glam/.opencode/PROVENANCE_SEPARATION_RULE.md
2026-01-02 02:11:04 +01:00

161 lines
5.1 KiB
Markdown

# Provenance Separation Rule
## Rule 37: Domain Classes MUST NOT Contain Data-Source-Specific Provenance
### Overview
Domain classes that model heritage custodian entities (events, identifiers, locations, etc.) MUST NOT contain provenance fields specific to any particular data source or API.
### Rationale
1. **Separation of Concerns**: Domain semantics (what happened) should be separate from provenance (how we know).
2. **Source Flexibility**: The same event can be discovered via multiple sources (Linkup, Wikidata, manual research).
3. **Schema Stability**: Adding new data sources should not require modifying domain classes.
4. **Provenance Reuse**: Observation classes can be reused across different domain entities.
### Rule
**NEVER put data-source-specific fields in domain classes:**
```yaml
# WRONG - Domain class with source-specific fields
CustodianTimelineEvent:
slots:
- event_type
- event_date
- description
- linkup_query # Source-specific!
- linkup_answer # Source-specific!
- fetch_timestamp # Source-specific!
```
**CORRECT - Separate domain and provenance:**
```yaml
# Domain class - source-agnostic
CustodianTimelineEvent:
slots:
- event_type
- event_date
- description
- data_tier # Quality indicator (not source-specific)
- observation_ref # Reference to observation (optional)
# Provenance classes - source-specific
WebObservation:
# For web-scraped data with XPath provenance
CustodianObservation:
# For institutional observations
LinkupObservation: # NEW - if needed for Linkup-specific provenance
slots:
- linkup_query
- linkup_answer
- source_urls
- fetch_timestamp
- archive_path
```
### Application to Timeline Events
The `CustodianTimelineEvent` class models organizational change events (founding, merger, dissolution, etc.) as **domain entities**.
**Timeline events can be discovered from multiple sources:**
| Source | Provenance Class |
|--------|------------------|
| Linkup API | `WebObservation` (with API-specific metadata in `extraction_notes`) |
| Web scraping | `WebObservation` (with XPath provenance in `claims`) |
| Wikidata SPARQL | `WebObservation` (with SPARQL query provenance) |
| Manual research | `CustodianObservation` (with source document reference) |
| Institutional records | `CustodianObservation` (with official source) |
### Provenance Flow
```
SourceDocument/API Response
WebObservation / CustodianObservation (provenance record)
CustodianTimelineEvent (domain entity)
references_observation → Observation (backlink for audit)
```
### Existing Provenance Classes
Use these existing classes for different provenance needs:
| Class | Purpose | Location |
|-------|---------|----------|
| `CustodianObservation` | Source-based evidence of custodian existence | `schemas/.../classes/CustodianObservation.yaml` |
| `WebObservation` | Web retrieval provenance with claims | `schemas/.../classes/WebObservation.yaml` |
| `WebClaim` | Individual claims with XPath provenance | `schemas/.../classes/WebClaim.yaml` |
| `SourceDocument` | Reference to source documents | `schemas/.../classes/SourceDocument.yaml` |
### Migration Note
The former `LinkupTimelineEvent` class contained Linkup-specific provenance fields. These have been moved to:
- `extraction_notes` field for API-specific metadata
- `archive_path` field for archived API responses
- The class was renamed to `CustodianTimelineEvent` to be source-agnostic
### Data Tier Always Required
Even without source-specific provenance, domain classes MUST indicate data quality:
```yaml
CustodianTimelineEvent:
slots:
- data_tier # REQUIRED: TIER_1 through TIER_4
```
This allows consumers to understand trustworthiness without needing source-specific knowledge.
### Examples
**Founding event from Linkup:**
```yaml
timeline_events:
- event_type: FOUNDING
event_date: "2005-04-30"
date_precision: day
description: "Founded on 30 April 2005"
data_tier: TIER_4_INFERRED # LLM-extracted
extraction_notes: |
Source: Linkup API query "Drents Archief opgericht"
Verified against: nl.wikipedia.org/wiki/Drents_Archief
```
**Founding event from institutional website:**
```yaml
timeline_events:
- event_type: FOUNDING
event_date: "2005-04-30"
date_precision: day
description: "Founded on 30 April 2005"
data_tier: TIER_2_VERIFIED # Verified from official source
extraction_notes: |
Source: Official website about page
XPath: /html/body/div[2]/section[1]/p[3]
```
### Related Rules
- **Rule 6**: WebObservation Claims MUST Have XPath Provenance (for web-scraped claims)
- **Rule 35**: Provenance Statements MUST Have Dual Timestamps
- **Rule 22**: Custodian YAML Files Are the Single Source of Truth
### Related Documentation
- `schemas/20251121/linkml/modules/classes/CustodianObservation.yaml`
- `schemas/20251121/linkml/modules/classes/WebObservation.yaml`
- `schemas/20251121/linkml/modules/classes/CustodianTimelineEvent.yaml`
- `.opencode/WEB_OBSERVATION_PROVENANCE_RULES.md`
---
**Created**: 2026-01-01
**Status**: ACTIVE
**Applies to**: All domain classes in the Heritage Custodian Ontology