5.1 KiB
Provenance Separation Rule
Rule 37: Domain Classes MUST NOT Contain Data-Source-Specific Provenance
Overview
Domain classes that model heritage custodian entities (events, identifiers, locations, etc.) MUST NOT contain provenance fields specific to any particular data source or API.
Rationale
- Separation of Concerns: Domain semantics (what happened) should be separate from provenance (how we know).
- Source Flexibility: The same event can be discovered via multiple sources (Linkup, Wikidata, manual research).
- Schema Stability: Adding new data sources should not require modifying domain classes.
- Provenance Reuse: Observation classes can be reused across different domain entities.
Rule
NEVER put data-source-specific fields in domain classes:
# WRONG - Domain class with source-specific fields
CustodianTimelineEvent:
slots:
- event_type
- event_date
- description
- linkup_query # Source-specific!
- linkup_answer # Source-specific!
- fetch_timestamp # Source-specific!
CORRECT - Separate domain and provenance:
# Domain class - source-agnostic
CustodianTimelineEvent:
slots:
- event_type
- event_date
- description
- data_tier # Quality indicator (not source-specific)
- observation_ref # Reference to observation (optional)
# Provenance classes - source-specific
WebObservation:
# For web-scraped data with XPath provenance
CustodianObservation:
# For institutional observations
LinkupObservation: # NEW - if needed for Linkup-specific provenance
slots:
- linkup_query
- linkup_answer
- source_urls
- fetch_timestamp
- archive_path
Application to Timeline Events
The CustodianTimelineEvent class models organizational change events (founding, merger, dissolution, etc.) as domain entities.
Timeline events can be discovered from multiple sources:
| Source | Provenance Class |
|---|---|
| Linkup API | WebObservation (with API-specific metadata in extraction_notes) |
| Web scraping | WebObservation (with XPath provenance in claims) |
| Wikidata SPARQL | WebObservation (with SPARQL query provenance) |
| Manual research | CustodianObservation (with source document reference) |
| Institutional records | CustodianObservation (with official source) |
Provenance Flow
SourceDocument/API Response
↓
WebObservation / CustodianObservation (provenance record)
↓
CustodianTimelineEvent (domain entity)
↓
references_observation → Observation (backlink for audit)
Existing Provenance Classes
Use these existing classes for different provenance needs:
| Class | Purpose | Location |
|---|---|---|
CustodianObservation |
Source-based evidence of custodian existence | schemas/.../classes/CustodianObservation.yaml |
WebObservation |
Web retrieval provenance with claims | schemas/.../classes/WebObservation.yaml |
WebClaim |
Individual claims with XPath provenance | schemas/.../classes/WebClaim.yaml |
SourceDocument |
Reference to source documents | schemas/.../classes/SourceDocument.yaml |
Migration Note
The former LinkupTimelineEvent class contained Linkup-specific provenance fields. These have been moved to:
extraction_notesfield for API-specific metadataarchive_pathfield for archived API responses- The class was renamed to
CustodianTimelineEventto be source-agnostic
Data Tier Always Required
Even without source-specific provenance, domain classes MUST indicate data quality:
CustodianTimelineEvent:
slots:
- data_tier # REQUIRED: TIER_1 through TIER_4
This allows consumers to understand trustworthiness without needing source-specific knowledge.
Examples
Founding event from Linkup:
timeline_events:
- event_type: FOUNDING
event_date: "2005-04-30"
date_precision: day
description: "Founded on 30 April 2005"
data_tier: TIER_4_INFERRED # LLM-extracted
extraction_notes: |
Source: Linkup API query "Drents Archief opgericht"
Verified against: nl.wikipedia.org/wiki/Drents_Archief
Founding event from institutional website:
timeline_events:
- event_type: FOUNDING
event_date: "2005-04-30"
date_precision: day
description: "Founded on 30 April 2005"
data_tier: TIER_2_VERIFIED # Verified from official source
extraction_notes: |
Source: Official website about page
XPath: /html/body/div[2]/section[1]/p[3]
Related Rules
- Rule 6: WebObservation Claims MUST Have XPath Provenance (for web-scraped claims)
- Rule 35: Provenance Statements MUST Have Dual Timestamps
- Rule 22: Custodian YAML Files Are the Single Source of Truth
Related Documentation
schemas/20251121/linkml/modules/classes/CustodianObservation.yamlschemas/20251121/linkml/modules/classes/WebObservation.yamlschemas/20251121/linkml/modules/classes/CustodianTimelineEvent.yaml.opencode/WEB_OBSERVATION_PROVENANCE_RULES.md
Created: 2026-01-01
Status: ACTIVE
Applies to: All domain classes in the Heritage Custodian Ontology