glam/.opencode/PROVENANCE_SEPARATION_RULE.md
2026-01-02 02:11:04 +01:00

5.1 KiB

Provenance Separation Rule

Rule 37: Domain Classes MUST NOT Contain Data-Source-Specific Provenance

Overview

Domain classes that model heritage custodian entities (events, identifiers, locations, etc.) MUST NOT contain provenance fields specific to any particular data source or API.

Rationale

  1. Separation of Concerns: Domain semantics (what happened) should be separate from provenance (how we know).
  2. Source Flexibility: The same event can be discovered via multiple sources (Linkup, Wikidata, manual research).
  3. Schema Stability: Adding new data sources should not require modifying domain classes.
  4. Provenance Reuse: Observation classes can be reused across different domain entities.

Rule

NEVER put data-source-specific fields in domain classes:

# WRONG - Domain class with source-specific fields
CustodianTimelineEvent:
  slots:
    - event_type
    - event_date
    - description
    - linkup_query        # Source-specific!
    - linkup_answer       # Source-specific!
    - fetch_timestamp     # Source-specific!

CORRECT - Separate domain and provenance:

# Domain class - source-agnostic
CustodianTimelineEvent:
  slots:
    - event_type
    - event_date
    - description
    - data_tier           # Quality indicator (not source-specific)
    - observation_ref     # Reference to observation (optional)

# Provenance classes - source-specific
WebObservation:
  # For web-scraped data with XPath provenance

CustodianObservation:
  # For institutional observations

LinkupObservation:          # NEW - if needed for Linkup-specific provenance
  slots:
    - linkup_query
    - linkup_answer
    - source_urls
    - fetch_timestamp
    - archive_path

Application to Timeline Events

The CustodianTimelineEvent class models organizational change events (founding, merger, dissolution, etc.) as domain entities.

Timeline events can be discovered from multiple sources:

Source Provenance Class
Linkup API WebObservation (with API-specific metadata in extraction_notes)
Web scraping WebObservation (with XPath provenance in claims)
Wikidata SPARQL WebObservation (with SPARQL query provenance)
Manual research CustodianObservation (with source document reference)
Institutional records CustodianObservation (with official source)

Provenance Flow

SourceDocument/API Response
        ↓
WebObservation / CustodianObservation (provenance record)
        ↓
CustodianTimelineEvent (domain entity)
        ↓
references_observation → Observation (backlink for audit)

Existing Provenance Classes

Use these existing classes for different provenance needs:

Class Purpose Location
CustodianObservation Source-based evidence of custodian existence schemas/.../classes/CustodianObservation.yaml
WebObservation Web retrieval provenance with claims schemas/.../classes/WebObservation.yaml
WebClaim Individual claims with XPath provenance schemas/.../classes/WebClaim.yaml
SourceDocument Reference to source documents schemas/.../classes/SourceDocument.yaml

Migration Note

The former LinkupTimelineEvent class contained Linkup-specific provenance fields. These have been moved to:

  • extraction_notes field for API-specific metadata
  • archive_path field for archived API responses
  • The class was renamed to CustodianTimelineEvent to be source-agnostic

Data Tier Always Required

Even without source-specific provenance, domain classes MUST indicate data quality:

CustodianTimelineEvent:
  slots:
    - data_tier  # REQUIRED: TIER_1 through TIER_4

This allows consumers to understand trustworthiness without needing source-specific knowledge.

Examples

Founding event from Linkup:

timeline_events:
  - event_type: FOUNDING
    event_date: "2005-04-30"
    date_precision: day
    description: "Founded on 30 April 2005"
    data_tier: TIER_4_INFERRED  # LLM-extracted
    extraction_notes: |
      Source: Linkup API query "Drents Archief opgericht"
      Verified against: nl.wikipedia.org/wiki/Drents_Archief      

Founding event from institutional website:

timeline_events:
  - event_type: FOUNDING
    event_date: "2005-04-30"
    date_precision: day
    description: "Founded on 30 April 2005"
    data_tier: TIER_2_VERIFIED  # Verified from official source
    extraction_notes: |
      Source: Official website about page
      XPath: /html/body/div[2]/section[1]/p[3]      
  • Rule 6: WebObservation Claims MUST Have XPath Provenance (for web-scraped claims)
  • Rule 35: Provenance Statements MUST Have Dual Timestamps
  • Rule 22: Custodian YAML Files Are the Single Source of Truth
  • schemas/20251121/linkml/modules/classes/CustodianObservation.yaml
  • schemas/20251121/linkml/modules/classes/WebObservation.yaml
  • schemas/20251121/linkml/modules/classes/CustodianTimelineEvent.yaml
  • .opencode/WEB_OBSERVATION_PROVENANCE_RULES.md

Created: 2026-01-01
Status: ACTIVE
Applies to: All domain classes in the Heritage Custodian Ontology