glam/.opencode/PROVENANCE_SEPARATION_RULE.md

# Provenance Separation Rule

## Rule 37: Domain Classes MUST NOT Contain Data-Source-Specific Provenance

### Overview

Domain classes that model heritage custodian entities (events, identifiers, locations, etc.) MUST NOT contain provenance fields specific to any particular data source or API.

### Rationale

1. **Separation of Concerns**: Domain semantics (what happened) should be separate from provenance (how we know).
2. **Source Flexibility**: The same event can be discovered via multiple sources (Linkup, Wikidata, manual research).
3. **Schema Stability**: Adding new data sources should not require modifying domain classes.
4. **Provenance Reuse**: Observation classes can be reused across different domain entities.

### Rule

**NEVER put data-source-specific fields in domain classes:**

```yaml
# WRONG - Domain class with source-specific fields
CustodianTimelineEvent:
  slots:
    - event_type
    - event_date
    - description
    - linkup_query        # Source-specific!
    - linkup_answer       # Source-specific!
    - fetch_timestamp     # Source-specific!
```

**CORRECT - Separate domain and provenance:**

```yaml
# Domain class - source-agnostic
CustodianTimelineEvent:
  slots:
    - event_type
    - event_date
    - description
    - data_tier           # Quality indicator (not source-specific)
    - observation_ref     # Reference to observation (optional)

# Provenance classes - source-specific
WebObservation:
  # For web-scraped data with XPath provenance

CustodianObservation:
  # For institutional observations

LinkupObservation:          # NEW - if needed for Linkup-specific provenance
  slots:
    - linkup_query
    - linkup_answer
    - source_urls
    - fetch_timestamp
    - archive_path
```

### Application to Timeline Events

The `CustodianTimelineEvent` class models organizational change events (founding, merger, dissolution, etc.) as **domain entities**.

**Timeline events can be discovered from multiple sources:**

| Source | Provenance Class |
|--------|------------------|
| Linkup API | `WebObservation` (with API-specific metadata in `extraction_notes`) |
| Web scraping | `WebObservation` (with XPath provenance in `claims`) |
| Wikidata SPARQL | `WebObservation` (with SPARQL query provenance) |
| Manual research | `CustodianObservation` (with source document reference) |
| Institutional records | `CustodianObservation` (with official source) |

### Provenance Flow

```
SourceDocument/API Response
        ↓
WebObservation / CustodianObservation (provenance record)
        ↓
CustodianTimelineEvent (domain entity)
        ↓
references_observation → Observation (backlink for audit)
```

### Existing Provenance Classes

Use these existing classes for different provenance needs:

| Class | Purpose | Location |
|-------|---------|----------|
| `CustodianObservation` | Source-based evidence of custodian existence | `schemas/.../classes/CustodianObservation.yaml` |
| `WebObservation` | Web retrieval provenance with claims | `schemas/.../classes/WebObservation.yaml` |
| `WebClaim` | Individual claims with XPath provenance | `schemas/.../classes/WebClaim.yaml` |
| `SourceDocument` | Reference to source documents | `schemas/.../classes/SourceDocument.yaml` |

### Migration Note

The former `LinkupTimelineEvent` class contained Linkup-specific provenance fields. These have been moved to:
- `extraction_notes` field for API-specific metadata
- `archive_path` field for archived API responses
- The class was renamed to `CustodianTimelineEvent` to be source-agnostic

### Data Tier Always Required

Even without source-specific provenance, domain classes MUST indicate data quality:

```yaml
CustodianTimelineEvent:
  slots:
    - data_tier  # REQUIRED: TIER_1 through TIER_4
```

This allows consumers to understand trustworthiness without needing source-specific knowledge.

### Examples

**Founding event from Linkup:**
```yaml
timeline_events:
  - event_type: FOUNDING
    event_date: "2005-04-30"
    date_precision: day
    description: "Founded on 30 April 2005"
    data_tier: TIER_4_INFERRED  # LLM-extracted
    extraction_notes: |
      Source: Linkup API query "Drents Archief opgericht"
      Verified against: nl.wikipedia.org/wiki/Drents_Archief
```

**Founding event from institutional website:**
```yaml
timeline_events:
  - event_type: FOUNDING
    event_date: "2005-04-30"
    date_precision: day
    description: "Founded on 30 April 2005"
    data_tier: TIER_2_VERIFIED  # Verified from official source
    extraction_notes: |
      Source: Official website about page
      XPath: /html/body/div[2]/section[1]/p[3]
```

### Related Rules

- **Rule 6**: WebObservation Claims MUST Have XPath Provenance (for web-scraped claims)
- **Rule 35**: Provenance Statements MUST Have Dual Timestamps
- **Rule 22**: Custodian YAML Files Are the Single Source of Truth

### Related Documentation

- `schemas/20251121/linkml/modules/classes/CustodianObservation.yaml`
- `schemas/20251121/linkml/modules/classes/WebObservation.yaml`
- `schemas/20251121/linkml/modules/classes/CustodianTimelineEvent.yaml`
- `.opencode/WEB_OBSERVATION_PROVENANCE_RULES.md`

---

**Created**: 2026-01-01
**Status**: ACTIVE
**Applies to**: All domain classes in the Heritage Custodian Ontology