161 lines
5.1 KiB
Markdown
161 lines
5.1 KiB
Markdown
# Provenance Separation Rule
|
|
|
|
## Rule 37: Domain Classes MUST NOT Contain Data-Source-Specific Provenance
|
|
|
|
### Overview
|
|
|
|
Domain classes that model heritage custodian entities (events, identifiers, locations, etc.) MUST NOT contain provenance fields specific to any particular data source or API.
|
|
|
|
### Rationale
|
|
|
|
1. **Separation of Concerns**: Domain semantics (what happened) should be separate from provenance (how we know).
|
|
2. **Source Flexibility**: The same event can be discovered via multiple sources (Linkup, Wikidata, manual research).
|
|
3. **Schema Stability**: Adding new data sources should not require modifying domain classes.
|
|
4. **Provenance Reuse**: Observation classes can be reused across different domain entities.
|
|
|
|
### Rule
|
|
|
|
**NEVER put data-source-specific fields in domain classes:**
|
|
|
|
```yaml
|
|
# WRONG - Domain class with source-specific fields
|
|
CustodianTimelineEvent:
|
|
slots:
|
|
- event_type
|
|
- event_date
|
|
- description
|
|
- linkup_query # Source-specific!
|
|
- linkup_answer # Source-specific!
|
|
- fetch_timestamp # Source-specific!
|
|
```
|
|
|
|
**CORRECT - Separate domain and provenance:**
|
|
|
|
```yaml
|
|
# Domain class - source-agnostic
|
|
CustodianTimelineEvent:
|
|
slots:
|
|
- event_type
|
|
- event_date
|
|
- description
|
|
- data_tier # Quality indicator (not source-specific)
|
|
- observation_ref # Reference to observation (optional)
|
|
|
|
# Provenance classes - source-specific
|
|
WebObservation:
|
|
# For web-scraped data with XPath provenance
|
|
|
|
CustodianObservation:
|
|
# For institutional observations
|
|
|
|
LinkupObservation: # NEW - if needed for Linkup-specific provenance
|
|
slots:
|
|
- linkup_query
|
|
- linkup_answer
|
|
- source_urls
|
|
- fetch_timestamp
|
|
- archive_path
|
|
```
|
|
|
|
### Application to Timeline Events
|
|
|
|
The `CustodianTimelineEvent` class models organizational change events (founding, merger, dissolution, etc.) as **domain entities**.
|
|
|
|
**Timeline events can be discovered from multiple sources:**
|
|
|
|
| Source | Provenance Class |
|
|
|--------|------------------|
|
|
| Linkup API | `WebObservation` (with API-specific metadata in `extraction_notes`) |
|
|
| Web scraping | `WebObservation` (with XPath provenance in `claims`) |
|
|
| Wikidata SPARQL | `WebObservation` (with SPARQL query provenance) |
|
|
| Manual research | `CustodianObservation` (with source document reference) |
|
|
| Institutional records | `CustodianObservation` (with official source) |
|
|
|
|
### Provenance Flow
|
|
|
|
```
|
|
SourceDocument/API Response
|
|
↓
|
|
WebObservation / CustodianObservation (provenance record)
|
|
↓
|
|
CustodianTimelineEvent (domain entity)
|
|
↓
|
|
references_observation → Observation (backlink for audit)
|
|
```
|
|
|
|
### Existing Provenance Classes
|
|
|
|
Use these existing classes for different provenance needs:
|
|
|
|
| Class | Purpose | Location |
|
|
|-------|---------|----------|
|
|
| `CustodianObservation` | Source-based evidence of custodian existence | `schemas/.../classes/CustodianObservation.yaml` |
|
|
| `WebObservation` | Web retrieval provenance with claims | `schemas/.../classes/WebObservation.yaml` |
|
|
| `WebClaim` | Individual claims with XPath provenance | `schemas/.../classes/WebClaim.yaml` |
|
|
| `SourceDocument` | Reference to source documents | `schemas/.../classes/SourceDocument.yaml` |
|
|
|
|
### Migration Note
|
|
|
|
The former `LinkupTimelineEvent` class contained Linkup-specific provenance fields. These have been moved to:
|
|
- `extraction_notes` field for API-specific metadata
|
|
- `archive_path` field for archived API responses
|
|
- The class was renamed to `CustodianTimelineEvent` to be source-agnostic
|
|
|
|
### Data Tier Always Required
|
|
|
|
Even without source-specific provenance, domain classes MUST indicate data quality:
|
|
|
|
```yaml
|
|
CustodianTimelineEvent:
|
|
slots:
|
|
- data_tier # REQUIRED: TIER_1 through TIER_4
|
|
```
|
|
|
|
This allows consumers to understand trustworthiness without needing source-specific knowledge.
|
|
|
|
### Examples
|
|
|
|
**Founding event from Linkup:**
|
|
```yaml
|
|
timeline_events:
|
|
- event_type: FOUNDING
|
|
event_date: "2005-04-30"
|
|
date_precision: day
|
|
description: "Founded on 30 April 2005"
|
|
data_tier: TIER_4_INFERRED # LLM-extracted
|
|
extraction_notes: |
|
|
Source: Linkup API query "Drents Archief opgericht"
|
|
Verified against: nl.wikipedia.org/wiki/Drents_Archief
|
|
```
|
|
|
|
**Founding event from institutional website:**
|
|
```yaml
|
|
timeline_events:
|
|
- event_type: FOUNDING
|
|
event_date: "2005-04-30"
|
|
date_precision: day
|
|
description: "Founded on 30 April 2005"
|
|
data_tier: TIER_2_VERIFIED # Verified from official source
|
|
extraction_notes: |
|
|
Source: Official website about page
|
|
XPath: /html/body/div[2]/section[1]/p[3]
|
|
```
|
|
|
|
### Related Rules
|
|
|
|
- **Rule 6**: WebObservation Claims MUST Have XPath Provenance (for web-scraped claims)
|
|
- **Rule 35**: Provenance Statements MUST Have Dual Timestamps
|
|
- **Rule 22**: Custodian YAML Files Are the Single Source of Truth
|
|
|
|
### Related Documentation
|
|
|
|
- `schemas/20251121/linkml/modules/classes/CustodianObservation.yaml`
|
|
- `schemas/20251121/linkml/modules/classes/WebObservation.yaml`
|
|
- `schemas/20251121/linkml/modules/classes/CustodianTimelineEvent.yaml`
|
|
- `.opencode/WEB_OBSERVATION_PROVENANCE_RULES.md`
|
|
|
|
---
|
|
|
|
**Created**: 2026-01-01
|
|
**Status**: ACTIVE
|
|
**Applies to**: All domain classes in the Heritage Custodian Ontology
|