glam/.opencode/TIMELINE_EVENT_PROVENANCE_POLICY.md
2026-01-02 02:11:04 +01:00

166 lines
5.8 KiB
Markdown

# Timeline Event Provenance Policy
## Overview
This document clarifies the provenance model for `CustodianTimelineEvent` data (renamed from `LinkupTimelineEvent` in January 2026).
**Key Change**: The class is now source-agnostic. Detailed provenance about the data source (Linkup, Wikidata, web scraping, etc.) belongs in observation classes, not in the event itself.
## Architectural Principle: Provenance Separation (Rule 37)
Domain classes model WHAT happened, not HOW we know:
| Layer | Purpose | Classes |
|-------|---------|---------|
| **Domain** | What happened (events, entities) | `CustodianTimelineEvent` |
| **Observation** | How we observed it (provenance) | `WebObservation`, `CustodianObservation` |
See `.opencode/PROVENANCE_SEPARATION_RULE.md` for the full rule.
## Rule 6 Scope Clarification
**AGENTS.md Rule 6** ("WebObservation Claims MUST Have XPath Provenance") applies ONLY to:
- `WebClaim` class
- `WebObservation` class
- `PersonWebClaim` class
**Rule 6 does NOT apply to**:
- `CustodianTimelineEvent` class (source-agnostic design)
- `WikidataEnrichment` (uses entity URI provenance)
- Other API-based enrichments
## CustodianTimelineEvent Provenance Model
The `CustodianTimelineEvent` class uses **source-agnostic provenance fields**:
### Required Fields
| Field | Purpose |
|-------|---------|
| `event_type` | What kind of event (FOUNDING, MERGER, etc.) |
| `date_precision` | How specific is the date (day, year, decade) |
| `approximate` | Is the date approximate (circa, roughly) |
| `description` | Human-readable summary of the event |
| `extraction_method` | How was the event discovered |
| `extraction_timestamp` | When was the event extracted |
| `data_tier` | Quality tier (TIER_1 to TIER_4) |
### Optional Fields
| Field | Purpose |
|-------|---------|
| `event_date` | When the event occurred (if known) |
| `source_urls` | URLs documenting the event |
| `extraction_notes` | Free-text notes for source-specific details |
| `archive_path` | Path to archived source material |
| `observation_ref` | Link to observation class for detailed provenance |
### Extraction Methods
The `TimelineExtractionMethodEnum` covers various sources:
| Method | Description |
|--------|-------------|
| `api_response_regex` | Date extracted via regex from API response |
| `api_response_llm` | Date extracted using LLM analysis |
| `web_scrape_xpath` | Date extracted via XPath from archived HTML |
| `wikidata_sparql` | Date extracted from Wikidata SPARQL |
| `manual_research` | Event discovered through manual research |
| `manual_verification` | Event manually verified and corrected |
## Source-Specific Details in extraction_notes
Use the `extraction_notes` field to capture source-specific details that don't fit elsewhere:
### For API-sourced data (e.g., Linkup)
```yaml
extraction_notes: |
Query: "Drents Archief" Assen opgericht OR gesticht
API: Linkup. Answer: "Het RHC Drents Archief werd opgericht op 30 april 2005..."
Sources cited: nl.wikipedia.org, bizzy.ai
archive_path: web/0002/linkup/linkup_founding_20251215T160438Z.json
```
### For web-scraped data
```yaml
extraction_notes: |
XPath: /html/body/main/section[2]/div/p[3]
Source page: https://www.rijksmuseum.nl/en/about-us/history
archive_path: web/0001/rijksmuseum.nl/about-us/rendered.html
```
### For Wikidata-sourced data
```yaml
extraction_notes: |
Wikidata: Q190804
Property: P571 (inception date)
SPARQL timestamp: 2025-12-20T14:30:00Z
```
## Linking to Observation Classes
For detailed provenance, use `observation_ref` to link to a `WebObservation`:
```yaml
timeline_events:
- event_type: FOUNDING
event_date: "2005-04-30"
# ... other fields ...
observation_ref: "https://nde.nl/ontology/hc/observation/web/2025-12-15/drents-archief"
```
The referenced `WebObservation` contains:
- Full API response details
- XPath provenance (if applicable)
- HTTP response metadata
- Archived content hash
## Data Quality Tiers
Events should have their `data_tier` set appropriately:
| Tier | Description | Typical Source |
|------|-------------|----------------|
| `TIER_4_INFERRED` | Unverified, possibly from LLM | Initial API extraction |
| `TIER_3_CROWD_SOURCED` | Verified against Wikipedia/Wikidata | Cross-referenced |
| `TIER_2_VERIFIED` | Verified against institutional website | Official source |
| `TIER_1_AUTHORITATIVE` | Verified against official registry | Government records |
## Migration from LinkupTimelineEvent
The following fields were removed from the class (use alternatives):
| Old Field | Migration Path |
|-----------|----------------|
| `linkup_query` | Put in `extraction_notes` |
| `linkup_answer` | Put in `extraction_notes` |
| `fetch_timestamp` | Use `extraction_timestamp` |
| `LinkupExtractionMethodEnum` | Use `TimelineExtractionMethodEnum` |
Data files with `timeline_enrichment.timeline_events` continue to work - the events are now instances of `CustodianTimelineEvent`.
## Current Statistics (January 2026)
- **Total events**: ~1,199 across ~862 custodian files
- **Event types**: FOUNDING (927), TRANSFER (190), MERGER (57), DISSOLUTION (10), RENAMING (10)
- **Data tier**: Mostly TIER_4_INFERRED (pending verification)
## Schema Reference
The formal schema is defined in:
- `schemas/20251121/linkml/modules/classes/CustodianTimelineEvent.yaml`
## Related Documentation
- `AGENTS.md` Rule 6 - WebObservation XPath requirements
- `.opencode/PROVENANCE_SEPARATION_RULE.md` - Rule 37 on provenance separation
- `.opencode/WEB_OBSERVATION_PROVENANCE_RULES.md` - WebClaim details
- `schemas/20251121/linkml/modules/classes/WebObservation.yaml` - WebObservation schema
- `schemas/20251121/linkml/modules/classes/CustodianObservation.yaml` - CustodianObservation schema
---
**Created**: 2025-12-16
**Updated**: 2026-01-01 (Renamed class to CustodianTimelineEvent, source-agnostic design)
**Status**: ACTIVE
**Applies to**: Heritage Custodian Timeline Events