# Linkup Provenance Policy ## Overview This document clarifies the provenance requirements for `LinkupTimelineEvent` data, explaining why it differs from `WebClaim` provenance. ## Rule 6 Scope Clarification **AGENTS.md Rule 6** ("WebObservation Claims MUST Have XPath Provenance") applies ONLY to: - `WebClaim` class - `WebObservation` class - `PersonWebClaim` class **Rule 6 does NOT apply to**: - `LinkupTimelineEvent` class (has its own provenance model) - `WikidataEnrichment` (uses entity URI provenance) - Other API-based enrichments ## Why Linkup Provenance is Different ### Data Flow Comparison **WebClaim** (XPath required): ``` Webpage HTML → Archive to file → Parse HTML → Extract XPath → Claim value ↓ ↓ Verifiable: Can check XPath points to value in HTML ``` **LinkupTimelineEvent** (API provenance): ``` Query → Linkup API → LLM Answer + Source URLs → Archive JSON → Regex Extract → Event ↓ NOT directly verifiable: LLM may hallucinate, sources may be misquoted ``` ### Fundamental Difference | Aspect | WebClaim | LinkupTimelineEvent | |--------|----------|---------------------| | **Source** | HTML file (static) | LLM answer (generated) | | **Verification** | Automated (XPath lookup) | Manual (check source_urls) | | **Trust Model** | High (direct extraction) | Low (LLM intermediary) | | **Data Tier** | TIER_2 or higher | TIER_4_INFERRED (always) | ## LinkupTimelineEvent Provenance Requirements All events MUST have these fields (per schema `LinkupTimelineEvent.yaml`): ### Required Fields | Field | Purpose | |-------|---------| | `linkup_query` | The exact query sent to API (reproducibility) | | `linkup_answer` | Full LLM response (audit trail) | | `fetch_timestamp` | When API was called | | `archive_path` | Path to archived JSON (evidence) | | `extraction_method` | How event was extracted | | `extraction_timestamp` | When extraction occurred | | `data_tier` | Always `TIER_4_INFERRED` initially | ### Optional but Recommended | Field | Purpose | |-------|---------| | `source_urls` | URLs cited by Linkup for manual verification | ## Verification Pathway Timeline events can be promoted from TIER_4 to higher tiers through verification: ``` TIER_4_INFERRED (initial) ↓ Verify against source_urls TIER_3_CROWD_SOURCED (if verified against Wikipedia) ↓ Verify against institutional website TIER_2_VERIFIED (if institutional source confirms) ↓ Verify against official registry/document TIER_1_AUTHORITATIVE (rare for events) ``` ## Implementation ### Current Statistics (December 2025) - **Total events**: 1,199 across 862 custodian files - **Event types**: FOUNDING (927), TRANSFER (190), MERGER (57), DISSOLUTION (10), RENAMING (10) - **Data tier**: 100% TIER_4_INFERRED - **Provenance fields**: 100% complete ### Archived JSON Location All Linkup API responses are archived at: ``` data/custodian/web/{entry_number}/linkup/linkup_{event_type}_{timestamp}.json ``` Example: ``` data/custodian/web/1071/linkup/linkup_founding_20251215T215802Z.json ``` ## Schema Reference The formal schema is defined in: - `schemas/20251121/linkml/modules/classes/LinkupTimelineEvent.yaml` Key documentation in schema (lines 6-15): ```yaml # Key principle: # Linkup API returns LLM-generated answers with source URLs, not XPath locations. # Therefore, provenance is different from WebClaim: # - Store the query that was sent to Linkup # - Store the LLM answer (which may contain hallucinations) # - Store source URLs (for manual verification) # - Archive the complete API response JSON # # This acknowledges that Linkup data is TIER_4_INFERRED (LLM-generated) # and requires manual verification before promotion to higher tiers. ``` ## Related Documentation - `AGENTS.md` Rule 6 - WebObservation XPath requirements - `.opencode/WEB_OBSERVATION_PROVENANCE_RULES.md` - WebClaim details - `schemas/20251121/linkml/modules/classes/WebClaim.yaml` - WebClaim schema - `schemas/20251121/linkml/modules/classes/LinkupTimelineEvent.yaml` - LinkupTimelineEvent schema --- **Created**: 2025-12-16 **Status**: ACTIVE **Applies to**: Dutch GLAM Timeline Event Enrichment project