glam/.opencode/LINKUP_PROVENANCE_POLICY.md
2025-12-17 10:11:56 +01:00

133 lines
4.1 KiB
Markdown

# Linkup Provenance Policy
## Overview
This document clarifies the provenance requirements for `LinkupTimelineEvent` data, explaining why it differs from `WebClaim` provenance.
## Rule 6 Scope Clarification
**AGENTS.md Rule 6** ("WebObservation Claims MUST Have XPath Provenance") applies ONLY to:
- `WebClaim` class
- `WebObservation` class
- `PersonWebClaim` class
**Rule 6 does NOT apply to**:
- `LinkupTimelineEvent` class (has its own provenance model)
- `WikidataEnrichment` (uses entity URI provenance)
- Other API-based enrichments
## Why Linkup Provenance is Different
### Data Flow Comparison
**WebClaim** (XPath required):
```
Webpage HTML → Archive to file → Parse HTML → Extract XPath → Claim value
↓ ↓
Verifiable: Can check XPath points to value in HTML
```
**LinkupTimelineEvent** (API provenance):
```
Query → Linkup API → LLM Answer + Source URLs → Archive JSON → Regex Extract → Event
NOT directly verifiable: LLM may hallucinate, sources may be misquoted
```
### Fundamental Difference
| Aspect | WebClaim | LinkupTimelineEvent |
|--------|----------|---------------------|
| **Source** | HTML file (static) | LLM answer (generated) |
| **Verification** | Automated (XPath lookup) | Manual (check source_urls) |
| **Trust Model** | High (direct extraction) | Low (LLM intermediary) |
| **Data Tier** | TIER_2 or higher | TIER_4_INFERRED (always) |
## LinkupTimelineEvent Provenance Requirements
All events MUST have these fields (per schema `LinkupTimelineEvent.yaml`):
### Required Fields
| Field | Purpose |
|-------|---------|
| `linkup_query` | The exact query sent to API (reproducibility) |
| `linkup_answer` | Full LLM response (audit trail) |
| `fetch_timestamp` | When API was called |
| `archive_path` | Path to archived JSON (evidence) |
| `extraction_method` | How event was extracted |
| `extraction_timestamp` | When extraction occurred |
| `data_tier` | Always `TIER_4_INFERRED` initially |
### Optional but Recommended
| Field | Purpose |
|-------|---------|
| `source_urls` | URLs cited by Linkup for manual verification |
## Verification Pathway
Timeline events can be promoted from TIER_4 to higher tiers through verification:
```
TIER_4_INFERRED (initial)
↓ Verify against source_urls
TIER_3_CROWD_SOURCED (if verified against Wikipedia)
↓ Verify against institutional website
TIER_2_VERIFIED (if institutional source confirms)
↓ Verify against official registry/document
TIER_1_AUTHORITATIVE (rare for events)
```
## Implementation
### Current Statistics (December 2025)
- **Total events**: 1,199 across 862 custodian files
- **Event types**: FOUNDING (927), TRANSFER (190), MERGER (57), DISSOLUTION (10), RENAMING (10)
- **Data tier**: 100% TIER_4_INFERRED
- **Provenance fields**: 100% complete
### Archived JSON Location
All Linkup API responses are archived at:
```
data/custodian/web/{entry_number}/linkup/linkup_{event_type}_{timestamp}.json
```
Example:
```
data/custodian/web/1071/linkup/linkup_founding_20251215T215802Z.json
```
## Schema Reference
The formal schema is defined in:
- `schemas/20251121/linkml/modules/classes/LinkupTimelineEvent.yaml`
Key documentation in schema (lines 6-15):
```yaml
# Key principle:
# Linkup API returns LLM-generated answers with source URLs, not XPath locations.
# Therefore, provenance is different from WebClaim:
# - Store the query that was sent to Linkup
# - Store the LLM answer (which may contain hallucinations)
# - Store source URLs (for manual verification)
# - Archive the complete API response JSON
#
# This acknowledges that Linkup data is TIER_4_INFERRED (LLM-generated)
# and requires manual verification before promotion to higher tiers.
```
## Related Documentation
- `AGENTS.md` Rule 6 - WebObservation XPath requirements
- `.opencode/WEB_OBSERVATION_PROVENANCE_RULES.md` - WebClaim details
- `schemas/20251121/linkml/modules/classes/WebClaim.yaml` - WebClaim schema
- `schemas/20251121/linkml/modules/classes/LinkupTimelineEvent.yaml` - LinkupTimelineEvent schema
---
**Created**: 2025-12-16
**Status**: ACTIVE
**Applies to**: Dutch GLAM Timeline Event Enrichment project