glam/.opencode/LINKUP_PROVENANCE_POLICY.md
2025-12-17 10:11:56 +01:00

4.1 KiB

Linkup Provenance Policy

Overview

This document clarifies the provenance requirements for LinkupTimelineEvent data, explaining why it differs from WebClaim provenance.

Rule 6 Scope Clarification

AGENTS.md Rule 6 ("WebObservation Claims MUST Have XPath Provenance") applies ONLY to:

  • WebClaim class
  • WebObservation class
  • PersonWebClaim class

Rule 6 does NOT apply to:

  • LinkupTimelineEvent class (has its own provenance model)
  • WikidataEnrichment (uses entity URI provenance)
  • Other API-based enrichments

Why Linkup Provenance is Different

Data Flow Comparison

WebClaim (XPath required):

Webpage HTML → Archive to file → Parse HTML → Extract XPath → Claim value
      ↓                              ↓
   Verifiable: Can check XPath points to value in HTML

LinkupTimelineEvent (API provenance):

Query → Linkup API → LLM Answer + Source URLs → Archive JSON → Regex Extract → Event
                          ↓
   NOT directly verifiable: LLM may hallucinate, sources may be misquoted

Fundamental Difference

Aspect WebClaim LinkupTimelineEvent
Source HTML file (static) LLM answer (generated)
Verification Automated (XPath lookup) Manual (check source_urls)
Trust Model High (direct extraction) Low (LLM intermediary)
Data Tier TIER_2 or higher TIER_4_INFERRED (always)

LinkupTimelineEvent Provenance Requirements

All events MUST have these fields (per schema LinkupTimelineEvent.yaml):

Required Fields

Field Purpose
linkup_query The exact query sent to API (reproducibility)
linkup_answer Full LLM response (audit trail)
fetch_timestamp When API was called
archive_path Path to archived JSON (evidence)
extraction_method How event was extracted
extraction_timestamp When extraction occurred
data_tier Always TIER_4_INFERRED initially
Field Purpose
source_urls URLs cited by Linkup for manual verification

Verification Pathway

Timeline events can be promoted from TIER_4 to higher tiers through verification:

TIER_4_INFERRED (initial)
    ↓ Verify against source_urls
TIER_3_CROWD_SOURCED (if verified against Wikipedia)
    ↓ Verify against institutional website
TIER_2_VERIFIED (if institutional source confirms)
    ↓ Verify against official registry/document
TIER_1_AUTHORITATIVE (rare for events)

Implementation

Current Statistics (December 2025)

  • Total events: 1,199 across 862 custodian files
  • Event types: FOUNDING (927), TRANSFER (190), MERGER (57), DISSOLUTION (10), RENAMING (10)
  • Data tier: 100% TIER_4_INFERRED
  • Provenance fields: 100% complete

Archived JSON Location

All Linkup API responses are archived at:

data/custodian/web/{entry_number}/linkup/linkup_{event_type}_{timestamp}.json

Example:

data/custodian/web/1071/linkup/linkup_founding_20251215T215802Z.json

Schema Reference

The formal schema is defined in:

  • schemas/20251121/linkml/modules/classes/LinkupTimelineEvent.yaml

Key documentation in schema (lines 6-15):

# Key principle:
#   Linkup API returns LLM-generated answers with source URLs, not XPath locations.
#   Therefore, provenance is different from WebClaim:
#   - Store the query that was sent to Linkup
#   - Store the LLM answer (which may contain hallucinations)
#   - Store source URLs (for manual verification)
#   - Archive the complete API response JSON
#
# This acknowledges that Linkup data is TIER_4_INFERRED (LLM-generated)
# and requires manual verification before promotion to higher tiers.
  • AGENTS.md Rule 6 - WebObservation XPath requirements
  • .opencode/WEB_OBSERVATION_PROVENANCE_RULES.md - WebClaim details
  • schemas/20251121/linkml/modules/classes/WebClaim.yaml - WebClaim schema
  • schemas/20251121/linkml/modules/classes/LinkupTimelineEvent.yaml - LinkupTimelineEvent schema

Created: 2025-12-16
Status: ACTIVE
Applies to: Dutch GLAM Timeline Event Enrichment project