glam/schemas/20251121/linkml/rules/XPATH_PROVENANCE.md
kempersc 3a6ead8fde feat: Add legal form filtering rule for CustodianName
- Introduced LEGAL-FORM-FILTER rule to standardize CustodianName by removing legal form designations.
- Documented rationale, examples, and implementation guidelines for the filtering process.

docs: Create README for value standardization rules

- Established a comprehensive README outlining various value standardization rules applicable to Heritage Custodian classes.
- Categorized rules into Name Standardization, Geographic Standardization, Web Observation, and Schema Evolution.

feat: Implement transliteration standards for non-Latin scripts

- Added TRANSLIT-ISO rule to ensure GHCID abbreviations are generated from emic names using ISO standards for transliteration.
- Included detailed guidelines for various scripts and languages, along with implementation examples.

feat: Define XPath provenance rules for web observations

- Created XPATH-PROVENANCE rule mandating XPath pointers for claims extracted from web sources.
- Established a workflow for archiving websites and verifying claims against archived HTML.

chore: Update records lifecycle diagram

- Generated a new Mermaid diagram illustrating the records lifecycle for heritage custodians.
- Included phases for active records, inactive archives, and processed heritage collections with key relationships and classifications.
2025-12-09 16:58:41 +01:00

6.5 KiB

WebObservation XPath Provenance Rules

Rule ID: XPATH-PROVENANCE
Status: MANDATORY
Applies To: WebClaim extraction from websites
Created: 2025-11-29


Core Principle: Every Claim MUST Have Verifiable Provenance

If a claim allegedly came from a webpage, it MUST have an XPath pointer to the exact location in the archived HTML where that value appears. Claims without XPath provenance are considered FABRICATED and must be removed.

This is not about "confidence" or "uncertainty" - it's about verifiability. Either the claim value exists in the HTML at a specific XPath, or it was hallucinated/fabricated by an LLM.


Required Fields for WebObservation Claims

Every claim in web_enrichment.claims MUST have:

Field Required Description
claim_type YES Type of claim (full_name, description, email, etc.)
claim_value YES The extracted value
source_url YES URL the claim was extracted from
retrieved_on YES ISO 8601 timestamp when page was archived
xpath YES XPath to the element containing this value
html_file YES Relative path to archived HTML file
xpath_match_score YES 1.0 for exact match, <1.0 for fuzzy match

Example - CORRECT (Verifiable)

web_enrichment:
  claims:
    - claim_type: full_name
      claim_value: Historische Vereniging Nijeveen
      source_url: https://historischeverenigingnijeveen.nl/
      retrieved_on: "2025-11-29T12:28:00Z"
      xpath: /[document][1]/html[1]/body[1]/div[6]/div[1]/table[3]/tbody[1]/tr[1]/td[1]/p[6]
      html_file: web/0021/historischeverenigingnijeveen.nl/rendered.html
      xpath_match_score: 1.0

Example - WRONG (Fabricated - Must Be Removed)

web_enrichment:
  claims:
    - claim_type: full_name
      claim_value: Historische Vereniging Nijeveen
      confidence: 0.95  # ← NO! This is meaningless without XPath

Forbidden: Confidence Scores Without XPath

NEVER use arbitrary confidence scores for web-extracted claims.

Confidence scores like 0.95, 0.90, 0.85 are meaningless because:

  1. There is NO methodology defining what these numbers mean
  2. They cannot be verified or reproduced
  3. They give false impression of rigor
  4. They mask the fact that claims may be fabricated

If a value appears in the HTML → xpath_match_score: 1.0 If a value does NOT appear in the HTML → REMOVE THE CLAIM


Website Archiving Workflow

Step 1: Archive the Website

Use Playwright to archive websites with JavaScript rendering:

python scripts/fetch_website_playwright.py <entry_number> <url>

# Example:
python scripts/fetch_website_playwright.py 0021 https://historischeverenigingnijeveen.nl/

This creates:

data/nde/enriched/entries/web/{entry_number}/{domain}/
├── index.html       # Raw HTML as received
├── rendered.html    # HTML after JS execution
├── content.md       # Markdown conversion
└── metadata.yaml    # XPath extractions for provenance

Step 2: Add XPath Provenance to Claims

Run the XPath migration script:

python scripts/add_xpath_provenance.py

# Or for specific entries:
python scripts/add_xpath_provenance.py --entries 0021,0022,0023

This script:

  1. Reads each entry's web_enrichment.claims
  2. Searches archived HTML for each claim value
  3. Adds xpath + html_file if found
  4. REMOVES claims that cannot be verified (stores in removed_unverified_claims)

Step 3: Audit Removed Claims

Check removed_unverified_claims in each entry file:

removed_unverified_claims:
  - claim_type: phone
    claim_value: "+31 6 12345678"
    reason: "Value not found in archived HTML - likely fabricated"
    removed_on: "2025-11-29T14:30:00Z"

These claims were NOT in the HTML and should NOT be restored without proper sourcing.


Claim Types and Expected Sources

Claim Type Expected Source Notes
full_name Page title, heading, logo text Usually in <h1>, <title>, or prominent <div>
description Meta description, about text Check <meta name="description"> first
email Contact page, footer Often in <a href="mailto:...">
phone Contact page, footer May need normalization
address Contact page, footer Check for structured data too
social_media Footer, contact page Links to social platforms
opening_hours Contact/visit page May be in structured data

XPath Matching Strategy

The add_xpath_provenance.py script uses this matching strategy:

  1. Exact match: Claim value appears exactly in element text
  2. Normalized match: After whitespace normalization
  3. Substring match: Claim value is substring of element text (score < 1.0)

Priority order for matching:

  1. rendered.html (after JS execution) - preferred
  2. index.html (raw HTML) - fallback

Integration with LinkML Schema

The WebClaim class in the LinkML schema requires:

# schemas/20251121/linkml/modules/classes/WebClaim.yaml
WebClaim:
  slots:
    - source_url        # Required
    - retrieved_on      # Required (timestamp)
    - xpath             # Required for claims
    - html_archive_path # Path to archived HTML

Rules for AI Agents

When Extracting Claims from Websites

  1. ALWAYS archive the website first using Playwright
  2. ALWAYS extract claims with XPath provenance using the archived HTML
  3. NEVER invent or infer claims not present in the HTML
  4. NEVER use confidence scores without XPath backing

When Processing Existing Claims

  1. Verify each claim against archived HTML
  2. Add XPath provenance to verified claims
  3. REMOVE fabricated claims that cannot be verified
  4. Document removed claims in removed_unverified_claims

When Reviewing Data Quality

  1. Claims with xpath + html_file = VERIFIED
  2. Claims with only confidence = SUSPECT (migrate or remove)
  3. Claims in removed_unverified_claims = FABRICATED (do not restore)

Scripts Reference

Script Purpose
scripts/fetch_website_playwright.py Archive website with Playwright
scripts/add_xpath_provenance.py Add XPath to claims, remove fabricated
scripts/batch_fetch_websites.py Batch archive multiple entries

Version History

  • 2025-11-29: Initial version - established XPath provenance requirement
  • Replaced confidence scores with verifiable XPath pointers
  • Established policy of removing fabricated claims