kempersc 0ab8f24a6b archive websites

2025-11-29 18:05:16 +01:00

6.4 KiB

Raw Blame History

WebObservation Provenance Rules

Core Principle: Every Claim MUST Have Verifiable Provenance

If a claim allegedly came from a webpage, it MUST have an XPath pointer to the exact location in the archived HTML where that value appears. Claims without XPath provenance are considered FABRICATED and must be removed.

This is not about "confidence" or "uncertainty" - it's about verifiability. Either the claim value exists in the HTML at a specific XPath, or it was hallucinated/fabricated by an LLM.

Required Fields for WebObservation Claims

Every claim in web_enrichment.claims MUST have:

Field	Required	Description
`claim_type`	YES	Type of claim (full_name, description, email, etc.)
`claim_value`	YES	The extracted value
`source_url`	YES	URL the claim was extracted from
`retrieved_on`	YES	ISO 8601 timestamp when page was archived
`xpath`	YES	XPath to the element containing this value
`html_file`	YES	Relative path to archived HTML file
`xpath_match_score`	YES	1.0 for exact match, <1.0 for fuzzy match

Example - CORRECT (Verifiable)

web_enrichment:
  claims:
    - claim_type: full_name
      claim_value: Historische Vereniging Nijeveen
      source_url: https://historischeverenigingnijeveen.nl/
      retrieved_on: "2025-11-29T12:28:00Z"
      xpath: /[document][1]/html[1]/body[1]/div[6]/div[1]/table[3]/tbody[1]/tr[1]/td[1]/p[6]
      html_file: web/0021/historischeverenigingnijeveen.nl/rendered.html
      xpath_match_score: 1.0

Example - WRONG (Fabricated - Must Be Removed)

web_enrichment:
  claims:
    - claim_type: full_name
      claim_value: Historische Vereniging Nijeveen
      confidence: 0.95  # ← NO! This is meaningless without XPath

Forbidden: Confidence Scores Without XPath

NEVER use arbitrary confidence scores for web-extracted claims.

Confidence scores like 0.95, 0.90, 0.85 are meaningless because:

There is NO methodology defining what these numbers mean
They cannot be verified or reproduced
They give false impression of rigor
They mask the fact that claims may be fabricated

If a value appears in the HTML → xpath_match_score: 1.0 If a value does NOT appear in the HTML → REMOVE THE CLAIM

Website Archiving Workflow

Step 1: Archive the Website

Use Playwright to archive websites with JavaScript rendering:

python scripts/fetch_website_playwright.py <entry_number> <url>

# Example:
python scripts/fetch_website_playwright.py 0021 https://historischeverenigingnijeveen.nl/

This creates:

data/nde/enriched/entries/web/{entry_number}/{domain}/
├── index.html       # Raw HTML as received
├── rendered.html    # HTML after JS execution
├── content.md       # Markdown conversion
└── metadata.yaml    # XPath extractions for provenance

Step 2: Add XPath Provenance to Claims

Run the XPath migration script:

python scripts/add_xpath_provenance.py

# Or for specific entries:
python scripts/add_xpath_provenance.py --entries 0021,0022,0023

This script:

Reads each entry's web_enrichment.claims
Searches archived HTML for each claim value
Adds xpath + html_file if found
REMOVES claims that cannot be verified (stores in removed_unverified_claims)

Step 3: Audit Removed Claims

Check removed_unverified_claims in each entry file:

removed_unverified_claims:
  - claim_type: phone
    claim_value: "+31 6 12345678"
    reason: "Value not found in archived HTML - likely fabricated"
    removed_on: "2025-11-29T14:30:00Z"

These claims were NOT in the HTML and should NOT be restored without proper sourcing.

Claim Types and Expected Sources

Claim Type	Expected Source	Notes
`full_name`	Page title, heading, logo text	Usually in `<h1>`, `<title>`, or prominent `<div>`
`description`	Meta description, about text	Check `<meta name="description">` first
`email`	Contact page, footer	Often in `<a href="mailto:...">`
`phone`	Contact page, footer	May need normalization
`address`	Contact page, footer	Check for structured data too
`social_media`	Footer, contact page	Links to social platforms
`opening_hours`	Contact/visit page	May be in structured data

XPath Matching Strategy

The add_xpath_provenance.py script uses this matching strategy:

Exact match: Claim value appears exactly in element text
Normalized match: After whitespace normalization
Substring match: Claim value is substring of element text (score < 1.0)

Priority order for matching:

rendered.html (after JS execution) - preferred
index.html (raw HTML) - fallback

Integration with LinkML Schema

The WebObservation class in the LinkML schema requires:

# schemas/20251121/linkml/modules/classes/WebObservation.yaml
WebObservation:
  slots:
    - source_url        # Required
    - retrieved_on      # Required (timestamp)
    - xpath             # Required for claims
    - html_archive_path # Path to archived HTML

Rules for AI Agents

When Extracting Claims from Websites

ALWAYS archive the website first using Playwright
ALWAYS extract claims with XPath provenance using the archived HTML
NEVER invent or infer claims not present in the HTML
NEVER use confidence scores without XPath backing

When Processing Existing Claims

Verify each claim against archived HTML
Add XPath provenance to verified claims
REMOVE fabricated claims that cannot be verified
Document removed claims in removed_unverified_claims

When Reviewing Data Quality

Claims with xpath + html_file = VERIFIED
Claims with only confidence = SUSPECT (migrate or remove)
Claims in removed_unverified_claims = FABRICATED (do not restore)

Scripts Reference

Script	Purpose
`scripts/fetch_website_playwright.py`	Archive website with Playwright
`scripts/add_xpath_provenance.py`	Add XPath to claims, remove fabricated
`scripts/batch_fetch_websites.py`	Batch archive multiple entries

Version History

2025-11-29: Initial version - established XPath provenance requirement
Replaced confidence scores with verifiable XPath pointers
Established policy of removing fabricated claims

6.4 KiB Raw Blame History