glam/.opencode/WEB_OBSERVATION_PROVENANCE_RULES.md
2025-11-29 18:05:16 +01:00

203 lines
6.4 KiB
Markdown

# WebObservation Provenance Rules
## Core Principle: Every Claim MUST Have Verifiable Provenance
**If a claim allegedly came from a webpage, it MUST have an XPath pointer to the exact location in the archived HTML where that value appears. Claims without XPath provenance are considered FABRICATED and must be removed.**
This is not about "confidence" or "uncertainty" - it's about **verifiability**. Either the claim value exists in the HTML at a specific XPath, or it was hallucinated/fabricated by an LLM.
---
## Required Fields for WebObservation Claims
Every claim in `web_enrichment.claims` MUST have:
| Field | Required | Description |
|-------|----------|-------------|
| `claim_type` | YES | Type of claim (full_name, description, email, etc.) |
| `claim_value` | YES | The extracted value |
| `source_url` | YES | URL the claim was extracted from |
| `retrieved_on` | YES | ISO 8601 timestamp when page was archived |
| `xpath` | YES | XPath to the element containing this value |
| `html_file` | YES | Relative path to archived HTML file |
| `xpath_match_score` | YES | 1.0 for exact match, <1.0 for fuzzy match |
### Example - CORRECT (Verifiable)
```yaml
web_enrichment:
claims:
- claim_type: full_name
claim_value: Historische Vereniging Nijeveen
source_url: https://historischeverenigingnijeveen.nl/
retrieved_on: "2025-11-29T12:28:00Z"
xpath: /[document][1]/html[1]/body[1]/div[6]/div[1]/table[3]/tbody[1]/tr[1]/td[1]/p[6]
html_file: web/0021/historischeverenigingnijeveen.nl/rendered.html
xpath_match_score: 1.0
```
### Example - WRONG (Fabricated - Must Be Removed)
```yaml
web_enrichment:
claims:
- claim_type: full_name
claim_value: Historische Vereniging Nijeveen
confidence: 0.95 # ← NO! This is meaningless without XPath
```
---
## Forbidden: Confidence Scores Without XPath
**NEVER use arbitrary confidence scores for web-extracted claims.**
Confidence scores like `0.95`, `0.90`, `0.85` are meaningless because:
1. There is NO methodology defining what these numbers mean
2. They cannot be verified or reproduced
3. They give false impression of rigor
4. They mask the fact that claims may be fabricated
If a value appears in the HTML `xpath_match_score: 1.0`
If a value does NOT appear in the HTML **REMOVE THE CLAIM**
---
## Website Archiving Workflow
### Step 1: Archive the Website
Use Playwright to archive websites with JavaScript rendering:
```bash
python scripts/fetch_website_playwright.py <entry_number> <url>
# Example:
python scripts/fetch_website_playwright.py 0021 https://historischeverenigingnijeveen.nl/
```
This creates:
```
data/nde/enriched/entries/web/{entry_number}/{domain}/
├── index.html # Raw HTML as received
├── rendered.html # HTML after JS execution
├── content.md # Markdown conversion
└── metadata.yaml # XPath extractions for provenance
```
### Step 2: Add XPath Provenance to Claims
Run the XPath migration script:
```bash
python scripts/add_xpath_provenance.py
# Or for specific entries:
python scripts/add_xpath_provenance.py --entries 0021,0022,0023
```
This script:
1. Reads each entry's `web_enrichment.claims`
2. Searches archived HTML for each claim value
3. Adds `xpath` + `html_file` if found
4. **REMOVES claims that cannot be verified** (stores in `removed_unverified_claims`)
### Step 3: Audit Removed Claims
Check `removed_unverified_claims` in each entry file:
```yaml
removed_unverified_claims:
- claim_type: phone
claim_value: "+31 6 12345678"
reason: "Value not found in archived HTML - likely fabricated"
removed_on: "2025-11-29T14:30:00Z"
```
These claims were NOT in the HTML and should NOT be restored without proper sourcing.
---
## Claim Types and Expected Sources
| Claim Type | Expected Source | Notes |
|------------|-----------------|-------|
| `full_name` | Page title, heading, logo text | Usually in `<h1>`, `<title>`, or prominent `<div>` |
| `description` | Meta description, about text | Check `<meta name="description">` first |
| `email` | Contact page, footer | Often in `<a href="mailto:...">` |
| `phone` | Contact page, footer | May need normalization |
| `address` | Contact page, footer | Check for structured data too |
| `social_media` | Footer, contact page | Links to social platforms |
| `opening_hours` | Contact/visit page | May be in structured data |
---
## XPath Matching Strategy
The `add_xpath_provenance.py` script uses this matching strategy:
1. **Exact match**: Claim value appears exactly in element text
2. **Normalized match**: After whitespace normalization
3. **Substring match**: Claim value is substring of element text (score < 1.0)
Priority order for matching:
1. `rendered.html` (after JS execution) - preferred
2. `index.html` (raw HTML) - fallback
---
## Integration with LinkML Schema
The `WebObservation` class in the LinkML schema requires:
```yaml
# schemas/20251121/linkml/modules/classes/WebObservation.yaml
WebObservation:
slots:
- source_url # Required
- retrieved_on # Required (timestamp)
- xpath # Required for claims
- html_archive_path # Path to archived HTML
```
---
## Rules for AI Agents
### When Extracting Claims from Websites
1. **ALWAYS archive the website first** using Playwright
2. **ALWAYS extract claims with XPath provenance** using the archived HTML
3. **NEVER invent or infer claims** not present in the HTML
4. **NEVER use confidence scores** without XPath backing
### When Processing Existing Claims
1. **Verify each claim** against archived HTML
2. **Add XPath provenance** to verified claims
3. **REMOVE fabricated claims** that cannot be verified
4. **Document removed claims** in `removed_unverified_claims`
### When Reviewing Data Quality
1. Claims with `xpath` + `html_file` = **VERIFIED**
2. Claims with only `confidence` = **SUSPECT** (migrate or remove)
3. Claims in `removed_unverified_claims` = **FABRICATED** (do not restore)
---
## Scripts Reference
| Script | Purpose |
|--------|---------|
| `scripts/fetch_website_playwright.py` | Archive website with Playwright |
| `scripts/add_xpath_provenance.py` | Add XPath to claims, remove fabricated |
| `scripts/batch_fetch_websites.py` | Batch archive multiple entries |
---
## Version History
- **2025-11-29**: Initial version - established XPath provenance requirement
- Replaced confidence scores with verifiable XPath pointers
- Established policy of removing fabricated claims