203 lines
6.4 KiB
Markdown
203 lines
6.4 KiB
Markdown
# WebObservation Provenance Rules
|
|
|
|
## Core Principle: Every Claim MUST Have Verifiable Provenance
|
|
|
|
**If a claim allegedly came from a webpage, it MUST have an XPath pointer to the exact location in the archived HTML where that value appears. Claims without XPath provenance are considered FABRICATED and must be removed.**
|
|
|
|
This is not about "confidence" or "uncertainty" - it's about **verifiability**. Either the claim value exists in the HTML at a specific XPath, or it was hallucinated/fabricated by an LLM.
|
|
|
|
---
|
|
|
|
## Required Fields for WebObservation Claims
|
|
|
|
Every claim in `web_enrichment.claims` MUST have:
|
|
|
|
| Field | Required | Description |
|
|
|-------|----------|-------------|
|
|
| `claim_type` | YES | Type of claim (full_name, description, email, etc.) |
|
|
| `claim_value` | YES | The extracted value |
|
|
| `source_url` | YES | URL the claim was extracted from |
|
|
| `retrieved_on` | YES | ISO 8601 timestamp when page was archived |
|
|
| `xpath` | YES | XPath to the element containing this value |
|
|
| `html_file` | YES | Relative path to archived HTML file |
|
|
| `xpath_match_score` | YES | 1.0 for exact match, <1.0 for fuzzy match |
|
|
|
|
### Example - CORRECT (Verifiable)
|
|
|
|
```yaml
|
|
web_enrichment:
|
|
claims:
|
|
- claim_type: full_name
|
|
claim_value: Historische Vereniging Nijeveen
|
|
source_url: https://historischeverenigingnijeveen.nl/
|
|
retrieved_on: "2025-11-29T12:28:00Z"
|
|
xpath: /[document][1]/html[1]/body[1]/div[6]/div[1]/table[3]/tbody[1]/tr[1]/td[1]/p[6]
|
|
html_file: web/0021/historischeverenigingnijeveen.nl/rendered.html
|
|
xpath_match_score: 1.0
|
|
```
|
|
|
|
### Example - WRONG (Fabricated - Must Be Removed)
|
|
|
|
```yaml
|
|
web_enrichment:
|
|
claims:
|
|
- claim_type: full_name
|
|
claim_value: Historische Vereniging Nijeveen
|
|
confidence: 0.95 # ← NO! This is meaningless without XPath
|
|
```
|
|
|
|
---
|
|
|
|
## Forbidden: Confidence Scores Without XPath
|
|
|
|
**NEVER use arbitrary confidence scores for web-extracted claims.**
|
|
|
|
Confidence scores like `0.95`, `0.90`, `0.85` are meaningless because:
|
|
1. There is NO methodology defining what these numbers mean
|
|
2. They cannot be verified or reproduced
|
|
3. They give false impression of rigor
|
|
4. They mask the fact that claims may be fabricated
|
|
|
|
If a value appears in the HTML → `xpath_match_score: 1.0`
|
|
If a value does NOT appear in the HTML → **REMOVE THE CLAIM**
|
|
|
|
---
|
|
|
|
## Website Archiving Workflow
|
|
|
|
### Step 1: Archive the Website
|
|
|
|
Use Playwright to archive websites with JavaScript rendering:
|
|
|
|
```bash
|
|
python scripts/fetch_website_playwright.py <entry_number> <url>
|
|
|
|
# Example:
|
|
python scripts/fetch_website_playwright.py 0021 https://historischeverenigingnijeveen.nl/
|
|
```
|
|
|
|
This creates:
|
|
```
|
|
data/nde/enriched/entries/web/{entry_number}/{domain}/
|
|
├── index.html # Raw HTML as received
|
|
├── rendered.html # HTML after JS execution
|
|
├── content.md # Markdown conversion
|
|
└── metadata.yaml # XPath extractions for provenance
|
|
```
|
|
|
|
### Step 2: Add XPath Provenance to Claims
|
|
|
|
Run the XPath migration script:
|
|
|
|
```bash
|
|
python scripts/add_xpath_provenance.py
|
|
|
|
# Or for specific entries:
|
|
python scripts/add_xpath_provenance.py --entries 0021,0022,0023
|
|
```
|
|
|
|
This script:
|
|
1. Reads each entry's `web_enrichment.claims`
|
|
2. Searches archived HTML for each claim value
|
|
3. Adds `xpath` + `html_file` if found
|
|
4. **REMOVES claims that cannot be verified** (stores in `removed_unverified_claims`)
|
|
|
|
### Step 3: Audit Removed Claims
|
|
|
|
Check `removed_unverified_claims` in each entry file:
|
|
|
|
```yaml
|
|
removed_unverified_claims:
|
|
- claim_type: phone
|
|
claim_value: "+31 6 12345678"
|
|
reason: "Value not found in archived HTML - likely fabricated"
|
|
removed_on: "2025-11-29T14:30:00Z"
|
|
```
|
|
|
|
These claims were NOT in the HTML and should NOT be restored without proper sourcing.
|
|
|
|
---
|
|
|
|
## Claim Types and Expected Sources
|
|
|
|
| Claim Type | Expected Source | Notes |
|
|
|------------|-----------------|-------|
|
|
| `full_name` | Page title, heading, logo text | Usually in `<h1>`, `<title>`, or prominent `<div>` |
|
|
| `description` | Meta description, about text | Check `<meta name="description">` first |
|
|
| `email` | Contact page, footer | Often in `<a href="mailto:...">` |
|
|
| `phone` | Contact page, footer | May need normalization |
|
|
| `address` | Contact page, footer | Check for structured data too |
|
|
| `social_media` | Footer, contact page | Links to social platforms |
|
|
| `opening_hours` | Contact/visit page | May be in structured data |
|
|
|
|
---
|
|
|
|
## XPath Matching Strategy
|
|
|
|
The `add_xpath_provenance.py` script uses this matching strategy:
|
|
|
|
1. **Exact match**: Claim value appears exactly in element text
|
|
2. **Normalized match**: After whitespace normalization
|
|
3. **Substring match**: Claim value is substring of element text (score < 1.0)
|
|
|
|
Priority order for matching:
|
|
1. `rendered.html` (after JS execution) - preferred
|
|
2. `index.html` (raw HTML) - fallback
|
|
|
|
---
|
|
|
|
## Integration with LinkML Schema
|
|
|
|
The `WebObservation` class in the LinkML schema requires:
|
|
|
|
```yaml
|
|
# schemas/20251121/linkml/modules/classes/WebObservation.yaml
|
|
WebObservation:
|
|
slots:
|
|
- source_url # Required
|
|
- retrieved_on # Required (timestamp)
|
|
- xpath # Required for claims
|
|
- html_archive_path # Path to archived HTML
|
|
```
|
|
|
|
---
|
|
|
|
## Rules for AI Agents
|
|
|
|
### When Extracting Claims from Websites
|
|
|
|
1. **ALWAYS archive the website first** using Playwright
|
|
2. **ALWAYS extract claims with XPath provenance** using the archived HTML
|
|
3. **NEVER invent or infer claims** not present in the HTML
|
|
4. **NEVER use confidence scores** without XPath backing
|
|
|
|
### When Processing Existing Claims
|
|
|
|
1. **Verify each claim** against archived HTML
|
|
2. **Add XPath provenance** to verified claims
|
|
3. **REMOVE fabricated claims** that cannot be verified
|
|
4. **Document removed claims** in `removed_unverified_claims`
|
|
|
|
### When Reviewing Data Quality
|
|
|
|
1. Claims with `xpath` + `html_file` = **VERIFIED**
|
|
2. Claims with only `confidence` = **SUSPECT** (migrate or remove)
|
|
3. Claims in `removed_unverified_claims` = **FABRICATED** (do not restore)
|
|
|
|
---
|
|
|
|
## Scripts Reference
|
|
|
|
| Script | Purpose |
|
|
|--------|---------|
|
|
| `scripts/fetch_website_playwright.py` | Archive website with Playwright |
|
|
| `scripts/add_xpath_provenance.py` | Add XPath to claims, remove fabricated |
|
|
| `scripts/batch_fetch_websites.py` | Batch archive multiple entries |
|
|
|
|
---
|
|
|
|
## Version History
|
|
|
|
- **2025-11-29**: Initial version - established XPath provenance requirement
|
|
- Replaced confidence scores with verifiable XPath pointers
|
|
- Established policy of removing fabricated claims
|