928 lines
28 KiB
Markdown
928 lines
28 KiB
Markdown
# Web Claim Provenance Schema
|
|
|
|
**Created**: 2025-12-28
|
|
**Updated**: 2025-12-28
|
|
**Status**: Active Rule
|
|
**Supersedes**: Basic web_claims structure
|
|
|
|
## Purpose
|
|
|
|
This schema defines a comprehensive provenance structure for web-sourced claims that:
|
|
1. **Enables automated re-verification** - Claims can be automatically re-checked
|
|
2. **Ensures content integrity** - Hashes detect if source content changed
|
|
3. **Supports temporal tracking** - Links to web archives for historical access
|
|
4. **Aligns with FAIR principles** - Findable, Accessible, Interoperable, Reusable
|
|
5. **Maps to linked data standards** - PROV-O, Schema.org ClaimReview, W3C Web Annotation
|
|
6. **Handles modern web applications** - SPAs, AJAX, JavaScript-rendered content
|
|
|
|
## Standards Alignment
|
|
|
|
| Standard | Purpose | Reference |
|
|
|----------|---------|-----------|
|
|
| W3C PROV-O | Provenance ontology | https://www.w3.org/TR/prov-o/ |
|
|
| W3C Web Annotation Data Model | Selectors and states | https://www.w3.org/TR/annotation-model/ |
|
|
| Schema.org ClaimReview | Fact-checking markup | https://schema.org/ClaimReview |
|
|
| RFC 7089 Memento | Web archival access | https://tools.ietf.org/html/rfc7089 |
|
|
| W3C SRI | Subresource integrity | https://www.w3.org/TR/sri/ |
|
|
| W3C Text Fragments | URL text targeting | https://wicg.github.io/scroll-to-text-fragment/ |
|
|
| WAI-ARIA | Accessibility selectors | https://www.w3.org/WAI/ARIA/apg/ |
|
|
| Playwright Locators | Modern selector strategies | https://playwright.dev/docs/locators |
|
|
|
|
---
|
|
|
|
## Mandatory vs Recommended Elements
|
|
|
|
### MANDATORY Elements (Required for FAIR Compliance)
|
|
|
|
Every `web_claim` object **MUST** have these elements:
|
|
|
|
| Element | Standard | Purpose |
|
|
|---------|----------|---------|
|
|
| `content_hash` | W3C SRI | SHA-256 hash of `extracted_text` for integrity verification |
|
|
| `text_fragment` | W3C Text Fragments | URL `#:~:text=...` for direct linking to source text |
|
|
| `archive.memento_uri` | RFC 7089 Memento | Wayback Machine archived snapshot URL |
|
|
| `prov.wasDerivedFrom` | W3C PROV-O | Source URL for linked data tracing |
|
|
| `verification.status` | - | Claim freshness (`verified`/`stale`/`failed`) |
|
|
| `w3c_selectors` | W3C Web Annotation | At least 2 selector types for redundancy |
|
|
|
|
### RECOMMENDED Elements (For Full Provenance)
|
|
|
|
| Element | Standard | Purpose |
|
|
|---------|----------|---------|
|
|
| `aria_selector` | WAI-ARIA | Accessibility-based element identification |
|
|
| `rendering_context` | - | JS framework detection and execution state |
|
|
| `interaction_sequence` | - | Actions taken to reach dynamic content |
|
|
| `network_context` | - | AJAX/API request details for dynamic content |
|
|
| `user_agent_context` | - | Browser and viewport information |
|
|
|
|
---
|
|
|
|
## Complete Web Claim Structure
|
|
|
|
```json
|
|
{
|
|
"claim_id": "string (UUID recommended)",
|
|
|
|
// === CLAIM CONTENT ===
|
|
"claim_type": "string (role|tenure|education|biography|contact|etc)",
|
|
"claim_value": "string (the extracted fact)",
|
|
"extracted_text": "string (full text from which claim derived)",
|
|
"language": "string (BCP 47 code: nl, en, fr, etc)",
|
|
|
|
// === SOURCE IDENTIFICATION ===
|
|
"source_url": "string (URL)",
|
|
"canonical_url": "string (URL, if different from source_url)",
|
|
"content_type": "string (MIME type: text/html)",
|
|
|
|
// === W3C WEB ANNOTATION SELECTORS (MANDATORY - at least 2) ===
|
|
"w3c_selectors": [
|
|
{
|
|
"type": "CssSelector",
|
|
"value": "string (CSS selector path)"
|
|
},
|
|
{
|
|
"type": "XPathSelector",
|
|
"value": "string (XPath expression)"
|
|
},
|
|
{
|
|
"type": "TextQuoteSelector",
|
|
"exact": "string (the matched text)",
|
|
"prefix": "string (text before match)",
|
|
"suffix": "string (text after match)"
|
|
},
|
|
{
|
|
"type": "TextPositionSelector",
|
|
"start": "integer (character offset start)",
|
|
"end": "integer (character offset end)"
|
|
},
|
|
{
|
|
"type": "FragmentSelector",
|
|
"value": "string (fragment identifier)",
|
|
"conformsTo": "string (specification URL)"
|
|
}
|
|
],
|
|
|
|
// === ACCESSIBILITY SELECTORS (RECOMMENDED) ===
|
|
"aria_selector": {
|
|
"role": "string (ARIA role: article, heading, button, etc)",
|
|
"name": "string (accessible name)",
|
|
"label": "string (aria-label value)",
|
|
"testid": "string (data-testid attribute)"
|
|
},
|
|
|
|
// === TEXT FRAGMENT (MANDATORY) ===
|
|
"text_fragment": "string (W3C Text Fragment: #:~:text=...)",
|
|
|
|
// === LEGACY SELECTORS (for backwards compatibility) ===
|
|
"css_selector": "string (CSS selector path)",
|
|
"xpath_selector": "string (XPath expression)",
|
|
|
|
// === TEMPORAL METADATA ===
|
|
"published_date": "string (ISO 8601, from article:published_time)",
|
|
"modified_date": "string (ISO 8601, from article:modified_time)",
|
|
"author": "string or object (content creator)",
|
|
|
|
// === RETRIEVAL METADATA ===
|
|
"retrieval_timestamp": "string (ISO 8601, when we fetched)",
|
|
"retrieval_agent": "string (tool identifier: opencode/claude-sonnet-4)",
|
|
"extraction_method": "string (MCP tool: web-reader_webReader)",
|
|
|
|
// === CONTENT INTEGRITY (MANDATORY) ===
|
|
"content_hash": {
|
|
"algorithm": "sha256",
|
|
"value": "string (base64 encoded hash of extracted_text)",
|
|
"scope": "extracted_text"
|
|
},
|
|
"http_etag": "string (ETag header from server response)",
|
|
"http_last_modified": "string (Last-Modified header)",
|
|
|
|
// === WEB ARCHIVAL (MANDATORY - RFC 7089 Memento) ===
|
|
"archive": {
|
|
"memento_uri": "string (archived snapshot URL)",
|
|
"memento_datetime": "string (ISO 8601, archival datetime)",
|
|
"timemap_uri": "string (TimeMap URL for all snapshots)",
|
|
"timegate_uri": "string (TimeGate for datetime negotiation)",
|
|
"archive_source": "string (web.archive.org, archive.today, etc)"
|
|
},
|
|
|
|
// === PROV-O ALIGNMENT (MANDATORY - wasDerivedFrom) ===
|
|
"prov": {
|
|
"wasAttributedTo": {
|
|
"@type": "prov:Agent",
|
|
"name": "string",
|
|
"url": "string"
|
|
},
|
|
"generatedAtTime": "string (ISO 8601)",
|
|
"wasDerivedFrom": "string (source URL or entity reference)",
|
|
"wasGeneratedBy": {
|
|
"@type": "prov:Activity",
|
|
"name": "web_extraction",
|
|
"used": "string (extraction_method)"
|
|
}
|
|
},
|
|
|
|
// === SCHEMA.ORG CLAIMREVIEW ALIGNMENT ===
|
|
"schema_org": {
|
|
"@type": "Claim",
|
|
"claimReviewed": "string (exact claim text)",
|
|
"appearance": {
|
|
"@type": "CreativeWork",
|
|
"url": "string (source_url)",
|
|
"datePublished": "string (published_date)"
|
|
},
|
|
"author": {
|
|
"@type": "Person",
|
|
"name": "string"
|
|
}
|
|
},
|
|
|
|
// === VERIFICATION SUPPORT (MANDATORY - status) ===
|
|
"verification": {
|
|
"status": "string (verified|stale|failed|pending)",
|
|
"last_verified": "string (ISO 8601)",
|
|
"next_verification_due": "string (ISO 8601)",
|
|
"confidence_score": "number (0.0-1.0)",
|
|
"verification_history": [
|
|
{
|
|
"timestamp": "string (ISO 8601)",
|
|
"status": "string",
|
|
"content_hash": "string",
|
|
"notes": "string"
|
|
}
|
|
]
|
|
},
|
|
|
|
// === RENDERING CONTEXT (RECOMMENDED - for SPAs) ===
|
|
"rendering_context": {
|
|
"framework_detected": "string (React|Vue|Angular|Svelte|vanilla|unknown)",
|
|
"framework_version": "string (version if detectable)",
|
|
"hydration_complete": "boolean",
|
|
"client_side_rendered": "boolean",
|
|
"server_side_rendered": "boolean",
|
|
"js_execution_required": "boolean",
|
|
"wait_condition": "string (load|domcontentloaded|networkidle)",
|
|
"wait_selector": "string (selector waited for)",
|
|
"wait_duration_ms": "integer (milliseconds waited)"
|
|
},
|
|
|
|
// === DOM STATE (RECOMMENDED - W3C Web Annotation TimeState) ===
|
|
"dom_state": {
|
|
"type": "TimeState",
|
|
"sourceDate": "string (ISO 8601)",
|
|
"dom_content_loaded": "boolean",
|
|
"load_event_fired": "boolean",
|
|
"mutation_observer_stable": "boolean",
|
|
"scroll_position": {
|
|
"x": "integer",
|
|
"y": "integer"
|
|
},
|
|
"viewport": {
|
|
"width": "integer",
|
|
"height": "integer"
|
|
}
|
|
},
|
|
|
|
// === INTERACTION SEQUENCE (RECOMMENDED - for dynamic content) ===
|
|
"interaction_sequence": [
|
|
{
|
|
"action": "string (navigate|click|scroll|wait|type|extract)",
|
|
"target": "string (URL or selector)",
|
|
"value": "string (input value if applicable)",
|
|
"condition": "string (wait condition if applicable)",
|
|
"duration_ms": "integer (action duration)",
|
|
"timestamp": "string (ISO 8601)"
|
|
}
|
|
],
|
|
|
|
// === NETWORK CONTEXT (RECOMMENDED - for AJAX content) ===
|
|
"network_context": {
|
|
"request_url": "string (API endpoint if different from page)",
|
|
"request_method": "string (GET|POST|etc)",
|
|
"request_headers": "object (relevant headers)",
|
|
"response_status": "integer (HTTP status code)",
|
|
"response_content_type": "string (MIME type)",
|
|
"response_hash": "string (hash of API response)",
|
|
"xhr_intercepted": "boolean",
|
|
"api_version": "string (API version if known)"
|
|
},
|
|
|
|
// === USER AGENT CONTEXT (RECOMMENDED) ===
|
|
"user_agent_context": {
|
|
"user_agent": "string (full UA string)",
|
|
"browser": "string (Chrome|Firefox|Chromium|etc)",
|
|
"browser_version": "string",
|
|
"headless": "boolean",
|
|
"viewport_width": "integer",
|
|
"viewport_height": "integer",
|
|
"device_scale_factor": "number",
|
|
"mobile": "boolean",
|
|
"locale": "string (BCP 47)",
|
|
"timezone": "string (IANA timezone)"
|
|
},
|
|
|
|
// === FREE-FORM ===
|
|
"notes": "string (optional context or caveats)"
|
|
}
|
|
```
|
|
|
|
---
|
|
|
|
## Minimal Compliant Web Claim
|
|
|
|
For basic FAIR compliance, every web claim **MUST** have:
|
|
|
|
```json
|
|
{
|
|
"claim_type": "MANDATORY",
|
|
"claim_value": "MANDATORY",
|
|
"source_url": "MANDATORY",
|
|
"extracted_text": "MANDATORY",
|
|
"retrieval_timestamp": "MANDATORY",
|
|
"retrieval_agent": "MANDATORY",
|
|
"extraction_method": "MANDATORY",
|
|
|
|
"content_hash": "MANDATORY (integrity)",
|
|
"text_fragment": "MANDATORY (direct linking)",
|
|
"archive": {
|
|
"memento_uri": "MANDATORY (archival fallback)"
|
|
},
|
|
"prov": {
|
|
"wasDerivedFrom": "MANDATORY (provenance)"
|
|
},
|
|
"verification": {
|
|
"status": "MANDATORY (freshness)"
|
|
},
|
|
"w3c_selectors": "MANDATORY (at least 2 selector types)"
|
|
}
|
|
```
|
|
|
|
---
|
|
|
|
## W3C Web Annotation Selector Types
|
|
|
|
The W3C Web Annotation Data Model defines several selector types. Use **at least 2** for redundancy:
|
|
|
|
### 1. CssSelector
|
|
```json
|
|
{
|
|
"type": "CssSelector",
|
|
"value": "#main-article > p.intro"
|
|
}
|
|
```
|
|
|
|
### 2. XPathSelector
|
|
```json
|
|
{
|
|
"type": "XPathSelector",
|
|
"value": "/html/body/main/article/div/p[1]"
|
|
}
|
|
```
|
|
|
|
### 3. TextQuoteSelector (Highly Recommended)
|
|
Most resilient to DOM changes:
|
|
```json
|
|
{
|
|
"type": "TextQuoteSelector",
|
|
"exact": "Arjen Dijkstra wordt de nieuwe directeur",
|
|
"prefix": "15 juli 2022 - ",
|
|
"suffix": " van Fries historisch"
|
|
}
|
|
```
|
|
|
|
### 4. TextPositionSelector
|
|
```json
|
|
{
|
|
"type": "TextPositionSelector",
|
|
"start": 412,
|
|
"end": 795
|
|
}
|
|
```
|
|
|
|
### 5. FragmentSelector
|
|
```json
|
|
{
|
|
"type": "FragmentSelector",
|
|
"value": "section2",
|
|
"conformsTo": "http://tools.ietf.org/rfc/rfc3236"
|
|
}
|
|
```
|
|
|
|
### 6. DataPositionSelector (for binary data)
|
|
```json
|
|
{
|
|
"type": "DataPositionSelector",
|
|
"start": 4096,
|
|
"end": 4104
|
|
}
|
|
```
|
|
|
|
### 7. RangeSelector (for spanning selections)
|
|
```json
|
|
{
|
|
"type": "RangeSelector",
|
|
"startSelector": {
|
|
"type": "XPathSelector",
|
|
"value": "//table[1]/tr[1]/td[2]"
|
|
},
|
|
"endSelector": {
|
|
"type": "XPathSelector",
|
|
"value": "//table[1]/tr[1]/td[4]"
|
|
}
|
|
}
|
|
```
|
|
|
|
---
|
|
|
|
## Accessibility-First Selectors (ARIA)
|
|
|
|
Modern web applications use semantic HTML and ARIA. These selectors are often **more stable** than CSS/XPath:
|
|
|
|
```json
|
|
"aria_selector": {
|
|
"role": "article",
|
|
"name": "Nieuws artikel over directeurswisseling",
|
|
"label": "Article content",
|
|
"testid": "news-article-content",
|
|
"description": "Artikel over de nieuwe directeur van Tresoar"
|
|
}
|
|
```
|
|
|
|
**Playwright Locator Equivalents:**
|
|
- `page.getByRole('article', { name: 'Nieuws artikel' })`
|
|
- `page.getByLabel('Article content')`
|
|
- `page.getByTestId('news-article-content')`
|
|
- `page.getByText('Arjen Dijkstra wordt')`
|
|
|
|
---
|
|
|
|
## Content Hash Generation
|
|
|
|
Generate SHA-256 hash of `extracted_text` for integrity verification:
|
|
|
|
```python
|
|
import hashlib
|
|
import base64
|
|
|
|
def generate_content_hash(text: str) -> dict:
|
|
"""Generate SHA-256 hash for content integrity (W3C SRI format)."""
|
|
hash_bytes = hashlib.sha256(text.encode('utf-8')).digest()
|
|
hash_b64 = base64.b64encode(hash_bytes).decode('ascii')
|
|
return {
|
|
"algorithm": "sha256",
|
|
"value": f"sha256-{hash_b64}",
|
|
"scope": "extracted_text"
|
|
}
|
|
|
|
# Example
|
|
text = "Arjen Dijkstra wordt de nieuwe directeur van Tresoar."
|
|
hash_obj = generate_content_hash(text)
|
|
# Result: {"algorithm": "sha256", "value": "sha256-abc123...", "scope": "extracted_text"}
|
|
```
|
|
|
|
---
|
|
|
|
## Text Fragment URL Generation
|
|
|
|
Create W3C Text Fragment URLs for direct linking:
|
|
|
|
```python
|
|
from urllib.parse import quote
|
|
|
|
def generate_text_fragment(source_url: str, text: str) -> str:
|
|
"""Generate URL with text fragment for direct linking."""
|
|
# Truncate to first 100 chars for fragment
|
|
fragment_text = text[:100] if len(text) > 100 else text
|
|
encoded = quote(fragment_text)
|
|
return f"{source_url}#:~:text={encoded}"
|
|
|
|
# Example
|
|
url = "https://example.com/article"
|
|
text = "Arjen Dijkstra wordt de nieuwe directeur"
|
|
fragment_url = generate_text_fragment(url, text)
|
|
# Result: "https://example.com/article#:~:text=Arjen%20Dijkstra%20wordt%20de%20nieuwe%20directeur"
|
|
```
|
|
|
|
---
|
|
|
|
## Web Archive (Memento) Integration
|
|
|
|
Request archived version via Wayback Machine:
|
|
|
|
```python
|
|
import requests
|
|
from datetime import datetime
|
|
|
|
def get_memento_info(url: str, target_date: datetime = None) -> dict:
|
|
"""Get Memento (archived) version info from Wayback Machine."""
|
|
|
|
# Use Memento TimeGate
|
|
timegate = f"https://web.archive.org/web/{url}"
|
|
|
|
# Or query TimeMap for all versions
|
|
timemap = f"https://web.archive.org/web/timemap/link/{url}"
|
|
|
|
# Get closest memento to target date
|
|
if target_date:
|
|
date_str = target_date.strftime("%Y%m%d")
|
|
memento_uri = f"https://web.archive.org/web/{date_str}/{url}"
|
|
else:
|
|
# Use wildcard for latest
|
|
memento_uri = f"https://web.archive.org/web/*/{url}"
|
|
|
|
return {
|
|
"memento_uri": memento_uri,
|
|
"timemap_uri": timemap,
|
|
"timegate_uri": timegate,
|
|
"archive_source": "web.archive.org"
|
|
}
|
|
|
|
def check_wayback_availability(url: str) -> dict:
|
|
"""Check if URL is available in Wayback Machine."""
|
|
api_url = f"https://archive.org/wayback/available?url={url}"
|
|
response = requests.get(api_url)
|
|
data = response.json()
|
|
|
|
if data.get("archived_snapshots", {}).get("closest"):
|
|
snapshot = data["archived_snapshots"]["closest"]
|
|
return {
|
|
"available": True,
|
|
"memento_uri": snapshot["url"],
|
|
"memento_datetime": snapshot["timestamp"],
|
|
"archive_source": "web.archive.org"
|
|
}
|
|
return {"available": False}
|
|
```
|
|
|
|
---
|
|
|
|
## Rendering Context Detection
|
|
|
|
For Single Page Applications (SPAs), detect the framework and rendering state:
|
|
|
|
```python
|
|
def detect_rendering_context(page) -> dict:
|
|
"""Detect JS framework and rendering state using Playwright."""
|
|
|
|
context = {
|
|
"framework_detected": "unknown",
|
|
"js_execution_required": False,
|
|
"client_side_rendered": False,
|
|
"server_side_rendered": True
|
|
}
|
|
|
|
# Check for React
|
|
react_check = page.evaluate("""() => {
|
|
return !!(window.__REACT_DEVTOOLS_GLOBAL_HOOK__ ||
|
|
document.querySelector('[data-reactroot]') ||
|
|
document.querySelector('[data-react-helmet]'))
|
|
}""")
|
|
if react_check:
|
|
context["framework_detected"] = "React"
|
|
context["js_execution_required"] = True
|
|
|
|
# Check for Vue
|
|
vue_check = page.evaluate("""() => {
|
|
return !!(window.__VUE__ ||
|
|
document.querySelector('[data-v-]') ||
|
|
document.querySelector('#__nuxt'))
|
|
}""")
|
|
if vue_check:
|
|
context["framework_detected"] = "Vue"
|
|
context["js_execution_required"] = True
|
|
|
|
# Check for Angular
|
|
angular_check = page.evaluate("""() => {
|
|
return !!(window.ng ||
|
|
document.querySelector('[ng-version]') ||
|
|
document.querySelector('[_ngcontent-]'))
|
|
}""")
|
|
if angular_check:
|
|
context["framework_detected"] = "Angular"
|
|
context["js_execution_required"] = True
|
|
|
|
# Check if content was client-side rendered
|
|
initial_html = page.content()
|
|
context["client_side_rendered"] = len(initial_html) < 5000 # Heuristic
|
|
|
|
return context
|
|
```
|
|
|
|
---
|
|
|
|
## Interaction Sequence Recording
|
|
|
|
For dynamic content that requires user interaction:
|
|
|
|
```python
|
|
def record_interaction_sequence(actions: list) -> list:
|
|
"""Record a sequence of interactions for provenance."""
|
|
from datetime import datetime
|
|
|
|
sequence = []
|
|
for action in actions:
|
|
entry = {
|
|
"action": action["type"],
|
|
"timestamp": datetime.utcnow().isoformat() + "Z"
|
|
}
|
|
|
|
if action["type"] == "navigate":
|
|
entry["target"] = action["url"]
|
|
elif action["type"] == "click":
|
|
entry["target"] = action["selector"]
|
|
elif action["type"] == "wait":
|
|
entry["condition"] = action.get("condition", "networkidle")
|
|
entry["duration_ms"] = action.get("duration_ms", 0)
|
|
elif action["type"] == "scroll":
|
|
entry["target"] = action.get("selector", "window")
|
|
entry["scroll_position"] = action.get("position", {"x": 0, "y": 0})
|
|
elif action["type"] == "extract":
|
|
entry["target"] = action["selector"]
|
|
|
|
sequence.append(entry)
|
|
|
|
return sequence
|
|
```
|
|
|
|
---
|
|
|
|
## Verification Workflow
|
|
|
|
```
|
|
1. INITIAL EXTRACTION
|
|
├─ Navigate to source URL (record interaction)
|
|
├─ Wait for JS rendering if needed (record wait condition)
|
|
├─ Detect framework (record rendering_context)
|
|
├─ Extract text via multiple selectors (record w3c_selectors)
|
|
├─ Generate content_hash
|
|
├─ Generate text_fragment URL
|
|
├─ Check Wayback Machine availability
|
|
├─ Set verification.status = "verified"
|
|
└─ Set next_verification_due (e.g., 90 days)
|
|
|
|
2. RE-VERIFICATION (automated)
|
|
├─ Fetch source URL again
|
|
├─ Replay interaction_sequence if needed
|
|
├─ Re-extract via same selectors (try all w3c_selectors)
|
|
├─ Compare content_hash
|
|
│ ├─ MATCH: Update last_verified, keep status "verified"
|
|
│ └─ MISMATCH: Set status "stale", log to verification_history
|
|
└─ If source 404: Try memento_uri, set status "archived"
|
|
|
|
3. ARCHIVAL FALLBACK
|
|
├─ If source unavailable, check memento_uri
|
|
├─ If no memento, search archive.org for URL
|
|
└─ Log archival source in verification_history
|
|
```
|
|
|
|
---
|
|
|
|
## Example: Complete Web Claim with Enhanced Provenance
|
|
|
|
```json
|
|
{
|
|
"claim_id": "c47f3e8a-9b2d-4e1a-8c5f-7d6e9a0b1c2d",
|
|
|
|
"claim_type": "role",
|
|
"claim_value": "directeur Tresoar",
|
|
"extracted_text": "Arjen Dijkstra wordt de nieuwe directeur van Fries historisch en letterkundig centrum Tresoar. Hij begint in oktober en volgt Bert Looper op.",
|
|
"language": "nl",
|
|
|
|
"source_url": "https://www.omropfryslan.nl/nl/nieuws/1161390/historisch-centrum-tresoar-vindt-nieuwe-directeur-in-arjen-dijkstra",
|
|
"canonical_url": null,
|
|
"content_type": "text/html",
|
|
|
|
"w3c_selectors": [
|
|
{
|
|
"type": "CssSelector",
|
|
"value": "article.news-article > div.content > p:first-of-type"
|
|
},
|
|
{
|
|
"type": "XPathSelector",
|
|
"value": "/html/body/main/article/div[@class='content']/p[1]"
|
|
},
|
|
{
|
|
"type": "TextQuoteSelector",
|
|
"exact": "Arjen Dijkstra wordt de nieuwe directeur van Fries historisch en letterkundig centrum Tresoar",
|
|
"prefix": "15 juli 2022 - ",
|
|
"suffix": ". Hij begint in oktober"
|
|
}
|
|
],
|
|
|
|
"aria_selector": {
|
|
"role": "article",
|
|
"name": "Historisch centrum Tresoar vindt nieuwe directeur in Arjen Dijkstra"
|
|
},
|
|
|
|
"text_fragment": "#:~:text=Arjen%20Dijkstra%20wordt%20de%20nieuwe%20directeur",
|
|
|
|
"css_selector": "article.news-article > div.content > p:first-of-type",
|
|
"xpath_selector": "/html/body/main/article/div[@class='content']/p[1]",
|
|
|
|
"published_date": "2022-07-15T14:15:00Z",
|
|
"modified_date": null,
|
|
"author": "Omrop Fryslân",
|
|
|
|
"retrieval_timestamp": "2025-12-28T02:45:00Z",
|
|
"retrieval_agent": "opencode/claude-sonnet-4",
|
|
"extraction_method": "web-reader_webReader",
|
|
|
|
"content_hash": {
|
|
"algorithm": "sha256",
|
|
"value": "sha256-K7gNU3sdo+OL0wNhqoVWhr3g6s1xYv72ol/pe/Unols=",
|
|
"scope": "extracted_text"
|
|
},
|
|
"http_etag": null,
|
|
"http_last_modified": null,
|
|
|
|
"archive": {
|
|
"memento_uri": "https://web.archive.org/web/20220716/https://www.omropfryslan.nl/nl/nieuws/1161390/historisch-centrum-tresoar-vindt-nieuwe-directeur-in-arjen-dijkstra",
|
|
"memento_datetime": "2022-07-16T00:00:00Z",
|
|
"timemap_uri": "https://web.archive.org/web/timemap/link/https://www.omropfryslan.nl/nl/nieuws/1161390/historisch-centrum-tresoar-vindt-nieuwe-directeur-in-arjen-dijkstra",
|
|
"timegate_uri": "https://web.archive.org/web/https://www.omropfryslan.nl/nl/nieuws/1161390/historisch-centrum-tresoar-vindt-nieuwe-directeur-in-arjen-dijkstra",
|
|
"archive_source": "web.archive.org"
|
|
},
|
|
|
|
"prov": {
|
|
"wasAttributedTo": {
|
|
"@type": "prov:Agent",
|
|
"name": "Omrop Fryslân",
|
|
"url": "https://www.omropfryslan.nl"
|
|
},
|
|
"generatedAtTime": "2025-12-28T02:45:00Z",
|
|
"wasDerivedFrom": "https://www.omropfryslan.nl/nl/nieuws/1161390/historisch-centrum-tresoar-vindt-nieuwe-directeur-in-arjen-dijkstra",
|
|
"wasGeneratedBy": {
|
|
"@type": "prov:Activity",
|
|
"name": "web_extraction",
|
|
"used": "web-reader_webReader"
|
|
}
|
|
},
|
|
|
|
"schema_org": {
|
|
"@type": "Claim",
|
|
"claimReviewed": "Arjen Dijkstra is directeur van Tresoar",
|
|
"appearance": {
|
|
"@type": "NewsArticle",
|
|
"url": "https://www.omropfryslan.nl/nl/nieuws/1161390/historisch-centrum-tresoar-vindt-nieuwe-directeur-in-arjen-dijkstra",
|
|
"datePublished": "2022-07-15T14:15:00Z"
|
|
},
|
|
"author": {
|
|
"@type": "Organization",
|
|
"name": "Omrop Fryslân"
|
|
}
|
|
},
|
|
|
|
"verification": {
|
|
"status": "verified",
|
|
"last_verified": "2025-12-28T02:45:00Z",
|
|
"next_verification_due": "2026-03-28T00:00:00Z",
|
|
"confidence_score": 0.95,
|
|
"verification_history": [
|
|
{
|
|
"timestamp": "2025-12-28T02:45:00Z",
|
|
"status": "verified",
|
|
"content_hash": "sha256-K7gNU3sdo+OL0wNhqoVWhr3g6s1xYv72ol/pe/Unols=",
|
|
"notes": "Initial extraction from Omrop Fryslân"
|
|
}
|
|
]
|
|
},
|
|
|
|
"rendering_context": {
|
|
"framework_detected": "vanilla",
|
|
"js_execution_required": false,
|
|
"client_side_rendered": false,
|
|
"server_side_rendered": true,
|
|
"wait_condition": "domcontentloaded",
|
|
"wait_duration_ms": 1200
|
|
},
|
|
|
|
"interaction_sequence": [
|
|
{
|
|
"action": "navigate",
|
|
"target": "https://www.omropfryslan.nl/nl/nieuws/1161390/historisch-centrum-tresoar-vindt-nieuwe-directeur-in-arjen-dijkstra",
|
|
"timestamp": "2025-12-28T02:44:58Z"
|
|
},
|
|
{
|
|
"action": "wait",
|
|
"condition": "domcontentloaded",
|
|
"duration_ms": 1200,
|
|
"timestamp": "2025-12-28T02:44:59Z"
|
|
},
|
|
{
|
|
"action": "extract",
|
|
"target": "article.news-article > div.content > p:first-of-type",
|
|
"timestamp": "2025-12-28T02:45:00Z"
|
|
}
|
|
],
|
|
|
|
"user_agent_context": {
|
|
"browser": "Chromium",
|
|
"headless": true,
|
|
"viewport_width": 1920,
|
|
"viewport_height": 1080,
|
|
"locale": "nl-NL",
|
|
"timezone": "Europe/Amsterdam"
|
|
},
|
|
|
|
"notes": "Primary source for Tresoar director appointment announcement"
|
|
}
|
|
```
|
|
|
|
---
|
|
|
|
## Example: SPA with AJAX Content
|
|
|
|
For a Single Page Application that loads content via API:
|
|
|
|
```json
|
|
{
|
|
"claim_id": "d58e4f9b-0c3e-5f2b-9d6g-8e7f0b2c3d4e",
|
|
|
|
"claim_type": "role",
|
|
"claim_value": "Head of Collections",
|
|
"extracted_text": "Dr. Maria van der Berg serves as Head of Collections since 2021.",
|
|
"language": "en",
|
|
|
|
"source_url": "https://museum.example.nl/about/team",
|
|
|
|
"w3c_selectors": [
|
|
{
|
|
"type": "CssSelector",
|
|
"value": "[data-testid='team-member-card']:nth-child(3) .role"
|
|
},
|
|
{
|
|
"type": "TextQuoteSelector",
|
|
"exact": "Dr. Maria van der Berg serves as Head of Collections",
|
|
"prefix": "",
|
|
"suffix": " since 2021"
|
|
}
|
|
],
|
|
|
|
"aria_selector": {
|
|
"role": "listitem",
|
|
"name": "Dr. Maria van der Berg",
|
|
"testid": "team-member-card"
|
|
},
|
|
|
|
"text_fragment": "#:~:text=Dr.%20Maria%20van%20der%20Berg%20serves%20as%20Head%20of%20Collections",
|
|
|
|
"content_hash": {
|
|
"algorithm": "sha256",
|
|
"value": "sha256-xyz789...",
|
|
"scope": "extracted_text"
|
|
},
|
|
|
|
"archive": {
|
|
"memento_uri": "https://web.archive.org/web/20251228/https://museum.example.nl/about/team",
|
|
"archive_source": "web.archive.org"
|
|
},
|
|
|
|
"prov": {
|
|
"wasDerivedFrom": "https://museum.example.nl/about/team"
|
|
},
|
|
|
|
"verification": {
|
|
"status": "verified",
|
|
"last_verified": "2025-12-28T03:00:00Z"
|
|
},
|
|
|
|
"rendering_context": {
|
|
"framework_detected": "React",
|
|
"framework_version": "18.2.0",
|
|
"hydration_complete": true,
|
|
"client_side_rendered": true,
|
|
"server_side_rendered": false,
|
|
"js_execution_required": true,
|
|
"wait_condition": "networkidle",
|
|
"wait_selector": "[data-testid='team-loaded']",
|
|
"wait_duration_ms": 3500
|
|
},
|
|
|
|
"interaction_sequence": [
|
|
{
|
|
"action": "navigate",
|
|
"target": "https://museum.example.nl/about/team",
|
|
"timestamp": "2025-12-28T02:59:55Z"
|
|
},
|
|
{
|
|
"action": "wait",
|
|
"condition": "networkidle",
|
|
"duration_ms": 3500,
|
|
"timestamp": "2025-12-28T02:59:58Z"
|
|
},
|
|
{
|
|
"action": "scroll",
|
|
"target": "[data-testid='team-member-card']:nth-child(3)",
|
|
"timestamp": "2025-12-28T02:59:59Z"
|
|
},
|
|
{
|
|
"action": "extract",
|
|
"target": "[data-testid='team-member-card']:nth-child(3) .role",
|
|
"timestamp": "2025-12-28T03:00:00Z"
|
|
}
|
|
],
|
|
|
|
"network_context": {
|
|
"request_url": "https://api.museum.example.nl/v2/team",
|
|
"request_method": "GET",
|
|
"response_status": 200,
|
|
"response_content_type": "application/json",
|
|
"xhr_intercepted": true,
|
|
"api_version": "v2"
|
|
},
|
|
|
|
"retrieval_timestamp": "2025-12-28T03:00:00Z",
|
|
"retrieval_agent": "opencode/claude-sonnet-4",
|
|
"extraction_method": "playwright_browser_snapshot"
|
|
}
|
|
```
|
|
|
|
---
|
|
|
|
## Implementation Priority
|
|
|
|
### Phase 1: Critical (implement immediately)
|
|
- `content_hash` - Content integrity verification
|
|
- `text_fragment` - URL-based text targeting
|
|
- `archive.memento_uri` - Archival fallback
|
|
- `prov.wasDerivedFrom` - Provenance tracing
|
|
- `verification.status` - Claim freshness
|
|
- `w3c_selectors` - Multiple selector types (at least 2)
|
|
|
|
### Phase 2: High (implement for SPAs)
|
|
- `rendering_context` - JS framework detection
|
|
- `interaction_sequence` - Dynamic content actions
|
|
- `aria_selector` - Accessibility-first selection
|
|
- `network_context` - AJAX request details
|
|
|
|
### Phase 3: Complete (full FAIR compliance)
|
|
- Full `prov` object - Complete PROV-O alignment
|
|
- Full `schema_org` object - Schema.org ClaimReview
|
|
- `verification_history` - Change tracking
|
|
- `user_agent_context` - Browser environment
|
|
- `dom_state` - Complete DOM state capture
|
|
|
|
---
|
|
|
|
## Selector Resilience Ranking
|
|
|
|
For maximum resilience to DOM changes, use selectors in this priority order:
|
|
|
|
1. **TextQuoteSelector** - Most resilient (content-based)
|
|
2. **aria_selector** - Very stable (semantic/accessibility)
|
|
3. **data-testid** - Stable (explicit test hooks)
|
|
4. **FragmentSelector** - Stable for named anchors
|
|
5. **CssSelector** - Moderate (structure-dependent)
|
|
6. **XPathSelector** - Brittle (exact path-dependent)
|
|
7. **TextPositionSelector** - Most brittle (offset-dependent)
|
|
|
|
**Best Practice:** Always include at least one TextQuoteSelector and one structural selector (CSS or XPath).
|
|
|
|
---
|
|
|
|
## Related Rules
|
|
|
|
- `.opencode/WEB_READER_PREFERRED_SCRAPER_RULE.md` - Preferred scraper tool
|
|
- `.opencode/DATA_PRESERVATION_RULES.md` - Never delete enriched data
|
|
- `.opencode/DATA_FABRICATION_PROHIBITION.md` - Real data only
|
|
- `.opencode/INITIALS_EXPANSION_PROHIBITION.md` - Never expand initials without verification
|
|
- `AGENTS.md` - Web Claim Provenance Requirements section
|