28 KiB
Web Claim Provenance Schema
Created: 2025-12-28 Updated: 2025-12-28 Status: Active Rule Supersedes: Basic web_claims structure
Purpose
This schema defines a comprehensive provenance structure for web-sourced claims that:
- Enables automated re-verification - Claims can be automatically re-checked
- Ensures content integrity - Hashes detect if source content changed
- Supports temporal tracking - Links to web archives for historical access
- Aligns with FAIR principles - Findable, Accessible, Interoperable, Reusable
- Maps to linked data standards - PROV-O, Schema.org ClaimReview, W3C Web Annotation
- Handles modern web applications - SPAs, AJAX, JavaScript-rendered content
Standards Alignment
| Standard | Purpose | Reference |
|---|---|---|
| W3C PROV-O | Provenance ontology | https://www.w3.org/TR/prov-o/ |
| W3C Web Annotation Data Model | Selectors and states | https://www.w3.org/TR/annotation-model/ |
| Schema.org ClaimReview | Fact-checking markup | https://schema.org/ClaimReview |
| RFC 7089 Memento | Web archival access | https://tools.ietf.org/html/rfc7089 |
| W3C SRI | Subresource integrity | https://www.w3.org/TR/sri/ |
| W3C Text Fragments | URL text targeting | https://wicg.github.io/scroll-to-text-fragment/ |
| WAI-ARIA | Accessibility selectors | https://www.w3.org/WAI/ARIA/apg/ |
| Playwright Locators | Modern selector strategies | https://playwright.dev/docs/locators |
Mandatory vs Recommended Elements
MANDATORY Elements (Required for FAIR Compliance)
Every web_claim object MUST have these elements:
| Element | Standard | Purpose |
|---|---|---|
content_hash |
W3C SRI | SHA-256 hash of extracted_text for integrity verification |
text_fragment |
W3C Text Fragments | URL #:~:text=... for direct linking to source text |
archive.memento_uri |
RFC 7089 Memento | Wayback Machine archived snapshot URL |
prov.wasDerivedFrom |
W3C PROV-O | Source URL for linked data tracing |
verification.status |
- | Claim freshness (verified/stale/failed) |
w3c_selectors |
W3C Web Annotation | At least 2 selector types for redundancy |
RECOMMENDED Elements (For Full Provenance)
| Element | Standard | Purpose |
|---|---|---|
aria_selector |
WAI-ARIA | Accessibility-based element identification |
rendering_context |
- | JS framework detection and execution state |
interaction_sequence |
- | Actions taken to reach dynamic content |
network_context |
- | AJAX/API request details for dynamic content |
user_agent_context |
- | Browser and viewport information |
Complete Web Claim Structure
{
"claim_id": "string (UUID recommended)",
// === CLAIM CONTENT ===
"claim_type": "string (role|tenure|education|biography|contact|etc)",
"claim_value": "string (the extracted fact)",
"extracted_text": "string (full text from which claim derived)",
"language": "string (BCP 47 code: nl, en, fr, etc)",
// === SOURCE IDENTIFICATION ===
"source_url": "string (URL)",
"canonical_url": "string (URL, if different from source_url)",
"content_type": "string (MIME type: text/html)",
// === W3C WEB ANNOTATION SELECTORS (MANDATORY - at least 2) ===
"w3c_selectors": [
{
"type": "CssSelector",
"value": "string (CSS selector path)"
},
{
"type": "XPathSelector",
"value": "string (XPath expression)"
},
{
"type": "TextQuoteSelector",
"exact": "string (the matched text)",
"prefix": "string (text before match)",
"suffix": "string (text after match)"
},
{
"type": "TextPositionSelector",
"start": "integer (character offset start)",
"end": "integer (character offset end)"
},
{
"type": "FragmentSelector",
"value": "string (fragment identifier)",
"conformsTo": "string (specification URL)"
}
],
// === ACCESSIBILITY SELECTORS (RECOMMENDED) ===
"aria_selector": {
"role": "string (ARIA role: article, heading, button, etc)",
"name": "string (accessible name)",
"label": "string (aria-label value)",
"testid": "string (data-testid attribute)"
},
// === TEXT FRAGMENT (MANDATORY) ===
"text_fragment": "string (W3C Text Fragment: #:~:text=...)",
// === LEGACY SELECTORS (for backwards compatibility) ===
"css_selector": "string (CSS selector path)",
"xpath_selector": "string (XPath expression)",
// === TEMPORAL METADATA ===
"published_date": "string (ISO 8601, from article:published_time)",
"modified_date": "string (ISO 8601, from article:modified_time)",
"author": "string or object (content creator)",
// === RETRIEVAL METADATA ===
"retrieval_timestamp": "string (ISO 8601, when we fetched)",
"retrieval_agent": "string (tool identifier: opencode/claude-sonnet-4)",
"extraction_method": "string (MCP tool: web-reader_webReader)",
// === CONTENT INTEGRITY (MANDATORY) ===
"content_hash": {
"algorithm": "sha256",
"value": "string (base64 encoded hash of extracted_text)",
"scope": "extracted_text"
},
"http_etag": "string (ETag header from server response)",
"http_last_modified": "string (Last-Modified header)",
// === WEB ARCHIVAL (MANDATORY - RFC 7089 Memento) ===
"archive": {
"memento_uri": "string (archived snapshot URL)",
"memento_datetime": "string (ISO 8601, archival datetime)",
"timemap_uri": "string (TimeMap URL for all snapshots)",
"timegate_uri": "string (TimeGate for datetime negotiation)",
"archive_source": "string (web.archive.org, archive.today, etc)"
},
// === PROV-O ALIGNMENT (MANDATORY - wasDerivedFrom) ===
"prov": {
"wasAttributedTo": {
"@type": "prov:Agent",
"name": "string",
"url": "string"
},
"generatedAtTime": "string (ISO 8601)",
"wasDerivedFrom": "string (source URL or entity reference)",
"wasGeneratedBy": {
"@type": "prov:Activity",
"name": "web_extraction",
"used": "string (extraction_method)"
}
},
// === SCHEMA.ORG CLAIMREVIEW ALIGNMENT ===
"schema_org": {
"@type": "Claim",
"claimReviewed": "string (exact claim text)",
"appearance": {
"@type": "CreativeWork",
"url": "string (source_url)",
"datePublished": "string (published_date)"
},
"author": {
"@type": "Person",
"name": "string"
}
},
// === VERIFICATION SUPPORT (MANDATORY - status) ===
"verification": {
"status": "string (verified|stale|failed|pending)",
"last_verified": "string (ISO 8601)",
"next_verification_due": "string (ISO 8601)",
"confidence_score": "number (0.0-1.0)",
"verification_history": [
{
"timestamp": "string (ISO 8601)",
"status": "string",
"content_hash": "string",
"notes": "string"
}
]
},
// === RENDERING CONTEXT (RECOMMENDED - for SPAs) ===
"rendering_context": {
"framework_detected": "string (React|Vue|Angular|Svelte|vanilla|unknown)",
"framework_version": "string (version if detectable)",
"hydration_complete": "boolean",
"client_side_rendered": "boolean",
"server_side_rendered": "boolean",
"js_execution_required": "boolean",
"wait_condition": "string (load|domcontentloaded|networkidle)",
"wait_selector": "string (selector waited for)",
"wait_duration_ms": "integer (milliseconds waited)"
},
// === DOM STATE (RECOMMENDED - W3C Web Annotation TimeState) ===
"dom_state": {
"type": "TimeState",
"sourceDate": "string (ISO 8601)",
"dom_content_loaded": "boolean",
"load_event_fired": "boolean",
"mutation_observer_stable": "boolean",
"scroll_position": {
"x": "integer",
"y": "integer"
},
"viewport": {
"width": "integer",
"height": "integer"
}
},
// === INTERACTION SEQUENCE (RECOMMENDED - for dynamic content) ===
"interaction_sequence": [
{
"action": "string (navigate|click|scroll|wait|type|extract)",
"target": "string (URL or selector)",
"value": "string (input value if applicable)",
"condition": "string (wait condition if applicable)",
"duration_ms": "integer (action duration)",
"timestamp": "string (ISO 8601)"
}
],
// === NETWORK CONTEXT (RECOMMENDED - for AJAX content) ===
"network_context": {
"request_url": "string (API endpoint if different from page)",
"request_method": "string (GET|POST|etc)",
"request_headers": "object (relevant headers)",
"response_status": "integer (HTTP status code)",
"response_content_type": "string (MIME type)",
"response_hash": "string (hash of API response)",
"xhr_intercepted": "boolean",
"api_version": "string (API version if known)"
},
// === USER AGENT CONTEXT (RECOMMENDED) ===
"user_agent_context": {
"user_agent": "string (full UA string)",
"browser": "string (Chrome|Firefox|Chromium|etc)",
"browser_version": "string",
"headless": "boolean",
"viewport_width": "integer",
"viewport_height": "integer",
"device_scale_factor": "number",
"mobile": "boolean",
"locale": "string (BCP 47)",
"timezone": "string (IANA timezone)"
},
// === FREE-FORM ===
"notes": "string (optional context or caveats)"
}
Minimal Compliant Web Claim
For basic FAIR compliance, every web claim MUST have:
{
"claim_type": "MANDATORY",
"claim_value": "MANDATORY",
"source_url": "MANDATORY",
"extracted_text": "MANDATORY",
"retrieval_timestamp": "MANDATORY",
"retrieval_agent": "MANDATORY",
"extraction_method": "MANDATORY",
"content_hash": "MANDATORY (integrity)",
"text_fragment": "MANDATORY (direct linking)",
"archive": {
"memento_uri": "MANDATORY (archival fallback)"
},
"prov": {
"wasDerivedFrom": "MANDATORY (provenance)"
},
"verification": {
"status": "MANDATORY (freshness)"
},
"w3c_selectors": "MANDATORY (at least 2 selector types)"
}
W3C Web Annotation Selector Types
The W3C Web Annotation Data Model defines several selector types. Use at least 2 for redundancy:
1. CssSelector
{
"type": "CssSelector",
"value": "#main-article > p.intro"
}
2. XPathSelector
{
"type": "XPathSelector",
"value": "/html/body/main/article/div/p[1]"
}
3. TextQuoteSelector (Highly Recommended)
Most resilient to DOM changes:
{
"type": "TextQuoteSelector",
"exact": "Arjen Dijkstra wordt de nieuwe directeur",
"prefix": "15 juli 2022 - ",
"suffix": " van Fries historisch"
}
4. TextPositionSelector
{
"type": "TextPositionSelector",
"start": 412,
"end": 795
}
5. FragmentSelector
{
"type": "FragmentSelector",
"value": "section2",
"conformsTo": "http://tools.ietf.org/rfc/rfc3236"
}
6. DataPositionSelector (for binary data)
{
"type": "DataPositionSelector",
"start": 4096,
"end": 4104
}
7. RangeSelector (for spanning selections)
{
"type": "RangeSelector",
"startSelector": {
"type": "XPathSelector",
"value": "//table[1]/tr[1]/td[2]"
},
"endSelector": {
"type": "XPathSelector",
"value": "//table[1]/tr[1]/td[4]"
}
}
Accessibility-First Selectors (ARIA)
Modern web applications use semantic HTML and ARIA. These selectors are often more stable than CSS/XPath:
"aria_selector": {
"role": "article",
"name": "Nieuws artikel over directeurswisseling",
"label": "Article content",
"testid": "news-article-content",
"description": "Artikel over de nieuwe directeur van Tresoar"
}
Playwright Locator Equivalents:
page.getByRole('article', { name: 'Nieuws artikel' })page.getByLabel('Article content')page.getByTestId('news-article-content')page.getByText('Arjen Dijkstra wordt')
Content Hash Generation
Generate SHA-256 hash of extracted_text for integrity verification:
import hashlib
import base64
def generate_content_hash(text: str) -> dict:
"""Generate SHA-256 hash for content integrity (W3C SRI format)."""
hash_bytes = hashlib.sha256(text.encode('utf-8')).digest()
hash_b64 = base64.b64encode(hash_bytes).decode('ascii')
return {
"algorithm": "sha256",
"value": f"sha256-{hash_b64}",
"scope": "extracted_text"
}
# Example
text = "Arjen Dijkstra wordt de nieuwe directeur van Tresoar."
hash_obj = generate_content_hash(text)
# Result: {"algorithm": "sha256", "value": "sha256-abc123...", "scope": "extracted_text"}
Text Fragment URL Generation
Create W3C Text Fragment URLs for direct linking:
from urllib.parse import quote
def generate_text_fragment(source_url: str, text: str) -> str:
"""Generate URL with text fragment for direct linking."""
# Truncate to first 100 chars for fragment
fragment_text = text[:100] if len(text) > 100 else text
encoded = quote(fragment_text)
return f"{source_url}#:~:text={encoded}"
# Example
url = "https://example.com/article"
text = "Arjen Dijkstra wordt de nieuwe directeur"
fragment_url = generate_text_fragment(url, text)
# Result: "https://example.com/article#:~:text=Arjen%20Dijkstra%20wordt%20de%20nieuwe%20directeur"
Web Archive (Memento) Integration
Request archived version via Wayback Machine:
import requests
from datetime import datetime
def get_memento_info(url: str, target_date: datetime = None) -> dict:
"""Get Memento (archived) version info from Wayback Machine."""
# Use Memento TimeGate
timegate = f"https://web.archive.org/web/{url}"
# Or query TimeMap for all versions
timemap = f"https://web.archive.org/web/timemap/link/{url}"
# Get closest memento to target date
if target_date:
date_str = target_date.strftime("%Y%m%d")
memento_uri = f"https://web.archive.org/web/{date_str}/{url}"
else:
# Use wildcard for latest
memento_uri = f"https://web.archive.org/web/*/{url}"
return {
"memento_uri": memento_uri,
"timemap_uri": timemap,
"timegate_uri": timegate,
"archive_source": "web.archive.org"
}
def check_wayback_availability(url: str) -> dict:
"""Check if URL is available in Wayback Machine."""
api_url = f"https://archive.org/wayback/available?url={url}"
response = requests.get(api_url)
data = response.json()
if data.get("archived_snapshots", {}).get("closest"):
snapshot = data["archived_snapshots"]["closest"]
return {
"available": True,
"memento_uri": snapshot["url"],
"memento_datetime": snapshot["timestamp"],
"archive_source": "web.archive.org"
}
return {"available": False}
Rendering Context Detection
For Single Page Applications (SPAs), detect the framework and rendering state:
def detect_rendering_context(page) -> dict:
"""Detect JS framework and rendering state using Playwright."""
context = {
"framework_detected": "unknown",
"js_execution_required": False,
"client_side_rendered": False,
"server_side_rendered": True
}
# Check for React
react_check = page.evaluate("""() => {
return !!(window.__REACT_DEVTOOLS_GLOBAL_HOOK__ ||
document.querySelector('[data-reactroot]') ||
document.querySelector('[data-react-helmet]'))
}""")
if react_check:
context["framework_detected"] = "React"
context["js_execution_required"] = True
# Check for Vue
vue_check = page.evaluate("""() => {
return !!(window.__VUE__ ||
document.querySelector('[data-v-]') ||
document.querySelector('#__nuxt'))
}""")
if vue_check:
context["framework_detected"] = "Vue"
context["js_execution_required"] = True
# Check for Angular
angular_check = page.evaluate("""() => {
return !!(window.ng ||
document.querySelector('[ng-version]') ||
document.querySelector('[_ngcontent-]'))
}""")
if angular_check:
context["framework_detected"] = "Angular"
context["js_execution_required"] = True
# Check if content was client-side rendered
initial_html = page.content()
context["client_side_rendered"] = len(initial_html) < 5000 # Heuristic
return context
Interaction Sequence Recording
For dynamic content that requires user interaction:
def record_interaction_sequence(actions: list) -> list:
"""Record a sequence of interactions for provenance."""
from datetime import datetime
sequence = []
for action in actions:
entry = {
"action": action["type"],
"timestamp": datetime.utcnow().isoformat() + "Z"
}
if action["type"] == "navigate":
entry["target"] = action["url"]
elif action["type"] == "click":
entry["target"] = action["selector"]
elif action["type"] == "wait":
entry["condition"] = action.get("condition", "networkidle")
entry["duration_ms"] = action.get("duration_ms", 0)
elif action["type"] == "scroll":
entry["target"] = action.get("selector", "window")
entry["scroll_position"] = action.get("position", {"x": 0, "y": 0})
elif action["type"] == "extract":
entry["target"] = action["selector"]
sequence.append(entry)
return sequence
Verification Workflow
1. INITIAL EXTRACTION
├─ Navigate to source URL (record interaction)
├─ Wait for JS rendering if needed (record wait condition)
├─ Detect framework (record rendering_context)
├─ Extract text via multiple selectors (record w3c_selectors)
├─ Generate content_hash
├─ Generate text_fragment URL
├─ Check Wayback Machine availability
├─ Set verification.status = "verified"
└─ Set next_verification_due (e.g., 90 days)
2. RE-VERIFICATION (automated)
├─ Fetch source URL again
├─ Replay interaction_sequence if needed
├─ Re-extract via same selectors (try all w3c_selectors)
├─ Compare content_hash
│ ├─ MATCH: Update last_verified, keep status "verified"
│ └─ MISMATCH: Set status "stale", log to verification_history
└─ If source 404: Try memento_uri, set status "archived"
3. ARCHIVAL FALLBACK
├─ If source unavailable, check memento_uri
├─ If no memento, search archive.org for URL
└─ Log archival source in verification_history
Example: Complete Web Claim with Enhanced Provenance
{
"claim_id": "c47f3e8a-9b2d-4e1a-8c5f-7d6e9a0b1c2d",
"claim_type": "role",
"claim_value": "directeur Tresoar",
"extracted_text": "Arjen Dijkstra wordt de nieuwe directeur van Fries historisch en letterkundig centrum Tresoar. Hij begint in oktober en volgt Bert Looper op.",
"language": "nl",
"source_url": "https://www.omropfryslan.nl/nl/nieuws/1161390/historisch-centrum-tresoar-vindt-nieuwe-directeur-in-arjen-dijkstra",
"canonical_url": null,
"content_type": "text/html",
"w3c_selectors": [
{
"type": "CssSelector",
"value": "article.news-article > div.content > p:first-of-type"
},
{
"type": "XPathSelector",
"value": "/html/body/main/article/div[@class='content']/p[1]"
},
{
"type": "TextQuoteSelector",
"exact": "Arjen Dijkstra wordt de nieuwe directeur van Fries historisch en letterkundig centrum Tresoar",
"prefix": "15 juli 2022 - ",
"suffix": ". Hij begint in oktober"
}
],
"aria_selector": {
"role": "article",
"name": "Historisch centrum Tresoar vindt nieuwe directeur in Arjen Dijkstra"
},
"text_fragment": "#:~:text=Arjen%20Dijkstra%20wordt%20de%20nieuwe%20directeur",
"css_selector": "article.news-article > div.content > p:first-of-type",
"xpath_selector": "/html/body/main/article/div[@class='content']/p[1]",
"published_date": "2022-07-15T14:15:00Z",
"modified_date": null,
"author": "Omrop Fryslân",
"retrieval_timestamp": "2025-12-28T02:45:00Z",
"retrieval_agent": "opencode/claude-sonnet-4",
"extraction_method": "web-reader_webReader",
"content_hash": {
"algorithm": "sha256",
"value": "sha256-K7gNU3sdo+OL0wNhqoVWhr3g6s1xYv72ol/pe/Unols=",
"scope": "extracted_text"
},
"http_etag": null,
"http_last_modified": null,
"archive": {
"memento_uri": "https://web.archive.org/web/20220716/https://www.omropfryslan.nl/nl/nieuws/1161390/historisch-centrum-tresoar-vindt-nieuwe-directeur-in-arjen-dijkstra",
"memento_datetime": "2022-07-16T00:00:00Z",
"timemap_uri": "https://web.archive.org/web/timemap/link/https://www.omropfryslan.nl/nl/nieuws/1161390/historisch-centrum-tresoar-vindt-nieuwe-directeur-in-arjen-dijkstra",
"timegate_uri": "https://web.archive.org/web/https://www.omropfryslan.nl/nl/nieuws/1161390/historisch-centrum-tresoar-vindt-nieuwe-directeur-in-arjen-dijkstra",
"archive_source": "web.archive.org"
},
"prov": {
"wasAttributedTo": {
"@type": "prov:Agent",
"name": "Omrop Fryslân",
"url": "https://www.omropfryslan.nl"
},
"generatedAtTime": "2025-12-28T02:45:00Z",
"wasDerivedFrom": "https://www.omropfryslan.nl/nl/nieuws/1161390/historisch-centrum-tresoar-vindt-nieuwe-directeur-in-arjen-dijkstra",
"wasGeneratedBy": {
"@type": "prov:Activity",
"name": "web_extraction",
"used": "web-reader_webReader"
}
},
"schema_org": {
"@type": "Claim",
"claimReviewed": "Arjen Dijkstra is directeur van Tresoar",
"appearance": {
"@type": "NewsArticle",
"url": "https://www.omropfryslan.nl/nl/nieuws/1161390/historisch-centrum-tresoar-vindt-nieuwe-directeur-in-arjen-dijkstra",
"datePublished": "2022-07-15T14:15:00Z"
},
"author": {
"@type": "Organization",
"name": "Omrop Fryslân"
}
},
"verification": {
"status": "verified",
"last_verified": "2025-12-28T02:45:00Z",
"next_verification_due": "2026-03-28T00:00:00Z",
"confidence_score": 0.95,
"verification_history": [
{
"timestamp": "2025-12-28T02:45:00Z",
"status": "verified",
"content_hash": "sha256-K7gNU3sdo+OL0wNhqoVWhr3g6s1xYv72ol/pe/Unols=",
"notes": "Initial extraction from Omrop Fryslân"
}
]
},
"rendering_context": {
"framework_detected": "vanilla",
"js_execution_required": false,
"client_side_rendered": false,
"server_side_rendered": true,
"wait_condition": "domcontentloaded",
"wait_duration_ms": 1200
},
"interaction_sequence": [
{
"action": "navigate",
"target": "https://www.omropfryslan.nl/nl/nieuws/1161390/historisch-centrum-tresoar-vindt-nieuwe-directeur-in-arjen-dijkstra",
"timestamp": "2025-12-28T02:44:58Z"
},
{
"action": "wait",
"condition": "domcontentloaded",
"duration_ms": 1200,
"timestamp": "2025-12-28T02:44:59Z"
},
{
"action": "extract",
"target": "article.news-article > div.content > p:first-of-type",
"timestamp": "2025-12-28T02:45:00Z"
}
],
"user_agent_context": {
"browser": "Chromium",
"headless": true,
"viewport_width": 1920,
"viewport_height": 1080,
"locale": "nl-NL",
"timezone": "Europe/Amsterdam"
},
"notes": "Primary source for Tresoar director appointment announcement"
}
Example: SPA with AJAX Content
For a Single Page Application that loads content via API:
{
"claim_id": "d58e4f9b-0c3e-5f2b-9d6g-8e7f0b2c3d4e",
"claim_type": "role",
"claim_value": "Head of Collections",
"extracted_text": "Dr. Maria van der Berg serves as Head of Collections since 2021.",
"language": "en",
"source_url": "https://museum.example.nl/about/team",
"w3c_selectors": [
{
"type": "CssSelector",
"value": "[data-testid='team-member-card']:nth-child(3) .role"
},
{
"type": "TextQuoteSelector",
"exact": "Dr. Maria van der Berg serves as Head of Collections",
"prefix": "",
"suffix": " since 2021"
}
],
"aria_selector": {
"role": "listitem",
"name": "Dr. Maria van der Berg",
"testid": "team-member-card"
},
"text_fragment": "#:~:text=Dr.%20Maria%20van%20der%20Berg%20serves%20as%20Head%20of%20Collections",
"content_hash": {
"algorithm": "sha256",
"value": "sha256-xyz789...",
"scope": "extracted_text"
},
"archive": {
"memento_uri": "https://web.archive.org/web/20251228/https://museum.example.nl/about/team",
"archive_source": "web.archive.org"
},
"prov": {
"wasDerivedFrom": "https://museum.example.nl/about/team"
},
"verification": {
"status": "verified",
"last_verified": "2025-12-28T03:00:00Z"
},
"rendering_context": {
"framework_detected": "React",
"framework_version": "18.2.0",
"hydration_complete": true,
"client_side_rendered": true,
"server_side_rendered": false,
"js_execution_required": true,
"wait_condition": "networkidle",
"wait_selector": "[data-testid='team-loaded']",
"wait_duration_ms": 3500
},
"interaction_sequence": [
{
"action": "navigate",
"target": "https://museum.example.nl/about/team",
"timestamp": "2025-12-28T02:59:55Z"
},
{
"action": "wait",
"condition": "networkidle",
"duration_ms": 3500,
"timestamp": "2025-12-28T02:59:58Z"
},
{
"action": "scroll",
"target": "[data-testid='team-member-card']:nth-child(3)",
"timestamp": "2025-12-28T02:59:59Z"
},
{
"action": "extract",
"target": "[data-testid='team-member-card']:nth-child(3) .role",
"timestamp": "2025-12-28T03:00:00Z"
}
],
"network_context": {
"request_url": "https://api.museum.example.nl/v2/team",
"request_method": "GET",
"response_status": 200,
"response_content_type": "application/json",
"xhr_intercepted": true,
"api_version": "v2"
},
"retrieval_timestamp": "2025-12-28T03:00:00Z",
"retrieval_agent": "opencode/claude-sonnet-4",
"extraction_method": "playwright_browser_snapshot"
}
Implementation Priority
Phase 1: Critical (implement immediately)
content_hash- Content integrity verificationtext_fragment- URL-based text targetingarchive.memento_uri- Archival fallbackprov.wasDerivedFrom- Provenance tracingverification.status- Claim freshnessw3c_selectors- Multiple selector types (at least 2)
Phase 2: High (implement for SPAs)
rendering_context- JS framework detectioninteraction_sequence- Dynamic content actionsaria_selector- Accessibility-first selectionnetwork_context- AJAX request details
Phase 3: Complete (full FAIR compliance)
- Full
provobject - Complete PROV-O alignment - Full
schema_orgobject - Schema.org ClaimReview verification_history- Change trackinguser_agent_context- Browser environmentdom_state- Complete DOM state capture
Selector Resilience Ranking
For maximum resilience to DOM changes, use selectors in this priority order:
- TextQuoteSelector - Most resilient (content-based)
- aria_selector - Very stable (semantic/accessibility)
- data-testid - Stable (explicit test hooks)
- FragmentSelector - Stable for named anchors
- CssSelector - Moderate (structure-dependent)
- XPathSelector - Brittle (exact path-dependent)
- TextPositionSelector - Most brittle (offset-dependent)
Best Practice: Always include at least one TextQuoteSelector and one structural selector (CSS or XPath).
Related Rules
.opencode/WEB_READER_PREFERRED_SCRAPER_RULE.md- Preferred scraper tool.opencode/DATA_PRESERVATION_RULES.md- Never delete enriched data.opencode/DATA_FABRICATION_PROHIBITION.md- Real data only.opencode/INITIALS_EXPANSION_PROHIBITION.md- Never expand initials without verificationAGENTS.md- Web Claim Provenance Requirements section