# Web Claim Provenance Schema **Created**: 2025-12-28 **Updated**: 2025-12-28 **Status**: Active Rule **Supersedes**: Basic web_claims structure ## Purpose This schema defines a comprehensive provenance structure for web-sourced claims that: 1. **Enables automated re-verification** - Claims can be automatically re-checked 2. **Ensures content integrity** - Hashes detect if source content changed 3. **Supports temporal tracking** - Links to web archives for historical access 4. **Aligns with FAIR principles** - Findable, Accessible, Interoperable, Reusable 5. **Maps to linked data standards** - PROV-O, Schema.org ClaimReview, W3C Web Annotation 6. **Handles modern web applications** - SPAs, AJAX, JavaScript-rendered content ## Standards Alignment | Standard | Purpose | Reference | |----------|---------|-----------| | W3C PROV-O | Provenance ontology | https://www.w3.org/TR/prov-o/ | | W3C Web Annotation Data Model | Selectors and states | https://www.w3.org/TR/annotation-model/ | | Schema.org ClaimReview | Fact-checking markup | https://schema.org/ClaimReview | | RFC 7089 Memento | Web archival access | https://tools.ietf.org/html/rfc7089 | | W3C SRI | Subresource integrity | https://www.w3.org/TR/sri/ | | W3C Text Fragments | URL text targeting | https://wicg.github.io/scroll-to-text-fragment/ | | WAI-ARIA | Accessibility selectors | https://www.w3.org/WAI/ARIA/apg/ | | Playwright Locators | Modern selector strategies | https://playwright.dev/docs/locators | --- ## Mandatory vs Recommended Elements ### MANDATORY Elements (Required for FAIR Compliance) Every `web_claim` object **MUST** have these elements: | Element | Standard | Purpose | |---------|----------|---------| | `content_hash` | W3C SRI | SHA-256 hash of `extracted_text` for integrity verification | | `text_fragment` | W3C Text Fragments | URL `#:~:text=...` for direct linking to source text | | `archive.memento_uri` | RFC 7089 Memento | Wayback Machine archived snapshot URL | | `prov.wasDerivedFrom` | W3C PROV-O | Source URL for linked data tracing | | `verification.status` | - | Claim freshness (`verified`/`stale`/`failed`) | | `w3c_selectors` | W3C Web Annotation | At least 2 selector types for redundancy | ### RECOMMENDED Elements (For Full Provenance) | Element | Standard | Purpose | |---------|----------|---------| | `aria_selector` | WAI-ARIA | Accessibility-based element identification | | `rendering_context` | - | JS framework detection and execution state | | `interaction_sequence` | - | Actions taken to reach dynamic content | | `network_context` | - | AJAX/API request details for dynamic content | | `user_agent_context` | - | Browser and viewport information | --- ## Complete Web Claim Structure ```json { "claim_id": "string (UUID recommended)", // === CLAIM CONTENT === "claim_type": "string (role|tenure|education|biography|contact|etc)", "claim_value": "string (the extracted fact)", "extracted_text": "string (full text from which claim derived)", "language": "string (BCP 47 code: nl, en, fr, etc)", // === SOURCE IDENTIFICATION === "source_url": "string (URL)", "canonical_url": "string (URL, if different from source_url)", "content_type": "string (MIME type: text/html)", // === W3C WEB ANNOTATION SELECTORS (MANDATORY - at least 2) === "w3c_selectors": [ { "type": "CssSelector", "value": "string (CSS selector path)" }, { "type": "XPathSelector", "value": "string (XPath expression)" }, { "type": "TextQuoteSelector", "exact": "string (the matched text)", "prefix": "string (text before match)", "suffix": "string (text after match)" }, { "type": "TextPositionSelector", "start": "integer (character offset start)", "end": "integer (character offset end)" }, { "type": "FragmentSelector", "value": "string (fragment identifier)", "conformsTo": "string (specification URL)" } ], // === ACCESSIBILITY SELECTORS (RECOMMENDED) === "aria_selector": { "role": "string (ARIA role: article, heading, button, etc)", "name": "string (accessible name)", "label": "string (aria-label value)", "testid": "string (data-testid attribute)" }, // === TEXT FRAGMENT (MANDATORY) === "text_fragment": "string (W3C Text Fragment: #:~:text=...)", // === LEGACY SELECTORS (for backwards compatibility) === "css_selector": "string (CSS selector path)", "xpath_selector": "string (XPath expression)", // === TEMPORAL METADATA === "published_date": "string (ISO 8601, from article:published_time)", "modified_date": "string (ISO 8601, from article:modified_time)", "author": "string or object (content creator)", // === RETRIEVAL METADATA === "retrieval_timestamp": "string (ISO 8601, when we fetched)", "retrieval_agent": "string (tool identifier: opencode/claude-sonnet-4)", "extraction_method": "string (MCP tool: web-reader_webReader)", // === CONTENT INTEGRITY (MANDATORY) === "content_hash": { "algorithm": "sha256", "value": "string (base64 encoded hash of extracted_text)", "scope": "extracted_text" }, "http_etag": "string (ETag header from server response)", "http_last_modified": "string (Last-Modified header)", // === WEB ARCHIVAL (MANDATORY - RFC 7089 Memento) === "archive": { "memento_uri": "string (archived snapshot URL)", "memento_datetime": "string (ISO 8601, archival datetime)", "timemap_uri": "string (TimeMap URL for all snapshots)", "timegate_uri": "string (TimeGate for datetime negotiation)", "archive_source": "string (web.archive.org, archive.today, etc)" }, // === PROV-O ALIGNMENT (MANDATORY - wasDerivedFrom) === "prov": { "wasAttributedTo": { "@type": "prov:Agent", "name": "string", "url": "string" }, "generatedAtTime": "string (ISO 8601)", "wasDerivedFrom": "string (source URL or entity reference)", "wasGeneratedBy": { "@type": "prov:Activity", "name": "web_extraction", "used": "string (extraction_method)" } }, // === SCHEMA.ORG CLAIMREVIEW ALIGNMENT === "schema_org": { "@type": "Claim", "claimReviewed": "string (exact claim text)", "appearance": { "@type": "CreativeWork", "url": "string (source_url)", "datePublished": "string (published_date)" }, "author": { "@type": "Person", "name": "string" } }, // === VERIFICATION SUPPORT (MANDATORY - status) === "verification": { "status": "string (verified|stale|failed|pending)", "last_verified": "string (ISO 8601)", "next_verification_due": "string (ISO 8601)", "confidence_score": "number (0.0-1.0)", "verification_history": [ { "timestamp": "string (ISO 8601)", "status": "string", "content_hash": "string", "notes": "string" } ] }, // === RENDERING CONTEXT (RECOMMENDED - for SPAs) === "rendering_context": { "framework_detected": "string (React|Vue|Angular|Svelte|vanilla|unknown)", "framework_version": "string (version if detectable)", "hydration_complete": "boolean", "client_side_rendered": "boolean", "server_side_rendered": "boolean", "js_execution_required": "boolean", "wait_condition": "string (load|domcontentloaded|networkidle)", "wait_selector": "string (selector waited for)", "wait_duration_ms": "integer (milliseconds waited)" }, // === DOM STATE (RECOMMENDED - W3C Web Annotation TimeState) === "dom_state": { "type": "TimeState", "sourceDate": "string (ISO 8601)", "dom_content_loaded": "boolean", "load_event_fired": "boolean", "mutation_observer_stable": "boolean", "scroll_position": { "x": "integer", "y": "integer" }, "viewport": { "width": "integer", "height": "integer" } }, // === INTERACTION SEQUENCE (RECOMMENDED - for dynamic content) === "interaction_sequence": [ { "action": "string (navigate|click|scroll|wait|type|extract)", "target": "string (URL or selector)", "value": "string (input value if applicable)", "condition": "string (wait condition if applicable)", "duration_ms": "integer (action duration)", "timestamp": "string (ISO 8601)" } ], // === NETWORK CONTEXT (RECOMMENDED - for AJAX content) === "network_context": { "request_url": "string (API endpoint if different from page)", "request_method": "string (GET|POST|etc)", "request_headers": "object (relevant headers)", "response_status": "integer (HTTP status code)", "response_content_type": "string (MIME type)", "response_hash": "string (hash of API response)", "xhr_intercepted": "boolean", "api_version": "string (API version if known)" }, // === USER AGENT CONTEXT (RECOMMENDED) === "user_agent_context": { "user_agent": "string (full UA string)", "browser": "string (Chrome|Firefox|Chromium|etc)", "browser_version": "string", "headless": "boolean", "viewport_width": "integer", "viewport_height": "integer", "device_scale_factor": "number", "mobile": "boolean", "locale": "string (BCP 47)", "timezone": "string (IANA timezone)" }, // === FREE-FORM === "notes": "string (optional context or caveats)" } ``` --- ## Minimal Compliant Web Claim For basic FAIR compliance, every web claim **MUST** have: ```json { "claim_type": "MANDATORY", "claim_value": "MANDATORY", "source_url": "MANDATORY", "extracted_text": "MANDATORY", "retrieval_timestamp": "MANDATORY", "retrieval_agent": "MANDATORY", "extraction_method": "MANDATORY", "content_hash": "MANDATORY (integrity)", "text_fragment": "MANDATORY (direct linking)", "archive": { "memento_uri": "MANDATORY (archival fallback)" }, "prov": { "wasDerivedFrom": "MANDATORY (provenance)" }, "verification": { "status": "MANDATORY (freshness)" }, "w3c_selectors": "MANDATORY (at least 2 selector types)" } ``` --- ## W3C Web Annotation Selector Types The W3C Web Annotation Data Model defines several selector types. Use **at least 2** for redundancy: ### 1. CssSelector ```json { "type": "CssSelector", "value": "#main-article > p.intro" } ``` ### 2. XPathSelector ```json { "type": "XPathSelector", "value": "/html/body/main/article/div/p[1]" } ``` ### 3. TextQuoteSelector (Highly Recommended) Most resilient to DOM changes: ```json { "type": "TextQuoteSelector", "exact": "Arjen Dijkstra wordt de nieuwe directeur", "prefix": "15 juli 2022 - ", "suffix": " van Fries historisch" } ``` ### 4. TextPositionSelector ```json { "type": "TextPositionSelector", "start": 412, "end": 795 } ``` ### 5. FragmentSelector ```json { "type": "FragmentSelector", "value": "section2", "conformsTo": "http://tools.ietf.org/rfc/rfc3236" } ``` ### 6. DataPositionSelector (for binary data) ```json { "type": "DataPositionSelector", "start": 4096, "end": 4104 } ``` ### 7. RangeSelector (for spanning selections) ```json { "type": "RangeSelector", "startSelector": { "type": "XPathSelector", "value": "//table[1]/tr[1]/td[2]" }, "endSelector": { "type": "XPathSelector", "value": "//table[1]/tr[1]/td[4]" } } ``` --- ## Accessibility-First Selectors (ARIA) Modern web applications use semantic HTML and ARIA. These selectors are often **more stable** than CSS/XPath: ```json "aria_selector": { "role": "article", "name": "Nieuws artikel over directeurswisseling", "label": "Article content", "testid": "news-article-content", "description": "Artikel over de nieuwe directeur van Tresoar" } ``` **Playwright Locator Equivalents:** - `page.getByRole('article', { name: 'Nieuws artikel' })` - `page.getByLabel('Article content')` - `page.getByTestId('news-article-content')` - `page.getByText('Arjen Dijkstra wordt')` --- ## Content Hash Generation Generate SHA-256 hash of `extracted_text` for integrity verification: ```python import hashlib import base64 def generate_content_hash(text: str) -> dict: """Generate SHA-256 hash for content integrity (W3C SRI format).""" hash_bytes = hashlib.sha256(text.encode('utf-8')).digest() hash_b64 = base64.b64encode(hash_bytes).decode('ascii') return { "algorithm": "sha256", "value": f"sha256-{hash_b64}", "scope": "extracted_text" } # Example text = "Arjen Dijkstra wordt de nieuwe directeur van Tresoar." hash_obj = generate_content_hash(text) # Result: {"algorithm": "sha256", "value": "sha256-abc123...", "scope": "extracted_text"} ``` --- ## Text Fragment URL Generation Create W3C Text Fragment URLs for direct linking: ```python from urllib.parse import quote def generate_text_fragment(source_url: str, text: str) -> str: """Generate URL with text fragment for direct linking.""" # Truncate to first 100 chars for fragment fragment_text = text[:100] if len(text) > 100 else text encoded = quote(fragment_text) return f"{source_url}#:~:text={encoded}" # Example url = "https://example.com/article" text = "Arjen Dijkstra wordt de nieuwe directeur" fragment_url = generate_text_fragment(url, text) # Result: "https://example.com/article#:~:text=Arjen%20Dijkstra%20wordt%20de%20nieuwe%20directeur" ``` --- ## Web Archive (Memento) Integration Request archived version via Wayback Machine: ```python import requests from datetime import datetime def get_memento_info(url: str, target_date: datetime = None) -> dict: """Get Memento (archived) version info from Wayback Machine.""" # Use Memento TimeGate timegate = f"https://web.archive.org/web/{url}" # Or query TimeMap for all versions timemap = f"https://web.archive.org/web/timemap/link/{url}" # Get closest memento to target date if target_date: date_str = target_date.strftime("%Y%m%d") memento_uri = f"https://web.archive.org/web/{date_str}/{url}" else: # Use wildcard for latest memento_uri = f"https://web.archive.org/web/*/{url}" return { "memento_uri": memento_uri, "timemap_uri": timemap, "timegate_uri": timegate, "archive_source": "web.archive.org" } def check_wayback_availability(url: str) -> dict: """Check if URL is available in Wayback Machine.""" api_url = f"https://archive.org/wayback/available?url={url}" response = requests.get(api_url) data = response.json() if data.get("archived_snapshots", {}).get("closest"): snapshot = data["archived_snapshots"]["closest"] return { "available": True, "memento_uri": snapshot["url"], "memento_datetime": snapshot["timestamp"], "archive_source": "web.archive.org" } return {"available": False} ``` --- ## Rendering Context Detection For Single Page Applications (SPAs), detect the framework and rendering state: ```python def detect_rendering_context(page) -> dict: """Detect JS framework and rendering state using Playwright.""" context = { "framework_detected": "unknown", "js_execution_required": False, "client_side_rendered": False, "server_side_rendered": True } # Check for React react_check = page.evaluate("""() => { return !!(window.__REACT_DEVTOOLS_GLOBAL_HOOK__ || document.querySelector('[data-reactroot]') || document.querySelector('[data-react-helmet]')) }""") if react_check: context["framework_detected"] = "React" context["js_execution_required"] = True # Check for Vue vue_check = page.evaluate("""() => { return !!(window.__VUE__ || document.querySelector('[data-v-]') || document.querySelector('#__nuxt')) }""") if vue_check: context["framework_detected"] = "Vue" context["js_execution_required"] = True # Check for Angular angular_check = page.evaluate("""() => { return !!(window.ng || document.querySelector('[ng-version]') || document.querySelector('[_ngcontent-]')) }""") if angular_check: context["framework_detected"] = "Angular" context["js_execution_required"] = True # Check if content was client-side rendered initial_html = page.content() context["client_side_rendered"] = len(initial_html) < 5000 # Heuristic return context ``` --- ## Interaction Sequence Recording For dynamic content that requires user interaction: ```python def record_interaction_sequence(actions: list) -> list: """Record a sequence of interactions for provenance.""" from datetime import datetime sequence = [] for action in actions: entry = { "action": action["type"], "timestamp": datetime.utcnow().isoformat() + "Z" } if action["type"] == "navigate": entry["target"] = action["url"] elif action["type"] == "click": entry["target"] = action["selector"] elif action["type"] == "wait": entry["condition"] = action.get("condition", "networkidle") entry["duration_ms"] = action.get("duration_ms", 0) elif action["type"] == "scroll": entry["target"] = action.get("selector", "window") entry["scroll_position"] = action.get("position", {"x": 0, "y": 0}) elif action["type"] == "extract": entry["target"] = action["selector"] sequence.append(entry) return sequence ``` --- ## Verification Workflow ``` 1. INITIAL EXTRACTION ├─ Navigate to source URL (record interaction) ├─ Wait for JS rendering if needed (record wait condition) ├─ Detect framework (record rendering_context) ├─ Extract text via multiple selectors (record w3c_selectors) ├─ Generate content_hash ├─ Generate text_fragment URL ├─ Check Wayback Machine availability ├─ Set verification.status = "verified" └─ Set next_verification_due (e.g., 90 days) 2. RE-VERIFICATION (automated) ├─ Fetch source URL again ├─ Replay interaction_sequence if needed ├─ Re-extract via same selectors (try all w3c_selectors) ├─ Compare content_hash │ ├─ MATCH: Update last_verified, keep status "verified" │ └─ MISMATCH: Set status "stale", log to verification_history └─ If source 404: Try memento_uri, set status "archived" 3. ARCHIVAL FALLBACK ├─ If source unavailable, check memento_uri ├─ If no memento, search archive.org for URL └─ Log archival source in verification_history ``` --- ## Example: Complete Web Claim with Enhanced Provenance ```json { "claim_id": "c47f3e8a-9b2d-4e1a-8c5f-7d6e9a0b1c2d", "claim_type": "role", "claim_value": "directeur Tresoar", "extracted_text": "Arjen Dijkstra wordt de nieuwe directeur van Fries historisch en letterkundig centrum Tresoar. Hij begint in oktober en volgt Bert Looper op.", "language": "nl", "source_url": "https://www.omropfryslan.nl/nl/nieuws/1161390/historisch-centrum-tresoar-vindt-nieuwe-directeur-in-arjen-dijkstra", "canonical_url": null, "content_type": "text/html", "w3c_selectors": [ { "type": "CssSelector", "value": "article.news-article > div.content > p:first-of-type" }, { "type": "XPathSelector", "value": "/html/body/main/article/div[@class='content']/p[1]" }, { "type": "TextQuoteSelector", "exact": "Arjen Dijkstra wordt de nieuwe directeur van Fries historisch en letterkundig centrum Tresoar", "prefix": "15 juli 2022 - ", "suffix": ". Hij begint in oktober" } ], "aria_selector": { "role": "article", "name": "Historisch centrum Tresoar vindt nieuwe directeur in Arjen Dijkstra" }, "text_fragment": "#:~:text=Arjen%20Dijkstra%20wordt%20de%20nieuwe%20directeur", "css_selector": "article.news-article > div.content > p:first-of-type", "xpath_selector": "/html/body/main/article/div[@class='content']/p[1]", "published_date": "2022-07-15T14:15:00Z", "modified_date": null, "author": "Omrop Fryslân", "retrieval_timestamp": "2025-12-28T02:45:00Z", "retrieval_agent": "opencode/claude-sonnet-4", "extraction_method": "web-reader_webReader", "content_hash": { "algorithm": "sha256", "value": "sha256-K7gNU3sdo+OL0wNhqoVWhr3g6s1xYv72ol/pe/Unols=", "scope": "extracted_text" }, "http_etag": null, "http_last_modified": null, "archive": { "memento_uri": "https://web.archive.org/web/20220716/https://www.omropfryslan.nl/nl/nieuws/1161390/historisch-centrum-tresoar-vindt-nieuwe-directeur-in-arjen-dijkstra", "memento_datetime": "2022-07-16T00:00:00Z", "timemap_uri": "https://web.archive.org/web/timemap/link/https://www.omropfryslan.nl/nl/nieuws/1161390/historisch-centrum-tresoar-vindt-nieuwe-directeur-in-arjen-dijkstra", "timegate_uri": "https://web.archive.org/web/https://www.omropfryslan.nl/nl/nieuws/1161390/historisch-centrum-tresoar-vindt-nieuwe-directeur-in-arjen-dijkstra", "archive_source": "web.archive.org" }, "prov": { "wasAttributedTo": { "@type": "prov:Agent", "name": "Omrop Fryslân", "url": "https://www.omropfryslan.nl" }, "generatedAtTime": "2025-12-28T02:45:00Z", "wasDerivedFrom": "https://www.omropfryslan.nl/nl/nieuws/1161390/historisch-centrum-tresoar-vindt-nieuwe-directeur-in-arjen-dijkstra", "wasGeneratedBy": { "@type": "prov:Activity", "name": "web_extraction", "used": "web-reader_webReader" } }, "schema_org": { "@type": "Claim", "claimReviewed": "Arjen Dijkstra is directeur van Tresoar", "appearance": { "@type": "NewsArticle", "url": "https://www.omropfryslan.nl/nl/nieuws/1161390/historisch-centrum-tresoar-vindt-nieuwe-directeur-in-arjen-dijkstra", "datePublished": "2022-07-15T14:15:00Z" }, "author": { "@type": "Organization", "name": "Omrop Fryslân" } }, "verification": { "status": "verified", "last_verified": "2025-12-28T02:45:00Z", "next_verification_due": "2026-03-28T00:00:00Z", "confidence_score": 0.95, "verification_history": [ { "timestamp": "2025-12-28T02:45:00Z", "status": "verified", "content_hash": "sha256-K7gNU3sdo+OL0wNhqoVWhr3g6s1xYv72ol/pe/Unols=", "notes": "Initial extraction from Omrop Fryslân" } ] }, "rendering_context": { "framework_detected": "vanilla", "js_execution_required": false, "client_side_rendered": false, "server_side_rendered": true, "wait_condition": "domcontentloaded", "wait_duration_ms": 1200 }, "interaction_sequence": [ { "action": "navigate", "target": "https://www.omropfryslan.nl/nl/nieuws/1161390/historisch-centrum-tresoar-vindt-nieuwe-directeur-in-arjen-dijkstra", "timestamp": "2025-12-28T02:44:58Z" }, { "action": "wait", "condition": "domcontentloaded", "duration_ms": 1200, "timestamp": "2025-12-28T02:44:59Z" }, { "action": "extract", "target": "article.news-article > div.content > p:first-of-type", "timestamp": "2025-12-28T02:45:00Z" } ], "user_agent_context": { "browser": "Chromium", "headless": true, "viewport_width": 1920, "viewport_height": 1080, "locale": "nl-NL", "timezone": "Europe/Amsterdam" }, "notes": "Primary source for Tresoar director appointment announcement" } ``` --- ## Example: SPA with AJAX Content For a Single Page Application that loads content via API: ```json { "claim_id": "d58e4f9b-0c3e-5f2b-9d6g-8e7f0b2c3d4e", "claim_type": "role", "claim_value": "Head of Collections", "extracted_text": "Dr. Maria van der Berg serves as Head of Collections since 2021.", "language": "en", "source_url": "https://museum.example.nl/about/team", "w3c_selectors": [ { "type": "CssSelector", "value": "[data-testid='team-member-card']:nth-child(3) .role" }, { "type": "TextQuoteSelector", "exact": "Dr. Maria van der Berg serves as Head of Collections", "prefix": "", "suffix": " since 2021" } ], "aria_selector": { "role": "listitem", "name": "Dr. Maria van der Berg", "testid": "team-member-card" }, "text_fragment": "#:~:text=Dr.%20Maria%20van%20der%20Berg%20serves%20as%20Head%20of%20Collections", "content_hash": { "algorithm": "sha256", "value": "sha256-xyz789...", "scope": "extracted_text" }, "archive": { "memento_uri": "https://web.archive.org/web/20251228/https://museum.example.nl/about/team", "archive_source": "web.archive.org" }, "prov": { "wasDerivedFrom": "https://museum.example.nl/about/team" }, "verification": { "status": "verified", "last_verified": "2025-12-28T03:00:00Z" }, "rendering_context": { "framework_detected": "React", "framework_version": "18.2.0", "hydration_complete": true, "client_side_rendered": true, "server_side_rendered": false, "js_execution_required": true, "wait_condition": "networkidle", "wait_selector": "[data-testid='team-loaded']", "wait_duration_ms": 3500 }, "interaction_sequence": [ { "action": "navigate", "target": "https://museum.example.nl/about/team", "timestamp": "2025-12-28T02:59:55Z" }, { "action": "wait", "condition": "networkidle", "duration_ms": 3500, "timestamp": "2025-12-28T02:59:58Z" }, { "action": "scroll", "target": "[data-testid='team-member-card']:nth-child(3)", "timestamp": "2025-12-28T02:59:59Z" }, { "action": "extract", "target": "[data-testid='team-member-card']:nth-child(3) .role", "timestamp": "2025-12-28T03:00:00Z" } ], "network_context": { "request_url": "https://api.museum.example.nl/v2/team", "request_method": "GET", "response_status": 200, "response_content_type": "application/json", "xhr_intercepted": true, "api_version": "v2" }, "retrieval_timestamp": "2025-12-28T03:00:00Z", "retrieval_agent": "opencode/claude-sonnet-4", "extraction_method": "playwright_browser_snapshot" } ``` --- ## Implementation Priority ### Phase 1: Critical (implement immediately) - `content_hash` - Content integrity verification - `text_fragment` - URL-based text targeting - `archive.memento_uri` - Archival fallback - `prov.wasDerivedFrom` - Provenance tracing - `verification.status` - Claim freshness - `w3c_selectors` - Multiple selector types (at least 2) ### Phase 2: High (implement for SPAs) - `rendering_context` - JS framework detection - `interaction_sequence` - Dynamic content actions - `aria_selector` - Accessibility-first selection - `network_context` - AJAX request details ### Phase 3: Complete (full FAIR compliance) - Full `prov` object - Complete PROV-O alignment - Full `schema_org` object - Schema.org ClaimReview - `verification_history` - Change tracking - `user_agent_context` - Browser environment - `dom_state` - Complete DOM state capture --- ## Selector Resilience Ranking For maximum resilience to DOM changes, use selectors in this priority order: 1. **TextQuoteSelector** - Most resilient (content-based) 2. **aria_selector** - Very stable (semantic/accessibility) 3. **data-testid** - Stable (explicit test hooks) 4. **FragmentSelector** - Stable for named anchors 5. **CssSelector** - Moderate (structure-dependent) 6. **XPathSelector** - Brittle (exact path-dependent) 7. **TextPositionSelector** - Most brittle (offset-dependent) **Best Practice:** Always include at least one TextQuoteSelector and one structural selector (CSS or XPath). --- ## Related Rules - `.opencode/WEB_READER_PREFERRED_SCRAPER_RULE.md` - Preferred scraper tool - `.opencode/DATA_PRESERVATION_RULES.md` - Never delete enriched data - `.opencode/DATA_FABRICATION_PROHIBITION.md` - Real data only - `.opencode/INITIALS_EXPANSION_PROHIBITION.md` - Never expand initials without verification - `AGENTS.md` - Web Claim Provenance Requirements section