kempersc/glam

Fork 0

kempersc 84904e344b Make AGENTS more succint by referring to opencode rules & enrich custodians

2025-12-28 14:56:35 +01:00

28 KiB

Raw Permalink Blame History

Web Claim Provenance Schema

Created: 2025-12-28 Updated: 2025-12-28 Status: Active Rule Supersedes: Basic web_claims structure

Purpose

This schema defines a comprehensive provenance structure for web-sourced claims that:

Enables automated re-verification - Claims can be automatically re-checked
Ensures content integrity - Hashes detect if source content changed
Supports temporal tracking - Links to web archives for historical access
Aligns with FAIR principles - Findable, Accessible, Interoperable, Reusable
Maps to linked data standards - PROV-O, Schema.org ClaimReview, W3C Web Annotation
Handles modern web applications - SPAs, AJAX, JavaScript-rendered content

Standards Alignment

Standard	Purpose	Reference
W3C PROV-O	Provenance ontology	https://www.w3.org/TR/prov-o/
W3C Web Annotation Data Model	Selectors and states	https://www.w3.org/TR/annotation-model/
Schema.org ClaimReview	Fact-checking markup	https://schema.org/ClaimReview
RFC 7089 Memento	Web archival access	https://tools.ietf.org/html/rfc7089
W3C SRI	Subresource integrity	https://www.w3.org/TR/sri/
W3C Text Fragments	URL text targeting	https://wicg.github.io/scroll-to-text-fragment/
WAI-ARIA	Accessibility selectors	https://www.w3.org/WAI/ARIA/apg/
Playwright Locators	Modern selector strategies	https://playwright.dev/docs/locators

Mandatory vs Recommended Elements

MANDATORY Elements (Required for FAIR Compliance)

Every web_claim object MUST have these elements:

Element	Standard	Purpose
`content_hash`	W3C SRI	SHA-256 hash of `extracted_text` for integrity verification
`text_fragment`	W3C Text Fragments	URL `#:~:text=...` for direct linking to source text
`archive.memento_uri`	RFC 7089 Memento	Wayback Machine archived snapshot URL
`prov.wasDerivedFrom`	W3C PROV-O	Source URL for linked data tracing
`verification.status`	-	Claim freshness (`verified`/`stale`/`failed`)
`w3c_selectors`	W3C Web Annotation	At least 2 selector types for redundancy

RECOMMENDED Elements (For Full Provenance)

Element	Standard	Purpose
`aria_selector`	WAI-ARIA	Accessibility-based element identification
`rendering_context`	-	JS framework detection and execution state
`interaction_sequence`	-	Actions taken to reach dynamic content
`network_context`	-	AJAX/API request details for dynamic content
`user_agent_context`	-	Browser and viewport information

Complete Web Claim Structure

{
  "claim_id": "string (UUID recommended)",
  
  // === CLAIM CONTENT ===
  "claim_type": "string (role|tenure|education|biography|contact|etc)",
  "claim_value": "string (the extracted fact)",
  "extracted_text": "string (full text from which claim derived)",
  "language": "string (BCP 47 code: nl, en, fr, etc)",
  
  // === SOURCE IDENTIFICATION ===
  "source_url": "string (URL)",
  "canonical_url": "string (URL, if different from source_url)",
  "content_type": "string (MIME type: text/html)",
  
  // === W3C WEB ANNOTATION SELECTORS (MANDATORY - at least 2) ===
  "w3c_selectors": [
    {
      "type": "CssSelector",
      "value": "string (CSS selector path)"
    },
    {
      "type": "XPathSelector",
      "value": "string (XPath expression)"
    },
    {
      "type": "TextQuoteSelector",
      "exact": "string (the matched text)",
      "prefix": "string (text before match)",
      "suffix": "string (text after match)"
    },
    {
      "type": "TextPositionSelector",
      "start": "integer (character offset start)",
      "end": "integer (character offset end)"
    },
    {
      "type": "FragmentSelector",
      "value": "string (fragment identifier)",
      "conformsTo": "string (specification URL)"
    }
  ],
  
  // === ACCESSIBILITY SELECTORS (RECOMMENDED) ===
  "aria_selector": {
    "role": "string (ARIA role: article, heading, button, etc)",
    "name": "string (accessible name)",
    "label": "string (aria-label value)",
    "testid": "string (data-testid attribute)"
  },
  
  // === TEXT FRAGMENT (MANDATORY) ===
  "text_fragment": "string (W3C Text Fragment: #:~:text=...)",
  
  // === LEGACY SELECTORS (for backwards compatibility) ===
  "css_selector": "string (CSS selector path)",
  "xpath_selector": "string (XPath expression)",
  
  // === TEMPORAL METADATA ===
  "published_date": "string (ISO 8601, from article:published_time)",
  "modified_date": "string (ISO 8601, from article:modified_time)",
  "author": "string or object (content creator)",
  
  // === RETRIEVAL METADATA ===
  "retrieval_timestamp": "string (ISO 8601, when we fetched)",
  "retrieval_agent": "string (tool identifier: opencode/claude-sonnet-4)",
  "extraction_method": "string (MCP tool: web-reader_webReader)",
  
  // === CONTENT INTEGRITY (MANDATORY) ===
  "content_hash": {
    "algorithm": "sha256",
    "value": "string (base64 encoded hash of extracted_text)",
    "scope": "extracted_text"
  },
  "http_etag": "string (ETag header from server response)",
  "http_last_modified": "string (Last-Modified header)",
  
  // === WEB ARCHIVAL (MANDATORY - RFC 7089 Memento) ===
  "archive": {
    "memento_uri": "string (archived snapshot URL)",
    "memento_datetime": "string (ISO 8601, archival datetime)",
    "timemap_uri": "string (TimeMap URL for all snapshots)",
    "timegate_uri": "string (TimeGate for datetime negotiation)",
    "archive_source": "string (web.archive.org, archive.today, etc)"
  },
  
  // === PROV-O ALIGNMENT (MANDATORY - wasDerivedFrom) ===
  "prov": {
    "wasAttributedTo": {
      "@type": "prov:Agent",
      "name": "string",
      "url": "string"
    },
    "generatedAtTime": "string (ISO 8601)",
    "wasDerivedFrom": "string (source URL or entity reference)",
    "wasGeneratedBy": {
      "@type": "prov:Activity",
      "name": "web_extraction",
      "used": "string (extraction_method)"
    }
  },
  
  // === SCHEMA.ORG CLAIMREVIEW ALIGNMENT ===
  "schema_org": {
    "@type": "Claim",
    "claimReviewed": "string (exact claim text)",
    "appearance": {
      "@type": "CreativeWork",
      "url": "string (source_url)",
      "datePublished": "string (published_date)"
    },
    "author": {
      "@type": "Person",
      "name": "string"
    }
  },
  
  // === VERIFICATION SUPPORT (MANDATORY - status) ===
  "verification": {
    "status": "string (verified|stale|failed|pending)",
    "last_verified": "string (ISO 8601)",
    "next_verification_due": "string (ISO 8601)",
    "confidence_score": "number (0.0-1.0)",
    "verification_history": [
      {
        "timestamp": "string (ISO 8601)",
        "status": "string",
        "content_hash": "string",
        "notes": "string"
      }
    ]
  },
  
  // === RENDERING CONTEXT (RECOMMENDED - for SPAs) ===
  "rendering_context": {
    "framework_detected": "string (React|Vue|Angular|Svelte|vanilla|unknown)",
    "framework_version": "string (version if detectable)",
    "hydration_complete": "boolean",
    "client_side_rendered": "boolean",
    "server_side_rendered": "boolean",
    "js_execution_required": "boolean",
    "wait_condition": "string (load|domcontentloaded|networkidle)",
    "wait_selector": "string (selector waited for)",
    "wait_duration_ms": "integer (milliseconds waited)"
  },
  
  // === DOM STATE (RECOMMENDED - W3C Web Annotation TimeState) ===
  "dom_state": {
    "type": "TimeState",
    "sourceDate": "string (ISO 8601)",
    "dom_content_loaded": "boolean",
    "load_event_fired": "boolean",
    "mutation_observer_stable": "boolean",
    "scroll_position": {
      "x": "integer",
      "y": "integer"
    },
    "viewport": {
      "width": "integer",
      "height": "integer"
    }
  },
  
  // === INTERACTION SEQUENCE (RECOMMENDED - for dynamic content) ===
  "interaction_sequence": [
    {
      "action": "string (navigate|click|scroll|wait|type|extract)",
      "target": "string (URL or selector)",
      "value": "string (input value if applicable)",
      "condition": "string (wait condition if applicable)",
      "duration_ms": "integer (action duration)",
      "timestamp": "string (ISO 8601)"
    }
  ],
  
  // === NETWORK CONTEXT (RECOMMENDED - for AJAX content) ===
  "network_context": {
    "request_url": "string (API endpoint if different from page)",
    "request_method": "string (GET|POST|etc)",
    "request_headers": "object (relevant headers)",
    "response_status": "integer (HTTP status code)",
    "response_content_type": "string (MIME type)",
    "response_hash": "string (hash of API response)",
    "xhr_intercepted": "boolean",
    "api_version": "string (API version if known)"
  },
  
  // === USER AGENT CONTEXT (RECOMMENDED) ===
  "user_agent_context": {
    "user_agent": "string (full UA string)",
    "browser": "string (Chrome|Firefox|Chromium|etc)",
    "browser_version": "string",
    "headless": "boolean",
    "viewport_width": "integer",
    "viewport_height": "integer",
    "device_scale_factor": "number",
    "mobile": "boolean",
    "locale": "string (BCP 47)",
    "timezone": "string (IANA timezone)"
  },
  
  // === FREE-FORM ===
  "notes": "string (optional context or caveats)"
}

Minimal Compliant Web Claim

For basic FAIR compliance, every web claim MUST have:

{
  "claim_type": "MANDATORY",
  "claim_value": "MANDATORY",
  "source_url": "MANDATORY",
  "extracted_text": "MANDATORY",
  "retrieval_timestamp": "MANDATORY",
  "retrieval_agent": "MANDATORY",
  "extraction_method": "MANDATORY",
  
  "content_hash": "MANDATORY (integrity)",
  "text_fragment": "MANDATORY (direct linking)",
  "archive": {
    "memento_uri": "MANDATORY (archival fallback)"
  },
  "prov": {
    "wasDerivedFrom": "MANDATORY (provenance)"
  },
  "verification": {
    "status": "MANDATORY (freshness)"
  },
  "w3c_selectors": "MANDATORY (at least 2 selector types)"
}

W3C Web Annotation Selector Types

The W3C Web Annotation Data Model defines several selector types. Use at least 2 for redundancy:

1. CssSelector

{
  "type": "CssSelector",
  "value": "#main-article > p.intro"
}

2. XPathSelector

{
  "type": "XPathSelector",
  "value": "/html/body/main/article/div/p[1]"
}

3. TextQuoteSelector (Highly Recommended)

Most resilient to DOM changes:

{
  "type": "TextQuoteSelector",
  "exact": "Arjen Dijkstra wordt de nieuwe directeur",
  "prefix": "15 juli 2022 - ",
  "suffix": " van Fries historisch"
}

4. TextPositionSelector

{
  "type": "TextPositionSelector",
  "start": 412,
  "end": 795
}

5. FragmentSelector

{
  "type": "FragmentSelector",
  "value": "section2",
  "conformsTo": "http://tools.ietf.org/rfc/rfc3236"
}

6. DataPositionSelector (for binary data)

{
  "type": "DataPositionSelector",
  "start": 4096,
  "end": 4104
}

7. RangeSelector (for spanning selections)

{
  "type": "RangeSelector",
  "startSelector": {
    "type": "XPathSelector",
    "value": "//table[1]/tr[1]/td[2]"
  },
  "endSelector": {
    "type": "XPathSelector",
    "value": "//table[1]/tr[1]/td[4]"
  }
}

Accessibility-First Selectors (ARIA)

Modern web applications use semantic HTML and ARIA. These selectors are often more stable than CSS/XPath:

"aria_selector": {
  "role": "article",
  "name": "Nieuws artikel over directeurswisseling",
  "label": "Article content",
  "testid": "news-article-content",
  "description": "Artikel over de nieuwe directeur van Tresoar"
}

Playwright Locator Equivalents:

page.getByRole('article', { name: 'Nieuws artikel' })
page.getByLabel('Article content')
page.getByTestId('news-article-content')
page.getByText('Arjen Dijkstra wordt')

Content Hash Generation

Generate SHA-256 hash of extracted_text for integrity verification:

import hashlib
import base64

def generate_content_hash(text: str) -> dict:
    """Generate SHA-256 hash for content integrity (W3C SRI format)."""
    hash_bytes = hashlib.sha256(text.encode('utf-8')).digest()
    hash_b64 = base64.b64encode(hash_bytes).decode('ascii')
    return {
        "algorithm": "sha256",
        "value": f"sha256-{hash_b64}",
        "scope": "extracted_text"
    }

# Example
text = "Arjen Dijkstra wordt de nieuwe directeur van Tresoar."
hash_obj = generate_content_hash(text)
# Result: {"algorithm": "sha256", "value": "sha256-abc123...", "scope": "extracted_text"}

Text Fragment URL Generation

Create W3C Text Fragment URLs for direct linking:

from urllib.parse import quote

def generate_text_fragment(source_url: str, text: str) -> str:
    """Generate URL with text fragment for direct linking."""
    # Truncate to first 100 chars for fragment
    fragment_text = text[:100] if len(text) > 100 else text
    encoded = quote(fragment_text)
    return f"{source_url}#:~:text={encoded}"

# Example
url = "https://example.com/article"
text = "Arjen Dijkstra wordt de nieuwe directeur"
fragment_url = generate_text_fragment(url, text)
# Result: "https://example.com/article#:~:text=Arjen%20Dijkstra%20wordt%20de%20nieuwe%20directeur"

Web Archive (Memento) Integration

Request archived version via Wayback Machine:

import requests
from datetime import datetime

def get_memento_info(url: str, target_date: datetime = None) -> dict:
    """Get Memento (archived) version info from Wayback Machine."""
    
    # Use Memento TimeGate
    timegate = f"https://web.archive.org/web/{url}"
    
    # Or query TimeMap for all versions
    timemap = f"https://web.archive.org/web/timemap/link/{url}"
    
    # Get closest memento to target date
    if target_date:
        date_str = target_date.strftime("%Y%m%d")
        memento_uri = f"https://web.archive.org/web/{date_str}/{url}"
    else:
        # Use wildcard for latest
        memento_uri = f"https://web.archive.org/web/*/{url}"
    
    return {
        "memento_uri": memento_uri,
        "timemap_uri": timemap,
        "timegate_uri": timegate,
        "archive_source": "web.archive.org"
    }

def check_wayback_availability(url: str) -> dict:
    """Check if URL is available in Wayback Machine."""
    api_url = f"https://archive.org/wayback/available?url={url}"
    response = requests.get(api_url)
    data = response.json()
    
    if data.get("archived_snapshots", {}).get("closest"):
        snapshot = data["archived_snapshots"]["closest"]
        return {
            "available": True,
            "memento_uri": snapshot["url"],
            "memento_datetime": snapshot["timestamp"],
            "archive_source": "web.archive.org"
        }
    return {"available": False}

Rendering Context Detection

For Single Page Applications (SPAs), detect the framework and rendering state:

def detect_rendering_context(page) -> dict:
    """Detect JS framework and rendering state using Playwright."""
    
    context = {
        "framework_detected": "unknown",
        "js_execution_required": False,
        "client_side_rendered": False,
        "server_side_rendered": True
    }
    
    # Check for React
    react_check = page.evaluate("""() => {
        return !!(window.__REACT_DEVTOOLS_GLOBAL_HOOK__ || 
                  document.querySelector('[data-reactroot]') ||
                  document.querySelector('[data-react-helmet]'))
    }""")
    if react_check:
        context["framework_detected"] = "React"
        context["js_execution_required"] = True
    
    # Check for Vue
    vue_check = page.evaluate("""() => {
        return !!(window.__VUE__ || 
                  document.querySelector('[data-v-]') ||
                  document.querySelector('#__nuxt'))
    }""")
    if vue_check:
        context["framework_detected"] = "Vue"
        context["js_execution_required"] = True
    
    # Check for Angular
    angular_check = page.evaluate("""() => {
        return !!(window.ng || 
                  document.querySelector('[ng-version]') ||
                  document.querySelector('[_ngcontent-]'))
    }""")
    if angular_check:
        context["framework_detected"] = "Angular"
        context["js_execution_required"] = True
    
    # Check if content was client-side rendered
    initial_html = page.content()
    context["client_side_rendered"] = len(initial_html) < 5000  # Heuristic
    
    return context

Interaction Sequence Recording

For dynamic content that requires user interaction:

def record_interaction_sequence(actions: list) -> list:
    """Record a sequence of interactions for provenance."""
    from datetime import datetime
    
    sequence = []
    for action in actions:
        entry = {
            "action": action["type"],
            "timestamp": datetime.utcnow().isoformat() + "Z"
        }
        
        if action["type"] == "navigate":
            entry["target"] = action["url"]
        elif action["type"] == "click":
            entry["target"] = action["selector"]
        elif action["type"] == "wait":
            entry["condition"] = action.get("condition", "networkidle")
            entry["duration_ms"] = action.get("duration_ms", 0)
        elif action["type"] == "scroll":
            entry["target"] = action.get("selector", "window")
            entry["scroll_position"] = action.get("position", {"x": 0, "y": 0})
        elif action["type"] == "extract":
            entry["target"] = action["selector"]
        
        sequence.append(entry)
    
    return sequence

Verification Workflow

1. INITIAL EXTRACTION
   ├─ Navigate to source URL (record interaction)
   ├─ Wait for JS rendering if needed (record wait condition)
   ├─ Detect framework (record rendering_context)
   ├─ Extract text via multiple selectors (record w3c_selectors)
   ├─ Generate content_hash
   ├─ Generate text_fragment URL
   ├─ Check Wayback Machine availability
   ├─ Set verification.status = "verified"
   └─ Set next_verification_due (e.g., 90 days)

2. RE-VERIFICATION (automated)
   ├─ Fetch source URL again
   ├─ Replay interaction_sequence if needed
   ├─ Re-extract via same selectors (try all w3c_selectors)
   ├─ Compare content_hash
   │   ├─ MATCH: Update last_verified, keep status "verified"
   │   └─ MISMATCH: Set status "stale", log to verification_history
   └─ If source 404: Try memento_uri, set status "archived"

3. ARCHIVAL FALLBACK
   ├─ If source unavailable, check memento_uri
   ├─ If no memento, search archive.org for URL
   └─ Log archival source in verification_history

Example: Complete Web Claim with Enhanced Provenance

{
  "claim_id": "c47f3e8a-9b2d-4e1a-8c5f-7d6e9a0b1c2d",
  
  "claim_type": "role",
  "claim_value": "directeur Tresoar",
  "extracted_text": "Arjen Dijkstra wordt de nieuwe directeur van Fries historisch en letterkundig centrum Tresoar. Hij begint in oktober en volgt Bert Looper op.",
  "language": "nl",
  
  "source_url": "https://www.omropfryslan.nl/nl/nieuws/1161390/historisch-centrum-tresoar-vindt-nieuwe-directeur-in-arjen-dijkstra",
  "canonical_url": null,
  "content_type": "text/html",
  
  "w3c_selectors": [
    {
      "type": "CssSelector",
      "value": "article.news-article > div.content > p:first-of-type"
    },
    {
      "type": "XPathSelector",
      "value": "/html/body/main/article/div[@class='content']/p[1]"
    },
    {
      "type": "TextQuoteSelector",
      "exact": "Arjen Dijkstra wordt de nieuwe directeur van Fries historisch en letterkundig centrum Tresoar",
      "prefix": "15 juli 2022 - ",
      "suffix": ". Hij begint in oktober"
    }
  ],
  
  "aria_selector": {
    "role": "article",
    "name": "Historisch centrum Tresoar vindt nieuwe directeur in Arjen Dijkstra"
  },
  
  "text_fragment": "#:~:text=Arjen%20Dijkstra%20wordt%20de%20nieuwe%20directeur",
  
  "css_selector": "article.news-article > div.content > p:first-of-type",
  "xpath_selector": "/html/body/main/article/div[@class='content']/p[1]",
  
  "published_date": "2022-07-15T14:15:00Z",
  "modified_date": null,
  "author": "Omrop Fryslân",
  
  "retrieval_timestamp": "2025-12-28T02:45:00Z",
  "retrieval_agent": "opencode/claude-sonnet-4",
  "extraction_method": "web-reader_webReader",
  
  "content_hash": {
    "algorithm": "sha256",
    "value": "sha256-K7gNU3sdo+OL0wNhqoVWhr3g6s1xYv72ol/pe/Unols=",
    "scope": "extracted_text"
  },
  "http_etag": null,
  "http_last_modified": null,
  
  "archive": {
    "memento_uri": "https://web.archive.org/web/20220716/https://www.omropfryslan.nl/nl/nieuws/1161390/historisch-centrum-tresoar-vindt-nieuwe-directeur-in-arjen-dijkstra",
    "memento_datetime": "2022-07-16T00:00:00Z",
    "timemap_uri": "https://web.archive.org/web/timemap/link/https://www.omropfryslan.nl/nl/nieuws/1161390/historisch-centrum-tresoar-vindt-nieuwe-directeur-in-arjen-dijkstra",
    "timegate_uri": "https://web.archive.org/web/https://www.omropfryslan.nl/nl/nieuws/1161390/historisch-centrum-tresoar-vindt-nieuwe-directeur-in-arjen-dijkstra",
    "archive_source": "web.archive.org"
  },
  
  "prov": {
    "wasAttributedTo": {
      "@type": "prov:Agent",
      "name": "Omrop Fryslân",
      "url": "https://www.omropfryslan.nl"
    },
    "generatedAtTime": "2025-12-28T02:45:00Z",
    "wasDerivedFrom": "https://www.omropfryslan.nl/nl/nieuws/1161390/historisch-centrum-tresoar-vindt-nieuwe-directeur-in-arjen-dijkstra",
    "wasGeneratedBy": {
      "@type": "prov:Activity",
      "name": "web_extraction",
      "used": "web-reader_webReader"
    }
  },
  
  "schema_org": {
    "@type": "Claim",
    "claimReviewed": "Arjen Dijkstra is directeur van Tresoar",
    "appearance": {
      "@type": "NewsArticle",
      "url": "https://www.omropfryslan.nl/nl/nieuws/1161390/historisch-centrum-tresoar-vindt-nieuwe-directeur-in-arjen-dijkstra",
      "datePublished": "2022-07-15T14:15:00Z"
    },
    "author": {
      "@type": "Organization",
      "name": "Omrop Fryslân"
    }
  },
  
  "verification": {
    "status": "verified",
    "last_verified": "2025-12-28T02:45:00Z",
    "next_verification_due": "2026-03-28T00:00:00Z",
    "confidence_score": 0.95,
    "verification_history": [
      {
        "timestamp": "2025-12-28T02:45:00Z",
        "status": "verified",
        "content_hash": "sha256-K7gNU3sdo+OL0wNhqoVWhr3g6s1xYv72ol/pe/Unols=",
        "notes": "Initial extraction from Omrop Fryslân"
      }
    ]
  },
  
  "rendering_context": {
    "framework_detected": "vanilla",
    "js_execution_required": false,
    "client_side_rendered": false,
    "server_side_rendered": true,
    "wait_condition": "domcontentloaded",
    "wait_duration_ms": 1200
  },
  
  "interaction_sequence": [
    {
      "action": "navigate",
      "target": "https://www.omropfryslan.nl/nl/nieuws/1161390/historisch-centrum-tresoar-vindt-nieuwe-directeur-in-arjen-dijkstra",
      "timestamp": "2025-12-28T02:44:58Z"
    },
    {
      "action": "wait",
      "condition": "domcontentloaded",
      "duration_ms": 1200,
      "timestamp": "2025-12-28T02:44:59Z"
    },
    {
      "action": "extract",
      "target": "article.news-article > div.content > p:first-of-type",
      "timestamp": "2025-12-28T02:45:00Z"
    }
  ],
  
  "user_agent_context": {
    "browser": "Chromium",
    "headless": true,
    "viewport_width": 1920,
    "viewport_height": 1080,
    "locale": "nl-NL",
    "timezone": "Europe/Amsterdam"
  },
  
  "notes": "Primary source for Tresoar director appointment announcement"
}

Example: SPA with AJAX Content

For a Single Page Application that loads content via API:

{
  "claim_id": "d58e4f9b-0c3e-5f2b-9d6g-8e7f0b2c3d4e",
  
  "claim_type": "role",
  "claim_value": "Head of Collections",
  "extracted_text": "Dr. Maria van der Berg serves as Head of Collections since 2021.",
  "language": "en",
  
  "source_url": "https://museum.example.nl/about/team",
  
  "w3c_selectors": [
    {
      "type": "CssSelector",
      "value": "[data-testid='team-member-card']:nth-child(3) .role"
    },
    {
      "type": "TextQuoteSelector",
      "exact": "Dr. Maria van der Berg serves as Head of Collections",
      "prefix": "",
      "suffix": " since 2021"
    }
  ],
  
  "aria_selector": {
    "role": "listitem",
    "name": "Dr. Maria van der Berg",
    "testid": "team-member-card"
  },
  
  "text_fragment": "#:~:text=Dr.%20Maria%20van%20der%20Berg%20serves%20as%20Head%20of%20Collections",
  
  "content_hash": {
    "algorithm": "sha256",
    "value": "sha256-xyz789...",
    "scope": "extracted_text"
  },
  
  "archive": {
    "memento_uri": "https://web.archive.org/web/20251228/https://museum.example.nl/about/team",
    "archive_source": "web.archive.org"
  },
  
  "prov": {
    "wasDerivedFrom": "https://museum.example.nl/about/team"
  },
  
  "verification": {
    "status": "verified",
    "last_verified": "2025-12-28T03:00:00Z"
  },
  
  "rendering_context": {
    "framework_detected": "React",
    "framework_version": "18.2.0",
    "hydration_complete": true,
    "client_side_rendered": true,
    "server_side_rendered": false,
    "js_execution_required": true,
    "wait_condition": "networkidle",
    "wait_selector": "[data-testid='team-loaded']",
    "wait_duration_ms": 3500
  },
  
  "interaction_sequence": [
    {
      "action": "navigate",
      "target": "https://museum.example.nl/about/team",
      "timestamp": "2025-12-28T02:59:55Z"
    },
    {
      "action": "wait",
      "condition": "networkidle",
      "duration_ms": 3500,
      "timestamp": "2025-12-28T02:59:58Z"
    },
    {
      "action": "scroll",
      "target": "[data-testid='team-member-card']:nth-child(3)",
      "timestamp": "2025-12-28T02:59:59Z"
    },
    {
      "action": "extract",
      "target": "[data-testid='team-member-card']:nth-child(3) .role",
      "timestamp": "2025-12-28T03:00:00Z"
    }
  ],
  
  "network_context": {
    "request_url": "https://api.museum.example.nl/v2/team",
    "request_method": "GET",
    "response_status": 200,
    "response_content_type": "application/json",
    "xhr_intercepted": true,
    "api_version": "v2"
  },
  
  "retrieval_timestamp": "2025-12-28T03:00:00Z",
  "retrieval_agent": "opencode/claude-sonnet-4",
  "extraction_method": "playwright_browser_snapshot"
}

Implementation Priority

Phase 1: Critical (implement immediately)

content_hash - Content integrity verification
text_fragment - URL-based text targeting
archive.memento_uri - Archival fallback
prov.wasDerivedFrom - Provenance tracing
verification.status - Claim freshness
w3c_selectors - Multiple selector types (at least 2)

Phase 2: High (implement for SPAs)

rendering_context - JS framework detection
interaction_sequence - Dynamic content actions
aria_selector - Accessibility-first selection
network_context - AJAX request details

Phase 3: Complete (full FAIR compliance)

Full prov object - Complete PROV-O alignment
Full schema_org object - Schema.org ClaimReview
verification_history - Change tracking
user_agent_context - Browser environment
dom_state - Complete DOM state capture

Selector Resilience Ranking

For maximum resilience to DOM changes, use selectors in this priority order:

TextQuoteSelector - Most resilient (content-based)
aria_selector - Very stable (semantic/accessibility)
data-testid - Stable (explicit test hooks)
FragmentSelector - Stable for named anchors
CssSelector - Moderate (structure-dependent)
XPathSelector - Brittle (exact path-dependent)
TextPositionSelector - Most brittle (offset-dependent)

Best Practice: Always include at least one TextQuoteSelector and one structural selector (CSS or XPath).

.opencode/WEB_READER_PREFERRED_SCRAPER_RULE.md - Preferred scraper tool
.opencode/DATA_PRESERVATION_RULES.md - Never delete enriched data
.opencode/DATA_FABRICATION_PROHIBITION.md - Real data only
.opencode/INITIALS_EXPANSION_PROHIBITION.md - Never expand initials without verification
AGENTS.md - Web Claim Provenance Requirements section

28 KiB Raw Permalink Blame History