glam/.opencode/WEB_CLAIM_PROVENANCE_SCHEMA.md

# Web Claim Provenance Schema

**Created**: 2025-12-28
**Updated**: 2025-12-28
**Status**: Active Rule
**Supersedes**: Basic web_claims structure

## Purpose

This schema defines a comprehensive provenance structure for web-sourced claims that:
1. **Enables automated re-verification** - Claims can be automatically re-checked
2. **Ensures content integrity** - Hashes detect if source content changed
3. **Supports temporal tracking** - Links to web archives for historical access
4. **Aligns with FAIR principles** - Findable, Accessible, Interoperable, Reusable
5. **Maps to linked data standards** - PROV-O, Schema.org ClaimReview, W3C Web Annotation
6. **Handles modern web applications** - SPAs, AJAX, JavaScript-rendered content

## Standards Alignment

| Standard | Purpose | Reference |
|----------|---------|-----------|
| W3C PROV-O | Provenance ontology | https://www.w3.org/TR/prov-o/ |
| W3C Web Annotation Data Model | Selectors and states | https://www.w3.org/TR/annotation-model/ |
| Schema.org ClaimReview | Fact-checking markup | https://schema.org/ClaimReview |
| RFC 7089 Memento | Web archival access | https://tools.ietf.org/html/rfc7089 |
| W3C SRI | Subresource integrity | https://www.w3.org/TR/sri/ |
| W3C Text Fragments | URL text targeting | https://wicg.github.io/scroll-to-text-fragment/ |
| WAI-ARIA | Accessibility selectors | https://www.w3.org/WAI/ARIA/apg/ |
| Playwright Locators | Modern selector strategies | https://playwright.dev/docs/locators |

---

## Mandatory vs Recommended Elements

### MANDATORY Elements (Required for FAIR Compliance)

Every `web_claim` object **MUST** have these elements:

| Element | Standard | Purpose |
|---------|----------|---------|
| `content_hash` | W3C SRI | SHA-256 hash of `extracted_text` for integrity verification |
| `text_fragment` | W3C Text Fragments | URL `#:~:text=...` for direct linking to source text |
| `archive.memento_uri` | RFC 7089 Memento | Wayback Machine archived snapshot URL |
| `prov.wasDerivedFrom` | W3C PROV-O | Source URL for linked data tracing |
| `verification.status` | - | Claim freshness (`verified`/`stale`/`failed`) |
| `w3c_selectors` | W3C Web Annotation | At least 2 selector types for redundancy |

### RECOMMENDED Elements (For Full Provenance)

| Element | Standard | Purpose |
|---------|----------|---------|
| `aria_selector` | WAI-ARIA | Accessibility-based element identification |
| `rendering_context` | - | JS framework detection and execution state |
| `interaction_sequence` | - | Actions taken to reach dynamic content |
| `network_context` | - | AJAX/API request details for dynamic content |
| `user_agent_context` | - | Browser and viewport information |

---

## Complete Web Claim Structure

```json
{
  "claim_id": "string (UUID recommended)",

  // === CLAIM CONTENT ===
  "claim_type": "string (role|tenure|education|biography|contact|etc)",
  "claim_value": "string (the extracted fact)",
  "extracted_text": "string (full text from which claim derived)",
  "language": "string (BCP 47 code: nl, en, fr, etc)",

  // === SOURCE IDENTIFICATION ===
  "source_url": "string (URL)",
  "canonical_url": "string (URL, if different from source_url)",
  "content_type": "string (MIME type: text/html)",

  // === W3C WEB ANNOTATION SELECTORS (MANDATORY - at least 2) ===
  "w3c_selectors": [
    {
      "type": "CssSelector",
      "value": "string (CSS selector path)"
    },
    {
      "type": "XPathSelector",
      "value": "string (XPath expression)"
    },
    {
      "type": "TextQuoteSelector",
      "exact": "string (the matched text)",
      "prefix": "string (text before match)",
      "suffix": "string (text after match)"
    },
    {
      "type": "TextPositionSelector",
      "start": "integer (character offset start)",
      "end": "integer (character offset end)"
    },
    {
      "type": "FragmentSelector",
      "value": "string (fragment identifier)",
      "conformsTo": "string (specification URL)"
    }
  ],

  // === ACCESSIBILITY SELECTORS (RECOMMENDED) ===
  "aria_selector": {
    "role": "string (ARIA role: article, heading, button, etc)",
    "name": "string (accessible name)",
    "label": "string (aria-label value)",
    "testid": "string (data-testid attribute)"
  },

  // === TEXT FRAGMENT (MANDATORY) ===
  "text_fragment": "string (W3C Text Fragment: #:~:text=...)",

  // === LEGACY SELECTORS (for backwards compatibility) ===
  "css_selector": "string (CSS selector path)",
  "xpath_selector": "string (XPath expression)",

  // === TEMPORAL METADATA ===
  "published_date": "string (ISO 8601, from article:published_time)",
  "modified_date": "string (ISO 8601, from article:modified_time)",
  "author": "string or object (content creator)",

  // === RETRIEVAL METADATA ===
  "retrieval_timestamp": "string (ISO 8601, when we fetched)",
  "retrieval_agent": "string (tool identifier: opencode/claude-sonnet-4)",
  "extraction_method": "string (MCP tool: web-reader_webReader)",

  // === CONTENT INTEGRITY (MANDATORY) ===
  "content_hash": {
    "algorithm": "sha256",
    "value": "string (base64 encoded hash of extracted_text)",
    "scope": "extracted_text"
  },
  "http_etag": "string (ETag header from server response)",
  "http_last_modified": "string (Last-Modified header)",

  // === WEB ARCHIVAL (MANDATORY - RFC 7089 Memento) ===
  "archive": {
    "memento_uri": "string (archived snapshot URL)",
    "memento_datetime": "string (ISO 8601, archival datetime)",
    "timemap_uri": "string (TimeMap URL for all snapshots)",
    "timegate_uri": "string (TimeGate for datetime negotiation)",
    "archive_source": "string (web.archive.org, archive.today, etc)"
  },

  // === PROV-O ALIGNMENT (MANDATORY - wasDerivedFrom) ===
  "prov": {
    "wasAttributedTo": {
      "@type": "prov:Agent",
      "name": "string",
      "url": "string"
    },
    "generatedAtTime": "string (ISO 8601)",
    "wasDerivedFrom": "string (source URL or entity reference)",
    "wasGeneratedBy": {
      "@type": "prov:Activity",
      "name": "web_extraction",
      "used": "string (extraction_method)"
    }
  },

  // === SCHEMA.ORG CLAIMREVIEW ALIGNMENT ===
  "schema_org": {
    "@type": "Claim",
    "claimReviewed": "string (exact claim text)",
    "appearance": {
      "@type": "CreativeWork",
      "url": "string (source_url)",
      "datePublished": "string (published_date)"
    },
    "author": {
      "@type": "Person",
      "name": "string"
    }
  },

  // === VERIFICATION SUPPORT (MANDATORY - status) ===
  "verification": {
    "status": "string (verified|stale|failed|pending)",
    "last_verified": "string (ISO 8601)",
    "next_verification_due": "string (ISO 8601)",
    "confidence_score": "number (0.0-1.0)",
    "verification_history": [
      {
        "timestamp": "string (ISO 8601)",
        "status": "string",
        "content_hash": "string",
        "notes": "string"
      }
    ]
  },

  // === RENDERING CONTEXT (RECOMMENDED - for SPAs) ===
  "rendering_context": {
    "framework_detected": "string (React|Vue|Angular|Svelte|vanilla|unknown)",
    "framework_version": "string (version if detectable)",
    "hydration_complete": "boolean",
    "client_side_rendered": "boolean",
    "server_side_rendered": "boolean",
    "js_execution_required": "boolean",
    "wait_condition": "string (load|domcontentloaded|networkidle)",
    "wait_selector": "string (selector waited for)",
    "wait_duration_ms": "integer (milliseconds waited)"
  },

  // === DOM STATE (RECOMMENDED - W3C Web Annotation TimeState) ===
  "dom_state": {
    "type": "TimeState",
    "sourceDate": "string (ISO 8601)",
    "dom_content_loaded": "boolean",
    "load_event_fired": "boolean",
    "mutation_observer_stable": "boolean",
    "scroll_position": {
      "x": "integer",
      "y": "integer"
    },
    "viewport": {
      "width": "integer",
      "height": "integer"
    }
  },

  // === INTERACTION SEQUENCE (RECOMMENDED - for dynamic content) ===
  "interaction_sequence": [
    {
      "action": "string (navigate|click|scroll|wait|type|extract)",
      "target": "string (URL or selector)",
      "value": "string (input value if applicable)",
      "condition": "string (wait condition if applicable)",
      "duration_ms": "integer (action duration)",
      "timestamp": "string (ISO 8601)"
    }
  ],

  // === NETWORK CONTEXT (RECOMMENDED - for AJAX content) ===
  "network_context": {
    "request_url": "string (API endpoint if different from page)",
    "request_method": "string (GET|POST|etc)",
    "request_headers": "object (relevant headers)",
    "response_status": "integer (HTTP status code)",
    "response_content_type": "string (MIME type)",
    "response_hash": "string (hash of API response)",
    "xhr_intercepted": "boolean",
    "api_version": "string (API version if known)"
  },

  // === USER AGENT CONTEXT (RECOMMENDED) ===
  "user_agent_context": {
    "user_agent": "string (full UA string)",
    "browser": "string (Chrome|Firefox|Chromium|etc)",
    "browser_version": "string",
    "headless": "boolean",
    "viewport_width": "integer",
    "viewport_height": "integer",
    "device_scale_factor": "number",
    "mobile": "boolean",
    "locale": "string (BCP 47)",
    "timezone": "string (IANA timezone)"
  },

  // === FREE-FORM ===
  "notes": "string (optional context or caveats)"
}
```

---

## Minimal Compliant Web Claim

For basic FAIR compliance, every web claim **MUST** have:

```json
{
  "claim_type": "MANDATORY",
  "claim_value": "MANDATORY",
  "source_url": "MANDATORY",
  "extracted_text": "MANDATORY",
  "retrieval_timestamp": "MANDATORY",
  "retrieval_agent": "MANDATORY",
  "extraction_method": "MANDATORY",

  "content_hash": "MANDATORY (integrity)",
  "text_fragment": "MANDATORY (direct linking)",
  "archive": {
    "memento_uri": "MANDATORY (archival fallback)"
  },
  "prov": {
    "wasDerivedFrom": "MANDATORY (provenance)"
  },
  "verification": {
    "status": "MANDATORY (freshness)"
  },
  "w3c_selectors": "MANDATORY (at least 2 selector types)"
}
```

---

## W3C Web Annotation Selector Types

The W3C Web Annotation Data Model defines several selector types. Use **at least 2** for redundancy:

### 1. CssSelector
```json
{
  "type": "CssSelector",
  "value": "#main-article > p.intro"
}
```

### 2. XPathSelector
```json
{
  "type": "XPathSelector",
  "value": "/html/body/main/article/div/p[1]"
}
```

### 3. TextQuoteSelector (Highly Recommended)
Most resilient to DOM changes:
```json
{
  "type": "TextQuoteSelector",
  "exact": "Arjen Dijkstra wordt de nieuwe directeur",
  "prefix": "15 juli 2022 - ",
  "suffix": " van Fries historisch"
}
```

### 4. TextPositionSelector
```json
{
  "type": "TextPositionSelector",
  "start": 412,
  "end": 795
}
```

### 5. FragmentSelector
```json
{
  "type": "FragmentSelector",
  "value": "section2",
  "conformsTo": "http://tools.ietf.org/rfc/rfc3236"
}
```

### 6. DataPositionSelector (for binary data)
```json
{
  "type": "DataPositionSelector",
  "start": 4096,
  "end": 4104
}
```

### 7. RangeSelector (for spanning selections)
```json
{
  "type": "RangeSelector",
  "startSelector": {
    "type": "XPathSelector",
    "value": "//table[1]/tr[1]/td[2]"
  },
  "endSelector": {
    "type": "XPathSelector",
    "value": "//table[1]/tr[1]/td[4]"
  }
}
```

---

## Accessibility-First Selectors (ARIA)

Modern web applications use semantic HTML and ARIA. These selectors are often **more stable** than CSS/XPath:

```json
"aria_selector": {
  "role": "article",
  "name": "Nieuws artikel over directeurswisseling",
  "label": "Article content",
  "testid": "news-article-content",
  "description": "Artikel over de nieuwe directeur van Tresoar"
}
```

**Playwright Locator Equivalents:**
- `page.getByRole('article', { name: 'Nieuws artikel' })`
- `page.getByLabel('Article content')`
- `page.getByTestId('news-article-content')`
- `page.getByText('Arjen Dijkstra wordt')`

---

## Content Hash Generation

Generate SHA-256 hash of `extracted_text` for integrity verification:

```python
import hashlib
import base64

def generate_content_hash(text: str) -> dict:
    """Generate SHA-256 hash for content integrity (W3C SRI format)."""
    hash_bytes = hashlib.sha256(text.encode('utf-8')).digest()
    hash_b64 = base64.b64encode(hash_bytes).decode('ascii')
    return {
        "algorithm": "sha256",
        "value": f"sha256-{hash_b64}",
        "scope": "extracted_text"
    }

# Example
text = "Arjen Dijkstra wordt de nieuwe directeur van Tresoar."
hash_obj = generate_content_hash(text)
# Result: {"algorithm": "sha256", "value": "sha256-abc123...", "scope": "extracted_text"}
```

---

## Text Fragment URL Generation

Create W3C Text Fragment URLs for direct linking:

```python
from urllib.parse import quote

def generate_text_fragment(source_url: str, text: str) -> str:
    """Generate URL with text fragment for direct linking."""
    # Truncate to first 100 chars for fragment
    fragment_text = text[:100] if len(text) > 100 else text
    encoded = quote(fragment_text)
    return f"{source_url}#:~:text={encoded}"

# Example
url = "https://example.com/article"
text = "Arjen Dijkstra wordt de nieuwe directeur"
fragment_url = generate_text_fragment(url, text)
# Result: "https://example.com/article#:~:text=Arjen%20Dijkstra%20wordt%20de%20nieuwe%20directeur"
```

---

## Web Archive (Memento) Integration

Request archived version via Wayback Machine:

```python
import requests
from datetime import datetime

def get_memento_info(url: str, target_date: datetime = None) -> dict:
    """Get Memento (archived) version info from Wayback Machine."""

    # Use Memento TimeGate
    timegate = f"https://web.archive.org/web/{url}"

    # Or query TimeMap for all versions
    timemap = f"https://web.archive.org/web/timemap/link/{url}"

    # Get closest memento to target date
    if target_date:
        date_str = target_date.strftime("%Y%m%d")
        memento_uri = f"https://web.archive.org/web/{date_str}/{url}"
    else:
        # Use wildcard for latest
        memento_uri = f"https://web.archive.org/web/*/{url}"

    return {
        "memento_uri": memento_uri,
        "timemap_uri": timemap,
        "timegate_uri": timegate,
        "archive_source": "web.archive.org"
    }

def check_wayback_availability(url: str) -> dict:
    """Check if URL is available in Wayback Machine."""
    api_url = f"https://archive.org/wayback/available?url={url}"
    response = requests.get(api_url)
    data = response.json()

    if data.get("archived_snapshots", {}).get("closest"):
        snapshot = data["archived_snapshots"]["closest"]
        return {
            "available": True,
            "memento_uri": snapshot["url"],
            "memento_datetime": snapshot["timestamp"],
            "archive_source": "web.archive.org"
        }
    return {"available": False}
```

---

## Rendering Context Detection

For Single Page Applications (SPAs), detect the framework and rendering state:

```python
def detect_rendering_context(page) -> dict:
    """Detect JS framework and rendering state using Playwright."""

    context = {
        "framework_detected": "unknown",
        "js_execution_required": False,
        "client_side_rendered": False,
        "server_side_rendered": True
    }

    # Check for React
    react_check = page.evaluate("""() => {
        return !!(window.__REACT_DEVTOOLS_GLOBAL_HOOK__ ||
                  document.querySelector('[data-reactroot]') ||
                  document.querySelector('[data-react-helmet]'))
    }""")
    if react_check:
        context["framework_detected"] = "React"
        context["js_execution_required"] = True

    # Check for Vue
    vue_check = page.evaluate("""() => {
        return !!(window.__VUE__ ||
                  document.querySelector('[data-v-]') ||
                  document.querySelector('#__nuxt'))
    }""")
    if vue_check:
        context["framework_detected"] = "Vue"
        context["js_execution_required"] = True

    # Check for Angular
    angular_check = page.evaluate("""() => {
        return !!(window.ng ||
                  document.querySelector('[ng-version]') ||
                  document.querySelector('[_ngcontent-]'))
    }""")
    if angular_check:
        context["framework_detected"] = "Angular"
        context["js_execution_required"] = True

    # Check if content was client-side rendered
    initial_html = page.content()
    context["client_side_rendered"] = len(initial_html) < 5000  # Heuristic

    return context
```

---

## Interaction Sequence Recording

For dynamic content that requires user interaction:

```python
def record_interaction_sequence(actions: list) -> list:
    """Record a sequence of interactions for provenance."""
    from datetime import datetime

    sequence = []
    for action in actions:
        entry = {
            "action": action["type"],
            "timestamp": datetime.utcnow().isoformat() + "Z"
        }

        if action["type"] == "navigate":
            entry["target"] = action["url"]
        elif action["type"] == "click":
            entry["target"] = action["selector"]
        elif action["type"] == "wait":
            entry["condition"] = action.get("condition", "networkidle")
            entry["duration_ms"] = action.get("duration_ms", 0)
        elif action["type"] == "scroll":
            entry["target"] = action.get("selector", "window")
            entry["scroll_position"] = action.get("position", {"x": 0, "y": 0})
        elif action["type"] == "extract":
            entry["target"] = action["selector"]

        sequence.append(entry)

    return sequence
```

---

## Verification Workflow

```
1. INITIAL EXTRACTION
   ├─ Navigate to source URL (record interaction)
   ├─ Wait for JS rendering if needed (record wait condition)
   ├─ Detect framework (record rendering_context)
   ├─ Extract text via multiple selectors (record w3c_selectors)
   ├─ Generate content_hash
   ├─ Generate text_fragment URL
   ├─ Check Wayback Machine availability
   ├─ Set verification.status = "verified"
   └─ Set next_verification_due (e.g., 90 days)

2. RE-VERIFICATION (automated)
   ├─ Fetch source URL again
   ├─ Replay interaction_sequence if needed
   ├─ Re-extract via same selectors (try all w3c_selectors)
   ├─ Compare content_hash
   │   ├─ MATCH: Update last_verified, keep status "verified"
   │   └─ MISMATCH: Set status "stale", log to verification_history
   └─ If source 404: Try memento_uri, set status "archived"

3. ARCHIVAL FALLBACK
   ├─ If source unavailable, check memento_uri
   ├─ If no memento, search archive.org for URL
   └─ Log archival source in verification_history
```

---

## Example: Complete Web Claim with Enhanced Provenance

```json
{
  "claim_id": "c47f3e8a-9b2d-4e1a-8c5f-7d6e9a0b1c2d",

  "claim_type": "role",
  "claim_value": "directeur Tresoar",
  "extracted_text": "Arjen Dijkstra wordt de nieuwe directeur van Fries historisch en letterkundig centrum Tresoar. Hij begint in oktober en volgt Bert Looper op.",
  "language": "nl",

  "source_url": "https://www.omropfryslan.nl/nl/nieuws/1161390/historisch-centrum-tresoar-vindt-nieuwe-directeur-in-arjen-dijkstra",
  "canonical_url": null,
  "content_type": "text/html",

  "w3c_selectors": [
    {
      "type": "CssSelector",
      "value": "article.news-article > div.content > p:first-of-type"
    },
    {
      "type": "XPathSelector",
      "value": "/html/body/main/article/div[@class='content']/p[1]"
    },
    {
      "type": "TextQuoteSelector",
      "exact": "Arjen Dijkstra wordt de nieuwe directeur van Fries historisch en letterkundig centrum Tresoar",
      "prefix": "15 juli 2022 - ",
      "suffix": ". Hij begint in oktober"
    }
  ],

  "aria_selector": {
    "role": "article",
    "name": "Historisch centrum Tresoar vindt nieuwe directeur in Arjen Dijkstra"
  },

  "text_fragment": "#:~:text=Arjen%20Dijkstra%20wordt%20de%20nieuwe%20directeur",

  "css_selector": "article.news-article > div.content > p:first-of-type",
  "xpath_selector": "/html/body/main/article/div[@class='content']/p[1]",

  "published_date": "2022-07-15T14:15:00Z",
  "modified_date": null,
  "author": "Omrop Fryslân",

  "retrieval_timestamp": "2025-12-28T02:45:00Z",
  "retrieval_agent": "opencode/claude-sonnet-4",
  "extraction_method": "web-reader_webReader",

  "content_hash": {
    "algorithm": "sha256",
    "value": "sha256-K7gNU3sdo+OL0wNhqoVWhr3g6s1xYv72ol/pe/Unols=",
    "scope": "extracted_text"
  },
  "http_etag": null,
  "http_last_modified": null,

  "archive": {
    "memento_uri": "https://web.archive.org/web/20220716/https://www.omropfryslan.nl/nl/nieuws/1161390/historisch-centrum-tresoar-vindt-nieuwe-directeur-in-arjen-dijkstra",
    "memento_datetime": "2022-07-16T00:00:00Z",
    "timemap_uri": "https://web.archive.org/web/timemap/link/https://www.omropfryslan.nl/nl/nieuws/1161390/historisch-centrum-tresoar-vindt-nieuwe-directeur-in-arjen-dijkstra",
    "timegate_uri": "https://web.archive.org/web/https://www.omropfryslan.nl/nl/nieuws/1161390/historisch-centrum-tresoar-vindt-nieuwe-directeur-in-arjen-dijkstra",
    "archive_source": "web.archive.org"
  },

  "prov": {
    "wasAttributedTo": {
      "@type": "prov:Agent",
      "name": "Omrop Fryslân",
      "url": "https://www.omropfryslan.nl"
    },
    "generatedAtTime": "2025-12-28T02:45:00Z",
    "wasDerivedFrom": "https://www.omropfryslan.nl/nl/nieuws/1161390/historisch-centrum-tresoar-vindt-nieuwe-directeur-in-arjen-dijkstra",
    "wasGeneratedBy": {
      "@type": "prov:Activity",
      "name": "web_extraction",
      "used": "web-reader_webReader"
    }
  },

  "schema_org": {
    "@type": "Claim",
    "claimReviewed": "Arjen Dijkstra is directeur van Tresoar",
    "appearance": {
      "@type": "NewsArticle",
      "url": "https://www.omropfryslan.nl/nl/nieuws/1161390/historisch-centrum-tresoar-vindt-nieuwe-directeur-in-arjen-dijkstra",
      "datePublished": "2022-07-15T14:15:00Z"
    },
    "author": {
      "@type": "Organization",
      "name": "Omrop Fryslân"
    }
  },

  "verification": {
    "status": "verified",
    "last_verified": "2025-12-28T02:45:00Z",
    "next_verification_due": "2026-03-28T00:00:00Z",
    "confidence_score": 0.95,
    "verification_history": [
      {
        "timestamp": "2025-12-28T02:45:00Z",
        "status": "verified",
        "content_hash": "sha256-K7gNU3sdo+OL0wNhqoVWhr3g6s1xYv72ol/pe/Unols=",
        "notes": "Initial extraction from Omrop Fryslân"
      }
    ]
  },

  "rendering_context": {
    "framework_detected": "vanilla",
    "js_execution_required": false,
    "client_side_rendered": false,
    "server_side_rendered": true,
    "wait_condition": "domcontentloaded",
    "wait_duration_ms": 1200
  },

  "interaction_sequence": [
    {
      "action": "navigate",
      "target": "https://www.omropfryslan.nl/nl/nieuws/1161390/historisch-centrum-tresoar-vindt-nieuwe-directeur-in-arjen-dijkstra",
      "timestamp": "2025-12-28T02:44:58Z"
    },
    {
      "action": "wait",
      "condition": "domcontentloaded",
      "duration_ms": 1200,
      "timestamp": "2025-12-28T02:44:59Z"
    },
    {
      "action": "extract",
      "target": "article.news-article > div.content > p:first-of-type",
      "timestamp": "2025-12-28T02:45:00Z"
    }
  ],

  "user_agent_context": {
    "browser": "Chromium",
    "headless": true,
    "viewport_width": 1920,
    "viewport_height": 1080,
    "locale": "nl-NL",
    "timezone": "Europe/Amsterdam"
  },

  "notes": "Primary source for Tresoar director appointment announcement"
}
```

---

## Example: SPA with AJAX Content

For a Single Page Application that loads content via API:

```json
{
  "claim_id": "d58e4f9b-0c3e-5f2b-9d6g-8e7f0b2c3d4e",

  "claim_type": "role",
  "claim_value": "Head of Collections",
  "extracted_text": "Dr. Maria van der Berg serves as Head of Collections since 2021.",
  "language": "en",

  "source_url": "https://museum.example.nl/about/team",

  "w3c_selectors": [
    {
      "type": "CssSelector",
      "value": "[data-testid='team-member-card']:nth-child(3) .role"
    },
    {
      "type": "TextQuoteSelector",
      "exact": "Dr. Maria van der Berg serves as Head of Collections",
      "prefix": "",
      "suffix": " since 2021"
    }
  ],

  "aria_selector": {
    "role": "listitem",
    "name": "Dr. Maria van der Berg",
    "testid": "team-member-card"
  },

  "text_fragment": "#:~:text=Dr.%20Maria%20van%20der%20Berg%20serves%20as%20Head%20of%20Collections",

  "content_hash": {
    "algorithm": "sha256",
    "value": "sha256-xyz789...",
    "scope": "extracted_text"
  },

  "archive": {
    "memento_uri": "https://web.archive.org/web/20251228/https://museum.example.nl/about/team",
    "archive_source": "web.archive.org"
  },

  "prov": {
    "wasDerivedFrom": "https://museum.example.nl/about/team"
  },

  "verification": {
    "status": "verified",
    "last_verified": "2025-12-28T03:00:00Z"
  },

  "rendering_context": {
    "framework_detected": "React",
    "framework_version": "18.2.0",
    "hydration_complete": true,
    "client_side_rendered": true,
    "server_side_rendered": false,
    "js_execution_required": true,
    "wait_condition": "networkidle",
    "wait_selector": "[data-testid='team-loaded']",
    "wait_duration_ms": 3500
  },

  "interaction_sequence": [
    {
      "action": "navigate",
      "target": "https://museum.example.nl/about/team",
      "timestamp": "2025-12-28T02:59:55Z"
    },
    {
      "action": "wait",
      "condition": "networkidle",
      "duration_ms": 3500,
      "timestamp": "2025-12-28T02:59:58Z"
    },
    {
      "action": "scroll",
      "target": "[data-testid='team-member-card']:nth-child(3)",
      "timestamp": "2025-12-28T02:59:59Z"
    },
    {
      "action": "extract",
      "target": "[data-testid='team-member-card']:nth-child(3) .role",
      "timestamp": "2025-12-28T03:00:00Z"
    }
  ],

  "network_context": {
    "request_url": "https://api.museum.example.nl/v2/team",
    "request_method": "GET",
    "response_status": 200,
    "response_content_type": "application/json",
    "xhr_intercepted": true,
    "api_version": "v2"
  },

  "retrieval_timestamp": "2025-12-28T03:00:00Z",
  "retrieval_agent": "opencode/claude-sonnet-4",
  "extraction_method": "playwright_browser_snapshot"
}
```

---

## Implementation Priority

### Phase 1: Critical (implement immediately)
- `content_hash` - Content integrity verification
- `text_fragment` - URL-based text targeting
- `archive.memento_uri` - Archival fallback
- `prov.wasDerivedFrom` - Provenance tracing
- `verification.status` - Claim freshness
- `w3c_selectors` - Multiple selector types (at least 2)

### Phase 2: High (implement for SPAs)
- `rendering_context` - JS framework detection
- `interaction_sequence` - Dynamic content actions
- `aria_selector` - Accessibility-first selection
- `network_context` - AJAX request details

### Phase 3: Complete (full FAIR compliance)
- Full `prov` object - Complete PROV-O alignment
- Full `schema_org` object - Schema.org ClaimReview
- `verification_history` - Change tracking
- `user_agent_context` - Browser environment
- `dom_state` - Complete DOM state capture

---

## Selector Resilience Ranking

For maximum resilience to DOM changes, use selectors in this priority order:

1. **TextQuoteSelector** - Most resilient (content-based)
2. **aria_selector** - Very stable (semantic/accessibility)
3. **data-testid** - Stable (explicit test hooks)
4. **FragmentSelector** - Stable for named anchors
5. **CssSelector** - Moderate (structure-dependent)
6. **XPathSelector** - Brittle (exact path-dependent)
7. **TextPositionSelector** - Most brittle (offset-dependent)

**Best Practice:** Always include at least one TextQuoteSelector and one structural selector (CSS or XPath).

---

## Related Rules

- `.opencode/WEB_READER_PREFERRED_SCRAPER_RULE.md` - Preferred scraper tool
- `.opencode/DATA_PRESERVATION_RULES.md` - Never delete enriched data
- `.opencode/DATA_FABRICATION_PROHIBITION.md` - Real data only
- `.opencode/INITIALS_EXPANSION_PROHIBITION.md` - Never expand initials without verification
- `AGENTS.md` - Web Claim Provenance Requirements section