kempersc 84904e344b Make AGENTS more succint by referring to opencode rules & enrich custodians

2025-12-28 14:56:35 +01:00

7 KiB

Raw Blame History

Web-Reader is the Preferred Web Scraper for Comprehensive Provenance

Overview

🚨 CRITICAL: Use web-reader_webReader as the PRIMARY web scraping tool due to its ability to create comprehensive provenance statements essential for data archival and future updates.

Why Web-Reader Over Other Tools

Web-Reader provides structured metadata that enables proper provenance tracking:

Feature	Web-Reader	Linkup	Firecrawl	Playwright
Source URL	✅ Canonical URL	✅	✅	✅
Published Date	✅ From metadata	❌	❌	❌
Author	✅ From metadata	❌	❌	❌
CSS Selectors	✅ Via metadata structure	❌	Limited	✅ Manual
Open Graph Data	✅ Full og: metadata	❌	❌	❌ Manual
Article Metadata	✅ article:published_time, etc.	❌	❌	❌
Content Structure	✅ Markdown + links summary	Text only	Markdown	Raw HTML
External Links	✅ Categorized	❌	✅	❌ Manual

Provenance Fields Extracted by Web-Reader

Web-Reader automatically extracts these provenance-critical fields:

{
  "title": "Article headline",
  "description": "Meta description for claim verification",
  "url": "Canonical URL (source_url)",
  "publishedTime": "2022-07-15T10:41:12.000Z",  // article:published_time
  "metadata": {
    "og:title": "...",
    "og:description": "...",          // Claim verification
    "og:image": "...",
    "article:published_time": "...",  // Timestamp provenance
    "article:modified_time": "...",
    "article:author": "...",          // Author attribution
    "article:section": "...",
    "twitter:card": "...",
    "description": "..."              // CSS: meta[name='description']
  },
  "content": "Full article text in markdown"
}

CSS Selector Derivation from Web-Reader Metadata

Web-Reader metadata maps directly to CSS selectors for provenance:

Metadata Field	CSS Selector
`title`	`head > title`
`description` (metadata)	`meta[name="description"]`
`og:title`	`meta[property="og:title"]`
`og:description`	`meta[property="og:description"]`
`article:published_time`	`meta[property="article:published_time"]`
`article:author`	`meta[property="article:author"]`
`content` (first paragraph)	`article > p:first-of-type`

Web Claim Structure with Web-Reader Provenance

{
  "claim_type": "role",
  "claim_value": "directeur Tresoar",
  "source_url": "https://example.nl/article.html",
  "css_selector": "meta[property='og:description']",
  "extracted_text": "Full text containing the claim",
  "published_date": "2022-07-15T10:41:12.000Z",
  "author": "Journalist Name",
  "retrieval_timestamp": "2025-12-28T02:45:00Z",
  "retrieval_agent": "opencode/claude-sonnet-4",
  "extraction_method": "web-reader_webReader"
}

Tool Selection Hierarchy

Use tools in this priority order:

1. Web-Reader (PRIMARY) - For Content with Provenance

Tool: web-reader_webReader
When: Extracting claims that need full provenance
Returns: Structured metadata + content + CSS selectors

2. Playwright (SECONDARY) - For Interactive Sites

Tool: playwright_browser_*
When: JavaScript-heavy sites, login walls, infinite scroll
Returns: Full DOM access for XPath extraction

3. Linkup (TERTIARY) - For Discovery Only

Tool: linkup_linkup-search / linkup_linkup-fetch
When: Finding URLs, quick content preview
Returns: Text content (NO provenance metadata)

4. Firecrawl (AVOID) - Credit Limited

Tool: firecrawl_*
When: Only if other tools fail
Note: Credits exhaust quickly

Two-Phase Workflow (Updated)

Phase 1: Discovery

Use Linkup to find relevant URLs:

linkup_linkup-search: "medewerkers [Institution Name]"
→ Returns candidate URLs

Phase 2: Extraction with Provenance (USE WEB-READER)

Use Web-Reader to extract with full provenance:

web-reader_webReader: 
  url: https://example.nl/medewerkers
  return_format: markdown
  with_links_summary: true
→ Returns structured metadata + content

Phase 3: XPath Extraction (If Needed)

Use Playwright only if CSS selectors from meta tags insufficient:

playwright_browser_navigate + playwright_browser_snapshot
→ Returns full DOM for XPath extraction

Example: Person Data Extraction

Step 1: Discover staff page

linkup_linkup-search: "directeur Tresoar Leeuwarden"
→ Found: https://lc.nl/friesland/Arjen-Dijkstra-nieuwe-directeur-Tresoar-27819587.html

Step 2: Extract with Web-Reader

web-reader_webReader:
  url: https://lc.nl/friesland/Arjen-Dijkstra-nieuwe-directeur-Tresoar-27819587.html
  
Returns:
{
  "title": "Arjen Dijkstra nieuwe directeur Tresoar",
  "description": "Arjen Dijkstra, nu nog directeur van het museum van de Rijksuniversiteit Groningen, wordt op 1 oktober de nieuwe directeur van Tresoar.",
  "publishedTime": "2022-07-15T10:41:12.000Z",
  "metadata": {
    "og:description": "...",
    "article:published_time": "2022-07-15T10:41:12.000Z"
  }
}

Step 3: Create claim with provenance

{
  "claim_type": "role",
  "claim_value": "directeur Tresoar",
  "source_url": "https://lc.nl/friesland/Arjen-Dijkstra-nieuwe-directeur-Tresoar-27819587.html",
  "css_selector": "meta[property='og:description']",
  "extracted_text": "Arjen Dijkstra, nu nog directeur van het museum van de Rijksuniversiteit Groningen, wordt op 1 oktober de nieuwe directeur van Tresoar.",
  "published_date": "2022-07-15T10:41:12.000Z",
  "retrieval_timestamp": "2025-12-28T02:45:00Z",
  "retrieval_agent": "opencode/claude-sonnet-4",
  "extraction_method": "web-reader_webReader"
}

Benefits for Data Archival

Web-Reader provenance enables:

Claim Verification: css_selector allows re-extraction from archived HTML
Temporal Tracking: published_date + retrieval_timestamp establish timeline
Source Attribution: author field for journalistic sources
Update Detection: Compare article:modified_time across extractions
Legal Compliance: Full audit trail for data provenance

When NOT to Use Web-Reader

Scenario	Use Instead
Login-protected pages	Playwright with authentication
Heavy JavaScript SPAs	Playwright with renderJs
PDF documents	Firecrawl or direct download
Real-time data (stock prices)	Direct API calls
Batch scraping 100+ URLs	Firecrawl batch or custom script

Relationship to Other Rules

This rule supersedes/supplements:

LINKUP_PREFERRED_WEB_SCRAPER_RULE.md - Linkup for discovery, Web-Reader for extraction
WEB_OBSERVATION_PROVENANCE_RULES.md - Web-Reader provides the provenance data
PERSON_DATA_PROVENANCE_RULE.md - Web-Reader extraction method for person claims

Created: 2025-12-28 Status: ACTIVE Priority: HIGH - Use for all web claim extraction Related Rules:

.opencode/LINKUP_PREFERRED_WEB_SCRAPER_RULE.md
.opencode/WEB_OBSERVATION_PROVENANCE_RULES.md
.opencode/PERSON_DATA_PROVENANCE_RULE.md

7 KiB Raw Blame History