7 KiB
7 KiB
Web-Reader is the Preferred Web Scraper for Comprehensive Provenance
Overview
🚨 CRITICAL: Use web-reader_webReader as the PRIMARY web scraping tool due to its ability to create comprehensive provenance statements essential for data archival and future updates.
Why Web-Reader Over Other Tools
Web-Reader provides structured metadata that enables proper provenance tracking:
| Feature | Web-Reader | Linkup | Firecrawl | Playwright |
|---|---|---|---|---|
| Source URL | ✅ Canonical URL | ✅ | ✅ | ✅ |
| Published Date | ✅ From metadata | ❌ | ❌ | ❌ |
| Author | ✅ From metadata | ❌ | ❌ | ❌ |
| CSS Selectors | ✅ Via metadata structure | ❌ | Limited | ✅ Manual |
| Open Graph Data | ✅ Full og: metadata | ❌ | ❌ | ❌ Manual |
| Article Metadata | ✅ article:published_time, etc. | ❌ | ❌ | ❌ |
| Content Structure | ✅ Markdown + links summary | Text only | Markdown | Raw HTML |
| External Links | ✅ Categorized | ❌ | ✅ | ❌ Manual |
Provenance Fields Extracted by Web-Reader
Web-Reader automatically extracts these provenance-critical fields:
{
"title": "Article headline",
"description": "Meta description for claim verification",
"url": "Canonical URL (source_url)",
"publishedTime": "2022-07-15T10:41:12.000Z", // article:published_time
"metadata": {
"og:title": "...",
"og:description": "...", // Claim verification
"og:image": "...",
"article:published_time": "...", // Timestamp provenance
"article:modified_time": "...",
"article:author": "...", // Author attribution
"article:section": "...",
"twitter:card": "...",
"description": "..." // CSS: meta[name='description']
},
"content": "Full article text in markdown"
}
CSS Selector Derivation from Web-Reader Metadata
Web-Reader metadata maps directly to CSS selectors for provenance:
| Metadata Field | CSS Selector |
|---|---|
title |
head > title |
description (metadata) |
meta[name="description"] |
og:title |
meta[property="og:title"] |
og:description |
meta[property="og:description"] |
article:published_time |
meta[property="article:published_time"] |
article:author |
meta[property="article:author"] |
content (first paragraph) |
article > p:first-of-type |
Web Claim Structure with Web-Reader Provenance
{
"claim_type": "role",
"claim_value": "directeur Tresoar",
"source_url": "https://example.nl/article.html",
"css_selector": "meta[property='og:description']",
"extracted_text": "Full text containing the claim",
"published_date": "2022-07-15T10:41:12.000Z",
"author": "Journalist Name",
"retrieval_timestamp": "2025-12-28T02:45:00Z",
"retrieval_agent": "opencode/claude-sonnet-4",
"extraction_method": "web-reader_webReader"
}
Tool Selection Hierarchy
Use tools in this priority order:
1. Web-Reader (PRIMARY) - For Content with Provenance
Tool: web-reader_webReader
When: Extracting claims that need full provenance
Returns: Structured metadata + content + CSS selectors
2. Playwright (SECONDARY) - For Interactive Sites
Tool: playwright_browser_*
When: JavaScript-heavy sites, login walls, infinite scroll
Returns: Full DOM access for XPath extraction
3. Linkup (TERTIARY) - For Discovery Only
Tool: linkup_linkup-search / linkup_linkup-fetch
When: Finding URLs, quick content preview
Returns: Text content (NO provenance metadata)
4. Firecrawl (AVOID) - Credit Limited
Tool: firecrawl_*
When: Only if other tools fail
Note: Credits exhaust quickly
Two-Phase Workflow (Updated)
Phase 1: Discovery
Use Linkup to find relevant URLs:
linkup_linkup-search: "medewerkers [Institution Name]"
→ Returns candidate URLs
Phase 2: Extraction with Provenance (USE WEB-READER)
Use Web-Reader to extract with full provenance:
web-reader_webReader:
url: https://example.nl/medewerkers
return_format: markdown
with_links_summary: true
→ Returns structured metadata + content
Phase 3: XPath Extraction (If Needed)
Use Playwright only if CSS selectors from meta tags insufficient:
playwright_browser_navigate + playwright_browser_snapshot
→ Returns full DOM for XPath extraction
Example: Person Data Extraction
Step 1: Discover staff page
linkup_linkup-search: "directeur Tresoar Leeuwarden"
→ Found: https://lc.nl/friesland/Arjen-Dijkstra-nieuwe-directeur-Tresoar-27819587.html
Step 2: Extract with Web-Reader
web-reader_webReader:
url: https://lc.nl/friesland/Arjen-Dijkstra-nieuwe-directeur-Tresoar-27819587.html
Returns:
{
"title": "Arjen Dijkstra nieuwe directeur Tresoar",
"description": "Arjen Dijkstra, nu nog directeur van het museum van de Rijksuniversiteit Groningen, wordt op 1 oktober de nieuwe directeur van Tresoar.",
"publishedTime": "2022-07-15T10:41:12.000Z",
"metadata": {
"og:description": "...",
"article:published_time": "2022-07-15T10:41:12.000Z"
}
}
Step 3: Create claim with provenance
{
"claim_type": "role",
"claim_value": "directeur Tresoar",
"source_url": "https://lc.nl/friesland/Arjen-Dijkstra-nieuwe-directeur-Tresoar-27819587.html",
"css_selector": "meta[property='og:description']",
"extracted_text": "Arjen Dijkstra, nu nog directeur van het museum van de Rijksuniversiteit Groningen, wordt op 1 oktober de nieuwe directeur van Tresoar.",
"published_date": "2022-07-15T10:41:12.000Z",
"retrieval_timestamp": "2025-12-28T02:45:00Z",
"retrieval_agent": "opencode/claude-sonnet-4",
"extraction_method": "web-reader_webReader"
}
Benefits for Data Archival
Web-Reader provenance enables:
- Claim Verification:
css_selectorallows re-extraction from archived HTML - Temporal Tracking:
published_date+retrieval_timestampestablish timeline - Source Attribution:
authorfield for journalistic sources - Update Detection: Compare
article:modified_timeacross extractions - Legal Compliance: Full audit trail for data provenance
When NOT to Use Web-Reader
| Scenario | Use Instead |
|---|---|
| Login-protected pages | Playwright with authentication |
| Heavy JavaScript SPAs | Playwright with renderJs |
| PDF documents | Firecrawl or direct download |
| Real-time data (stock prices) | Direct API calls |
| Batch scraping 100+ URLs | Firecrawl batch or custom script |
Relationship to Other Rules
This rule supersedes/supplements:
LINKUP_PREFERRED_WEB_SCRAPER_RULE.md- Linkup for discovery, Web-Reader for extractionWEB_OBSERVATION_PROVENANCE_RULES.md- Web-Reader provides the provenance dataPERSON_DATA_PROVENANCE_RULE.md- Web-Reader extraction method for person claims
Created: 2025-12-28 Status: ACTIVE Priority: HIGH - Use for all web claim extraction Related Rules:
.opencode/LINKUP_PREFERRED_WEB_SCRAPER_RULE.md.opencode/WEB_OBSERVATION_PROVENANCE_RULES.md.opencode/PERSON_DATA_PROVENANCE_RULE.md