# Web-Reader is the Preferred Web Scraper for Comprehensive Provenance ## Overview **🚨 CRITICAL**: Use `web-reader_webReader` as the PRIMARY web scraping tool due to its ability to create comprehensive provenance statements essential for data archival and future updates. ## Why Web-Reader Over Other Tools Web-Reader provides structured metadata that enables proper provenance tracking: | Feature | Web-Reader | Linkup | Firecrawl | Playwright | |---------|------------|--------|-----------|------------| | **Source URL** | ✅ Canonical URL | ✅ | ✅ | ✅ | | **Published Date** | ✅ From metadata | ❌ | ❌ | ❌ | | **Author** | ✅ From metadata | ❌ | ❌ | ❌ | | **CSS Selectors** | ✅ Via metadata structure | ❌ | Limited | ✅ Manual | | **Open Graph Data** | ✅ Full og: metadata | ❌ | ❌ | ❌ Manual | | **Article Metadata** | ✅ article:published_time, etc. | ❌ | ❌ | ❌ | | **Content Structure** | ✅ Markdown + links summary | Text only | Markdown | Raw HTML | | **External Links** | ✅ Categorized | ❌ | ✅ | ❌ Manual | ## Provenance Fields Extracted by Web-Reader Web-Reader automatically extracts these provenance-critical fields: ```json { "title": "Article headline", "description": "Meta description for claim verification", "url": "Canonical URL (source_url)", "publishedTime": "2022-07-15T10:41:12.000Z", // article:published_time "metadata": { "og:title": "...", "og:description": "...", // Claim verification "og:image": "...", "article:published_time": "...", // Timestamp provenance "article:modified_time": "...", "article:author": "...", // Author attribution "article:section": "...", "twitter:card": "...", "description": "..." // CSS: meta[name='description'] }, "content": "Full article text in markdown" } ``` ## CSS Selector Derivation from Web-Reader Metadata Web-Reader metadata maps directly to CSS selectors for provenance: | Metadata Field | CSS Selector | |----------------|--------------| | `title` | `head > title` | | `description` (metadata) | `meta[name="description"]` | | `og:title` | `meta[property="og:title"]` | | `og:description` | `meta[property="og:description"]` | | `article:published_time` | `meta[property="article:published_time"]` | | `article:author` | `meta[property="article:author"]` | | `content` (first paragraph) | `article > p:first-of-type` | ## Web Claim Structure with Web-Reader Provenance ```json { "claim_type": "role", "claim_value": "directeur Tresoar", "source_url": "https://example.nl/article.html", "css_selector": "meta[property='og:description']", "extracted_text": "Full text containing the claim", "published_date": "2022-07-15T10:41:12.000Z", "author": "Journalist Name", "retrieval_timestamp": "2025-12-28T02:45:00Z", "retrieval_agent": "opencode/claude-sonnet-4", "extraction_method": "web-reader_webReader" } ``` ## Tool Selection Hierarchy Use tools in this priority order: ### 1. Web-Reader (PRIMARY) - For Content with Provenance ``` Tool: web-reader_webReader When: Extracting claims that need full provenance Returns: Structured metadata + content + CSS selectors ``` ### 2. Playwright (SECONDARY) - For Interactive Sites ``` Tool: playwright_browser_* When: JavaScript-heavy sites, login walls, infinite scroll Returns: Full DOM access for XPath extraction ``` ### 3. Linkup (TERTIARY) - For Discovery Only ``` Tool: linkup_linkup-search / linkup_linkup-fetch When: Finding URLs, quick content preview Returns: Text content (NO provenance metadata) ``` ### 4. Firecrawl (AVOID) - Credit Limited ``` Tool: firecrawl_* When: Only if other tools fail Note: Credits exhaust quickly ``` ## Two-Phase Workflow (Updated) ### Phase 1: Discovery Use Linkup to **find** relevant URLs: ``` linkup_linkup-search: "medewerkers [Institution Name]" → Returns candidate URLs ``` ### Phase 2: Extraction with Provenance (USE WEB-READER) Use Web-Reader to **extract** with full provenance: ``` web-reader_webReader: url: https://example.nl/medewerkers return_format: markdown with_links_summary: true → Returns structured metadata + content ``` ### Phase 3: XPath Extraction (If Needed) Use Playwright only if CSS selectors from meta tags insufficient: ``` playwright_browser_navigate + playwright_browser_snapshot → Returns full DOM for XPath extraction ``` ## Example: Person Data Extraction ### Step 1: Discover staff page ``` linkup_linkup-search: "directeur Tresoar Leeuwarden" → Found: https://lc.nl/friesland/Arjen-Dijkstra-nieuwe-directeur-Tresoar-27819587.html ``` ### Step 2: Extract with Web-Reader ``` web-reader_webReader: url: https://lc.nl/friesland/Arjen-Dijkstra-nieuwe-directeur-Tresoar-27819587.html Returns: { "title": "Arjen Dijkstra nieuwe directeur Tresoar", "description": "Arjen Dijkstra, nu nog directeur van het museum van de Rijksuniversiteit Groningen, wordt op 1 oktober de nieuwe directeur van Tresoar.", "publishedTime": "2022-07-15T10:41:12.000Z", "metadata": { "og:description": "...", "article:published_time": "2022-07-15T10:41:12.000Z" } } ``` ### Step 3: Create claim with provenance ```json { "claim_type": "role", "claim_value": "directeur Tresoar", "source_url": "https://lc.nl/friesland/Arjen-Dijkstra-nieuwe-directeur-Tresoar-27819587.html", "css_selector": "meta[property='og:description']", "extracted_text": "Arjen Dijkstra, nu nog directeur van het museum van de Rijksuniversiteit Groningen, wordt op 1 oktober de nieuwe directeur van Tresoar.", "published_date": "2022-07-15T10:41:12.000Z", "retrieval_timestamp": "2025-12-28T02:45:00Z", "retrieval_agent": "opencode/claude-sonnet-4", "extraction_method": "web-reader_webReader" } ``` ## Benefits for Data Archival Web-Reader provenance enables: 1. **Claim Verification**: `css_selector` allows re-extraction from archived HTML 2. **Temporal Tracking**: `published_date` + `retrieval_timestamp` establish timeline 3. **Source Attribution**: `author` field for journalistic sources 4. **Update Detection**: Compare `article:modified_time` across extractions 5. **Legal Compliance**: Full audit trail for data provenance ## When NOT to Use Web-Reader | Scenario | Use Instead | |----------|-------------| | Login-protected pages | Playwright with authentication | | Heavy JavaScript SPAs | Playwright with renderJs | | PDF documents | Firecrawl or direct download | | Real-time data (stock prices) | Direct API calls | | Batch scraping 100+ URLs | Firecrawl batch or custom script | ## Relationship to Other Rules This rule supersedes/supplements: - `LINKUP_PREFERRED_WEB_SCRAPER_RULE.md` - Linkup for discovery, Web-Reader for extraction - `WEB_OBSERVATION_PROVENANCE_RULES.md` - Web-Reader provides the provenance data - `PERSON_DATA_PROVENANCE_RULE.md` - Web-Reader extraction method for person claims --- **Created**: 2025-12-28 **Status**: ACTIVE **Priority**: HIGH - Use for all web claim extraction **Related Rules**: - `.opencode/LINKUP_PREFERRED_WEB_SCRAPER_RULE.md` - `.opencode/WEB_OBSERVATION_PROVENANCE_RULES.md` - `.opencode/PERSON_DATA_PROVENANCE_RULE.md`