211 lines
7 KiB
Markdown
211 lines
7 KiB
Markdown
# Web-Reader is the Preferred Web Scraper for Comprehensive Provenance
|
|
|
|
## Overview
|
|
|
|
**🚨 CRITICAL**: Use `web-reader_webReader` as the PRIMARY web scraping tool due to its ability to create comprehensive provenance statements essential for data archival and future updates.
|
|
|
|
## Why Web-Reader Over Other Tools
|
|
|
|
Web-Reader provides structured metadata that enables proper provenance tracking:
|
|
|
|
| Feature | Web-Reader | Linkup | Firecrawl | Playwright |
|
|
|---------|------------|--------|-----------|------------|
|
|
| **Source URL** | ✅ Canonical URL | ✅ | ✅ | ✅ |
|
|
| **Published Date** | ✅ From metadata | ❌ | ❌ | ❌ |
|
|
| **Author** | ✅ From metadata | ❌ | ❌ | ❌ |
|
|
| **CSS Selectors** | ✅ Via metadata structure | ❌ | Limited | ✅ Manual |
|
|
| **Open Graph Data** | ✅ Full og: metadata | ❌ | ❌ | ❌ Manual |
|
|
| **Article Metadata** | ✅ article:published_time, etc. | ❌ | ❌ | ❌ |
|
|
| **Content Structure** | ✅ Markdown + links summary | Text only | Markdown | Raw HTML |
|
|
| **External Links** | ✅ Categorized | ❌ | ✅ | ❌ Manual |
|
|
|
|
## Provenance Fields Extracted by Web-Reader
|
|
|
|
Web-Reader automatically extracts these provenance-critical fields:
|
|
|
|
```json
|
|
{
|
|
"title": "Article headline",
|
|
"description": "Meta description for claim verification",
|
|
"url": "Canonical URL (source_url)",
|
|
"publishedTime": "2022-07-15T10:41:12.000Z", // article:published_time
|
|
"metadata": {
|
|
"og:title": "...",
|
|
"og:description": "...", // Claim verification
|
|
"og:image": "...",
|
|
"article:published_time": "...", // Timestamp provenance
|
|
"article:modified_time": "...",
|
|
"article:author": "...", // Author attribution
|
|
"article:section": "...",
|
|
"twitter:card": "...",
|
|
"description": "..." // CSS: meta[name='description']
|
|
},
|
|
"content": "Full article text in markdown"
|
|
}
|
|
```
|
|
|
|
## CSS Selector Derivation from Web-Reader Metadata
|
|
|
|
Web-Reader metadata maps directly to CSS selectors for provenance:
|
|
|
|
| Metadata Field | CSS Selector |
|
|
|----------------|--------------|
|
|
| `title` | `head > title` |
|
|
| `description` (metadata) | `meta[name="description"]` |
|
|
| `og:title` | `meta[property="og:title"]` |
|
|
| `og:description` | `meta[property="og:description"]` |
|
|
| `article:published_time` | `meta[property="article:published_time"]` |
|
|
| `article:author` | `meta[property="article:author"]` |
|
|
| `content` (first paragraph) | `article > p:first-of-type` |
|
|
|
|
## Web Claim Structure with Web-Reader Provenance
|
|
|
|
```json
|
|
{
|
|
"claim_type": "role",
|
|
"claim_value": "directeur Tresoar",
|
|
"source_url": "https://example.nl/article.html",
|
|
"css_selector": "meta[property='og:description']",
|
|
"extracted_text": "Full text containing the claim",
|
|
"published_date": "2022-07-15T10:41:12.000Z",
|
|
"author": "Journalist Name",
|
|
"retrieval_timestamp": "2025-12-28T02:45:00Z",
|
|
"retrieval_agent": "opencode/claude-sonnet-4",
|
|
"extraction_method": "web-reader_webReader"
|
|
}
|
|
```
|
|
|
|
## Tool Selection Hierarchy
|
|
|
|
Use tools in this priority order:
|
|
|
|
### 1. Web-Reader (PRIMARY) - For Content with Provenance
|
|
```
|
|
Tool: web-reader_webReader
|
|
When: Extracting claims that need full provenance
|
|
Returns: Structured metadata + content + CSS selectors
|
|
```
|
|
|
|
### 2. Playwright (SECONDARY) - For Interactive Sites
|
|
```
|
|
Tool: playwright_browser_*
|
|
When: JavaScript-heavy sites, login walls, infinite scroll
|
|
Returns: Full DOM access for XPath extraction
|
|
```
|
|
|
|
### 3. Linkup (TERTIARY) - For Discovery Only
|
|
```
|
|
Tool: linkup_linkup-search / linkup_linkup-fetch
|
|
When: Finding URLs, quick content preview
|
|
Returns: Text content (NO provenance metadata)
|
|
```
|
|
|
|
### 4. Firecrawl (AVOID) - Credit Limited
|
|
```
|
|
Tool: firecrawl_*
|
|
When: Only if other tools fail
|
|
Note: Credits exhaust quickly
|
|
```
|
|
|
|
## Two-Phase Workflow (Updated)
|
|
|
|
### Phase 1: Discovery
|
|
Use Linkup to **find** relevant URLs:
|
|
```
|
|
linkup_linkup-search: "medewerkers [Institution Name]"
|
|
→ Returns candidate URLs
|
|
```
|
|
|
|
### Phase 2: Extraction with Provenance (USE WEB-READER)
|
|
Use Web-Reader to **extract** with full provenance:
|
|
```
|
|
web-reader_webReader:
|
|
url: https://example.nl/medewerkers
|
|
return_format: markdown
|
|
with_links_summary: true
|
|
→ Returns structured metadata + content
|
|
```
|
|
|
|
### Phase 3: XPath Extraction (If Needed)
|
|
Use Playwright only if CSS selectors from meta tags insufficient:
|
|
```
|
|
playwright_browser_navigate + playwright_browser_snapshot
|
|
→ Returns full DOM for XPath extraction
|
|
```
|
|
|
|
## Example: Person Data Extraction
|
|
|
|
### Step 1: Discover staff page
|
|
```
|
|
linkup_linkup-search: "directeur Tresoar Leeuwarden"
|
|
→ Found: https://lc.nl/friesland/Arjen-Dijkstra-nieuwe-directeur-Tresoar-27819587.html
|
|
```
|
|
|
|
### Step 2: Extract with Web-Reader
|
|
```
|
|
web-reader_webReader:
|
|
url: https://lc.nl/friesland/Arjen-Dijkstra-nieuwe-directeur-Tresoar-27819587.html
|
|
|
|
Returns:
|
|
{
|
|
"title": "Arjen Dijkstra nieuwe directeur Tresoar",
|
|
"description": "Arjen Dijkstra, nu nog directeur van het museum van de Rijksuniversiteit Groningen, wordt op 1 oktober de nieuwe directeur van Tresoar.",
|
|
"publishedTime": "2022-07-15T10:41:12.000Z",
|
|
"metadata": {
|
|
"og:description": "...",
|
|
"article:published_time": "2022-07-15T10:41:12.000Z"
|
|
}
|
|
}
|
|
```
|
|
|
|
### Step 3: Create claim with provenance
|
|
```json
|
|
{
|
|
"claim_type": "role",
|
|
"claim_value": "directeur Tresoar",
|
|
"source_url": "https://lc.nl/friesland/Arjen-Dijkstra-nieuwe-directeur-Tresoar-27819587.html",
|
|
"css_selector": "meta[property='og:description']",
|
|
"extracted_text": "Arjen Dijkstra, nu nog directeur van het museum van de Rijksuniversiteit Groningen, wordt op 1 oktober de nieuwe directeur van Tresoar.",
|
|
"published_date": "2022-07-15T10:41:12.000Z",
|
|
"retrieval_timestamp": "2025-12-28T02:45:00Z",
|
|
"retrieval_agent": "opencode/claude-sonnet-4",
|
|
"extraction_method": "web-reader_webReader"
|
|
}
|
|
```
|
|
|
|
## Benefits for Data Archival
|
|
|
|
Web-Reader provenance enables:
|
|
|
|
1. **Claim Verification**: `css_selector` allows re-extraction from archived HTML
|
|
2. **Temporal Tracking**: `published_date` + `retrieval_timestamp` establish timeline
|
|
3. **Source Attribution**: `author` field for journalistic sources
|
|
4. **Update Detection**: Compare `article:modified_time` across extractions
|
|
5. **Legal Compliance**: Full audit trail for data provenance
|
|
|
|
## When NOT to Use Web-Reader
|
|
|
|
| Scenario | Use Instead |
|
|
|----------|-------------|
|
|
| Login-protected pages | Playwright with authentication |
|
|
| Heavy JavaScript SPAs | Playwright with renderJs |
|
|
| PDF documents | Firecrawl or direct download |
|
|
| Real-time data (stock prices) | Direct API calls |
|
|
| Batch scraping 100+ URLs | Firecrawl batch or custom script |
|
|
|
|
## Relationship to Other Rules
|
|
|
|
This rule supersedes/supplements:
|
|
- `LINKUP_PREFERRED_WEB_SCRAPER_RULE.md` - Linkup for discovery, Web-Reader for extraction
|
|
- `WEB_OBSERVATION_PROVENANCE_RULES.md` - Web-Reader provides the provenance data
|
|
- `PERSON_DATA_PROVENANCE_RULE.md` - Web-Reader extraction method for person claims
|
|
|
|
---
|
|
|
|
**Created**: 2025-12-28
|
|
**Status**: ACTIVE
|
|
**Priority**: HIGH - Use for all web claim extraction
|
|
**Related Rules**:
|
|
- `.opencode/LINKUP_PREFERRED_WEB_SCRAPER_RULE.md`
|
|
- `.opencode/WEB_OBSERVATION_PROVENANCE_RULES.md`
|
|
- `.opencode/PERSON_DATA_PROVENANCE_RULE.md`
|