6.7 KiB
Linkup is the Preferred Web Scraper
Overview
Use Linkup (linkup_linkup-search or linkup_linkup-fetch) as the primary web scraping tool instead of Firecrawl for DISCOVERY. Then use Playwright or crawl4ai for XPath provenance extraction.
This rule establishes a two-phase workflow:
- Phase 1 (Discovery): Linkup to find information and URLs
- Phase 2 (Provenance): Playwright/crawl4ai to archive HTML and extract XPaths
Why Linkup Over Firecrawl for Discovery
| Aspect | Linkup | Firecrawl |
|---|---|---|
| Availability | Always available | Credit-limited (exhausts quickly) |
| Cost | Included in plan | Consumes credits |
| Reliability | Consistent | May hit rate limits |
| Source Attribution | Returns source URLs | Raw content only |
| LLM Integration | Built-in summarization | Requires separate step |
Two-Phase Workflow for Person Data Extraction
Phase 1: Discovery (Linkup)
Use Linkup to find the staff page and identify the information:
1. Search for staff page URL
Tool: linkup_linkup-search
Query: "staff members [Custodian Name] medewerkers"
2. Identify relevant URLs from search results
Extract: https://example.org/organisatie/medewerkers
3. Preview content (optional)
Tool: linkup_linkup-fetch
URL: https://example.org/organisatie/medewerkers
Phase 2: XPath Provenance (Playwright or crawl4ai)
Use Playwright to archive the HTML and extract with XPath provenance:
1. Navigate to the URL
Tool: playwright_browser_navigate
URL: https://example.org/organisatie/medewerkers
2. Take accessibility snapshot (for element refs)
Tool: playwright_browser_snapshot
3. Archive the HTML
Save to: data/custodian/web/{GHCID}/example.org/medewerkers_rendered.html
4. Extract data with XPath locations
For each person, record:
- claim_value: "Corinne Rodenburg"
- xpath: /html/body/main/section[2]/div[1]/ul/li[1]
- html_file: web/{GHCID}/example.org/medewerkers_rendered.html
Tool Selection by Purpose
| Purpose | Tool | Output |
|---|---|---|
| Find URLs | linkup_linkup-search |
Source URLs |
| Preview content | linkup_linkup-fetch |
Text content (no XPath) |
| Archive HTML | playwright_browser_navigate + save |
HTML file |
| Get element refs | playwright_browser_snapshot |
Accessibility tree |
| Extract with XPath | Parse archived HTML | Claims with XPath provenance |
Linkup Tools Reference
linkup_linkup-search
Best for: Finding URLs and discovering information
Parameters:
- query: Natural language search query (required)
- depth: "standard" for direct answers, "deep" for complex research
linkup_linkup-fetch
Best for: Quick content preview (NOT for provenance)
Parameters:
- url: The URL to fetch (required)
- renderJs: Set to true for JavaScript-rendered content
When Linkup-Only is Acceptable (TIER_4)
For timeline events and non-critical metadata, Linkup-only provenance is acceptable:
# TIER_4_INFERRED - Linkup discovery without XPath
timeline_events:
- event_type: FOUNDING
event_date: "1895-01-01"
provenance:
retrieval_agent: linkup_linkup-search
linkup_query: "founding date Drents Museum"
source_urls:
- https://www.drentsmuseum.nl/over-ons
data_tier: TIER_4_INFERRED
When XPath Provenance is REQUIRED (Rule 6)
For person data and web claims, XPath provenance is mandatory:
# TIER_2_VERIFIED - Full XPath provenance
web_claims:
- claim_type: full_name
claim_value: "Corinne Rodenburg"
source_url: https://www.drentsarchief.nl/organisatie/medewerkers
xpath: /html/body/main/section[2]/div[1]/ul/li[1]/span[1]
html_file: web/NL-DR-ASS-A-PDA/drentsarchief.nl/medewerkers.html
retrieved_on: "2025-12-27T12:00:00Z"
xpath_match_score: 1.0
retrieval_agent: playwright
Complete Example Workflow
GOAL: Extract staff from Drents Archief
PHASE 1: DISCOVERY (Linkup)
─────────────────────────────
1. linkup_linkup-search
Query: "medewerkers Drents Archief site:drentsarchief.nl"
Result: Found https://www.drentsarchief.nl/organisatie/over-het-drents-archief/medewerkers
2. linkup_linkup-fetch (optional preview)
URL: https://www.drentsarchief.nl/organisatie/over-het-drents-archief/medewerkers
Result: Page lists 16 staff members
PHASE 2: PROVENANCE (Playwright)
─────────────────────────────────
3. playwright_browser_navigate
URL: https://www.drentsarchief.nl/organisatie/over-het-drents-archief/medewerkers
4. playwright_browser_snapshot
Result: Accessibility tree with element references
5. Archive HTML (bash or script)
Save rendered HTML to: data/custodian/web/NL-DR-ASS-A-PDA/drentsarchief.nl/medewerkers.html
6. Extract with XPath
Parse HTML, record XPath for each person's name and role
7. Create person entity JSONs with web_claims
Each claim includes xpath, html_file, xpath_match_score
8. Update custodian YAML with person_observations
Reference the person entity files
Alternatives to Playwright for Phase 2
| Tool | Best For | XPath Support |
|---|---|---|
| Playwright | JavaScript-heavy sites, interactive pages | Yes (via snapshot + parsing) |
| crawl4ai | Batch processing, simpler sites | Yes |
| scripts/fetch_website_playwright.py | Automated archival | Yes |
| Exa crawling | LinkedIn profiles (special case) | No (use for content only) |
Migration from Firecrawl
| Firecrawl Tool | Replacement |
|---|---|
firecrawl_firecrawl_search |
linkup_linkup-search (discovery) |
firecrawl_firecrawl_scrape |
linkup_linkup-fetch (preview) + Playwright (provenance) |
firecrawl_firecrawl_map |
Playwright navigation or sitemap parsing |
firecrawl_firecrawl_extract |
Playwright + manual XPath extraction |
Summary
| Phase | Tool | Purpose | Output |
|---|---|---|---|
| Discovery | Linkup | Find URLs, preview content | Source URLs, text |
| Provenance | Playwright/crawl4ai | Archive HTML, extract XPath | WebClaims with full provenance |
Key Principle: Linkup replaces Firecrawl for discovery, but Rule 6 still requires XPath provenance for person data. Use Playwright or crawl4ai to fulfill Rule 6 requirements.
Created: 2025-12-27 Updated: 2025-12-27 (clarified two-phase workflow) Status: ACTIVE Relates to:
.opencode/LINKUP_PROVENANCE_POLICY.md- Linkup-only provenance for TIER_4 data.opencode/WEB_OBSERVATION_PROVENANCE_RULES.md- XPath requirements (Rule 6)AGENTS.mdRule 6 - WebObservation XPath requirementsAGENTS.mdRule 34 - This rule