# Linkup is the Preferred Web Scraper ## Overview **Use Linkup (`linkup_linkup-search` or `linkup_linkup-fetch`) as the primary web scraping tool instead of Firecrawl for DISCOVERY. Then use Playwright or crawl4ai for XPath provenance extraction.** This rule establishes a **two-phase workflow**: 1. **Phase 1 (Discovery)**: Linkup to find information and URLs 2. **Phase 2 (Provenance)**: Playwright/crawl4ai to archive HTML and extract XPaths ## Why Linkup Over Firecrawl for Discovery | Aspect | Linkup | Firecrawl | |--------|--------|-----------| | **Availability** | Always available | Credit-limited (exhausts quickly) | | **Cost** | Included in plan | Consumes credits | | **Reliability** | Consistent | May hit rate limits | | **Source Attribution** | Returns source URLs | Raw content only | | **LLM Integration** | Built-in summarization | Requires separate step | ## Two-Phase Workflow for Person Data Extraction ### Phase 1: Discovery (Linkup) Use Linkup to **find** the staff page and **identify** the information: ``` 1. Search for staff page URL Tool: linkup_linkup-search Query: "staff members [Custodian Name] medewerkers" 2. Identify relevant URLs from search results Extract: https://example.org/organisatie/medewerkers 3. Preview content (optional) Tool: linkup_linkup-fetch URL: https://example.org/organisatie/medewerkers ``` ### Phase 2: XPath Provenance (Playwright or crawl4ai) Use Playwright to **archive** the HTML and **extract with XPath provenance**: ``` 1. Navigate to the URL Tool: playwright_browser_navigate URL: https://example.org/organisatie/medewerkers 2. Take accessibility snapshot (for element refs) Tool: playwright_browser_snapshot 3. Archive the HTML Save to: data/custodian/web/{GHCID}/example.org/medewerkers_rendered.html 4. Extract data with XPath locations For each person, record: - claim_value: "Corinne Rodenburg" - xpath: /html/body/main/section[2]/div[1]/ul/li[1] - html_file: web/{GHCID}/example.org/medewerkers_rendered.html ``` ## Tool Selection by Purpose | Purpose | Tool | Output | |---------|------|--------| | **Find URLs** | `linkup_linkup-search` | Source URLs | | **Preview content** | `linkup_linkup-fetch` | Text content (no XPath) | | **Archive HTML** | `playwright_browser_navigate` + save | HTML file | | **Get element refs** | `playwright_browser_snapshot` | Accessibility tree | | **Extract with XPath** | Parse archived HTML | Claims with XPath provenance | ## Linkup Tools Reference ### `linkup_linkup-search` Best for: Finding URLs and discovering information ``` Parameters: - query: Natural language search query (required) - depth: "standard" for direct answers, "deep" for complex research ``` ### `linkup_linkup-fetch` Best for: Quick content preview (NOT for provenance) ``` Parameters: - url: The URL to fetch (required) - renderJs: Set to true for JavaScript-rendered content ``` ## When Linkup-Only is Acceptable (TIER_4) For **timeline events** and **non-critical metadata**, Linkup-only provenance is acceptable: ```yaml # TIER_4_INFERRED - Linkup discovery without XPath timeline_events: - event_type: FOUNDING event_date: "1895-01-01" provenance: retrieval_agent: linkup_linkup-search linkup_query: "founding date Drents Museum" source_urls: - https://www.drentsmuseum.nl/over-ons data_tier: TIER_4_INFERRED ``` ## When XPath Provenance is REQUIRED (Rule 6) For **person data** and **web claims**, XPath provenance is mandatory: ```yaml # TIER_2_VERIFIED - Full XPath provenance web_claims: - claim_type: full_name claim_value: "Corinne Rodenburg" source_url: https://www.drentsarchief.nl/organisatie/medewerkers xpath: /html/body/main/section[2]/div[1]/ul/li[1]/span[1] html_file: web/NL-DR-ASS-A-PDA/drentsarchief.nl/medewerkers.html retrieved_on: "2025-12-27T12:00:00Z" xpath_match_score: 1.0 retrieval_agent: playwright ``` ## Complete Example Workflow ``` GOAL: Extract staff from Drents Archief PHASE 1: DISCOVERY (Linkup) ───────────────────────────── 1. linkup_linkup-search Query: "medewerkers Drents Archief site:drentsarchief.nl" Result: Found https://www.drentsarchief.nl/organisatie/over-het-drents-archief/medewerkers 2. linkup_linkup-fetch (optional preview) URL: https://www.drentsarchief.nl/organisatie/over-het-drents-archief/medewerkers Result: Page lists 16 staff members PHASE 2: PROVENANCE (Playwright) ───────────────────────────────── 3. playwright_browser_navigate URL: https://www.drentsarchief.nl/organisatie/over-het-drents-archief/medewerkers 4. playwright_browser_snapshot Result: Accessibility tree with element references 5. Archive HTML (bash or script) Save rendered HTML to: data/custodian/web/NL-DR-ASS-A-PDA/drentsarchief.nl/medewerkers.html 6. Extract with XPath Parse HTML, record XPath for each person's name and role 7. Create person entity JSONs with web_claims Each claim includes xpath, html_file, xpath_match_score 8. Update custodian YAML with person_observations Reference the person entity files ``` ## Alternatives to Playwright for Phase 2 | Tool | Best For | XPath Support | |------|----------|---------------| | **Playwright** | JavaScript-heavy sites, interactive pages | Yes (via snapshot + parsing) | | **crawl4ai** | Batch processing, simpler sites | Yes | | **scripts/fetch_website_playwright.py** | Automated archival | Yes | | **Exa crawling** | LinkedIn profiles (special case) | No (use for content only) | ## Migration from Firecrawl | Firecrawl Tool | Replacement | |----------------|-------------| | `firecrawl_firecrawl_search` | `linkup_linkup-search` (discovery) | | `firecrawl_firecrawl_scrape` | `linkup_linkup-fetch` (preview) + Playwright (provenance) | | `firecrawl_firecrawl_map` | Playwright navigation or sitemap parsing | | `firecrawl_firecrawl_extract` | Playwright + manual XPath extraction | ## Summary | Phase | Tool | Purpose | Output | |-------|------|---------|--------| | **Discovery** | Linkup | Find URLs, preview content | Source URLs, text | | **Provenance** | Playwright/crawl4ai | Archive HTML, extract XPath | WebClaims with full provenance | **Key Principle**: Linkup replaces Firecrawl for **discovery**, but Rule 6 still requires XPath provenance for person data. Use Playwright or crawl4ai to fulfill Rule 6 requirements. --- **Created**: 2025-12-27 **Updated**: 2025-12-27 (clarified two-phase workflow) **Status**: ACTIVE **Relates to**: - `.opencode/LINKUP_PROVENANCE_POLICY.md` - Linkup-only provenance for TIER_4 data - `.opencode/WEB_OBSERVATION_PROVENANCE_RULES.md` - XPath requirements (Rule 6) - `AGENTS.md` Rule 6 - WebObservation XPath requirements - `AGENTS.md` Rule 34 - This rule