kempersc cdb633b0c9 enrich custodian entries with logo

2025-12-27 02:15:17 +01:00

6.7 KiB

Raw Blame History

Linkup is the Preferred Web Scraper

Overview

Use Linkup (linkup_linkup-search or linkup_linkup-fetch) as the primary web scraping tool instead of Firecrawl for DISCOVERY. Then use Playwright or crawl4ai for XPath provenance extraction.

This rule establishes a two-phase workflow:

Phase 1 (Discovery): Linkup to find information and URLs
Phase 2 (Provenance): Playwright/crawl4ai to archive HTML and extract XPaths

Why Linkup Over Firecrawl for Discovery

Aspect	Linkup	Firecrawl
Availability	Always available	Credit-limited (exhausts quickly)
Cost	Included in plan	Consumes credits
Reliability	Consistent	May hit rate limits
Source Attribution	Returns source URLs	Raw content only
LLM Integration	Built-in summarization	Requires separate step

Two-Phase Workflow for Person Data Extraction

Phase 1: Discovery (Linkup)

Use Linkup to find the staff page and identify the information:

1. Search for staff page URL
   Tool: linkup_linkup-search
   Query: "staff members [Custodian Name] medewerkers"
   
2. Identify relevant URLs from search results
   Extract: https://example.org/organisatie/medewerkers
   
3. Preview content (optional)
   Tool: linkup_linkup-fetch
   URL: https://example.org/organisatie/medewerkers

Phase 2: XPath Provenance (Playwright or crawl4ai)

Use Playwright to archive the HTML and extract with XPath provenance:

1. Navigate to the URL
   Tool: playwright_browser_navigate
   URL: https://example.org/organisatie/medewerkers
   
2. Take accessibility snapshot (for element refs)
   Tool: playwright_browser_snapshot
   
3. Archive the HTML
   Save to: data/custodian/web/{GHCID}/example.org/medewerkers_rendered.html
   
4. Extract data with XPath locations
   For each person, record:
   - claim_value: "Corinne Rodenburg"
   - xpath: /html/body/main/section[2]/div[1]/ul/li[1]
   - html_file: web/{GHCID}/example.org/medewerkers_rendered.html

Tool Selection by Purpose

Purpose	Tool	Output
Find URLs	`linkup_linkup-search`	Source URLs
Preview content	`linkup_linkup-fetch`	Text content (no XPath)
Archive HTML	`playwright_browser_navigate` + save	HTML file
Get element refs	`playwright_browser_snapshot`	Accessibility tree
Extract with XPath	Parse archived HTML	Claims with XPath provenance

Linkup Tools Reference

`linkup_linkup-search`

Best for: Finding URLs and discovering information

Parameters:
- query: Natural language search query (required)
- depth: "standard" for direct answers, "deep" for complex research

`linkup_linkup-fetch`

Best for: Quick content preview (NOT for provenance)

Parameters:
- url: The URL to fetch (required)
- renderJs: Set to true for JavaScript-rendered content

When Linkup-Only is Acceptable (TIER_4)

For timeline events and non-critical metadata, Linkup-only provenance is acceptable:

# TIER_4_INFERRED - Linkup discovery without XPath
timeline_events:
  - event_type: FOUNDING
    event_date: "1895-01-01"
    provenance:
      retrieval_agent: linkup_linkup-search
      linkup_query: "founding date Drents Museum"
      source_urls:
        - https://www.drentsmuseum.nl/over-ons
      data_tier: TIER_4_INFERRED

When XPath Provenance is REQUIRED (Rule 6)

For person data and web claims, XPath provenance is mandatory:

# TIER_2_VERIFIED - Full XPath provenance
web_claims:
  - claim_type: full_name
    claim_value: "Corinne Rodenburg"
    source_url: https://www.drentsarchief.nl/organisatie/medewerkers
    xpath: /html/body/main/section[2]/div[1]/ul/li[1]/span[1]
    html_file: web/NL-DR-ASS-A-PDA/drentsarchief.nl/medewerkers.html
    retrieved_on: "2025-12-27T12:00:00Z"
    xpath_match_score: 1.0
    retrieval_agent: playwright

Complete Example Workflow

GOAL: Extract staff from Drents Archief

PHASE 1: DISCOVERY (Linkup)
─────────────────────────────
1. linkup_linkup-search
   Query: "medewerkers Drents Archief site:drentsarchief.nl"
   Result: Found https://www.drentsarchief.nl/organisatie/over-het-drents-archief/medewerkers

2. linkup_linkup-fetch (optional preview)
   URL: https://www.drentsarchief.nl/organisatie/over-het-drents-archief/medewerkers
   Result: Page lists 16 staff members

PHASE 2: PROVENANCE (Playwright)
─────────────────────────────────
3. playwright_browser_navigate
   URL: https://www.drentsarchief.nl/organisatie/over-het-drents-archief/medewerkers

4. playwright_browser_snapshot
   Result: Accessibility tree with element references

5. Archive HTML (bash or script)
   Save rendered HTML to: data/custodian/web/NL-DR-ASS-A-PDA/drentsarchief.nl/medewerkers.html

6. Extract with XPath
   Parse HTML, record XPath for each person's name and role

7. Create person entity JSONs with web_claims
   Each claim includes xpath, html_file, xpath_match_score

8. Update custodian YAML with person_observations
   Reference the person entity files

Alternatives to Playwright for Phase 2

Tool	Best For	XPath Support
Playwright	JavaScript-heavy sites, interactive pages	Yes (via snapshot + parsing)
crawl4ai	Batch processing, simpler sites	Yes
scripts/fetch_website_playwright.py	Automated archival	Yes
Exa crawling	LinkedIn profiles (special case)	No (use for content only)

Migration from Firecrawl

Firecrawl Tool	Replacement
`firecrawl_firecrawl_search`	`linkup_linkup-search` (discovery)
`firecrawl_firecrawl_scrape`	`linkup_linkup-fetch` (preview) + Playwright (provenance)
`firecrawl_firecrawl_map`	Playwright navigation or sitemap parsing
`firecrawl_firecrawl_extract`	Playwright + manual XPath extraction

Summary

Phase	Tool	Purpose	Output
Discovery	Linkup	Find URLs, preview content	Source URLs, text
Provenance	Playwright/crawl4ai	Archive HTML, extract XPath	WebClaims with full provenance

Key Principle: Linkup replaces Firecrawl for discovery, but Rule 6 still requires XPath provenance for person data. Use Playwright or crawl4ai to fulfill Rule 6 requirements.

Created: 2025-12-27 Updated: 2025-12-27 (clarified two-phase workflow) Status: ACTIVE Relates to:

.opencode/LINKUP_PROVENANCE_POLICY.md - Linkup-only provenance for TIER_4 data
.opencode/WEB_OBSERVATION_PROVENANCE_RULES.md - XPath requirements (Rule 6)
AGENTS.md Rule 6 - WebObservation XPath requirements
AGENTS.md Rule 34 - This rule

6.7 KiB Raw Blame History