200 lines
6.7 KiB
Markdown
200 lines
6.7 KiB
Markdown
# Linkup is the Preferred Web Scraper
|
|
|
|
## Overview
|
|
|
|
**Use Linkup (`linkup_linkup-search` or `linkup_linkup-fetch`) as the primary web scraping tool instead of Firecrawl for DISCOVERY. Then use Playwright or crawl4ai for XPath provenance extraction.**
|
|
|
|
This rule establishes a **two-phase workflow**:
|
|
1. **Phase 1 (Discovery)**: Linkup to find information and URLs
|
|
2. **Phase 2 (Provenance)**: Playwright/crawl4ai to archive HTML and extract XPaths
|
|
|
|
## Why Linkup Over Firecrawl for Discovery
|
|
|
|
| Aspect | Linkup | Firecrawl |
|
|
|--------|--------|-----------|
|
|
| **Availability** | Always available | Credit-limited (exhausts quickly) |
|
|
| **Cost** | Included in plan | Consumes credits |
|
|
| **Reliability** | Consistent | May hit rate limits |
|
|
| **Source Attribution** | Returns source URLs | Raw content only |
|
|
| **LLM Integration** | Built-in summarization | Requires separate step |
|
|
|
|
## Two-Phase Workflow for Person Data Extraction
|
|
|
|
### Phase 1: Discovery (Linkup)
|
|
|
|
Use Linkup to **find** the staff page and **identify** the information:
|
|
|
|
```
|
|
1. Search for staff page URL
|
|
Tool: linkup_linkup-search
|
|
Query: "staff members [Custodian Name] medewerkers"
|
|
|
|
2. Identify relevant URLs from search results
|
|
Extract: https://example.org/organisatie/medewerkers
|
|
|
|
3. Preview content (optional)
|
|
Tool: linkup_linkup-fetch
|
|
URL: https://example.org/organisatie/medewerkers
|
|
```
|
|
|
|
### Phase 2: XPath Provenance (Playwright or crawl4ai)
|
|
|
|
Use Playwright to **archive** the HTML and **extract with XPath provenance**:
|
|
|
|
```
|
|
1. Navigate to the URL
|
|
Tool: playwright_browser_navigate
|
|
URL: https://example.org/organisatie/medewerkers
|
|
|
|
2. Take accessibility snapshot (for element refs)
|
|
Tool: playwright_browser_snapshot
|
|
|
|
3. Archive the HTML
|
|
Save to: data/custodian/web/{GHCID}/example.org/medewerkers_rendered.html
|
|
|
|
4. Extract data with XPath locations
|
|
For each person, record:
|
|
- claim_value: "Corinne Rodenburg"
|
|
- xpath: /html/body/main/section[2]/div[1]/ul/li[1]
|
|
- html_file: web/{GHCID}/example.org/medewerkers_rendered.html
|
|
```
|
|
|
|
## Tool Selection by Purpose
|
|
|
|
| Purpose | Tool | Output |
|
|
|---------|------|--------|
|
|
| **Find URLs** | `linkup_linkup-search` | Source URLs |
|
|
| **Preview content** | `linkup_linkup-fetch` | Text content (no XPath) |
|
|
| **Archive HTML** | `playwright_browser_navigate` + save | HTML file |
|
|
| **Get element refs** | `playwright_browser_snapshot` | Accessibility tree |
|
|
| **Extract with XPath** | Parse archived HTML | Claims with XPath provenance |
|
|
|
|
## Linkup Tools Reference
|
|
|
|
### `linkup_linkup-search`
|
|
|
|
Best for: Finding URLs and discovering information
|
|
|
|
```
|
|
Parameters:
|
|
- query: Natural language search query (required)
|
|
- depth: "standard" for direct answers, "deep" for complex research
|
|
```
|
|
|
|
### `linkup_linkup-fetch`
|
|
|
|
Best for: Quick content preview (NOT for provenance)
|
|
|
|
```
|
|
Parameters:
|
|
- url: The URL to fetch (required)
|
|
- renderJs: Set to true for JavaScript-rendered content
|
|
```
|
|
|
|
## When Linkup-Only is Acceptable (TIER_4)
|
|
|
|
For **timeline events** and **non-critical metadata**, Linkup-only provenance is acceptable:
|
|
|
|
```yaml
|
|
# TIER_4_INFERRED - Linkup discovery without XPath
|
|
timeline_events:
|
|
- event_type: FOUNDING
|
|
event_date: "1895-01-01"
|
|
provenance:
|
|
retrieval_agent: linkup_linkup-search
|
|
linkup_query: "founding date Drents Museum"
|
|
source_urls:
|
|
- https://www.drentsmuseum.nl/over-ons
|
|
data_tier: TIER_4_INFERRED
|
|
```
|
|
|
|
## When XPath Provenance is REQUIRED (Rule 6)
|
|
|
|
For **person data** and **web claims**, XPath provenance is mandatory:
|
|
|
|
```yaml
|
|
# TIER_2_VERIFIED - Full XPath provenance
|
|
web_claims:
|
|
- claim_type: full_name
|
|
claim_value: "Corinne Rodenburg"
|
|
source_url: https://www.drentsarchief.nl/organisatie/medewerkers
|
|
xpath: /html/body/main/section[2]/div[1]/ul/li[1]/span[1]
|
|
html_file: web/NL-DR-ASS-A-PDA/drentsarchief.nl/medewerkers.html
|
|
retrieved_on: "2025-12-27T12:00:00Z"
|
|
xpath_match_score: 1.0
|
|
retrieval_agent: playwright
|
|
```
|
|
|
|
## Complete Example Workflow
|
|
|
|
```
|
|
GOAL: Extract staff from Drents Archief
|
|
|
|
PHASE 1: DISCOVERY (Linkup)
|
|
─────────────────────────────
|
|
1. linkup_linkup-search
|
|
Query: "medewerkers Drents Archief site:drentsarchief.nl"
|
|
Result: Found https://www.drentsarchief.nl/organisatie/over-het-drents-archief/medewerkers
|
|
|
|
2. linkup_linkup-fetch (optional preview)
|
|
URL: https://www.drentsarchief.nl/organisatie/over-het-drents-archief/medewerkers
|
|
Result: Page lists 16 staff members
|
|
|
|
PHASE 2: PROVENANCE (Playwright)
|
|
─────────────────────────────────
|
|
3. playwright_browser_navigate
|
|
URL: https://www.drentsarchief.nl/organisatie/over-het-drents-archief/medewerkers
|
|
|
|
4. playwright_browser_snapshot
|
|
Result: Accessibility tree with element references
|
|
|
|
5. Archive HTML (bash or script)
|
|
Save rendered HTML to: data/custodian/web/NL-DR-ASS-A-PDA/drentsarchief.nl/medewerkers.html
|
|
|
|
6. Extract with XPath
|
|
Parse HTML, record XPath for each person's name and role
|
|
|
|
7. Create person entity JSONs with web_claims
|
|
Each claim includes xpath, html_file, xpath_match_score
|
|
|
|
8. Update custodian YAML with person_observations
|
|
Reference the person entity files
|
|
```
|
|
|
|
## Alternatives to Playwright for Phase 2
|
|
|
|
| Tool | Best For | XPath Support |
|
|
|------|----------|---------------|
|
|
| **Playwright** | JavaScript-heavy sites, interactive pages | Yes (via snapshot + parsing) |
|
|
| **crawl4ai** | Batch processing, simpler sites | Yes |
|
|
| **scripts/fetch_website_playwright.py** | Automated archival | Yes |
|
|
| **Exa crawling** | LinkedIn profiles (special case) | No (use for content only) |
|
|
|
|
## Migration from Firecrawl
|
|
|
|
| Firecrawl Tool | Replacement |
|
|
|----------------|-------------|
|
|
| `firecrawl_firecrawl_search` | `linkup_linkup-search` (discovery) |
|
|
| `firecrawl_firecrawl_scrape` | `linkup_linkup-fetch` (preview) + Playwright (provenance) |
|
|
| `firecrawl_firecrawl_map` | Playwright navigation or sitemap parsing |
|
|
| `firecrawl_firecrawl_extract` | Playwright + manual XPath extraction |
|
|
|
|
## Summary
|
|
|
|
| Phase | Tool | Purpose | Output |
|
|
|-------|------|---------|--------|
|
|
| **Discovery** | Linkup | Find URLs, preview content | Source URLs, text |
|
|
| **Provenance** | Playwright/crawl4ai | Archive HTML, extract XPath | WebClaims with full provenance |
|
|
|
|
**Key Principle**: Linkup replaces Firecrawl for **discovery**, but Rule 6 still requires XPath provenance for person data. Use Playwright or crawl4ai to fulfill Rule 6 requirements.
|
|
|
|
---
|
|
|
|
**Created**: 2025-12-27
|
|
**Updated**: 2025-12-27 (clarified two-phase workflow)
|
|
**Status**: ACTIVE
|
|
**Relates to**:
|
|
- `.opencode/LINKUP_PROVENANCE_POLICY.md` - Linkup-only provenance for TIER_4 data
|
|
- `.opencode/WEB_OBSERVATION_PROVENANCE_RULES.md` - XPath requirements (Rule 6)
|
|
- `AGENTS.md` Rule 6 - WebObservation XPath requirements
|
|
- `AGENTS.md` Rule 34 - This rule
|