# Web Enrichment via Firecrawl MCP This document describes using Firecrawl MCP as a **practical alternative** to the full Playwright archiving process for extracting web claims with XPath provenance. ## Overview The Firecrawl MCP tool can return raw HTML content, enabling: 1. **Local HTML archiving** - Save the raw HTML as archived files 2. **XPath generation** - Parse HTML to generate exact XPaths for claims 3. **WebClaim provenance** - Create properly sourced claims per Rule 6 This is a **lightweight alternative** to the full `scripts/fetch_website_playwright.py` workflow, suitable for: - Quick enrichment of individual custodian entries - Sites that work well with static HTML scraping - Situations where JavaScript rendering is not required ## When to Use Each Method | Method | Use Case | Pros | Cons | |--------|----------|------|------| | **Firecrawl MCP** | Simple static sites, quick enrichment | Fast, MCP-integrated, cached results | No JS rendering, limited interaction | | **Playwright script** | Complex SPAs, JS-heavy sites | Full JS support, screenshots | Slower, requires local setup | ## Workflow ### Step 1: Scrape with Firecrawl (rawHtml format) ``` Tool: firecrawl_firecrawl_scrape Parameters: url: "http://www.example-museum.org/about/" formats: ["rawHtml"] ``` The `rawHtml` format returns the complete HTML source, including all elements needed for XPath generation. ### Step 2: Create Archive Directory ``` data/custodian/web/{GHCID}/{domain}/ ``` Example: ``` data/custodian/web/AG-04-SJ-M-MAB/antiguamuseums.net/ ├── about-the-museum.html ├── staff.html └── metadata.yaml ``` ### Step 3: Save HTML Files Save the `rawHtml` content from Firecrawl to local files. Strip unnecessary CSS/JS if desired, but preserve the DOM structure for XPath validity. ### Step 4: Create metadata.yaml ```yaml archive_metadata: ghcid: AG-04-SJ-M-MAB custodian_name: Museum of Antigua and Barbuda domain: antiguamuseums.net archive_created: "2025-12-14T09:34:47Z" fetched_pages: - file: about-the-museum.html source_url: http://www.antiguamuseums.net/portfolios/about-the-museum/ retrieved_on: "2025-12-14T09:34:47Z" fetch_method: firecrawl cache_state: hit content_type: text/html status_code: 200 ``` ### Step 5: Parse HTML for XPaths Use Python with lxml to generate accurate XPaths: ```python from lxml import etree # Parse HTML parser = etree.HTMLParser() tree = etree.fromstring(html_content, parser) # Find element containing target text for elem in tree.iter('h5'): text = ''.join(elem.itertext()) if 'telephone:' in text.lower(): xpath = tree.getroottree().getpath(elem) print(f"Phone element XPath: {xpath}") # Result: /html/body/div/div[3]/div/div/div/h5[18] ``` ### Step 6: Add WebClaims to Custodian YAML ```yaml web_enrichment: archive_path: web/AG-04-SJ-M-MAB/antiguamuseums.net/ fetch_timestamp: "2025-12-14T09:34:47Z" fetch_method: firecrawl claims: - claim_type: founding_date claim_value: "1985" source_url: http://www.antiguamuseums.net/portfolios/about-the-museum/ retrieved_on: "2025-12-14T09:34:47Z" xpath: /html/body/div/div[3]/div/div/div/h5[1] html_file: web/AG-04-SJ-M-MAB/antiguamuseums.net/about-the-museum.html xpath_match_score: 1.0 - claim_type: phone claim_value: "1-268-462-4930" source_url: http://www.antiguamuseums.net/portfolios/about-the-museum/ retrieved_on: "2025-12-14T09:34:47Z" xpath: /html/body/div/div[3]/div/div/div/h5[18] html_file: web/AG-04-SJ-M-MAB/antiguamuseums.net/about-the-museum.html xpath_match_score: 1.0 ``` ## XPath Generation Tips ### Finding Elements by Text Content ```python # Find element containing specific text for elem in tree.iter(): text = ''.join(elem.itertext()) if 'search term' in text.lower(): xpath = tree.getroottree().getpath(elem) ``` ### Common XPath Patterns | Content Type | Typical XPath Pattern | |--------------|----------------------| | Page title | `/html/body/div[@id='wrapper']/div[@id='content']/h1[1]` | | Address | `/html/body/...//h5[contains(text(),'address')]` | | Phone | `/html/body/...//h5[contains(text(),'telephone')]` | | Description | `/html/body/...//div[@class='content']/p[1]` | ### Verifying XPath Accuracy ```python # Verify XPath returns expected content result = tree.xpath(xpath) if result: actual_text = ''.join(result[0].itertext()) if expected_value in actual_text: match_score = 1.0 # Exact match else: match_score = 0.8 # Partial match - investigate ``` ## ClaimTypeEnum Values From `schemas/20251121/linkml/modules/classes/WebClaim.yaml`: | Claim Type | Description | |------------|-------------| | `full_name` | Official institution name | | `description` | Institution description/mission | | `phone` | Phone number | | `email` | Email address | | `address` | Physical address | | `website` | Official website URL | | `founding_date` | Year or date founded | | `parent_organization` | Managing/parent organization | | `opening_hours` | Hours of operation | | `social_media` | Social media URLs | ## Example: Museum of Antigua and Barbuda ### Extracted Claims | Claim Type | Value | XPath | |------------|-------|-------| | `founding_date` | 1985 | `/html/body/div/div[3]/div/div/div/h5[1]` | | `address` | Long Street, St. John's, Antigua | `/html/body/div/div[3]/div/div/div/h5[9]` | | `phone` | 1-268-462-4930 | `/html/body/div/div[3]/div/div/div/h5[18]` | | `parent_organization` | Historical and Archaeological Society of Antigua and Barbuda | `/html/body/div/div[3]/div/div/div/h5[20]` | ### Staff Data (from separate page) | Role | Name | XPath | |------|------|-------| | Curator | Dian Andrews | `/html/body/div/div[3]/div/div/div/h5[3]` | | Research Librarian | Myra Piper | `/html/body/div/div[3]/div/div/div/h5[4]` | | Board Chairperson | Walter Berridge | `/html/body/div/div[3]/div/div/div/h5[25]` | | Board President | Reg Murphy | `/html/body/div/div[3]/div/div/div/h5[26]` | ## Comparison with Playwright Method | Aspect | Firecrawl MCP | Playwright Script | |--------|---------------|-------------------| | **Setup** | None (MCP ready) | Python + Playwright install | | **Speed** | Fast (cached) | Slower (full render) | | **JS Support** | Limited | Full | | **Screenshots** | No | Yes | | **Archival** | Manual save | Automatic | | **Best For** | Static HTML sites | SPAs, JS-heavy sites | ## Related Documentation - `AGENTS.md` Rule 6: WebObservation Claims MUST Have XPath Provenance - `.opencode/WEB_OBSERVATION_PROVENANCE_RULES.md` - Complete provenance rules - `schemas/20251121/linkml/modules/classes/WebClaim.yaml` - Schema definition - `scripts/fetch_website_playwright.py` - Full Playwright archiving script - `scripts/add_xpath_provenance.py` - XPath verification script ## Limitations 1. **No JavaScript rendering** - Sites requiring JS won't work well 2. **Manual archiving** - HTML must be saved manually (vs automatic with Playwright) 3. **No screenshots** - Cannot capture visual state 4. **Cache dependency** - Firecrawl caching affects data freshness For complex sites or when screenshots are needed, use the full Playwright workflow instead.