glam/docs/WEB_ENRICHMENT_FIRECRAWL.md
2025-12-14 17:09:55 +01:00

7.1 KiB

Web Enrichment via Firecrawl MCP

This document describes using Firecrawl MCP as a practical alternative to the full Playwright archiving process for extracting web claims with XPath provenance.

Overview

The Firecrawl MCP tool can return raw HTML content, enabling:

  1. Local HTML archiving - Save the raw HTML as archived files
  2. XPath generation - Parse HTML to generate exact XPaths for claims
  3. WebClaim provenance - Create properly sourced claims per Rule 6

This is a lightweight alternative to the full scripts/fetch_website_playwright.py workflow, suitable for:

  • Quick enrichment of individual custodian entries
  • Sites that work well with static HTML scraping
  • Situations where JavaScript rendering is not required

When to Use Each Method

Method Use Case Pros Cons
Firecrawl MCP Simple static sites, quick enrichment Fast, MCP-integrated, cached results No JS rendering, limited interaction
Playwright script Complex SPAs, JS-heavy sites Full JS support, screenshots Slower, requires local setup

Workflow

Step 1: Scrape with Firecrawl (rawHtml format)

Tool: firecrawl_firecrawl_scrape
Parameters:
  url: "http://www.example-museum.org/about/"
  formats: ["rawHtml"]

The rawHtml format returns the complete HTML source, including all elements needed for XPath generation.

Step 2: Create Archive Directory

data/custodian/web/{GHCID}/{domain}/

Example:

data/custodian/web/AG-04-SJ-M-MAB/antiguamuseums.net/
├── about-the-museum.html
├── staff.html
└── metadata.yaml

Step 3: Save HTML Files

Save the rawHtml content from Firecrawl to local files. Strip unnecessary CSS/JS if desired, but preserve the DOM structure for XPath validity.

Step 4: Create metadata.yaml

archive_metadata:
  ghcid: AG-04-SJ-M-MAB
  custodian_name: Museum of Antigua and Barbuda
  domain: antiguamuseums.net
  archive_created: "2025-12-14T09:34:47Z"

fetched_pages:
  - file: about-the-museum.html
    source_url: http://www.antiguamuseums.net/portfolios/about-the-museum/
    retrieved_on: "2025-12-14T09:34:47Z"
    fetch_method: firecrawl
    cache_state: hit
    content_type: text/html
    status_code: 200

Step 5: Parse HTML for XPaths

Use Python with lxml to generate accurate XPaths:

from lxml import etree

# Parse HTML
parser = etree.HTMLParser()
tree = etree.fromstring(html_content, parser)

# Find element containing target text
for elem in tree.iter('h5'):
    text = ''.join(elem.itertext())
    if 'telephone:' in text.lower():
        xpath = tree.getroottree().getpath(elem)
        print(f"Phone element XPath: {xpath}")
        # Result: /html/body/div/div[3]/div/div/div/h5[18]

Step 6: Add WebClaims to Custodian YAML

web_enrichment:
  archive_path: web/AG-04-SJ-M-MAB/antiguamuseums.net/
  fetch_timestamp: "2025-12-14T09:34:47Z"
  fetch_method: firecrawl
  claims:
    - claim_type: founding_date
      claim_value: "1985"
      source_url: http://www.antiguamuseums.net/portfolios/about-the-museum/
      retrieved_on: "2025-12-14T09:34:47Z"
      xpath: /html/body/div/div[3]/div/div/div/h5[1]
      html_file: web/AG-04-SJ-M-MAB/antiguamuseums.net/about-the-museum.html
      xpath_match_score: 1.0
      
    - claim_type: phone
      claim_value: "1-268-462-4930"
      source_url: http://www.antiguamuseums.net/portfolios/about-the-museum/
      retrieved_on: "2025-12-14T09:34:47Z"
      xpath: /html/body/div/div[3]/div/div/div/h5[18]
      html_file: web/AG-04-SJ-M-MAB/antiguamuseums.net/about-the-museum.html
      xpath_match_score: 1.0

XPath Generation Tips

Finding Elements by Text Content

# Find element containing specific text
for elem in tree.iter():
    text = ''.join(elem.itertext())
    if 'search term' in text.lower():
        xpath = tree.getroottree().getpath(elem)

Common XPath Patterns

Content Type Typical XPath Pattern
Page title /html/body/div[@id='wrapper']/div[@id='content']/h1[1]
Address /html/body/...//h5[contains(text(),'address')]
Phone /html/body/...//h5[contains(text(),'telephone')]
Description /html/body/...//div[@class='content']/p[1]

Verifying XPath Accuracy

# Verify XPath returns expected content
result = tree.xpath(xpath)
if result:
    actual_text = ''.join(result[0].itertext())
    if expected_value in actual_text:
        match_score = 1.0  # Exact match
    else:
        match_score = 0.8  # Partial match - investigate

ClaimTypeEnum Values

From schemas/20251121/linkml/modules/classes/WebClaim.yaml:

Claim Type Description
full_name Official institution name
description Institution description/mission
phone Phone number
email Email address
address Physical address
website Official website URL
founding_date Year or date founded
parent_organization Managing/parent organization
opening_hours Hours of operation
social_media Social media URLs

Example: Museum of Antigua and Barbuda

Extracted Claims

Claim Type Value XPath
founding_date 1985 /html/body/div/div[3]/div/div/div/h5[1]
address Long Street, St. John's, Antigua /html/body/div/div[3]/div/div/div/h5[9]
phone 1-268-462-4930 /html/body/div/div[3]/div/div/div/h5[18]
parent_organization Historical and Archaeological Society of Antigua and Barbuda /html/body/div/div[3]/div/div/div/h5[20]

Staff Data (from separate page)

Role Name XPath
Curator Dian Andrews /html/body/div/div[3]/div/div/div/h5[3]
Research Librarian Myra Piper /html/body/div/div[3]/div/div/div/h5[4]
Board Chairperson Walter Berridge /html/body/div/div[3]/div/div/div/h5[25]
Board President Reg Murphy /html/body/div/div[3]/div/div/div/h5[26]

Comparison with Playwright Method

Aspect Firecrawl MCP Playwright Script
Setup None (MCP ready) Python + Playwright install
Speed Fast (cached) Slower (full render)
JS Support Limited Full
Screenshots No Yes
Archival Manual save Automatic
Best For Static HTML sites SPAs, JS-heavy sites
  • AGENTS.md Rule 6: WebObservation Claims MUST Have XPath Provenance
  • .opencode/WEB_OBSERVATION_PROVENANCE_RULES.md - Complete provenance rules
  • schemas/20251121/linkml/modules/classes/WebClaim.yaml - Schema definition
  • scripts/fetch_website_playwright.py - Full Playwright archiving script
  • scripts/add_xpath_provenance.py - XPath verification script

Limitations

  1. No JavaScript rendering - Sites requiring JS won't work well
  2. Manual archiving - HTML must be saved manually (vs automatic with Playwright)
  3. No screenshots - Cannot capture visual state
  4. Cache dependency - Firecrawl caching affects data freshness

For complex sites or when screenshots are needed, use the full Playwright workflow instead.