7.1 KiB
Web Enrichment via Firecrawl MCP
This document describes using Firecrawl MCP as a practical alternative to the full Playwright archiving process for extracting web claims with XPath provenance.
Overview
The Firecrawl MCP tool can return raw HTML content, enabling:
- Local HTML archiving - Save the raw HTML as archived files
- XPath generation - Parse HTML to generate exact XPaths for claims
- WebClaim provenance - Create properly sourced claims per Rule 6
This is a lightweight alternative to the full scripts/fetch_website_playwright.py workflow, suitable for:
- Quick enrichment of individual custodian entries
- Sites that work well with static HTML scraping
- Situations where JavaScript rendering is not required
When to Use Each Method
| Method | Use Case | Pros | Cons |
|---|---|---|---|
| Firecrawl MCP | Simple static sites, quick enrichment | Fast, MCP-integrated, cached results | No JS rendering, limited interaction |
| Playwright script | Complex SPAs, JS-heavy sites | Full JS support, screenshots | Slower, requires local setup |
Workflow
Step 1: Scrape with Firecrawl (rawHtml format)
Tool: firecrawl_firecrawl_scrape
Parameters:
url: "http://www.example-museum.org/about/"
formats: ["rawHtml"]
The rawHtml format returns the complete HTML source, including all elements needed for XPath generation.
Step 2: Create Archive Directory
data/custodian/web/{GHCID}/{domain}/
Example:
data/custodian/web/AG-04-SJ-M-MAB/antiguamuseums.net/
├── about-the-museum.html
├── staff.html
└── metadata.yaml
Step 3: Save HTML Files
Save the rawHtml content from Firecrawl to local files. Strip unnecessary CSS/JS if desired, but preserve the DOM structure for XPath validity.
Step 4: Create metadata.yaml
archive_metadata:
ghcid: AG-04-SJ-M-MAB
custodian_name: Museum of Antigua and Barbuda
domain: antiguamuseums.net
archive_created: "2025-12-14T09:34:47Z"
fetched_pages:
- file: about-the-museum.html
source_url: http://www.antiguamuseums.net/portfolios/about-the-museum/
retrieved_on: "2025-12-14T09:34:47Z"
fetch_method: firecrawl
cache_state: hit
content_type: text/html
status_code: 200
Step 5: Parse HTML for XPaths
Use Python with lxml to generate accurate XPaths:
from lxml import etree
# Parse HTML
parser = etree.HTMLParser()
tree = etree.fromstring(html_content, parser)
# Find element containing target text
for elem in tree.iter('h5'):
text = ''.join(elem.itertext())
if 'telephone:' in text.lower():
xpath = tree.getroottree().getpath(elem)
print(f"Phone element XPath: {xpath}")
# Result: /html/body/div/div[3]/div/div/div/h5[18]
Step 6: Add WebClaims to Custodian YAML
web_enrichment:
archive_path: web/AG-04-SJ-M-MAB/antiguamuseums.net/
fetch_timestamp: "2025-12-14T09:34:47Z"
fetch_method: firecrawl
claims:
- claim_type: founding_date
claim_value: "1985"
source_url: http://www.antiguamuseums.net/portfolios/about-the-museum/
retrieved_on: "2025-12-14T09:34:47Z"
xpath: /html/body/div/div[3]/div/div/div/h5[1]
html_file: web/AG-04-SJ-M-MAB/antiguamuseums.net/about-the-museum.html
xpath_match_score: 1.0
- claim_type: phone
claim_value: "1-268-462-4930"
source_url: http://www.antiguamuseums.net/portfolios/about-the-museum/
retrieved_on: "2025-12-14T09:34:47Z"
xpath: /html/body/div/div[3]/div/div/div/h5[18]
html_file: web/AG-04-SJ-M-MAB/antiguamuseums.net/about-the-museum.html
xpath_match_score: 1.0
XPath Generation Tips
Finding Elements by Text Content
# Find element containing specific text
for elem in tree.iter():
text = ''.join(elem.itertext())
if 'search term' in text.lower():
xpath = tree.getroottree().getpath(elem)
Common XPath Patterns
| Content Type | Typical XPath Pattern |
|---|---|
| Page title | /html/body/div[@id='wrapper']/div[@id='content']/h1[1] |
| Address | /html/body/...//h5[contains(text(),'address')] |
| Phone | /html/body/...//h5[contains(text(),'telephone')] |
| Description | /html/body/...//div[@class='content']/p[1] |
Verifying XPath Accuracy
# Verify XPath returns expected content
result = tree.xpath(xpath)
if result:
actual_text = ''.join(result[0].itertext())
if expected_value in actual_text:
match_score = 1.0 # Exact match
else:
match_score = 0.8 # Partial match - investigate
ClaimTypeEnum Values
From schemas/20251121/linkml/modules/classes/WebClaim.yaml:
| Claim Type | Description |
|---|---|
full_name |
Official institution name |
description |
Institution description/mission |
phone |
Phone number |
email |
Email address |
address |
Physical address |
website |
Official website URL |
founding_date |
Year or date founded |
parent_organization |
Managing/parent organization |
opening_hours |
Hours of operation |
social_media |
Social media URLs |
Example: Museum of Antigua and Barbuda
Extracted Claims
| Claim Type | Value | XPath |
|---|---|---|
founding_date |
1985 | /html/body/div/div[3]/div/div/div/h5[1] |
address |
Long Street, St. John's, Antigua | /html/body/div/div[3]/div/div/div/h5[9] |
phone |
1-268-462-4930 | /html/body/div/div[3]/div/div/div/h5[18] |
parent_organization |
Historical and Archaeological Society of Antigua and Barbuda | /html/body/div/div[3]/div/div/div/h5[20] |
Staff Data (from separate page)
| Role | Name | XPath |
|---|---|---|
| Curator | Dian Andrews | /html/body/div/div[3]/div/div/div/h5[3] |
| Research Librarian | Myra Piper | /html/body/div/div[3]/div/div/div/h5[4] |
| Board Chairperson | Walter Berridge | /html/body/div/div[3]/div/div/div/h5[25] |
| Board President | Reg Murphy | /html/body/div/div[3]/div/div/div/h5[26] |
Comparison with Playwright Method
| Aspect | Firecrawl MCP | Playwright Script |
|---|---|---|
| Setup | None (MCP ready) | Python + Playwright install |
| Speed | Fast (cached) | Slower (full render) |
| JS Support | Limited | Full |
| Screenshots | No | Yes |
| Archival | Manual save | Automatic |
| Best For | Static HTML sites | SPAs, JS-heavy sites |
Related Documentation
AGENTS.mdRule 6: WebObservation Claims MUST Have XPath Provenance.opencode/WEB_OBSERVATION_PROVENANCE_RULES.md- Complete provenance rulesschemas/20251121/linkml/modules/classes/WebClaim.yaml- Schema definitionscripts/fetch_website_playwright.py- Full Playwright archiving scriptscripts/add_xpath_provenance.py- XPath verification script
Limitations
- No JavaScript rendering - Sites requiring JS won't work well
- Manual archiving - HTML must be saved manually (vs automatic with Playwright)
- No screenshots - Cannot capture visual state
- Cache dependency - Firecrawl caching affects data freshness
For complex sites or when screenshots are needed, use the full Playwright workflow instead.