glam/docs/WEB_ENRICHMENT_FIRECRAWL.md
2025-12-14 17:09:55 +01:00

216 lines
7.1 KiB
Markdown

# Web Enrichment via Firecrawl MCP
This document describes using Firecrawl MCP as a **practical alternative** to the full Playwright archiving process for extracting web claims with XPath provenance.
## Overview
The Firecrawl MCP tool can return raw HTML content, enabling:
1. **Local HTML archiving** - Save the raw HTML as archived files
2. **XPath generation** - Parse HTML to generate exact XPaths for claims
3. **WebClaim provenance** - Create properly sourced claims per Rule 6
This is a **lightweight alternative** to the full `scripts/fetch_website_playwright.py` workflow, suitable for:
- Quick enrichment of individual custodian entries
- Sites that work well with static HTML scraping
- Situations where JavaScript rendering is not required
## When to Use Each Method
| Method | Use Case | Pros | Cons |
|--------|----------|------|------|
| **Firecrawl MCP** | Simple static sites, quick enrichment | Fast, MCP-integrated, cached results | No JS rendering, limited interaction |
| **Playwright script** | Complex SPAs, JS-heavy sites | Full JS support, screenshots | Slower, requires local setup |
## Workflow
### Step 1: Scrape with Firecrawl (rawHtml format)
```
Tool: firecrawl_firecrawl_scrape
Parameters:
url: "http://www.example-museum.org/about/"
formats: ["rawHtml"]
```
The `rawHtml` format returns the complete HTML source, including all elements needed for XPath generation.
### Step 2: Create Archive Directory
```
data/custodian/web/{GHCID}/{domain}/
```
Example:
```
data/custodian/web/AG-04-SJ-M-MAB/antiguamuseums.net/
├── about-the-museum.html
├── staff.html
└── metadata.yaml
```
### Step 3: Save HTML Files
Save the `rawHtml` content from Firecrawl to local files. Strip unnecessary CSS/JS if desired, but preserve the DOM structure for XPath validity.
### Step 4: Create metadata.yaml
```yaml
archive_metadata:
ghcid: AG-04-SJ-M-MAB
custodian_name: Museum of Antigua and Barbuda
domain: antiguamuseums.net
archive_created: "2025-12-14T09:34:47Z"
fetched_pages:
- file: about-the-museum.html
source_url: http://www.antiguamuseums.net/portfolios/about-the-museum/
retrieved_on: "2025-12-14T09:34:47Z"
fetch_method: firecrawl
cache_state: hit
content_type: text/html
status_code: 200
```
### Step 5: Parse HTML for XPaths
Use Python with lxml to generate accurate XPaths:
```python
from lxml import etree
# Parse HTML
parser = etree.HTMLParser()
tree = etree.fromstring(html_content, parser)
# Find element containing target text
for elem in tree.iter('h5'):
text = ''.join(elem.itertext())
if 'telephone:' in text.lower():
xpath = tree.getroottree().getpath(elem)
print(f"Phone element XPath: {xpath}")
# Result: /html/body/div/div[3]/div/div/div/h5[18]
```
### Step 6: Add WebClaims to Custodian YAML
```yaml
web_enrichment:
archive_path: web/AG-04-SJ-M-MAB/antiguamuseums.net/
fetch_timestamp: "2025-12-14T09:34:47Z"
fetch_method: firecrawl
claims:
- claim_type: founding_date
claim_value: "1985"
source_url: http://www.antiguamuseums.net/portfolios/about-the-museum/
retrieved_on: "2025-12-14T09:34:47Z"
xpath: /html/body/div/div[3]/div/div/div/h5[1]
html_file: web/AG-04-SJ-M-MAB/antiguamuseums.net/about-the-museum.html
xpath_match_score: 1.0
- claim_type: phone
claim_value: "1-268-462-4930"
source_url: http://www.antiguamuseums.net/portfolios/about-the-museum/
retrieved_on: "2025-12-14T09:34:47Z"
xpath: /html/body/div/div[3]/div/div/div/h5[18]
html_file: web/AG-04-SJ-M-MAB/antiguamuseums.net/about-the-museum.html
xpath_match_score: 1.0
```
## XPath Generation Tips
### Finding Elements by Text Content
```python
# Find element containing specific text
for elem in tree.iter():
text = ''.join(elem.itertext())
if 'search term' in text.lower():
xpath = tree.getroottree().getpath(elem)
```
### Common XPath Patterns
| Content Type | Typical XPath Pattern |
|--------------|----------------------|
| Page title | `/html/body/div[@id='wrapper']/div[@id='content']/h1[1]` |
| Address | `/html/body/...//h5[contains(text(),'address')]` |
| Phone | `/html/body/...//h5[contains(text(),'telephone')]` |
| Description | `/html/body/...//div[@class='content']/p[1]` |
### Verifying XPath Accuracy
```python
# Verify XPath returns expected content
result = tree.xpath(xpath)
if result:
actual_text = ''.join(result[0].itertext())
if expected_value in actual_text:
match_score = 1.0 # Exact match
else:
match_score = 0.8 # Partial match - investigate
```
## ClaimTypeEnum Values
From `schemas/20251121/linkml/modules/classes/WebClaim.yaml`:
| Claim Type | Description |
|------------|-------------|
| `full_name` | Official institution name |
| `description` | Institution description/mission |
| `phone` | Phone number |
| `email` | Email address |
| `address` | Physical address |
| `website` | Official website URL |
| `founding_date` | Year or date founded |
| `parent_organization` | Managing/parent organization |
| `opening_hours` | Hours of operation |
| `social_media` | Social media URLs |
## Example: Museum of Antigua and Barbuda
### Extracted Claims
| Claim Type | Value | XPath |
|------------|-------|-------|
| `founding_date` | 1985 | `/html/body/div/div[3]/div/div/div/h5[1]` |
| `address` | Long Street, St. John's, Antigua | `/html/body/div/div[3]/div/div/div/h5[9]` |
| `phone` | 1-268-462-4930 | `/html/body/div/div[3]/div/div/div/h5[18]` |
| `parent_organization` | Historical and Archaeological Society of Antigua and Barbuda | `/html/body/div/div[3]/div/div/div/h5[20]` |
### Staff Data (from separate page)
| Role | Name | XPath |
|------|------|-------|
| Curator | Dian Andrews | `/html/body/div/div[3]/div/div/div/h5[3]` |
| Research Librarian | Myra Piper | `/html/body/div/div[3]/div/div/div/h5[4]` |
| Board Chairperson | Walter Berridge | `/html/body/div/div[3]/div/div/div/h5[25]` |
| Board President | Reg Murphy | `/html/body/div/div[3]/div/div/div/h5[26]` |
## Comparison with Playwright Method
| Aspect | Firecrawl MCP | Playwright Script |
|--------|---------------|-------------------|
| **Setup** | None (MCP ready) | Python + Playwright install |
| **Speed** | Fast (cached) | Slower (full render) |
| **JS Support** | Limited | Full |
| **Screenshots** | No | Yes |
| **Archival** | Manual save | Automatic |
| **Best For** | Static HTML sites | SPAs, JS-heavy sites |
## Related Documentation
- `AGENTS.md` Rule 6: WebObservation Claims MUST Have XPath Provenance
- `.opencode/WEB_OBSERVATION_PROVENANCE_RULES.md` - Complete provenance rules
- `schemas/20251121/linkml/modules/classes/WebClaim.yaml` - Schema definition
- `scripts/fetch_website_playwright.py` - Full Playwright archiving script
- `scripts/add_xpath_provenance.py` - XPath verification script
## Limitations
1. **No JavaScript rendering** - Sites requiring JS won't work well
2. **Manual archiving** - HTML must be saved manually (vs automatic with Playwright)
3. **No screenshots** - Cannot capture visual state
4. **Cache dependency** - Firecrawl caching affects data freshness
For complex sites or when screenshots are needed, use the full Playwright workflow instead.