216 lines
7.1 KiB
Markdown
216 lines
7.1 KiB
Markdown
# Web Enrichment via Firecrawl MCP
|
|
|
|
This document describes using Firecrawl MCP as a **practical alternative** to the full Playwright archiving process for extracting web claims with XPath provenance.
|
|
|
|
## Overview
|
|
|
|
The Firecrawl MCP tool can return raw HTML content, enabling:
|
|
1. **Local HTML archiving** - Save the raw HTML as archived files
|
|
2. **XPath generation** - Parse HTML to generate exact XPaths for claims
|
|
3. **WebClaim provenance** - Create properly sourced claims per Rule 6
|
|
|
|
This is a **lightweight alternative** to the full `scripts/fetch_website_playwright.py` workflow, suitable for:
|
|
- Quick enrichment of individual custodian entries
|
|
- Sites that work well with static HTML scraping
|
|
- Situations where JavaScript rendering is not required
|
|
|
|
## When to Use Each Method
|
|
|
|
| Method | Use Case | Pros | Cons |
|
|
|--------|----------|------|------|
|
|
| **Firecrawl MCP** | Simple static sites, quick enrichment | Fast, MCP-integrated, cached results | No JS rendering, limited interaction |
|
|
| **Playwright script** | Complex SPAs, JS-heavy sites | Full JS support, screenshots | Slower, requires local setup |
|
|
|
|
## Workflow
|
|
|
|
### Step 1: Scrape with Firecrawl (rawHtml format)
|
|
|
|
```
|
|
Tool: firecrawl_firecrawl_scrape
|
|
Parameters:
|
|
url: "http://www.example-museum.org/about/"
|
|
formats: ["rawHtml"]
|
|
```
|
|
|
|
The `rawHtml` format returns the complete HTML source, including all elements needed for XPath generation.
|
|
|
|
### Step 2: Create Archive Directory
|
|
|
|
```
|
|
data/custodian/web/{GHCID}/{domain}/
|
|
```
|
|
|
|
Example:
|
|
```
|
|
data/custodian/web/AG-04-SJ-M-MAB/antiguamuseums.net/
|
|
├── about-the-museum.html
|
|
├── staff.html
|
|
└── metadata.yaml
|
|
```
|
|
|
|
### Step 3: Save HTML Files
|
|
|
|
Save the `rawHtml` content from Firecrawl to local files. Strip unnecessary CSS/JS if desired, but preserve the DOM structure for XPath validity.
|
|
|
|
### Step 4: Create metadata.yaml
|
|
|
|
```yaml
|
|
archive_metadata:
|
|
ghcid: AG-04-SJ-M-MAB
|
|
custodian_name: Museum of Antigua and Barbuda
|
|
domain: antiguamuseums.net
|
|
archive_created: "2025-12-14T09:34:47Z"
|
|
|
|
fetched_pages:
|
|
- file: about-the-museum.html
|
|
source_url: http://www.antiguamuseums.net/portfolios/about-the-museum/
|
|
retrieved_on: "2025-12-14T09:34:47Z"
|
|
fetch_method: firecrawl
|
|
cache_state: hit
|
|
content_type: text/html
|
|
status_code: 200
|
|
```
|
|
|
|
### Step 5: Parse HTML for XPaths
|
|
|
|
Use Python with lxml to generate accurate XPaths:
|
|
|
|
```python
|
|
from lxml import etree
|
|
|
|
# Parse HTML
|
|
parser = etree.HTMLParser()
|
|
tree = etree.fromstring(html_content, parser)
|
|
|
|
# Find element containing target text
|
|
for elem in tree.iter('h5'):
|
|
text = ''.join(elem.itertext())
|
|
if 'telephone:' in text.lower():
|
|
xpath = tree.getroottree().getpath(elem)
|
|
print(f"Phone element XPath: {xpath}")
|
|
# Result: /html/body/div/div[3]/div/div/div/h5[18]
|
|
```
|
|
|
|
### Step 6: Add WebClaims to Custodian YAML
|
|
|
|
```yaml
|
|
web_enrichment:
|
|
archive_path: web/AG-04-SJ-M-MAB/antiguamuseums.net/
|
|
fetch_timestamp: "2025-12-14T09:34:47Z"
|
|
fetch_method: firecrawl
|
|
claims:
|
|
- claim_type: founding_date
|
|
claim_value: "1985"
|
|
source_url: http://www.antiguamuseums.net/portfolios/about-the-museum/
|
|
retrieved_on: "2025-12-14T09:34:47Z"
|
|
xpath: /html/body/div/div[3]/div/div/div/h5[1]
|
|
html_file: web/AG-04-SJ-M-MAB/antiguamuseums.net/about-the-museum.html
|
|
xpath_match_score: 1.0
|
|
|
|
- claim_type: phone
|
|
claim_value: "1-268-462-4930"
|
|
source_url: http://www.antiguamuseums.net/portfolios/about-the-museum/
|
|
retrieved_on: "2025-12-14T09:34:47Z"
|
|
xpath: /html/body/div/div[3]/div/div/div/h5[18]
|
|
html_file: web/AG-04-SJ-M-MAB/antiguamuseums.net/about-the-museum.html
|
|
xpath_match_score: 1.0
|
|
```
|
|
|
|
## XPath Generation Tips
|
|
|
|
### Finding Elements by Text Content
|
|
|
|
```python
|
|
# Find element containing specific text
|
|
for elem in tree.iter():
|
|
text = ''.join(elem.itertext())
|
|
if 'search term' in text.lower():
|
|
xpath = tree.getroottree().getpath(elem)
|
|
```
|
|
|
|
### Common XPath Patterns
|
|
|
|
| Content Type | Typical XPath Pattern |
|
|
|--------------|----------------------|
|
|
| Page title | `/html/body/div[@id='wrapper']/div[@id='content']/h1[1]` |
|
|
| Address | `/html/body/...//h5[contains(text(),'address')]` |
|
|
| Phone | `/html/body/...//h5[contains(text(),'telephone')]` |
|
|
| Description | `/html/body/...//div[@class='content']/p[1]` |
|
|
|
|
### Verifying XPath Accuracy
|
|
|
|
```python
|
|
# Verify XPath returns expected content
|
|
result = tree.xpath(xpath)
|
|
if result:
|
|
actual_text = ''.join(result[0].itertext())
|
|
if expected_value in actual_text:
|
|
match_score = 1.0 # Exact match
|
|
else:
|
|
match_score = 0.8 # Partial match - investigate
|
|
```
|
|
|
|
## ClaimTypeEnum Values
|
|
|
|
From `schemas/20251121/linkml/modules/classes/WebClaim.yaml`:
|
|
|
|
| Claim Type | Description |
|
|
|------------|-------------|
|
|
| `full_name` | Official institution name |
|
|
| `description` | Institution description/mission |
|
|
| `phone` | Phone number |
|
|
| `email` | Email address |
|
|
| `address` | Physical address |
|
|
| `website` | Official website URL |
|
|
| `founding_date` | Year or date founded |
|
|
| `parent_organization` | Managing/parent organization |
|
|
| `opening_hours` | Hours of operation |
|
|
| `social_media` | Social media URLs |
|
|
|
|
## Example: Museum of Antigua and Barbuda
|
|
|
|
### Extracted Claims
|
|
|
|
| Claim Type | Value | XPath |
|
|
|------------|-------|-------|
|
|
| `founding_date` | 1985 | `/html/body/div/div[3]/div/div/div/h5[1]` |
|
|
| `address` | Long Street, St. John's, Antigua | `/html/body/div/div[3]/div/div/div/h5[9]` |
|
|
| `phone` | 1-268-462-4930 | `/html/body/div/div[3]/div/div/div/h5[18]` |
|
|
| `parent_organization` | Historical and Archaeological Society of Antigua and Barbuda | `/html/body/div/div[3]/div/div/div/h5[20]` |
|
|
|
|
### Staff Data (from separate page)
|
|
|
|
| Role | Name | XPath |
|
|
|------|------|-------|
|
|
| Curator | Dian Andrews | `/html/body/div/div[3]/div/div/div/h5[3]` |
|
|
| Research Librarian | Myra Piper | `/html/body/div/div[3]/div/div/div/h5[4]` |
|
|
| Board Chairperson | Walter Berridge | `/html/body/div/div[3]/div/div/div/h5[25]` |
|
|
| Board President | Reg Murphy | `/html/body/div/div[3]/div/div/div/h5[26]` |
|
|
|
|
## Comparison with Playwright Method
|
|
|
|
| Aspect | Firecrawl MCP | Playwright Script |
|
|
|--------|---------------|-------------------|
|
|
| **Setup** | None (MCP ready) | Python + Playwright install |
|
|
| **Speed** | Fast (cached) | Slower (full render) |
|
|
| **JS Support** | Limited | Full |
|
|
| **Screenshots** | No | Yes |
|
|
| **Archival** | Manual save | Automatic |
|
|
| **Best For** | Static HTML sites | SPAs, JS-heavy sites |
|
|
|
|
## Related Documentation
|
|
|
|
- `AGENTS.md` Rule 6: WebObservation Claims MUST Have XPath Provenance
|
|
- `.opencode/WEB_OBSERVATION_PROVENANCE_RULES.md` - Complete provenance rules
|
|
- `schemas/20251121/linkml/modules/classes/WebClaim.yaml` - Schema definition
|
|
- `scripts/fetch_website_playwright.py` - Full Playwright archiving script
|
|
- `scripts/add_xpath_provenance.py` - XPath verification script
|
|
|
|
## Limitations
|
|
|
|
1. **No JavaScript rendering** - Sites requiring JS won't work well
|
|
2. **Manual archiving** - HTML must be saved manually (vs automatic with Playwright)
|
|
3. **No screenshots** - Cannot capture visual state
|
|
4. **Cache dependency** - Firecrawl caching affects data freshness
|
|
|
|
For complex sites or when screenshots are needed, use the full Playwright workflow instead.
|