glam/docs/WEB_ENRICHMENT_FIRECRAWL.md

# Web Enrichment via Firecrawl MCP

This document describes using Firecrawl MCP as a **practical alternative** to the full Playwright archiving process for extracting web claims with XPath provenance.

## Overview

The Firecrawl MCP tool can return raw HTML content, enabling:
1. **Local HTML archiving** - Save the raw HTML as archived files
2. **XPath generation** - Parse HTML to generate exact XPaths for claims
3. **WebClaim provenance** - Create properly sourced claims per Rule 6

This is a **lightweight alternative** to the full `scripts/fetch_website_playwright.py` workflow, suitable for:
- Quick enrichment of individual custodian entries
- Sites that work well with static HTML scraping
- Situations where JavaScript rendering is not required

## When to Use Each Method

| Method | Use Case | Pros | Cons |
|--------|----------|------|------|
| **Firecrawl MCP** | Simple static sites, quick enrichment | Fast, MCP-integrated, cached results | No JS rendering, limited interaction |
| **Playwright script** | Complex SPAs, JS-heavy sites | Full JS support, screenshots | Slower, requires local setup |

## Workflow

### Step 1: Scrape with Firecrawl (rawHtml format)

```
Tool: firecrawl_firecrawl_scrape
Parameters:
  url: "http://www.example-museum.org/about/"
  formats: ["rawHtml"]
```

The `rawHtml` format returns the complete HTML source, including all elements needed for XPath generation.

### Step 2: Create Archive Directory

```
data/custodian/web/{GHCID}/{domain}/
```

Example:
```
data/custodian/web/AG-04-SJ-M-MAB/antiguamuseums.net/
├── about-the-museum.html
├── staff.html
└── metadata.yaml
```

### Step 3: Save HTML Files

Save the `rawHtml` content from Firecrawl to local files. Strip unnecessary CSS/JS if desired, but preserve the DOM structure for XPath validity.

### Step 4: Create metadata.yaml

```yaml
archive_metadata:
  ghcid: AG-04-SJ-M-MAB
  custodian_name: Museum of Antigua and Barbuda
  domain: antiguamuseums.net
  archive_created: "2025-12-14T09:34:47Z"

fetched_pages:
  - file: about-the-museum.html
    source_url: http://www.antiguamuseums.net/portfolios/about-the-museum/
    retrieved_on: "2025-12-14T09:34:47Z"
    fetch_method: firecrawl
    cache_state: hit
    content_type: text/html
    status_code: 200
```

### Step 5: Parse HTML for XPaths

Use Python with lxml to generate accurate XPaths:

```python
from lxml import etree

# Parse HTML
parser = etree.HTMLParser()
tree = etree.fromstring(html_content, parser)

# Find element containing target text
for elem in tree.iter('h5'):
    text = ''.join(elem.itertext())
    if 'telephone:' in text.lower():
        xpath = tree.getroottree().getpath(elem)
        print(f"Phone element XPath: {xpath}")
        # Result: /html/body/div/div[3]/div/div/div/h5[18]
```

### Step 6: Add WebClaims to Custodian YAML

```yaml
web_enrichment:
  archive_path: web/AG-04-SJ-M-MAB/antiguamuseums.net/
  fetch_timestamp: "2025-12-14T09:34:47Z"
  fetch_method: firecrawl
  claims:
    - claim_type: founding_date
      claim_value: "1985"
      source_url: http://www.antiguamuseums.net/portfolios/about-the-museum/
      retrieved_on: "2025-12-14T09:34:47Z"
      xpath: /html/body/div/div[3]/div/div/div/h5[1]
      html_file: web/AG-04-SJ-M-MAB/antiguamuseums.net/about-the-museum.html
      xpath_match_score: 1.0

    - claim_type: phone
      claim_value: "1-268-462-4930"
      source_url: http://www.antiguamuseums.net/portfolios/about-the-museum/
      retrieved_on: "2025-12-14T09:34:47Z"
      xpath: /html/body/div/div[3]/div/div/div/h5[18]
      html_file: web/AG-04-SJ-M-MAB/antiguamuseums.net/about-the-museum.html
      xpath_match_score: 1.0
```

## XPath Generation Tips

### Finding Elements by Text Content

```python
# Find element containing specific text
for elem in tree.iter():
    text = ''.join(elem.itertext())
    if 'search term' in text.lower():
        xpath = tree.getroottree().getpath(elem)
```

### Common XPath Patterns

| Content Type | Typical XPath Pattern |
|--------------|----------------------|
| Page title | `/html/body/div[@id='wrapper']/div[@id='content']/h1[1]` |
| Address | `/html/body/...//h5[contains(text(),'address')]` |
| Phone | `/html/body/...//h5[contains(text(),'telephone')]` |
| Description | `/html/body/...//div[@class='content']/p[1]` |

### Verifying XPath Accuracy

```python
# Verify XPath returns expected content
result = tree.xpath(xpath)
if result:
    actual_text = ''.join(result[0].itertext())
    if expected_value in actual_text:
        match_score = 1.0  # Exact match
    else:
        match_score = 0.8  # Partial match - investigate
```

## ClaimTypeEnum Values

From `schemas/20251121/linkml/modules/classes/WebClaim.yaml`:

| Claim Type | Description |
|------------|-------------|
| `full_name` | Official institution name |
| `description` | Institution description/mission |
| `phone` | Phone number |
| `email` | Email address |
| `address` | Physical address |
| `website` | Official website URL |
| `founding_date` | Year or date founded |
| `parent_organization` | Managing/parent organization |
| `opening_hours` | Hours of operation |
| `social_media` | Social media URLs |

## Example: Museum of Antigua and Barbuda

### Extracted Claims

| Claim Type | Value | XPath |
|------------|-------|-------|
| `founding_date` | 1985 | `/html/body/div/div[3]/div/div/div/h5[1]` |
| `address` | Long Street, St. John's, Antigua | `/html/body/div/div[3]/div/div/div/h5[9]` |
| `phone` | 1-268-462-4930 | `/html/body/div/div[3]/div/div/div/h5[18]` |
| `parent_organization` | Historical and Archaeological Society of Antigua and Barbuda | `/html/body/div/div[3]/div/div/div/h5[20]` |

### Staff Data (from separate page)

| Role | Name | XPath |
|------|------|-------|
| Curator | Dian Andrews | `/html/body/div/div[3]/div/div/div/h5[3]` |
| Research Librarian | Myra Piper | `/html/body/div/div[3]/div/div/div/h5[4]` |
| Board Chairperson | Walter Berridge | `/html/body/div/div[3]/div/div/div/h5[25]` |
| Board President | Reg Murphy | `/html/body/div/div[3]/div/div/div/h5[26]` |

## Comparison with Playwright Method

| Aspect | Firecrawl MCP | Playwright Script |
|--------|---------------|-------------------|
| **Setup** | None (MCP ready) | Python + Playwright install |
| **Speed** | Fast (cached) | Slower (full render) |
| **JS Support** | Limited | Full |
| **Screenshots** | No | Yes |
| **Archival** | Manual save | Automatic |
| **Best For** | Static HTML sites | SPAs, JS-heavy sites |

## Related Documentation

- `AGENTS.md` Rule 6: WebObservation Claims MUST Have XPath Provenance
- `.opencode/WEB_OBSERVATION_PROVENANCE_RULES.md` - Complete provenance rules
- `schemas/20251121/linkml/modules/classes/WebClaim.yaml` - Schema definition
- `scripts/fetch_website_playwright.py` - Full Playwright archiving script
- `scripts/add_xpath_provenance.py` - XPath verification script

## Limitations

1. **No JavaScript rendering** - Sites requiring JS won't work well
2. **Manual archiving** - HTML must be saved manually (vs automatic with Playwright)
3. **No screenshots** - Cannot capture visual state
4. **Cache dependency** - Firecrawl caching affects data freshness

For complex sites or when screenshots are needed, use the full Playwright workflow instead.