kempersc c50c35fd3a enrich person custodian

2025-12-14 17:09:55 +01:00

7.1 KiB

Raw Blame History

Web Enrichment via Firecrawl MCP

This document describes using Firecrawl MCP as a practical alternative to the full Playwright archiving process for extracting web claims with XPath provenance.

Overview

The Firecrawl MCP tool can return raw HTML content, enabling:

Local HTML archiving - Save the raw HTML as archived files
XPath generation - Parse HTML to generate exact XPaths for claims
WebClaim provenance - Create properly sourced claims per Rule 6

This is a lightweight alternative to the full scripts/fetch_website_playwright.py workflow, suitable for:

Quick enrichment of individual custodian entries
Sites that work well with static HTML scraping
Situations where JavaScript rendering is not required

When to Use Each Method

Method	Use Case	Pros	Cons
Firecrawl MCP	Simple static sites, quick enrichment	Fast, MCP-integrated, cached results	No JS rendering, limited interaction
Playwright script	Complex SPAs, JS-heavy sites	Full JS support, screenshots	Slower, requires local setup

Workflow

Step 1: Scrape with Firecrawl (rawHtml format)

Tool: firecrawl_firecrawl_scrape
Parameters:
  url: "http://www.example-museum.org/about/"
  formats: ["rawHtml"]

The rawHtml format returns the complete HTML source, including all elements needed for XPath generation.

Step 2: Create Archive Directory

data/custodian/web/{GHCID}/{domain}/

Example:

data/custodian/web/AG-04-SJ-M-MAB/antiguamuseums.net/
├── about-the-museum.html
├── staff.html
└── metadata.yaml

Step 3: Save HTML Files

Save the rawHtml content from Firecrawl to local files. Strip unnecessary CSS/JS if desired, but preserve the DOM structure for XPath validity.

Step 4: Create metadata.yaml

archive_metadata:
  ghcid: AG-04-SJ-M-MAB
  custodian_name: Museum of Antigua and Barbuda
  domain: antiguamuseums.net
  archive_created: "2025-12-14T09:34:47Z"

fetched_pages:
  - file: about-the-museum.html
    source_url: http://www.antiguamuseums.net/portfolios/about-the-museum/
    retrieved_on: "2025-12-14T09:34:47Z"
    fetch_method: firecrawl
    cache_state: hit
    content_type: text/html
    status_code: 200

Step 5: Parse HTML for XPaths

Use Python with lxml to generate accurate XPaths:

from lxml import etree

# Parse HTML
parser = etree.HTMLParser()
tree = etree.fromstring(html_content, parser)

# Find element containing target text
for elem in tree.iter('h5'):
    text = ''.join(elem.itertext())
    if 'telephone:' in text.lower():
        xpath = tree.getroottree().getpath(elem)
        print(f"Phone element XPath: {xpath}")
        # Result: /html/body/div/div[3]/div/div/div/h5[18]

Step 6: Add WebClaims to Custodian YAML

web_enrichment:
  archive_path: web/AG-04-SJ-M-MAB/antiguamuseums.net/
  fetch_timestamp: "2025-12-14T09:34:47Z"
  fetch_method: firecrawl
  claims:
    - claim_type: founding_date
      claim_value: "1985"
      source_url: http://www.antiguamuseums.net/portfolios/about-the-museum/
      retrieved_on: "2025-12-14T09:34:47Z"
      xpath: /html/body/div/div[3]/div/div/div/h5[1]
      html_file: web/AG-04-SJ-M-MAB/antiguamuseums.net/about-the-museum.html
      xpath_match_score: 1.0
      
    - claim_type: phone
      claim_value: "1-268-462-4930"
      source_url: http://www.antiguamuseums.net/portfolios/about-the-museum/
      retrieved_on: "2025-12-14T09:34:47Z"
      xpath: /html/body/div/div[3]/div/div/div/h5[18]
      html_file: web/AG-04-SJ-M-MAB/antiguamuseums.net/about-the-museum.html
      xpath_match_score: 1.0

XPath Generation Tips

Finding Elements by Text Content

# Find element containing specific text
for elem in tree.iter():
    text = ''.join(elem.itertext())
    if 'search term' in text.lower():
        xpath = tree.getroottree().getpath(elem)

Common XPath Patterns

Content Type	Typical XPath Pattern
Page title	`/html/body/div[@id='wrapper']/div[@id='content']/h1[1]`
Address	`/html/body/...//h5[contains(text(),'address')]`
Phone	`/html/body/...//h5[contains(text(),'telephone')]`
Description	`/html/body/...//div[@class='content']/p[1]`

Verifying XPath Accuracy

# Verify XPath returns expected content
result = tree.xpath(xpath)
if result:
    actual_text = ''.join(result[0].itertext())
    if expected_value in actual_text:
        match_score = 1.0  # Exact match
    else:
        match_score = 0.8  # Partial match - investigate

ClaimTypeEnum Values

From schemas/20251121/linkml/modules/classes/WebClaim.yaml:

Claim Type	Description
`full_name`	Official institution name
`description`	Institution description/mission
`phone`	Phone number
`email`	Email address
`address`	Physical address
`website`	Official website URL
`founding_date`	Year or date founded
`parent_organization`	Managing/parent organization
`opening_hours`	Hours of operation
`social_media`	Social media URLs

Example: Museum of Antigua and Barbuda

Extracted Claims

Claim Type	Value	XPath
`founding_date`	1985	`/html/body/div/div[3]/div/div/div/h5[1]`
`address`	Long Street, St. John's, Antigua	`/html/body/div/div[3]/div/div/div/h5[9]`
`phone`	1-268-462-4930	`/html/body/div/div[3]/div/div/div/h5[18]`
`parent_organization`	Historical and Archaeological Society of Antigua and Barbuda	`/html/body/div/div[3]/div/div/div/h5[20]`

Staff Data (from separate page)

Role	Name	XPath
Curator	Dian Andrews	`/html/body/div/div[3]/div/div/div/h5[3]`
Research Librarian	Myra Piper	`/html/body/div/div[3]/div/div/div/h5[4]`
Board Chairperson	Walter Berridge	`/html/body/div/div[3]/div/div/div/h5[25]`
Board President	Reg Murphy	`/html/body/div/div[3]/div/div/div/h5[26]`

Comparison with Playwright Method

Aspect	Firecrawl MCP	Playwright Script
Setup	None (MCP ready)	Python + Playwright install
Speed	Fast (cached)	Slower (full render)
JS Support	Limited	Full
Screenshots	No	Yes
Archival	Manual save	Automatic
Best For	Static HTML sites	SPAs, JS-heavy sites

AGENTS.md Rule 6: WebObservation Claims MUST Have XPath Provenance
.opencode/WEB_OBSERVATION_PROVENANCE_RULES.md - Complete provenance rules
schemas/20251121/linkml/modules/classes/WebClaim.yaml - Schema definition
scripts/fetch_website_playwright.py - Full Playwright archiving script
scripts/add_xpath_provenance.py - XPath verification script

Limitations

No JavaScript rendering - Sites requiring JS won't work well
Manual archiving - HTML must be saved manually (vs automatic with Playwright)
No screenshots - Cannot capture visual state
Cache dependency - Firecrawl caching affects data freshness

For complex sites or when screenshots are needed, use the full Playwright workflow instead.

7.1 KiB Raw Blame History