kempersc c50c35fd3a enrich person custodian

2025-12-14 17:09:55 +01:00

9 KiB

Raw Blame History

Digital Platform Discovery Rule

Rule Summary

Rule 25 in AGENTS.md: Every heritage custodian MUST be enriched with digital platform discovery data with complete provenance tracking.

Purpose

Digital platform discovery documents how heritage institutions make their collections accessible online. This is critical for:

Collection Accessibility Research - Understanding which collections are digitized and discoverable
System Integration Planning - Identifying which platforms to integrate with
Aggregator Mapping - Tracking which custodians contribute to which aggregators
Technology Assessment - Understanding collection management system landscape

What to Discover

1. Collection Management Systems (CMS)

The backend system used for cataloging and managing collections.

Common Systems:

MAIS-Flexis - DE REE (Dutch archives)
Adlib/Axiell Collections - Axiell (museums, archives)
CollectiveAccess - Open source
ArchivesSpace - Open source (archives)
Koha - Open source (libraries)
Ex Libris Alma - Libraries
TMS (The Museum System) - Gallery Systems

Required Fields:

collection_management_system:
  system_name: MAIS-Flexis
  vendor: DE REE Archiefsystemen
  vendor_url: https://www.de-ree.nl/
  version: null  # If determinable
  primary_use: archival_description | collection_management | digital_asset_management
  provenance:
    source_url: https://example.org/about
    xpath: /html/body/main/div[2]/p[3]
    retrieved_on: "2025-01-15T10:30:00Z"
    retrieval_agent: firecrawl

2. Auxiliary Digital Platforms

Public-facing discovery tools operated by the institution.

Platform Types:

Type	Description	Examples
`DISCOVERY_PORTAL`	Search interface for collections	Beeldbank, Archiefstukken
`DIGITAL_ARCHIVE`	Digitized document repository	E-depot
`IMAGE_DATABASE`	Photograph/image collection	Beeldbank
`GENEALOGY_PORTAL`	Family history records	Genealogie, WieWasWie
`NEWSPAPER_ARCHIVE`	Historical newspapers	Delpher, Kranten
`MAP_COLLECTION`	Historical cartography	Kaarten
`AV_ARCHIVE`	Audio/video materials	Film & Geluid

Required Fields:

auxiliary_digital_platforms:
  - platform_name: Beeldbank
    platform_url: https://beeldbank.example.nl/
    platform_type: IMAGE_DATABASE
    content_type: images | documents | newspapers | maps | audio | video | mixed
    items_indexed: 125000
    description: "Brief description of content"
    provenance:
      source_url: https://example.org/onderzoeken
      xpath: /html/body/main/section[2]/div[1]/a
      retrieved_on: "2025-01-15T10:30:00Z"
      retrieval_agent: firecrawl

3. External Platform Integrations

Aggregators and shared platforms the institution contributes to.

Common Aggregators:

Platform	Scope	URL Pattern
Archieven.nl	Dutch archives	`archieven.nl/nl/zoeken?mivast=XXX`
Europeana	European heritage	`europeana.eu/...`
Memorix Maior	Dutch shared platform	`memorix.nl/...`
Collectie Nederland	Dutch museums	`collectienederland.nl/...`
DPLA	US heritage	`dp.la/...`
Delpher	Dutch newspapers/books	`delpher.nl/...`
WieWasWie	Dutch genealogy	`wiewaswie.nl/...`

Required Fields:

external_platform_integrations:
  - platform_name: Archieven.nl
    integration_type: discovery_aggregator | data_provider | api_consumer
    integration_url: https://www.archieven.nl/nl/zoeken?mivast=123
    items_contributed: 450000
    sync_frequency: daily | weekly | monthly | on_demand | unknown
    provenance:
      source_url: https://example.org/onderzoeken
      xpath: /html/body/main/section[3]/div[1]
      retrieved_on: "2025-01-15T10:30:00Z"
      retrieval_agent: firecrawl

4. APIs & Data Services

Machine-readable interfaces for collection data.

Protocol Types:

OAI-PMH - Open Archives Initiative Protocol for Metadata Harvesting
SPARQL - RDF query endpoint
REST API - Custom REST endpoints
SRU/SRW - Search/Retrieve via URL
Z39.50 - Legacy library protocol

Required Fields:

data_services:
  - service_name: OAI-PMH Endpoint
    endpoint_url: https://example.org/oai
    protocol: OAI-PMH | SPARQL | REST | SRU | Z39.50
    documentation_url: https://example.org/api-docs
    metadata_formats: [dublin_core, ead, marc21]
    provenance:
      source_url: https://example.org/voor-onderzoekers/api
      retrieved_on: "2025-01-15T10:30:00Z"
      retrieval_agent: firecrawl

Provenance Requirements

Required Fields for ALL Provenance

Field	Type	Required	Description
`source_url`	string	YES	URL where information was found
`retrieved_on`	datetime	YES	ISO 8601 timestamp
`retrieval_agent`	enum	YES	Tool used for extraction
`xpath`	string	RECOMMENDED	XPath to source element
`html_file`	string	RECOMMENDED	Path to archived HTML

Retrieval Agent Values

Value	Description	When to Use
`firecrawl`	FireCrawl MCP tools	Primary - most web pages
`playwright`	Playwright browser automation	JavaScript-heavy sites
`exa`	Exa web search/crawl	When FireCrawl unavailable
`manual`	Manual inspection	Last resort

Discovery Summary Block

Every custodian with digital platform data MUST include:

digital_platform_discovery_summary:
  discovery_metadata:
    retrieval_agent: firecrawl
    retrieval_timestamp: "2025-01-15T10:30:00Z"
    source_url: https://www.example.nl/onderzoeken
    xpath_base: /html/body/main/section[2]
    html_file: web/GHCID/example.nl/rendered.html
  platforms_discovered: 7
  total_items_indexed: 545393
  cms_identified: true
  external_integrations_count: 2

Discovery Workflow

Step 1: Initial Website Scrape

# Using FireCrawl MCP
firecrawl_firecrawl_scrape(
  url="https://www.example-archive.nl",
  formats=["markdown", "links"]
)

Step 2: Map Site Structure

# Find all URLs on site
firecrawl_firecrawl_map(
  url="https://www.example-archive.nl",
  limit=200
)

Step 3: Target Key Pages

Common page patterns for platform discovery:

/onderzoeken - Dutch archives
/collecties - Collections
/zoeken - Search functionality
/over-ons - About (may mention CMS)
/api or /data - Technical services
/partners - External integrations

Step 4: Extract Platform Information

For each platform discovered:

Note the platform name
Copy the URL
Record XPath to source element
Count items (if displayed)
Identify platform type

Step 5: Archive Source HTML

# Save rendered HTML for provenance
playwright_browser_navigate(url="https://...")
playwright_browser_snapshot(filename="web/GHCID/domain/rendered.html")

Step 6: Update Custodian YAML

Add all sections to the custodian file:

collection_management_system
auxiliary_digital_platforms
external_platform_integrations
data_services (if applicable)
digital_platform_discovery_summary

Example: Complete Implementation

See data/custodian/NL-DR-ASS-A-DA.yaml (Drents Archief) for a comprehensive example with:

1 Collection Management System (MAIS-Flexis)
7 Auxiliary Digital Platforms
2 External Integrations (Archieven.nl, Memorix)
Complete provenance for each platform
Discovery summary with totals

Tools Reference

FireCrawl MCP Tools

Tool	Purpose	MCP Name
Scrape	Extract single page content	`firecrawl_firecrawl_scrape`
Map	Discover all URLs on site	`firecrawl_firecrawl_map`
Search	Web search with scraping	`firecrawl_firecrawl_search`
Crawl	Multi-page extraction	`firecrawl_firecrawl_crawl`

Playwright MCP Tools

Tool	Purpose	MCP Name
Navigate	Open page in browser	`playwright_browser_navigate`
Snapshot	Capture accessibility tree	`playwright_browser_snapshot`
Screenshot	Visual capture	`playwright_browser_take_screenshot`

Validation Checklist

Before marking discovery complete, verify:

collection_management_system identified (or marked as unknown)
All public discovery portals documented
External aggregator integrations listed
Every platform has provenance block
retrieval_agent specified for each
retrieved_on in ISO 8601 format
source_url is the page where info was found
digital_platform_discovery_summary present
Item counts included where visible

Rule 6: WebObservation claims MUST have XPath provenance
Rule 22: Custodian YAML is single source of truth
Rule 5: Never delete enriched data (additive only)

Version History

Date	Change
2025-01-15	Initial rule creation

9 KiB Raw Blame History

Digital Platform Discovery Rule

Rule Summary

Purpose

What to Discover

1. Collection Management Systems (CMS)

2. Auxiliary Digital Platforms

3. External Platform Integrations

4. APIs & Data Services

Provenance Requirements

Required Fields for ALL Provenance

Retrieval Agent Values

Discovery Summary Block

Discovery Workflow

Step 1: Initial Website Scrape

Step 2: Map Site Structure

Step 3: Target Key Pages

Step 4: Extract Platform Information

Step 5: Archive Source HTML

Step 6: Update Custodian YAML

Example: Complete Implementation

Tools Reference

FireCrawl MCP Tools

Playwright MCP Tools

Validation Checklist

Related Rules

Version History

See Also

9 KiB

Raw Blame History