glam/.opencode/DIGITAL_PLATFORM_DISCOVERY_RULE.md
2025-12-14 17:09:55 +01:00

9 KiB

Digital Platform Discovery Rule

Rule Summary

Rule 25 in AGENTS.md: Every heritage custodian MUST be enriched with digital platform discovery data with complete provenance tracking.

Purpose

Digital platform discovery documents how heritage institutions make their collections accessible online. This is critical for:

  1. Collection Accessibility Research - Understanding which collections are digitized and discoverable
  2. System Integration Planning - Identifying which platforms to integrate with
  3. Aggregator Mapping - Tracking which custodians contribute to which aggregators
  4. Technology Assessment - Understanding collection management system landscape

What to Discover

1. Collection Management Systems (CMS)

The backend system used for cataloging and managing collections.

Common Systems:

  • MAIS-Flexis - DE REE (Dutch archives)
  • Adlib/Axiell Collections - Axiell (museums, archives)
  • CollectiveAccess - Open source
  • ArchivesSpace - Open source (archives)
  • Koha - Open source (libraries)
  • Ex Libris Alma - Libraries
  • TMS (The Museum System) - Gallery Systems

Required Fields:

collection_management_system:
  system_name: MAIS-Flexis
  vendor: DE REE Archiefsystemen
  vendor_url: https://www.de-ree.nl/
  version: null  # If determinable
  primary_use: archival_description | collection_management | digital_asset_management
  provenance:
    source_url: https://example.org/about
    xpath: /html/body/main/div[2]/p[3]
    retrieved_on: "2025-01-15T10:30:00Z"
    retrieval_agent: firecrawl

2. Auxiliary Digital Platforms

Public-facing discovery tools operated by the institution.

Platform Types:

Type Description Examples
DISCOVERY_PORTAL Search interface for collections Beeldbank, Archiefstukken
DIGITAL_ARCHIVE Digitized document repository E-depot
IMAGE_DATABASE Photograph/image collection Beeldbank
GENEALOGY_PORTAL Family history records Genealogie, WieWasWie
NEWSPAPER_ARCHIVE Historical newspapers Delpher, Kranten
MAP_COLLECTION Historical cartography Kaarten
AV_ARCHIVE Audio/video materials Film & Geluid

Required Fields:

auxiliary_digital_platforms:
  - platform_name: Beeldbank
    platform_url: https://beeldbank.example.nl/
    platform_type: IMAGE_DATABASE
    content_type: images | documents | newspapers | maps | audio | video | mixed
    items_indexed: 125000
    description: "Brief description of content"
    provenance:
      source_url: https://example.org/onderzoeken
      xpath: /html/body/main/section[2]/div[1]/a
      retrieved_on: "2025-01-15T10:30:00Z"
      retrieval_agent: firecrawl

3. External Platform Integrations

Aggregators and shared platforms the institution contributes to.

Common Aggregators:

Platform Scope URL Pattern
Archieven.nl Dutch archives archieven.nl/nl/zoeken?mivast=XXX
Europeana European heritage europeana.eu/...
Memorix Maior Dutch shared platform memorix.nl/...
Collectie Nederland Dutch museums collectienederland.nl/...
DPLA US heritage dp.la/...
Delpher Dutch newspapers/books delpher.nl/...
WieWasWie Dutch genealogy wiewaswie.nl/...

Required Fields:

external_platform_integrations:
  - platform_name: Archieven.nl
    integration_type: discovery_aggregator | data_provider | api_consumer
    integration_url: https://www.archieven.nl/nl/zoeken?mivast=123
    items_contributed: 450000
    sync_frequency: daily | weekly | monthly | on_demand | unknown
    provenance:
      source_url: https://example.org/onderzoeken
      xpath: /html/body/main/section[3]/div[1]
      retrieved_on: "2025-01-15T10:30:00Z"
      retrieval_agent: firecrawl

4. APIs & Data Services

Machine-readable interfaces for collection data.

Protocol Types:

  • OAI-PMH - Open Archives Initiative Protocol for Metadata Harvesting
  • SPARQL - RDF query endpoint
  • REST API - Custom REST endpoints
  • SRU/SRW - Search/Retrieve via URL
  • Z39.50 - Legacy library protocol

Required Fields:

data_services:
  - service_name: OAI-PMH Endpoint
    endpoint_url: https://example.org/oai
    protocol: OAI-PMH | SPARQL | REST | SRU | Z39.50
    documentation_url: https://example.org/api-docs
    metadata_formats: [dublin_core, ead, marc21]
    provenance:
      source_url: https://example.org/voor-onderzoekers/api
      retrieved_on: "2025-01-15T10:30:00Z"
      retrieval_agent: firecrawl

Provenance Requirements

Required Fields for ALL Provenance

Field Type Required Description
source_url string YES URL where information was found
retrieved_on datetime YES ISO 8601 timestamp
retrieval_agent enum YES Tool used for extraction
xpath string RECOMMENDED XPath to source element
html_file string RECOMMENDED Path to archived HTML

Retrieval Agent Values

Value Description When to Use
firecrawl FireCrawl MCP tools Primary - most web pages
playwright Playwright browser automation JavaScript-heavy sites
exa Exa web search/crawl When FireCrawl unavailable
manual Manual inspection Last resort

Discovery Summary Block

Every custodian with digital platform data MUST include:

digital_platform_discovery_summary:
  discovery_metadata:
    retrieval_agent: firecrawl
    retrieval_timestamp: "2025-01-15T10:30:00Z"
    source_url: https://www.example.nl/onderzoeken
    xpath_base: /html/body/main/section[2]
    html_file: web/GHCID/example.nl/rendered.html
  platforms_discovered: 7
  total_items_indexed: 545393
  cms_identified: true
  external_integrations_count: 2

Discovery Workflow

Step 1: Initial Website Scrape

# Using FireCrawl MCP
firecrawl_firecrawl_scrape(
  url="https://www.example-archive.nl",
  formats=["markdown", "links"]
)

Step 2: Map Site Structure

# Find all URLs on site
firecrawl_firecrawl_map(
  url="https://www.example-archive.nl",
  limit=200
)

Step 3: Target Key Pages

Common page patterns for platform discovery:

  • /onderzoeken - Dutch archives
  • /collecties - Collections
  • /zoeken - Search functionality
  • /over-ons - About (may mention CMS)
  • /api or /data - Technical services
  • /partners - External integrations

Step 4: Extract Platform Information

For each platform discovered:

  1. Note the platform name
  2. Copy the URL
  3. Record XPath to source element
  4. Count items (if displayed)
  5. Identify platform type

Step 5: Archive Source HTML

# Save rendered HTML for provenance
playwright_browser_navigate(url="https://...")
playwright_browser_snapshot(filename="web/GHCID/domain/rendered.html")

Step 6: Update Custodian YAML

Add all sections to the custodian file:

  • collection_management_system
  • auxiliary_digital_platforms
  • external_platform_integrations
  • data_services (if applicable)
  • digital_platform_discovery_summary

Example: Complete Implementation

See data/custodian/NL-DR-ASS-A-DA.yaml (Drents Archief) for a comprehensive example with:

  • 1 Collection Management System (MAIS-Flexis)
  • 7 Auxiliary Digital Platforms
  • 2 External Integrations (Archieven.nl, Memorix)
  • Complete provenance for each platform
  • Discovery summary with totals

Tools Reference

FireCrawl MCP Tools

Tool Purpose MCP Name
Scrape Extract single page content firecrawl_firecrawl_scrape
Map Discover all URLs on site firecrawl_firecrawl_map
Search Web search with scraping firecrawl_firecrawl_search
Crawl Multi-page extraction firecrawl_firecrawl_crawl

Playwright MCP Tools

Tool Purpose MCP Name
Navigate Open page in browser playwright_browser_navigate
Snapshot Capture accessibility tree playwright_browser_snapshot
Screenshot Visual capture playwright_browser_take_screenshot

Validation Checklist

Before marking discovery complete, verify:

  • collection_management_system identified (or marked as unknown)
  • All public discovery portals documented
  • External aggregator integrations listed
  • Every platform has provenance block
  • retrieval_agent specified for each
  • retrieved_on in ISO 8601 format
  • source_url is the page where info was found
  • digital_platform_discovery_summary present
  • Item counts included where visible
  • Rule 6: WebObservation claims MUST have XPath provenance
  • Rule 22: Custodian YAML is single source of truth
  • Rule 5: Never delete enriched data (additive only)

Version History

Date Change
2025-01-15 Initial rule creation

See Also

  • AGENTS.md Rule 25
  • docs/DIGITAL_PLATFORM_DISCOVERY_GUIDE.md
  • data/custodian/NL-DR-ASS-A-DA.yaml (reference implementation)