9 KiB
Digital Platform Discovery Rule
Rule Summary
Rule 25 in AGENTS.md: Every heritage custodian MUST be enriched with digital platform discovery data with complete provenance tracking.
Purpose
Digital platform discovery documents how heritage institutions make their collections accessible online. This is critical for:
- Collection Accessibility Research - Understanding which collections are digitized and discoverable
- System Integration Planning - Identifying which platforms to integrate with
- Aggregator Mapping - Tracking which custodians contribute to which aggregators
- Technology Assessment - Understanding collection management system landscape
What to Discover
1. Collection Management Systems (CMS)
The backend system used for cataloging and managing collections.
Common Systems:
- MAIS-Flexis - DE REE (Dutch archives)
- Adlib/Axiell Collections - Axiell (museums, archives)
- CollectiveAccess - Open source
- ArchivesSpace - Open source (archives)
- Koha - Open source (libraries)
- Ex Libris Alma - Libraries
- TMS (The Museum System) - Gallery Systems
Required Fields:
collection_management_system:
system_name: MAIS-Flexis
vendor: DE REE Archiefsystemen
vendor_url: https://www.de-ree.nl/
version: null # If determinable
primary_use: archival_description | collection_management | digital_asset_management
provenance:
source_url: https://example.org/about
xpath: /html/body/main/div[2]/p[3]
retrieved_on: "2025-01-15T10:30:00Z"
retrieval_agent: firecrawl
2. Auxiliary Digital Platforms
Public-facing discovery tools operated by the institution.
Platform Types:
| Type | Description | Examples |
|---|---|---|
DISCOVERY_PORTAL |
Search interface for collections | Beeldbank, Archiefstukken |
DIGITAL_ARCHIVE |
Digitized document repository | E-depot |
IMAGE_DATABASE |
Photograph/image collection | Beeldbank |
GENEALOGY_PORTAL |
Family history records | Genealogie, WieWasWie |
NEWSPAPER_ARCHIVE |
Historical newspapers | Delpher, Kranten |
MAP_COLLECTION |
Historical cartography | Kaarten |
AV_ARCHIVE |
Audio/video materials | Film & Geluid |
Required Fields:
auxiliary_digital_platforms:
- platform_name: Beeldbank
platform_url: https://beeldbank.example.nl/
platform_type: IMAGE_DATABASE
content_type: images | documents | newspapers | maps | audio | video | mixed
items_indexed: 125000
description: "Brief description of content"
provenance:
source_url: https://example.org/onderzoeken
xpath: /html/body/main/section[2]/div[1]/a
retrieved_on: "2025-01-15T10:30:00Z"
retrieval_agent: firecrawl
3. External Platform Integrations
Aggregators and shared platforms the institution contributes to.
Common Aggregators:
| Platform | Scope | URL Pattern |
|---|---|---|
| Archieven.nl | Dutch archives | archieven.nl/nl/zoeken?mivast=XXX |
| Europeana | European heritage | europeana.eu/... |
| Memorix Maior | Dutch shared platform | memorix.nl/... |
| Collectie Nederland | Dutch museums | collectienederland.nl/... |
| DPLA | US heritage | dp.la/... |
| Delpher | Dutch newspapers/books | delpher.nl/... |
| WieWasWie | Dutch genealogy | wiewaswie.nl/... |
Required Fields:
external_platform_integrations:
- platform_name: Archieven.nl
integration_type: discovery_aggregator | data_provider | api_consumer
integration_url: https://www.archieven.nl/nl/zoeken?mivast=123
items_contributed: 450000
sync_frequency: daily | weekly | monthly | on_demand | unknown
provenance:
source_url: https://example.org/onderzoeken
xpath: /html/body/main/section[3]/div[1]
retrieved_on: "2025-01-15T10:30:00Z"
retrieval_agent: firecrawl
4. APIs & Data Services
Machine-readable interfaces for collection data.
Protocol Types:
- OAI-PMH - Open Archives Initiative Protocol for Metadata Harvesting
- SPARQL - RDF query endpoint
- REST API - Custom REST endpoints
- SRU/SRW - Search/Retrieve via URL
- Z39.50 - Legacy library protocol
Required Fields:
data_services:
- service_name: OAI-PMH Endpoint
endpoint_url: https://example.org/oai
protocol: OAI-PMH | SPARQL | REST | SRU | Z39.50
documentation_url: https://example.org/api-docs
metadata_formats: [dublin_core, ead, marc21]
provenance:
source_url: https://example.org/voor-onderzoekers/api
retrieved_on: "2025-01-15T10:30:00Z"
retrieval_agent: firecrawl
Provenance Requirements
Required Fields for ALL Provenance
| Field | Type | Required | Description |
|---|---|---|---|
source_url |
string | YES | URL where information was found |
retrieved_on |
datetime | YES | ISO 8601 timestamp |
retrieval_agent |
enum | YES | Tool used for extraction |
xpath |
string | RECOMMENDED | XPath to source element |
html_file |
string | RECOMMENDED | Path to archived HTML |
Retrieval Agent Values
| Value | Description | When to Use |
|---|---|---|
firecrawl |
FireCrawl MCP tools | Primary - most web pages |
playwright |
Playwright browser automation | JavaScript-heavy sites |
exa |
Exa web search/crawl | When FireCrawl unavailable |
manual |
Manual inspection | Last resort |
Discovery Summary Block
Every custodian with digital platform data MUST include:
digital_platform_discovery_summary:
discovery_metadata:
retrieval_agent: firecrawl
retrieval_timestamp: "2025-01-15T10:30:00Z"
source_url: https://www.example.nl/onderzoeken
xpath_base: /html/body/main/section[2]
html_file: web/GHCID/example.nl/rendered.html
platforms_discovered: 7
total_items_indexed: 545393
cms_identified: true
external_integrations_count: 2
Discovery Workflow
Step 1: Initial Website Scrape
# Using FireCrawl MCP
firecrawl_firecrawl_scrape(
url="https://www.example-archive.nl",
formats=["markdown", "links"]
)
Step 2: Map Site Structure
# Find all URLs on site
firecrawl_firecrawl_map(
url="https://www.example-archive.nl",
limit=200
)
Step 3: Target Key Pages
Common page patterns for platform discovery:
/onderzoeken- Dutch archives/collecties- Collections/zoeken- Search functionality/over-ons- About (may mention CMS)/apior/data- Technical services/partners- External integrations
Step 4: Extract Platform Information
For each platform discovered:
- Note the platform name
- Copy the URL
- Record XPath to source element
- Count items (if displayed)
- Identify platform type
Step 5: Archive Source HTML
# Save rendered HTML for provenance
playwright_browser_navigate(url="https://...")
playwright_browser_snapshot(filename="web/GHCID/domain/rendered.html")
Step 6: Update Custodian YAML
Add all sections to the custodian file:
collection_management_systemauxiliary_digital_platformsexternal_platform_integrationsdata_services(if applicable)digital_platform_discovery_summary
Example: Complete Implementation
See data/custodian/NL-DR-ASS-A-DA.yaml (Drents Archief) for a comprehensive example with:
- 1 Collection Management System (MAIS-Flexis)
- 7 Auxiliary Digital Platforms
- 2 External Integrations (Archieven.nl, Memorix)
- Complete provenance for each platform
- Discovery summary with totals
Tools Reference
FireCrawl MCP Tools
| Tool | Purpose | MCP Name |
|---|---|---|
| Scrape | Extract single page content | firecrawl_firecrawl_scrape |
| Map | Discover all URLs on site | firecrawl_firecrawl_map |
| Search | Web search with scraping | firecrawl_firecrawl_search |
| Crawl | Multi-page extraction | firecrawl_firecrawl_crawl |
Playwright MCP Tools
| Tool | Purpose | MCP Name |
|---|---|---|
| Navigate | Open page in browser | playwright_browser_navigate |
| Snapshot | Capture accessibility tree | playwright_browser_snapshot |
| Screenshot | Visual capture | playwright_browser_take_screenshot |
Validation Checklist
Before marking discovery complete, verify:
collection_management_systemidentified (or marked as unknown)- All public discovery portals documented
- External aggregator integrations listed
- Every platform has provenance block
retrieval_agentspecified for eachretrieved_onin ISO 8601 formatsource_urlis the page where info was founddigital_platform_discovery_summarypresent- Item counts included where visible
Related Rules
- Rule 6: WebObservation claims MUST have XPath provenance
- Rule 22: Custodian YAML is single source of truth
- Rule 5: Never delete enriched data (additive only)
Version History
| Date | Change |
|---|---|
| 2025-01-15 | Initial rule creation |
See Also
AGENTS.mdRule 25docs/DIGITAL_PLATFORM_DISCOVERY_GUIDE.mddata/custodian/NL-DR-ASS-A-DA.yaml(reference implementation)