# Digital Platform Discovery Rule ## Rule Summary **Rule 25** in `AGENTS.md`: Every heritage custodian MUST be enriched with digital platform discovery data with complete provenance tracking. ## Purpose Digital platform discovery documents how heritage institutions make their collections accessible online. This is critical for: 1. **Collection Accessibility Research** - Understanding which collections are digitized and discoverable 2. **System Integration Planning** - Identifying which platforms to integrate with 3. **Aggregator Mapping** - Tracking which custodians contribute to which aggregators 4. **Technology Assessment** - Understanding collection management system landscape ## What to Discover ### 1. Collection Management Systems (CMS) The backend system used for cataloging and managing collections. **Common Systems**: - **MAIS-Flexis** - DE REE (Dutch archives) - **Adlib/Axiell Collections** - Axiell (museums, archives) - **CollectiveAccess** - Open source - **ArchivesSpace** - Open source (archives) - **Koha** - Open source (libraries) - **Ex Libris Alma** - Libraries - **TMS (The Museum System)** - Gallery Systems **Required Fields**: ```yaml collection_management_system: system_name: MAIS-Flexis vendor: DE REE Archiefsystemen vendor_url: https://www.de-ree.nl/ version: null # If determinable primary_use: archival_description | collection_management | digital_asset_management provenance: source_url: https://example.org/about xpath: /html/body/main/div[2]/p[3] retrieved_on: "2025-01-15T10:30:00Z" retrieval_agent: firecrawl ``` ### 2. Auxiliary Digital Platforms Public-facing discovery tools operated by the institution. **Platform Types**: | Type | Description | Examples | |------|-------------|----------| | `DISCOVERY_PORTAL` | Search interface for collections | Beeldbank, Archiefstukken | | `DIGITAL_ARCHIVE` | Digitized document repository | E-depot | | `IMAGE_DATABASE` | Photograph/image collection | Beeldbank | | `GENEALOGY_PORTAL` | Family history records | Genealogie, WieWasWie | | `NEWSPAPER_ARCHIVE` | Historical newspapers | Delpher, Kranten | | `MAP_COLLECTION` | Historical cartography | Kaarten | | `AV_ARCHIVE` | Audio/video materials | Film & Geluid | **Required Fields**: ```yaml auxiliary_digital_platforms: - platform_name: Beeldbank platform_url: https://beeldbank.example.nl/ platform_type: IMAGE_DATABASE content_type: images | documents | newspapers | maps | audio | video | mixed items_indexed: 125000 description: "Brief description of content" provenance: source_url: https://example.org/onderzoeken xpath: /html/body/main/section[2]/div[1]/a retrieved_on: "2025-01-15T10:30:00Z" retrieval_agent: firecrawl ``` ### 3. External Platform Integrations Aggregators and shared platforms the institution contributes to. **Common Aggregators**: | Platform | Scope | URL Pattern | |----------|-------|-------------| | **Archieven.nl** | Dutch archives | `archieven.nl/nl/zoeken?mivast=XXX` | | **Europeana** | European heritage | `europeana.eu/...` | | **Memorix Maior** | Dutch shared platform | `memorix.nl/...` | | **Collectie Nederland** | Dutch museums | `collectienederland.nl/...` | | **DPLA** | US heritage | `dp.la/...` | | **Delpher** | Dutch newspapers/books | `delpher.nl/...` | | **WieWasWie** | Dutch genealogy | `wiewaswie.nl/...` | **Required Fields**: ```yaml external_platform_integrations: - platform_name: Archieven.nl integration_type: discovery_aggregator | data_provider | api_consumer integration_url: https://www.archieven.nl/nl/zoeken?mivast=123 items_contributed: 450000 sync_frequency: daily | weekly | monthly | on_demand | unknown provenance: source_url: https://example.org/onderzoeken xpath: /html/body/main/section[3]/div[1] retrieved_on: "2025-01-15T10:30:00Z" retrieval_agent: firecrawl ``` ### 4. APIs & Data Services Machine-readable interfaces for collection data. **Protocol Types**: - **OAI-PMH** - Open Archives Initiative Protocol for Metadata Harvesting - **SPARQL** - RDF query endpoint - **REST API** - Custom REST endpoints - **SRU/SRW** - Search/Retrieve via URL - **Z39.50** - Legacy library protocol **Required Fields**: ```yaml data_services: - service_name: OAI-PMH Endpoint endpoint_url: https://example.org/oai protocol: OAI-PMH | SPARQL | REST | SRU | Z39.50 documentation_url: https://example.org/api-docs metadata_formats: [dublin_core, ead, marc21] provenance: source_url: https://example.org/voor-onderzoekers/api retrieved_on: "2025-01-15T10:30:00Z" retrieval_agent: firecrawl ``` ## Provenance Requirements ### Required Fields for ALL Provenance | Field | Type | Required | Description | |-------|------|----------|-------------| | `source_url` | string | YES | URL where information was found | | `retrieved_on` | datetime | YES | ISO 8601 timestamp | | `retrieval_agent` | enum | YES | Tool used for extraction | | `xpath` | string | RECOMMENDED | XPath to source element | | `html_file` | string | RECOMMENDED | Path to archived HTML | ### Retrieval Agent Values | Value | Description | When to Use | |-------|-------------|-------------| | `firecrawl` | FireCrawl MCP tools | Primary - most web pages | | `playwright` | Playwright browser automation | JavaScript-heavy sites | | `exa` | Exa web search/crawl | When FireCrawl unavailable | | `manual` | Manual inspection | Last resort | ### Discovery Summary Block Every custodian with digital platform data MUST include: ```yaml digital_platform_discovery_summary: discovery_metadata: retrieval_agent: firecrawl retrieval_timestamp: "2025-01-15T10:30:00Z" source_url: https://www.example.nl/onderzoeken xpath_base: /html/body/main/section[2] html_file: web/GHCID/example.nl/rendered.html platforms_discovered: 7 total_items_indexed: 545393 cms_identified: true external_integrations_count: 2 ``` ## Discovery Workflow ### Step 1: Initial Website Scrape ```bash # Using FireCrawl MCP firecrawl_firecrawl_scrape( url="https://www.example-archive.nl", formats=["markdown", "links"] ) ``` ### Step 2: Map Site Structure ```bash # Find all URLs on site firecrawl_firecrawl_map( url="https://www.example-archive.nl", limit=200 ) ``` ### Step 3: Target Key Pages Common page patterns for platform discovery: - `/onderzoeken` - Dutch archives - `/collecties` - Collections - `/zoeken` - Search functionality - `/over-ons` - About (may mention CMS) - `/api` or `/data` - Technical services - `/partners` - External integrations ### Step 4: Extract Platform Information For each platform discovered: 1. Note the platform name 2. Copy the URL 3. Record XPath to source element 4. Count items (if displayed) 5. Identify platform type ### Step 5: Archive Source HTML ```bash # Save rendered HTML for provenance playwright_browser_navigate(url="https://...") playwright_browser_snapshot(filename="web/GHCID/domain/rendered.html") ``` ### Step 6: Update Custodian YAML Add all sections to the custodian file: - `collection_management_system` - `auxiliary_digital_platforms` - `external_platform_integrations` - `data_services` (if applicable) - `digital_platform_discovery_summary` ## Example: Complete Implementation See `data/custodian/NL-DR-ASS-A-DA.yaml` (Drents Archief) for a comprehensive example with: - 1 Collection Management System (MAIS-Flexis) - 7 Auxiliary Digital Platforms - 2 External Integrations (Archieven.nl, Memorix) - Complete provenance for each platform - Discovery summary with totals ## Tools Reference ### FireCrawl MCP Tools | Tool | Purpose | MCP Name | |------|---------|----------| | **Scrape** | Extract single page content | `firecrawl_firecrawl_scrape` | | **Map** | Discover all URLs on site | `firecrawl_firecrawl_map` | | **Search** | Web search with scraping | `firecrawl_firecrawl_search` | | **Crawl** | Multi-page extraction | `firecrawl_firecrawl_crawl` | ### Playwright MCP Tools | Tool | Purpose | MCP Name | |------|---------|----------| | **Navigate** | Open page in browser | `playwright_browser_navigate` | | **Snapshot** | Capture accessibility tree | `playwright_browser_snapshot` | | **Screenshot** | Visual capture | `playwright_browser_take_screenshot` | ## Validation Checklist Before marking discovery complete, verify: - [ ] `collection_management_system` identified (or marked as unknown) - [ ] All public discovery portals documented - [ ] External aggregator integrations listed - [ ] Every platform has provenance block - [ ] `retrieval_agent` specified for each - [ ] `retrieved_on` in ISO 8601 format - [ ] `source_url` is the page where info was found - [ ] `digital_platform_discovery_summary` present - [ ] Item counts included where visible ## Related Rules - **Rule 6**: WebObservation claims MUST have XPath provenance - **Rule 22**: Custodian YAML is single source of truth - **Rule 5**: Never delete enriched data (additive only) ## Version History | Date | Change | |------|--------| | 2025-01-15 | Initial rule creation | ## See Also - `AGENTS.md` Rule 25 - `docs/DIGITAL_PLATFORM_DISCOVERY_GUIDE.md` - `data/custodian/NL-DR-ASS-A-DA.yaml` (reference implementation)