glam/docs/DIGITAL_PLATFORM_DISCOVERY_GUIDE.md
2025-12-14 17:09:55 +01:00

6.9 KiB

Digital Platform Discovery Guide

A comprehensive guide for discovering and documenting digital platforms used by heritage custodians.

Overview

Digital platform discovery is the process of identifying and cataloging the online systems, discovery portals, and digital integrations used by heritage institutions. This documentation enables understanding of how heritage collections are made accessible online.

Prerequisites

  • Access to FireCrawl scraping tools
  • Basic understanding of YAML structure
  • Knowledge of common heritage digital platforms

Discovery Workflow

Step 1: Initial Website Mapping

Use FireCrawl to discover all URLs on the custodian's website:

Tool: firecrawl_firecrawl_map
Parameters:
  url: https://www.example-archive.nl/
  search: "onderzoeken collecties zoeken about over"
  limit: 50

Look for these common page patterns:

  • /onderzoeken or /research - Main research/discovery page
  • /collecties or /collections - Collection information
  • /zoeken or /search - Search interfaces
  • /over-ons or /about - Organization information
  • /contact - Contact and system information

Step 2: Scrape Key Pages

Scrape the main pages with relevant content:

Tool: firecrawl_firecrawl_scrape
Parameters:
  url: https://www.example-archive.nl/onderzoeken
  formats: ["markdown", "html"]
  onlyMainContent: true

Step 3: Identify Digital Platforms

Look for these platform categories in the scraped content:

Category Examples What to Look For
Collection Management System MAIS-Flexis, Adlib, CollectiveAccess "Powered by", system names in footer, API references
Discovery Portals Beeldbank, Archiefstukken, Genealogie Links to search interfaces, item counts
External Integrations Archieven.nl, Europeana, Memorix Partner logos, integration links, federation mentions
APIs & Data Services OAI-PMH, SPARQL, REST APIs Developer documentation, endpoint URLs

Step 4: Document Platform Details

For each platform discovered, document:

collection_management_system:
  system_name: MAIS-Flexis
  vendor: DE REE Archiefsystemen
  vendor_url: https://www.de-ree.nl/
  version: null  # If unknown
  primary_use: archival_description
  provenance:
    source_url: https://www.example-archive.nl/over-ons
    xpath: /html/body/main/div[2]/p[3]  # Where info was found
    retrieved_on: "2025-01-15T10:30:00Z"
    retrieval_agent: firecrawl

Step 5: Calculate Item Counts

When platforms display item counts, document them:

auxiliary_digital_platforms:
  - platform_name: Beeldbank
    platform_url: https://beeldbank.example-archive.nl/
    platform_type: DISCOVERY_PORTAL
    content_type: images
    items_indexed: 125000
    description: "Digitized photographs, maps, and visual materials"

Step 6: Document External Integrations

external_platform_integrations:
  - platform_name: Archieven.nl
    integration_type: discovery_aggregator
    integration_url: https://www.archieven.nl/nl/zoeken?mivast=123
    items_contributed: 450000
    sync_frequency: daily

Common Dutch Heritage Platforms

Collection Management Systems

System Vendor Common Users
MAIS-Flexis DE REE Archiefsystemen Regional archives
Adlib Axiell Museums
CollectiveAccess Whirl-i-Gig Various
ArchivesSpace Lyrasis Archives
Memorix Maior Picturae Archives, libraries

Discovery Aggregators

Platform URL Description
Archieven.nl archieven.nl Dutch archival finding aids
Delpher delpher.nl Digitized newspapers, books
Europeana europeana.eu European cultural heritage
Collectie Nederland collectienederland.nl Dutch museum collections
Geheugen van Nederland geheugenvannederland.nl Digital heritage portal

Regional Platforms

Platform Region Type
Geheugen van Drenthe Drenthe Regional memory
Brabants Historisch Informatie Centrum Noord-Brabant Regional archives
Beeldbank Amsterdam Amsterdam Image archives

Provenance Requirements

Every discovery claim MUST include:

Field Required Description
source_url YES URL where platform was discovered
xpath RECOMMENDED XPath to element mentioning platform
retrieved_on YES ISO 8601 timestamp of discovery
retrieval_agent YES Tool used (firecrawl, playwright, manual)

Output Structure

The final digital_platform_discovery_summary should include:

digital_platform_discovery_summary:
  discovery_metadata:
    retrieval_agent: firecrawl
    retrieval_timestamp: "2025-01-15T10:30:00Z"
    source_url: https://www.example-archive.nl/onderzoeken
    xpath_base: /html/body/main/section[2]
    html_file: web/GHCID/example-archive.nl/rendered.html
  platforms_discovered: 7
  total_items_indexed: 545393
  
collection_management_system:
  system_name: MAIS-Flexis
  vendor: DE REE Archiefsystemen
  # ... full details
  
auxiliary_digital_platforms:
  - platform_name: Beeldbank
    # ... full details
  - platform_name: Archiefstukken
    # ... full details
    
external_platform_integrations:
  - platform_name: Archieven.nl
    # ... full details

Example: Drents Archief Discovery

The Drents Archief (NL-DR-ASS-A-DA) provides a comprehensive example:

Discovered Platforms

  1. Collection Management: MAIS-Flexis by DE REE Archiefsystemen
  2. Beeldbank: 252,183 images
  3. Archiefstukken: 1,155 finding aids
  4. Genealogie: Person name search
  5. Kaarten: 7,700 maps
  6. Kranten: 276,852 newspaper pages
  7. Film en Geluid: 8,658 audiovisual items

External Integrations

  • Archieven.nl (discovery aggregator)
  • Memorix (digital asset management)
  • Archives Portal Europe

Quality Checklist

Before submitting digital platform discovery:

  • All platforms have source URLs
  • Item counts are documented where available
  • Collection management system identified (if known)
  • External integrations listed
  • Provenance timestamps included
  • XPath references provided for key claims

Tools Reference

Tool MCP Name Best For
FireCrawl Map firecrawl_firecrawl_map Discovering all URLs
FireCrawl Scrape firecrawl_firecrawl_scrape Extracting page content
FireCrawl Search firecrawl_firecrawl_search Finding specific platforms
Playwright Snapshot playwright_browser_snapshot JavaScript-heavy pages

Version: 1.0
Last Updated: 2025-01-15
Author: GLAM Data Extraction Project