287 lines
9 KiB
Markdown
287 lines
9 KiB
Markdown
# Digital Platform Discovery Rule
|
|
|
|
## Rule Summary
|
|
|
|
**Rule 25** in `AGENTS.md`: Every heritage custodian MUST be enriched with digital platform discovery data with complete provenance tracking.
|
|
|
|
## Purpose
|
|
|
|
Digital platform discovery documents how heritage institutions make their collections accessible online. This is critical for:
|
|
|
|
1. **Collection Accessibility Research** - Understanding which collections are digitized and discoverable
|
|
2. **System Integration Planning** - Identifying which platforms to integrate with
|
|
3. **Aggregator Mapping** - Tracking which custodians contribute to which aggregators
|
|
4. **Technology Assessment** - Understanding collection management system landscape
|
|
|
|
## What to Discover
|
|
|
|
### 1. Collection Management Systems (CMS)
|
|
|
|
The backend system used for cataloging and managing collections.
|
|
|
|
**Common Systems**:
|
|
- **MAIS-Flexis** - DE REE (Dutch archives)
|
|
- **Adlib/Axiell Collections** - Axiell (museums, archives)
|
|
- **CollectiveAccess** - Open source
|
|
- **ArchivesSpace** - Open source (archives)
|
|
- **Koha** - Open source (libraries)
|
|
- **Ex Libris Alma** - Libraries
|
|
- **TMS (The Museum System)** - Gallery Systems
|
|
|
|
**Required Fields**:
|
|
```yaml
|
|
collection_management_system:
|
|
system_name: MAIS-Flexis
|
|
vendor: DE REE Archiefsystemen
|
|
vendor_url: https://www.de-ree.nl/
|
|
version: null # If determinable
|
|
primary_use: archival_description | collection_management | digital_asset_management
|
|
provenance:
|
|
source_url: https://example.org/about
|
|
xpath: /html/body/main/div[2]/p[3]
|
|
retrieved_on: "2025-01-15T10:30:00Z"
|
|
retrieval_agent: firecrawl
|
|
```
|
|
|
|
### 2. Auxiliary Digital Platforms
|
|
|
|
Public-facing discovery tools operated by the institution.
|
|
|
|
**Platform Types**:
|
|
| Type | Description | Examples |
|
|
|------|-------------|----------|
|
|
| `DISCOVERY_PORTAL` | Search interface for collections | Beeldbank, Archiefstukken |
|
|
| `DIGITAL_ARCHIVE` | Digitized document repository | E-depot |
|
|
| `IMAGE_DATABASE` | Photograph/image collection | Beeldbank |
|
|
| `GENEALOGY_PORTAL` | Family history records | Genealogie, WieWasWie |
|
|
| `NEWSPAPER_ARCHIVE` | Historical newspapers | Delpher, Kranten |
|
|
| `MAP_COLLECTION` | Historical cartography | Kaarten |
|
|
| `AV_ARCHIVE` | Audio/video materials | Film & Geluid |
|
|
|
|
**Required Fields**:
|
|
```yaml
|
|
auxiliary_digital_platforms:
|
|
- platform_name: Beeldbank
|
|
platform_url: https://beeldbank.example.nl/
|
|
platform_type: IMAGE_DATABASE
|
|
content_type: images | documents | newspapers | maps | audio | video | mixed
|
|
items_indexed: 125000
|
|
description: "Brief description of content"
|
|
provenance:
|
|
source_url: https://example.org/onderzoeken
|
|
xpath: /html/body/main/section[2]/div[1]/a
|
|
retrieved_on: "2025-01-15T10:30:00Z"
|
|
retrieval_agent: firecrawl
|
|
```
|
|
|
|
### 3. External Platform Integrations
|
|
|
|
Aggregators and shared platforms the institution contributes to.
|
|
|
|
**Common Aggregators**:
|
|
| Platform | Scope | URL Pattern |
|
|
|----------|-------|-------------|
|
|
| **Archieven.nl** | Dutch archives | `archieven.nl/nl/zoeken?mivast=XXX` |
|
|
| **Europeana** | European heritage | `europeana.eu/...` |
|
|
| **Memorix Maior** | Dutch shared platform | `memorix.nl/...` |
|
|
| **Collectie Nederland** | Dutch museums | `collectienederland.nl/...` |
|
|
| **DPLA** | US heritage | `dp.la/...` |
|
|
| **Delpher** | Dutch newspapers/books | `delpher.nl/...` |
|
|
| **WieWasWie** | Dutch genealogy | `wiewaswie.nl/...` |
|
|
|
|
**Required Fields**:
|
|
```yaml
|
|
external_platform_integrations:
|
|
- platform_name: Archieven.nl
|
|
integration_type: discovery_aggregator | data_provider | api_consumer
|
|
integration_url: https://www.archieven.nl/nl/zoeken?mivast=123
|
|
items_contributed: 450000
|
|
sync_frequency: daily | weekly | monthly | on_demand | unknown
|
|
provenance:
|
|
source_url: https://example.org/onderzoeken
|
|
xpath: /html/body/main/section[3]/div[1]
|
|
retrieved_on: "2025-01-15T10:30:00Z"
|
|
retrieval_agent: firecrawl
|
|
```
|
|
|
|
### 4. APIs & Data Services
|
|
|
|
Machine-readable interfaces for collection data.
|
|
|
|
**Protocol Types**:
|
|
- **OAI-PMH** - Open Archives Initiative Protocol for Metadata Harvesting
|
|
- **SPARQL** - RDF query endpoint
|
|
- **REST API** - Custom REST endpoints
|
|
- **SRU/SRW** - Search/Retrieve via URL
|
|
- **Z39.50** - Legacy library protocol
|
|
|
|
**Required Fields**:
|
|
```yaml
|
|
data_services:
|
|
- service_name: OAI-PMH Endpoint
|
|
endpoint_url: https://example.org/oai
|
|
protocol: OAI-PMH | SPARQL | REST | SRU | Z39.50
|
|
documentation_url: https://example.org/api-docs
|
|
metadata_formats: [dublin_core, ead, marc21]
|
|
provenance:
|
|
source_url: https://example.org/voor-onderzoekers/api
|
|
retrieved_on: "2025-01-15T10:30:00Z"
|
|
retrieval_agent: firecrawl
|
|
```
|
|
|
|
## Provenance Requirements
|
|
|
|
### Required Fields for ALL Provenance
|
|
|
|
| Field | Type | Required | Description |
|
|
|-------|------|----------|-------------|
|
|
| `source_url` | string | YES | URL where information was found |
|
|
| `retrieved_on` | datetime | YES | ISO 8601 timestamp |
|
|
| `retrieval_agent` | enum | YES | Tool used for extraction |
|
|
| `xpath` | string | RECOMMENDED | XPath to source element |
|
|
| `html_file` | string | RECOMMENDED | Path to archived HTML |
|
|
|
|
### Retrieval Agent Values
|
|
|
|
| Value | Description | When to Use |
|
|
|-------|-------------|-------------|
|
|
| `firecrawl` | FireCrawl MCP tools | Primary - most web pages |
|
|
| `playwright` | Playwright browser automation | JavaScript-heavy sites |
|
|
| `exa` | Exa web search/crawl | When FireCrawl unavailable |
|
|
| `manual` | Manual inspection | Last resort |
|
|
|
|
### Discovery Summary Block
|
|
|
|
Every custodian with digital platform data MUST include:
|
|
|
|
```yaml
|
|
digital_platform_discovery_summary:
|
|
discovery_metadata:
|
|
retrieval_agent: firecrawl
|
|
retrieval_timestamp: "2025-01-15T10:30:00Z"
|
|
source_url: https://www.example.nl/onderzoeken
|
|
xpath_base: /html/body/main/section[2]
|
|
html_file: web/GHCID/example.nl/rendered.html
|
|
platforms_discovered: 7
|
|
total_items_indexed: 545393
|
|
cms_identified: true
|
|
external_integrations_count: 2
|
|
```
|
|
|
|
## Discovery Workflow
|
|
|
|
### Step 1: Initial Website Scrape
|
|
|
|
```bash
|
|
# Using FireCrawl MCP
|
|
firecrawl_firecrawl_scrape(
|
|
url="https://www.example-archive.nl",
|
|
formats=["markdown", "links"]
|
|
)
|
|
```
|
|
|
|
### Step 2: Map Site Structure
|
|
|
|
```bash
|
|
# Find all URLs on site
|
|
firecrawl_firecrawl_map(
|
|
url="https://www.example-archive.nl",
|
|
limit=200
|
|
)
|
|
```
|
|
|
|
### Step 3: Target Key Pages
|
|
|
|
Common page patterns for platform discovery:
|
|
- `/onderzoeken` - Dutch archives
|
|
- `/collecties` - Collections
|
|
- `/zoeken` - Search functionality
|
|
- `/over-ons` - About (may mention CMS)
|
|
- `/api` or `/data` - Technical services
|
|
- `/partners` - External integrations
|
|
|
|
### Step 4: Extract Platform Information
|
|
|
|
For each platform discovered:
|
|
1. Note the platform name
|
|
2. Copy the URL
|
|
3. Record XPath to source element
|
|
4. Count items (if displayed)
|
|
5. Identify platform type
|
|
|
|
### Step 5: Archive Source HTML
|
|
|
|
```bash
|
|
# Save rendered HTML for provenance
|
|
playwright_browser_navigate(url="https://...")
|
|
playwright_browser_snapshot(filename="web/GHCID/domain/rendered.html")
|
|
```
|
|
|
|
### Step 6: Update Custodian YAML
|
|
|
|
Add all sections to the custodian file:
|
|
- `collection_management_system`
|
|
- `auxiliary_digital_platforms`
|
|
- `external_platform_integrations`
|
|
- `data_services` (if applicable)
|
|
- `digital_platform_discovery_summary`
|
|
|
|
## Example: Complete Implementation
|
|
|
|
See `data/custodian/NL-DR-ASS-A-DA.yaml` (Drents Archief) for a comprehensive example with:
|
|
- 1 Collection Management System (MAIS-Flexis)
|
|
- 7 Auxiliary Digital Platforms
|
|
- 2 External Integrations (Archieven.nl, Memorix)
|
|
- Complete provenance for each platform
|
|
- Discovery summary with totals
|
|
|
|
## Tools Reference
|
|
|
|
### FireCrawl MCP Tools
|
|
|
|
| Tool | Purpose | MCP Name |
|
|
|------|---------|----------|
|
|
| **Scrape** | Extract single page content | `firecrawl_firecrawl_scrape` |
|
|
| **Map** | Discover all URLs on site | `firecrawl_firecrawl_map` |
|
|
| **Search** | Web search with scraping | `firecrawl_firecrawl_search` |
|
|
| **Crawl** | Multi-page extraction | `firecrawl_firecrawl_crawl` |
|
|
|
|
### Playwright MCP Tools
|
|
|
|
| Tool | Purpose | MCP Name |
|
|
|------|---------|----------|
|
|
| **Navigate** | Open page in browser | `playwright_browser_navigate` |
|
|
| **Snapshot** | Capture accessibility tree | `playwright_browser_snapshot` |
|
|
| **Screenshot** | Visual capture | `playwright_browser_take_screenshot` |
|
|
|
|
## Validation Checklist
|
|
|
|
Before marking discovery complete, verify:
|
|
|
|
- [ ] `collection_management_system` identified (or marked as unknown)
|
|
- [ ] All public discovery portals documented
|
|
- [ ] External aggregator integrations listed
|
|
- [ ] Every platform has provenance block
|
|
- [ ] `retrieval_agent` specified for each
|
|
- [ ] `retrieved_on` in ISO 8601 format
|
|
- [ ] `source_url` is the page where info was found
|
|
- [ ] `digital_platform_discovery_summary` present
|
|
- [ ] Item counts included where visible
|
|
|
|
## Related Rules
|
|
|
|
- **Rule 6**: WebObservation claims MUST have XPath provenance
|
|
- **Rule 22**: Custodian YAML is single source of truth
|
|
- **Rule 5**: Never delete enriched data (additive only)
|
|
|
|
## Version History
|
|
|
|
| Date | Change |
|
|
|------|--------|
|
|
| 2025-01-15 | Initial rule creation |
|
|
|
|
## See Also
|
|
|
|
- `AGENTS.md` Rule 25
|
|
- `docs/DIGITAL_PLATFORM_DISCOVERY_GUIDE.md`
|
|
- `data/custodian/NL-DR-ASS-A-DA.yaml` (reference implementation)
|