glam/.opencode/DIGITAL_PLATFORM_DISCOVERY_RULE.md
2025-12-14 17:09:55 +01:00

287 lines
9 KiB
Markdown

# Digital Platform Discovery Rule
## Rule Summary
**Rule 25** in `AGENTS.md`: Every heritage custodian MUST be enriched with digital platform discovery data with complete provenance tracking.
## Purpose
Digital platform discovery documents how heritage institutions make their collections accessible online. This is critical for:
1. **Collection Accessibility Research** - Understanding which collections are digitized and discoverable
2. **System Integration Planning** - Identifying which platforms to integrate with
3. **Aggregator Mapping** - Tracking which custodians contribute to which aggregators
4. **Technology Assessment** - Understanding collection management system landscape
## What to Discover
### 1. Collection Management Systems (CMS)
The backend system used for cataloging and managing collections.
**Common Systems**:
- **MAIS-Flexis** - DE REE (Dutch archives)
- **Adlib/Axiell Collections** - Axiell (museums, archives)
- **CollectiveAccess** - Open source
- **ArchivesSpace** - Open source (archives)
- **Koha** - Open source (libraries)
- **Ex Libris Alma** - Libraries
- **TMS (The Museum System)** - Gallery Systems
**Required Fields**:
```yaml
collection_management_system:
system_name: MAIS-Flexis
vendor: DE REE Archiefsystemen
vendor_url: https://www.de-ree.nl/
version: null # If determinable
primary_use: archival_description | collection_management | digital_asset_management
provenance:
source_url: https://example.org/about
xpath: /html/body/main/div[2]/p[3]
retrieved_on: "2025-01-15T10:30:00Z"
retrieval_agent: firecrawl
```
### 2. Auxiliary Digital Platforms
Public-facing discovery tools operated by the institution.
**Platform Types**:
| Type | Description | Examples |
|------|-------------|----------|
| `DISCOVERY_PORTAL` | Search interface for collections | Beeldbank, Archiefstukken |
| `DIGITAL_ARCHIVE` | Digitized document repository | E-depot |
| `IMAGE_DATABASE` | Photograph/image collection | Beeldbank |
| `GENEALOGY_PORTAL` | Family history records | Genealogie, WieWasWie |
| `NEWSPAPER_ARCHIVE` | Historical newspapers | Delpher, Kranten |
| `MAP_COLLECTION` | Historical cartography | Kaarten |
| `AV_ARCHIVE` | Audio/video materials | Film & Geluid |
**Required Fields**:
```yaml
auxiliary_digital_platforms:
- platform_name: Beeldbank
platform_url: https://beeldbank.example.nl/
platform_type: IMAGE_DATABASE
content_type: images | documents | newspapers | maps | audio | video | mixed
items_indexed: 125000
description: "Brief description of content"
provenance:
source_url: https://example.org/onderzoeken
xpath: /html/body/main/section[2]/div[1]/a
retrieved_on: "2025-01-15T10:30:00Z"
retrieval_agent: firecrawl
```
### 3. External Platform Integrations
Aggregators and shared platforms the institution contributes to.
**Common Aggregators**:
| Platform | Scope | URL Pattern |
|----------|-------|-------------|
| **Archieven.nl** | Dutch archives | `archieven.nl/nl/zoeken?mivast=XXX` |
| **Europeana** | European heritage | `europeana.eu/...` |
| **Memorix Maior** | Dutch shared platform | `memorix.nl/...` |
| **Collectie Nederland** | Dutch museums | `collectienederland.nl/...` |
| **DPLA** | US heritage | `dp.la/...` |
| **Delpher** | Dutch newspapers/books | `delpher.nl/...` |
| **WieWasWie** | Dutch genealogy | `wiewaswie.nl/...` |
**Required Fields**:
```yaml
external_platform_integrations:
- platform_name: Archieven.nl
integration_type: discovery_aggregator | data_provider | api_consumer
integration_url: https://www.archieven.nl/nl/zoeken?mivast=123
items_contributed: 450000
sync_frequency: daily | weekly | monthly | on_demand | unknown
provenance:
source_url: https://example.org/onderzoeken
xpath: /html/body/main/section[3]/div[1]
retrieved_on: "2025-01-15T10:30:00Z"
retrieval_agent: firecrawl
```
### 4. APIs & Data Services
Machine-readable interfaces for collection data.
**Protocol Types**:
- **OAI-PMH** - Open Archives Initiative Protocol for Metadata Harvesting
- **SPARQL** - RDF query endpoint
- **REST API** - Custom REST endpoints
- **SRU/SRW** - Search/Retrieve via URL
- **Z39.50** - Legacy library protocol
**Required Fields**:
```yaml
data_services:
- service_name: OAI-PMH Endpoint
endpoint_url: https://example.org/oai
protocol: OAI-PMH | SPARQL | REST | SRU | Z39.50
documentation_url: https://example.org/api-docs
metadata_formats: [dublin_core, ead, marc21]
provenance:
source_url: https://example.org/voor-onderzoekers/api
retrieved_on: "2025-01-15T10:30:00Z"
retrieval_agent: firecrawl
```
## Provenance Requirements
### Required Fields for ALL Provenance
| Field | Type | Required | Description |
|-------|------|----------|-------------|
| `source_url` | string | YES | URL where information was found |
| `retrieved_on` | datetime | YES | ISO 8601 timestamp |
| `retrieval_agent` | enum | YES | Tool used for extraction |
| `xpath` | string | RECOMMENDED | XPath to source element |
| `html_file` | string | RECOMMENDED | Path to archived HTML |
### Retrieval Agent Values
| Value | Description | When to Use |
|-------|-------------|-------------|
| `firecrawl` | FireCrawl MCP tools | Primary - most web pages |
| `playwright` | Playwright browser automation | JavaScript-heavy sites |
| `exa` | Exa web search/crawl | When FireCrawl unavailable |
| `manual` | Manual inspection | Last resort |
### Discovery Summary Block
Every custodian with digital platform data MUST include:
```yaml
digital_platform_discovery_summary:
discovery_metadata:
retrieval_agent: firecrawl
retrieval_timestamp: "2025-01-15T10:30:00Z"
source_url: https://www.example.nl/onderzoeken
xpath_base: /html/body/main/section[2]
html_file: web/GHCID/example.nl/rendered.html
platforms_discovered: 7
total_items_indexed: 545393
cms_identified: true
external_integrations_count: 2
```
## Discovery Workflow
### Step 1: Initial Website Scrape
```bash
# Using FireCrawl MCP
firecrawl_firecrawl_scrape(
url="https://www.example-archive.nl",
formats=["markdown", "links"]
)
```
### Step 2: Map Site Structure
```bash
# Find all URLs on site
firecrawl_firecrawl_map(
url="https://www.example-archive.nl",
limit=200
)
```
### Step 3: Target Key Pages
Common page patterns for platform discovery:
- `/onderzoeken` - Dutch archives
- `/collecties` - Collections
- `/zoeken` - Search functionality
- `/over-ons` - About (may mention CMS)
- `/api` or `/data` - Technical services
- `/partners` - External integrations
### Step 4: Extract Platform Information
For each platform discovered:
1. Note the platform name
2. Copy the URL
3. Record XPath to source element
4. Count items (if displayed)
5. Identify platform type
### Step 5: Archive Source HTML
```bash
# Save rendered HTML for provenance
playwright_browser_navigate(url="https://...")
playwright_browser_snapshot(filename="web/GHCID/domain/rendered.html")
```
### Step 6: Update Custodian YAML
Add all sections to the custodian file:
- `collection_management_system`
- `auxiliary_digital_platforms`
- `external_platform_integrations`
- `data_services` (if applicable)
- `digital_platform_discovery_summary`
## Example: Complete Implementation
See `data/custodian/NL-DR-ASS-A-DA.yaml` (Drents Archief) for a comprehensive example with:
- 1 Collection Management System (MAIS-Flexis)
- 7 Auxiliary Digital Platforms
- 2 External Integrations (Archieven.nl, Memorix)
- Complete provenance for each platform
- Discovery summary with totals
## Tools Reference
### FireCrawl MCP Tools
| Tool | Purpose | MCP Name |
|------|---------|----------|
| **Scrape** | Extract single page content | `firecrawl_firecrawl_scrape` |
| **Map** | Discover all URLs on site | `firecrawl_firecrawl_map` |
| **Search** | Web search with scraping | `firecrawl_firecrawl_search` |
| **Crawl** | Multi-page extraction | `firecrawl_firecrawl_crawl` |
### Playwright MCP Tools
| Tool | Purpose | MCP Name |
|------|---------|----------|
| **Navigate** | Open page in browser | `playwright_browser_navigate` |
| **Snapshot** | Capture accessibility tree | `playwright_browser_snapshot` |
| **Screenshot** | Visual capture | `playwright_browser_take_screenshot` |
## Validation Checklist
Before marking discovery complete, verify:
- [ ] `collection_management_system` identified (or marked as unknown)
- [ ] All public discovery portals documented
- [ ] External aggregator integrations listed
- [ ] Every platform has provenance block
- [ ] `retrieval_agent` specified for each
- [ ] `retrieved_on` in ISO 8601 format
- [ ] `source_url` is the page where info was found
- [ ] `digital_platform_discovery_summary` present
- [ ] Item counts included where visible
## Related Rules
- **Rule 6**: WebObservation claims MUST have XPath provenance
- **Rule 22**: Custodian YAML is single source of truth
- **Rule 5**: Never delete enriched data (additive only)
## Version History
| Date | Change |
|------|--------|
| 2025-01-15 | Initial rule creation |
## See Also
- `AGENTS.md` Rule 25
- `docs/DIGITAL_PLATFORM_DISCOVERY_GUIDE.md`
- `data/custodian/NL-DR-ASS-A-DA.yaml` (reference implementation)