# Digital Platform Discovery Rule

## Rule Summary

**Rule 25** in `AGENTS.md`: Every heritage custodian MUST be enriched with digital platform discovery data with complete provenance tracking.

## Purpose

Digital platform discovery documents how heritage institutions make their collections accessible online. This is critical for:

1. **Collection Accessibility Research** - Understanding which collections are digitized and discoverable
2. **System Integration Planning** - Identifying which platforms to integrate with
3. **Aggregator Mapping** - Tracking which custodians contribute to which aggregators
4. **Technology Assessment** - Understanding collection management system landscape

## What to Discover

### 1. Collection Management Systems (CMS)

The backend system used for cataloging and managing collections.

**Common Systems**:
- **MAIS-Flexis** - DE REE (Dutch archives)
- **Adlib/Axiell Collections** - Axiell (museums, archives)
- **CollectiveAccess** - Open source
- **ArchivesSpace** - Open source (archives)
- **Koha** - Open source (libraries)
- **Ex Libris Alma** - Libraries
- **TMS (The Museum System)** - Gallery Systems

**Required Fields**:
```yaml
collection_management_system:
  system_name: MAIS-Flexis
  vendor: DE REE Archiefsystemen
  vendor_url: https://www.de-ree.nl/
  version: null  # If determinable
  primary_use: archival_description | collection_management | digital_asset_management
  provenance:
    source_url: https://example.org/about
    xpath: /html/body/main/div[2]/p[3]
    retrieved_on: "2025-01-15T10:30:00Z"
    retrieval_agent: firecrawl
```

### 2. Auxiliary Digital Platforms

Public-facing discovery tools operated by the institution.

**Platform Types**:
| Type | Description | Examples |
|------|-------------|----------|
| `DISCOVERY_PORTAL` | Search interface for collections | Beeldbank, Archiefstukken |
| `DIGITAL_ARCHIVE` | Digitized document repository | E-depot |
| `IMAGE_DATABASE` | Photograph/image collection | Beeldbank |
| `GENEALOGY_PORTAL` | Family history records | Genealogie, WieWasWie |
| `NEWSPAPER_ARCHIVE` | Historical newspapers | Delpher, Kranten |
| `MAP_COLLECTION` | Historical cartography | Kaarten |
| `AV_ARCHIVE` | Audio/video materials | Film & Geluid |

**Required Fields**:
```yaml
auxiliary_digital_platforms:
  - platform_name: Beeldbank
    platform_url: https://beeldbank.example.nl/
    platform_type: IMAGE_DATABASE
    content_type: images | documents | newspapers | maps | audio | video | mixed
    items_indexed: 125000
    description: "Brief description of content"
    provenance:
      source_url: https://example.org/onderzoeken
      xpath: /html/body/main/section[2]/div[1]/a
      retrieved_on: "2025-01-15T10:30:00Z"
      retrieval_agent: firecrawl
```

### 3. External Platform Integrations

Aggregators and shared platforms the institution contributes to.

**Common Aggregators**:
| Platform | Scope | URL Pattern |
|----------|-------|-------------|
| **Archieven.nl** | Dutch archives | `archieven.nl/nl/zoeken?mivast=XXX` |
| **Europeana** | European heritage | `europeana.eu/...` |
| **Memorix Maior** | Dutch shared platform | `memorix.nl/...` |
| **Collectie Nederland** | Dutch museums | `collectienederland.nl/...` |
| **DPLA** | US heritage | `dp.la/...` |
| **Delpher** | Dutch newspapers/books | `delpher.nl/...` |
| **WieWasWie** | Dutch genealogy | `wiewaswie.nl/...` |

**Required Fields**:
```yaml
external_platform_integrations:
  - platform_name: Archieven.nl
    integration_type: discovery_aggregator | data_provider | api_consumer
    integration_url: https://www.archieven.nl/nl/zoeken?mivast=123
    items_contributed: 450000
    sync_frequency: daily | weekly | monthly | on_demand | unknown
    provenance:
      source_url: https://example.org/onderzoeken
      xpath: /html/body/main/section[3]/div[1]
      retrieved_on: "2025-01-15T10:30:00Z"
      retrieval_agent: firecrawl
```

### 4. APIs & Data Services

Machine-readable interfaces for collection data.

**Protocol Types**:
- **OAI-PMH** - Open Archives Initiative Protocol for Metadata Harvesting
- **SPARQL** - RDF query endpoint
- **REST API** - Custom REST endpoints
- **SRU/SRW** - Search/Retrieve via URL
- **Z39.50** - Legacy library protocol

**Required Fields**:
```yaml
data_services:
  - service_name: OAI-PMH Endpoint
    endpoint_url: https://example.org/oai
    protocol: OAI-PMH | SPARQL | REST | SRU | Z39.50
    documentation_url: https://example.org/api-docs
    metadata_formats: [dublin_core, ead, marc21]
    provenance:
      source_url: https://example.org/voor-onderzoekers/api
      retrieved_on: "2025-01-15T10:30:00Z"
      retrieval_agent: firecrawl
```

## Provenance Requirements

### Required Fields for ALL Provenance

| Field | Type | Required | Description |
|-------|------|----------|-------------|
| `source_url` | string | YES | URL where information was found |
| `retrieved_on` | datetime | YES | ISO 8601 timestamp |
| `retrieval_agent` | enum | YES | Tool used for extraction |
| `xpath` | string | RECOMMENDED | XPath to source element |
| `html_file` | string | RECOMMENDED | Path to archived HTML |

### Retrieval Agent Values

| Value | Description | When to Use |
|-------|-------------|-------------|
| `firecrawl` | FireCrawl MCP tools | Primary - most web pages |
| `playwright` | Playwright browser automation | JavaScript-heavy sites |
| `exa` | Exa web search/crawl | When FireCrawl unavailable |
| `manual` | Manual inspection | Last resort |

### Discovery Summary Block

Every custodian with digital platform data MUST include:

```yaml
digital_platform_discovery_summary:
  discovery_metadata:
    retrieval_agent: firecrawl
    retrieval_timestamp: "2025-01-15T10:30:00Z"
    source_url: https://www.example.nl/onderzoeken
    xpath_base: /html/body/main/section[2]
    html_file: web/GHCID/example.nl/rendered.html
  platforms_discovered: 7
  total_items_indexed: 545393
  cms_identified: true
  external_integrations_count: 2
```

## Discovery Workflow

### Step 1: Initial Website Scrape

```bash
# Using FireCrawl MCP
firecrawl_firecrawl_scrape(
  url="https://www.example-archive.nl",
  formats=["markdown", "links"]
)
```

### Step 2: Map Site Structure

```bash
# Find all URLs on site
firecrawl_firecrawl_map(
  url="https://www.example-archive.nl",
  limit=200
)
```

### Step 3: Target Key Pages

Common page patterns for platform discovery:
- `/onderzoeken` - Dutch archives
- `/collecties` - Collections
- `/zoeken` - Search functionality
- `/over-ons` - About (may mention CMS)
- `/api` or `/data` - Technical services
- `/partners` - External integrations

### Step 4: Extract Platform Information

For each platform discovered:
1. Note the platform name
2. Copy the URL
3. Record XPath to source element
4. Count items (if displayed)
5. Identify platform type

### Step 5: Archive Source HTML

```bash
# Save rendered HTML for provenance
playwright_browser_navigate(url="https://...")
playwright_browser_snapshot(filename="web/GHCID/domain/rendered.html")
```

### Step 6: Update Custodian YAML

Add all sections to the custodian file:
- `collection_management_system`
- `auxiliary_digital_platforms`
- `external_platform_integrations`
- `data_services` (if applicable)
- `digital_platform_discovery_summary`

## Example: Complete Implementation

See `data/custodian/NL-DR-ASS-A-DA.yaml` (Drents Archief) for a comprehensive example with:
- 1 Collection Management System (MAIS-Flexis)
- 7 Auxiliary Digital Platforms
- 2 External Integrations (Archieven.nl, Memorix)
- Complete provenance for each platform
- Discovery summary with totals

## Tools Reference

### FireCrawl MCP Tools

| Tool | Purpose | MCP Name |
|------|---------|----------|
| **Scrape** | Extract single page content | `firecrawl_firecrawl_scrape` |
| **Map** | Discover all URLs on site | `firecrawl_firecrawl_map` |
| **Search** | Web search with scraping | `firecrawl_firecrawl_search` |
| **Crawl** | Multi-page extraction | `firecrawl_firecrawl_crawl` |

### Playwright MCP Tools

| Tool | Purpose | MCP Name |
|------|---------|----------|
| **Navigate** | Open page in browser | `playwright_browser_navigate` |
| **Snapshot** | Capture accessibility tree | `playwright_browser_snapshot` |
| **Screenshot** | Visual capture | `playwright_browser_take_screenshot` |

## Validation Checklist

Before marking discovery complete, verify:

- [ ] `collection_management_system` identified (or marked as unknown)
- [ ] All public discovery portals documented
- [ ] External aggregator integrations listed
- [ ] Every platform has provenance block
- [ ] `retrieval_agent` specified for each
- [ ] `retrieved_on` in ISO 8601 format
- [ ] `source_url` is the page where info was found
- [ ] `digital_platform_discovery_summary` present
- [ ] Item counts included where visible

## Related Rules

- **Rule 6**: WebObservation claims MUST have XPath provenance
- **Rule 22**: Custodian YAML is single source of truth
- **Rule 5**: Never delete enriched data (additive only)

## Version History

| Date | Change |
|------|--------|
| 2025-01-15 | Initial rule creation |

## See Also

- `AGENTS.md` Rule 25
- `docs/DIGITAL_PLATFORM_DISCOVERY_GUIDE.md`
- `data/custodian/NL-DR-ASS-A-DA.yaml` (reference implementation)