glam/CZECH_ARON_API_INVESTIGATION.md
2025-11-19 23:25:22 +01:00

5.9 KiB
Raw Blame History

Czech Archives ARON API Investigation Results

Date: 2025-11-19
API Base: https://portal.nacr.cz/aron/api/aron

Summary

FOUND: Undocumented REST API for ARON portal
WORKS: API returns JSON data without authentication
⚠️ CHALLENGE: API returns ALL records (505,884), need to filter for institutions only

API Endpoints Discovered

1. List All Records

POST https://portal.nacr.cz/aron/api/aron/apu/listview?listType=EVIDENCE-LIST

Request Body:

{
  "size": 100,           # Records per page (max seems to be 100)
  "searchAfter": "[...]" # Cursor for pagination (optional, from previous response)
}

Response Structure:

{
  "count": 505884,                    # Total records in database
  "items": [                          # Array of records
    {
      "id": "uuid",
      "name": "Record name",
      "description": "Optional description",
      "order": 1
    }
  ],
  "searchAfter": "[\"next-cursor\"]", # Cursor for next page
  "aggregations": null
}

Pagination: Uses cursor-based pagination (searchAfter), not page numbers

2. Get Individual Record Detail

GET https://portal.nacr.cz/aron/api/aron/apu/{uuid}

Response includes:

  • type: Record type (e.g., "INSTITUTION", "FOND", "ORIGINATOR")
  • parts: Array of metadata parts with items
  • name, description, id, etc.

Example metadata items:

  • INST~CODE: Institution code
  • INST~SHORT~NAME: Short name
  • INST~ADDRESS: Address
  • INST~PHONE: Phone
  • INST~EMAIL: Email
  • INST~URL: Website

Key Findings

Database Contents

  • Total records: 505,884
  • Record types: Institutions, archival fonds, originators, finding aids, etc.
  • Institutions: Unknown count (need to filter by type="INSTITUTION" from detail API)

Challenge: Identifying Institutions

The list API does NOT include the type field. This means:

Cannot filter institutions at list level
Must fetch detail for each record to check type field
⚠️ 505,884 API calls required to identify all institutions

Web Interface Shows ~560 Institutions

The browser interface at https://portal.nacr.cz/aron/institution shows:

  • 56 pages × 10 per page = ~560 institutions
  • This uses a different filter/query that we haven't found yet

Recommendation: Two Approaches

Strategy: Filter by name patterns in list API, then verify with detail API

  1. Phase 1: Scan all 505,884 records in list API

    • Filter by name patterns (contains "archiv", "muzeum", "galerie", etc.)
    • Exclude non-institutions ("inventář", "pomůcka", "fond", "družstvo")
    • Estimated: ~5,000-10,000 candidates
  2. Phase 2: Fetch details for candidates only

    • Check type == "INSTITUTION"
    • Extract full metadata
    • Estimated: ~560 confirmed institutions

Pros:

  • Reduces API calls from 505k to ~10k
  • Faster execution (~2-3 hours)
  • Respects rate limits (2 req/sec)

Cons:

  • May miss institutions with unusual names
  • Name-based filtering is heuristic

Estimated time: 2-3 hours total

Approach B: Complete Scan (THOROUGH)

Strategy: Fetch details for ALL 505,884 records, filter by type

  1. Fetch detail API for every record
  2. Filter where type == "INSTITUTION"
  3. Extract metadata

Pros:

  • 100% coverage
  • No false negatives

Cons:

  • 505,884 API calls
  • ~70 hours at 2 req/sec
  • May hit rate limits
  • Excessive load on server

Estimated time: 70 hours (3 days continuous)

Implemented Solution

Created: scripts/scrapers/scrape_czech_archives_aron.py

Current implementation: Approach A (smart filtering)

Features:

  • Two-phase scraping (list → filter → details)
  • Cursor-based pagination
  • Rate limiting (0.5s delay = 2 req/sec)
  • Resumable (can interrupt and restart)
  • Progress display
  • LinkML-compliant output

Usage:

cd /Users/kempersc/apps/glam
python3 scripts/scrapers/scrape_czech_archives_aron.py

Output: data/instances/czech_archives_aron.yaml

Alternative: Find Specific Institution Endpoint

The browser interface must use a different endpoint or filter parameter. Possible options:

  1. Check for filter parameter:

    POST .../listview?listType=RECORD~TYPE
    Body: {"recordType": "INSTITUTION"}
    

    Status: Tried, didn't work (still returns 505k)

  2. Check for different listType:

    • listType=EVIDENCE-LIST (current, returns all)
    • listType=RECORD~TYPE (tried, same result)
    • Other types? (need to investigate)
  3. Check browser network tab more carefully:

    • The /aron/institution page makes a POST request
    • Need to capture exact request body with filters

Next Steps

python3 scripts/scrapers/scrape_czech_archives_aron.py
  • Expected: ~560 institutions in 2-3 hours
  • Safe, respectful of server

Option 2: Investigate Institution-Specific Endpoint

  • Use browser DevTools on /aron/institution page
  • Capture POST request body when loading page
  • Look for filter parameters we missed
  • Try this first before running scraper

Option 3: Contact Národní archiv

  • Email: posta@nacr.cz or arch@mvcr.cz
  • Ask: "Is there an API endpoint for institutions only?"
  • Mention: "We found undocumented API but it returns 505k records"

API Rate Limits

Not documented, but being conservative:

  • Max 2 requests/second (0.5s delay)
  • Avoid burst requests
  • Monitor for 429 errors

Data Quality

TIER_1_AUTHORITATIVE:

  • Source: Národní archiv (National Archives)
  • Official national registry
  • Real-time updates by archivists
  • High quality metadata

License: Unknown (likely CC0 like library data)


Status: API discovered, scraper ready, awaiting final decision on approach
Next action: Capture exact browser request to find institution filter, OR run smart filter scraper
Estimated completion: 2-3 hours for scraping