kempersc 3c80de87e0 add isil entries

2025-11-19 23:25:22 +01:00

5.9 KiB

Raw Blame History

Czech Archives ARON API Investigation Results

Date: 2025-11-19
API Base: https://portal.nacr.cz/aron/api/aron

Summary

✅ FOUND: Undocumented REST API for ARON portal
✅ WORKS: API returns JSON data without authentication
⚠️ CHALLENGE: API returns ALL records (505,884), need to filter for institutions only

API Endpoints Discovered

1. List All Records

POST https://portal.nacr.cz/aron/api/aron/apu/listview?listType=EVIDENCE-LIST

Request Body:

{
  "size": 100,           # Records per page (max seems to be 100)
  "searchAfter": "[...]" # Cursor for pagination (optional, from previous response)
}

Response Structure:

{
  "count": 505884,                    # Total records in database
  "items": [                          # Array of records
    {
      "id": "uuid",
      "name": "Record name",
      "description": "Optional description",
      "order": 1
    }
  ],
  "searchAfter": "[\"next-cursor\"]", # Cursor for next page
  "aggregations": null
}

Pagination: Uses cursor-based pagination (searchAfter), not page numbers

2. Get Individual Record Detail

GET https://portal.nacr.cz/aron/api/aron/apu/{uuid}

Response includes:

type: Record type (e.g., "INSTITUTION", "FOND", "ORIGINATOR")
parts: Array of metadata parts with items
name, description, id, etc.

Example metadata items:

INST~CODE: Institution code
INST~SHORT~NAME: Short name
INST~ADDRESS: Address
INST~PHONE: Phone
INST~EMAIL: Email
INST~URL: Website

Key Findings

Database Contents

Total records: 505,884
Record types: Institutions, archival fonds, originators, finding aids, etc.
Institutions: Unknown count (need to filter by type="INSTITUTION" from detail API)

Challenge: Identifying Institutions

The list API does NOT include the type field. This means:

❌ Cannot filter institutions at list level
✅ Must fetch detail for each record to check type field
⚠️ 505,884 API calls required to identify all institutions

Web Interface Shows ~560 Institutions

The browser interface at https://portal.nacr.cz/aron/institution shows:

56 pages × 10 per page = ~560 institutions
This uses a different filter/query that we haven't found yet

Recommendation: Two Approaches

Approach A: Smart Filtering (RECOMMENDED) ⭐

Strategy: Filter by name patterns in list API, then verify with detail API

Phase 1: Scan all 505,884 records in list API
- Filter by name patterns (contains "archiv", "muzeum", "galerie", etc.)
- Exclude non-institutions ("inventář", "pomůcka", "fond", "družstvo")
- Estimated: ~5,000-10,000 candidates
Phase 2: Fetch details for candidates only
- Check type == "INSTITUTION"
- Extract full metadata
- Estimated: ~560 confirmed institutions

Pros:

Reduces API calls from 505k to ~10k
Faster execution (~2-3 hours)
Respects rate limits (2 req/sec)

Cons:

May miss institutions with unusual names
Name-based filtering is heuristic

Estimated time: 2-3 hours total

Approach B: Complete Scan (THOROUGH)

Strategy: Fetch details for ALL 505,884 records, filter by type

Fetch detail API for every record
Filter where type == "INSTITUTION"
Extract metadata

Pros:

100% coverage
No false negatives

Cons:

505,884 API calls
~70 hours at 2 req/sec
May hit rate limits
Excessive load on server

Estimated time: 70 hours (3 days continuous)

Implemented Solution

Created: scripts/scrapers/scrape_czech_archives_aron.py

Current implementation: Approach A (smart filtering)

Features:

Two-phase scraping (list → filter → details)
Cursor-based pagination
Rate limiting (0.5s delay = 2 req/sec)
Resumable (can interrupt and restart)
Progress display
LinkML-compliant output

Usage:

cd /Users/kempersc/apps/glam
python3 scripts/scrapers/scrape_czech_archives_aron.py

Output: data/instances/czech_archives_aron.yaml

Alternative: Find Specific Institution Endpoint

The browser interface must use a different endpoint or filter parameter. Possible options:

Check for filter parameter:

POST .../listview?listType=RECORD~TYPE
Body: {"recordType": "INSTITUTION"}

Status: Tried, didn't work (still returns 505k)

Check for different listType:
- listType=EVIDENCE-LIST (current, returns all)
- listType=RECORD~TYPE (tried, same result)
- Other types? (need to investigate)
Check browser network tab more carefully:
- The /aron/institution page makes a POST request
- Need to capture exact request body with filters

Next Steps

Option 1: Run Smart Filter Scraper (RECOMMENDED)

python3 scripts/scrapers/scrape_czech_archives_aron.py

Expected: ~560 institutions in 2-3 hours
Safe, respectful of server

Option 2: Investigate Institution-Specific Endpoint

Use browser DevTools on /aron/institution page
Capture POST request body when loading page
Look for filter parameters we missed
Try this first before running scraper

Option 3: Contact Národní archiv

Email: posta@nacr.cz or arch@mvcr.cz
Ask: "Is there an API endpoint for institutions only?"
Mention: "We found undocumented API but it returns 505k records"

API Rate Limits

Not documented, but being conservative:

Max 2 requests/second (0.5s delay)
Avoid burst requests
Monitor for 429 errors

Data Quality

TIER_1_AUTHORITATIVE:

Source: Národní archiv (National Archives)
Official national registry
Real-time updates by archivists
High quality metadata

License: Unknown (likely CC0 like library data)

Status: API discovered, scraper ready, awaiting final decision on approach
Next action: Capture exact browser request to find institution filter, OR run smart filter scraper
Estimated completion: 2-3 hours for scraping

5.9 KiB Raw Blame History Unescape Escape