5.9 KiB
Czech Archives ARON API Investigation Results
Date: 2025-11-19
API Base: https://portal.nacr.cz/aron/api/aron
Summary
✅ FOUND: Undocumented REST API for ARON portal
✅ WORKS: API returns JSON data without authentication
⚠️ CHALLENGE: API returns ALL records (505,884), need to filter for institutions only
API Endpoints Discovered
1. List All Records
POST https://portal.nacr.cz/aron/api/aron/apu/listview?listType=EVIDENCE-LIST
Request Body:
{
"size": 100, # Records per page (max seems to be 100)
"searchAfter": "[...]" # Cursor for pagination (optional, from previous response)
}
Response Structure:
{
"count": 505884, # Total records in database
"items": [ # Array of records
{
"id": "uuid",
"name": "Record name",
"description": "Optional description",
"order": 1
}
],
"searchAfter": "[\"next-cursor\"]", # Cursor for next page
"aggregations": null
}
Pagination: Uses cursor-based pagination (searchAfter), not page numbers
2. Get Individual Record Detail
GET https://portal.nacr.cz/aron/api/aron/apu/{uuid}
Response includes:
type: Record type (e.g., "INSTITUTION", "FOND", "ORIGINATOR")parts: Array of metadata parts with itemsname,description,id, etc.
Example metadata items:
INST~CODE: Institution codeINST~SHORT~NAME: Short nameINST~ADDRESS: AddressINST~PHONE: PhoneINST~EMAIL: EmailINST~URL: Website
Key Findings
Database Contents
- Total records: 505,884
- Record types: Institutions, archival fonds, originators, finding aids, etc.
- Institutions: Unknown count (need to filter by
type="INSTITUTION"from detail API)
Challenge: Identifying Institutions
The list API does NOT include the type field. This means:
❌ Cannot filter institutions at list level
✅ Must fetch detail for each record to check type field
⚠️ 505,884 API calls required to identify all institutions
Web Interface Shows ~560 Institutions
The browser interface at https://portal.nacr.cz/aron/institution shows:
- 56 pages × 10 per page = ~560 institutions
- This uses a different filter/query that we haven't found yet
Recommendation: Two Approaches
Approach A: Smart Filtering (RECOMMENDED) ⭐
Strategy: Filter by name patterns in list API, then verify with detail API
-
Phase 1: Scan all 505,884 records in list API
- Filter by name patterns (contains "archiv", "muzeum", "galerie", etc.)
- Exclude non-institutions ("inventář", "pomůcka", "fond", "družstvo")
- Estimated: ~5,000-10,000 candidates
-
Phase 2: Fetch details for candidates only
- Check
type == "INSTITUTION" - Extract full metadata
- Estimated: ~560 confirmed institutions
- Check
Pros:
- Reduces API calls from 505k to ~10k
- Faster execution (~2-3 hours)
- Respects rate limits (2 req/sec)
Cons:
- May miss institutions with unusual names
- Name-based filtering is heuristic
Estimated time: 2-3 hours total
Approach B: Complete Scan (THOROUGH)
Strategy: Fetch details for ALL 505,884 records, filter by type
- Fetch detail API for every record
- Filter where
type == "INSTITUTION" - Extract metadata
Pros:
- 100% coverage
- No false negatives
Cons:
- 505,884 API calls
- ~70 hours at 2 req/sec
- May hit rate limits
- Excessive load on server
Estimated time: 70 hours (3 days continuous)
Implemented Solution
Created: scripts/scrapers/scrape_czech_archives_aron.py
Current implementation: Approach A (smart filtering)
Features:
- Two-phase scraping (list → filter → details)
- Cursor-based pagination
- Rate limiting (0.5s delay = 2 req/sec)
- Resumable (can interrupt and restart)
- Progress display
- LinkML-compliant output
Usage:
cd /Users/kempersc/apps/glam
python3 scripts/scrapers/scrape_czech_archives_aron.py
Output: data/instances/czech_archives_aron.yaml
Alternative: Find Specific Institution Endpoint
The browser interface must use a different endpoint or filter parameter. Possible options:
-
Check for filter parameter:
POST .../listview?listType=RECORD~TYPE Body: {"recordType": "INSTITUTION"}Status: Tried, didn't work (still returns 505k)
-
Check for different listType:
listType=EVIDENCE-LIST(current, returns all)listType=RECORD~TYPE(tried, same result)- Other types? (need to investigate)
-
Check browser network tab more carefully:
- The
/aron/institutionpage makes a POST request - Need to capture exact request body with filters
- The
Next Steps
Option 1: Run Smart Filter Scraper (RECOMMENDED)
python3 scripts/scrapers/scrape_czech_archives_aron.py
- Expected: ~560 institutions in 2-3 hours
- Safe, respectful of server
Option 2: Investigate Institution-Specific Endpoint
- Use browser DevTools on
/aron/institutionpage - Capture POST request body when loading page
- Look for filter parameters we missed
- Try this first before running scraper
Option 3: Contact Národní archiv
- Email: posta@nacr.cz or arch@mvcr.cz
- Ask: "Is there an API endpoint for institutions only?"
- Mention: "We found undocumented API but it returns 505k records"
API Rate Limits
Not documented, but being conservative:
- Max 2 requests/second (0.5s delay)
- Avoid burst requests
- Monitor for 429 errors
Data Quality
TIER_1_AUTHORITATIVE:
- Source: Národní archiv (National Archives)
- Official national registry
- Real-time updates by archivists
- High quality metadata
License: Unknown (likely CC0 like library data)
Status: API discovered, scraper ready, awaiting final decision on approach
Next action: Capture exact browser request to find institution filter, OR run smart filter scraper
Estimated completion: 2-3 hours for scraping