# Czech Archives ARON API Investigation Results **Date**: 2025-11-19 **API Base**: https://portal.nacr.cz/aron/api/aron ## Summary ✅ **FOUND**: Undocumented REST API for ARON portal ✅ **WORKS**: API returns JSON data without authentication ⚠️ **CHALLENGE**: API returns ALL records (505,884), need to filter for institutions only ## API Endpoints Discovered ### 1. List All Records ```bash POST https://portal.nacr.cz/aron/api/aron/apu/listview?listType=EVIDENCE-LIST ``` **Request Body**: ```json { "size": 100, # Records per page (max seems to be 100) "searchAfter": "[...]" # Cursor for pagination (optional, from previous response) } ``` **Response Structure**: ```json { "count": 505884, # Total records in database "items": [ # Array of records { "id": "uuid", "name": "Record name", "description": "Optional description", "order": 1 } ], "searchAfter": "[\"next-cursor\"]", # Cursor for next page "aggregations": null } ``` **Pagination**: Uses cursor-based pagination (searchAfter), not page numbers ### 2. Get Individual Record Detail ```bash GET https://portal.nacr.cz/aron/api/aron/apu/{uuid} ``` **Response includes**: - `type`: Record type (e.g., "INSTITUTION", "FOND", "ORIGINATOR") - `parts`: Array of metadata parts with items - `name`, `description`, `id`, etc. **Example metadata items**: - `INST~CODE`: Institution code - `INST~SHORT~NAME`: Short name - `INST~ADDRESS`: Address - `INST~PHONE`: Phone - `INST~EMAIL`: Email - `INST~URL`: Website ## Key Findings ### Database Contents - **Total records**: 505,884 - **Record types**: Institutions, archival fonds, originators, finding aids, etc. - **Institutions**: Unknown count (need to filter by `type="INSTITUTION"` from detail API) ### Challenge: Identifying Institutions The list API **does NOT include** the `type` field. This means: ❌ **Cannot filter institutions** at list level ✅ **Must fetch detail for each record** to check `type` field ⚠️ **505,884 API calls required** to identify all institutions ### Web Interface Shows ~560 Institutions The browser interface at `https://portal.nacr.cz/aron/institution` shows: - 56 pages × 10 per page = ~560 institutions - **This uses a different filter/query** that we haven't found yet ## Recommendation: Two Approaches ### Approach A: Smart Filtering (RECOMMENDED) ⭐ **Strategy**: Filter by name patterns in list API, then verify with detail API 1. **Phase 1**: Scan all 505,884 records in list API - Filter by name patterns (contains "archiv", "muzeum", "galerie", etc.) - Exclude non-institutions ("inventář", "pomůcka", "fond", "družstvo") - Estimated: ~5,000-10,000 candidates 2. **Phase 2**: Fetch details for candidates only - Check `type == "INSTITUTION"` - Extract full metadata - Estimated: ~560 confirmed institutions **Pros**: - Reduces API calls from 505k to ~10k - Faster execution (~2-3 hours) - Respects rate limits (2 req/sec) **Cons**: - May miss institutions with unusual names - Name-based filtering is heuristic **Estimated time**: 2-3 hours total ### Approach B: Complete Scan (THOROUGH) **Strategy**: Fetch details for ALL 505,884 records, filter by type 1. Fetch detail API for every record 2. Filter where `type == "INSTITUTION"` 3. Extract metadata **Pros**: - 100% coverage - No false negatives **Cons**: - 505,884 API calls - ~70 hours at 2 req/sec - May hit rate limits - Excessive load on server **Estimated time**: 70 hours (3 days continuous) ## Implemented Solution Created: `scripts/scrapers/scrape_czech_archives_aron.py` **Current implementation**: Approach A (smart filtering) **Features**: - Two-phase scraping (list → filter → details) - Cursor-based pagination - Rate limiting (0.5s delay = 2 req/sec) - Resumable (can interrupt and restart) - Progress display - LinkML-compliant output **Usage**: ```bash cd /Users/kempersc/apps/glam python3 scripts/scrapers/scrape_czech_archives_aron.py ``` **Output**: `data/instances/czech_archives_aron.yaml` ## Alternative: Find Specific Institution Endpoint The browser interface must use a different endpoint or filter parameter. Possible options: 1. **Check for filter parameter**: ```bash POST .../listview?listType=RECORD~TYPE Body: {"recordType": "INSTITUTION"} ``` Status: Tried, didn't work (still returns 505k) 2. **Check for different listType**: - `listType=EVIDENCE-LIST` (current, returns all) - `listType=RECORD~TYPE` (tried, same result) - Other types? (need to investigate) 3. **Check browser network tab more carefully**: - The `/aron/institution` page makes a POST request - Need to capture exact request body with filters ## Next Steps ### Option 1: Run Smart Filter Scraper (RECOMMENDED) ```bash python3 scripts/scrapers/scrape_czech_archives_aron.py ``` - Expected: ~560 institutions in 2-3 hours - Safe, respectful of server ### Option 2: Investigate Institution-Specific Endpoint - Use browser DevTools on `/aron/institution` page - Capture POST request body when loading page - Look for filter parameters we missed - **Try this first before running scraper** ### Option 3: Contact Národní archiv - Email: posta@nacr.cz or arch@mvcr.cz - Ask: "Is there an API endpoint for institutions only?" - Mention: "We found undocumented API but it returns 505k records" ## API Rate Limits **Not documented**, but being conservative: - Max 2 requests/second (0.5s delay) - Avoid burst requests - Monitor for 429 errors ## Data Quality **TIER_1_AUTHORITATIVE**: - Source: Národní archiv (National Archives) - Official national registry - Real-time updates by archivists - High quality metadata **License**: Unknown (likely CC0 like library data) --- **Status**: API discovered, scraper ready, awaiting final decision on approach **Next action**: Capture exact browser request to find institution filter, OR run smart filter scraper **Estimated completion**: 2-3 hours for scraping