# Czech Archives - Next Actions **Quick Reference**: What to do next session ## Context - ✅ Czech **libraries** harvested: 8,145 institutions from ADR database - 🔍 Czech **archives** discovered: ~560 institutions in ARON portal - ❌ No bulk download found yet for archive data ## Recommended Actions (in order) ### 1️⃣ Email Request to Ministry of Interior (DO THIS FIRST) **To**: arch@mvcr.cz **Subject**: Request for Czech Archive Institution Registry Export **Email Template**: ``` Dear Archival Administration, I am working on a global heritage institution database project and have successfully integrated data from the Czech National Library's ADR database (8,145 libraries). I would like to request a downloadable export of Czech archive institutions from the "Archiválie na dosah" (ARON) portal at https://portal.nacr.cz/aron/institution Could you provide: 1. Complete list of Czech archive institutions in XML/CSV/JSON format 2. Metadata: institution name, type, location, identifiers, website 3. License information (hoping for CC0 like the ADR database) The ADR library database was publicly available at: https://aleph.nkp.cz/data/adr.xml.gz (CC0 license) Is there a similar download for archive institutions from the ARON system? This data will contribute to a global open heritage institution dataset for research and discovery purposes. Thank you for your assistance! ``` ### 2️⃣ Search Czech Open Data Portal **URL**: https://data.gov.cz/datasets **Search terms to try**: - "archivy" - "archivní instituce" - "ARON" - "Národní archiv" - "Archiválie na dosah" **Filter by publishers**: - Národní archiv - Ministerstvo vnitra ### 3️⃣ Investigate ARON API (if no download found) **Method**: Use browser DevTools 1. Open https://portal.nacr.cz/aron/institution 2. Open DevTools (F12) → Network tab 3. Filter by "XHR" or "Fetch" 4. Click through pages to see API calls 5. Look for endpoints like: - `/api/institution` - `/api/apu` - `/aron/api/...` **Check**: - Response format (JSON expected) - Authentication requirements - Pagination parameters - Rate limits ### 4️⃣ Web Scraping (LAST RESORT ONLY) **Only if**: - No email response after 1 week - No open data export found - No public API available **Approach**: ```python # Use playwright or crawl4ai # Scrape 56 pages of institution list # Extract: name, UUID, link # Follow links to detail pages # Respect rate limits: 1 req/sec ``` ## What We Have vs What We Need ### Currently Have (Libraries) ``` Source: https://aleph.nkp.cz/data/adr.xml.gz Format: MARC21 XML (27 MB) Records: 8,145 libraries Status: ✅ Parsed and validated Output: data/instances/czech_institutions.yaml (8.8 MB) ``` ### Need to Get (Archives) ``` Source: https://portal.nacr.cz/aron/institution Format: Unknown (web portal only) Records: ~560 archives (estimated) Status: ⏳ Awaiting data access Target: data/instances/czech_archives.yaml ``` ### Final Goal (Combined) ``` Output: data/instances/czech_heritage_complete.yaml Records: ~8,700 total (8,145 libraries + 560 archives) Status: 📋 Waiting for archive data ``` ## Expected Timeline **Best Case** (Ministry provides export): - Email response: 3-5 business days - Download + parse data: 1-2 hours - Validation + merge: 1-2 hours - **Total: 1 week** **Medium Case** (Found in open data portal): - Deep search: 1-2 hours - Download + parse: 1-2 hours - **Total: Same day** **Worst Case** (Web scraping required): - Write scraper: 2-3 hours - Run scraper (56 pages + details): 2-3 hours - Parse + validate: 1-2 hours - **Total: 1-2 days** ## Commands for Next Session ### Check email response ```bash # Manual check - did arch@mvcr.cz respond? ``` ### Search open data portal ```bash # Open in browser: open https://data.gov.cz/datasets?keywords=archivy open https://data.gov.cz/datasets?keywords=archivn%C3%AD%20instituce ``` ### Check current status ```bash # Czech libraries python3 -c " import yaml with open('data/instances/czech_institutions.yaml', 'r') as f: data = yaml.safe_load(f) print(f'Czech libraries: {len(data)}') print(f'Archives in library DB: {sum(1 for i in data if \"archiv\" in i[\"name\"].lower())}') " ``` ### Test ARON API (if found) ```bash # Example if we find API endpoint curl -s "https://portal.nacr.cz/api/institution?page=0&size=100" | jq . ``` ## Files to Create (Once Archive Data Available) 1. `scripts/parsers/parse_czech_archives.py` - Archive parser 2. `data/instances/czech_archives.yaml` - Archive records 3. `scripts/merge_czech_datasets.py` - Merge libraries + archives 4. `data/instances/czech_heritage_complete.yaml` - Final unified dataset 5. `CZECH_ARCHIVES_COMPLETE.md` - Archive harvest report ## Success Criteria ✅ Archive data obtained (via email, open data, or API) ✅ Parser created and tested ✅ Records validated against LinkML schema ✅ Libraries + archives merged without duplicates ✅ Total ~8,700 Czech heritage institutions in unified dataset ✅ GHCID generated for all institutions ✅ Ready for Wikidata enrichment ## Key Contacts **Národní archiv (National Archives)**: - Email: posta@nacr.cz - Website: https://www.nacr.cz **Ministry of Interior - Archival Administration**: - Email: arch@mvcr.cz ⭐ PRIMARY CONTACT - Website: https://www.mvcr.cz **National Library (for reference)**: - Email: eva.svobodova@nkp.cz (ISIL registry contact) - Website: https://www.nkp.cz --- **Current Status**: Awaiting archive data access **Next Action**: Send email to arch@mvcr.cz **Fallback**: Search open data portal, then investigate API **Last Resort**: Web scraping with rate limits