5.5 KiB
Czech Archives - Next Actions
Quick Reference: What to do next session
Context
- ✅ Czech libraries harvested: 8,145 institutions from ADR database
- 🔍 Czech archives discovered: ~560 institutions in ARON portal
- ❌ No bulk download found yet for archive data
Recommended Actions (in order)
1️⃣ Email Request to Ministry of Interior (DO THIS FIRST)
To: arch@mvcr.cz
Subject: Request for Czech Archive Institution Registry Export
Email Template:
Dear Archival Administration,
I am working on a global heritage institution database project and have
successfully integrated data from the Czech National Library's ADR database
(8,145 libraries).
I would like to request a downloadable export of Czech archive institutions
from the "Archiválie na dosah" (ARON) portal at https://portal.nacr.cz/aron/institution
Could you provide:
1. Complete list of Czech archive institutions in XML/CSV/JSON format
2. Metadata: institution name, type, location, identifiers, website
3. License information (hoping for CC0 like the ADR database)
The ADR library database was publicly available at:
https://aleph.nkp.cz/data/adr.xml.gz (CC0 license)
Is there a similar download for archive institutions from the ARON system?
This data will contribute to a global open heritage institution dataset
for research and discovery purposes.
Thank you for your assistance!
2️⃣ Search Czech Open Data Portal
URL: https://data.gov.cz/datasets
Search terms to try:
- "archivy"
- "archivní instituce"
- "ARON"
- "Národní archiv"
- "Archiválie na dosah"
Filter by publishers:
- Národní archiv
- Ministerstvo vnitra
3️⃣ Investigate ARON API (if no download found)
Method: Use browser DevTools
- Open https://portal.nacr.cz/aron/institution
- Open DevTools (F12) → Network tab
- Filter by "XHR" or "Fetch"
- Click through pages to see API calls
- Look for endpoints like:
/api/institution/api/apu/aron/api/...
Check:
- Response format (JSON expected)
- Authentication requirements
- Pagination parameters
- Rate limits
4️⃣ Web Scraping (LAST RESORT ONLY)
Only if:
- No email response after 1 week
- No open data export found
- No public API available
Approach:
# Use playwright or crawl4ai
# Scrape 56 pages of institution list
# Extract: name, UUID, link
# Follow links to detail pages
# Respect rate limits: 1 req/sec
What We Have vs What We Need
Currently Have (Libraries)
Source: https://aleph.nkp.cz/data/adr.xml.gz
Format: MARC21 XML (27 MB)
Records: 8,145 libraries
Status: ✅ Parsed and validated
Output: data/instances/czech_institutions.yaml (8.8 MB)
Need to Get (Archives)
Source: https://portal.nacr.cz/aron/institution
Format: Unknown (web portal only)
Records: ~560 archives (estimated)
Status: ⏳ Awaiting data access
Target: data/instances/czech_archives.yaml
Final Goal (Combined)
Output: data/instances/czech_heritage_complete.yaml
Records: ~8,700 total (8,145 libraries + 560 archives)
Status: 📋 Waiting for archive data
Expected Timeline
Best Case (Ministry provides export):
- Email response: 3-5 business days
- Download + parse data: 1-2 hours
- Validation + merge: 1-2 hours
- Total: 1 week
Medium Case (Found in open data portal):
- Deep search: 1-2 hours
- Download + parse: 1-2 hours
- Total: Same day
Worst Case (Web scraping required):
- Write scraper: 2-3 hours
- Run scraper (56 pages + details): 2-3 hours
- Parse + validate: 1-2 hours
- Total: 1-2 days
Commands for Next Session
Check email response
# Manual check - did arch@mvcr.cz respond?
Search open data portal
# Open in browser:
open https://data.gov.cz/datasets?keywords=archivy
open https://data.gov.cz/datasets?keywords=archivn%C3%AD%20instituce
Check current status
# Czech libraries
python3 -c "
import yaml
with open('data/instances/czech_institutions.yaml', 'r') as f:
data = yaml.safe_load(f)
print(f'Czech libraries: {len(data)}')
print(f'Archives in library DB: {sum(1 for i in data if \"archiv\" in i[\"name\"].lower())}')
"
Test ARON API (if found)
# Example if we find API endpoint
curl -s "https://portal.nacr.cz/api/institution?page=0&size=100" | jq .
Files to Create (Once Archive Data Available)
scripts/parsers/parse_czech_archives.py- Archive parserdata/instances/czech_archives.yaml- Archive recordsscripts/merge_czech_datasets.py- Merge libraries + archivesdata/instances/czech_heritage_complete.yaml- Final unified datasetCZECH_ARCHIVES_COMPLETE.md- Archive harvest report
Success Criteria
✅ Archive data obtained (via email, open data, or API)
✅ Parser created and tested
✅ Records validated against LinkML schema
✅ Libraries + archives merged without duplicates
✅ Total ~8,700 Czech heritage institutions in unified dataset
✅ GHCID generated for all institutions
✅ Ready for Wikidata enrichment
Key Contacts
Národní archiv (National Archives):
- Email: posta@nacr.cz
- Website: https://www.nacr.cz
Ministry of Interior - Archival Administration:
- Email: arch@mvcr.cz ⭐ PRIMARY CONTACT
- Website: https://www.mvcr.cz
National Library (for reference):
- Email: eva.svobodova@nkp.cz (ISIL registry contact)
- Website: https://www.nkp.cz
Current Status: Awaiting archive data access
Next Action: Send email to arch@mvcr.cz
Fallback: Search open data portal, then investigate API
Last Resort: Web scraping with rate limits