# Czech Archives Database Investigation **Date**: 2025-11-19 **Status**: Investigation in progress ## Summary The Czech Republic has **TWO separate heritage databases**: 1. ✅ **Libraries** - Already harvested (8,145 institutions from ADR database) 2. 🔍 **Archives** - Need to harvest (estimated ~560 institutions from ARON portal) ## Czech Archive Infrastructure ### Primary Source: Archiválie na dosah (Archives Within Reach) **Portal URL**: https://portal.nacr.cz/cro/pro-badatele/ **Institution Database**: https://portal.nacr.cz/aron/institution **Manager**: Národní archiv (National Archives) + Ministry of Interior ### What We Discovered 1. **"Archiválie na dosah"** replaced the old "Archivní fondy a sbírky v ČR" database 2. Contains **200,000+ archival collections** (fonds and collections) 3. Managed by Národní archiv through **ARON portal** (ARchiv ONline) 4. Updated in real-time by archivists across Czech Republic 5. Includes state archives, municipal archives, university archives, specialized archives, and private archives ### Archive Categories in ARON According to the portal (https://portal.nacr.cz/cro/pro-badatele/): **State Archives** (Státní archivy): - Národní archiv (National Archives) - Moravský zemský archiv v Brně - 7 regional state archives (Státní oblastní archivy) - Archiv bezpečnostních složek (Security Forces Archive) **Municipal Archives** (Archivy územních samosprávných celků): - Archiv hlavního města Prahy (Prague City Archive) - Archiv města Brna - Archiv města Plzně - Archiv města Ústí nad Labem - Archiv města Ostravy - Many district and municipal archives **University Archives** (Archivy vysokých škol): - Univerzita Karlova - Masarykova univerzita - Univerzita Palackého - Ostravská univerzita - ČVUT Praha - VUT Brno - And more... **Specialized Archives** (Specializované archivy): - Archiv Kanceláře prezidenta republiky - Archiv Poslanecké sněmovny (Parliament Archive) - Archiv Ministerstva zahraničních věcí - Vojenský historický archiv (Military Historical Archive) - Archiv Českého rozhlasu (Czech Radio Archive) - Archiv Národního muzea - Literární archiv Památníku národního písemnictví - Archiv Národní galerie - Archiv Národní knihovny ČR - And many more... **Private Archives** (Soukromé archivy): - Archiv Židovského muzea - Corporate archives (Škoda, Vítkovice, Plzeňský Prazdroj, etc.) - Church archives (Biskupství brněnské) ### Current Status **ARON Portal Observations**: - Shows **56 pages** of institutions (10 per page) - **Estimated total: ~560 Czech archive institutions** - Web interface at https://portal.nacr.cz/aron/institution - Each institution has UUID (e.g., `/aron/apu/000efd1e-099b-4e8c-ab8c-ec9e47662e7b`) - Displays institution name, but details require clicking through ## Data Access Challenges ### No Obvious Bulk Download ❌ **What we DON'T have yet**: - No public downloadable XML/CSV export like ADR library database - No documented public API for institution list - ARON API exists but appears to be for internal use (DA-COMM system) - Czech open data portal (data.gov.cz) doesn't list archive institution registry ### Possible Data Access Methods **Option 1: Contact Ministry of Interior - Archival Administration** ⭐ RECOMMENDED - **Email**: arch@mvcr.cz - **Request**: Ask for downloadable export of Czech archive institution registry - **Precedent**: National Library provided ADR database as open XML download - **License**: Likely CC0 (public domain) like library database - **Format**: Probably XML or CSV **Option 2: Web Scraping ARON Portal** - Portal: https://portal.nacr.cz/aron/institution - 56 pages to scrape (paginated results) - Extract: Institution name, UUID, link to detail page - Then scrape each detail page for full metadata - **Cons**: Time-consuming, fragile, may violate terms of service **Option 3: Check for API Documentation** - Found DA-COMM API docs at https://stands.nacr.cz/da-comm/viewapi/ - Appears to be for component viewing, not institution export - May require authentication token - **Needs investigation**: Does ARON have public API? **Option 4: Check Open Data Portal** - Search data.gov.cz for "archivy" or "archivní instituce" - May be published as open data dataset - **Status**: Preliminary search didn't find it, but worth deeper investigation ## Technical Details ### ARON System Architecture **ARON** = ARchiv ONline (Archive Online) - Modern web application (React-based frontend) - RESTful API backend - PostgreSQL database (inferred from similar Czech gov systems) - Real-time updates by archivists - Integrated with CAM (Centrální Archivní Modul - Central Archive Module) ### Known ARON Endpoints - `/aron/institution` - List of archive institutions - `/aron/fund` - Archival fonds - `/aron/finding-aid` - Finding aids (archival inventories) - `/aron/originator` - Originators (organizations that created records) - `/aron/apu/{uuid}` - Individual access point (entity detail) ### Data We Need from Each Institution From LinkML schema perspective, extract: - **Name** (official institution name) - **Institution Type** (ARCHIVE, MUSEUM, LIBRARY, etc.) - **Location** (city, address if available) - **Identifiers** (UUID from ARON, potential ISIL code) - **Website URL** (if available) - **Description** (mission, holdings summary) - **Collections** (archival fonds managed by institution) ### Comparison: Libraries vs Archives | Feature | Libraries (ADR) | Archives (ARON) | |---------|-----------------|-----------------| | **Total Institutions** | 8,145 | ~560 (estimated) | | **Data Source** | National Library | National Archives + Min. Interior | | **Download Available** | ✅ Yes (adr.xml.gz) | ❌ Not yet found | | **Format** | MARC21 XML | Unknown (web portal only) | | **Coverage** | 100% Czech libraries | All Czech archives + cultural institutions | | **ISIL Codes** | Yes (siglas) | Unknown if archives have ISIL | | **Update Frequency** | Periodic (last: 2025-08-01) | Real-time (online updates) | | **License** | CC0 (public domain) | Unknown (likely open) | ## Next Steps ### Immediate Actions (Priority 1) 1. **Email Národní archiv and Ministry of Interior** ⭐ ``` To: arch@mvcr.cz Subject: Request for Czech Archive Institution Registry Export Dear Archival Administration, I am working on a global heritage institution database project (https://github.com/user/glam) and successfully integrated data from the National Library's ADR database (8,145 libraries). I would like to request a downloadable export of Czech archive institutions from the "Archiválie na dosah" (ARON) portal. Could you provide: 1. Complete list of Czech archive institutions in XML/CSV format 2. Metadata: institution name, type, location, identifiers, website 3. License information (hoping for CC0 like ADR database) The ADR database was available at: https://aleph.nkp.cz/data/adr.xml.gz Is there a similar download for archive institutions? Thank you! ``` 2. **Check Czech Open Data Portal Thoroughly** - Search data.gov.cz for: - "archivy" - "archivní instituce" - "ARON" - "Národní archiv" - "Archiválie na dosah" - Filter by publisher: Národní archiv, Ministerstvo vnitra 3. **Investigate ARON API** - Check browser network tab when using portal - Look for API endpoints (likely `/api/institution` or similar) - Check if API returns JSON data - Test if API requires authentication ### Secondary Actions (Priority 2) 4. **Web Scraping as Last Resort** - Only if no official export is available - Use `crawl4ai` or `playwright` to scrape institution list - Respect rate limits (1 request per second) - Parse HTML for institution names and UUIDs - Follow links to detail pages for full metadata 5. **Cross-Reference with ADR Database** - Check if 87 "archiv" mentions in library database are real archives - Some may be archive libraries (KI-MU type institutions) - Deduplicate if institutions appear in both databases 6. **ISIL Code Investigation** - Check if Czech archives have ISIL codes - Libraries use "siglas" (e.g., ABA000) - Archives may use different identifier system - Contact ISIL agency: https://isil.org ### Data Integration (Priority 3) 7. **Create Czech Archive Parser** - Once data is available, create `scripts/parsers/parse_czech_archives.py` - Similar structure to `parse_czech_isil.py` - Map archive types to GLAM taxonomy (ARCHIVE, MUSEUM, etc.) - Generate GHCIDs with CZ-* prefix 8. **Merge Libraries + Archives** - Combine `czech_institutions.yaml` (libraries) with new archive data - Deduplicate by name/location - Create unified `czech_heritage_institutions.yaml` - 8,145 libraries + ~560 archives = **~8,700 total Czech institutions** 9. **Enrich with Wikidata** - Query Wikidata for Czech archives - Fuzzy match by name and city - Add Q-numbers to identifiers - Update GHCIDs with Q-numbers if needed ## Expected Outcomes ### After Archive Data Harvest **Czech Heritage Institution Dataset**: - **Total Institutions**: ~8,700 (8,145 libraries + 560 archives) - **Data Quality**: TIER_1_AUTHORITATIVE (from national registries) - **Geographic Coverage**: All 14 Czech regions - **Type Distribution**: - Libraries: 94.6% (8,145) - Archives: 5.4% (~560) - Museums: ~50 (from specialized archives) - Galleries: ~20 (from specialized archives) - Universities: ~10 (university archives) - Mixed: varies ### Contribution to Global Dataset Czech Republic would become one of the **most complete** country datasets: - 🇳🇱 Netherlands: ~1,400 institutions (current best) - 🇨🇿 Czech Republic: ~8,700 institutions (if archives added) 🏆 - 🇧🇷 Brazil: ~3,000+ institutions (in progress) - 🇦🇷 Argentina: ~2,000 institutions - 🇦🇹 Austria: ~1,800 institutions - 🇨🇦 Canada: ~800 institutions ## References **Czech National Archives**: - Website: https://www.nacr.cz - ARON Portal: https://portal.nacr.cz/cro - Institution List: https://portal.nacr.cz/aron/institution - Email: posta@nacr.cz **Ministry of Interior - Archival Administration**: - Email: arch@mvcr.cz - Website: https://www.mvcr.cz (archival section) **Czech Open Data Portal**: - Website: https://data.gov.cz - Search: https://data.gov.cz/datasets **Related Documentation**: - CZECH_ISIL_COMPLETE_REPORT.md - Library harvest report - CZECH_ISIL_NEXT_STEPS.md - Quick start guide - SESSION_SUMMARY_20251119_CZECH_COMPLETE.md - Session summary --- **Status**: Awaiting response from arch@mvcr.cz **Next Session**: Check email for response, investigate open data portal, or begin API investigation