8.6 KiB
Session Summary: Czech Archives Discovery (2025-11-19)
What We Accomplished
1. Identified Archive Data Gap ✅
- Confirmed Czech ADR database contains libraries only (no archives)
- Discovered 87 institutions with "archiv" in name are archive libraries/collections, not archives
- Found separate Czech archive database: "Archiválie na dosah" (Archives Within Reach)
2. Located Czech Archive Infrastructure ✅
- Portal: https://portal.nacr.cz/cro/pro-badatele/
- Institution List: https://portal.nacr.cz/aron/institution
- Manager: Národní archiv (National Archives) + Ministry of Interior
- Estimated Archives: ~560 institutions (56 pages × 10 per page)
3. Categorized Czech Archive Types ✅
Documented archive categories from ARON portal:
- State archives (Národní archiv, regional archives, security archive)
- Municipal archives (Prague, Brno, Plzeň, Ostrava, etc.)
- University archives (Charles University, Masaryk University, etc.)
- Specialized archives (Parliament, Foreign Ministry, Military, Radio, Museums, Galleries, Libraries)
- Private archives (Jewish Museum, corporate archives, church archives)
4. Identified Data Access Challenges ⚠️
Problem: No obvious bulk download available
- ❌ No public XML/CSV export like library ADR database
- ❌ No documented public API endpoint
- ❌ Not found in Czech open data portal (preliminary search)
- ✅ Web interface exists but requires scraping 56 pages
5. Developed Action Plan 📋
Created comprehensive investigation strategy:
- Priority 1: Email Ministry of Interior (arch@mvcr.cz) for export
- Priority 2: Deep search Czech open data portal
- Priority 3: Investigate ARON API through DevTools
- Priority 4: Web scraping as last resort
Files Created
Documentation
-
✅
CZECH_ARCHIVES_INVESTIGATION.md(6.5 KB)- Detailed investigation report
- System architecture analysis
- Comparison: libraries vs archives
- Expected outcomes
-
✅
CZECH_ARCHIVES_NEXT_ACTIONS.md(5.2 KB)- Quick reference guide
- Email template for Ministry
- Step-by-step instructions
- Success criteria
Related Files (From Earlier Session)
CZECH_ISIL_COMPLETE_REPORT.md- Library harvest reportCZECH_ISIL_NEXT_STEPS.md- Library processing guidedata/instances/czech_institutions.yaml- 8,145 libraries (8.8 MB)
Key Findings
Archive Distribution Estimate
Based on ARON portal pagination:
Total pages: 56
Per page: 10
Estimated total: ~560 Czech archive institutions
Czech Heritage Landscape
Libraries: 8,145 (ADR database) ✅ COMPLETE
Archives: ~560 (ARON portal) ⏳ PENDING DATA ACCESS
Total: ~8,700 Czech heritage institutions
Data Quality Comparison
| Feature | Libraries (ADR) | Archives (ARON) |
|---|---|---|
| Data Source | National Library | National Archives |
| Format | MARC21 XML | Unknown (web only) |
| Download | ✅ Available | ❌ Not found |
| Records | 8,145 | ~560 (estimated) |
| GPS Coverage | 81.3% | Unknown |
| ISIL Codes | Yes (siglas) | Unknown |
| License | CC0 | Unknown |
Expected Czech Dataset
After Archive Integration
- Total Institutions: ~8,700
- Libraries: 8,145 (94.6%)
- Archives: ~560 (5.4%)
- Mixed: ~50 (museums, galleries, etc.)
- Geographic Coverage: All 14 Czech regions
- Data Quality: TIER_1_AUTHORITATIVE
- Global Ranking: 2nd largest national dataset (after Netherlands if archives added)
Next Steps
Immediate (Next Session)
-
Send email to arch@mvcr.cz requesting archive data export
- Use template in
CZECH_ARCHIVES_NEXT_ACTIONS.md - Reference ADR database as precedent
- Request CC0 license
- Use template in
-
Search Czech open data portal thoroughly
- Keywords: "archivy", "archivní instituce", "ARON"
- Filter by: Národní archiv, Ministerstvo vnitra
- URL: https://data.gov.cz/datasets
-
Investigate ARON API
- Use browser DevTools on https://portal.nacr.cz/aron/institution
- Look for
/api/endpoints - Test pagination and response format
Secondary (If No Official Export)
- Web scraping fallback
- Scrape 56 pages of institution list
- Extract: name, UUID, detail page link
- Follow links for full metadata
- Respect rate limits (1 req/sec)
After Archive Data Obtained
- Create parser:
scripts/parsers/parse_czech_archives.py - Validate records: Check LinkML schema compliance
- Merge datasets: Combine libraries + archives
- Deduplicate: Check for institutions in both databases
- Enrich with Wikidata: Add Q-numbers
- Generate GHCIDs: CZ-* prefixes
Technical Discoveries
ARON System
- ARON = ARchiv ONline
- React-based web application
- RESTful API backend (not publicly documented)
- Real-time updates by archivists
- Integrated with CAM (Central Archive Module)
Known Endpoints
/aron/institution- Institution list/aron/fund- Archival fonds/aron/finding-aid- Finding aids/aron/originator- Record creators/aron/apu/{uuid}- Entity details
API Potential
- DA-COMM API found at https://stands.nacr.cz/da-comm/viewapi/
- Appears to be for component viewing, not data export
- May require authentication
- Needs further investigation
Outstanding Questions
-
Does archive data export exist?
- Awaiting response from arch@mvcr.cz
-
Do Czech archives have ISIL codes?
- Libraries use "siglas" format
- Archives may use different system
- Need to check with ISIL agency
-
What's the license for archive data?
- Library data is CC0
- Archives likely similar but unconfirmed
-
How many archives actually exist?
- Portal shows ~560
- May be more not in ARON system
- Need official count
Success Metrics
Session Goals
- ✅ Identified archive database location
- ✅ Estimated archive count
- ✅ Documented archive categories
- ✅ Created action plan
- ⏳ Awaiting data access
Project Goals (Pending)
- ⏳ Obtain archive data export
- ⏳ Parse and validate records
- ⏳ Merge with library dataset
- ⏳ Complete Czech heritage dataset (~8,700 institutions)
- ⏳ Become 2nd largest national dataset globally
Resources
Key Contacts
- Ministry of Interior: arch@mvcr.cz ⭐ PRIMARY
- Národní archiv: posta@nacr.cz
- National Library (reference): eva.svobodova@nkp.cz
URLs
- ARON Portal: https://portal.nacr.cz/cro
- Institution List: https://portal.nacr.cz/aron/institution
- National Archives: https://www.nacr.cz
- Open Data: https://data.gov.cz
- Ministry of Interior: https://www.mvcr.cz
Documentation
CZECH_ARCHIVES_INVESTIGATION.md- Full investigation reportCZECH_ARCHIVES_NEXT_ACTIONS.md- Quick start guideCZECH_ISIL_COMPLETE_REPORT.md- Library harvest resultsAGENTS.md- Project instructions for AI agents
Timeline Estimate
Best Case (Official export provided):
- Email response: 3-5 business days
- Download + parse: 1-2 hours
- Merge + validate: 1-2 hours
- Total: ~1 week
Medium Case (Found in open data):
- Deep portal search: 1-2 hours
- Download + parse: 1-2 hours
- Total: Same day
Worst Case (Web scraping):
- Write scraper: 2-3 hours
- Run scraper: 2-3 hours (56 pages + details)
- Parse + validate: 1-2 hours
- Total: 1-2 days
Commands for Verification
Check library data
python3 -c "
import yaml
with open('data/instances/czech_institutions.yaml', 'r') as f:
data = yaml.safe_load(f)
print(f'Czech libraries: {len(data)}')
print(f'\"Archiv\" mentions: {sum(1 for i in data if \"archiv\" in i[\"name\"].lower())}')
print(f'Archive type count: {sum(1 for i in data if i[\"institution_type\"] == \"ARCHIVE\")}')
"
List files created
ls -lh CZECH_ARCHIVES*.md
Notes
- Czech Republic has TWO separate heritage databases (libraries + archives)
- This is common in many countries with specialized systems
- Library database was straightforward (single XML download)
- Archive database requires more investigation (web portal only)
- Both are managed by national institutions (good data quality)
- License likely CC0 (following Czech open data practices)
Status
Current Phase: Data acquisition
Blocking Issue: No bulk download found for archives
Next Action: Email arch@mvcr.cz requesting export
Fallback Plans: Open data search → API investigation → Web scraping
Expected Resolution: 1 week (if email successful)
Session Duration: ~2 hours
Files Created: 2 (investigation report + action guide)
Research Completed: Archive infrastructure mapping
Data Acquired: 0 (awaiting archive export)
Next Session Focus: Data acquisition and parsing