269 lines
8.6 KiB
Markdown
269 lines
8.6 KiB
Markdown
# Session Summary: Czech Archives Discovery (2025-11-19)
|
||
|
||
## What We Accomplished
|
||
|
||
### 1. Identified Archive Data Gap ✅
|
||
- Confirmed Czech ADR database contains **libraries only** (no archives)
|
||
- Discovered 87 institutions with "archiv" in name are archive libraries/collections, not archives
|
||
- Found separate Czech archive database: **"Archiválie na dosah"** (Archives Within Reach)
|
||
|
||
### 2. Located Czech Archive Infrastructure ✅
|
||
- **Portal**: https://portal.nacr.cz/cro/pro-badatele/
|
||
- **Institution List**: https://portal.nacr.cz/aron/institution
|
||
- **Manager**: Národní archiv (National Archives) + Ministry of Interior
|
||
- **Estimated Archives**: ~560 institutions (56 pages × 10 per page)
|
||
|
||
### 3. Categorized Czech Archive Types ✅
|
||
Documented archive categories from ARON portal:
|
||
- State archives (Národní archiv, regional archives, security archive)
|
||
- Municipal archives (Prague, Brno, Plzeň, Ostrava, etc.)
|
||
- University archives (Charles University, Masaryk University, etc.)
|
||
- Specialized archives (Parliament, Foreign Ministry, Military, Radio, Museums, Galleries, Libraries)
|
||
- Private archives (Jewish Museum, corporate archives, church archives)
|
||
|
||
### 4. Identified Data Access Challenges ⚠️
|
||
**Problem**: No obvious bulk download available
|
||
- ❌ No public XML/CSV export like library ADR database
|
||
- ❌ No documented public API endpoint
|
||
- ❌ Not found in Czech open data portal (preliminary search)
|
||
- ✅ Web interface exists but requires scraping 56 pages
|
||
|
||
### 5. Developed Action Plan 📋
|
||
Created comprehensive investigation strategy:
|
||
- **Priority 1**: Email Ministry of Interior (arch@mvcr.cz) for export
|
||
- **Priority 2**: Deep search Czech open data portal
|
||
- **Priority 3**: Investigate ARON API through DevTools
|
||
- **Priority 4**: Web scraping as last resort
|
||
|
||
## Files Created
|
||
|
||
### Documentation
|
||
- ✅ `CZECH_ARCHIVES_INVESTIGATION.md` (6.5 KB)
|
||
- Detailed investigation report
|
||
- System architecture analysis
|
||
- Comparison: libraries vs archives
|
||
- Expected outcomes
|
||
|
||
- ✅ `CZECH_ARCHIVES_NEXT_ACTIONS.md` (5.2 KB)
|
||
- Quick reference guide
|
||
- Email template for Ministry
|
||
- Step-by-step instructions
|
||
- Success criteria
|
||
|
||
### Related Files (From Earlier Session)
|
||
- `CZECH_ISIL_COMPLETE_REPORT.md` - Library harvest report
|
||
- `CZECH_ISIL_NEXT_STEPS.md` - Library processing guide
|
||
- `data/instances/czech_institutions.yaml` - 8,145 libraries (8.8 MB)
|
||
|
||
## Key Findings
|
||
|
||
### Archive Distribution Estimate
|
||
Based on ARON portal pagination:
|
||
```
|
||
Total pages: 56
|
||
Per page: 10
|
||
Estimated total: ~560 Czech archive institutions
|
||
```
|
||
|
||
### Czech Heritage Landscape
|
||
```
|
||
Libraries: 8,145 (ADR database) ✅ COMPLETE
|
||
Archives: ~560 (ARON portal) ⏳ PENDING DATA ACCESS
|
||
Total: ~8,700 Czech heritage institutions
|
||
```
|
||
|
||
### Data Quality Comparison
|
||
|
||
| Feature | Libraries (ADR) | Archives (ARON) |
|
||
|---------|-----------------|-----------------|
|
||
| Data Source | National Library | National Archives |
|
||
| Format | MARC21 XML | Unknown (web only) |
|
||
| Download | ✅ Available | ❌ Not found |
|
||
| Records | 8,145 | ~560 (estimated) |
|
||
| GPS Coverage | 81.3% | Unknown |
|
||
| ISIL Codes | Yes (siglas) | Unknown |
|
||
| License | CC0 | Unknown |
|
||
|
||
## Expected Czech Dataset
|
||
|
||
### After Archive Integration
|
||
- **Total Institutions**: ~8,700
|
||
- Libraries: 8,145 (94.6%)
|
||
- Archives: ~560 (5.4%)
|
||
- Mixed: ~50 (museums, galleries, etc.)
|
||
- **Geographic Coverage**: All 14 Czech regions
|
||
- **Data Quality**: TIER_1_AUTHORITATIVE
|
||
- **Global Ranking**: 2nd largest national dataset (after Netherlands if archives added)
|
||
|
||
## Next Steps
|
||
|
||
### Immediate (Next Session)
|
||
1. **Send email to arch@mvcr.cz** requesting archive data export
|
||
- Use template in `CZECH_ARCHIVES_NEXT_ACTIONS.md`
|
||
- Reference ADR database as precedent
|
||
- Request CC0 license
|
||
|
||
2. **Search Czech open data portal thoroughly**
|
||
- Keywords: "archivy", "archivní instituce", "ARON"
|
||
- Filter by: Národní archiv, Ministerstvo vnitra
|
||
- URL: https://data.gov.cz/datasets
|
||
|
||
3. **Investigate ARON API**
|
||
- Use browser DevTools on https://portal.nacr.cz/aron/institution
|
||
- Look for `/api/` endpoints
|
||
- Test pagination and response format
|
||
|
||
### Secondary (If No Official Export)
|
||
4. **Web scraping fallback**
|
||
- Scrape 56 pages of institution list
|
||
- Extract: name, UUID, detail page link
|
||
- Follow links for full metadata
|
||
- Respect rate limits (1 req/sec)
|
||
|
||
### After Archive Data Obtained
|
||
5. **Create parser**: `scripts/parsers/parse_czech_archives.py`
|
||
6. **Validate records**: Check LinkML schema compliance
|
||
7. **Merge datasets**: Combine libraries + archives
|
||
8. **Deduplicate**: Check for institutions in both databases
|
||
9. **Enrich with Wikidata**: Add Q-numbers
|
||
10. **Generate GHCIDs**: CZ-* prefixes
|
||
|
||
## Technical Discoveries
|
||
|
||
### ARON System
|
||
- **ARON** = ARchiv ONline
|
||
- React-based web application
|
||
- RESTful API backend (not publicly documented)
|
||
- Real-time updates by archivists
|
||
- Integrated with CAM (Central Archive Module)
|
||
|
||
### Known Endpoints
|
||
- `/aron/institution` - Institution list
|
||
- `/aron/fund` - Archival fonds
|
||
- `/aron/finding-aid` - Finding aids
|
||
- `/aron/originator` - Record creators
|
||
- `/aron/apu/{uuid}` - Entity details
|
||
|
||
### API Potential
|
||
- DA-COMM API found at https://stands.nacr.cz/da-comm/viewapi/
|
||
- Appears to be for component viewing, not data export
|
||
- May require authentication
|
||
- Needs further investigation
|
||
|
||
## Outstanding Questions
|
||
|
||
1. **Does archive data export exist?**
|
||
- Awaiting response from arch@mvcr.cz
|
||
|
||
2. **Do Czech archives have ISIL codes?**
|
||
- Libraries use "siglas" format
|
||
- Archives may use different system
|
||
- Need to check with ISIL agency
|
||
|
||
3. **What's the license for archive data?**
|
||
- Library data is CC0
|
||
- Archives likely similar but unconfirmed
|
||
|
||
4. **How many archives actually exist?**
|
||
- Portal shows ~560
|
||
- May be more not in ARON system
|
||
- Need official count
|
||
|
||
## Success Metrics
|
||
|
||
### Session Goals
|
||
- ✅ Identified archive database location
|
||
- ✅ Estimated archive count
|
||
- ✅ Documented archive categories
|
||
- ✅ Created action plan
|
||
- ⏳ Awaiting data access
|
||
|
||
### Project Goals (Pending)
|
||
- ⏳ Obtain archive data export
|
||
- ⏳ Parse and validate records
|
||
- ⏳ Merge with library dataset
|
||
- ⏳ Complete Czech heritage dataset (~8,700 institutions)
|
||
- ⏳ Become 2nd largest national dataset globally
|
||
|
||
## Resources
|
||
|
||
### Key Contacts
|
||
- **Ministry of Interior**: arch@mvcr.cz ⭐ PRIMARY
|
||
- **Národní archiv**: posta@nacr.cz
|
||
- **National Library** (reference): eva.svobodova@nkp.cz
|
||
|
||
### URLs
|
||
- **ARON Portal**: https://portal.nacr.cz/cro
|
||
- **Institution List**: https://portal.nacr.cz/aron/institution
|
||
- **National Archives**: https://www.nacr.cz
|
||
- **Open Data**: https://data.gov.cz
|
||
- **Ministry of Interior**: https://www.mvcr.cz
|
||
|
||
### Documentation
|
||
- `CZECH_ARCHIVES_INVESTIGATION.md` - Full investigation report
|
||
- `CZECH_ARCHIVES_NEXT_ACTIONS.md` - Quick start guide
|
||
- `CZECH_ISIL_COMPLETE_REPORT.md` - Library harvest results
|
||
- `AGENTS.md` - Project instructions for AI agents
|
||
|
||
## Timeline Estimate
|
||
|
||
**Best Case** (Official export provided):
|
||
- Email response: 3-5 business days
|
||
- Download + parse: 1-2 hours
|
||
- Merge + validate: 1-2 hours
|
||
- **Total: ~1 week**
|
||
|
||
**Medium Case** (Found in open data):
|
||
- Deep portal search: 1-2 hours
|
||
- Download + parse: 1-2 hours
|
||
- **Total: Same day**
|
||
|
||
**Worst Case** (Web scraping):
|
||
- Write scraper: 2-3 hours
|
||
- Run scraper: 2-3 hours (56 pages + details)
|
||
- Parse + validate: 1-2 hours
|
||
- **Total: 1-2 days**
|
||
|
||
## Commands for Verification
|
||
|
||
### Check library data
|
||
```bash
|
||
python3 -c "
|
||
import yaml
|
||
with open('data/instances/czech_institutions.yaml', 'r') as f:
|
||
data = yaml.safe_load(f)
|
||
print(f'Czech libraries: {len(data)}')
|
||
print(f'\"Archiv\" mentions: {sum(1 for i in data if \"archiv\" in i[\"name\"].lower())}')
|
||
print(f'Archive type count: {sum(1 for i in data if i[\"institution_type\"] == \"ARCHIVE\")}')
|
||
"
|
||
```
|
||
|
||
### List files created
|
||
```bash
|
||
ls -lh CZECH_ARCHIVES*.md
|
||
```
|
||
|
||
## Notes
|
||
|
||
- Czech Republic has **TWO separate** heritage databases (libraries + archives)
|
||
- This is common in many countries with specialized systems
|
||
- Library database was straightforward (single XML download)
|
||
- Archive database requires more investigation (web portal only)
|
||
- Both are managed by national institutions (good data quality)
|
||
- License likely CC0 (following Czech open data practices)
|
||
|
||
## Status
|
||
|
||
**Current Phase**: Data acquisition
|
||
**Blocking Issue**: No bulk download found for archives
|
||
**Next Action**: Email arch@mvcr.cz requesting export
|
||
**Fallback Plans**: Open data search → API investigation → Web scraping
|
||
**Expected Resolution**: 1 week (if email successful)
|
||
|
||
---
|
||
|
||
**Session Duration**: ~2 hours
|
||
**Files Created**: 2 (investigation report + action guide)
|
||
**Research Completed**: Archive infrastructure mapping
|
||
**Data Acquired**: 0 (awaiting archive export)
|
||
**Next Session Focus**: Data acquisition and parsing
|