glam/SESSION_SUMMARY_20251119_ARCHIVES_DISCOVERY.md
2025-11-19 23:25:22 +01:00

269 lines
8.6 KiB
Markdown
Raw Blame History

This file contains ambiguous Unicode characters

This file contains Unicode characters that might be confused with other characters. If you think that this is intentional, you can safely ignore this warning. Use the Escape button to reveal them.

# Session Summary: Czech Archives Discovery (2025-11-19)
## What We Accomplished
### 1. Identified Archive Data Gap ✅
- Confirmed Czech ADR database contains **libraries only** (no archives)
- Discovered 87 institutions with "archiv" in name are archive libraries/collections, not archives
- Found separate Czech archive database: **"Archiválie na dosah"** (Archives Within Reach)
### 2. Located Czech Archive Infrastructure ✅
- **Portal**: https://portal.nacr.cz/cro/pro-badatele/
- **Institution List**: https://portal.nacr.cz/aron/institution
- **Manager**: Národní archiv (National Archives) + Ministry of Interior
- **Estimated Archives**: ~560 institutions (56 pages × 10 per page)
### 3. Categorized Czech Archive Types ✅
Documented archive categories from ARON portal:
- State archives (Národní archiv, regional archives, security archive)
- Municipal archives (Prague, Brno, Plzeň, Ostrava, etc.)
- University archives (Charles University, Masaryk University, etc.)
- Specialized archives (Parliament, Foreign Ministry, Military, Radio, Museums, Galleries, Libraries)
- Private archives (Jewish Museum, corporate archives, church archives)
### 4. Identified Data Access Challenges ⚠️
**Problem**: No obvious bulk download available
- ❌ No public XML/CSV export like library ADR database
- ❌ No documented public API endpoint
- ❌ Not found in Czech open data portal (preliminary search)
- ✅ Web interface exists but requires scraping 56 pages
### 5. Developed Action Plan 📋
Created comprehensive investigation strategy:
- **Priority 1**: Email Ministry of Interior (arch@mvcr.cz) for export
- **Priority 2**: Deep search Czech open data portal
- **Priority 3**: Investigate ARON API through DevTools
- **Priority 4**: Web scraping as last resort
## Files Created
### Documentation
-`CZECH_ARCHIVES_INVESTIGATION.md` (6.5 KB)
- Detailed investigation report
- System architecture analysis
- Comparison: libraries vs archives
- Expected outcomes
-`CZECH_ARCHIVES_NEXT_ACTIONS.md` (5.2 KB)
- Quick reference guide
- Email template for Ministry
- Step-by-step instructions
- Success criteria
### Related Files (From Earlier Session)
- `CZECH_ISIL_COMPLETE_REPORT.md` - Library harvest report
- `CZECH_ISIL_NEXT_STEPS.md` - Library processing guide
- `data/instances/czech_institutions.yaml` - 8,145 libraries (8.8 MB)
## Key Findings
### Archive Distribution Estimate
Based on ARON portal pagination:
```
Total pages: 56
Per page: 10
Estimated total: ~560 Czech archive institutions
```
### Czech Heritage Landscape
```
Libraries: 8,145 (ADR database) ✅ COMPLETE
Archives: ~560 (ARON portal) ⏳ PENDING DATA ACCESS
Total: ~8,700 Czech heritage institutions
```
### Data Quality Comparison
| Feature | Libraries (ADR) | Archives (ARON) |
|---------|-----------------|-----------------|
| Data Source | National Library | National Archives |
| Format | MARC21 XML | Unknown (web only) |
| Download | ✅ Available | ❌ Not found |
| Records | 8,145 | ~560 (estimated) |
| GPS Coverage | 81.3% | Unknown |
| ISIL Codes | Yes (siglas) | Unknown |
| License | CC0 | Unknown |
## Expected Czech Dataset
### After Archive Integration
- **Total Institutions**: ~8,700
- Libraries: 8,145 (94.6%)
- Archives: ~560 (5.4%)
- Mixed: ~50 (museums, galleries, etc.)
- **Geographic Coverage**: All 14 Czech regions
- **Data Quality**: TIER_1_AUTHORITATIVE
- **Global Ranking**: 2nd largest national dataset (after Netherlands if archives added)
## Next Steps
### Immediate (Next Session)
1. **Send email to arch@mvcr.cz** requesting archive data export
- Use template in `CZECH_ARCHIVES_NEXT_ACTIONS.md`
- Reference ADR database as precedent
- Request CC0 license
2. **Search Czech open data portal thoroughly**
- Keywords: "archivy", "archivní instituce", "ARON"
- Filter by: Národní archiv, Ministerstvo vnitra
- URL: https://data.gov.cz/datasets
3. **Investigate ARON API**
- Use browser DevTools on https://portal.nacr.cz/aron/institution
- Look for `/api/` endpoints
- Test pagination and response format
### Secondary (If No Official Export)
4. **Web scraping fallback**
- Scrape 56 pages of institution list
- Extract: name, UUID, detail page link
- Follow links for full metadata
- Respect rate limits (1 req/sec)
### After Archive Data Obtained
5. **Create parser**: `scripts/parsers/parse_czech_archives.py`
6. **Validate records**: Check LinkML schema compliance
7. **Merge datasets**: Combine libraries + archives
8. **Deduplicate**: Check for institutions in both databases
9. **Enrich with Wikidata**: Add Q-numbers
10. **Generate GHCIDs**: CZ-* prefixes
## Technical Discoveries
### ARON System
- **ARON** = ARchiv ONline
- React-based web application
- RESTful API backend (not publicly documented)
- Real-time updates by archivists
- Integrated with CAM (Central Archive Module)
### Known Endpoints
- `/aron/institution` - Institution list
- `/aron/fund` - Archival fonds
- `/aron/finding-aid` - Finding aids
- `/aron/originator` - Record creators
- `/aron/apu/{uuid}` - Entity details
### API Potential
- DA-COMM API found at https://stands.nacr.cz/da-comm/viewapi/
- Appears to be for component viewing, not data export
- May require authentication
- Needs further investigation
## Outstanding Questions
1. **Does archive data export exist?**
- Awaiting response from arch@mvcr.cz
2. **Do Czech archives have ISIL codes?**
- Libraries use "siglas" format
- Archives may use different system
- Need to check with ISIL agency
3. **What's the license for archive data?**
- Library data is CC0
- Archives likely similar but unconfirmed
4. **How many archives actually exist?**
- Portal shows ~560
- May be more not in ARON system
- Need official count
## Success Metrics
### Session Goals
- ✅ Identified archive database location
- ✅ Estimated archive count
- ✅ Documented archive categories
- ✅ Created action plan
- ⏳ Awaiting data access
### Project Goals (Pending)
- ⏳ Obtain archive data export
- ⏳ Parse and validate records
- ⏳ Merge with library dataset
- ⏳ Complete Czech heritage dataset (~8,700 institutions)
- ⏳ Become 2nd largest national dataset globally
## Resources
### Key Contacts
- **Ministry of Interior**: arch@mvcr.cz ⭐ PRIMARY
- **Národní archiv**: posta@nacr.cz
- **National Library** (reference): eva.svobodova@nkp.cz
### URLs
- **ARON Portal**: https://portal.nacr.cz/cro
- **Institution List**: https://portal.nacr.cz/aron/institution
- **National Archives**: https://www.nacr.cz
- **Open Data**: https://data.gov.cz
- **Ministry of Interior**: https://www.mvcr.cz
### Documentation
- `CZECH_ARCHIVES_INVESTIGATION.md` - Full investigation report
- `CZECH_ARCHIVES_NEXT_ACTIONS.md` - Quick start guide
- `CZECH_ISIL_COMPLETE_REPORT.md` - Library harvest results
- `AGENTS.md` - Project instructions for AI agents
## Timeline Estimate
**Best Case** (Official export provided):
- Email response: 3-5 business days
- Download + parse: 1-2 hours
- Merge + validate: 1-2 hours
- **Total: ~1 week**
**Medium Case** (Found in open data):
- Deep portal search: 1-2 hours
- Download + parse: 1-2 hours
- **Total: Same day**
**Worst Case** (Web scraping):
- Write scraper: 2-3 hours
- Run scraper: 2-3 hours (56 pages + details)
- Parse + validate: 1-2 hours
- **Total: 1-2 days**
## Commands for Verification
### Check library data
```bash
python3 -c "
import yaml
with open('data/instances/czech_institutions.yaml', 'r') as f:
data = yaml.safe_load(f)
print(f'Czech libraries: {len(data)}')
print(f'\"Archiv\" mentions: {sum(1 for i in data if \"archiv\" in i[\"name\"].lower())}')
print(f'Archive type count: {sum(1 for i in data if i[\"institution_type\"] == \"ARCHIVE\")}')
"
```
### List files created
```bash
ls -lh CZECH_ARCHIVES*.md
```
## Notes
- Czech Republic has **TWO separate** heritage databases (libraries + archives)
- This is common in many countries with specialized systems
- Library database was straightforward (single XML download)
- Archive database requires more investigation (web portal only)
- Both are managed by national institutions (good data quality)
- License likely CC0 (following Czech open data practices)
## Status
**Current Phase**: Data acquisition
**Blocking Issue**: No bulk download found for archives
**Next Action**: Email arch@mvcr.cz requesting export
**Fallback Plans**: Open data search → API investigation → Web scraping
**Expected Resolution**: 1 week (if email successful)
---
**Session Duration**: ~2 hours
**Files Created**: 2 (investigation report + action guide)
**Research Completed**: Archive infrastructure mapping
**Data Acquired**: 0 (awaiting archive export)
**Next Session Focus**: Data acquisition and parsing