glam/CZECH_ARCHIVES_NEXT_ACTIONS.md
2025-11-19 23:25:22 +01:00

209 lines
5.5 KiB
Markdown
Raw Blame History

This file contains invisible Unicode characters

This file contains invisible Unicode characters that are indistinguishable to humans but may be processed differently by a computer. If you think that this is intentional, you can safely ignore this warning. Use the Escape button to reveal them.

# Czech Archives - Next Actions
**Quick Reference**: What to do next session
## Context
- ✅ Czech **libraries** harvested: 8,145 institutions from ADR database
- 🔍 Czech **archives** discovered: ~560 institutions in ARON portal
- ❌ No bulk download found yet for archive data
## Recommended Actions (in order)
### 1⃣ Email Request to Ministry of Interior (DO THIS FIRST)
**To**: arch@mvcr.cz
**Subject**: Request for Czech Archive Institution Registry Export
**Email Template**:
```
Dear Archival Administration,
I am working on a global heritage institution database project and have
successfully integrated data from the Czech National Library's ADR database
(8,145 libraries).
I would like to request a downloadable export of Czech archive institutions
from the "Archiválie na dosah" (ARON) portal at https://portal.nacr.cz/aron/institution
Could you provide:
1. Complete list of Czech archive institutions in XML/CSV/JSON format
2. Metadata: institution name, type, location, identifiers, website
3. License information (hoping for CC0 like the ADR database)
The ADR library database was publicly available at:
https://aleph.nkp.cz/data/adr.xml.gz (CC0 license)
Is there a similar download for archive institutions from the ARON system?
This data will contribute to a global open heritage institution dataset
for research and discovery purposes.
Thank you for your assistance!
```
### 2⃣ Search Czech Open Data Portal
**URL**: https://data.gov.cz/datasets
**Search terms to try**:
- "archivy"
- "archivní instituce"
- "ARON"
- "Národní archiv"
- "Archiválie na dosah"
**Filter by publishers**:
- Národní archiv
- Ministerstvo vnitra
### 3⃣ Investigate ARON API (if no download found)
**Method**: Use browser DevTools
1. Open https://portal.nacr.cz/aron/institution
2. Open DevTools (F12) → Network tab
3. Filter by "XHR" or "Fetch"
4. Click through pages to see API calls
5. Look for endpoints like:
- `/api/institution`
- `/api/apu`
- `/aron/api/...`
**Check**:
- Response format (JSON expected)
- Authentication requirements
- Pagination parameters
- Rate limits
### 4⃣ Web Scraping (LAST RESORT ONLY)
**Only if**:
- No email response after 1 week
- No open data export found
- No public API available
**Approach**:
```python
# Use playwright or crawl4ai
# Scrape 56 pages of institution list
# Extract: name, UUID, link
# Follow links to detail pages
# Respect rate limits: 1 req/sec
```
## What We Have vs What We Need
### Currently Have (Libraries)
```
Source: https://aleph.nkp.cz/data/adr.xml.gz
Format: MARC21 XML (27 MB)
Records: 8,145 libraries
Status: ✅ Parsed and validated
Output: data/instances/czech_institutions.yaml (8.8 MB)
```
### Need to Get (Archives)
```
Source: https://portal.nacr.cz/aron/institution
Format: Unknown (web portal only)
Records: ~560 archives (estimated)
Status: ⏳ Awaiting data access
Target: data/instances/czech_archives.yaml
```
### Final Goal (Combined)
```
Output: data/instances/czech_heritage_complete.yaml
Records: ~8,700 total (8,145 libraries + 560 archives)
Status: 📋 Waiting for archive data
```
## Expected Timeline
**Best Case** (Ministry provides export):
- Email response: 3-5 business days
- Download + parse data: 1-2 hours
- Validation + merge: 1-2 hours
- **Total: 1 week**
**Medium Case** (Found in open data portal):
- Deep search: 1-2 hours
- Download + parse: 1-2 hours
- **Total: Same day**
**Worst Case** (Web scraping required):
- Write scraper: 2-3 hours
- Run scraper (56 pages + details): 2-3 hours
- Parse + validate: 1-2 hours
- **Total: 1-2 days**
## Commands for Next Session
### Check email response
```bash
# Manual check - did arch@mvcr.cz respond?
```
### Search open data portal
```bash
# Open in browser:
open https://data.gov.cz/datasets?keywords=archivy
open https://data.gov.cz/datasets?keywords=archivn%C3%AD%20instituce
```
### Check current status
```bash
# Czech libraries
python3 -c "
import yaml
with open('data/instances/czech_institutions.yaml', 'r') as f:
data = yaml.safe_load(f)
print(f'Czech libraries: {len(data)}')
print(f'Archives in library DB: {sum(1 for i in data if \"archiv\" in i[\"name\"].lower())}')
"
```
### Test ARON API (if found)
```bash
# Example if we find API endpoint
curl -s "https://portal.nacr.cz/api/institution?page=0&size=100" | jq .
```
## Files to Create (Once Archive Data Available)
1. `scripts/parsers/parse_czech_archives.py` - Archive parser
2. `data/instances/czech_archives.yaml` - Archive records
3. `scripts/merge_czech_datasets.py` - Merge libraries + archives
4. `data/instances/czech_heritage_complete.yaml` - Final unified dataset
5. `CZECH_ARCHIVES_COMPLETE.md` - Archive harvest report
## Success Criteria
✅ Archive data obtained (via email, open data, or API)
✅ Parser created and tested
✅ Records validated against LinkML schema
✅ Libraries + archives merged without duplicates
✅ Total ~8,700 Czech heritage institutions in unified dataset
✅ GHCID generated for all institutions
✅ Ready for Wikidata enrichment
## Key Contacts
**Národní archiv (National Archives)**:
- Email: posta@nacr.cz
- Website: https://www.nacr.cz
**Ministry of Interior - Archival Administration**:
- Email: arch@mvcr.cz ⭐ PRIMARY CONTACT
- Website: https://www.mvcr.cz
**National Library (for reference)**:
- Email: eva.svobodova@nkp.cz (ISIL registry contact)
- Website: https://www.nkp.cz
---
**Current Status**: Awaiting archive data access
**Next Action**: Send email to arch@mvcr.cz
**Fallback**: Search open data portal, then investigate API
**Last Resort**: Web scraping with rate limits