209 lines
5.5 KiB
Markdown
209 lines
5.5 KiB
Markdown
# Czech Archives - Next Actions
|
||
|
||
**Quick Reference**: What to do next session
|
||
|
||
## Context
|
||
- ✅ Czech **libraries** harvested: 8,145 institutions from ADR database
|
||
- 🔍 Czech **archives** discovered: ~560 institutions in ARON portal
|
||
- ❌ No bulk download found yet for archive data
|
||
|
||
## Recommended Actions (in order)
|
||
|
||
### 1️⃣ Email Request to Ministry of Interior (DO THIS FIRST)
|
||
|
||
**To**: arch@mvcr.cz
|
||
**Subject**: Request for Czech Archive Institution Registry Export
|
||
|
||
**Email Template**:
|
||
```
|
||
Dear Archival Administration,
|
||
|
||
I am working on a global heritage institution database project and have
|
||
successfully integrated data from the Czech National Library's ADR database
|
||
(8,145 libraries).
|
||
|
||
I would like to request a downloadable export of Czech archive institutions
|
||
from the "Archiválie na dosah" (ARON) portal at https://portal.nacr.cz/aron/institution
|
||
|
||
Could you provide:
|
||
1. Complete list of Czech archive institutions in XML/CSV/JSON format
|
||
2. Metadata: institution name, type, location, identifiers, website
|
||
3. License information (hoping for CC0 like the ADR database)
|
||
|
||
The ADR library database was publicly available at:
|
||
https://aleph.nkp.cz/data/adr.xml.gz (CC0 license)
|
||
|
||
Is there a similar download for archive institutions from the ARON system?
|
||
|
||
This data will contribute to a global open heritage institution dataset
|
||
for research and discovery purposes.
|
||
|
||
Thank you for your assistance!
|
||
```
|
||
|
||
### 2️⃣ Search Czech Open Data Portal
|
||
|
||
**URL**: https://data.gov.cz/datasets
|
||
|
||
**Search terms to try**:
|
||
- "archivy"
|
||
- "archivní instituce"
|
||
- "ARON"
|
||
- "Národní archiv"
|
||
- "Archiválie na dosah"
|
||
|
||
**Filter by publishers**:
|
||
- Národní archiv
|
||
- Ministerstvo vnitra
|
||
|
||
### 3️⃣ Investigate ARON API (if no download found)
|
||
|
||
**Method**: Use browser DevTools
|
||
|
||
1. Open https://portal.nacr.cz/aron/institution
|
||
2. Open DevTools (F12) → Network tab
|
||
3. Filter by "XHR" or "Fetch"
|
||
4. Click through pages to see API calls
|
||
5. Look for endpoints like:
|
||
- `/api/institution`
|
||
- `/api/apu`
|
||
- `/aron/api/...`
|
||
|
||
**Check**:
|
||
- Response format (JSON expected)
|
||
- Authentication requirements
|
||
- Pagination parameters
|
||
- Rate limits
|
||
|
||
### 4️⃣ Web Scraping (LAST RESORT ONLY)
|
||
|
||
**Only if**:
|
||
- No email response after 1 week
|
||
- No open data export found
|
||
- No public API available
|
||
|
||
**Approach**:
|
||
```python
|
||
# Use playwright or crawl4ai
|
||
# Scrape 56 pages of institution list
|
||
# Extract: name, UUID, link
|
||
# Follow links to detail pages
|
||
# Respect rate limits: 1 req/sec
|
||
```
|
||
|
||
## What We Have vs What We Need
|
||
|
||
### Currently Have (Libraries)
|
||
```
|
||
Source: https://aleph.nkp.cz/data/adr.xml.gz
|
||
Format: MARC21 XML (27 MB)
|
||
Records: 8,145 libraries
|
||
Status: ✅ Parsed and validated
|
||
Output: data/instances/czech_institutions.yaml (8.8 MB)
|
||
```
|
||
|
||
### Need to Get (Archives)
|
||
```
|
||
Source: https://portal.nacr.cz/aron/institution
|
||
Format: Unknown (web portal only)
|
||
Records: ~560 archives (estimated)
|
||
Status: ⏳ Awaiting data access
|
||
Target: data/instances/czech_archives.yaml
|
||
```
|
||
|
||
### Final Goal (Combined)
|
||
```
|
||
Output: data/instances/czech_heritage_complete.yaml
|
||
Records: ~8,700 total (8,145 libraries + 560 archives)
|
||
Status: 📋 Waiting for archive data
|
||
```
|
||
|
||
## Expected Timeline
|
||
|
||
**Best Case** (Ministry provides export):
|
||
- Email response: 3-5 business days
|
||
- Download + parse data: 1-2 hours
|
||
- Validation + merge: 1-2 hours
|
||
- **Total: 1 week**
|
||
|
||
**Medium Case** (Found in open data portal):
|
||
- Deep search: 1-2 hours
|
||
- Download + parse: 1-2 hours
|
||
- **Total: Same day**
|
||
|
||
**Worst Case** (Web scraping required):
|
||
- Write scraper: 2-3 hours
|
||
- Run scraper (56 pages + details): 2-3 hours
|
||
- Parse + validate: 1-2 hours
|
||
- **Total: 1-2 days**
|
||
|
||
## Commands for Next Session
|
||
|
||
### Check email response
|
||
```bash
|
||
# Manual check - did arch@mvcr.cz respond?
|
||
```
|
||
|
||
### Search open data portal
|
||
```bash
|
||
# Open in browser:
|
||
open https://data.gov.cz/datasets?keywords=archivy
|
||
open https://data.gov.cz/datasets?keywords=archivn%C3%AD%20instituce
|
||
```
|
||
|
||
### Check current status
|
||
```bash
|
||
# Czech libraries
|
||
python3 -c "
|
||
import yaml
|
||
with open('data/instances/czech_institutions.yaml', 'r') as f:
|
||
data = yaml.safe_load(f)
|
||
print(f'Czech libraries: {len(data)}')
|
||
print(f'Archives in library DB: {sum(1 for i in data if \"archiv\" in i[\"name\"].lower())}')
|
||
"
|
||
```
|
||
|
||
### Test ARON API (if found)
|
||
```bash
|
||
# Example if we find API endpoint
|
||
curl -s "https://portal.nacr.cz/api/institution?page=0&size=100" | jq .
|
||
```
|
||
|
||
## Files to Create (Once Archive Data Available)
|
||
|
||
1. `scripts/parsers/parse_czech_archives.py` - Archive parser
|
||
2. `data/instances/czech_archives.yaml` - Archive records
|
||
3. `scripts/merge_czech_datasets.py` - Merge libraries + archives
|
||
4. `data/instances/czech_heritage_complete.yaml` - Final unified dataset
|
||
5. `CZECH_ARCHIVES_COMPLETE.md` - Archive harvest report
|
||
|
||
## Success Criteria
|
||
|
||
✅ Archive data obtained (via email, open data, or API)
|
||
✅ Parser created and tested
|
||
✅ Records validated against LinkML schema
|
||
✅ Libraries + archives merged without duplicates
|
||
✅ Total ~8,700 Czech heritage institutions in unified dataset
|
||
✅ GHCID generated for all institutions
|
||
✅ Ready for Wikidata enrichment
|
||
|
||
## Key Contacts
|
||
|
||
**Národní archiv (National Archives)**:
|
||
- Email: posta@nacr.cz
|
||
- Website: https://www.nacr.cz
|
||
|
||
**Ministry of Interior - Archival Administration**:
|
||
- Email: arch@mvcr.cz ⭐ PRIMARY CONTACT
|
||
- Website: https://www.mvcr.cz
|
||
|
||
**National Library (for reference)**:
|
||
- Email: eva.svobodova@nkp.cz (ISIL registry contact)
|
||
- Website: https://www.nkp.cz
|
||
|
||
---
|
||
|
||
**Current Status**: Awaiting archive data access
|
||
**Next Action**: Send email to arch@mvcr.cz
|
||
**Fallback**: Search open data portal, then investigate API
|
||
**Last Resort**: Web scraping with rate limits
|