glam/CZECH_ARCHIVES_NEXT_ACTIONS.md

# Czech Archives - Next Actions

**Quick Reference**: What to do next session

## Context
- ✅ Czech **libraries** harvested: 8,145 institutions from ADR database
- 🔍 Czech **archives** discovered: ~560 institutions in ARON portal
- ❌ No bulk download found yet for archive data

## Recommended Actions (in order)

### 1️⃣ Email Request to Ministry of Interior (DO THIS FIRST)

**To**: arch@mvcr.cz
**Subject**: Request for Czech Archive Institution Registry Export

**Email Template**:
```
Dear Archival Administration,

I am working on a global heritage institution database project and have
successfully integrated data from the Czech National Library's ADR database
(8,145 libraries).

I would like to request a downloadable export of Czech archive institutions
from the "Archiválie na dosah" (ARON) portal at https://portal.nacr.cz/aron/institution

Could you provide:
1. Complete list of Czech archive institutions in XML/CSV/JSON format
2. Metadata: institution name, type, location, identifiers, website
3. License information (hoping for CC0 like the ADR database)

The ADR library database was publicly available at:
https://aleph.nkp.cz/data/adr.xml.gz (CC0 license)

Is there a similar download for archive institutions from the ARON system?

This data will contribute to a global open heritage institution dataset
for research and discovery purposes.

Thank you for your assistance!
```

### 2️⃣ Search Czech Open Data Portal

**URL**: https://data.gov.cz/datasets

**Search terms to try**:
- "archivy"
- "archivní instituce"
- "ARON"
- "Národní archiv"
- "Archiválie na dosah"

**Filter by publishers**:
- Národní archiv
- Ministerstvo vnitra

### 3️⃣ Investigate ARON API (if no download found)

**Method**: Use browser DevTools

1. Open https://portal.nacr.cz/aron/institution
2. Open DevTools (F12) → Network tab
3. Filter by "XHR" or "Fetch"
4. Click through pages to see API calls
5. Look for endpoints like:
   - `/api/institution`
   - `/api/apu`
   - `/aron/api/...`

**Check**:
- Response format (JSON expected)
- Authentication requirements
- Pagination parameters
- Rate limits

### 4️⃣ Web Scraping (LAST RESORT ONLY)

**Only if**:
- No email response after 1 week
- No open data export found
- No public API available

**Approach**:
```python
# Use playwright or crawl4ai
# Scrape 56 pages of institution list
# Extract: name, UUID, link
# Follow links to detail pages
# Respect rate limits: 1 req/sec
```

## What We Have vs What We Need

### Currently Have (Libraries)
```
Source: https://aleph.nkp.cz/data/adr.xml.gz
Format: MARC21 XML (27 MB)
Records: 8,145 libraries
Status: ✅ Parsed and validated
Output: data/instances/czech_institutions.yaml (8.8 MB)
```

### Need to Get (Archives)
```
Source: https://portal.nacr.cz/aron/institution
Format: Unknown (web portal only)
Records: ~560 archives (estimated)
Status: ⏳ Awaiting data access
Target: data/instances/czech_archives.yaml
```

### Final Goal (Combined)
```
Output: data/instances/czech_heritage_complete.yaml
Records: ~8,700 total (8,145 libraries + 560 archives)
Status: 📋 Waiting for archive data
```

## Expected Timeline

**Best Case** (Ministry provides export):
- Email response: 3-5 business days
- Download + parse data: 1-2 hours
- Validation + merge: 1-2 hours
- **Total: 1 week**

**Medium Case** (Found in open data portal):
- Deep search: 1-2 hours
- Download + parse: 1-2 hours
- **Total: Same day**

**Worst Case** (Web scraping required):
- Write scraper: 2-3 hours
- Run scraper (56 pages + details): 2-3 hours
- Parse + validate: 1-2 hours
- **Total: 1-2 days**

## Commands for Next Session

### Check email response
```bash
# Manual check - did arch@mvcr.cz respond?
```

### Search open data portal
```bash
# Open in browser:
open https://data.gov.cz/datasets?keywords=archivy
open https://data.gov.cz/datasets?keywords=archivn%C3%AD%20instituce
```

### Check current status
```bash
# Czech libraries
python3 -c "
import yaml
with open('data/instances/czech_institutions.yaml', 'r') as f:
    data = yaml.safe_load(f)
print(f'Czech libraries: {len(data)}')
print(f'Archives in library DB: {sum(1 for i in data if \"archiv\" in i[\"name\"].lower())}')
"
```

### Test ARON API (if found)
```bash
# Example if we find API endpoint
curl -s "https://portal.nacr.cz/api/institution?page=0&size=100" | jq .
```

## Files to Create (Once Archive Data Available)

1. `scripts/parsers/parse_czech_archives.py` - Archive parser
2. `data/instances/czech_archives.yaml` - Archive records
3. `scripts/merge_czech_datasets.py` - Merge libraries + archives
4. `data/instances/czech_heritage_complete.yaml` - Final unified dataset
5. `CZECH_ARCHIVES_COMPLETE.md` - Archive harvest report

## Success Criteria

✅ Archive data obtained (via email, open data, or API)
✅ Parser created and tested
✅ Records validated against LinkML schema
✅ Libraries + archives merged without duplicates
✅ Total ~8,700 Czech heritage institutions in unified dataset
✅ GHCID generated for all institutions
✅ Ready for Wikidata enrichment

## Key Contacts

**Národní archiv (National Archives)**:
- Email: posta@nacr.cz
- Website: https://www.nacr.cz

**Ministry of Interior - Archival Administration**:
- Email: arch@mvcr.cz ⭐ PRIMARY CONTACT
- Website: https://www.mvcr.cz

**National Library (for reference)**:
- Email: eva.svobodova@nkp.cz (ISIL registry contact)
- Website: https://www.nkp.cz

---

**Current Status**: Awaiting archive data access
**Next Action**: Send email to arch@mvcr.cz
**Fallback**: Search open data portal, then investigate API
**Last Resort**: Web scraping with rate limits