glam/SESSION_SUMMARY_20251119_ARCHIVES_DISCOVERY.md

# Session Summary: Czech Archives Discovery (2025-11-19)

## What We Accomplished

### 1. Identified Archive Data Gap ✅
- Confirmed Czech ADR database contains **libraries only** (no archives)
- Discovered 87 institutions with "archiv" in name are archive libraries/collections, not archives
- Found separate Czech archive database: **"Archiválie na dosah"** (Archives Within Reach)

### 2. Located Czech Archive Infrastructure ✅
- **Portal**: https://portal.nacr.cz/cro/pro-badatele/
- **Institution List**: https://portal.nacr.cz/aron/institution
- **Manager**: Národní archiv (National Archives) + Ministry of Interior
- **Estimated Archives**: ~560 institutions (56 pages × 10 per page)

### 3. Categorized Czech Archive Types ✅
Documented archive categories from ARON portal:
- State archives (Národní archiv, regional archives, security archive)
- Municipal archives (Prague, Brno, Plzeň, Ostrava, etc.)
- University archives (Charles University, Masaryk University, etc.)
- Specialized archives (Parliament, Foreign Ministry, Military, Radio, Museums, Galleries, Libraries)
- Private archives (Jewish Museum, corporate archives, church archives)

### 4. Identified Data Access Challenges ⚠️
**Problem**: No obvious bulk download available
- ❌ No public XML/CSV export like library ADR database
- ❌ No documented public API endpoint
- ❌ Not found in Czech open data portal (preliminary search)
- ✅ Web interface exists but requires scraping 56 pages

### 5. Developed Action Plan 📋
Created comprehensive investigation strategy:
- **Priority 1**: Email Ministry of Interior (arch@mvcr.cz) for export
- **Priority 2**: Deep search Czech open data portal
- **Priority 3**: Investigate ARON API through DevTools
- **Priority 4**: Web scraping as last resort

## Files Created

### Documentation
- ✅ `CZECH_ARCHIVES_INVESTIGATION.md` (6.5 KB)
  - Detailed investigation report
  - System architecture analysis
  - Comparison: libraries vs archives
  - Expected outcomes

- ✅ `CZECH_ARCHIVES_NEXT_ACTIONS.md` (5.2 KB)
  - Quick reference guide
  - Email template for Ministry
  - Step-by-step instructions
  - Success criteria

### Related Files (From Earlier Session)
- `CZECH_ISIL_COMPLETE_REPORT.md` - Library harvest report
- `CZECH_ISIL_NEXT_STEPS.md` - Library processing guide
- `data/instances/czech_institutions.yaml` - 8,145 libraries (8.8 MB)

## Key Findings

### Archive Distribution Estimate
Based on ARON portal pagination:
```
Total pages: 56
Per page: 10
Estimated total: ~560 Czech archive institutions
```

### Czech Heritage Landscape
```
Libraries: 8,145 (ADR database) ✅ COMPLETE
Archives: ~560 (ARON portal) ⏳ PENDING DATA ACCESS
Total: ~8,700 Czech heritage institutions
```

### Data Quality Comparison

| Feature | Libraries (ADR) | Archives (ARON) |
|---------|-----------------|-----------------|
| Data Source | National Library | National Archives |
| Format | MARC21 XML | Unknown (web only) |
| Download | ✅ Available | ❌ Not found |
| Records | 8,145 | ~560 (estimated) |
| GPS Coverage | 81.3% | Unknown |
| ISIL Codes | Yes (siglas) | Unknown |
| License | CC0 | Unknown |

## Expected Czech Dataset

### After Archive Integration
- **Total Institutions**: ~8,700
  - Libraries: 8,145 (94.6%)
  - Archives: ~560 (5.4%)
  - Mixed: ~50 (museums, galleries, etc.)
- **Geographic Coverage**: All 14 Czech regions
- **Data Quality**: TIER_1_AUTHORITATIVE
- **Global Ranking**: 2nd largest national dataset (after Netherlands if archives added)

## Next Steps

### Immediate (Next Session)
1. **Send email to arch@mvcr.cz** requesting archive data export
   - Use template in `CZECH_ARCHIVES_NEXT_ACTIONS.md`
   - Reference ADR database as precedent
   - Request CC0 license

2. **Search Czech open data portal thoroughly**
   - Keywords: "archivy", "archivní instituce", "ARON"
   - Filter by: Národní archiv, Ministerstvo vnitra
   - URL: https://data.gov.cz/datasets

3. **Investigate ARON API**
   - Use browser DevTools on https://portal.nacr.cz/aron/institution
   - Look for `/api/` endpoints
   - Test pagination and response format

### Secondary (If No Official Export)
4. **Web scraping fallback**
   - Scrape 56 pages of institution list
   - Extract: name, UUID, detail page link
   - Follow links for full metadata
   - Respect rate limits (1 req/sec)

### After Archive Data Obtained
5. **Create parser**: `scripts/parsers/parse_czech_archives.py`
6. **Validate records**: Check LinkML schema compliance
7. **Merge datasets**: Combine libraries + archives
8. **Deduplicate**: Check for institutions in both databases
9. **Enrich with Wikidata**: Add Q-numbers
10. **Generate GHCIDs**: CZ-* prefixes

## Technical Discoveries

### ARON System
- **ARON** = ARchiv ONline
- React-based web application
- RESTful API backend (not publicly documented)
- Real-time updates by archivists
- Integrated with CAM (Central Archive Module)

### Known Endpoints
- `/aron/institution` - Institution list
- `/aron/fund` - Archival fonds
- `/aron/finding-aid` - Finding aids
- `/aron/originator` - Record creators
- `/aron/apu/{uuid}` - Entity details

### API Potential
- DA-COMM API found at https://stands.nacr.cz/da-comm/viewapi/
- Appears to be for component viewing, not data export
- May require authentication
- Needs further investigation

## Outstanding Questions

1. **Does archive data export exist?**
   - Awaiting response from arch@mvcr.cz

2. **Do Czech archives have ISIL codes?**
   - Libraries use "siglas" format
   - Archives may use different system
   - Need to check with ISIL agency

3. **What's the license for archive data?**
   - Library data is CC0
   - Archives likely similar but unconfirmed

4. **How many archives actually exist?**
   - Portal shows ~560
   - May be more not in ARON system
   - Need official count

## Success Metrics

### Session Goals
- ✅ Identified archive database location
- ✅ Estimated archive count
- ✅ Documented archive categories
- ✅ Created action plan
- ⏳ Awaiting data access

### Project Goals (Pending)
- ⏳ Obtain archive data export
- ⏳ Parse and validate records
- ⏳ Merge with library dataset
- ⏳ Complete Czech heritage dataset (~8,700 institutions)
- ⏳ Become 2nd largest national dataset globally

## Resources

### Key Contacts
- **Ministry of Interior**: arch@mvcr.cz ⭐ PRIMARY
- **Národní archiv**: posta@nacr.cz
- **National Library** (reference): eva.svobodova@nkp.cz

### URLs
- **ARON Portal**: https://portal.nacr.cz/cro
- **Institution List**: https://portal.nacr.cz/aron/institution
- **National Archives**: https://www.nacr.cz
- **Open Data**: https://data.gov.cz
- **Ministry of Interior**: https://www.mvcr.cz

### Documentation
- `CZECH_ARCHIVES_INVESTIGATION.md` - Full investigation report
- `CZECH_ARCHIVES_NEXT_ACTIONS.md` - Quick start guide
- `CZECH_ISIL_COMPLETE_REPORT.md` - Library harvest results
- `AGENTS.md` - Project instructions for AI agents

## Timeline Estimate

**Best Case** (Official export provided):
- Email response: 3-5 business days
- Download + parse: 1-2 hours
- Merge + validate: 1-2 hours
- **Total: ~1 week**

**Medium Case** (Found in open data):
- Deep portal search: 1-2 hours
- Download + parse: 1-2 hours
- **Total: Same day**

**Worst Case** (Web scraping):
- Write scraper: 2-3 hours
- Run scraper: 2-3 hours (56 pages + details)
- Parse + validate: 1-2 hours
- **Total: 1-2 days**

## Commands for Verification

### Check library data
```bash
python3 -c "
import yaml
with open('data/instances/czech_institutions.yaml', 'r') as f:
    data = yaml.safe_load(f)
print(f'Czech libraries: {len(data)}')
print(f'\"Archiv\" mentions: {sum(1 for i in data if \"archiv\" in i[\"name\"].lower())}')
print(f'Archive type count: {sum(1 for i in data if i[\"institution_type\"] == \"ARCHIVE\")}')
"
```

### List files created
```bash
ls -lh CZECH_ARCHIVES*.md
```

## Notes

- Czech Republic has **TWO separate** heritage databases (libraries + archives)
- This is common in many countries with specialized systems
- Library database was straightforward (single XML download)
- Archive database requires more investigation (web portal only)
- Both are managed by national institutions (good data quality)
- License likely CC0 (following Czech open data practices)

## Status

**Current Phase**: Data acquisition
**Blocking Issue**: No bulk download found for archives
**Next Action**: Email arch@mvcr.cz requesting export
**Fallback Plans**: Open data search → API investigation → Web scraping
**Expected Resolution**: 1 week (if email successful)

---

**Session Duration**: ~2 hours
**Files Created**: 2 (investigation report + action guide)
**Research Completed**: Archive infrastructure mapping
**Data Acquired**: 0 (awaiting archive export)
**Next Session Focus**: Data acquisition and parsing