glam/CZECH_ARCHIVES_INVESTIGATION.md
2025-11-19 23:25:22 +01:00

297 lines
11 KiB
Markdown

# Czech Archives Database Investigation
**Date**: 2025-11-19
**Status**: Investigation in progress
## Summary
The Czech Republic has **TWO separate heritage databases**:
1.**Libraries** - Already harvested (8,145 institutions from ADR database)
2. 🔍 **Archives** - Need to harvest (estimated ~560 institutions from ARON portal)
## Czech Archive Infrastructure
### Primary Source: Archiválie na dosah (Archives Within Reach)
**Portal URL**: https://portal.nacr.cz/cro/pro-badatele/
**Institution Database**: https://portal.nacr.cz/aron/institution
**Manager**: Národní archiv (National Archives) + Ministry of Interior
### What We Discovered
1. **"Archiválie na dosah"** replaced the old "Archivní fondy a sbírky v ČR" database
2. Contains **200,000+ archival collections** (fonds and collections)
3. Managed by Národní archiv through **ARON portal** (ARchiv ONline)
4. Updated in real-time by archivists across Czech Republic
5. Includes state archives, municipal archives, university archives, specialized archives, and private archives
### Archive Categories in ARON
According to the portal (https://portal.nacr.cz/cro/pro-badatele/):
**State Archives** (Státní archivy):
- Národní archiv (National Archives)
- Moravský zemský archiv v Brně
- 7 regional state archives (Státní oblastní archivy)
- Archiv bezpečnostních složek (Security Forces Archive)
**Municipal Archives** (Archivy územních samosprávných celků):
- Archiv hlavního města Prahy (Prague City Archive)
- Archiv města Brna
- Archiv města Plzně
- Archiv města Ústí nad Labem
- Archiv města Ostravy
- Many district and municipal archives
**University Archives** (Archivy vysokých škol):
- Univerzita Karlova
- Masarykova univerzita
- Univerzita Palackého
- Ostravská univerzita
- ČVUT Praha
- VUT Brno
- And more...
**Specialized Archives** (Specializované archivy):
- Archiv Kanceláře prezidenta republiky
- Archiv Poslanecké sněmovny (Parliament Archive)
- Archiv Ministerstva zahraničních věcí
- Vojenský historický archiv (Military Historical Archive)
- Archiv Českého rozhlasu (Czech Radio Archive)
- Archiv Národního muzea
- Literární archiv Památníku národního písemnictví
- Archiv Národní galerie
- Archiv Národní knihovny ČR
- And many more...
**Private Archives** (Soukromé archivy):
- Archiv Židovského muzea
- Corporate archives (Škoda, Vítkovice, Plzeňský Prazdroj, etc.)
- Church archives (Biskupství brněnské)
### Current Status
**ARON Portal Observations**:
- Shows **56 pages** of institutions (10 per page)
- **Estimated total: ~560 Czech archive institutions**
- Web interface at https://portal.nacr.cz/aron/institution
- Each institution has UUID (e.g., `/aron/apu/000efd1e-099b-4e8c-ab8c-ec9e47662e7b`)
- Displays institution name, but details require clicking through
## Data Access Challenges
### No Obvious Bulk Download
**What we DON'T have yet**:
- No public downloadable XML/CSV export like ADR library database
- No documented public API for institution list
- ARON API exists but appears to be for internal use (DA-COMM system)
- Czech open data portal (data.gov.cz) doesn't list archive institution registry
### Possible Data Access Methods
**Option 1: Contact Ministry of Interior - Archival Administration** ⭐ RECOMMENDED
- **Email**: arch@mvcr.cz
- **Request**: Ask for downloadable export of Czech archive institution registry
- **Precedent**: National Library provided ADR database as open XML download
- **License**: Likely CC0 (public domain) like library database
- **Format**: Probably XML or CSV
**Option 2: Web Scraping ARON Portal**
- Portal: https://portal.nacr.cz/aron/institution
- 56 pages to scrape (paginated results)
- Extract: Institution name, UUID, link to detail page
- Then scrape each detail page for full metadata
- **Cons**: Time-consuming, fragile, may violate terms of service
**Option 3: Check for API Documentation**
- Found DA-COMM API docs at https://stands.nacr.cz/da-comm/viewapi/
- Appears to be for component viewing, not institution export
- May require authentication token
- **Needs investigation**: Does ARON have public API?
**Option 4: Check Open Data Portal**
- Search data.gov.cz for "archivy" or "archivní instituce"
- May be published as open data dataset
- **Status**: Preliminary search didn't find it, but worth deeper investigation
## Technical Details
### ARON System Architecture
**ARON** = ARchiv ONline (Archive Online)
- Modern web application (React-based frontend)
- RESTful API backend
- PostgreSQL database (inferred from similar Czech gov systems)
- Real-time updates by archivists
- Integrated with CAM (Centrální Archivní Modul - Central Archive Module)
### Known ARON Endpoints
- `/aron/institution` - List of archive institutions
- `/aron/fund` - Archival fonds
- `/aron/finding-aid` - Finding aids (archival inventories)
- `/aron/originator` - Originators (organizations that created records)
- `/aron/apu/{uuid}` - Individual access point (entity detail)
### Data We Need from Each Institution
From LinkML schema perspective, extract:
- **Name** (official institution name)
- **Institution Type** (ARCHIVE, MUSEUM, LIBRARY, etc.)
- **Location** (city, address if available)
- **Identifiers** (UUID from ARON, potential ISIL code)
- **Website URL** (if available)
- **Description** (mission, holdings summary)
- **Collections** (archival fonds managed by institution)
### Comparison: Libraries vs Archives
| Feature | Libraries (ADR) | Archives (ARON) |
|---------|-----------------|-----------------|
| **Total Institutions** | 8,145 | ~560 (estimated) |
| **Data Source** | National Library | National Archives + Min. Interior |
| **Download Available** | ✅ Yes (adr.xml.gz) | ❌ Not yet found |
| **Format** | MARC21 XML | Unknown (web portal only) |
| **Coverage** | 100% Czech libraries | All Czech archives + cultural institutions |
| **ISIL Codes** | Yes (siglas) | Unknown if archives have ISIL |
| **Update Frequency** | Periodic (last: 2025-08-01) | Real-time (online updates) |
| **License** | CC0 (public domain) | Unknown (likely open) |
## Next Steps
### Immediate Actions (Priority 1)
1. **Email Národní archiv and Ministry of Interior**
```
To: arch@mvcr.cz
Subject: Request for Czech Archive Institution Registry Export
Dear Archival Administration,
I am working on a global heritage institution database project
(https://github.com/user/glam) and successfully integrated data
from the National Library's ADR database (8,145 libraries).
I would like to request a downloadable export of Czech archive
institutions from the "Archiválie na dosah" (ARON) portal.
Could you provide:
1. Complete list of Czech archive institutions in XML/CSV format
2. Metadata: institution name, type, location, identifiers, website
3. License information (hoping for CC0 like ADR database)
The ADR database was available at:
https://aleph.nkp.cz/data/adr.xml.gz
Is there a similar download for archive institutions?
Thank you!
```
2. **Check Czech Open Data Portal Thoroughly**
- Search data.gov.cz for:
- "archivy"
- "archivní instituce"
- "ARON"
- "Národní archiv"
- "Archiválie na dosah"
- Filter by publisher: Národní archiv, Ministerstvo vnitra
3. **Investigate ARON API**
- Check browser network tab when using portal
- Look for API endpoints (likely `/api/institution` or similar)
- Check if API returns JSON data
- Test if API requires authentication
### Secondary Actions (Priority 2)
4. **Web Scraping as Last Resort**
- Only if no official export is available
- Use `crawl4ai` or `playwright` to scrape institution list
- Respect rate limits (1 request per second)
- Parse HTML for institution names and UUIDs
- Follow links to detail pages for full metadata
5. **Cross-Reference with ADR Database**
- Check if 87 "archiv" mentions in library database are real archives
- Some may be archive libraries (KI-MU type institutions)
- Deduplicate if institutions appear in both databases
6. **ISIL Code Investigation**
- Check if Czech archives have ISIL codes
- Libraries use "siglas" (e.g., ABA000)
- Archives may use different identifier system
- Contact ISIL agency: https://isil.org
### Data Integration (Priority 3)
7. **Create Czech Archive Parser**
- Once data is available, create `scripts/parsers/parse_czech_archives.py`
- Similar structure to `parse_czech_isil.py`
- Map archive types to GLAM taxonomy (ARCHIVE, MUSEUM, etc.)
- Generate GHCIDs with CZ-* prefix
8. **Merge Libraries + Archives**
- Combine `czech_institutions.yaml` (libraries) with new archive data
- Deduplicate by name/location
- Create unified `czech_heritage_institutions.yaml`
- 8,145 libraries + ~560 archives = **~8,700 total Czech institutions**
9. **Enrich with Wikidata**
- Query Wikidata for Czech archives
- Fuzzy match by name and city
- Add Q-numbers to identifiers
- Update GHCIDs with Q-numbers if needed
## Expected Outcomes
### After Archive Data Harvest
**Czech Heritage Institution Dataset**:
- **Total Institutions**: ~8,700 (8,145 libraries + 560 archives)
- **Data Quality**: TIER_1_AUTHORITATIVE (from national registries)
- **Geographic Coverage**: All 14 Czech regions
- **Type Distribution**:
- Libraries: 94.6% (8,145)
- Archives: 5.4% (~560)
- Museums: ~50 (from specialized archives)
- Galleries: ~20 (from specialized archives)
- Universities: ~10 (university archives)
- Mixed: varies
### Contribution to Global Dataset
Czech Republic would become one of the **most complete** country datasets:
- 🇳🇱 Netherlands: ~1,400 institutions (current best)
- 🇨🇿 Czech Republic: ~8,700 institutions (if archives added) 🏆
- 🇧🇷 Brazil: ~3,000+ institutions (in progress)
- 🇦🇷 Argentina: ~2,000 institutions
- 🇦🇹 Austria: ~1,800 institutions
- 🇨🇦 Canada: ~800 institutions
## References
**Czech National Archives**:
- Website: https://www.nacr.cz
- ARON Portal: https://portal.nacr.cz/cro
- Institution List: https://portal.nacr.cz/aron/institution
- Email: posta@nacr.cz
**Ministry of Interior - Archival Administration**:
- Email: arch@mvcr.cz
- Website: https://www.mvcr.cz (archival section)
**Czech Open Data Portal**:
- Website: https://data.gov.cz
- Search: https://data.gov.cz/datasets
**Related Documentation**:
- CZECH_ISIL_COMPLETE_REPORT.md - Library harvest report
- CZECH_ISIL_NEXT_STEPS.md - Quick start guide
- SESSION_SUMMARY_20251119_CZECH_COMPLETE.md - Session summary
---
**Status**: Awaiting response from arch@mvcr.cz
**Next Session**: Check email for response, investigate open data portal, or begin API investigation