297 lines
11 KiB
Markdown
297 lines
11 KiB
Markdown
# Czech Archives Database Investigation
|
|
|
|
**Date**: 2025-11-19
|
|
**Status**: Investigation in progress
|
|
|
|
## Summary
|
|
|
|
The Czech Republic has **TWO separate heritage databases**:
|
|
1. ✅ **Libraries** - Already harvested (8,145 institutions from ADR database)
|
|
2. 🔍 **Archives** - Need to harvest (estimated ~560 institutions from ARON portal)
|
|
|
|
## Czech Archive Infrastructure
|
|
|
|
### Primary Source: Archiválie na dosah (Archives Within Reach)
|
|
|
|
**Portal URL**: https://portal.nacr.cz/cro/pro-badatele/
|
|
**Institution Database**: https://portal.nacr.cz/aron/institution
|
|
**Manager**: Národní archiv (National Archives) + Ministry of Interior
|
|
|
|
### What We Discovered
|
|
|
|
1. **"Archiválie na dosah"** replaced the old "Archivní fondy a sbírky v ČR" database
|
|
2. Contains **200,000+ archival collections** (fonds and collections)
|
|
3. Managed by Národní archiv through **ARON portal** (ARchiv ONline)
|
|
4. Updated in real-time by archivists across Czech Republic
|
|
5. Includes state archives, municipal archives, university archives, specialized archives, and private archives
|
|
|
|
### Archive Categories in ARON
|
|
|
|
According to the portal (https://portal.nacr.cz/cro/pro-badatele/):
|
|
|
|
**State Archives** (Státní archivy):
|
|
- Národní archiv (National Archives)
|
|
- Moravský zemský archiv v Brně
|
|
- 7 regional state archives (Státní oblastní archivy)
|
|
- Archiv bezpečnostních složek (Security Forces Archive)
|
|
|
|
**Municipal Archives** (Archivy územních samosprávných celků):
|
|
- Archiv hlavního města Prahy (Prague City Archive)
|
|
- Archiv města Brna
|
|
- Archiv města Plzně
|
|
- Archiv města Ústí nad Labem
|
|
- Archiv města Ostravy
|
|
- Many district and municipal archives
|
|
|
|
**University Archives** (Archivy vysokých škol):
|
|
- Univerzita Karlova
|
|
- Masarykova univerzita
|
|
- Univerzita Palackého
|
|
- Ostravská univerzita
|
|
- ČVUT Praha
|
|
- VUT Brno
|
|
- And more...
|
|
|
|
**Specialized Archives** (Specializované archivy):
|
|
- Archiv Kanceláře prezidenta republiky
|
|
- Archiv Poslanecké sněmovny (Parliament Archive)
|
|
- Archiv Ministerstva zahraničních věcí
|
|
- Vojenský historický archiv (Military Historical Archive)
|
|
- Archiv Českého rozhlasu (Czech Radio Archive)
|
|
- Archiv Národního muzea
|
|
- Literární archiv Památníku národního písemnictví
|
|
- Archiv Národní galerie
|
|
- Archiv Národní knihovny ČR
|
|
- And many more...
|
|
|
|
**Private Archives** (Soukromé archivy):
|
|
- Archiv Židovského muzea
|
|
- Corporate archives (Škoda, Vítkovice, Plzeňský Prazdroj, etc.)
|
|
- Church archives (Biskupství brněnské)
|
|
|
|
### Current Status
|
|
|
|
**ARON Portal Observations**:
|
|
- Shows **56 pages** of institutions (10 per page)
|
|
- **Estimated total: ~560 Czech archive institutions**
|
|
- Web interface at https://portal.nacr.cz/aron/institution
|
|
- Each institution has UUID (e.g., `/aron/apu/000efd1e-099b-4e8c-ab8c-ec9e47662e7b`)
|
|
- Displays institution name, but details require clicking through
|
|
|
|
## Data Access Challenges
|
|
|
|
### No Obvious Bulk Download
|
|
|
|
❌ **What we DON'T have yet**:
|
|
- No public downloadable XML/CSV export like ADR library database
|
|
- No documented public API for institution list
|
|
- ARON API exists but appears to be for internal use (DA-COMM system)
|
|
- Czech open data portal (data.gov.cz) doesn't list archive institution registry
|
|
|
|
### Possible Data Access Methods
|
|
|
|
**Option 1: Contact Ministry of Interior - Archival Administration** ⭐ RECOMMENDED
|
|
- **Email**: arch@mvcr.cz
|
|
- **Request**: Ask for downloadable export of Czech archive institution registry
|
|
- **Precedent**: National Library provided ADR database as open XML download
|
|
- **License**: Likely CC0 (public domain) like library database
|
|
- **Format**: Probably XML or CSV
|
|
|
|
**Option 2: Web Scraping ARON Portal**
|
|
- Portal: https://portal.nacr.cz/aron/institution
|
|
- 56 pages to scrape (paginated results)
|
|
- Extract: Institution name, UUID, link to detail page
|
|
- Then scrape each detail page for full metadata
|
|
- **Cons**: Time-consuming, fragile, may violate terms of service
|
|
|
|
**Option 3: Check for API Documentation**
|
|
- Found DA-COMM API docs at https://stands.nacr.cz/da-comm/viewapi/
|
|
- Appears to be for component viewing, not institution export
|
|
- May require authentication token
|
|
- **Needs investigation**: Does ARON have public API?
|
|
|
|
**Option 4: Check Open Data Portal**
|
|
- Search data.gov.cz for "archivy" or "archivní instituce"
|
|
- May be published as open data dataset
|
|
- **Status**: Preliminary search didn't find it, but worth deeper investigation
|
|
|
|
## Technical Details
|
|
|
|
### ARON System Architecture
|
|
|
|
**ARON** = ARchiv ONline (Archive Online)
|
|
- Modern web application (React-based frontend)
|
|
- RESTful API backend
|
|
- PostgreSQL database (inferred from similar Czech gov systems)
|
|
- Real-time updates by archivists
|
|
- Integrated with CAM (Centrální Archivní Modul - Central Archive Module)
|
|
|
|
### Known ARON Endpoints
|
|
|
|
- `/aron/institution` - List of archive institutions
|
|
- `/aron/fund` - Archival fonds
|
|
- `/aron/finding-aid` - Finding aids (archival inventories)
|
|
- `/aron/originator` - Originators (organizations that created records)
|
|
- `/aron/apu/{uuid}` - Individual access point (entity detail)
|
|
|
|
### Data We Need from Each Institution
|
|
|
|
From LinkML schema perspective, extract:
|
|
- **Name** (official institution name)
|
|
- **Institution Type** (ARCHIVE, MUSEUM, LIBRARY, etc.)
|
|
- **Location** (city, address if available)
|
|
- **Identifiers** (UUID from ARON, potential ISIL code)
|
|
- **Website URL** (if available)
|
|
- **Description** (mission, holdings summary)
|
|
- **Collections** (archival fonds managed by institution)
|
|
|
|
### Comparison: Libraries vs Archives
|
|
|
|
| Feature | Libraries (ADR) | Archives (ARON) |
|
|
|---------|-----------------|-----------------|
|
|
| **Total Institutions** | 8,145 | ~560 (estimated) |
|
|
| **Data Source** | National Library | National Archives + Min. Interior |
|
|
| **Download Available** | ✅ Yes (adr.xml.gz) | ❌ Not yet found |
|
|
| **Format** | MARC21 XML | Unknown (web portal only) |
|
|
| **Coverage** | 100% Czech libraries | All Czech archives + cultural institutions |
|
|
| **ISIL Codes** | Yes (siglas) | Unknown if archives have ISIL |
|
|
| **Update Frequency** | Periodic (last: 2025-08-01) | Real-time (online updates) |
|
|
| **License** | CC0 (public domain) | Unknown (likely open) |
|
|
|
|
## Next Steps
|
|
|
|
### Immediate Actions (Priority 1)
|
|
|
|
1. **Email Národní archiv and Ministry of Interior** ⭐
|
|
```
|
|
To: arch@mvcr.cz
|
|
Subject: Request for Czech Archive Institution Registry Export
|
|
|
|
Dear Archival Administration,
|
|
|
|
I am working on a global heritage institution database project
|
|
(https://github.com/user/glam) and successfully integrated data
|
|
from the National Library's ADR database (8,145 libraries).
|
|
|
|
I would like to request a downloadable export of Czech archive
|
|
institutions from the "Archiválie na dosah" (ARON) portal.
|
|
|
|
Could you provide:
|
|
1. Complete list of Czech archive institutions in XML/CSV format
|
|
2. Metadata: institution name, type, location, identifiers, website
|
|
3. License information (hoping for CC0 like ADR database)
|
|
|
|
The ADR database was available at:
|
|
https://aleph.nkp.cz/data/adr.xml.gz
|
|
|
|
Is there a similar download for archive institutions?
|
|
|
|
Thank you!
|
|
```
|
|
|
|
2. **Check Czech Open Data Portal Thoroughly**
|
|
- Search data.gov.cz for:
|
|
- "archivy"
|
|
- "archivní instituce"
|
|
- "ARON"
|
|
- "Národní archiv"
|
|
- "Archiválie na dosah"
|
|
- Filter by publisher: Národní archiv, Ministerstvo vnitra
|
|
|
|
3. **Investigate ARON API**
|
|
- Check browser network tab when using portal
|
|
- Look for API endpoints (likely `/api/institution` or similar)
|
|
- Check if API returns JSON data
|
|
- Test if API requires authentication
|
|
|
|
### Secondary Actions (Priority 2)
|
|
|
|
4. **Web Scraping as Last Resort**
|
|
- Only if no official export is available
|
|
- Use `crawl4ai` or `playwright` to scrape institution list
|
|
- Respect rate limits (1 request per second)
|
|
- Parse HTML for institution names and UUIDs
|
|
- Follow links to detail pages for full metadata
|
|
|
|
5. **Cross-Reference with ADR Database**
|
|
- Check if 87 "archiv" mentions in library database are real archives
|
|
- Some may be archive libraries (KI-MU type institutions)
|
|
- Deduplicate if institutions appear in both databases
|
|
|
|
6. **ISIL Code Investigation**
|
|
- Check if Czech archives have ISIL codes
|
|
- Libraries use "siglas" (e.g., ABA000)
|
|
- Archives may use different identifier system
|
|
- Contact ISIL agency: https://isil.org
|
|
|
|
### Data Integration (Priority 3)
|
|
|
|
7. **Create Czech Archive Parser**
|
|
- Once data is available, create `scripts/parsers/parse_czech_archives.py`
|
|
- Similar structure to `parse_czech_isil.py`
|
|
- Map archive types to GLAM taxonomy (ARCHIVE, MUSEUM, etc.)
|
|
- Generate GHCIDs with CZ-* prefix
|
|
|
|
8. **Merge Libraries + Archives**
|
|
- Combine `czech_institutions.yaml` (libraries) with new archive data
|
|
- Deduplicate by name/location
|
|
- Create unified `czech_heritage_institutions.yaml`
|
|
- 8,145 libraries + ~560 archives = **~8,700 total Czech institutions**
|
|
|
|
9. **Enrich with Wikidata**
|
|
- Query Wikidata for Czech archives
|
|
- Fuzzy match by name and city
|
|
- Add Q-numbers to identifiers
|
|
- Update GHCIDs with Q-numbers if needed
|
|
|
|
## Expected Outcomes
|
|
|
|
### After Archive Data Harvest
|
|
|
|
**Czech Heritage Institution Dataset**:
|
|
- **Total Institutions**: ~8,700 (8,145 libraries + 560 archives)
|
|
- **Data Quality**: TIER_1_AUTHORITATIVE (from national registries)
|
|
- **Geographic Coverage**: All 14 Czech regions
|
|
- **Type Distribution**:
|
|
- Libraries: 94.6% (8,145)
|
|
- Archives: 5.4% (~560)
|
|
- Museums: ~50 (from specialized archives)
|
|
- Galleries: ~20 (from specialized archives)
|
|
- Universities: ~10 (university archives)
|
|
- Mixed: varies
|
|
|
|
### Contribution to Global Dataset
|
|
|
|
Czech Republic would become one of the **most complete** country datasets:
|
|
- 🇳🇱 Netherlands: ~1,400 institutions (current best)
|
|
- 🇨🇿 Czech Republic: ~8,700 institutions (if archives added) 🏆
|
|
- 🇧🇷 Brazil: ~3,000+ institutions (in progress)
|
|
- 🇦🇷 Argentina: ~2,000 institutions
|
|
- 🇦🇹 Austria: ~1,800 institutions
|
|
- 🇨🇦 Canada: ~800 institutions
|
|
|
|
## References
|
|
|
|
**Czech National Archives**:
|
|
- Website: https://www.nacr.cz
|
|
- ARON Portal: https://portal.nacr.cz/cro
|
|
- Institution List: https://portal.nacr.cz/aron/institution
|
|
- Email: posta@nacr.cz
|
|
|
|
**Ministry of Interior - Archival Administration**:
|
|
- Email: arch@mvcr.cz
|
|
- Website: https://www.mvcr.cz (archival section)
|
|
|
|
**Czech Open Data Portal**:
|
|
- Website: https://data.gov.cz
|
|
- Search: https://data.gov.cz/datasets
|
|
|
|
**Related Documentation**:
|
|
- CZECH_ISIL_COMPLETE_REPORT.md - Library harvest report
|
|
- CZECH_ISIL_NEXT_STEPS.md - Quick start guide
|
|
- SESSION_SUMMARY_20251119_CZECH_COMPLETE.md - Session summary
|
|
|
|
---
|
|
|
|
**Status**: Awaiting response from arch@mvcr.cz
|
|
**Next Session**: Check email for response, investigate open data portal, or begin API investigation
|