glam/CZECH_ARCHIVES_INVESTIGATION.md
2025-11-19 23:25:22 +01:00

11 KiB

Czech Archives Database Investigation

Date: 2025-11-19
Status: Investigation in progress

Summary

The Czech Republic has TWO separate heritage databases:

  1. Libraries - Already harvested (8,145 institutions from ADR database)
  2. 🔍 Archives - Need to harvest (estimated ~560 institutions from ARON portal)

Czech Archive Infrastructure

Primary Source: Archiválie na dosah (Archives Within Reach)

Portal URL: https://portal.nacr.cz/cro/pro-badatele/
Institution Database: https://portal.nacr.cz/aron/institution
Manager: Národní archiv (National Archives) + Ministry of Interior

What We Discovered

  1. "Archiválie na dosah" replaced the old "Archivní fondy a sbírky v ČR" database
  2. Contains 200,000+ archival collections (fonds and collections)
  3. Managed by Národní archiv through ARON portal (ARchiv ONline)
  4. Updated in real-time by archivists across Czech Republic
  5. Includes state archives, municipal archives, university archives, specialized archives, and private archives

Archive Categories in ARON

According to the portal (https://portal.nacr.cz/cro/pro-badatele/):

State Archives (Státní archivy):

  • Národní archiv (National Archives)
  • Moravský zemský archiv v Brně
  • 7 regional state archives (Státní oblastní archivy)
  • Archiv bezpečnostních složek (Security Forces Archive)

Municipal Archives (Archivy územních samosprávných celků):

  • Archiv hlavního města Prahy (Prague City Archive)
  • Archiv města Brna
  • Archiv města Plzně
  • Archiv města Ústí nad Labem
  • Archiv města Ostravy
  • Many district and municipal archives

University Archives (Archivy vysokých škol):

  • Univerzita Karlova
  • Masarykova univerzita
  • Univerzita Palackého
  • Ostravská univerzita
  • ČVUT Praha
  • VUT Brno
  • And more...

Specialized Archives (Specializované archivy):

  • Archiv Kanceláře prezidenta republiky
  • Archiv Poslanecké sněmovny (Parliament Archive)
  • Archiv Ministerstva zahraničních věcí
  • Vojenský historický archiv (Military Historical Archive)
  • Archiv Českého rozhlasu (Czech Radio Archive)
  • Archiv Národního muzea
  • Literární archiv Památníku národního písemnictví
  • Archiv Národní galerie
  • Archiv Národní knihovny ČR
  • And many more...

Private Archives (Soukromé archivy):

  • Archiv Židovského muzea
  • Corporate archives (Škoda, Vítkovice, Plzeňský Prazdroj, etc.)
  • Church archives (Biskupství brněnské)

Current Status

ARON Portal Observations:

  • Shows 56 pages of institutions (10 per page)
  • Estimated total: ~560 Czech archive institutions
  • Web interface at https://portal.nacr.cz/aron/institution
  • Each institution has UUID (e.g., /aron/apu/000efd1e-099b-4e8c-ab8c-ec9e47662e7b)
  • Displays institution name, but details require clicking through

Data Access Challenges

No Obvious Bulk Download

What we DON'T have yet:

  • No public downloadable XML/CSV export like ADR library database
  • No documented public API for institution list
  • ARON API exists but appears to be for internal use (DA-COMM system)
  • Czech open data portal (data.gov.cz) doesn't list archive institution registry

Possible Data Access Methods

Option 1: Contact Ministry of Interior - Archival Administration RECOMMENDED

  • Email: arch@mvcr.cz
  • Request: Ask for downloadable export of Czech archive institution registry
  • Precedent: National Library provided ADR database as open XML download
  • License: Likely CC0 (public domain) like library database
  • Format: Probably XML or CSV

Option 2: Web Scraping ARON Portal

  • Portal: https://portal.nacr.cz/aron/institution
  • 56 pages to scrape (paginated results)
  • Extract: Institution name, UUID, link to detail page
  • Then scrape each detail page for full metadata
  • Cons: Time-consuming, fragile, may violate terms of service

Option 3: Check for API Documentation

  • Found DA-COMM API docs at https://stands.nacr.cz/da-comm/viewapi/
  • Appears to be for component viewing, not institution export
  • May require authentication token
  • Needs investigation: Does ARON have public API?

Option 4: Check Open Data Portal

  • Search data.gov.cz for "archivy" or "archivní instituce"
  • May be published as open data dataset
  • Status: Preliminary search didn't find it, but worth deeper investigation

Technical Details

ARON System Architecture

ARON = ARchiv ONline (Archive Online)

  • Modern web application (React-based frontend)
  • RESTful API backend
  • PostgreSQL database (inferred from similar Czech gov systems)
  • Real-time updates by archivists
  • Integrated with CAM (Centrální Archivní Modul - Central Archive Module)

Known ARON Endpoints

  • /aron/institution - List of archive institutions
  • /aron/fund - Archival fonds
  • /aron/finding-aid - Finding aids (archival inventories)
  • /aron/originator - Originators (organizations that created records)
  • /aron/apu/{uuid} - Individual access point (entity detail)

Data We Need from Each Institution

From LinkML schema perspective, extract:

  • Name (official institution name)
  • Institution Type (ARCHIVE, MUSEUM, LIBRARY, etc.)
  • Location (city, address if available)
  • Identifiers (UUID from ARON, potential ISIL code)
  • Website URL (if available)
  • Description (mission, holdings summary)
  • Collections (archival fonds managed by institution)

Comparison: Libraries vs Archives

Feature Libraries (ADR) Archives (ARON)
Total Institutions 8,145 ~560 (estimated)
Data Source National Library National Archives + Min. Interior
Download Available Yes (adr.xml.gz) Not yet found
Format MARC21 XML Unknown (web portal only)
Coverage 100% Czech libraries All Czech archives + cultural institutions
ISIL Codes Yes (siglas) Unknown if archives have ISIL
Update Frequency Periodic (last: 2025-08-01) Real-time (online updates)
License CC0 (public domain) Unknown (likely open)

Next Steps

Immediate Actions (Priority 1)

  1. Email Národní archiv and Ministry of Interior

    To: arch@mvcr.cz
    Subject: Request for Czech Archive Institution Registry Export
    
    Dear Archival Administration,
    
    I am working on a global heritage institution database project 
    (https://github.com/user/glam) and successfully integrated data 
    from the National Library's ADR database (8,145 libraries).
    
    I would like to request a downloadable export of Czech archive 
    institutions from the "Archiválie na dosah" (ARON) portal. 
    
    Could you provide:
    1. Complete list of Czech archive institutions in XML/CSV format
    2. Metadata: institution name, type, location, identifiers, website
    3. License information (hoping for CC0 like ADR database)
    
    The ADR database was available at:
    https://aleph.nkp.cz/data/adr.xml.gz
    
    Is there a similar download for archive institutions?
    
    Thank you!
    
  2. Check Czech Open Data Portal Thoroughly

    • Search data.gov.cz for:
      • "archivy"
      • "archivní instituce"
      • "ARON"
      • "Národní archiv"
      • "Archiválie na dosah"
    • Filter by publisher: Národní archiv, Ministerstvo vnitra
  3. Investigate ARON API

    • Check browser network tab when using portal
    • Look for API endpoints (likely /api/institution or similar)
    • Check if API returns JSON data
    • Test if API requires authentication

Secondary Actions (Priority 2)

  1. Web Scraping as Last Resort

    • Only if no official export is available
    • Use crawl4ai or playwright to scrape institution list
    • Respect rate limits (1 request per second)
    • Parse HTML for institution names and UUIDs
    • Follow links to detail pages for full metadata
  2. Cross-Reference with ADR Database

    • Check if 87 "archiv" mentions in library database are real archives
    • Some may be archive libraries (KI-MU type institutions)
    • Deduplicate if institutions appear in both databases
  3. ISIL Code Investigation

    • Check if Czech archives have ISIL codes
    • Libraries use "siglas" (e.g., ABA000)
    • Archives may use different identifier system
    • Contact ISIL agency: https://isil.org

Data Integration (Priority 3)

  1. Create Czech Archive Parser

    • Once data is available, create scripts/parsers/parse_czech_archives.py
    • Similar structure to parse_czech_isil.py
    • Map archive types to GLAM taxonomy (ARCHIVE, MUSEUM, etc.)
    • Generate GHCIDs with CZ-* prefix
  2. Merge Libraries + Archives

    • Combine czech_institutions.yaml (libraries) with new archive data
    • Deduplicate by name/location
    • Create unified czech_heritage_institutions.yaml
    • 8,145 libraries + ~560 archives = ~8,700 total Czech institutions
  3. Enrich with Wikidata

    • Query Wikidata for Czech archives
    • Fuzzy match by name and city
    • Add Q-numbers to identifiers
    • Update GHCIDs with Q-numbers if needed

Expected Outcomes

After Archive Data Harvest

Czech Heritage Institution Dataset:

  • Total Institutions: ~8,700 (8,145 libraries + 560 archives)
  • Data Quality: TIER_1_AUTHORITATIVE (from national registries)
  • Geographic Coverage: All 14 Czech regions
  • Type Distribution:
    • Libraries: 94.6% (8,145)
    • Archives: 5.4% (~560)
    • Museums: ~50 (from specialized archives)
    • Galleries: ~20 (from specialized archives)
    • Universities: ~10 (university archives)
    • Mixed: varies

Contribution to Global Dataset

Czech Republic would become one of the most complete country datasets:

  • 🇳🇱 Netherlands: ~1,400 institutions (current best)
  • 🇨🇿 Czech Republic: ~8,700 institutions (if archives added) 🏆
  • 🇧🇷 Brazil: ~3,000+ institutions (in progress)
  • 🇦🇷 Argentina: ~2,000 institutions
  • 🇦🇹 Austria: ~1,800 institutions
  • 🇨🇦 Canada: ~800 institutions

References

Czech National Archives:

Ministry of Interior - Archival Administration:

Czech Open Data Portal:

Related Documentation:

  • CZECH_ISIL_COMPLETE_REPORT.md - Library harvest report
  • CZECH_ISIL_NEXT_STEPS.md - Quick start guide
  • SESSION_SUMMARY_20251119_CZECH_COMPLETE.md - Session summary

Status: Awaiting response from arch@mvcr.cz
Next Session: Check email for response, investigate open data portal, or begin API investigation