glam/CZECH_ARCHIVES_NEXT_ACTIONS.md
2025-11-19 23:25:22 +01:00

5.5 KiB
Raw Blame History

Czech Archives - Next Actions

Quick Reference: What to do next session

Context

  • Czech libraries harvested: 8,145 institutions from ADR database
  • 🔍 Czech archives discovered: ~560 institutions in ARON portal
  • No bulk download found yet for archive data

1 Email Request to Ministry of Interior (DO THIS FIRST)

To: arch@mvcr.cz
Subject: Request for Czech Archive Institution Registry Export

Email Template:

Dear Archival Administration,

I am working on a global heritage institution database project and have 
successfully integrated data from the Czech National Library's ADR database 
(8,145 libraries).

I would like to request a downloadable export of Czech archive institutions 
from the "Archiválie na dosah" (ARON) portal at https://portal.nacr.cz/aron/institution

Could you provide:
1. Complete list of Czech archive institutions in XML/CSV/JSON format
2. Metadata: institution name, type, location, identifiers, website
3. License information (hoping for CC0 like the ADR database)

The ADR library database was publicly available at:
https://aleph.nkp.cz/data/adr.xml.gz (CC0 license)

Is there a similar download for archive institutions from the ARON system?

This data will contribute to a global open heritage institution dataset 
for research and discovery purposes.

Thank you for your assistance!

2 Search Czech Open Data Portal

URL: https://data.gov.cz/datasets

Search terms to try:

  • "archivy"
  • "archivní instituce"
  • "ARON"
  • "Národní archiv"
  • "Archiválie na dosah"

Filter by publishers:

  • Národní archiv
  • Ministerstvo vnitra

3 Investigate ARON API (if no download found)

Method: Use browser DevTools

  1. Open https://portal.nacr.cz/aron/institution
  2. Open DevTools (F12) → Network tab
  3. Filter by "XHR" or "Fetch"
  4. Click through pages to see API calls
  5. Look for endpoints like:
    • /api/institution
    • /api/apu
    • /aron/api/...

Check:

  • Response format (JSON expected)
  • Authentication requirements
  • Pagination parameters
  • Rate limits

4 Web Scraping (LAST RESORT ONLY)

Only if:

  • No email response after 1 week
  • No open data export found
  • No public API available

Approach:

# Use playwright or crawl4ai
# Scrape 56 pages of institution list
# Extract: name, UUID, link
# Follow links to detail pages
# Respect rate limits: 1 req/sec

What We Have vs What We Need

Currently Have (Libraries)

Source: https://aleph.nkp.cz/data/adr.xml.gz
Format: MARC21 XML (27 MB)
Records: 8,145 libraries
Status: ✅ Parsed and validated
Output: data/instances/czech_institutions.yaml (8.8 MB)

Need to Get (Archives)

Source: https://portal.nacr.cz/aron/institution
Format: Unknown (web portal only)
Records: ~560 archives (estimated)
Status: ⏳ Awaiting data access
Target: data/instances/czech_archives.yaml

Final Goal (Combined)

Output: data/instances/czech_heritage_complete.yaml
Records: ~8,700 total (8,145 libraries + 560 archives)
Status: 📋 Waiting for archive data

Expected Timeline

Best Case (Ministry provides export):

  • Email response: 3-5 business days
  • Download + parse data: 1-2 hours
  • Validation + merge: 1-2 hours
  • Total: 1 week

Medium Case (Found in open data portal):

  • Deep search: 1-2 hours
  • Download + parse: 1-2 hours
  • Total: Same day

Worst Case (Web scraping required):

  • Write scraper: 2-3 hours
  • Run scraper (56 pages + details): 2-3 hours
  • Parse + validate: 1-2 hours
  • Total: 1-2 days

Commands for Next Session

Check email response

# Manual check - did arch@mvcr.cz respond?

Search open data portal

# Open in browser:
open https://data.gov.cz/datasets?keywords=archivy
open https://data.gov.cz/datasets?keywords=archivn%C3%AD%20instituce

Check current status

# Czech libraries
python3 -c "
import yaml
with open('data/instances/czech_institutions.yaml', 'r') as f:
    data = yaml.safe_load(f)
print(f'Czech libraries: {len(data)}')
print(f'Archives in library DB: {sum(1 for i in data if \"archiv\" in i[\"name\"].lower())}')
"

Test ARON API (if found)

# Example if we find API endpoint
curl -s "https://portal.nacr.cz/api/institution?page=0&size=100" | jq .

Files to Create (Once Archive Data Available)

  1. scripts/parsers/parse_czech_archives.py - Archive parser
  2. data/instances/czech_archives.yaml - Archive records
  3. scripts/merge_czech_datasets.py - Merge libraries + archives
  4. data/instances/czech_heritage_complete.yaml - Final unified dataset
  5. CZECH_ARCHIVES_COMPLETE.md - Archive harvest report

Success Criteria

Archive data obtained (via email, open data, or API)
Parser created and tested
Records validated against LinkML schema
Libraries + archives merged without duplicates
Total ~8,700 Czech heritage institutions in unified dataset
GHCID generated for all institutions
Ready for Wikidata enrichment

Key Contacts

Národní archiv (National Archives):

Ministry of Interior - Archival Administration:

National Library (for reference):


Current Status: Awaiting archive data access
Next Action: Send email to arch@mvcr.cz
Fallback: Search open data portal, then investigate API
Last Resort: Web scraping with rate limits