glam/SESSION_SUMMARY_20251119_ARCHIVES_DISCOVERY.md
2025-11-19 23:25:22 +01:00

8.6 KiB
Raw Blame History

Session Summary: Czech Archives Discovery (2025-11-19)

What We Accomplished

1. Identified Archive Data Gap

  • Confirmed Czech ADR database contains libraries only (no archives)
  • Discovered 87 institutions with "archiv" in name are archive libraries/collections, not archives
  • Found separate Czech archive database: "Archiválie na dosah" (Archives Within Reach)

2. Located Czech Archive Infrastructure

3. Categorized Czech Archive Types

Documented archive categories from ARON portal:

  • State archives (Národní archiv, regional archives, security archive)
  • Municipal archives (Prague, Brno, Plzeň, Ostrava, etc.)
  • University archives (Charles University, Masaryk University, etc.)
  • Specialized archives (Parliament, Foreign Ministry, Military, Radio, Museums, Galleries, Libraries)
  • Private archives (Jewish Museum, corporate archives, church archives)

4. Identified Data Access Challenges ⚠️

Problem: No obvious bulk download available

  • No public XML/CSV export like library ADR database
  • No documented public API endpoint
  • Not found in Czech open data portal (preliminary search)
  • Web interface exists but requires scraping 56 pages

5. Developed Action Plan 📋

Created comprehensive investigation strategy:

  • Priority 1: Email Ministry of Interior (arch@mvcr.cz) for export
  • Priority 2: Deep search Czech open data portal
  • Priority 3: Investigate ARON API through DevTools
  • Priority 4: Web scraping as last resort

Files Created

Documentation

  • CZECH_ARCHIVES_INVESTIGATION.md (6.5 KB)

    • Detailed investigation report
    • System architecture analysis
    • Comparison: libraries vs archives
    • Expected outcomes
  • CZECH_ARCHIVES_NEXT_ACTIONS.md (5.2 KB)

    • Quick reference guide
    • Email template for Ministry
    • Step-by-step instructions
    • Success criteria
  • CZECH_ISIL_COMPLETE_REPORT.md - Library harvest report
  • CZECH_ISIL_NEXT_STEPS.md - Library processing guide
  • data/instances/czech_institutions.yaml - 8,145 libraries (8.8 MB)

Key Findings

Archive Distribution Estimate

Based on ARON portal pagination:

Total pages: 56
Per page: 10
Estimated total: ~560 Czech archive institutions

Czech Heritage Landscape

Libraries: 8,145 (ADR database) ✅ COMPLETE
Archives: ~560 (ARON portal) ⏳ PENDING DATA ACCESS
Total: ~8,700 Czech heritage institutions

Data Quality Comparison

Feature Libraries (ADR) Archives (ARON)
Data Source National Library National Archives
Format MARC21 XML Unknown (web only)
Download Available Not found
Records 8,145 ~560 (estimated)
GPS Coverage 81.3% Unknown
ISIL Codes Yes (siglas) Unknown
License CC0 Unknown

Expected Czech Dataset

After Archive Integration

  • Total Institutions: ~8,700
    • Libraries: 8,145 (94.6%)
    • Archives: ~560 (5.4%)
    • Mixed: ~50 (museums, galleries, etc.)
  • Geographic Coverage: All 14 Czech regions
  • Data Quality: TIER_1_AUTHORITATIVE
  • Global Ranking: 2nd largest national dataset (after Netherlands if archives added)

Next Steps

Immediate (Next Session)

  1. Send email to arch@mvcr.cz requesting archive data export

    • Use template in CZECH_ARCHIVES_NEXT_ACTIONS.md
    • Reference ADR database as precedent
    • Request CC0 license
  2. Search Czech open data portal thoroughly

  3. Investigate ARON API

Secondary (If No Official Export)

  1. Web scraping fallback
    • Scrape 56 pages of institution list
    • Extract: name, UUID, detail page link
    • Follow links for full metadata
    • Respect rate limits (1 req/sec)

After Archive Data Obtained

  1. Create parser: scripts/parsers/parse_czech_archives.py
  2. Validate records: Check LinkML schema compliance
  3. Merge datasets: Combine libraries + archives
  4. Deduplicate: Check for institutions in both databases
  5. Enrich with Wikidata: Add Q-numbers
  6. Generate GHCIDs: CZ-* prefixes

Technical Discoveries

ARON System

  • ARON = ARchiv ONline
  • React-based web application
  • RESTful API backend (not publicly documented)
  • Real-time updates by archivists
  • Integrated with CAM (Central Archive Module)

Known Endpoints

  • /aron/institution - Institution list
  • /aron/fund - Archival fonds
  • /aron/finding-aid - Finding aids
  • /aron/originator - Record creators
  • /aron/apu/{uuid} - Entity details

API Potential

Outstanding Questions

  1. Does archive data export exist?

  2. Do Czech archives have ISIL codes?

    • Libraries use "siglas" format
    • Archives may use different system
    • Need to check with ISIL agency
  3. What's the license for archive data?

    • Library data is CC0
    • Archives likely similar but unconfirmed
  4. How many archives actually exist?

    • Portal shows ~560
    • May be more not in ARON system
    • Need official count

Success Metrics

Session Goals

  • Identified archive database location
  • Estimated archive count
  • Documented archive categories
  • Created action plan
  • Awaiting data access

Project Goals (Pending)

  • Obtain archive data export
  • Parse and validate records
  • Merge with library dataset
  • Complete Czech heritage dataset (~8,700 institutions)
  • Become 2nd largest national dataset globally

Resources

Key Contacts

URLs

Documentation

  • CZECH_ARCHIVES_INVESTIGATION.md - Full investigation report
  • CZECH_ARCHIVES_NEXT_ACTIONS.md - Quick start guide
  • CZECH_ISIL_COMPLETE_REPORT.md - Library harvest results
  • AGENTS.md - Project instructions for AI agents

Timeline Estimate

Best Case (Official export provided):

  • Email response: 3-5 business days
  • Download + parse: 1-2 hours
  • Merge + validate: 1-2 hours
  • Total: ~1 week

Medium Case (Found in open data):

  • Deep portal search: 1-2 hours
  • Download + parse: 1-2 hours
  • Total: Same day

Worst Case (Web scraping):

  • Write scraper: 2-3 hours
  • Run scraper: 2-3 hours (56 pages + details)
  • Parse + validate: 1-2 hours
  • Total: 1-2 days

Commands for Verification

Check library data

python3 -c "
import yaml
with open('data/instances/czech_institutions.yaml', 'r') as f:
    data = yaml.safe_load(f)
print(f'Czech libraries: {len(data)}')
print(f'\"Archiv\" mentions: {sum(1 for i in data if \"archiv\" in i[\"name\"].lower())}')
print(f'Archive type count: {sum(1 for i in data if i[\"institution_type\"] == \"ARCHIVE\")}')
"

List files created

ls -lh CZECH_ARCHIVES*.md

Notes

  • Czech Republic has TWO separate heritage databases (libraries + archives)
  • This is common in many countries with specialized systems
  • Library database was straightforward (single XML download)
  • Archive database requires more investigation (web portal only)
  • Both are managed by national institutions (good data quality)
  • License likely CC0 (following Czech open data practices)

Status

Current Phase: Data acquisition
Blocking Issue: No bulk download found for archives
Next Action: Email arch@mvcr.cz requesting export
Fallback Plans: Open data search → API investigation → Web scraping
Expected Resolution: 1 week (if email successful)


Session Duration: ~2 hours
Files Created: 2 (investigation report + action guide)
Research Completed: Archive infrastructure mapping
Data Acquired: 0 (awaiting archive export)
Next Session Focus: Data acquisition and parsing