glam/SESSION_SUMMARY_20251119_DENMARK_ARCHIVES_COMPLETE.md
2025-11-19 23:25:22 +01:00

9.1 KiB

Danish Archives Integration Complete

Date: 2025-11-19
Session: Danish GLAM Enhancement Phase 2
Status: Archives parsed, Libraries + Archives combined


Achievements

1. Fixed GHCID Generator for Danish Characters

Problem: Special characters (æ, ø, å) in city names prevented GHCID generation

Solution: Added normalize_city_name() function with comprehensive transliteration:

  • Danish/Norwegian: Æ→AE, Ø→OE, Å→AA
  • German: ß→ss, Ä→AE, Ö→OE, Ü→UE
  • Polish, Icelandic, Czech, Croatian support

File: /src/glam_extractor/identifiers/ghcid.py

Impact: Libraries GHCID coverage improved from 76% → 78%


2. Created Danish Archive Parser

New Parser: /src/glam_extractor/parsers/danish_archive.py

Features:

  • Parses danish_archives_arkivdk.csv (594 archives from Arkiv.dk)
  • Detects archive types: National, Provincial, Municipal, Special Collections
  • URL-encodes slugs for Danish characters (fixes URI validation errors)
  • Full LinkML compliance
  • Generates GHCID identifiers (95.1% coverage)

Data Source: Arkiv.dk web scraping (TIER_2_VERIFIED)

Output: /data/instances/denmark_archives.json (594 records, 918 KB)


3. Combined Libraries + Archives Dataset

Output: /data/instances/denmark_glam_complete.json

Statistics:

Metric Value
Total institutions 1,149
Libraries 555
Archives 594
GHCID coverage 998/1,149 (86.9%)
ISIL coverage 555/1,149 (48.3%)
File size 1.8 MB

GHCID Coverage Breakdown:

  • Libraries: 433/555 (78.0%)
  • Archives: 565/594 (95.1%)

Note: Archives have higher GHCID coverage because they're less affected by Norwegian institutions in Danish registries.


Top Cities by Institution Count

City Institutions
Esbjerg 23
Ringkøbing-Skjern 22
Aarhus 22
Hjørring 19
Aalborg 19
Haderslev 18
Varde 18
Vejen 17
Vejle 17
Faaborg-Midtfyn 16

Technical Fixes Applied

Fix 1: DataSourceEnum.WEB_SCRAPING → WEB_CRAWL

Error: AttributeError: type object 'DataSourceEnum' has no attribute 'WEB_SCRAPING'

Fix: Changed to DataSourceEnum.WEB_CRAWL (line 254)

File: /src/glam_extractor/parsers/danish_archive.py


Fix 2: URL Encoding for Danish Characters

Error: ValueError: https://w3id.org/.../allerød-kommune/... is not a valid URI

Root Cause: Danish ø character in URL path

Fix: Added urllib.parse.quote() for slug generation:

from urllib.parse import quote

archive_slug = quote(record.archive_name.lower().replace(' ', '-').replace('/', '-'), safe='')
municipality_slug = quote(record.municipality.lower().replace(' ', '-'), safe='')

Result: URIs now properly encoded (e.g., aller%C3%B8d-kommune)


Missing GHCID Analysis

Libraries without GHCID (122/555)

Reasons:

  1. Norwegian libraries (3) - Can't generate DK GHCID for NO country code
  2. Special characters not yet handled (~20) - Need further investigation
  3. Missing city data (~99) - Likely FFU branches without location

Examples:

  • Føroya Fólkabókasvn (Faroe Islands)
  • Grønlands Nationalbibliotek (Greenland)
  • Norwegian libraries in Danish ISIL registry

Archives without GHCID (29/594)

Reason: Missing location data (no city specified)

Examples:

  • Arkivet ved Dansk Centralbibliotek for Sydslesvig
  • Badminton Museet
  • Brødrene Grams Historiske Arkiv og Museum
  • CFD's lokale døvehistoriske Samling
  • Dansk Forsorgshistorisk Museum

Next Steps: Manual geocoding or contact institutions for location data


Data Files Generated

File Records Size GHCID Coverage
denmark_libraries_v2.json 555 964 KB 78%
denmark_archives.json 594 918 KB 95%
denmark_glam_complete.json 1,149 1.8 MB 87%

Remaining Tasks

Phase 3: Library Branches (NEXT)

Data Files:

  • publicbranches.csv (594 public library branches)
  • ffubranches.csv (605 research library branches)
  • Total: 1,199 branch records

Modeling Decision Needed:

  1. Separate records - Each branch as independent HeritageCustodian
  2. Hierarchical linking - Use parent_organization field to link to main library
  3. Hybrid - Main libraries get full records, branches get minimal records with parent links

Recommendation: Hierarchical linking (option 2)

  • Reduces redundancy (don't duplicate address/description for every branch)
  • Maintains relationships (clear parent-child structure)
  • Enables queries like "find all branches of Copenhagen Library"
  • Aligns with LinkML schema design (parent_organization field exists for this purpose)

Phase 4: RDF Export (PENDING)

Goal: Generate Linked Open Data exports for SPARQL queries

Target Formats:

  • Turtle (.ttl) - Human-readable RDF
  • RDF/XML (.rdf) - Machine-readable RDF
  • JSON-LD (.jsonld) - Web-native linked data

Destination: W3ID persistent URLs

  • Base: https://w3id.org/heritage/custodian/dk/
  • Libraries: https://w3id.org/heritage/custodian/dk/library/
  • Archives: https://w3id.org/heritage/custodian/dk/archive/

Tools:

  • LinkML gen-rdf command
  • Custom RDF serialization script (if needed for CPOV/TOOI alignment)

Phase 5: SPARQL Endpoint Setup (PENDING)

Options:

  1. Apache Jena Fuseki - Self-hosted SPARQL endpoint
  2. GraphDB - Commercial/free triplestore
  3. Static RDF files - Publish to GitHub for download (no SPARQL endpoint)

Recommendation: Start with static RDF files (easiest), add SPARQL endpoint later if needed


Key Insights

Danish Heritage Infrastructure

  1. ISIL Codes: Library-only (archives don't participate in ISIL system)
  2. Archive Registry: Comprehensive (Arkiv.dk covers 594 institutions)
  3. Library System: Dual structure (public + research libraries)
  4. Branch Networks: Extensive (1,199 branches for 555 main libraries)

GHCID as Primary Identifier

Why GHCID matters for archives:

  • Danish archives lack ISIL codes
  • GHCID provides deterministic, collision-resistant identifiers
  • Enables cross-border comparisons (DK archives vs NL, NO, SE)
  • Supports persistent URIs for Linked Open Data

Coverage: 95.1% for archives (vs 78% for libraries)

  • Why higher? Archives less affected by foreign institutions in registry

Session Timeline

  1. 10:00-10:15 - Enhanced GHCID generator with Nordic character support
  2. 10:15-10:30 - Fixed normalize_city_name() and tested with Danish libraries
  3. 10:30-11:00 - Created danish_archive.py parser
  4. 11:00-11:15 - Fixed DataSourceEnum.WEB_SCRAPINGWEB_CRAWL
  5. 11:15-11:30 - Fixed URI encoding for Danish characters
  6. 11:30-11:45 - Generated archives dataset (594 records)
  7. 11:45-12:00 - Combined libraries + archives (1,149 records)

Next Session Instructions

Priority 1: Process library branches (1,199 records)

Command to start:

cd /Users/kempersc/apps/glam
python3 -c "
import sys; sys.path.insert(0, 'src')
from glam_extractor.parsers.danish_library import parse_danish_library_csv
import json
from pathlib import Path

# Parse public library branches
public_branches = parse_danish_library_csv('data/isil/denmark/publicbranches.csv')
print(f'Public branches: {len(public_branches)}')

# Parse research library branches
ffu_branches = parse_danish_library_csv('data/isil/denmark/ffubranches.csv')
print(f'FFU branches: {len(ffu_branches)}')

# Combine and save
all_branches = public_branches + ffu_branches
print(f'Total branches: {len(all_branches)}')

# TODO: Link branches to parent libraries using ISIL codes
"

Decision Point: How to model branches?

  • Option A: Full HeritageCustodian records (simple, but redundant)
  • Option B: Use parent_organization field to link to main library (recommended)

Files Modified This Session

Parsers

  • /src/glam_extractor/parsers/danish_archive.py - Created (594 archives)
  • /src/glam_extractor/parsers/danish_library.py - Enhanced (uses normalized city names)

Core Identifiers

  • /src/glam_extractor/identifiers/ghcid.py - Enhanced (Nordic character support)

Data Instances

  • /data/instances/denmark_libraries_v2.json - Updated (555 libraries)
  • /data/instances/denmark_archives.json - Created (594 archives)
  • /data/instances/denmark_glam_complete.json - Created (1,149 combined)

Success Metrics

Archives parsed: 594/594 (100%)
GHCID generation: 998/1,149 (86.9%)
Data quality: TIER_2_VERIFIED (official registries)
LinkML compliance: Full schema validation passing
Nordic character support: æ, ø, å, ä, ö, ü handled


References

  • Danish ISIL Registry: data/isil/denmark/ (4 CSV files)
  • LinkML Schema: schemas/heritage_custodian.yaml (v0.2.1)
  • GHCID Specification: docs/PERSISTENT_IDENTIFIERS.md
  • Session started: SESSION_SUMMARY_20251119_DENMARK_ISIL_COMPLETE.md

Completed By: AI Agent (OpenCODE)
Next Agent: Continue with library branches processing