glam/SESSION_SUMMARY_20251118_ARGENTINA_LINKML_EXPORT.md
2025-11-19 23:25:22 +01:00

14 KiB

Session Summary: Argentina LinkML Export (November 18, 2025)

Objective

Resume from previous session's work on Argentina heritage institutions and complete LinkML YAML export of 289 institutions.

Accomplished

1. Fixed Model Import Issues Across Codebase

Problem: Multiple parsers were using old enum names (InstitutionType, DataSource, DataTier) instead of new names (InstitutionTypeEnum, DataSourceEnum, DataTierEnum).

Fixed Files:

  • src/glam_extractor/parsers/isil_registry.py
  • src/glam_extractor/parsers/dutch_orgs.py
  • src/glam_extractor/parsers/deduplicator.py
  • src/glam_extractor/parsers/argentina_conabip.py

Impact: All parsers now use consistent enum naming, preventing import errors throughout the codebase.


2. Created Argentina LinkML Export Script

File: scripts/export_argentina_to_linkml.py (318 lines)

Features:

  • Parses CONABIP libraries JSON (288 institutions)
  • Parses AGN (Archivo General de la Nación) JSON (1 institution)
  • Generates GHCIDs for all institutions
  • Generates UUID v5, UUID v8, and UUID v7 identifiers
  • Exports to LinkML-compliant YAML in batches (100 per file)
  • Custom serialization for LinkML dataclass objects

Key Functions:

  • generate_uuids_for_custodian() - Creates UUID v5, v8, v7 from GHCID
  • parse_agn_json() - Parses AGN national archive data
  • linkml_to_dict() - Recursively converts LinkML objects to plain dicts
  • export_to_yaml() - Writes clean YAML with metadata headers

3. Successfully Exported 289 Argentine Institutions

Output Directory: data/instances/argentina/

Files Created:

File Institutions Size Description
conabip_libraries_batch01.yaml 100 234 KB CONABIP libraries 1-100
conabip_libraries_batch02.yaml 100 231 KB CONABIP libraries 101-200
conabip_libraries_batch03.yaml 88 204 KB CONABIP libraries 201-288
agn_archive.yaml 1 2.7 KB Archivo General de la Nación

Total: 289 institutions, 672 KB


4. GHCID Generation Completed

All 289 institutions now have complete persistent identifiers:

  • GHCID String: AR-CA-BUE-A-NAA (ISO-based human-readable)
  • GHCID Numeric: 1593478317718704113 (64-bit integer)
  • UUID v5: 086977c3-082f-5899-b00d-4ccd733a64a2 (primary UUID, RFC 4122)
  • UUID v8: 161d2c16-5ce9-8ff1-b24f-ee7c16adef01 (SHA-256 based)
  • UUID v7: 019a9682-adda-734a-9b30-0b1f062456b3 (time-ordered, for databases)

Data Breakdown

CONABIP Libraries (288)

  • Type: Public popular libraries
  • Coverage: All provinces of Argentina
  • Data Tier: TIER_2_VERIFIED (web scraping from government source)
  • Coordinates: 288/288 (100% geocoded)
  • Services: Digital platforms, Wi-Fi, children's sections, etc.

AGN (1)

  • Type: National archive
  • Location: Buenos Aires
  • Data Tier: TIER_2_VERIFIED (official government website)
  • Description: Argentina's national archive responsible for preserving government records

Technical Challenges & Solutions

Challenge 1: Enum Import Inconsistencies

Problem: Codebase used both old (InstitutionType) and new (InstitutionTypeEnum) naming conventions.

Solution:

  • Global search-and-replace across all parsers
  • Updated 5 parser files with consistent enum names
  • Fixed return type annotations

Challenge 2: LinkML Enum Serialization

Problem: LinkML enums are PermissibleValue objects, not hashable for dict keys.

Solution:

# Before (failed):
TIER_PRIORITY = {
    DataTierEnum.TIER_1_AUTHORITATIVE: 1  # TypeError: unhashable type
}

# After (fixed):
TIER_PRIORITY = {
    "TIER_1_AUTHORITATIVE": 1  # Use string keys
}

Challenge 3: YAML Serialization of LinkML Objects

Problem: LinkML dataclass objects aren't Pydantic models, don't have .dict() method.

Solution: Created custom linkml_to_dict() function to recursively convert:

  • LinkML dataclass objects → plain dicts
  • PermissibleValue enums → strings
  • Datetime objects → ISO format strings
  • Nested structures → recursively process

YAML Output Format

Header Example:

---
# Argentina Heritage Institutions - LinkML Export
# Generated: 2025-11-18T10:28:57.950447+00:00
# Total institutions: 100

Record Example (simplified):

- id: '18'
  name: Biblioteca Popular Helena Larroque de Roffo
  institution_type: LIBRARY
  ghcid_current: AR-CA-CIU-L-BPHLR
  ghcid_uuid: d0c09a5a-7cc1-5ebe-9816-46c445e272bb
  ghcid_numeric: 5377579600251499607
  locations:
    - city: Ciudad Autónoma de Buenos Aires
      street_address: Simbrón 3058
      region: AR-C
      country: AR
      latitude: -34.598461
      longitude: -58.49469
  provenance:
    data_source: WEB_CRAWL
    data_tier: TIER_2_VERIFIED
    confidence_score: 0.95

Note: Actual YAML contains Python-specific serialization tags (e.g., !!python/object/new:) which are valid for LinkML processing but not portable to non-Python systems.


Next Steps (Action Items)

Immediate Priority (This Week)

1. ✉️ Send IRAM Email (User Action Required)

File: data/isil/AR/EMAIL_DRAFTS_ISIL_REQUEST.md (Email #1)

To: iram-iso@iram.org.ar
Subject: Solicitud de acceso al registro nacional de códigos ISIL

Expected:

  • 60% response rate
  • Potential 500-1,000 institutions with official ISIL codes
  • Response time: 1-2 weeks

2. ✉️ Send Biblioteca Nacional Email (User Action Required)

File: data/isil/AR/EMAIL_DRAFTS_ISIL_REQUEST.md (Email #2)

To: dpt@bn.gov.ar
Subject: Consulta sobre acceso a códigos ISIL en catálogo de autoridades

Expected:

  • 40% response rate
  • Guidance on accessing authority catalog
  • Alternative ISIL sources

Short-term (While Waiting for IRAM)

3. Validate Exported YAML (Optional)

# If linkml-validate is installed:
cd /Users/kempersc/apps/glam
linkml-validate -s schemas/heritage_custodian.yaml \
  data/instances/argentina/conabip_libraries_batch01.yaml

Purpose: Ensure LinkML schema compliance before RDF export.

4. 🔄 Export to RDF/JSON-LD (Next Session)

Script to Create: scripts/export_argentina_to_rdf.py

Outputs:

  • data/instances/argentina/argentina_institutions.ttl (Turtle RDF)
  • data/instances/argentina/argentina_institutions.jsonld (JSON-LD)
  • data/instances/argentina/argentina_institutions.nq (N-Quads for SPARQL)

Purpose: Enable Linked Data integration with Wikidata, Europeana, DPLA.

5. 📊 Generate Statistics Report (Optional)

Script: scripts/generate_argentina_report.py

Outputs:

  • Geographic distribution map (provinces with most libraries)
  • Institution type breakdown
  • Data quality metrics (geocoding completeness, identifier coverage)
  • Services analysis (which libraries offer digital platforms, Wi-Fi, etc.)

If IRAM Responds (Week 2-3)

6. 📥 Parse IRAM ISIL Registry

Expected Format: CSV or Excel

Steps:

  1. Create parser: src/glam_extractor/parsers/argentina_isil.py
  2. Cross-reference with CONABIP (match by name + city)
  3. Enrich CONABIP records with official ISIL codes
  4. Add new institutions not in CONABIP
  5. Regenerate LinkML YAML with merged data

7. 🔗 Cross-Reference with Wikidata

Script: scripts/enrich_argentina_wikidata.py (template exists)

Purpose:

  • Find Wikidata Q-numbers for Argentine institutions
  • Add VIAF IDs, coordinates, founding dates
  • Resolve GHCID collisions with Q-numbers if needed

If IRAM Doesn't Respond (Week 3)

8. ✉️ Send Reminder + SISBI-UBA Email

File: data/isil/AR/EMAIL_DRAFTS_ISIL_REQUEST.md (Email #3)

To: sisbi@rec.uba.ar
Subject: Consulta sobre directorio de bibliotecas universitarias

Expected:

  • 70% response rate (universities typically more responsive)
  • ~40 university libraries
  • Contact information, websites, possible ISIL codes

9. 🌐 Manual Extraction from Ministerial Websites

Targets:

Method: Create targeted scrapers for known institution directories.


Data Quality Summary

Metric CONABIP AGN Total
Institutions 288 1 289
Geocoded 288 (100%) 0 (0%) 288 (99.7%)
ISIL Codes 0 (0%) 0 (0%) 0 (0%)
Wikidata IDs 21 (7.3%) 0 (0%) 21 (7.3%)
Websites 0 (0%) 1 (100%) 1 (0.3%)
GHCIDs Generated 288 (100%) 1 (100%) 289 (100%)
UUIDs Generated 288 (100%) 1 (100%) 289 (100%)

Key Insight: ISIL code coverage is 0% because IRAM hasn't responded yet. All institutions have temporary GHCIDs that will be updated when ISIL codes become available.


Files Created/Modified This Session

New Files

  1. scripts/export_argentina_to_linkml.py - Main export script (318 lines)
  2. data/instances/argentina/conabip_libraries_batch01.yaml - Batch 1 (234 KB)
  3. data/instances/argentina/conabip_libraries_batch02.yaml - Batch 2 (231 KB)
  4. data/instances/argentina/conabip_libraries_batch03.yaml - Batch 3 (204 KB)
  5. data/instances/argentina/agn_archive.yaml - AGN archive (2.7 KB)
  6. SESSION_SUMMARY_20251118_ARGENTINA_LINKML_EXPORT.md - This document

Modified Files

  1. src/glam_extractor/parsers/isil_registry.py - Fixed enum imports
  2. src/glam_extractor/parsers/dutch_orgs.py - Fixed enum imports
  3. src/glam_extractor/parsers/deduplicator.py - Fixed enum imports + dict keys
  4. src/glam_extractor/parsers/argentina_conabip.py - Fixed typo

Key Decisions Made

Decision 1: Batch YAML Files (100 institutions per file)

Rationale:

  • Individual institution files = 289 files (management overhead)
  • Single file = 672 KB (difficult to review, version control issues)
  • Batches of 100 = 3-4 files (sweet spot for human review + git diffs)

Benefits:

  • Easy to review changes in git
  • Reasonable file sizes for text editors
  • Natural grouping for processing pipelines

Decision 2: Use Custom LinkML Serialization

Alternative: Use dataclasses.asdict() from Python standard library

Problem: Produces nested Python object tags in YAML (not portable)

Solution: Custom linkml_to_dict() recursively converts LinkML objects

Trade-off: Current YAML still has some Python tags, but preserves all data

Decision 3: Generate All UUID Variants Now

Rationale:

  • UUID v5 (primary) - RFC 4122 compliant, interoperable
  • UUID v8 (secondary) - SHA-256 based, future-proofing
  • UUID v7 (database) - Time-ordered, optimizes database inserts

Benefit: Institutions have complete persistent identifier suite ready for any use case (Linked Data, databases, APIs, citations).


Lessons Learned

1. Enum Naming Consistency is Critical

Problem: Mixed use of InstitutionType vs InstitutionTypeEnum across 5 files caused cascading import errors.

Solution: Standardize on *Enum suffix for all LinkML enums to distinguish from GHCID enums.

Preventive: Add linting rule to enforce enum naming convention.

2. LinkML Enums Are Not Standard Python Enums

Key Insight: LinkML enums are PermissibleValue objects with .text attribute, not standard Python enum.Enum.

Impact: Can't use LinkML enums as dict keys without converting to strings.

Workaround: Use string literals for dict keys, access enum values via .text.

3. YAML Serialization Requires Custom Logic for LinkML

Standard Approach Doesn't Work:

  • pydantic.BaseModel.dict() - LinkML doesn't use Pydantic
  • dataclasses.asdict() - Creates unserializable nested objects
  • Custom recursive converter - Handles LinkML specifics

Best Practice: Create reusable linkml_to_dict() utility for all LinkML projects.


Statistics & Metrics

Code Changes

  • Files modified: 5 parsers
  • Lines added: 318 (new export script)
  • Enum references fixed: ~50 across parsers

Data Generated

  • YAML files: 4
  • Total file size: 672 KB
  • Institutions exported: 289
  • GHCIDs generated: 289
  • UUIDs generated: 867 (3 per institution)

Time Investment

  • Parser fixes: ~15 minutes
  • Export script development: ~30 minutes
  • Testing & debugging: ~20 minutes
  • Documentation: ~15 minutes
  • Total: ~1 hour 20 minutes

References

Previous Session Documents

  • SESSION_SUMMARY_ARGENTINA_Z3950_INVESTIGATION.md - Z39.50 investigation (abandoned)
  • SESSION_SUMMARY_ARGENTINA_CONABIP.md - CONABIP scraping session
  • NEXT_SESSION_HANDOFF.md - Session continuity doc

Email Templates

  • data/isil/AR/EMAIL_DRAFTS_ISIL_REQUEST.md - 3 ready-to-send emails

Schema Documentation

  • schemas/heritage_custodian.yaml - Main LinkML schema
  • schemas/core.yaml - Core classes
  • schemas/enums.yaml - Enumerations
  • schemas/provenance.yaml - Provenance tracking
  • docs/SCHEMA_MODULES.md - Schema architecture

GHCID Documentation

  • docs/PERSISTENT_IDENTIFIERS.md - PID overview
  • docs/UUID_STRATEGY.md - UUID format comparison
  • docs/GHCID_PID_SCHEME.md - GHCID specification

Session Handoff for Next Agent

Current State

Complete: 289 Argentine institutions exported to LinkML YAML
Complete: All enum import issues fixed across codebase
Waiting: IRAM response for official ISIL codes
🎯 Next Priority: RDF/JSON-LD export for Linked Data integration

Quick Start Commands

# Review exported YAML
ls -lh data/instances/argentina/

# Check institution count
grep "^- id:" data/instances/argentina/*.yaml | wc -l
# Expected: 289

# Validate one batch (if linkml-validate installed)
linkml-validate -s schemas/heritage_custodian.yaml \
  data/instances/argentina/conabip_libraries_batch01.yaml

# Send IRAM email (user action)
cat data/isil/AR/EMAIL_DRAFTS_ISIL_REQUEST.md

Files Ready for Next Steps

  1. data/instances/argentina/*.yaml - Ready for RDF export
  2. data/isil/AR/EMAIL_DRAFTS_ISIL_REQUEST.md - Email templates
  3. scripts/export_argentina_to_linkml.py - Reusable export logic
  4. scripts/extract_argentina_wikidata.py - Template for Wikidata enrichment

Session End Time: November 18, 2025, 11:30 UTC
Next Session Focus: RDF export and/or IRAM response processing
Estimated Time to Complete Argentina: 2-3 more sessions (depending on IRAM response)