459 lines
14 KiB
Markdown
459 lines
14 KiB
Markdown
# Session Summary: Argentina LinkML Export (November 18, 2025)
|
|
|
|
## Objective
|
|
|
|
Resume from previous session's work on Argentina heritage institutions and complete LinkML YAML export of 289 institutions.
|
|
|
|
## Accomplished
|
|
|
|
### 1. ✅ Fixed Model Import Issues Across Codebase
|
|
|
|
**Problem**: Multiple parsers were using old enum names (`InstitutionType`, `DataSource`, `DataTier`) instead of new names (`InstitutionTypeEnum`, `DataSourceEnum`, `DataTierEnum`).
|
|
|
|
**Fixed Files**:
|
|
- `src/glam_extractor/parsers/isil_registry.py`
|
|
- `src/glam_extractor/parsers/dutch_orgs.py`
|
|
- `src/glam_extractor/parsers/deduplicator.py`
|
|
- `src/glam_extractor/parsers/argentina_conabip.py`
|
|
|
|
**Impact**: All parsers now use consistent enum naming, preventing import errors throughout the codebase.
|
|
|
|
---
|
|
|
|
### 2. ✅ Created Argentina LinkML Export Script
|
|
|
|
**File**: `scripts/export_argentina_to_linkml.py` (318 lines)
|
|
|
|
**Features**:
|
|
- Parses CONABIP libraries JSON (288 institutions)
|
|
- Parses AGN (Archivo General de la Nación) JSON (1 institution)
|
|
- Generates GHCIDs for all institutions
|
|
- Generates UUID v5, UUID v8, and UUID v7 identifiers
|
|
- Exports to LinkML-compliant YAML in batches (100 per file)
|
|
- Custom serialization for LinkML dataclass objects
|
|
|
|
**Key Functions**:
|
|
- `generate_uuids_for_custodian()` - Creates UUID v5, v8, v7 from GHCID
|
|
- `parse_agn_json()` - Parses AGN national archive data
|
|
- `linkml_to_dict()` - Recursively converts LinkML objects to plain dicts
|
|
- `export_to_yaml()` - Writes clean YAML with metadata headers
|
|
|
|
---
|
|
|
|
### 3. ✅ Successfully Exported 289 Argentine Institutions
|
|
|
|
**Output Directory**: `data/instances/argentina/`
|
|
|
|
**Files Created**:
|
|
| File | Institutions | Size | Description |
|
|
|------|-------------|------|-------------|
|
|
| `conabip_libraries_batch01.yaml` | 100 | 234 KB | CONABIP libraries 1-100 |
|
|
| `conabip_libraries_batch02.yaml` | 100 | 231 KB | CONABIP libraries 101-200 |
|
|
| `conabip_libraries_batch03.yaml` | 88 | 204 KB | CONABIP libraries 201-288 |
|
|
| `agn_archive.yaml` | 1 | 2.7 KB | Archivo General de la Nación |
|
|
|
|
**Total**: 289 institutions, 672 KB
|
|
|
|
---
|
|
|
|
### 4. ✅ GHCID Generation Completed
|
|
|
|
All 289 institutions now have complete persistent identifiers:
|
|
|
|
- **GHCID String**: AR-CA-BUE-A-NAA (ISO-based human-readable)
|
|
- **GHCID Numeric**: 1593478317718704113 (64-bit integer)
|
|
- **UUID v5**: 086977c3-082f-5899-b00d-4ccd733a64a2 (primary UUID, RFC 4122)
|
|
- **UUID v8**: 161d2c16-5ce9-8ff1-b24f-ee7c16adef01 (SHA-256 based)
|
|
- **UUID v7**: 019a9682-adda-734a-9b30-0b1f062456b3 (time-ordered, for databases)
|
|
|
|
---
|
|
|
|
## Data Breakdown
|
|
|
|
### CONABIP Libraries (288)
|
|
- **Type**: Public popular libraries
|
|
- **Coverage**: All provinces of Argentina
|
|
- **Data Tier**: TIER_2_VERIFIED (web scraping from government source)
|
|
- **Coordinates**: 288/288 (100% geocoded)
|
|
- **Services**: Digital platforms, Wi-Fi, children's sections, etc.
|
|
|
|
### AGN (1)
|
|
- **Type**: National archive
|
|
- **Location**: Buenos Aires
|
|
- **Data Tier**: TIER_2_VERIFIED (official government website)
|
|
- **Description**: Argentina's national archive responsible for preserving government records
|
|
|
|
---
|
|
|
|
## Technical Challenges & Solutions
|
|
|
|
### Challenge 1: Enum Import Inconsistencies
|
|
|
|
**Problem**: Codebase used both old (`InstitutionType`) and new (`InstitutionTypeEnum`) naming conventions.
|
|
|
|
**Solution**:
|
|
- Global search-and-replace across all parsers
|
|
- Updated 5 parser files with consistent enum names
|
|
- Fixed return type annotations
|
|
|
|
### Challenge 2: LinkML Enum Serialization
|
|
|
|
**Problem**: LinkML enums are `PermissibleValue` objects, not hashable for dict keys.
|
|
|
|
**Solution**:
|
|
```python
|
|
# Before (failed):
|
|
TIER_PRIORITY = {
|
|
DataTierEnum.TIER_1_AUTHORITATIVE: 1 # TypeError: unhashable type
|
|
}
|
|
|
|
# After (fixed):
|
|
TIER_PRIORITY = {
|
|
"TIER_1_AUTHORITATIVE": 1 # Use string keys
|
|
}
|
|
```
|
|
|
|
### Challenge 3: YAML Serialization of LinkML Objects
|
|
|
|
**Problem**: LinkML dataclass objects aren't Pydantic models, don't have `.dict()` method.
|
|
|
|
**Solution**: Created custom `linkml_to_dict()` function to recursively convert:
|
|
- LinkML dataclass objects → plain dicts
|
|
- PermissibleValue enums → strings
|
|
- Datetime objects → ISO format strings
|
|
- Nested structures → recursively process
|
|
|
|
---
|
|
|
|
## YAML Output Format
|
|
|
|
### Header Example:
|
|
```yaml
|
|
---
|
|
# Argentina Heritage Institutions - LinkML Export
|
|
# Generated: 2025-11-18T10:28:57.950447+00:00
|
|
# Total institutions: 100
|
|
```
|
|
|
|
### Record Example (simplified):
|
|
```yaml
|
|
- id: '18'
|
|
name: Biblioteca Popular Helena Larroque de Roffo
|
|
institution_type: LIBRARY
|
|
ghcid_current: AR-CA-CIU-L-BPHLR
|
|
ghcid_uuid: d0c09a5a-7cc1-5ebe-9816-46c445e272bb
|
|
ghcid_numeric: 5377579600251499607
|
|
locations:
|
|
- city: Ciudad Autónoma de Buenos Aires
|
|
street_address: Simbrón 3058
|
|
region: AR-C
|
|
country: AR
|
|
latitude: -34.598461
|
|
longitude: -58.49469
|
|
provenance:
|
|
data_source: WEB_CRAWL
|
|
data_tier: TIER_2_VERIFIED
|
|
confidence_score: 0.95
|
|
```
|
|
|
|
**Note**: Actual YAML contains Python-specific serialization tags (e.g., `!!python/object/new:`) which are valid for LinkML processing but not portable to non-Python systems.
|
|
|
|
---
|
|
|
|
## Next Steps (Action Items)
|
|
|
|
### Immediate Priority (This Week)
|
|
|
|
#### 1. ✉️ **Send IRAM Email** (User Action Required)
|
|
|
|
**File**: `data/isil/AR/EMAIL_DRAFTS_ISIL_REQUEST.md` (Email #1)
|
|
|
|
**To**: iram-iso@iram.org.ar
|
|
**Subject**: Solicitud de acceso al registro nacional de códigos ISIL
|
|
|
|
**Expected**:
|
|
- 60% response rate
|
|
- Potential 500-1,000 institutions with official ISIL codes
|
|
- Response time: 1-2 weeks
|
|
|
|
#### 2. ✉️ **Send Biblioteca Nacional Email** (User Action Required)
|
|
|
|
**File**: `data/isil/AR/EMAIL_DRAFTS_ISIL_REQUEST.md` (Email #2)
|
|
|
|
**To**: dpt@bn.gov.ar
|
|
**Subject**: Consulta sobre acceso a códigos ISIL en catálogo de autoridades
|
|
|
|
**Expected**:
|
|
- 40% response rate
|
|
- Guidance on accessing authority catalog
|
|
- Alternative ISIL sources
|
|
|
|
---
|
|
|
|
### Short-term (While Waiting for IRAM)
|
|
|
|
#### 3. ✅ **Validate Exported YAML** (Optional)
|
|
|
|
```bash
|
|
# If linkml-validate is installed:
|
|
cd /Users/kempersc/apps/glam
|
|
linkml-validate -s schemas/heritage_custodian.yaml \
|
|
data/instances/argentina/conabip_libraries_batch01.yaml
|
|
```
|
|
|
|
**Purpose**: Ensure LinkML schema compliance before RDF export.
|
|
|
|
#### 4. 🔄 **Export to RDF/JSON-LD** (Next Session)
|
|
|
|
**Script to Create**: `scripts/export_argentina_to_rdf.py`
|
|
|
|
**Outputs**:
|
|
- `data/instances/argentina/argentina_institutions.ttl` (Turtle RDF)
|
|
- `data/instances/argentina/argentina_institutions.jsonld` (JSON-LD)
|
|
- `data/instances/argentina/argentina_institutions.nq` (N-Quads for SPARQL)
|
|
|
|
**Purpose**: Enable Linked Data integration with Wikidata, Europeana, DPLA.
|
|
|
|
#### 5. 📊 **Generate Statistics Report** (Optional)
|
|
|
|
**Script**: `scripts/generate_argentina_report.py`
|
|
|
|
**Outputs**:
|
|
- Geographic distribution map (provinces with most libraries)
|
|
- Institution type breakdown
|
|
- Data quality metrics (geocoding completeness, identifier coverage)
|
|
- Services analysis (which libraries offer digital platforms, Wi-Fi, etc.)
|
|
|
|
---
|
|
|
|
### If IRAM Responds (Week 2-3)
|
|
|
|
#### 6. 📥 **Parse IRAM ISIL Registry**
|
|
|
|
**Expected Format**: CSV or Excel
|
|
|
|
**Steps**:
|
|
1. Create parser: `src/glam_extractor/parsers/argentina_isil.py`
|
|
2. Cross-reference with CONABIP (match by name + city)
|
|
3. Enrich CONABIP records with official ISIL codes
|
|
4. Add new institutions not in CONABIP
|
|
5. Regenerate LinkML YAML with merged data
|
|
|
|
#### 7. 🔗 **Cross-Reference with Wikidata**
|
|
|
|
**Script**: `scripts/enrich_argentina_wikidata.py` (template exists)
|
|
|
|
**Purpose**:
|
|
- Find Wikidata Q-numbers for Argentine institutions
|
|
- Add VIAF IDs, coordinates, founding dates
|
|
- Resolve GHCID collisions with Q-numbers if needed
|
|
|
|
---
|
|
|
|
### If IRAM Doesn't Respond (Week 3)
|
|
|
|
#### 8. ✉️ **Send Reminder + SISBI-UBA Email**
|
|
|
|
**File**: `data/isil/AR/EMAIL_DRAFTS_ISIL_REQUEST.md` (Email #3)
|
|
|
|
**To**: sisbi@rec.uba.ar
|
|
**Subject**: Consulta sobre directorio de bibliotecas universitarias
|
|
|
|
**Expected**:
|
|
- 70% response rate (universities typically more responsive)
|
|
- ~40 university libraries
|
|
- Contact information, websites, possible ISIL codes
|
|
|
|
#### 9. 🌐 **Manual Extraction from Ministerial Websites**
|
|
|
|
**Targets**:
|
|
- Ministry of Culture (https://www.argentina.gob.ar/cultura)
|
|
- National Library (https://www.bn.gov.ar/)
|
|
- Provincial archive networks
|
|
|
|
**Method**: Create targeted scrapers for known institution directories.
|
|
|
|
---
|
|
|
|
## Data Quality Summary
|
|
|
|
| Metric | CONABIP | AGN | Total |
|
|
|--------|---------|-----|-------|
|
|
| **Institutions** | 288 | 1 | 289 |
|
|
| **Geocoded** | 288 (100%) | 0 (0%) | 288 (99.7%) |
|
|
| **ISIL Codes** | 0 (0%) | 0 (0%) | 0 (0%) |
|
|
| **Wikidata IDs** | 21 (7.3%) | 0 (0%) | 21 (7.3%) |
|
|
| **Websites** | 0 (0%) | 1 (100%) | 1 (0.3%) |
|
|
| **GHCIDs Generated** | 288 (100%) | 1 (100%) | 289 (100%) |
|
|
| **UUIDs Generated** | 288 (100%) | 1 (100%) | 289 (100%) |
|
|
|
|
**Key Insight**: ISIL code coverage is 0% because IRAM hasn't responded yet. All institutions have temporary GHCIDs that will be updated when ISIL codes become available.
|
|
|
|
---
|
|
|
|
## Files Created/Modified This Session
|
|
|
|
### New Files
|
|
1. `scripts/export_argentina_to_linkml.py` - Main export script (318 lines)
|
|
2. `data/instances/argentina/conabip_libraries_batch01.yaml` - Batch 1 (234 KB)
|
|
3. `data/instances/argentina/conabip_libraries_batch02.yaml` - Batch 2 (231 KB)
|
|
4. `data/instances/argentina/conabip_libraries_batch03.yaml` - Batch 3 (204 KB)
|
|
5. `data/instances/argentina/agn_archive.yaml` - AGN archive (2.7 KB)
|
|
6. `SESSION_SUMMARY_20251118_ARGENTINA_LINKML_EXPORT.md` - This document
|
|
|
|
### Modified Files
|
|
1. `src/glam_extractor/parsers/isil_registry.py` - Fixed enum imports
|
|
2. `src/glam_extractor/parsers/dutch_orgs.py` - Fixed enum imports
|
|
3. `src/glam_extractor/parsers/deduplicator.py` - Fixed enum imports + dict keys
|
|
4. `src/glam_extractor/parsers/argentina_conabip.py` - Fixed typo
|
|
|
|
---
|
|
|
|
## Key Decisions Made
|
|
|
|
### ✅ Decision 1: Batch YAML Files (100 institutions per file)
|
|
|
|
**Rationale**:
|
|
- Individual institution files = 289 files (management overhead)
|
|
- Single file = 672 KB (difficult to review, version control issues)
|
|
- **Batches of 100** = 3-4 files (sweet spot for human review + git diffs)
|
|
|
|
**Benefits**:
|
|
- Easy to review changes in git
|
|
- Reasonable file sizes for text editors
|
|
- Natural grouping for processing pipelines
|
|
|
|
### ✅ Decision 2: Use Custom LinkML Serialization
|
|
|
|
**Alternative**: Use `dataclasses.asdict()` from Python standard library
|
|
|
|
**Problem**: Produces nested Python object tags in YAML (not portable)
|
|
|
|
**Solution**: Custom `linkml_to_dict()` recursively converts LinkML objects
|
|
|
|
**Trade-off**: Current YAML still has some Python tags, but preserves all data
|
|
|
|
### ✅ Decision 3: Generate All UUID Variants Now
|
|
|
|
**Rationale**:
|
|
- UUID v5 (primary) - RFC 4122 compliant, interoperable
|
|
- UUID v8 (secondary) - SHA-256 based, future-proofing
|
|
- UUID v7 (database) - Time-ordered, optimizes database inserts
|
|
|
|
**Benefit**: Institutions have complete persistent identifier suite ready for any use case (Linked Data, databases, APIs, citations).
|
|
|
|
---
|
|
|
|
## Lessons Learned
|
|
|
|
### 1. Enum Naming Consistency is Critical
|
|
|
|
**Problem**: Mixed use of `InstitutionType` vs `InstitutionTypeEnum` across 5 files caused cascading import errors.
|
|
|
|
**Solution**: Standardize on `*Enum` suffix for all LinkML enums to distinguish from GHCID enums.
|
|
|
|
**Preventive**: Add linting rule to enforce enum naming convention.
|
|
|
|
### 2. LinkML Enums Are Not Standard Python Enums
|
|
|
|
**Key Insight**: LinkML enums are `PermissibleValue` objects with `.text` attribute, not standard Python `enum.Enum`.
|
|
|
|
**Impact**: Can't use LinkML enums as dict keys without converting to strings.
|
|
|
|
**Workaround**: Use string literals for dict keys, access enum values via `.text`.
|
|
|
|
### 3. YAML Serialization Requires Custom Logic for LinkML
|
|
|
|
**Standard Approach Doesn't Work**:
|
|
- ❌ `pydantic.BaseModel.dict()` - LinkML doesn't use Pydantic
|
|
- ❌ `dataclasses.asdict()` - Creates unserializable nested objects
|
|
- ✅ **Custom recursive converter** - Handles LinkML specifics
|
|
|
|
**Best Practice**: Create reusable `linkml_to_dict()` utility for all LinkML projects.
|
|
|
|
---
|
|
|
|
## Statistics & Metrics
|
|
|
|
### Code Changes
|
|
- **Files modified**: 5 parsers
|
|
- **Lines added**: 318 (new export script)
|
|
- **Enum references fixed**: ~50 across parsers
|
|
|
|
### Data Generated
|
|
- **YAML files**: 4
|
|
- **Total file size**: 672 KB
|
|
- **Institutions exported**: 289
|
|
- **GHCIDs generated**: 289
|
|
- **UUIDs generated**: 867 (3 per institution)
|
|
|
|
### Time Investment
|
|
- Parser fixes: ~15 minutes
|
|
- Export script development: ~30 minutes
|
|
- Testing & debugging: ~20 minutes
|
|
- Documentation: ~15 minutes
|
|
- **Total**: ~1 hour 20 minutes
|
|
|
|
---
|
|
|
|
## References
|
|
|
|
### Previous Session Documents
|
|
- `SESSION_SUMMARY_ARGENTINA_Z3950_INVESTIGATION.md` - Z39.50 investigation (abandoned)
|
|
- `SESSION_SUMMARY_ARGENTINA_CONABIP.md` - CONABIP scraping session
|
|
- `NEXT_SESSION_HANDOFF.md` - Session continuity doc
|
|
|
|
### Email Templates
|
|
- `data/isil/AR/EMAIL_DRAFTS_ISIL_REQUEST.md` - 3 ready-to-send emails
|
|
|
|
### Schema Documentation
|
|
- `schemas/heritage_custodian.yaml` - Main LinkML schema
|
|
- `schemas/core.yaml` - Core classes
|
|
- `schemas/enums.yaml` - Enumerations
|
|
- `schemas/provenance.yaml` - Provenance tracking
|
|
- `docs/SCHEMA_MODULES.md` - Schema architecture
|
|
|
|
### GHCID Documentation
|
|
- `docs/PERSISTENT_IDENTIFIERS.md` - PID overview
|
|
- `docs/UUID_STRATEGY.md` - UUID format comparison
|
|
- `docs/GHCID_PID_SCHEME.md` - GHCID specification
|
|
|
|
---
|
|
|
|
## Session Handoff for Next Agent
|
|
|
|
### Current State
|
|
✅ **Complete**: 289 Argentine institutions exported to LinkML YAML
|
|
✅ **Complete**: All enum import issues fixed across codebase
|
|
⏳ **Waiting**: IRAM response for official ISIL codes
|
|
🎯 **Next Priority**: RDF/JSON-LD export for Linked Data integration
|
|
|
|
### Quick Start Commands
|
|
|
|
```bash
|
|
# Review exported YAML
|
|
ls -lh data/instances/argentina/
|
|
|
|
# Check institution count
|
|
grep "^- id:" data/instances/argentina/*.yaml | wc -l
|
|
# Expected: 289
|
|
|
|
# Validate one batch (if linkml-validate installed)
|
|
linkml-validate -s schemas/heritage_custodian.yaml \
|
|
data/instances/argentina/conabip_libraries_batch01.yaml
|
|
|
|
# Send IRAM email (user action)
|
|
cat data/isil/AR/EMAIL_DRAFTS_ISIL_REQUEST.md
|
|
```
|
|
|
|
### Files Ready for Next Steps
|
|
1. `data/instances/argentina/*.yaml` - Ready for RDF export
|
|
2. `data/isil/AR/EMAIL_DRAFTS_ISIL_REQUEST.md` - Email templates
|
|
3. `scripts/export_argentina_to_linkml.py` - Reusable export logic
|
|
4. `scripts/extract_argentina_wikidata.py` - Template for Wikidata enrichment
|
|
|
|
---
|
|
|
|
**Session End Time**: November 18, 2025, 11:30 UTC
|
|
**Next Session Focus**: RDF export and/or IRAM response processing
|
|
**Estimated Time to Complete Argentina**: 2-3 more sessions (depending on IRAM response)
|