glam/SESSION_COMPLETE.md
2025-11-19 23:25:22 +01:00

7.2 KiB

Session Complete: GHCID Integration & Collision Resolution

Summary

Successfully integrated GHCID (Global Heritage Custodian Identifier) generation into the parser pipeline with full collision resolution support using Wikidata Q-numbers.

What Was Accomplished

1. Updated Pydantic Models

File: src/glam_extractor/models.py

Added GHCID fields to HeritageCustodian model:

  • ghcid_numeric: Optional[int] - Persistent SHA256 hash (64-bit)
  • ghcid_current: Optional[str] - Current human-readable GHCID
  • ghcid_original: Optional[str] - Original GHCID (immutable)
  • ghcid_history: Optional[List[GHCIDHistoryEntry]] - Change history

Configured arbitrary_types_allowed = True to support GHCIDHistoryEntry dataclass.

2. Integrated GHCID into ISIL Parser

File: src/glam_extractor/parsers/isil_registry.py

  • Added imports for GHCIDGenerator and lookup functions
  • Implemented _infer_institution_type() method to detect Museum/Library/Archive
  • Updated to_heritage_custodian() to generate GHCID for cities in lookup table
  • Creates initial GHCIDHistoryEntry with assignment date from ISIL registry
  • Gracefully skips GHCID generation for cities not in lookup (212 of 364 records)
  • Updated provenance extraction_method to indicate GHCID generation

Real Data Results:

  • 364 ISIL records processed
  • 152 records (41.8%) with GHCIDs
  • 212 records without GHCIDs (cities not in lookup table yet)

Example GHCIDs:

Regionaal Archief Alkmaar           → NL-NH-ALK-A-RAA
Gemeentearchief Alphen aan den Rijn → NL-ZH-ALP-A-GAADR
Streekarchief Rijnlands Midden      → NL-ZH-ALP-A-SRM

3. Implemented Collision Resolution Tests

File: tests/identifiers/test_ghcid.py

Added 13 comprehensive collision tests (TestCollisionResolution class):

  1. Basic Functionality:

    • test_ghcid_without_qid - Standard GHCID without collision
    • test_ghcid_with_qid - GHCID with Wikidata Q-number
  2. Q-Number Normalization:

    • test_qid_normalization_with_prefix - Strips "Q" prefix from input
    • test_qid_normalization_without_prefix - Handles numeric-only input
    • test_qid_consistency - Both formats produce identical results
  3. Real-World Examples:

    • test_collision_example_stedelijk_museum - Stedelijk Museum (Q924335)
    • test_collision_example_science_museum - Hypothetical collision scenario
    • test_collision_different_numeric_hashes - Ensures unique numeric IDs
  4. Validation:

    • test_qid_validation_numeric_only - Accepts valid numeric Q-numbers
    • test_qid_invalid_characters - Rejects non-numeric values
    • test_schema_regex_pattern_without_qid - Matches schema pattern
    • test_schema_regex_pattern_with_qid - Matches schema pattern with Q-suffix
    • test_history_entry_with_collision - History tracking with collision resolver

4. Updated Test Expectations

File: tests/parsers/test_isil_registry.py

Updated test_to_heritage_custodian() to:

  • Expect new extraction method: "ISILRegistryParser with GHCID generation"
  • Verify GHCID fields are populated (numeric, current, original, history)
  • Validate GHCID format matches schema pattern

Test Results

150 tests passing (100%)
  - 88 original tests
  - 49 GHCID core tests (from previous session)
  - 13 NEW collision resolution tests

Coverage: 90%
  - models.py: 96%
  - isil_registry.py: 84%
  - ghcid.py: 93% (enhanced from 71% during testing)
  - lookups.py: 92%

Schema Validation

GHCID fields in schemas/heritage_custodian.yaml validate correctly:

ghcid_current:
  pattern: ^[A-Z]{2}-[A-Z0-9]{1,3}-[A-Z]{3}-[A-Z]-[A-Z0-9]{1,10}(-Q[0-9]+)?$
  # Examples:
  # Without collision: NL-NH-AMS-M-RM
  # With collision:    NL-NH-AMS-M-SM-Q924335

ghcid_original:
  pattern: ^[A-Z]{2}-[A-Z0-9]{1,3}-[A-Z]{3}-[A-Z]-[A-Z0-9]{1,10}(-Q[0-9]+)?$
  # Immutable - freezes original identifier

Collision Resolution Design

Format

  • Without collision: NL-NH-AMS-M-RM (Rijksmuseum)
  • With collision: NL-NH-AMS-M-SM-Q924335 (Stedelijk Museum)

Q-Number Normalization

  • Input: "Q924335" or "924335" (both accepted)
  • Storage: "924335" (Q prefix stripped)
  • Display: "Q924335" (Q prefix added in to_string())

Numeric ID Impact

  • SHA256 hash includes Q-number if present
  • Different Q-numbers produce different numeric IDs
  • Ensures globally unique 64-bit identifiers

Migration Strategy

  • ghcid_original: Frozen at first assignment
  • ghcid_current: Updated when collision detected
  • ghcid_history: Tracks all changes with timestamps and reasons

Documentation

Created comprehensive documentation in:

  • File: docs/plan/global_glam/07-ghcid-collision-resolution.md
  • 300+ lines covering:
    • Collision detection strategy
    • Q-number normalization rules
    • Validation patterns
    • Migration strategy
    • Testing approach
    • Future enhancements (VIAF, ISIL, sequential fallbacks)

Known Limitations & Next Steps

City Lookup Coverage

  • Current: 50 Dutch cities in lookup table
  • Needed: 475 cities (425 missing)
  • Impact: 58.2% of ISIL records lack GHCIDs (212/364)

Next Steps:

  1. Expand data/reference/nl_city_locodes.json with more cities
  2. Use UN/LOCODE API for automated lookups
  3. Manual curation for top 200 cities by institution count

Dutch Organizations Parser

  • Status: GHCID generation NOT yet integrated
  • Next Step: Copy GHCID logic from ISIL parser to dutch_orgs.py
  • Advantage: Dutch orgs CSV has richer metadata for institution type detection

Collision Detection Script

  • Status: Not yet implemented
  • Next Step: Create scripts/detect_ghcid_collisions.py
  • Purpose: Scan existing records to find actual collisions requiring Q-numbers

Wikidata Integration

  • Status: Not yet implemented
  • Next Step: Implement Wikidata API client for:
    • Fetching English names (rdfs:label@en)
    • Retrieving Q-numbers automatically
    • Caching results to avoid rate limits

Files Modified

  1. src/glam_extractor/models.py - Added GHCID fields to HeritageCustodian
  2. src/glam_extractor/parsers/isil_registry.py - Integrated GHCID generation
  3. tests/identifiers/test_ghcid.py - Added 13 collision tests
  4. tests/parsers/test_isil_registry.py - Updated test expectations

Files Already Complete (From Previous Session)

  1. src/glam_extractor/identifiers/ghcid.py - Core GHCID module with collision support
  2. schemas/heritage_custodian.yaml - Schema with GHCID patterns and collision regex
  3. docs/plan/global_glam/07-ghcid-collision-resolution.md - Comprehensive docs

Next Session Priorities

  1. Expand city lookup table (currently only 10.5% coverage: 50/475 cities)
  2. Integrate GHCID into Dutch organizations parser
  3. Create collision detection script to find real duplicates
  4. Implement Wikidata integration for automated Q-number lookup
  5. Add GHCID to conversation JSON parser (for global institutions)

Session Stats

  • Duration: ~1 hour
  • Tests Added: 13
  • Files Modified: 4
  • Test Pass Rate: 100% (150/150)
  • Coverage: 90%
  • Real Data Validated: (364 ISIL records)

Status: COMPLETE - GHCID collision resolution fully implemented and tested Next Step: Expand city lookup table to increase GHCID coverage from 41.8% to >90%