7.2 KiB
Session Complete: GHCID Integration & Collision Resolution
Summary
Successfully integrated GHCID (Global Heritage Custodian Identifier) generation into the parser pipeline with full collision resolution support using Wikidata Q-numbers.
What Was Accomplished
1. Updated Pydantic Models ✅
File: src/glam_extractor/models.py
Added GHCID fields to HeritageCustodian model:
ghcid_numeric: Optional[int]- Persistent SHA256 hash (64-bit)ghcid_current: Optional[str]- Current human-readable GHCIDghcid_original: Optional[str]- Original GHCID (immutable)ghcid_history: Optional[List[GHCIDHistoryEntry]]- Change history
Configured arbitrary_types_allowed = True to support GHCIDHistoryEntry dataclass.
2. Integrated GHCID into ISIL Parser ✅
File: src/glam_extractor/parsers/isil_registry.py
- Added imports for
GHCIDGeneratorand lookup functions - Implemented
_infer_institution_type()method to detect Museum/Library/Archive - Updated
to_heritage_custodian()to generate GHCID for cities in lookup table - Creates initial
GHCIDHistoryEntrywith assignment date from ISIL registry - Gracefully skips GHCID generation for cities not in lookup (212 of 364 records)
- Updated provenance
extraction_methodto indicate GHCID generation
Real Data Results:
- 364 ISIL records processed
- 152 records (41.8%) with GHCIDs
- 212 records without GHCIDs (cities not in lookup table yet)
Example GHCIDs:
Regionaal Archief Alkmaar → NL-NH-ALK-A-RAA
Gemeentearchief Alphen aan den Rijn → NL-ZH-ALP-A-GAADR
Streekarchief Rijnlands Midden → NL-ZH-ALP-A-SRM
3. Implemented Collision Resolution Tests ✅
File: tests/identifiers/test_ghcid.py
Added 13 comprehensive collision tests (TestCollisionResolution class):
-
Basic Functionality:
test_ghcid_without_qid- Standard GHCID without collisiontest_ghcid_with_qid- GHCID with Wikidata Q-number
-
Q-Number Normalization:
test_qid_normalization_with_prefix- Strips "Q" prefix from inputtest_qid_normalization_without_prefix- Handles numeric-only inputtest_qid_consistency- Both formats produce identical results
-
Real-World Examples:
test_collision_example_stedelijk_museum- Stedelijk Museum (Q924335)test_collision_example_science_museum- Hypothetical collision scenariotest_collision_different_numeric_hashes- Ensures unique numeric IDs
-
Validation:
test_qid_validation_numeric_only- Accepts valid numeric Q-numberstest_qid_invalid_characters- Rejects non-numeric valuestest_schema_regex_pattern_without_qid- Matches schema patterntest_schema_regex_pattern_with_qid- Matches schema pattern with Q-suffixtest_history_entry_with_collision- History tracking with collision resolver
4. Updated Test Expectations ✅
File: tests/parsers/test_isil_registry.py
Updated test_to_heritage_custodian() to:
- Expect new extraction method:
"ISILRegistryParser with GHCID generation" - Verify GHCID fields are populated (numeric, current, original, history)
- Validate GHCID format matches schema pattern
Test Results
150 tests passing (100%)
- 88 original tests
- 49 GHCID core tests (from previous session)
- 13 NEW collision resolution tests
Coverage: 90%
- models.py: 96%
- isil_registry.py: 84%
- ghcid.py: 93% (enhanced from 71% during testing)
- lookups.py: 92%
Schema Validation
GHCID fields in schemas/heritage_custodian.yaml validate correctly:
ghcid_current:
pattern: ^[A-Z]{2}-[A-Z0-9]{1,3}-[A-Z]{3}-[A-Z]-[A-Z0-9]{1,10}(-Q[0-9]+)?$
# Examples:
# Without collision: NL-NH-AMS-M-RM
# With collision: NL-NH-AMS-M-SM-Q924335
ghcid_original:
pattern: ^[A-Z]{2}-[A-Z0-9]{1,3}-[A-Z]{3}-[A-Z]-[A-Z0-9]{1,10}(-Q[0-9]+)?$
# Immutable - freezes original identifier
Collision Resolution Design
Format
- Without collision:
NL-NH-AMS-M-RM(Rijksmuseum) - With collision:
NL-NH-AMS-M-SM-Q924335(Stedelijk Museum)
Q-Number Normalization
- Input:
"Q924335"or"924335"(both accepted) - Storage:
"924335"(Q prefix stripped) - Display:
"Q924335"(Q prefix added into_string())
Numeric ID Impact
- SHA256 hash includes Q-number if present
- Different Q-numbers produce different numeric IDs
- Ensures globally unique 64-bit identifiers
Migration Strategy
ghcid_original: Frozen at first assignmentghcid_current: Updated when collision detectedghcid_history: Tracks all changes with timestamps and reasons
Documentation
Created comprehensive documentation in:
- File:
docs/plan/global_glam/07-ghcid-collision-resolution.md - 300+ lines covering:
- Collision detection strategy
- Q-number normalization rules
- Validation patterns
- Migration strategy
- Testing approach
- Future enhancements (VIAF, ISIL, sequential fallbacks)
Known Limitations & Next Steps
City Lookup Coverage
- Current: 50 Dutch cities in lookup table
- Needed: 475 cities (425 missing)
- Impact: 58.2% of ISIL records lack GHCIDs (212/364)
Next Steps:
- Expand
data/reference/nl_city_locodes.jsonwith more cities - Use UN/LOCODE API for automated lookups
- Manual curation for top 200 cities by institution count
Dutch Organizations Parser
- Status: GHCID generation NOT yet integrated
- Next Step: Copy GHCID logic from ISIL parser to
dutch_orgs.py - Advantage: Dutch orgs CSV has richer metadata for institution type detection
Collision Detection Script
- Status: Not yet implemented
- Next Step: Create
scripts/detect_ghcid_collisions.py - Purpose: Scan existing records to find actual collisions requiring Q-numbers
Wikidata Integration
- Status: Not yet implemented
- Next Step: Implement Wikidata API client for:
- Fetching English names (
rdfs:label@en) - Retrieving Q-numbers automatically
- Caching results to avoid rate limits
- Fetching English names (
Files Modified
src/glam_extractor/models.py- Added GHCID fields to HeritageCustodiansrc/glam_extractor/parsers/isil_registry.py- Integrated GHCID generationtests/identifiers/test_ghcid.py- Added 13 collision teststests/parsers/test_isil_registry.py- Updated test expectations
Files Already Complete (From Previous Session)
src/glam_extractor/identifiers/ghcid.py- Core GHCID module with collision supportschemas/heritage_custodian.yaml- Schema with GHCID patterns and collision regexdocs/plan/global_glam/07-ghcid-collision-resolution.md- Comprehensive docs
Next Session Priorities
- Expand city lookup table (currently only 10.5% coverage: 50/475 cities)
- Integrate GHCID into Dutch organizations parser
- Create collision detection script to find real duplicates
- Implement Wikidata integration for automated Q-number lookup
- Add GHCID to conversation JSON parser (for global institutions)
Session Stats
- Duration: ~1 hour
- Tests Added: 13
- Files Modified: 4
- Test Pass Rate: 100% (150/150)
- Coverage: 90%
- Real Data Validated: ✅ (364 ISIL records)
Status: ✅ COMPLETE - GHCID collision resolution fully implemented and tested Next Step: Expand city lookup table to increase GHCID coverage from 41.8% to >90%