# Session Complete: GHCID Integration & Collision Resolution > **⚠️ HISTORICAL DOCUMENT - SUPERSEDED COLLISION RESOLUTION POLICY** > > This session report documents the **original** GHCID collision resolution approach using Wikidata Q-numbers. > **As of November 2025**, collision resolution now uses **native language institution names in snake_case format**. > > **Current policy**: See `docs/plan/global_glam/07-ghcid-collision-resolution.md` > > Examples of the NEW format: > - OLD: `NL-NH-AMS-M-SM-Q924335` > - NEW: `NL-NH-AMS-M-SM-stedelijk_museum_amsterdam` ## Summary Successfully integrated GHCID (Global Heritage Custodian Identifier) generation into the parser pipeline with collision resolution support. > **Note**: The collision resolution approach documented below (Wikidata Q-numbers) has been **superseded** by native language name suffixes. See deprecation notice above. ## What Was Accomplished ### 1. Updated Pydantic Models ✅ **File**: `src/glam_extractor/models.py` Added GHCID fields to `HeritageCustodian` model: - `ghcid_numeric: Optional[int]` - Persistent SHA256 hash (64-bit) - `ghcid_current: Optional[str]` - Current human-readable GHCID - `ghcid_original: Optional[str]` - Original GHCID (immutable) - `ghcid_history: Optional[List[GHCIDHistoryEntry]]` - Change history Configured `arbitrary_types_allowed = True` to support `GHCIDHistoryEntry` dataclass. ### 2. Integrated GHCID into ISIL Parser ✅ **File**: `src/glam_extractor/parsers/isil_registry.py` - Added imports for `GHCIDGenerator` and lookup functions - Implemented `_infer_institution_type()` method to detect Museum/Library/Archive - Updated `to_heritage_custodian()` to generate GHCID for cities in lookup table - Creates initial `GHCIDHistoryEntry` with assignment date from ISIL registry - Gracefully skips GHCID generation for cities not in lookup (212 of 364 records) - Updated provenance `extraction_method` to indicate GHCID generation **Real Data Results**: - 364 ISIL records processed - 152 records (41.8%) with GHCIDs - 212 records without GHCIDs (cities not in lookup table yet) **Example GHCIDs**: ``` Regionaal Archief Alkmaar → NL-NH-ALK-A-RAA Gemeentearchief Alphen aan den Rijn → NL-ZH-ALP-A-GAADR Streekarchief Rijnlands Midden → NL-ZH-ALP-A-SRM ``` ### 3. Implemented Collision Resolution Tests ✅ **File**: `tests/identifiers/test_ghcid.py` Added 13 comprehensive collision tests (`TestCollisionResolution` class): 1. **Basic Functionality**: - `test_ghcid_without_qid` - Standard GHCID without collision - `test_ghcid_with_qid` - GHCID with Wikidata Q-number 2. **Q-Number Normalization**: - `test_qid_normalization_with_prefix` - Strips "Q" prefix from input - `test_qid_normalization_without_prefix` - Handles numeric-only input - `test_qid_consistency` - Both formats produce identical results 3. **Real-World Examples**: - `test_collision_example_stedelijk_museum` - Stedelijk Museum (Q924335) - `test_collision_example_science_museum` - Hypothetical collision scenario - `test_collision_different_numeric_hashes` - Ensures unique numeric IDs 4. **Validation**: - `test_qid_validation_numeric_only` - Accepts valid numeric Q-numbers - `test_qid_invalid_characters` - Rejects non-numeric values - `test_schema_regex_pattern_without_qid` - Matches schema pattern - `test_schema_regex_pattern_with_qid` - Matches schema pattern with Q-suffix - `test_history_entry_with_collision` - History tracking with collision resolver ### 4. Updated Test Expectations ✅ **File**: `tests/parsers/test_isil_registry.py` Updated `test_to_heritage_custodian()` to: - Expect new extraction method: `"ISILRegistryParser with GHCID generation"` - Verify GHCID fields are populated (numeric, current, original, history) - Validate GHCID format matches schema pattern ## Test Results ``` 150 tests passing (100%) - 88 original tests - 49 GHCID core tests (from previous session) - 13 NEW collision resolution tests Coverage: 90% - models.py: 96% - isil_registry.py: 84% - ghcid.py: 93% (enhanced from 71% during testing) - lookups.py: 92% ``` ## Schema Validation GHCID fields in `schemas/heritage_custodian.yaml` validate correctly: ```yaml ghcid_current: pattern: ^[A-Z]{2}-[A-Z0-9]{1,3}-[A-Z]{3}-[A-Z]-[A-Z0-9]{1,10}(-Q[0-9]+)?$ # Examples: # Without collision: NL-NH-AMS-M-RM # With collision: NL-NH-AMS-M-SM-Q924335 ghcid_original: pattern: ^[A-Z]{2}-[A-Z0-9]{1,3}-[A-Z]{3}-[A-Z]-[A-Z0-9]{1,10}(-Q[0-9]+)?$ # Immutable - freezes original identifier ``` ## Collision Resolution Design ### Format - **Without collision**: `NL-NH-AMS-M-RM` (Rijksmuseum) - **With collision**: `NL-NH-AMS-M-SM-Q924335` (Stedelijk Museum) ### Q-Number Normalization - **Input**: `"Q924335"` or `"924335"` (both accepted) - **Storage**: `"924335"` (Q prefix stripped) - **Display**: `"Q924335"` (Q prefix added in `to_string()`) ### Numeric ID Impact - SHA256 hash includes Q-number if present - Different Q-numbers produce different numeric IDs - Ensures globally unique 64-bit identifiers ### Migration Strategy - `ghcid_original`: Frozen at first assignment - `ghcid_current`: Updated when collision detected - `ghcid_history`: Tracks all changes with timestamps and reasons ## Documentation Created comprehensive documentation in: - **File**: `docs/plan/global_glam/07-ghcid-collision-resolution.md` - 300+ lines covering: - Collision detection strategy - Q-number normalization rules - Validation patterns - Migration strategy - Testing approach - Future enhancements (VIAF, ISIL, sequential fallbacks) ## Known Limitations & Next Steps ### City Lookup Coverage - **Current**: 50 Dutch cities in lookup table - **Needed**: 475 cities (425 missing) - **Impact**: 58.2% of ISIL records lack GHCIDs (212/364) **Next Steps**: 1. Expand `data/reference/nl_city_locodes.json` with more cities 2. Use UN/LOCODE API for automated lookups 3. Manual curation for top 200 cities by institution count ### Dutch Organizations Parser - **Status**: GHCID generation NOT yet integrated - **Next Step**: Copy GHCID logic from ISIL parser to `dutch_orgs.py` - **Advantage**: Dutch orgs CSV has richer metadata for institution type detection ### Collision Detection Script - **Status**: Not yet implemented - **Next Step**: Create `scripts/detect_ghcid_collisions.py` - **Purpose**: Scan existing records to find actual collisions requiring Q-numbers ### Wikidata Integration - **Status**: Not yet implemented - **Next Step**: Implement Wikidata API client for: - Fetching English names (`rdfs:label@en`) - Retrieving Q-numbers automatically - Caching results to avoid rate limits ## Files Modified 1. `src/glam_extractor/models.py` - Added GHCID fields to HeritageCustodian 2. `src/glam_extractor/parsers/isil_registry.py` - Integrated GHCID generation 3. `tests/identifiers/test_ghcid.py` - Added 13 collision tests 4. `tests/parsers/test_isil_registry.py` - Updated test expectations ## Files Already Complete (From Previous Session) 1. `src/glam_extractor/identifiers/ghcid.py` - Core GHCID module with collision support 2. `schemas/heritage_custodian.yaml` - Schema with GHCID patterns and collision regex 3. `docs/plan/global_glam/07-ghcid-collision-resolution.md` - Comprehensive docs ## Next Session Priorities 1. **Expand city lookup table** (currently only 10.5% coverage: 50/475 cities) 2. **Integrate GHCID into Dutch organizations parser** 3. **Create collision detection script** to find real duplicates 4. **Implement Wikidata integration** for automated Q-number lookup 5. **Add GHCID to conversation JSON parser** (for global institutions) ## Session Stats - **Duration**: ~1 hour - **Tests Added**: 13 - **Files Modified**: 4 - **Test Pass Rate**: 100% (150/150) - **Coverage**: 90% - **Real Data Validated**: ✅ (364 ISIL records) --- **Status**: ✅ **COMPLETE** - GHCID collision resolution fully implemented and tested **Next Step**: Expand city lookup table to increase GHCID coverage from 41.8% to >90%