glam/SESSION_COMPLETE.md
2025-11-30 23:30:29 +01:00

219 lines
7.9 KiB
Markdown

# Session Complete: GHCID Integration & Collision Resolution
> **⚠️ HISTORICAL DOCUMENT - SUPERSEDED COLLISION RESOLUTION POLICY**
>
> This session report documents the **original** GHCID collision resolution approach using Wikidata Q-numbers.
> **As of November 2025**, collision resolution now uses **native language institution names in snake_case format**.
>
> **Current policy**: See `docs/plan/global_glam/07-ghcid-collision-resolution.md`
>
> Examples of the NEW format:
> - OLD: `NL-NH-AMS-M-SM-Q924335`
> - NEW: `NL-NH-AMS-M-SM-stedelijk_museum_amsterdam`
## Summary
Successfully integrated GHCID (Global Heritage Custodian Identifier) generation into the parser pipeline with collision resolution support.
> **Note**: The collision resolution approach documented below (Wikidata Q-numbers) has been **superseded** by native language name suffixes. See deprecation notice above.
## What Was Accomplished
### 1. Updated Pydantic Models ✅
**File**: `src/glam_extractor/models.py`
Added GHCID fields to `HeritageCustodian` model:
- `ghcid_numeric: Optional[int]` - Persistent SHA256 hash (64-bit)
- `ghcid_current: Optional[str]` - Current human-readable GHCID
- `ghcid_original: Optional[str]` - Original GHCID (immutable)
- `ghcid_history: Optional[List[GHCIDHistoryEntry]]` - Change history
Configured `arbitrary_types_allowed = True` to support `GHCIDHistoryEntry` dataclass.
### 2. Integrated GHCID into ISIL Parser ✅
**File**: `src/glam_extractor/parsers/isil_registry.py`
- Added imports for `GHCIDGenerator` and lookup functions
- Implemented `_infer_institution_type()` method to detect Museum/Library/Archive
- Updated `to_heritage_custodian()` to generate GHCID for cities in lookup table
- Creates initial `GHCIDHistoryEntry` with assignment date from ISIL registry
- Gracefully skips GHCID generation for cities not in lookup (212 of 364 records)
- Updated provenance `extraction_method` to indicate GHCID generation
**Real Data Results**:
- 364 ISIL records processed
- 152 records (41.8%) with GHCIDs
- 212 records without GHCIDs (cities not in lookup table yet)
**Example GHCIDs**:
```
Regionaal Archief Alkmaar → NL-NH-ALK-A-RAA
Gemeentearchief Alphen aan den Rijn → NL-ZH-ALP-A-GAADR
Streekarchief Rijnlands Midden → NL-ZH-ALP-A-SRM
```
### 3. Implemented Collision Resolution Tests ✅
**File**: `tests/identifiers/test_ghcid.py`
Added 13 comprehensive collision tests (`TestCollisionResolution` class):
1. **Basic Functionality**:
- `test_ghcid_without_qid` - Standard GHCID without collision
- `test_ghcid_with_qid` - GHCID with Wikidata Q-number
2. **Q-Number Normalization**:
- `test_qid_normalization_with_prefix` - Strips "Q" prefix from input
- `test_qid_normalization_without_prefix` - Handles numeric-only input
- `test_qid_consistency` - Both formats produce identical results
3. **Real-World Examples**:
- `test_collision_example_stedelijk_museum` - Stedelijk Museum (Q924335)
- `test_collision_example_science_museum` - Hypothetical collision scenario
- `test_collision_different_numeric_hashes` - Ensures unique numeric IDs
4. **Validation**:
- `test_qid_validation_numeric_only` - Accepts valid numeric Q-numbers
- `test_qid_invalid_characters` - Rejects non-numeric values
- `test_schema_regex_pattern_without_qid` - Matches schema pattern
- `test_schema_regex_pattern_with_qid` - Matches schema pattern with Q-suffix
- `test_history_entry_with_collision` - History tracking with collision resolver
### 4. Updated Test Expectations ✅
**File**: `tests/parsers/test_isil_registry.py`
Updated `test_to_heritage_custodian()` to:
- Expect new extraction method: `"ISILRegistryParser with GHCID generation"`
- Verify GHCID fields are populated (numeric, current, original, history)
- Validate GHCID format matches schema pattern
## Test Results
```
150 tests passing (100%)
- 88 original tests
- 49 GHCID core tests (from previous session)
- 13 NEW collision resolution tests
Coverage: 90%
- models.py: 96%
- isil_registry.py: 84%
- ghcid.py: 93% (enhanced from 71% during testing)
- lookups.py: 92%
```
## Schema Validation
GHCID fields in `schemas/heritage_custodian.yaml` validate correctly:
```yaml
ghcid_current:
pattern: ^[A-Z]{2}-[A-Z0-9]{1,3}-[A-Z]{3}-[A-Z]-[A-Z0-9]{1,10}(-Q[0-9]+)?$
# Examples:
# Without collision: NL-NH-AMS-M-RM
# With collision: NL-NH-AMS-M-SM-Q924335
ghcid_original:
pattern: ^[A-Z]{2}-[A-Z0-9]{1,3}-[A-Z]{3}-[A-Z]-[A-Z0-9]{1,10}(-Q[0-9]+)?$
# Immutable - freezes original identifier
```
## Collision Resolution Design
### Format
- **Without collision**: `NL-NH-AMS-M-RM` (Rijksmuseum)
- **With collision**: `NL-NH-AMS-M-SM-Q924335` (Stedelijk Museum)
### Q-Number Normalization
- **Input**: `"Q924335"` or `"924335"` (both accepted)
- **Storage**: `"924335"` (Q prefix stripped)
- **Display**: `"Q924335"` (Q prefix added in `to_string()`)
### Numeric ID Impact
- SHA256 hash includes Q-number if present
- Different Q-numbers produce different numeric IDs
- Ensures globally unique 64-bit identifiers
### Migration Strategy
- `ghcid_original`: Frozen at first assignment
- `ghcid_current`: Updated when collision detected
- `ghcid_history`: Tracks all changes with timestamps and reasons
## Documentation
Created comprehensive documentation in:
- **File**: `docs/plan/global_glam/07-ghcid-collision-resolution.md`
- 300+ lines covering:
- Collision detection strategy
- Q-number normalization rules
- Validation patterns
- Migration strategy
- Testing approach
- Future enhancements (VIAF, ISIL, sequential fallbacks)
## Known Limitations & Next Steps
### City Lookup Coverage
- **Current**: 50 Dutch cities in lookup table
- **Needed**: 475 cities (425 missing)
- **Impact**: 58.2% of ISIL records lack GHCIDs (212/364)
**Next Steps**:
1. Expand `data/reference/nl_city_locodes.json` with more cities
2. Use UN/LOCODE API for automated lookups
3. Manual curation for top 200 cities by institution count
### Dutch Organizations Parser
- **Status**: GHCID generation NOT yet integrated
- **Next Step**: Copy GHCID logic from ISIL parser to `dutch_orgs.py`
- **Advantage**: Dutch orgs CSV has richer metadata for institution type detection
### Collision Detection Script
- **Status**: Not yet implemented
- **Next Step**: Create `scripts/detect_ghcid_collisions.py`
- **Purpose**: Scan existing records to find actual collisions requiring Q-numbers
### Wikidata Integration
- **Status**: Not yet implemented
- **Next Step**: Implement Wikidata API client for:
- Fetching English names (`rdfs:label@en`)
- Retrieving Q-numbers automatically
- Caching results to avoid rate limits
## Files Modified
1. `src/glam_extractor/models.py` - Added GHCID fields to HeritageCustodian
2. `src/glam_extractor/parsers/isil_registry.py` - Integrated GHCID generation
3. `tests/identifiers/test_ghcid.py` - Added 13 collision tests
4. `tests/parsers/test_isil_registry.py` - Updated test expectations
## Files Already Complete (From Previous Session)
1. `src/glam_extractor/identifiers/ghcid.py` - Core GHCID module with collision support
2. `schemas/heritage_custodian.yaml` - Schema with GHCID patterns and collision regex
3. `docs/plan/global_glam/07-ghcid-collision-resolution.md` - Comprehensive docs
## Next Session Priorities
1. **Expand city lookup table** (currently only 10.5% coverage: 50/475 cities)
2. **Integrate GHCID into Dutch organizations parser**
3. **Create collision detection script** to find real duplicates
4. **Implement Wikidata integration** for automated Q-number lookup
5. **Add GHCID to conversation JSON parser** (for global institutions)
## Session Stats
- **Duration**: ~1 hour
- **Tests Added**: 13
- **Files Modified**: 4
- **Test Pass Rate**: 100% (150/150)
- **Coverage**: 90%
- **Real Data Validated**: ✅ (364 ISIL records)
---
**Status**: ✅ **COMPLETE** - GHCID collision resolution fully implemented and tested
**Next Step**: Expand city lookup table to increase GHCID coverage from 41.8% to >90%