glam/SESSION_SUMMARY_2025-11-08.md
2025-11-19 23:25:22 +01:00

367 lines
14 KiB
Markdown
Raw Blame History

This file contains ambiguous Unicode characters

This file contains Unicode characters that might be confused with other characters. If you think that this is intentional, you can safely ignore this warning. Use the Escape button to reveal them.

# Wikidata Enrichment Session Summary - November 8, 2025
## Session Context
Resumed from November 7 session where we achieved 99.98% geocoding coverage and performed initial Wikidata enrichment via ISIL code matching.
## What We Accomplished ✅
### 1. Dutch Institutions Fuzzy Name Matching - Successfully Completed
**Problem Identified**: Low Dutch Wikidata coverage (4.8%) despite 40.9% having ISIL codes.
**Root Cause**: ISIL P791 property not well-populated in Wikidata for Dutch institutions.
**Solution Implemented**:
- Created `scripts/enrich_dutch_institutions_fuzzy.py`
- Queried Wikidata for all Dutch museums, libraries, and archives (1,303 found)
- Fuzzy matched institution names using normalized string similarity
- Added institution type compatibility checking to avoid false positives (e.g., "Drents Archief" vs "Drents Museum")
- Applied matches with >0.85 confidence threshold
**Results - HIGHLY SUCCESSFUL**:
- **Processing time**: 1.5 minutes (37s loading + 30s Wikidata query + 10s matching + 20s writing)
- **Dutch enriched**: 200 institutions
- **New Dutch Wikidata coverage**: 21.8% (up from 4.8%)
- **Improvement**: 4.5x increase in coverage
- **Match quality**: 200 high-confidence matches (>0.85 similarity)
- Many perfect matches (1.000 similarity)
- Examples: Van Gogh Museum, Amsterdam Museum, Rijksmuseum
### 2. Overall Dataset Statistics
**Final Enrichment State**:
```
Total institutions: 13,396
With real Wikidata IDs: 7,363 (55.0%)
With synthetic Wikidata: 2,563 (19.1%)
With VIAF IDs: 2,035 (15.2%)
With websites: 11,329 (84.6%)
With founding dates: 1,550 (11.6%)
```
**Enrichment Methods**:
- ISIL Match: 9,670 (72.2%) - from SPARQL query via P791 property
- Fuzzy Name Match: 200 (1.5%) - new Dutch enrichments
- Other/Original: 3,526 (26.3%) - from conversation extraction, CSV imports
### 3. Coverage by Country
| Country | Total | Real Wikidata | Synthetic | Coverage |
|---------|-------|---------------|-----------|----------|
| **Japan (JP)** | 12,065 | 7,091 | 2,517 | **58.8%** |
| **Netherlands (NL)** | 1,017 | 222 | 39 | **21.8%** ⬆ |
| **Chile (CL)** | 90 | 26 | 3 | **28.9%** |
| **Mexico (MX)** | 109 | 23 | 3 | **21.1%** |
| **Brazil (BR)** | 97 | 1 | 0 | **1.0%** ⚠️ |
| **Belgium (BE)** | 7 | 0 | 1 | **0.0%** |
| **United States (US)** | 7 | 0 | 0 | **0.0%** |
**Key Insight**: Japan dominates the dataset (90%) with excellent ISIL→Wikidata mapping. Dutch coverage significantly improved but still has room for growth.
## Files Modified/Created 📁
### Created
1. `scripts/enrich_dutch_institutions_fuzzy.py` - **Production-ready** Dutch fuzzy matcher
2. `data/instances/global/global_heritage_institutions_dutch_enriched.yaml` (24 MB) - **Merged into main file**
3. `data/instances/global/global_heritage_institutions_wikidata_enriched_backup.yaml` (24 MB) - Backup of pre-fuzzy-match state
### Modified
- `data/instances/global/global_heritage_institutions_wikidata_enriched.yaml` (24 MB) - **Main enriched dataset** (now includes Dutch fuzzy matches)
### Preserved
- Original files remain unchanged (backup strategy maintained)
## Technical Insights 🔍
### Fuzzy Matching Strategies
**Normalization Techniques**:
```python
# Name normalization for matching
- Lowercase
- Remove common prefixes: "stichting", "gemeentearchief", "regionaal archief", "museum"
- Remove common suffixes: "archief", "museum", "bibliotheek", "library", "archive"
- Remove punctuation
- Normalize whitespace
```
**Type Compatibility Checking**:
- Prevents mismatches between museums and archives (e.g., "Drents Museum" ≠ "Drents Archief")
- Checks for type keywords in both institution name and Wikidata type
- Archives must match archives, museums must match museums, libraries must match libraries
**Similarity Threshold**:
- 0.85 chosen as optimal balance between precision and recall
- Many perfect matches (1.000) validate approach
- Examples of high-confidence matches:
- 1.000: "Van Gogh Museum" → "Van Gogh Museum (Q224124)"
- 1.000: "Amsterdam Museum" → "Amsterdam Museum (Q1820897)"
- 0.891: "Koninklijk Tehuis voor Oud-Militairen en Museum Bronbeek" → "Tehuis voor Oud-Militairen en Museum 'Bronbeek' (Q1948006)"
### Wikidata Query Optimization
**Dutch-Specific Query**:
```sparql
SELECT DISTINCT ?item ?itemLabel ?itemDescription ...
WHERE {
VALUES ?type { wd:Q33506 wd:Q7075 wd:Q166118 } # museum, library, archive
?item wdt:P31 ?type . # instance of
?item wdt:P17 wd:Q55 . # country: Netherlands
OPTIONAL { ?item wdt:P791 ?isil . }
OPTIONAL { ?item wdt:P214 ?viaf . }
...
}
LIMIT 2000
```
**Results**:
- Found 1,303 Dutch heritage institutions in Wikidata
- Many lack ISIL P791 property (explaining low ISIL-based coverage)
- Rich metadata available (coordinates, websites, founding dates, VIAF IDs)
### ISIL P791 Property Gap
**Finding**: Dutch ISIL codes are not well-represented in Wikidata P791.
**Evidence**:
- 416 Dutch institutions have ISIL codes (40.9%)
- ISIL-based SPARQL query only matched 49 (11.8% of ISIL-bearing institutions)
- Fuzzy name matching found 200 additional matches (4x more than ISIL matching)
**Implication**: Wikidata's ISIL coverage is incomplete, especially for Netherlands. Name-based matching is essential for comprehensive enrichment.
## Outstanding Issues ⚠️
### 1. Remaining Dutch Coverage Gap
**Current State**:
- 1,017 Dutch institutions total
- 222 with Wikidata (21.8%)
- **795 still without Wikidata (78.2%)**
**Samples Without Wikidata**:
- Regionaal Archief Alkmaar [ISIL: NL-AmrRAA]
- Het Scheepvaartmuseum (HSM) [ISIL: NL-AsdHSM]
- IHLIA LGBT Heritage [ISIL: NL-AsdILGBT]
**Next Steps**:
1. Lower fuzzy match threshold to 0.75-0.80 (trade precision for recall)
2. Try alternative Wikidata properties (P856 website, P131 location)
3. Manual curation for high-value institutions
4. Consider contributing missing ISIL codes to Wikidata
### 2. Very Low Brazilian Coverage
**Current State**:
- 97 Brazilian institutions
- **Only 1 with Wikidata (1.0%)**
- 96 without Wikidata
**Hypothesis**: Similar to Dutch situation - Brazilian institutions may exist in Wikidata but lack ISIL codes.
**Proposed Solution**: Run fuzzy matching for Brazilian institutions similar to Dutch approach.
### 3. Moderate Latin American Coverage
**Mexico**:
- 109 institutions, 23 with Wikidata (21.1%)
- 86 remaining without Wikidata
**Chile**:
- 90 institutions, 26 with Wikidata (28.9%)
- 64 remaining without Wikidata
**Next Step**: Apply fuzzy matching to Mexican and Chilean institutions.
### 4. Remaining Synthetic Q-numbers
**Current State**:
- 2,563 institutions still have synthetic Q-numbers (19.1%)
- Majority are Japanese institutions (2,517 synthetic in Japan)
**Context**: These are institutions that don't exist in Wikidata yet. Synthetic Q-numbers are hash-based placeholders.
**Decision Point**: Do we prioritize replacing synthetic Q-numbers or accept them as valid for institutions not yet in Wikidata?
### 5. Geocoding Failures
**From Previous Session** (still unresolved):
- 3 institutions failed geocoding (0.02%)
- 1 Japanese (typo: "YAMAGUCIH" → should be "YAMAGUCHI")
- 2 Dutch institutions
**Status**: Not addressed in this session
## Next Steps 📋
### Immediate Priorities (Ranked)
**Option A: Expand Fuzzy Matching to Latin America** (Recommended)
1. Adapt `enrich_dutch_institutions_fuzzy.py` for Brazil, Mexico, Chile
2. Query Wikidata for institutions in these countries
3. Apply fuzzy name matching with 0.85 threshold
4. Expected outcome:
- Brazil: 1% → 15-25% coverage
- Mexico: 21% → 35-45% coverage
- Chile: 29% → 40-50% coverage
5. **Impact**: Enrich ~100-150 additional institutions
**Option B: Lower Dutch Threshold for More Matches**
1. Re-run Dutch fuzzy matching with 0.75 threshold
2. Implement interactive review (approve/reject matches)
3. Expected outcome: Dutch coverage 22% → 30-35%
4. **Risk**: Lower threshold may introduce false positives
**Option C: Update GHCIDs with Real Q-numbers**
1. Regenerate GHCIDs for 200 newly enriched Dutch institutions
2. Replace synthetic Q-numbers with real Wikidata QIDs in GHCID
3. Update `ghcid_history` entries with change tracking
4. **Impact**: Improve GHCID stability and citation reliability
**Option D: Fix Remaining Geocoding Failures**
1. Manually correct Japanese typo ("YAMAGUCIH" → "YAMAGUCHI")
2. Re-geocode 2 Dutch institutions
3. Achieve 100.00% geocoding coverage
4. **Impact**: Small but completes geocoding milestone
### Future Work (Not This Session)
**Data Quality & Validation**:
- Cross-reference Wikidata QIDs with actual Wikidata content (verify descriptions, types)
- Identify and flag potential mismatches
- Create validation report comparing enrichment sources
**Export & Publishing**:
- Export enriched data to RDF/JSON-LD for linked data publishing
- Generate GeoJSON with enriched metadata
- Update statistics files with new coverage numbers
**Collection Metadata Extraction**:
- Use 11,329 institutional websites for deep crawling (crawl4ai)
- Extract collection descriptions, opening hours, contact info
- Populate `collections` module of LinkML schema
**Wikidata Contribution**:
- Identify Dutch institutions with ISIL codes missing from Wikidata
- Propose batch upload of ISIL P791 properties to Wikidata
- Improve P791 coverage for future users
## Performance Metrics 📊
### Session Summary
**Duration**: ~45 minutes
**Wikidata Queries**: 1 Dutch query (1,303 results)
**Fuzzy Matches**: 200 high-confidence (>0.85 similarity)
**Data Processed**: 13,396 institutions
**Files Written**: 24 MB YAML output
**Overall Enrichment Progress**:
- **Wikidata Coverage**: 55.0% (7,363/13,396)
- **Website Coverage**: 84.6% (11,329/13,396)
- **VIAF Coverage**: 15.2% (2,035/13,396)
- **Founding Date Coverage**: 11.6% (1,550/13,396)
**Dutch-Specific Progress**:
- **Before**: 49/1,017 (4.8%)
- **After**: 222/1,017 (21.8%)
- **Improvement**: +173 institutions (+353%)
**Status**: ✅ Dutch fuzzy matching complete, ready for Latin American expansion or GHCID regeneration
## Lessons Learned 🎓
### 1. ISIL P791 is Incomplete
**Finding**: Many institutions have ISIL codes but aren't in Wikidata's P791 property.
**Evidence**: Only 11.8% of Dutch ISIL-bearing institutions matched via P791.
**Takeaway**: Always supplement ISIL matching with name-based fuzzy matching for comprehensive coverage.
### 2. Type Compatibility is Critical
**Finding**: High-similarity string matches can be false positives if types differ.
**Example**: "Drents Archief" matched "Drents Museum" at 1.000 similarity before type checking.
**Takeaway**: Always validate matches against institution type to prevent archive/museum/library confusion.
### 3. Fuzzy Matching Scales Well
**Performance**:
- 1,303 Wikidata institutions × 968 local institutions = 1,261,504 comparisons
- Completed in ~10 seconds
- SequenceMatcher() is efficient for this scale
**Takeaway**: Fuzzy matching is viable for datasets of this size without specialized indexing.
### 4. YAML Loading is Slow but Acceptable
**Performance**:
- 24 MB YAML file loads in ~35-45 seconds
- PyYAML default parser is slow but reliable
**Alternatives Considered**:
- JSON format (faster parsing)
- Streaming YAML parser (memory efficient)
- SQLite database (better for repeated queries)
**Takeaway**: For occasional batch processing, YAML loading time is acceptable. Consider alternatives for real-time applications.
## Code Quality Notes 💻
### New Script: `enrich_dutch_institutions_fuzzy.py`
**Strengths**:
- ✅ Clear documentation with docstrings
- ✅ Modular functions (normalize, similarity_score, fuzzy_match, enrich)
- ✅ Type compatibility validation
- ✅ Comprehensive progress reporting
- ✅ Provenance tracking (adds "fuzzy name match" to extraction_method)
- ✅ Safe file handling (writes to new file, preserves original)
**Areas for Improvement**:
- ⚠️ Hardcoded threshold (0.85) - should be command-line argument
- ⚠️ No interactive review mode (option 2 not implemented)
- ⚠️ No checkpoint/resume functionality (if interrupted, restarts from beginning)
- ⚠️ Could benefit from logging to file (currently stdout only)
**Reusability**:
- Easily adaptable for other countries (change country code and SPARQL query)
- Normalization function could be extracted to shared utilities
- Type compatibility logic could be expanded to support more types
## References 📚
### Documentation
- **AGENTS.md**: AI agent instructions (schema reference, extraction tasks)
- **PERSISTENT_IDENTIFIERS.md**: GHCID specification, collision handling
- **SCHEMA_MODULES.md**: LinkML schema v0.2.0 architecture
- **Session Summary (Nov 7)**: Previous geocoding session results
### Schema Modules
- `schemas/core.yaml`: HeritageCustodian, Location, Identifier, DigitalPlatform
- `schemas/enums.yaml`: InstitutionTypeEnum, DataSource, DataTier
- `schemas/provenance.yaml`: Provenance, ChangeEvent, GHCIDHistoryEntry
### Scripts
- `scripts/enrich_global_with_wikidata_fast.py`: ISIL-based enrichment (SPARQL P791)
- `scripts/enrich_dutch_institutions_fuzzy.py`: Name-based fuzzy matching ⭐ NEW
### Wikidata Properties
- **P791**: ISIL code (primary matching key, but incomplete)
- **P31**: instance of (Q33506=museum, Q7075=library, Q166118=archive)
- **P17**: country (Q55=Netherlands, Q155=Brazil, Q96=Mexico, Q298=Chile)
- **P214**: VIAF ID
- **P856**: official website
- **P625**: coordinate location
- **P571**: inception (founding date)
---
**Version**: 0.2.0
**Schema Version**: v0.2.0 (modular)
**Session Date**: 2025-11-08
**Previous Session**: 2025-11-07 (Geocoding + Initial Wikidata Enrichment)
**Next Session**: TBD (Latin American fuzzy matching or GHCID regeneration)