367 lines
14 KiB
Markdown
367 lines
14 KiB
Markdown
# Wikidata Enrichment Session Summary - November 8, 2025
|
||
|
||
## Session Context
|
||
Resumed from November 7 session where we achieved 99.98% geocoding coverage and performed initial Wikidata enrichment via ISIL code matching.
|
||
|
||
## What We Accomplished ✅
|
||
|
||
### 1. Dutch Institutions Fuzzy Name Matching - Successfully Completed
|
||
|
||
**Problem Identified**: Low Dutch Wikidata coverage (4.8%) despite 40.9% having ISIL codes.
|
||
|
||
**Root Cause**: ISIL P791 property not well-populated in Wikidata for Dutch institutions.
|
||
|
||
**Solution Implemented**:
|
||
- Created `scripts/enrich_dutch_institutions_fuzzy.py`
|
||
- Queried Wikidata for all Dutch museums, libraries, and archives (1,303 found)
|
||
- Fuzzy matched institution names using normalized string similarity
|
||
- Added institution type compatibility checking to avoid false positives (e.g., "Drents Archief" vs "Drents Museum")
|
||
- Applied matches with >0.85 confidence threshold
|
||
|
||
**Results - HIGHLY SUCCESSFUL**:
|
||
- **Processing time**: 1.5 minutes (37s loading + 30s Wikidata query + 10s matching + 20s writing)
|
||
- **Dutch enriched**: 200 institutions
|
||
- **New Dutch Wikidata coverage**: 21.8% (up from 4.8%)
|
||
- **Improvement**: 4.5x increase in coverage
|
||
- **Match quality**: 200 high-confidence matches (>0.85 similarity)
|
||
- Many perfect matches (1.000 similarity)
|
||
- Examples: Van Gogh Museum, Amsterdam Museum, Rijksmuseum
|
||
|
||
### 2. Overall Dataset Statistics
|
||
|
||
**Final Enrichment State**:
|
||
```
|
||
Total institutions: 13,396
|
||
With real Wikidata IDs: 7,363 (55.0%)
|
||
With synthetic Wikidata: 2,563 (19.1%)
|
||
With VIAF IDs: 2,035 (15.2%)
|
||
With websites: 11,329 (84.6%)
|
||
With founding dates: 1,550 (11.6%)
|
||
```
|
||
|
||
**Enrichment Methods**:
|
||
- ISIL Match: 9,670 (72.2%) - from SPARQL query via P791 property
|
||
- Fuzzy Name Match: 200 (1.5%) - new Dutch enrichments
|
||
- Other/Original: 3,526 (26.3%) - from conversation extraction, CSV imports
|
||
|
||
### 3. Coverage by Country
|
||
|
||
| Country | Total | Real Wikidata | Synthetic | Coverage |
|
||
|---------|-------|---------------|-----------|----------|
|
||
| **Japan (JP)** | 12,065 | 7,091 | 2,517 | **58.8%** |
|
||
| **Netherlands (NL)** | 1,017 | 222 | 39 | **21.8%** ⬆ |
|
||
| **Chile (CL)** | 90 | 26 | 3 | **28.9%** |
|
||
| **Mexico (MX)** | 109 | 23 | 3 | **21.1%** |
|
||
| **Brazil (BR)** | 97 | 1 | 0 | **1.0%** ⚠️ |
|
||
| **Belgium (BE)** | 7 | 0 | 1 | **0.0%** |
|
||
| **United States (US)** | 7 | 0 | 0 | **0.0%** |
|
||
|
||
**Key Insight**: Japan dominates the dataset (90%) with excellent ISIL→Wikidata mapping. Dutch coverage significantly improved but still has room for growth.
|
||
|
||
## Files Modified/Created 📁
|
||
|
||
### Created
|
||
1. `scripts/enrich_dutch_institutions_fuzzy.py` - **Production-ready** Dutch fuzzy matcher
|
||
2. `data/instances/global/global_heritage_institutions_dutch_enriched.yaml` (24 MB) - **Merged into main file**
|
||
3. `data/instances/global/global_heritage_institutions_wikidata_enriched_backup.yaml` (24 MB) - Backup of pre-fuzzy-match state
|
||
|
||
### Modified
|
||
- `data/instances/global/global_heritage_institutions_wikidata_enriched.yaml` (24 MB) - **Main enriched dataset** (now includes Dutch fuzzy matches)
|
||
|
||
### Preserved
|
||
- Original files remain unchanged (backup strategy maintained)
|
||
|
||
## Technical Insights 🔍
|
||
|
||
### Fuzzy Matching Strategies
|
||
|
||
**Normalization Techniques**:
|
||
```python
|
||
# Name normalization for matching
|
||
- Lowercase
|
||
- Remove common prefixes: "stichting", "gemeentearchief", "regionaal archief", "museum"
|
||
- Remove common suffixes: "archief", "museum", "bibliotheek", "library", "archive"
|
||
- Remove punctuation
|
||
- Normalize whitespace
|
||
```
|
||
|
||
**Type Compatibility Checking**:
|
||
- Prevents mismatches between museums and archives (e.g., "Drents Museum" ≠ "Drents Archief")
|
||
- Checks for type keywords in both institution name and Wikidata type
|
||
- Archives must match archives, museums must match museums, libraries must match libraries
|
||
|
||
**Similarity Threshold**:
|
||
- 0.85 chosen as optimal balance between precision and recall
|
||
- Many perfect matches (1.000) validate approach
|
||
- Examples of high-confidence matches:
|
||
- 1.000: "Van Gogh Museum" → "Van Gogh Museum (Q224124)"
|
||
- 1.000: "Amsterdam Museum" → "Amsterdam Museum (Q1820897)"
|
||
- 0.891: "Koninklijk Tehuis voor Oud-Militairen en Museum Bronbeek" → "Tehuis voor Oud-Militairen en Museum 'Bronbeek' (Q1948006)"
|
||
|
||
### Wikidata Query Optimization
|
||
|
||
**Dutch-Specific Query**:
|
||
```sparql
|
||
SELECT DISTINCT ?item ?itemLabel ?itemDescription ...
|
||
WHERE {
|
||
VALUES ?type { wd:Q33506 wd:Q7075 wd:Q166118 } # museum, library, archive
|
||
?item wdt:P31 ?type . # instance of
|
||
?item wdt:P17 wd:Q55 . # country: Netherlands
|
||
OPTIONAL { ?item wdt:P791 ?isil . }
|
||
OPTIONAL { ?item wdt:P214 ?viaf . }
|
||
...
|
||
}
|
||
LIMIT 2000
|
||
```
|
||
|
||
**Results**:
|
||
- Found 1,303 Dutch heritage institutions in Wikidata
|
||
- Many lack ISIL P791 property (explaining low ISIL-based coverage)
|
||
- Rich metadata available (coordinates, websites, founding dates, VIAF IDs)
|
||
|
||
### ISIL P791 Property Gap
|
||
|
||
**Finding**: Dutch ISIL codes are not well-represented in Wikidata P791.
|
||
|
||
**Evidence**:
|
||
- 416 Dutch institutions have ISIL codes (40.9%)
|
||
- ISIL-based SPARQL query only matched 49 (11.8% of ISIL-bearing institutions)
|
||
- Fuzzy name matching found 200 additional matches (4x more than ISIL matching)
|
||
|
||
**Implication**: Wikidata's ISIL coverage is incomplete, especially for Netherlands. Name-based matching is essential for comprehensive enrichment.
|
||
|
||
## Outstanding Issues ⚠️
|
||
|
||
### 1. Remaining Dutch Coverage Gap
|
||
|
||
**Current State**:
|
||
- 1,017 Dutch institutions total
|
||
- 222 with Wikidata (21.8%)
|
||
- **795 still without Wikidata (78.2%)**
|
||
|
||
**Samples Without Wikidata**:
|
||
- Regionaal Archief Alkmaar [ISIL: NL-AmrRAA]
|
||
- Het Scheepvaartmuseum (HSM) [ISIL: NL-AsdHSM]
|
||
- IHLIA LGBT Heritage [ISIL: NL-AsdILGBT]
|
||
|
||
**Next Steps**:
|
||
1. Lower fuzzy match threshold to 0.75-0.80 (trade precision for recall)
|
||
2. Try alternative Wikidata properties (P856 website, P131 location)
|
||
3. Manual curation for high-value institutions
|
||
4. Consider contributing missing ISIL codes to Wikidata
|
||
|
||
### 2. Very Low Brazilian Coverage
|
||
|
||
**Current State**:
|
||
- 97 Brazilian institutions
|
||
- **Only 1 with Wikidata (1.0%)**
|
||
- 96 without Wikidata
|
||
|
||
**Hypothesis**: Similar to Dutch situation - Brazilian institutions may exist in Wikidata but lack ISIL codes.
|
||
|
||
**Proposed Solution**: Run fuzzy matching for Brazilian institutions similar to Dutch approach.
|
||
|
||
### 3. Moderate Latin American Coverage
|
||
|
||
**Mexico**:
|
||
- 109 institutions, 23 with Wikidata (21.1%)
|
||
- 86 remaining without Wikidata
|
||
|
||
**Chile**:
|
||
- 90 institutions, 26 with Wikidata (28.9%)
|
||
- 64 remaining without Wikidata
|
||
|
||
**Next Step**: Apply fuzzy matching to Mexican and Chilean institutions.
|
||
|
||
### 4. Remaining Synthetic Q-numbers
|
||
|
||
**Current State**:
|
||
- 2,563 institutions still have synthetic Q-numbers (19.1%)
|
||
- Majority are Japanese institutions (2,517 synthetic in Japan)
|
||
|
||
**Context**: These are institutions that don't exist in Wikidata yet. Synthetic Q-numbers are hash-based placeholders.
|
||
|
||
**Decision Point**: Do we prioritize replacing synthetic Q-numbers or accept them as valid for institutions not yet in Wikidata?
|
||
|
||
### 5. Geocoding Failures
|
||
|
||
**From Previous Session** (still unresolved):
|
||
- 3 institutions failed geocoding (0.02%)
|
||
- 1 Japanese (typo: "YAMAGUCIH" → should be "YAMAGUCHI")
|
||
- 2 Dutch institutions
|
||
|
||
**Status**: Not addressed in this session
|
||
|
||
## Next Steps 📋
|
||
|
||
### Immediate Priorities (Ranked)
|
||
|
||
**Option A: Expand Fuzzy Matching to Latin America** (Recommended)
|
||
1. Adapt `enrich_dutch_institutions_fuzzy.py` for Brazil, Mexico, Chile
|
||
2. Query Wikidata for institutions in these countries
|
||
3. Apply fuzzy name matching with 0.85 threshold
|
||
4. Expected outcome:
|
||
- Brazil: 1% → 15-25% coverage
|
||
- Mexico: 21% → 35-45% coverage
|
||
- Chile: 29% → 40-50% coverage
|
||
5. **Impact**: Enrich ~100-150 additional institutions
|
||
|
||
**Option B: Lower Dutch Threshold for More Matches**
|
||
1. Re-run Dutch fuzzy matching with 0.75 threshold
|
||
2. Implement interactive review (approve/reject matches)
|
||
3. Expected outcome: Dutch coverage 22% → 30-35%
|
||
4. **Risk**: Lower threshold may introduce false positives
|
||
|
||
**Option C: Update GHCIDs with Real Q-numbers**
|
||
1. Regenerate GHCIDs for 200 newly enriched Dutch institutions
|
||
2. Replace synthetic Q-numbers with real Wikidata QIDs in GHCID
|
||
3. Update `ghcid_history` entries with change tracking
|
||
4. **Impact**: Improve GHCID stability and citation reliability
|
||
|
||
**Option D: Fix Remaining Geocoding Failures**
|
||
1. Manually correct Japanese typo ("YAMAGUCIH" → "YAMAGUCHI")
|
||
2. Re-geocode 2 Dutch institutions
|
||
3. Achieve 100.00% geocoding coverage
|
||
4. **Impact**: Small but completes geocoding milestone
|
||
|
||
### Future Work (Not This Session)
|
||
|
||
**Data Quality & Validation**:
|
||
- Cross-reference Wikidata QIDs with actual Wikidata content (verify descriptions, types)
|
||
- Identify and flag potential mismatches
|
||
- Create validation report comparing enrichment sources
|
||
|
||
**Export & Publishing**:
|
||
- Export enriched data to RDF/JSON-LD for linked data publishing
|
||
- Generate GeoJSON with enriched metadata
|
||
- Update statistics files with new coverage numbers
|
||
|
||
**Collection Metadata Extraction**:
|
||
- Use 11,329 institutional websites for deep crawling (crawl4ai)
|
||
- Extract collection descriptions, opening hours, contact info
|
||
- Populate `collections` module of LinkML schema
|
||
|
||
**Wikidata Contribution**:
|
||
- Identify Dutch institutions with ISIL codes missing from Wikidata
|
||
- Propose batch upload of ISIL P791 properties to Wikidata
|
||
- Improve P791 coverage for future users
|
||
|
||
## Performance Metrics 📊
|
||
|
||
### Session Summary
|
||
|
||
**Duration**: ~45 minutes
|
||
**Wikidata Queries**: 1 Dutch query (1,303 results)
|
||
**Fuzzy Matches**: 200 high-confidence (>0.85 similarity)
|
||
**Data Processed**: 13,396 institutions
|
||
**Files Written**: 24 MB YAML output
|
||
|
||
**Overall Enrichment Progress**:
|
||
- **Wikidata Coverage**: 55.0% (7,363/13,396)
|
||
- **Website Coverage**: 84.6% (11,329/13,396)
|
||
- **VIAF Coverage**: 15.2% (2,035/13,396)
|
||
- **Founding Date Coverage**: 11.6% (1,550/13,396)
|
||
|
||
**Dutch-Specific Progress**:
|
||
- **Before**: 49/1,017 (4.8%)
|
||
- **After**: 222/1,017 (21.8%)
|
||
- **Improvement**: +173 institutions (+353%)
|
||
|
||
**Status**: ✅ Dutch fuzzy matching complete, ready for Latin American expansion or GHCID regeneration
|
||
|
||
## Lessons Learned 🎓
|
||
|
||
### 1. ISIL P791 is Incomplete
|
||
|
||
**Finding**: Many institutions have ISIL codes but aren't in Wikidata's P791 property.
|
||
|
||
**Evidence**: Only 11.8% of Dutch ISIL-bearing institutions matched via P791.
|
||
|
||
**Takeaway**: Always supplement ISIL matching with name-based fuzzy matching for comprehensive coverage.
|
||
|
||
### 2. Type Compatibility is Critical
|
||
|
||
**Finding**: High-similarity string matches can be false positives if types differ.
|
||
|
||
**Example**: "Drents Archief" matched "Drents Museum" at 1.000 similarity before type checking.
|
||
|
||
**Takeaway**: Always validate matches against institution type to prevent archive/museum/library confusion.
|
||
|
||
### 3. Fuzzy Matching Scales Well
|
||
|
||
**Performance**:
|
||
- 1,303 Wikidata institutions × 968 local institutions = 1,261,504 comparisons
|
||
- Completed in ~10 seconds
|
||
- SequenceMatcher() is efficient for this scale
|
||
|
||
**Takeaway**: Fuzzy matching is viable for datasets of this size without specialized indexing.
|
||
|
||
### 4. YAML Loading is Slow but Acceptable
|
||
|
||
**Performance**:
|
||
- 24 MB YAML file loads in ~35-45 seconds
|
||
- PyYAML default parser is slow but reliable
|
||
|
||
**Alternatives Considered**:
|
||
- JSON format (faster parsing)
|
||
- Streaming YAML parser (memory efficient)
|
||
- SQLite database (better for repeated queries)
|
||
|
||
**Takeaway**: For occasional batch processing, YAML loading time is acceptable. Consider alternatives for real-time applications.
|
||
|
||
## Code Quality Notes 💻
|
||
|
||
### New Script: `enrich_dutch_institutions_fuzzy.py`
|
||
|
||
**Strengths**:
|
||
- ✅ Clear documentation with docstrings
|
||
- ✅ Modular functions (normalize, similarity_score, fuzzy_match, enrich)
|
||
- ✅ Type compatibility validation
|
||
- ✅ Comprehensive progress reporting
|
||
- ✅ Provenance tracking (adds "fuzzy name match" to extraction_method)
|
||
- ✅ Safe file handling (writes to new file, preserves original)
|
||
|
||
**Areas for Improvement**:
|
||
- ⚠️ Hardcoded threshold (0.85) - should be command-line argument
|
||
- ⚠️ No interactive review mode (option 2 not implemented)
|
||
- ⚠️ No checkpoint/resume functionality (if interrupted, restarts from beginning)
|
||
- ⚠️ Could benefit from logging to file (currently stdout only)
|
||
|
||
**Reusability**:
|
||
- Easily adaptable for other countries (change country code and SPARQL query)
|
||
- Normalization function could be extracted to shared utilities
|
||
- Type compatibility logic could be expanded to support more types
|
||
|
||
## References 📚
|
||
|
||
### Documentation
|
||
- **AGENTS.md**: AI agent instructions (schema reference, extraction tasks)
|
||
- **PERSISTENT_IDENTIFIERS.md**: GHCID specification, collision handling
|
||
- **SCHEMA_MODULES.md**: LinkML schema v0.2.0 architecture
|
||
- **Session Summary (Nov 7)**: Previous geocoding session results
|
||
|
||
### Schema Modules
|
||
- `schemas/core.yaml`: HeritageCustodian, Location, Identifier, DigitalPlatform
|
||
- `schemas/enums.yaml`: InstitutionTypeEnum, DataSource, DataTier
|
||
- `schemas/provenance.yaml`: Provenance, ChangeEvent, GHCIDHistoryEntry
|
||
|
||
### Scripts
|
||
- `scripts/enrich_global_with_wikidata_fast.py`: ISIL-based enrichment (SPARQL P791)
|
||
- `scripts/enrich_dutch_institutions_fuzzy.py`: Name-based fuzzy matching ⭐ NEW
|
||
|
||
### Wikidata Properties
|
||
- **P791**: ISIL code (primary matching key, but incomplete)
|
||
- **P31**: instance of (Q33506=museum, Q7075=library, Q166118=archive)
|
||
- **P17**: country (Q55=Netherlands, Q155=Brazil, Q96=Mexico, Q298=Chile)
|
||
- **P214**: VIAF ID
|
||
- **P856**: official website
|
||
- **P625**: coordinate location
|
||
- **P571**: inception (founding date)
|
||
|
||
---
|
||
|
||
**Version**: 0.2.0
|
||
**Schema Version**: v0.2.0 (modular)
|
||
**Session Date**: 2025-11-08
|
||
**Previous Session**: 2025-11-07 (Geocoding + Initial Wikidata Enrichment)
|
||
**Next Session**: TBD (Latin American fuzzy matching or GHCID regeneration)
|