glam/data/instances/netherlands/NETHERLANDS_PHASE2_ENRICHMENT_REPORT.md
2025-11-19 23:25:22 +01:00

244 lines
6.8 KiB
Markdown

# Netherlands Phase 2 Enrichment Report
**Date**: 1762880966.934019
**Script**: `scripts/enrich_phase2_netherlands.py`
**Target**: 622 Dutch institutions
**Methodology**: SPARQL batch query + fuzzy name matching (Dutch normalization, 70% threshold)
---
## 📊 Overall Results
| Metric | Value |
|--------|-------|
| **Total Dutch Institutions** | 622 |
| **With Wikidata** | 396 (63.7%) |
| **Without Wikidata** | 226 (36.3%) |
| **Phase 2 Enriched** | 203 institutions |
| **Wikidata Pool** | 3,550 Dutch institutions in Wikidata |
| **Match Threshold** | 70% similarity (Dutch normalization) |
### Coverage Progression
- **Before Phase 2**: 193 institutions (31.0%)
- **After Phase 2**: 396 institutions (63.7%)
- **Improvement**: +203 institutions (+32.6 percentage points)
- **🎯 Target Achieved**: 62%+ coverage ✅
---
## 🏛️ Coverage by Institution Type
| Type | With Wikidata | Without Wikidata | Total | Coverage |
|------|---------------|------------------|-------|----------|
| ARCHIVE | 135 | 16 | 151 | 89.4% |
| COLLECTING_SOCIETY | 1 | 17 | 18 | 5.6% |
| L | 1 | 0 | 1 | 100.0% |
| LIBRARY | 5 | 2 | 7 | 71.4% |
| MIXED | 176 | 151 | 327 | 53.8% |
| MUSEUM | 71 | 27 | 98 | 72.4% |
| OFFICIAL_INSTITUTION | 4 | 4 | 8 | 50.0% |
| RESEARCH_CENTER | 3 | 8 | 11 | 27.3% |
| UNDEFINED | 0 | 1 | 1 | 0.0% |
### Highlights
- **MUSEUM**: 71/98 (72.4%) - Highest absolute coverage
- **ARCHIVE**: 135/151 (89.4%) - Significant improvement from 21.2%
- **MIXED**: 176/327 (53.8%) - Largest group, improved from 30.6%
---
## 🏙️ Geographic Distribution
### Top 10 Cities with Wikidata Coverage
| City | With Wikidata | Without Wikidata | Total | Coverage |
|------|---------------|------------------|-------|----------|
| Den Haag | 18 | 31 | 49 | 36.7% |
| Amsterdam | 25 | 14 | 39 | 64.1% |
| Utrecht | 22 | 11 | 33 | 66.7% |
| Arnhem | 15 | 2 | 17 | 88.2% |
| Zwolle | 4 | 11 | 15 | 26.7% |
| Rotterdam | 13 | 1 | 14 | 92.9% |
| Leiden | 6 | 5 | 11 | 54.5% |
| Groningen | 5 | 5 | 10 | 50.0% |
| Zeeland | 9 | 1 | 10 | 90.0% |
| Maastricht | 6 | 3 | 9 | 66.7% |
### Top 10 Cities Needing Enrichment
| City | Institutions Without Wikidata |
|------|-------------------------------|
| Den Haag | 31 |
| Amsterdam | 14 |
| Utrecht | 11 |
| Zwolle | 11 |
| Enschede | 5 |
| Groningen | 5 |
| Leiden | 5 |
| Roermond | 5 |
| Deventer | 4 |
| Leeuwarden | 4 |
---
## 🎯 Remaining Work
### Institutions Without Wikidata: 226
**By Type:**
- **MIXED**: 151 institutions
- **MUSEUM**: 27 institutions
- **COLLECTING_SOCIETY**: 17 institutions
- **ARCHIVE**: 16 institutions
- **RESEARCH_CENTER**: 8 institutions
- **OFFICIAL_INSTITUTION**: 4 institutions
- **LIBRARY**: 2 institutions
- **UNDEFINED**: 1 institutions
### Recommended Next Steps
1. **Phase 3 Netherlands**: Alternative name search for remaining 226 institutions
- Target: COLLECTING_SOCIETY (0% coverage currently)
- Target: Generic "Museum" institutions (common names)
- Target: Regional archives with variant spellings
2. **Manual Curation**: Review institutions with unique names not found in Wikidata
3. **ISIL Code Matching**: Cross-reference with Dutch ISIL registry for remaining institutions
---
## 🔍 Sample Enriched Institutions
### 1. Regionaal Archief Alkmaar
- **Location**: Alkmaar
- **Type**: ARCHIVE
- **Wikidata**: [Q2189005](https://www.wikidata.org/wiki/Q2189005)
- **Match Score**: 1.000
### 2. Gemeente Almelo
- **Location**: Almelo
- **Type**: MIXED
- **Wikidata**: [Q110891755](https://www.wikidata.org/wiki/Q110891755)
- **Match Score**: 0.811
### 3. Gemeentearchief Alphen aan den Rijn
- **Location**: Alphen aan den Rijn
- **Type**: ARCHIVE
- **Wikidata**: [Q111190988](https://www.wikidata.org/wiki/Q111190988)
- **Match Score**: 1.000
### 4. Huygens Instituut (HI)
- **Location**: Amsterdam
- **Type**: MIXED
- **Wikidata**: [Q487857](https://www.wikidata.org/wiki/Q487857)
- **Match Score**: 0.743
### 5. IHLIA LGBT Heritage
- **Location**: Amsterdam
- **Type**: MIXED
- **Wikidata**: [Q1417841](https://www.wikidata.org/wiki/Q1417841)
- **Match Score**: 0.974
### 6. Nationale Opera & Ballet
- **Location**: Amsterdam
- **Type**: MIXED
- **Wikidata**: [Q110996017](https://www.wikidata.org/wiki/Q110996017)
- **Match Score**: 0.714
### 7. Rijksmuseum
- **Location**: Amsterdam
- **Type**: MUSEUM
- **Wikidata**: [Q124624215](https://www.wikidata.org/wiki/Q124624215)
- **Match Score**: 0.909
### 8. Gemeente Appingedam
- **Location**: Appingedam
- **Type**: ARCHIVE
- **Wikidata**: [Q81181191](https://www.wikidata.org/wiki/Q81181191)
- **Match Score**: 0.844
### 9. Museum Arnhem (MA)
- **Location**: Arnhem
- **Type**: MUSEUM
- **Wikidata**: [Q2114028](https://www.wikidata.org/wiki/Q2114028)
- **Match Score**: 1.000
### 10. Drents Archief
- **Location**: Assen
- **Type**: ARCHIVE
- **Wikidata**: [Q1978308](https://www.wikidata.org/wiki/Q1978308)
- **Match Score**: 1.000
---
## 📈 Performance Metrics
- **Wikidata Query Time**: 58.9 seconds
- **Institutions Matched**: 203
- **Match Rate**: 47.3% (203 matched out of 429 without Wikidata)
- **Total Processing Time**: 2.5 minutes
- **Dataset Write Time**: 16.4 seconds
---
## ✅ Success Factors
1. **Strong Dutch Wikipedia Coverage**: Netherlands has extensive cultural heritage documentation
2. **ISIL Code Integration**: Many institutions already have ISIL codes for validation
3. **Dutch Normalization**: Effective handling of Dutch-specific prefixes/suffixes
4. **High Wikidata Pool**: 3,550 Dutch institutions available (vs. 1,845 for Mexico)
5. **Type Compatibility Checks**: Prevented museum → library mismatches
---
## 🔄 Comparison with Mexico Phase 2
| Metric | Mexico | Netherlands |
|--------|--------|-------------|
| **Total Institutions** | 192 | 622 |
| **Starting Coverage** | 17.7% (34) | 31.0% (193) |
| **Ending Coverage** | 50.0% (96) | 63.7% (396) |
| **Institutions Enriched** | 62 | 203 |
| **Coverage Gain** | +32.3pp | +32.6pp |
| **Match Rate** | 39.2% | 47.3% |
| **Wikidata Pool** | 1,845 | 3,550 |
| **Processing Time** | 2.1 min | 2.5 min |
**Netherlands outperformed Mexico** in absolute numbers (203 vs 62 enriched) and match rate (47.3% vs 39.2%), demonstrating the value of targeting well-documented European heritage institutions.
---
## 📊 Phase 2 Summary Across Countries
| Country | Total | Before | After | Enriched | Coverage Gain |
|---------|-------|--------|-------|----------|---------------|
| 🇧🇷 Brazil | 241 | 13.7% | 32.5% | 45 | +18.8pp |
| 🇲🇽 Mexico | 192 | 17.7% | 50.0% | 62 | +32.3pp |
| 🇳🇱 **Netherlands** | **622** | **31.0%** | **63.7%** | **203** | **+32.6pp** |
**Netherlands Phase 2 is the largest successful enrichment to date** with 203 institutions enriched, bringing total Wikidata coverage to 63.7%.
---
**Generated**: /Users/kempersc/apps/glam
**Script**: `scripts/enrich_phase2_netherlands.py`
**Dataset**: `data/instances/all/globalglam-20251111.yaml`