glam/UNIFICATION_SUMMARY.md
2025-11-19 23:25:22 +01:00

13 KiB

Global GLAM Dataset Unification - Complete Summary

Completed: 2025-11-11 15:17 UTC
Script: scripts/unify_all_datasets.py

MISSION ACCOMPLISHED

Successfully unified all heritage institution datasets into a single comprehensive global dataset.


📊 Final Statistics

Overall Coverage

  • Total Institutions: 13,502 (from 25,963 raw records)
  • Countries Covered: 18
  • Wikidata Coverage: 7,520/13,502 (55.7%)
  • Geocoding Coverage: 8,178/13,502 (60.6%)
  • Duplicates Removed: 12,461 (48.0% deduplication rate)

Data Quality Metrics

  • Records Needing Enrichment: 13,461 (99.7%)
  • Missing Wikidata Q-numbers: 5,982 institutions
  • Missing Coordinates: 5,324 institutions
  • Missing Website URLs: 2,085 institutions
  • Missing/Incomplete Descriptions: 13,036 institutions

🌍 Geographic Distribution

Top 10 Countries by Institution Count

Rank Country Code Count Wikidata % Geocode % Status
1 Japan JP 12,065 58.8% 58.8% Good
2 Netherlands NL 622 31.0% 99.8% ⚠️ Needs Wikidata
3 Mexico MX 226 15.0% 73.9% 🔴 Priority
4 Brazil BR 212 13.7% 45.8% 🔴 Priority
5 Chile CL 180 53.9% 93.3% Good
6 Libya LY 50 16.0% 0.0% 🔴 Needs Geocoding
7 Tunisia TN 69 1.4% 13.0% 🔴 Critical
8 Vietnam VN 21 38.1% 0.0% ⚠️ Needs Geocoding
9 Algeria DZ 19 5.3% 0.0% 🔴 Critical
10 Georgia GE 14 0.0% 0.0% 🔴 Critical

Countries with 0% Wikidata Coverage (HIGHEST PRIORITY)

  1. Georgia (GE): 14 institutions - NO data
  2. Great Britain (GB): 4 institutions - NO data
  3. Belgium (BE): 7 institutions - Geocoded but no Wikidata
  4. United States (US): 7 institutions - Geocoded but no Wikidata
  5. Luxembourg (LU): 1 institution - Geocoded but no Wikidata

📁 Files Generated

Output Location: /data/instances/all/

  1. globalglam-20251111.yaml (24 MB)

    • Complete unified dataset
    • 13,502 unique institutions
    • Provenance tracking for each record
  2. ENRICHMENT_CANDIDATES.yaml (2.8 MB)

    • 13,461 institutions needing enrichment
    • Sorted by priority (4 = most urgent, 1 = least urgent)
    • Detailed field-level gap analysis
  3. UNIFICATION_REPORT.md (11 KB)

    • Comprehensive statistics by country and source
    • Top 50 enrichment candidates
    • Duplicate detection results
  4. DATASET_STATISTICS.yaml (3 KB)

    • Machine-readable metrics
    • Country-by-country breakdown
    • Quality indicators

🔍 Data Sources Merged

Source Count Wikidata % Geocode % Notes
Global Merged 13,396 55.6% 100.0% Base dataset from previous work
Japan 12,065 0.0% 0.0% Largest single-country dataset
Chile 90 78.9% 86.7% Best quality - enriched in Batch 19
Brazil 115 6.1% 0.0% Batch 6 enriched
Mexico 117 0.0% 49.6% Geocoded only
Libya 54 14.8% 0.0% Needs geocoding
Tunisia 69 1.4% 13.0% Minimal data
Vietnam 21 38.1% 0.0% Needs geocoding
Algeria 19 5.3% 0.0% Minimal data
Georgia 14 0.0% 0.0% Critical - no enrichment
Historical 5 100.0% 100.0% Validation dataset

🎯 Enrichment Priority Matrix

Priority 4 (4 Missing Fields): 855 Institutions

ALL need: Wikidata, Coordinates, Website, Description

Geographic Focus:

  • Brazil: 108 institutions (Museu dos Povos Acreanos, UFAC Repository, etc.)
  • Algeria: 19 institutions (all institutions)
  • Georgia: 14 institutions (all institutions)
  • Libya: 47 institutions (most institutions)

Action: Batch Wikidata query + Nominatim geocoding + website scraping


Priority 3 (3 Missing Fields): 4,875 Institutions

Typical pattern: Missing Wikidata, Coordinates, Description (have website) OR missing Wikidata, Website, Description (have coordinates)

Geographic Focus:

  • Japan: Majority of 12,065 institutions (missing Wikidata)
  • Netherlands: 430 institutions (have geocoding, need Wikidata)
  • Mexico: 170 institutions (partial data)

Action: Focus on Wikidata enrichment via SPARQL


Priority 2 (2 Missing Fields): 665 Institutions

Typical pattern: Missing Wikidata + one other field

Action: Targeted enrichment for specific gaps


Priority 1 (1 Missing Field): 7,042 Institutions

Typical pattern: Only missing description OR only missing website

Action: Lower priority - can defer


🚀 Next Steps - Global Enrichment Workflow

Phase 1: Critical Countries (0% Wikidata Coverage)

Target: 33 institutions across 5 countries (GE, GB, BE, US, LU)

Workflow:

  1. Create enrichment script: scripts/enrich_critical_countries.py
  2. Query Wikidata SPARQL endpoint by country + institution type
  3. Fuzzy match institution names (threshold > 0.85)
  4. Geocode missing coordinates via Nominatim
  5. Validate and update records

Expected Outcome: Bring all 5 countries to 50%+ Wikidata coverage


Phase 2: North Africa (Tunisia, Algeria, Libya)

Target: 112 institutions with <16% Wikidata coverage

Challenges:

  • Limited Wikidata entries for North African institutions
  • Multilingual names (Arabic/French/English)
  • Missing coordinates

Workflow:

  1. Wikidata enrichment with Arabic name variants
  2. Batch geocoding for all institutions
  3. Cross-reference with UNESCO heritage sites
  4. Manual validation of fuzzy matches

Expected Outcome: 40%+ Wikidata coverage, 80%+ geocoding


Phase 3: Latin America (Brazil, Mexico)

Target: 438 institutions (212 BR + 226 MX)

Current State:

  • Brazil: 13.7% Wikidata, 45.8% geocoded
  • Mexico: 15.0% Wikidata, 73.9% geocoded

Workflow:

  1. Reuse Chile enrichment scripts (proven 78.9% success rate)
  2. Batch SPARQL queries for Brazilian/Mexican institutions
  3. Enhance geocoding for Brazil (currently 45.8%)
  4. Website crawling for missing descriptions

Expected Outcome:

  • Brazil → 50%+ Wikidata, 80%+ geocoding
  • Mexico → 50%+ Wikidata, 90%+ geocoding

Phase 4: Netherlands Deep Enrichment

Target: 622 institutions, currently 31.0% Wikidata

Advantages:

  • Already 99.8% geocoded
  • Rich metadata available (ISIL codes, KvK numbers)
  • Many institutions have websites

Workflow:

  1. Cross-reference with Dutch ISIL registry (TIER_1 data)
  2. Query Wikidata using ISIL codes as identifiers
  3. Crawl institutional websites for descriptions
  4. Leverage existing digital platform metadata

Expected Outcome: 70%+ Wikidata coverage (431 institutions)


Phase 5: Japan Mass Enrichment

Target: 12,065 institutions (89.5% of total dataset!)

Current State: 0% Wikidata from local dataset, but global merge shows 58.8%

Analysis: Japan data appears split between:

  • Local Japanese dataset (12,065 records, 0% enriched)
  • Global dataset (includes ~7,091 Japanese institutions with Wikidata)

Workflow:

  1. Investigate duplicate detection logic (why 12,461 duplicates removed?)
  2. Verify Japanese institution deduplication by name + coordinates
  3. Run batch Wikidata enrichment for remaining institutions
  4. Consider Japanese-language Wikidata queries

Expected Outcome: Maintain 58.8% coverage, improve to 70%+


📋 Enrichment Scripts to Create

1. scripts/enrich_critical_countries_batch.py

Purpose: Enrich Georgia, Great Britain, Belgium, US, Luxembourg institutions
Target: 33 institutions (0% → 50%+ Wikidata)

2. scripts/enrich_north_africa_batch.py

Purpose: Enrich Tunisia, Algeria, Libya institutions
Target: 112 institutions (<16% → 40%+ Wikidata)

3. scripts/enrich_brazil_comprehensive.py

Purpose: Full Brazil enrichment (Wikidata + geocoding + websites)
Target: 212 institutions (13.7% → 50%+ Wikidata, 45.8% → 80%+ geocoding)

4. scripts/enrich_mexico_comprehensive.py

Purpose: Mexico Wikidata enrichment (geocoding already good)
Target: 226 institutions (15.0% → 50%+ Wikidata)

5. scripts/enrich_netherlands_isil.py

Purpose: Netherlands enrichment using ISIL codes
Target: 622 institutions (31.0% → 70%+ Wikidata)

6. scripts/enrich_japan_mass.py

Purpose: Japan mass enrichment + deduplication analysis
Target: 12,065 institutions (maintain/improve 58.8% coverage)


🏆 Success Criteria

Minimum Viable Dataset (MVP)

  • Total institutions: 13,000+ (ACHIEVED: 13,502)
  • Wikidata coverage: 50%+ (ACHIEVED: 55.7%)
  • Geocoding coverage: 50%+ (ACHIEVED: 60.6%)
  • All countries: 30%+ Wikidata (NOT MET: 5 countries at 0%)

Target Goals (Post-Enrichment)

  • Total institutions: 15,000+ (add more countries)
  • Wikidata coverage: 70%+ globally
  • Geocoding coverage: 80%+ globally
  • All countries: 50%+ Wikidata minimum
  • Description coverage: 80%+ (currently 3.5%)

🔧 Technical Notes

Deduplication Strategy

  • 12,461 duplicates removed (48% duplicate rate!)
  • Prioritized records with Wikidata Q-numbers
  • Kept most complete version when duplicates found
  • ID-based deduplication (exact match on id field)

Investigation Needed: Why such a high duplicate rate?

  • Likely overlap between "global" dataset and country-specific datasets
  • Japan institutions may be duplicated between sources

Data Quality Issues

  1. Description Completeness: 96.5% of records missing/incomplete descriptions

    • Most urgent data quality issue
    • Affects usability and discoverability
    • Can be addressed via website crawling
  2. Coordinate Precision: 39.3% missing coordinates

    • Libya: 100% missing (50 institutions)
    • Algeria: 100% missing (19 institutions)
    • Vietnam: 100% missing (21 institutions)
    • Georgia: 100% missing (14 institutions)
    • Brazil: 54.2% missing (115 institutions)
  3. Website URLs: 15.5% missing

    • Lower priority (institutions may not have websites)
    • Focus on institutional websites vs. social media

📚 Chile Success Story - Benchmark for Quality

Chile Enrichment Results (Completed in Batch 19):

  • Total: 90 institutions
  • Wikidata: 71/90 (78.9%) - EXCEEDS 70% target by 8.9 points
  • Geocoding: 78/90 (86.7%)
  • Method: Iterative batch enrichment with fuzzy matching
  • Scripts: enrich_chile_batch[1-19].py

Key Success Factors:

  1. Iterative approach (19 batches, gradual refinement)
  2. Fuzzy matching threshold optimization (0.85+)
  3. Manual validation of uncertain matches
  4. Parent organization fallback (when direct match fails)

Replication Strategy: Use Chile's approach as template for other countries


🎉 Achievements

Data Integration

Unified 11 separate datasets into single comprehensive file
Merged 25,963 raw records → 13,502 unique institutions (48.0% deduplication)
Covered 18 countries across 4 continents
Preserved provenance tracking for all records

Quality Metrics

Exceeded 50% Wikidata coverage globally (55.7%)
Exceeded 50% geocoding coverage globally (60.6%)
Generated comprehensive enrichment candidates list (13,461 records)
Automated priority scoring (4-level system)

Documentation

Created detailed unification report (UNIFICATION_REPORT.md)
Machine-readable statistics (DATASET_STATISTICS.yaml)
Enrichment roadmap with 6 phase plan
Country-by-country breakdown with quality indicators


💡 Recommendations

Immediate Priorities (This Week)

  1. Enrich Critical Countries (GE, GB, BE, US, LU) - 33 institutions, 0% coverage
  2. Fix North Africa Geocoding (DZ, LY, TN) - 112 institutions, 0% coordinates
  3. Boost Brazil Coverage - 212 institutions, only 13.7% Wikidata

Short-term Goals (This Month)

  1. Netherlands Deep Dive - 622 institutions, leverage ISIL codes
  2. Mexico Enhancement - 226 institutions, build on existing geocoding
  3. Japan Deduplication Analysis - Investigate high duplicate rate

Long-term Vision (Next Quarter)

  1. Add New Countries - Target 25 countries total
  2. Semantic Web Integration - Generate RDF/Turtle exports
  3. API Development - Create SPARQL endpoint for querying
  4. Collection-Level Enrichment - Extract collection metadata from websites

📊 Progress Tracking

Overall Progress:

  • Phase 0: Dataset Unification (COMPLETE)
  • Phase 1: Critical Countries Enrichment (READY TO START)
  • Phase 2: North Africa Enrichment (READY TO START)
  • 📋 Phase 3: Latin America Enrichment (PLANNED)
  • 📋 Phase 4: Netherlands Enrichment (PLANNED)
  • 📋 Phase 5: Japan Mass Enrichment (PLANNED)

Files Ready for Use:

  • globalglam-20251111.yaml - Master dataset
  • ENRICHMENT_CANDIDATES.yaml - Prioritized enrichment list
  • UNIFICATION_REPORT.md - Detailed statistics
  • DATASET_STATISTICS.yaml - Machine-readable metrics

Scripts to Create:

  • 📝 enrich_critical_countries_batch.py
  • 📝 enrich_north_africa_batch.py
  • 📝 enrich_brazil_comprehensive.py
  • 📝 enrich_mexico_comprehensive.py
  • 📝 enrich_netherlands_isil.py
  • 📝 enrich_japan_mass.py

Status: READY FOR GLOBAL ENRICHMENT WORKFLOW

Next Command: Create first enrichment script for critical countries (GE, GB, BE, US, LU)