# GLAM Dataset Unification Report **Generated**: 2025-11-20 19:29:21 UTC ## Executive Summary - **Total Institutions**: 13,500 - **Countries Covered**: 18 - **Wikidata Coverage**: 7,542/13,500 (55.9%) - **Geocoding Coverage**: 8,178/13,500 (60.6%) - **Duplicates Removed**: 12461 ## Data Sources ### Algeria - Total: 19 institutions - Wikidata: 13 (68.4%) - Geocoded: 0 (0.0%) ### Brazil - Total: 115 institutions - Wikidata: 7 (6.1%) - Geocoded: 0 (0.0%) ### Chile - Total: 90 institutions - Wikidata: 71 (78.9%) - Geocoded: 78 (86.7%) ### Georgia - Total: 14 institutions - Wikidata: 11 (78.6%) - Geocoded: 0 (0.0%) ### Global - Total: 13,396 institutions - Wikidata: 7,448 (55.6%) - Geocoded: 13,393 (100.0%) ### Historical - Total: 5 institutions - Wikidata: 5 (100.0%) - Geocoded: 5 (100.0%) ### Japan - Total: 12,065 institutions - Wikidata: 0 (0.0%) - Geocoded: 0 (0.0%) ### Libya - Total: 50 institutions - Wikidata: 50 (100.0%) - Geocoded: 0 (0.0%) ### Mexico - Total: 117 institutions - Wikidata: 10 (8.5%) - Geocoded: 58 (49.6%) ### Tunisia - Total: 69 institutions - Wikidata: 2 (2.9%) - Geocoded: 12 (17.4%) ### Vietnam - Total: 21 institutions - Wikidata: 8 (38.1%) - Geocoded: 0 (0.0%) ## Coverage by Country | Country | Total | Wikidata | Wikidata % | Geocoded | Geocoded % | |---------|-------|----------|------------|----------|------------| | JP | 12,065 | 7,091 | 58.8% | 7,091 | 58.8% | | NL | 622 | 193 | 31.0% | 621 | 99.8% | | MX | 226 | 44 | 19.5% | 167 | 73.9% | | BR | 212 | 29 | 13.7% | 97 | 45.8% | | CL | 180 | 97 | 53.9% | 168 | 93.3% | | TN | 69 | 2 | 2.9% | 12 | 17.4% | | LY | 47 | 47 | 100.0% | 0 | 0.0% | | VN | 21 | 8 | 38.1% | 0 | 0.0% | | DZ | 19 | 13 | 68.4% | 0 | 0.0% | | GE | 14 | 11 | 78.6% | 0 | 0.0% | | BE | 7 | 0 | 0.0% | 7 | 100.0% | | US | 7 | 0 | 0.0% | 7 | 100.0% | | GB | 3 | 3 | 100.0% | 0 | 0.0% | | IT | 3 | 1 | 33.3% | 3 | 100.0% | | AR | 2 | 1 | 50.0% | 2 | 100.0% | | RU | 1 | 1 | 100.0% | 1 | 100.0% | | DK | 1 | 1 | 100.0% | 1 | 100.0% | | LU | 1 | 0 | 0.0% | 1 | 100.0% | ## Enrichment Needs Total institutions requiring enrichment: **13,458** (99.7% of dataset) ### By Enrichment Type - **Need Wikidata**: 5,958 (44.1%) - **Need Coordinates**: 5,322 (39.4%) - **Need Website**: 2,085 (15.4%) - **Need Description**: 13,005 (96.3%) ### Priority Distribution (by number of missing fields) - **Priority 4** (4 missing fields): 854 institutions - **Priority 3** (3 missing fields): 4,834 institutions - **Priority 2** (2 missing fields): 682 institutions - **Priority 1** (1 missing fields): 7,088 institutions ## Top 50 Enrichment Candidates (Highest Priority) | Name | Country | Type | Missing Fields | |------|---------|------|----------------| | Museu dos Povos Acreanos | BR | MUSEUM | wikidata, coordinates, website, description | | UFAC Repository | BR | EDUCATION_PROVIDER | wikidata, coordinates, website, description | | Serra da Barriga | BR | MIXED | wikidata, coordinates, website, description | | Parque Memorial Quilombo dos Palmares | BR | MIXED | wikidata, coordinates, website, description | | UFAL Natural History Museum | BR | MUSEUM | wikidata, coordinates, website, description | | SECULT | BR | MIXED | wikidata, coordinates, website, description | | Museu Sacaca | BR | MUSEUM | wikidata, coordinates, website, description | | Museu de Arqueologia e Etnologia | BR | MUSEUM | wikidata, coordinates, website, description | | Teatro Amazonas | BR | MIXED | wikidata, coordinates, website, description | | Centro Cultural Povos da Amazônia | BR | GALLERY | wikidata, coordinates, website, description | | Arquivo Público (APEB) | BR | ARCHIVE | wikidata, coordinates, website, description | | FPC/IPAC | BR | MIXED | wikidata, coordinates, website, description | | MAM-BA | BR | MIXED | wikidata, coordinates, website, description | | Centro Dragão do Mar | BR | GALLERY | wikidata, coordinates, website, description | | Arquivo Público DF | BR | ARCHIVE | wikidata, coordinates, website, description | | UnB BCE | BR | LIBRARY | wikidata, coordinates, website, description | | CCBB Brasília | BR | GALLERY | wikidata, coordinates, website, description | | UFES Digital Libraries | BR | EDUCATION_PROVIDER | wikidata, coordinates, website, description | | State Archives | BR | ARCHIVE | wikidata, coordinates, website, description | | Cultural Collections | BR | MIXED | wikidata, coordinates, website, description | | UFG Repositories | BR | EDUCATION_PROVIDER | wikidata, coordinates, website, description | | Museu Zoroastro Artiaga | BR | MUSEUM | wikidata, coordinates, website, description | | Museu Histórico (MHAM) | BR | MUSEUM | wikidata, coordinates, website, description | | Casa das Minas/Casa de Nagô | BR | MIXED | wikidata, coordinates, website, description | | MUSEAR/UFMT | BR | EDUCATION_PROVIDER | wikidata, coordinates, website, description | | Instituto Histórico | BR | OFFICIAL_INSTITUTION | wikidata, coordinates, website, description | | Museu Histórico | BR | MUSEUM | wikidata, coordinates, website, description | | Guarani-Kaiowá Projects | BR | MIXED | wikidata, coordinates, website, description | | UFMS Repositories | BR | EDUCATION_PROVIDER | wikidata, coordinates, website, description | | UFMG Tainacan Lab | BR | EDUCATION_PROVIDER | wikidata, coordinates, website, description | | Inhotim | BR | MIXED | wikidata, coordinates, website, description | | Ouro Preto System | BR | MIXED | wikidata, coordinates, website, description | | MM Gerdau | BR | MIXED | wikidata, coordinates, website, description | | Museu Goeldi | BR | MUSEUM | wikidata, coordinates, website, description | | Forte do Presépio | BR | MIXED | wikidata, coordinates, website, description | | UFPA | BR | EDUCATION_PROVIDER | wikidata, coordinates, website, description | | Teatro da Paz | BR | MIXED | wikidata, coordinates, website, description | | Pedra do Ingá | BR | MIXED | wikidata, coordinates, website, description | | UFPB/UEPB | BR | EDUCATION_PROVIDER | wikidata, coordinates, website, description | | Natural History Museum | BR | MUSEUM | wikidata, coordinates, website, description | | Forte Santa Catarina | BR | MIXED | wikidata, coordinates, website, description | | MON | BR | MIXED | wikidata, coordinates, website, description | | DEAP Archives | BR | ARCHIVE | wikidata, coordinates, website, description | | Centro de Memória | BR | MIXED | wikidata, coordinates, website, description | | UFPR | BR | EDUCATION_PROVIDER | wikidata, coordinates, website, description | | Instituto Ricardo Brennand | BR | OFFICIAL_INSTITUTION | wikidata, coordinates, website, description | | UFPE | BR | EDUCATION_PROVIDER | wikidata, coordinates, website, description | | MEPE/IAHGP | BR | MIXED | wikidata, coordinates, website, description | | FUMDHAM | BR | MIXED | wikidata, coordinates, website, description | | Museu do Piauí | BR | MUSEUM | wikidata, coordinates, website, description | ## Deduplication Details ### Duplicates Found Total duplicate IDs: 12461 | ID | Sources | |----|---------| | ...JP-1000034 | japan, global | | ...JP-1000036 | japan, global | | ...JP-1000038 | japan, global | | ...JP-1000039 | japan, global | | ...JP-1000044 | japan, global | | ...JP-1000045 | japan, global | | ...JP-1000046 | japan, global | | ...JP-1000047 | japan, global | | ...JP-1004852 | japan, global | | ...JP-1005807 | japan, global | | ...JP-1000051 | japan, global | | ...JP-1005809 | japan, global | | ...JP-1000053 | japan, global | | ...JP-1000059 | japan, global | | ...JP-1000060 | japan, global | | ...JP-1000061 | japan, global | | ...JP-1000054 | japan, global | | ...JP-1000058 | japan, global | | ...JP-1000075 | japan, global | | ...JP-1000141 | japan, global | *...and 12441 more duplicates* ## Next Steps ### Immediate Actions 1. **Review Enrichment Candidates**: Check `ENRICHMENT_CANDIDATES.yaml` for institutions needing data 2. **Prioritize Countries**: Focus on countries with low Wikidata coverage: - BE: 0.0% coverage (7 institutions) - LU: 0.0% coverage (1 institutions) - US: 0.0% coverage (7 institutions) - TN: 2.9% coverage (69 institutions) - BR: 13.7% coverage (212 institutions) - MX: 19.5% coverage (226 institutions) - NL: 31.0% coverage (622 institutions) - IT: 33.3% coverage (3 institutions) - VN: 38.1% coverage (21 institutions) - AR: 50.0% coverage (2 institutions) 3. **Batch Enrichment Workflow**: - Run Wikidata enrichment for high-priority candidates - Run geocoding for missing coordinates - Crawl institutional websites for missing data ### Tools Available - **Wikidata Enrichment**: `scripts/enrich_global_batch.py` - **Geocoding**: `scripts/geocode_institutions.py` - **Website Crawling**: `scripts/crawl_institution_websites.py` (to be created) ## Files Generated 1. **globalglam-20251111.yaml** - Complete unified dataset (13,500 institutions) 2. **ENRICHMENT_CANDIDATES.yaml** - Institutions needing enrichment (13,458 candidates) 3. **UNIFICATION_REPORT.md** - This report --- **Generated by**: `scripts/unify_all_datasets.py` **Dataset Version**: 1.0 **Schema Version**: LinkML v0.2.1