8.9 KiB
8.9 KiB
GLAM Dataset Unification Report
Generated: 2025-11-20 19:29:21 UTC
Executive Summary
- Total Institutions: 13,500
- Countries Covered: 18
- Wikidata Coverage: 7,542/13,500 (55.9%)
- Geocoding Coverage: 8,178/13,500 (60.6%)
- Duplicates Removed: 12461
Data Sources
Algeria
- Total: 19 institutions
- Wikidata: 13 (68.4%)
- Geocoded: 0 (0.0%)
Brazil
- Total: 115 institutions
- Wikidata: 7 (6.1%)
- Geocoded: 0 (0.0%)
Chile
- Total: 90 institutions
- Wikidata: 71 (78.9%)
- Geocoded: 78 (86.7%)
Georgia
- Total: 14 institutions
- Wikidata: 11 (78.6%)
- Geocoded: 0 (0.0%)
Global
- Total: 13,396 institutions
- Wikidata: 7,448 (55.6%)
- Geocoded: 13,393 (100.0%)
Historical
- Total: 5 institutions
- Wikidata: 5 (100.0%)
- Geocoded: 5 (100.0%)
Japan
- Total: 12,065 institutions
- Wikidata: 0 (0.0%)
- Geocoded: 0 (0.0%)
Libya
- Total: 50 institutions
- Wikidata: 50 (100.0%)
- Geocoded: 0 (0.0%)
Mexico
- Total: 117 institutions
- Wikidata: 10 (8.5%)
- Geocoded: 58 (49.6%)
Tunisia
- Total: 69 institutions
- Wikidata: 2 (2.9%)
- Geocoded: 12 (17.4%)
Vietnam
- Total: 21 institutions
- Wikidata: 8 (38.1%)
- Geocoded: 0 (0.0%)
Coverage by Country
| Country | Total | Wikidata | Wikidata % | Geocoded | Geocoded % |
|---|---|---|---|---|---|
| JP | 12,065 | 7,091 | 58.8% | 7,091 | 58.8% |
| NL | 622 | 193 | 31.0% | 621 | 99.8% |
| MX | 226 | 44 | 19.5% | 167 | 73.9% |
| BR | 212 | 29 | 13.7% | 97 | 45.8% |
| CL | 180 | 97 | 53.9% | 168 | 93.3% |
| TN | 69 | 2 | 2.9% | 12 | 17.4% |
| LY | 47 | 47 | 100.0% | 0 | 0.0% |
| VN | 21 | 8 | 38.1% | 0 | 0.0% |
| DZ | 19 | 13 | 68.4% | 0 | 0.0% |
| GE | 14 | 11 | 78.6% | 0 | 0.0% |
| BE | 7 | 0 | 0.0% | 7 | 100.0% |
| US | 7 | 0 | 0.0% | 7 | 100.0% |
| GB | 3 | 3 | 100.0% | 0 | 0.0% |
| IT | 3 | 1 | 33.3% | 3 | 100.0% |
| AR | 2 | 1 | 50.0% | 2 | 100.0% |
| RU | 1 | 1 | 100.0% | 1 | 100.0% |
| DK | 1 | 1 | 100.0% | 1 | 100.0% |
| LU | 1 | 0 | 0.0% | 1 | 100.0% |
Enrichment Needs
Total institutions requiring enrichment: 13,458 (99.7% of dataset)
By Enrichment Type
- Need Wikidata: 5,958 (44.1%)
- Need Coordinates: 5,322 (39.4%)
- Need Website: 2,085 (15.4%)
- Need Description: 13,005 (96.3%)
Priority Distribution (by number of missing fields)
- Priority 4 (4 missing fields): 854 institutions
- Priority 3 (3 missing fields): 4,834 institutions
- Priority 2 (2 missing fields): 682 institutions
- Priority 1 (1 missing fields): 7,088 institutions
Top 50 Enrichment Candidates (Highest Priority)
| Name | Country | Type | Missing Fields |
|---|---|---|---|
| Museu dos Povos Acreanos | BR | MUSEUM | wikidata, coordinates, website, description |
| UFAC Repository | BR | EDUCATION_PROVIDER | wikidata, coordinates, website, description |
| Serra da Barriga | BR | MIXED | wikidata, coordinates, website, description |
| Parque Memorial Quilombo dos Palmares | BR | MIXED | wikidata, coordinates, website, description |
| UFAL Natural History Museum | BR | MUSEUM | wikidata, coordinates, website, description |
| SECULT | BR | MIXED | wikidata, coordinates, website, description |
| Museu Sacaca | BR | MUSEUM | wikidata, coordinates, website, description |
| Museu de Arqueologia e Etnologia | BR | MUSEUM | wikidata, coordinates, website, description |
| Teatro Amazonas | BR | MIXED | wikidata, coordinates, website, description |
| Centro Cultural Povos da Amazônia | BR | GALLERY | wikidata, coordinates, website, description |
| Arquivo Público (APEB) | BR | ARCHIVE | wikidata, coordinates, website, description |
| FPC/IPAC | BR | MIXED | wikidata, coordinates, website, description |
| MAM-BA | BR | MIXED | wikidata, coordinates, website, description |
| Centro Dragão do Mar | BR | GALLERY | wikidata, coordinates, website, description |
| Arquivo Público DF | BR | ARCHIVE | wikidata, coordinates, website, description |
| UnB BCE | BR | LIBRARY | wikidata, coordinates, website, description |
| CCBB Brasília | BR | GALLERY | wikidata, coordinates, website, description |
| UFES Digital Libraries | BR | EDUCATION_PROVIDER | wikidata, coordinates, website, description |
| State Archives | BR | ARCHIVE | wikidata, coordinates, website, description |
| Cultural Collections | BR | MIXED | wikidata, coordinates, website, description |
| UFG Repositories | BR | EDUCATION_PROVIDER | wikidata, coordinates, website, description |
| Museu Zoroastro Artiaga | BR | MUSEUM | wikidata, coordinates, website, description |
| Museu Histórico (MHAM) | BR | MUSEUM | wikidata, coordinates, website, description |
| Casa das Minas/Casa de Nagô | BR | MIXED | wikidata, coordinates, website, description |
| MUSEAR/UFMT | BR | EDUCATION_PROVIDER | wikidata, coordinates, website, description |
| Instituto Histórico | BR | OFFICIAL_INSTITUTION | wikidata, coordinates, website, description |
| Museu Histórico | BR | MUSEUM | wikidata, coordinates, website, description |
| Guarani-Kaiowá Projects | BR | MIXED | wikidata, coordinates, website, description |
| UFMS Repositories | BR | EDUCATION_PROVIDER | wikidata, coordinates, website, description |
| UFMG Tainacan Lab | BR | EDUCATION_PROVIDER | wikidata, coordinates, website, description |
| Inhotim | BR | MIXED | wikidata, coordinates, website, description |
| Ouro Preto System | BR | MIXED | wikidata, coordinates, website, description |
| MM Gerdau | BR | MIXED | wikidata, coordinates, website, description |
| Museu Goeldi | BR | MUSEUM | wikidata, coordinates, website, description |
| Forte do Presépio | BR | MIXED | wikidata, coordinates, website, description |
| UFPA | BR | EDUCATION_PROVIDER | wikidata, coordinates, website, description |
| Teatro da Paz | BR | MIXED | wikidata, coordinates, website, description |
| Pedra do Ingá | BR | MIXED | wikidata, coordinates, website, description |
| UFPB/UEPB | BR | EDUCATION_PROVIDER | wikidata, coordinates, website, description |
| Natural History Museum | BR | MUSEUM | wikidata, coordinates, website, description |
| Forte Santa Catarina | BR | MIXED | wikidata, coordinates, website, description |
| MON | BR | MIXED | wikidata, coordinates, website, description |
| DEAP Archives | BR | ARCHIVE | wikidata, coordinates, website, description |
| Centro de Memória | BR | MIXED | wikidata, coordinates, website, description |
| UFPR | BR | EDUCATION_PROVIDER | wikidata, coordinates, website, description |
| Instituto Ricardo Brennand | BR | OFFICIAL_INSTITUTION | wikidata, coordinates, website, description |
| UFPE | BR | EDUCATION_PROVIDER | wikidata, coordinates, website, description |
| MEPE/IAHGP | BR | MIXED | wikidata, coordinates, website, description |
| FUMDHAM | BR | MIXED | wikidata, coordinates, website, description |
| Museu do Piauí | BR | MUSEUM | wikidata, coordinates, website, description |
Deduplication Details
Duplicates Found
Total duplicate IDs: 12461
| ID | Sources |
|---|---|
| ...JP-1000034 | japan, global |
| ...JP-1000036 | japan, global |
| ...JP-1000038 | japan, global |
| ...JP-1000039 | japan, global |
| ...JP-1000044 | japan, global |
| ...JP-1000045 | japan, global |
| ...JP-1000046 | japan, global |
| ...JP-1000047 | japan, global |
| ...JP-1004852 | japan, global |
| ...JP-1005807 | japan, global |
| ...JP-1000051 | japan, global |
| ...JP-1005809 | japan, global |
| ...JP-1000053 | japan, global |
| ...JP-1000059 | japan, global |
| ...JP-1000060 | japan, global |
| ...JP-1000061 | japan, global |
| ...JP-1000054 | japan, global |
| ...JP-1000058 | japan, global |
| ...JP-1000075 | japan, global |
| ...JP-1000141 | japan, global |
...and 12441 more duplicates
Next Steps
Immediate Actions
-
Review Enrichment Candidates: Check
ENRICHMENT_CANDIDATES.yamlfor institutions needing data -
Prioritize Countries: Focus on countries with low Wikidata coverage:
- BE: 0.0% coverage (7 institutions)
- LU: 0.0% coverage (1 institutions)
- US: 0.0% coverage (7 institutions)
- TN: 2.9% coverage (69 institutions)
- BR: 13.7% coverage (212 institutions)
- MX: 19.5% coverage (226 institutions)
- NL: 31.0% coverage (622 institutions)
- IT: 33.3% coverage (3 institutions)
- VN: 38.1% coverage (21 institutions)
- AR: 50.0% coverage (2 institutions)
-
Batch Enrichment Workflow:
- Run Wikidata enrichment for high-priority candidates
- Run geocoding for missing coordinates
- Crawl institutional websites for missing data
Tools Available
- Wikidata Enrichment:
scripts/enrich_global_batch.py - Geocoding:
scripts/geocode_institutions.py - Website Crawling:
scripts/crawl_institution_websites.py(to be created)
Files Generated
- globalglam-20251111.yaml - Complete unified dataset (13,500 institutions)
- ENRICHMENT_CANDIDATES.yaml - Institutions needing enrichment (13,458 candidates)
- UNIFICATION_REPORT.md - This report
Generated by: scripts/unify_all_datasets.py
Dataset Version: 1.0
Schema Version: LinkML v0.2.1