glam/data/instances/all/UNIFICATION_REPORT.md
2025-11-21 22:12:33 +01:00

8.9 KiB

GLAM Dataset Unification Report

Generated: 2025-11-20 19:29:21 UTC

Executive Summary

  • Total Institutions: 13,500
  • Countries Covered: 18
  • Wikidata Coverage: 7,542/13,500 (55.9%)
  • Geocoding Coverage: 8,178/13,500 (60.6%)
  • Duplicates Removed: 12461

Data Sources

Algeria

  • Total: 19 institutions
  • Wikidata: 13 (68.4%)
  • Geocoded: 0 (0.0%)

Brazil

  • Total: 115 institutions
  • Wikidata: 7 (6.1%)
  • Geocoded: 0 (0.0%)

Chile

  • Total: 90 institutions
  • Wikidata: 71 (78.9%)
  • Geocoded: 78 (86.7%)

Georgia

  • Total: 14 institutions
  • Wikidata: 11 (78.6%)
  • Geocoded: 0 (0.0%)

Global

  • Total: 13,396 institutions
  • Wikidata: 7,448 (55.6%)
  • Geocoded: 13,393 (100.0%)

Historical

  • Total: 5 institutions
  • Wikidata: 5 (100.0%)
  • Geocoded: 5 (100.0%)

Japan

  • Total: 12,065 institutions
  • Wikidata: 0 (0.0%)
  • Geocoded: 0 (0.0%)

Libya

  • Total: 50 institutions
  • Wikidata: 50 (100.0%)
  • Geocoded: 0 (0.0%)

Mexico

  • Total: 117 institutions
  • Wikidata: 10 (8.5%)
  • Geocoded: 58 (49.6%)

Tunisia

  • Total: 69 institutions
  • Wikidata: 2 (2.9%)
  • Geocoded: 12 (17.4%)

Vietnam

  • Total: 21 institutions
  • Wikidata: 8 (38.1%)
  • Geocoded: 0 (0.0%)

Coverage by Country

Country Total Wikidata Wikidata % Geocoded Geocoded %
JP 12,065 7,091 58.8% 7,091 58.8%
NL 622 193 31.0% 621 99.8%
MX 226 44 19.5% 167 73.9%
BR 212 29 13.7% 97 45.8%
CL 180 97 53.9% 168 93.3%
TN 69 2 2.9% 12 17.4%
LY 47 47 100.0% 0 0.0%
VN 21 8 38.1% 0 0.0%
DZ 19 13 68.4% 0 0.0%
GE 14 11 78.6% 0 0.0%
BE 7 0 0.0% 7 100.0%
US 7 0 0.0% 7 100.0%
GB 3 3 100.0% 0 0.0%
IT 3 1 33.3% 3 100.0%
AR 2 1 50.0% 2 100.0%
RU 1 1 100.0% 1 100.0%
DK 1 1 100.0% 1 100.0%
LU 1 0 0.0% 1 100.0%

Enrichment Needs

Total institutions requiring enrichment: 13,458 (99.7% of dataset)

By Enrichment Type

  • Need Wikidata: 5,958 (44.1%)
  • Need Coordinates: 5,322 (39.4%)
  • Need Website: 2,085 (15.4%)
  • Need Description: 13,005 (96.3%)

Priority Distribution (by number of missing fields)

  • Priority 4 (4 missing fields): 854 institutions
  • Priority 3 (3 missing fields): 4,834 institutions
  • Priority 2 (2 missing fields): 682 institutions
  • Priority 1 (1 missing fields): 7,088 institutions

Top 50 Enrichment Candidates (Highest Priority)

Name Country Type Missing Fields
Museu dos Povos Acreanos BR MUSEUM wikidata, coordinates, website, description
UFAC Repository BR EDUCATION_PROVIDER wikidata, coordinates, website, description
Serra da Barriga BR MIXED wikidata, coordinates, website, description
Parque Memorial Quilombo dos Palmares BR MIXED wikidata, coordinates, website, description
UFAL Natural History Museum BR MUSEUM wikidata, coordinates, website, description
SECULT BR MIXED wikidata, coordinates, website, description
Museu Sacaca BR MUSEUM wikidata, coordinates, website, description
Museu de Arqueologia e Etnologia BR MUSEUM wikidata, coordinates, website, description
Teatro Amazonas BR MIXED wikidata, coordinates, website, description
Centro Cultural Povos da Amazônia BR GALLERY wikidata, coordinates, website, description
Arquivo Público (APEB) BR ARCHIVE wikidata, coordinates, website, description
FPC/IPAC BR MIXED wikidata, coordinates, website, description
MAM-BA BR MIXED wikidata, coordinates, website, description
Centro Dragão do Mar BR GALLERY wikidata, coordinates, website, description
Arquivo Público DF BR ARCHIVE wikidata, coordinates, website, description
UnB BCE BR LIBRARY wikidata, coordinates, website, description
CCBB Brasília BR GALLERY wikidata, coordinates, website, description
UFES Digital Libraries BR EDUCATION_PROVIDER wikidata, coordinates, website, description
State Archives BR ARCHIVE wikidata, coordinates, website, description
Cultural Collections BR MIXED wikidata, coordinates, website, description
UFG Repositories BR EDUCATION_PROVIDER wikidata, coordinates, website, description
Museu Zoroastro Artiaga BR MUSEUM wikidata, coordinates, website, description
Museu Histórico (MHAM) BR MUSEUM wikidata, coordinates, website, description
Casa das Minas/Casa de Nagô BR MIXED wikidata, coordinates, website, description
MUSEAR/UFMT BR EDUCATION_PROVIDER wikidata, coordinates, website, description
Instituto Histórico BR OFFICIAL_INSTITUTION wikidata, coordinates, website, description
Museu Histórico BR MUSEUM wikidata, coordinates, website, description
Guarani-Kaiowá Projects BR MIXED wikidata, coordinates, website, description
UFMS Repositories BR EDUCATION_PROVIDER wikidata, coordinates, website, description
UFMG Tainacan Lab BR EDUCATION_PROVIDER wikidata, coordinates, website, description
Inhotim BR MIXED wikidata, coordinates, website, description
Ouro Preto System BR MIXED wikidata, coordinates, website, description
MM Gerdau BR MIXED wikidata, coordinates, website, description
Museu Goeldi BR MUSEUM wikidata, coordinates, website, description
Forte do Presépio BR MIXED wikidata, coordinates, website, description
UFPA BR EDUCATION_PROVIDER wikidata, coordinates, website, description
Teatro da Paz BR MIXED wikidata, coordinates, website, description
Pedra do Ingá BR MIXED wikidata, coordinates, website, description
UFPB/UEPB BR EDUCATION_PROVIDER wikidata, coordinates, website, description
Natural History Museum BR MUSEUM wikidata, coordinates, website, description
Forte Santa Catarina BR MIXED wikidata, coordinates, website, description
MON BR MIXED wikidata, coordinates, website, description
DEAP Archives BR ARCHIVE wikidata, coordinates, website, description
Centro de Memória BR MIXED wikidata, coordinates, website, description
UFPR BR EDUCATION_PROVIDER wikidata, coordinates, website, description
Instituto Ricardo Brennand BR OFFICIAL_INSTITUTION wikidata, coordinates, website, description
UFPE BR EDUCATION_PROVIDER wikidata, coordinates, website, description
MEPE/IAHGP BR MIXED wikidata, coordinates, website, description
FUMDHAM BR MIXED wikidata, coordinates, website, description
Museu do Piauí BR MUSEUM wikidata, coordinates, website, description

Deduplication Details

Duplicates Found

Total duplicate IDs: 12461

ID Sources
...JP-1000034 japan, global
...JP-1000036 japan, global
...JP-1000038 japan, global
...JP-1000039 japan, global
...JP-1000044 japan, global
...JP-1000045 japan, global
...JP-1000046 japan, global
...JP-1000047 japan, global
...JP-1004852 japan, global
...JP-1005807 japan, global
...JP-1000051 japan, global
...JP-1005809 japan, global
...JP-1000053 japan, global
...JP-1000059 japan, global
...JP-1000060 japan, global
...JP-1000061 japan, global
...JP-1000054 japan, global
...JP-1000058 japan, global
...JP-1000075 japan, global
...JP-1000141 japan, global

...and 12441 more duplicates

Next Steps

Immediate Actions

  1. Review Enrichment Candidates: Check ENRICHMENT_CANDIDATES.yaml for institutions needing data

  2. Prioritize Countries: Focus on countries with low Wikidata coverage:

    • BE: 0.0% coverage (7 institutions)
    • LU: 0.0% coverage (1 institutions)
    • US: 0.0% coverage (7 institutions)
    • TN: 2.9% coverage (69 institutions)
    • BR: 13.7% coverage (212 institutions)
    • MX: 19.5% coverage (226 institutions)
    • NL: 31.0% coverage (622 institutions)
    • IT: 33.3% coverage (3 institutions)
    • VN: 38.1% coverage (21 institutions)
    • AR: 50.0% coverage (2 institutions)
  3. Batch Enrichment Workflow:

    • Run Wikidata enrichment for high-priority candidates
    • Run geocoding for missing coordinates
    • Crawl institutional websites for missing data

Tools Available

  • Wikidata Enrichment: scripts/enrich_global_batch.py
  • Geocoding: scripts/geocode_institutions.py
  • Website Crawling: scripts/crawl_institution_websites.py (to be created)

Files Generated

  1. globalglam-20251111.yaml - Complete unified dataset (13,500 institutions)
  2. ENRICHMENT_CANDIDATES.yaml - Institutions needing enrichment (13,458 candidates)
  3. UNIFICATION_REPORT.md - This report

Generated by: scripts/unify_all_datasets.py
Dataset Version: 1.0
Schema Version: LinkML v0.2.1