235 lines
8.9 KiB
Markdown
235 lines
8.9 KiB
Markdown
# GLAM Dataset Unification Report
|
|
|
|
**Generated**: 2025-11-20 19:29:21 UTC
|
|
|
|
## Executive Summary
|
|
|
|
- **Total Institutions**: 13,500
|
|
- **Countries Covered**: 18
|
|
- **Wikidata Coverage**: 7,542/13,500 (55.9%)
|
|
- **Geocoding Coverage**: 8,178/13,500 (60.6%)
|
|
- **Duplicates Removed**: 12461
|
|
|
|
## Data Sources
|
|
|
|
### Algeria
|
|
- Total: 19 institutions
|
|
- Wikidata: 13 (68.4%)
|
|
- Geocoded: 0 (0.0%)
|
|
|
|
### Brazil
|
|
- Total: 115 institutions
|
|
- Wikidata: 7 (6.1%)
|
|
- Geocoded: 0 (0.0%)
|
|
|
|
### Chile
|
|
- Total: 90 institutions
|
|
- Wikidata: 71 (78.9%)
|
|
- Geocoded: 78 (86.7%)
|
|
|
|
### Georgia
|
|
- Total: 14 institutions
|
|
- Wikidata: 11 (78.6%)
|
|
- Geocoded: 0 (0.0%)
|
|
|
|
### Global
|
|
- Total: 13,396 institutions
|
|
- Wikidata: 7,448 (55.6%)
|
|
- Geocoded: 13,393 (100.0%)
|
|
|
|
### Historical
|
|
- Total: 5 institutions
|
|
- Wikidata: 5 (100.0%)
|
|
- Geocoded: 5 (100.0%)
|
|
|
|
### Japan
|
|
- Total: 12,065 institutions
|
|
- Wikidata: 0 (0.0%)
|
|
- Geocoded: 0 (0.0%)
|
|
|
|
### Libya
|
|
- Total: 50 institutions
|
|
- Wikidata: 50 (100.0%)
|
|
- Geocoded: 0 (0.0%)
|
|
|
|
### Mexico
|
|
- Total: 117 institutions
|
|
- Wikidata: 10 (8.5%)
|
|
- Geocoded: 58 (49.6%)
|
|
|
|
### Tunisia
|
|
- Total: 69 institutions
|
|
- Wikidata: 2 (2.9%)
|
|
- Geocoded: 12 (17.4%)
|
|
|
|
### Vietnam
|
|
- Total: 21 institutions
|
|
- Wikidata: 8 (38.1%)
|
|
- Geocoded: 0 (0.0%)
|
|
|
|
## Coverage by Country
|
|
|
|
| Country | Total | Wikidata | Wikidata % | Geocoded | Geocoded % |
|
|
|---------|-------|----------|------------|----------|------------|
|
|
| JP | 12,065 | 7,091 | 58.8% | 7,091 | 58.8% |
|
|
| NL | 622 | 193 | 31.0% | 621 | 99.8% |
|
|
| MX | 226 | 44 | 19.5% | 167 | 73.9% |
|
|
| BR | 212 | 29 | 13.7% | 97 | 45.8% |
|
|
| CL | 180 | 97 | 53.9% | 168 | 93.3% |
|
|
| TN | 69 | 2 | 2.9% | 12 | 17.4% |
|
|
| LY | 47 | 47 | 100.0% | 0 | 0.0% |
|
|
| VN | 21 | 8 | 38.1% | 0 | 0.0% |
|
|
| DZ | 19 | 13 | 68.4% | 0 | 0.0% |
|
|
| GE | 14 | 11 | 78.6% | 0 | 0.0% |
|
|
| BE | 7 | 0 | 0.0% | 7 | 100.0% |
|
|
| US | 7 | 0 | 0.0% | 7 | 100.0% |
|
|
| GB | 3 | 3 | 100.0% | 0 | 0.0% |
|
|
| IT | 3 | 1 | 33.3% | 3 | 100.0% |
|
|
| AR | 2 | 1 | 50.0% | 2 | 100.0% |
|
|
| RU | 1 | 1 | 100.0% | 1 | 100.0% |
|
|
| DK | 1 | 1 | 100.0% | 1 | 100.0% |
|
|
| LU | 1 | 0 | 0.0% | 1 | 100.0% |
|
|
|
|
## Enrichment Needs
|
|
|
|
Total institutions requiring enrichment: **13,458** (99.7% of dataset)
|
|
|
|
### By Enrichment Type
|
|
|
|
- **Need Wikidata**: 5,958 (44.1%)
|
|
- **Need Coordinates**: 5,322 (39.4%)
|
|
- **Need Website**: 2,085 (15.4%)
|
|
- **Need Description**: 13,005 (96.3%)
|
|
|
|
### Priority Distribution (by number of missing fields)
|
|
|
|
- **Priority 4** (4 missing fields): 854 institutions
|
|
- **Priority 3** (3 missing fields): 4,834 institutions
|
|
- **Priority 2** (2 missing fields): 682 institutions
|
|
- **Priority 1** (1 missing fields): 7,088 institutions
|
|
|
|
## Top 50 Enrichment Candidates (Highest Priority)
|
|
|
|
| Name | Country | Type | Missing Fields |
|
|
|------|---------|------|----------------|
|
|
| Museu dos Povos Acreanos | BR | MUSEUM | wikidata, coordinates, website, description |
|
|
| UFAC Repository | BR | EDUCATION_PROVIDER | wikidata, coordinates, website, description |
|
|
| Serra da Barriga | BR | MIXED | wikidata, coordinates, website, description |
|
|
| Parque Memorial Quilombo dos Palmares | BR | MIXED | wikidata, coordinates, website, description |
|
|
| UFAL Natural History Museum | BR | MUSEUM | wikidata, coordinates, website, description |
|
|
| SECULT | BR | MIXED | wikidata, coordinates, website, description |
|
|
| Museu Sacaca | BR | MUSEUM | wikidata, coordinates, website, description |
|
|
| Museu de Arqueologia e Etnologia | BR | MUSEUM | wikidata, coordinates, website, description |
|
|
| Teatro Amazonas | BR | MIXED | wikidata, coordinates, website, description |
|
|
| Centro Cultural Povos da Amazônia | BR | GALLERY | wikidata, coordinates, website, description |
|
|
| Arquivo Público (APEB) | BR | ARCHIVE | wikidata, coordinates, website, description |
|
|
| FPC/IPAC | BR | MIXED | wikidata, coordinates, website, description |
|
|
| MAM-BA | BR | MIXED | wikidata, coordinates, website, description |
|
|
| Centro Dragão do Mar | BR | GALLERY | wikidata, coordinates, website, description |
|
|
| Arquivo Público DF | BR | ARCHIVE | wikidata, coordinates, website, description |
|
|
| UnB BCE | BR | LIBRARY | wikidata, coordinates, website, description |
|
|
| CCBB Brasília | BR | GALLERY | wikidata, coordinates, website, description |
|
|
| UFES Digital Libraries | BR | EDUCATION_PROVIDER | wikidata, coordinates, website, description |
|
|
| State Archives | BR | ARCHIVE | wikidata, coordinates, website, description |
|
|
| Cultural Collections | BR | MIXED | wikidata, coordinates, website, description |
|
|
| UFG Repositories | BR | EDUCATION_PROVIDER | wikidata, coordinates, website, description |
|
|
| Museu Zoroastro Artiaga | BR | MUSEUM | wikidata, coordinates, website, description |
|
|
| Museu Histórico (MHAM) | BR | MUSEUM | wikidata, coordinates, website, description |
|
|
| Casa das Minas/Casa de Nagô | BR | MIXED | wikidata, coordinates, website, description |
|
|
| MUSEAR/UFMT | BR | EDUCATION_PROVIDER | wikidata, coordinates, website, description |
|
|
| Instituto Histórico | BR | OFFICIAL_INSTITUTION | wikidata, coordinates, website, description |
|
|
| Museu Histórico | BR | MUSEUM | wikidata, coordinates, website, description |
|
|
| Guarani-Kaiowá Projects | BR | MIXED | wikidata, coordinates, website, description |
|
|
| UFMS Repositories | BR | EDUCATION_PROVIDER | wikidata, coordinates, website, description |
|
|
| UFMG Tainacan Lab | BR | EDUCATION_PROVIDER | wikidata, coordinates, website, description |
|
|
| Inhotim | BR | MIXED | wikidata, coordinates, website, description |
|
|
| Ouro Preto System | BR | MIXED | wikidata, coordinates, website, description |
|
|
| MM Gerdau | BR | MIXED | wikidata, coordinates, website, description |
|
|
| Museu Goeldi | BR | MUSEUM | wikidata, coordinates, website, description |
|
|
| Forte do Presépio | BR | MIXED | wikidata, coordinates, website, description |
|
|
| UFPA | BR | EDUCATION_PROVIDER | wikidata, coordinates, website, description |
|
|
| Teatro da Paz | BR | MIXED | wikidata, coordinates, website, description |
|
|
| Pedra do Ingá | BR | MIXED | wikidata, coordinates, website, description |
|
|
| UFPB/UEPB | BR | EDUCATION_PROVIDER | wikidata, coordinates, website, description |
|
|
| Natural History Museum | BR | MUSEUM | wikidata, coordinates, website, description |
|
|
| Forte Santa Catarina | BR | MIXED | wikidata, coordinates, website, description |
|
|
| MON | BR | MIXED | wikidata, coordinates, website, description |
|
|
| DEAP Archives | BR | ARCHIVE | wikidata, coordinates, website, description |
|
|
| Centro de Memória | BR | MIXED | wikidata, coordinates, website, description |
|
|
| UFPR | BR | EDUCATION_PROVIDER | wikidata, coordinates, website, description |
|
|
| Instituto Ricardo Brennand | BR | OFFICIAL_INSTITUTION | wikidata, coordinates, website, description |
|
|
| UFPE | BR | EDUCATION_PROVIDER | wikidata, coordinates, website, description |
|
|
| MEPE/IAHGP | BR | MIXED | wikidata, coordinates, website, description |
|
|
| FUMDHAM | BR | MIXED | wikidata, coordinates, website, description |
|
|
| Museu do Piauí | BR | MUSEUM | wikidata, coordinates, website, description |
|
|
|
|
## Deduplication Details
|
|
|
|
### Duplicates Found
|
|
|
|
Total duplicate IDs: 12461
|
|
|
|
| ID | Sources |
|
|
|----|---------|
|
|
| ...JP-1000034 | japan, global |
|
|
| ...JP-1000036 | japan, global |
|
|
| ...JP-1000038 | japan, global |
|
|
| ...JP-1000039 | japan, global |
|
|
| ...JP-1000044 | japan, global |
|
|
| ...JP-1000045 | japan, global |
|
|
| ...JP-1000046 | japan, global |
|
|
| ...JP-1000047 | japan, global |
|
|
| ...JP-1004852 | japan, global |
|
|
| ...JP-1005807 | japan, global |
|
|
| ...JP-1000051 | japan, global |
|
|
| ...JP-1005809 | japan, global |
|
|
| ...JP-1000053 | japan, global |
|
|
| ...JP-1000059 | japan, global |
|
|
| ...JP-1000060 | japan, global |
|
|
| ...JP-1000061 | japan, global |
|
|
| ...JP-1000054 | japan, global |
|
|
| ...JP-1000058 | japan, global |
|
|
| ...JP-1000075 | japan, global |
|
|
| ...JP-1000141 | japan, global |
|
|
|
|
*...and 12441 more duplicates*
|
|
|
|
## Next Steps
|
|
|
|
### Immediate Actions
|
|
|
|
1. **Review Enrichment Candidates**: Check `ENRICHMENT_CANDIDATES.yaml` for institutions needing data
|
|
2. **Prioritize Countries**: Focus on countries with low Wikidata coverage:
|
|
- BE: 0.0% coverage (7 institutions)
|
|
- LU: 0.0% coverage (1 institutions)
|
|
- US: 0.0% coverage (7 institutions)
|
|
- TN: 2.9% coverage (69 institutions)
|
|
- BR: 13.7% coverage (212 institutions)
|
|
- MX: 19.5% coverage (226 institutions)
|
|
- NL: 31.0% coverage (622 institutions)
|
|
- IT: 33.3% coverage (3 institutions)
|
|
- VN: 38.1% coverage (21 institutions)
|
|
- AR: 50.0% coverage (2 institutions)
|
|
|
|
3. **Batch Enrichment Workflow**:
|
|
- Run Wikidata enrichment for high-priority candidates
|
|
- Run geocoding for missing coordinates
|
|
- Crawl institutional websites for missing data
|
|
|
|
### Tools Available
|
|
|
|
- **Wikidata Enrichment**: `scripts/enrich_global_batch.py`
|
|
- **Geocoding**: `scripts/geocode_institutions.py`
|
|
- **Website Crawling**: `scripts/crawl_institution_websites.py` (to be created)
|
|
|
|
## Files Generated
|
|
|
|
1. **globalglam-20251111.yaml** - Complete unified dataset (13,500 institutions)
|
|
2. **ENRICHMENT_CANDIDATES.yaml** - Institutions needing enrichment (13,458 candidates)
|
|
3. **UNIFICATION_REPORT.md** - This report
|
|
|
|
---
|
|
|
|
**Generated by**: `scripts/unify_all_datasets.py`
|
|
**Dataset Version**: 1.0
|
|
**Schema Version**: LinkML v0.2.1
|