glam/data/instances/all/UNIFICATION_REPORT.md
2025-11-21 22:12:33 +01:00

235 lines
8.9 KiB
Markdown

# GLAM Dataset Unification Report
**Generated**: 2025-11-20 19:29:21 UTC
## Executive Summary
- **Total Institutions**: 13,500
- **Countries Covered**: 18
- **Wikidata Coverage**: 7,542/13,500 (55.9%)
- **Geocoding Coverage**: 8,178/13,500 (60.6%)
- **Duplicates Removed**: 12461
## Data Sources
### Algeria
- Total: 19 institutions
- Wikidata: 13 (68.4%)
- Geocoded: 0 (0.0%)
### Brazil
- Total: 115 institutions
- Wikidata: 7 (6.1%)
- Geocoded: 0 (0.0%)
### Chile
- Total: 90 institutions
- Wikidata: 71 (78.9%)
- Geocoded: 78 (86.7%)
### Georgia
- Total: 14 institutions
- Wikidata: 11 (78.6%)
- Geocoded: 0 (0.0%)
### Global
- Total: 13,396 institutions
- Wikidata: 7,448 (55.6%)
- Geocoded: 13,393 (100.0%)
### Historical
- Total: 5 institutions
- Wikidata: 5 (100.0%)
- Geocoded: 5 (100.0%)
### Japan
- Total: 12,065 institutions
- Wikidata: 0 (0.0%)
- Geocoded: 0 (0.0%)
### Libya
- Total: 50 institutions
- Wikidata: 50 (100.0%)
- Geocoded: 0 (0.0%)
### Mexico
- Total: 117 institutions
- Wikidata: 10 (8.5%)
- Geocoded: 58 (49.6%)
### Tunisia
- Total: 69 institutions
- Wikidata: 2 (2.9%)
- Geocoded: 12 (17.4%)
### Vietnam
- Total: 21 institutions
- Wikidata: 8 (38.1%)
- Geocoded: 0 (0.0%)
## Coverage by Country
| Country | Total | Wikidata | Wikidata % | Geocoded | Geocoded % |
|---------|-------|----------|------------|----------|------------|
| JP | 12,065 | 7,091 | 58.8% | 7,091 | 58.8% |
| NL | 622 | 193 | 31.0% | 621 | 99.8% |
| MX | 226 | 44 | 19.5% | 167 | 73.9% |
| BR | 212 | 29 | 13.7% | 97 | 45.8% |
| CL | 180 | 97 | 53.9% | 168 | 93.3% |
| TN | 69 | 2 | 2.9% | 12 | 17.4% |
| LY | 47 | 47 | 100.0% | 0 | 0.0% |
| VN | 21 | 8 | 38.1% | 0 | 0.0% |
| DZ | 19 | 13 | 68.4% | 0 | 0.0% |
| GE | 14 | 11 | 78.6% | 0 | 0.0% |
| BE | 7 | 0 | 0.0% | 7 | 100.0% |
| US | 7 | 0 | 0.0% | 7 | 100.0% |
| GB | 3 | 3 | 100.0% | 0 | 0.0% |
| IT | 3 | 1 | 33.3% | 3 | 100.0% |
| AR | 2 | 1 | 50.0% | 2 | 100.0% |
| RU | 1 | 1 | 100.0% | 1 | 100.0% |
| DK | 1 | 1 | 100.0% | 1 | 100.0% |
| LU | 1 | 0 | 0.0% | 1 | 100.0% |
## Enrichment Needs
Total institutions requiring enrichment: **13,458** (99.7% of dataset)
### By Enrichment Type
- **Need Wikidata**: 5,958 (44.1%)
- **Need Coordinates**: 5,322 (39.4%)
- **Need Website**: 2,085 (15.4%)
- **Need Description**: 13,005 (96.3%)
### Priority Distribution (by number of missing fields)
- **Priority 4** (4 missing fields): 854 institutions
- **Priority 3** (3 missing fields): 4,834 institutions
- **Priority 2** (2 missing fields): 682 institutions
- **Priority 1** (1 missing fields): 7,088 institutions
## Top 50 Enrichment Candidates (Highest Priority)
| Name | Country | Type | Missing Fields |
|------|---------|------|----------------|
| Museu dos Povos Acreanos | BR | MUSEUM | wikidata, coordinates, website, description |
| UFAC Repository | BR | EDUCATION_PROVIDER | wikidata, coordinates, website, description |
| Serra da Barriga | BR | MIXED | wikidata, coordinates, website, description |
| Parque Memorial Quilombo dos Palmares | BR | MIXED | wikidata, coordinates, website, description |
| UFAL Natural History Museum | BR | MUSEUM | wikidata, coordinates, website, description |
| SECULT | BR | MIXED | wikidata, coordinates, website, description |
| Museu Sacaca | BR | MUSEUM | wikidata, coordinates, website, description |
| Museu de Arqueologia e Etnologia | BR | MUSEUM | wikidata, coordinates, website, description |
| Teatro Amazonas | BR | MIXED | wikidata, coordinates, website, description |
| Centro Cultural Povos da Amazônia | BR | GALLERY | wikidata, coordinates, website, description |
| Arquivo Público (APEB) | BR | ARCHIVE | wikidata, coordinates, website, description |
| FPC/IPAC | BR | MIXED | wikidata, coordinates, website, description |
| MAM-BA | BR | MIXED | wikidata, coordinates, website, description |
| Centro Dragão do Mar | BR | GALLERY | wikidata, coordinates, website, description |
| Arquivo Público DF | BR | ARCHIVE | wikidata, coordinates, website, description |
| UnB BCE | BR | LIBRARY | wikidata, coordinates, website, description |
| CCBB Brasília | BR | GALLERY | wikidata, coordinates, website, description |
| UFES Digital Libraries | BR | EDUCATION_PROVIDER | wikidata, coordinates, website, description |
| State Archives | BR | ARCHIVE | wikidata, coordinates, website, description |
| Cultural Collections | BR | MIXED | wikidata, coordinates, website, description |
| UFG Repositories | BR | EDUCATION_PROVIDER | wikidata, coordinates, website, description |
| Museu Zoroastro Artiaga | BR | MUSEUM | wikidata, coordinates, website, description |
| Museu Histórico (MHAM) | BR | MUSEUM | wikidata, coordinates, website, description |
| Casa das Minas/Casa de Nagô | BR | MIXED | wikidata, coordinates, website, description |
| MUSEAR/UFMT | BR | EDUCATION_PROVIDER | wikidata, coordinates, website, description |
| Instituto Histórico | BR | OFFICIAL_INSTITUTION | wikidata, coordinates, website, description |
| Museu Histórico | BR | MUSEUM | wikidata, coordinates, website, description |
| Guarani-Kaiowá Projects | BR | MIXED | wikidata, coordinates, website, description |
| UFMS Repositories | BR | EDUCATION_PROVIDER | wikidata, coordinates, website, description |
| UFMG Tainacan Lab | BR | EDUCATION_PROVIDER | wikidata, coordinates, website, description |
| Inhotim | BR | MIXED | wikidata, coordinates, website, description |
| Ouro Preto System | BR | MIXED | wikidata, coordinates, website, description |
| MM Gerdau | BR | MIXED | wikidata, coordinates, website, description |
| Museu Goeldi | BR | MUSEUM | wikidata, coordinates, website, description |
| Forte do Presépio | BR | MIXED | wikidata, coordinates, website, description |
| UFPA | BR | EDUCATION_PROVIDER | wikidata, coordinates, website, description |
| Teatro da Paz | BR | MIXED | wikidata, coordinates, website, description |
| Pedra do Ingá | BR | MIXED | wikidata, coordinates, website, description |
| UFPB/UEPB | BR | EDUCATION_PROVIDER | wikidata, coordinates, website, description |
| Natural History Museum | BR | MUSEUM | wikidata, coordinates, website, description |
| Forte Santa Catarina | BR | MIXED | wikidata, coordinates, website, description |
| MON | BR | MIXED | wikidata, coordinates, website, description |
| DEAP Archives | BR | ARCHIVE | wikidata, coordinates, website, description |
| Centro de Memória | BR | MIXED | wikidata, coordinates, website, description |
| UFPR | BR | EDUCATION_PROVIDER | wikidata, coordinates, website, description |
| Instituto Ricardo Brennand | BR | OFFICIAL_INSTITUTION | wikidata, coordinates, website, description |
| UFPE | BR | EDUCATION_PROVIDER | wikidata, coordinates, website, description |
| MEPE/IAHGP | BR | MIXED | wikidata, coordinates, website, description |
| FUMDHAM | BR | MIXED | wikidata, coordinates, website, description |
| Museu do Piauí | BR | MUSEUM | wikidata, coordinates, website, description |
## Deduplication Details
### Duplicates Found
Total duplicate IDs: 12461
| ID | Sources |
|----|---------|
| ...JP-1000034 | japan, global |
| ...JP-1000036 | japan, global |
| ...JP-1000038 | japan, global |
| ...JP-1000039 | japan, global |
| ...JP-1000044 | japan, global |
| ...JP-1000045 | japan, global |
| ...JP-1000046 | japan, global |
| ...JP-1000047 | japan, global |
| ...JP-1004852 | japan, global |
| ...JP-1005807 | japan, global |
| ...JP-1000051 | japan, global |
| ...JP-1005809 | japan, global |
| ...JP-1000053 | japan, global |
| ...JP-1000059 | japan, global |
| ...JP-1000060 | japan, global |
| ...JP-1000061 | japan, global |
| ...JP-1000054 | japan, global |
| ...JP-1000058 | japan, global |
| ...JP-1000075 | japan, global |
| ...JP-1000141 | japan, global |
*...and 12441 more duplicates*
## Next Steps
### Immediate Actions
1. **Review Enrichment Candidates**: Check `ENRICHMENT_CANDIDATES.yaml` for institutions needing data
2. **Prioritize Countries**: Focus on countries with low Wikidata coverage:
- BE: 0.0% coverage (7 institutions)
- LU: 0.0% coverage (1 institutions)
- US: 0.0% coverage (7 institutions)
- TN: 2.9% coverage (69 institutions)
- BR: 13.7% coverage (212 institutions)
- MX: 19.5% coverage (226 institutions)
- NL: 31.0% coverage (622 institutions)
- IT: 33.3% coverage (3 institutions)
- VN: 38.1% coverage (21 institutions)
- AR: 50.0% coverage (2 institutions)
3. **Batch Enrichment Workflow**:
- Run Wikidata enrichment for high-priority candidates
- Run geocoding for missing coordinates
- Crawl institutional websites for missing data
### Tools Available
- **Wikidata Enrichment**: `scripts/enrich_global_batch.py`
- **Geocoding**: `scripts/geocode_institutions.py`
- **Website Crawling**: `scripts/crawl_institution_websites.py` (to be created)
## Files Generated
1. **globalglam-20251111.yaml** - Complete unified dataset (13,500 institutions)
2. **ENRICHMENT_CANDIDATES.yaml** - Institutions needing enrichment (13,458 candidates)
3. **UNIFICATION_REPORT.md** - This report
---
**Generated by**: `scripts/unify_all_datasets.py`
**Dataset Version**: 1.0
**Schema Version**: LinkML v0.2.1