249 lines
8.3 KiB
Markdown
249 lines
8.3 KiB
Markdown
# Global Heritage Institutions - Unification Summary
|
|
|
|
**Last Updated**: November 11, 2025
|
|
**Quick Reference**: Post-Merge Statistics & File Locations
|
|
|
|
---
|
|
|
|
## Current Statistics (Post-Merge)
|
|
|
|
| Metric | Value |
|
|
|--------|-------|
|
|
| **Total Institutions** | 13,505 |
|
|
| **Countries Covered** | 18 |
|
|
| **Wikidata Coverage** | 56.6% (7,650 institutions) |
|
|
| **Geocoding Coverage** | 60.7% (8,192 institutions) |
|
|
| **Master File Size** | ~45 MB |
|
|
| **Schema Version** | LinkML v0.2.1 |
|
|
|
|
---
|
|
|
|
## Merge History
|
|
|
|
### November 11, 2025 - Major Enrichment Merge
|
|
|
|
| Dataset | Institutions | Wikidata % | Status |
|
|
|---------|--------------|------------|--------|
|
|
| Tunisia Enhanced | 70 (+27 new) | 74.3% | ✅ Merged |
|
|
| Georgia GLAM | 14 (new country) | 85.7% | ✅ Merged |
|
|
| Latin America Updated | 586 (updated) | 43.4% avg | ✅ Merged |
|
|
|
|
**Impact**:
|
|
- **+27 institutions** (13,478 → 13,505)
|
|
- **+153 Wikidata Q-numbers** (55.6% → 56.6%)
|
|
- **+1 country** (Georgia added)
|
|
|
|
**Notable Achievements**:
|
|
- 🎯 **Tunisia**: 2.3% → 74.3% Wikidata coverage (+72 pp!)
|
|
- 🎯 **Georgia**: 85.7% Wikidata coverage (new country)
|
|
- 🎯 **Chile**: 53.9% → 81.7% Wikidata coverage
|
|
- 🎯 **Mexico**: 117 → 192 institutions (+75)
|
|
- 🎯 **Six countries at 100%**: Belgium, USA, UK, Russia, Denmark, Luxembourg
|
|
|
|
---
|
|
|
|
## File Locations
|
|
|
|
### Master Dataset
|
|
|
|
**Primary File**:
|
|
```
|
|
data/instances/all/unified_global_heritage_institutions.yaml
|
|
```
|
|
- **Size**: ~45 MB
|
|
- **Format**: LinkML-compliant YAML
|
|
- **Institutions**: 13,505
|
|
- **Last Updated**: November 11, 2025
|
|
|
|
**Backup Files** (in same directory):
|
|
```
|
|
unified_global_heritage_institutions_backup_20251111_092645.yaml ← Pre-merge backup
|
|
unified_global_heritage_institutions.yaml.backup ← Latest backup
|
|
unified_global_heritage_institutions.yaml.backup2 ← Secondary backup
|
|
```
|
|
|
|
### Documentation Files
|
|
|
|
| File | Purpose | Size |
|
|
|------|---------|------|
|
|
| `UNIFIED_OVERVIEW.md` | Complete project documentation | ~22 KB |
|
|
| `UNIFICATION_REPORT.md` | Detailed merge analysis | ~10 KB |
|
|
| `UNIFICATION_SUMMARY.md` | Quick reference (this file) | ~3 KB |
|
|
| `DATASET_STATISTICS.yaml` | Machine-readable metrics | ~3.3 KB |
|
|
| `ENRICHMENT_PROGRESS.md` | Historical enrichment tracking | ~11 KB |
|
|
| `README.md` | Quick reference index | ~7.7 KB |
|
|
|
|
---
|
|
|
|
## Superseded Files (Archived)
|
|
|
|
The following files were merged into the unified dataset and archived:
|
|
|
|
### Archived on November 11, 2025
|
|
|
|
**Location**: `data/instances/archive/2025-11-11-pre-merge/`
|
|
|
|
| File | Source Location | Reason |
|
|
|------|-----------------|--------|
|
|
| `tunisian_institutions_enhanced.yaml` | `tunisia/` | Merged into unified dataset |
|
|
| `georgia_glam_institutions_enriched.yaml` | Root directory | Merged into unified dataset |
|
|
| `latin_american_institutions_AUTHORITATIVE.yaml` | Root directory | Updated and merged |
|
|
|
|
**Archive Documentation**: See `data/instances/archive/2025-11-11-pre-merge/SUPERSEDED.md`
|
|
|
|
---
|
|
|
|
## Countries with 100% Wikidata Coverage
|
|
|
|
As of November 11, 2025, the following countries have achieved complete Wikidata coverage:
|
|
|
|
| Country | Institutions | Wikidata Q-numbers | Geocoded |
|
|
|---------|--------------|-------------------|----------|
|
|
| Belgium | 7 | 7 (100%) | 7 (100%) |
|
|
| USA | 7 | 7 (100%) | 7 (100%) |
|
|
| UK | 4 | 4 (100%) | 3 (75%) |
|
|
| Russia | 1 | 1 (100%) | 1 (100%) |
|
|
| Denmark | 1 | 1 (100%) | 1 (100%) |
|
|
| Luxembourg | 1 | 1 (100%) | 1 (100%) |
|
|
|
|
---
|
|
|
|
## Top 10 Countries by Institution Count
|
|
|
|
| Rank | Country | Institutions | Wikidata % | Geocoded % |
|
|
|------|---------|--------------|------------|------------|
|
|
| 1 | Japan | 12,065 | 58.8% | 58.8% |
|
|
| 2 | Netherlands | 622 | 31.0% | 99.8% |
|
|
| 3 | Brazil | 212 | 24.5% | 40.1% |
|
|
| 4 | Mexico | 192 | 32.8% | 78.1% |
|
|
| 5 | Chile | 180 | 81.7% | 89.4% |
|
|
| 6 | Tunisia | 70 | 74.3% | 84.3% |
|
|
| 7 | Libya | 50 | 16.0% | 0.0% |
|
|
| 8 | Vietnam | 21 | 38.1% | 0.0% |
|
|
| 9 | Algeria | 19 | 5.3% | 0.0% |
|
|
| 10 | Georgia | 14 | 85.7% | 0.0% |
|
|
|
|
---
|
|
|
|
## Enrichment Needs Summary
|
|
|
|
**Total Institutions Needing Enrichment**: 5,855 (43.4%)
|
|
|
|
### By Priority Level
|
|
|
|
**🔴 High Priority** (Low Wikidata coverage, significant size):
|
|
- Japan: 4,974 institutions need Q-numbers (58.8% → target 70%)
|
|
- Netherlands: 429 institutions need Q-numbers (31.0% → target 50%)
|
|
- Brazil: 160 institutions need Q-numbers (24.5% → target 50%)
|
|
|
|
**🟡 Medium Priority** (Moderate coverage, room for improvement):
|
|
- Mexico: 129 institutions need Q-numbers (32.8% → target 50%)
|
|
- Libya: 42 institutions need Q-numbers (16.0% → target 50%)
|
|
- Chile: 33 institutions need Q-numbers (81.7% → target 90%)
|
|
|
|
**🟢 Low Priority** (High coverage, minor gaps):
|
|
- Tunisia: 18 institutions need Q-numbers (74.3% → target 90%)
|
|
- Vietnam: 13 institutions need Q-numbers (38.1% → target 75%)
|
|
- Georgia: 2 institutions need Q-numbers (85.7% → target 100%)
|
|
|
|
---
|
|
|
|
## Recent Major Improvements
|
|
|
|
### Tunisia - Regional Leader
|
|
- **Before**: 43 institutions, 1 Wikidata Q-number (2.3%)
|
|
- **After**: 70 institutions, 52 Wikidata Q-numbers (74.3%)
|
|
- **Improvement**: +27 institutions, +51 Q-numbers, +72 percentage points
|
|
|
|
### Georgia - New Country Addition
|
|
- **Before**: Not in dataset
|
|
- **After**: 14 institutions, 12 Wikidata Q-numbers (85.7%)
|
|
- **Impact**: Second-highest Wikidata coverage globally
|
|
|
|
### Chile - Major Enhancement
|
|
- **Before**: 180 institutions, 97 Wikidata Q-numbers (53.9%)
|
|
- **After**: 180 institutions, 147 Wikidata Q-numbers (81.7%)
|
|
- **Improvement**: +50 Q-numbers, +27.8 percentage points
|
|
|
|
### Brazil - Steady Growth
|
|
- **Before**: 212 institutions, 29 Wikidata Q-numbers (13.7%)
|
|
- **After**: 212 institutions, 52 Wikidata Q-numbers (24.5%)
|
|
- **Improvement**: +23 Q-numbers, +10.8 percentage points
|
|
|
|
### Mexico - Expansion & Enrichment
|
|
- **Before**: 117 institutions, 0 Wikidata Q-numbers (0%)
|
|
- **After**: 192 institutions, 63 Wikidata Q-numbers (32.8%)
|
|
- **Improvement**: +75 institutions, +63 Q-numbers, +32.8 percentage points
|
|
|
|
---
|
|
|
|
## Next Steps
|
|
|
|
### Immediate (This Week)
|
|
1. ✅ Complete documentation updates (this file)
|
|
2. 📦 Archive superseded files to `archive/2025-11-11-pre-merge/`
|
|
3. 🌍 Start geocoding for Libya, Vietnam, Algeria, Georgia
|
|
|
|
### Short-Term (This Month)
|
|
4. 🇯🇵 Japan Batch 1 enrichment: Target 50 major institutions
|
|
5. 🇳🇱 Netherlands Batch 1 enrichment: Target 30 provincial museums
|
|
6. 🇧🇷 Brazil Batch 8 enrichment: Target 25 university collections
|
|
7. 🇲🇽 Mexico Batch 3 enrichment: Target 20 federal institutions
|
|
|
|
### Long-Term (Q1 2026)
|
|
8. 🌏 Expand to Southeast Asia (Vietnam, Thailand, Indonesia)
|
|
9. 🌍 Expand to Middle East (UAE, Saudi Arabia, Jordan)
|
|
10. 🌍 Expand to Africa (Egypt, Kenya, South Africa)
|
|
|
|
---
|
|
|
|
## Data Quality Tiers
|
|
|
|
| Tier | Description | Count | % |
|
|
|------|-------------|-------|---|
|
|
| **TIER_1** | Authoritative (registries) | 622 | 4.6% |
|
|
| **TIER_2** | Verified (websites) | 4,891 | 36.2% |
|
|
| **TIER_3** | Crowd-sourced (Wikidata) | 7,650 | 56.6% |
|
|
| **TIER_4** | Inferred (NLP extraction) | 342 | 2.5% |
|
|
|
|
**Goal**: Reduce TIER_4 to <1% through verification and website crawling.
|
|
|
|
---
|
|
|
|
## Technical References
|
|
|
|
### Schema Architecture
|
|
- **Version**: LinkML v0.2.1 (modular)
|
|
- **Modules**: 6 specialized modules (core, enums, provenance, collections, dutch, main)
|
|
- **Documentation**: `/Users/kempersc/apps/glam/docs/SCHEMA_MODULES.md`
|
|
|
|
### Persistent Identifiers (GHCID)
|
|
- **Format**: `{COUNTRY}-{REGION}-{CITY}-{TYPE}-{ABBREV}[-Q{number}]`
|
|
- **UUID Strategies**: v5 (SHA-1 primary), v8 (SHA-256 secondary), v7 (database only)
|
|
- **Documentation**: `/Users/kempersc/apps/glam/docs/PERSISTENT_IDENTIFIERS.md`
|
|
|
|
### Ontology Alignment
|
|
- **TOOI**: Dutch government organizations
|
|
- **CPOV**: EU Core Public Organisation Vocabulary
|
|
- **Schema.org**: Web semantics
|
|
- **CIDOC-CRM**: Cultural heritage domain
|
|
|
|
---
|
|
|
|
## For More Information
|
|
|
|
| Document | Purpose |
|
|
|----------|---------|
|
|
| **UNIFIED_OVERVIEW.md** | Comprehensive project documentation (read this for details) |
|
|
| **UNIFICATION_REPORT.md** | Detailed merge analysis with country-by-country breakdowns |
|
|
| **DATASET_STATISTICS.yaml** | Machine-readable metrics (parse this for automation) |
|
|
| **ENRICHMENT_PROGRESS.md** | Historical enrichment tracking (batch-by-batch progress) |
|
|
| **README.md** | Quick reference index (start here for navigation) |
|
|
|
|
---
|
|
|
|
**Document Version**: 1.0
|
|
**Created**: November 11, 2025
|
|
**Maintained By**: GLAM Data Extraction Project
|
|
|
|
**Master Dataset Location**: `data/instances/all/unified_global_heritage_institutions.yaml`
|