glam/data/instances/all/UNIFICATION_SUMMARY.md
2025-11-30 23:30:29 +01:00

251 lines
8.5 KiB
Markdown

# Global Heritage Institutions - Unification Summary
**Last Updated**: November 11, 2025
**Quick Reference**: Post-Merge Statistics & File Locations
---
## Current Statistics (Post-Merge)
| Metric | Value |
|--------|-------|
| **Total Institutions** | 13,505 |
| **Countries Covered** | 18 |
| **Wikidata Coverage** | 56.6% (7,650 institutions) |
| **Geocoding Coverage** | 60.7% (8,192 institutions) |
| **Master File Size** | ~45 MB |
| **Schema Version** | LinkML v0.2.1 |
---
## Merge History
### November 11, 2025 - Major Enrichment Merge
| Dataset | Institutions | Wikidata % | Status |
|---------|--------------|------------|--------|
| Tunisia Enhanced | 70 (+27 new) | 74.3% | ✅ Merged |
| Georgia GLAM | 14 (new country) | 85.7% | ✅ Merged |
| Latin America Updated | 586 (updated) | 43.4% avg | ✅ Merged |
**Impact**:
- **+27 institutions** (13,478 → 13,505)
- **+153 Wikidata Q-numbers** (55.6% → 56.6%)
- **+1 country** (Georgia added)
**Notable Achievements**:
- 🎯 **Tunisia**: 2.3% → 74.3% Wikidata coverage (+72 pp!)
- 🎯 **Georgia**: 85.7% Wikidata coverage (new country)
- 🎯 **Chile**: 53.9% → 81.7% Wikidata coverage
- 🎯 **Mexico**: 117 → 192 institutions (+75)
- 🎯 **Six countries at 100%**: Belgium, USA, UK, Russia, Denmark, Luxembourg
---
## File Locations
### Master Dataset
**Primary File**:
```
data/instances/all/unified_global_heritage_institutions.yaml
```
- **Size**: ~45 MB
- **Format**: LinkML-compliant YAML
- **Institutions**: 13,505
- **Last Updated**: November 11, 2025
**Backup Files** (in same directory):
```
unified_global_heritage_institutions_backup_20251111_092645.yaml ← Pre-merge backup
unified_global_heritage_institutions.yaml.backup ← Latest backup
unified_global_heritage_institutions.yaml.backup2 ← Secondary backup
```
### Documentation Files
| File | Purpose | Size |
|------|---------|------|
| `UNIFIED_OVERVIEW.md` | Complete project documentation | ~22 KB |
| `UNIFICATION_REPORT.md` | Detailed merge analysis | ~10 KB |
| `UNIFICATION_SUMMARY.md` | Quick reference (this file) | ~3 KB |
| `DATASET_STATISTICS.yaml` | Machine-readable metrics | ~3.3 KB |
| `ENRICHMENT_PROGRESS.md` | Historical enrichment tracking | ~11 KB |
| `README.md` | Quick reference index | ~7.7 KB |
---
## Superseded Files (Archived)
The following files were merged into the unified dataset and archived:
### Archived on November 11, 2025
**Location**: `data/instances/archive/2025-11-11-pre-merge/`
| File | Source Location | Reason |
|------|-----------------|--------|
| `tunisian_institutions_enhanced.yaml` | `tunisia/` | Merged into unified dataset |
| `georgia_glam_institutions_enriched.yaml` | Root directory | Merged into unified dataset |
| `latin_american_institutions_AUTHORITATIVE.yaml` | Root directory | Updated and merged |
**Archive Documentation**: See `data/instances/archive/2025-11-11-pre-merge/SUPERSEDED.md`
---
## Countries with 100% Wikidata Coverage
As of November 11, 2025, the following countries have achieved complete Wikidata coverage:
| Country | Institutions | Wikidata Q-numbers | Geocoded |
|---------|--------------|-------------------|----------|
| Belgium | 7 | 7 (100%) | 7 (100%) |
| USA | 7 | 7 (100%) | 7 (100%) |
| UK | 4 | 4 (100%) | 3 (75%) |
| Russia | 1 | 1 (100%) | 1 (100%) |
| Denmark | 1 | 1 (100%) | 1 (100%) |
| Luxembourg | 1 | 1 (100%) | 1 (100%) |
---
## Top 10 Countries by Institution Count
| Rank | Country | Institutions | Wikidata % | Geocoded % |
|------|---------|--------------|------------|------------|
| 1 | Japan | 12,065 | 58.8% | 58.8% |
| 2 | Netherlands | 622 | 31.0% | 99.8% |
| 3 | Brazil | 212 | 24.5% | 40.1% |
| 4 | Mexico | 192 | 32.8% | 78.1% |
| 5 | Chile | 180 | 81.7% | 89.4% |
| 6 | Tunisia | 70 | 74.3% | 84.3% |
| 7 | Libya | 50 | 16.0% | 0.0% |
| 8 | Vietnam | 21 | 38.1% | 0.0% |
| 9 | Algeria | 19 | 5.3% | 0.0% |
| 10 | Georgia | 14 | 85.7% | 0.0% |
---
## Enrichment Needs Summary
**Total Institutions Needing Enrichment**: 5,855 (43.4%)
### By Priority Level
**🔴 High Priority** (Low Wikidata coverage, significant size):
- Japan: 4,974 institutions need Q-numbers (58.8% → target 70%)
- Netherlands: 429 institutions need Q-numbers (31.0% → target 50%)
- Brazil: 160 institutions need Q-numbers (24.5% → target 50%)
**🟡 Medium Priority** (Moderate coverage, room for improvement):
- Mexico: 129 institutions need Q-numbers (32.8% → target 50%)
- Libya: 42 institutions need Q-numbers (16.0% → target 50%)
- Chile: 33 institutions need Q-numbers (81.7% → target 90%)
**🟢 Low Priority** (High coverage, minor gaps):
- Tunisia: 18 institutions need Q-numbers (74.3% → target 90%)
- Vietnam: 13 institutions need Q-numbers (38.1% → target 75%)
- Georgia: 2 institutions need Q-numbers (85.7% → target 100%)
---
## Recent Major Improvements
### Tunisia - Regional Leader
- **Before**: 43 institutions, 1 Wikidata Q-number (2.3%)
- **After**: 70 institutions, 52 Wikidata Q-numbers (74.3%)
- **Improvement**: +27 institutions, +51 Q-numbers, +72 percentage points
### Georgia - New Country Addition
- **Before**: Not in dataset
- **After**: 14 institutions, 12 Wikidata Q-numbers (85.7%)
- **Impact**: Second-highest Wikidata coverage globally
### Chile - Major Enhancement
- **Before**: 180 institutions, 97 Wikidata Q-numbers (53.9%)
- **After**: 180 institutions, 147 Wikidata Q-numbers (81.7%)
- **Improvement**: +50 Q-numbers, +27.8 percentage points
### Brazil - Steady Growth
- **Before**: 212 institutions, 29 Wikidata Q-numbers (13.7%)
- **After**: 212 institutions, 52 Wikidata Q-numbers (24.5%)
- **Improvement**: +23 Q-numbers, +10.8 percentage points
### Mexico - Expansion & Enrichment
- **Before**: 117 institutions, 0 Wikidata Q-numbers (0%)
- **After**: 192 institutions, 63 Wikidata Q-numbers (32.8%)
- **Improvement**: +75 institutions, +63 Q-numbers, +32.8 percentage points
---
## Next Steps
### Immediate (This Week)
1. ✅ Complete documentation updates (this file)
2. 📦 Archive superseded files to `archive/2025-11-11-pre-merge/`
3. 🌍 Start geocoding for Libya, Vietnam, Algeria, Georgia
### Short-Term (This Month)
4. 🇯🇵 Japan Batch 1 enrichment: Target 50 major institutions
5. 🇳🇱 Netherlands Batch 1 enrichment: Target 30 provincial museums
6. 🇧🇷 Brazil Batch 8 enrichment: Target 25 university collections
7. 🇲🇽 Mexico Batch 3 enrichment: Target 20 federal institutions
### Long-Term (Q1 2026)
8. 🌏 Expand to Southeast Asia (Vietnam, Thailand, Indonesia)
9. 🌍 Expand to Middle East (UAE, Saudi Arabia, Jordan)
10. 🌍 Expand to Africa (Egypt, Kenya, South Africa)
---
## Data Quality Tiers
| Tier | Description | Count | % |
|------|-------------|-------|---|
| **TIER_1** | Authoritative (registries) | 622 | 4.6% |
| **TIER_2** | Verified (websites) | 4,891 | 36.2% |
| **TIER_3** | Crowd-sourced (Wikidata) | 7,650 | 56.6% |
| **TIER_4** | Inferred (NLP extraction) | 342 | 2.5% |
**Goal**: Reduce TIER_4 to <1% through verification and website crawling.
---
## Technical References
### Schema Architecture
- **Version**: LinkML v0.2.1 (modular)
- **Modules**: 6 specialized modules (core, enums, provenance, collections, dutch, main)
- **Documentation**: `/Users/kempersc/apps/glam/docs/SCHEMA_MODULES.md`
### Persistent Identifiers (GHCID)
- **Format**: `{COUNTRY}-{REGION}-{CITY}-{TYPE}-{ABBREV}[-{native_name_snake_case}]`
- **Collision Suffix**: Native language institution name in snake_case (NOT Wikidata Q-numbers)
- **UUID Strategies**: v5 (SHA-1 primary), v8 (SHA-256 secondary), v7 (database only)
- **Documentation**: `/Users/kempersc/apps/glam/docs/PERSISTENT_IDENTIFIERS.md`
- **Collision Resolution**: `/Users/kempersc/apps/glam/docs/plan/global_glam/07-ghcid-collision-resolution.md`
### Ontology Alignment
- **TOOI**: Dutch government organizations
- **CPOV**: EU Core Public Organisation Vocabulary
- **Schema.org**: Web semantics
- **CIDOC-CRM**: Cultural heritage domain
---
## For More Information
| Document | Purpose |
|----------|---------|
| **UNIFIED_OVERVIEW.md** | Comprehensive project documentation (read this for details) |
| **UNIFICATION_REPORT.md** | Detailed merge analysis with country-by-country breakdowns |
| **DATASET_STATISTICS.yaml** | Machine-readable metrics (parse this for automation) |
| **ENRICHMENT_PROGRESS.md** | Historical enrichment tracking (batch-by-batch progress) |
| **README.md** | Quick reference index (start here for navigation) |
---
**Document Version**: 1.0
**Created**: November 11, 2025
**Maintained By**: GLAM Data Extraction Project
**Master Dataset Location**: `data/instances/all/unified_global_heritage_institutions.yaml`