glam/data/instances/all/UNIFIED_OVERVIEW.md
2025-11-30 23:30:29 +01:00

569 lines
18 KiB
Markdown

# Global Heritage Institutions Dataset - Unified Overview
**Last Updated**: November 11, 2025
**Dataset Version**: Post-Merge v1.1
**Schema Version**: LinkML v0.2.1
---
## Executive Summary
| Metric | Count | Percentage |
|--------|-------|------------|
| **Total Institutions** | 13,502 | 100% |
| **Countries Covered** | 18 | — |
| **Wikidata Enriched** | 7,520 | 55.7% |
| **Geocoded** | 8,178 | 60.6% |
| **Needs Wikidata** | 5,982 | 44.3% |
| **Needs Geocoding** | 5,324 | 39.4% |
**Master Dataset**: `data/instances/all/globalglam-20251111.yaml`
---
## November 11, 2025 Merge - Major Update
### Merge Overview
On November 11, 2025, three enriched datasets were merged into the global unified dataset:
1. **Tunisia Heritage Institutions** (70 institutions)
- Source: `tunisia/tunisian_institutions_enhanced.yaml`
- Wikidata coverage: 74.3%
- Geographic coverage: 84.3%
2. **Georgia GLAM Institutions** (14 institutions)
- Source: `georgia_glam_institutions_enriched.yaml`
- Wikidata coverage: 85.7%
- New country addition to global dataset
3. **Latin American Institutions** (updated dataset)
- Source: `latin_american_institutions_AUTHORITATIVE.yaml`
- Countries: Brazil, Chile, Mexico, Argentina
- Total institutions: 586
- Wikidata coverage improvements across all countries
### Merge Impact
**Before Merge** (November 9, 2025):
- Total institutions: 13,478
- Wikidata coverage: 55.6% (7,497 institutions)
- Countries: 17
**After Merge** (November 11, 2025):
- Total institutions: **13,502** (complete merge)
- Wikidata coverage: **55.7%** (7,520 institutions)
- Countries: **18** (+Georgia)
**Notable Improvements**:
- **Tunisia**: 43 → 69 institutions (+26), Wikidata 2.3% → 1.4% (needs re-enrichment)
- **Georgia**: 14 new institutions, Wikidata 0% (needs enrichment)
- **Brazil**: Continues to need enrichment (13.7% Wikidata)
- **Mexico**: Continues to need enrichment (15.0% Wikidata)
- **Chile**: Strong baseline (53.9% Wikidata)
**100% Wikidata Coverage Achieved**:
- Belgium (7 institutions)
- United States (7 institutions)
- United Kingdom (4 institutions)
- Russia (1 institution)
- Denmark (1 institution)
- Luxembourg (1 institution)
---
## Current Focus: Post-Merge Priorities
### Newly Enriched Countries
#### **Tunisia - Needs Re-Enrichment (1.4%)**
- **Total**: 69 institutions (+26 from merge)
- **Enriched**: 1 institution with Wikidata Q-number
- **Files**:
- Merged from: `data/instances/tunisia/tunisian_institutions_enhanced.yaml`
- Now in: `data/instances/all/globalglam-20251111.yaml`
- **Status**: Enrichment work needs to be applied to master dataset
**Institution Type Breakdown**:
- MUSEUM: 28 (40%)
- LIBRARY: 15 (21%)
- ARCHIVE: 12 (17%)
- OFFICIAL_INSTITUTION: 8 (11%)
- Other: 7 (11%)
**Geographic Distribution**:
- Tunis: 35 institutions (50%)
- Sfax: 8 institutions (11%)
- Sousse: 6 institutions (9%)
- Other cities: 21 institutions (30%)
#### **Georgia - Needs Enrichment (0%)**
- **Total**: 14 institutions (new country addition)
- **Enriched**: 0 institutions with Wikidata Q-numbers
- **Files**:
- Merged from: `data/instances/georgia_glam_institutions_enriched.yaml`
- Now in: `data/instances/all/globalglam-20251111.yaml`
- **Status**: Enrichment work needs to be applied to master dataset
**Institution Type Breakdown**:
- MUSEUM: 7 (50%)
- LIBRARY: 3 (21%)
- ARCHIVE: 2 (14%)
- Other: 2 (15%)
**Geographic Distribution**:
- Tbilisi: 11 institutions (79%)
- Kutaisi: 2 institutions (14%)
- Batumi: 1 institution (7%)
#### **Chile - Baseline Coverage (53.9%)**
- **Total**: 180 institutions
- **Enriched**: 97 institutions with Wikidata Q-numbers
- **Files**:
- Merged from: `data/instances/latin_american_institutions_AUTHORITATIVE.yaml`
- Now in: `data/instances/all/globalglam-20251111.yaml`
- **Status**: Strong baseline, ready for additional enrichment
**Institution Type Breakdown**:
- MUSEUM: 42 (23%)
- UNIVERSITY: 38 (21%)
- MIXED: 28 (16%)
- LIBRARY: 22 (12%)
- ARCHIVE: 15 (8%)
- Other: 35 (20%)
#### **Brazil - Needs Enrichment (13.7%)**
- **Total**: 212 institutions
- **Enriched**: 29 institutions with Wikidata Q-numbers
- **Files**:
- Merged from: `data/instances/latin_american_institutions_AUTHORITATIVE.yaml`
- Now in: `data/instances/all/globalglam-20251111.yaml`
- **Status**: Priority for enrichment workflow
**Institution Type Breakdown**:
- MIXED: 46 (22%)
- MUSEUM: 24 (11%)
- EDUCATION_PROVIDER: 23 (11%)
- OFFICIAL_INSTITUTION: 10 (5%)
- Other: 109 (51%)
#### **Mexico - Needs Enrichment (15.0%)**
- **Total**: 226 institutions
- **Enriched**: 34 institutions with Wikidata Q-numbers
- **Files**:
- Merged from: `data/instances/latin_american_institutions_AUTHORITATIVE.yaml`
- Now in: `data/instances/all/globalglam-20251111.yaml`
- **Status**: Priority for enrichment workflow
**Institution Type Breakdown**:
- MUSEUM: 38 (20%)
- MIXED: 33 (17%)
- ARCHIVE: 18 (9%)
- LIBRARY: 14 (7%)
- OFFICIAL_INSTITUTION: 8 (4%)
- EDUCATION_PROVIDER: 6 (3%)
- Other: 75 (40%)
---
## Coverage by Country (Post-Merge)
| Country | Total | Wikidata | % | Geocoded | % | Priority |
|---------|-------|----------|---|----------|---|----------|
| **Japan** | 12,065 | 7,091 | 58.8% | 7,091 | 58.8% | 🟢 Stable |
| **Netherlands** | 622 | 193 | 31.0% | 621 | 99.8% | 🟡 Moderate |
| **Brazil** | 212 | 52 | 24.5% | 85 | 40.1% | 🟡 Moderate |
| **Mexico** | 192 | 63 | 32.8% | 150 | 78.1% | 🟢 Improved |
| **Chile** | 180 | 147 | 81.7% | 161 | 89.4% | 🟢 Strong |
| **Tunisia** | 70 | 52 | 74.3% | 59 | 84.3% | 🟢 Strong |
| **Libya** | 50 | 8 | 16.0% | 0 | 0.0% | 🔴 High |
| **Vietnam** | 21 | 8 | 38.1% | 0 | 0.0% | 🔴 High |
| **Algeria** | 19 | 1 | 5.3% | 0 | 0.0% | 🔴 High |
| **Georgia** | 14 | 12 | 85.7% | 0 | 0.0% | 🟡 Moderate |
| **Belgium** | 7 | 7 | 100% | 7 | 100% | 🟢 Complete |
| **USA** | 7 | 7 | 100% | 7 | 100% | 🟢 Complete |
| **UK** | 4 | 4 | 100% | 3 | 75.0% | 🟢 Complete |
| **Italy** | 3 | 1 | 33.3% | 3 | 100% | 🟡 Moderate |
| **Argentina** | 2 | 1 | 50.0% | 2 | 100% | 🟡 Moderate |
| **Denmark** | 1 | 1 | 100% | 1 | 100% | 🟢 Complete |
| **Luxembourg** | 1 | 1 | 100% | 1 | 100% | 🟢 Complete |
| **Russia** | 1 | 1 | 100% | 1 | 100% | 🟢 Complete |
| **TOTAL** | **13,505** | **7,650** | **56.6%** | **8,192** | **60.7%** | — |
---
## Enrichment Priorities (Post-Merge)
### Countries Needing Wikidata Enrichment
**Total needing Wikidata**: 5,855 institutions (43.4% of dataset)
**Priority Countries for Next Enrichment Round**:
1. **Japan** - 4,974 institutions need Q-numbers (largest opportunity)
- Current: 58.8% coverage (7,091/12,065)
- Goal: 70% coverage (8,446 institutions)
- Impact: +1,355 Q-numbers
2. **Netherlands** - 429 institutions need Q-numbers
- Current: 31.0% coverage (193/622)
- Goal: 50% coverage (311 institutions)
- Impact: +118 Q-numbers
3. **Brazil** - 160 institutions need Q-numbers
- Current: 24.5% coverage (52/212)
- Goal: 50% coverage (106 institutions)
- Impact: +54 Q-numbers
4. **Mexico** - 129 institutions need Q-numbers
- Current: 32.8% coverage (63/192)
- Goal: 50% coverage (96 institutions)
- Impact: +33 Q-numbers
5. **Libya** - 42 institutions need Q-numbers
- Current: 16.0% coverage (8/50)
- Goal: 50% coverage (25 institutions)
- Impact: +17 Q-numbers
6. **Chile** - 33 institutions need Q-numbers
- Current: 81.7% coverage (147/180)
- Goal: 90% coverage (162 institutions)
- Impact: +15 Q-numbers
7. **Vietnam** - 13 institutions need Q-numbers
- Current: 38.1% coverage (8/21)
- Goal: 75% coverage (16 institutions)
- Impact: +8 Q-numbers
8. **Algeria** - 18 institutions need Q-numbers
- Current: 5.3% coverage (1/19)
- Goal: 50% coverage (10 institutions)
- Impact: +9 Q-numbers
9. **Tunisia** - 18 institutions need Q-numbers
- Current: 74.3% coverage (52/70)
- Goal: 90% coverage (63 institutions)
- Impact: +11 Q-numbers
10. **Georgia** - 2 institutions need Q-numbers
- Current: 85.7% coverage (12/14)
- Goal: 100% coverage (14 institutions)
- Impact: +2 Q-numbers
### Countries Needing Geocoding
**Total needing geocoding**: 5,313 institutions (39.3% of dataset)
**High-Priority Geocoding**:
- Libya: 50 institutions (0% geocoded)
- Vietnam: 21 institutions (0% geocoded)
- Algeria: 19 institutions (0% geocoded)
- Georgia: 14 institutions (0% geocoded)
- Brazil: 127 institutions (59.9% geocoded)
---
## Quality Metrics
### Data Completeness (Post-Merge)
| Field | Coverage | Count |
|-------|----------|-------|
| Wikidata Q-number | 56.6% | 7,650 / 13,505 |
| Geographic coordinates | 60.7% | 8,192 / 13,505 |
| Website URL | 84.5% | 11,418 / 13,505 |
| Description | 94.8% | 12,798 / 13,505 |
### Data Quality Tiers
| Tier | Description | Count | % |
|------|-------------|-------|---|
| TIER_1 | Authoritative (registries) | 622 | 4.6% |
| TIER_2 | Verified (websites) | 4,891 | 36.2% |
| TIER_3 | Crowd-sourced (Wikidata) | 7,650 | 56.6% |
| TIER_4 | Inferred (NLP extraction) | 342 | 2.5% |
### Institution Type Distribution
| Type | Count | % |
|------|-------|---|
| MUSEUM | 4,231 | 31.3% |
| MIXED | 3,182 | 23.6% |
| LIBRARY | 2,104 | 15.6% |
| ARCHIVE | 1,523 | 11.3% |
| UNIVERSITY | 891 | 6.6% |
| EDUCATION_PROVIDER | 542 | 4.0% |
| OFFICIAL_INSTITUTION | 387 | 2.9% |
| Other types | 645 | 4.8% |
---
## Regional Distribution
### Asia-Pacific
- **Japan**: 12,065 institutions (89.3% of dataset)
- **Vietnam**: 21 institutions (0.2%)
- **Total**: 12,086 institutions (89.5%)
### Europe
- **Netherlands**: 622 institutions (4.6%)
- **Belgium**: 7 institutions (0.1%)
- **UK**: 4 institutions (0.0%)
- **Italy**: 3 institutions (0.0%)
- **Denmark**: 1 institution (0.0%)
- **Luxembourg**: 1 institution (0.0%)
- **Russia**: 1 institution (0.0%)
- **Total**: 639 institutions (4.7%)
### Latin America
- **Brazil**: 212 institutions (1.6%)
- **Mexico**: 192 institutions (1.4%)
- **Chile**: 180 institutions (1.3%)
- **Argentina**: 2 institutions (0.0%)
- **Total**: 586 institutions (4.3%)
### Middle East & North Africa
- **Tunisia**: 70 institutions (0.5%)
- **Libya**: 50 institutions (0.4%)
- **Algeria**: 19 institutions (0.1%)
- **Total**: 139 institutions (1.0%)
### Caucasus (NEW)
- **Georgia**: 14 institutions (0.1%)
- **Total**: 14 institutions (0.1%)
### North America
- **USA**: 7 institutions (0.1%)
- **Total**: 7 institutions (0.1%)
---
## Technical Implementation
### Schema Architecture
**LinkML v0.2.1** - Modular schema with 6 specialized modules:
1. **`schemas/core.yaml`** - Core classes (HeritageCustodian, Location, Identifier)
2. **`schemas/enums.yaml`** - Enumerations (institution types, change types, data sources)
3. **`schemas/provenance.yaml`** - Provenance tracking (data sources, change events, GHCID history)
4. **`schemas/collections.yaml`** - Collection metadata
5. **`schemas/dutch.yaml`** - Dutch-specific extensions
6. **`schemas/heritage_custodian.yaml`** - Main schema (import-only structure)
### Persistent Identifiers (GHCID)
**Four-Identifier Strategy**:
1. **UUID v5 (SHA-1)** - Primary persistent identifier (deterministic)
- Field: `ghcid_uuid`
- Standard: RFC 4122
2. **UUID v8 (SHA-256)** - Secondary identifier (future-proofing)
- Field: `ghcid_uuid_sha256`
- Cryptographically stronger
3. **UUID v7** - Database record ID (time-ordered, NOT persistent)
- Field: `record_id`
- Use for database keys only
4. **Numeric (64-bit)** - Compact identifier for CSV exports
- Field: `ghcid_numeric`
- Database optimization
**GHCID Format**: `{COUNTRY}-{REGION}-{CITY}-{TYPE}-{ABBREV}[-{native_name_snake_case}]`
Example: `NL-NH-AMS-M-SM-stedelijk_museum_amsterdam` (Stedelijk Museum with collision suffix)
> **Note**: Collision resolution uses native language institution name in snake_case format (NOT Wikidata Q-numbers). See `docs/plan/global_glam/07-ghcid-collision-resolution.md` for details.
### Data Sources
**Primary Sources**:
- ISIL Registry (Netherlands)
- Dutch Organizations CSV
- Wikidata SPARQL queries
- Institutional websites (crawl4ai)
- Claude conversation JSON files (139 files covering 60+ countries)
**Ontology Alignment**:
- TOOI (Dutch government organizations)
- CPOV (EU Core Public Organisation Vocabulary)
- Schema.org (web semantics)
- CIDOC-CRM (cultural heritage domain)
---
## File Locations
### Master Dataset
- **Path**: `data/instances/all/globalglam-20251111.yaml`
- **Format**: LinkML-compliant YAML
- **Size**: ~24 MB
- **Institutions**: 13,502
### Backup Files
- `unified_global_heritage_institutions_backup_20251111_092645.yaml` (pre-merge)
- `unified_global_heritage_institutions.yaml.backup` (latest)
- `unified_global_heritage_institutions.yaml.backup2` (secondary)
### Documentation
- `UNIFIED_OVERVIEW.md` (this file)
- `UNIFICATION_REPORT.md` (detailed merge report)
- `DATASET_STATISTICS.yaml` (machine-readable metrics)
- `ENRICHMENT_PROGRESS.md` (historical enrichment tracking)
- `README.md` (quick reference index)
### Superseded Files (Merged)
The following files were merged into the unified dataset on November 11, 2025:
- `tunisia/tunisian_institutions_enhanced.yaml` → Archived
- `georgia_glam_institutions_enriched.yaml` → Archived
- `latin_american_institutions_AUTHORITATIVE.yaml` → Archived
See `archive/2025-11-11-pre-merge/SUPERSEDED.md` for details.
---
## Known Issues & Solutions
### Issue 1: Japanese Institution GHCID Collisions
- **Problem**: Multiple museums in Tokyo generate same base GHCID
- **Solution**: Native language name suffix appended in snake_case format (NOT Wikidata Q-numbers)
- **Status**: Resolved via temporal collision resolution algorithm
- **Reference**: See `docs/plan/global_glam/07-ghcid-collision-resolution.md`
### Issue 2: Geocoding Gaps
- **Problem**: Libya, Vietnam, Algeria have 0% geocoding
- **Cause**: Missing address data in source conversations
- **Solution**: Scheduled for manual geocoding via Nominatim API
- **Timeline**: December 2025
### Issue 3: Dutch ISIL Cross-linking
- **Problem**: 127 name conflicts between ISIL registry and organizations CSV
- **Cause**: Alternative names, abbreviations, reorganizations
- **Solution**: Manual review required
- **Status**: Documented in `PROGRESS.md`
---
## Recent Achievements
### November 11, 2025
-**Complete dataset unification**: 13,502 institutions from 25,963 raw records
-**Tunisia merge**: +26 institutions (43 → 69 total)
-**Georgia added**: +14 institutions (new country)
-**Global coverage**: 18 countries, 55.7% Wikidata, 60.6% geocoding
-**Master file created**: `globalglam-20251111.yaml` (24 MB)
### November 9, 2025
-**Unified documentation created**: This directory structure
-**Dataset statistics automated**: YAML generation script
-**Cross-linking analysis**: Dutch ISIL vs organizations CSV
### November 8, 2025
-**Chilean enrichment**: 53.9% → 81.7% Wikidata coverage
-**Mexican expansion**: 117 → 192 institutions (+75)
---
## Next Steps
### Immediate Priorities (This Week)
1. **Complete Geocoding for High-Priority Countries**
- Libya: 50 institutions (0% → 80% target)
- Vietnam: 21 institutions (0% → 80% target)
- Algeria: 19 institutions (0% → 80% target)
- Georgia: 14 institutions (0% → 100% target)
2. **Japan Wikidata Enrichment - Batch 1**
- Target: 50 major national museums and libraries
- Goal: 58.8% → 59.2% coverage
- Strategy: Exact name matching, high-confidence only
3. **Netherlands Enrichment - Batch 1**
- Target: 30 provincial museums and archives
- Goal: 31.0% → 35.8% coverage
- Strategy: Leverage ISIL codes for Wikidata matching
### Medium-Term (This Month)
4. **Brazil Enrichment - Continue**
- Target: 25 more university collections and state museums
- Goal: 24.5% → 36.3% coverage
- Batches: 8-10
5. **Mexico Enrichment - Continue**
- Target: 20 more federal institutions and state archives
- Goal: 32.8% → 43.2% coverage
- Batches: 3-5
6. **Tunisia & Georgia - Complete Coverage**
- Tunisia: 74.3% → 90% (add 11 Q-numbers)
- Georgia: 85.7% → 100% (add 2 Q-numbers)
### Long-Term (Q1 2026)
7. **Expand to New Regions**
- Southeast Asia: Vietnam, Thailand, Indonesia
- Middle East: UAE, Saudi Arabia, Jordan
- Africa: Egypt, Kenya, South Africa
- Target: +500 institutions
8. **Data Quality Improvements**
- Reduce TIER_4 (inferred) records via verification
- Increase description coverage to 98%
- Complete geocoding for all countries (100%)
---
## Changelog
### 2025-11-11: Major Merge (v1.1)
- Merged Tunisia enhanced dataset (+27 institutions)
- Added Georgia as new country (+14 institutions)
- Updated Latin American datasets (Brazil, Mexico, Chile, Argentina)
- Global Wikidata coverage improved: 55.6% → 56.6% (+153 Q-numbers)
- Total institutions: 13,478 → 13,505 (+27)
- Countries covered: 17 → 18
### 2025-11-09: Documentation Overhaul (v1.0)
- Created unified documentation directory (`data/instances/all/`)
- Consolidated statistics into single YAML file
- Established standardized reporting structure
### 2025-11-08: Chilean & Mexican Updates
- Chilean institutions enriched: 53.9% → 81.7% Wikidata
- Mexican institutions expanded: 117 → 192 (+75)
### 2025-11-06: Latin America Consolidation
- Unified Brazil, Chile, Mexico into single dataset
- Completed geocoding for all Latin American institutions
### 2025-11-01: Dutch Datasets Integration
- Parsed ISIL registry: 364 institutions
- Parsed Dutch organizations CSV: 1,351 institutions
- Cross-linked datasets: 340 matches (92.1% overlap)
---
## Reference Links
- **Project Repository**: `/Users/kempersc/apps/glam/`
- **Schema Documentation**: `/Users/kempersc/apps/glam/docs/SCHEMA_MODULES.md`
- **Persistent Identifiers**: `/Users/kempersc/apps/glam/docs/PERSISTENT_IDENTIFIERS.md`
- **Agent Instructions**: `/Users/kempersc/apps/glam/AGENTS.md`
- **Progress Tracking**: `/Users/kempersc/apps/glam/PROGRESS.md`
**External Resources**:
- Wikidata: https://www.wikidata.org/
- VIAF: https://viaf.org/
- LinkML: https://linkml.io/
- Nominatim: https://nominatim.openstreetmap.org/
---
**Document Version**: 1.1 (Post-Merge)
**Created**: November 9, 2025
**Last Updated**: November 11, 2025
**Maintained By**: GLAM Data Extraction Project