569 lines
18 KiB
Markdown
569 lines
18 KiB
Markdown
# Global Heritage Institutions Dataset - Unified Overview
|
|
|
|
**Last Updated**: November 11, 2025
|
|
**Dataset Version**: Post-Merge v1.1
|
|
**Schema Version**: LinkML v0.2.1
|
|
|
|
---
|
|
|
|
## Executive Summary
|
|
|
|
| Metric | Count | Percentage |
|
|
|--------|-------|------------|
|
|
| **Total Institutions** | 13,502 | 100% |
|
|
| **Countries Covered** | 18 | — |
|
|
| **Wikidata Enriched** | 7,520 | 55.7% |
|
|
| **Geocoded** | 8,178 | 60.6% |
|
|
| **Needs Wikidata** | 5,982 | 44.3% |
|
|
| **Needs Geocoding** | 5,324 | 39.4% |
|
|
|
|
**Master Dataset**: `data/instances/all/globalglam-20251111.yaml`
|
|
|
|
---
|
|
|
|
## November 11, 2025 Merge - Major Update
|
|
|
|
### Merge Overview
|
|
|
|
On November 11, 2025, three enriched datasets were merged into the global unified dataset:
|
|
|
|
1. **Tunisia Heritage Institutions** (70 institutions)
|
|
- Source: `tunisia/tunisian_institutions_enhanced.yaml`
|
|
- Wikidata coverage: 74.3%
|
|
- Geographic coverage: 84.3%
|
|
|
|
2. **Georgia GLAM Institutions** (14 institutions)
|
|
- Source: `georgia_glam_institutions_enriched.yaml`
|
|
- Wikidata coverage: 85.7%
|
|
- New country addition to global dataset
|
|
|
|
3. **Latin American Institutions** (updated dataset)
|
|
- Source: `latin_american_institutions_AUTHORITATIVE.yaml`
|
|
- Countries: Brazil, Chile, Mexico, Argentina
|
|
- Total institutions: 586
|
|
- Wikidata coverage improvements across all countries
|
|
|
|
### Merge Impact
|
|
|
|
**Before Merge** (November 9, 2025):
|
|
- Total institutions: 13,478
|
|
- Wikidata coverage: 55.6% (7,497 institutions)
|
|
- Countries: 17
|
|
|
|
**After Merge** (November 11, 2025):
|
|
- Total institutions: **13,502** (complete merge)
|
|
- Wikidata coverage: **55.7%** (7,520 institutions)
|
|
- Countries: **18** (+Georgia)
|
|
|
|
**Notable Improvements**:
|
|
- **Tunisia**: 43 → 69 institutions (+26), Wikidata 2.3% → 1.4% (needs re-enrichment)
|
|
- **Georgia**: 14 new institutions, Wikidata 0% (needs enrichment)
|
|
- **Brazil**: Continues to need enrichment (13.7% Wikidata)
|
|
- **Mexico**: Continues to need enrichment (15.0% Wikidata)
|
|
- **Chile**: Strong baseline (53.9% Wikidata)
|
|
|
|
**100% Wikidata Coverage Achieved**:
|
|
- Belgium (7 institutions)
|
|
- United States (7 institutions)
|
|
- United Kingdom (4 institutions)
|
|
- Russia (1 institution)
|
|
- Denmark (1 institution)
|
|
- Luxembourg (1 institution)
|
|
|
|
---
|
|
|
|
## Current Focus: Post-Merge Priorities
|
|
|
|
### Newly Enriched Countries
|
|
|
|
#### **Tunisia - Needs Re-Enrichment (1.4%)**
|
|
- **Total**: 69 institutions (+26 from merge)
|
|
- **Enriched**: 1 institution with Wikidata Q-number
|
|
- **Files**:
|
|
- Merged from: `data/instances/tunisia/tunisian_institutions_enhanced.yaml`
|
|
- Now in: `data/instances/all/globalglam-20251111.yaml`
|
|
- **Status**: Enrichment work needs to be applied to master dataset
|
|
|
|
**Institution Type Breakdown**:
|
|
- MUSEUM: 28 (40%)
|
|
- LIBRARY: 15 (21%)
|
|
- ARCHIVE: 12 (17%)
|
|
- OFFICIAL_INSTITUTION: 8 (11%)
|
|
- Other: 7 (11%)
|
|
|
|
**Geographic Distribution**:
|
|
- Tunis: 35 institutions (50%)
|
|
- Sfax: 8 institutions (11%)
|
|
- Sousse: 6 institutions (9%)
|
|
- Other cities: 21 institutions (30%)
|
|
|
|
#### **Georgia - Needs Enrichment (0%)**
|
|
- **Total**: 14 institutions (new country addition)
|
|
- **Enriched**: 0 institutions with Wikidata Q-numbers
|
|
- **Files**:
|
|
- Merged from: `data/instances/georgia_glam_institutions_enriched.yaml`
|
|
- Now in: `data/instances/all/globalglam-20251111.yaml`
|
|
- **Status**: Enrichment work needs to be applied to master dataset
|
|
|
|
**Institution Type Breakdown**:
|
|
- MUSEUM: 7 (50%)
|
|
- LIBRARY: 3 (21%)
|
|
- ARCHIVE: 2 (14%)
|
|
- Other: 2 (15%)
|
|
|
|
**Geographic Distribution**:
|
|
- Tbilisi: 11 institutions (79%)
|
|
- Kutaisi: 2 institutions (14%)
|
|
- Batumi: 1 institution (7%)
|
|
|
|
#### **Chile - Baseline Coverage (53.9%)**
|
|
- **Total**: 180 institutions
|
|
- **Enriched**: 97 institutions with Wikidata Q-numbers
|
|
- **Files**:
|
|
- Merged from: `data/instances/latin_american_institutions_AUTHORITATIVE.yaml`
|
|
- Now in: `data/instances/all/globalglam-20251111.yaml`
|
|
- **Status**: Strong baseline, ready for additional enrichment
|
|
|
|
**Institution Type Breakdown**:
|
|
- MUSEUM: 42 (23%)
|
|
- UNIVERSITY: 38 (21%)
|
|
- MIXED: 28 (16%)
|
|
- LIBRARY: 22 (12%)
|
|
- ARCHIVE: 15 (8%)
|
|
- Other: 35 (20%)
|
|
|
|
#### **Brazil - Needs Enrichment (13.7%)**
|
|
- **Total**: 212 institutions
|
|
- **Enriched**: 29 institutions with Wikidata Q-numbers
|
|
- **Files**:
|
|
- Merged from: `data/instances/latin_american_institutions_AUTHORITATIVE.yaml`
|
|
- Now in: `data/instances/all/globalglam-20251111.yaml`
|
|
- **Status**: Priority for enrichment workflow
|
|
|
|
**Institution Type Breakdown**:
|
|
- MIXED: 46 (22%)
|
|
- MUSEUM: 24 (11%)
|
|
- EDUCATION_PROVIDER: 23 (11%)
|
|
- OFFICIAL_INSTITUTION: 10 (5%)
|
|
- Other: 109 (51%)
|
|
|
|
#### **Mexico - Needs Enrichment (15.0%)**
|
|
- **Total**: 226 institutions
|
|
- **Enriched**: 34 institutions with Wikidata Q-numbers
|
|
- **Files**:
|
|
- Merged from: `data/instances/latin_american_institutions_AUTHORITATIVE.yaml`
|
|
- Now in: `data/instances/all/globalglam-20251111.yaml`
|
|
- **Status**: Priority for enrichment workflow
|
|
|
|
**Institution Type Breakdown**:
|
|
- MUSEUM: 38 (20%)
|
|
- MIXED: 33 (17%)
|
|
- ARCHIVE: 18 (9%)
|
|
- LIBRARY: 14 (7%)
|
|
- OFFICIAL_INSTITUTION: 8 (4%)
|
|
- EDUCATION_PROVIDER: 6 (3%)
|
|
- Other: 75 (40%)
|
|
|
|
---
|
|
|
|
## Coverage by Country (Post-Merge)
|
|
|
|
| Country | Total | Wikidata | % | Geocoded | % | Priority |
|
|
|---------|-------|----------|---|----------|---|----------|
|
|
| **Japan** | 12,065 | 7,091 | 58.8% | 7,091 | 58.8% | 🟢 Stable |
|
|
| **Netherlands** | 622 | 193 | 31.0% | 621 | 99.8% | 🟡 Moderate |
|
|
| **Brazil** | 212 | 52 | 24.5% | 85 | 40.1% | 🟡 Moderate |
|
|
| **Mexico** | 192 | 63 | 32.8% | 150 | 78.1% | 🟢 Improved |
|
|
| **Chile** | 180 | 147 | 81.7% | 161 | 89.4% | 🟢 Strong |
|
|
| **Tunisia** | 70 | 52 | 74.3% | 59 | 84.3% | 🟢 Strong |
|
|
| **Libya** | 50 | 8 | 16.0% | 0 | 0.0% | 🔴 High |
|
|
| **Vietnam** | 21 | 8 | 38.1% | 0 | 0.0% | 🔴 High |
|
|
| **Algeria** | 19 | 1 | 5.3% | 0 | 0.0% | 🔴 High |
|
|
| **Georgia** | 14 | 12 | 85.7% | 0 | 0.0% | 🟡 Moderate |
|
|
| **Belgium** | 7 | 7 | 100% | 7 | 100% | 🟢 Complete |
|
|
| **USA** | 7 | 7 | 100% | 7 | 100% | 🟢 Complete |
|
|
| **UK** | 4 | 4 | 100% | 3 | 75.0% | 🟢 Complete |
|
|
| **Italy** | 3 | 1 | 33.3% | 3 | 100% | 🟡 Moderate |
|
|
| **Argentina** | 2 | 1 | 50.0% | 2 | 100% | 🟡 Moderate |
|
|
| **Denmark** | 1 | 1 | 100% | 1 | 100% | 🟢 Complete |
|
|
| **Luxembourg** | 1 | 1 | 100% | 1 | 100% | 🟢 Complete |
|
|
| **Russia** | 1 | 1 | 100% | 1 | 100% | 🟢 Complete |
|
|
| **TOTAL** | **13,505** | **7,650** | **56.6%** | **8,192** | **60.7%** | — |
|
|
|
|
---
|
|
|
|
## Enrichment Priorities (Post-Merge)
|
|
|
|
### Countries Needing Wikidata Enrichment
|
|
|
|
**Total needing Wikidata**: 5,855 institutions (43.4% of dataset)
|
|
|
|
**Priority Countries for Next Enrichment Round**:
|
|
|
|
1. **Japan** - 4,974 institutions need Q-numbers (largest opportunity)
|
|
- Current: 58.8% coverage (7,091/12,065)
|
|
- Goal: 70% coverage (8,446 institutions)
|
|
- Impact: +1,355 Q-numbers
|
|
|
|
2. **Netherlands** - 429 institutions need Q-numbers
|
|
- Current: 31.0% coverage (193/622)
|
|
- Goal: 50% coverage (311 institutions)
|
|
- Impact: +118 Q-numbers
|
|
|
|
3. **Brazil** - 160 institutions need Q-numbers
|
|
- Current: 24.5% coverage (52/212)
|
|
- Goal: 50% coverage (106 institutions)
|
|
- Impact: +54 Q-numbers
|
|
|
|
4. **Mexico** - 129 institutions need Q-numbers
|
|
- Current: 32.8% coverage (63/192)
|
|
- Goal: 50% coverage (96 institutions)
|
|
- Impact: +33 Q-numbers
|
|
|
|
5. **Libya** - 42 institutions need Q-numbers
|
|
- Current: 16.0% coverage (8/50)
|
|
- Goal: 50% coverage (25 institutions)
|
|
- Impact: +17 Q-numbers
|
|
|
|
6. **Chile** - 33 institutions need Q-numbers
|
|
- Current: 81.7% coverage (147/180)
|
|
- Goal: 90% coverage (162 institutions)
|
|
- Impact: +15 Q-numbers
|
|
|
|
7. **Vietnam** - 13 institutions need Q-numbers
|
|
- Current: 38.1% coverage (8/21)
|
|
- Goal: 75% coverage (16 institutions)
|
|
- Impact: +8 Q-numbers
|
|
|
|
8. **Algeria** - 18 institutions need Q-numbers
|
|
- Current: 5.3% coverage (1/19)
|
|
- Goal: 50% coverage (10 institutions)
|
|
- Impact: +9 Q-numbers
|
|
|
|
9. **Tunisia** - 18 institutions need Q-numbers
|
|
- Current: 74.3% coverage (52/70)
|
|
- Goal: 90% coverage (63 institutions)
|
|
- Impact: +11 Q-numbers
|
|
|
|
10. **Georgia** - 2 institutions need Q-numbers
|
|
- Current: 85.7% coverage (12/14)
|
|
- Goal: 100% coverage (14 institutions)
|
|
- Impact: +2 Q-numbers
|
|
|
|
### Countries Needing Geocoding
|
|
|
|
**Total needing geocoding**: 5,313 institutions (39.3% of dataset)
|
|
|
|
**High-Priority Geocoding**:
|
|
- Libya: 50 institutions (0% geocoded)
|
|
- Vietnam: 21 institutions (0% geocoded)
|
|
- Algeria: 19 institutions (0% geocoded)
|
|
- Georgia: 14 institutions (0% geocoded)
|
|
- Brazil: 127 institutions (59.9% geocoded)
|
|
|
|
---
|
|
|
|
## Quality Metrics
|
|
|
|
### Data Completeness (Post-Merge)
|
|
|
|
| Field | Coverage | Count |
|
|
|-------|----------|-------|
|
|
| Wikidata Q-number | 56.6% | 7,650 / 13,505 |
|
|
| Geographic coordinates | 60.7% | 8,192 / 13,505 |
|
|
| Website URL | 84.5% | 11,418 / 13,505 |
|
|
| Description | 94.8% | 12,798 / 13,505 |
|
|
|
|
### Data Quality Tiers
|
|
|
|
| Tier | Description | Count | % |
|
|
|------|-------------|-------|---|
|
|
| TIER_1 | Authoritative (registries) | 622 | 4.6% |
|
|
| TIER_2 | Verified (websites) | 4,891 | 36.2% |
|
|
| TIER_3 | Crowd-sourced (Wikidata) | 7,650 | 56.6% |
|
|
| TIER_4 | Inferred (NLP extraction) | 342 | 2.5% |
|
|
|
|
### Institution Type Distribution
|
|
|
|
| Type | Count | % |
|
|
|------|-------|---|
|
|
| MUSEUM | 4,231 | 31.3% |
|
|
| MIXED | 3,182 | 23.6% |
|
|
| LIBRARY | 2,104 | 15.6% |
|
|
| ARCHIVE | 1,523 | 11.3% |
|
|
| UNIVERSITY | 891 | 6.6% |
|
|
| EDUCATION_PROVIDER | 542 | 4.0% |
|
|
| OFFICIAL_INSTITUTION | 387 | 2.9% |
|
|
| Other types | 645 | 4.8% |
|
|
|
|
---
|
|
|
|
## Regional Distribution
|
|
|
|
### Asia-Pacific
|
|
- **Japan**: 12,065 institutions (89.3% of dataset)
|
|
- **Vietnam**: 21 institutions (0.2%)
|
|
- **Total**: 12,086 institutions (89.5%)
|
|
|
|
### Europe
|
|
- **Netherlands**: 622 institutions (4.6%)
|
|
- **Belgium**: 7 institutions (0.1%)
|
|
- **UK**: 4 institutions (0.0%)
|
|
- **Italy**: 3 institutions (0.0%)
|
|
- **Denmark**: 1 institution (0.0%)
|
|
- **Luxembourg**: 1 institution (0.0%)
|
|
- **Russia**: 1 institution (0.0%)
|
|
- **Total**: 639 institutions (4.7%)
|
|
|
|
### Latin America
|
|
- **Brazil**: 212 institutions (1.6%)
|
|
- **Mexico**: 192 institutions (1.4%)
|
|
- **Chile**: 180 institutions (1.3%)
|
|
- **Argentina**: 2 institutions (0.0%)
|
|
- **Total**: 586 institutions (4.3%)
|
|
|
|
### Middle East & North Africa
|
|
- **Tunisia**: 70 institutions (0.5%)
|
|
- **Libya**: 50 institutions (0.4%)
|
|
- **Algeria**: 19 institutions (0.1%)
|
|
- **Total**: 139 institutions (1.0%)
|
|
|
|
### Caucasus (NEW)
|
|
- **Georgia**: 14 institutions (0.1%)
|
|
- **Total**: 14 institutions (0.1%)
|
|
|
|
### North America
|
|
- **USA**: 7 institutions (0.1%)
|
|
- **Total**: 7 institutions (0.1%)
|
|
|
|
---
|
|
|
|
## Technical Implementation
|
|
|
|
### Schema Architecture
|
|
|
|
**LinkML v0.2.1** - Modular schema with 6 specialized modules:
|
|
|
|
1. **`schemas/core.yaml`** - Core classes (HeritageCustodian, Location, Identifier)
|
|
2. **`schemas/enums.yaml`** - Enumerations (institution types, change types, data sources)
|
|
3. **`schemas/provenance.yaml`** - Provenance tracking (data sources, change events, GHCID history)
|
|
4. **`schemas/collections.yaml`** - Collection metadata
|
|
5. **`schemas/dutch.yaml`** - Dutch-specific extensions
|
|
6. **`schemas/heritage_custodian.yaml`** - Main schema (import-only structure)
|
|
|
|
### Persistent Identifiers (GHCID)
|
|
|
|
**Four-Identifier Strategy**:
|
|
|
|
1. **UUID v5 (SHA-1)** - Primary persistent identifier (deterministic)
|
|
- Field: `ghcid_uuid`
|
|
- Standard: RFC 4122
|
|
|
|
2. **UUID v8 (SHA-256)** - Secondary identifier (future-proofing)
|
|
- Field: `ghcid_uuid_sha256`
|
|
- Cryptographically stronger
|
|
|
|
3. **UUID v7** - Database record ID (time-ordered, NOT persistent)
|
|
- Field: `record_id`
|
|
- Use for database keys only
|
|
|
|
4. **Numeric (64-bit)** - Compact identifier for CSV exports
|
|
- Field: `ghcid_numeric`
|
|
- Database optimization
|
|
|
|
**GHCID Format**: `{COUNTRY}-{REGION}-{CITY}-{TYPE}-{ABBREV}[-{native_name_snake_case}]`
|
|
|
|
Example: `NL-NH-AMS-M-SM-stedelijk_museum_amsterdam` (Stedelijk Museum with collision suffix)
|
|
|
|
> **Note**: Collision resolution uses native language institution name in snake_case format (NOT Wikidata Q-numbers). See `docs/plan/global_glam/07-ghcid-collision-resolution.md` for details.
|
|
|
|
### Data Sources
|
|
|
|
**Primary Sources**:
|
|
- ISIL Registry (Netherlands)
|
|
- Dutch Organizations CSV
|
|
- Wikidata SPARQL queries
|
|
- Institutional websites (crawl4ai)
|
|
- Claude conversation JSON files (139 files covering 60+ countries)
|
|
|
|
**Ontology Alignment**:
|
|
- TOOI (Dutch government organizations)
|
|
- CPOV (EU Core Public Organisation Vocabulary)
|
|
- Schema.org (web semantics)
|
|
- CIDOC-CRM (cultural heritage domain)
|
|
|
|
---
|
|
|
|
## File Locations
|
|
|
|
### Master Dataset
|
|
- **Path**: `data/instances/all/globalglam-20251111.yaml`
|
|
- **Format**: LinkML-compliant YAML
|
|
- **Size**: ~24 MB
|
|
- **Institutions**: 13,502
|
|
|
|
### Backup Files
|
|
- `unified_global_heritage_institutions_backup_20251111_092645.yaml` (pre-merge)
|
|
- `unified_global_heritage_institutions.yaml.backup` (latest)
|
|
- `unified_global_heritage_institutions.yaml.backup2` (secondary)
|
|
|
|
### Documentation
|
|
- `UNIFIED_OVERVIEW.md` (this file)
|
|
- `UNIFICATION_REPORT.md` (detailed merge report)
|
|
- `DATASET_STATISTICS.yaml` (machine-readable metrics)
|
|
- `ENRICHMENT_PROGRESS.md` (historical enrichment tracking)
|
|
- `README.md` (quick reference index)
|
|
|
|
### Superseded Files (Merged)
|
|
The following files were merged into the unified dataset on November 11, 2025:
|
|
- `tunisia/tunisian_institutions_enhanced.yaml` → Archived
|
|
- `georgia_glam_institutions_enriched.yaml` → Archived
|
|
- `latin_american_institutions_AUTHORITATIVE.yaml` → Archived
|
|
|
|
See `archive/2025-11-11-pre-merge/SUPERSEDED.md` for details.
|
|
|
|
---
|
|
|
|
## Known Issues & Solutions
|
|
|
|
### Issue 1: Japanese Institution GHCID Collisions
|
|
- **Problem**: Multiple museums in Tokyo generate same base GHCID
|
|
- **Solution**: Native language name suffix appended in snake_case format (NOT Wikidata Q-numbers)
|
|
- **Status**: Resolved via temporal collision resolution algorithm
|
|
- **Reference**: See `docs/plan/global_glam/07-ghcid-collision-resolution.md`
|
|
|
|
### Issue 2: Geocoding Gaps
|
|
- **Problem**: Libya, Vietnam, Algeria have 0% geocoding
|
|
- **Cause**: Missing address data in source conversations
|
|
- **Solution**: Scheduled for manual geocoding via Nominatim API
|
|
- **Timeline**: December 2025
|
|
|
|
### Issue 3: Dutch ISIL Cross-linking
|
|
- **Problem**: 127 name conflicts between ISIL registry and organizations CSV
|
|
- **Cause**: Alternative names, abbreviations, reorganizations
|
|
- **Solution**: Manual review required
|
|
- **Status**: Documented in `PROGRESS.md`
|
|
|
|
---
|
|
|
|
## Recent Achievements
|
|
|
|
### November 11, 2025
|
|
- ✅ **Complete dataset unification**: 13,502 institutions from 25,963 raw records
|
|
- ✅ **Tunisia merge**: +26 institutions (43 → 69 total)
|
|
- ✅ **Georgia added**: +14 institutions (new country)
|
|
- ✅ **Global coverage**: 18 countries, 55.7% Wikidata, 60.6% geocoding
|
|
- ✅ **Master file created**: `globalglam-20251111.yaml` (24 MB)
|
|
|
|
### November 9, 2025
|
|
- ✅ **Unified documentation created**: This directory structure
|
|
- ✅ **Dataset statistics automated**: YAML generation script
|
|
- ✅ **Cross-linking analysis**: Dutch ISIL vs organizations CSV
|
|
|
|
### November 8, 2025
|
|
- ✅ **Chilean enrichment**: 53.9% → 81.7% Wikidata coverage
|
|
- ✅ **Mexican expansion**: 117 → 192 institutions (+75)
|
|
|
|
---
|
|
|
|
## Next Steps
|
|
|
|
### Immediate Priorities (This Week)
|
|
|
|
1. **Complete Geocoding for High-Priority Countries**
|
|
- Libya: 50 institutions (0% → 80% target)
|
|
- Vietnam: 21 institutions (0% → 80% target)
|
|
- Algeria: 19 institutions (0% → 80% target)
|
|
- Georgia: 14 institutions (0% → 100% target)
|
|
|
|
2. **Japan Wikidata Enrichment - Batch 1**
|
|
- Target: 50 major national museums and libraries
|
|
- Goal: 58.8% → 59.2% coverage
|
|
- Strategy: Exact name matching, high-confidence only
|
|
|
|
3. **Netherlands Enrichment - Batch 1**
|
|
- Target: 30 provincial museums and archives
|
|
- Goal: 31.0% → 35.8% coverage
|
|
- Strategy: Leverage ISIL codes for Wikidata matching
|
|
|
|
### Medium-Term (This Month)
|
|
|
|
4. **Brazil Enrichment - Continue**
|
|
- Target: 25 more university collections and state museums
|
|
- Goal: 24.5% → 36.3% coverage
|
|
- Batches: 8-10
|
|
|
|
5. **Mexico Enrichment - Continue**
|
|
- Target: 20 more federal institutions and state archives
|
|
- Goal: 32.8% → 43.2% coverage
|
|
- Batches: 3-5
|
|
|
|
6. **Tunisia & Georgia - Complete Coverage**
|
|
- Tunisia: 74.3% → 90% (add 11 Q-numbers)
|
|
- Georgia: 85.7% → 100% (add 2 Q-numbers)
|
|
|
|
### Long-Term (Q1 2026)
|
|
|
|
7. **Expand to New Regions**
|
|
- Southeast Asia: Vietnam, Thailand, Indonesia
|
|
- Middle East: UAE, Saudi Arabia, Jordan
|
|
- Africa: Egypt, Kenya, South Africa
|
|
- Target: +500 institutions
|
|
|
|
8. **Data Quality Improvements**
|
|
- Reduce TIER_4 (inferred) records via verification
|
|
- Increase description coverage to 98%
|
|
- Complete geocoding for all countries (100%)
|
|
|
|
---
|
|
|
|
## Changelog
|
|
|
|
### 2025-11-11: Major Merge (v1.1)
|
|
- Merged Tunisia enhanced dataset (+27 institutions)
|
|
- Added Georgia as new country (+14 institutions)
|
|
- Updated Latin American datasets (Brazil, Mexico, Chile, Argentina)
|
|
- Global Wikidata coverage improved: 55.6% → 56.6% (+153 Q-numbers)
|
|
- Total institutions: 13,478 → 13,505 (+27)
|
|
- Countries covered: 17 → 18
|
|
|
|
### 2025-11-09: Documentation Overhaul (v1.0)
|
|
- Created unified documentation directory (`data/instances/all/`)
|
|
- Consolidated statistics into single YAML file
|
|
- Established standardized reporting structure
|
|
|
|
### 2025-11-08: Chilean & Mexican Updates
|
|
- Chilean institutions enriched: 53.9% → 81.7% Wikidata
|
|
- Mexican institutions expanded: 117 → 192 (+75)
|
|
|
|
### 2025-11-06: Latin America Consolidation
|
|
- Unified Brazil, Chile, Mexico into single dataset
|
|
- Completed geocoding for all Latin American institutions
|
|
|
|
### 2025-11-01: Dutch Datasets Integration
|
|
- Parsed ISIL registry: 364 institutions
|
|
- Parsed Dutch organizations CSV: 1,351 institutions
|
|
- Cross-linked datasets: 340 matches (92.1% overlap)
|
|
|
|
---
|
|
|
|
## Reference Links
|
|
|
|
- **Project Repository**: `/Users/kempersc/apps/glam/`
|
|
- **Schema Documentation**: `/Users/kempersc/apps/glam/docs/SCHEMA_MODULES.md`
|
|
- **Persistent Identifiers**: `/Users/kempersc/apps/glam/docs/PERSISTENT_IDENTIFIERS.md`
|
|
- **Agent Instructions**: `/Users/kempersc/apps/glam/AGENTS.md`
|
|
- **Progress Tracking**: `/Users/kempersc/apps/glam/PROGRESS.md`
|
|
|
|
**External Resources**:
|
|
- Wikidata: https://www.wikidata.org/
|
|
- VIAF: https://viaf.org/
|
|
- LinkML: https://linkml.io/
|
|
- Nominatim: https://nominatim.openstreetmap.org/
|
|
|
|
---
|
|
|
|
**Document Version**: 1.1 (Post-Merge)
|
|
**Created**: November 9, 2025
|
|
**Last Updated**: November 11, 2025
|
|
**Maintained By**: GLAM Data Extraction Project
|