4.7 KiB
LinkML Validation Report - Latin American GLAM Institutions
Date: 2025-11-06
Validator: Custom Python validator (validate_instances.py)
Schema: heritage_custodian.yaml v0.2.0
Summary
All three curated Latin American heritage institution datasets pass LinkML schema validation with zero errors.
| Dataset | Records | Errors | Warnings | Status |
|---|---|---|---|---|
| Chilean | 90 | 0 | 90 | ✅ VALID |
| Mexican | 117 | 0 | 113 | ✅ VALID |
| Brazilian | 97 | 0 | 54 | ✅ VALID |
| TOTAL | 304 | 0 | 257 | ✅ VALID |
Validation Details
Chilean Institutions (chilean_institutions_curated.yaml)
- Records: 90
- Errors: 0 ✅
- Warnings: 90 (all missing city names)
- Field Coverage:
- id: 100%
- name: 100%
- institution_type: 100%
- description: 100%
- provenance: 100%
- locations: 100%
- digital_platforms: 100%
Issues:
- All 90 records missing city names in locations (only region/country provided)
- Recommended: Apply geocoding enrichment to add city names and coordinates
Mexican Institutions (mexican_institutions_curated.yaml)
- Records: 117
- Errors: 0 ✅
- Warnings: 113 (110 missing city names, 3 other)
- Field Coverage:
- id: 100%
- name: 100%
- institution_type: 100%
- description: 100%
- provenance: 100%
- locations: 70.9% (83/117)
- identifiers: 57.3% (67/117)
- digital_platforms: 23.1% (27/117)
- collections: 3.4% (4/117)
Issues:
- 110 records missing city names
- 34 records missing location information entirely
- Recommended: Apply geocoding enrichment
Brazilian Institutions (brazilian_institutions_geocoded_v3.yaml)
- Records: 97
- Errors: 0 ✅
- Warnings: 54 (39 missing city names, 15 missing descriptions)
- Field Coverage:
- id: 100%
- name: 100%
- institution_type: 100%
- provenance: 100%
- locations: 100%
- description: 84.5% (82/97)
- identifiers: 55.7% (54/97)
- change_history: 7.2% (7/97)
Best Performing Dataset:
- Already geocoded with 59.8% city coverage (58/97)
- 51.5% coordinate coverage (50/97)
- Highest description coverage (84.5%)
Required Fields Compliance
All 304 records have complete required fields per LinkML schema:
HeritageCustodian (100% compliance)
- ✅
id- 304/304 records - ✅
name- 304/304 records - ✅
institution_type- 304/304 records - ✅
provenance- 304/304 records
Provenance (100% compliance)
- ✅
data_source- 304/304 records - ✅
data_tier- 304/304 records - ✅
extraction_date- 304/304 records - ✅
extraction_method- 304/304 records - ✅
confidence_score- 304/304 records
Enum Validation
All enum values validated successfully:
- institution_type: All values match InstitutionTypeEnum (GALLERY, LIBRARY, ARCHIVE, MUSEUM, etc.)
- data_source: All values match DataSource enum (CONVERSATION_NLP, CSV_REGISTRY, etc.)
- data_tier: All values match DataTier enum (TIER_1_AUTHORITATIVE through TIER_4_INFERRED)
- change_type: All values match ChangeTypeEnum (FOUNDING, MERGER, etc.)
Recommendations
Priority 1: Geocoding Enrichment
Chilean Dataset:
- 0% city coverage → Target: 60%+
- Add coordinates via Nominatim API
- Expected yield: ~54 geocoded records
Mexican Dataset:
- 5.9% city coverage (7/117) → Target: 60%+
- Add coordinates via Nominatim API
- Expected yield: ~70 geocoded records
Priority 2: Description Enhancement
Brazilian Dataset:
- 15 records (15.5%) missing descriptions
- Use web scraping or Wikidata to fill gaps
Priority 3: Identifier Collection
Mexican Dataset:
- Only 57.3% have identifiers
- Search for Wikidata IDs, VIAF IDs, institutional websites
- Target: 80%+ identifier coverage
Brazilian Dataset:
- 55.7% identifier coverage
- Similar enhancement strategy
Validation Method
Custom Python validator checks:
- Required field presence
- Enum value validity
- Confidence score ranges (0.0-1.0)
- Nested object structure (Location, Identifier, Provenance, etc.)
- Field completeness statistics
Note: LinkML's official linkml-validate CLI tool incompatible due to Pydantic v1/v2 version conflicts. Custom validator implements schema rules directly.
Conclusion
✅ All 304 Latin American heritage institution records are valid according to the LinkML heritage_custodian.yaml schema v0.2.0.
The datasets are production-ready for:
- RDF/JSON-LD export
- SPARQL endpoint loading
- Data aggregation and analysis
- Further enrichment (geocoding, identifiers, web scraping)
Next steps: Apply geocoding enrichment to Chilean and Mexican datasets to bring all three datasets to similar quality levels.