151 lines
4.7 KiB
Markdown
151 lines
4.7 KiB
Markdown
# LinkML Validation Report - Latin American GLAM Institutions
|
|
|
|
**Date**: 2025-11-06
|
|
**Validator**: Custom Python validator (validate_instances.py)
|
|
**Schema**: heritage_custodian.yaml v0.2.0
|
|
|
|
## Summary
|
|
|
|
All three curated Latin American heritage institution datasets pass LinkML schema validation with **zero errors**.
|
|
|
|
| Dataset | Records | Errors | Warnings | Status |
|
|
|---------|---------|--------|----------|--------|
|
|
| **Chilean** | 90 | 0 | 90 | ✅ VALID |
|
|
| **Mexican** | 117 | 0 | 113 | ✅ VALID |
|
|
| **Brazilian** | 97 | 0 | 54 | ✅ VALID |
|
|
| **TOTAL** | **304** | **0** | **257** | ✅ **VALID** |
|
|
|
|
## Validation Details
|
|
|
|
### Chilean Institutions (`chilean_institutions_curated.yaml`)
|
|
- **Records**: 90
|
|
- **Errors**: 0 ✅
|
|
- **Warnings**: 90 (all missing city names)
|
|
- **Field Coverage**:
|
|
- id: 100%
|
|
- name: 100%
|
|
- institution_type: 100%
|
|
- description: 100%
|
|
- provenance: 100%
|
|
- locations: 100%
|
|
- digital_platforms: 100%
|
|
|
|
**Issues**:
|
|
- All 90 records missing city names in locations (only region/country provided)
|
|
- Recommended: Apply geocoding enrichment to add city names and coordinates
|
|
|
|
### Mexican Institutions (`mexican_institutions_curated.yaml`)
|
|
- **Records**: 117
|
|
- **Errors**: 0 ✅
|
|
- **Warnings**: 113 (110 missing city names, 3 other)
|
|
- **Field Coverage**:
|
|
- id: 100%
|
|
- name: 100%
|
|
- institution_type: 100%
|
|
- description: 100%
|
|
- provenance: 100%
|
|
- locations: 70.9% (83/117)
|
|
- identifiers: 57.3% (67/117)
|
|
- digital_platforms: 23.1% (27/117)
|
|
- collections: 3.4% (4/117)
|
|
|
|
**Issues**:
|
|
- 110 records missing city names
|
|
- 34 records missing location information entirely
|
|
- Recommended: Apply geocoding enrichment
|
|
|
|
### Brazilian Institutions (`brazilian_institutions_geocoded_v3.yaml`)
|
|
- **Records**: 97
|
|
- **Errors**: 0 ✅
|
|
- **Warnings**: 54 (39 missing city names, 15 missing descriptions)
|
|
- **Field Coverage**:
|
|
- id: 100%
|
|
- name: 100%
|
|
- institution_type: 100%
|
|
- provenance: 100%
|
|
- locations: 100%
|
|
- description: 84.5% (82/97)
|
|
- identifiers: 55.7% (54/97)
|
|
- change_history: 7.2% (7/97)
|
|
|
|
**Best Performing Dataset**:
|
|
- Already geocoded with 59.8% city coverage (58/97)
|
|
- 51.5% coordinate coverage (50/97)
|
|
- Highest description coverage (84.5%)
|
|
|
|
## Required Fields Compliance
|
|
|
|
All 304 records have complete required fields per LinkML schema:
|
|
|
|
### HeritageCustodian (100% compliance)
|
|
- ✅ `id` - 304/304 records
|
|
- ✅ `name` - 304/304 records
|
|
- ✅ `institution_type` - 304/304 records
|
|
- ✅ `provenance` - 304/304 records
|
|
|
|
### Provenance (100% compliance)
|
|
- ✅ `data_source` - 304/304 records
|
|
- ✅ `data_tier` - 304/304 records
|
|
- ✅ `extraction_date` - 304/304 records
|
|
- ✅ `extraction_method` - 304/304 records
|
|
- ✅ `confidence_score` - 304/304 records
|
|
|
|
## Enum Validation
|
|
|
|
All enum values validated successfully:
|
|
|
|
- **institution_type**: All values match InstitutionTypeEnum (GALLERY, LIBRARY, ARCHIVE, MUSEUM, etc.)
|
|
- **data_source**: All values match DataSource enum (CONVERSATION_NLP, CSV_REGISTRY, etc.)
|
|
- **data_tier**: All values match DataTier enum (TIER_1_AUTHORITATIVE through TIER_4_INFERRED)
|
|
- **change_type**: All values match ChangeTypeEnum (FOUNDING, MERGER, etc.)
|
|
|
|
## Recommendations
|
|
|
|
### Priority 1: Geocoding Enrichment
|
|
**Chilean Dataset**:
|
|
- 0% city coverage → Target: 60%+
|
|
- Add coordinates via Nominatim API
|
|
- Expected yield: ~54 geocoded records
|
|
|
|
**Mexican Dataset**:
|
|
- 5.9% city coverage (7/117) → Target: 60%+
|
|
- Add coordinates via Nominatim API
|
|
- Expected yield: ~70 geocoded records
|
|
|
|
### Priority 2: Description Enhancement
|
|
**Brazilian Dataset**:
|
|
- 15 records (15.5%) missing descriptions
|
|
- Use web scraping or Wikidata to fill gaps
|
|
|
|
### Priority 3: Identifier Collection
|
|
**Mexican Dataset**:
|
|
- Only 57.3% have identifiers
|
|
- Search for Wikidata IDs, VIAF IDs, institutional websites
|
|
- Target: 80%+ identifier coverage
|
|
|
|
**Brazilian Dataset**:
|
|
- 55.7% identifier coverage
|
|
- Similar enhancement strategy
|
|
|
|
## Validation Method
|
|
|
|
Custom Python validator checks:
|
|
1. Required field presence
|
|
2. Enum value validity
|
|
3. Confidence score ranges (0.0-1.0)
|
|
4. Nested object structure (Location, Identifier, Provenance, etc.)
|
|
5. Field completeness statistics
|
|
|
|
**Note**: LinkML's official `linkml-validate` CLI tool incompatible due to Pydantic v1/v2 version conflicts. Custom validator implements schema rules directly.
|
|
|
|
## Conclusion
|
|
|
|
✅ **All 304 Latin American heritage institution records are valid** according to the LinkML heritage_custodian.yaml schema v0.2.0.
|
|
|
|
The datasets are production-ready for:
|
|
- RDF/JSON-LD export
|
|
- SPARQL endpoint loading
|
|
- Data aggregation and analysis
|
|
- Further enrichment (geocoding, identifiers, web scraping)
|
|
|
|
Next steps: Apply geocoding enrichment to Chilean and Mexican datasets to bring all three datasets to similar quality levels.
|