glam/data/instances/archive/validation_report_2025-11-06.md
2025-11-19 23:25:22 +01:00

151 lines
4.7 KiB
Markdown

# LinkML Validation Report - Latin American GLAM Institutions
**Date**: 2025-11-06
**Validator**: Custom Python validator (validate_instances.py)
**Schema**: heritage_custodian.yaml v0.2.0
## Summary
All three curated Latin American heritage institution datasets pass LinkML schema validation with **zero errors**.
| Dataset | Records | Errors | Warnings | Status |
|---------|---------|--------|----------|--------|
| **Chilean** | 90 | 0 | 90 | ✅ VALID |
| **Mexican** | 117 | 0 | 113 | ✅ VALID |
| **Brazilian** | 97 | 0 | 54 | ✅ VALID |
| **TOTAL** | **304** | **0** | **257** | ✅ **VALID** |
## Validation Details
### Chilean Institutions (`chilean_institutions_curated.yaml`)
- **Records**: 90
- **Errors**: 0 ✅
- **Warnings**: 90 (all missing city names)
- **Field Coverage**:
- id: 100%
- name: 100%
- institution_type: 100%
- description: 100%
- provenance: 100%
- locations: 100%
- digital_platforms: 100%
**Issues**:
- All 90 records missing city names in locations (only region/country provided)
- Recommended: Apply geocoding enrichment to add city names and coordinates
### Mexican Institutions (`mexican_institutions_curated.yaml`)
- **Records**: 117
- **Errors**: 0 ✅
- **Warnings**: 113 (110 missing city names, 3 other)
- **Field Coverage**:
- id: 100%
- name: 100%
- institution_type: 100%
- description: 100%
- provenance: 100%
- locations: 70.9% (83/117)
- identifiers: 57.3% (67/117)
- digital_platforms: 23.1% (27/117)
- collections: 3.4% (4/117)
**Issues**:
- 110 records missing city names
- 34 records missing location information entirely
- Recommended: Apply geocoding enrichment
### Brazilian Institutions (`brazilian_institutions_geocoded_v3.yaml`)
- **Records**: 97
- **Errors**: 0 ✅
- **Warnings**: 54 (39 missing city names, 15 missing descriptions)
- **Field Coverage**:
- id: 100%
- name: 100%
- institution_type: 100%
- provenance: 100%
- locations: 100%
- description: 84.5% (82/97)
- identifiers: 55.7% (54/97)
- change_history: 7.2% (7/97)
**Best Performing Dataset**:
- Already geocoded with 59.8% city coverage (58/97)
- 51.5% coordinate coverage (50/97)
- Highest description coverage (84.5%)
## Required Fields Compliance
All 304 records have complete required fields per LinkML schema:
### HeritageCustodian (100% compliance)
-`id` - 304/304 records
-`name` - 304/304 records
-`institution_type` - 304/304 records
-`provenance` - 304/304 records
### Provenance (100% compliance)
-`data_source` - 304/304 records
-`data_tier` - 304/304 records
-`extraction_date` - 304/304 records
-`extraction_method` - 304/304 records
-`confidence_score` - 304/304 records
## Enum Validation
All enum values validated successfully:
- **institution_type**: All values match InstitutionTypeEnum (GALLERY, LIBRARY, ARCHIVE, MUSEUM, etc.)
- **data_source**: All values match DataSource enum (CONVERSATION_NLP, CSV_REGISTRY, etc.)
- **data_tier**: All values match DataTier enum (TIER_1_AUTHORITATIVE through TIER_4_INFERRED)
- **change_type**: All values match ChangeTypeEnum (FOUNDING, MERGER, etc.)
## Recommendations
### Priority 1: Geocoding Enrichment
**Chilean Dataset**:
- 0% city coverage → Target: 60%+
- Add coordinates via Nominatim API
- Expected yield: ~54 geocoded records
**Mexican Dataset**:
- 5.9% city coverage (7/117) → Target: 60%+
- Add coordinates via Nominatim API
- Expected yield: ~70 geocoded records
### Priority 2: Description Enhancement
**Brazilian Dataset**:
- 15 records (15.5%) missing descriptions
- Use web scraping or Wikidata to fill gaps
### Priority 3: Identifier Collection
**Mexican Dataset**:
- Only 57.3% have identifiers
- Search for Wikidata IDs, VIAF IDs, institutional websites
- Target: 80%+ identifier coverage
**Brazilian Dataset**:
- 55.7% identifier coverage
- Similar enhancement strategy
## Validation Method
Custom Python validator checks:
1. Required field presence
2. Enum value validity
3. Confidence score ranges (0.0-1.0)
4. Nested object structure (Location, Identifier, Provenance, etc.)
5. Field completeness statistics
**Note**: LinkML's official `linkml-validate` CLI tool incompatible due to Pydantic v1/v2 version conflicts. Custom validator implements schema rules directly.
## Conclusion
**All 304 Latin American heritage institution records are valid** according to the LinkML heritage_custodian.yaml schema v0.2.0.
The datasets are production-ready for:
- RDF/JSON-LD export
- SPARQL endpoint loading
- Data aggregation and analysis
- Further enrichment (geocoding, identifiers, web scraping)
Next steps: Apply geocoding enrichment to Chilean and Mexican datasets to bring all three datasets to similar quality levels.