glam/data/instances/archive/validation_report_2025-11-06.md
2025-11-19 23:25:22 +01:00

4.7 KiB

LinkML Validation Report - Latin American GLAM Institutions

Date: 2025-11-06
Validator: Custom Python validator (validate_instances.py)
Schema: heritage_custodian.yaml v0.2.0

Summary

All three curated Latin American heritage institution datasets pass LinkML schema validation with zero errors.

Dataset Records Errors Warnings Status
Chilean 90 0 90 VALID
Mexican 117 0 113 VALID
Brazilian 97 0 54 VALID
TOTAL 304 0 257 VALID

Validation Details

Chilean Institutions (chilean_institutions_curated.yaml)

  • Records: 90
  • Errors: 0
  • Warnings: 90 (all missing city names)
  • Field Coverage:
    • id: 100%
    • name: 100%
    • institution_type: 100%
    • description: 100%
    • provenance: 100%
    • locations: 100%
    • digital_platforms: 100%

Issues:

  • All 90 records missing city names in locations (only region/country provided)
  • Recommended: Apply geocoding enrichment to add city names and coordinates

Mexican Institutions (mexican_institutions_curated.yaml)

  • Records: 117
  • Errors: 0
  • Warnings: 113 (110 missing city names, 3 other)
  • Field Coverage:
    • id: 100%
    • name: 100%
    • institution_type: 100%
    • description: 100%
    • provenance: 100%
    • locations: 70.9% (83/117)
    • identifiers: 57.3% (67/117)
    • digital_platforms: 23.1% (27/117)
    • collections: 3.4% (4/117)

Issues:

  • 110 records missing city names
  • 34 records missing location information entirely
  • Recommended: Apply geocoding enrichment

Brazilian Institutions (brazilian_institutions_geocoded_v3.yaml)

  • Records: 97
  • Errors: 0
  • Warnings: 54 (39 missing city names, 15 missing descriptions)
  • Field Coverage:
    • id: 100%
    • name: 100%
    • institution_type: 100%
    • provenance: 100%
    • locations: 100%
    • description: 84.5% (82/97)
    • identifiers: 55.7% (54/97)
    • change_history: 7.2% (7/97)

Best Performing Dataset:

  • Already geocoded with 59.8% city coverage (58/97)
  • 51.5% coordinate coverage (50/97)
  • Highest description coverage (84.5%)

Required Fields Compliance

All 304 records have complete required fields per LinkML schema:

HeritageCustodian (100% compliance)

  • id - 304/304 records
  • name - 304/304 records
  • institution_type - 304/304 records
  • provenance - 304/304 records

Provenance (100% compliance)

  • data_source - 304/304 records
  • data_tier - 304/304 records
  • extraction_date - 304/304 records
  • extraction_method - 304/304 records
  • confidence_score - 304/304 records

Enum Validation

All enum values validated successfully:

  • institution_type: All values match InstitutionTypeEnum (GALLERY, LIBRARY, ARCHIVE, MUSEUM, etc.)
  • data_source: All values match DataSource enum (CONVERSATION_NLP, CSV_REGISTRY, etc.)
  • data_tier: All values match DataTier enum (TIER_1_AUTHORITATIVE through TIER_4_INFERRED)
  • change_type: All values match ChangeTypeEnum (FOUNDING, MERGER, etc.)

Recommendations

Priority 1: Geocoding Enrichment

Chilean Dataset:

  • 0% city coverage → Target: 60%+
  • Add coordinates via Nominatim API
  • Expected yield: ~54 geocoded records

Mexican Dataset:

  • 5.9% city coverage (7/117) → Target: 60%+
  • Add coordinates via Nominatim API
  • Expected yield: ~70 geocoded records

Priority 2: Description Enhancement

Brazilian Dataset:

  • 15 records (15.5%) missing descriptions
  • Use web scraping or Wikidata to fill gaps

Priority 3: Identifier Collection

Mexican Dataset:

  • Only 57.3% have identifiers
  • Search for Wikidata IDs, VIAF IDs, institutional websites
  • Target: 80%+ identifier coverage

Brazilian Dataset:

  • 55.7% identifier coverage
  • Similar enhancement strategy

Validation Method

Custom Python validator checks:

  1. Required field presence
  2. Enum value validity
  3. Confidence score ranges (0.0-1.0)
  4. Nested object structure (Location, Identifier, Provenance, etc.)
  5. Field completeness statistics

Note: LinkML's official linkml-validate CLI tool incompatible due to Pydantic v1/v2 version conflicts. Custom validator implements schema rules directly.

Conclusion

All 304 Latin American heritage institution records are valid according to the LinkML heritage_custodian.yaml schema v0.2.0.

The datasets are production-ready for:

  • RDF/JSON-LD export
  • SPARQL endpoint loading
  • Data aggregation and analysis
  • Further enrichment (geocoding, identifiers, web scraping)

Next steps: Apply geocoding enrichment to Chilean and Mexican datasets to bring all three datasets to similar quality levels.