- Introduced LEGAL-FORM-FILTER rule to standardize CustodianName by removing legal form designations.
- Documented rationale, examples, and implementation guidelines for the filtering process.
docs: Create README for value standardization rules
- Established a comprehensive README outlining various value standardization rules applicable to Heritage Custodian classes.
- Categorized rules into Name Standardization, Geographic Standardization, Web Observation, and Schema Evolution.
feat: Implement transliteration standards for non-Latin scripts
- Added TRANSLIT-ISO rule to ensure GHCID abbreviations are generated from emic names using ISO standards for transliteration.
- Included detailed guidelines for various scripts and languages, along with implementation examples.
feat: Define XPath provenance rules for web observations
- Created XPATH-PROVENANCE rule mandating XPath pointers for claims extracted from web sources.
- Established a workflow for archiving websites and verifying claims against archived HTML.
chore: Update records lifecycle diagram
- Generated a new Mermaid diagram illustrating the records lifecycle for heritage custodians.
- Included phases for active records, inactive archives, and processed heritage collections with key relationships and classifications.
- Final 42 files updated
- Normalization complete: all 27,511 custodian files have location block
- 15,419 files have coordinates with coordinate_provenance
- 12,092 files have address-only location blocks
- 2,546 files updated with location blocks
- All 27,511 custodian files now have location: block
- 15,421 files have coordinates with coordinate_provenance
- 12,090 files have address-only location blocks
- Fixed 469 JP files missing location: blocks (had data in original_entry.locations)
- Fixed 117 additional JP files found in second pass
- 1 EG file skipped (no location source data available)
- Total files with location: blocks now 27,459 out of 27,511 (99.8%)
- Also includes YAML formatting standardization (line wrapping)
Recovery from data loss in commit 62fdd35321 is now complete.
- Add emic_name, name_language, and standardized_name to 1,781 custodian files
- Remove 2,239 duplicate files that had name suffixes in filename
- Consolidate data into base GHCID files per PID stability rules
- Part of UNESCO Memory of the World custodian enrichment
Remove redundant ch_annotator metadata and duplicate ghcid_history entries
that were causing YAML parsing issues. Files now have cleaner, more
consistent structure while preserving all essential data.
Rename 144 custodian files from XXX placeholders to resolved city codes:
- BR (65): ASA, RBR, MAN, FAZ, MAC, VIT, FOR, GUA, BRE, GOI, SLU, BHO, XAN, CGR, CUI, etc.
- CH (24): ZUR, BER, GEN, BAL, LUC, etc.
- MX (23): MEX, GDL, MTY, PUE, etc.
- CL (9): SCL, VAL, etc.
- CZ (5): PRG, BRN, MSV, etc.
- KR (4): SEL, etc.
- GB (4): LON, etc.
- FR (3): PAR, etc.
- IN (2): DEL, etc.
- PH, JP, EE (1 each)
City codes derived from GeoNames reverse geocoding using institution coordinates.
GHCID format: {COUNTRY}-{REGION}-{CITY}-{TYPE}-{ABBREV}
Remove 229 custodian YAML files containing invalid characters in GHCIDs:
- Ampersand (&) in abbreviations (e.g., BM&HS, UNL&AG, DR&IMSM)
- Parentheses in abbreviations (e.g., WHO(RA, VK(, SL()
- Unicode characters in filenames (Ö, Ä, Å, É, İ, Ż, etc.)
These files are replaced with corrected versions using alphabetic-only
abbreviations per AGENTS.md Rule 8 (Special Characters MUST Be Excluded).
Related scripts updated for location resolution.
- Introduced `llm_extract_archiveslab.py` script for entity and relationship extraction using LLMAnnotator with GLAM-NER v1.7.0.
- Replaced regex-based extraction with generative LLM inference.
- Added functions for loading markdown content, converting annotation sessions to dictionaries, and generating extraction statistics.
- Implemented comprehensive logging of extraction results, including counts of entities, relationships, and specific types like heritage institutions and persons.
- Results and statistics are saved in JSON format for further analysis.
- Implemented `generate_mermaid_with_instances.py` to create ER diagrams that include all classes, relationships, enum values, and instance data.
- Loaded instance data from YAML files and enriched enum definitions with meaningful annotations.
- Configured output paths for generated diagrams in both frontend and schema directories.
- Added support for excluding technical classes and limiting the number of displayed enum and instance values for readability.
- Implemented a Python script to validate KB library YAML files for required fields and data quality.
- Analyzed enrichment coverage from Wikidata and Google Maps, generating statistics.
- Created a comprehensive markdown report summarizing validation results and enrichment quality.
- Included error handling for file loading and validation processes.
- Generated JSON statistics for further analysis.
- Introduced a comprehensive class diagram for the heritage custodian observation reconstruction schema.
- Defined multiple classes including AllocationAgency, ArchiveOrganizationType, AuxiliaryDigitalPlatform, and others, with relevant attributes and relationships.
- Established inheritance and associations among classes to represent complex relationships within the schema.
- Generated on 2025-11-28, version 0.9.0, excluding the Container class.