- Download GeoNames JP postal code database (142K entries)
- Create geocode_japan_postal.py with postal code lookup
- Handle unicode hyphen variants in postal codes
- Add manual mappings for remote Tokyo islands (Hachijojima, Miyakejima)
- Implement prefix fallback for company postal codes
- Total JP files geocoded: 540 (99.81% coverage)
This brings overall geocoding coverage from 97.84% to 99.81%
- Improved city name cleaning:
- Roman numeral district suffixes (Kolín V. -> Kolín)
- City + country suffixes (Genève 4 - Suisse -> Genève)
- Czech postal notation (p. Luka nad Jihlavou -> Luka nad Jihlavou)
- Historical city names (Gottwaldov -> Zlín, renamed 1990)
- Manual mappings for Swiss districts (Lugano Massagno -> Lugano)
- Handle Czech address patterns:
- House numbers with čp./č.p. prefix
- X nad/pod Y town names (rivers/landmarks)
- Hyphenated district names (Město-Část)
- Trailing numbers and suffixes
- Improved city name normalization to handle:
- St. Gallen / St.Gallen -> Sankt Gallen
- Canton suffixes (Buchs SG, Brugg AG)
- Hyphenated districts (Bernex - Genève)
- Postal codes with slashes (Ecublens/VD)
- German prepositions (Hausen b. Brugg)
- Created scripts/geocode_from_city_name.py for unified geocoding
- Introduced LEGAL-FORM-FILTER rule to standardize CustodianName by removing legal form designations.
- Documented rationale, examples, and implementation guidelines for the filtering process.
docs: Create README for value standardization rules
- Established a comprehensive README outlining various value standardization rules applicable to Heritage Custodian classes.
- Categorized rules into Name Standardization, Geographic Standardization, Web Observation, and Schema Evolution.
feat: Implement transliteration standards for non-Latin scripts
- Added TRANSLIT-ISO rule to ensure GHCID abbreviations are generated from emic names using ISO standards for transliteration.
- Included detailed guidelines for various scripts and languages, along with implementation examples.
feat: Define XPath provenance rules for web observations
- Created XPATH-PROVENANCE rule mandating XPath pointers for claims extracted from web sources.
- Established a workflow for archiving websites and verifying claims against archived HTML.
chore: Update records lifecycle diagram
- Generated a new Mermaid diagram illustrating the records lifecycle for heritage custodians.
- Included phases for active records, inactive archives, and processed heritage collections with key relationships and classifications.
- Final 42 files updated
- Normalization complete: all 27,511 custodian files have location block
- 15,419 files have coordinates with coordinate_provenance
- 12,092 files have address-only location blocks
- 2,546 files updated with location blocks
- All 27,511 custodian files now have location: block
- 15,421 files have coordinates with coordinate_provenance
- 12,090 files have address-only location blocks
- Fixed 469 JP files missing location: blocks (had data in original_entry.locations)
- Fixed 117 additional JP files found in second pass
- 1 EG file skipped (no location source data available)
- Total files with location: blocks now 27,459 out of 27,511 (99.8%)
- Also includes YAML formatting standardization (line wrapping)
Recovery from data loss in commit 62fdd35321 is now complete.
- Add emic_name, name_language, and standardized_name to 1,781 custodian files
- Remove 2,239 duplicate files that had name suffixes in filename
- Consolidate data into base GHCID files per PID stability rules
- Part of UNESCO Memory of the World custodian enrichment
Remove redundant ch_annotator metadata and duplicate ghcid_history entries
that were causing YAML parsing issues. Files now have cleaner, more
consistent structure while preserving all essential data.
Rename 144 custodian files from XXX placeholders to resolved city codes:
- BR (65): ASA, RBR, MAN, FAZ, MAC, VIT, FOR, GUA, BRE, GOI, SLU, BHO, XAN, CGR, CUI, etc.
- CH (24): ZUR, BER, GEN, BAL, LUC, etc.
- MX (23): MEX, GDL, MTY, PUE, etc.
- CL (9): SCL, VAL, etc.
- CZ (5): PRG, BRN, MSV, etc.
- KR (4): SEL, etc.
- GB (4): LON, etc.
- FR (3): PAR, etc.
- IN (2): DEL, etc.
- PH, JP, EE (1 each)
City codes derived from GeoNames reverse geocoding using institution coordinates.
GHCID format: {COUNTRY}-{REGION}-{CITY}-{TYPE}-{ABBREV}
Remove 229 custodian YAML files containing invalid characters in GHCIDs:
- Ampersand (&) in abbreviations (e.g., BM&HS, UNL&AG, DR&IMSM)
- Parentheses in abbreviations (e.g., WHO(RA, VK(, SL()
- Unicode characters in filenames (Ö, Ä, Å, É, İ, Ż, etc.)
These files are replaced with corrected versions using alphabetic-only
abbreviations per AGENTS.md Rule 8 (Special Characters MUST Be Excluded).
Related scripts updated for location resolution.
- Introduced `llm_extract_archiveslab.py` script for entity and relationship extraction using LLMAnnotator with GLAM-NER v1.7.0.
- Replaced regex-based extraction with generative LLM inference.
- Added functions for loading markdown content, converting annotation sessions to dictionaries, and generating extraction statistics.
- Implemented comprehensive logging of extraction results, including counts of entities, relationships, and specific types like heritage institutions and persons.
- Results and statistics are saved in JSON format for further analysis.
- Implemented `generate_mermaid_with_instances.py` to create ER diagrams that include all classes, relationships, enum values, and instance data.
- Loaded instance data from YAML files and enriched enum definitions with meaningful annotations.
- Configured output paths for generated diagrams in both frontend and schema directories.
- Added support for excluding technical classes and limiting the number of displayed enum and instance values for readability.
- Implemented a Python script to validate KB library YAML files for required fields and data quality.
- Analyzed enrichment coverage from Wikidata and Google Maps, generating statistics.
- Created a comprehensive markdown report summarizing validation results and enrichment quality.
- Included error handling for file loading and validation processes.
- Generated JSON statistics for further analysis.
- Introduced a comprehensive class diagram for the heritage custodian observation reconstruction schema.
- Defined multiple classes including AllocationAgency, ArchiveOrganizationType, AuxiliaryDigitalPlatform, and others, with relevant attributes and relationships.
- Established inheritance and associations among classes to represent complex relationships within the schema.
- Generated on 2025-11-28, version 0.9.0, excluding the Container class.
- Implemented a Python script that fetches and enriches entries from the NDE Register using data from Wikidata.
- Utilized the Wikibase REST API and SPARQL endpoints for data retrieval.
- Added logging for tracking progress and errors during the enrichment process.
- Configured rate limiting based on authentication status for API requests.
- Created a structured output in YAML format, including detailed enrichment data.
- Generated a log file summarizing the enrichment process and results.
- Created PlantUML diagrams for custodian types, full schema, legal status, and organizational structure.
- Implemented a script to generate GraphViz DOT diagrams from OWL/RDF ontology files.
- Developed a script to generate UML diagrams from modular LinkML schema, supporting both Mermaid and PlantUML formats.
- Enhanced class definitions and relationships in UML diagrams to reflect the latest schema updates.
- Created the Country class with ISO 3166-1 alpha-2 and alpha-3 codes, ensuring minimal design without additional metadata.
- Integrated the Country class into CustodianPlace and LegalForm schemas to support country-specific feature types and legal forms.
- Removed duplicate keys in FeatureTypeEnum.yaml, resulting in 294 unique feature types.
- Eliminated "Hypernyms:" text from FeatureTypeEnum descriptions, verifying that semantic relationships are now conveyed through ontology mappings.
- Created example instance file demonstrating integration of Country with CustodianPlace and LegalForm.
- Updated documentation to reflect the completion of the Country class implementation and hypernyms removal.
- Implemented `owl_to_mermaid.py` to convert OWL/Turtle files into Mermaid class diagrams.
- Implemented `owl_to_plantuml.py` to convert OWL/Turtle files into PlantUML class diagrams.
- Added two new PlantUML files for custodian multi-aspect diagrams.
- Introduced custodian_hub_v3.mmd, custodian_hub_v4_final.mmd, and custodian_hub_v5_FINAL.mmd for Mermaid representation.
- Created custodian_hub_FINAL.puml and custodian_hub_v3.puml for PlantUML representation.
- Defined entities such as CustodianReconstruction, Identifier, TimeSpan, Agent, CustodianName, CustodianObservation, ReconstructionActivity, Appellation, ConfidenceMeasure, Custodian, LanguageCode, and SourceDocument.
- Established relationships and associations between entities, including temporal extents, observations, and reconstruction activities.
- Incorporated enumerations for various types, statuses, and classifications relevant to custodians and their activities.
- Implemented a new script to extract full metadata from 149 archive detail pages on archive-in-thueringen.de.
- Extracted data includes addresses, emails, phones, directors, collection sizes, opening hours, histories, and more.
- Introduced structured data parsing and error handling for robust data extraction.
- Added rate limiting to respect server load and improve scraping efficiency.
- Results are saved in a JSON format with detailed metadata about the extraction process.
- Add Wikidata Q-numbers to 8 Brazilian institutions
- Coverage: 56/212 institutions (26.4%, +5.6pp gain)
- All Q-numbers validated via Wikidata authenticated API
- Largest single batch gain yet
- Note: Duplicate entries detected, deduplication needed
Q-numbers added:
- Q10333651 - Museu da Borracha
- Q10387829 - UFAC Repository
- Q10345196 - Parque Memorial Quilombo dos Palmares
- Q1434444 - Teatro Amazonas
- Q116921020 - Centro Cultural dos Povos da Amazônia
- Q7894381 - UNIFAP
- Q16496091 - Arquivo Público do Estado da Bahia
- Q56695457 - Museu de Arqueologia e Etnologia da UFPR