- Skip YYYYMMDD and YYMMDD date patterns at end of email
- Skip digit sequences longer than 4 characters
- Require non-digit before 4-digit years at end
- Add knid.nl/kabelnoord.nl to consumer domains (Friesland ISP)
- Add 11 missing regional archive domains to HERITAGE_DOMAIN_MAP
- Update recalculation script to re-extract email semantics
Results:
- 3,151 false birth years removed
- 'Likely wrong person' reduced from 533 to 325 (-39%)
- 2,944 candidates' scores boosted
- Updated `entity_review.py` to map email semantic fields from JSON.
- Expanded `email_semantics.py` with additional museum mappings.
- Introduced a new rule in `.opencode/rules/no-duplicate-ontology-mappings.md` to prevent duplicate ontology mappings.
- Added a backup JSON file for entity resolution candidates.
- Created `enrich_email_semantics.py` to enrich candidates with email semantic signals.
- Developed `merge_entity_reviews.py` to merge reviewed decisions from a backup into new candidates.
- Introduced SoundArchiveRecordSetType, SpecialCollectionRecordSetType, SpecializedArchiveRecordSetType, SpecializedArchivesCzechiaRecordSetType, StateArchivesRecordSetType, StateArchivesSectionRecordSetType, StateDistrictArchiveRecordSetType, StateRegionalArchiveCzechiaRecordSetType, TelevisionArchiveRecordSetType, TradeUnionArchiveRecordSetType, UniversityArchiveRecordSetType, VereinsarchivRecordSetType, VerlagsarchivRecordSetType, VerwaltungsarchivRecordSetType, WebArchiveRecordSetType, and WomensArchivesRecordSetType.
- Each new type includes appropriate metadata, slots, and relationships to existing classes.
- Implemented a script to detect and fix Type class violations in LinkML files.
- Remove inline slot definitions from 144 class files
- Create 7 new centralized slot files in modules/slots/:
- custodian_type_broader.yaml
- custodian_type_narrower.yaml
- custodian_type_related.yaml
- definition.yaml
- finding_aid_access_restriction.yaml
- finding_aid_description.yaml
- finding_aid_temporal_coverage.yaml
- Add centralize_inline_slots.py automation script
- Update manifest with new timestamp
Rule 48: Class files must NOT define inline slots - all slots
must be imported from modules/slots/ directory.
Note: Pre-existing IdentifierFormat duplicate class definition
(in Standard.yaml and IdentifierFormat.yaml) not addressed in
this commit - requires separate schema refactor.
- Set up GitHub integration to be disabled.
- Configure Git settings including path and autofetch options.
- Add Gitea instance URL and repository details.
- Enable YAML support for LinkML schemas with validation.
- Define file associations for YAML files.
- Recommend essential extensions for development and exclude unwanted ones.
- Remove hardcoded type mappings, derive dynamically from LinkML
- Extract keywords from annotations, structured_aliases, and comments
- Add rename_plural_slot.py utility for schema slot renaming
Person Enrichment Scripts:
- enrich_person_comprehensive.py: Full-featured web search enrichment via Linkup
with Rule 6/21/26/34/35 compliance (dual timestamps, no fabrication)
- enrich_ppids_linkup.py: Batch PPID enrichment pipeline
- extract_persons_with_provenance.py: Extract person data from LinkedIn HTML
with XPath provenance tracking
LinkML Slot Management:
- update_slot_mappings.py: Update slots for RiC-O naming (Rule 39) and
semantic URI requirements (Rule 38)
- update_class_slot_references.py: Update class files referencing renamed slots
- validate_slot_mappings.py: Validate slot definitions against ontology rules
All scripts follow established project conventions for provenance and
ontology alignment.
Merged LinkedIn-extracted staff sections from PENDING files into their
corresponding proper GHCID custodian files. This consolidates data from
two extraction sources:
- Existing enriched files: Google Maps, Museum Register, YouTube, etc.
- PENDING files: LinkedIn staff data extraction
Files modified:
- 28 custodian files enriched with staff data
- 35 PENDING files deleted (merged into proper locations)
- Originals archived to archive/pending_duplicates_20250109/
Key institutions enriched:
- Rijksmuseum (NL-NH-AMS-M-RM)
- Stedelijk Museum Amsterdam (NL-NH-AMS-M-SMA)
- Amsterdam Museum (NL-NH-AMS-M-AM)
- Regionaal Archief Alkmaar (NL-NH-ALK-A-RAA)
- Maritiem Museum Rotterdam (NL-ZH-ROT-M-MMR)
- And 23 more museums/archives across NL
New scripts:
- scripts/merge_staff_data.py: Automated staff data merger
- scripts/categorize_pending_files.py: PENDING file analysis utility
Script to resolve NL-XX-XXX-PENDING files that have city names in filename:
- Looks up city in GeoNames database
- Updates YAML with location data (city, region, country)
- Generates proper GHCID with UUID v5/v8
- Renames files to match new GHCID
- Archives original PENDING files for reference
- Add --type/-t flag to specify institution type (A, G, H, I, L, M, N, O, R, S, T, U, X, ALL)
- Default still Type I (Intangible Heritage) for backward compatibility
- Skip PENDING files that have no location data
- Update help text with all supported types
Manual review of remaining Type I custodian files without official websites
identified additional false matches in these categories:
Wrong organization type:
- Bird catchers vs bird watchers association
- Heritage org vs webshop
- Regional org vs specific local entity
- Federation vs single member association
- Bell ringers org vs church building
Wrong location:
- Amsterdam org matched to Den Haag
- Haarlem org matched to Apeldoorn
- Rotterdam org matched to Amstelveen
- Dutch org matched to Suriname (!)
- Giethoorn event matched to Belt-Schutsloot
- Duindorp bonfire matched to Scheveningen
Different event/entity:
- Horse racing org vs summer festival
- Street name vs organization
- Heritage foundation vs specific local fair
Total Type I false matches fixed: 62 of 188 files (33%)
Additional Type I custodian files with obvious name mismatches between
KIEN registry entries and Google Maps results. These couldn't be
auto-detected via domain mismatch because they lack official websites.
Fixes:
- Dick Timmerman (person) → carpentry business
- Ria Bos (cigar maker) → money transfer agent
- Stichting Kracom (Krampuslauf) → Happy Caps retail
- Fed. Nederlandse Vertelorganisaties → NET Foundation
- Stichting dodenherdenking Alphen → wrong memorial
- Sao Joao Rotterdam → Heemraadsplein (location not org)
- sport en spel (heritage) → equipment rental
- Eiertikken Ommen → restaurant
Also adds detection and fix scripts for Google Maps false matches.
- Updated documentation to clarify integration points with existing components in the RAG pipeline and DSPy framework.
- Added detailed mapping of SPARQL templates to context templates for improved specificity filtering.
- Implemented wrapper patterns around existing classifiers to extend functionality without duplication.
- Introduced new tests for the SpecificityAwareClassifier and SPARQLToContextMapper to ensure proper integration and functionality.
- Enhanced the CustodianRDFConverter to include ISO country and subregion codes from GHCID for better geospatial data handling.
- Created deliverables_slot for expected or achieved deliverable outputs.
- Introduced event_id_slot for persistent unique event identifiers.
- Added follow_up_date_slot for scheduled follow-up action dates.
- Implemented object_ref_slot for references to heritage objects.
- Established price_slot for price information across entities.
- Added price_currency_slot for currency codes in price information.
- Created protocol_slot for API protocol specifications.
- Introduced provenance_text_slot for full provenance entry text.
- Added record_type_slot for classification of record types.
- Implemented response_formats_slot for supported API response formats.
- Established status_slot for current status of entities or activities.
- Added FactualCountDisplay component for displaying count query results.
- Introduced ReplyTypeIndicator component for visualizing reply types.
- Created approval_date_slot for formal approval dates.
- Added authentication_required_slot for API authentication status.
- Implemented capacity_items_slot for maximum storage capacity.
- Established conservation_lab_slot for conservation laboratory information.
- Added cost_usd_slot for API operation costs in USD.