Commit graph

210 commits

Author SHA1 Message Date
kempersc
7ea7e3d0d7 feat: Add new ontology and schema classes for Heritage and related concepts
- Introduced new classes: Heritage, HeritagePractice, HeritageRelevanceAssessment, HeritageRelevanceScore, HolySiteType, Mandate.
- Added slots for heritage-related attributes including has_or_had_confidence_measure, has_or_had_related_heritage_form, heritage_education, heritage_employer, heritage_mandate, heritage_practice, and more.
- Migrated existing attributes and ensured compliance with RiC-O naming conventions.
- Enhanced documentation and descriptions for clarity and usability.
- Archived previous versions of slots and classes to maintain schema integrity.
2026-01-28 08:06:56 +01:00
kempersc
7992e8abaa Remove deprecated slot definitions and introduce new slots for height, width, x and y coordinates with temporal predicates. Add new classes for assessment and audit status types, along with a dataset class. Archive previous slot definitions and ensure proper migration documentation for the new slots. Update links to datasets and their registration status. 2026-01-28 01:27:24 +01:00
kempersc
f800e198ff Refactor code structure for improved readability and maintainability 2026-01-28 01:11:55 +01:00
kempersc
4a277d7d42 standardise slots 2026-01-19 00:09:28 +01:00
kempersc
a6a9ba58b8 standardise slots 2026-01-18 01:23:32 +01:00
kempersc
54b26343c9 Add initial version of QUDT ontology file 2026-01-17 00:08:39 +01:00
kempersc
24cddb82dc enrich ppid profiles 2026-01-16 12:50:50 +01:00
kempersc
7424b85352 Add new slots for heritage custodian entities
- Introduced setpoint_max, setpoint_min, setpoint_tolerance, setpoint_type, setpoint_unit, setpoint_value, temperature_target, track_id, typical_http_methods, typical_metadata_standard, typical_response_formats, typical_scope, typical_technical_feature, unit_code, unit_symbol, unit_type, wikidata_entity, wikidata_equivalent, and wikidata_id slots.
- Each slot includes a unique identifier, name, title, description, and annotations for custodian types and specificity score.
2026-01-16 01:04:38 +01:00
kempersc
f9f3cc8e74 fix: resolve YAML import indentation and add missing slot descriptions
Schema Improvements:
- Fix YAML import indentation across 800+ class files (sed: '^- ../' → '  - ../')
- Add descriptions to 26 inline slots missing them (lint warnings)
- Fix malformed imports in BirthPlace.yaml and CustodianObservation.yaml

Validation Results:
- linkml-lint: 4 warnings (intentional SCREAMING_CASE tier names)
- gen-owl: SUCCESS (164,069 lines generated)
- gen-json-schema: SUCCESS (9.4MB generated)

Files affected: 1,034 files, +23,908 -15,200 lines
2026-01-16 00:09:28 +01:00
kempersc
043ea868b5 fix(schema): Resolve broken imports after slot migration
All checks were successful
Deploy Frontend / build-and-deploy (push) Successful in 4m31s
- Fix empty import list elements (- # comment pattern) in Laptop, Expenses,
  FunctionType, Overview, WebLink, Photography classes
- Replace valid_from/valid_to slots with temporal_extent in class slots lists
- Update slot_usage to use temporal_extent with TimeSpan range
- Update examples to use temporal_extent with begin_of_the_begin/end_of_the_end
- Fix typo is_or_was_is_or_was_archived_at → is_or_was_archived_at in WebObservation
- Add TimeSpan imports to classes using temporal_extent
- Fix relative import paths for Timestamp in temporal slots
- Fix CustodianIdentifier → Identifier imports in FundingAgenda, ReadingRoomAnnex

Schema validates successfully with 902 classes and 2043 slots.
2026-01-15 12:25:27 +01:00
kempersc
b13674400f Refactor schema slots and classes for improved organization and clarity
- Removed deprecated slots: appraisal_notes, branch_id, is_or_was_real.
- Introduced new slots: has_or_had_notes, has_or_had_provenance.
- Created Notes class to encapsulate note-related metadata.
- Archived removed slots and classes in accordance with the new archive folder convention.
- Updated slot_fixes.yaml to reflect migration status and details.
- Enhanced documentation for new slots and classes, ensuring compliance with ontology alignment.
- Added new slots for note content, date, and type to support the Notes class.
2026-01-14 12:14:07 +01:00
kempersc
b30711fcfb update slots 2026-01-14 09:05:54 +01:00
kempersc
d51bba5003 data: update entity resolution confidence scores
Regenerated confidence scores with updated scoring algorithm:
- Total candidates: 78,746
- Adjusted: 2,832 (was 3,869)
- Boosted: 2,499 (was 3,192)
- Penalized: 333 (was 677)
- Likely wrong person: 533
- Reviews preserved: 57

Confidence scoring version: 2.0
2026-01-13 21:54:18 +01:00
kempersc
92b490d690 edit slots 2026-01-13 20:35:11 +01:00
kempersc
f74513e8ef feat: Enhance entity resolution with email semantics and review merging
- Updated `entity_review.py` to map email semantic fields from JSON.
- Expanded `email_semantics.py` with additional museum mappings.
- Introduced a new rule in `.opencode/rules/no-duplicate-ontology-mappings.md` to prevent duplicate ontology mappings.
- Added a backup JSON file for entity resolution candidates.
- Created `enrich_email_semantics.py` to enrich candidates with email semantic signals.
- Developed `merge_entity_reviews.py` to merge reviewed decisions from a backup into new candidates.
2026-01-13 16:43:56 +01:00
kempersc
1fb924c412 feat: add ontology mappings to LinkML schema and enhance entity resolution
Schema enhancements (443 files):
- Add class_uri with proper ontology references (schema:, prov:, skos:, rico:)
- Add close_mappings, related_mappings per Rule 50 convention
- Replace stub hc: slot_uri with standard predicates (dcterms:identifier, skos:prefLabel)
- Improve descriptions with ontology mapping rationale
- Add prefixes blocks to all schema modules

Entity Resolution improvements:
- Add entity_resolution module with email semantics parsing
- Enhance build_entity_resolution.py with email-based matching signals
- Extend Entity Review API with filtering by signal types and count
- Add candidates caching and indexing for performance
- Add ReviewLoginPage component

New rules and documentation:
- Add Rule 51: No Hallucinated Ontology References
- Add .opencode/rules/no-hallucinated-ontology-references.md
- Add .opencode/rules/slot-ontology-mapping-reference.md
- Add adms.ttl and dqv.ttl ontology files

Frontend ontology support:
- Add RiC-O_1-1.rdf and schemaorg.owl to public/ontology
2026-01-13 13:51:02 +01:00
kempersc
846a6cdcec Add new Record Set Types for various archival collections
- Introduced SoundArchiveRecordSetType, SpecialCollectionRecordSetType, SpecializedArchiveRecordSetType, SpecializedArchivesCzechiaRecordSetType, StateArchivesRecordSetType, StateArchivesSectionRecordSetType, StateDistrictArchiveRecordSetType, StateRegionalArchiveCzechiaRecordSetType, TelevisionArchiveRecordSetType, TradeUnionArchiveRecordSetType, UniversityArchiveRecordSetType, VereinsarchivRecordSetType, VerlagsarchivRecordSetType, VerwaltungsarchivRecordSetType, WebArchiveRecordSetType, and WomensArchivesRecordSetType.
- Each new type includes appropriate metadata, slots, and relationships to existing classes.
- Implemented a script to detect and fix Type class violations in LinkML files.
2026-01-12 15:20:29 +01:00
kempersc
355d8be51d centralise slots 2026-01-12 14:33:56 +01:00
kempersc
070c87af7b refactor(migrate_wcms_resume): use recursive glob to find user JSON files and skip macOS hidden files 2026-01-11 23:32:27 +01:00
kempersc
56c373bba8 Implement fast WCMS migration script with state file checkpointing and batch processing 2026-01-11 22:26:37 +01:00
kempersc
fce186b649 enrich person profiles 2026-01-11 18:08:40 +01:00
kempersc
fd792fce2c Refactor code structure for improved readability and maintainability
Some checks failed
Deploy Frontend / build-and-deploy (push) Has been cancelled
2026-01-11 15:27:14 +01:00
kempersc
55ef2a831d feat(data): add Belgian surnames dataset with metadata and surname counts 2026-01-11 13:50:20 +01:00
kempersc
7d09e4179c Add US surnames dataset from 2010 Census with metadata and surname counts 2026-01-11 12:28:58 +01:00
kempersc
dfb4744dc7 Evaluate data enrichments of persons 2026-01-11 12:15:27 +01:00
kempersc
49a8c341b5 chore(data): update geonames database journal file 2026-01-11 02:51:52 +01:00
kempersc
170fd73c49 feat(agents): update critical rules section to include entity resolution guidelines 2026-01-11 02:51:18 +01:00
kempersc
556cc6c294 Add workspace configuration for Git and Gitea integration
- Set up GitHub integration to be disabled.
- Configure Git settings including path and autofetch options.
- Add Gitea instance URL and repository details.
- Enable YAML support for LinkML schemas with validation.
- Define file associations for YAML files.
- Recommend essential extensions for development and exclude unwanted ones.
2026-01-11 02:50:39 +01:00
kempersc
b3e57e709c Refactor code structure for improved readability and maintainability 2026-01-11 02:24:34 +01:00
kempersc
0df26a6e44 data(person): additional person profile enrichments 2026-01-11 00:41:59 +01:00
kempersc
3eb097d92e data(person): enrich 64 person profiles with comprehensive metadata
- Add inferred birth dates using EDTF notation
- Add inferred birth/current settlements
- Enrich employment history with temporal data
- Add heritage sector relevance scores
- Improve PPID component tracking
- Update .gitignore with large file patterns (warc, nt, trix, geonames.db)
2026-01-11 00:38:09 +01:00
kempersc
ac36b80476 feat(rag): add companion queries for count templates
Add companion_query support to fetch full entity records alongside
aggregate count queries. Enables displaying results on map/list when
asking 'how many museums in Amsterdam?'

Backend changes:
- Add companion_query, companion_query_region, companion_query_country
  fields to TemplateDefinition and TemplateMatchResult
- Add render_template_string() for raw companion query rendering

Template changes:
- Add companion queries to count_institutions_by_type_and_location
  for settlement, region, and country level queries
- Returns institution URI, name, coordinates, city for visualization
2026-01-10 18:44:06 +01:00
kempersc
f8b4ecad7d data(person): enrich 7 person profiles with detailed employment history
Update heritage professional profiles with:
- Separate role entries for different positions at same institution
- Employment date ranges (start_date, end_date)
- Updated observed_on timestamps
- Direct LinkedIn profile URLs as source

Profiles updated:
- Antoinet Nijssen (Noord-Hollands Archief)
- Anna Lakmaker
- Annelies Reus
- Marianne Hamersma
- Marcel Auwers
- Hans Felius
- Nico Vriend
2026-01-10 18:43:27 +01:00
kempersc
28c3aaf33f enrich profiles 2026-01-10 17:31:02 +01:00
kempersc
bd257c52f4 data(person): update 2 additional profiles 2026-01-10 15:39:12 +01:00
kempersc
2f33e6a230 data(person): update DR-STAPEL profile 2026-01-10 15:38:37 +01:00
kempersc
ec18e1810d data(person): enrich 7 profiles with detailed affiliations and GHCIDs
- Add GHCID references to custodian affiliations
- Add start dates for employment periods
- Expand heritage type classifications (A→[A,F])
- Add detailed rationales based on career history
- Add full_initials from archival publications
2026-01-10 15:36:49 +01:00
kempersc
e5a08a353d enrich person profiles 2026-01-10 14:14:04 +01:00
kempersc
9339de2cfb data(person): process 44,512 heritage-relevant profiles from entity extractions
Processing Summary:
- Scanned 94,716 LinkedIn entity files
- Identified 44,512 heritage-relevant individuals (47%)
- Created 1,430 new PPID-formatted profiles
- Updated 43,070 existing profiles with entity data
- Final count: 40,731 person profiles

Profile updates include:
- Merged web_claims with full provenance
- Added/updated heritage_relevance scoring
- Added affiliation data with custodian references
- Added inferred birth decades with provenance chains (Rule 45)

All data preserved per Rule 5 (additive only)
2026-01-10 14:01:29 +01:00
kempersc
6f3cf95492 data(person): fix data quality issues and PPID corrections
Data Quality Corrections:
- TIRANA-ADISUNA: Fix erroneous death_year claim (was education end date 2016,
  not death). Set is_living=true. Reassess heritage_relevance=false (tourism
  ministry is not a GLAM institution)
- ALEX-ALSEMGEEST: Rename from NL-ZH-TH (The Hague) to NL-ZH-ROT (Rotterdam)
  based on verified birth location. Update birth year to 1980

Profile Enrichments (5 profiles with XX-XX-XXX placeholders):
- Add web claims with proper provenance timestamps
- Add LinkedIn-verified education and position claims
- Document correction rationale in modification_reason

Heritage Relevance Reassessments:
- Government ministries (Tourism, etc.) marked as non-heritage
- Only GLAM institutions (Galleries, Libraries, Archives, Museums) qualify
2026-01-10 13:31:39 +01:00
kempersc
49f4054802 data(person/entity): add 83,845 LinkedIn profile extractions from company pages
Bulk extraction of heritage professional profiles from LinkedIn company pages
using extract_persons_with_provenance.py script.

Key characteristics:
- Source: LinkedIn company 'People' pages for heritage institutions
- File format: {linkedin-slug}_{timestamp}.json
- Total size: ~3.6GB
- Includes: profile_data, heritage_relevance, affiliations, web_claims
- Provenance: Full XPath + archived HTML references (Rule 6 compliant)
- Dual timestamps: statement_created_at + source_archived_at (Rule 35)

Extraction metadata includes:
- extraction_agent: extract_persons_with_provenance.py
- source_file: Original archived HTML filename
- source_archived_at: When LinkedIn page was captured
- schema_version: 1.0.0

Note: URL-encoded filenames preserve international characters (Arabic,
Hebrew, Chinese, Turkish, accented Latin, etc.)
2026-01-10 13:27:08 +01:00
kempersc
30cd8842d9 data(person): update profiles with web claims and PPID corrections
- Rename SENNAY-GHEBREAB profile: NL-ZH-ROT → ET-XX-ADD (Ethiopian birth)
- Enrich profiles with inferred birth decades and settlements
- Add web claims provenance for enriched data
- Update 16 profiles with improved location resolution

Files: +1 new (renamed), 16 modified, 1 deleted
2026-01-10 12:56:28 +01:00
kempersc
5eaab2bd30 data(person): enrich heritage professional profiles with web claims
Batch enrichment of 3,728 person profiles with additional data:
- Birth decade inference from education/career history
- Location resolution for inferred birth settlements
- Web claims with full provenance (source_url, retrieved_on)
- Organizational subdivision extraction
- Heritage relevance scoring

Also includes:
- 14 profile renames for PPID format corrections
- Updated _manifest.json with extraction statistics
- New _extraction_log.txt and _extraction_summary.json

Enrichment follows AGENTS.md rules:
- Rule 44: EDTF unknown date notation (XXXX, 196X, etc.)
- Rule 45: Inferred data with explicit provenance
- Rule 30: Confidence scoring (0.50-0.95)
- Rule 31: Organizational subdivision extraction

35,052 files changed, +4,507,411 insertions, -63,118 deletions
2026-01-10 10:35:20 +01:00
kempersc
519b0b47a8 Add Playwright test results JSON file with initial test suite and failure details 2026-01-09 21:33:31 +01:00
kempersc
004d342935 chore: minor updates and evaluation results
- auth.setup.ts: require env vars for test credentials (no hardcoded defaults)
- manifest.json: update schema manifest
- full_evaluation_results.json: add RAG evaluation results
- petra-links.json: update birth date from web claim
2026-01-09 21:10:55 +01:00
kempersc
855fff5962 data(person): resolve PPID locations and enrich profiles
- Rename 512 person files from XX-XX-XXX placeholders to proper GeoNames locations
- Update 2,463 profiles with enriched data
- Add 512 new person profiles (AU, international heritage professionals)
- PPID format: ID_{birth-loc}_{decade}_{work-loc}_{custodian}_{NAME}
2026-01-09 21:09:28 +01:00
kempersc
eb122e2532 data(custodian): remove 380 PENDING files after collision merge
PENDING files were merged into existing custodian records in commit eaf80ec.
These temporary collision placeholder files are no longer needed.
2026-01-09 21:06:22 +01:00
kempersc
9e67d0f967 enrich profiles 2026-01-09 20:35:19 +01:00
kempersc
eaf80ec756 data(custodian): merge PENDING collision files into existing custodians
Merge staff data from 7 PENDING files into their matching custodian records:
- NL-XX-XXX-PENDING-SPOT-GRONINGEN → NL-GR-GRO-M-SG (SPOT Groningen, 120 staff)
- NL-XX-XXX-PENDING-DIENST-UITVOERING-ONDERWIJS → NL-GR-GRO-O-DUO
- NL-XX-XXX-PENDING-ANNE-FRANK-STICHTING → NL-NH-AMS-M-AFS
- NL-XX-XXX-PENDING-ALLARD-PIERSON → NL-NH-AMS-M-AP
- NL-XX-XXX-PENDING-STICHTING-JOODS-HISTORISCH-MUSEUM → NL-NH-AMS-M-JHM
- NL-XX-XXX-PENDING-MINISTERIE-VAN-BUITENLANDSE-ZAKEN → NL-ZH-DHA-O-MBZ
- NL-XX-XXX-PENDING-MINISTERIE-VAN-JUSTITIE-EN-VEILIGHEID → NL-ZH-DHA-O-MJV

Originals archived in data/custodian/archive/pending_collisions_20250109/
Add scripts/merge_collision_files.py for reproducible merging
2026-01-09 18:33:00 +01:00
kempersc
e9c9aefc37 data(person): regenerate PPIDs with unidecode support for non-Latin scripts
- Add display_name and name_romanized fields to all 7948 person profiles
- Resolve UNKNOWN-UNKNOWN collision group (Hebrew/Arabic names now properly romanize)
- Hebrew names like אבישי דנינו now generate PPID AVISHI-DANINO instead of UNKNOWN-UNKNOWN
- Collision count reduced from 82 to 81 groups

Regenerated using generate_ppids.py with unidecode support (commit abe30cb)
2026-01-09 18:31:53 +01:00