Commit graph

190 commits

Author SHA1 Message Date
kempersc
fce186b649 enrich person profiles 2026-01-11 18:08:40 +01:00
kempersc
fd792fce2c Refactor code structure for improved readability and maintainability
Some checks failed
Deploy Frontend / build-and-deploy (push) Has been cancelled
2026-01-11 15:27:14 +01:00
kempersc
55ef2a831d feat(data): add Belgian surnames dataset with metadata and surname counts 2026-01-11 13:50:20 +01:00
kempersc
7d09e4179c Add US surnames dataset from 2010 Census with metadata and surname counts 2026-01-11 12:28:58 +01:00
kempersc
dfb4744dc7 Evaluate data enrichments of persons 2026-01-11 12:15:27 +01:00
kempersc
49a8c341b5 chore(data): update geonames database journal file 2026-01-11 02:51:52 +01:00
kempersc
170fd73c49 feat(agents): update critical rules section to include entity resolution guidelines 2026-01-11 02:51:18 +01:00
kempersc
556cc6c294 Add workspace configuration for Git and Gitea integration
- Set up GitHub integration to be disabled.
- Configure Git settings including path and autofetch options.
- Add Gitea instance URL and repository details.
- Enable YAML support for LinkML schemas with validation.
- Define file associations for YAML files.
- Recommend essential extensions for development and exclude unwanted ones.
2026-01-11 02:50:39 +01:00
kempersc
b3e57e709c Refactor code structure for improved readability and maintainability 2026-01-11 02:24:34 +01:00
kempersc
0df26a6e44 data(person): additional person profile enrichments 2026-01-11 00:41:59 +01:00
kempersc
3eb097d92e data(person): enrich 64 person profiles with comprehensive metadata
- Add inferred birth dates using EDTF notation
- Add inferred birth/current settlements
- Enrich employment history with temporal data
- Add heritage sector relevance scores
- Improve PPID component tracking
- Update .gitignore with large file patterns (warc, nt, trix, geonames.db)
2026-01-11 00:38:09 +01:00
kempersc
ac36b80476 feat(rag): add companion queries for count templates
Add companion_query support to fetch full entity records alongside
aggregate count queries. Enables displaying results on map/list when
asking 'how many museums in Amsterdam?'

Backend changes:
- Add companion_query, companion_query_region, companion_query_country
  fields to TemplateDefinition and TemplateMatchResult
- Add render_template_string() for raw companion query rendering

Template changes:
- Add companion queries to count_institutions_by_type_and_location
  for settlement, region, and country level queries
- Returns institution URI, name, coordinates, city for visualization
2026-01-10 18:44:06 +01:00
kempersc
f8b4ecad7d data(person): enrich 7 person profiles with detailed employment history
Update heritage professional profiles with:
- Separate role entries for different positions at same institution
- Employment date ranges (start_date, end_date)
- Updated observed_on timestamps
- Direct LinkedIn profile URLs as source

Profiles updated:
- Antoinet Nijssen (Noord-Hollands Archief)
- Anna Lakmaker
- Annelies Reus
- Marianne Hamersma
- Marcel Auwers
- Hans Felius
- Nico Vriend
2026-01-10 18:43:27 +01:00
kempersc
28c3aaf33f enrich profiles 2026-01-10 17:31:02 +01:00
kempersc
bd257c52f4 data(person): update 2 additional profiles 2026-01-10 15:39:12 +01:00
kempersc
2f33e6a230 data(person): update DR-STAPEL profile 2026-01-10 15:38:37 +01:00
kempersc
ec18e1810d data(person): enrich 7 profiles with detailed affiliations and GHCIDs
- Add GHCID references to custodian affiliations
- Add start dates for employment periods
- Expand heritage type classifications (A→[A,F])
- Add detailed rationales based on career history
- Add full_initials from archival publications
2026-01-10 15:36:49 +01:00
kempersc
e5a08a353d enrich person profiles 2026-01-10 14:14:04 +01:00
kempersc
9339de2cfb data(person): process 44,512 heritage-relevant profiles from entity extractions
Processing Summary:
- Scanned 94,716 LinkedIn entity files
- Identified 44,512 heritage-relevant individuals (47%)
- Created 1,430 new PPID-formatted profiles
- Updated 43,070 existing profiles with entity data
- Final count: 40,731 person profiles

Profile updates include:
- Merged web_claims with full provenance
- Added/updated heritage_relevance scoring
- Added affiliation data with custodian references
- Added inferred birth decades with provenance chains (Rule 45)

All data preserved per Rule 5 (additive only)
2026-01-10 14:01:29 +01:00
kempersc
6f3cf95492 data(person): fix data quality issues and PPID corrections
Data Quality Corrections:
- TIRANA-ADISUNA: Fix erroneous death_year claim (was education end date 2016,
  not death). Set is_living=true. Reassess heritage_relevance=false (tourism
  ministry is not a GLAM institution)
- ALEX-ALSEMGEEST: Rename from NL-ZH-TH (The Hague) to NL-ZH-ROT (Rotterdam)
  based on verified birth location. Update birth year to 1980

Profile Enrichments (5 profiles with XX-XX-XXX placeholders):
- Add web claims with proper provenance timestamps
- Add LinkedIn-verified education and position claims
- Document correction rationale in modification_reason

Heritage Relevance Reassessments:
- Government ministries (Tourism, etc.) marked as non-heritage
- Only GLAM institutions (Galleries, Libraries, Archives, Museums) qualify
2026-01-10 13:31:39 +01:00
kempersc
49f4054802 data(person/entity): add 83,845 LinkedIn profile extractions from company pages
Bulk extraction of heritage professional profiles from LinkedIn company pages
using extract_persons_with_provenance.py script.

Key characteristics:
- Source: LinkedIn company 'People' pages for heritage institutions
- File format: {linkedin-slug}_{timestamp}.json
- Total size: ~3.6GB
- Includes: profile_data, heritage_relevance, affiliations, web_claims
- Provenance: Full XPath + archived HTML references (Rule 6 compliant)
- Dual timestamps: statement_created_at + source_archived_at (Rule 35)

Extraction metadata includes:
- extraction_agent: extract_persons_with_provenance.py
- source_file: Original archived HTML filename
- source_archived_at: When LinkedIn page was captured
- schema_version: 1.0.0

Note: URL-encoded filenames preserve international characters (Arabic,
Hebrew, Chinese, Turkish, accented Latin, etc.)
2026-01-10 13:27:08 +01:00
kempersc
30cd8842d9 data(person): update profiles with web claims and PPID corrections
- Rename SENNAY-GHEBREAB profile: NL-ZH-ROT → ET-XX-ADD (Ethiopian birth)
- Enrich profiles with inferred birth decades and settlements
- Add web claims provenance for enriched data
- Update 16 profiles with improved location resolution

Files: +1 new (renamed), 16 modified, 1 deleted
2026-01-10 12:56:28 +01:00
kempersc
5eaab2bd30 data(person): enrich heritage professional profiles with web claims
Batch enrichment of 3,728 person profiles with additional data:
- Birth decade inference from education/career history
- Location resolution for inferred birth settlements
- Web claims with full provenance (source_url, retrieved_on)
- Organizational subdivision extraction
- Heritage relevance scoring

Also includes:
- 14 profile renames for PPID format corrections
- Updated _manifest.json with extraction statistics
- New _extraction_log.txt and _extraction_summary.json

Enrichment follows AGENTS.md rules:
- Rule 44: EDTF unknown date notation (XXXX, 196X, etc.)
- Rule 45: Inferred data with explicit provenance
- Rule 30: Confidence scoring (0.50-0.95)
- Rule 31: Organizational subdivision extraction

35,052 files changed, +4,507,411 insertions, -63,118 deletions
2026-01-10 10:35:20 +01:00
kempersc
519b0b47a8 Add Playwright test results JSON file with initial test suite and failure details 2026-01-09 21:33:31 +01:00
kempersc
004d342935 chore: minor updates and evaluation results
- auth.setup.ts: require env vars for test credentials (no hardcoded defaults)
- manifest.json: update schema manifest
- full_evaluation_results.json: add RAG evaluation results
- petra-links.json: update birth date from web claim
2026-01-09 21:10:55 +01:00
kempersc
855fff5962 data(person): resolve PPID locations and enrich profiles
- Rename 512 person files from XX-XX-XXX placeholders to proper GeoNames locations
- Update 2,463 profiles with enriched data
- Add 512 new person profiles (AU, international heritage professionals)
- PPID format: ID_{birth-loc}_{decade}_{work-loc}_{custodian}_{NAME}
2026-01-09 21:09:28 +01:00
kempersc
eb122e2532 data(custodian): remove 380 PENDING files after collision merge
PENDING files were merged into existing custodian records in commit eaf80ec.
These temporary collision placeholder files are no longer needed.
2026-01-09 21:06:22 +01:00
kempersc
9e67d0f967 enrich profiles 2026-01-09 20:35:19 +01:00
kempersc
eaf80ec756 data(custodian): merge PENDING collision files into existing custodians
Merge staff data from 7 PENDING files into their matching custodian records:
- NL-XX-XXX-PENDING-SPOT-GRONINGEN → NL-GR-GRO-M-SG (SPOT Groningen, 120 staff)
- NL-XX-XXX-PENDING-DIENST-UITVOERING-ONDERWIJS → NL-GR-GRO-O-DUO
- NL-XX-XXX-PENDING-ANNE-FRANK-STICHTING → NL-NH-AMS-M-AFS
- NL-XX-XXX-PENDING-ALLARD-PIERSON → NL-NH-AMS-M-AP
- NL-XX-XXX-PENDING-STICHTING-JOODS-HISTORISCH-MUSEUM → NL-NH-AMS-M-JHM
- NL-XX-XXX-PENDING-MINISTERIE-VAN-BUITENLANDSE-ZAKEN → NL-ZH-DHA-O-MBZ
- NL-XX-XXX-PENDING-MINISTERIE-VAN-JUSTITIE-EN-VEILIGHEID → NL-ZH-DHA-O-MJV

Originals archived in data/custodian/archive/pending_collisions_20250109/
Add scripts/merge_collision_files.py for reproducible merging
2026-01-09 18:33:00 +01:00
kempersc
e9c9aefc37 data(person): regenerate PPIDs with unidecode support for non-Latin scripts
- Add display_name and name_romanized fields to all 7948 person profiles
- Resolve UNKNOWN-UNKNOWN collision group (Hebrew/Arabic names now properly romanize)
- Hebrew names like אבישי דנינו now generate PPID AVISHI-DANINO instead of UNKNOWN-UNKNOWN
- Collision count reduced from 82 to 81 groups

Regenerated using generate_ppids.py with unidecode support (commit abe30cb)
2026-01-09 18:31:53 +01:00
kempersc
c45367c60f data(custodian): resolve more PENDING files with proper GHCIDs
Additional batch of PENDING file resolutions:
- DK: Aalborg Teater
- FR: Airborne Museum, ALCA Nouvelle-Aquitaine
- NL: 12 institutions (CODA Apeldoorn, Airborne Museum Arnhem, etc.)
- SA: Saudi Arabia Ministry of Culture

Files renamed from NL-XX-XXX-PENDING-* to proper country/region codes.
2026-01-09 18:29:09 +01:00
kempersc
932ec5438c add person profiles with PPID 2026-01-09 18:26:58 +01:00
kempersc
bd06e4f864 data(custodian): merge 135 PENDING files into existing enriched records
Merge data from PENDING files (with XX-XXX placeholders) into their
corresponding enriched custodian records with proper GHCIDs.

Countries affected:
- DE: 4 institutions (Deutsche Stiftung, Jewish Museum Berlin, etc.)
- ES: 1 institution (Biblioteca Nacional de España)
- FR: 1 institution (NMO)
- ID: 18 Indonesian museums and archives
- NL: 111 Dutch institutions across all provinces
- US: 1 institution (ARCA)

The PENDING files are deleted after merge; originals archived in
data/custodian/archive/pending_merged_20250109/
2026-01-09 18:25:56 +01:00
kempersc
a51c8c400c data(pending): add 125 international PENDING custodian files with proper country codes
Identified 125 institutions from LinkedIn staff extraction that are NOT Dutch:
- FR: 45 (French museums, archives, libraries)
- ID: 14 (Indonesian institutions)
- GB: 14 (British institutions)
- DE: 13 (German museums, foundations)
- BE: 11 (Belgian museums)
- IT: 6 (Italian institutions)
- AU: 6 (Australian archives, museums)
- Plus smaller counts from IN, US, ES, CH, DK, AT, SA, NO, IL

These files have staff data from LinkedIn company pages but need
GHCID resolution (currently XX-XXX placeholders for region/city).

Dutch PENDING files remain: 1,283
2026-01-09 15:55:31 +01:00
kempersc
14be18e7c4 feat(data): merge staff data from 30 more PENDING files into enriched custodians
Batch 2 of PENDING file resolution:
- Merged LinkedIn staff data from 30 PENDING files into matching enriched custodians
- Archived processed PENDING files to data/custodian/archive/pending_merged_20250109/
- Notable merges: ASML (994 staff), BBB (117), Apenheul (100), BOEI (93)

Files merged include:
- Corporate: ASML, BOS Foundation, Constructing the Limes
- Museums: Allard Pierson, Apenheul, various regional museums
- Research: Catholic Documentation Centre, Creating Cultures of Care
- Cultural orgs: Cultuur Ondernemen, CultuurOost, CultuurKwadraat

This continues the effort to consolidate PENDING files (1283 remaining).
2026-01-09 15:42:32 +01:00
kempersc
1f723fd5d7 feat(data): merge staff data from 35 PENDING files into enriched custodians
Merged LinkedIn-extracted staff sections from PENDING files into their
corresponding proper GHCID custodian files. This consolidates data from
two extraction sources:
- Existing enriched files: Google Maps, Museum Register, YouTube, etc.
- PENDING files: LinkedIn staff data extraction

Files modified:
- 28 custodian files enriched with staff data
- 35 PENDING files deleted (merged into proper locations)
- Originals archived to archive/pending_duplicates_20250109/

Key institutions enriched:
- Rijksmuseum (NL-NH-AMS-M-RM)
- Stedelijk Museum Amsterdam (NL-NH-AMS-M-SMA)
- Amsterdam Museum (NL-NH-AMS-M-AM)
- Regionaal Archief Alkmaar (NL-NH-ALK-A-RAA)
- Maritiem Museum Rotterdam (NL-ZH-ROT-M-MMR)
- And 23 more museums/archives across NL

New scripts:
- scripts/merge_staff_data.py: Automated staff data merger
- scripts/categorize_pending_files.py: PENDING file analysis utility
2026-01-09 14:51:17 +01:00
kempersc
2c2a312e0a feat(rag): add database routing to 8 more factual query templates
Add databases: ["oxigraph"] to skip vector search for deterministic queries:
- count_institutions_by_type_location (count)
- count_institutions_by_type (aggregation)
- find_institutions_by_founding_date (temporal)
- find_custodians_by_budget_threshold (financial)
- compare_locations (comparative)
- find_by_founding (temporal)
- events_in_period (temporal events)
- institutions_by_founding_decade (temporal aggregation)

Total templates with oxigraph-only routing: 12
2026-01-09 12:33:41 +01:00
kempersc
b9c30fc970 feat(rag): extend database routing to count, temporal, and financial templates
Add databases: ["oxigraph"] to 5 more templates that don't benefit from vector search:
- count_institutions_by_type_location
- compare_locations
- find_by_founding
- find_custodians_by_budget_threshold
- find_institutions_by_founding_date

Total templates with Oxigraph-only routing: 10
2026-01-09 12:32:28 +01:00
kempersc
17a94613f3 data(custodian): resolve 57 PENDING files to proper GHCID locations
Resolved NL-XX-XXX-PENDING files to proper regional GHCIDs:
- 57 new files with proper location codes (city, region)
- Cities include: Amsterdam, Rotterdam, Utrecht, Leiden, Groningen, etc.
- 34 original PENDING files archived to archive/pending_duplicates_20250109/

Examples:
- NL-XX-XXX-PENDING-AMSTERDAM-MUSEUM → NL-NH-AMS-M-AM (Amsterdam Museum)
- NL-XX-XXX-PENDING-GRONINGEN-MUSEUM → NL-GR-GRO-M-GM (Groninger Museum)
- NL-XX-XXX-PENDING-KUNSTHAL-ROTTERDAM → NL-ZH-ROT-G-KR (Kunsthal Rotterdam)
2026-01-09 12:19:19 +01:00
kempersc
76644f55f5 feat(rag): add database routing to geographic query templates
Add databases: ["oxigraph"] to 4 geographic templates to skip vector search:
- list_institutions_by_type_city
- list_institutions_by_type_region
- list_institutions_by_type_country
- list_institutions_in_city

Also add documentation explaining database routing configuration in _metadata.
2026-01-09 11:56:18 +01:00
kempersc
5255128159 fix(data): correct GHCID locations for 4 heritage custodians
Location corrections based on GeoNames reverse geocoding:
- NL-FR-LAN-S-L → NL-FR-DKN-S-L (Historische Werkgroep Kynhout: De Knipe)
- NL-LI-HEE-A-CRGR → NL-LI-MAA-A-CRGR (Centrum Regionale Geschiedenis: Maastricht)
- NL-NB-MID-S-M → NL-NB-BER-S-M (Heemkundekring De Plaets: Berlicum)
- NL-OV-NIJ-A-GH → NL-OV-HEL-A-GH (Gemeente Hellendoorn: Hellendoorn)
2026-01-09 11:55:08 +01:00
kempersc
e128727b13 fix(data): correct GHCID location for Museumreddingboot Terschelling
- Rename NL-FR-HOO-M-MT.yaml → NL-FR-TER-M-MT.yaml
- HOO (Hooghalen) → TER (Terschelling) - correct island location
- Institution is on Terschelling island, not in Drenthe
2026-01-09 11:54:37 +01:00
kempersc
c88fd3af70 Refactor code structure for improved readability and maintainability 2026-01-09 11:05:26 +01:00
kempersc
6608a207d4 update frontend 2026-01-08 15:56:28 +01:00
kempersc
9d68ed8c2e fix: mark 15 more Google Maps false matches via comprehensive review
Manual review of remaining Type I custodian files without official websites
identified additional false matches in these categories:

Wrong organization type:
- Bird catchers vs bird watchers association
- Heritage org vs webshop
- Regional org vs specific local entity
- Federation vs single member association
- Bell ringers org vs church building

Wrong location:
- Amsterdam org matched to Den Haag
- Haarlem org matched to Apeldoorn
- Rotterdam org matched to Amstelveen
- Dutch org matched to Suriname (!)
- Giethoorn event matched to Belt-Schutsloot
- Duindorp bonfire matched to Scheveningen

Different event/entity:
- Horse racing org vs summer festival
- Street name vs organization
- Heritage foundation vs specific local fair

Total Type I false matches fixed: 62 of 188 files (33%)
2026-01-08 15:21:31 +01:00
kempersc
85d9cee82f fix: mark 8 more Google Maps false matches detected via name mismatch
Additional Type I custodian files with obvious name mismatches between
KIEN registry entries and Google Maps results. These couldn't be
auto-detected via domain mismatch because they lack official websites.

Fixes:
- Dick Timmerman (person) → carpentry business
- Ria Bos (cigar maker) → money transfer agent
- Stichting Kracom (Krampuslauf) → Happy Caps retail
- Fed. Nederlandse Vertelorganisaties → NET Foundation
- Stichting dodenherdenking Alphen → wrong memorial
- Sao Joao Rotterdam → Heemraadsplein (location not org)
- sport en spel (heritage) → equipment rental
- Eiertikken Ommen → restaurant

Also adds detection and fix scripts for Google Maps false matches.
2026-01-08 13:26:53 +01:00
kempersc
b2b21abe2b fix: mark 39 Google Maps false matches for Type I intangible heritage custodians
Per Rule 40 (KIEN authoritative source), Google Maps frequently returns
false matches for intangible heritage organizations. These are virtual
networks without commercial storefronts.

Changes:
- Mark google_maps_enrichment.status as FALSE_MATCH
- Preserve original data in original_false_match for audit trail
- Add correction_timestamp and correction_agent provenance
- Special handling for NL-GE-TIE-I-M (Stichting MOZA): also fixed
  YouTube false match (Mozart channel) and removed ~1750 lines of
  irrelevant video data

Detection method: Domain mismatch between Google Maps website field
and official KIEN registry website.
2026-01-08 12:16:39 +01:00
kempersc
98c42bf272 Fix LinkML URI conflicts and generate RDF outputs
- Fix scope_note → finding_aid_scope_note in FindingAid.yaml
- Remove duplicate wikidata_entity slot from CustodianType.yaml (import instead)
- Remove duplicate rico_record_set_type from class_metadata_slots.yaml
- Fix range types for equals_string compatibility (uriorcurie → string)
- Move class names from close_mappings to see_also in 10 RecordSetTypes files
- Generate all RDF formats: OWL, N-Triples, RDF/XML, N3, JSON-LD context
- Sync schemas to frontend/public/schemas/

Files: 1,151 changed (includes prior CustodianType migration)
2026-01-07 12:32:59 +01:00
kempersc
11983014bb Enhance specificity scoring system integration with existing infrastructure
- Updated documentation to clarify integration points with existing components in the RAG pipeline and DSPy framework.
- Added detailed mapping of SPARQL templates to context templates for improved specificity filtering.
- Implemented wrapper patterns around existing classifiers to extend functionality without duplication.
- Introduced new tests for the SpecificityAwareClassifier and SPARQLToContextMapper to ensure proper integration and functionality.
- Enhanced the CustodianRDFConverter to include ISO country and subregion codes from GHCID for better geospatial data handling.
2026-01-05 17:37:49 +01:00
kempersc
242bc8bb35 Add new slots for heritage custodian entities
- Created deliverables_slot for expected or achieved deliverable outputs.
- Introduced event_id_slot for persistent unique event identifiers.
- Added follow_up_date_slot for scheduled follow-up action dates.
- Implemented object_ref_slot for references to heritage objects.
- Established price_slot for price information across entities.
- Added price_currency_slot for currency codes in price information.
- Created protocol_slot for API protocol specifications.
- Introduced provenance_text_slot for full provenance entry text.
- Added record_type_slot for classification of record types.
- Implemented response_formats_slot for supported API response formats.
- Established status_slot for current status of entities or activities.
- Added FactualCountDisplay component for displaying count query results.
- Introduced ReplyTypeIndicator component for visualizing reply types.
- Created approval_date_slot for formal approval dates.
- Added authentication_required_slot for API authentication status.
- Implemented capacity_items_slot for maximum storage capacity.
- Established conservation_lab_slot for conservation laboratory information.
- Added cost_usd_slot for API operation costs in USD.
2026-01-05 00:49:05 +01:00