Commit graph

200 commits

Author SHA1 Message Date
kempersc
5255128159 fix(data): correct GHCID locations for 4 heritage custodians
Location corrections based on GeoNames reverse geocoding:
- NL-FR-LAN-S-L → NL-FR-DKN-S-L (Historische Werkgroep Kynhout: De Knipe)
- NL-LI-HEE-A-CRGR → NL-LI-MAA-A-CRGR (Centrum Regionale Geschiedenis: Maastricht)
- NL-NB-MID-S-M → NL-NB-BER-S-M (Heemkundekring De Plaets: Berlicum)
- NL-OV-NIJ-A-GH → NL-OV-HEL-A-GH (Gemeente Hellendoorn: Hellendoorn)
2026-01-09 11:55:08 +01:00
kempersc
e128727b13 fix(data): correct GHCID location for Museumreddingboot Terschelling
- Rename NL-FR-HOO-M-MT.yaml → NL-FR-TER-M-MT.yaml
- HOO (Hooghalen) → TER (Terschelling) - correct island location
- Institution is on Terschelling island, not in Drenthe
2026-01-09 11:54:37 +01:00
kempersc
c88fd3af70 Refactor code structure for improved readability and maintainability 2026-01-09 11:05:26 +01:00
kempersc
6608a207d4 update frontend 2026-01-08 15:56:28 +01:00
kempersc
9d68ed8c2e fix: mark 15 more Google Maps false matches via comprehensive review
Manual review of remaining Type I custodian files without official websites
identified additional false matches in these categories:

Wrong organization type:
- Bird catchers vs bird watchers association
- Heritage org vs webshop
- Regional org vs specific local entity
- Federation vs single member association
- Bell ringers org vs church building

Wrong location:
- Amsterdam org matched to Den Haag
- Haarlem org matched to Apeldoorn
- Rotterdam org matched to Amstelveen
- Dutch org matched to Suriname (!)
- Giethoorn event matched to Belt-Schutsloot
- Duindorp bonfire matched to Scheveningen

Different event/entity:
- Horse racing org vs summer festival
- Street name vs organization
- Heritage foundation vs specific local fair

Total Type I false matches fixed: 62 of 188 files (33%)
2026-01-08 15:21:31 +01:00
kempersc
85d9cee82f fix: mark 8 more Google Maps false matches detected via name mismatch
Additional Type I custodian files with obvious name mismatches between
KIEN registry entries and Google Maps results. These couldn't be
auto-detected via domain mismatch because they lack official websites.

Fixes:
- Dick Timmerman (person) → carpentry business
- Ria Bos (cigar maker) → money transfer agent
- Stichting Kracom (Krampuslauf) → Happy Caps retail
- Fed. Nederlandse Vertelorganisaties → NET Foundation
- Stichting dodenherdenking Alphen → wrong memorial
- Sao Joao Rotterdam → Heemraadsplein (location not org)
- sport en spel (heritage) → equipment rental
- Eiertikken Ommen → restaurant

Also adds detection and fix scripts for Google Maps false matches.
2026-01-08 13:26:53 +01:00
kempersc
b2b21abe2b fix: mark 39 Google Maps false matches for Type I intangible heritage custodians
Per Rule 40 (KIEN authoritative source), Google Maps frequently returns
false matches for intangible heritage organizations. These are virtual
networks without commercial storefronts.

Changes:
- Mark google_maps_enrichment.status as FALSE_MATCH
- Preserve original data in original_false_match for audit trail
- Add correction_timestamp and correction_agent provenance
- Special handling for NL-GE-TIE-I-M (Stichting MOZA): also fixed
  YouTube false match (Mozart channel) and removed ~1750 lines of
  irrelevant video data

Detection method: Domain mismatch between Google Maps website field
and official KIEN registry website.
2026-01-08 12:16:39 +01:00
kempersc
98c42bf272 Fix LinkML URI conflicts and generate RDF outputs
- Fix scope_note → finding_aid_scope_note in FindingAid.yaml
- Remove duplicate wikidata_entity slot from CustodianType.yaml (import instead)
- Remove duplicate rico_record_set_type from class_metadata_slots.yaml
- Fix range types for equals_string compatibility (uriorcurie → string)
- Move class names from close_mappings to see_also in 10 RecordSetTypes files
- Generate all RDF formats: OWL, N-Triples, RDF/XML, N3, JSON-LD context
- Sync schemas to frontend/public/schemas/

Files: 1,151 changed (includes prior CustodianType migration)
2026-01-07 12:32:59 +01:00
kempersc
11983014bb Enhance specificity scoring system integration with existing infrastructure
- Updated documentation to clarify integration points with existing components in the RAG pipeline and DSPy framework.
- Added detailed mapping of SPARQL templates to context templates for improved specificity filtering.
- Implemented wrapper patterns around existing classifiers to extend functionality without duplication.
- Introduced new tests for the SpecificityAwareClassifier and SPARQLToContextMapper to ensure proper integration and functionality.
- Enhanced the CustodianRDFConverter to include ISO country and subregion codes from GHCID for better geospatial data handling.
2026-01-05 17:37:49 +01:00
kempersc
242bc8bb35 Add new slots for heritage custodian entities
- Created deliverables_slot for expected or achieved deliverable outputs.
- Introduced event_id_slot for persistent unique event identifiers.
- Added follow_up_date_slot for scheduled follow-up action dates.
- Implemented object_ref_slot for references to heritage objects.
- Established price_slot for price information across entities.
- Added price_currency_slot for currency codes in price information.
- Created protocol_slot for API protocol specifications.
- Introduced provenance_text_slot for full provenance entry text.
- Added record_type_slot for classification of record types.
- Implemented response_formats_slot for supported API response formats.
- Established status_slot for current status of entities or activities.
- Added FactualCountDisplay component for displaying count query results.
- Introduced ReplyTypeIndicator component for visualizing reply types.
- Created approval_date_slot for formal approval dates.
- Added authentication_required_slot for API authentication status.
- Implemented capacity_items_slot for maximum storage capacity.
- Established conservation_lab_slot for conservation laboratory information.
- Added cost_usd_slot for API operation costs in USD.
2026-01-05 00:49:05 +01:00
kempersc
2dca28d8c1 enrich CH entries with mission statements 2026-01-04 13:12:32 +01:00
kempersc
4f0cafe98a enrich HC profiles 2026-01-02 02:11:04 +01:00
kempersc
349f31ae6f enrich custodian profiles 2026-01-02 02:10:18 +01:00
kempersc
aee76fcc7f backup html content 2025-12-31 02:36:38 +01:00
kempersc
b7701c8a8e backup person profiles 2025-12-31 00:04:09 +01:00
kempersc
7108cb1483 backup person profiles 2025-12-31 00:00:25 +01:00
kempersc
38dcd2ce9c Restore YAML files for Museum Dokkum and Gemeente Smallingerland with enriched data and provenance tracking 2025-12-30 23:58:21 +01:00
kempersc
1d8fd68e3a backup custodian web profiles 2025-12-30 23:53:16 +01:00
kempersc
f6a5962c3b backup person profiles 2025-12-30 23:48:50 +01:00
kempersc
cbf88d2a6d backup person profiles 2025-12-30 23:44:57 +01:00
kempersc
30b701a5ec backup HC data 2025-12-30 23:41:15 +01:00
kempersc
c417d0c758 Refactor code structure for improved readability and maintainability 2025-12-30 23:38:18 +01:00
kempersc
fb0daab718 backup JP profiles 2025-12-30 23:24:30 +01:00
kempersc
b42d6bf5d2 backup CZ and JP 2025-12-30 23:19:38 +01:00
kempersc
45e873ec0a enrich JP BE AR profiles 2025-12-30 23:07:03 +01:00
kempersc
bc6ad46bfa enrich CZ and JP profiles 2025-12-30 23:03:03 +01:00
kempersc
90b402dba6 enrich AR en Czech files 2025-12-30 23:01:01 +01:00
kempersc
cefc847056 Remove custodian entry for Leica AG from YAML file 2025-12-30 03:44:25 +01:00
kempersc
9159ff35db Add custodian entry for Leica AG with data contamination fixes and location corrections 2025-12-30 03:43:47 +01:00
kempersc
d64f857aa9 add sparql validator and RAG injector 2025-12-30 03:43:31 +01:00
kempersc
84904e344b Make AGENTS more succint by referring to opencode rules & enrich custodians 2025-12-28 14:56:35 +01:00
kempersc
4cf3fe8a07 Logo enrichment batch: JP+170 (5,166/12,096 = 42.7%) - 14,503 total (45.6%) 2025-12-27 13:17:40 +01:00
kempersc
3447a9cc6c Logo enrichment batch: JP+440 (4,996/12,096 = 41.3%) - 14,333 total (45.1%) 2025-12-27 12:20:53 +01:00
kempersc
cdb633b0c9 enrich custodian entries with logo 2025-12-27 02:15:17 +01:00
kempersc
fd91fec63f Logo enrichment batch: JP+320, 13,603 total (42.8%)
- JP: 4,516/12,096 (37.4%)  NEW COMMIT
- CZ: 3,820/8,432 (45.3%) - batches 7-16 running
- CH, NL, BE, AT, BR: 100% complete
- Total: 13,603/31,772 (42.8%)
- Using crawl4ai favicon extraction
2025-12-26 23:25:40 +01:00
kempersc
2104a90f22 Logo enrichment COMPLETE: CZ 3,820 (45.3%)
- CZ: 3,820/8,432 files processed (45.3%)
- 9 parallel batches completed (500 files each)
- NL person entities added (4 staff profiles)
- scripts/discover_websites_crawl4ai.py modified
- Using crawl4ai favicon extraction
2025-12-26 21:45:14 +01:00
kempersc
6af5009444 enrich entries 2025-12-26 21:41:18 +01:00
kempersc
59963c8d3f Logo enrichment batch: JP+300, CZ-0 - 12,833 files (40.4%)
- JP: 4,496 processed (37.2% of 12,096)  COMPLETE
- CZ: 2,820 processed (33.4% of 8,432) - batch completed, slight decrease
- CH, NL, BE, AT, BR: 100% complete
- Total: 12,833 of 31,772 files (40.4%)
- Using crawl4ai favicon extraction
2025-12-26 13:42:21 +01:00
kempersc
6b9fa33767 Logo enrichment batch: CZ+500, JP+170 - 12,513 files (39.4%)
- CZ: 2,820 processed (33.4% of 8,432)
- JP: 4,176 processed (34.5% of 12,096)
- Total: 12,513 of 31,772 (39.4%)
- CZ batch completed: 500 files, 52 logos found
- JP batch crashed during run (4,176 files before crash)
- Using crawl4ai favicon extraction
2025-12-26 02:03:48 +01:00
kempersc
63400392ff Fix CZ-52-PAB-L-IPVVZOVI logo: use primary_logo.png instead of favicon.ico
- Primary logo (logo.png) identified via crawl4ai direct scraping
- Favicon (favicon.ico) retained as secondary asset
- Updated claims: primary_logo_url + favicon_url
- Summary shows: has_primary_logo: true, total_claims: 2
2025-12-25 21:01:05 +01:00
kempersc
6ab0b19ae2 Logo enrichment batch: CZ+260, JP+260 - 11,663 files (36.7%)
- CZ: 2,810 processed (33.3% of 8,432)
- JP: 3,336 processed (27.6% of 12,096)
- Total: 11,663 of 31,772 (36.7%)
- Using crawl4ai favicon extraction
2025-12-25 19:23:41 +01:00
kempersc
717ee3408a Logo enrichment batch: JP+771, CZ+380 - 10,913 files (34%)
- JP: 2,846 processed (24% of 12,096)
- CZ: 2,550 processed (30% of 8,432)
- CH, NL, BE, AT, BR: 100% complete
- Total: 10,913 of 31,772 files (34%)
- Using crawl4ai favicon extraction
2025-12-25 13:44:26 +01:00
kempersc
c3387ef3f1 Logo enrichment batch: CZ +380, JP +125, AR +28 files
- CZ: 2,170 processed (26% of 8,432)
- JP: 2,075 processed (17% of 12,096)
- AR: Started processing
- Total checkpoint: 9,762 files across all countries
- Using crawl4ai favicon extraction
2025-12-24 12:50:20 +01:00
kempersc
57de5e4b11 CZ logo enrichment: 1,790 files processed (21%)
- Added logo_enrichment to 771 Czech custodian files
- 87% logo hit rate using crawl4ai favicon extraction
- Total checkpoint: 9,257 files across all countries
- CZ remaining: 6,642 files
2025-12-24 02:41:26 +01:00
kempersc
ce1f80d024 enrich: logo enrichment progress (CZ: 220, JP: 1600) 2025-12-23 22:08:43 +01:00
kempersc
4f6ca92084 enrich: logo enrichment progress (JP: 1500, CZ: 40 started) 2025-12-23 21:37:10 +01:00
kempersc
8036eb5a3f enrich: logo enrichment for JP custodians (1490 processed, 10606 remaining) 2025-12-23 21:17:45 +01:00
kempersc
38292d1918 enrich: logo enrichment for JP custodians (1350 processed, 10746 remaining) 2025-12-23 20:56:21 +01:00
kempersc
5e8a432ef0 enrich japanese and dutch custodians 2025-12-23 18:08:45 +01:00
kempersc
a1fb6344e7 enriching custodian data 2025-12-23 17:26:29 +01:00
kempersc
0c1d19e98b enrich entries 2025-12-23 13:27:35 +01:00
kempersc
7a056fa746 enrich entries 2025-12-21 22:12:34 +01:00
kempersc
aca68ea47f remove a,bihguous web-claims 2025-12-21 00:01:54 +01:00
kempersc
23b1d8ee5f clean up GHCID 2025-12-17 11:58:40 +01:00
kempersc
99430c2a70 add new entries and semantic routing 2025-12-17 10:11:56 +01:00
kempersc
e0dd847491 extend ontology 2025-12-16 20:27:39 +01:00
kempersc
b0416efc7d enrich custodians and persons 2025-12-16 11:57:34 +01:00
kempersc
52ae711c56 add timespans 2025-12-16 09:02:52 +01:00
kempersc
b1340e30c8 add timespan 2025-12-15 22:35:35 +01:00
kempersc
cb56aa7e40 enrich all custodian timespan 2025-12-15 22:31:41 +01:00
kempersc
525662ea16 data: fix remaining person entity profiles 2025-12-15 01:48:33 +01:00
kempersc
3820f2fc92 chore: Add data reports, infra scripts, and API updates
- Data quality reports for Dutch custodians
- Name mismatch detection reports
- Failed crawl URL tracking
- Caddy configuration updates
- Monitor script for chunk 404 errors
- API endpoint improvements
2025-12-15 01:48:08 +01:00
kempersc
70c30a52d4 data: update person entity profiles with heritage classification 2025-12-15 01:47:42 +01:00
kempersc
181b1cf705 data: enrich Dutch heritage custodians (DR, FL, FR, GE, GR, LI provinces)
- Add digital platform discovery data with provenance
- Cleanup duplicate/incorrect custodian entries
- Add GHCID collision resolution suffixes where needed
- Update person entity profiles with career history
2025-12-15 01:34:38 +01:00
kempersc
1d26cade66 correct person labels 2025-12-14 17:58:55 +01:00
kempersc
c6aee998db correct person labels 2025-12-14 17:29:39 +01:00
kempersc
c50c35fd3a enrich person custodian 2025-12-14 17:09:55 +01:00
kempersc
505c12601a Add test script for PiCo extraction from Arabic waqf documents
- Implemented a new script `test_pico_arabic_waqf.py` to test the GLM annotator's ability to extract person observations from Arabic historical documents.
- The script includes environment variable handling for API token, structured prompts for the GLM API, and validation of extraction results.
- Added comprehensive logging for API responses, extraction results, and validation errors.
- Included a sample Arabic waqf text for testing purposes, following the PiCo ontology pattern.
2025-12-12 17:50:17 +01:00
kempersc
b1f93b6f22 enrich person profiles 2025-12-12 12:51:10 +01:00
kempersc
03263f67d6 moved web archives 2025-12-12 00:40:26 +01:00
kempersc
1b1cfbfca0 enrich custodians 2025-12-11 22:32:09 +01:00
kempersc
d4906abae4 update postgis data 2025-12-10 23:51:51 +01:00
kempersc
be3fbac601 enrich entries and persons 2025-12-10 18:04:25 +01:00
kempersc
41959f0766 correct HCID! 2025-12-10 13:01:13 +01:00
kempersc
c4b0f17a43 geocode: complete 100% coverage - add coordinates to final 26 files (CZ, BE, AR, LB, ML) 2025-12-10 01:07:34 +01:00
kempersc
82e58f6d40 geocode: add coordinates to 29 custodian files via Wikidata P131/P159 lookups 2025-12-10 01:04:29 +01:00
kempersc
6e2c36413e geocode: add coordinates to 540 Japanese custodian files using postal codes
- Download GeoNames JP postal code database (142K entries)
- Create geocode_japan_postal.py with postal code lookup
- Handle unicode hyphen variants in postal codes
- Add manual mappings for remote Tokyo islands (Hachijojima, Miyakejima)
- Implement prefix fallback for company postal codes
- Total JP files geocoded: 540 (99.81% coverage)

This brings overall geocoding coverage from 97.84% to 99.81%
2025-12-10 00:27:33 +01:00
kempersc
251b5eee68 geocode: add coordinates to 26 more custodian files
- Improved city name cleaning:
  - Roman numeral district suffixes (Kolín V. -> Kolín)
  - City + country suffixes (Genève 4 - Suisse -> Genève)
  - Czech postal notation (p. Luka nad Jihlavou -> Luka nad Jihlavou)
  - Historical city names (Gottwaldov -> Zlín, renamed 1990)
- Manual mappings for Swiss districts (Lugano Massagno -> Lugano)
2025-12-09 22:47:32 +01:00
kempersc
35e1686160 geocode: add coordinates to 69 custodian files across multiple countries
Countries updated: AR, AT, BG, BR, CA, CL, CN, CU, FI, GE, IR, JO, KG,
KR, LB, LI, LV, MX, MY, NI, NL, PS, PY, SX, TM, VN

- Manual city name mappings for transliteration variants
- St. Pölten -> Sankt Pölten (AT)
- Gaza City -> Gaza (PS)
- Beit Hanoun -> Bayt Hanun (PS)
- Veliko Tarnovo via geonames_id (BG)
2025-12-09 22:44:12 +01:00
kempersc
ef9607d991 geocode: add coordinates to 80 Czech custodian files
- Handle Czech address patterns:
  - House numbers with čp./č.p. prefix
  - X nad/pod Y town names (rivers/landmarks)
  - Hyphenated district names (Město-Část)
  - Trailing numbers and suffixes
2025-12-09 22:41:09 +01:00
kempersc
dee7a4c7d9 geocode: add coordinates to 147 Swiss custodian files
- Improved city name normalization to handle:
  - St. Gallen / St.Gallen -> Sankt Gallen
  - Canton suffixes (Buchs SG, Brugg AG)
  - Hyphenated districts (Bernex - Genève)
  - Postal codes with slashes (Ecublens/VD)
  - German prepositions (Hausen b. Brugg)
- Created scripts/geocode_from_city_name.py for unified geocoding
2025-12-09 22:38:33 +01:00
kempersc
cc61d99acf geocode: add coordinates to BG and EG custodian files
- BG: Add lat/lon from existing GeoNames IDs (28 files)
- EG: Map city codes to GeoNames (CAI→Cairo, ALX→Alexandria, etc.) (28 files)
- Fix malformed EG-IS-\`A\`-O-SCA.yaml → EG-IS-ISM-O-SCA.yaml
- Overall coverage: 96.4% → 96.6%
2025-12-09 21:59:58 +01:00
kempersc
2137c522db geocode: add coordinates to JP compound cities and CZ files from GeoNames
- JP: Handle Gun/Cho/Machi/Mura compound city names (2615 files)
- CZ: Map city codes to GeoNames entries (667 files)
- Overall coverage: 84.5% → 96.4%
2025-12-09 21:49:40 +01:00
kempersc
92b5e58ef3 geocode: add coordinates to AT, BE, DE, GB, PL, UA, US custodian files from GeoNames 2025-12-09 20:38:34 +01:00
kempersc
3a6ead8fde feat: Add legal form filtering rule for CustodianName
- Introduced LEGAL-FORM-FILTER rule to standardize CustodianName by removing legal form designations.
- Documented rationale, examples, and implementation guidelines for the filtering process.

docs: Create README for value standardization rules

- Established a comprehensive README outlining various value standardization rules applicable to Heritage Custodian classes.
- Categorized rules into Name Standardization, Geographic Standardization, Web Observation, and Schema Evolution.

feat: Implement transliteration standards for non-Latin scripts

- Added TRANSLIT-ISO rule to ensure GHCID abbreviations are generated from emic names using ISO standards for transliteration.
- Included detailed guidelines for various scripts and languages, along with implementation examples.

feat: Define XPath provenance rules for web observations

- Created XPATH-PROVENANCE rule mandating XPath pointers for claims extracted from web sources.
- Established a workflow for archiving websites and verifying claims against archived HTML.

chore: Update records lifecycle diagram

- Generated a new Mermaid diagram illustrating the records lifecycle for heritage custodians.
- Included phases for active records, inactive archives, and processed heritage collections with key relationships and classifications.
2025-12-09 16:58:41 +01:00
kempersc
7b42d720d5 geocode: add coordinates to CZ, BY, CH, FR, ES custodian files from GeoNames (1145 files) 2025-12-09 16:41:41 +01:00
kempersc
b54904ad0a fix: normalize YAML null formatting in Eye Filmmuseum file 2025-12-09 16:34:12 +01:00
kempersc
2c25ed6a96 geocode: add coordinates to JP custodian files from GeoNames (batch 2 - remaining 3639 files) 2025-12-09 16:33:29 +01:00
kempersc
9bc454cdbf geocode: add coordinates to JP custodian files from GeoNames (batch 1 - 3000 files) 2025-12-09 16:32:01 +01:00
kempersc
982620ba0c normalize: add canonical location blocks (batch 7 - final)
- Final 42 files updated
- Normalization complete: all 27,511 custodian files have location block
- 15,419 files have coordinates with coordinate_provenance
- 12,092 files have address-only location blocks
2025-12-09 14:57:33 +01:00
kempersc
e28576ee65 normalize: add canonical location blocks (batch 6)
- 2,546 files updated with location blocks
- All 27,511 custodian files now have location: block
- 15,421 files have coordinates with coordinate_provenance
- 12,090 files have address-only location blocks
2025-12-09 14:44:03 +01:00
kempersc
d20978dcbe normalize: add canonical location blocks (batch 5) 2025-12-09 14:39:02 +01:00
kempersc
3f60aa6238 normalize: add canonical location blocks (batch 4 - final) 2025-12-09 14:18:15 +01:00
kempersc
5b3d4d1ed5 normalize: add canonical location blocks (batch 3) 2025-12-09 14:14:13 +01:00
kempersc
b739ad4e61 normalize: add canonical location blocks (batch 2) 2025-12-09 13:28:59 +01:00
kempersc
bb41287730 normalize: add canonical location blocks (batch 1) 2025-12-09 13:17:11 +01:00
kempersc
a7321b1bb9 reconstruct location blocks 2025-12-09 12:25:16 +01:00
kempersc
85a951bbea normalize: add canonical location blocks to 586 files
- Fixed 469 JP files missing location: blocks (had data in original_entry.locations)
- Fixed 117 additional JP files found in second pass
- 1 EG file skipped (no location source data available)
- Total files with location: blocks now 27,459 out of 27,511 (99.8%)
- Also includes YAML formatting standardization (line wrapping)

Recovery from data loss in commit 62fdd35321 is now complete.
2025-12-09 12:17:34 +01:00
kempersc
cab712659d recover location blocks 2025-12-09 11:34:56 +01:00
kempersc
62fdd35321 Refactor code structure for improved readability and maintainability 2025-12-09 11:15:51 +01:00