Commit graph

146 commits

Author SHA1 Message Date
kempersc
aca68ea47f remove a,bihguous web-claims 2025-12-21 00:01:54 +01:00
kempersc
23b1d8ee5f clean up GHCID 2025-12-17 11:58:40 +01:00
kempersc
99430c2a70 add new entries and semantic routing 2025-12-17 10:11:56 +01:00
kempersc
5fe692296d Fix RDF visualization: correct SPARQL namespaces and show all node types
- Update SPARQL CONSTRUCT query to use correct ontology namespaces:
  - hc: https://w3id.org/heritage/custodian/ (was nde)
  - Use nde:Custodian type (was crm:E39_Actor)
  - Use schema:location + geo:lat/long (was crm predicates)
- Remove LIMIT 500 clause to fetch all results
- Show all node types by default instead of random single type
- Fixes issue where Knowledge Graph showed incomplete/random data
2025-12-17 08:55:26 +01:00
kempersc
e0dd847491 extend ontology 2025-12-16 20:27:39 +01:00
kempersc
0cf93587fb fix(schema): Normalize custodian_types annotation YAML quoting
YAML arrays in LinkML annotations must be quoted strings to ensure
proper parsing. This change quotes all custodian_types annotations
from the raw array format to quoted string format.

Before: custodian_types: ["A", "G"]
After:  custodian_types: '["A", "G"]'

Affected: 50+ class files in modules/classes/
Also updates: manifest.json, 01_custodian_name_modular.yaml
2025-12-16 20:19:45 +01:00
kempersc
8a4727eb34 feat(schema): Add social media post and content modeling schema
Social Media Post Classes:
- SocialMediaPost: Base class for platform-agnostic post modeling
- SocialMediaPostType: Abstract base for post type taxonomy
- SocialMediaPostTypes: Concrete post types (TextPost, ImagePost,
  CarouselPost, StoryPost, ReelPost, ArticlePost, PollPost, EventPost)

Content Classes:
- SocialMediaContent: Rich content modeling with media attachments,
  hashtags, mentions, links, and engagement metrics

Features:
- Platform-specific post type mappings (Instagram, LinkedIn, Twitter, etc.)
- Engagement analytics (likes, comments, shares, saves)
- Heritage institution content categorization
- Media attachment handling (images, videos, documents)
- Hashtag and mention extraction for heritage topic tracking
2025-12-16 20:06:08 +01:00
kempersc
767fb8ca80 feat(schema): Add LinkedIn profile and person modeling schema
Person Identity Classes:
- PersonName: Full name modeling with components (given_name, surname_prefix,
  base_surname, patronym, initials) following Dutch naming conventions
- PersonConnection: Professional network connections with heritage relevance scoring
- ConnectionNetwork: Network-level analysis and statistics

LinkedIn Profile Schema:
- LinkedInProfile: Complete professional profile structure
- WorkExperience: Employment history with heritage institution detection
- EducationCredential: Academic background and qualifications
- LanguageProficiency: Language skills with ISO 639-1 codes

Supporting Classes:
- ExtractionMetadata: Provenance tracking for extracted profile data
- HeritageRelevance: GLAMORCUBESFIXPHDNT type scoring and classification

Slots (17 person-related slots):
- Name components: given_name, base_surname, surname_prefix, patronym, initials
- Identity: age, birth_date, birth_place, death_place, gender_identity, pronouns
- Professional: occupation, religion
- References: literal_name, name_specification, has_person_name, extraction_metadata

Enums:
- HeritageTypeEnum: GLAMORCUBESFIXPHDNT type codes for heritage relevance
2025-12-16 20:04:59 +01:00
kempersc
51554947a0 feat(schema): Add video content schema with comprehensive examples
Video Schema Classes (9 files):
- VideoPost, VideoComment: Social media video modeling
- VideoTextContent: Base class for text content extraction
- VideoTranscript, VideoSubtitle: Text with timing and formatting
- VideoTimeSegment: Time code handling with ISO 8601 duration
- VideoAnnotation: Base annotation with W3C Web Annotation alignment
- VideoAnnotationTypes: Scene, Object, OCR detection annotations
- VideoChapter, VideoChapterList: Navigation and chapter structure
- VideoAudioAnnotation: Speaker diarization, music, sound events

Enumerations (12 enums):
- VideoDefinitionEnum, LiveBroadcastStatusEnum
- TranscriptFormatEnum, SubtitleFormatEnum, SubtitlePositionEnum
- AnnotationTypeEnum, AnnotationMotivationEnum
- DetectionLevelEnum, SceneTypeEnum, TransitionTypeEnum, TextTypeEnum
- ChapterSourceEnum, AudioEventTypeEnum, SoundEventTypeEnum, MusicTypeEnum

Examples (904 lines, 10 comprehensive heritage-themed examples):
- Rijksmuseum virtual tour chapters (5 chapters with heritage entity refs)
- Operation Night Watch documentary chapters (5 chapters)
- VideoAudioAnnotation: curator interview, exhibition promo, museum lecture

All examples reference real heritage entities with Wikidata IDs:
Q5598 (Rembrandt), Q41264 (Vermeer), Q219831 (The Night Watch)
2025-12-16 20:03:17 +01:00
kempersc
b0416efc7d enrich custodians and persons 2025-12-16 11:57:34 +01:00
kempersc
52ae711c56 add timespans 2025-12-16 09:02:52 +01:00
kempersc
b1340e30c8 add timespan 2025-12-15 22:35:35 +01:00
kempersc
cb56aa7e40 enrich all custodian timespan 2025-12-15 22:31:41 +01:00
kempersc
82aa655522 feat(conversation): Add resizable embedding projector panel with improved UX
- Larger default size (700x550) for better readability
- Resizable from all 8 edges/corners with visual SE grip indicator
- Clearer button icons (18px, strokeWidth 2.5)
- Draggable, minimizable, pinnable panel
- Dark theme and mobile responsive support
2025-12-15 17:45:27 +01:00
kempersc
d9892dba6f fix: handle single-vector Qdrant collections and multi-collection embedding dimensions
- Fixed _vector_search() to check uses_named_vectors() before adding 'using' parameter
- Fixed _person_vector_search() to detect person collection vector size and use appropriate model
- Resolves 'Not existing vector name error: openai_1536' for single-vector collections
- Resolves embedding dimension mismatch between heritage_custodians (1536-dim) and heritage_persons (384-dim)
2025-12-15 10:31:39 +01:00
kempersc
31bbce13e6 fix(types): Make genealogiewerkbalk nested fields optional
Fixes TypeScript error where parseGenealogiewerkbalk returns optional
fields but Institution interface expected required fields.
2025-12-15 09:04:09 +01:00
kempersc
525662ea16 data: fix remaining person entity profiles 2025-12-15 01:48:33 +01:00
kempersc
3820f2fc92 chore: Add data reports, infra scripts, and API updates
- Data quality reports for Dutch custodians
- Name mismatch detection reports
- Failed crawl URL tracking
- Caddy configuration updates
- Monitor script for chunk 404 errors
- API endpoint improvements
2025-12-15 01:48:08 +01:00
kempersc
0c36429257 feat(scripts): Add batch crawling and data quality scripts
- batch_crawl4ai_recrawl.py: Retry failed URL crawls
- batch_firecrawl_recrawl.py: FireCrawl batch processing
- batch_httpx_scrape.py: HTTPX-based scraping
- detect_name_mismatch.py: Find name mismatches in data
- enrich_dutch_custodians_crawl4ai.py: Dutch custodian enrichment
- fix_collision_victims.py: GHCID collision resolution
- fix_generic_platform_names*.py: Platform name cleanup
- fix_ghcid_type.py: GHCID type corrections
- fix_simon_kemper_contamination.py: Data cleanup
- scan_dutch_data_quality.py: Data quality scanning
- transform_crawl4ai_to_digital_platform.py: Data transformation
2025-12-15 01:47:46 +01:00
kempersc
70c30a52d4 data: update person entity profiles with heritage classification 2025-12-15 01:47:42 +01:00
kempersc
0a38225b36 feat(frontend): Add multi-select filters, URL params, and UI improvements
- Institution Browser: multi-select for types and countries
- URL query param sync for shareable filter URLs
- New utility: countryNames.ts with flag emoji support
- New utility: imageProxy.ts for image URL handling
- New component: SearchableMultiSelect dropdown
- Career timeline CSS and component updates
- Media gallery improvements
- Lazy load error boundary component
- Version check utility
2025-12-15 01:47:11 +01:00
kempersc
181b1cf705 data: enrich Dutch heritage custodians (DR, FL, FR, GE, GR, LI provinces)
- Add digital platform discovery data with provenance
- Cleanup duplicate/incorrect custodian entries
- Add GHCID collision resolution suffixes where needed
- Update person entity profiles with career history
2025-12-15 01:34:38 +01:00
kempersc
68c5aa2724 feat(api): Add heritage person classification and RAG retry logic
- Add GLAMORCUBESFIXPHDNT heritage type detection for person profiles
- Two-stage classification: blocklist non-heritage orgs, then match keywords
- Special handling for Digital (D) type: requires heritage org context
- Add career_history heritage_relevant and heritage_type fields
- Add exponential backoff retry for Anthropic API overload errors
- Fix DSPy 3.x async context with dspy.context() wrapper
2025-12-15 01:31:54 +01:00
kempersc
22709cc13e feat(rag): Add per-message refresh, bypass cache toggle, and cache clear improvements
- Add refresh button to assistant messages for re-running queries with fresh results
- Highlight refresh button (amber) for cached responses to draw attention
- Add spinning icon animation while refreshing
- Fix cache clear to return detailed success/failure status for local vs shared cache
- Add bypass cache toggle that forces fresh queries (one-shot, resets after query)
- Add Dutch/English translations for new UI elements
2025-12-14 19:12:25 +01:00
kempersc
1d26cade66 correct person labels 2025-12-14 17:58:55 +01:00
kempersc
c6aee998db correct person labels 2025-12-14 17:29:39 +01:00
kempersc
c50c35fd3a enrich person custodian 2025-12-14 17:09:55 +01:00
kempersc
d1c9aebd84 feat(rag): Add hybrid language detection and enhanced ontology mapping
Implement Heritage RAG pipeline enhancements:

1. Ontology Mapping (new file: ontology_mapping.py)
   - Hybrid language detection: heritage vocabulary -> fast-langdetect -> English default
   - HERITAGE_VOCABULARY dict (~40 terms) for domain-specific accuracy
   - FastText-based ML detection with 0.6 confidence threshold
   - Support for Dutch, French, German, Spanish, Italian, Portuguese, English
   - Dynamic synonym extraction from LinkML enum values
   - 93 comprehensive tests (all passing)

2. Schema Loader Enhancements (schema_loader.py)
   - Language-tagged multilingual synonym extraction for DSPy signatures
   - Enhanced enum value parsing with annotations support
   - Better error handling for malformed schema files

3. DSPy Heritage RAG (dspy_heritage_rag.py)
   - Fixed all 10 mypy type errors
   - Enhanced type annotations throughout
   - Improved query routing with multilingual support

4. Dependencies (pyproject.toml)
   - Added fast-langdetect ^1.0.0 (primary language detection)
   - Added types-pyyaml ^6.0.12 (mypy type stubs)

Tests: 93 new tests for ontology_mapping, all passing
Mypy: Clean (no type errors)
2025-12-14 15:55:18 +01:00
kempersc
41aace785f feat: Add SyncPanel component for database synchronization
- Add SyncPanel component with bilingual (NL/EN) support
- Add relative URL handling for production (bronhouder.nl)
- Integrate SyncPanel into Database page
- Show sync status for all 4 databases (DuckLake, PostgreSQL, Oxigraph, Qdrant)
- Support dry-run mode and file limit options
2025-12-12 23:42:22 +01:00
kempersc
505c12601a Add test script for PiCo extraction from Arabic waqf documents
- Implemented a new script `test_pico_arabic_waqf.py` to test the GLM annotator's ability to extract person observations from Arabic historical documents.
- The script includes environment variable handling for API token, structured prompts for the GLM API, and validation of extraction results.
- Added comprehensive logging for API responses, extraction results, and validation errors.
- Included a sample Arabic waqf text for testing purposes, following the PiCo ontology pattern.
2025-12-12 17:50:17 +01:00
kempersc
b1f93b6f22 enrich person profiles 2025-12-12 12:51:10 +01:00
kempersc
03263f67d6 moved web archives 2025-12-12 00:40:26 +01:00
kempersc
1b1cfbfca0 enrich custodians 2025-12-11 22:32:09 +01:00
kempersc
d4906abae4 update postgis data 2025-12-10 23:51:51 +01:00
kempersc
be3fbac601 enrich entries and persons 2025-12-10 18:04:25 +01:00
kempersc
41959f0766 correct HCID! 2025-12-10 13:01:13 +01:00
kempersc
162ca3ad79 docs: add Rules 13-16 for custodian type annotations, Exa LinkedIn, connection data, photo CDN 2025-12-10 09:04:14 +01:00
kempersc
c4b0f17a43 geocode: complete 100% coverage - add coordinates to final 26 files (CZ, BE, AR, LB, ML) 2025-12-10 01:07:34 +01:00
kempersc
82e58f6d40 geocode: add coordinates to 29 custodian files via Wikidata P131/P159 lookups 2025-12-10 01:04:29 +01:00
kempersc
6e2c36413e geocode: add coordinates to 540 Japanese custodian files using postal codes
- Download GeoNames JP postal code database (142K entries)
- Create geocode_japan_postal.py with postal code lookup
- Handle unicode hyphen variants in postal codes
- Add manual mappings for remote Tokyo islands (Hachijojima, Miyakejima)
- Implement prefix fallback for company postal codes
- Total JP files geocoded: 540 (99.81% coverage)

This brings overall geocoding coverage from 97.84% to 99.81%
2025-12-10 00:27:33 +01:00
kempersc
251b5eee68 geocode: add coordinates to 26 more custodian files
- Improved city name cleaning:
  - Roman numeral district suffixes (Kolín V. -> Kolín)
  - City + country suffixes (Genève 4 - Suisse -> Genève)
  - Czech postal notation (p. Luka nad Jihlavou -> Luka nad Jihlavou)
  - Historical city names (Gottwaldov -> Zlín, renamed 1990)
- Manual mappings for Swiss districts (Lugano Massagno -> Lugano)
2025-12-09 22:47:32 +01:00
kempersc
35e1686160 geocode: add coordinates to 69 custodian files across multiple countries
Countries updated: AR, AT, BG, BR, CA, CL, CN, CU, FI, GE, IR, JO, KG,
KR, LB, LI, LV, MX, MY, NI, NL, PS, PY, SX, TM, VN

- Manual city name mappings for transliteration variants
- St. Pölten -> Sankt Pölten (AT)
- Gaza City -> Gaza (PS)
- Beit Hanoun -> Bayt Hanun (PS)
- Veliko Tarnovo via geonames_id (BG)
2025-12-09 22:44:12 +01:00
kempersc
ef9607d991 geocode: add coordinates to 80 Czech custodian files
- Handle Czech address patterns:
  - House numbers with čp./č.p. prefix
  - X nad/pod Y town names (rivers/landmarks)
  - Hyphenated district names (Město-Část)
  - Trailing numbers and suffixes
2025-12-09 22:41:09 +01:00
kempersc
dee7a4c7d9 geocode: add coordinates to 147 Swiss custodian files
- Improved city name normalization to handle:
  - St. Gallen / St.Gallen -> Sankt Gallen
  - Canton suffixes (Buchs SG, Brugg AG)
  - Hyphenated districts (Bernex - Genève)
  - Postal codes with slashes (Ecublens/VD)
  - German prepositions (Hausen b. Brugg)
- Created scripts/geocode_from_city_name.py for unified geocoding
2025-12-09 22:38:33 +01:00
kempersc
cc61d99acf geocode: add coordinates to BG and EG custodian files
- BG: Add lat/lon from existing GeoNames IDs (28 files)
- EG: Map city codes to GeoNames (CAI→Cairo, ALX→Alexandria, etc.) (28 files)
- Fix malformed EG-IS-\`A\`-O-SCA.yaml → EG-IS-ISM-O-SCA.yaml
- Overall coverage: 96.4% → 96.6%
2025-12-09 21:59:58 +01:00
kempersc
2137c522db geocode: add coordinates to JP compound cities and CZ files from GeoNames
- JP: Handle Gun/Cho/Machi/Mura compound city names (2615 files)
- CZ: Map city codes to GeoNames entries (667 files)
- Overall coverage: 84.5% → 96.4%
2025-12-09 21:49:40 +01:00
kempersc
92b5e58ef3 geocode: add coordinates to AT, BE, DE, GB, PL, UA, US custodian files from GeoNames 2025-12-09 20:38:34 +01:00
kempersc
3a6ead8fde feat: Add legal form filtering rule for CustodianName
- Introduced LEGAL-FORM-FILTER rule to standardize CustodianName by removing legal form designations.
- Documented rationale, examples, and implementation guidelines for the filtering process.

docs: Create README for value standardization rules

- Established a comprehensive README outlining various value standardization rules applicable to Heritage Custodian classes.
- Categorized rules into Name Standardization, Geographic Standardization, Web Observation, and Schema Evolution.

feat: Implement transliteration standards for non-Latin scripts

- Added TRANSLIT-ISO rule to ensure GHCID abbreviations are generated from emic names using ISO standards for transliteration.
- Included detailed guidelines for various scripts and languages, along with implementation examples.

feat: Define XPath provenance rules for web observations

- Created XPATH-PROVENANCE rule mandating XPath pointers for claims extracted from web sources.
- Established a workflow for archiving websites and verifying claims against archived HTML.

chore: Update records lifecycle diagram

- Generated a new Mermaid diagram illustrating the records lifecycle for heritage custodians.
- Included phases for active records, inactive archives, and processed heritage collections with key relationships and classifications.
2025-12-09 16:58:41 +01:00
kempersc
7b42d720d5 geocode: add coordinates to CZ, BY, CH, FR, ES custodian files from GeoNames (1145 files) 2025-12-09 16:41:41 +01:00
kempersc
b54904ad0a fix: normalize YAML null formatting in Eye Filmmuseum file 2025-12-09 16:34:12 +01:00