kempersc/glam - Forgejo: Beyond coding. We Forge.

Author	SHA1	Message	Date
kempersc	aca68ea47f	remove a,bihguous web-claims	2025-12-21 00:01:54 +01:00
kempersc	23b1d8ee5f	clean up GHCID	2025-12-17 11:58:40 +01:00
kempersc	99430c2a70	add new entries and semantic routing	2025-12-17 10:11:56 +01:00
kempersc	5fe692296d	Fix RDF visualization: correct SPARQL namespaces and show all node types - Update SPARQL CONSTRUCT query to use correct ontology namespaces: - hc: https://w3id.org/heritage/custodian/ (was nde) - Use nde:Custodian type (was crm:E39_Actor) - Use schema:location + geo:lat/long (was crm predicates) - Remove LIMIT 500 clause to fetch all results - Show all node types by default instead of random single type - Fixes issue where Knowledge Graph showed incomplete/random data	2025-12-17 08:55:26 +01:00
kempersc	e0dd847491	extend ontology	2025-12-16 20:27:39 +01:00
kempersc	0cf93587fb	fix(schema): Normalize custodian_types annotation YAML quoting YAML arrays in LinkML annotations must be quoted strings to ensure proper parsing. This change quotes all custodian_types annotations from the raw array format to quoted string format. Before: custodian_types: ["A", "G"] After: custodian_types: '["A", "G"]' Affected: 50+ class files in modules/classes/ Also updates: manifest.json, 01_custodian_name_modular.yaml	2025-12-16 20:19:45 +01:00
kempersc	8a4727eb34	feat(schema): Add social media post and content modeling schema Social Media Post Classes: - SocialMediaPost: Base class for platform-agnostic post modeling - SocialMediaPostType: Abstract base for post type taxonomy - SocialMediaPostTypes: Concrete post types (TextPost, ImagePost, CarouselPost, StoryPost, ReelPost, ArticlePost, PollPost, EventPost) Content Classes: - SocialMediaContent: Rich content modeling with media attachments, hashtags, mentions, links, and engagement metrics Features: - Platform-specific post type mappings (Instagram, LinkedIn, Twitter, etc.) - Engagement analytics (likes, comments, shares, saves) - Heritage institution content categorization - Media attachment handling (images, videos, documents) - Hashtag and mention extraction for heritage topic tracking	2025-12-16 20:06:08 +01:00
kempersc	767fb8ca80	feat(schema): Add LinkedIn profile and person modeling schema Person Identity Classes: - PersonName: Full name modeling with components (given_name, surname_prefix, base_surname, patronym, initials) following Dutch naming conventions - PersonConnection: Professional network connections with heritage relevance scoring - ConnectionNetwork: Network-level analysis and statistics LinkedIn Profile Schema: - LinkedInProfile: Complete professional profile structure - WorkExperience: Employment history with heritage institution detection - EducationCredential: Academic background and qualifications - LanguageProficiency: Language skills with ISO 639-1 codes Supporting Classes: - ExtractionMetadata: Provenance tracking for extracted profile data - HeritageRelevance: GLAMORCUBESFIXPHDNT type scoring and classification Slots (17 person-related slots): - Name components: given_name, base_surname, surname_prefix, patronym, initials - Identity: age, birth_date, birth_place, death_place, gender_identity, pronouns - Professional: occupation, religion - References: literal_name, name_specification, has_person_name, extraction_metadata Enums: - HeritageTypeEnum: GLAMORCUBESFIXPHDNT type codes for heritage relevance	2025-12-16 20:04:59 +01:00
kempersc	51554947a0	feat(schema): Add video content schema with comprehensive examples Video Schema Classes (9 files): - VideoPost, VideoComment: Social media video modeling - VideoTextContent: Base class for text content extraction - VideoTranscript, VideoSubtitle: Text with timing and formatting - VideoTimeSegment: Time code handling with ISO 8601 duration - VideoAnnotation: Base annotation with W3C Web Annotation alignment - VideoAnnotationTypes: Scene, Object, OCR detection annotations - VideoChapter, VideoChapterList: Navigation and chapter structure - VideoAudioAnnotation: Speaker diarization, music, sound events Enumerations (12 enums): - VideoDefinitionEnum, LiveBroadcastStatusEnum - TranscriptFormatEnum, SubtitleFormatEnum, SubtitlePositionEnum - AnnotationTypeEnum, AnnotationMotivationEnum - DetectionLevelEnum, SceneTypeEnum, TransitionTypeEnum, TextTypeEnum - ChapterSourceEnum, AudioEventTypeEnum, SoundEventTypeEnum, MusicTypeEnum Examples (904 lines, 10 comprehensive heritage-themed examples): - Rijksmuseum virtual tour chapters (5 chapters with heritage entity refs) - Operation Night Watch documentary chapters (5 chapters) - VideoAudioAnnotation: curator interview, exhibition promo, museum lecture All examples reference real heritage entities with Wikidata IDs: Q5598 (Rembrandt), Q41264 (Vermeer), Q219831 (The Night Watch)	2025-12-16 20:03:17 +01:00
kempersc	b0416efc7d	enrich custodians and persons	2025-12-16 11:57:34 +01:00
kempersc	52ae711c56	add timespans	2025-12-16 09:02:52 +01:00
kempersc	b1340e30c8	add timespan	2025-12-15 22:35:35 +01:00
kempersc	cb56aa7e40	enrich all custodian timespan	2025-12-15 22:31:41 +01:00
kempersc	82aa655522	feat(conversation): Add resizable embedding projector panel with improved UX - Larger default size (700x550) for better readability - Resizable from all 8 edges/corners with visual SE grip indicator - Clearer button icons (18px, strokeWidth 2.5) - Draggable, minimizable, pinnable panel - Dark theme and mobile responsive support	2025-12-15 17:45:27 +01:00
kempersc	d9892dba6f	fix: handle single-vector Qdrant collections and multi-collection embedding dimensions - Fixed _vector_search() to check uses_named_vectors() before adding 'using' parameter - Fixed _person_vector_search() to detect person collection vector size and use appropriate model - Resolves 'Not existing vector name error: openai_1536' for single-vector collections - Resolves embedding dimension mismatch between heritage_custodians (1536-dim) and heritage_persons (384-dim)	2025-12-15 10:31:39 +01:00
kempersc	31bbce13e6	fix(types): Make genealogiewerkbalk nested fields optional Fixes TypeScript error where parseGenealogiewerkbalk returns optional fields but Institution interface expected required fields.	2025-12-15 09:04:09 +01:00
kempersc	525662ea16	data: fix remaining person entity profiles	2025-12-15 01:48:33 +01:00
kempersc	3820f2fc92	chore: Add data reports, infra scripts, and API updates - Data quality reports for Dutch custodians - Name mismatch detection reports - Failed crawl URL tracking - Caddy configuration updates - Monitor script for chunk 404 errors - API endpoint improvements	2025-12-15 01:48:08 +01:00
kempersc	0c36429257	feat(scripts): Add batch crawling and data quality scripts - batch_crawl4ai_recrawl.py: Retry failed URL crawls - batch_firecrawl_recrawl.py: FireCrawl batch processing - batch_httpx_scrape.py: HTTPX-based scraping - detect_name_mismatch.py: Find name mismatches in data - enrich_dutch_custodians_crawl4ai.py: Dutch custodian enrichment - fix_collision_victims.py: GHCID collision resolution - fix_generic_platform_names*.py: Platform name cleanup - fix_ghcid_type.py: GHCID type corrections - fix_simon_kemper_contamination.py: Data cleanup - scan_dutch_data_quality.py: Data quality scanning - transform_crawl4ai_to_digital_platform.py: Data transformation	2025-12-15 01:47:46 +01:00
kempersc	70c30a52d4	data: update person entity profiles with heritage classification	2025-12-15 01:47:42 +01:00
kempersc	0a38225b36	feat(frontend): Add multi-select filters, URL params, and UI improvements - Institution Browser: multi-select for types and countries - URL query param sync for shareable filter URLs - New utility: countryNames.ts with flag emoji support - New utility: imageProxy.ts for image URL handling - New component: SearchableMultiSelect dropdown - Career timeline CSS and component updates - Media gallery improvements - Lazy load error boundary component - Version check utility	2025-12-15 01:47:11 +01:00
kempersc	181b1cf705	data: enrich Dutch heritage custodians (DR, FL, FR, GE, GR, LI provinces) - Add digital platform discovery data with provenance - Cleanup duplicate/incorrect custodian entries - Add GHCID collision resolution suffixes where needed - Update person entity profiles with career history	2025-12-15 01:34:38 +01:00
kempersc	68c5aa2724	feat(api): Add heritage person classification and RAG retry logic - Add GLAMORCUBESFIXPHDNT heritage type detection for person profiles - Two-stage classification: blocklist non-heritage orgs, then match keywords - Special handling for Digital (D) type: requires heritage org context - Add career_history heritage_relevant and heritage_type fields - Add exponential backoff retry for Anthropic API overload errors - Fix DSPy 3.x async context with dspy.context() wrapper	2025-12-15 01:31:54 +01:00
kempersc	22709cc13e	feat(rag): Add per-message refresh, bypass cache toggle, and cache clear improvements - Add refresh button to assistant messages for re-running queries with fresh results - Highlight refresh button (amber) for cached responses to draw attention - Add spinning icon animation while refreshing - Fix cache clear to return detailed success/failure status for local vs shared cache - Add bypass cache toggle that forces fresh queries (one-shot, resets after query) - Add Dutch/English translations for new UI elements	2025-12-14 19:12:25 +01:00
kempersc	1d26cade66	correct person labels	2025-12-14 17:58:55 +01:00
kempersc	c6aee998db	correct person labels	2025-12-14 17:29:39 +01:00
kempersc	c50c35fd3a	enrich person custodian	2025-12-14 17:09:55 +01:00
kempersc	d1c9aebd84	feat(rag): Add hybrid language detection and enhanced ontology mapping Implement Heritage RAG pipeline enhancements: 1. Ontology Mapping (new file: ontology_mapping.py) - Hybrid language detection: heritage vocabulary -> fast-langdetect -> English default - HERITAGE_VOCABULARY dict (~40 terms) for domain-specific accuracy - FastText-based ML detection with 0.6 confidence threshold - Support for Dutch, French, German, Spanish, Italian, Portuguese, English - Dynamic synonym extraction from LinkML enum values - 93 comprehensive tests (all passing) 2. Schema Loader Enhancements (schema_loader.py) - Language-tagged multilingual synonym extraction for DSPy signatures - Enhanced enum value parsing with annotations support - Better error handling for malformed schema files 3. DSPy Heritage RAG (dspy_heritage_rag.py) - Fixed all 10 mypy type errors - Enhanced type annotations throughout - Improved query routing with multilingual support 4. Dependencies (pyproject.toml) - Added fast-langdetect ^1.0.0 (primary language detection) - Added types-pyyaml ^6.0.12 (mypy type stubs) Tests: 93 new tests for ontology_mapping, all passing Mypy: Clean (no type errors)	2025-12-14 15:55:18 +01:00
kempersc	41aace785f	feat: Add SyncPanel component for database synchronization - Add SyncPanel component with bilingual (NL/EN) support - Add relative URL handling for production (bronhouder.nl) - Integrate SyncPanel into Database page - Show sync status for all 4 databases (DuckLake, PostgreSQL, Oxigraph, Qdrant) - Support dry-run mode and file limit options	2025-12-12 23:42:22 +01:00
kempersc	505c12601a	Add test script for PiCo extraction from Arabic waqf documents - Implemented a new script `test_pico_arabic_waqf.py` to test the GLM annotator's ability to extract person observations from Arabic historical documents. - The script includes environment variable handling for API token, structured prompts for the GLM API, and validation of extraction results. - Added comprehensive logging for API responses, extraction results, and validation errors. - Included a sample Arabic waqf text for testing purposes, following the PiCo ontology pattern.	2025-12-12 17:50:17 +01:00
kempersc	b1f93b6f22	enrich person profiles	2025-12-12 12:51:10 +01:00
kempersc	03263f67d6	moved web archives	2025-12-12 00:40:26 +01:00
kempersc	1b1cfbfca0	enrich custodians	2025-12-11 22:32:09 +01:00
kempersc	d4906abae4	update postgis data	2025-12-10 23:51:51 +01:00
kempersc	be3fbac601	enrich entries and persons	2025-12-10 18:04:25 +01:00
kempersc	41959f0766	correct HCID!	2025-12-10 13:01:13 +01:00
kempersc	162ca3ad79	docs: add Rules 13-16 for custodian type annotations, Exa LinkedIn, connection data, photo CDN	2025-12-10 09:04:14 +01:00
kempersc	c4b0f17a43	geocode: complete 100% coverage - add coordinates to final 26 files (CZ, BE, AR, LB, ML)	2025-12-10 01:07:34 +01:00
kempersc	82e58f6d40	geocode: add coordinates to 29 custodian files via Wikidata P131/P159 lookups	2025-12-10 01:04:29 +01:00
kempersc	6e2c36413e	geocode: add coordinates to 540 Japanese custodian files using postal codes - Download GeoNames JP postal code database (142K entries) - Create geocode_japan_postal.py with postal code lookup - Handle unicode hyphen variants in postal codes - Add manual mappings for remote Tokyo islands (Hachijojima, Miyakejima) - Implement prefix fallback for company postal codes - Total JP files geocoded: 540 (99.81% coverage) This brings overall geocoding coverage from 97.84% to 99.81%	2025-12-10 00:27:33 +01:00
kempersc	251b5eee68	geocode: add coordinates to 26 more custodian files - Improved city name cleaning: - Roman numeral district suffixes (Kolín V. -> Kolín) - City + country suffixes (Genève 4 - Suisse -> Genève) - Czech postal notation (p. Luka nad Jihlavou -> Luka nad Jihlavou) - Historical city names (Gottwaldov -> Zlín, renamed 1990) - Manual mappings for Swiss districts (Lugano Massagno -> Lugano)	2025-12-09 22:47:32 +01:00
kempersc	35e1686160	geocode: add coordinates to 69 custodian files across multiple countries Countries updated: AR, AT, BG, BR, CA, CL, CN, CU, FI, GE, IR, JO, KG, KR, LB, LI, LV, MX, MY, NI, NL, PS, PY, SX, TM, VN - Manual city name mappings for transliteration variants - St. Pölten -> Sankt Pölten (AT) - Gaza City -> Gaza (PS) - Beit Hanoun -> Bayt Hanun (PS) - Veliko Tarnovo via geonames_id (BG)	2025-12-09 22:44:12 +01:00
kempersc	ef9607d991	geocode: add coordinates to 80 Czech custodian files - Handle Czech address patterns: - House numbers with čp./č.p. prefix - X nad/pod Y town names (rivers/landmarks) - Hyphenated district names (Město-Část) - Trailing numbers and suffixes	2025-12-09 22:41:09 +01:00
kempersc	dee7a4c7d9	geocode: add coordinates to 147 Swiss custodian files - Improved city name normalization to handle: - St. Gallen / St.Gallen -> Sankt Gallen - Canton suffixes (Buchs SG, Brugg AG) - Hyphenated districts (Bernex - Genève) - Postal codes with slashes (Ecublens/VD) - German prepositions (Hausen b. Brugg) - Created scripts/geocode_from_city_name.py for unified geocoding	2025-12-09 22:38:33 +01:00
kempersc	cc61d99acf	geocode: add coordinates to BG and EG custodian files - BG: Add lat/lon from existing GeoNames IDs (28 files) - EG: Map city codes to GeoNames (CAI→Cairo, ALX→Alexandria, etc.) (28 files) - Fix malformed EG-IS-\`A\`-O-SCA.yaml → EG-IS-ISM-O-SCA.yaml - Overall coverage: 96.4% → 96.6%	2025-12-09 21:59:58 +01:00
kempersc	2137c522db	geocode: add coordinates to JP compound cities and CZ files from GeoNames - JP: Handle Gun/Cho/Machi/Mura compound city names (2615 files) - CZ: Map city codes to GeoNames entries (667 files) - Overall coverage: 84.5% → 96.4%	2025-12-09 21:49:40 +01:00
kempersc	92b5e58ef3	geocode: add coordinates to AT, BE, DE, GB, PL, UA, US custodian files from GeoNames	2025-12-09 20:38:34 +01:00
kempersc	3a6ead8fde	feat: Add legal form filtering rule for CustodianName - Introduced LEGAL-FORM-FILTER rule to standardize CustodianName by removing legal form designations. - Documented rationale, examples, and implementation guidelines for the filtering process. docs: Create README for value standardization rules - Established a comprehensive README outlining various value standardization rules applicable to Heritage Custodian classes. - Categorized rules into Name Standardization, Geographic Standardization, Web Observation, and Schema Evolution. feat: Implement transliteration standards for non-Latin scripts - Added TRANSLIT-ISO rule to ensure GHCID abbreviations are generated from emic names using ISO standards for transliteration. - Included detailed guidelines for various scripts and languages, along with implementation examples. feat: Define XPath provenance rules for web observations - Created XPATH-PROVENANCE rule mandating XPath pointers for claims extracted from web sources. - Established a workflow for archiving websites and verifying claims against archived HTML. chore: Update records lifecycle diagram - Generated a new Mermaid diagram illustrating the records lifecycle for heritage custodians. - Included phases for active records, inactive archives, and processed heritage collections with key relationships and classifications.	2025-12-09 16:58:41 +01:00
kempersc	7b42d720d5	geocode: add coordinates to CZ, BY, CH, FR, ES custodian files from GeoNames (1145 files)	2025-12-09 16:41:41 +01:00
kempersc	b54904ad0a	fix: normalize YAML null formatting in Eye Filmmuseum file	2025-12-09 16:34:12 +01:00

1 2 3

146 commits