Commit graph

11 commits

Author SHA1 Message Date
kempersc
e5a08a353d enrich person profiles 2026-01-10 14:14:04 +01:00
kempersc
9339de2cfb data(person): process 44,512 heritage-relevant profiles from entity extractions
Processing Summary:
- Scanned 94,716 LinkedIn entity files
- Identified 44,512 heritage-relevant individuals (47%)
- Created 1,430 new PPID-formatted profiles
- Updated 43,070 existing profiles with entity data
- Final count: 40,731 person profiles

Profile updates include:
- Merged web_claims with full provenance
- Added/updated heritage_relevance scoring
- Added affiliation data with custodian references
- Added inferred birth decades with provenance chains (Rule 45)

All data preserved per Rule 5 (additive only)
2026-01-10 14:01:29 +01:00
kempersc
6f3cf95492 data(person): fix data quality issues and PPID corrections
Data Quality Corrections:
- TIRANA-ADISUNA: Fix erroneous death_year claim (was education end date 2016,
  not death). Set is_living=true. Reassess heritage_relevance=false (tourism
  ministry is not a GLAM institution)
- ALEX-ALSEMGEEST: Rename from NL-ZH-TH (The Hague) to NL-ZH-ROT (Rotterdam)
  based on verified birth location. Update birth year to 1980

Profile Enrichments (5 profiles with XX-XX-XXX placeholders):
- Add web claims with proper provenance timestamps
- Add LinkedIn-verified education and position claims
- Document correction rationale in modification_reason

Heritage Relevance Reassessments:
- Government ministries (Tourism, etc.) marked as non-heritage
- Only GLAM institutions (Galleries, Libraries, Archives, Museums) qualify
2026-01-10 13:31:39 +01:00
kempersc
30cd8842d9 data(person): update profiles with web claims and PPID corrections
- Rename SENNAY-GHEBREAB profile: NL-ZH-ROT → ET-XX-ADD (Ethiopian birth)
- Enrich profiles with inferred birth decades and settlements
- Add web claims provenance for enriched data
- Update 16 profiles with improved location resolution

Files: +1 new (renamed), 16 modified, 1 deleted
2026-01-10 12:56:28 +01:00
kempersc
5eaab2bd30 data(person): enrich heritage professional profiles with web claims
Batch enrichment of 3,728 person profiles with additional data:
- Birth decade inference from education/career history
- Location resolution for inferred birth settlements
- Web claims with full provenance (source_url, retrieved_on)
- Organizational subdivision extraction
- Heritage relevance scoring

Also includes:
- 14 profile renames for PPID format corrections
- Updated _manifest.json with extraction statistics
- New _extraction_log.txt and _extraction_summary.json

Enrichment follows AGENTS.md rules:
- Rule 44: EDTF unknown date notation (XXXX, 196X, etc.)
- Rule 45: Inferred data with explicit provenance
- Rule 30: Confidence scoring (0.50-0.95)
- Rule 31: Organizational subdivision extraction

35,052 files changed, +4,507,411 insertions, -63,118 deletions
2026-01-10 10:35:20 +01:00
kempersc
519b0b47a8 Add Playwright test results JSON file with initial test suite and failure details 2026-01-09 21:33:31 +01:00
kempersc
004d342935 chore: minor updates and evaluation results
- auth.setup.ts: require env vars for test credentials (no hardcoded defaults)
- manifest.json: update schema manifest
- full_evaluation_results.json: add RAG evaluation results
- petra-links.json: update birth date from web claim
2026-01-09 21:10:55 +01:00
kempersc
855fff5962 data(person): resolve PPID locations and enrich profiles
- Rename 512 person files from XX-XX-XXX placeholders to proper GeoNames locations
- Update 2,463 profiles with enriched data
- Add 512 new person profiles (AU, international heritage professionals)
- PPID format: ID_{birth-loc}_{decade}_{work-loc}_{custodian}_{NAME}
2026-01-09 21:09:28 +01:00
kempersc
9e67d0f967 enrich profiles 2026-01-09 20:35:19 +01:00
kempersc
e9c9aefc37 data(person): regenerate PPIDs with unidecode support for non-Latin scripts
- Add display_name and name_romanized fields to all 7948 person profiles
- Resolve UNKNOWN-UNKNOWN collision group (Hebrew/Arabic names now properly romanize)
- Hebrew names like אבישי דנינו now generate PPID AVISHI-DANINO instead of UNKNOWN-UNKNOWN
- Collision count reduced from 82 to 81 groups

Regenerated using generate_ppids.py with unidecode support (commit abe30cb)
2026-01-09 18:31:53 +01:00
kempersc
932ec5438c add person profiles with PPID 2026-01-09 18:26:58 +01:00