kempersc cb56aa7e40 enrich all custodian timespan

2025-12-15 22:31:41 +01:00

6.4 KiB

Raw Blame History

Person Profile Extraction Confidence Scoring

Version: 1.0.0
Created: 2025-12-15
Applies To: Person entity profiles in data/custodian/person/entity/

Purpose

This document defines the confidence scoring rubric for profile extraction quality - how confident we are that the extracted profile data is accurate and complete. This is distinct from heritage_sector_relevance (which measures domain expertise).

Two Different Scores:

Field	Measures	Range
`exa_enrichment.confidence_score`	Data extraction quality/completeness	0.50-0.95
`heritage_sector_relevance.score`	Domain expertise in heritage sector	0.10-1.0

Confidence Score Rubric

Score Range	Level	Criteria	Examples
0.90-0.95	High Confidence	Senior heritage role, clear title, named institution, verifiable details	"Director at Rijksmuseum", "Chief Curator at British Museum"
0.75-0.85	Good Confidence	Mid-level heritage role, good institutional context, clear affiliation	"Junior Development at Rijksmuseum \| MA Cultural Economics"
0.60-0.70	Moderate Confidence	Entry-level/support role, or technical role at heritage institution, limited details	"Staff at Internet Archive", "Stedelijk Museum Amsterdam" (no role)
0.50-0.55	Low Confidence	Intern, unclear relationship, privacy-abbreviated name, minimal data	"Intern at Museum", "Amy B." (abbreviated name)

Scoring Factors

Factors That INCREASE Confidence

Factor	Impact	Example
Clear job title visible	+0.10 to +0.15	"Curator", "Archivist", "Director"
Named institution in headline	+0.05 to +0.10	"at Rijksmuseum", "at Internet Archive"
Education degree visible	+0.05	"MA Cultural Economics", "PhD Art History"
Seniority indicator	+0.05 to +0.10	"Senior", "Head of", "Director"
Multiple data points	+0.05	Role + Education + Location
Full name (not abbreviated)	+0.05	"Aliza Snoek" vs "Amy B."
Specific department/team	+0.05	"Development Team", "Conservation Department"

Factors That DECREASE Confidence

Factor	Impact	Example
No role title (institution only)	-0.15 to -0.20	Headline: "Stedelijk Museum Amsterdam"
Generic "staff" title	-0.10	"staff at The Internet Archive"
Privacy-abbreviated name	-0.15 to -0.20	"Amy B.", "J. Smith"
Intern/trainee position	-0.10	"Intern", "Stagiair", "Trainee"
No location data	-0.05	Location field is null
403 privacy restriction	-0.10	Full profile unavailable
Ambiguous affiliation	-0.10	Unclear which institution

Score Calculation Examples

Example 1: Score 0.80 (Good Confidence)

Profile: "Junior Development Rijksmuseum | MA Cultural Economics"

Base score: 0.65
+ Clear role title ("Junior Development"): +0.10
+ Named institution ("Rijksmuseum"): +0.05
+ Education visible ("MA Cultural Economics"): +0.05
= Final score: 0.80

Example 2: Score 0.65 (Moderate Confidence)

Profile: "staff at The Internet Archive"

Base score: 0.65
+ Named institution ("Internet Archive"): +0.05
- Generic title ("staff"): -0.10
+ Full name visible: +0.05
= Final score: 0.65

Example 3: Score 0.60 (Moderate-Low Confidence)

Profile: "Stedelijk Museum Amsterdam" (no role)

Base score: 0.65
+ Named institution: +0.05
- No role title visible: -0.15
+ Full name visible: +0.05
= Final score: 0.60

Example 4: Score 0.50 (Low Confidence)

Profile: "Intern at Kröller-Müller Museum"

Base score: 0.65
+ Named institution: +0.05
- Intern position: -0.10
- 403 privacy restriction: -0.10
= Final score: 0.50

Example 5: Score 0.50 (Low Confidence - Abbreviated Name)

Profile: "Amy B. - Film Archivist"

Base score: 0.65
+ Clear role title: +0.10
- Abbreviated name: -0.20
- No institution in headline: -0.05
= Final score: 0.50

Implementation in Entity Files

The confidence score is stored in the exa_enrichment block:

{
  "exa_enrichment": {
    "confidence_score": 0.75,
    "enrichment_date": "2025-12-15T12:45:00Z",
    "sources_consulted": [
      "LinkedIn profile headline",
      "Rijksmuseum institutional website"
    ],
    "notes": "Clear role title and educational background visible in headline. Development roles are core museum functions."
  }
}

Relationship to Other Scores

exa_enrichment.confidence_score vs heritage_sector_relevance.score

Aspect	confidence_score	heritage_sector_relevance.score
What it measures	Data extraction quality	Domain expertise
Question answered	"How sure are we about this data?"	"How relevant is this person to heritage?"
High score means	Rich, verifiable profile data	Deep heritage sector expertise
Low score means	Sparse, uncertain data	Peripheral/support role
Example: IT Director	0.90 (clear role, full data)	0.45 (enabling role, not heritage-specific)
Example: Intern Curator	0.50 (intern, limited data)	0.65 (heritage role, limited experience)

When Both Scores Are Used

{
  "exa_enrichment": {
    "confidence_score": 0.75,
    "notes": "Good extraction with clear role title"
  },
  "heritage_sector_relevance": {
    "score": 0.85,
    "primary_domain": "Archives",
    "assessment_notes": "Senior archivist with 10+ years experience"
  }
}

Quality Control

Minimum Thresholds

Threshold	Action
< 0.50	Flag for manual review, consider re-extraction
0.50-0.60	Accept but note uncertainty in provenance
0.60-0.75	Standard acceptance
> 0.75	High-quality record

Required Documentation

For scores below 0.60, the notes field MUST explain:

Why the score is low
What data is missing or uncertain
Potential sources for verification

References

AGENTS.md: Rule 30 (Person Profile Extraction Confidence Scoring)
AGENTS.md: Rule 20 (Person Entity Profiles)
HERITAGE_SECTOR_RELEVANCE_SCORING.md: Domain expertise scoring (separate metric)
PERSON_ENTITY_PROFILE_FORMAT_RULE.md: Entity file structure
DATA_FABRICATION_PROHIBITION.md: Never fabricate data to increase confidence

6.4 KiB Raw Blame History