13 KiB
Rule 45: Inferred Data Must Be Explicit with Provenance
Status: Active
Created: 2025-01-09
Applies to: PPID enrichment, person entity profiles, any data inference
Core Principle
All inferred data MUST be stored in explicit inferred_* fields with full provenance statements. Inferred values MUST NEVER silently replace or merge with verified data.
This ensures:
- Transparency: Users can distinguish verified facts from heuristic estimates
- Auditability: The inference method and source observations are traceable
- Reversibility: Inferred data can be corrected when verified data becomes available
- Quality Signals: Confidence levels and argument chains are preserved
Required Structure for Inferred Data
Every inferred claim MUST include:
inferred_[field_name]:
value: "the inferred value"
edtf: "196X" # For dates: EDTF notation
formatted: "NL-UT-UTR" # For locations: CC-RR-PPP format
confidence: "low|medium|high"
inference_provenance:
method: "heuristic_name"
inference_chain:
- step: 1
observation: "University start year 1986"
source_field: "profile_data.education[0].date_range"
source_value: "1986 - 1990"
- step: 2
assumption: "University entry at age 18"
rationale: "Standard Dutch university entry age"
- step: 3
calculation: "1986 - 18 = 1968"
result: "Estimated birth year 1968"
- step: 4
generalization: "Round to decade → 196X"
rationale: "EDTF decade notation for uncertain years"
inferred_at: "2025-01-09T18:00:00Z"
inferred_by: "enrich_ppids.py"
Explicit Inferred Fields
For Person Profiles (PPID)
| Inferred Field | Source Observations | Heuristic |
|---|---|---|
inferred_birth_year |
Earliest education/job dates | Entry age assumptions |
inferred_birth_decade |
Birth year estimate | EDTF decade notation |
inferred_birth_settlement |
School/university location | Residential proximity |
inferred_birth_region |
Settlement location | GeoNames admin1 |
inferred_birth_country |
Settlement location | GeoNames country |
inferred_current_settlement |
Profile location, current job | Direct extraction |
inferred_current_region |
Settlement location | GeoNames admin1 |
inferred_current_country |
Settlement location | GeoNames country |
Example: Complete Inferred Birth Data
{
"ppid": "ID_NL-UT-UTR_196X_NL-UT-UTR_XXXX_AART-HARTEN",
"birth_date": {
"edtf": "XXXX",
"precision": "unknown",
"note": "See inferred_birth_decade for heuristic estimate"
},
"inferred_birth_decade": {
"value": "196X",
"edtf": "196X",
"precision": "decade",
"confidence": "low",
"inference_provenance": {
"method": "earliest_education_heuristic",
"inference_chain": [
{
"step": 1,
"observation": "University education record found",
"source_field": "profile_data.education[0]",
"source_value": {
"institution": "Universiteit Utrecht",
"degree": "Social & Organisational psychology, doctoraal",
"date_range": "1986 - 1990"
}
},
{
"step": 2,
"extraction": "Start year extracted from date_range",
"extracted_value": 1986
},
{
"step": 3,
"assumption": "University entry age",
"assumed_value": 18,
"rationale": "Standard Dutch university entry age (post-VWO)",
"confidence_impact": "Assumption reduces confidence; actual age 17-20 possible"
},
{
"step": 4,
"calculation": "1986 - 18 = 1968",
"result": "Estimated birth year: 1968"
},
{
"step": 5,
"generalization": "Convert to EDTF decade",
"input": 1968,
"output": "196X",
"rationale": "Decade precision appropriate for heuristic estimate"
}
],
"inferred_at": "2025-01-09T18:00:00Z",
"inferred_by": "enrich_ppids.py"
}
},
"inferred_birth_settlement": {
"value": "Utrecht",
"formatted": "NL-UT-UTR",
"confidence": "low",
"inference_provenance": {
"method": "earliest_education_location",
"inference_chain": [
{
"step": 1,
"observation": "Earliest education institution identified",
"source_field": "profile_data.education[0].institution",
"source_value": "Universiteit Utrecht"
},
{
"step": 2,
"lookup": "Institution location mapping",
"mapping_key": "Universiteit Utrecht",
"mapping_value": "Utrecht, Netherlands"
},
{
"step": 3,
"geocoding": "GeoNames resolution",
"query": "Utrecht",
"country_code": "NL",
"result": {
"geonames_id": 2745912,
"name": "Utrecht",
"admin1_code": "09",
"admin1_name": "Utrecht"
}
},
{
"step": 4,
"formatting": "CC-RR-PPP generation",
"country_code": "NL",
"region_code": "UT",
"settlement_code": "UTR",
"result": "NL-UT-UTR"
}
],
"assumption_note": "University location used as proxy for birth location; student may have relocated for education",
"inferred_at": "2025-01-09T18:00:00Z",
"inferred_by": "enrich_ppids.py"
}
}
}
List-Valued Inferred Data (EDTF Set Notation)
When inference yields multiple plausible values (e.g., someone born in 1968 could be in either the 1960s or 1970s decade), store as a list with EDTF set notation.
EDTF Set Notation Standards
| Notation | Meaning | Use Case |
|---|---|---|
[196X,197X] |
One of these values | Person born in late 1960s (uncertainty spans decades) |
{196X,197X} |
All of these values | NOT for birth decade (use [...]) |
[1965..1970] |
Range within set | Birth year between 1965-1970 |
When to Use List Values
-
Decade Boundary Cases: Estimated birth year is within 3 years of a decade boundary
- Estimated 1968 →
[196X,197X](could be late 60s or early 70s due to age assumption variance) - Estimated 1972 →
[196X,197X](same logic) - Estimated 1975 →
197X(confidently mid-decade)
- Estimated 1968 →
-
Multiple Plausible Locations: Student attended schools in different cities
["NL-UT-UTR", "NL-NH-AMS"]with provenance explaining each candidate
Example: List-Valued Birth Decade
{
"inferred_birth_decade": {
"values": ["196X", "197X"],
"edtf": "[196X,197X]",
"edtf_meaning": "one of: 1960s or 1970s",
"precision": "decade_set",
"confidence": "low",
"primary_value": "196X",
"primary_rationale": "1968 is closer to 1960s center than 1970s",
"inference_provenance": {
"method": "earliest_observation_heuristic",
"inference_chain": [
{
"step": 1,
"observation": "University start 1986",
"source_field": "profile_data.education[0].date_range"
},
{
"step": 2,
"assumption": "University entry at age 18 (±3 years)",
"rationale": "Dutch university entry typically 17-21"
},
{
"step": 3,
"calculation": "1986 - 18 = 1968 (range: 1965-1971)",
"result": "Birth year estimate: 1968 with variance 1965-1971"
},
{
"step": 4,
"generalization": "Birth year range spans decade boundary",
"input_range": [1965, 1971],
"output": ["196X", "197X"],
"rationale": "Cannot determine which decade without additional evidence"
}
],
"inferred_at": "2025-01-09T18:00:00Z",
"inferred_by": "enrich_ppids.py"
}
}
}
PPID Generation with List Values
When inferred_birth_decade is a list, use primary_value for PPID:
{
"ppid": "ID_NL-UT-UTR_196X_NL-UT-UTR_XXXX_AART-HARTEN",
"ppid_components": {
"first_date": "196X",
"first_date_source": "inferred_birth_decade.primary_value",
"first_date_alternatives": ["197X"]
}
}
Example: List-Valued Location
{
"inferred_birth_settlement": {
"values": [
{"settlement": "Utrecht", "formatted": "NL-UT-UTR"},
{"settlement": "Amsterdam", "formatted": "NL-NH-AMS"}
],
"primary_value": "NL-UT-UTR",
"primary_rationale": "Earlier education (1986) in Utrecht; Amsterdam job later (1990)",
"confidence": "very_low",
"inference_provenance": {
"method": "education_locations",
"inference_chain": [
{
"step": 1,
"observation": "Multiple education institutions found",
"source_field": "profile_data.education",
"candidates": ["Universiteit Utrecht (1986)", "UvA (1990)"]
},
{
"step": 2,
"assumption": "Earlier education more likely near birth location",
"rationale": "Students often attend local university first"
}
]
}
}
}
Confidence Levels
| Level | Criteria | Example |
|---|---|---|
| high | Direct extraction from authoritative source | Profile states "Born in Amsterdam" |
| medium | Single-step inference with reliable source | Current job location from employment record |
| low | Multi-step heuristic with assumptions | Birth year from university start date |
| very_low | Speculative, multiple assumptions, or list-valued | Birth location from first observed location, or decade spanning boundary |
Anti-Patterns (FORBIDDEN)
❌ Silent Replacement
{
"birth_date": {
"edtf": "196X",
"precision": "decade"
}
}
Problem: No indication this is inferred, no provenance, no confidence level.
❌ Hidden in Metadata
{
"birth_date": {
"edtf": "196X"
},
"enrichment_metadata": {
"birth_date_inferred": true
}
}
Problem: Inference metadata separated from the value; easy to miss.
❌ Missing Inference Chain
{
"inferred_birth_decade": {
"value": "196X",
"method": "heuristic"
}
}
Problem: No explanation of HOW the value was derived; not auditable.
Correct Pattern ✅
{
"birth_date": {
"edtf": "XXXX",
"precision": "unknown",
"note": "See inferred_birth_decade"
},
"inferred_birth_decade": {
"value": "196X",
"edtf": "196X",
"confidence": "low",
"inference_provenance": {
"method": "earliest_education_heuristic",
"inference_chain": [
{"step": 1, "observation": "...", "source_field": "...", "source_value": "..."},
{"step": 2, "assumption": "...", "rationale": "..."},
{"step": 3, "calculation": "...", "result": "..."}
],
"inferred_at": "2025-01-09T18:00:00Z",
"inferred_by": "enrich_ppids.py"
}
}
}
PPID Component Handling
When inferred values are used in PPID components:
{
"ppid": "ID_NL-UT-UTR_196X_NL-NH-AMS_XXXX_AART-HARTEN",
"ppid_components": {
"type": "ID",
"first_location": "NL-UT-UTR",
"first_location_source": "inferred_birth_settlement",
"first_date": "196X",
"first_date_source": "inferred_birth_decade",
"last_location": "NL-NH-AMS",
"last_location_source": "inferred_current_settlement",
"last_date": "XXXX",
"name_tokens": ["AART", "HARTEN"]
}
}
The *_source fields document which inferred field was used for PPID generation.
Upgrade Path: Inferred → Verified
When verified data becomes available:
- Keep inferred data in
inferred_*fields for audit trail - Add verified data to canonical fields
- Mark inferred as superseded:
{
"birth_date": {
"edtf": "1967-03-15",
"precision": "day",
"verified": true,
"source": "official_record"
},
"inferred_birth_decade": {
"value": "196X",
"superseded": true,
"superseded_by": "birth_date",
"superseded_at": "2025-01-15T10:00:00Z",
"accuracy_assessment": "Inferred decade was correct (1960s), actual year 1967"
}
}
Implementation Checklist
For any enrichment script:
- Create explicit
inferred_*fields for ALL inferred data - Include
inference_provenancewith completeinference_chain - Record each step: observation → assumption → calculation → result
- Set appropriate
confidencelevel - Add
*_sourcereferences in PPID components - Preserve original unknown values (
XXXX,XX-XX-XXX) - Add
notein canonical fields pointing to inferred alternatives
Related Rules
- Rule 44: PPID Birth Date Enrichment and EDTF Unknown Date Notation
- Rule 35: Provenance Statements MUST Have Dual Timestamps
- Rule 6: WebObservation Claims MUST Have XPath Provenance