# Rule 45: Inferred Data Must Be Explicit with Provenance **Status**: Active **Created**: 2025-01-09 **Applies to**: PPID enrichment, person entity profiles, any data inference ## Core Principle **All inferred data MUST be stored in explicit `inferred_*` fields with full provenance statements. Inferred values MUST NEVER silently replace or merge with verified data.** This ensures: 1. **Transparency**: Users can distinguish verified facts from heuristic estimates 2. **Auditability**: The inference method and source observations are traceable 3. **Reversibility**: Inferred data can be corrected when verified data becomes available 4. **Quality Signals**: Confidence levels and argument chains are preserved ## Required Structure for Inferred Data Every inferred claim MUST include: ```yaml inferred_[field_name]: value: "the inferred value" edtf: "196X" # For dates: EDTF notation formatted: "NL-UT-UTR" # For locations: CC-RR-PPP format confidence: "low|medium|high" inference_provenance: method: "heuristic_name" inference_chain: - step: 1 observation: "University start year 1986" source_field: "profile_data.education[0].date_range" source_value: "1986 - 1990" - step: 2 assumption: "University entry at age 18" rationale: "Standard Dutch university entry age" - step: 3 calculation: "1986 - 18 = 1968" result: "Estimated birth year 1968" - step: 4 generalization: "Round to decade → 196X" rationale: "EDTF decade notation for uncertain years" inferred_at: "2025-01-09T18:00:00Z" inferred_by: "enrich_ppids.py" ``` ## Explicit Inferred Fields ### For Person Profiles (PPID) | Inferred Field | Source Observations | Heuristic | |----------------|---------------------|-----------| | `inferred_birth_year` | Earliest education/job dates | Entry age assumptions | | `inferred_birth_decade` | Birth year estimate | EDTF decade notation | | `inferred_birth_settlement` | School/university location | Residential proximity | | `inferred_birth_region` | Settlement location | GeoNames admin1 | | `inferred_birth_country` | Settlement location | GeoNames country | | `inferred_current_settlement` | Profile location, current job | Direct extraction | | `inferred_current_region` | Settlement location | GeoNames admin1 | | `inferred_current_country` | Settlement location | GeoNames country | ### Example: Complete Inferred Birth Data ```json { "ppid": "ID_NL-UT-UTR_196X_NL-UT-UTR_XXXX_AART-HARTEN", "birth_date": { "edtf": "XXXX", "precision": "unknown", "note": "See inferred_birth_decade for heuristic estimate" }, "inferred_birth_decade": { "value": "196X", "edtf": "196X", "precision": "decade", "confidence": "low", "inference_provenance": { "method": "earliest_education_heuristic", "inference_chain": [ { "step": 1, "observation": "University education record found", "source_field": "profile_data.education[0]", "source_value": { "institution": "Universiteit Utrecht", "degree": "Social & Organisational psychology, doctoraal", "date_range": "1986 - 1990" } }, { "step": 2, "extraction": "Start year extracted from date_range", "extracted_value": 1986 }, { "step": 3, "assumption": "University entry age", "assumed_value": 18, "rationale": "Standard Dutch university entry age (post-VWO)", "confidence_impact": "Assumption reduces confidence; actual age 17-20 possible" }, { "step": 4, "calculation": "1986 - 18 = 1968", "result": "Estimated birth year: 1968" }, { "step": 5, "generalization": "Convert to EDTF decade", "input": 1968, "output": "196X", "rationale": "Decade precision appropriate for heuristic estimate" } ], "inferred_at": "2025-01-09T18:00:00Z", "inferred_by": "enrich_ppids.py" } }, "inferred_birth_settlement": { "value": "Utrecht", "formatted": "NL-UT-UTR", "confidence": "low", "inference_provenance": { "method": "earliest_education_location", "inference_chain": [ { "step": 1, "observation": "Earliest education institution identified", "source_field": "profile_data.education[0].institution", "source_value": "Universiteit Utrecht" }, { "step": 2, "lookup": "Institution location mapping", "mapping_key": "Universiteit Utrecht", "mapping_value": "Utrecht, Netherlands" }, { "step": 3, "geocoding": "GeoNames resolution", "query": "Utrecht", "country_code": "NL", "result": { "geonames_id": 2745912, "name": "Utrecht", "admin1_code": "09", "admin1_name": "Utrecht" } }, { "step": 4, "formatting": "CC-RR-PPP generation", "country_code": "NL", "region_code": "UT", "settlement_code": "UTR", "result": "NL-UT-UTR" } ], "assumption_note": "University location used as proxy for birth location; student may have relocated for education", "inferred_at": "2025-01-09T18:00:00Z", "inferred_by": "enrich_ppids.py" } } } ``` ## List-Valued Inferred Data (EDTF Set Notation) When inference yields multiple plausible values (e.g., someone born in 1968 could be in either the 1960s or 1970s decade), store as a **list** with EDTF set notation. ### EDTF Set Notation Standards | Notation | Meaning | Use Case | |----------|---------|----------| | `[196X,197X]` | One of these values | Person born in late 1960s (uncertainty spans decades) | | `{196X,197X}` | All of these values | NOT for birth decade (use `[...]`) | | `[1965..1970]` | Range within set | Birth year between 1965-1970 | ### When to Use List Values 1. **Decade Boundary Cases**: Estimated birth year is within 3 years of a decade boundary - Estimated 1968 → `[196X,197X]` (could be late 60s or early 70s due to age assumption variance) - Estimated 1972 → `[196X,197X]` (same logic) - Estimated 1975 → `197X` (confidently mid-decade) 2. **Multiple Plausible Locations**: Student attended schools in different cities - `["NL-UT-UTR", "NL-NH-AMS"]` with provenance explaining each candidate ### Example: List-Valued Birth Decade ```json { "inferred_birth_decade": { "values": ["196X", "197X"], "edtf": "[196X,197X]", "edtf_meaning": "one of: 1960s or 1970s", "precision": "decade_set", "confidence": "low", "primary_value": "196X", "primary_rationale": "1968 is closer to 1960s center than 1970s", "inference_provenance": { "method": "earliest_observation_heuristic", "inference_chain": [ { "step": 1, "observation": "University start 1986", "source_field": "profile_data.education[0].date_range" }, { "step": 2, "assumption": "University entry at age 18 (±3 years)", "rationale": "Dutch university entry typically 17-21" }, { "step": 3, "calculation": "1986 - 18 = 1968 (range: 1965-1971)", "result": "Birth year estimate: 1968 with variance 1965-1971" }, { "step": 4, "generalization": "Birth year range spans decade boundary", "input_range": [1965, 1971], "output": ["196X", "197X"], "rationale": "Cannot determine which decade without additional evidence" } ], "inferred_at": "2025-01-09T18:00:00Z", "inferred_by": "enrich_ppids.py" } } } ``` ### PPID Generation with List Values When `inferred_birth_decade` is a list, use `primary_value` for PPID: ```json { "ppid": "ID_NL-UT-UTR_196X_NL-UT-UTR_XXXX_AART-HARTEN", "ppid_components": { "first_date": "196X", "first_date_source": "inferred_birth_decade.primary_value", "first_date_alternatives": ["197X"] } } ``` ### Example: List-Valued Location ```json { "inferred_birth_settlement": { "values": [ {"settlement": "Utrecht", "formatted": "NL-UT-UTR"}, {"settlement": "Amsterdam", "formatted": "NL-NH-AMS"} ], "primary_value": "NL-UT-UTR", "primary_rationale": "Earlier education (1986) in Utrecht; Amsterdam job later (1990)", "confidence": "very_low", "inference_provenance": { "method": "education_locations", "inference_chain": [ { "step": 1, "observation": "Multiple education institutions found", "source_field": "profile_data.education", "candidates": ["Universiteit Utrecht (1986)", "UvA (1990)"] }, { "step": 2, "assumption": "Earlier education more likely near birth location", "rationale": "Students often attend local university first" } ] } } } ``` ## Confidence Levels | Level | Criteria | Example | |-------|----------|---------| | **high** | Direct extraction from authoritative source | Profile states "Born in Amsterdam" | | **medium** | Single-step inference with reliable source | Current job location from employment record | | **low** | Multi-step heuristic with assumptions | Birth year from university start date | | **very_low** | Speculative, multiple assumptions, or list-valued | Birth location from first observed location, or decade spanning boundary | ## Anti-Patterns (FORBIDDEN) ### ❌ Silent Replacement ```json { "birth_date": { "edtf": "196X", "precision": "decade" } } ``` **Problem**: No indication this is inferred, no provenance, no confidence level. ### ❌ Hidden in Metadata ```json { "birth_date": { "edtf": "196X" }, "enrichment_metadata": { "birth_date_inferred": true } } ``` **Problem**: Inference metadata separated from the value; easy to miss. ### ❌ Missing Inference Chain ```json { "inferred_birth_decade": { "value": "196X", "method": "heuristic" } } ``` **Problem**: No explanation of HOW the value was derived; not auditable. ## Correct Pattern ✅ ```json { "birth_date": { "edtf": "XXXX", "precision": "unknown", "note": "See inferred_birth_decade" }, "inferred_birth_decade": { "value": "196X", "edtf": "196X", "confidence": "low", "inference_provenance": { "method": "earliest_education_heuristic", "inference_chain": [ {"step": 1, "observation": "...", "source_field": "...", "source_value": "..."}, {"step": 2, "assumption": "...", "rationale": "..."}, {"step": 3, "calculation": "...", "result": "..."} ], "inferred_at": "2025-01-09T18:00:00Z", "inferred_by": "enrich_ppids.py" } } } ``` ## PPID Component Handling When inferred values are used in PPID components: ```json { "ppid": "ID_NL-UT-UTR_196X_NL-NH-AMS_XXXX_AART-HARTEN", "ppid_components": { "type": "ID", "first_location": "NL-UT-UTR", "first_location_source": "inferred_birth_settlement", "first_date": "196X", "first_date_source": "inferred_birth_decade", "last_location": "NL-NH-AMS", "last_location_source": "inferred_current_settlement", "last_date": "XXXX", "name_tokens": ["AART", "HARTEN"] } } ``` The `*_source` fields document which inferred field was used for PPID generation. ## Upgrade Path: Inferred → Verified When verified data becomes available: 1. **Keep inferred data** in `inferred_*` fields for audit trail 2. **Add verified data** to canonical fields 3. **Mark inferred as superseded**: ```json { "birth_date": { "edtf": "1967-03-15", "precision": "day", "verified": true, "source": "official_record" }, "inferred_birth_decade": { "value": "196X", "superseded": true, "superseded_by": "birth_date", "superseded_at": "2025-01-15T10:00:00Z", "accuracy_assessment": "Inferred decade was correct (1960s), actual year 1967" } } ``` ## Implementation Checklist For any enrichment script: - [ ] Create explicit `inferred_*` fields for ALL inferred data - [ ] Include `inference_provenance` with complete `inference_chain` - [ ] Record each step: observation → assumption → calculation → result - [ ] Set appropriate `confidence` level - [ ] Add `*_source` references in PPID components - [ ] Preserve original unknown values (`XXXX`, `XX-XX-XXX`) - [ ] Add `note` in canonical fields pointing to inferred alternatives ## Related Rules - **Rule 44**: PPID Birth Date Enrichment and EDTF Unknown Date Notation - **Rule 35**: Provenance Statements MUST Have Dual Timestamps - **Rule 6**: WebObservation Claims MUST Have XPath Provenance