glam/.opencode/rules/inferred-data-explicit-provenance-rule.md
2026-01-09 20:35:19 +01:00

13 KiB

Rule 45: Inferred Data Must Be Explicit with Provenance

Status: Active
Created: 2025-01-09
Applies to: PPID enrichment, person entity profiles, any data inference

Core Principle

All inferred data MUST be stored in explicit inferred_* fields with full provenance statements. Inferred values MUST NEVER silently replace or merge with verified data.

This ensures:

  1. Transparency: Users can distinguish verified facts from heuristic estimates
  2. Auditability: The inference method and source observations are traceable
  3. Reversibility: Inferred data can be corrected when verified data becomes available
  4. Quality Signals: Confidence levels and argument chains are preserved

Required Structure for Inferred Data

Every inferred claim MUST include:

inferred_[field_name]:
  value: "the inferred value"
  edtf: "196X"  # For dates: EDTF notation
  formatted: "NL-UT-UTR"  # For locations: CC-RR-PPP format
  confidence: "low|medium|high"
  inference_provenance:
    method: "heuristic_name"
    inference_chain:
      - step: 1
        observation: "University start year 1986"
        source_field: "profile_data.education[0].date_range"
        source_value: "1986 - 1990"
      - step: 2
        assumption: "University entry at age 18"
        rationale: "Standard Dutch university entry age"
      - step: 3
        calculation: "1986 - 18 = 1968"
        result: "Estimated birth year 1968"
      - step: 4
        generalization: "Round to decade → 196X"
        rationale: "EDTF decade notation for uncertain years"
    inferred_at: "2025-01-09T18:00:00Z"
    inferred_by: "enrich_ppids.py"

Explicit Inferred Fields

For Person Profiles (PPID)

Inferred Field Source Observations Heuristic
inferred_birth_year Earliest education/job dates Entry age assumptions
inferred_birth_decade Birth year estimate EDTF decade notation
inferred_birth_settlement School/university location Residential proximity
inferred_birth_region Settlement location GeoNames admin1
inferred_birth_country Settlement location GeoNames country
inferred_current_settlement Profile location, current job Direct extraction
inferred_current_region Settlement location GeoNames admin1
inferred_current_country Settlement location GeoNames country

Example: Complete Inferred Birth Data

{
  "ppid": "ID_NL-UT-UTR_196X_NL-UT-UTR_XXXX_AART-HARTEN",
  
  "birth_date": {
    "edtf": "XXXX",
    "precision": "unknown",
    "note": "See inferred_birth_decade for heuristic estimate"
  },
  
  "inferred_birth_decade": {
    "value": "196X",
    "edtf": "196X",
    "precision": "decade",
    "confidence": "low",
    "inference_provenance": {
      "method": "earliest_education_heuristic",
      "inference_chain": [
        {
          "step": 1,
          "observation": "University education record found",
          "source_field": "profile_data.education[0]",
          "source_value": {
            "institution": "Universiteit Utrecht",
            "degree": "Social & Organisational psychology, doctoraal",
            "date_range": "1986 - 1990"
          }
        },
        {
          "step": 2,
          "extraction": "Start year extracted from date_range",
          "extracted_value": 1986
        },
        {
          "step": 3,
          "assumption": "University entry age",
          "assumed_value": 18,
          "rationale": "Standard Dutch university entry age (post-VWO)",
          "confidence_impact": "Assumption reduces confidence; actual age 17-20 possible"
        },
        {
          "step": 4,
          "calculation": "1986 - 18 = 1968",
          "result": "Estimated birth year: 1968"
        },
        {
          "step": 5,
          "generalization": "Convert to EDTF decade",
          "input": 1968,
          "output": "196X",
          "rationale": "Decade precision appropriate for heuristic estimate"
        }
      ],
      "inferred_at": "2025-01-09T18:00:00Z",
      "inferred_by": "enrich_ppids.py"
    }
  },
  
  "inferred_birth_settlement": {
    "value": "Utrecht",
    "formatted": "NL-UT-UTR",
    "confidence": "low",
    "inference_provenance": {
      "method": "earliest_education_location",
      "inference_chain": [
        {
          "step": 1,
          "observation": "Earliest education institution identified",
          "source_field": "profile_data.education[0].institution",
          "source_value": "Universiteit Utrecht"
        },
        {
          "step": 2,
          "lookup": "Institution location mapping",
          "mapping_key": "Universiteit Utrecht",
          "mapping_value": "Utrecht, Netherlands"
        },
        {
          "step": 3,
          "geocoding": "GeoNames resolution",
          "query": "Utrecht",
          "country_code": "NL",
          "result": {
            "geonames_id": 2745912,
            "name": "Utrecht",
            "admin1_code": "09",
            "admin1_name": "Utrecht"
          }
        },
        {
          "step": 4,
          "formatting": "CC-RR-PPP generation",
          "country_code": "NL",
          "region_code": "UT",
          "settlement_code": "UTR",
          "result": "NL-UT-UTR"
        }
      ],
      "assumption_note": "University location used as proxy for birth location; student may have relocated for education",
      "inferred_at": "2025-01-09T18:00:00Z",
      "inferred_by": "enrich_ppids.py"
    }
  }
}

List-Valued Inferred Data (EDTF Set Notation)

When inference yields multiple plausible values (e.g., someone born in 1968 could be in either the 1960s or 1970s decade), store as a list with EDTF set notation.

EDTF Set Notation Standards

Notation Meaning Use Case
[196X,197X] One of these values Person born in late 1960s (uncertainty spans decades)
{196X,197X} All of these values NOT for birth decade (use [...])
[1965..1970] Range within set Birth year between 1965-1970

When to Use List Values

  1. Decade Boundary Cases: Estimated birth year is within 3 years of a decade boundary

    • Estimated 1968 → [196X,197X] (could be late 60s or early 70s due to age assumption variance)
    • Estimated 1972 → [196X,197X] (same logic)
    • Estimated 1975 → 197X (confidently mid-decade)
  2. Multiple Plausible Locations: Student attended schools in different cities

    • ["NL-UT-UTR", "NL-NH-AMS"] with provenance explaining each candidate

Example: List-Valued Birth Decade

{
  "inferred_birth_decade": {
    "values": ["196X", "197X"],
    "edtf": "[196X,197X]",
    "edtf_meaning": "one of: 1960s or 1970s",
    "precision": "decade_set",
    "confidence": "low",
    "primary_value": "196X",
    "primary_rationale": "1968 is closer to 1960s center than 1970s",
    "inference_provenance": {
      "method": "earliest_observation_heuristic",
      "inference_chain": [
        {
          "step": 1,
          "observation": "University start 1986",
          "source_field": "profile_data.education[0].date_range"
        },
        {
          "step": 2,
          "assumption": "University entry at age 18 (±3 years)",
          "rationale": "Dutch university entry typically 17-21"
        },
        {
          "step": 3,
          "calculation": "1986 - 18 = 1968 (range: 1965-1971)",
          "result": "Birth year estimate: 1968 with variance 1965-1971"
        },
        {
          "step": 4,
          "generalization": "Birth year range spans decade boundary",
          "input_range": [1965, 1971],
          "output": ["196X", "197X"],
          "rationale": "Cannot determine which decade without additional evidence"
        }
      ],
      "inferred_at": "2025-01-09T18:00:00Z",
      "inferred_by": "enrich_ppids.py"
    }
  }
}

PPID Generation with List Values

When inferred_birth_decade is a list, use primary_value for PPID:

{
  "ppid": "ID_NL-UT-UTR_196X_NL-UT-UTR_XXXX_AART-HARTEN",
  "ppid_components": {
    "first_date": "196X",
    "first_date_source": "inferred_birth_decade.primary_value",
    "first_date_alternatives": ["197X"]
  }
}

Example: List-Valued Location

{
  "inferred_birth_settlement": {
    "values": [
      {"settlement": "Utrecht", "formatted": "NL-UT-UTR"},
      {"settlement": "Amsterdam", "formatted": "NL-NH-AMS"}
    ],
    "primary_value": "NL-UT-UTR",
    "primary_rationale": "Earlier education (1986) in Utrecht; Amsterdam job later (1990)",
    "confidence": "very_low",
    "inference_provenance": {
      "method": "education_locations",
      "inference_chain": [
        {
          "step": 1,
          "observation": "Multiple education institutions found",
          "source_field": "profile_data.education",
          "candidates": ["Universiteit Utrecht (1986)", "UvA (1990)"]
        },
        {
          "step": 2,
          "assumption": "Earlier education more likely near birth location",
          "rationale": "Students often attend local university first"
        }
      ]
    }
  }
}

Confidence Levels

Level Criteria Example
high Direct extraction from authoritative source Profile states "Born in Amsterdam"
medium Single-step inference with reliable source Current job location from employment record
low Multi-step heuristic with assumptions Birth year from university start date
very_low Speculative, multiple assumptions, or list-valued Birth location from first observed location, or decade spanning boundary

Anti-Patterns (FORBIDDEN)

Silent Replacement

{
  "birth_date": {
    "edtf": "196X",
    "precision": "decade"
  }
}

Problem: No indication this is inferred, no provenance, no confidence level.

Hidden in Metadata

{
  "birth_date": {
    "edtf": "196X"
  },
  "enrichment_metadata": {
    "birth_date_inferred": true
  }
}

Problem: Inference metadata separated from the value; easy to miss.

Missing Inference Chain

{
  "inferred_birth_decade": {
    "value": "196X",
    "method": "heuristic"
  }
}

Problem: No explanation of HOW the value was derived; not auditable.

Correct Pattern

{
  "birth_date": {
    "edtf": "XXXX",
    "precision": "unknown",
    "note": "See inferred_birth_decade"
  },
  "inferred_birth_decade": {
    "value": "196X",
    "edtf": "196X",
    "confidence": "low",
    "inference_provenance": {
      "method": "earliest_education_heuristic",
      "inference_chain": [
        {"step": 1, "observation": "...", "source_field": "...", "source_value": "..."},
        {"step": 2, "assumption": "...", "rationale": "..."},
        {"step": 3, "calculation": "...", "result": "..."}
      ],
      "inferred_at": "2025-01-09T18:00:00Z",
      "inferred_by": "enrich_ppids.py"
    }
  }
}

PPID Component Handling

When inferred values are used in PPID components:

{
  "ppid": "ID_NL-UT-UTR_196X_NL-NH-AMS_XXXX_AART-HARTEN",
  "ppid_components": {
    "type": "ID",
    "first_location": "NL-UT-UTR",
    "first_location_source": "inferred_birth_settlement",
    "first_date": "196X",
    "first_date_source": "inferred_birth_decade",
    "last_location": "NL-NH-AMS",
    "last_location_source": "inferred_current_settlement",
    "last_date": "XXXX",
    "name_tokens": ["AART", "HARTEN"]
  }
}

The *_source fields document which inferred field was used for PPID generation.

Upgrade Path: Inferred → Verified

When verified data becomes available:

  1. Keep inferred data in inferred_* fields for audit trail
  2. Add verified data to canonical fields
  3. Mark inferred as superseded:
{
  "birth_date": {
    "edtf": "1967-03-15",
    "precision": "day",
    "verified": true,
    "source": "official_record"
  },
  "inferred_birth_decade": {
    "value": "196X",
    "superseded": true,
    "superseded_by": "birth_date",
    "superseded_at": "2025-01-15T10:00:00Z",
    "accuracy_assessment": "Inferred decade was correct (1960s), actual year 1967"
  }
}

Implementation Checklist

For any enrichment script:

  • Create explicit inferred_* fields for ALL inferred data
  • Include inference_provenance with complete inference_chain
  • Record each step: observation → assumption → calculation → result
  • Set appropriate confidence level
  • Add *_source references in PPID components
  • Preserve original unknown values (XXXX, XX-XX-XXX)
  • Add note in canonical fields pointing to inferred alternatives
  • Rule 44: PPID Birth Date Enrichment and EDTF Unknown Date Notation
  • Rule 35: Provenance Statements MUST Have Dual Timestamps
  • Rule 6: WebObservation Claims MUST Have XPath Provenance