glam/.opencode/WEB_CLAIMS_DEDUPLICATION_RULE.md
2025-12-14 17:09:55 +01:00

4.5 KiB

Web Claims Deduplication Rule

Rule Summary

Do not state the same claim value multiple times unless there is strong variation in its value AND genuine uncertainty about its accuracy.

Rationale

Web claims extracted from institutional websites often contain duplicate or near-duplicate information:

  • Multiple favicon variants (16x16, 32x32, apple-touch-icon, safari-pinned-tab)
  • Same value extracted via different methods (page_title vs org_name from same <title> tag)
  • Dynamic content that changes frequently (image counts, gallery elements)

Storing duplicates:

  1. Wastes storage - Same information repeated 5x
  2. Creates maintenance burden - Must update all instances when value changes
  3. Obscures authoritative data - Hard to find the "canonical" value among duplicates
  4. Violates single-source-of-truth principle - Which duplicate is correct?

When to Keep Multiple Claims

Keep multiple claims for the same property type ONLY when:

  1. Genuine variation exists: Different social media URLs for different regional accounts
  2. Uncertainty about accuracy: Two conflicting values from different sources
  3. Temporal tracking: Historical values vs current values (use valid_from/valid_to)
  4. Different semantic meaning: Logo for header vs logo for footer (rare)

When to Deduplicate

Remove duplicate claims when:

  1. Same value, same source: page_title and org_name both = "Nationaal Archief" from same <title> tag
  2. Variant forms of same asset: Multiple favicon sizes (keep primary /favicon.ico)
  3. Dynamic content: Image counts, gallery element counts (changes frequently, low value)
  4. Computed/derived values: Values trivially derivable from other claims

Implementation Pattern

When consolidating web_claims into web_enrichment:

web_enrichment:
  verified_claims:
    verification_date: '2025-01-14T00:00:00Z'
    verification_method: firecrawl_live_scrape
    claims:
    - claim_type: org_name
      claim_value: Institution Name
      # ... provenance fields
      verification_status: verified
    # Only ONE social_linkedin, ONE favicon, etc.
  
  removed_claims:
    removal_date: '2025-01-14T00:00:00Z'
    removal_reason: Duplicates or low-value dynamic content
    claims:
    - claim_type: page_title
      reason: Duplicate of org_name (same xpath, same value)
    - claim_type: favicon
      original_values:
      - /favicon-32x32.png
      - /favicon-16x16.png
      reason: Duplicate favicon variants, primary /favicon.ico retained

Audit Trail

Always document removed claims in removed_claims section:

  • claim_type: What was removed
  • reason: Why it was removed
  • original_values: (optional) What the duplicate values were

This enables:

  • Audit trail for data governance
  • Recovery if removal was incorrect
  • Understanding of extraction pipeline behavior

Examples

Example 1: Favicon Deduplication

Before (17 claims including 5 favicons):

claims:
- claim_type: favicon
  claim_value: /favicon.ico
- claim_type: favicon
  claim_value: /favicon-32x32.png
- claim_type: favicon
  claim_value: /favicon-16x16.png
- claim_type: favicon
  claim_value: /apple-touch-icon.png
- claim_type: favicon
  claim_value: /safari-pinned-tab.svg

After (10 verified claims, 1 favicon):

verified_claims:
  claims:
  - claim_type: favicon
    claim_value: /favicon.ico
    verification_status: verified

removed_claims:
  claims:
  - claim_type: favicon
    original_values: [/favicon-32x32.png, /favicon-16x16.png, /apple-touch-icon.png, /safari-pinned-tab.svg]
    reason: Duplicate favicon variants, primary /favicon.ico retained

Example 2: Title Deduplication

Before:

claims:
- claim_type: org_name
  claim_value: Nationaal Archief
  xpath: /html/head/title
- claim_type: page_title
  claim_value: Nationaal Archief
  xpath: /html/head/title  # Same xpath!

After:

verified_claims:
  claims:
  - claim_type: org_name
    claim_value: Nationaal Archief
    verification_status: verified

removed_claims:
  claims:
  - claim_type: page_title
    reason: Duplicate of org_name (same xpath, same value)
  • Rule 6: WebObservation Claims MUST Have XPath Provenance
  • Rule 22: Custodian YAML Files Are the Single Source of Truth
  • Rule 5: NEVER Delete Enriched Data - Additive Only (removed claims are documented, not deleted silently)

Version History

Date Change
2025-01-14 Initial rule created based on Nationaal Archief web claims consolidation