4.5 KiB
4.5 KiB
Web Claims Deduplication Rule
Rule Summary
Do not state the same claim value multiple times unless there is strong variation in its value AND genuine uncertainty about its accuracy.
Rationale
Web claims extracted from institutional websites often contain duplicate or near-duplicate information:
- Multiple favicon variants (16x16, 32x32, apple-touch-icon, safari-pinned-tab)
- Same value extracted via different methods (page_title vs org_name from same
<title>tag) - Dynamic content that changes frequently (image counts, gallery elements)
Storing duplicates:
- Wastes storage - Same information repeated 5x
- Creates maintenance burden - Must update all instances when value changes
- Obscures authoritative data - Hard to find the "canonical" value among duplicates
- Violates single-source-of-truth principle - Which duplicate is correct?
When to Keep Multiple Claims
Keep multiple claims for the same property type ONLY when:
- Genuine variation exists: Different social media URLs for different regional accounts
- Uncertainty about accuracy: Two conflicting values from different sources
- Temporal tracking: Historical values vs current values (use
valid_from/valid_to) - Different semantic meaning: Logo for header vs logo for footer (rare)
When to Deduplicate
Remove duplicate claims when:
- Same value, same source:
page_titleandorg_nameboth = "Nationaal Archief" from same<title>tag - Variant forms of same asset: Multiple favicon sizes (keep primary
/favicon.ico) - Dynamic content: Image counts, gallery element counts (changes frequently, low value)
- Computed/derived values: Values trivially derivable from other claims
Implementation Pattern
When consolidating web_claims into web_enrichment:
web_enrichment:
verified_claims:
verification_date: '2025-01-14T00:00:00Z'
verification_method: firecrawl_live_scrape
claims:
- claim_type: org_name
claim_value: Institution Name
# ... provenance fields
verification_status: verified
# Only ONE social_linkedin, ONE favicon, etc.
removed_claims:
removal_date: '2025-01-14T00:00:00Z'
removal_reason: Duplicates or low-value dynamic content
claims:
- claim_type: page_title
reason: Duplicate of org_name (same xpath, same value)
- claim_type: favicon
original_values:
- /favicon-32x32.png
- /favicon-16x16.png
reason: Duplicate favicon variants, primary /favicon.ico retained
Audit Trail
Always document removed claims in removed_claims section:
claim_type: What was removedreason: Why it was removedoriginal_values: (optional) What the duplicate values were
This enables:
- Audit trail for data governance
- Recovery if removal was incorrect
- Understanding of extraction pipeline behavior
Examples
Example 1: Favicon Deduplication
Before (17 claims including 5 favicons):
claims:
- claim_type: favicon
claim_value: /favicon.ico
- claim_type: favicon
claim_value: /favicon-32x32.png
- claim_type: favicon
claim_value: /favicon-16x16.png
- claim_type: favicon
claim_value: /apple-touch-icon.png
- claim_type: favicon
claim_value: /safari-pinned-tab.svg
After (10 verified claims, 1 favicon):
verified_claims:
claims:
- claim_type: favicon
claim_value: /favicon.ico
verification_status: verified
removed_claims:
claims:
- claim_type: favicon
original_values: [/favicon-32x32.png, /favicon-16x16.png, /apple-touch-icon.png, /safari-pinned-tab.svg]
reason: Duplicate favicon variants, primary /favicon.ico retained
Example 2: Title Deduplication
Before:
claims:
- claim_type: org_name
claim_value: Nationaal Archief
xpath: /html/head/title
- claim_type: page_title
claim_value: Nationaal Archief
xpath: /html/head/title # Same xpath!
After:
verified_claims:
claims:
- claim_type: org_name
claim_value: Nationaal Archief
verification_status: verified
removed_claims:
claims:
- claim_type: page_title
reason: Duplicate of org_name (same xpath, same value)
Related Rules
- Rule 6: WebObservation Claims MUST Have XPath Provenance
- Rule 22: Custodian YAML Files Are the Single Source of Truth
- Rule 5: NEVER Delete Enriched Data - Additive Only (removed claims are documented, not deleted silently)
Version History
| Date | Change |
|---|---|
| 2025-01-14 | Initial rule created based on Nationaal Archief web claims consolidation |