# Web Claims Deduplication Rule ## Rule Summary **Do not state the same claim value multiple times unless there is strong variation in its value AND genuine uncertainty about its accuracy.** ## Rationale Web claims extracted from institutional websites often contain duplicate or near-duplicate information: - Multiple favicon variants (16x16, 32x32, apple-touch-icon, safari-pinned-tab) - Same value extracted via different methods (page_title vs org_name from same `` tag) - Dynamic content that changes frequently (image counts, gallery elements) Storing duplicates: 1. **Wastes storage** - Same information repeated 5x 2. **Creates maintenance burden** - Must update all instances when value changes 3. **Obscures authoritative data** - Hard to find the "canonical" value among duplicates 4. **Violates single-source-of-truth principle** - Which duplicate is correct? ## When to Keep Multiple Claims Keep multiple claims for the same property type ONLY when: 1. **Genuine variation exists**: Different social media URLs for different regional accounts 2. **Uncertainty about accuracy**: Two conflicting values from different sources 3. **Temporal tracking**: Historical values vs current values (use `valid_from`/`valid_to`) 4. **Different semantic meaning**: Logo for header vs logo for footer (rare) ## When to Deduplicate Remove duplicate claims when: 1. **Same value, same source**: `page_title` and `org_name` both = "Nationaal Archief" from same `<title>` tag 2. **Variant forms of same asset**: Multiple favicon sizes (keep primary `/favicon.ico`) 3. **Dynamic content**: Image counts, gallery element counts (changes frequently, low value) 4. **Computed/derived values**: Values trivially derivable from other claims ## Implementation Pattern When consolidating web_claims into web_enrichment: ```yaml web_enrichment: verified_claims: verification_date: '2025-01-14T00:00:00Z' verification_method: firecrawl_live_scrape claims: - claim_type: org_name claim_value: Institution Name # ... provenance fields verification_status: verified # Only ONE social_linkedin, ONE favicon, etc. removed_claims: removal_date: '2025-01-14T00:00:00Z' removal_reason: Duplicates or low-value dynamic content claims: - claim_type: page_title reason: Duplicate of org_name (same xpath, same value) - claim_type: favicon original_values: - /favicon-32x32.png - /favicon-16x16.png reason: Duplicate favicon variants, primary /favicon.ico retained ``` ## Audit Trail Always document removed claims in `removed_claims` section: - `claim_type`: What was removed - `reason`: Why it was removed - `original_values`: (optional) What the duplicate values were This enables: - Audit trail for data governance - Recovery if removal was incorrect - Understanding of extraction pipeline behavior ## Examples ### Example 1: Favicon Deduplication **Before (17 claims including 5 favicons):** ```yaml claims: - claim_type: favicon claim_value: /favicon.ico - claim_type: favicon claim_value: /favicon-32x32.png - claim_type: favicon claim_value: /favicon-16x16.png - claim_type: favicon claim_value: /apple-touch-icon.png - claim_type: favicon claim_value: /safari-pinned-tab.svg ``` **After (10 verified claims, 1 favicon):** ```yaml verified_claims: claims: - claim_type: favicon claim_value: /favicon.ico verification_status: verified removed_claims: claims: - claim_type: favicon original_values: [/favicon-32x32.png, /favicon-16x16.png, /apple-touch-icon.png, /safari-pinned-tab.svg] reason: Duplicate favicon variants, primary /favicon.ico retained ``` ### Example 2: Title Deduplication **Before:** ```yaml claims: - claim_type: org_name claim_value: Nationaal Archief xpath: /html/head/title - claim_type: page_title claim_value: Nationaal Archief xpath: /html/head/title # Same xpath! ``` **After:** ```yaml verified_claims: claims: - claim_type: org_name claim_value: Nationaal Archief verification_status: verified removed_claims: claims: - claim_type: page_title reason: Duplicate of org_name (same xpath, same value) ``` ## Related Rules - **Rule 6**: WebObservation Claims MUST Have XPath Provenance - **Rule 22**: Custodian YAML Files Are the Single Source of Truth - **Rule 5**: NEVER Delete Enriched Data - Additive Only (removed claims are documented, not deleted silently) ## Version History | Date | Change | |------|--------| | 2025-01-14 | Initial rule created based on Nationaal Archief web claims consolidation |