# Web Claims Deduplication Rule
## Rule Summary
**Do not state the same claim value multiple times unless there is strong variation in its value AND genuine uncertainty about its accuracy.**
## Rationale
Web claims extracted from institutional websites often contain duplicate or near-duplicate information:
- Multiple favicon variants (16x16, 32x32, apple-touch-icon, safari-pinned-tab)
- Same value extracted via different methods (page_title vs org_name from same `
` tag)
- Dynamic content that changes frequently (image counts, gallery elements)
Storing duplicates:
1. **Wastes storage** - Same information repeated 5x
2. **Creates maintenance burden** - Must update all instances when value changes
3. **Obscures authoritative data** - Hard to find the "canonical" value among duplicates
4. **Violates single-source-of-truth principle** - Which duplicate is correct?
## When to Keep Multiple Claims
Keep multiple claims for the same property type ONLY when:
1. **Genuine variation exists**: Different social media URLs for different regional accounts
2. **Uncertainty about accuracy**: Two conflicting values from different sources
3. **Temporal tracking**: Historical values vs current values (use `valid_from`/`valid_to`)
4. **Different semantic meaning**: Logo for header vs logo for footer (rare)
## When to Deduplicate
Remove duplicate claims when:
1. **Same value, same source**: `page_title` and `org_name` both = "Nationaal Archief" from same `` tag
2. **Variant forms of same asset**: Multiple favicon sizes (keep primary `/favicon.ico`)
3. **Dynamic content**: Image counts, gallery element counts (changes frequently, low value)
4. **Computed/derived values**: Values trivially derivable from other claims
## Implementation Pattern
When consolidating web_claims into web_enrichment:
```yaml
web_enrichment:
verified_claims:
verification_date: '2025-01-14T00:00:00Z'
verification_method: firecrawl_live_scrape
claims:
- claim_type: org_name
claim_value: Institution Name
# ... provenance fields
verification_status: verified
# Only ONE social_linkedin, ONE favicon, etc.
removed_claims:
removal_date: '2025-01-14T00:00:00Z'
removal_reason: Duplicates or low-value dynamic content
claims:
- claim_type: page_title
reason: Duplicate of org_name (same xpath, same value)
- claim_type: favicon
original_values:
- /favicon-32x32.png
- /favicon-16x16.png
reason: Duplicate favicon variants, primary /favicon.ico retained
```
## Audit Trail
Always document removed claims in `removed_claims` section:
- `claim_type`: What was removed
- `reason`: Why it was removed
- `original_values`: (optional) What the duplicate values were
This enables:
- Audit trail for data governance
- Recovery if removal was incorrect
- Understanding of extraction pipeline behavior
## Examples
### Example 1: Favicon Deduplication
**Before (17 claims including 5 favicons):**
```yaml
claims:
- claim_type: favicon
claim_value: /favicon.ico
- claim_type: favicon
claim_value: /favicon-32x32.png
- claim_type: favicon
claim_value: /favicon-16x16.png
- claim_type: favicon
claim_value: /apple-touch-icon.png
- claim_type: favicon
claim_value: /safari-pinned-tab.svg
```
**After (10 verified claims, 1 favicon):**
```yaml
verified_claims:
claims:
- claim_type: favicon
claim_value: /favicon.ico
verification_status: verified
removed_claims:
claims:
- claim_type: favicon
original_values: [/favicon-32x32.png, /favicon-16x16.png, /apple-touch-icon.png, /safari-pinned-tab.svg]
reason: Duplicate favicon variants, primary /favicon.ico retained
```
### Example 2: Title Deduplication
**Before:**
```yaml
claims:
- claim_type: org_name
claim_value: Nationaal Archief
xpath: /html/head/title
- claim_type: page_title
claim_value: Nationaal Archief
xpath: /html/head/title # Same xpath!
```
**After:**
```yaml
verified_claims:
claims:
- claim_type: org_name
claim_value: Nationaal Archief
verification_status: verified
removed_claims:
claims:
- claim_type: page_title
reason: Duplicate of org_name (same xpath, same value)
```
## Related Rules
- **Rule 6**: WebObservation Claims MUST Have XPath Provenance
- **Rule 22**: Custodian YAML Files Are the Single Source of Truth
- **Rule 5**: NEVER Delete Enriched Data - Additive Only (removed claims are documented, not deleted silently)
## Version History
| Date | Change |
|------|--------|
| 2025-01-14 | Initial rule created based on Nationaal Archief web claims consolidation |