150 lines
4.5 KiB
Markdown
150 lines
4.5 KiB
Markdown
# Web Claims Deduplication Rule
|
|
|
|
## Rule Summary
|
|
|
|
**Do not state the same claim value multiple times unless there is strong variation in its value AND genuine uncertainty about its accuracy.**
|
|
|
|
## Rationale
|
|
|
|
Web claims extracted from institutional websites often contain duplicate or near-duplicate information:
|
|
- Multiple favicon variants (16x16, 32x32, apple-touch-icon, safari-pinned-tab)
|
|
- Same value extracted via different methods (page_title vs org_name from same `<title>` tag)
|
|
- Dynamic content that changes frequently (image counts, gallery elements)
|
|
|
|
Storing duplicates:
|
|
1. **Wastes storage** - Same information repeated 5x
|
|
2. **Creates maintenance burden** - Must update all instances when value changes
|
|
3. **Obscures authoritative data** - Hard to find the "canonical" value among duplicates
|
|
4. **Violates single-source-of-truth principle** - Which duplicate is correct?
|
|
|
|
## When to Keep Multiple Claims
|
|
|
|
Keep multiple claims for the same property type ONLY when:
|
|
|
|
1. **Genuine variation exists**: Different social media URLs for different regional accounts
|
|
2. **Uncertainty about accuracy**: Two conflicting values from different sources
|
|
3. **Temporal tracking**: Historical values vs current values (use `valid_from`/`valid_to`)
|
|
4. **Different semantic meaning**: Logo for header vs logo for footer (rare)
|
|
|
|
## When to Deduplicate
|
|
|
|
Remove duplicate claims when:
|
|
|
|
1. **Same value, same source**: `page_title` and `org_name` both = "Nationaal Archief" from same `<title>` tag
|
|
2. **Variant forms of same asset**: Multiple favicon sizes (keep primary `/favicon.ico`)
|
|
3. **Dynamic content**: Image counts, gallery element counts (changes frequently, low value)
|
|
4. **Computed/derived values**: Values trivially derivable from other claims
|
|
|
|
## Implementation Pattern
|
|
|
|
When consolidating web_claims into web_enrichment:
|
|
|
|
```yaml
|
|
web_enrichment:
|
|
verified_claims:
|
|
verification_date: '2025-01-14T00:00:00Z'
|
|
verification_method: firecrawl_live_scrape
|
|
claims:
|
|
- claim_type: org_name
|
|
claim_value: Institution Name
|
|
# ... provenance fields
|
|
verification_status: verified
|
|
# Only ONE social_linkedin, ONE favicon, etc.
|
|
|
|
removed_claims:
|
|
removal_date: '2025-01-14T00:00:00Z'
|
|
removal_reason: Duplicates or low-value dynamic content
|
|
claims:
|
|
- claim_type: page_title
|
|
reason: Duplicate of org_name (same xpath, same value)
|
|
- claim_type: favicon
|
|
original_values:
|
|
- /favicon-32x32.png
|
|
- /favicon-16x16.png
|
|
reason: Duplicate favicon variants, primary /favicon.ico retained
|
|
```
|
|
|
|
## Audit Trail
|
|
|
|
Always document removed claims in `removed_claims` section:
|
|
- `claim_type`: What was removed
|
|
- `reason`: Why it was removed
|
|
- `original_values`: (optional) What the duplicate values were
|
|
|
|
This enables:
|
|
- Audit trail for data governance
|
|
- Recovery if removal was incorrect
|
|
- Understanding of extraction pipeline behavior
|
|
|
|
## Examples
|
|
|
|
### Example 1: Favicon Deduplication
|
|
|
|
**Before (17 claims including 5 favicons):**
|
|
```yaml
|
|
claims:
|
|
- claim_type: favicon
|
|
claim_value: /favicon.ico
|
|
- claim_type: favicon
|
|
claim_value: /favicon-32x32.png
|
|
- claim_type: favicon
|
|
claim_value: /favicon-16x16.png
|
|
- claim_type: favicon
|
|
claim_value: /apple-touch-icon.png
|
|
- claim_type: favicon
|
|
claim_value: /safari-pinned-tab.svg
|
|
```
|
|
|
|
**After (10 verified claims, 1 favicon):**
|
|
```yaml
|
|
verified_claims:
|
|
claims:
|
|
- claim_type: favicon
|
|
claim_value: /favicon.ico
|
|
verification_status: verified
|
|
|
|
removed_claims:
|
|
claims:
|
|
- claim_type: favicon
|
|
original_values: [/favicon-32x32.png, /favicon-16x16.png, /apple-touch-icon.png, /safari-pinned-tab.svg]
|
|
reason: Duplicate favicon variants, primary /favicon.ico retained
|
|
```
|
|
|
|
### Example 2: Title Deduplication
|
|
|
|
**Before:**
|
|
```yaml
|
|
claims:
|
|
- claim_type: org_name
|
|
claim_value: Nationaal Archief
|
|
xpath: /html/head/title
|
|
- claim_type: page_title
|
|
claim_value: Nationaal Archief
|
|
xpath: /html/head/title # Same xpath!
|
|
```
|
|
|
|
**After:**
|
|
```yaml
|
|
verified_claims:
|
|
claims:
|
|
- claim_type: org_name
|
|
claim_value: Nationaal Archief
|
|
verification_status: verified
|
|
|
|
removed_claims:
|
|
claims:
|
|
- claim_type: page_title
|
|
reason: Duplicate of org_name (same xpath, same value)
|
|
```
|
|
|
|
## Related Rules
|
|
|
|
- **Rule 6**: WebObservation Claims MUST Have XPath Provenance
|
|
- **Rule 22**: Custodian YAML Files Are the Single Source of Truth
|
|
- **Rule 5**: NEVER Delete Enriched Data - Additive Only (removed claims are documented, not deleted silently)
|
|
|
|
## Version History
|
|
|
|
| Date | Change |
|
|
|------|--------|
|
|
| 2025-01-14 | Initial rule created based on Nationaal Archief web claims consolidation |
|