glam/reports/entity_duplicate_analysis.md
2025-12-14 17:09:55 +01:00

231 lines
No EOL
9.6 KiB
Markdown

# Entity Duplicate Analysis Report
**Generated**: 2025-12-13 17:34:36
## Data Quality Issues Summary
| Issue Type | Count | Total Occurrences |
|------------|-------|-------------------|
| Language Codes | 31 | 1098 |
| Generic Labels | 7 | 87 |
| Numeric Only | 235 | 366 |
| Single Char | 48 | 476 |
| Type Mismatches | 475 | 2652 |
| Variant Spellings | 0 | 0 |
| **TOTAL** | **796** | **4679** |
## Language Code Entities (Should Be Filtered)
These are HTML `lang` attribute values, not real entities:
| Entity | Occurrences | Action |
|--------|-------------|--------|
| `nl-NL` | 527 | Filter out |
| `nl` | 370 | Filter out |
| `nl_NL` | 86 | Filter out |
| `EN` | 36 | Filter out |
| `en-GB` | 20 | Filter out |
| `en-US` | 12 | Filter out |
| `de` | 10 | Filter out |
| `en_US` | 6 | Filter out |
| `de-de` | 4 | Filter out |
| `NH` | 4 | Filter out |
| `UB` | 2 | Filter out |
| `fy` | 2 | Filter out |
| `fy-nl` | 1 | Filter out |
| `PI` | 1 | Filter out |
| `ym` | 1 | Filter out |
| `en_GB` | 1 | Filter out |
| `VU` | 1 | Filter out |
| `fr-FR` | 1 | Filter out |
| `MA` | 1 | Filter out |
| `US` | 1 | Filter out |
## Generic Navigation Labels (Low Value)
These are website navigation labels, not heritage entities:
| Entity | Occurrences | Action |
|--------|-------------|--------|
| `Home` | 42 | Filter out |
| `collectie` | 22 | Filter out |
| `Archief` | 8 | Filter out |
| `Contact` | 5 | Filter out |
| `Nieuws` | 5 | Filter out |
| `AGENDA` | 4 | Filter out |
| `collection` | 1 | Filter out |
## Numeric-Only Entities (Often Dimensions)
These are often image dimensions or years without context:
| Entity | Occurrences | Action |
|--------|-------------|--------|
| `'2025'` | 26 | Review context |
| `'2560'` | 10 | Review context |
| `'1200'` | 10 | Review context |
| `'1920'` | 10 | Review context |
| `'2024'` | 8 | Review context |
| `'800'` | 6 | Review context |
| `'1280'` | 6 | Review context |
| `'670'` | 5 | Review context |
| `'700'` | 4 | Review context |
| `'2021'` | 4 | Review context |
| `'2023'` | 4 | Review context |
| `'20'` | 3 | Review context |
| `'1080'` | 3 | Review context |
| `'10'` | 3 | Review context |
| `'150'` | 3 | Review context |
| `'1995'` | 3 | Review context |
| `'630'` | 3 | Review context |
| `'100'` | 3 | Review context |
| `'2000'` | 3 | Review context |
| `'2048'` | 3 | Review context |
## Type Mismatches (Need Resolution)
Same entity classified with different types:
| Entity | Types | Occurrences | Action |
|--------|-------|-------------|--------|
| nl-NL | `THG.LNG, TOP.CTY, TOP.LNG` | 527 | Resolve type |
| nl | `THG.LNG, TOP.CTY` | 370 | Resolve type |
| Nederland | `THG.LNG, TOP.CTY, TOP.NAT, TOP.REG, TOP.SET` | 87 | Resolve type |
| nl_NL | `THG.LNG, TOP.CTY` | 86 | Resolve type |
| Home | `APP.EXH, APP.TIT, APP.TTL, DOC.WEB, THG.CON, TOP.SET, WRK.TXT, WRK.WEB` | 42 | Resolve type |
| '2025' | `TMP.DAB, TMP.DRL` | 26 | Resolve type |
| Netherlands | `TOP.CTY, TOP.REG, TOP.SET` | 24 | Resolve type |
| collectie | `APP.COL, APP.EXH, THG.COL, THG.CON, WRK.COL, WRK.SER` | 22 | Resolve type |
| Groningen | `TOP.CTY, TOP.REG, TOP.SET` | 21 | Resolve type |
| Den Haag | `APP.PNM, TOP.SET` | 19 | Resolve type |
| Onder de paramariboom | `APP.TIT, WRK.TTL, WRK.TXT, WRK.VIS, WRK.WRK` | 18 | Resolve type |
| bibliotheek | `APP.COL, GRP.EDU, GRP.HER, THG.CON, TOP.BLD, WRK.COL, WRK.WEB` | 16 | Resolve type |
| Zwolle | `TOP.ADR, TOP.SET` | 16 | Resolve type |
| Beeldbank | `APP.COL, APP.EXH, WRK.COL, WRK.VIS, WRK.WEB` | 15 | Resolve type |
| Heel Nederland Leest | `APP.EXH, APP.TTL, GRP.HIS, THG.CON, THG.EVT, WRK.EXH, WRK.SER` | 15 | Resolve type |
| Rotterdam | `TOP.CTY, TOP.REG, TOP.SET` | 14 | Resolve type |
| Nederlandse | `THG.LNG, TOP.CTY` | 12 | Resolve type |
| Utrecht | `TOP.CTY, TOP.REG, TOP.SET` | 12 | Resolve type |
| Facebook | `APP.NAM, APP.SOC, GRP.COR, GRP.GOV, THG.CON, WRK.SOC, WRK.WEB` | 11 | Resolve type |
| archieven | `APP.COL, THG.CON, WRK.ARC, WRK.COL` | 11 | Resolve type |
| Gemeentearchief | `APP.TIT, APP.TTL, GRP.HER, WRK.WEB` | 11 | Resolve type |
| Overijssel | `GRP.GOV, TOP.REG` | 11 | Resolve type |
| Friesland | `TOP.CTY, TOP.REG` | 10 | Resolve type |
| '1200' | `QTY.CNT, QTY.MSR` | 10 | Resolve type |
| Twente | `TOP.REG, TOP.SET` | 10 | Resolve type |
| Leiden | `TOP.REG, TOP.SET` | 10 | Resolve type |
| '1920' | `QTY.MSR, TMP.CEN, TMP.DAB` | 10 | Resolve type |
| de | `THG.LNG, TOP.CTY` | 10 | Resolve type |
| Noord-Holland | `TOP.CTY, TOP.REG` | 9 | Resolve type |
| Tweede Wereldoorlog | `THG.CON, THG.EVT, TMP.ERA` | 9 | Resolve type |
| Limburg | `TOP.CTY, TOP.REG, TOP.SET` | 8 | Resolve type |
| Zuid-Holland | `GRP.GOV, TOP.REG` | 8 | Resolve type |
| Brabant | `TOP.CTY, TOP.REG` | 8 | Resolve type |
| Archief | `APP.COL, APP.TIT, GRP.UNT, THG.CON, WRK.ARC, WRK.COL, WRK.WEB` | 8 | Resolve type |
| Google | `GRP.COR, GRP.GOV` | 7 | Resolve type |
| Ministerie van Algemene Zaken | `GRP.GOV, GRP.PAR` | 7 | Resolve type |
| foto's | `THG.CON, THG.DOC, THG.PHO` | 7 | Resolve type |
| Kasteel Doorwerth | `APP.TTL, GRP.HER, THG.AFT, TOP.BLD` | 7 | Resolve type |
| Boeken | `THG.CON, THG.DOC, WRK.COL, WRK.PUB, WRK.TXT` | 7 | Resolve type |
| Delft | `TOP.REG, TOP.SET` | 7 | Resolve type |
| Gelderland | `GRP.GOV, TOP.REG, TOP.SET` | 7 | Resolve type |
| Instagram | `APP.NAM, APP.SOC, THG.CON, WRK.SOC, WRK.WEB` | 6 | Resolve type |
| '800' | `QTY.CNT, QTY.MSR` | 6 | Resolve type |
| Haarlem | `TOP.REG, TOP.SET` | 6 | Resolve type |
| museum | `GRP.HER, ROL.OCC, THG.CON, THG.EVT` | 6 | Resolve type |
| Tijdschriften | `APP.TTL, THG.CON, THG.DOC, WRK.PUB, WRK.SER` | 6 | Resolve type |
| Archeologie | `GRP.HER, GRP.UNT, THG.CON` | 6 | Resolve type |
| Provincie Overijssel | `APP.TIT, GRP.GOV, TOP.REG` | 6 | Resolve type |
| '1280' | `QTY.CNT, QTY.MSR` | 6 | Resolve type |
| november 2025 | `TMP.DAB, TMP.DRL, TMP.DUR, TMP.EXP, TMP.RNG` | 6 | Resolve type |
## Cleanup Impact Analysis
| Metric | Value |
|--------|-------|
| Total entity occurrences | 13,268 |
| Candidate cleanup occurrences | 1,551 |
| Cleanup percentage | 11.7% |
| Remaining after cleanup | 11,717 |
## High-Value Entities for Linking
Entities that appear frequently and are good candidates for Wikidata/VIAF linking:
| Entity | Type | Occurrences | Wikidata? |
|--------|------|-------------|-----------|
| Nederland | `TOP.CTY` | 87 | 🔍 Search |
| Amsterdam | `TOP.SET` | 42 | 🔍 Search |
| Netherlands | `TOP.CTY` | 24 | 🔍 Search |
| Groningen | `TOP.SET` | 21 | 🔍 Search |
| Den Haag | `TOP.SET` | 19 | 🔍 Search |
| bibliotheek | `GRP.HER` | 16 | 🔍 Search |
| Zwolle | `TOP.SET` | 16 | 🔍 Search |
| Rotterdam | `TOP.SET` | 14 | 🔍 Search |
| Johan Fretz | `AGT.PER` | 13 | 🔍 Search |
| Nederlandse | `TOP.CTY` | 12 | 🔍 Search |
| Utrecht | `TOP.SET` | 12 | 🔍 Search |
| Facebook | `WRK.SOC` | 11 | 🔍 Search |
| Roermond | `TOP.SET` | 11 | 🔍 Search |
| Gemeentearchief | `GRP.HER` | 11 | 🔍 Search |
| Overijssel | `TOP.REG` | 11 | 🔍 Search |
| Deventer | `TOP.SET` | 11 | 🔍 Search |
| Friesland | `TOP.REG` | 10 | 🔍 Search |
| Twente | `TOP.REG` | 10 | 🔍 Search |
| Leiden | `TOP.SET` | 10 | 🔍 Search |
| Noord-Holland | `TOP.REG` | 9 | 🔍 Search |
| Geldersch Landschap en Kasteelen | `GRP.HER` | 9 | 🔍 Search |
| Nijmegen | `TOP.SET` | 8 | 🔍 Search |
| Limburg | `TOP.REG` | 8 | 🔍 Search |
| Enschede | `TOP.SET` | 8 | 🔍 Search |
| Zuid-Holland | `TOP.REG` | 8 | 🔍 Search |
| Brabant | `TOP.REG` | 8 | 🔍 Search |
| Google | `GRP.COR` | 7 | 🔍 Search |
| Heerlen | `TOP.SET` | 7 | 🔍 Search |
| Maastricht | `TOP.SET` | 7 | 🔍 Search |
| Ministerie van Algemene Zaken | `GRP.GOV` | 7 | 🔍 Search |
| Kasteel Doorwerth | `GRP.HER` | 7 | 🔍 Search |
| Delft | `TOP.SET` | 7 | 🔍 Search |
| Gelderland | `TOP.REG` | 7 | 🔍 Search |
| Schiedam | `TOP.SET` | 6 | 🔍 Search |
| Tilburg | `TOP.SET` | 6 | 🔍 Search |
| Amelander Musea | `GRP.HER` | 6 | 🔍 Search |
| Alkmaar | `TOP.SET` | 6 | 🔍 Search |
| Franeker | `TOP.SET` | 6 | 🔍 Search |
| Ommen | `TOP.SET` | 6 | 🔍 Search |
| Zutphen | `TOP.SET` | 6 | 🔍 Search |
| Haarlem | `TOP.SET` | 6 | 🔍 Search |
| museum | `GRP.HER` | 6 | 🔍 Search |
| Maassluis | `TOP.SET` | 6 | 🔍 Search |
| Dordrecht | `TOP.SET` | 6 | 🔍 Search |
| Arnhem | `TOP.SET` | 6 | 🔍 Search |
| Government of the Netherlands | `GRP.GOV` | 6 | 🔍 Search |
| Leeuwarden | `TOP.SET` | 6 | 🔍 Search |
| Apeldoorn | `TOP.SET` | 6 | 🔍 Search |
| Archeologie | `THG.CON` | 6 | 🔍 Search |
| Provincie Overijssel | `GRP.GOV` | 6 | 🔍 Search |
| Rijksmuseum | `GRP.HER` | 6 | 🔍 Search |
| Zeeland | `TOP.SET` | 6 | 🔍 Search |
| Kampen | `TOP.SET` | 5 | 🔍 Search |
| Breda | `TOP.SET` | 5 | 🔍 Search |
| Hengelo | `TOP.SET` | 5 | 🔍 Search |
| Midden-Groningen | `TOP.REG` | 5 | 🔍 Search |
| The Netherlands | `TOP.CTY` | 5 | 🔍 Search |
| Erfgoed 's-Hertogenbosch | `GRP.HER` | 5 | 🔍 Search |
| Amersfoort | `TOP.SET` | 5 | 🔍 Search |
| Eindhoven | `TOP.SET` | 5 | 🔍 Search |
| Hoorn | `TOP.SET` | 5 | 🔍 Search |
| Dalfsen | `TOP.SET` | 5 | 🔍 Search |
| Drenthe | `TOP.REG` | 5 | 🔍 Search |
| Oirschot | `TOP.SET` | 5 | 🔍 Search |
| Bollenstreek | `TOP.REG` | 5 | 🔍 Search |
| Medemblik | `TOP.SET` | 5 | 🔍 Search |
| Wageningen | `TOP.SET` | 5 | 🔍 Search |
| Rijksuniversiteit Groningen | `GRP.EDU` | 5 | 🔍 Search |
| Texel | `TOP.SET` | 5 | 🔍 Search |
| Collectie Overijssel | `GRP.HER` | 5 | 🔍 Search |
| Flevoland | `TOP.REG` | 5 | 🔍 Search |
| Venlo | `TOP.SET` | 5 | 🔍 Search |
| Twitter | `APP.SOC` | 4 | 🔍 Search |
| Gemeente Waadhoeke | `GRP.GOV` | 4 | 🔍 Search |
| St.-Annaparochie | `TOP.SET` | 4 | 🔍 Search |