# Entity Duplicate Analysis Report **Generated**: 2025-12-13 17:34:36 ## Data Quality Issues Summary | Issue Type | Count | Total Occurrences | |------------|-------|-------------------| | Language Codes | 31 | 1098 | | Generic Labels | 7 | 87 | | Numeric Only | 235 | 366 | | Single Char | 48 | 476 | | Type Mismatches | 475 | 2652 | | Variant Spellings | 0 | 0 | | **TOTAL** | **796** | **4679** | ## Language Code Entities (Should Be Filtered) These are HTML `lang` attribute values, not real entities: | Entity | Occurrences | Action | |--------|-------------|--------| | `nl-NL` | 527 | Filter out | | `nl` | 370 | Filter out | | `nl_NL` | 86 | Filter out | | `EN` | 36 | Filter out | | `en-GB` | 20 | Filter out | | `en-US` | 12 | Filter out | | `de` | 10 | Filter out | | `en_US` | 6 | Filter out | | `de-de` | 4 | Filter out | | `NH` | 4 | Filter out | | `UB` | 2 | Filter out | | `fy` | 2 | Filter out | | `fy-nl` | 1 | Filter out | | `PI` | 1 | Filter out | | `ym` | 1 | Filter out | | `en_GB` | 1 | Filter out | | `VU` | 1 | Filter out | | `fr-FR` | 1 | Filter out | | `MA` | 1 | Filter out | | `US` | 1 | Filter out | ## Generic Navigation Labels (Low Value) These are website navigation labels, not heritage entities: | Entity | Occurrences | Action | |--------|-------------|--------| | `Home` | 42 | Filter out | | `collectie` | 22 | Filter out | | `Archief` | 8 | Filter out | | `Contact` | 5 | Filter out | | `Nieuws` | 5 | Filter out | | `AGENDA` | 4 | Filter out | | `collection` | 1 | Filter out | ## Numeric-Only Entities (Often Dimensions) These are often image dimensions or years without context: | Entity | Occurrences | Action | |--------|-------------|--------| | `'2025'` | 26 | Review context | | `'2560'` | 10 | Review context | | `'1200'` | 10 | Review context | | `'1920'` | 10 | Review context | | `'2024'` | 8 | Review context | | `'800'` | 6 | Review context | | `'1280'` | 6 | Review context | | `'670'` | 5 | Review context | | `'700'` | 4 | Review context | | `'2021'` | 4 | Review context | | `'2023'` | 4 | Review context | | `'20'` | 3 | Review context | | `'1080'` | 3 | Review context | | `'10'` | 3 | Review context | | `'150'` | 3 | Review context | | `'1995'` | 3 | Review context | | `'630'` | 3 | Review context | | `'100'` | 3 | Review context | | `'2000'` | 3 | Review context | | `'2048'` | 3 | Review context | ## Type Mismatches (Need Resolution) Same entity classified with different types: | Entity | Types | Occurrences | Action | |--------|-------|-------------|--------| | nl-NL | `THG.LNG, TOP.CTY, TOP.LNG` | 527 | Resolve type | | nl | `THG.LNG, TOP.CTY` | 370 | Resolve type | | Nederland | `THG.LNG, TOP.CTY, TOP.NAT, TOP.REG, TOP.SET` | 87 | Resolve type | | nl_NL | `THG.LNG, TOP.CTY` | 86 | Resolve type | | Home | `APP.EXH, APP.TIT, APP.TTL, DOC.WEB, THG.CON, TOP.SET, WRK.TXT, WRK.WEB` | 42 | Resolve type | | '2025' | `TMP.DAB, TMP.DRL` | 26 | Resolve type | | Netherlands | `TOP.CTY, TOP.REG, TOP.SET` | 24 | Resolve type | | collectie | `APP.COL, APP.EXH, THG.COL, THG.CON, WRK.COL, WRK.SER` | 22 | Resolve type | | Groningen | `TOP.CTY, TOP.REG, TOP.SET` | 21 | Resolve type | | Den Haag | `APP.PNM, TOP.SET` | 19 | Resolve type | | Onder de paramariboom | `APP.TIT, WRK.TTL, WRK.TXT, WRK.VIS, WRK.WRK` | 18 | Resolve type | | bibliotheek | `APP.COL, GRP.EDU, GRP.HER, THG.CON, TOP.BLD, WRK.COL, WRK.WEB` | 16 | Resolve type | | Zwolle | `TOP.ADR, TOP.SET` | 16 | Resolve type | | Beeldbank | `APP.COL, APP.EXH, WRK.COL, WRK.VIS, WRK.WEB` | 15 | Resolve type | | Heel Nederland Leest | `APP.EXH, APP.TTL, GRP.HIS, THG.CON, THG.EVT, WRK.EXH, WRK.SER` | 15 | Resolve type | | Rotterdam | `TOP.CTY, TOP.REG, TOP.SET` | 14 | Resolve type | | Nederlandse | `THG.LNG, TOP.CTY` | 12 | Resolve type | | Utrecht | `TOP.CTY, TOP.REG, TOP.SET` | 12 | Resolve type | | Facebook | `APP.NAM, APP.SOC, GRP.COR, GRP.GOV, THG.CON, WRK.SOC, WRK.WEB` | 11 | Resolve type | | archieven | `APP.COL, THG.CON, WRK.ARC, WRK.COL` | 11 | Resolve type | | Gemeentearchief | `APP.TIT, APP.TTL, GRP.HER, WRK.WEB` | 11 | Resolve type | | Overijssel | `GRP.GOV, TOP.REG` | 11 | Resolve type | | Friesland | `TOP.CTY, TOP.REG` | 10 | Resolve type | | '1200' | `QTY.CNT, QTY.MSR` | 10 | Resolve type | | Twente | `TOP.REG, TOP.SET` | 10 | Resolve type | | Leiden | `TOP.REG, TOP.SET` | 10 | Resolve type | | '1920' | `QTY.MSR, TMP.CEN, TMP.DAB` | 10 | Resolve type | | de | `THG.LNG, TOP.CTY` | 10 | Resolve type | | Noord-Holland | `TOP.CTY, TOP.REG` | 9 | Resolve type | | Tweede Wereldoorlog | `THG.CON, THG.EVT, TMP.ERA` | 9 | Resolve type | | Limburg | `TOP.CTY, TOP.REG, TOP.SET` | 8 | Resolve type | | Zuid-Holland | `GRP.GOV, TOP.REG` | 8 | Resolve type | | Brabant | `TOP.CTY, TOP.REG` | 8 | Resolve type | | Archief | `APP.COL, APP.TIT, GRP.UNT, THG.CON, WRK.ARC, WRK.COL, WRK.WEB` | 8 | Resolve type | | Google | `GRP.COR, GRP.GOV` | 7 | Resolve type | | Ministerie van Algemene Zaken | `GRP.GOV, GRP.PAR` | 7 | Resolve type | | foto's | `THG.CON, THG.DOC, THG.PHO` | 7 | Resolve type | | Kasteel Doorwerth | `APP.TTL, GRP.HER, THG.AFT, TOP.BLD` | 7 | Resolve type | | Boeken | `THG.CON, THG.DOC, WRK.COL, WRK.PUB, WRK.TXT` | 7 | Resolve type | | Delft | `TOP.REG, TOP.SET` | 7 | Resolve type | | Gelderland | `GRP.GOV, TOP.REG, TOP.SET` | 7 | Resolve type | | Instagram | `APP.NAM, APP.SOC, THG.CON, WRK.SOC, WRK.WEB` | 6 | Resolve type | | '800' | `QTY.CNT, QTY.MSR` | 6 | Resolve type | | Haarlem | `TOP.REG, TOP.SET` | 6 | Resolve type | | museum | `GRP.HER, ROL.OCC, THG.CON, THG.EVT` | 6 | Resolve type | | Tijdschriften | `APP.TTL, THG.CON, THG.DOC, WRK.PUB, WRK.SER` | 6 | Resolve type | | Archeologie | `GRP.HER, GRP.UNT, THG.CON` | 6 | Resolve type | | Provincie Overijssel | `APP.TIT, GRP.GOV, TOP.REG` | 6 | Resolve type | | '1280' | `QTY.CNT, QTY.MSR` | 6 | Resolve type | | november 2025 | `TMP.DAB, TMP.DRL, TMP.DUR, TMP.EXP, TMP.RNG` | 6 | Resolve type | ## Cleanup Impact Analysis | Metric | Value | |--------|-------| | Total entity occurrences | 13,268 | | Candidate cleanup occurrences | 1,551 | | Cleanup percentage | 11.7% | | Remaining after cleanup | 11,717 | ## High-Value Entities for Linking Entities that appear frequently and are good candidates for Wikidata/VIAF linking: | Entity | Type | Occurrences | Wikidata? | |--------|------|-------------|-----------| | Nederland | `TOP.CTY` | 87 | 🔍 Search | | Amsterdam | `TOP.SET` | 42 | 🔍 Search | | Netherlands | `TOP.CTY` | 24 | 🔍 Search | | Groningen | `TOP.SET` | 21 | 🔍 Search | | Den Haag | `TOP.SET` | 19 | 🔍 Search | | bibliotheek | `GRP.HER` | 16 | 🔍 Search | | Zwolle | `TOP.SET` | 16 | 🔍 Search | | Rotterdam | `TOP.SET` | 14 | 🔍 Search | | Johan Fretz | `AGT.PER` | 13 | 🔍 Search | | Nederlandse | `TOP.CTY` | 12 | 🔍 Search | | Utrecht | `TOP.SET` | 12 | 🔍 Search | | Facebook | `WRK.SOC` | 11 | 🔍 Search | | Roermond | `TOP.SET` | 11 | 🔍 Search | | Gemeentearchief | `GRP.HER` | 11 | 🔍 Search | | Overijssel | `TOP.REG` | 11 | 🔍 Search | | Deventer | `TOP.SET` | 11 | 🔍 Search | | Friesland | `TOP.REG` | 10 | 🔍 Search | | Twente | `TOP.REG` | 10 | 🔍 Search | | Leiden | `TOP.SET` | 10 | 🔍 Search | | Noord-Holland | `TOP.REG` | 9 | 🔍 Search | | Geldersch Landschap en Kasteelen | `GRP.HER` | 9 | 🔍 Search | | Nijmegen | `TOP.SET` | 8 | 🔍 Search | | Limburg | `TOP.REG` | 8 | 🔍 Search | | Enschede | `TOP.SET` | 8 | 🔍 Search | | Zuid-Holland | `TOP.REG` | 8 | 🔍 Search | | Brabant | `TOP.REG` | 8 | 🔍 Search | | Google | `GRP.COR` | 7 | 🔍 Search | | Heerlen | `TOP.SET` | 7 | 🔍 Search | | Maastricht | `TOP.SET` | 7 | 🔍 Search | | Ministerie van Algemene Zaken | `GRP.GOV` | 7 | 🔍 Search | | Kasteel Doorwerth | `GRP.HER` | 7 | 🔍 Search | | Delft | `TOP.SET` | 7 | 🔍 Search | | Gelderland | `TOP.REG` | 7 | 🔍 Search | | Schiedam | `TOP.SET` | 6 | 🔍 Search | | Tilburg | `TOP.SET` | 6 | 🔍 Search | | Amelander Musea | `GRP.HER` | 6 | 🔍 Search | | Alkmaar | `TOP.SET` | 6 | 🔍 Search | | Franeker | `TOP.SET` | 6 | 🔍 Search | | Ommen | `TOP.SET` | 6 | 🔍 Search | | Zutphen | `TOP.SET` | 6 | 🔍 Search | | Haarlem | `TOP.SET` | 6 | 🔍 Search | | museum | `GRP.HER` | 6 | 🔍 Search | | Maassluis | `TOP.SET` | 6 | 🔍 Search | | Dordrecht | `TOP.SET` | 6 | 🔍 Search | | Arnhem | `TOP.SET` | 6 | 🔍 Search | | Government of the Netherlands | `GRP.GOV` | 6 | 🔍 Search | | Leeuwarden | `TOP.SET` | 6 | 🔍 Search | | Apeldoorn | `TOP.SET` | 6 | 🔍 Search | | Archeologie | `THG.CON` | 6 | 🔍 Search | | Provincie Overijssel | `GRP.GOV` | 6 | 🔍 Search | | Rijksmuseum | `GRP.HER` | 6 | 🔍 Search | | Zeeland | `TOP.SET` | 6 | 🔍 Search | | Kampen | `TOP.SET` | 5 | 🔍 Search | | Breda | `TOP.SET` | 5 | 🔍 Search | | Hengelo | `TOP.SET` | 5 | 🔍 Search | | Midden-Groningen | `TOP.REG` | 5 | 🔍 Search | | The Netherlands | `TOP.CTY` | 5 | 🔍 Search | | Erfgoed 's-Hertogenbosch | `GRP.HER` | 5 | 🔍 Search | | Amersfoort | `TOP.SET` | 5 | 🔍 Search | | Eindhoven | `TOP.SET` | 5 | 🔍 Search | | Hoorn | `TOP.SET` | 5 | 🔍 Search | | Dalfsen | `TOP.SET` | 5 | 🔍 Search | | Drenthe | `TOP.REG` | 5 | 🔍 Search | | Oirschot | `TOP.SET` | 5 | 🔍 Search | | Bollenstreek | `TOP.REG` | 5 | 🔍 Search | | Medemblik | `TOP.SET` | 5 | 🔍 Search | | Wageningen | `TOP.SET` | 5 | 🔍 Search | | Rijksuniversiteit Groningen | `GRP.EDU` | 5 | 🔍 Search | | Texel | `TOP.SET` | 5 | 🔍 Search | | Collectie Overijssel | `GRP.HER` | 5 | 🔍 Search | | Flevoland | `TOP.REG` | 5 | 🔍 Search | | Venlo | `TOP.SET` | 5 | 🔍 Search | | Twitter | `APP.SOC` | 4 | 🔍 Search | | Gemeente Waadhoeke | `GRP.GOV` | 4 | 🔍 Search | | St.-Annaparochie | `TOP.SET` | 4 | 🔍 Search |