231 lines
No EOL
9.6 KiB
Markdown
231 lines
No EOL
9.6 KiB
Markdown
# Entity Duplicate Analysis Report
|
|
|
|
**Generated**: 2025-12-13 17:34:36
|
|
|
|
## Data Quality Issues Summary
|
|
|
|
| Issue Type | Count | Total Occurrences |
|
|
|------------|-------|-------------------|
|
|
| Language Codes | 31 | 1098 |
|
|
| Generic Labels | 7 | 87 |
|
|
| Numeric Only | 235 | 366 |
|
|
| Single Char | 48 | 476 |
|
|
| Type Mismatches | 475 | 2652 |
|
|
| Variant Spellings | 0 | 0 |
|
|
| **TOTAL** | **796** | **4679** |
|
|
|
|
## Language Code Entities (Should Be Filtered)
|
|
|
|
These are HTML `lang` attribute values, not real entities:
|
|
|
|
| Entity | Occurrences | Action |
|
|
|--------|-------------|--------|
|
|
| `nl-NL` | 527 | Filter out |
|
|
| `nl` | 370 | Filter out |
|
|
| `nl_NL` | 86 | Filter out |
|
|
| `EN` | 36 | Filter out |
|
|
| `en-GB` | 20 | Filter out |
|
|
| `en-US` | 12 | Filter out |
|
|
| `de` | 10 | Filter out |
|
|
| `en_US` | 6 | Filter out |
|
|
| `de-de` | 4 | Filter out |
|
|
| `NH` | 4 | Filter out |
|
|
| `UB` | 2 | Filter out |
|
|
| `fy` | 2 | Filter out |
|
|
| `fy-nl` | 1 | Filter out |
|
|
| `PI` | 1 | Filter out |
|
|
| `ym` | 1 | Filter out |
|
|
| `en_GB` | 1 | Filter out |
|
|
| `VU` | 1 | Filter out |
|
|
| `fr-FR` | 1 | Filter out |
|
|
| `MA` | 1 | Filter out |
|
|
| `US` | 1 | Filter out |
|
|
|
|
## Generic Navigation Labels (Low Value)
|
|
|
|
These are website navigation labels, not heritage entities:
|
|
|
|
| Entity | Occurrences | Action |
|
|
|--------|-------------|--------|
|
|
| `Home` | 42 | Filter out |
|
|
| `collectie` | 22 | Filter out |
|
|
| `Archief` | 8 | Filter out |
|
|
| `Contact` | 5 | Filter out |
|
|
| `Nieuws` | 5 | Filter out |
|
|
| `AGENDA` | 4 | Filter out |
|
|
| `collection` | 1 | Filter out |
|
|
|
|
## Numeric-Only Entities (Often Dimensions)
|
|
|
|
These are often image dimensions or years without context:
|
|
|
|
| Entity | Occurrences | Action |
|
|
|--------|-------------|--------|
|
|
| `'2025'` | 26 | Review context |
|
|
| `'2560'` | 10 | Review context |
|
|
| `'1200'` | 10 | Review context |
|
|
| `'1920'` | 10 | Review context |
|
|
| `'2024'` | 8 | Review context |
|
|
| `'800'` | 6 | Review context |
|
|
| `'1280'` | 6 | Review context |
|
|
| `'670'` | 5 | Review context |
|
|
| `'700'` | 4 | Review context |
|
|
| `'2021'` | 4 | Review context |
|
|
| `'2023'` | 4 | Review context |
|
|
| `'20'` | 3 | Review context |
|
|
| `'1080'` | 3 | Review context |
|
|
| `'10'` | 3 | Review context |
|
|
| `'150'` | 3 | Review context |
|
|
| `'1995'` | 3 | Review context |
|
|
| `'630'` | 3 | Review context |
|
|
| `'100'` | 3 | Review context |
|
|
| `'2000'` | 3 | Review context |
|
|
| `'2048'` | 3 | Review context |
|
|
|
|
## Type Mismatches (Need Resolution)
|
|
|
|
Same entity classified with different types:
|
|
|
|
| Entity | Types | Occurrences | Action |
|
|
|--------|-------|-------------|--------|
|
|
| nl-NL | `THG.LNG, TOP.CTY, TOP.LNG` | 527 | Resolve type |
|
|
| nl | `THG.LNG, TOP.CTY` | 370 | Resolve type |
|
|
| Nederland | `THG.LNG, TOP.CTY, TOP.NAT, TOP.REG, TOP.SET` | 87 | Resolve type |
|
|
| nl_NL | `THG.LNG, TOP.CTY` | 86 | Resolve type |
|
|
| Home | `APP.EXH, APP.TIT, APP.TTL, DOC.WEB, THG.CON, TOP.SET, WRK.TXT, WRK.WEB` | 42 | Resolve type |
|
|
| '2025' | `TMP.DAB, TMP.DRL` | 26 | Resolve type |
|
|
| Netherlands | `TOP.CTY, TOP.REG, TOP.SET` | 24 | Resolve type |
|
|
| collectie | `APP.COL, APP.EXH, THG.COL, THG.CON, WRK.COL, WRK.SER` | 22 | Resolve type |
|
|
| Groningen | `TOP.CTY, TOP.REG, TOP.SET` | 21 | Resolve type |
|
|
| Den Haag | `APP.PNM, TOP.SET` | 19 | Resolve type |
|
|
| Onder de paramariboom | `APP.TIT, WRK.TTL, WRK.TXT, WRK.VIS, WRK.WRK` | 18 | Resolve type |
|
|
| bibliotheek | `APP.COL, GRP.EDU, GRP.HER, THG.CON, TOP.BLD, WRK.COL, WRK.WEB` | 16 | Resolve type |
|
|
| Zwolle | `TOP.ADR, TOP.SET` | 16 | Resolve type |
|
|
| Beeldbank | `APP.COL, APP.EXH, WRK.COL, WRK.VIS, WRK.WEB` | 15 | Resolve type |
|
|
| Heel Nederland Leest | `APP.EXH, APP.TTL, GRP.HIS, THG.CON, THG.EVT, WRK.EXH, WRK.SER` | 15 | Resolve type |
|
|
| Rotterdam | `TOP.CTY, TOP.REG, TOP.SET` | 14 | Resolve type |
|
|
| Nederlandse | `THG.LNG, TOP.CTY` | 12 | Resolve type |
|
|
| Utrecht | `TOP.CTY, TOP.REG, TOP.SET` | 12 | Resolve type |
|
|
| Facebook | `APP.NAM, APP.SOC, GRP.COR, GRP.GOV, THG.CON, WRK.SOC, WRK.WEB` | 11 | Resolve type |
|
|
| archieven | `APP.COL, THG.CON, WRK.ARC, WRK.COL` | 11 | Resolve type |
|
|
| Gemeentearchief | `APP.TIT, APP.TTL, GRP.HER, WRK.WEB` | 11 | Resolve type |
|
|
| Overijssel | `GRP.GOV, TOP.REG` | 11 | Resolve type |
|
|
| Friesland | `TOP.CTY, TOP.REG` | 10 | Resolve type |
|
|
| '1200' | `QTY.CNT, QTY.MSR` | 10 | Resolve type |
|
|
| Twente | `TOP.REG, TOP.SET` | 10 | Resolve type |
|
|
| Leiden | `TOP.REG, TOP.SET` | 10 | Resolve type |
|
|
| '1920' | `QTY.MSR, TMP.CEN, TMP.DAB` | 10 | Resolve type |
|
|
| de | `THG.LNG, TOP.CTY` | 10 | Resolve type |
|
|
| Noord-Holland | `TOP.CTY, TOP.REG` | 9 | Resolve type |
|
|
| Tweede Wereldoorlog | `THG.CON, THG.EVT, TMP.ERA` | 9 | Resolve type |
|
|
| Limburg | `TOP.CTY, TOP.REG, TOP.SET` | 8 | Resolve type |
|
|
| Zuid-Holland | `GRP.GOV, TOP.REG` | 8 | Resolve type |
|
|
| Brabant | `TOP.CTY, TOP.REG` | 8 | Resolve type |
|
|
| Archief | `APP.COL, APP.TIT, GRP.UNT, THG.CON, WRK.ARC, WRK.COL, WRK.WEB` | 8 | Resolve type |
|
|
| Google | `GRP.COR, GRP.GOV` | 7 | Resolve type |
|
|
| Ministerie van Algemene Zaken | `GRP.GOV, GRP.PAR` | 7 | Resolve type |
|
|
| foto's | `THG.CON, THG.DOC, THG.PHO` | 7 | Resolve type |
|
|
| Kasteel Doorwerth | `APP.TTL, GRP.HER, THG.AFT, TOP.BLD` | 7 | Resolve type |
|
|
| Boeken | `THG.CON, THG.DOC, WRK.COL, WRK.PUB, WRK.TXT` | 7 | Resolve type |
|
|
| Delft | `TOP.REG, TOP.SET` | 7 | Resolve type |
|
|
| Gelderland | `GRP.GOV, TOP.REG, TOP.SET` | 7 | Resolve type |
|
|
| Instagram | `APP.NAM, APP.SOC, THG.CON, WRK.SOC, WRK.WEB` | 6 | Resolve type |
|
|
| '800' | `QTY.CNT, QTY.MSR` | 6 | Resolve type |
|
|
| Haarlem | `TOP.REG, TOP.SET` | 6 | Resolve type |
|
|
| museum | `GRP.HER, ROL.OCC, THG.CON, THG.EVT` | 6 | Resolve type |
|
|
| Tijdschriften | `APP.TTL, THG.CON, THG.DOC, WRK.PUB, WRK.SER` | 6 | Resolve type |
|
|
| Archeologie | `GRP.HER, GRP.UNT, THG.CON` | 6 | Resolve type |
|
|
| Provincie Overijssel | `APP.TIT, GRP.GOV, TOP.REG` | 6 | Resolve type |
|
|
| '1280' | `QTY.CNT, QTY.MSR` | 6 | Resolve type |
|
|
| november 2025 | `TMP.DAB, TMP.DRL, TMP.DUR, TMP.EXP, TMP.RNG` | 6 | Resolve type |
|
|
|
|
## Cleanup Impact Analysis
|
|
|
|
| Metric | Value |
|
|
|--------|-------|
|
|
| Total entity occurrences | 13,268 |
|
|
| Candidate cleanup occurrences | 1,551 |
|
|
| Cleanup percentage | 11.7% |
|
|
| Remaining after cleanup | 11,717 |
|
|
|
|
## High-Value Entities for Linking
|
|
|
|
Entities that appear frequently and are good candidates for Wikidata/VIAF linking:
|
|
|
|
| Entity | Type | Occurrences | Wikidata? |
|
|
|--------|------|-------------|-----------|
|
|
| Nederland | `TOP.CTY` | 87 | 🔍 Search |
|
|
| Amsterdam | `TOP.SET` | 42 | 🔍 Search |
|
|
| Netherlands | `TOP.CTY` | 24 | 🔍 Search |
|
|
| Groningen | `TOP.SET` | 21 | 🔍 Search |
|
|
| Den Haag | `TOP.SET` | 19 | 🔍 Search |
|
|
| bibliotheek | `GRP.HER` | 16 | 🔍 Search |
|
|
| Zwolle | `TOP.SET` | 16 | 🔍 Search |
|
|
| Rotterdam | `TOP.SET` | 14 | 🔍 Search |
|
|
| Johan Fretz | `AGT.PER` | 13 | 🔍 Search |
|
|
| Nederlandse | `TOP.CTY` | 12 | 🔍 Search |
|
|
| Utrecht | `TOP.SET` | 12 | 🔍 Search |
|
|
| Facebook | `WRK.SOC` | 11 | 🔍 Search |
|
|
| Roermond | `TOP.SET` | 11 | 🔍 Search |
|
|
| Gemeentearchief | `GRP.HER` | 11 | 🔍 Search |
|
|
| Overijssel | `TOP.REG` | 11 | 🔍 Search |
|
|
| Deventer | `TOP.SET` | 11 | 🔍 Search |
|
|
| Friesland | `TOP.REG` | 10 | 🔍 Search |
|
|
| Twente | `TOP.REG` | 10 | 🔍 Search |
|
|
| Leiden | `TOP.SET` | 10 | 🔍 Search |
|
|
| Noord-Holland | `TOP.REG` | 9 | 🔍 Search |
|
|
| Geldersch Landschap en Kasteelen | `GRP.HER` | 9 | 🔍 Search |
|
|
| Nijmegen | `TOP.SET` | 8 | 🔍 Search |
|
|
| Limburg | `TOP.REG` | 8 | 🔍 Search |
|
|
| Enschede | `TOP.SET` | 8 | 🔍 Search |
|
|
| Zuid-Holland | `TOP.REG` | 8 | 🔍 Search |
|
|
| Brabant | `TOP.REG` | 8 | 🔍 Search |
|
|
| Google | `GRP.COR` | 7 | 🔍 Search |
|
|
| Heerlen | `TOP.SET` | 7 | 🔍 Search |
|
|
| Maastricht | `TOP.SET` | 7 | 🔍 Search |
|
|
| Ministerie van Algemene Zaken | `GRP.GOV` | 7 | 🔍 Search |
|
|
| Kasteel Doorwerth | `GRP.HER` | 7 | 🔍 Search |
|
|
| Delft | `TOP.SET` | 7 | 🔍 Search |
|
|
| Gelderland | `TOP.REG` | 7 | 🔍 Search |
|
|
| Schiedam | `TOP.SET` | 6 | 🔍 Search |
|
|
| Tilburg | `TOP.SET` | 6 | 🔍 Search |
|
|
| Amelander Musea | `GRP.HER` | 6 | 🔍 Search |
|
|
| Alkmaar | `TOP.SET` | 6 | 🔍 Search |
|
|
| Franeker | `TOP.SET` | 6 | 🔍 Search |
|
|
| Ommen | `TOP.SET` | 6 | 🔍 Search |
|
|
| Zutphen | `TOP.SET` | 6 | 🔍 Search |
|
|
| Haarlem | `TOP.SET` | 6 | 🔍 Search |
|
|
| museum | `GRP.HER` | 6 | 🔍 Search |
|
|
| Maassluis | `TOP.SET` | 6 | 🔍 Search |
|
|
| Dordrecht | `TOP.SET` | 6 | 🔍 Search |
|
|
| Arnhem | `TOP.SET` | 6 | 🔍 Search |
|
|
| Government of the Netherlands | `GRP.GOV` | 6 | 🔍 Search |
|
|
| Leeuwarden | `TOP.SET` | 6 | 🔍 Search |
|
|
| Apeldoorn | `TOP.SET` | 6 | 🔍 Search |
|
|
| Archeologie | `THG.CON` | 6 | 🔍 Search |
|
|
| Provincie Overijssel | `GRP.GOV` | 6 | 🔍 Search |
|
|
| Rijksmuseum | `GRP.HER` | 6 | 🔍 Search |
|
|
| Zeeland | `TOP.SET` | 6 | 🔍 Search |
|
|
| Kampen | `TOP.SET` | 5 | 🔍 Search |
|
|
| Breda | `TOP.SET` | 5 | 🔍 Search |
|
|
| Hengelo | `TOP.SET` | 5 | 🔍 Search |
|
|
| Midden-Groningen | `TOP.REG` | 5 | 🔍 Search |
|
|
| The Netherlands | `TOP.CTY` | 5 | 🔍 Search |
|
|
| Erfgoed 's-Hertogenbosch | `GRP.HER` | 5 | 🔍 Search |
|
|
| Amersfoort | `TOP.SET` | 5 | 🔍 Search |
|
|
| Eindhoven | `TOP.SET` | 5 | 🔍 Search |
|
|
| Hoorn | `TOP.SET` | 5 | 🔍 Search |
|
|
| Dalfsen | `TOP.SET` | 5 | 🔍 Search |
|
|
| Drenthe | `TOP.REG` | 5 | 🔍 Search |
|
|
| Oirschot | `TOP.SET` | 5 | 🔍 Search |
|
|
| Bollenstreek | `TOP.REG` | 5 | 🔍 Search |
|
|
| Medemblik | `TOP.SET` | 5 | 🔍 Search |
|
|
| Wageningen | `TOP.SET` | 5 | 🔍 Search |
|
|
| Rijksuniversiteit Groningen | `GRP.EDU` | 5 | 🔍 Search |
|
|
| Texel | `TOP.SET` | 5 | 🔍 Search |
|
|
| Collectie Overijssel | `GRP.HER` | 5 | 🔍 Search |
|
|
| Flevoland | `TOP.REG` | 5 | 🔍 Search |
|
|
| Venlo | `TOP.SET` | 5 | 🔍 Search |
|
|
| Twitter | `APP.SOC` | 4 | 🔍 Search |
|
|
| Gemeente Waadhoeke | `GRP.GOV` | 4 | 🔍 Search |
|
|
| St.-Annaparochie | `TOP.SET` | 4 | 🔍 Search | |