glam/reports/layout_analysis_full.md
2025-12-14 17:09:55 +01:00

827 lines
21 KiB
Markdown
Raw Blame History

This file contains ambiguous Unicode characters

This file contains Unicode characters that might be confused with other characters. If you think that this is intentional, you can safely ignore this warning. Use the Escape button to reveal them.

# Web Archive Layout Pattern Analysis
Generated: 2025-12-13T15:05:22.346619
## Summary
- **Annotation files analyzed**: 1525
- **Unique websites**: 1343
- **Total layout claims**: 4302
- **Total entity claims**: 15252
## Layout Categories
Distribution of content by DOM location category:
### other
- **Occurrences**: 896
- **Websites**: 457
**Top XPath patterns:**
- `body/*` (471)
- `body` (24)
- `body/*/table` (23)
- `body/*/dl` (15)
- `*` (14)
**String patterns found:**
- `unknown` (511)
- `person:dutch_name` (263)
- `heritage:institution_type` (119)
- `nav:menu_item` (114)
- `org:foundation_or_association` (41)
**Samples:**
- "Voorstelling 'De Ezelprins' door Spiegelbeesten......" (bibliotheekkampen.nl)
- "Vandaag open... Bibliotheek Kampen 09:00 - 17:00......" (bibliotheekkampen.nl)
- "flexslider fullscreen-carousel hero-slider..." (museumrijswijk.nl)
### meta_title
- **Occurrences**: 828
- **Websites**: 741
**Top XPath patterns:**
- `head/title` (815)
- `head/meta[@property='og:title']` (8)
- `head/meta[@property='og:title']/@content` (2)
- `head/meta[@name='apple-mobile-web-app-title']/@content` (1)
- `head/meta[@name='twitter:title']` (1)
**String patterns found:**
- `person:dutch_name` (651)
- `heritage:institution_type` (266)
- `nav:menu_item` (194)
- `unknown` (99)
- `org:foundation_or_association` (92)
**Samples:**
- "Protestantse Theologische Universiteit: 400 jaar theologisch onderwijs..." (pthu.nl)
- "Heemkundevereniging Heibloem | Heibloem.nu..." (heibloem.nu)
- "Museum Eenigenburg Museum Eenigenburg..." (museumeenigenburg.nl)
### header
- **Occurrences**: 561
- **Websites**: 386
**Top XPath patterns:**
- `body/header` (204)
- `body/*/header` (65)
- `body/*/header/h1` (29)
- `body/*/header/*` (22)
- `body/header/*` (13)
**String patterns found:**
- `unknown` (285)
- `person:dutch_name` (210)
- `nav:menu_item` (79)
- `heritage:institution_type` (63)
- `org:foundation_or_association` (31)
**Samples:**
- "<header class="header preventselection">...Protestantse Theologische Universitei..." (pthu.nl)
- "Heemkundevereniging Heibloem..." (heibloem.nu)
- "Museum Eenigenburg..." (museumeenigenburg.nl)
### meta_other
- **Occurrences**: 538
- **Websites**: 436
**Top XPath patterns:**
- `head` (328)
- `head/script[@type='application/ld+json']` (76)
- `head/script` (42)
- `head/link[@rel='canonical']` (22)
- `head/link` (13)
**String patterns found:**
- `person:dutch_name` (264)
- `unknown` (242)
- `heritage:institution_type` (107)
- `nav:menu_item` (81)
- `org:foundation_or_association` (32)
**Samples:**
- "<title>Home - Hildo Krop Museum</title><link rel="canonical" href="https://www.h..." (hildokrop.nl)
- "Museum Rijswijk..." (museumrijswijk.nl)
- "Gemeentearchief Ede | Gemeentearchief Ede......" (gemeentearchief.ede.nl)
### paragraph
- **Occurrences**: 295
- **Websites**: 190
**Top XPath patterns:**
- `body/*/p` (202)
- `*/p` (17)
- `body/*/ul/li/*/p` (8)
- `body/p` (4)
- `body/*/p/strong` (4)
**String patterns found:**
- `person:dutch_name` (152)
- `unknown` (87)
- `heritage:institution_type` (73)
- `org:foundation_or_association` (25)
- `nav:menu_item` (23)
**Samples:**
- "Heemkundevereniging Heibloem heeft als doel het in brede kring bevorderen van be..." (heibloem.nu)
- "In Museum Eenigenburg wijst een levensgroot kompas......" (museumeenigenburg.nl)
- "Het Museum is zeer geschikt voor het ontvangen van groepen......" (museumeenigenburg.nl)
### nav
- **Occurrences**: 285
- **Websites**: 247
**Top XPath patterns:**
- `body/*/nav` (57)
- `body/nav` (43)
- `body/header/nav` (34)
- `body/*/header/*/nav` (30)
- `body/header/*/nav` (17)
**String patterns found:**
- `nav:menu_item` (137)
- `unknown` (117)
- `person:dutch_name` (105)
- `heritage:institution_type` (32)
- `org:foundation_or_association` (16)
**Samples:**
- "Actueel HELD Burgerbus Onderwijs & opvang......" (heibloem.nu)
- "Home Het museum Daglonershuis Kasteel De familie Eenigenburg Blog Contact..." (museumeenigenburg.nl)
- "Home Tentoonstellingen Bezoekersinformatie......" (sonnenborgh.nl)
### sub_heading
- **Occurrences**: 224
- **Websites**: 128
**Top XPath patterns:**
- `body/*/h2` (117)
- `body/*/h3` (65)
- `*/h2` (8)
- `body/h2` (6)
- `body/*/h4` (4)
**String patterns found:**
- `unknown` (138)
- `person:dutch_name` (71)
- `heritage:institution_type` (36)
- `nav:menu_item` (9)
- `org:foundation_or_association` (9)
**Samples:**
- "Verenigingsnieuws..." (heibloem.nu)
- "Contactgegevens..." (heibloem.nu)
- "Home..." (bibliotheekkampen.nl)
### main_heading
- **Occurrences**: 222
- **Websites**: 182
**Top XPath patterns:**
- `body/*/h1` (175)
- `*/h1` (16)
- `*/a/*/h1` (2)
- `body/*/ul/li/h1` (2)
- `body/section[@class='uk-section head']/*/div[@class='title']/h1` (2)
**String patterns found:**
- `person:dutch_name` (110)
- `unknown` (90)
- `heritage:institution_type` (67)
- `nav:menu_item` (12)
- `org:foundation_or_association` (9)
**Samples:**
- "Protestantse Theologische Universiteit..." (pthu.nl)
- "Terug in de tijd..." (museumeenigenburg.nl)
- "Ook leuk voor groepen..." (museumeenigenburg.nl)
### meta_tag
- **Occurrences**: 183
- **Websites**: 165
**Top XPath patterns:**
- `head/meta[@name='description']` (69)
- `head/meta` (33)
- `head/meta[@name='description']/@content` (29)
- `head/meta/@content` (22)
- `head/meta[@property='og:description']` (5)
**String patterns found:**
- `person:dutch_name` (98)
- `unknown` (60)
- `heritage:institution_type` (56)
- `nav:menu_item` (16)
- `org:foundation_or_association` (14)
**Samples:**
- "Prachtige tijdelijke en vaste tentoonstellingen zijn te zien bij Museum Catharij..." (catharijneconvent.nl)
- "Energie is overal. We gebruiken het constant en ervaren er veel gemak van. Tegel..." (vindhetuit.nl)
- "Aanmeldformulier Via dit formulier kunt u zich aanmelden bij de Stichting Vriend..." (verbindingsdienst.nl)
### list
- **Occurrences**: 130
- **Websites**: 92
**Top XPath patterns:**
- `body/*/ul` (93)
- `*/ul` (5)
- `body/*/ul[@class='metanav']` (5)
- `body/ul` (2)
- `body/*/ol` (2)
**String patterns found:**
- `unknown` (62)
- `person:dutch_name` (47)
- `nav:menu_item` (29)
- `heritage:institution_type` (19)
- `org:foundation_or_association` (4)
**Samples:**
- "<ul class="metanav">......" (bibliotheekkampen.nl)
- "Nu te zien [list of exhibitions]..." (catharijneconvent.nl)
- "Let op: in december hebben we aangepaste openingstijden. Bekijk hier de openings..." (debieb.nl)
### footer
- **Occurrences**: 119
- **Websites**: 92
**Top XPath patterns:**
- `body/footer` (44)
- `body/*/footer` (16)
- `body/footer/*` (10)
- `footer` (6)
- `body/footer/*/p` (5)
**String patterns found:**
- `person:dutch_name` (41)
- `nav:menu_item` (39)
- `unknown` (37)
- `loc:postal_code_nl` (12)
- `loc:street_address` (10)
**Samples:**
- "Heibloem.nu Nieuwsbrief Contact......" (heibloem.nu)
- "Made with ♥ by Encore Digital Agency......" (heibloem.nu)
- "Contactinformatie Kerkweg 5... Toegangsprijzen... OPENINGS TIJDEN......" (museumeenigenburg.nl)
### main_content
- **Occurrences**: 10
- **Websites**: 6
**Top XPath patterns:**
- `body/div[@id='page_wrapper']/main[@id='content_main']/div[@class='cm_column_wrapper']/div[@class='cm_column']/p` (3)
- `body/div[@id='readspeaker']/*/article[@role='region']/*/p` (1)
- `body/main[@id='main']/div[@id='content-wrap']/div[@id='primary']/div[@id='content']/*/div[@class='entry']/div[@data-elementor-type='wp-post']/section[@data-id='150f6235']` (1)
- `body/div[@id='wrapper']/div[@id='mainCntr']/main[@id='contentCntr']/div[@id='generatedContent']/div[@class='timeItem']` (1)
- `body/div[@id='wrapper']/div[@id='mainCntr']/main[@id='contentCntr']/div[@id='generatedContent']/div[@class='locationItem']` (1)
**String patterns found:**
- `unknown` (7)
- `person:dutch_name` (2)
- `info:opening_hours` (1)
**Samples:**
- "Let op: Op donderdag 27 en vrijdag 28 november 2025 is Burgerzaken gesloten......" (veendam.nl)
- "september 2025: Arfdeeltje 53......" (lemelsarfgoed.nl)
- "19 dec. 2024: Arfdeeltje 52......" (lemelsarfgoed.nl)
### contact
- **Occurrences**: 10
- **Websites**: 7
**Top XPath patterns:**
- `body/*/address` (1)
- `body/div[@id='contact']/*/ul` (1)
- `div[@class='xr_group' and ./span[text()='Contact']]` (1)
- `body/div[@id='wrapper']/div[@id='rechterkolom']/div[@class='moduletablecontact']/div[@class='customcontact']` (1)
- `h2[text()='Contact en adressen']` (1)
**String patterns found:**
- `nav:menu_item` (4)
- `loc:street_address` (3)
- `person:dutch_name` (3)
- `loc:postal_code_nl` (2)
- `unknown` (2)
**Samples:**
- "Bongerd 18, 6411 JM Heerlen..." (museum.nl)
- "Winsum, Hoofdstraat W 70 Uithuizen, Hoofdstraat-West 1..." (hethogeland.nl)
- "Contact Contactgegevens..." (heemkundenispen.nl)
### sidebar
- **Occurrences**: 1
- **Websites**: 1
**Top XPath patterns:**
- `body/aside[@id='leftbar']` (1)
**String patterns found:**
- `org:foundation_or_association` (1)
- `nav:menu_item` (1)
- `person:dutch_name` (1)
**Samples:**
- "Vereniging Adres / Contact Kwartaalblad......" (oudhoorn.nl)
## Entity Type Distribution
| Entity Type | Count | Websites | Primary Location |
|-------------|-------|----------|------------------|
| APP.URL | 3196 | 1176 | meta_other |
| GRP.HER | 1805 | 891 | meta_title |
| TOP.SET | 1344 | 722 | meta_tag |
| THG.LNG | 1309 | 921 | other |
| TMP.DAB | 928 | 476 | meta_tag |
| TOP.ADR | 412 | 233 | footer |
| GRP.ASS | 401 | 230 | meta_title |
| GRP.GOV | 381 | 168 | meta_title |
| APP.TIT | 380 | 278 | meta_title |
| AGT.PER | 334 | 180 | paragraph |
| WRK.WEB | 316 | 191 | meta_other |
| APP.TTL | 293 | 206 | meta_title |
| THG.CON | 260 | 162 | meta_tag |
| APP.EXH | 259 | 166 | other |
| TOP.REG | 254 | 192 | meta_tag |
| TOP.CTY | 234 | 179 | meta_tag |
| TMP.OPH | 217 | 130 | paragraph |
| QTY.MSR | 179 | 85 | meta_tag |
| QTY.CNT | 176 | 106 | meta_tag |
| GRP.COR | 152 | 110 | meta_tag |
### Entity Type Details
#### APP.URL
- **Total occurrences**: 3196
- **Unique websites**: 1176
**Found in DOM categories:**
- meta_other: 1705 (53.3%)
- meta_tag: 455 (14.2%)
- other: 293 (9.2%)
- list: 162 (5.1%)
- footer: 149 (4.7%)
**Common XPath patterns:**
- `head/link/@href` (441)
- `head/link[@rel='canonical']/@href` (342)
- `head/script[@type='application/ld+json']` (186)
- `head/meta/@content` (175)
- `head/link` (151)
**Samples:**
- "https://www.pthu.nl/" @ `head/link/@href`
- "https://heibloem.nu/vereniging/heemkundevereniging-heibloem" @ `head/meta[@property='og:url']`
- "https://heibloem.nu/held" @ `body/*/footer/*/p/a`
#### GRP.HER
- **Total occurrences**: 1805
- **Unique websites**: 891
**Found in DOM categories:**
- meta_title: 871 (48.3%)
- meta_tag: 311 (17.2%)
- other: 181 (10.0%)
- meta_other: 131 (7.3%)
- header: 88 (4.9%)
**Common XPath patterns:**
- `head/title` (829)
- `head/meta/@content` (129)
- `head/meta[@name='description']/@content` (66)
- `head/script[@type='application/ld+json']` (56)
- `head/meta[@property='og:site_name']/@content` (52)
**Samples:**
- "Museum Eenigenburg" @ `head/title`
- "Musea Schagen" @ `body/*/footer/*/a/*/p`
- "Hildo Krop Museum" @ `head/title`
#### TOP.SET
- **Total occurrences**: 1344
- **Unique websites**: 722
**Found in DOM categories:**
- meta_tag: 356 (26.5%)
- meta_title: 331 (24.6%)
- paragraph: 209 (15.6%)
- other: 126 (9.4%)
- meta_other: 84 (6.2%)
**Common XPath patterns:**
- `head/title` (315)
- `head/meta[@name='description']/@content` (154)
- `head/meta/@content` (106)
- `body/*/p` (101)
- `head/script[@type='application/ld+json']` (32)
**Samples:**
- "Amsterdam" @ `head/script[@type='application/ld+json']`
- "Eenigenburg" @ `body/*/header/*/nav/*/ul/li/a`
- "Steenwijk" @ `head/meta[@property='og:site_name']`
#### THG.LNG
- **Total occurrences**: 1309
- **Unique websites**: 921
**Found in DOM categories:**
- other: 846 (64.6%)
- meta_tag: 207 (15.8%)
- meta_other: 150 (11.5%)
- nav: 32 (2.4%)
- paragraph: 27 (2.1%)
**Common XPath patterns:**
- `@lang` (360)
- `html/@lang` (192)
- `html` (70)
- `head/meta[@property='og:locale']/@content` (54)
- `head/meta/@content` (50)
**Samples:**
- "nl_NL" @ `head/meta[@property='og:locale']`
- "nl-NL" @ `[@lang='nl-NL']`
- "nl_NL" @ `html/@lang`
#### TMP.DAB
- **Total occurrences**: 928
- **Unique websites**: 476
**Found in DOM categories:**
- meta_tag: 396 (42.7%)
- meta_other: 189 (20.4%)
- paragraph: 166 (17.9%)
- other: 97 (10.5%)
- list: 33 (3.6%)
**Common XPath patterns:**
- `head/meta[@property='article:modified_time']/@content` (108)
- `head/meta/@content` (108)
- `head/script[@type='application/ld+json']` (93)
- `body/*/p` (75)
- `head/script/text()` (34)
**Samples:**
- "2024" @ `li[@id='menu-item-2747']/a`
- "2025" @ `li[@id='menu-item-2758']/a`
- "2022-05-30T07:37:02+00:00" @ `head/meta[@property='article:modified_time']/@content`
#### TOP.ADR
- **Total occurrences**: 412
- **Unique websites**: 233
**Found in DOM categories:**
- footer: 104 (25.2%)
- paragraph: 81 (19.7%)
- other: 68 (16.5%)
- meta_other: 60 (14.6%)
- contact: 25 (6.1%)
**Common XPath patterns:**
- `body/*/p` (31)
- `body/footer/*/p` (29)
- `head/script[@type='application/ld+json']` (24)
- `head/script/text()` (19)
- `body/*/p/*` (17)
**Samples:**
- "De Boelelaan 1105, 1081 HV, Amsterdam, Nederland" @ `head/script[@type='application/ld+json']`
- "Kerkweg 5, 1744 JD Eenigenburg, Noord-Holland" @ `body/*/footer/*/ol/li`
- "Herenstraat 67" @ `body/footer/*`
#### GRP.ASS
- **Total occurrences**: 401
- **Unique websites**: 230
**Found in DOM categories:**
- meta_title: 164 (40.9%)
- meta_tag: 58 (14.5%)
- paragraph: 43 (10.7%)
- header: 27 (6.7%)
- other: 27 (6.7%)
**Common XPath patterns:**
- `head/title` (153)
- `head/meta/@content` (25)
- `body/*/p` (24)
- `head/script[@type='application/ld+json']` (13)
- `head/meta[@name='description']/@content` (11)
**Samples:**
- "Heemkundevereniging Heibloem" @ `head/title`
- "Heemkundevereniging Heibloem" @ `body/*/header/h1/*`
- "Heemkundevereniging Heibloem" @ `body/*/p/strong`
#### GRP.GOV
- **Total occurrences**: 381
- **Unique websites**: 168
**Found in DOM categories:**
- meta_title: 131 (34.4%)
- meta_tag: 126 (33.1%)
- header: 41 (10.8%)
- paragraph: 26 (6.8%)
- other: 17 (4.5%)
**Common XPath patterns:**
- `head/title` (127)
- `head/meta/@content` (42)
- `head/meta[@name='description']/@content` (26)
- `head/meta` (15)
- `body/*/p` (13)
**Samples:**
- "gemeente Ede" @ `body/img[@alt='Logo Gemeentearchief Ede']/@alt`
- "Provincie Noord-Holland" @ `head/title`
- "Provincie Overijssel" @ `head/title`
#### APP.TIT
- **Total occurrences**: 380
- **Unique websites**: 278
**Found in DOM categories:**
- meta_title: 179 (47.1%)
- meta_tag: 39 (10.3%)
- other: 39 (10.3%)
- sub_heading: 36 (9.5%)
- paragraph: 24 (6.3%)
**Common XPath patterns:**
- `head/title` (153)
- `head/meta/@content` (15)
- `body/*/h2` (12)
- `head/meta[@property='og:title']/@content` (12)
- `body/*/a/*/h4` (10)
**Samples:**
- "Heemkundevereniging Heibloem" @ `head/title`
- "Uitnodiging lezing "Ontdek de sporen van je voorouders"" @ `body/*/h2`
- "Versiering plaquette "Sporen die bleven"" @ `body/*/h2`
#### AGT.PER
- **Total occurrences**: 334
- **Unique websites**: 180
**Found in DOM categories:**
- paragraph: 107 (32.0%)
- meta_tag: 80 (24.0%)
- other: 63 (18.9%)
- sub_heading: 28 (8.4%)
- meta_other: 17 (5.1%)
**Common XPath patterns:**
- `body/*/p` (44)
- `head/meta/@content` (31)
- `head/meta[@name='description']/@content` (23)
- `body/*` (11)
- `body/*/p/text()` (11)
**Samples:**
- "Erik Pape" @ `body/*/ul/li/*/h2/a`
- "Caren van Herwaarden" @ `body/*/ul/li/*/h2/a`
- "Prof. dr. Chris Stolwijk" @ `body/span[contains(text(), 'Prof. dr. Chris Stolwijk')]`
## XPath → Entity Type Mapping
Which entity types are typically found at which DOM locations:
### `head/title` (1983 entities)
- GRP.HER: 829 (41.8%)
- TOP.SET: 315 (15.9%)
- GRP.ASS: 153 (7.7%)
- APP.TIT: 153 (7.7%)
- GRP.GOV: 127 (6.4%)
### `head/meta/@content` (976 entities)
- APP.URL: 175 (17.9%)
- GRP.HER: 129 (13.2%)
- TMP.DAB: 108 (11.1%)
- TOP.SET: 106 (10.9%)
- THG.LNG: 50 (5.1%)
### `body/*/p` (770 entities)
- TOP.SET: 101 (13.1%)
- TMP.DAB: 75 (9.7%)
- AGT.PER: 44 (5.7%)
- GRP.HER: 36 (4.7%)
- TOP.ADR: 31 (4.0%)
### `head/script[@type='application/ld+json']` (641 entities)
- APP.URL: 186 (29.0%)
- TMP.DAB: 93 (14.5%)
- GRP.HER: 56 (8.7%)
- TOP.SET: 32 (5.0%)
- QTY.MSR: 32 (5.0%)
### `head/meta[@name='description']/@content` (619 entities)
- TOP.SET: 154 (24.9%)
- GRP.HER: 66 (10.7%)
- THG.CON: 43 (6.9%)
- TOP.REG: 35 (5.7%)
- GRP.GOV: 26 (4.2%)
### `head/link/@href` (468 entities)
- APP.URL: 441 (94.2%)
- WRK.WEB: 7 (1.5%)
- GRP.HER: 6 (1.3%)
- TOP.SET: 4 (0.9%)
- APP.WEB: 3 (0.6%)
### `@lang` (365 entities)
- THG.LNG: 360 (98.6%)
- TOP.CTY: 4 (1.1%)
- TOP.SET: 1 (0.3%)
### `head/link[@rel='canonical']/@href` (351 entities)
- APP.URL: 342 (97.4%)
- WRK.WEB: 7 (2.0%)
- WRK.URL: 1 (0.3%)
- GRP.HER: 1 (0.3%)
### `head/script/text()` (342 entities)
- APP.URL: 103 (30.1%)
- GRP.HER: 37 (10.8%)
- TMP.DAB: 34 (9.9%)
- TOP.ADR: 19 (5.6%)
- THG.LNG: 19 (5.6%)
### `head/script` (242 entities)
- APP.URL: 57 (23.6%)
- TMP.DAB: 32 (13.2%)
- GRP.HER: 17 (7.0%)
- THG.LNG: 13 (5.4%)
- QTY.MSR: 11 (4.5%)
### `html/@lang` (196 entities)
- THG.LNG: 192 (98.0%)
- TOP.CTY: 3 (1.5%)
- TOP.LNG: 1 (0.5%)
### `head/meta` (175 entities)
- APP.URL: 32 (18.3%)
- TMP.DAB: 22 (12.6%)
- GRP.GOV: 15 (8.6%)
- THG.LNG: 14 (8.0%)
- GRP.HER: 13 (7.4%)
### `head/link` (161 entities)
- APP.URL: 151 (93.8%)
- THG.LNG: 5 (3.1%)
- WRK.WEB: 2 (1.2%)
- APP.WKD: 1 (0.6%)
- GRP.COR: 1 (0.6%)
### `head/link[@rel='canonical']` (150 entities)
- APP.URL: 143 (95.3%)
- WRK.WEB: 7 (4.7%)
### `body/*` (139 entities)
- TMP.DAB: 24 (17.3%)
- TOP.ADR: 13 (9.4%)
- AGT.PER: 11 (7.9%)
- TMP.TAB: 8 (5.8%)
- WRK.COL: 7 (5.0%)
### `head/meta[@property='article:modified_time']/@content` (108 entities)
- TMP.DAB: 108 (100.0%)
### `head/script[@type='application/ld+json']/text()` (108 entities)
- APP.URL: 27 (25.0%)
- TMP.DAB: 16 (14.8%)
- TOP.SET: 9 (8.3%)
- GRP.HER: 9 (8.3%)
- TOP.REG: 5 (4.6%)
### `body/*/ul/li/a` (100 entities)
- APP.URL: 33 (33.0%)
- APP.COL: 10 (10.0%)
- WRK.COL: 7 (7.0%)
- WRK.WEB: 7 (7.0%)
- APP.EXH: 7 (7.0%)
### `body/*/h2` (98 entities)
- GRP.HER: 21 (21.4%)
- APP.TIT: 12 (12.2%)
- APP.EXH: 11 (11.2%)
- TOP.SET: 11 (11.2%)
- GRP.ASS: 6 (6.1%)
### `head/meta[@property='og:site_name']/@content` (98 entities)
- GRP.HER: 52 (53.1%)
- GRP.GOV: 12 (12.2%)
- GRP.ASS: 6 (6.1%)
- APP.TTL: 5 (5.1%)
- TOP.SET: 4 (4.1%)
## String Pattern → Entity Type Correlation
Which string patterns are most associated with which entity types:
### `unknown` (10443 occurrences)
- APP.URL: 2814 (26.9%)
- THG.LNG: 1307 (12.5%)
- TOP.SET: 1250 (12.0%)
- TMP.DAB: 892 (8.5%)
- GRP.HER: 363 (3.5%)
### `person:dutch_name` (3440 occurrences)
- GRP.HER: 1329 (38.6%)
- GRP.ASS: 313 (9.1%)
- GRP.GOV: 246 (7.2%)
- AGT.PER: 241 (7.0%)
- APP.TIT: 172 (5.0%)
### `heritage:institution_type` (1185 occurrences)
- GRP.HER: 777 (65.6%)
- APP.URL: 140 (11.8%)
- APP.TTL: 56 (4.7%)
- APP.TIT: 49 (4.1%)
- WRK.WEB: 31 (2.6%)
### `org:foundation_or_association` (383 occurrences)
- GRP.ASS: 198 (51.7%)
- GRP.HER: 85 (22.2%)
- APP.URL: 24 (6.3%)
- APP.TIT: 15 (3.9%)
- WRK.WEB: 15 (3.9%)
### `nav:menu_item` (368 occurrences)
- APP.URL: 130 (35.3%)
- APP.TTL: 46 (12.5%)
- APP.TIT: 45 (12.2%)
- WRK.COL: 40 (10.9%)
- WRK.WEB: 34 (9.2%)
### `loc:postal_code_nl` (218 occurrences)
- TOP.ADR: 193 (88.5%)
- TOP.SET: 7 (3.2%)
- APP.NAM: 2 (0.9%)
- QTY.CNT: 2 (0.9%)
- unknown: 2 (0.9%)
### `gov:municipality` (206 occurrences)
- GRP.GOV: 171 (83.0%)
- APP.TTL: 11 (5.3%)
- TOP.SET: 5 (2.4%)
- APP.TIT: 5 (2.4%)
- TOP.REG: 5 (2.4%)
### `info:time_range` (185 occurrences)
- TMP.OPH: 152 (82.2%)
- TMP.TAB: 25 (13.5%)
- TMP.EXP: 3 (1.6%)
- TMP.DAT: 2 (1.1%)
- TMP.RNG: 1 (0.5%)
### `loc:street_address` (178 occurrences)
- TOP.ADR: 158 (88.8%)
- GRP.HER: 8 (4.5%)
- TOP.SET: 2 (1.1%)
- APP.PNM: 2 (1.1%)
- APP.URL: 1 (0.6%)
### `info:opening_hours` (161 occurrences)
- TMP.OPH: 115 (71.4%)
- TMP.DAB: 31 (19.3%)
- TMP.SET: 7 (4.3%)
- TMP.RNG: 2 (1.2%)
- TMP.DAT: 2 (1.2%)
### `contact:email` (126 occurrences)
- APP.URL: 93 (73.8%)
- APP.EMAIL: 12 (9.5%)
- APP.NAM: 5 (4.0%)
- APP.PNM: 5 (4.0%)
- APP.EML: 3 (2.4%)
### `contact:phone` (7 occurrences)
- APP.URL: 5 (71.4%)
- TOP.ADR: 1 (14.3%)
- APP.TTL: 1 (14.3%)
## Recommendations for dutch_web_patterns.yaml
Based on the analysis, the following layout-aware patterns could be added:
### High-Value XPath Targets
1. **`head/title`** - Almost always contains institution name
2. **`body/*/h1`** - Primary heading, usually institution name
3. **`body/*/nav`** - Navigation menu (discard patterns)
4. **`body/*/footer`** - Contact info, address, social links
### Suggested Pattern Additions
```yaml
# Add xpath_hint to patterns for better precision
entity_patterns:
organizations:
heritage_institutions:
patterns:
- pattern: '^(het|de)\s+(\w+)\s*(museum|archief|bibliotheek)$'
xpath_hints:
- 'head/title'
- 'body/*/h1'
confidence_boost: 0.2 # Higher confidence when found at expected location
```