glam/reports/layout_analysis_500.md
2025-12-14 17:09:55 +01:00

809 lines
20 KiB
Markdown
Raw Permalink Blame History

This file contains ambiguous Unicode characters

This file contains Unicode characters that might be confused with other characters. If you think that this is intentional, you can safely ignore this warning. Use the Escape button to reveal them.

# Web Archive Layout Pattern Analysis
Generated: 2025-12-13T15:03:35.789833
## Summary
- **Annotation files analyzed**: 500
- **Unique websites**: 468
- **Total layout claims**: 1443
- **Total entity claims**: 5134
## Layout Categories
Distribution of content by DOM location category:
### other
- **Occurrences**: 322
- **Websites**: 162
**Top XPath patterns:**
- `body/*` (162)
- `body/*/table` (8)
- `body` (7)
- `body/*/dl` (6)
- `body/*/script` (6)
**String patterns found:**
- `unknown` (170)
- `person:dutch_name` (104)
- `nav:menu_item` (44)
- `heritage:institution_type` (43)
- `org:foundation_or_association` (13)
**Samples:**
- "Voorstelling 'De Ezelprins' door Spiegelbeesten......" (bibliotheekkampen.nl)
- "Vandaag open... Bibliotheek Kampen 09:00 - 17:00......" (bibliotheekkampen.nl)
- "flexslider fullscreen-carousel hero-slider..." (museumrijswijk.nl)
### meta_title
- **Occurrences**: 282
- **Websites**: 268
**Top XPath patterns:**
- `head/title` (279)
- `head/meta[@property='og:title']` (1)
- `head/meta[@name='apple-mobile-web-app-title']/@content` (1)
- `head/meta[@property='og:title']/@content` (1)
**String patterns found:**
- `person:dutch_name` (224)
- `heritage:institution_type` (85)
- `nav:menu_item` (58)
- `org:foundation_or_association` (41)
- `unknown` (34)
**Samples:**
- "Protestantse Theologische Universiteit: 400 jaar theologisch onderwijs..." (pthu.nl)
- "Heemkundevereniging Heibloem | Heibloem.nu..." (heibloem.nu)
- "Museum Eenigenburg Museum Eenigenburg..." (museumeenigenburg.nl)
### header
- **Occurrences**: 186
- **Websites**: 131
**Top XPath patterns:**
- `body/header` (62)
- `body/*/header` (23)
- `body/*/header/h1` (11)
- `body/*/header/*` (9)
- `body/header/*` (5)
**String patterns found:**
- `unknown` (96)
- `person:dutch_name` (76)
- `nav:menu_item` (26)
- `heritage:institution_type` (18)
- `org:foundation_or_association` (12)
**Samples:**
- "<header class="header preventselection">...Protestantse Theologische Universitei..." (pthu.nl)
- "Heemkundevereniging Heibloem..." (heibloem.nu)
- "Museum Eenigenburg..." (museumeenigenburg.nl)
### meta_other
- **Occurrences**: 164
- **Websites**: 139
**Top XPath patterns:**
- `head` (97)
- `head/script[@type='application/ld+json']` (20)
- `head/script` (14)
- `head/link` (7)
- `head/style` (3)
**String patterns found:**
- `person:dutch_name` (77)
- `unknown` (77)
- `heritage:institution_type` (33)
- `nav:menu_item` (24)
- `gov:municipality` (10)
**Samples:**
- "<title>Home - Hildo Krop Museum</title><link rel="canonical" href="https://www.h..." (hildokrop.nl)
- "Museum Rijswijk..." (museumrijswijk.nl)
- "Gemeentearchief Ede | Gemeentearchief Ede......" (gemeentearchief.ede.nl)
### nav
- **Occurrences**: 106
- **Websites**: 97
**Top XPath patterns:**
- `body/*/nav` (19)
- `body/nav` (16)
- `body/*/header/*/nav` (14)
- `body/header/nav` (14)
- `body/header/*/nav` (8)
**String patterns found:**
- `nav:menu_item` (57)
- `person:dutch_name` (50)
- `unknown` (37)
- `heritage:institution_type` (16)
- `org:foundation_or_association` (9)
**Samples:**
- "Actueel HELD Burgerbus Onderwijs & opvang......" (heibloem.nu)
- "Home Het museum Daglonershuis Kasteel De familie Eenigenburg Blog Contact..." (museumeenigenburg.nl)
- "Home Tentoonstellingen Bezoekersinformatie......" (sonnenborgh.nl)
### paragraph
- **Occurrences**: 93
- **Websites**: 68
**Top XPath patterns:**
- `body/*/p` (67)
- `body/*/ul/li/*/p` (4)
- `body/p` (3)
- `*/p` (2)
- `body/table/tr/td/p` (2)
**String patterns found:**
- `person:dutch_name` (47)
- `unknown` (27)
- `heritage:institution_type` (19)
- `org:foundation_or_association` (11)
- `nav:menu_item` (7)
**Samples:**
- "Heemkundevereniging Heibloem heeft als doel het in brede kring bevorderen van be..." (heibloem.nu)
- "In Museum Eenigenburg wijst een levensgroot kompas......" (museumeenigenburg.nl)
- "Het Museum is zeer geschikt voor het ontvangen van groepen......" (museumeenigenburg.nl)
### main_heading
- **Occurrences**: 74
- **Websites**: 65
**Top XPath patterns:**
- `body/*/h1` (60)
- `*/h1` (4)
- `body/div[@id='pagewrapper']/div[@class='section group']/div[@id='rightcolumn']/div[@class='section group']/div[@class='col span_2_of_2']/h1` (1)
- `body/div[@class='kolomContent home']/*/h1` (1)
- `body/div[@id='wrapper']/div[@id='wrapperinner']/div[@id='content']/div[@id='columnwrapper']/*/div[@id='anchorContent']/h1` (1)
**String patterns found:**
- `person:dutch_name` (37)
- `unknown` (31)
- `heritage:institution_type` (18)
- `nav:menu_item` (5)
- `org:foundation_or_association` (3)
**Samples:**
- "Protestantse Theologische Universiteit..." (pthu.nl)
- "Terug in de tijd..." (museumeenigenburg.nl)
- "Ook leuk voor groepen..." (museumeenigenburg.nl)
### sub_heading
- **Occurrences**: 69
- **Websites**: 46
**Top XPath patterns:**
- `body/*/h2` (39)
- `body/*/h3` (15)
- `body/h2` (3)
- `*/h2` (2)
- `body/*/h4` (2)
**String patterns found:**
- `unknown` (42)
- `person:dutch_name` (23)
- `heritage:institution_type` (5)
- `nav:menu_item` (3)
- `org:foundation_or_association` (3)
**Samples:**
- "Verenigingsnieuws..." (heibloem.nu)
- "Contactgegevens..." (heibloem.nu)
- "Home..." (bibliotheekkampen.nl)
### meta_tag
- **Occurrences**: 60
- **Websites**: 55
**Top XPath patterns:**
- `head/meta[@name='description']` (20)
- `head/meta` (13)
- `head/meta/@content` (9)
- `head/meta[@name='description']/@content` (8)
- `head/meta[@property='og:url']` (2)
**String patterns found:**
- `person:dutch_name` (33)
- `heritage:institution_type` (20)
- `unknown` (20)
- `org:foundation_or_association` (8)
- `nav:menu_item` (6)
**Samples:**
- "Prachtige tijdelijke en vaste tentoonstellingen zijn te zien bij Museum Catharij..." (catharijneconvent.nl)
- "Energie is overal. We gebruiken het constant en ervaren er veel gemak van. Tegel..." (vindhetuit.nl)
- "Aanmeldformulier Via dit formulier kunt u zich aanmelden bij de Stichting Vriend..." (verbindingsdienst.nl)
### footer
- **Occurrences**: 43
- **Websites**: 35
**Top XPath patterns:**
- `body/footer` (17)
- `body/*/footer` (8)
- `body/footer/*` (6)
- `body/*/footer/*` (3)
- `body/footer/*/p` (3)
**String patterns found:**
- `person:dutch_name` (18)
- `nav:menu_item` (13)
- `unknown` (10)
- `loc:postal_code_nl` (5)
- `loc:street_address` (4)
**Samples:**
- "Heibloem.nu Nieuwsbrief Contact......" (heibloem.nu)
- "Made with by Encore Digital Agency......" (heibloem.nu)
- "Contactinformatie Kerkweg 5... Toegangsprijzen... OPENINGS TIJDEN......" (museumeenigenburg.nl)
### list
- **Occurrences**: 39
- **Websites**: 30
**Top XPath patterns:**
- `body/*/ul` (30)
- `*/ul` (2)
- `*/ul/li` (1)
- `body/ul` (1)
- `body/*/ol` (1)
**String patterns found:**
- `person:dutch_name` (17)
- `unknown` (16)
- `nav:menu_item` (9)
- `heritage:institution_type` (5)
- `org:foundation_or_association` (1)
**Samples:**
- "<ul class="metanav">......" (bibliotheekkampen.nl)
- "Nu te zien [list of exhibitions]..." (catharijneconvent.nl)
- "Let op: in december hebben we aangepaste openingstijden. Bekijk hier de openings..." (debieb.nl)
### contact
- **Occurrences**: 3
- **Websites**: 3
**Top XPath patterns:**
- `body/*/address` (1)
- `body/div[@id='contact']/*/ul` (1)
- `div[@class='xr_group' and ./span[text()='Contact']]` (1)
**String patterns found:**
- `loc:postal_code_nl` (1)
- `loc:street_address` (1)
- `nav:menu_item` (1)
- `person:dutch_name` (1)
**Samples:**
- "Bongerd 18, 6411 JM Heerlen..." (museum.nl)
- "Winsum, Hoofdstraat W 70 Uithuizen, Hoofdstraat-West 1..." (hethogeland.nl)
- "Contact Contactgegevens..." (heemkundenispen.nl)
### main_content
- **Occurrences**: 1
- **Websites**: 1
**Top XPath patterns:**
- `body/div[@id='readspeaker']/*/article[@role='region']/*/p` (1)
**String patterns found:**
- `info:opening_hours` (1)
**Samples:**
- "Let op: Op donderdag 27 en vrijdag 28 november 2025 is Burgerzaken gesloten......" (veendam.nl)
### sidebar
- **Occurrences**: 1
- **Websites**: 1
**Top XPath patterns:**
- `body/aside[@id='leftbar']` (1)
**String patterns found:**
- `org:foundation_or_association` (1)
- `nav:menu_item` (1)
- `person:dutch_name` (1)
**Samples:**
- "Vereniging Adres / Contact Kwartaalblad......" (oudhoorn.nl)
## Entity Type Distribution
| Entity Type | Count | Websites | Primary Location |
|-------------|-------|----------|------------------|
| APP.URL | 1071 | 406 | meta_other |
| GRP.HER | 582 | 313 | meta_title |
| TOP.SET | 426 | 245 | meta_tag |
| THG.LNG | 398 | 305 | other |
| TMP.DAB | 285 | 155 | meta_tag |
| WRK.WEB | 145 | 75 | meta_other |
| APP.TIT | 132 | 97 | meta_title |
| TOP.ADR | 130 | 77 | footer |
| AGT.PER | 129 | 68 | paragraph |
| GRP.ASS | 122 | 76 | meta_title |
| APP.TTL | 119 | 78 | meta_title |
| GRP.GOV | 118 | 60 | meta_title |
| THG.CON | 103 | 63 | meta_tag |
| APP.EXH | 98 | 61 | other |
| TOP.REG | 80 | 65 | meta_tag |
| TMP.OPH | 74 | 46 | other |
| QTY.CNT | 69 | 38 | meta_tag |
| TOP.CTY | 68 | 55 | meta_tag |
| THG.EVT | 64 | 31 | paragraph |
| TOP.BLD | 56 | 38 | paragraph |
### Entity Type Details
#### APP.URL
- **Total occurrences**: 1071
- **Unique websites**: 406
**Found in DOM categories:**
- meta_other: 596 (55.6%)
- meta_tag: 126 (11.8%)
- other: 81 (7.6%)
- list: 64 (6.0%)
- nav: 55 (5.1%)
**Common XPath patterns:**
- `head/link/@href` (168)
- `head/link[@rel='canonical']/@href` (108)
- `head/link` (67)
- `head/script[@type='application/ld+json']` (59)
- `head/meta/@content` (47)
**Samples:**
- "https://www.pthu.nl/" @ `head/link/@href`
- "https://heibloem.nu/vereniging/heemkundevereniging-heibloem" @ `head/meta[@property='og:url']`
- "https://heibloem.nu/held" @ `body/*/footer/*/p/a`
#### GRP.HER
- **Total occurrences**: 582
- **Unique websites**: 313
**Found in DOM categories:**
- meta_title: 282 (48.5%)
- meta_tag: 97 (16.7%)
- other: 51 (8.8%)
- meta_other: 43 (7.4%)
- paragraph: 34 (5.8%)
**Common XPath patterns:**
- `head/title` (261)
- `head/meta/@content` (41)
- `head/meta[@name='description']/@content` (24)
- `head/script[@type='application/ld+json']` (20)
- `body/*/dl/*/dt/a` (17)
**Samples:**
- "Museum Eenigenburg" @ `head/title`
- "Musea Schagen" @ `body/*/footer/*/a/*/p`
- "Hildo Krop Museum" @ `head/title`
#### TOP.SET
- **Total occurrences**: 426
- **Unique websites**: 245
**Found in DOM categories:**
- meta_tag: 116 (27.2%)
- meta_title: 94 (22.1%)
- paragraph: 72 (16.9%)
- other: 32 (7.5%)
- meta_other: 28 (6.6%)
**Common XPath patterns:**
- `head/title` (87)
- `head/meta[@name='description']/@content` (45)
- `body/*/p` (38)
- `head/meta/@content` (32)
- `body/*/dl/*/dt/a` (10)
**Samples:**
- "Amsterdam" @ `head/script[@type='application/ld+json']`
- "Eenigenburg" @ `body/*/header/*/nav/*/ul/li/a`
- "Steenwijk" @ `head/meta[@property='og:site_name']`
#### THG.LNG
- **Total occurrences**: 398
- **Unique websites**: 305
**Found in DOM categories:**
- other: 260 (65.3%)
- meta_tag: 66 (16.6%)
- meta_other: 31 (7.8%)
- nav: 17 (4.3%)
- paragraph: 15 (3.8%)
**Common XPath patterns:**
- `@lang` (113)
- `html/@lang` (62)
- `html` (30)
- `head/meta[@property='og:locale']/@content` (18)
- `head/meta/@content` (15)
**Samples:**
- "nl_NL" @ `head/meta[@property='og:locale']`
- "nl-NL" @ `[@lang='nl-NL']`
- "nl_NL" @ `html/@lang`
#### TMP.DAB
- **Total occurrences**: 285
- **Unique websites**: 155
**Found in DOM categories:**
- meta_tag: 100 (35.1%)
- meta_other: 70 (24.6%)
- paragraph: 49 (17.2%)
- other: 37 (13.0%)
- sub_heading: 9 (3.2%)
**Common XPath patterns:**
- `head/script[@type='application/ld+json']` (33)
- `body/*/p` (30)
- `head/meta[@property='article:modified_time']/@content` (29)
- `head/meta/@content` (28)
- `head/script[@type='application/ld+json']/text()` (12)
**Samples:**
- "2024" @ `li[@id='menu-item-2747']/a`
- "2025" @ `li[@id='menu-item-2758']/a`
- "2022-05-30T07:37:02+00:00" @ `head/meta[@property='article:modified_time']/@content`
#### WRK.WEB
- **Total occurrences**: 145
- **Unique websites**: 75
**Found in DOM categories:**
- meta_other: 39 (26.9%)
- nav: 28 (19.3%)
- other: 27 (18.6%)
- meta_title: 15 (10.3%)
- list: 14 (9.7%)
**Common XPath patterns:**
- `body/*/nav/*/ul/li/*/h3/a` (13)
- `head/script[@type='application/ld+json']` (11)
- `HTML/BODY/DIV/DIV/UL/LI/A` (8)
- `head/script/text()` (7)
- `body/ul/li/a` (6)
**Samples:**
- "Home - Hildo Krop Museum" @ `head/title`
- "https://www.sonnenborgh.nl/home" @ `head/link[@rel='canonical']`
- "Strategische visie" @ `body/*/h2`
#### APP.TIT
- **Total occurrences**: 132
- **Unique websites**: 97
**Found in DOM categories:**
- meta_title: 62 (47.0%)
- sub_heading: 18 (13.6%)
- meta_tag: 16 (12.1%)
- other: 13 (9.8%)
- meta_other: 7 (5.3%)
**Common XPath patterns:**
- `head/title` (53)
- `body/*/h2` (8)
- `body/*/a/*/h4` (8)
- `head/meta/@content` (5)
- `head/meta[@property='og:title']/@content` (4)
**Samples:**
- "Heemkundevereniging Heibloem" @ `head/title`
- "Uitnodiging lezing "Ontdek de sporen van je voorouders"" @ `body/*/h2`
- "Versiering plaquette "Sporen die bleven"" @ `body/*/h2`
#### TOP.ADR
- **Total occurrences**: 130
- **Unique websites**: 77
**Found in DOM categories:**
- footer: 31 (23.8%)
- paragraph: 23 (17.7%)
- other: 21 (16.2%)
- meta_other: 18 (13.8%)
- list: 13 (10.0%)
**Common XPath patterns:**
- `body/footer/*/p` (13)
- `body/*/p` (9)
- `body/*/ul/li` (7)
- `head/script/text()` (7)
- `head/script[@type='application/ld+json']` (6)
**Samples:**
- "De Boelelaan 1105, 1081 HV, Amsterdam, Nederland" @ `head/script[@type='application/ld+json']`
- "Kerkweg 5, 1744 JD Eenigenburg, Noord-Holland" @ `body/*/footer/*/ol/li`
- "Herenstraat 67" @ `body/footer/*`
#### AGT.PER
- **Total occurrences**: 129
- **Unique websites**: 68
**Found in DOM categories:**
- paragraph: 47 (36.4%)
- meta_tag: 29 (22.5%)
- other: 21 (16.3%)
- sub_heading: 10 (7.8%)
- list: 5 (3.9%)
**Common XPath patterns:**
- `body/*/p` (12)
- `body/*/p/text()` (10)
- `head/meta[@name='description']/@content` (8)
- `body/*/p/*` (8)
- `head/meta/@content` (8)
**Samples:**
- "Erik Pape" @ `body/*/ul/li/*/h2/a`
- "Caren van Herwaarden" @ `body/*/ul/li/*/h2/a`
- "Prof. dr. Chris Stolwijk" @ `body/span[contains(text(), 'Prof. dr. Chris Stolwijk')]`
#### GRP.ASS
- **Total occurrences**: 122
- **Unique websites**: 76
**Found in DOM categories:**
- meta_title: 48 (39.3%)
- meta_tag: 17 (13.9%)
- paragraph: 15 (12.3%)
- other: 11 (9.0%)
- header: 7 (5.7%)
**Common XPath patterns:**
- `head/title` (44)
- `head/meta/@content` (12)
- `body/*/p` (6)
- `head/script` (4)
- `body/*` (4)
**Samples:**
- "Heemkundevereniging Heibloem" @ `head/title`
- "Heemkundevereniging Heibloem" @ `body/*/header/h1/*`
- "Heemkundevereniging Heibloem" @ `body/*/p/strong`
## XPath → Entity Type Mapping
Which entity types are typically found at which DOM locations:
### `head/title` (608 entities)
- GRP.HER: 261 (42.9%)
- TOP.SET: 87 (14.3%)
- APP.TIT: 53 (8.7%)
- APP.TTL: 45 (7.4%)
- GRP.ASS: 44 (7.2%)
### `head/meta/@content` (278 entities)
- APP.URL: 47 (16.9%)
- GRP.HER: 41 (14.7%)
- TOP.SET: 32 (11.5%)
- TMP.DAB: 28 (10.1%)
- THG.LNG: 15 (5.4%)
### `body/*/p` (277 entities)
- TOP.SET: 38 (13.7%)
- TMP.DAB: 30 (10.8%)
- GRP.HER: 17 (6.1%)
- AGT.PER: 12 (4.3%)
- THG.ART: 11 (4.0%)
### `head/script[@type='application/ld+json']` (219 entities)
- APP.URL: 59 (26.9%)
- TMP.DAB: 33 (15.1%)
- GRP.HER: 20 (9.1%)
- WRK.WEB: 11 (5.0%)
- TOP.SET: 9 (4.1%)
### `head/meta[@name='description']/@content` (189 entities)
- TOP.SET: 45 (23.8%)
- GRP.HER: 24 (12.7%)
- THG.CON: 17 (9.0%)
- TOP.REG: 11 (5.8%)
- GRP.GOV: 10 (5.3%)
### `head/link/@href` (179 entities)
- APP.URL: 168 (93.9%)
- WRK.WEB: 3 (1.7%)
- APP.WEB: 3 (1.7%)
- WRK.URL: 2 (1.1%)
- TOP.SET: 1 (0.6%)
### `@lang` (115 entities)
- THG.LNG: 113 (98.3%)
- TOP.CTY: 2 (1.7%)
### `head/link[@rel='canonical']/@href` (111 entities)
- APP.URL: 108 (97.3%)
- WRK.WEB: 2 (1.8%)
- WRK.URL: 1 (0.9%)
### `head/script` (106 entities)
- APP.URL: 23 (21.7%)
- TMP.DAB: 12 (11.3%)
- GRP.HER: 6 (5.7%)
- QTY.MSR: 6 (5.7%)
- TOP.SET: 5 (4.7%)
### `head/script/text()` (93 entities)
- APP.URL: 31 (33.3%)
- TMP.DAB: 9 (9.7%)
- GRP.HER: 9 (9.7%)
- WRK.WEB: 7 (7.5%)
- TOP.ADR: 7 (7.5%)
### `head/link` (69 entities)
- APP.URL: 67 (97.1%)
- APP.WKD: 1 (1.4%)
- WRK.WEB: 1 (1.4%)
### `html/@lang` (62 entities)
- THG.LNG: 62 (100.0%)
### `head/meta` (61 entities)
- APP.URL: 12 (19.7%)
- TMP.DAB: 5 (8.2%)
- GRP.COR: 4 (6.6%)
- GRP.HER: 4 (6.6%)
- TOP.SET: 4 (6.6%)
### `head/script[@type='application/ld+json']/text()` (57 entities)
- TMP.DAB: 12 (21.1%)
- APP.URL: 11 (19.3%)
- GRP.HER: 6 (10.5%)
- TOP.SET: 5 (8.8%)
- APP.NAM: 3 (5.3%)
### `body/*/h2` (54 entities)
- APP.TIT: 8 (14.8%)
- APP.EXH: 8 (14.8%)
- GRP.HER: 8 (14.8%)
- TOP.SET: 5 (9.3%)
- TMP.DAB: 4 (7.4%)
### `body/*` (54 entities)
- TMP.DAB: 9 (16.7%)
- WRK.COL: 7 (13.0%)
- TOP.ADR: 6 (11.1%)
- GRP.ASS: 4 (7.4%)
- TMP.TAB: 4 (7.4%)
### `head/link[@rel='canonical']` (46 entities)
- APP.URL: 41 (89.1%)
- WRK.WEB: 5 (10.9%)
### `head/meta[@property='og:description']/@content` (41 entities)
- TOP.SET: 6 (14.6%)
- TMP.DAB: 5 (12.2%)
- THG.CON: 4 (9.8%)
- AGT.PER: 3 (7.3%)
- APP.TIT: 3 (7.3%)
### `body/*/p/a` (40 entities)
- APP.URL: 20 (50.0%)
- THG.EVT: 5 (12.5%)
- TOP.SET: 3 (7.5%)
- APP.TTL: 3 (7.5%)
- GRP.GOV: 2 (5.0%)
### `body/*/p/*` (36 entities)
- AGT.PER: 8 (22.2%)
- TOP.ADR: 4 (11.1%)
- TOP.CTY: 3 (8.3%)
- ROL.POS: 2 (5.6%)
- ROL.OCC: 2 (5.6%)
## String Pattern → Entity Type Correlation
Which string patterns are most associated with which entity types:
### `unknown` (3503 occurrences)
- APP.URL: 937 (26.7%)
- TOP.SET: 403 (11.5%)
- THG.LNG: 398 (11.4%)
- TMP.DAB: 269 (7.7%)
- GRP.HER: 132 (3.8%)
### `person:dutch_name` (1146 occurrences)
- GRP.HER: 410 (35.8%)
- GRP.ASS: 101 (8.8%)
- AGT.PER: 89 (7.8%)
- GRP.GOV: 74 (6.5%)
- APP.TIT: 63 (5.5%)
### `heritage:institution_type` (376 occurrences)
- GRP.HER: 233 (62.0%)
- APP.URL: 50 (13.3%)
- APP.TIT: 20 (5.3%)
- APP.TTL: 19 (5.1%)
- WRK.WEB: 13 (3.5%)
### `org:foundation_or_association` (155 occurrences)
- GRP.ASS: 75 (48.4%)
- GRP.HER: 35 (22.6%)
- APP.URL: 15 (9.7%)
- WRK.WEB: 6 (3.9%)
- APP.TIT: 5 (3.2%)
### `nav:menu_item` (141 occurrences)
- APP.URL: 46 (32.6%)
- APP.TTL: 23 (16.3%)
- APP.TIT: 18 (12.8%)
- WRK.WEB: 14 (9.9%)
- WRK.COL: 13 (9.2%)
### `loc:postal_code_nl` (66 occurrences)
- TOP.ADR: 55 (83.3%)
- TOP.SET: 4 (6.1%)
- QTY.CNT: 2 (3.0%)
- APP.COL: 1 (1.5%)
- APP.NAM: 1 (1.5%)
### `gov:municipality` (65 occurrences)
- GRP.GOV: 53 (81.5%)
- TOP.REG: 3 (4.6%)
- TOP.SET: 2 (3.1%)
- APP.TIT: 2 (3.1%)
- APP.TTL: 2 (3.1%)
### `info:time_range` (63 occurrences)
- TMP.OPH: 52 (82.5%)
- TMP.TAB: 9 (14.3%)
- TMP.EXP: 2 (3.2%)
### `loc:street_address` (62 occurrences)
- TOP.ADR: 55 (88.7%)
- GRP.HER: 3 (4.8%)
- GRP.ASS: 1 (1.6%)
- APP.TTL: 1 (1.6%)
- unknown: 1 (1.6%)
### `info:opening_hours` (48 occurrences)
- TMP.OPH: 32 (66.7%)
- TMP.DAB: 13 (27.1%)
- TMP.SET: 2 (4.2%)
- THG.EVT: 1 (2.1%)
### `contact:email` (42 occurrences)
- APP.URL: 25 (59.5%)
- APP.EMAIL: 9 (21.4%)
- APP.NAM: 3 (7.1%)
- APP.PNM: 2 (4.8%)
- THG.PHO: 1 (2.4%)
### `contact:phone` (4 occurrences)
- APP.URL: 3 (75.0%)
- TOP.ADR: 1 (25.0%)
## Recommendations for dutch_web_patterns.yaml
Based on the analysis, the following layout-aware patterns could be added:
### High-Value XPath Targets
1. **`head/title`** - Almost always contains institution name
2. **`body/*/h1`** - Primary heading, usually institution name
3. **`body/*/nav`** - Navigation menu (discard patterns)
4. **`body/*/footer`** - Contact info, address, social links
### Suggested Pattern Additions
```yaml
# Add xpath_hint to patterns for better precision
entity_patterns:
organizations:
heritage_institutions:
patterns:
- pattern: '^(het|de)\s+(\w+)\s*(museum|archief|bibliotheek)$'
xpath_hints:
- 'head/title'
- 'body/*/h1'
confidence_boost: 0.2 # Higher confidence when found at expected location
```