glam/reports/layout_analysis_full.md
2025-12-14 17:09:55 +01:00

21 KiB
Raw Blame History

Web Archive Layout Pattern Analysis

Generated: 2025-12-13T15:05:22.346619

Summary

  • Annotation files analyzed: 1525
  • Unique websites: 1343
  • Total layout claims: 4302
  • Total entity claims: 15252

Layout Categories

Distribution of content by DOM location category:

other

  • Occurrences: 896
  • Websites: 457

Top XPath patterns:

  • body/* (471)
  • body (24)
  • body/*/table (23)
  • body/*/dl (15)
  • * (14)

String patterns found:

  • unknown (511)
  • person:dutch_name (263)
  • heritage:institution_type (119)
  • nav:menu_item (114)
  • org:foundation_or_association (41)

Samples:

  • "Voorstelling 'De Ezelprins' door Spiegelbeesten......" (bibliotheekkampen.nl)
  • "Vandaag open... Bibliotheek Kampen 09:00 - 17:00......" (bibliotheekkampen.nl)
  • "flexslider fullscreen-carousel hero-slider..." (museumrijswijk.nl)

meta_title

  • Occurrences: 828
  • Websites: 741

Top XPath patterns:

  • head/title (815)
  • head/meta[@property='og:title'] (8)
  • head/meta[@property='og:title']/@content (2)
  • head/meta[@name='apple-mobile-web-app-title']/@content (1)
  • head/meta[@name='twitter:title'] (1)

String patterns found:

  • person:dutch_name (651)
  • heritage:institution_type (266)
  • nav:menu_item (194)
  • unknown (99)
  • org:foundation_or_association (92)

Samples:

  • "Protestantse Theologische Universiteit: 400 jaar theologisch onderwijs..." (pthu.nl)
  • "Heemkundevereniging Heibloem | Heibloem.nu..." (heibloem.nu)
  • "Museum Eenigenburg Museum Eenigenburg..." (museumeenigenburg.nl)

header

  • Occurrences: 561
  • Websites: 386

Top XPath patterns:

  • body/header (204)
  • body/*/header (65)
  • body/*/header/h1 (29)
  • body/*/header/* (22)
  • body/header/* (13)

String patterns found:

  • unknown (285)
  • person:dutch_name (210)
  • nav:menu_item (79)
  • heritage:institution_type (63)
  • org:foundation_or_association (31)

Samples:

  • "...Protestantse Theologische Universitei..." (pthu.nl)
  • "Heemkundevereniging Heibloem..." (heibloem.nu)
  • "Museum Eenigenburg..." (museumeenigenburg.nl)

meta_other

  • Occurrences: 538
  • Websites: 436

Top XPath patterns:

  • head (328)
  • head/script[@type='application/ld+json'] (76)
  • head/script (42)
  • head/link[@rel='canonical'] (22)
  • head/link (13)

String patterns found:

  • person:dutch_name (264)
  • unknown (242)
  • heritage:institution_type (107)
  • nav:menu_item (81)
  • org:foundation_or_association (32)

Samples:

  • "<link rel="canonical" href="https://www.h..." (hildokrop.nl)
  • "Museum Rijswijk..." (museumrijswijk.nl)
  • "Gemeentearchief Ede | Gemeentearchief Ede......" (gemeentearchief.ede.nl)

paragraph

  • Occurrences: 295
  • Websites: 190

Top XPath patterns:

  • body/*/p (202)
  • */p (17)
  • body/*/ul/li/*/p (8)
  • body/p (4)
  • body/*/p/strong (4)

String patterns found:

  • person:dutch_name (152)
  • unknown (87)
  • heritage:institution_type (73)
  • org:foundation_or_association (25)
  • nav:menu_item (23)

Samples:

  • "Heemkundevereniging Heibloem heeft als doel het in brede kring bevorderen van be..." (heibloem.nu)
  • "In Museum Eenigenburg wijst een levensgroot kompas......" (museumeenigenburg.nl)
  • "Het Museum is zeer geschikt voor het ontvangen van groepen......" (museumeenigenburg.nl)

nav

  • Occurrences: 285
  • Websites: 247

Top XPath patterns:

  • body/*/nav (57)
  • body/nav (43)
  • body/header/nav (34)
  • body/*/header/*/nav (30)
  • body/header/*/nav (17)

String patterns found:

  • nav:menu_item (137)
  • unknown (117)
  • person:dutch_name (105)
  • heritage:institution_type (32)
  • org:foundation_or_association (16)

Samples:

  • "Actueel HELD Burgerbus Onderwijs & opvang......" (heibloem.nu)
  • "Home Het museum Daglonershuis Kasteel De familie Eenigenburg Blog Contact..." (museumeenigenburg.nl)
  • "Home Tentoonstellingen Bezoekersinformatie......" (sonnenborgh.nl)

sub_heading

  • Occurrences: 224
  • Websites: 128

Top XPath patterns:

  • body/*/h2 (117)
  • body/*/h3 (65)
  • */h2 (8)
  • body/h2 (6)
  • body/*/h4 (4)

String patterns found:

  • unknown (138)
  • person:dutch_name (71)
  • heritage:institution_type (36)
  • nav:menu_item (9)
  • org:foundation_or_association (9)

Samples:

  • "Verenigingsnieuws..." (heibloem.nu)
  • "Contactgegevens..." (heibloem.nu)
  • "Home..." (bibliotheekkampen.nl)

main_heading

  • Occurrences: 222
  • Websites: 182

Top XPath patterns:

  • body/*/h1 (175)
  • */h1 (16)
  • */a/*/h1 (2)
  • body/*/ul/li/h1 (2)
  • body/section[@class='uk-section head']/*/div[@class='title']/h1 (2)

String patterns found:

  • person:dutch_name (110)
  • unknown (90)
  • heritage:institution_type (67)
  • nav:menu_item (12)
  • org:foundation_or_association (9)

Samples:

  • "Protestantse Theologische Universiteit..." (pthu.nl)
  • "Terug in de tijd..." (museumeenigenburg.nl)
  • "Ook leuk voor groepen..." (museumeenigenburg.nl)

meta_tag

  • Occurrences: 183
  • Websites: 165

Top XPath patterns:

  • head/meta[@name='description'] (69)
  • head/meta (33)
  • head/meta[@name='description']/@content (29)
  • head/meta/@content (22)
  • head/meta[@property='og:description'] (5)

String patterns found:

  • person:dutch_name (98)
  • unknown (60)
  • heritage:institution_type (56)
  • nav:menu_item (16)
  • org:foundation_or_association (14)

Samples:

  • "Prachtige tijdelijke en vaste tentoonstellingen zijn te zien bij Museum Catharij..." (catharijneconvent.nl)
  • "Energie is overal. We gebruiken het constant en ervaren er veel gemak van. Tegel..." (vindhetuit.nl)
  • "Aanmeldformulier Via dit formulier kunt u zich aanmelden bij de Stichting Vriend..." (verbindingsdienst.nl)

list

  • Occurrences: 130
  • Websites: 92

Top XPath patterns:

  • body/*/ul (93)
  • */ul (5)
  • body/*/ul[@class='metanav'] (5)
  • body/ul (2)
  • body/*/ol (2)

String patterns found:

  • unknown (62)
  • person:dutch_name (47)
  • nav:menu_item (29)
  • heritage:institution_type (19)
  • org:foundation_or_association (4)

Samples:

  • "
      ......" (bibliotheekkampen.nl)
    • "Nu te zien [list of exhibitions]..." (catharijneconvent.nl)
    • "Let op: in december hebben we aangepaste openingstijden. Bekijk hier de openings..." (debieb.nl)
    • Occurrences: 119
    • Websites: 92

    Top XPath patterns:

    • body/footer (44)
    • body/*/footer (16)
    • body/footer/* (10)
    • footer (6)
    • body/footer/*/p (5)

    String patterns found:

    • person:dutch_name (41)
    • nav:menu_item (39)
    • unknown (37)
    • loc:postal_code_nl (12)
    • loc:street_address (10)

    Samples:

    • "Heibloem.nu Nieuwsbrief Contact......" (heibloem.nu)
    • "Made with ♥ by Encore Digital Agency......" (heibloem.nu)
    • "Contactinformatie Kerkweg 5... Toegangsprijzen... OPENINGS TIJDEN......" (museumeenigenburg.nl)

    main_content

    • Occurrences: 10
    • Websites: 6

    Top XPath patterns:

    • body/div[@id='page_wrapper']/main[@id='content_main']/div[@class='cm_column_wrapper']/div[@class='cm_column']/p (3)
    • body/div[@id='readspeaker']/*/article[@role='region']/*/p (1)
    • body/main[@id='main']/div[@id='content-wrap']/div[@id='primary']/div[@id='content']/*/div[@class='entry']/div[@data-elementor-type='wp-post']/section[@data-id='150f6235'] (1)
    • body/div[@id='wrapper']/div[@id='mainCntr']/main[@id='contentCntr']/div[@id='generatedContent']/div[@class='timeItem'] (1)
    • body/div[@id='wrapper']/div[@id='mainCntr']/main[@id='contentCntr']/div[@id='generatedContent']/div[@class='locationItem'] (1)

    String patterns found:

    • unknown (7)
    • person:dutch_name (2)
    • info:opening_hours (1)

    Samples:

    • "Let op: Op donderdag 27 en vrijdag 28 november 2025 is Burgerzaken gesloten......" (veendam.nl)
    • "september 2025: Arfdeeltje 53......" (lemelsarfgoed.nl)
    • "19 dec. 2024: Arfdeeltje 52......" (lemelsarfgoed.nl)

    contact

    • Occurrences: 10
    • Websites: 7

    Top XPath patterns:

    • body/*/address (1)
    • body/div[@id='contact']/*/ul (1)
    • div[@class='xr_group' and ./span[text()='Contact']] (1)
    • body/div[@id='wrapper']/div[@id='rechterkolom']/div[@class='moduletablecontact']/div[@class='customcontact'] (1)
    • h2[text()='Contact en adressen'] (1)

    String patterns found:

    • nav:menu_item (4)
    • loc:street_address (3)
    • person:dutch_name (3)
    • loc:postal_code_nl (2)
    • unknown (2)

    Samples:

    • "Bongerd 18, 6411 JM Heerlen..." (museum.nl)
    • "Winsum, Hoofdstraat W 70 Uithuizen, Hoofdstraat-West 1..." (hethogeland.nl)
    • "Contact Contactgegevens..." (heemkundenispen.nl)

    sidebar

    • Occurrences: 1
    • Websites: 1

    Top XPath patterns:

    • body/aside[@id='leftbar'] (1)

    String patterns found:

    • org:foundation_or_association (1)
    • nav:menu_item (1)
    • person:dutch_name (1)

    Samples:

    • "Vereniging Adres / Contact Kwartaalblad......" (oudhoorn.nl)

    Entity Type Distribution

    Entity Type Count Websites Primary Location
    APP.URL 3196 1176 meta_other
    GRP.HER 1805 891 meta_title
    TOP.SET 1344 722 meta_tag
    THG.LNG 1309 921 other
    TMP.DAB 928 476 meta_tag
    TOP.ADR 412 233 footer
    GRP.ASS 401 230 meta_title
    GRP.GOV 381 168 meta_title
    APP.TIT 380 278 meta_title
    AGT.PER 334 180 paragraph
    WRK.WEB 316 191 meta_other
    APP.TTL 293 206 meta_title
    THG.CON 260 162 meta_tag
    APP.EXH 259 166 other
    TOP.REG 254 192 meta_tag
    TOP.CTY 234 179 meta_tag
    TMP.OPH 217 130 paragraph
    QTY.MSR 179 85 meta_tag
    QTY.CNT 176 106 meta_tag
    GRP.COR 152 110 meta_tag

    Entity Type Details

    APP.URL

    • Total occurrences: 3196
    • Unique websites: 1176

    Found in DOM categories:

    • meta_other: 1705 (53.3%)
    • meta_tag: 455 (14.2%)
    • other: 293 (9.2%)
    • list: 162 (5.1%)
    • footer: 149 (4.7%)

    Common XPath patterns:

    • head/link/@href (441)
    • head/link[@rel='canonical']/@href (342)
    • head/script[@type='application/ld+json'] (186)
    • head/meta/@content (175)
    • head/link (151)

    Samples:

    GRP.HER

    • Total occurrences: 1805
    • Unique websites: 891

    Found in DOM categories:

    • meta_title: 871 (48.3%)
    • meta_tag: 311 (17.2%)
    • other: 181 (10.0%)
    • meta_other: 131 (7.3%)
    • header: 88 (4.9%)

    Common XPath patterns:

    • head/title (829)
    • head/meta/@content (129)
    • head/meta[@name='description']/@content (66)
    • head/script[@type='application/ld+json'] (56)
    • head/meta[@property='og:site_name']/@content (52)

    Samples:

    • "Museum Eenigenburg" @ head/title
    • "Musea Schagen" @ body/*/footer/*/a/*/p
    • "Hildo Krop Museum" @ head/title

    TOP.SET

    • Total occurrences: 1344
    • Unique websites: 722

    Found in DOM categories:

    • meta_tag: 356 (26.5%)
    • meta_title: 331 (24.6%)
    • paragraph: 209 (15.6%)
    • other: 126 (9.4%)
    • meta_other: 84 (6.2%)

    Common XPath patterns:

    • head/title (315)
    • head/meta[@name='description']/@content (154)
    • head/meta/@content (106)
    • body/*/p (101)
    • head/script[@type='application/ld+json'] (32)

    Samples:

    • "Amsterdam" @ head/script[@type='application/ld+json']
    • "Eenigenburg" @ body/*/header/*/nav/*/ul/li/a
    • "Steenwijk" @ head/meta[@property='og:site_name']

    THG.LNG

    • Total occurrences: 1309
    • Unique websites: 921

    Found in DOM categories:

    • other: 846 (64.6%)
    • meta_tag: 207 (15.8%)
    • meta_other: 150 (11.5%)
    • nav: 32 (2.4%)
    • paragraph: 27 (2.1%)

    Common XPath patterns:

    • @lang (360)
    • html/@lang (192)
    • html (70)
    • head/meta[@property='og:locale']/@content (54)
    • head/meta/@content (50)

    Samples:

    • "nl_NL" @ head/meta[@property='og:locale']
    • "nl-NL" @ [@lang='nl-NL']
    • "nl_NL" @ html/@lang

    TMP.DAB

    • Total occurrences: 928
    • Unique websites: 476

    Found in DOM categories:

    • meta_tag: 396 (42.7%)
    • meta_other: 189 (20.4%)
    • paragraph: 166 (17.9%)
    • other: 97 (10.5%)
    • list: 33 (3.6%)

    Common XPath patterns:

    • head/meta[@property='article:modified_time']/@content (108)
    • head/meta/@content (108)
    • head/script[@type='application/ld+json'] (93)
    • body/*/p (75)
    • head/script/text() (34)

    Samples:

    • "2024" @ li[@id='menu-item-2747']/a
    • "2025" @ li[@id='menu-item-2758']/a
    • "2022-05-30T07:37:02+00:00" @ head/meta[@property='article:modified_time']/@content

    TOP.ADR

    • Total occurrences: 412
    • Unique websites: 233

    Found in DOM categories:

    • footer: 104 (25.2%)
    • paragraph: 81 (19.7%)
    • other: 68 (16.5%)
    • meta_other: 60 (14.6%)
    • contact: 25 (6.1%)

    Common XPath patterns:

    • body/*/p (31)
    • body/footer/*/p (29)
    • head/script[@type='application/ld+json'] (24)
    • head/script/text() (19)
    • body/*/p/* (17)

    Samples:

    • "De Boelelaan 1105, 1081 HV, Amsterdam, Nederland" @ head/script[@type='application/ld+json']
    • "Kerkweg 5, 1744 JD Eenigenburg, Noord-Holland" @ body/*/footer/*/ol/li
    • "Herenstraat 67" @ body/footer/*

    GRP.ASS

    • Total occurrences: 401
    • Unique websites: 230

    Found in DOM categories:

    • meta_title: 164 (40.9%)
    • meta_tag: 58 (14.5%)
    • paragraph: 43 (10.7%)
    • header: 27 (6.7%)
    • other: 27 (6.7%)

    Common XPath patterns:

    • head/title (153)
    • head/meta/@content (25)
    • body/*/p (24)
    • head/script[@type='application/ld+json'] (13)
    • head/meta[@name='description']/@content (11)

    Samples:

    • "Heemkundevereniging Heibloem" @ head/title
    • "Heemkundevereniging Heibloem" @ body/*/header/h1/*
    • "Heemkundevereniging Heibloem" @ body/*/p/strong

    GRP.GOV

    • Total occurrences: 381
    • Unique websites: 168

    Found in DOM categories:

    • meta_title: 131 (34.4%)
    • meta_tag: 126 (33.1%)
    • header: 41 (10.8%)
    • paragraph: 26 (6.8%)
    • other: 17 (4.5%)

    Common XPath patterns:

    • head/title (127)
    • head/meta/@content (42)
    • head/meta[@name='description']/@content (26)
    • head/meta (15)
    • body/*/p (13)

    Samples:

    • "gemeente Ede" @ body/img[@alt='Logo Gemeentearchief Ede']/@alt
    • "Provincie Noord-Holland" @ head/title
    • "Provincie Overijssel" @ head/title

    APP.TIT

    • Total occurrences: 380
    • Unique websites: 278

    Found in DOM categories:

    • meta_title: 179 (47.1%)
    • meta_tag: 39 (10.3%)
    • other: 39 (10.3%)
    • sub_heading: 36 (9.5%)
    • paragraph: 24 (6.3%)

    Common XPath patterns:

    • head/title (153)
    • head/meta/@content (15)
    • body/*/h2 (12)
    • head/meta[@property='og:title']/@content (12)
    • body/*/a/*/h4 (10)

    Samples:

    • "Heemkundevereniging Heibloem" @ head/title
    • "Uitnodiging lezing "Ontdek de sporen van je voorouders"" @ body/*/h2
    • "Versiering plaquette "Sporen die bleven"" @ body/*/h2

    AGT.PER

    • Total occurrences: 334
    • Unique websites: 180

    Found in DOM categories:

    • paragraph: 107 (32.0%)
    • meta_tag: 80 (24.0%)
    • other: 63 (18.9%)
    • sub_heading: 28 (8.4%)
    • meta_other: 17 (5.1%)

    Common XPath patterns:

    • body/*/p (44)
    • head/meta/@content (31)
    • head/meta[@name='description']/@content (23)
    • body/* (11)
    • body/*/p/text() (11)

    Samples:

    • "Erik Pape" @ body/*/ul/li/*/h2/a
    • "Caren van Herwaarden" @ body/*/ul/li/*/h2/a
    • "Prof. dr. Chris Stolwijk" @ body/span[contains(text(), 'Prof. dr. Chris Stolwijk')]

    XPath → Entity Type Mapping

    Which entity types are typically found at which DOM locations:

    head/title (1983 entities)

    • GRP.HER: 829 (41.8%)
    • TOP.SET: 315 (15.9%)
    • GRP.ASS: 153 (7.7%)
    • APP.TIT: 153 (7.7%)
    • GRP.GOV: 127 (6.4%)

    head/meta/@content (976 entities)

    • APP.URL: 175 (17.9%)
    • GRP.HER: 129 (13.2%)
    • TMP.DAB: 108 (11.1%)
    • TOP.SET: 106 (10.9%)
    • THG.LNG: 50 (5.1%)

    body/*/p (770 entities)

    • TOP.SET: 101 (13.1%)
    • TMP.DAB: 75 (9.7%)
    • AGT.PER: 44 (5.7%)
    • GRP.HER: 36 (4.7%)
    • TOP.ADR: 31 (4.0%)

    head/script[@type='application/ld+json'] (641 entities)

    • APP.URL: 186 (29.0%)
    • TMP.DAB: 93 (14.5%)
    • GRP.HER: 56 (8.7%)
    • TOP.SET: 32 (5.0%)
    • QTY.MSR: 32 (5.0%)

    head/meta[@name='description']/@content (619 entities)

    • TOP.SET: 154 (24.9%)
    • GRP.HER: 66 (10.7%)
    • THG.CON: 43 (6.9%)
    • TOP.REG: 35 (5.7%)
    • GRP.GOV: 26 (4.2%)
    • APP.URL: 441 (94.2%)
    • WRK.WEB: 7 (1.5%)
    • GRP.HER: 6 (1.3%)
    • TOP.SET: 4 (0.9%)
    • APP.WEB: 3 (0.6%)

    @lang (365 entities)

    • THG.LNG: 360 (98.6%)
    • TOP.CTY: 4 (1.1%)
    • TOP.SET: 1 (0.3%)
    • APP.URL: 342 (97.4%)
    • WRK.WEB: 7 (2.0%)
    • WRK.URL: 1 (0.3%)
    • GRP.HER: 1 (0.3%)

    head/script/text() (342 entities)

    • APP.URL: 103 (30.1%)
    • GRP.HER: 37 (10.8%)
    • TMP.DAB: 34 (9.9%)
    • TOP.ADR: 19 (5.6%)
    • THG.LNG: 19 (5.6%)

    head/script (242 entities)

    • APP.URL: 57 (23.6%)
    • TMP.DAB: 32 (13.2%)
    • GRP.HER: 17 (7.0%)
    • THG.LNG: 13 (5.4%)
    • QTY.MSR: 11 (4.5%)

    html/@lang (196 entities)

    • THG.LNG: 192 (98.0%)
    • TOP.CTY: 3 (1.5%)
    • TOP.LNG: 1 (0.5%)

    head/meta (175 entities)

    • APP.URL: 32 (18.3%)
    • TMP.DAB: 22 (12.6%)
    • GRP.GOV: 15 (8.6%)
    • THG.LNG: 14 (8.0%)
    • GRP.HER: 13 (7.4%)
    • APP.URL: 151 (93.8%)
    • THG.LNG: 5 (3.1%)
    • WRK.WEB: 2 (1.2%)
    • APP.WKD: 1 (0.6%)
    • GRP.COR: 1 (0.6%)
    • APP.URL: 143 (95.3%)
    • WRK.WEB: 7 (4.7%)

    body/* (139 entities)

    • TMP.DAB: 24 (17.3%)
    • TOP.ADR: 13 (9.4%)
    • AGT.PER: 11 (7.9%)
    • TMP.TAB: 8 (5.8%)
    • WRK.COL: 7 (5.0%)

    head/meta[@property='article:modified_time']/@content (108 entities)

    • TMP.DAB: 108 (100.0%)

    head/script[@type='application/ld+json']/text() (108 entities)

    • APP.URL: 27 (25.0%)
    • TMP.DAB: 16 (14.8%)
    • TOP.SET: 9 (8.3%)
    • GRP.HER: 9 (8.3%)
    • TOP.REG: 5 (4.6%)

    body/*/ul/li/a (100 entities)

    • APP.URL: 33 (33.0%)
    • APP.COL: 10 (10.0%)
    • WRK.COL: 7 (7.0%)
    • WRK.WEB: 7 (7.0%)
    • APP.EXH: 7 (7.0%)

    body/*/h2 (98 entities)

    • GRP.HER: 21 (21.4%)
    • APP.TIT: 12 (12.2%)
    • APP.EXH: 11 (11.2%)
    • TOP.SET: 11 (11.2%)
    • GRP.ASS: 6 (6.1%)

    head/meta[@property='og:site_name']/@content (98 entities)

    • GRP.HER: 52 (53.1%)
    • GRP.GOV: 12 (12.2%)
    • GRP.ASS: 6 (6.1%)
    • APP.TTL: 5 (5.1%)
    • TOP.SET: 4 (4.1%)

    String Pattern → Entity Type Correlation

    Which string patterns are most associated with which entity types:

    unknown (10443 occurrences)

    • APP.URL: 2814 (26.9%)
    • THG.LNG: 1307 (12.5%)
    • TOP.SET: 1250 (12.0%)
    • TMP.DAB: 892 (8.5%)
    • GRP.HER: 363 (3.5%)

    person:dutch_name (3440 occurrences)

    • GRP.HER: 1329 (38.6%)
    • GRP.ASS: 313 (9.1%)
    • GRP.GOV: 246 (7.2%)
    • AGT.PER: 241 (7.0%)
    • APP.TIT: 172 (5.0%)

    heritage:institution_type (1185 occurrences)

    • GRP.HER: 777 (65.6%)
    • APP.URL: 140 (11.8%)
    • APP.TTL: 56 (4.7%)
    • APP.TIT: 49 (4.1%)
    • WRK.WEB: 31 (2.6%)

    org:foundation_or_association (383 occurrences)

    • GRP.ASS: 198 (51.7%)
    • GRP.HER: 85 (22.2%)
    • APP.URL: 24 (6.3%)
    • APP.TIT: 15 (3.9%)
    • WRK.WEB: 15 (3.9%)

    nav:menu_item (368 occurrences)

    • APP.URL: 130 (35.3%)
    • APP.TTL: 46 (12.5%)
    • APP.TIT: 45 (12.2%)
    • WRK.COL: 40 (10.9%)
    • WRK.WEB: 34 (9.2%)

    loc:postal_code_nl (218 occurrences)

    • TOP.ADR: 193 (88.5%)
    • TOP.SET: 7 (3.2%)
    • APP.NAM: 2 (0.9%)
    • QTY.CNT: 2 (0.9%)
    • unknown: 2 (0.9%)

    gov:municipality (206 occurrences)

    • GRP.GOV: 171 (83.0%)
    • APP.TTL: 11 (5.3%)
    • TOP.SET: 5 (2.4%)
    • APP.TIT: 5 (2.4%)
    • TOP.REG: 5 (2.4%)

    info:time_range (185 occurrences)

    • TMP.OPH: 152 (82.2%)
    • TMP.TAB: 25 (13.5%)
    • TMP.EXP: 3 (1.6%)
    • TMP.DAT: 2 (1.1%)
    • TMP.RNG: 1 (0.5%)

    loc:street_address (178 occurrences)

    • TOP.ADR: 158 (88.8%)
    • GRP.HER: 8 (4.5%)
    • TOP.SET: 2 (1.1%)
    • APP.PNM: 2 (1.1%)
    • APP.URL: 1 (0.6%)

    info:opening_hours (161 occurrences)

    • TMP.OPH: 115 (71.4%)
    • TMP.DAB: 31 (19.3%)
    • TMP.SET: 7 (4.3%)
    • TMP.RNG: 2 (1.2%)
    • TMP.DAT: 2 (1.2%)

    contact:email (126 occurrences)

    • APP.URL: 93 (73.8%)
    • APP.EMAIL: 12 (9.5%)
    • APP.NAM: 5 (4.0%)
    • APP.PNM: 5 (4.0%)
    • APP.EML: 3 (2.4%)

    contact:phone (7 occurrences)

    • APP.URL: 5 (71.4%)
    • TOP.ADR: 1 (14.3%)
    • APP.TTL: 1 (14.3%)

    Recommendations for dutch_web_patterns.yaml

    Based on the analysis, the following layout-aware patterns could be added:

    High-Value XPath Targets

    1. head/title - Almost always contains institution name
    2. body/*/h1 - Primary heading, usually institution name
    3. body/*/nav - Navigation menu (discard patterns)
    4. body/*/footer - Contact info, address, social links

    Suggested Pattern Additions

    # Add xpath_hint to patterns for better precision
    entity_patterns:
      organizations:
        heritage_institutions:
          patterns:
            - pattern: '^(het|de)\s+(\w+)\s*(museum|archief|bibliotheek)$'
              xpath_hints:
                - 'head/title'
                - 'body/*/h1'
              confidence_boost: 0.2  # Higher confidence when found at expected location