glam/reports/layout_analysis_500.md
2025-12-14 17:09:55 +01:00

20 KiB
Raw Blame History

Web Archive Layout Pattern Analysis

Generated: 2025-12-13T15:03:35.789833

Summary

  • Annotation files analyzed: 500
  • Unique websites: 468
  • Total layout claims: 1443
  • Total entity claims: 5134

Layout Categories

Distribution of content by DOM location category:

other

  • Occurrences: 322
  • Websites: 162

Top XPath patterns:

  • body/* (162)
  • body/*/table (8)
  • body (7)
  • body/*/dl (6)
  • body/*/script (6)

String patterns found:

  • unknown (170)
  • person:dutch_name (104)
  • nav:menu_item (44)
  • heritage:institution_type (43)
  • org:foundation_or_association (13)

Samples:

  • "Voorstelling 'De Ezelprins' door Spiegelbeesten......" (bibliotheekkampen.nl)
  • "Vandaag open... Bibliotheek Kampen 09:00 - 17:00......" (bibliotheekkampen.nl)
  • "flexslider fullscreen-carousel hero-slider..." (museumrijswijk.nl)

meta_title

  • Occurrences: 282
  • Websites: 268

Top XPath patterns:

  • head/title (279)
  • head/meta[@property='og:title'] (1)
  • head/meta[@name='apple-mobile-web-app-title']/@content (1)
  • head/meta[@property='og:title']/@content (1)

String patterns found:

  • person:dutch_name (224)
  • heritage:institution_type (85)
  • nav:menu_item (58)
  • org:foundation_or_association (41)
  • unknown (34)

Samples:

  • "Protestantse Theologische Universiteit: 400 jaar theologisch onderwijs..." (pthu.nl)
  • "Heemkundevereniging Heibloem | Heibloem.nu..." (heibloem.nu)
  • "Museum Eenigenburg Museum Eenigenburg..." (museumeenigenburg.nl)

header

  • Occurrences: 186
  • Websites: 131

Top XPath patterns:

  • body/header (62)
  • body/*/header (23)
  • body/*/header/h1 (11)
  • body/*/header/* (9)
  • body/header/* (5)

String patterns found:

  • unknown (96)
  • person:dutch_name (76)
  • nav:menu_item (26)
  • heritage:institution_type (18)
  • org:foundation_or_association (12)

Samples:

  • "...Protestantse Theologische Universitei..." (pthu.nl)
  • "Heemkundevereniging Heibloem..." (heibloem.nu)
  • "Museum Eenigenburg..." (museumeenigenburg.nl)

meta_other

  • Occurrences: 164
  • Websites: 139

Top XPath patterns:

  • head (97)
  • head/script[@type='application/ld+json'] (20)
  • head/script (14)
  • head/link (7)
  • head/style (3)

String patterns found:

  • person:dutch_name (77)
  • unknown (77)
  • heritage:institution_type (33)
  • nav:menu_item (24)
  • gov:municipality (10)

Samples:

  • "<link rel="canonical" href="https://www.h..." (hildokrop.nl)
  • "Museum Rijswijk..." (museumrijswijk.nl)
  • "Gemeentearchief Ede | Gemeentearchief Ede......" (gemeentearchief.ede.nl)

nav

  • Occurrences: 106
  • Websites: 97

Top XPath patterns:

  • body/*/nav (19)
  • body/nav (16)
  • body/*/header/*/nav (14)
  • body/header/nav (14)
  • body/header/*/nav (8)

String patterns found:

  • nav:menu_item (57)
  • person:dutch_name (50)
  • unknown (37)
  • heritage:institution_type (16)
  • org:foundation_or_association (9)

Samples:

  • "Actueel HELD Burgerbus Onderwijs & opvang......" (heibloem.nu)
  • "Home Het museum Daglonershuis Kasteel De familie Eenigenburg Blog Contact..." (museumeenigenburg.nl)
  • "Home Tentoonstellingen Bezoekersinformatie......" (sonnenborgh.nl)

paragraph

  • Occurrences: 93
  • Websites: 68

Top XPath patterns:

  • body/*/p (67)
  • body/*/ul/li/*/p (4)
  • body/p (3)
  • */p (2)
  • body/table/tr/td/p (2)

String patterns found:

  • person:dutch_name (47)
  • unknown (27)
  • heritage:institution_type (19)
  • org:foundation_or_association (11)
  • nav:menu_item (7)

Samples:

  • "Heemkundevereniging Heibloem heeft als doel het in brede kring bevorderen van be..." (heibloem.nu)
  • "In Museum Eenigenburg wijst een levensgroot kompas......" (museumeenigenburg.nl)
  • "Het Museum is zeer geschikt voor het ontvangen van groepen......" (museumeenigenburg.nl)

main_heading

  • Occurrences: 74
  • Websites: 65

Top XPath patterns:

  • body/*/h1 (60)
  • */h1 (4)
  • body/div[@id='pagewrapper']/div[@class='section group']/div[@id='rightcolumn']/div[@class='section group']/div[@class='col span_2_of_2']/h1 (1)
  • body/div[@class='kolomContent home']/*/h1 (1)
  • body/div[@id='wrapper']/div[@id='wrapperinner']/div[@id='content']/div[@id='columnwrapper']/*/div[@id='anchorContent']/h1 (1)

String patterns found:

  • person:dutch_name (37)
  • unknown (31)
  • heritage:institution_type (18)
  • nav:menu_item (5)
  • org:foundation_or_association (3)

Samples:

  • "Protestantse Theologische Universiteit..." (pthu.nl)
  • "Terug in de tijd..." (museumeenigenburg.nl)
  • "Ook leuk voor groepen..." (museumeenigenburg.nl)

sub_heading

  • Occurrences: 69
  • Websites: 46

Top XPath patterns:

  • body/*/h2 (39)
  • body/*/h3 (15)
  • body/h2 (3)
  • */h2 (2)
  • body/*/h4 (2)

String patterns found:

  • unknown (42)
  • person:dutch_name (23)
  • heritage:institution_type (5)
  • nav:menu_item (3)
  • org:foundation_or_association (3)

Samples:

  • "Verenigingsnieuws..." (heibloem.nu)
  • "Contactgegevens..." (heibloem.nu)
  • "Home..." (bibliotheekkampen.nl)

meta_tag

  • Occurrences: 60
  • Websites: 55

Top XPath patterns:

  • head/meta[@name='description'] (20)
  • head/meta (13)
  • head/meta/@content (9)
  • head/meta[@name='description']/@content (8)
  • head/meta[@property='og:url'] (2)

String patterns found:

  • person:dutch_name (33)
  • heritage:institution_type (20)
  • unknown (20)
  • org:foundation_or_association (8)
  • nav:menu_item (6)

Samples:

  • "Prachtige tijdelijke en vaste tentoonstellingen zijn te zien bij Museum Catharij..." (catharijneconvent.nl)
  • "Energie is overal. We gebruiken het constant en ervaren er veel gemak van. Tegel..." (vindhetuit.nl)
  • "Aanmeldformulier Via dit formulier kunt u zich aanmelden bij de Stichting Vriend..." (verbindingsdienst.nl)
  • Occurrences: 43
  • Websites: 35

Top XPath patterns:

  • body/footer (17)
  • body/*/footer (8)
  • body/footer/* (6)
  • body/*/footer/* (3)
  • body/footer/*/p (3)

String patterns found:

  • person:dutch_name (18)
  • nav:menu_item (13)
  • unknown (10)
  • loc:postal_code_nl (5)
  • loc:street_address (4)

Samples:

  • "Heibloem.nu Nieuwsbrief Contact......" (heibloem.nu)
  • "Made with ♥ by Encore Digital Agency......" (heibloem.nu)
  • "Contactinformatie Kerkweg 5... Toegangsprijzen... OPENINGS TIJDEN......" (museumeenigenburg.nl)

list

  • Occurrences: 39
  • Websites: 30

Top XPath patterns:

  • body/*/ul (30)
  • */ul (2)
  • */ul/li (1)
  • body/ul (1)
  • body/*/ol (1)

String patterns found:

  • person:dutch_name (17)
  • unknown (16)
  • nav:menu_item (9)
  • heritage:institution_type (5)
  • org:foundation_or_association (1)

Samples:

  • "
      ......" (bibliotheekkampen.nl)
    • "Nu te zien [list of exhibitions]..." (catharijneconvent.nl)
    • "Let op: in december hebben we aangepaste openingstijden. Bekijk hier de openings..." (debieb.nl)

    contact

    • Occurrences: 3
    • Websites: 3

    Top XPath patterns:

    • body/*/address (1)
    • body/div[@id='contact']/*/ul (1)
    • div[@class='xr_group' and ./span[text()='Contact']] (1)

    String patterns found:

    • loc:postal_code_nl (1)
    • loc:street_address (1)
    • nav:menu_item (1)
    • person:dutch_name (1)

    Samples:

    • "Bongerd 18, 6411 JM Heerlen..." (museum.nl)
    • "Winsum, Hoofdstraat W 70 Uithuizen, Hoofdstraat-West 1..." (hethogeland.nl)
    • "Contact Contactgegevens..." (heemkundenispen.nl)

    main_content

    • Occurrences: 1
    • Websites: 1

    Top XPath patterns:

    • body/div[@id='readspeaker']/*/article[@role='region']/*/p (1)

    String patterns found:

    • info:opening_hours (1)

    Samples:

    • "Let op: Op donderdag 27 en vrijdag 28 november 2025 is Burgerzaken gesloten......" (veendam.nl)

    sidebar

    • Occurrences: 1
    • Websites: 1

    Top XPath patterns:

    • body/aside[@id='leftbar'] (1)

    String patterns found:

    • org:foundation_or_association (1)
    • nav:menu_item (1)
    • person:dutch_name (1)

    Samples:

    • "Vereniging Adres / Contact Kwartaalblad......" (oudhoorn.nl)

    Entity Type Distribution

    Entity Type Count Websites Primary Location
    APP.URL 1071 406 meta_other
    GRP.HER 582 313 meta_title
    TOP.SET 426 245 meta_tag
    THG.LNG 398 305 other
    TMP.DAB 285 155 meta_tag
    WRK.WEB 145 75 meta_other
    APP.TIT 132 97 meta_title
    TOP.ADR 130 77 footer
    AGT.PER 129 68 paragraph
    GRP.ASS 122 76 meta_title
    APP.TTL 119 78 meta_title
    GRP.GOV 118 60 meta_title
    THG.CON 103 63 meta_tag
    APP.EXH 98 61 other
    TOP.REG 80 65 meta_tag
    TMP.OPH 74 46 other
    QTY.CNT 69 38 meta_tag
    TOP.CTY 68 55 meta_tag
    THG.EVT 64 31 paragraph
    TOP.BLD 56 38 paragraph

    Entity Type Details

    APP.URL

    • Total occurrences: 1071
    • Unique websites: 406

    Found in DOM categories:

    • meta_other: 596 (55.6%)
    • meta_tag: 126 (11.8%)
    • other: 81 (7.6%)
    • list: 64 (6.0%)
    • nav: 55 (5.1%)

    Common XPath patterns:

    • head/link/@href (168)
    • head/link[@rel='canonical']/@href (108)
    • head/link (67)
    • head/script[@type='application/ld+json'] (59)
    • head/meta/@content (47)

    Samples:

    GRP.HER

    • Total occurrences: 582
    • Unique websites: 313

    Found in DOM categories:

    • meta_title: 282 (48.5%)
    • meta_tag: 97 (16.7%)
    • other: 51 (8.8%)
    • meta_other: 43 (7.4%)
    • paragraph: 34 (5.8%)

    Common XPath patterns:

    • head/title (261)
    • head/meta/@content (41)
    • head/meta[@name='description']/@content (24)
    • head/script[@type='application/ld+json'] (20)
    • body/*/dl/*/dt/a (17)

    Samples:

    • "Museum Eenigenburg" @ head/title
    • "Musea Schagen" @ body/*/footer/*/a/*/p
    • "Hildo Krop Museum" @ head/title

    TOP.SET

    • Total occurrences: 426
    • Unique websites: 245

    Found in DOM categories:

    • meta_tag: 116 (27.2%)
    • meta_title: 94 (22.1%)
    • paragraph: 72 (16.9%)
    • other: 32 (7.5%)
    • meta_other: 28 (6.6%)

    Common XPath patterns:

    • head/title (87)
    • head/meta[@name='description']/@content (45)
    • body/*/p (38)
    • head/meta/@content (32)
    • body/*/dl/*/dt/a (10)

    Samples:

    • "Amsterdam" @ head/script[@type='application/ld+json']
    • "Eenigenburg" @ body/*/header/*/nav/*/ul/li/a
    • "Steenwijk" @ head/meta[@property='og:site_name']

    THG.LNG

    • Total occurrences: 398
    • Unique websites: 305

    Found in DOM categories:

    • other: 260 (65.3%)
    • meta_tag: 66 (16.6%)
    • meta_other: 31 (7.8%)
    • nav: 17 (4.3%)
    • paragraph: 15 (3.8%)

    Common XPath patterns:

    • @lang (113)
    • html/@lang (62)
    • html (30)
    • head/meta[@property='og:locale']/@content (18)
    • head/meta/@content (15)

    Samples:

    • "nl_NL" @ head/meta[@property='og:locale']
    • "nl-NL" @ [@lang='nl-NL']
    • "nl_NL" @ html/@lang

    TMP.DAB

    • Total occurrences: 285
    • Unique websites: 155

    Found in DOM categories:

    • meta_tag: 100 (35.1%)
    • meta_other: 70 (24.6%)
    • paragraph: 49 (17.2%)
    • other: 37 (13.0%)
    • sub_heading: 9 (3.2%)

    Common XPath patterns:

    • head/script[@type='application/ld+json'] (33)
    • body/*/p (30)
    • head/meta[@property='article:modified_time']/@content (29)
    • head/meta/@content (28)
    • head/script[@type='application/ld+json']/text() (12)

    Samples:

    • "2024" @ li[@id='menu-item-2747']/a
    • "2025" @ li[@id='menu-item-2758']/a
    • "2022-05-30T07:37:02+00:00" @ head/meta[@property='article:modified_time']/@content

    WRK.WEB

    • Total occurrences: 145
    • Unique websites: 75

    Found in DOM categories:

    • meta_other: 39 (26.9%)
    • nav: 28 (19.3%)
    • other: 27 (18.6%)
    • meta_title: 15 (10.3%)
    • list: 14 (9.7%)

    Common XPath patterns:

    • body/*/nav/*/ul/li/*/h3/a (13)
    • head/script[@type='application/ld+json'] (11)
    • HTML/BODY/DIV/DIV/UL/LI/A (8)
    • head/script/text() (7)
    • body/ul/li/a (6)

    Samples:

    APP.TIT

    • Total occurrences: 132
    • Unique websites: 97

    Found in DOM categories:

    • meta_title: 62 (47.0%)
    • sub_heading: 18 (13.6%)
    • meta_tag: 16 (12.1%)
    • other: 13 (9.8%)
    • meta_other: 7 (5.3%)

    Common XPath patterns:

    • head/title (53)
    • body/*/h2 (8)
    • body/*/a/*/h4 (8)
    • head/meta/@content (5)
    • head/meta[@property='og:title']/@content (4)

    Samples:

    • "Heemkundevereniging Heibloem" @ head/title
    • "Uitnodiging lezing "Ontdek de sporen van je voorouders"" @ body/*/h2
    • "Versiering plaquette "Sporen die bleven"" @ body/*/h2

    TOP.ADR

    • Total occurrences: 130
    • Unique websites: 77

    Found in DOM categories:

    • footer: 31 (23.8%)
    • paragraph: 23 (17.7%)
    • other: 21 (16.2%)
    • meta_other: 18 (13.8%)
    • list: 13 (10.0%)

    Common XPath patterns:

    • body/footer/*/p (13)
    • body/*/p (9)
    • body/*/ul/li (7)
    • head/script/text() (7)
    • head/script[@type='application/ld+json'] (6)

    Samples:

    • "De Boelelaan 1105, 1081 HV, Amsterdam, Nederland" @ head/script[@type='application/ld+json']
    • "Kerkweg 5, 1744 JD Eenigenburg, Noord-Holland" @ body/*/footer/*/ol/li
    • "Herenstraat 67" @ body/footer/*

    AGT.PER

    • Total occurrences: 129
    • Unique websites: 68

    Found in DOM categories:

    • paragraph: 47 (36.4%)
    • meta_tag: 29 (22.5%)
    • other: 21 (16.3%)
    • sub_heading: 10 (7.8%)
    • list: 5 (3.9%)

    Common XPath patterns:

    • body/*/p (12)
    • body/*/p/text() (10)
    • head/meta[@name='description']/@content (8)
    • body/*/p/* (8)
    • head/meta/@content (8)

    Samples:

    • "Erik Pape" @ body/*/ul/li/*/h2/a
    • "Caren van Herwaarden" @ body/*/ul/li/*/h2/a
    • "Prof. dr. Chris Stolwijk" @ body/span[contains(text(), 'Prof. dr. Chris Stolwijk')]

    GRP.ASS

    • Total occurrences: 122
    • Unique websites: 76

    Found in DOM categories:

    • meta_title: 48 (39.3%)
    • meta_tag: 17 (13.9%)
    • paragraph: 15 (12.3%)
    • other: 11 (9.0%)
    • header: 7 (5.7%)

    Common XPath patterns:

    • head/title (44)
    • head/meta/@content (12)
    • body/*/p (6)
    • head/script (4)
    • body/* (4)

    Samples:

    • "Heemkundevereniging Heibloem" @ head/title
    • "Heemkundevereniging Heibloem" @ body/*/header/h1/*
    • "Heemkundevereniging Heibloem" @ body/*/p/strong

    XPath → Entity Type Mapping

    Which entity types are typically found at which DOM locations:

    head/title (608 entities)

    • GRP.HER: 261 (42.9%)
    • TOP.SET: 87 (14.3%)
    • APP.TIT: 53 (8.7%)
    • APP.TTL: 45 (7.4%)
    • GRP.ASS: 44 (7.2%)

    head/meta/@content (278 entities)

    • APP.URL: 47 (16.9%)
    • GRP.HER: 41 (14.7%)
    • TOP.SET: 32 (11.5%)
    • TMP.DAB: 28 (10.1%)
    • THG.LNG: 15 (5.4%)

    body/*/p (277 entities)

    • TOP.SET: 38 (13.7%)
    • TMP.DAB: 30 (10.8%)
    • GRP.HER: 17 (6.1%)
    • AGT.PER: 12 (4.3%)
    • THG.ART: 11 (4.0%)

    head/script[@type='application/ld+json'] (219 entities)

    • APP.URL: 59 (26.9%)
    • TMP.DAB: 33 (15.1%)
    • GRP.HER: 20 (9.1%)
    • WRK.WEB: 11 (5.0%)
    • TOP.SET: 9 (4.1%)

    head/meta[@name='description']/@content (189 entities)

    • TOP.SET: 45 (23.8%)
    • GRP.HER: 24 (12.7%)
    • THG.CON: 17 (9.0%)
    • TOP.REG: 11 (5.8%)
    • GRP.GOV: 10 (5.3%)
    • APP.URL: 168 (93.9%)
    • WRK.WEB: 3 (1.7%)
    • APP.WEB: 3 (1.7%)
    • WRK.URL: 2 (1.1%)
    • TOP.SET: 1 (0.6%)

    @lang (115 entities)

    • THG.LNG: 113 (98.3%)
    • TOP.CTY: 2 (1.7%)
    • APP.URL: 108 (97.3%)
    • WRK.WEB: 2 (1.8%)
    • WRK.URL: 1 (0.9%)

    head/script (106 entities)

    • APP.URL: 23 (21.7%)
    • TMP.DAB: 12 (11.3%)
    • GRP.HER: 6 (5.7%)
    • QTY.MSR: 6 (5.7%)
    • TOP.SET: 5 (4.7%)

    head/script/text() (93 entities)

    • APP.URL: 31 (33.3%)
    • TMP.DAB: 9 (9.7%)
    • GRP.HER: 9 (9.7%)
    • WRK.WEB: 7 (7.5%)
    • TOP.ADR: 7 (7.5%)
    • APP.URL: 67 (97.1%)
    • APP.WKD: 1 (1.4%)
    • WRK.WEB: 1 (1.4%)

    html/@lang (62 entities)

    • THG.LNG: 62 (100.0%)

    head/meta (61 entities)

    • APP.URL: 12 (19.7%)
    • TMP.DAB: 5 (8.2%)
    • GRP.COR: 4 (6.6%)
    • GRP.HER: 4 (6.6%)
    • TOP.SET: 4 (6.6%)

    head/script[@type='application/ld+json']/text() (57 entities)

    • TMP.DAB: 12 (21.1%)
    • APP.URL: 11 (19.3%)
    • GRP.HER: 6 (10.5%)
    • TOP.SET: 5 (8.8%)
    • APP.NAM: 3 (5.3%)

    body/*/h2 (54 entities)

    • APP.TIT: 8 (14.8%)
    • APP.EXH: 8 (14.8%)
    • GRP.HER: 8 (14.8%)
    • TOP.SET: 5 (9.3%)
    • TMP.DAB: 4 (7.4%)

    body/* (54 entities)

    • TMP.DAB: 9 (16.7%)
    • WRK.COL: 7 (13.0%)
    • TOP.ADR: 6 (11.1%)
    • GRP.ASS: 4 (7.4%)
    • TMP.TAB: 4 (7.4%)
    • APP.URL: 41 (89.1%)
    • WRK.WEB: 5 (10.9%)

    head/meta[@property='og:description']/@content (41 entities)

    • TOP.SET: 6 (14.6%)
    • TMP.DAB: 5 (12.2%)
    • THG.CON: 4 (9.8%)
    • AGT.PER: 3 (7.3%)
    • APP.TIT: 3 (7.3%)

    body/*/p/a (40 entities)

    • APP.URL: 20 (50.0%)
    • THG.EVT: 5 (12.5%)
    • TOP.SET: 3 (7.5%)
    • APP.TTL: 3 (7.5%)
    • GRP.GOV: 2 (5.0%)

    body/*/p/* (36 entities)

    • AGT.PER: 8 (22.2%)
    • TOP.ADR: 4 (11.1%)
    • TOP.CTY: 3 (8.3%)
    • ROL.POS: 2 (5.6%)
    • ROL.OCC: 2 (5.6%)

    String Pattern → Entity Type Correlation

    Which string patterns are most associated with which entity types:

    unknown (3503 occurrences)

    • APP.URL: 937 (26.7%)
    • TOP.SET: 403 (11.5%)
    • THG.LNG: 398 (11.4%)
    • TMP.DAB: 269 (7.7%)
    • GRP.HER: 132 (3.8%)

    person:dutch_name (1146 occurrences)

    • GRP.HER: 410 (35.8%)
    • GRP.ASS: 101 (8.8%)
    • AGT.PER: 89 (7.8%)
    • GRP.GOV: 74 (6.5%)
    • APP.TIT: 63 (5.5%)

    heritage:institution_type (376 occurrences)

    • GRP.HER: 233 (62.0%)
    • APP.URL: 50 (13.3%)
    • APP.TIT: 20 (5.3%)
    • APP.TTL: 19 (5.1%)
    • WRK.WEB: 13 (3.5%)

    org:foundation_or_association (155 occurrences)

    • GRP.ASS: 75 (48.4%)
    • GRP.HER: 35 (22.6%)
    • APP.URL: 15 (9.7%)
    • WRK.WEB: 6 (3.9%)
    • APP.TIT: 5 (3.2%)

    nav:menu_item (141 occurrences)

    • APP.URL: 46 (32.6%)
    • APP.TTL: 23 (16.3%)
    • APP.TIT: 18 (12.8%)
    • WRK.WEB: 14 (9.9%)
    • WRK.COL: 13 (9.2%)

    loc:postal_code_nl (66 occurrences)

    • TOP.ADR: 55 (83.3%)
    • TOP.SET: 4 (6.1%)
    • QTY.CNT: 2 (3.0%)
    • APP.COL: 1 (1.5%)
    • APP.NAM: 1 (1.5%)

    gov:municipality (65 occurrences)

    • GRP.GOV: 53 (81.5%)
    • TOP.REG: 3 (4.6%)
    • TOP.SET: 2 (3.1%)
    • APP.TIT: 2 (3.1%)
    • APP.TTL: 2 (3.1%)

    info:time_range (63 occurrences)

    • TMP.OPH: 52 (82.5%)
    • TMP.TAB: 9 (14.3%)
    • TMP.EXP: 2 (3.2%)

    loc:street_address (62 occurrences)

    • TOP.ADR: 55 (88.7%)
    • GRP.HER: 3 (4.8%)
    • GRP.ASS: 1 (1.6%)
    • APP.TTL: 1 (1.6%)
    • unknown: 1 (1.6%)

    info:opening_hours (48 occurrences)

    • TMP.OPH: 32 (66.7%)
    • TMP.DAB: 13 (27.1%)
    • TMP.SET: 2 (4.2%)
    • THG.EVT: 1 (2.1%)

    contact:email (42 occurrences)

    • APP.URL: 25 (59.5%)
    • APP.EMAIL: 9 (21.4%)
    • APP.NAM: 3 (7.1%)
    • APP.PNM: 2 (4.8%)
    • THG.PHO: 1 (2.4%)

    contact:phone (4 occurrences)

    • APP.URL: 3 (75.0%)
    • TOP.ADR: 1 (25.0%)

    Recommendations for dutch_web_patterns.yaml

    Based on the analysis, the following layout-aware patterns could be added:

    High-Value XPath Targets

    1. head/title - Almost always contains institution name
    2. body/*/h1 - Primary heading, usually institution name
    3. body/*/nav - Navigation menu (discard patterns)
    4. body/*/footer - Contact info, address, social links

    Suggested Pattern Additions

    # Add xpath_hint to patterns for better precision
    entity_patterns:
      organizations:
        heritage_institutions:
          patterns:
            - pattern: '^(het|de)\s+(\w+)\s*(museum|archief|bibliotheek)$'
              xpath_hints:
                - 'head/title'
                - 'body/*/h1'
              confidence_boost: 0.2  # Higher confidence when found at expected location