20 KiB
Web Archive Layout Pattern Analysis
Generated: 2025-12-13T15:03:35.789833
Summary
- Annotation files analyzed: 500
- Unique websites: 468
- Total layout claims: 1443
- Total entity claims: 5134
Layout Categories
Distribution of content by DOM location category:
other
- Occurrences: 322
- Websites: 162
Top XPath patterns:
body/*(162)body/*/table(8)body(7)body/*/dl(6)body/*/script(6)
String patterns found:
unknown(170)person:dutch_name(104)nav:menu_item(44)heritage:institution_type(43)org:foundation_or_association(13)
Samples:
- "Voorstelling 'De Ezelprins' door Spiegelbeesten......" (bibliotheekkampen.nl)
- "Vandaag open... Bibliotheek Kampen 09:00 - 17:00......" (bibliotheekkampen.nl)
- "flexslider fullscreen-carousel hero-slider..." (museumrijswijk.nl)
meta_title
- Occurrences: 282
- Websites: 268
Top XPath patterns:
head/title(279)head/meta[@property='og:title'](1)head/meta[@name='apple-mobile-web-app-title']/@content(1)head/meta[@property='og:title']/@content(1)
String patterns found:
person:dutch_name(224)heritage:institution_type(85)nav:menu_item(58)org:foundation_or_association(41)unknown(34)
Samples:
- "Protestantse Theologische Universiteit: 400 jaar theologisch onderwijs..." (pthu.nl)
- "Heemkundevereniging Heibloem | Heibloem.nu..." (heibloem.nu)
- "Museum Eenigenburg – Museum Eenigenburg..." (museumeenigenburg.nl)
header
- Occurrences: 186
- Websites: 131
Top XPath patterns:
body/header(62)body/*/header(23)body/*/header/h1(11)body/*/header/*(9)body/header/*(5)
String patterns found:
unknown(96)person:dutch_name(76)nav:menu_item(26)heritage:institution_type(18)org:foundation_or_association(12)
Samples:
- "...Protestantse Theologische Universitei..." (pthu.nl)
- "Heemkundevereniging Heibloem..." (heibloem.nu)
- "Museum Eenigenburg..." (museumeenigenburg.nl)
meta_other
- Occurrences: 164
- Websites: 139
Top XPath patterns:
head(97)head/script[@type='application/ld+json'](20)head/script(14)head/link(7)head/style(3)
String patterns found:
person:dutch_name(77)unknown(77)heritage:institution_type(33)nav:menu_item(24)gov:municipality(10)
Samples:
- "<link rel="canonical" href="https://www.h..." (hildokrop.nl)
- "Museum Rijswijk..." (museumrijswijk.nl)
- "Gemeentearchief Ede | Gemeentearchief Ede......" (gemeentearchief.ede.nl)
nav
- Occurrences: 106
- Websites: 97
Top XPath patterns:
body/*/nav(19)body/nav(16)body/*/header/*/nav(14)body/header/nav(14)body/header/*/nav(8)
String patterns found:
nav:menu_item(57)person:dutch_name(50)unknown(37)heritage:institution_type(16)org:foundation_or_association(9)
Samples:
- "Actueel HELD Burgerbus Onderwijs & opvang......" (heibloem.nu)
- "Home Het museum Daglonershuis Kasteel De familie Eenigenburg Blog Contact..." (museumeenigenburg.nl)
- "Home Tentoonstellingen Bezoekersinformatie......" (sonnenborgh.nl)
paragraph
- Occurrences: 93
- Websites: 68
Top XPath patterns:
body/*/p(67)body/*/ul/li/*/p(4)body/p(3)*/p(2)body/table/tr/td/p(2)
String patterns found:
person:dutch_name(47)unknown(27)heritage:institution_type(19)org:foundation_or_association(11)nav:menu_item(7)
Samples:
- "Heemkundevereniging Heibloem heeft als doel het in brede kring bevorderen van be..." (heibloem.nu)
- "In Museum Eenigenburg wijst een levensgroot kompas......" (museumeenigenburg.nl)
- "Het Museum is zeer geschikt voor het ontvangen van groepen......" (museumeenigenburg.nl)
main_heading
- Occurrences: 74
- Websites: 65
Top XPath patterns:
body/*/h1(60)*/h1(4)body/div[@id='pagewrapper']/div[@class='section group']/div[@id='rightcolumn']/div[@class='section group']/div[@class='col span_2_of_2']/h1(1)body/div[@class='kolomContent home']/*/h1(1)body/div[@id='wrapper']/div[@id='wrapperinner']/div[@id='content']/div[@id='columnwrapper']/*/div[@id='anchorContent']/h1(1)
String patterns found:
person:dutch_name(37)unknown(31)heritage:institution_type(18)nav:menu_item(5)org:foundation_or_association(3)
Samples:
- "Protestantse Theologische Universiteit..." (pthu.nl)
- "Terug in de tijd..." (museumeenigenburg.nl)
- "Ook leuk voor groepen..." (museumeenigenburg.nl)
sub_heading
- Occurrences: 69
- Websites: 46
Top XPath patterns:
body/*/h2(39)body/*/h3(15)body/h2(3)*/h2(2)body/*/h4(2)
String patterns found:
unknown(42)person:dutch_name(23)heritage:institution_type(5)nav:menu_item(3)org:foundation_or_association(3)
Samples:
- "Verenigingsnieuws..." (heibloem.nu)
- "Contactgegevens..." (heibloem.nu)
- "Home..." (bibliotheekkampen.nl)
meta_tag
- Occurrences: 60
- Websites: 55
Top XPath patterns:
head/meta[@name='description'](20)head/meta(13)head/meta/@content(9)head/meta[@name='description']/@content(8)head/meta[@property='og:url'](2)
String patterns found:
person:dutch_name(33)heritage:institution_type(20)unknown(20)org:foundation_or_association(8)nav:menu_item(6)
Samples:
- "Prachtige tijdelijke en vaste tentoonstellingen zijn te zien bij Museum Catharij..." (catharijneconvent.nl)
- "Energie is overal. We gebruiken het constant en ervaren er veel gemak van. Tegel..." (vindhetuit.nl)
- "Aanmeldformulier Via dit formulier kunt u zich aanmelden bij de Stichting Vriend..." (verbindingsdienst.nl)
footer
- Occurrences: 43
- Websites: 35
Top XPath patterns:
body/footer(17)body/*/footer(8)body/footer/*(6)body/*/footer/*(3)body/footer/*/p(3)
String patterns found:
person:dutch_name(18)nav:menu_item(13)unknown(10)loc:postal_code_nl(5)loc:street_address(4)
Samples:
- "Heibloem.nu Nieuwsbrief Contact......" (heibloem.nu)
- "Made with ♥ by Encore Digital Agency......" (heibloem.nu)
- "Contactinformatie Kerkweg 5... Toegangsprijzen... OPENINGS TIJDEN......" (museumeenigenburg.nl)
list
- Occurrences: 39
- Websites: 30
Top XPath patterns:
body/*/ul(30)*/ul(2)*/ul/li(1)body/ul(1)body/*/ol(1)
String patterns found:
person:dutch_name(17)unknown(16)nav:menu_item(9)heritage:institution_type(5)org:foundation_or_association(1)
Samples:
- "
- ......" (bibliotheekkampen.nl)
- "Nu te zien [list of exhibitions]..." (catharijneconvent.nl)
- "Let op: in december hebben we aangepaste openingstijden. Bekijk hier de openings..." (debieb.nl)
contact
- Occurrences: 3
- Websites: 3
Top XPath patterns:
body/*/address(1)body/div[@id='contact']/*/ul(1)div[@class='xr_group' and ./span[text()='Contact']](1)
String patterns found:
loc:postal_code_nl(1)loc:street_address(1)nav:menu_item(1)person:dutch_name(1)
Samples:
- "Bongerd 18, 6411 JM Heerlen..." (museum.nl)
- "Winsum, Hoofdstraat W 70 Uithuizen, Hoofdstraat-West 1..." (hethogeland.nl)
- "Contact Contactgegevens..." (heemkundenispen.nl)
main_content
- Occurrences: 1
- Websites: 1
Top XPath patterns:
body/div[@id='readspeaker']/*/article[@role='region']/*/p(1)
String patterns found:
info:opening_hours(1)
Samples:
- "Let op: Op donderdag 27 en vrijdag 28 november 2025 is Burgerzaken gesloten......" (veendam.nl)
sidebar
- Occurrences: 1
- Websites: 1
Top XPath patterns:
body/aside[@id='leftbar'](1)
String patterns found:
org:foundation_or_association(1)nav:menu_item(1)person:dutch_name(1)
Samples:
- "Vereniging Adres / Contact Kwartaalblad......" (oudhoorn.nl)
Entity Type Distribution
Entity Type Count Websites Primary Location APP.URL 1071 406 meta_other GRP.HER 582 313 meta_title TOP.SET 426 245 meta_tag THG.LNG 398 305 other TMP.DAB 285 155 meta_tag WRK.WEB 145 75 meta_other APP.TIT 132 97 meta_title TOP.ADR 130 77 footer AGT.PER 129 68 paragraph GRP.ASS 122 76 meta_title APP.TTL 119 78 meta_title GRP.GOV 118 60 meta_title THG.CON 103 63 meta_tag APP.EXH 98 61 other TOP.REG 80 65 meta_tag TMP.OPH 74 46 other QTY.CNT 69 38 meta_tag TOP.CTY 68 55 meta_tag THG.EVT 64 31 paragraph TOP.BLD 56 38 paragraph Entity Type Details
APP.URL
- Total occurrences: 1071
- Unique websites: 406
Found in DOM categories:
- meta_other: 596 (55.6%)
- meta_tag: 126 (11.8%)
- other: 81 (7.6%)
- list: 64 (6.0%)
- nav: 55 (5.1%)
Common XPath patterns:
head/link/@href(168)head/link[@rel='canonical']/@href(108)head/link(67)head/script[@type='application/ld+json'](59)head/meta/@content(47)
Samples:
- "https://www.pthu.nl/" @
head/link/@href - "https://heibloem.nu/vereniging/heemkundevereniging-heibloem" @
head/meta[@property='og:url'] - "https://heibloem.nu/held" @
body/*/footer/*/p/a
GRP.HER
- Total occurrences: 582
- Unique websites: 313
Found in DOM categories:
- meta_title: 282 (48.5%)
- meta_tag: 97 (16.7%)
- other: 51 (8.8%)
- meta_other: 43 (7.4%)
- paragraph: 34 (5.8%)
Common XPath patterns:
head/title(261)head/meta/@content(41)head/meta[@name='description']/@content(24)head/script[@type='application/ld+json'](20)body/*/dl/*/dt/a(17)
Samples:
- "Museum Eenigenburg" @
head/title - "Musea Schagen" @
body/*/footer/*/a/*/p - "Hildo Krop Museum" @
head/title
TOP.SET
- Total occurrences: 426
- Unique websites: 245
Found in DOM categories:
- meta_tag: 116 (27.2%)
- meta_title: 94 (22.1%)
- paragraph: 72 (16.9%)
- other: 32 (7.5%)
- meta_other: 28 (6.6%)
Common XPath patterns:
head/title(87)head/meta[@name='description']/@content(45)body/*/p(38)head/meta/@content(32)body/*/dl/*/dt/a(10)
Samples:
- "Amsterdam" @
head/script[@type='application/ld+json'] - "Eenigenburg" @
body/*/header/*/nav/*/ul/li/a - "Steenwijk" @
head/meta[@property='og:site_name']
THG.LNG
- Total occurrences: 398
- Unique websites: 305
Found in DOM categories:
- other: 260 (65.3%)
- meta_tag: 66 (16.6%)
- meta_other: 31 (7.8%)
- nav: 17 (4.3%)
- paragraph: 15 (3.8%)
Common XPath patterns:
@lang(113)html/@lang(62)html(30)head/meta[@property='og:locale']/@content(18)head/meta/@content(15)
Samples:
- "nl_NL" @
head/meta[@property='og:locale'] - "nl-NL" @
[@lang='nl-NL'] - "nl_NL" @
html/@lang
TMP.DAB
- Total occurrences: 285
- Unique websites: 155
Found in DOM categories:
- meta_tag: 100 (35.1%)
- meta_other: 70 (24.6%)
- paragraph: 49 (17.2%)
- other: 37 (13.0%)
- sub_heading: 9 (3.2%)
Common XPath patterns:
head/script[@type='application/ld+json'](33)body/*/p(30)head/meta[@property='article:modified_time']/@content(29)head/meta/@content(28)head/script[@type='application/ld+json']/text()(12)
Samples:
- "2024" @
li[@id='menu-item-2747']/a - "2025" @
li[@id='menu-item-2758']/a - "2022-05-30T07:37:02+00:00" @
head/meta[@property='article:modified_time']/@content
WRK.WEB
- Total occurrences: 145
- Unique websites: 75
Found in DOM categories:
- meta_other: 39 (26.9%)
- nav: 28 (19.3%)
- other: 27 (18.6%)
- meta_title: 15 (10.3%)
- list: 14 (9.7%)
Common XPath patterns:
body/*/nav/*/ul/li/*/h3/a(13)head/script[@type='application/ld+json'](11)HTML/BODY/DIV/DIV/UL/LI/A(8)head/script/text()(7)body/ul/li/a(6)
Samples:
- "Home - Hildo Krop Museum" @
head/title - "https://www.sonnenborgh.nl/home" @
head/link[@rel='canonical'] - "Strategische visie" @
body/*/h2
APP.TIT
- Total occurrences: 132
- Unique websites: 97
Found in DOM categories:
- meta_title: 62 (47.0%)
- sub_heading: 18 (13.6%)
- meta_tag: 16 (12.1%)
- other: 13 (9.8%)
- meta_other: 7 (5.3%)
Common XPath patterns:
head/title(53)body/*/h2(8)body/*/a/*/h4(8)head/meta/@content(5)head/meta[@property='og:title']/@content(4)
Samples:
- "Heemkundevereniging Heibloem" @
head/title - "Uitnodiging lezing "Ontdek de sporen van je voorouders"" @
body/*/h2 - "Versiering plaquette "Sporen die bleven"" @
body/*/h2
TOP.ADR
- Total occurrences: 130
- Unique websites: 77
Found in DOM categories:
- footer: 31 (23.8%)
- paragraph: 23 (17.7%)
- other: 21 (16.2%)
- meta_other: 18 (13.8%)
- list: 13 (10.0%)
Common XPath patterns:
body/footer/*/p(13)body/*/p(9)body/*/ul/li(7)head/script/text()(7)head/script[@type='application/ld+json'](6)
Samples:
- "De Boelelaan 1105, 1081 HV, Amsterdam, Nederland" @
head/script[@type='application/ld+json'] - "Kerkweg 5, 1744 JD Eenigenburg, Noord-Holland" @
body/*/footer/*/ol/li - "Herenstraat 67" @
body/footer/*
AGT.PER
- Total occurrences: 129
- Unique websites: 68
Found in DOM categories:
- paragraph: 47 (36.4%)
- meta_tag: 29 (22.5%)
- other: 21 (16.3%)
- sub_heading: 10 (7.8%)
- list: 5 (3.9%)
Common XPath patterns:
body/*/p(12)body/*/p/text()(10)head/meta[@name='description']/@content(8)body/*/p/*(8)head/meta/@content(8)
Samples:
- "Erik Pape" @
body/*/ul/li/*/h2/a - "Caren van Herwaarden" @
body/*/ul/li/*/h2/a - "Prof. dr. Chris Stolwijk" @
body/span[contains(text(), 'Prof. dr. Chris Stolwijk')]
GRP.ASS
- Total occurrences: 122
- Unique websites: 76
Found in DOM categories:
- meta_title: 48 (39.3%)
- meta_tag: 17 (13.9%)
- paragraph: 15 (12.3%)
- other: 11 (9.0%)
- header: 7 (5.7%)
Common XPath patterns:
head/title(44)head/meta/@content(12)body/*/p(6)head/script(4)body/*(4)
Samples:
- "Heemkundevereniging Heibloem" @
head/title - "Heemkundevereniging Heibloem" @
body/*/header/h1/* - "Heemkundevereniging Heibloem" @
body/*/p/strong
XPath → Entity Type Mapping
Which entity types are typically found at which DOM locations:
head/title(608 entities)- GRP.HER: 261 (42.9%)
- TOP.SET: 87 (14.3%)
- APP.TIT: 53 (8.7%)
- APP.TTL: 45 (7.4%)
- GRP.ASS: 44 (7.2%)
head/meta/@content(278 entities)- APP.URL: 47 (16.9%)
- GRP.HER: 41 (14.7%)
- TOP.SET: 32 (11.5%)
- TMP.DAB: 28 (10.1%)
- THG.LNG: 15 (5.4%)
body/*/p(277 entities)- TOP.SET: 38 (13.7%)
- TMP.DAB: 30 (10.8%)
- GRP.HER: 17 (6.1%)
- AGT.PER: 12 (4.3%)
- THG.ART: 11 (4.0%)
head/script[@type='application/ld+json'](219 entities)- APP.URL: 59 (26.9%)
- TMP.DAB: 33 (15.1%)
- GRP.HER: 20 (9.1%)
- WRK.WEB: 11 (5.0%)
- TOP.SET: 9 (4.1%)
head/meta[@name='description']/@content(189 entities)- TOP.SET: 45 (23.8%)
- GRP.HER: 24 (12.7%)
- THG.CON: 17 (9.0%)
- TOP.REG: 11 (5.8%)
- GRP.GOV: 10 (5.3%)
head/link/@href(179 entities)- APP.URL: 168 (93.9%)
- WRK.WEB: 3 (1.7%)
- APP.WEB: 3 (1.7%)
- WRK.URL: 2 (1.1%)
- TOP.SET: 1 (0.6%)
@lang(115 entities)- THG.LNG: 113 (98.3%)
- TOP.CTY: 2 (1.7%)
head/link[@rel='canonical']/@href(111 entities)- APP.URL: 108 (97.3%)
- WRK.WEB: 2 (1.8%)
- WRK.URL: 1 (0.9%)
head/script(106 entities)- APP.URL: 23 (21.7%)
- TMP.DAB: 12 (11.3%)
- GRP.HER: 6 (5.7%)
- QTY.MSR: 6 (5.7%)
- TOP.SET: 5 (4.7%)
head/script/text()(93 entities)- APP.URL: 31 (33.3%)
- TMP.DAB: 9 (9.7%)
- GRP.HER: 9 (9.7%)
- WRK.WEB: 7 (7.5%)
- TOP.ADR: 7 (7.5%)
head/link(69 entities)- APP.URL: 67 (97.1%)
- APP.WKD: 1 (1.4%)
- WRK.WEB: 1 (1.4%)
html/@lang(62 entities)- THG.LNG: 62 (100.0%)
head/meta(61 entities)- APP.URL: 12 (19.7%)
- TMP.DAB: 5 (8.2%)
- GRP.COR: 4 (6.6%)
- GRP.HER: 4 (6.6%)
- TOP.SET: 4 (6.6%)
head/script[@type='application/ld+json']/text()(57 entities)- TMP.DAB: 12 (21.1%)
- APP.URL: 11 (19.3%)
- GRP.HER: 6 (10.5%)
- TOP.SET: 5 (8.8%)
- APP.NAM: 3 (5.3%)
body/*/h2(54 entities)- APP.TIT: 8 (14.8%)
- APP.EXH: 8 (14.8%)
- GRP.HER: 8 (14.8%)
- TOP.SET: 5 (9.3%)
- TMP.DAB: 4 (7.4%)
body/*(54 entities)- TMP.DAB: 9 (16.7%)
- WRK.COL: 7 (13.0%)
- TOP.ADR: 6 (11.1%)
- GRP.ASS: 4 (7.4%)
- TMP.TAB: 4 (7.4%)
head/link[@rel='canonical'](46 entities)- APP.URL: 41 (89.1%)
- WRK.WEB: 5 (10.9%)
head/meta[@property='og:description']/@content(41 entities)- TOP.SET: 6 (14.6%)
- TMP.DAB: 5 (12.2%)
- THG.CON: 4 (9.8%)
- AGT.PER: 3 (7.3%)
- APP.TIT: 3 (7.3%)
body/*/p/a(40 entities)- APP.URL: 20 (50.0%)
- THG.EVT: 5 (12.5%)
- TOP.SET: 3 (7.5%)
- APP.TTL: 3 (7.5%)
- GRP.GOV: 2 (5.0%)
body/*/p/*(36 entities)- AGT.PER: 8 (22.2%)
- TOP.ADR: 4 (11.1%)
- TOP.CTY: 3 (8.3%)
- ROL.POS: 2 (5.6%)
- ROL.OCC: 2 (5.6%)
String Pattern → Entity Type Correlation
Which string patterns are most associated with which entity types:
unknown(3503 occurrences)- APP.URL: 937 (26.7%)
- TOP.SET: 403 (11.5%)
- THG.LNG: 398 (11.4%)
- TMP.DAB: 269 (7.7%)
- GRP.HER: 132 (3.8%)
person:dutch_name(1146 occurrences)- GRP.HER: 410 (35.8%)
- GRP.ASS: 101 (8.8%)
- AGT.PER: 89 (7.8%)
- GRP.GOV: 74 (6.5%)
- APP.TIT: 63 (5.5%)
heritage:institution_type(376 occurrences)- GRP.HER: 233 (62.0%)
- APP.URL: 50 (13.3%)
- APP.TIT: 20 (5.3%)
- APP.TTL: 19 (5.1%)
- WRK.WEB: 13 (3.5%)
org:foundation_or_association(155 occurrences)- GRP.ASS: 75 (48.4%)
- GRP.HER: 35 (22.6%)
- APP.URL: 15 (9.7%)
- WRK.WEB: 6 (3.9%)
- APP.TIT: 5 (3.2%)
nav:menu_item(141 occurrences)- APP.URL: 46 (32.6%)
- APP.TTL: 23 (16.3%)
- APP.TIT: 18 (12.8%)
- WRK.WEB: 14 (9.9%)
- WRK.COL: 13 (9.2%)
loc:postal_code_nl(66 occurrences)- TOP.ADR: 55 (83.3%)
- TOP.SET: 4 (6.1%)
- QTY.CNT: 2 (3.0%)
- APP.COL: 1 (1.5%)
- APP.NAM: 1 (1.5%)
gov:municipality(65 occurrences)- GRP.GOV: 53 (81.5%)
- TOP.REG: 3 (4.6%)
- TOP.SET: 2 (3.1%)
- APP.TIT: 2 (3.1%)
- APP.TTL: 2 (3.1%)
info:time_range(63 occurrences)- TMP.OPH: 52 (82.5%)
- TMP.TAB: 9 (14.3%)
- TMP.EXP: 2 (3.2%)
loc:street_address(62 occurrences)- TOP.ADR: 55 (88.7%)
- GRP.HER: 3 (4.8%)
- GRP.ASS: 1 (1.6%)
- APP.TTL: 1 (1.6%)
- unknown: 1 (1.6%)
info:opening_hours(48 occurrences)- TMP.OPH: 32 (66.7%)
- TMP.DAB: 13 (27.1%)
- TMP.SET: 2 (4.2%)
- THG.EVT: 1 (2.1%)
contact:email(42 occurrences)- APP.URL: 25 (59.5%)
- APP.EMAIL: 9 (21.4%)
- APP.NAM: 3 (7.1%)
- APP.PNM: 2 (4.8%)
- THG.PHO: 1 (2.4%)
contact:phone(4 occurrences)- APP.URL: 3 (75.0%)
- TOP.ADR: 1 (25.0%)
Recommendations for dutch_web_patterns.yaml
Based on the analysis, the following layout-aware patterns could be added:
High-Value XPath Targets
head/title- Almost always contains institution namebody/*/h1- Primary heading, usually institution namebody/*/nav- Navigation menu (discard patterns)body/*/footer- Contact info, address, social links
Suggested Pattern Additions
# Add xpath_hint to patterns for better precision entity_patterns: organizations: heritage_institutions: patterns: - pattern: '^(het|de)\s+(\w+)\s*(museum|archief|bibliotheek)$' xpath_hints: - 'head/title' - 'body/*/h1' confidence_boost: 0.2 # Higher confidence when found at expected location