21 KiB
Web Archive Layout Pattern Analysis
Generated: 2025-12-13T15:05:22.346619
Summary
- Annotation files analyzed: 1525
- Unique websites: 1343
- Total layout claims: 4302
- Total entity claims: 15252
Layout Categories
Distribution of content by DOM location category:
other
- Occurrences: 896
- Websites: 457
Top XPath patterns:
body/*(471)body(24)body/*/table(23)body/*/dl(15)*(14)
String patterns found:
unknown(511)person:dutch_name(263)heritage:institution_type(119)nav:menu_item(114)org:foundation_or_association(41)
Samples:
- "Voorstelling 'De Ezelprins' door Spiegelbeesten......" (bibliotheekkampen.nl)
- "Vandaag open... Bibliotheek Kampen 09:00 - 17:00......" (bibliotheekkampen.nl)
- "flexslider fullscreen-carousel hero-slider..." (museumrijswijk.nl)
meta_title
- Occurrences: 828
- Websites: 741
Top XPath patterns:
head/title(815)head/meta[@property='og:title'](8)head/meta[@property='og:title']/@content(2)head/meta[@name='apple-mobile-web-app-title']/@content(1)head/meta[@name='twitter:title'](1)
String patterns found:
person:dutch_name(651)heritage:institution_type(266)nav:menu_item(194)unknown(99)org:foundation_or_association(92)
Samples:
- "Protestantse Theologische Universiteit: 400 jaar theologisch onderwijs..." (pthu.nl)
- "Heemkundevereniging Heibloem | Heibloem.nu..." (heibloem.nu)
- "Museum Eenigenburg – Museum Eenigenburg..." (museumeenigenburg.nl)
header
- Occurrences: 561
- Websites: 386
Top XPath patterns:
body/header(204)body/*/header(65)body/*/header/h1(29)body/*/header/*(22)body/header/*(13)
String patterns found:
unknown(285)person:dutch_name(210)nav:menu_item(79)heritage:institution_type(63)org:foundation_or_association(31)
Samples:
- "...Protestantse Theologische Universitei..." (pthu.nl)
- "Heemkundevereniging Heibloem..." (heibloem.nu)
- "Museum Eenigenburg..." (museumeenigenburg.nl)
meta_other
- Occurrences: 538
- Websites: 436
Top XPath patterns:
head(328)head/script[@type='application/ld+json'](76)head/script(42)head/link[@rel='canonical'](22)head/link(13)
String patterns found:
person:dutch_name(264)unknown(242)heritage:institution_type(107)nav:menu_item(81)org:foundation_or_association(32)
Samples:
- "<link rel="canonical" href="https://www.h..." (hildokrop.nl)
- "Museum Rijswijk..." (museumrijswijk.nl)
- "Gemeentearchief Ede | Gemeentearchief Ede......" (gemeentearchief.ede.nl)
paragraph
- Occurrences: 295
- Websites: 190
Top XPath patterns:
body/*/p(202)*/p(17)body/*/ul/li/*/p(8)body/p(4)body/*/p/strong(4)
String patterns found:
person:dutch_name(152)unknown(87)heritage:institution_type(73)org:foundation_or_association(25)nav:menu_item(23)
Samples:
- "Heemkundevereniging Heibloem heeft als doel het in brede kring bevorderen van be..." (heibloem.nu)
- "In Museum Eenigenburg wijst een levensgroot kompas......" (museumeenigenburg.nl)
- "Het Museum is zeer geschikt voor het ontvangen van groepen......" (museumeenigenburg.nl)
nav
- Occurrences: 285
- Websites: 247
Top XPath patterns:
body/*/nav(57)body/nav(43)body/header/nav(34)body/*/header/*/nav(30)body/header/*/nav(17)
String patterns found:
nav:menu_item(137)unknown(117)person:dutch_name(105)heritage:institution_type(32)org:foundation_or_association(16)
Samples:
- "Actueel HELD Burgerbus Onderwijs & opvang......" (heibloem.nu)
- "Home Het museum Daglonershuis Kasteel De familie Eenigenburg Blog Contact..." (museumeenigenburg.nl)
- "Home Tentoonstellingen Bezoekersinformatie......" (sonnenborgh.nl)
sub_heading
- Occurrences: 224
- Websites: 128
Top XPath patterns:
body/*/h2(117)body/*/h3(65)*/h2(8)body/h2(6)body/*/h4(4)
String patterns found:
unknown(138)person:dutch_name(71)heritage:institution_type(36)nav:menu_item(9)org:foundation_or_association(9)
Samples:
- "Verenigingsnieuws..." (heibloem.nu)
- "Contactgegevens..." (heibloem.nu)
- "Home..." (bibliotheekkampen.nl)
main_heading
- Occurrences: 222
- Websites: 182
Top XPath patterns:
body/*/h1(175)*/h1(16)*/a/*/h1(2)body/*/ul/li/h1(2)body/section[@class='uk-section head']/*/div[@class='title']/h1(2)
String patterns found:
person:dutch_name(110)unknown(90)heritage:institution_type(67)nav:menu_item(12)org:foundation_or_association(9)
Samples:
- "Protestantse Theologische Universiteit..." (pthu.nl)
- "Terug in de tijd..." (museumeenigenburg.nl)
- "Ook leuk voor groepen..." (museumeenigenburg.nl)
meta_tag
- Occurrences: 183
- Websites: 165
Top XPath patterns:
head/meta[@name='description'](69)head/meta(33)head/meta[@name='description']/@content(29)head/meta/@content(22)head/meta[@property='og:description'](5)
String patterns found:
person:dutch_name(98)unknown(60)heritage:institution_type(56)nav:menu_item(16)org:foundation_or_association(14)
Samples:
- "Prachtige tijdelijke en vaste tentoonstellingen zijn te zien bij Museum Catharij..." (catharijneconvent.nl)
- "Energie is overal. We gebruiken het constant en ervaren er veel gemak van. Tegel..." (vindhetuit.nl)
- "Aanmeldformulier Via dit formulier kunt u zich aanmelden bij de Stichting Vriend..." (verbindingsdienst.nl)
list
- Occurrences: 130
- Websites: 92
Top XPath patterns:
body/*/ul(93)*/ul(5)body/*/ul[@class='metanav'](5)body/ul(2)body/*/ol(2)
String patterns found:
unknown(62)person:dutch_name(47)nav:menu_item(29)heritage:institution_type(19)org:foundation_or_association(4)
Samples:
- "
- ......" (bibliotheekkampen.nl)
- "Nu te zien [list of exhibitions]..." (catharijneconvent.nl)
- "Let op: in december hebben we aangepaste openingstijden. Bekijk hier de openings..." (debieb.nl)
footer
- Occurrences: 119
- Websites: 92
Top XPath patterns:
body/footer(44)body/*/footer(16)body/footer/*(10)footer(6)body/footer/*/p(5)
String patterns found:
person:dutch_name(41)nav:menu_item(39)unknown(37)loc:postal_code_nl(12)loc:street_address(10)
Samples:
- "Heibloem.nu Nieuwsbrief Contact......" (heibloem.nu)
- "Made with ♥ by Encore Digital Agency......" (heibloem.nu)
- "Contactinformatie Kerkweg 5... Toegangsprijzen... OPENINGS TIJDEN......" (museumeenigenburg.nl)
main_content
- Occurrences: 10
- Websites: 6
Top XPath patterns:
body/div[@id='page_wrapper']/main[@id='content_main']/div[@class='cm_column_wrapper']/div[@class='cm_column']/p(3)body/div[@id='readspeaker']/*/article[@role='region']/*/p(1)body/main[@id='main']/div[@id='content-wrap']/div[@id='primary']/div[@id='content']/*/div[@class='entry']/div[@data-elementor-type='wp-post']/section[@data-id='150f6235'](1)body/div[@id='wrapper']/div[@id='mainCntr']/main[@id='contentCntr']/div[@id='generatedContent']/div[@class='timeItem'](1)body/div[@id='wrapper']/div[@id='mainCntr']/main[@id='contentCntr']/div[@id='generatedContent']/div[@class='locationItem'](1)
String patterns found:
unknown(7)person:dutch_name(2)info:opening_hours(1)
Samples:
- "Let op: Op donderdag 27 en vrijdag 28 november 2025 is Burgerzaken gesloten......" (veendam.nl)
- "september 2025: Arfdeeltje 53......" (lemelsarfgoed.nl)
- "19 dec. 2024: Arfdeeltje 52......" (lemelsarfgoed.nl)
contact
- Occurrences: 10
- Websites: 7
Top XPath patterns:
body/*/address(1)body/div[@id='contact']/*/ul(1)div[@class='xr_group' and ./span[text()='Contact']](1)body/div[@id='wrapper']/div[@id='rechterkolom']/div[@class='moduletablecontact']/div[@class='customcontact'](1)h2[text()='Contact en adressen'](1)
String patterns found:
nav:menu_item(4)loc:street_address(3)person:dutch_name(3)loc:postal_code_nl(2)unknown(2)
Samples:
- "Bongerd 18, 6411 JM Heerlen..." (museum.nl)
- "Winsum, Hoofdstraat W 70 Uithuizen, Hoofdstraat-West 1..." (hethogeland.nl)
- "Contact Contactgegevens..." (heemkundenispen.nl)
sidebar
- Occurrences: 1
- Websites: 1
Top XPath patterns:
body/aside[@id='leftbar'](1)
String patterns found:
org:foundation_or_association(1)nav:menu_item(1)person:dutch_name(1)
Samples:
- "Vereniging Adres / Contact Kwartaalblad......" (oudhoorn.nl)
Entity Type Distribution
Entity Type Count Websites Primary Location APP.URL 3196 1176 meta_other GRP.HER 1805 891 meta_title TOP.SET 1344 722 meta_tag THG.LNG 1309 921 other TMP.DAB 928 476 meta_tag TOP.ADR 412 233 footer GRP.ASS 401 230 meta_title GRP.GOV 381 168 meta_title APP.TIT 380 278 meta_title AGT.PER 334 180 paragraph WRK.WEB 316 191 meta_other APP.TTL 293 206 meta_title THG.CON 260 162 meta_tag APP.EXH 259 166 other TOP.REG 254 192 meta_tag TOP.CTY 234 179 meta_tag TMP.OPH 217 130 paragraph QTY.MSR 179 85 meta_tag QTY.CNT 176 106 meta_tag GRP.COR 152 110 meta_tag Entity Type Details
APP.URL
- Total occurrences: 3196
- Unique websites: 1176
Found in DOM categories:
- meta_other: 1705 (53.3%)
- meta_tag: 455 (14.2%)
- other: 293 (9.2%)
- list: 162 (5.1%)
- footer: 149 (4.7%)
Common XPath patterns:
head/link/@href(441)head/link[@rel='canonical']/@href(342)head/script[@type='application/ld+json'](186)head/meta/@content(175)head/link(151)
Samples:
- "https://www.pthu.nl/" @
head/link/@href - "https://heibloem.nu/vereniging/heemkundevereniging-heibloem" @
head/meta[@property='og:url'] - "https://heibloem.nu/held" @
body/*/footer/*/p/a
GRP.HER
- Total occurrences: 1805
- Unique websites: 891
Found in DOM categories:
- meta_title: 871 (48.3%)
- meta_tag: 311 (17.2%)
- other: 181 (10.0%)
- meta_other: 131 (7.3%)
- header: 88 (4.9%)
Common XPath patterns:
head/title(829)head/meta/@content(129)head/meta[@name='description']/@content(66)head/script[@type='application/ld+json'](56)head/meta[@property='og:site_name']/@content(52)
Samples:
- "Museum Eenigenburg" @
head/title - "Musea Schagen" @
body/*/footer/*/a/*/p - "Hildo Krop Museum" @
head/title
TOP.SET
- Total occurrences: 1344
- Unique websites: 722
Found in DOM categories:
- meta_tag: 356 (26.5%)
- meta_title: 331 (24.6%)
- paragraph: 209 (15.6%)
- other: 126 (9.4%)
- meta_other: 84 (6.2%)
Common XPath patterns:
head/title(315)head/meta[@name='description']/@content(154)head/meta/@content(106)body/*/p(101)head/script[@type='application/ld+json'](32)
Samples:
- "Amsterdam" @
head/script[@type='application/ld+json'] - "Eenigenburg" @
body/*/header/*/nav/*/ul/li/a - "Steenwijk" @
head/meta[@property='og:site_name']
THG.LNG
- Total occurrences: 1309
- Unique websites: 921
Found in DOM categories:
- other: 846 (64.6%)
- meta_tag: 207 (15.8%)
- meta_other: 150 (11.5%)
- nav: 32 (2.4%)
- paragraph: 27 (2.1%)
Common XPath patterns:
@lang(360)html/@lang(192)html(70)head/meta[@property='og:locale']/@content(54)head/meta/@content(50)
Samples:
- "nl_NL" @
head/meta[@property='og:locale'] - "nl-NL" @
[@lang='nl-NL'] - "nl_NL" @
html/@lang
TMP.DAB
- Total occurrences: 928
- Unique websites: 476
Found in DOM categories:
- meta_tag: 396 (42.7%)
- meta_other: 189 (20.4%)
- paragraph: 166 (17.9%)
- other: 97 (10.5%)
- list: 33 (3.6%)
Common XPath patterns:
head/meta[@property='article:modified_time']/@content(108)head/meta/@content(108)head/script[@type='application/ld+json'](93)body/*/p(75)head/script/text()(34)
Samples:
- "2024" @
li[@id='menu-item-2747']/a - "2025" @
li[@id='menu-item-2758']/a - "2022-05-30T07:37:02+00:00" @
head/meta[@property='article:modified_time']/@content
TOP.ADR
- Total occurrences: 412
- Unique websites: 233
Found in DOM categories:
- footer: 104 (25.2%)
- paragraph: 81 (19.7%)
- other: 68 (16.5%)
- meta_other: 60 (14.6%)
- contact: 25 (6.1%)
Common XPath patterns:
body/*/p(31)body/footer/*/p(29)head/script[@type='application/ld+json'](24)head/script/text()(19)body/*/p/*(17)
Samples:
- "De Boelelaan 1105, 1081 HV, Amsterdam, Nederland" @
head/script[@type='application/ld+json'] - "Kerkweg 5, 1744 JD Eenigenburg, Noord-Holland" @
body/*/footer/*/ol/li - "Herenstraat 67" @
body/footer/*
GRP.ASS
- Total occurrences: 401
- Unique websites: 230
Found in DOM categories:
- meta_title: 164 (40.9%)
- meta_tag: 58 (14.5%)
- paragraph: 43 (10.7%)
- header: 27 (6.7%)
- other: 27 (6.7%)
Common XPath patterns:
head/title(153)head/meta/@content(25)body/*/p(24)head/script[@type='application/ld+json'](13)head/meta[@name='description']/@content(11)
Samples:
- "Heemkundevereniging Heibloem" @
head/title - "Heemkundevereniging Heibloem" @
body/*/header/h1/* - "Heemkundevereniging Heibloem" @
body/*/p/strong
GRP.GOV
- Total occurrences: 381
- Unique websites: 168
Found in DOM categories:
- meta_title: 131 (34.4%)
- meta_tag: 126 (33.1%)
- header: 41 (10.8%)
- paragraph: 26 (6.8%)
- other: 17 (4.5%)
Common XPath patterns:
head/title(127)head/meta/@content(42)head/meta[@name='description']/@content(26)head/meta(15)body/*/p(13)
Samples:
- "gemeente Ede" @
body/img[@alt='Logo Gemeentearchief Ede']/@alt - "Provincie Noord-Holland" @
head/title - "Provincie Overijssel" @
head/title
APP.TIT
- Total occurrences: 380
- Unique websites: 278
Found in DOM categories:
- meta_title: 179 (47.1%)
- meta_tag: 39 (10.3%)
- other: 39 (10.3%)
- sub_heading: 36 (9.5%)
- paragraph: 24 (6.3%)
Common XPath patterns:
head/title(153)head/meta/@content(15)body/*/h2(12)head/meta[@property='og:title']/@content(12)body/*/a/*/h4(10)
Samples:
- "Heemkundevereniging Heibloem" @
head/title - "Uitnodiging lezing "Ontdek de sporen van je voorouders"" @
body/*/h2 - "Versiering plaquette "Sporen die bleven"" @
body/*/h2
AGT.PER
- Total occurrences: 334
- Unique websites: 180
Found in DOM categories:
- paragraph: 107 (32.0%)
- meta_tag: 80 (24.0%)
- other: 63 (18.9%)
- sub_heading: 28 (8.4%)
- meta_other: 17 (5.1%)
Common XPath patterns:
body/*/p(44)head/meta/@content(31)head/meta[@name='description']/@content(23)body/*(11)body/*/p/text()(11)
Samples:
- "Erik Pape" @
body/*/ul/li/*/h2/a - "Caren van Herwaarden" @
body/*/ul/li/*/h2/a - "Prof. dr. Chris Stolwijk" @
body/span[contains(text(), 'Prof. dr. Chris Stolwijk')]
XPath → Entity Type Mapping
Which entity types are typically found at which DOM locations:
head/title(1983 entities)- GRP.HER: 829 (41.8%)
- TOP.SET: 315 (15.9%)
- GRP.ASS: 153 (7.7%)
- APP.TIT: 153 (7.7%)
- GRP.GOV: 127 (6.4%)
head/meta/@content(976 entities)- APP.URL: 175 (17.9%)
- GRP.HER: 129 (13.2%)
- TMP.DAB: 108 (11.1%)
- TOP.SET: 106 (10.9%)
- THG.LNG: 50 (5.1%)
body/*/p(770 entities)- TOP.SET: 101 (13.1%)
- TMP.DAB: 75 (9.7%)
- AGT.PER: 44 (5.7%)
- GRP.HER: 36 (4.7%)
- TOP.ADR: 31 (4.0%)
head/script[@type='application/ld+json'](641 entities)- APP.URL: 186 (29.0%)
- TMP.DAB: 93 (14.5%)
- GRP.HER: 56 (8.7%)
- TOP.SET: 32 (5.0%)
- QTY.MSR: 32 (5.0%)
head/meta[@name='description']/@content(619 entities)- TOP.SET: 154 (24.9%)
- GRP.HER: 66 (10.7%)
- THG.CON: 43 (6.9%)
- TOP.REG: 35 (5.7%)
- GRP.GOV: 26 (4.2%)
head/link/@href(468 entities)- APP.URL: 441 (94.2%)
- WRK.WEB: 7 (1.5%)
- GRP.HER: 6 (1.3%)
- TOP.SET: 4 (0.9%)
- APP.WEB: 3 (0.6%)
@lang(365 entities)- THG.LNG: 360 (98.6%)
- TOP.CTY: 4 (1.1%)
- TOP.SET: 1 (0.3%)
head/link[@rel='canonical']/@href(351 entities)- APP.URL: 342 (97.4%)
- WRK.WEB: 7 (2.0%)
- WRK.URL: 1 (0.3%)
- GRP.HER: 1 (0.3%)
head/script/text()(342 entities)- APP.URL: 103 (30.1%)
- GRP.HER: 37 (10.8%)
- TMP.DAB: 34 (9.9%)
- TOP.ADR: 19 (5.6%)
- THG.LNG: 19 (5.6%)
head/script(242 entities)- APP.URL: 57 (23.6%)
- TMP.DAB: 32 (13.2%)
- GRP.HER: 17 (7.0%)
- THG.LNG: 13 (5.4%)
- QTY.MSR: 11 (4.5%)
html/@lang(196 entities)- THG.LNG: 192 (98.0%)
- TOP.CTY: 3 (1.5%)
- TOP.LNG: 1 (0.5%)
head/meta(175 entities)- APP.URL: 32 (18.3%)
- TMP.DAB: 22 (12.6%)
- GRP.GOV: 15 (8.6%)
- THG.LNG: 14 (8.0%)
- GRP.HER: 13 (7.4%)
head/link(161 entities)- APP.URL: 151 (93.8%)
- THG.LNG: 5 (3.1%)
- WRK.WEB: 2 (1.2%)
- APP.WKD: 1 (0.6%)
- GRP.COR: 1 (0.6%)
head/link[@rel='canonical'](150 entities)- APP.URL: 143 (95.3%)
- WRK.WEB: 7 (4.7%)
body/*(139 entities)- TMP.DAB: 24 (17.3%)
- TOP.ADR: 13 (9.4%)
- AGT.PER: 11 (7.9%)
- TMP.TAB: 8 (5.8%)
- WRK.COL: 7 (5.0%)
head/meta[@property='article:modified_time']/@content(108 entities)- TMP.DAB: 108 (100.0%)
head/script[@type='application/ld+json']/text()(108 entities)- APP.URL: 27 (25.0%)
- TMP.DAB: 16 (14.8%)
- TOP.SET: 9 (8.3%)
- GRP.HER: 9 (8.3%)
- TOP.REG: 5 (4.6%)
body/*/ul/li/a(100 entities)- APP.URL: 33 (33.0%)
- APP.COL: 10 (10.0%)
- WRK.COL: 7 (7.0%)
- WRK.WEB: 7 (7.0%)
- APP.EXH: 7 (7.0%)
body/*/h2(98 entities)- GRP.HER: 21 (21.4%)
- APP.TIT: 12 (12.2%)
- APP.EXH: 11 (11.2%)
- TOP.SET: 11 (11.2%)
- GRP.ASS: 6 (6.1%)
head/meta[@property='og:site_name']/@content(98 entities)- GRP.HER: 52 (53.1%)
- GRP.GOV: 12 (12.2%)
- GRP.ASS: 6 (6.1%)
- APP.TTL: 5 (5.1%)
- TOP.SET: 4 (4.1%)
String Pattern → Entity Type Correlation
Which string patterns are most associated with which entity types:
unknown(10443 occurrences)- APP.URL: 2814 (26.9%)
- THG.LNG: 1307 (12.5%)
- TOP.SET: 1250 (12.0%)
- TMP.DAB: 892 (8.5%)
- GRP.HER: 363 (3.5%)
person:dutch_name(3440 occurrences)- GRP.HER: 1329 (38.6%)
- GRP.ASS: 313 (9.1%)
- GRP.GOV: 246 (7.2%)
- AGT.PER: 241 (7.0%)
- APP.TIT: 172 (5.0%)
heritage:institution_type(1185 occurrences)- GRP.HER: 777 (65.6%)
- APP.URL: 140 (11.8%)
- APP.TTL: 56 (4.7%)
- APP.TIT: 49 (4.1%)
- WRK.WEB: 31 (2.6%)
org:foundation_or_association(383 occurrences)- GRP.ASS: 198 (51.7%)
- GRP.HER: 85 (22.2%)
- APP.URL: 24 (6.3%)
- APP.TIT: 15 (3.9%)
- WRK.WEB: 15 (3.9%)
nav:menu_item(368 occurrences)- APP.URL: 130 (35.3%)
- APP.TTL: 46 (12.5%)
- APP.TIT: 45 (12.2%)
- WRK.COL: 40 (10.9%)
- WRK.WEB: 34 (9.2%)
loc:postal_code_nl(218 occurrences)- TOP.ADR: 193 (88.5%)
- TOP.SET: 7 (3.2%)
- APP.NAM: 2 (0.9%)
- QTY.CNT: 2 (0.9%)
- unknown: 2 (0.9%)
gov:municipality(206 occurrences)- GRP.GOV: 171 (83.0%)
- APP.TTL: 11 (5.3%)
- TOP.SET: 5 (2.4%)
- APP.TIT: 5 (2.4%)
- TOP.REG: 5 (2.4%)
info:time_range(185 occurrences)- TMP.OPH: 152 (82.2%)
- TMP.TAB: 25 (13.5%)
- TMP.EXP: 3 (1.6%)
- TMP.DAT: 2 (1.1%)
- TMP.RNG: 1 (0.5%)
loc:street_address(178 occurrences)- TOP.ADR: 158 (88.8%)
- GRP.HER: 8 (4.5%)
- TOP.SET: 2 (1.1%)
- APP.PNM: 2 (1.1%)
- APP.URL: 1 (0.6%)
info:opening_hours(161 occurrences)- TMP.OPH: 115 (71.4%)
- TMP.DAB: 31 (19.3%)
- TMP.SET: 7 (4.3%)
- TMP.RNG: 2 (1.2%)
- TMP.DAT: 2 (1.2%)
contact:email(126 occurrences)- APP.URL: 93 (73.8%)
- APP.EMAIL: 12 (9.5%)
- APP.NAM: 5 (4.0%)
- APP.PNM: 5 (4.0%)
- APP.EML: 3 (2.4%)
contact:phone(7 occurrences)- APP.URL: 5 (71.4%)
- TOP.ADR: 1 (14.3%)
- APP.TTL: 1 (14.3%)
Recommendations for dutch_web_patterns.yaml
Based on the analysis, the following layout-aware patterns could be added:
High-Value XPath Targets
head/title- Almost always contains institution namebody/*/h1- Primary heading, usually institution namebody/*/nav- Navigation menu (discard patterns)body/*/footer- Contact info, address, social links
Suggested Pattern Additions
# Add xpath_hint to patterns for better precision entity_patterns: organizations: heritage_institutions: patterns: - pattern: '^(het|de)\s+(\w+)\s*(museum|archief|bibliotheek)$' xpath_hints: - 'head/title' - 'body/*/h1' confidence_boost: 0.2 # Higher confidence when found at expected location