# Web Archive Layout Pattern Analysis Generated: 2025-12-13T15:05:22.346619 ## Summary - **Annotation files analyzed**: 1525 - **Unique websites**: 1343 - **Total layout claims**: 4302 - **Total entity claims**: 15252 ## Layout Categories Distribution of content by DOM location category: ### other - **Occurrences**: 896 - **Websites**: 457 **Top XPath patterns:** - `body/*` (471) - `body` (24) - `body/*/table` (23) - `body/*/dl` (15) - `*` (14) **String patterns found:** - `unknown` (511) - `person:dutch_name` (263) - `heritage:institution_type` (119) - `nav:menu_item` (114) - `org:foundation_or_association` (41) **Samples:** - "Voorstelling 'De Ezelprins' door Spiegelbeesten......" (bibliotheekkampen.nl) - "Vandaag open... Bibliotheek Kampen 09:00 - 17:00......" (bibliotheekkampen.nl) - "flexslider fullscreen-carousel hero-slider..." (museumrijswijk.nl) ### meta_title - **Occurrences**: 828 - **Websites**: 741 **Top XPath patterns:** - `head/title` (815) - `head/meta[@property='og:title']` (8) - `head/meta[@property='og:title']/@content` (2) - `head/meta[@name='apple-mobile-web-app-title']/@content` (1) - `head/meta[@name='twitter:title']` (1) **String patterns found:** - `person:dutch_name` (651) - `heritage:institution_type` (266) - `nav:menu_item` (194) - `unknown` (99) - `org:foundation_or_association` (92) **Samples:** - "Protestantse Theologische Universiteit: 400 jaar theologisch onderwijs..." (pthu.nl) - "Heemkundevereniging Heibloem | Heibloem.nu..." (heibloem.nu) - "Museum Eenigenburg – Museum Eenigenburg..." (museumeenigenburg.nl) ### header - **Occurrences**: 561 - **Websites**: 386 **Top XPath patterns:** - `body/header` (204) - `body/*/header` (65) - `body/*/header/h1` (29) - `body/*/header/*` (22) - `body/header/*` (13) **String patterns found:** - `unknown` (285) - `person:dutch_name` (210) - `nav:menu_item` (79) - `heritage:institution_type` (63) - `org:foundation_or_association` (31) **Samples:** - "
...Protestantse Theologische Universitei..." (pthu.nl) - "Heemkundevereniging Heibloem..." (heibloem.nu) - "Museum Eenigenburg..." (museumeenigenburg.nl) ### meta_other - **Occurrences**: 538 - **Websites**: 436 **Top XPath patterns:** - `head` (328) - `head/script[@type='application/ld+json']` (76) - `head/script` (42) - `head/link[@rel='canonical']` (22) - `head/link` (13) **String patterns found:** - `person:dutch_name` (264) - `unknown` (242) - `heritage:institution_type` (107) - `nav:menu_item` (81) - `org:foundation_or_association` (32) **Samples:** - "Home - Hildo Krop Museum......" (bibliotheekkampen.nl) - "Nu te zien [list of exhibitions]..." (catharijneconvent.nl) - "Let op: in december hebben we aangepaste openingstijden. Bekijk hier de openings..." (debieb.nl) ### footer - **Occurrences**: 119 - **Websites**: 92 **Top XPath patterns:** - `body/footer` (44) - `body/*/footer` (16) - `body/footer/*` (10) - `footer` (6) - `body/footer/*/p` (5) **String patterns found:** - `person:dutch_name` (41) - `nav:menu_item` (39) - `unknown` (37) - `loc:postal_code_nl` (12) - `loc:street_address` (10) **Samples:** - "Heibloem.nu Nieuwsbrief Contact......" (heibloem.nu) - "Made with ♥ by Encore Digital Agency......" (heibloem.nu) - "Contactinformatie Kerkweg 5... Toegangsprijzen... OPENINGS TIJDEN......" (museumeenigenburg.nl) ### main_content - **Occurrences**: 10 - **Websites**: 6 **Top XPath patterns:** - `body/div[@id='page_wrapper']/main[@id='content_main']/div[@class='cm_column_wrapper']/div[@class='cm_column']/p` (3) - `body/div[@id='readspeaker']/*/article[@role='region']/*/p` (1) - `body/main[@id='main']/div[@id='content-wrap']/div[@id='primary']/div[@id='content']/*/div[@class='entry']/div[@data-elementor-type='wp-post']/section[@data-id='150f6235']` (1) - `body/div[@id='wrapper']/div[@id='mainCntr']/main[@id='contentCntr']/div[@id='generatedContent']/div[@class='timeItem']` (1) - `body/div[@id='wrapper']/div[@id='mainCntr']/main[@id='contentCntr']/div[@id='generatedContent']/div[@class='locationItem']` (1) **String patterns found:** - `unknown` (7) - `person:dutch_name` (2) - `info:opening_hours` (1) **Samples:** - "Let op: Op donderdag 27 en vrijdag 28 november 2025 is Burgerzaken gesloten......" (veendam.nl) - "september 2025: Arfdeeltje 53......" (lemelsarfgoed.nl) - "19 dec. 2024: Arfdeeltje 52......" (lemelsarfgoed.nl) ### contact - **Occurrences**: 10 - **Websites**: 7 **Top XPath patterns:** - `body/*/address` (1) - `body/div[@id='contact']/*/ul` (1) - `div[@class='xr_group' and ./span[text()='Contact']]` (1) - `body/div[@id='wrapper']/div[@id='rechterkolom']/div[@class='moduletablecontact']/div[@class='customcontact']` (1) - `h2[text()='Contact en adressen']` (1) **String patterns found:** - `nav:menu_item` (4) - `loc:street_address` (3) - `person:dutch_name` (3) - `loc:postal_code_nl` (2) - `unknown` (2) **Samples:** - "Bongerd 18, 6411 JM Heerlen..." (museum.nl) - "Winsum, Hoofdstraat W 70 Uithuizen, Hoofdstraat-West 1..." (hethogeland.nl) - "Contact Contactgegevens..." (heemkundenispen.nl) ### sidebar - **Occurrences**: 1 - **Websites**: 1 **Top XPath patterns:** - `body/aside[@id='leftbar']` (1) **String patterns found:** - `org:foundation_or_association` (1) - `nav:menu_item` (1) - `person:dutch_name` (1) **Samples:** - "Vereniging Adres / Contact Kwartaalblad......" (oudhoorn.nl) ## Entity Type Distribution | Entity Type | Count | Websites | Primary Location | |-------------|-------|----------|------------------| | APP.URL | 3196 | 1176 | meta_other | | GRP.HER | 1805 | 891 | meta_title | | TOP.SET | 1344 | 722 | meta_tag | | THG.LNG | 1309 | 921 | other | | TMP.DAB | 928 | 476 | meta_tag | | TOP.ADR | 412 | 233 | footer | | GRP.ASS | 401 | 230 | meta_title | | GRP.GOV | 381 | 168 | meta_title | | APP.TIT | 380 | 278 | meta_title | | AGT.PER | 334 | 180 | paragraph | | WRK.WEB | 316 | 191 | meta_other | | APP.TTL | 293 | 206 | meta_title | | THG.CON | 260 | 162 | meta_tag | | APP.EXH | 259 | 166 | other | | TOP.REG | 254 | 192 | meta_tag | | TOP.CTY | 234 | 179 | meta_tag | | TMP.OPH | 217 | 130 | paragraph | | QTY.MSR | 179 | 85 | meta_tag | | QTY.CNT | 176 | 106 | meta_tag | | GRP.COR | 152 | 110 | meta_tag | ### Entity Type Details #### APP.URL - **Total occurrences**: 3196 - **Unique websites**: 1176 **Found in DOM categories:** - meta_other: 1705 (53.3%) - meta_tag: 455 (14.2%) - other: 293 (9.2%) - list: 162 (5.1%) - footer: 149 (4.7%) **Common XPath patterns:** - `head/link/@href` (441) - `head/link[@rel='canonical']/@href` (342) - `head/script[@type='application/ld+json']` (186) - `head/meta/@content` (175) - `head/link` (151) **Samples:** - "https://www.pthu.nl/" @ `head/link/@href` - "https://heibloem.nu/vereniging/heemkundevereniging-heibloem" @ `head/meta[@property='og:url']` - "https://heibloem.nu/held" @ `body/*/footer/*/p/a` #### GRP.HER - **Total occurrences**: 1805 - **Unique websites**: 891 **Found in DOM categories:** - meta_title: 871 (48.3%) - meta_tag: 311 (17.2%) - other: 181 (10.0%) - meta_other: 131 (7.3%) - header: 88 (4.9%) **Common XPath patterns:** - `head/title` (829) - `head/meta/@content` (129) - `head/meta[@name='description']/@content` (66) - `head/script[@type='application/ld+json']` (56) - `head/meta[@property='og:site_name']/@content` (52) **Samples:** - "Museum Eenigenburg" @ `head/title` - "Musea Schagen" @ `body/*/footer/*/a/*/p` - "Hildo Krop Museum" @ `head/title` #### TOP.SET - **Total occurrences**: 1344 - **Unique websites**: 722 **Found in DOM categories:** - meta_tag: 356 (26.5%) - meta_title: 331 (24.6%) - paragraph: 209 (15.6%) - other: 126 (9.4%) - meta_other: 84 (6.2%) **Common XPath patterns:** - `head/title` (315) - `head/meta[@name='description']/@content` (154) - `head/meta/@content` (106) - `body/*/p` (101) - `head/script[@type='application/ld+json']` (32) **Samples:** - "Amsterdam" @ `head/script[@type='application/ld+json']` - "Eenigenburg" @ `body/*/header/*/nav/*/ul/li/a` - "Steenwijk" @ `head/meta[@property='og:site_name']` #### THG.LNG - **Total occurrences**: 1309 - **Unique websites**: 921 **Found in DOM categories:** - other: 846 (64.6%) - meta_tag: 207 (15.8%) - meta_other: 150 (11.5%) - nav: 32 (2.4%) - paragraph: 27 (2.1%) **Common XPath patterns:** - `@lang` (360) - `html/@lang` (192) - `html` (70) - `head/meta[@property='og:locale']/@content` (54) - `head/meta/@content` (50) **Samples:** - "nl_NL" @ `head/meta[@property='og:locale']` - "nl-NL" @ `[@lang='nl-NL']` - "nl_NL" @ `html/@lang` #### TMP.DAB - **Total occurrences**: 928 - **Unique websites**: 476 **Found in DOM categories:** - meta_tag: 396 (42.7%) - meta_other: 189 (20.4%) - paragraph: 166 (17.9%) - other: 97 (10.5%) - list: 33 (3.6%) **Common XPath patterns:** - `head/meta[@property='article:modified_time']/@content` (108) - `head/meta/@content` (108) - `head/script[@type='application/ld+json']` (93) - `body/*/p` (75) - `head/script/text()` (34) **Samples:** - "2024" @ `li[@id='menu-item-2747']/a` - "2025" @ `li[@id='menu-item-2758']/a` - "2022-05-30T07:37:02+00:00" @ `head/meta[@property='article:modified_time']/@content` #### TOP.ADR - **Total occurrences**: 412 - **Unique websites**: 233 **Found in DOM categories:** - footer: 104 (25.2%) - paragraph: 81 (19.7%) - other: 68 (16.5%) - meta_other: 60 (14.6%) - contact: 25 (6.1%) **Common XPath patterns:** - `body/*/p` (31) - `body/footer/*/p` (29) - `head/script[@type='application/ld+json']` (24) - `head/script/text()` (19) - `body/*/p/*` (17) **Samples:** - "De Boelelaan 1105, 1081 HV, Amsterdam, Nederland" @ `head/script[@type='application/ld+json']` - "Kerkweg 5, 1744 JD Eenigenburg, Noord-Holland" @ `body/*/footer/*/ol/li` - "Herenstraat 67" @ `body/footer/*` #### GRP.ASS - **Total occurrences**: 401 - **Unique websites**: 230 **Found in DOM categories:** - meta_title: 164 (40.9%) - meta_tag: 58 (14.5%) - paragraph: 43 (10.7%) - header: 27 (6.7%) - other: 27 (6.7%) **Common XPath patterns:** - `head/title` (153) - `head/meta/@content` (25) - `body/*/p` (24) - `head/script[@type='application/ld+json']` (13) - `head/meta[@name='description']/@content` (11) **Samples:** - "Heemkundevereniging Heibloem" @ `head/title` - "Heemkundevereniging Heibloem" @ `body/*/header/h1/*` - "Heemkundevereniging Heibloem" @ `body/*/p/strong` #### GRP.GOV - **Total occurrences**: 381 - **Unique websites**: 168 **Found in DOM categories:** - meta_title: 131 (34.4%) - meta_tag: 126 (33.1%) - header: 41 (10.8%) - paragraph: 26 (6.8%) - other: 17 (4.5%) **Common XPath patterns:** - `head/title` (127) - `head/meta/@content` (42) - `head/meta[@name='description']/@content` (26) - `head/meta` (15) - `body/*/p` (13) **Samples:** - "gemeente Ede" @ `body/img[@alt='Logo Gemeentearchief Ede']/@alt` - "Provincie Noord-Holland" @ `head/title` - "Provincie Overijssel" @ `head/title` #### APP.TIT - **Total occurrences**: 380 - **Unique websites**: 278 **Found in DOM categories:** - meta_title: 179 (47.1%) - meta_tag: 39 (10.3%) - other: 39 (10.3%) - sub_heading: 36 (9.5%) - paragraph: 24 (6.3%) **Common XPath patterns:** - `head/title` (153) - `head/meta/@content` (15) - `body/*/h2` (12) - `head/meta[@property='og:title']/@content` (12) - `body/*/a/*/h4` (10) **Samples:** - "Heemkundevereniging Heibloem" @ `head/title` - "Uitnodiging lezing "Ontdek de sporen van je voorouders"" @ `body/*/h2` - "Versiering plaquette "Sporen die bleven"" @ `body/*/h2` #### AGT.PER - **Total occurrences**: 334 - **Unique websites**: 180 **Found in DOM categories:** - paragraph: 107 (32.0%) - meta_tag: 80 (24.0%) - other: 63 (18.9%) - sub_heading: 28 (8.4%) - meta_other: 17 (5.1%) **Common XPath patterns:** - `body/*/p` (44) - `head/meta/@content` (31) - `head/meta[@name='description']/@content` (23) - `body/*` (11) - `body/*/p/text()` (11) **Samples:** - "Erik Pape" @ `body/*/ul/li/*/h2/a` - "Caren van Herwaarden" @ `body/*/ul/li/*/h2/a` - "Prof. dr. Chris Stolwijk" @ `body/span[contains(text(), 'Prof. dr. Chris Stolwijk')]` ## XPath → Entity Type Mapping Which entity types are typically found at which DOM locations: ### `head/title` (1983 entities) - GRP.HER: 829 (41.8%) - TOP.SET: 315 (15.9%) - GRP.ASS: 153 (7.7%) - APP.TIT: 153 (7.7%) - GRP.GOV: 127 (6.4%) ### `head/meta/@content` (976 entities) - APP.URL: 175 (17.9%) - GRP.HER: 129 (13.2%) - TMP.DAB: 108 (11.1%) - TOP.SET: 106 (10.9%) - THG.LNG: 50 (5.1%) ### `body/*/p` (770 entities) - TOP.SET: 101 (13.1%) - TMP.DAB: 75 (9.7%) - AGT.PER: 44 (5.7%) - GRP.HER: 36 (4.7%) - TOP.ADR: 31 (4.0%) ### `head/script[@type='application/ld+json']` (641 entities) - APP.URL: 186 (29.0%) - TMP.DAB: 93 (14.5%) - GRP.HER: 56 (8.7%) - TOP.SET: 32 (5.0%) - QTY.MSR: 32 (5.0%) ### `head/meta[@name='description']/@content` (619 entities) - TOP.SET: 154 (24.9%) - GRP.HER: 66 (10.7%) - THG.CON: 43 (6.9%) - TOP.REG: 35 (5.7%) - GRP.GOV: 26 (4.2%) ### `head/link/@href` (468 entities) - APP.URL: 441 (94.2%) - WRK.WEB: 7 (1.5%) - GRP.HER: 6 (1.3%) - TOP.SET: 4 (0.9%) - APP.WEB: 3 (0.6%) ### `@lang` (365 entities) - THG.LNG: 360 (98.6%) - TOP.CTY: 4 (1.1%) - TOP.SET: 1 (0.3%) ### `head/link[@rel='canonical']/@href` (351 entities) - APP.URL: 342 (97.4%) - WRK.WEB: 7 (2.0%) - WRK.URL: 1 (0.3%) - GRP.HER: 1 (0.3%) ### `head/script/text()` (342 entities) - APP.URL: 103 (30.1%) - GRP.HER: 37 (10.8%) - TMP.DAB: 34 (9.9%) - TOP.ADR: 19 (5.6%) - THG.LNG: 19 (5.6%) ### `head/script` (242 entities) - APP.URL: 57 (23.6%) - TMP.DAB: 32 (13.2%) - GRP.HER: 17 (7.0%) - THG.LNG: 13 (5.4%) - QTY.MSR: 11 (4.5%) ### `html/@lang` (196 entities) - THG.LNG: 192 (98.0%) - TOP.CTY: 3 (1.5%) - TOP.LNG: 1 (0.5%) ### `head/meta` (175 entities) - APP.URL: 32 (18.3%) - TMP.DAB: 22 (12.6%) - GRP.GOV: 15 (8.6%) - THG.LNG: 14 (8.0%) - GRP.HER: 13 (7.4%) ### `head/link` (161 entities) - APP.URL: 151 (93.8%) - THG.LNG: 5 (3.1%) - WRK.WEB: 2 (1.2%) - APP.WKD: 1 (0.6%) - GRP.COR: 1 (0.6%) ### `head/link[@rel='canonical']` (150 entities) - APP.URL: 143 (95.3%) - WRK.WEB: 7 (4.7%) ### `body/*` (139 entities) - TMP.DAB: 24 (17.3%) - TOP.ADR: 13 (9.4%) - AGT.PER: 11 (7.9%) - TMP.TAB: 8 (5.8%) - WRK.COL: 7 (5.0%) ### `head/meta[@property='article:modified_time']/@content` (108 entities) - TMP.DAB: 108 (100.0%) ### `head/script[@type='application/ld+json']/text()` (108 entities) - APP.URL: 27 (25.0%) - TMP.DAB: 16 (14.8%) - TOP.SET: 9 (8.3%) - GRP.HER: 9 (8.3%) - TOP.REG: 5 (4.6%) ### `body/*/ul/li/a` (100 entities) - APP.URL: 33 (33.0%) - APP.COL: 10 (10.0%) - WRK.COL: 7 (7.0%) - WRK.WEB: 7 (7.0%) - APP.EXH: 7 (7.0%) ### `body/*/h2` (98 entities) - GRP.HER: 21 (21.4%) - APP.TIT: 12 (12.2%) - APP.EXH: 11 (11.2%) - TOP.SET: 11 (11.2%) - GRP.ASS: 6 (6.1%) ### `head/meta[@property='og:site_name']/@content` (98 entities) - GRP.HER: 52 (53.1%) - GRP.GOV: 12 (12.2%) - GRP.ASS: 6 (6.1%) - APP.TTL: 5 (5.1%) - TOP.SET: 4 (4.1%) ## String Pattern → Entity Type Correlation Which string patterns are most associated with which entity types: ### `unknown` (10443 occurrences) - APP.URL: 2814 (26.9%) - THG.LNG: 1307 (12.5%) - TOP.SET: 1250 (12.0%) - TMP.DAB: 892 (8.5%) - GRP.HER: 363 (3.5%) ### `person:dutch_name` (3440 occurrences) - GRP.HER: 1329 (38.6%) - GRP.ASS: 313 (9.1%) - GRP.GOV: 246 (7.2%) - AGT.PER: 241 (7.0%) - APP.TIT: 172 (5.0%) ### `heritage:institution_type` (1185 occurrences) - GRP.HER: 777 (65.6%) - APP.URL: 140 (11.8%) - APP.TTL: 56 (4.7%) - APP.TIT: 49 (4.1%) - WRK.WEB: 31 (2.6%) ### `org:foundation_or_association` (383 occurrences) - GRP.ASS: 198 (51.7%) - GRP.HER: 85 (22.2%) - APP.URL: 24 (6.3%) - APP.TIT: 15 (3.9%) - WRK.WEB: 15 (3.9%) ### `nav:menu_item` (368 occurrences) - APP.URL: 130 (35.3%) - APP.TTL: 46 (12.5%) - APP.TIT: 45 (12.2%) - WRK.COL: 40 (10.9%) - WRK.WEB: 34 (9.2%) ### `loc:postal_code_nl` (218 occurrences) - TOP.ADR: 193 (88.5%) - TOP.SET: 7 (3.2%) - APP.NAM: 2 (0.9%) - QTY.CNT: 2 (0.9%) - unknown: 2 (0.9%) ### `gov:municipality` (206 occurrences) - GRP.GOV: 171 (83.0%) - APP.TTL: 11 (5.3%) - TOP.SET: 5 (2.4%) - APP.TIT: 5 (2.4%) - TOP.REG: 5 (2.4%) ### `info:time_range` (185 occurrences) - TMP.OPH: 152 (82.2%) - TMP.TAB: 25 (13.5%) - TMP.EXP: 3 (1.6%) - TMP.DAT: 2 (1.1%) - TMP.RNG: 1 (0.5%) ### `loc:street_address` (178 occurrences) - TOP.ADR: 158 (88.8%) - GRP.HER: 8 (4.5%) - TOP.SET: 2 (1.1%) - APP.PNM: 2 (1.1%) - APP.URL: 1 (0.6%) ### `info:opening_hours` (161 occurrences) - TMP.OPH: 115 (71.4%) - TMP.DAB: 31 (19.3%) - TMP.SET: 7 (4.3%) - TMP.RNG: 2 (1.2%) - TMP.DAT: 2 (1.2%) ### `contact:email` (126 occurrences) - APP.URL: 93 (73.8%) - APP.EMAIL: 12 (9.5%) - APP.NAM: 5 (4.0%) - APP.PNM: 5 (4.0%) - APP.EML: 3 (2.4%) ### `contact:phone` (7 occurrences) - APP.URL: 5 (71.4%) - TOP.ADR: 1 (14.3%) - APP.TTL: 1 (14.3%) ## Recommendations for dutch_web_patterns.yaml Based on the analysis, the following layout-aware patterns could be added: ### High-Value XPath Targets 1. **`head/title`** - Almost always contains institution name 2. **`body/*/h1`** - Primary heading, usually institution name 3. **`body/*/nav`** - Navigation menu (discard patterns) 4. **`body/*/footer`** - Contact info, address, social links ### Suggested Pattern Additions ```yaml # Add xpath_hint to patterns for better precision entity_patterns: organizations: heritage_institutions: patterns: - pattern: '^(het|de)\s+(\w+)\s*(museum|archief|bibliotheek)$' xpath_hints: - 'head/title' - 'body/*/h1' confidence_boost: 0.2 # Higher confidence when found at expected location ```