# Web Archive Layout Pattern Analysis Generated: 2025-12-13T15:03:35.789833 ## Summary - **Annotation files analyzed**: 500 - **Unique websites**: 468 - **Total layout claims**: 1443 - **Total entity claims**: 5134 ## Layout Categories Distribution of content by DOM location category: ### other - **Occurrences**: 322 - **Websites**: 162 **Top XPath patterns:** - `body/*` (162) - `body/*/table` (8) - `body` (7) - `body/*/dl` (6) - `body/*/script` (6) **String patterns found:** - `unknown` (170) - `person:dutch_name` (104) - `nav:menu_item` (44) - `heritage:institution_type` (43) - `org:foundation_or_association` (13) **Samples:** - "Voorstelling 'De Ezelprins' door Spiegelbeesten......" (bibliotheekkampen.nl) - "Vandaag open... Bibliotheek Kampen 09:00 - 17:00......" (bibliotheekkampen.nl) - "flexslider fullscreen-carousel hero-slider..." (museumrijswijk.nl) ### meta_title - **Occurrences**: 282 - **Websites**: 268 **Top XPath patterns:** - `head/title` (279) - `head/meta[@property='og:title']` (1) - `head/meta[@name='apple-mobile-web-app-title']/@content` (1) - `head/meta[@property='og:title']/@content` (1) **String patterns found:** - `person:dutch_name` (224) - `heritage:institution_type` (85) - `nav:menu_item` (58) - `org:foundation_or_association` (41) - `unknown` (34) **Samples:** - "Protestantse Theologische Universiteit: 400 jaar theologisch onderwijs..." (pthu.nl) - "Heemkundevereniging Heibloem | Heibloem.nu..." (heibloem.nu) - "Museum Eenigenburg – Museum Eenigenburg..." (museumeenigenburg.nl) ### header - **Occurrences**: 186 - **Websites**: 131 **Top XPath patterns:** - `body/header` (62) - `body/*/header` (23) - `body/*/header/h1` (11) - `body/*/header/*` (9) - `body/header/*` (5) **String patterns found:** - `unknown` (96) - `person:dutch_name` (76) - `nav:menu_item` (26) - `heritage:institution_type` (18) - `org:foundation_or_association` (12) **Samples:** - "
...Protestantse Theologische Universitei..." (pthu.nl) - "Heemkundevereniging Heibloem..." (heibloem.nu) - "Museum Eenigenburg..." (museumeenigenburg.nl) ### meta_other - **Occurrences**: 164 - **Websites**: 139 **Top XPath patterns:** - `head` (97) - `head/script[@type='application/ld+json']` (20) - `head/script` (14) - `head/link` (7) - `head/style` (3) **String patterns found:** - `person:dutch_name` (77) - `unknown` (77) - `heritage:institution_type` (33) - `nav:menu_item` (24) - `gov:municipality` (10) **Samples:** - "Home - Hildo Krop Museum......" (bibliotheekkampen.nl) - "Nu te zien [list of exhibitions]..." (catharijneconvent.nl) - "Let op: in december hebben we aangepaste openingstijden. Bekijk hier de openings..." (debieb.nl) ### contact - **Occurrences**: 3 - **Websites**: 3 **Top XPath patterns:** - `body/*/address` (1) - `body/div[@id='contact']/*/ul` (1) - `div[@class='xr_group' and ./span[text()='Contact']]` (1) **String patterns found:** - `loc:postal_code_nl` (1) - `loc:street_address` (1) - `nav:menu_item` (1) - `person:dutch_name` (1) **Samples:** - "Bongerd 18, 6411 JM Heerlen..." (museum.nl) - "Winsum, Hoofdstraat W 70 Uithuizen, Hoofdstraat-West 1..." (hethogeland.nl) - "Contact Contactgegevens..." (heemkundenispen.nl) ### main_content - **Occurrences**: 1 - **Websites**: 1 **Top XPath patterns:** - `body/div[@id='readspeaker']/*/article[@role='region']/*/p` (1) **String patterns found:** - `info:opening_hours` (1) **Samples:** - "Let op: Op donderdag 27 en vrijdag 28 november 2025 is Burgerzaken gesloten......" (veendam.nl) ### sidebar - **Occurrences**: 1 - **Websites**: 1 **Top XPath patterns:** - `body/aside[@id='leftbar']` (1) **String patterns found:** - `org:foundation_or_association` (1) - `nav:menu_item` (1) - `person:dutch_name` (1) **Samples:** - "Vereniging Adres / Contact Kwartaalblad......" (oudhoorn.nl) ## Entity Type Distribution | Entity Type | Count | Websites | Primary Location | |-------------|-------|----------|------------------| | APP.URL | 1071 | 406 | meta_other | | GRP.HER | 582 | 313 | meta_title | | TOP.SET | 426 | 245 | meta_tag | | THG.LNG | 398 | 305 | other | | TMP.DAB | 285 | 155 | meta_tag | | WRK.WEB | 145 | 75 | meta_other | | APP.TIT | 132 | 97 | meta_title | | TOP.ADR | 130 | 77 | footer | | AGT.PER | 129 | 68 | paragraph | | GRP.ASS | 122 | 76 | meta_title | | APP.TTL | 119 | 78 | meta_title | | GRP.GOV | 118 | 60 | meta_title | | THG.CON | 103 | 63 | meta_tag | | APP.EXH | 98 | 61 | other | | TOP.REG | 80 | 65 | meta_tag | | TMP.OPH | 74 | 46 | other | | QTY.CNT | 69 | 38 | meta_tag | | TOP.CTY | 68 | 55 | meta_tag | | THG.EVT | 64 | 31 | paragraph | | TOP.BLD | 56 | 38 | paragraph | ### Entity Type Details #### APP.URL - **Total occurrences**: 1071 - **Unique websites**: 406 **Found in DOM categories:** - meta_other: 596 (55.6%) - meta_tag: 126 (11.8%) - other: 81 (7.6%) - list: 64 (6.0%) - nav: 55 (5.1%) **Common XPath patterns:** - `head/link/@href` (168) - `head/link[@rel='canonical']/@href` (108) - `head/link` (67) - `head/script[@type='application/ld+json']` (59) - `head/meta/@content` (47) **Samples:** - "https://www.pthu.nl/" @ `head/link/@href` - "https://heibloem.nu/vereniging/heemkundevereniging-heibloem" @ `head/meta[@property='og:url']` - "https://heibloem.nu/held" @ `body/*/footer/*/p/a` #### GRP.HER - **Total occurrences**: 582 - **Unique websites**: 313 **Found in DOM categories:** - meta_title: 282 (48.5%) - meta_tag: 97 (16.7%) - other: 51 (8.8%) - meta_other: 43 (7.4%) - paragraph: 34 (5.8%) **Common XPath patterns:** - `head/title` (261) - `head/meta/@content` (41) - `head/meta[@name='description']/@content` (24) - `head/script[@type='application/ld+json']` (20) - `body/*/dl/*/dt/a` (17) **Samples:** - "Museum Eenigenburg" @ `head/title` - "Musea Schagen" @ `body/*/footer/*/a/*/p` - "Hildo Krop Museum" @ `head/title` #### TOP.SET - **Total occurrences**: 426 - **Unique websites**: 245 **Found in DOM categories:** - meta_tag: 116 (27.2%) - meta_title: 94 (22.1%) - paragraph: 72 (16.9%) - other: 32 (7.5%) - meta_other: 28 (6.6%) **Common XPath patterns:** - `head/title` (87) - `head/meta[@name='description']/@content` (45) - `body/*/p` (38) - `head/meta/@content` (32) - `body/*/dl/*/dt/a` (10) **Samples:** - "Amsterdam" @ `head/script[@type='application/ld+json']` - "Eenigenburg" @ `body/*/header/*/nav/*/ul/li/a` - "Steenwijk" @ `head/meta[@property='og:site_name']` #### THG.LNG - **Total occurrences**: 398 - **Unique websites**: 305 **Found in DOM categories:** - other: 260 (65.3%) - meta_tag: 66 (16.6%) - meta_other: 31 (7.8%) - nav: 17 (4.3%) - paragraph: 15 (3.8%) **Common XPath patterns:** - `@lang` (113) - `html/@lang` (62) - `html` (30) - `head/meta[@property='og:locale']/@content` (18) - `head/meta/@content` (15) **Samples:** - "nl_NL" @ `head/meta[@property='og:locale']` - "nl-NL" @ `[@lang='nl-NL']` - "nl_NL" @ `html/@lang` #### TMP.DAB - **Total occurrences**: 285 - **Unique websites**: 155 **Found in DOM categories:** - meta_tag: 100 (35.1%) - meta_other: 70 (24.6%) - paragraph: 49 (17.2%) - other: 37 (13.0%) - sub_heading: 9 (3.2%) **Common XPath patterns:** - `head/script[@type='application/ld+json']` (33) - `body/*/p` (30) - `head/meta[@property='article:modified_time']/@content` (29) - `head/meta/@content` (28) - `head/script[@type='application/ld+json']/text()` (12) **Samples:** - "2024" @ `li[@id='menu-item-2747']/a` - "2025" @ `li[@id='menu-item-2758']/a` - "2022-05-30T07:37:02+00:00" @ `head/meta[@property='article:modified_time']/@content` #### WRK.WEB - **Total occurrences**: 145 - **Unique websites**: 75 **Found in DOM categories:** - meta_other: 39 (26.9%) - nav: 28 (19.3%) - other: 27 (18.6%) - meta_title: 15 (10.3%) - list: 14 (9.7%) **Common XPath patterns:** - `body/*/nav/*/ul/li/*/h3/a` (13) - `head/script[@type='application/ld+json']` (11) - `HTML/BODY/DIV/DIV/UL/LI/A` (8) - `head/script/text()` (7) - `body/ul/li/a` (6) **Samples:** - "Home - Hildo Krop Museum" @ `head/title` - "https://www.sonnenborgh.nl/home" @ `head/link[@rel='canonical']` - "Strategische visie" @ `body/*/h2` #### APP.TIT - **Total occurrences**: 132 - **Unique websites**: 97 **Found in DOM categories:** - meta_title: 62 (47.0%) - sub_heading: 18 (13.6%) - meta_tag: 16 (12.1%) - other: 13 (9.8%) - meta_other: 7 (5.3%) **Common XPath patterns:** - `head/title` (53) - `body/*/h2` (8) - `body/*/a/*/h4` (8) - `head/meta/@content` (5) - `head/meta[@property='og:title']/@content` (4) **Samples:** - "Heemkundevereniging Heibloem" @ `head/title` - "Uitnodiging lezing "Ontdek de sporen van je voorouders"" @ `body/*/h2` - "Versiering plaquette "Sporen die bleven"" @ `body/*/h2` #### TOP.ADR - **Total occurrences**: 130 - **Unique websites**: 77 **Found in DOM categories:** - footer: 31 (23.8%) - paragraph: 23 (17.7%) - other: 21 (16.2%) - meta_other: 18 (13.8%) - list: 13 (10.0%) **Common XPath patterns:** - `body/footer/*/p` (13) - `body/*/p` (9) - `body/*/ul/li` (7) - `head/script/text()` (7) - `head/script[@type='application/ld+json']` (6) **Samples:** - "De Boelelaan 1105, 1081 HV, Amsterdam, Nederland" @ `head/script[@type='application/ld+json']` - "Kerkweg 5, 1744 JD Eenigenburg, Noord-Holland" @ `body/*/footer/*/ol/li` - "Herenstraat 67" @ `body/footer/*` #### AGT.PER - **Total occurrences**: 129 - **Unique websites**: 68 **Found in DOM categories:** - paragraph: 47 (36.4%) - meta_tag: 29 (22.5%) - other: 21 (16.3%) - sub_heading: 10 (7.8%) - list: 5 (3.9%) **Common XPath patterns:** - `body/*/p` (12) - `body/*/p/text()` (10) - `head/meta[@name='description']/@content` (8) - `body/*/p/*` (8) - `head/meta/@content` (8) **Samples:** - "Erik Pape" @ `body/*/ul/li/*/h2/a` - "Caren van Herwaarden" @ `body/*/ul/li/*/h2/a` - "Prof. dr. Chris Stolwijk" @ `body/span[contains(text(), 'Prof. dr. Chris Stolwijk')]` #### GRP.ASS - **Total occurrences**: 122 - **Unique websites**: 76 **Found in DOM categories:** - meta_title: 48 (39.3%) - meta_tag: 17 (13.9%) - paragraph: 15 (12.3%) - other: 11 (9.0%) - header: 7 (5.7%) **Common XPath patterns:** - `head/title` (44) - `head/meta/@content` (12) - `body/*/p` (6) - `head/script` (4) - `body/*` (4) **Samples:** - "Heemkundevereniging Heibloem" @ `head/title` - "Heemkundevereniging Heibloem" @ `body/*/header/h1/*` - "Heemkundevereniging Heibloem" @ `body/*/p/strong` ## XPath → Entity Type Mapping Which entity types are typically found at which DOM locations: ### `head/title` (608 entities) - GRP.HER: 261 (42.9%) - TOP.SET: 87 (14.3%) - APP.TIT: 53 (8.7%) - APP.TTL: 45 (7.4%) - GRP.ASS: 44 (7.2%) ### `head/meta/@content` (278 entities) - APP.URL: 47 (16.9%) - GRP.HER: 41 (14.7%) - TOP.SET: 32 (11.5%) - TMP.DAB: 28 (10.1%) - THG.LNG: 15 (5.4%) ### `body/*/p` (277 entities) - TOP.SET: 38 (13.7%) - TMP.DAB: 30 (10.8%) - GRP.HER: 17 (6.1%) - AGT.PER: 12 (4.3%) - THG.ART: 11 (4.0%) ### `head/script[@type='application/ld+json']` (219 entities) - APP.URL: 59 (26.9%) - TMP.DAB: 33 (15.1%) - GRP.HER: 20 (9.1%) - WRK.WEB: 11 (5.0%) - TOP.SET: 9 (4.1%) ### `head/meta[@name='description']/@content` (189 entities) - TOP.SET: 45 (23.8%) - GRP.HER: 24 (12.7%) - THG.CON: 17 (9.0%) - TOP.REG: 11 (5.8%) - GRP.GOV: 10 (5.3%) ### `head/link/@href` (179 entities) - APP.URL: 168 (93.9%) - WRK.WEB: 3 (1.7%) - APP.WEB: 3 (1.7%) - WRK.URL: 2 (1.1%) - TOP.SET: 1 (0.6%) ### `@lang` (115 entities) - THG.LNG: 113 (98.3%) - TOP.CTY: 2 (1.7%) ### `head/link[@rel='canonical']/@href` (111 entities) - APP.URL: 108 (97.3%) - WRK.WEB: 2 (1.8%) - WRK.URL: 1 (0.9%) ### `head/script` (106 entities) - APP.URL: 23 (21.7%) - TMP.DAB: 12 (11.3%) - GRP.HER: 6 (5.7%) - QTY.MSR: 6 (5.7%) - TOP.SET: 5 (4.7%) ### `head/script/text()` (93 entities) - APP.URL: 31 (33.3%) - TMP.DAB: 9 (9.7%) - GRP.HER: 9 (9.7%) - WRK.WEB: 7 (7.5%) - TOP.ADR: 7 (7.5%) ### `head/link` (69 entities) - APP.URL: 67 (97.1%) - APP.WKD: 1 (1.4%) - WRK.WEB: 1 (1.4%) ### `html/@lang` (62 entities) - THG.LNG: 62 (100.0%) ### `head/meta` (61 entities) - APP.URL: 12 (19.7%) - TMP.DAB: 5 (8.2%) - GRP.COR: 4 (6.6%) - GRP.HER: 4 (6.6%) - TOP.SET: 4 (6.6%) ### `head/script[@type='application/ld+json']/text()` (57 entities) - TMP.DAB: 12 (21.1%) - APP.URL: 11 (19.3%) - GRP.HER: 6 (10.5%) - TOP.SET: 5 (8.8%) - APP.NAM: 3 (5.3%) ### `body/*/h2` (54 entities) - APP.TIT: 8 (14.8%) - APP.EXH: 8 (14.8%) - GRP.HER: 8 (14.8%) - TOP.SET: 5 (9.3%) - TMP.DAB: 4 (7.4%) ### `body/*` (54 entities) - TMP.DAB: 9 (16.7%) - WRK.COL: 7 (13.0%) - TOP.ADR: 6 (11.1%) - GRP.ASS: 4 (7.4%) - TMP.TAB: 4 (7.4%) ### `head/link[@rel='canonical']` (46 entities) - APP.URL: 41 (89.1%) - WRK.WEB: 5 (10.9%) ### `head/meta[@property='og:description']/@content` (41 entities) - TOP.SET: 6 (14.6%) - TMP.DAB: 5 (12.2%) - THG.CON: 4 (9.8%) - AGT.PER: 3 (7.3%) - APP.TIT: 3 (7.3%) ### `body/*/p/a` (40 entities) - APP.URL: 20 (50.0%) - THG.EVT: 5 (12.5%) - TOP.SET: 3 (7.5%) - APP.TTL: 3 (7.5%) - GRP.GOV: 2 (5.0%) ### `body/*/p/*` (36 entities) - AGT.PER: 8 (22.2%) - TOP.ADR: 4 (11.1%) - TOP.CTY: 3 (8.3%) - ROL.POS: 2 (5.6%) - ROL.OCC: 2 (5.6%) ## String Pattern → Entity Type Correlation Which string patterns are most associated with which entity types: ### `unknown` (3503 occurrences) - APP.URL: 937 (26.7%) - TOP.SET: 403 (11.5%) - THG.LNG: 398 (11.4%) - TMP.DAB: 269 (7.7%) - GRP.HER: 132 (3.8%) ### `person:dutch_name` (1146 occurrences) - GRP.HER: 410 (35.8%) - GRP.ASS: 101 (8.8%) - AGT.PER: 89 (7.8%) - GRP.GOV: 74 (6.5%) - APP.TIT: 63 (5.5%) ### `heritage:institution_type` (376 occurrences) - GRP.HER: 233 (62.0%) - APP.URL: 50 (13.3%) - APP.TIT: 20 (5.3%) - APP.TTL: 19 (5.1%) - WRK.WEB: 13 (3.5%) ### `org:foundation_or_association` (155 occurrences) - GRP.ASS: 75 (48.4%) - GRP.HER: 35 (22.6%) - APP.URL: 15 (9.7%) - WRK.WEB: 6 (3.9%) - APP.TIT: 5 (3.2%) ### `nav:menu_item` (141 occurrences) - APP.URL: 46 (32.6%) - APP.TTL: 23 (16.3%) - APP.TIT: 18 (12.8%) - WRK.WEB: 14 (9.9%) - WRK.COL: 13 (9.2%) ### `loc:postal_code_nl` (66 occurrences) - TOP.ADR: 55 (83.3%) - TOP.SET: 4 (6.1%) - QTY.CNT: 2 (3.0%) - APP.COL: 1 (1.5%) - APP.NAM: 1 (1.5%) ### `gov:municipality` (65 occurrences) - GRP.GOV: 53 (81.5%) - TOP.REG: 3 (4.6%) - TOP.SET: 2 (3.1%) - APP.TIT: 2 (3.1%) - APP.TTL: 2 (3.1%) ### `info:time_range` (63 occurrences) - TMP.OPH: 52 (82.5%) - TMP.TAB: 9 (14.3%) - TMP.EXP: 2 (3.2%) ### `loc:street_address` (62 occurrences) - TOP.ADR: 55 (88.7%) - GRP.HER: 3 (4.8%) - GRP.ASS: 1 (1.6%) - APP.TTL: 1 (1.6%) - unknown: 1 (1.6%) ### `info:opening_hours` (48 occurrences) - TMP.OPH: 32 (66.7%) - TMP.DAB: 13 (27.1%) - TMP.SET: 2 (4.2%) - THG.EVT: 1 (2.1%) ### `contact:email` (42 occurrences) - APP.URL: 25 (59.5%) - APP.EMAIL: 9 (21.4%) - APP.NAM: 3 (7.1%) - APP.PNM: 2 (4.8%) - THG.PHO: 1 (2.4%) ### `contact:phone` (4 occurrences) - APP.URL: 3 (75.0%) - TOP.ADR: 1 (25.0%) ## Recommendations for dutch_web_patterns.yaml Based on the analysis, the following layout-aware patterns could be added: ### High-Value XPath Targets 1. **`head/title`** - Almost always contains institution name 2. **`body/*/h1`** - Primary heading, usually institution name 3. **`body/*/nav`** - Navigation menu (discard patterns) 4. **`body/*/footer`** - Contact info, address, social links ### Suggested Pattern Additions ```yaml # Add xpath_hint to patterns for better precision entity_patterns: organizations: heritage_institutions: patterns: - pattern: '^(het|de)\s+(\w+)\s*(museum|archief|bibliotheek)$' xpath_hints: - 'head/title' - 'body/*/h1' confidence_boost: 0.2 # Higher confidence when found at expected location ```