827 lines
21 KiB
Markdown
827 lines
21 KiB
Markdown
# Web Archive Layout Pattern Analysis
|
||
|
||
Generated: 2025-12-13T15:05:22.346619
|
||
|
||
## Summary
|
||
|
||
- **Annotation files analyzed**: 1525
|
||
- **Unique websites**: 1343
|
||
- **Total layout claims**: 4302
|
||
- **Total entity claims**: 15252
|
||
|
||
## Layout Categories
|
||
|
||
Distribution of content by DOM location category:
|
||
|
||
### other
|
||
- **Occurrences**: 896
|
||
- **Websites**: 457
|
||
|
||
**Top XPath patterns:**
|
||
- `body/*` (471)
|
||
- `body` (24)
|
||
- `body/*/table` (23)
|
||
- `body/*/dl` (15)
|
||
- `*` (14)
|
||
|
||
**String patterns found:**
|
||
- `unknown` (511)
|
||
- `person:dutch_name` (263)
|
||
- `heritage:institution_type` (119)
|
||
- `nav:menu_item` (114)
|
||
- `org:foundation_or_association` (41)
|
||
|
||
**Samples:**
|
||
- "Voorstelling 'De Ezelprins' door Spiegelbeesten......" (bibliotheekkampen.nl)
|
||
- "Vandaag open... Bibliotheek Kampen 09:00 - 17:00......" (bibliotheekkampen.nl)
|
||
- "flexslider fullscreen-carousel hero-slider..." (museumrijswijk.nl)
|
||
|
||
### meta_title
|
||
- **Occurrences**: 828
|
||
- **Websites**: 741
|
||
|
||
**Top XPath patterns:**
|
||
- `head/title` (815)
|
||
- `head/meta[@property='og:title']` (8)
|
||
- `head/meta[@property='og:title']/@content` (2)
|
||
- `head/meta[@name='apple-mobile-web-app-title']/@content` (1)
|
||
- `head/meta[@name='twitter:title']` (1)
|
||
|
||
**String patterns found:**
|
||
- `person:dutch_name` (651)
|
||
- `heritage:institution_type` (266)
|
||
- `nav:menu_item` (194)
|
||
- `unknown` (99)
|
||
- `org:foundation_or_association` (92)
|
||
|
||
**Samples:**
|
||
- "Protestantse Theologische Universiteit: 400 jaar theologisch onderwijs..." (pthu.nl)
|
||
- "Heemkundevereniging Heibloem | Heibloem.nu..." (heibloem.nu)
|
||
- "Museum Eenigenburg – Museum Eenigenburg..." (museumeenigenburg.nl)
|
||
|
||
### header
|
||
- **Occurrences**: 561
|
||
- **Websites**: 386
|
||
|
||
**Top XPath patterns:**
|
||
- `body/header` (204)
|
||
- `body/*/header` (65)
|
||
- `body/*/header/h1` (29)
|
||
- `body/*/header/*` (22)
|
||
- `body/header/*` (13)
|
||
|
||
**String patterns found:**
|
||
- `unknown` (285)
|
||
- `person:dutch_name` (210)
|
||
- `nav:menu_item` (79)
|
||
- `heritage:institution_type` (63)
|
||
- `org:foundation_or_association` (31)
|
||
|
||
**Samples:**
|
||
- "<header class="header preventselection">...Protestantse Theologische Universitei..." (pthu.nl)
|
||
- "Heemkundevereniging Heibloem..." (heibloem.nu)
|
||
- "Museum Eenigenburg..." (museumeenigenburg.nl)
|
||
|
||
### meta_other
|
||
- **Occurrences**: 538
|
||
- **Websites**: 436
|
||
|
||
**Top XPath patterns:**
|
||
- `head` (328)
|
||
- `head/script[@type='application/ld+json']` (76)
|
||
- `head/script` (42)
|
||
- `head/link[@rel='canonical']` (22)
|
||
- `head/link` (13)
|
||
|
||
**String patterns found:**
|
||
- `person:dutch_name` (264)
|
||
- `unknown` (242)
|
||
- `heritage:institution_type` (107)
|
||
- `nav:menu_item` (81)
|
||
- `org:foundation_or_association` (32)
|
||
|
||
**Samples:**
|
||
- "<title>Home - Hildo Krop Museum</title><link rel="canonical" href="https://www.h..." (hildokrop.nl)
|
||
- "Museum Rijswijk..." (museumrijswijk.nl)
|
||
- "Gemeentearchief Ede | Gemeentearchief Ede......" (gemeentearchief.ede.nl)
|
||
|
||
### paragraph
|
||
- **Occurrences**: 295
|
||
- **Websites**: 190
|
||
|
||
**Top XPath patterns:**
|
||
- `body/*/p` (202)
|
||
- `*/p` (17)
|
||
- `body/*/ul/li/*/p` (8)
|
||
- `body/p` (4)
|
||
- `body/*/p/strong` (4)
|
||
|
||
**String patterns found:**
|
||
- `person:dutch_name` (152)
|
||
- `unknown` (87)
|
||
- `heritage:institution_type` (73)
|
||
- `org:foundation_or_association` (25)
|
||
- `nav:menu_item` (23)
|
||
|
||
**Samples:**
|
||
- "Heemkundevereniging Heibloem heeft als doel het in brede kring bevorderen van be..." (heibloem.nu)
|
||
- "In Museum Eenigenburg wijst een levensgroot kompas......" (museumeenigenburg.nl)
|
||
- "Het Museum is zeer geschikt voor het ontvangen van groepen......" (museumeenigenburg.nl)
|
||
|
||
### nav
|
||
- **Occurrences**: 285
|
||
- **Websites**: 247
|
||
|
||
**Top XPath patterns:**
|
||
- `body/*/nav` (57)
|
||
- `body/nav` (43)
|
||
- `body/header/nav` (34)
|
||
- `body/*/header/*/nav` (30)
|
||
- `body/header/*/nav` (17)
|
||
|
||
**String patterns found:**
|
||
- `nav:menu_item` (137)
|
||
- `unknown` (117)
|
||
- `person:dutch_name` (105)
|
||
- `heritage:institution_type` (32)
|
||
- `org:foundation_or_association` (16)
|
||
|
||
**Samples:**
|
||
- "Actueel HELD Burgerbus Onderwijs & opvang......" (heibloem.nu)
|
||
- "Home Het museum Daglonershuis Kasteel De familie Eenigenburg Blog Contact..." (museumeenigenburg.nl)
|
||
- "Home Tentoonstellingen Bezoekersinformatie......" (sonnenborgh.nl)
|
||
|
||
### sub_heading
|
||
- **Occurrences**: 224
|
||
- **Websites**: 128
|
||
|
||
**Top XPath patterns:**
|
||
- `body/*/h2` (117)
|
||
- `body/*/h3` (65)
|
||
- `*/h2` (8)
|
||
- `body/h2` (6)
|
||
- `body/*/h4` (4)
|
||
|
||
**String patterns found:**
|
||
- `unknown` (138)
|
||
- `person:dutch_name` (71)
|
||
- `heritage:institution_type` (36)
|
||
- `nav:menu_item` (9)
|
||
- `org:foundation_or_association` (9)
|
||
|
||
**Samples:**
|
||
- "Verenigingsnieuws..." (heibloem.nu)
|
||
- "Contactgegevens..." (heibloem.nu)
|
||
- "Home..." (bibliotheekkampen.nl)
|
||
|
||
### main_heading
|
||
- **Occurrences**: 222
|
||
- **Websites**: 182
|
||
|
||
**Top XPath patterns:**
|
||
- `body/*/h1` (175)
|
||
- `*/h1` (16)
|
||
- `*/a/*/h1` (2)
|
||
- `body/*/ul/li/h1` (2)
|
||
- `body/section[@class='uk-section head']/*/div[@class='title']/h1` (2)
|
||
|
||
**String patterns found:**
|
||
- `person:dutch_name` (110)
|
||
- `unknown` (90)
|
||
- `heritage:institution_type` (67)
|
||
- `nav:menu_item` (12)
|
||
- `org:foundation_or_association` (9)
|
||
|
||
**Samples:**
|
||
- "Protestantse Theologische Universiteit..." (pthu.nl)
|
||
- "Terug in de tijd..." (museumeenigenburg.nl)
|
||
- "Ook leuk voor groepen..." (museumeenigenburg.nl)
|
||
|
||
### meta_tag
|
||
- **Occurrences**: 183
|
||
- **Websites**: 165
|
||
|
||
**Top XPath patterns:**
|
||
- `head/meta[@name='description']` (69)
|
||
- `head/meta` (33)
|
||
- `head/meta[@name='description']/@content` (29)
|
||
- `head/meta/@content` (22)
|
||
- `head/meta[@property='og:description']` (5)
|
||
|
||
**String patterns found:**
|
||
- `person:dutch_name` (98)
|
||
- `unknown` (60)
|
||
- `heritage:institution_type` (56)
|
||
- `nav:menu_item` (16)
|
||
- `org:foundation_or_association` (14)
|
||
|
||
**Samples:**
|
||
- "Prachtige tijdelijke en vaste tentoonstellingen zijn te zien bij Museum Catharij..." (catharijneconvent.nl)
|
||
- "Energie is overal. We gebruiken het constant en ervaren er veel gemak van. Tegel..." (vindhetuit.nl)
|
||
- "Aanmeldformulier Via dit formulier kunt u zich aanmelden bij de Stichting Vriend..." (verbindingsdienst.nl)
|
||
|
||
### list
|
||
- **Occurrences**: 130
|
||
- **Websites**: 92
|
||
|
||
**Top XPath patterns:**
|
||
- `body/*/ul` (93)
|
||
- `*/ul` (5)
|
||
- `body/*/ul[@class='metanav']` (5)
|
||
- `body/ul` (2)
|
||
- `body/*/ol` (2)
|
||
|
||
**String patterns found:**
|
||
- `unknown` (62)
|
||
- `person:dutch_name` (47)
|
||
- `nav:menu_item` (29)
|
||
- `heritage:institution_type` (19)
|
||
- `org:foundation_or_association` (4)
|
||
|
||
**Samples:**
|
||
- "<ul class="metanav">......" (bibliotheekkampen.nl)
|
||
- "Nu te zien [list of exhibitions]..." (catharijneconvent.nl)
|
||
- "Let op: in december hebben we aangepaste openingstijden. Bekijk hier de openings..." (debieb.nl)
|
||
|
||
### footer
|
||
- **Occurrences**: 119
|
||
- **Websites**: 92
|
||
|
||
**Top XPath patterns:**
|
||
- `body/footer` (44)
|
||
- `body/*/footer` (16)
|
||
- `body/footer/*` (10)
|
||
- `footer` (6)
|
||
- `body/footer/*/p` (5)
|
||
|
||
**String patterns found:**
|
||
- `person:dutch_name` (41)
|
||
- `nav:menu_item` (39)
|
||
- `unknown` (37)
|
||
- `loc:postal_code_nl` (12)
|
||
- `loc:street_address` (10)
|
||
|
||
**Samples:**
|
||
- "Heibloem.nu Nieuwsbrief Contact......" (heibloem.nu)
|
||
- "Made with ♥ by Encore Digital Agency......" (heibloem.nu)
|
||
- "Contactinformatie Kerkweg 5... Toegangsprijzen... OPENINGS TIJDEN......" (museumeenigenburg.nl)
|
||
|
||
### main_content
|
||
- **Occurrences**: 10
|
||
- **Websites**: 6
|
||
|
||
**Top XPath patterns:**
|
||
- `body/div[@id='page_wrapper']/main[@id='content_main']/div[@class='cm_column_wrapper']/div[@class='cm_column']/p` (3)
|
||
- `body/div[@id='readspeaker']/*/article[@role='region']/*/p` (1)
|
||
- `body/main[@id='main']/div[@id='content-wrap']/div[@id='primary']/div[@id='content']/*/div[@class='entry']/div[@data-elementor-type='wp-post']/section[@data-id='150f6235']` (1)
|
||
- `body/div[@id='wrapper']/div[@id='mainCntr']/main[@id='contentCntr']/div[@id='generatedContent']/div[@class='timeItem']` (1)
|
||
- `body/div[@id='wrapper']/div[@id='mainCntr']/main[@id='contentCntr']/div[@id='generatedContent']/div[@class='locationItem']` (1)
|
||
|
||
**String patterns found:**
|
||
- `unknown` (7)
|
||
- `person:dutch_name` (2)
|
||
- `info:opening_hours` (1)
|
||
|
||
**Samples:**
|
||
- "Let op: Op donderdag 27 en vrijdag 28 november 2025 is Burgerzaken gesloten......" (veendam.nl)
|
||
- "september 2025: Arfdeeltje 53......" (lemelsarfgoed.nl)
|
||
- "19 dec. 2024: Arfdeeltje 52......" (lemelsarfgoed.nl)
|
||
|
||
### contact
|
||
- **Occurrences**: 10
|
||
- **Websites**: 7
|
||
|
||
**Top XPath patterns:**
|
||
- `body/*/address` (1)
|
||
- `body/div[@id='contact']/*/ul` (1)
|
||
- `div[@class='xr_group' and ./span[text()='Contact']]` (1)
|
||
- `body/div[@id='wrapper']/div[@id='rechterkolom']/div[@class='moduletablecontact']/div[@class='customcontact']` (1)
|
||
- `h2[text()='Contact en adressen']` (1)
|
||
|
||
**String patterns found:**
|
||
- `nav:menu_item` (4)
|
||
- `loc:street_address` (3)
|
||
- `person:dutch_name` (3)
|
||
- `loc:postal_code_nl` (2)
|
||
- `unknown` (2)
|
||
|
||
**Samples:**
|
||
- "Bongerd 18, 6411 JM Heerlen..." (museum.nl)
|
||
- "Winsum, Hoofdstraat W 70 Uithuizen, Hoofdstraat-West 1..." (hethogeland.nl)
|
||
- "Contact Contactgegevens..." (heemkundenispen.nl)
|
||
|
||
### sidebar
|
||
- **Occurrences**: 1
|
||
- **Websites**: 1
|
||
|
||
**Top XPath patterns:**
|
||
- `body/aside[@id='leftbar']` (1)
|
||
|
||
**String patterns found:**
|
||
- `org:foundation_or_association` (1)
|
||
- `nav:menu_item` (1)
|
||
- `person:dutch_name` (1)
|
||
|
||
**Samples:**
|
||
- "Vereniging Adres / Contact Kwartaalblad......" (oudhoorn.nl)
|
||
|
||
## Entity Type Distribution
|
||
|
||
| Entity Type | Count | Websites | Primary Location |
|
||
|-------------|-------|----------|------------------|
|
||
| APP.URL | 3196 | 1176 | meta_other |
|
||
| GRP.HER | 1805 | 891 | meta_title |
|
||
| TOP.SET | 1344 | 722 | meta_tag |
|
||
| THG.LNG | 1309 | 921 | other |
|
||
| TMP.DAB | 928 | 476 | meta_tag |
|
||
| TOP.ADR | 412 | 233 | footer |
|
||
| GRP.ASS | 401 | 230 | meta_title |
|
||
| GRP.GOV | 381 | 168 | meta_title |
|
||
| APP.TIT | 380 | 278 | meta_title |
|
||
| AGT.PER | 334 | 180 | paragraph |
|
||
| WRK.WEB | 316 | 191 | meta_other |
|
||
| APP.TTL | 293 | 206 | meta_title |
|
||
| THG.CON | 260 | 162 | meta_tag |
|
||
| APP.EXH | 259 | 166 | other |
|
||
| TOP.REG | 254 | 192 | meta_tag |
|
||
| TOP.CTY | 234 | 179 | meta_tag |
|
||
| TMP.OPH | 217 | 130 | paragraph |
|
||
| QTY.MSR | 179 | 85 | meta_tag |
|
||
| QTY.CNT | 176 | 106 | meta_tag |
|
||
| GRP.COR | 152 | 110 | meta_tag |
|
||
|
||
### Entity Type Details
|
||
|
||
#### APP.URL
|
||
- **Total occurrences**: 3196
|
||
- **Unique websites**: 1176
|
||
|
||
**Found in DOM categories:**
|
||
- meta_other: 1705 (53.3%)
|
||
- meta_tag: 455 (14.2%)
|
||
- other: 293 (9.2%)
|
||
- list: 162 (5.1%)
|
||
- footer: 149 (4.7%)
|
||
|
||
**Common XPath patterns:**
|
||
- `head/link/@href` (441)
|
||
- `head/link[@rel='canonical']/@href` (342)
|
||
- `head/script[@type='application/ld+json']` (186)
|
||
- `head/meta/@content` (175)
|
||
- `head/link` (151)
|
||
|
||
**Samples:**
|
||
- "https://www.pthu.nl/" @ `head/link/@href`
|
||
- "https://heibloem.nu/vereniging/heemkundevereniging-heibloem" @ `head/meta[@property='og:url']`
|
||
- "https://heibloem.nu/held" @ `body/*/footer/*/p/a`
|
||
|
||
#### GRP.HER
|
||
- **Total occurrences**: 1805
|
||
- **Unique websites**: 891
|
||
|
||
**Found in DOM categories:**
|
||
- meta_title: 871 (48.3%)
|
||
- meta_tag: 311 (17.2%)
|
||
- other: 181 (10.0%)
|
||
- meta_other: 131 (7.3%)
|
||
- header: 88 (4.9%)
|
||
|
||
**Common XPath patterns:**
|
||
- `head/title` (829)
|
||
- `head/meta/@content` (129)
|
||
- `head/meta[@name='description']/@content` (66)
|
||
- `head/script[@type='application/ld+json']` (56)
|
||
- `head/meta[@property='og:site_name']/@content` (52)
|
||
|
||
**Samples:**
|
||
- "Museum Eenigenburg" @ `head/title`
|
||
- "Musea Schagen" @ `body/*/footer/*/a/*/p`
|
||
- "Hildo Krop Museum" @ `head/title`
|
||
|
||
#### TOP.SET
|
||
- **Total occurrences**: 1344
|
||
- **Unique websites**: 722
|
||
|
||
**Found in DOM categories:**
|
||
- meta_tag: 356 (26.5%)
|
||
- meta_title: 331 (24.6%)
|
||
- paragraph: 209 (15.6%)
|
||
- other: 126 (9.4%)
|
||
- meta_other: 84 (6.2%)
|
||
|
||
**Common XPath patterns:**
|
||
- `head/title` (315)
|
||
- `head/meta[@name='description']/@content` (154)
|
||
- `head/meta/@content` (106)
|
||
- `body/*/p` (101)
|
||
- `head/script[@type='application/ld+json']` (32)
|
||
|
||
**Samples:**
|
||
- "Amsterdam" @ `head/script[@type='application/ld+json']`
|
||
- "Eenigenburg" @ `body/*/header/*/nav/*/ul/li/a`
|
||
- "Steenwijk" @ `head/meta[@property='og:site_name']`
|
||
|
||
#### THG.LNG
|
||
- **Total occurrences**: 1309
|
||
- **Unique websites**: 921
|
||
|
||
**Found in DOM categories:**
|
||
- other: 846 (64.6%)
|
||
- meta_tag: 207 (15.8%)
|
||
- meta_other: 150 (11.5%)
|
||
- nav: 32 (2.4%)
|
||
- paragraph: 27 (2.1%)
|
||
|
||
**Common XPath patterns:**
|
||
- `@lang` (360)
|
||
- `html/@lang` (192)
|
||
- `html` (70)
|
||
- `head/meta[@property='og:locale']/@content` (54)
|
||
- `head/meta/@content` (50)
|
||
|
||
**Samples:**
|
||
- "nl_NL" @ `head/meta[@property='og:locale']`
|
||
- "nl-NL" @ `[@lang='nl-NL']`
|
||
- "nl_NL" @ `html/@lang`
|
||
|
||
#### TMP.DAB
|
||
- **Total occurrences**: 928
|
||
- **Unique websites**: 476
|
||
|
||
**Found in DOM categories:**
|
||
- meta_tag: 396 (42.7%)
|
||
- meta_other: 189 (20.4%)
|
||
- paragraph: 166 (17.9%)
|
||
- other: 97 (10.5%)
|
||
- list: 33 (3.6%)
|
||
|
||
**Common XPath patterns:**
|
||
- `head/meta[@property='article:modified_time']/@content` (108)
|
||
- `head/meta/@content` (108)
|
||
- `head/script[@type='application/ld+json']` (93)
|
||
- `body/*/p` (75)
|
||
- `head/script/text()` (34)
|
||
|
||
**Samples:**
|
||
- "2024" @ `li[@id='menu-item-2747']/a`
|
||
- "2025" @ `li[@id='menu-item-2758']/a`
|
||
- "2022-05-30T07:37:02+00:00" @ `head/meta[@property='article:modified_time']/@content`
|
||
|
||
#### TOP.ADR
|
||
- **Total occurrences**: 412
|
||
- **Unique websites**: 233
|
||
|
||
**Found in DOM categories:**
|
||
- footer: 104 (25.2%)
|
||
- paragraph: 81 (19.7%)
|
||
- other: 68 (16.5%)
|
||
- meta_other: 60 (14.6%)
|
||
- contact: 25 (6.1%)
|
||
|
||
**Common XPath patterns:**
|
||
- `body/*/p` (31)
|
||
- `body/footer/*/p` (29)
|
||
- `head/script[@type='application/ld+json']` (24)
|
||
- `head/script/text()` (19)
|
||
- `body/*/p/*` (17)
|
||
|
||
**Samples:**
|
||
- "De Boelelaan 1105, 1081 HV, Amsterdam, Nederland" @ `head/script[@type='application/ld+json']`
|
||
- "Kerkweg 5, 1744 JD Eenigenburg, Noord-Holland" @ `body/*/footer/*/ol/li`
|
||
- "Herenstraat 67" @ `body/footer/*`
|
||
|
||
#### GRP.ASS
|
||
- **Total occurrences**: 401
|
||
- **Unique websites**: 230
|
||
|
||
**Found in DOM categories:**
|
||
- meta_title: 164 (40.9%)
|
||
- meta_tag: 58 (14.5%)
|
||
- paragraph: 43 (10.7%)
|
||
- header: 27 (6.7%)
|
||
- other: 27 (6.7%)
|
||
|
||
**Common XPath patterns:**
|
||
- `head/title` (153)
|
||
- `head/meta/@content` (25)
|
||
- `body/*/p` (24)
|
||
- `head/script[@type='application/ld+json']` (13)
|
||
- `head/meta[@name='description']/@content` (11)
|
||
|
||
**Samples:**
|
||
- "Heemkundevereniging Heibloem" @ `head/title`
|
||
- "Heemkundevereniging Heibloem" @ `body/*/header/h1/*`
|
||
- "Heemkundevereniging Heibloem" @ `body/*/p/strong`
|
||
|
||
#### GRP.GOV
|
||
- **Total occurrences**: 381
|
||
- **Unique websites**: 168
|
||
|
||
**Found in DOM categories:**
|
||
- meta_title: 131 (34.4%)
|
||
- meta_tag: 126 (33.1%)
|
||
- header: 41 (10.8%)
|
||
- paragraph: 26 (6.8%)
|
||
- other: 17 (4.5%)
|
||
|
||
**Common XPath patterns:**
|
||
- `head/title` (127)
|
||
- `head/meta/@content` (42)
|
||
- `head/meta[@name='description']/@content` (26)
|
||
- `head/meta` (15)
|
||
- `body/*/p` (13)
|
||
|
||
**Samples:**
|
||
- "gemeente Ede" @ `body/img[@alt='Logo Gemeentearchief Ede']/@alt`
|
||
- "Provincie Noord-Holland" @ `head/title`
|
||
- "Provincie Overijssel" @ `head/title`
|
||
|
||
#### APP.TIT
|
||
- **Total occurrences**: 380
|
||
- **Unique websites**: 278
|
||
|
||
**Found in DOM categories:**
|
||
- meta_title: 179 (47.1%)
|
||
- meta_tag: 39 (10.3%)
|
||
- other: 39 (10.3%)
|
||
- sub_heading: 36 (9.5%)
|
||
- paragraph: 24 (6.3%)
|
||
|
||
**Common XPath patterns:**
|
||
- `head/title` (153)
|
||
- `head/meta/@content` (15)
|
||
- `body/*/h2` (12)
|
||
- `head/meta[@property='og:title']/@content` (12)
|
||
- `body/*/a/*/h4` (10)
|
||
|
||
**Samples:**
|
||
- "Heemkundevereniging Heibloem" @ `head/title`
|
||
- "Uitnodiging lezing "Ontdek de sporen van je voorouders"" @ `body/*/h2`
|
||
- "Versiering plaquette "Sporen die bleven"" @ `body/*/h2`
|
||
|
||
#### AGT.PER
|
||
- **Total occurrences**: 334
|
||
- **Unique websites**: 180
|
||
|
||
**Found in DOM categories:**
|
||
- paragraph: 107 (32.0%)
|
||
- meta_tag: 80 (24.0%)
|
||
- other: 63 (18.9%)
|
||
- sub_heading: 28 (8.4%)
|
||
- meta_other: 17 (5.1%)
|
||
|
||
**Common XPath patterns:**
|
||
- `body/*/p` (44)
|
||
- `head/meta/@content` (31)
|
||
- `head/meta[@name='description']/@content` (23)
|
||
- `body/*` (11)
|
||
- `body/*/p/text()` (11)
|
||
|
||
**Samples:**
|
||
- "Erik Pape" @ `body/*/ul/li/*/h2/a`
|
||
- "Caren van Herwaarden" @ `body/*/ul/li/*/h2/a`
|
||
- "Prof. dr. Chris Stolwijk" @ `body/span[contains(text(), 'Prof. dr. Chris Stolwijk')]`
|
||
|
||
## XPath → Entity Type Mapping
|
||
|
||
Which entity types are typically found at which DOM locations:
|
||
|
||
### `head/title` (1983 entities)
|
||
- GRP.HER: 829 (41.8%)
|
||
- TOP.SET: 315 (15.9%)
|
||
- GRP.ASS: 153 (7.7%)
|
||
- APP.TIT: 153 (7.7%)
|
||
- GRP.GOV: 127 (6.4%)
|
||
|
||
### `head/meta/@content` (976 entities)
|
||
- APP.URL: 175 (17.9%)
|
||
- GRP.HER: 129 (13.2%)
|
||
- TMP.DAB: 108 (11.1%)
|
||
- TOP.SET: 106 (10.9%)
|
||
- THG.LNG: 50 (5.1%)
|
||
|
||
### `body/*/p` (770 entities)
|
||
- TOP.SET: 101 (13.1%)
|
||
- TMP.DAB: 75 (9.7%)
|
||
- AGT.PER: 44 (5.7%)
|
||
- GRP.HER: 36 (4.7%)
|
||
- TOP.ADR: 31 (4.0%)
|
||
|
||
### `head/script[@type='application/ld+json']` (641 entities)
|
||
- APP.URL: 186 (29.0%)
|
||
- TMP.DAB: 93 (14.5%)
|
||
- GRP.HER: 56 (8.7%)
|
||
- TOP.SET: 32 (5.0%)
|
||
- QTY.MSR: 32 (5.0%)
|
||
|
||
### `head/meta[@name='description']/@content` (619 entities)
|
||
- TOP.SET: 154 (24.9%)
|
||
- GRP.HER: 66 (10.7%)
|
||
- THG.CON: 43 (6.9%)
|
||
- TOP.REG: 35 (5.7%)
|
||
- GRP.GOV: 26 (4.2%)
|
||
|
||
### `head/link/@href` (468 entities)
|
||
- APP.URL: 441 (94.2%)
|
||
- WRK.WEB: 7 (1.5%)
|
||
- GRP.HER: 6 (1.3%)
|
||
- TOP.SET: 4 (0.9%)
|
||
- APP.WEB: 3 (0.6%)
|
||
|
||
### `@lang` (365 entities)
|
||
- THG.LNG: 360 (98.6%)
|
||
- TOP.CTY: 4 (1.1%)
|
||
- TOP.SET: 1 (0.3%)
|
||
|
||
### `head/link[@rel='canonical']/@href` (351 entities)
|
||
- APP.URL: 342 (97.4%)
|
||
- WRK.WEB: 7 (2.0%)
|
||
- WRK.URL: 1 (0.3%)
|
||
- GRP.HER: 1 (0.3%)
|
||
|
||
### `head/script/text()` (342 entities)
|
||
- APP.URL: 103 (30.1%)
|
||
- GRP.HER: 37 (10.8%)
|
||
- TMP.DAB: 34 (9.9%)
|
||
- TOP.ADR: 19 (5.6%)
|
||
- THG.LNG: 19 (5.6%)
|
||
|
||
### `head/script` (242 entities)
|
||
- APP.URL: 57 (23.6%)
|
||
- TMP.DAB: 32 (13.2%)
|
||
- GRP.HER: 17 (7.0%)
|
||
- THG.LNG: 13 (5.4%)
|
||
- QTY.MSR: 11 (4.5%)
|
||
|
||
### `html/@lang` (196 entities)
|
||
- THG.LNG: 192 (98.0%)
|
||
- TOP.CTY: 3 (1.5%)
|
||
- TOP.LNG: 1 (0.5%)
|
||
|
||
### `head/meta` (175 entities)
|
||
- APP.URL: 32 (18.3%)
|
||
- TMP.DAB: 22 (12.6%)
|
||
- GRP.GOV: 15 (8.6%)
|
||
- THG.LNG: 14 (8.0%)
|
||
- GRP.HER: 13 (7.4%)
|
||
|
||
### `head/link` (161 entities)
|
||
- APP.URL: 151 (93.8%)
|
||
- THG.LNG: 5 (3.1%)
|
||
- WRK.WEB: 2 (1.2%)
|
||
- APP.WKD: 1 (0.6%)
|
||
- GRP.COR: 1 (0.6%)
|
||
|
||
### `head/link[@rel='canonical']` (150 entities)
|
||
- APP.URL: 143 (95.3%)
|
||
- WRK.WEB: 7 (4.7%)
|
||
|
||
### `body/*` (139 entities)
|
||
- TMP.DAB: 24 (17.3%)
|
||
- TOP.ADR: 13 (9.4%)
|
||
- AGT.PER: 11 (7.9%)
|
||
- TMP.TAB: 8 (5.8%)
|
||
- WRK.COL: 7 (5.0%)
|
||
|
||
### `head/meta[@property='article:modified_time']/@content` (108 entities)
|
||
- TMP.DAB: 108 (100.0%)
|
||
|
||
### `head/script[@type='application/ld+json']/text()` (108 entities)
|
||
- APP.URL: 27 (25.0%)
|
||
- TMP.DAB: 16 (14.8%)
|
||
- TOP.SET: 9 (8.3%)
|
||
- GRP.HER: 9 (8.3%)
|
||
- TOP.REG: 5 (4.6%)
|
||
|
||
### `body/*/ul/li/a` (100 entities)
|
||
- APP.URL: 33 (33.0%)
|
||
- APP.COL: 10 (10.0%)
|
||
- WRK.COL: 7 (7.0%)
|
||
- WRK.WEB: 7 (7.0%)
|
||
- APP.EXH: 7 (7.0%)
|
||
|
||
### `body/*/h2` (98 entities)
|
||
- GRP.HER: 21 (21.4%)
|
||
- APP.TIT: 12 (12.2%)
|
||
- APP.EXH: 11 (11.2%)
|
||
- TOP.SET: 11 (11.2%)
|
||
- GRP.ASS: 6 (6.1%)
|
||
|
||
### `head/meta[@property='og:site_name']/@content` (98 entities)
|
||
- GRP.HER: 52 (53.1%)
|
||
- GRP.GOV: 12 (12.2%)
|
||
- GRP.ASS: 6 (6.1%)
|
||
- APP.TTL: 5 (5.1%)
|
||
- TOP.SET: 4 (4.1%)
|
||
|
||
## String Pattern → Entity Type Correlation
|
||
|
||
Which string patterns are most associated with which entity types:
|
||
|
||
### `unknown` (10443 occurrences)
|
||
- APP.URL: 2814 (26.9%)
|
||
- THG.LNG: 1307 (12.5%)
|
||
- TOP.SET: 1250 (12.0%)
|
||
- TMP.DAB: 892 (8.5%)
|
||
- GRP.HER: 363 (3.5%)
|
||
|
||
### `person:dutch_name` (3440 occurrences)
|
||
- GRP.HER: 1329 (38.6%)
|
||
- GRP.ASS: 313 (9.1%)
|
||
- GRP.GOV: 246 (7.2%)
|
||
- AGT.PER: 241 (7.0%)
|
||
- APP.TIT: 172 (5.0%)
|
||
|
||
### `heritage:institution_type` (1185 occurrences)
|
||
- GRP.HER: 777 (65.6%)
|
||
- APP.URL: 140 (11.8%)
|
||
- APP.TTL: 56 (4.7%)
|
||
- APP.TIT: 49 (4.1%)
|
||
- WRK.WEB: 31 (2.6%)
|
||
|
||
### `org:foundation_or_association` (383 occurrences)
|
||
- GRP.ASS: 198 (51.7%)
|
||
- GRP.HER: 85 (22.2%)
|
||
- APP.URL: 24 (6.3%)
|
||
- APP.TIT: 15 (3.9%)
|
||
- WRK.WEB: 15 (3.9%)
|
||
|
||
### `nav:menu_item` (368 occurrences)
|
||
- APP.URL: 130 (35.3%)
|
||
- APP.TTL: 46 (12.5%)
|
||
- APP.TIT: 45 (12.2%)
|
||
- WRK.COL: 40 (10.9%)
|
||
- WRK.WEB: 34 (9.2%)
|
||
|
||
### `loc:postal_code_nl` (218 occurrences)
|
||
- TOP.ADR: 193 (88.5%)
|
||
- TOP.SET: 7 (3.2%)
|
||
- APP.NAM: 2 (0.9%)
|
||
- QTY.CNT: 2 (0.9%)
|
||
- unknown: 2 (0.9%)
|
||
|
||
### `gov:municipality` (206 occurrences)
|
||
- GRP.GOV: 171 (83.0%)
|
||
- APP.TTL: 11 (5.3%)
|
||
- TOP.SET: 5 (2.4%)
|
||
- APP.TIT: 5 (2.4%)
|
||
- TOP.REG: 5 (2.4%)
|
||
|
||
### `info:time_range` (185 occurrences)
|
||
- TMP.OPH: 152 (82.2%)
|
||
- TMP.TAB: 25 (13.5%)
|
||
- TMP.EXP: 3 (1.6%)
|
||
- TMP.DAT: 2 (1.1%)
|
||
- TMP.RNG: 1 (0.5%)
|
||
|
||
### `loc:street_address` (178 occurrences)
|
||
- TOP.ADR: 158 (88.8%)
|
||
- GRP.HER: 8 (4.5%)
|
||
- TOP.SET: 2 (1.1%)
|
||
- APP.PNM: 2 (1.1%)
|
||
- APP.URL: 1 (0.6%)
|
||
|
||
### `info:opening_hours` (161 occurrences)
|
||
- TMP.OPH: 115 (71.4%)
|
||
- TMP.DAB: 31 (19.3%)
|
||
- TMP.SET: 7 (4.3%)
|
||
- TMP.RNG: 2 (1.2%)
|
||
- TMP.DAT: 2 (1.2%)
|
||
|
||
### `contact:email` (126 occurrences)
|
||
- APP.URL: 93 (73.8%)
|
||
- APP.EMAIL: 12 (9.5%)
|
||
- APP.NAM: 5 (4.0%)
|
||
- APP.PNM: 5 (4.0%)
|
||
- APP.EML: 3 (2.4%)
|
||
|
||
### `contact:phone` (7 occurrences)
|
||
- APP.URL: 5 (71.4%)
|
||
- TOP.ADR: 1 (14.3%)
|
||
- APP.TTL: 1 (14.3%)
|
||
|
||
## Recommendations for dutch_web_patterns.yaml
|
||
|
||
Based on the analysis, the following layout-aware patterns could be added:
|
||
|
||
### High-Value XPath Targets
|
||
|
||
1. **`head/title`** - Almost always contains institution name
|
||
2. **`body/*/h1`** - Primary heading, usually institution name
|
||
3. **`body/*/nav`** - Navigation menu (discard patterns)
|
||
4. **`body/*/footer`** - Contact info, address, social links
|
||
|
||
### Suggested Pattern Additions
|
||
|
||
```yaml
|
||
# Add xpath_hint to patterns for better precision
|
||
entity_patterns:
|
||
organizations:
|
||
heritage_institutions:
|
||
patterns:
|
||
- pattern: '^(het|de)\s+(\w+)\s*(museum|archief|bibliotheek)$'
|
||
xpath_hints:
|
||
- 'head/title'
|
||
- 'body/*/h1'
|
||
confidence_boost: 0.2 # Higher confidence when found at expected location
|
||
```
|