809 lines
20 KiB
Markdown
809 lines
20 KiB
Markdown
# Web Archive Layout Pattern Analysis
|
||
|
||
Generated: 2025-12-13T15:03:35.789833
|
||
|
||
## Summary
|
||
|
||
- **Annotation files analyzed**: 500
|
||
- **Unique websites**: 468
|
||
- **Total layout claims**: 1443
|
||
- **Total entity claims**: 5134
|
||
|
||
## Layout Categories
|
||
|
||
Distribution of content by DOM location category:
|
||
|
||
### other
|
||
- **Occurrences**: 322
|
||
- **Websites**: 162
|
||
|
||
**Top XPath patterns:**
|
||
- `body/*` (162)
|
||
- `body/*/table` (8)
|
||
- `body` (7)
|
||
- `body/*/dl` (6)
|
||
- `body/*/script` (6)
|
||
|
||
**String patterns found:**
|
||
- `unknown` (170)
|
||
- `person:dutch_name` (104)
|
||
- `nav:menu_item` (44)
|
||
- `heritage:institution_type` (43)
|
||
- `org:foundation_or_association` (13)
|
||
|
||
**Samples:**
|
||
- "Voorstelling 'De Ezelprins' door Spiegelbeesten......" (bibliotheekkampen.nl)
|
||
- "Vandaag open... Bibliotheek Kampen 09:00 - 17:00......" (bibliotheekkampen.nl)
|
||
- "flexslider fullscreen-carousel hero-slider..." (museumrijswijk.nl)
|
||
|
||
### meta_title
|
||
- **Occurrences**: 282
|
||
- **Websites**: 268
|
||
|
||
**Top XPath patterns:**
|
||
- `head/title` (279)
|
||
- `head/meta[@property='og:title']` (1)
|
||
- `head/meta[@name='apple-mobile-web-app-title']/@content` (1)
|
||
- `head/meta[@property='og:title']/@content` (1)
|
||
|
||
**String patterns found:**
|
||
- `person:dutch_name` (224)
|
||
- `heritage:institution_type` (85)
|
||
- `nav:menu_item` (58)
|
||
- `org:foundation_or_association` (41)
|
||
- `unknown` (34)
|
||
|
||
**Samples:**
|
||
- "Protestantse Theologische Universiteit: 400 jaar theologisch onderwijs..." (pthu.nl)
|
||
- "Heemkundevereniging Heibloem | Heibloem.nu..." (heibloem.nu)
|
||
- "Museum Eenigenburg – Museum Eenigenburg..." (museumeenigenburg.nl)
|
||
|
||
### header
|
||
- **Occurrences**: 186
|
||
- **Websites**: 131
|
||
|
||
**Top XPath patterns:**
|
||
- `body/header` (62)
|
||
- `body/*/header` (23)
|
||
- `body/*/header/h1` (11)
|
||
- `body/*/header/*` (9)
|
||
- `body/header/*` (5)
|
||
|
||
**String patterns found:**
|
||
- `unknown` (96)
|
||
- `person:dutch_name` (76)
|
||
- `nav:menu_item` (26)
|
||
- `heritage:institution_type` (18)
|
||
- `org:foundation_or_association` (12)
|
||
|
||
**Samples:**
|
||
- "<header class="header preventselection">...Protestantse Theologische Universitei..." (pthu.nl)
|
||
- "Heemkundevereniging Heibloem..." (heibloem.nu)
|
||
- "Museum Eenigenburg..." (museumeenigenburg.nl)
|
||
|
||
### meta_other
|
||
- **Occurrences**: 164
|
||
- **Websites**: 139
|
||
|
||
**Top XPath patterns:**
|
||
- `head` (97)
|
||
- `head/script[@type='application/ld+json']` (20)
|
||
- `head/script` (14)
|
||
- `head/link` (7)
|
||
- `head/style` (3)
|
||
|
||
**String patterns found:**
|
||
- `person:dutch_name` (77)
|
||
- `unknown` (77)
|
||
- `heritage:institution_type` (33)
|
||
- `nav:menu_item` (24)
|
||
- `gov:municipality` (10)
|
||
|
||
**Samples:**
|
||
- "<title>Home - Hildo Krop Museum</title><link rel="canonical" href="https://www.h..." (hildokrop.nl)
|
||
- "Museum Rijswijk..." (museumrijswijk.nl)
|
||
- "Gemeentearchief Ede | Gemeentearchief Ede......" (gemeentearchief.ede.nl)
|
||
|
||
### nav
|
||
- **Occurrences**: 106
|
||
- **Websites**: 97
|
||
|
||
**Top XPath patterns:**
|
||
- `body/*/nav` (19)
|
||
- `body/nav` (16)
|
||
- `body/*/header/*/nav` (14)
|
||
- `body/header/nav` (14)
|
||
- `body/header/*/nav` (8)
|
||
|
||
**String patterns found:**
|
||
- `nav:menu_item` (57)
|
||
- `person:dutch_name` (50)
|
||
- `unknown` (37)
|
||
- `heritage:institution_type` (16)
|
||
- `org:foundation_or_association` (9)
|
||
|
||
**Samples:**
|
||
- "Actueel HELD Burgerbus Onderwijs & opvang......" (heibloem.nu)
|
||
- "Home Het museum Daglonershuis Kasteel De familie Eenigenburg Blog Contact..." (museumeenigenburg.nl)
|
||
- "Home Tentoonstellingen Bezoekersinformatie......" (sonnenborgh.nl)
|
||
|
||
### paragraph
|
||
- **Occurrences**: 93
|
||
- **Websites**: 68
|
||
|
||
**Top XPath patterns:**
|
||
- `body/*/p` (67)
|
||
- `body/*/ul/li/*/p` (4)
|
||
- `body/p` (3)
|
||
- `*/p` (2)
|
||
- `body/table/tr/td/p` (2)
|
||
|
||
**String patterns found:**
|
||
- `person:dutch_name` (47)
|
||
- `unknown` (27)
|
||
- `heritage:institution_type` (19)
|
||
- `org:foundation_or_association` (11)
|
||
- `nav:menu_item` (7)
|
||
|
||
**Samples:**
|
||
- "Heemkundevereniging Heibloem heeft als doel het in brede kring bevorderen van be..." (heibloem.nu)
|
||
- "In Museum Eenigenburg wijst een levensgroot kompas......" (museumeenigenburg.nl)
|
||
- "Het Museum is zeer geschikt voor het ontvangen van groepen......" (museumeenigenburg.nl)
|
||
|
||
### main_heading
|
||
- **Occurrences**: 74
|
||
- **Websites**: 65
|
||
|
||
**Top XPath patterns:**
|
||
- `body/*/h1` (60)
|
||
- `*/h1` (4)
|
||
- `body/div[@id='pagewrapper']/div[@class='section group']/div[@id='rightcolumn']/div[@class='section group']/div[@class='col span_2_of_2']/h1` (1)
|
||
- `body/div[@class='kolomContent home']/*/h1` (1)
|
||
- `body/div[@id='wrapper']/div[@id='wrapperinner']/div[@id='content']/div[@id='columnwrapper']/*/div[@id='anchorContent']/h1` (1)
|
||
|
||
**String patterns found:**
|
||
- `person:dutch_name` (37)
|
||
- `unknown` (31)
|
||
- `heritage:institution_type` (18)
|
||
- `nav:menu_item` (5)
|
||
- `org:foundation_or_association` (3)
|
||
|
||
**Samples:**
|
||
- "Protestantse Theologische Universiteit..." (pthu.nl)
|
||
- "Terug in de tijd..." (museumeenigenburg.nl)
|
||
- "Ook leuk voor groepen..." (museumeenigenburg.nl)
|
||
|
||
### sub_heading
|
||
- **Occurrences**: 69
|
||
- **Websites**: 46
|
||
|
||
**Top XPath patterns:**
|
||
- `body/*/h2` (39)
|
||
- `body/*/h3` (15)
|
||
- `body/h2` (3)
|
||
- `*/h2` (2)
|
||
- `body/*/h4` (2)
|
||
|
||
**String patterns found:**
|
||
- `unknown` (42)
|
||
- `person:dutch_name` (23)
|
||
- `heritage:institution_type` (5)
|
||
- `nav:menu_item` (3)
|
||
- `org:foundation_or_association` (3)
|
||
|
||
**Samples:**
|
||
- "Verenigingsnieuws..." (heibloem.nu)
|
||
- "Contactgegevens..." (heibloem.nu)
|
||
- "Home..." (bibliotheekkampen.nl)
|
||
|
||
### meta_tag
|
||
- **Occurrences**: 60
|
||
- **Websites**: 55
|
||
|
||
**Top XPath patterns:**
|
||
- `head/meta[@name='description']` (20)
|
||
- `head/meta` (13)
|
||
- `head/meta/@content` (9)
|
||
- `head/meta[@name='description']/@content` (8)
|
||
- `head/meta[@property='og:url']` (2)
|
||
|
||
**String patterns found:**
|
||
- `person:dutch_name` (33)
|
||
- `heritage:institution_type` (20)
|
||
- `unknown` (20)
|
||
- `org:foundation_or_association` (8)
|
||
- `nav:menu_item` (6)
|
||
|
||
**Samples:**
|
||
- "Prachtige tijdelijke en vaste tentoonstellingen zijn te zien bij Museum Catharij..." (catharijneconvent.nl)
|
||
- "Energie is overal. We gebruiken het constant en ervaren er veel gemak van. Tegel..." (vindhetuit.nl)
|
||
- "Aanmeldformulier Via dit formulier kunt u zich aanmelden bij de Stichting Vriend..." (verbindingsdienst.nl)
|
||
|
||
### footer
|
||
- **Occurrences**: 43
|
||
- **Websites**: 35
|
||
|
||
**Top XPath patterns:**
|
||
- `body/footer` (17)
|
||
- `body/*/footer` (8)
|
||
- `body/footer/*` (6)
|
||
- `body/*/footer/*` (3)
|
||
- `body/footer/*/p` (3)
|
||
|
||
**String patterns found:**
|
||
- `person:dutch_name` (18)
|
||
- `nav:menu_item` (13)
|
||
- `unknown` (10)
|
||
- `loc:postal_code_nl` (5)
|
||
- `loc:street_address` (4)
|
||
|
||
**Samples:**
|
||
- "Heibloem.nu Nieuwsbrief Contact......" (heibloem.nu)
|
||
- "Made with ♥ by Encore Digital Agency......" (heibloem.nu)
|
||
- "Contactinformatie Kerkweg 5... Toegangsprijzen... OPENINGS TIJDEN......" (museumeenigenburg.nl)
|
||
|
||
### list
|
||
- **Occurrences**: 39
|
||
- **Websites**: 30
|
||
|
||
**Top XPath patterns:**
|
||
- `body/*/ul` (30)
|
||
- `*/ul` (2)
|
||
- `*/ul/li` (1)
|
||
- `body/ul` (1)
|
||
- `body/*/ol` (1)
|
||
|
||
**String patterns found:**
|
||
- `person:dutch_name` (17)
|
||
- `unknown` (16)
|
||
- `nav:menu_item` (9)
|
||
- `heritage:institution_type` (5)
|
||
- `org:foundation_or_association` (1)
|
||
|
||
**Samples:**
|
||
- "<ul class="metanav">......" (bibliotheekkampen.nl)
|
||
- "Nu te zien [list of exhibitions]..." (catharijneconvent.nl)
|
||
- "Let op: in december hebben we aangepaste openingstijden. Bekijk hier de openings..." (debieb.nl)
|
||
|
||
### contact
|
||
- **Occurrences**: 3
|
||
- **Websites**: 3
|
||
|
||
**Top XPath patterns:**
|
||
- `body/*/address` (1)
|
||
- `body/div[@id='contact']/*/ul` (1)
|
||
- `div[@class='xr_group' and ./span[text()='Contact']]` (1)
|
||
|
||
**String patterns found:**
|
||
- `loc:postal_code_nl` (1)
|
||
- `loc:street_address` (1)
|
||
- `nav:menu_item` (1)
|
||
- `person:dutch_name` (1)
|
||
|
||
**Samples:**
|
||
- "Bongerd 18, 6411 JM Heerlen..." (museum.nl)
|
||
- "Winsum, Hoofdstraat W 70 Uithuizen, Hoofdstraat-West 1..." (hethogeland.nl)
|
||
- "Contact Contactgegevens..." (heemkundenispen.nl)
|
||
|
||
### main_content
|
||
- **Occurrences**: 1
|
||
- **Websites**: 1
|
||
|
||
**Top XPath patterns:**
|
||
- `body/div[@id='readspeaker']/*/article[@role='region']/*/p` (1)
|
||
|
||
**String patterns found:**
|
||
- `info:opening_hours` (1)
|
||
|
||
**Samples:**
|
||
- "Let op: Op donderdag 27 en vrijdag 28 november 2025 is Burgerzaken gesloten......" (veendam.nl)
|
||
|
||
### sidebar
|
||
- **Occurrences**: 1
|
||
- **Websites**: 1
|
||
|
||
**Top XPath patterns:**
|
||
- `body/aside[@id='leftbar']` (1)
|
||
|
||
**String patterns found:**
|
||
- `org:foundation_or_association` (1)
|
||
- `nav:menu_item` (1)
|
||
- `person:dutch_name` (1)
|
||
|
||
**Samples:**
|
||
- "Vereniging Adres / Contact Kwartaalblad......" (oudhoorn.nl)
|
||
|
||
## Entity Type Distribution
|
||
|
||
| Entity Type | Count | Websites | Primary Location |
|
||
|-------------|-------|----------|------------------|
|
||
| APP.URL | 1071 | 406 | meta_other |
|
||
| GRP.HER | 582 | 313 | meta_title |
|
||
| TOP.SET | 426 | 245 | meta_tag |
|
||
| THG.LNG | 398 | 305 | other |
|
||
| TMP.DAB | 285 | 155 | meta_tag |
|
||
| WRK.WEB | 145 | 75 | meta_other |
|
||
| APP.TIT | 132 | 97 | meta_title |
|
||
| TOP.ADR | 130 | 77 | footer |
|
||
| AGT.PER | 129 | 68 | paragraph |
|
||
| GRP.ASS | 122 | 76 | meta_title |
|
||
| APP.TTL | 119 | 78 | meta_title |
|
||
| GRP.GOV | 118 | 60 | meta_title |
|
||
| THG.CON | 103 | 63 | meta_tag |
|
||
| APP.EXH | 98 | 61 | other |
|
||
| TOP.REG | 80 | 65 | meta_tag |
|
||
| TMP.OPH | 74 | 46 | other |
|
||
| QTY.CNT | 69 | 38 | meta_tag |
|
||
| TOP.CTY | 68 | 55 | meta_tag |
|
||
| THG.EVT | 64 | 31 | paragraph |
|
||
| TOP.BLD | 56 | 38 | paragraph |
|
||
|
||
### Entity Type Details
|
||
|
||
#### APP.URL
|
||
- **Total occurrences**: 1071
|
||
- **Unique websites**: 406
|
||
|
||
**Found in DOM categories:**
|
||
- meta_other: 596 (55.6%)
|
||
- meta_tag: 126 (11.8%)
|
||
- other: 81 (7.6%)
|
||
- list: 64 (6.0%)
|
||
- nav: 55 (5.1%)
|
||
|
||
**Common XPath patterns:**
|
||
- `head/link/@href` (168)
|
||
- `head/link[@rel='canonical']/@href` (108)
|
||
- `head/link` (67)
|
||
- `head/script[@type='application/ld+json']` (59)
|
||
- `head/meta/@content` (47)
|
||
|
||
**Samples:**
|
||
- "https://www.pthu.nl/" @ `head/link/@href`
|
||
- "https://heibloem.nu/vereniging/heemkundevereniging-heibloem" @ `head/meta[@property='og:url']`
|
||
- "https://heibloem.nu/held" @ `body/*/footer/*/p/a`
|
||
|
||
#### GRP.HER
|
||
- **Total occurrences**: 582
|
||
- **Unique websites**: 313
|
||
|
||
**Found in DOM categories:**
|
||
- meta_title: 282 (48.5%)
|
||
- meta_tag: 97 (16.7%)
|
||
- other: 51 (8.8%)
|
||
- meta_other: 43 (7.4%)
|
||
- paragraph: 34 (5.8%)
|
||
|
||
**Common XPath patterns:**
|
||
- `head/title` (261)
|
||
- `head/meta/@content` (41)
|
||
- `head/meta[@name='description']/@content` (24)
|
||
- `head/script[@type='application/ld+json']` (20)
|
||
- `body/*/dl/*/dt/a` (17)
|
||
|
||
**Samples:**
|
||
- "Museum Eenigenburg" @ `head/title`
|
||
- "Musea Schagen" @ `body/*/footer/*/a/*/p`
|
||
- "Hildo Krop Museum" @ `head/title`
|
||
|
||
#### TOP.SET
|
||
- **Total occurrences**: 426
|
||
- **Unique websites**: 245
|
||
|
||
**Found in DOM categories:**
|
||
- meta_tag: 116 (27.2%)
|
||
- meta_title: 94 (22.1%)
|
||
- paragraph: 72 (16.9%)
|
||
- other: 32 (7.5%)
|
||
- meta_other: 28 (6.6%)
|
||
|
||
**Common XPath patterns:**
|
||
- `head/title` (87)
|
||
- `head/meta[@name='description']/@content` (45)
|
||
- `body/*/p` (38)
|
||
- `head/meta/@content` (32)
|
||
- `body/*/dl/*/dt/a` (10)
|
||
|
||
**Samples:**
|
||
- "Amsterdam" @ `head/script[@type='application/ld+json']`
|
||
- "Eenigenburg" @ `body/*/header/*/nav/*/ul/li/a`
|
||
- "Steenwijk" @ `head/meta[@property='og:site_name']`
|
||
|
||
#### THG.LNG
|
||
- **Total occurrences**: 398
|
||
- **Unique websites**: 305
|
||
|
||
**Found in DOM categories:**
|
||
- other: 260 (65.3%)
|
||
- meta_tag: 66 (16.6%)
|
||
- meta_other: 31 (7.8%)
|
||
- nav: 17 (4.3%)
|
||
- paragraph: 15 (3.8%)
|
||
|
||
**Common XPath patterns:**
|
||
- `@lang` (113)
|
||
- `html/@lang` (62)
|
||
- `html` (30)
|
||
- `head/meta[@property='og:locale']/@content` (18)
|
||
- `head/meta/@content` (15)
|
||
|
||
**Samples:**
|
||
- "nl_NL" @ `head/meta[@property='og:locale']`
|
||
- "nl-NL" @ `[@lang='nl-NL']`
|
||
- "nl_NL" @ `html/@lang`
|
||
|
||
#### TMP.DAB
|
||
- **Total occurrences**: 285
|
||
- **Unique websites**: 155
|
||
|
||
**Found in DOM categories:**
|
||
- meta_tag: 100 (35.1%)
|
||
- meta_other: 70 (24.6%)
|
||
- paragraph: 49 (17.2%)
|
||
- other: 37 (13.0%)
|
||
- sub_heading: 9 (3.2%)
|
||
|
||
**Common XPath patterns:**
|
||
- `head/script[@type='application/ld+json']` (33)
|
||
- `body/*/p` (30)
|
||
- `head/meta[@property='article:modified_time']/@content` (29)
|
||
- `head/meta/@content` (28)
|
||
- `head/script[@type='application/ld+json']/text()` (12)
|
||
|
||
**Samples:**
|
||
- "2024" @ `li[@id='menu-item-2747']/a`
|
||
- "2025" @ `li[@id='menu-item-2758']/a`
|
||
- "2022-05-30T07:37:02+00:00" @ `head/meta[@property='article:modified_time']/@content`
|
||
|
||
#### WRK.WEB
|
||
- **Total occurrences**: 145
|
||
- **Unique websites**: 75
|
||
|
||
**Found in DOM categories:**
|
||
- meta_other: 39 (26.9%)
|
||
- nav: 28 (19.3%)
|
||
- other: 27 (18.6%)
|
||
- meta_title: 15 (10.3%)
|
||
- list: 14 (9.7%)
|
||
|
||
**Common XPath patterns:**
|
||
- `body/*/nav/*/ul/li/*/h3/a` (13)
|
||
- `head/script[@type='application/ld+json']` (11)
|
||
- `HTML/BODY/DIV/DIV/UL/LI/A` (8)
|
||
- `head/script/text()` (7)
|
||
- `body/ul/li/a` (6)
|
||
|
||
**Samples:**
|
||
- "Home - Hildo Krop Museum" @ `head/title`
|
||
- "https://www.sonnenborgh.nl/home" @ `head/link[@rel='canonical']`
|
||
- "Strategische visie" @ `body/*/h2`
|
||
|
||
#### APP.TIT
|
||
- **Total occurrences**: 132
|
||
- **Unique websites**: 97
|
||
|
||
**Found in DOM categories:**
|
||
- meta_title: 62 (47.0%)
|
||
- sub_heading: 18 (13.6%)
|
||
- meta_tag: 16 (12.1%)
|
||
- other: 13 (9.8%)
|
||
- meta_other: 7 (5.3%)
|
||
|
||
**Common XPath patterns:**
|
||
- `head/title` (53)
|
||
- `body/*/h2` (8)
|
||
- `body/*/a/*/h4` (8)
|
||
- `head/meta/@content` (5)
|
||
- `head/meta[@property='og:title']/@content` (4)
|
||
|
||
**Samples:**
|
||
- "Heemkundevereniging Heibloem" @ `head/title`
|
||
- "Uitnodiging lezing "Ontdek de sporen van je voorouders"" @ `body/*/h2`
|
||
- "Versiering plaquette "Sporen die bleven"" @ `body/*/h2`
|
||
|
||
#### TOP.ADR
|
||
- **Total occurrences**: 130
|
||
- **Unique websites**: 77
|
||
|
||
**Found in DOM categories:**
|
||
- footer: 31 (23.8%)
|
||
- paragraph: 23 (17.7%)
|
||
- other: 21 (16.2%)
|
||
- meta_other: 18 (13.8%)
|
||
- list: 13 (10.0%)
|
||
|
||
**Common XPath patterns:**
|
||
- `body/footer/*/p` (13)
|
||
- `body/*/p` (9)
|
||
- `body/*/ul/li` (7)
|
||
- `head/script/text()` (7)
|
||
- `head/script[@type='application/ld+json']` (6)
|
||
|
||
**Samples:**
|
||
- "De Boelelaan 1105, 1081 HV, Amsterdam, Nederland" @ `head/script[@type='application/ld+json']`
|
||
- "Kerkweg 5, 1744 JD Eenigenburg, Noord-Holland" @ `body/*/footer/*/ol/li`
|
||
- "Herenstraat 67" @ `body/footer/*`
|
||
|
||
#### AGT.PER
|
||
- **Total occurrences**: 129
|
||
- **Unique websites**: 68
|
||
|
||
**Found in DOM categories:**
|
||
- paragraph: 47 (36.4%)
|
||
- meta_tag: 29 (22.5%)
|
||
- other: 21 (16.3%)
|
||
- sub_heading: 10 (7.8%)
|
||
- list: 5 (3.9%)
|
||
|
||
**Common XPath patterns:**
|
||
- `body/*/p` (12)
|
||
- `body/*/p/text()` (10)
|
||
- `head/meta[@name='description']/@content` (8)
|
||
- `body/*/p/*` (8)
|
||
- `head/meta/@content` (8)
|
||
|
||
**Samples:**
|
||
- "Erik Pape" @ `body/*/ul/li/*/h2/a`
|
||
- "Caren van Herwaarden" @ `body/*/ul/li/*/h2/a`
|
||
- "Prof. dr. Chris Stolwijk" @ `body/span[contains(text(), 'Prof. dr. Chris Stolwijk')]`
|
||
|
||
#### GRP.ASS
|
||
- **Total occurrences**: 122
|
||
- **Unique websites**: 76
|
||
|
||
**Found in DOM categories:**
|
||
- meta_title: 48 (39.3%)
|
||
- meta_tag: 17 (13.9%)
|
||
- paragraph: 15 (12.3%)
|
||
- other: 11 (9.0%)
|
||
- header: 7 (5.7%)
|
||
|
||
**Common XPath patterns:**
|
||
- `head/title` (44)
|
||
- `head/meta/@content` (12)
|
||
- `body/*/p` (6)
|
||
- `head/script` (4)
|
||
- `body/*` (4)
|
||
|
||
**Samples:**
|
||
- "Heemkundevereniging Heibloem" @ `head/title`
|
||
- "Heemkundevereniging Heibloem" @ `body/*/header/h1/*`
|
||
- "Heemkundevereniging Heibloem" @ `body/*/p/strong`
|
||
|
||
## XPath → Entity Type Mapping
|
||
|
||
Which entity types are typically found at which DOM locations:
|
||
|
||
### `head/title` (608 entities)
|
||
- GRP.HER: 261 (42.9%)
|
||
- TOP.SET: 87 (14.3%)
|
||
- APP.TIT: 53 (8.7%)
|
||
- APP.TTL: 45 (7.4%)
|
||
- GRP.ASS: 44 (7.2%)
|
||
|
||
### `head/meta/@content` (278 entities)
|
||
- APP.URL: 47 (16.9%)
|
||
- GRP.HER: 41 (14.7%)
|
||
- TOP.SET: 32 (11.5%)
|
||
- TMP.DAB: 28 (10.1%)
|
||
- THG.LNG: 15 (5.4%)
|
||
|
||
### `body/*/p` (277 entities)
|
||
- TOP.SET: 38 (13.7%)
|
||
- TMP.DAB: 30 (10.8%)
|
||
- GRP.HER: 17 (6.1%)
|
||
- AGT.PER: 12 (4.3%)
|
||
- THG.ART: 11 (4.0%)
|
||
|
||
### `head/script[@type='application/ld+json']` (219 entities)
|
||
- APP.URL: 59 (26.9%)
|
||
- TMP.DAB: 33 (15.1%)
|
||
- GRP.HER: 20 (9.1%)
|
||
- WRK.WEB: 11 (5.0%)
|
||
- TOP.SET: 9 (4.1%)
|
||
|
||
### `head/meta[@name='description']/@content` (189 entities)
|
||
- TOP.SET: 45 (23.8%)
|
||
- GRP.HER: 24 (12.7%)
|
||
- THG.CON: 17 (9.0%)
|
||
- TOP.REG: 11 (5.8%)
|
||
- GRP.GOV: 10 (5.3%)
|
||
|
||
### `head/link/@href` (179 entities)
|
||
- APP.URL: 168 (93.9%)
|
||
- WRK.WEB: 3 (1.7%)
|
||
- APP.WEB: 3 (1.7%)
|
||
- WRK.URL: 2 (1.1%)
|
||
- TOP.SET: 1 (0.6%)
|
||
|
||
### `@lang` (115 entities)
|
||
- THG.LNG: 113 (98.3%)
|
||
- TOP.CTY: 2 (1.7%)
|
||
|
||
### `head/link[@rel='canonical']/@href` (111 entities)
|
||
- APP.URL: 108 (97.3%)
|
||
- WRK.WEB: 2 (1.8%)
|
||
- WRK.URL: 1 (0.9%)
|
||
|
||
### `head/script` (106 entities)
|
||
- APP.URL: 23 (21.7%)
|
||
- TMP.DAB: 12 (11.3%)
|
||
- GRP.HER: 6 (5.7%)
|
||
- QTY.MSR: 6 (5.7%)
|
||
- TOP.SET: 5 (4.7%)
|
||
|
||
### `head/script/text()` (93 entities)
|
||
- APP.URL: 31 (33.3%)
|
||
- TMP.DAB: 9 (9.7%)
|
||
- GRP.HER: 9 (9.7%)
|
||
- WRK.WEB: 7 (7.5%)
|
||
- TOP.ADR: 7 (7.5%)
|
||
|
||
### `head/link` (69 entities)
|
||
- APP.URL: 67 (97.1%)
|
||
- APP.WKD: 1 (1.4%)
|
||
- WRK.WEB: 1 (1.4%)
|
||
|
||
### `html/@lang` (62 entities)
|
||
- THG.LNG: 62 (100.0%)
|
||
|
||
### `head/meta` (61 entities)
|
||
- APP.URL: 12 (19.7%)
|
||
- TMP.DAB: 5 (8.2%)
|
||
- GRP.COR: 4 (6.6%)
|
||
- GRP.HER: 4 (6.6%)
|
||
- TOP.SET: 4 (6.6%)
|
||
|
||
### `head/script[@type='application/ld+json']/text()` (57 entities)
|
||
- TMP.DAB: 12 (21.1%)
|
||
- APP.URL: 11 (19.3%)
|
||
- GRP.HER: 6 (10.5%)
|
||
- TOP.SET: 5 (8.8%)
|
||
- APP.NAM: 3 (5.3%)
|
||
|
||
### `body/*/h2` (54 entities)
|
||
- APP.TIT: 8 (14.8%)
|
||
- APP.EXH: 8 (14.8%)
|
||
- GRP.HER: 8 (14.8%)
|
||
- TOP.SET: 5 (9.3%)
|
||
- TMP.DAB: 4 (7.4%)
|
||
|
||
### `body/*` (54 entities)
|
||
- TMP.DAB: 9 (16.7%)
|
||
- WRK.COL: 7 (13.0%)
|
||
- TOP.ADR: 6 (11.1%)
|
||
- GRP.ASS: 4 (7.4%)
|
||
- TMP.TAB: 4 (7.4%)
|
||
|
||
### `head/link[@rel='canonical']` (46 entities)
|
||
- APP.URL: 41 (89.1%)
|
||
- WRK.WEB: 5 (10.9%)
|
||
|
||
### `head/meta[@property='og:description']/@content` (41 entities)
|
||
- TOP.SET: 6 (14.6%)
|
||
- TMP.DAB: 5 (12.2%)
|
||
- THG.CON: 4 (9.8%)
|
||
- AGT.PER: 3 (7.3%)
|
||
- APP.TIT: 3 (7.3%)
|
||
|
||
### `body/*/p/a` (40 entities)
|
||
- APP.URL: 20 (50.0%)
|
||
- THG.EVT: 5 (12.5%)
|
||
- TOP.SET: 3 (7.5%)
|
||
- APP.TTL: 3 (7.5%)
|
||
- GRP.GOV: 2 (5.0%)
|
||
|
||
### `body/*/p/*` (36 entities)
|
||
- AGT.PER: 8 (22.2%)
|
||
- TOP.ADR: 4 (11.1%)
|
||
- TOP.CTY: 3 (8.3%)
|
||
- ROL.POS: 2 (5.6%)
|
||
- ROL.OCC: 2 (5.6%)
|
||
|
||
## String Pattern → Entity Type Correlation
|
||
|
||
Which string patterns are most associated with which entity types:
|
||
|
||
### `unknown` (3503 occurrences)
|
||
- APP.URL: 937 (26.7%)
|
||
- TOP.SET: 403 (11.5%)
|
||
- THG.LNG: 398 (11.4%)
|
||
- TMP.DAB: 269 (7.7%)
|
||
- GRP.HER: 132 (3.8%)
|
||
|
||
### `person:dutch_name` (1146 occurrences)
|
||
- GRP.HER: 410 (35.8%)
|
||
- GRP.ASS: 101 (8.8%)
|
||
- AGT.PER: 89 (7.8%)
|
||
- GRP.GOV: 74 (6.5%)
|
||
- APP.TIT: 63 (5.5%)
|
||
|
||
### `heritage:institution_type` (376 occurrences)
|
||
- GRP.HER: 233 (62.0%)
|
||
- APP.URL: 50 (13.3%)
|
||
- APP.TIT: 20 (5.3%)
|
||
- APP.TTL: 19 (5.1%)
|
||
- WRK.WEB: 13 (3.5%)
|
||
|
||
### `org:foundation_or_association` (155 occurrences)
|
||
- GRP.ASS: 75 (48.4%)
|
||
- GRP.HER: 35 (22.6%)
|
||
- APP.URL: 15 (9.7%)
|
||
- WRK.WEB: 6 (3.9%)
|
||
- APP.TIT: 5 (3.2%)
|
||
|
||
### `nav:menu_item` (141 occurrences)
|
||
- APP.URL: 46 (32.6%)
|
||
- APP.TTL: 23 (16.3%)
|
||
- APP.TIT: 18 (12.8%)
|
||
- WRK.WEB: 14 (9.9%)
|
||
- WRK.COL: 13 (9.2%)
|
||
|
||
### `loc:postal_code_nl` (66 occurrences)
|
||
- TOP.ADR: 55 (83.3%)
|
||
- TOP.SET: 4 (6.1%)
|
||
- QTY.CNT: 2 (3.0%)
|
||
- APP.COL: 1 (1.5%)
|
||
- APP.NAM: 1 (1.5%)
|
||
|
||
### `gov:municipality` (65 occurrences)
|
||
- GRP.GOV: 53 (81.5%)
|
||
- TOP.REG: 3 (4.6%)
|
||
- TOP.SET: 2 (3.1%)
|
||
- APP.TIT: 2 (3.1%)
|
||
- APP.TTL: 2 (3.1%)
|
||
|
||
### `info:time_range` (63 occurrences)
|
||
- TMP.OPH: 52 (82.5%)
|
||
- TMP.TAB: 9 (14.3%)
|
||
- TMP.EXP: 2 (3.2%)
|
||
|
||
### `loc:street_address` (62 occurrences)
|
||
- TOP.ADR: 55 (88.7%)
|
||
- GRP.HER: 3 (4.8%)
|
||
- GRP.ASS: 1 (1.6%)
|
||
- APP.TTL: 1 (1.6%)
|
||
- unknown: 1 (1.6%)
|
||
|
||
### `info:opening_hours` (48 occurrences)
|
||
- TMP.OPH: 32 (66.7%)
|
||
- TMP.DAB: 13 (27.1%)
|
||
- TMP.SET: 2 (4.2%)
|
||
- THG.EVT: 1 (2.1%)
|
||
|
||
### `contact:email` (42 occurrences)
|
||
- APP.URL: 25 (59.5%)
|
||
- APP.EMAIL: 9 (21.4%)
|
||
- APP.NAM: 3 (7.1%)
|
||
- APP.PNM: 2 (4.8%)
|
||
- THG.PHO: 1 (2.4%)
|
||
|
||
### `contact:phone` (4 occurrences)
|
||
- APP.URL: 3 (75.0%)
|
||
- TOP.ADR: 1 (25.0%)
|
||
|
||
## Recommendations for dutch_web_patterns.yaml
|
||
|
||
Based on the analysis, the following layout-aware patterns could be added:
|
||
|
||
### High-Value XPath Targets
|
||
|
||
1. **`head/title`** - Almost always contains institution name
|
||
2. **`body/*/h1`** - Primary heading, usually institution name
|
||
3. **`body/*/nav`** - Navigation menu (discard patterns)
|
||
4. **`body/*/footer`** - Contact info, address, social links
|
||
|
||
### Suggested Pattern Additions
|
||
|
||
```yaml
|
||
# Add xpath_hint to patterns for better precision
|
||
entity_patterns:
|
||
organizations:
|
||
heritage_institutions:
|
||
patterns:
|
||
- pattern: '^(het|de)\s+(\w+)\s*(museum|archief|bibliotheek)$'
|
||
xpath_hints:
|
||
- 'head/title'
|
||
- 'body/*/h1'
|
||
confidence_boost: 0.2 # Higher confidence when found at expected location
|
||
```
|