glam/schemas/20251121/linkml/modules/classes/WebClaim.yaml
kempersc 4319f38c05 Add archived slots for audience size, audience type, and capacity metrics
- Created new YAML files for audience size and audience type slots, defining their properties and annotations.
- Added archived capacity slots including cubic meters, linear meters, item count, and descriptions, with appropriate URIs and ranges.
- Introduced a template specificity slot for context-aware RAG filtering.
- Consolidated capacity-related slots into a unified structure, including has_or_had_capacity, capacity_type, and capacity_value, with detailed descriptions and examples.
2026-01-17 18:53:23 +01:00

200 lines
10 KiB
YAML

id: https://nde.nl/ontology/hc/class/WebClaim
name: WebClaim
title: WebClaim Class - Verifiable Web-Extracted Claims
prefixes:
linkml: https://w3id.org/linkml/
hc: https://nde.nl/ontology/hc/
schema: http://schema.org/
dcterms: http://purl.org/dc/terms/
prov: http://www.w3.org/ns/prov#
pav: http://purl.org/pav/
xsd: http://www.w3.org/2001/XMLSchema#
oa: http://www.w3.org/ns/oa#
nif: http://persistence.uni-leipzig.org/nlp2rdf/ontologies/nif-core#
imports:
- linkml:types
- ../slots/source_url
- ../slots/retrieved_on
- ../slots/has_or_had_provenance_path
- ../slots/html_file
- ../slots/claim_id
- ../slots/claim_type
- ../slots/claim_value
- ../slots/extraction_timestamp
- ../slots/specificity_annotation
- ../slots/has_or_had_score # was: template_specificity - migrated per Rule 53 (2026-01-17)
- ../slots/claim_extraction_method
- ../slots/pipeline_stage
- ../slots/claim_note
- ../enums/ClaimTypeEnum
- ../enums/ExtractionPipelineStageEnum
- ./SpecificityAnnotation
- ./TemplateSpecificityScore # was: TemplateSpecificityScores - migrated per Rule 53 (2026-01-17)
- ./TemplateSpecificityType
- ./TemplateSpecificityTypes
- ./XPath
- ../slots/claim_extraction_method
- ../slots/claim_id
- ../slots/claim_note
- ../slots/claim_type
- ../slots/claim_value
- ../slots/extraction_timestamp
- ../slots/has_or_had_provenance_path
- ../slots/html_file
- ../slots/pipeline_stage
- ../slots/retrieved_on
- ../slots/source_url
- ../slots/specificity_annotation
- ../slots/has_or_had_score # was: template_specificity - migrated per Rule 53 (2026-01-17)
- ../slots/claim_extraction_method
- ../slots/claim_id
- ../slots/claim_note
- ../slots/claim_type
- ../slots/claim_value
- ../slots/extraction_timestamp
- ../slots/has_or_had_provenance_path
- ../slots/html_file
- ../slots/pipeline_stage
- ../slots/retrieved_on
- ../slots/source_url
- ../slots/specificity_annotation
- ../slots/has_or_had_score # was: template_specificity - migrated per Rule 53 (2026-01-17)
default_prefix: hc
classes:
WebClaim:
class_uri: prov:Entity
description: "A single verifiable claim extracted from a web page.\n\n**CORE PRINCIPLE: XPATH OR REMOVE**\n\nEvery claim\
\ extracted from a webpage MUST have:\n1. `has_or_had_provenance_path` - XPath object pointing to exact element in archived HTML\n2. `html_file` - path\
\ to the archived HTML (Playwright-rendered, NOT WARC)\n\nThe XPath object contains:\n- `expression` - the XPath string\n- `match_score` - quality of match (0.0-1.0)\n- `matched_text` - actual text found (for verification)\n\nClaims without\
\ these fields are FABRICATED and must be REMOVED.\n\n**ARCHIVE FORMAT: PLAYWRIGHT-RENDERED HTML**\n\nWe use Playwright\
\ (headless browser) to:\n1. Navigate to the target URL\n2. Wait for JavaScript to fully render\n3. Save the complete\
\ DOM as an HTML file\n\nThis differs from WARC archives which capture raw HTTP responses.\nPlaywright rendering captures\
\ the final DOM state including:\n- JavaScript-rendered content\n- Dynamically loaded elements\n- Client-side state\n\
\n**WHY NOT CONFIDENCE SCORES?**\n\nConfidence scores like `0.95` are MEANINGLESS because:\n- There is NO methodology\
\ defining what these numbers mean\n- They cannot be verified or reproduced\n- They give false impression of rigor\n\
- They mask the fact that claims may be fabricated\n\nInstead, we use VERIFIABLE provenance:\n- XPath points to exact\
\ location\n- Archived HTML can be inspected\n- Match score is computed, not estimated\n\n**EXTRACTION PIPELINE (4 Stages)**\n\
\nFollowing the GLAM-NER Unified Entity Annotation Convention v1.7.0:\n\n1. **Entity Recognition** (Stage 1)\n - Detect\
\ named entities in text\n - Classify by hypernym type (AGT, GRP, TOP, TMP, etc.)\n - Methods: spaCy NER, transformer\
\ models, regex patterns\n\n2. **Layout Analysis** (Stage 2)\n - Analyze document structure (headers, paragraphs,\
\ tables)\n - Assign DOC hypernym types (DOC.HDR, DOC.PAR, DOC.TBL)\n - Generate XPath provenance for each claim\
\ location\n\n3. **Entity Resolution** (Stage 3)\n - Disambiguate entity mentions\n - Merge coreferences and name\
\ variants\n - Produce canonical entity clusters\n\n4. **Entity Linking** (Stage 4)\n - Link resolved entities to\
\ knowledge bases\n - Connect to Wikidata, ISIL, GeoNames, etc.\n - Assign link confidence scores\n\n**WORKFLOW**:\n\
\n1. Archive website using Playwright:\n `python scripts/fetch_website_playwright.py <entry_number> <url>`\n \n\
\ This saves: web/{entry_number}/{domain}/rendered.html\n\n2. Add XPath provenance to claims:\n `python scripts/add_xpath_provenance.py`\n\
\n3. Script REMOVES claims that cannot be verified\n (stores in `removed_unverified_claims` for audit)\n\n**EXAMPLES**:\n\
\nCORRECT (Verifiable):\n```yaml\n- claim_type: full_name\n claim_value: Historische Vereniging Nijeveen\n source_url:\
\ https://historischeverenigingnijeveen.nl/\n retrieved_on: \"2025-11-29T12:28:00Z\"\n has_or_had_provenance_path:\n expression: /html[1]/body[1]/div[6]/div[1]/h1[1]\n match_score: 1.0\n\
\ html_file: web/0021/historischeverenigingnijeveen.nl/rendered.html\n pipeline_stage: layout_analysis\n\
```\n\nWRONG (Fabricated - Must Be Removed):\n```yaml\n- claim_type: full_name\n claim_value: Historische Vereniging\
\ Nijeveen\n confidence: 0.95 # ← NO! This is meaningless without XPath\n```\n\n**MIGRATION NOTE (2026-01-15)**:\nConsolidated xpath, xpath_match_score, xpath_matched_text\ninto has_or_had_provenance_path with XPath class.\n"
exact_mappings:
- prov:Entity
close_mappings:
- schema:PropertyValue
- oa:Annotation
slots:
- claim_extraction_method
- claim_id
- claim_note
- claim_type
- claim_value
- extraction_timestamp
- html_file
- pipeline_stage
- retrieved_on
- source_url
- specificity_annotation
- has_or_had_score # was: template_specificity - migrated per Rule 53 (2026-01-17)
- has_or_had_provenance_path
slot_usage:
claim_type:
required: true
claim_value:
required: true
source_url:
required: true
retrieved_on:
required: true
has_or_had_provenance_path:
required: true
range: XPath
inlined: true
description: >-
XPath provenance for this claim - pointing to exact element in archived HTML.
Contains expression, matched_text, and match_score.
html_file:
required: true
rules:
- preconditions:
slot_conditions:
has_or_had_provenance_path:
value_presence: ABSENT
postconditions:
description: Claims without XPath provenance must be removed as unverifiable
comments:
- WebClaim requires XPath provenance via has_or_had_provenance_path - claims without it are fabricated
- XPath class contains expression, matched_text, and match_score in one structure
- Archived HTML files are Playwright-rendered (NOT WARC format)
- Use scripts/fetch_website_playwright.py to archive websites
- Use scripts/add_xpath_provenance.py to add XPath to existing claims
- 'Follows 4-stage GLAM-NER pipeline: recognition → layout → resolution → linking'
- 'MIGRATED 2026-01-15: xpath/xpath_match_score/xpath_matched_text → has_or_had_provenance_path (XPath class)'
see_also:
- rules/WEB_OBSERVATION_PROVENANCE_RULES.md
- scripts/fetch_website_playwright.py
- scripts/add_xpath_provenance.py
- docs/convention/schema/20251202/entity_annotation_rules_v1.6.0_unified.yaml
examples:
- value:
claim_type: full_name
claim_value: Historische Vereniging Nijeveen
source_url: https://historischeverenigingnijeveen.nl/
retrieved_on: '2025-11-29T12:28:00Z'
has_or_had_provenance_path:
expression: /html[1]/body[1]/div[6]/div[1]/h1[1]
match_score: 1.0
source_document: web/0021/historischeverenigingnijeveen.nl/rendered.html
html_file: web/0021/historischeverenigingnijeveen.nl/rendered.html
pipeline_stage: layout_analysis
description: Exact match claim for organization name
- value:
claim_type: beeldbank_total_photos
claim_value: '6253'
source_url: https://historischeverenigingnijeveen.nl/nl/hvn
retrieved_on: '2025-11-29T12:28:00Z'
has_or_had_provenance_path:
expression: /html[1]/body[1]/div[6]/div[1]/table[3]/tbody[1]/tr[1]/td[1]/p[1]
match_score: 1.0
source_document: web/0021/historischeverenigingnijeveen.nl/rendered.html
html_file: web/0021/historischeverenigingnijeveen.nl/rendered.html
pipeline_stage: layout_analysis
description: Collection count claim from image bank statistics
- value:
claim_type: facebook
claim_value: https://www.facebook.com/HistorischeVerenigingNijeveen/
source_url: https://historischeverenigingnijeveen.nl/
retrieved_on: '2025-11-29T12:28:00Z'
has_or_had_provenance_path:
expression: /html[1]/body[1]/footer[1]/div[1]/a[3]
match_score: 1.0
source_document: web/0021/historischeverenigingnijeveen.nl/rendered.html
html_file: web/0021/historischeverenigingnijeveen.nl/rendered.html
pipeline_stage: entity_linking
description: Social media link claim - entity linking stage
- value:
claim_type: website
claim_value: https://www.historischeverenigingnijeveen.nl/
source_url: https://historischeverenigingnijeveen.nl/nl/hvn
retrieved_on: '2025-11-28T12:00:00Z'
has_or_had_provenance_path:
expression: /html[1]/body[1]/div[6]/div[1]/table[3]/tbody[1]/tr[1]/td[1]/p[6]
matched_text: De Historische Vereniging Nijeveen is ook te vinden op Facebook
match_score: 0.561
source_document: web/0021/historischeverenigingnijeveen.nl/rendered.html
html_file: web/0021/historischeverenigingnijeveen.nl/rendered.html
pipeline_stage: layout_analysis
description: Substring match - URL found within longer text