glam/schemas/20251121/linkml/modules/classes/WebClaim.yaml
kempersc f7bf1cc5ae Refactor schema slots and classes
- Deleted obsolete slot definitions: statement_summary, statement_text, statement_type, status_name, supersede_articles, supersede_condition, supersede_name, temporal_dynamics, total_amount, typical_contents, use_cases, was_acquired_through, was_fetched_at, was_retrieved_at.
- Updated existing slot definitions for states_or_stated to enhance clarity and structure.
- Introduced new classes: Article, ConditionofAccess, FinancialStatementType, MaximumQuantity, Series, Summary, Type, and their respective slots to improve schema organization and usability.
- Added new slots: changes_or_changed_through, has_or_had_condition_of_access, has_or_had_heritage_type, is_or_was_part_of_series, is_or_was_retrieved_at, maximum_of_maximum to capture additional metadata and relationships.
2026-01-30 00:29:31 +01:00

368 lines
16 KiB
YAML

id: https://nde.nl/ontology/hc/class/WebClaim
name: WebClaim
title: WebClaim Class - Verifiable Web-Extracted Claims
prefixes:
linkml: https://w3id.org/linkml/
hc: https://nde.nl/ontology/hc/
schema: http://schema.org/
dcterms: http://purl.org/dc/terms/
prov: http://www.w3.org/ns/prov#
pav: http://purl.org/pav/
xsd: http://www.w3.org/2001/XMLSchema#
oa: http://www.w3.org/ns/oa#
nif: http://persistence.uni-leipzig.org/nlp2rdf/ontologies/nif-core#
crm: http://www.cidoc-crm.org/cidoc-crm/
skos: http://www.w3.org/2004/02/skos/core#
rdfs: http://www.w3.org/2000/01/rdf-schema#
org: http://www.w3.org/ns/org#
imports:
- linkml:types
- ./Claim
- ../slots/source_url
- ../slots/retrieved_on
- ../slots/has_or_had_provenance_path
- ../slots/has_or_had_file_path
- ./FilePath
- ../slots/has_or_had_identifier
- ./Identifier
- ../slots/has_or_had_type
- ./ClaimType
- ./ClaimTypes
- ../slots/has_or_had_content
- ./Content
- ../slots/temporal_extent
- ../slots/specificity_annotation
- ../slots/has_or_had_score
- ../slots/is_or_was_extracted_using
- ./ExtractionMethod
- ../slots/is_or_was_retrieved_through
- ./RetrievalEvent
- ../slots/pipeline_stage
- ../slots/has_or_had_note
- ./Note
- ../enums/ExtractionPipelineStageEnum
- ./SpecificityAnnotation
- ./TemplateSpecificityScore
- ./TemplateSpecificityType
- ./TemplateSpecificityTypes
- ./XPath
default_prefix: hc
classes:
WebClaim:
is_a: Claim
class_uri: prov:Entity
description: "A single verifiable claim extracted from a web page.\n\n**CORE PRINCIPLE: XPATH OR REMOVE**\n\nEvery claim extracted from a webpage MUST have:\n1. `has_or_had_provenance_path` - XPath object pointing to exact element in archived HTML\n2. `html_file` - path to the archived HTML (Playwright-rendered, NOT WARC)\n\nThe XPath object contains:\n- `expression` - the XPath string\n- `match_score` - quality of match (0.0-1.0)\n- `matched_text` - actual text found (for verification)\n\nClaims without these fields are FABRICATED and must be REMOVED.\n\n**ARCHIVE FORMAT: PLAYWRIGHT-RENDERED HTML**\n\nWe use Playwright (headless browser) to:\n1. Navigate to the target URL\n2. Wait for JavaScript to fully render\n3. Save the complete DOM as an HTML file\n\nThis differs from WARC archives which capture raw HTTP responses.\nPlaywright rendering captures the final DOM state including:\n- JavaScript-rendered content\n- Dynamically loaded elements\n- Client-side state\n\n**WHY NOT CONFIDENCE\
\ SCORES?**\n\nConfidence scores like `0.95` are MEANINGLESS because:\n- There is NO methodology defining what these numbers mean\n- They cannot be verified or reproduced\n- They give false impression of rigor\n- They mask the fact that claims may be fabricated\n\nInstead, we use VERIFIABLE provenance:\n- XPath points to exact location\n- Archived HTML can be inspected\n- Match score is computed, not estimated\n\n**EXTRACTION PIPELINE (4 Stages)**\n\nFollowing the GLAM-NER Unified Entity Annotation Convention v1.7.0:\n\n1. **Entity Recognition** (Stage 1)\n - Detect named entities in text\n - Classify by hypernym type (AGT, GRP, TOP, TMP, etc.)\n - Methods: spaCy NER, transformer models, regex patterns\n\n2. **Layout Analysis** (Stage 2)\n - Analyze document structure (headers, paragraphs, tables)\n - Assign DOC hypernym types (DOC.HDR, DOC.PAR, DOC.TBL)\n - Generate XPath provenance for each claim location\n\n3. **Entity Resolution** (Stage 3)\n - Disambiguate entity\
\ mentions\n - Merge coreferences and name variants\n - Produce canonical entity clusters\n\n4. **Entity Linking** (Stage 4)\n - Link resolved entities to knowledge bases\n - Connect to Wikidata, ISIL, GeoNames, etc.\n - Assign link confidence scores\n\n**WORKFLOW**:\n\n1. Archive website using Playwright:\n `python scripts/fetch_website_playwright.py <entry_number> <url>`\n \n This saves: web/{entry_number}/{domain}/rendered.html\n\n2. Add XPath provenance to claims:\n `python scripts/add_xpath_provenance.py`\n\n3. Script REMOVES claims that cannot be verified\n (stores in `removed_unverified_claims` for audit)\n\n**EXAMPLES**:\n\nCORRECT (Verifiable):\n```yaml\n- claim_type: full_name\n has_or_had_content:\n has_or_had_label: Historische Vereniging Nijeveen\n source_url: https://historischeverenigingnijeveen.nl/\n retrieved_on: \"2025-11-29T12:28:00Z\"\n has_or_had_provenance_path:\n expression: /html[1]/body[1]/div[6]/div[1]/h1[1]\n match_score:\
\ 1.0\n html_file: web/0021/historischeverenigingnijeveen.nl/rendered.html\n pipeline_stage: layout_analysis\n```\n\nWRONG (Fabricated - Must Be Removed):\n```yaml\n- claim_type: full_name\n has_or_had_content:\n has_or_had_label: Historische Vereniging Nijeveen\n confidence: 0.95 # \u2190 NO! This is meaningless without XPath\n```\n\n**MIGRATION NOTE (2026-01-15)**:\nConsolidated xpath, xpath_match_score, xpath_matched_text\ninto has_or_had_provenance_path with XPath class.\n\n**MIGRATION NOTE (2026-01-18)**:\nMigrated claim_value to has_or_had_content with Content class per Rule 53/56.\n"
exact_mappings:
- prov:Entity
close_mappings:
- schema:PropertyValue
- oa:Annotation
slots:
- is_or_was_extracted_using
- has_or_had_identifier
- has_or_had_note
- has_or_had_type
- has_or_had_content
- is_or_was_retrieved_through
- has_or_had_file_path
- pipeline_stage
- retrieved_on
- source_url
- specificity_annotation
- has_or_had_score
- has_or_had_provenance_path
slot_usage:
has_or_had_identifier:
description: 'MIGRATED from claim_id per slot_fixes.yaml (Rule 53/56, 2026-01-18).
Unique identifier for the web claim.
Uses Identifier class for structured identifier representation.
'
range: Identifier
inlined: true
required: false
examples:
- value:
identifier_scheme: web_claim_id
identifier_value: claim-2025-11-29-001
description: Web claim identifier
has_or_had_type:
description: 'MIGRATED from claim_type per slot_fixes.yaml (Rule 53/56, 2026-01-19).
The type of claim being made (e.g., full_name, email, facebook).
Uses ClaimType class hierarchy for structured type representation:
- IdentityClaim: full_name, short_name, description, legal_name
- ContactClaim: email, phone, address, website
- SocialMediaClaim: facebook, twitter, instagram, linkedin, youtube
- MediaClaim: logo_url, favicon_url, og_image_url
- OperationalClaim: opening_hours, admission_info, accessibility_info
- CollectionClaim: collection_count, beeldbank statistics
- OrganizationalClaim: founding_date, kvk_number, legal_form
- DocumentClaim: annual_report_url, policy_document_url
- GeographicClaim: street_address, postal_code, city, province
- ArchivalClaim: archief_description, beeldbank_description
'
range: ClaimType
inlined: true
required: true
examples:
- value:
has_or_had_label: full_name
description: Identity claim for organization name
- value:
has_or_had_label: facebook
description: Social media claim for Facebook URL
has_or_had_note:
description: 'MIGRATED from claim_note per slot_fixes.yaml (Rule 53/56, 2026-01-18).
Notes about this specific claim extraction.
Uses Note class with note_type, note_content, note_date fields.
**Note Type Mapping**:
- `note_type`: "claim" (default for WebClaim notes)
- `note_content`: The actual note text
- `note_date`: When the note was created
**Use Cases**:
- Document extraction issues
- Note special circumstances
- Record conflicts with other sources
'
range: Note
inlined: true
inlined_as_list: true
multivalued: true
required: false
examples:
- value:
note_type: claim
note_content: Additional verification required for this claim.
note_date: '2026-01-18'
description: Verification note for claim
- value:
note_type: extraction
note_content: Biography truncated from longer text on page.
note_date: '2025-11-29'
description: Extraction processing note
has_or_had_content:
description: 'MIGRATED from claim_value per slot_fixes.yaml (Rule 53/56, 2026-01-18).
The extracted value from the web source - the actual content claimed to exist
at the XPath location.
Uses Content class with has_or_had_label holding the raw extracted string.
**Content Mapping**:
- `has_or_had_label`: The raw extracted value (required)
- `has_or_had_description`: Optional elaboration on the claim content
**Examples of claim values**:
- Organization names: "Historische Vereniging Nijeveen"
- Statistics: "6253" (photo count)
- URLs: "https://www.facebook.com/HistorischeVerenigingNijeveen/"
'
range: Content
inlined: true
required: true
multivalued: false
examples:
- value:
has_or_had_label: Historische Vereniging Nijeveen
description: Organization name claim value
- value:
has_or_had_label: '6253'
description: Numeric statistic claim value
- value:
has_or_had_label: https://www.facebook.com/HistorischeVerenigingNijeveen/
description: URL claim value
source_url:
required: true
retrieved_on:
required: true
has_or_had_provenance_path:
required: true
range: XPath
inlined: true
description: XPath provenance for this claim - pointing to exact element in archived HTML. Contains expression, matched_text, and match_score.
has_or_had_file_path:
required: true
range: FilePath
inlined: true
description: Path to the archived HTML file (Playwright-rendered) used for extraction. MIGRATED from html_file.
examples:
- value:
has_or_had_label: web/0021/historischeverenigingnijeveen.nl/rendered.html
description: Archived HTML file path
is_or_was_retrieved_through:
description: 'Retrieval event containing timestamp.
MIGRATED from extraction_timestamp per Rule 53 (2026-01-26).
'
range: RetrievalEvent
inlined: true
required: false
is_or_was_extracted_using:
description: 'MIGRATED from claim_extraction_method per slot_fixes.yaml (Rule 53/56, 2026-01-19).
Method used to extract this claim from the source document.
Uses ExtractionMethod class to represent structured extraction method information.
**Common Extraction Methods**:
- `xpath_exact_match` - XPath pointed to exact element containing value
- `xpath_fuzzy_match` - XPath with partial/substring match
- `nlp_ner` - Named Entity Recognition extraction
- `json_ld_parse` - Parsed from embedded JSON-LD structured data
- `meta_tag` - Extracted from HTML meta tags
- `manual` - Human-verified extraction
'
range: ExtractionMethod
inlined: true
required: false
examples:
- value:
has_or_had_label: xpath_exact_match
description: XPath extraction with exact match
- value:
has_or_had_label: nlp_ner
description: NLP Named Entity Recognition extraction
rules:
- preconditions:
slot_conditions:
has_or_had_provenance_path:
value_presence: ABSENT
postconditions:
description: Claims without XPath provenance must be removed as unverifiable
comments:
- WebClaim requires XPath provenance via has_or_had_provenance_path - claims without it are fabricated
- XPath class contains expression, matched_text, and match_score in one structure
- Archived HTML files are Playwright-rendered (NOT WARC format)
- Use scripts/fetch_website_playwright.py to archive websites
- Use scripts/add_xpath_provenance.py to add XPath to existing claims
- "Follows 4-stage GLAM-NER pipeline: recognition \u2192 layout \u2192 resolution \u2192 linking"
- "MIGRATED 2026-01-15: xpath/xpath_match_score/xpath_matched_text \u2192 has_or_had_provenance_path (XPath class)"
- "MIGRATED 2026-01-18: claim_value \u2192 has_or_had_content (Content class) per Rule 53/56"
- "MIGRATED 2026-01-18: claim_note \u2192 has_or_had_note (Note class) per Rule 53/56"
- "MIGRATED 2026-01-19: claim_extraction_method \u2192 is_or_was_extracted_using (ExtractionMethod class) per Rule 53/56"
- "MIGRATED 2026-01-19: claim_type \u2192 has_or_had_type (ClaimType/ClaimTypes classes) per Rule 53/56"
see_also:
- rules/WEB_OBSERVATION_PROVENANCE_RULES.md
- scripts/fetch_website_playwright.py
- scripts/add_xpath_provenance.py
- docs/convention/schema/20251202/entity_annotation_rules_v1.6.0_unified.yaml
examples:
- value:
has_or_had_type:
has_or_had_label: full_name
has_or_had_content:
has_or_had_label: Historische Vereniging Nijeveen
source_url: https://historischeverenigingnijeveen.nl/
retrieved_on: '2025-11-29T12:28:00Z'
has_or_had_provenance_path:
expression: /html[1]/body[1]/div[6]/div[1]/h1[1]
match_score: 1.0
source_document: web/0021/historischeverenigingnijeveen.nl/rendered.html
has_or_had_file_path:
has_or_had_label: web/0021/historischeverenigingnijeveen.nl/rendered.html
pipeline_stage: layout_analysis
description: Exact match claim for organization name (claim_type migrated to has_or_had_type)
- value:
has_or_had_type:
has_or_had_label: beeldbank_total_photos
has_or_had_content:
has_or_had_label: '6253'
source_url: https://historischeverenigingnijeveen.nl/nl/hvn
retrieved_on: '2025-11-29T12:28:00Z'
has_or_had_provenance_path:
expression: /html[1]/body[1]/div[6]/div[1]/table[3]/tbody[1]/tr[1]/td[1]/p[1]
match_score: 1.0
source_document: web/0021/historischeverenigingnijeveen.nl/rendered.html
has_or_had_file_path:
has_or_had_label: web/0021/historischeverenigingnijeveen.nl/rendered.html
pipeline_stage: layout_analysis
description: Collection count claim from image bank statistics
- value:
has_or_had_type:
has_or_had_label: facebook
has_or_had_content:
has_or_had_label: https://www.facebook.com/HistorischeVerenigingNijeveen/
source_url: https://historischeverenigingnijeveen.nl/
retrieved_on: '2025-11-29T12:28:00Z'
has_or_had_provenance_path:
expression: /html[1]/body[1]/footer[1]/div[1]/a[3]
match_score: 1.0
source_document: web/0021/historischeverenigingnijeveen.nl/rendered.html
has_or_had_file_path:
has_or_had_label: web/0021/historischeverenigingnijeveen.nl/rendered.html
pipeline_stage: entity_linking
description: Social media link claim - entity linking stage
- value:
has_or_had_type:
has_or_had_label: website
has_or_had_content:
has_or_had_label: https://www.historischeverenigingnijeveen.nl/
source_url: https://historischeverenigingnijeveen.nl/nl/hvn
retrieved_on: '2025-11-28T12:00:00Z'
has_or_had_provenance_path:
expression: /html[1]/body[1]/div[6]/div[1]/table[3]/tbody[1]/tr[1]/td[1]/p[6]
matched_text: De Historische Vereniging Nijeveen is ook te vinden op Facebook
match_score: 0.561
source_document: web/0021/historischeverenigingnijeveen.nl/rendered.html
has_or_had_file_path:
has_or_had_label: web/0021/historischeverenigingnijeveen.nl/rendered.html
pipeline_stage: layout_analysis
description: Substring match - URL found within longer text
annotations:
specificity_score: 0.1
specificity_rationale: Generic utility class/slot created during migration
custodian_types: "['*']"
custodian_types_rationale: Universal utility concept