- Deleted obsolete slot definitions: statement_summary, statement_text, statement_type, status_name, supersede_articles, supersede_condition, supersede_name, temporal_dynamics, total_amount, typical_contents, use_cases, was_acquired_through, was_fetched_at, was_retrieved_at. - Updated existing slot definitions for states_or_stated to enhance clarity and structure. - Introduced new classes: Article, ConditionofAccess, FinancialStatementType, MaximumQuantity, Series, Summary, Type, and their respective slots to improve schema organization and usability. - Added new slots: changes_or_changed_through, has_or_had_condition_of_access, has_or_had_heritage_type, is_or_was_part_of_series, is_or_was_retrieved_at, maximum_of_maximum to capture additional metadata and relationships.
368 lines
16 KiB
YAML
368 lines
16 KiB
YAML
id: https://nde.nl/ontology/hc/class/WebClaim
|
|
name: WebClaim
|
|
title: WebClaim Class - Verifiable Web-Extracted Claims
|
|
prefixes:
|
|
linkml: https://w3id.org/linkml/
|
|
hc: https://nde.nl/ontology/hc/
|
|
schema: http://schema.org/
|
|
dcterms: http://purl.org/dc/terms/
|
|
prov: http://www.w3.org/ns/prov#
|
|
pav: http://purl.org/pav/
|
|
xsd: http://www.w3.org/2001/XMLSchema#
|
|
oa: http://www.w3.org/ns/oa#
|
|
nif: http://persistence.uni-leipzig.org/nlp2rdf/ontologies/nif-core#
|
|
crm: http://www.cidoc-crm.org/cidoc-crm/
|
|
skos: http://www.w3.org/2004/02/skos/core#
|
|
rdfs: http://www.w3.org/2000/01/rdf-schema#
|
|
org: http://www.w3.org/ns/org#
|
|
imports:
|
|
- linkml:types
|
|
- ./Claim
|
|
- ../slots/source_url
|
|
- ../slots/retrieved_on
|
|
- ../slots/has_or_had_provenance_path
|
|
- ../slots/has_or_had_file_path
|
|
- ./FilePath
|
|
- ../slots/has_or_had_identifier
|
|
- ./Identifier
|
|
- ../slots/has_or_had_type
|
|
- ./ClaimType
|
|
- ./ClaimTypes
|
|
- ../slots/has_or_had_content
|
|
- ./Content
|
|
- ../slots/temporal_extent
|
|
- ../slots/specificity_annotation
|
|
- ../slots/has_or_had_score
|
|
- ../slots/is_or_was_extracted_using
|
|
- ./ExtractionMethod
|
|
- ../slots/is_or_was_retrieved_through
|
|
- ./RetrievalEvent
|
|
- ../slots/pipeline_stage
|
|
- ../slots/has_or_had_note
|
|
- ./Note
|
|
- ../enums/ExtractionPipelineStageEnum
|
|
- ./SpecificityAnnotation
|
|
- ./TemplateSpecificityScore
|
|
- ./TemplateSpecificityType
|
|
- ./TemplateSpecificityTypes
|
|
- ./XPath
|
|
default_prefix: hc
|
|
classes:
|
|
WebClaim:
|
|
is_a: Claim
|
|
class_uri: prov:Entity
|
|
description: "A single verifiable claim extracted from a web page.\n\n**CORE PRINCIPLE: XPATH OR REMOVE**\n\nEvery claim extracted from a webpage MUST have:\n1. `has_or_had_provenance_path` - XPath object pointing to exact element in archived HTML\n2. `html_file` - path to the archived HTML (Playwright-rendered, NOT WARC)\n\nThe XPath object contains:\n- `expression` - the XPath string\n- `match_score` - quality of match (0.0-1.0)\n- `matched_text` - actual text found (for verification)\n\nClaims without these fields are FABRICATED and must be REMOVED.\n\n**ARCHIVE FORMAT: PLAYWRIGHT-RENDERED HTML**\n\nWe use Playwright (headless browser) to:\n1. Navigate to the target URL\n2. Wait for JavaScript to fully render\n3. Save the complete DOM as an HTML file\n\nThis differs from WARC archives which capture raw HTTP responses.\nPlaywright rendering captures the final DOM state including:\n- JavaScript-rendered content\n- Dynamically loaded elements\n- Client-side state\n\n**WHY NOT CONFIDENCE\
|
|
\ SCORES?**\n\nConfidence scores like `0.95` are MEANINGLESS because:\n- There is NO methodology defining what these numbers mean\n- They cannot be verified or reproduced\n- They give false impression of rigor\n- They mask the fact that claims may be fabricated\n\nInstead, we use VERIFIABLE provenance:\n- XPath points to exact location\n- Archived HTML can be inspected\n- Match score is computed, not estimated\n\n**EXTRACTION PIPELINE (4 Stages)**\n\nFollowing the GLAM-NER Unified Entity Annotation Convention v1.7.0:\n\n1. **Entity Recognition** (Stage 1)\n - Detect named entities in text\n - Classify by hypernym type (AGT, GRP, TOP, TMP, etc.)\n - Methods: spaCy NER, transformer models, regex patterns\n\n2. **Layout Analysis** (Stage 2)\n - Analyze document structure (headers, paragraphs, tables)\n - Assign DOC hypernym types (DOC.HDR, DOC.PAR, DOC.TBL)\n - Generate XPath provenance for each claim location\n\n3. **Entity Resolution** (Stage 3)\n - Disambiguate entity\
|
|
\ mentions\n - Merge coreferences and name variants\n - Produce canonical entity clusters\n\n4. **Entity Linking** (Stage 4)\n - Link resolved entities to knowledge bases\n - Connect to Wikidata, ISIL, GeoNames, etc.\n - Assign link confidence scores\n\n**WORKFLOW**:\n\n1. Archive website using Playwright:\n `python scripts/fetch_website_playwright.py <entry_number> <url>`\n \n This saves: web/{entry_number}/{domain}/rendered.html\n\n2. Add XPath provenance to claims:\n `python scripts/add_xpath_provenance.py`\n\n3. Script REMOVES claims that cannot be verified\n (stores in `removed_unverified_claims` for audit)\n\n**EXAMPLES**:\n\nCORRECT (Verifiable):\n```yaml\n- claim_type: full_name\n has_or_had_content:\n has_or_had_label: Historische Vereniging Nijeveen\n source_url: https://historischeverenigingnijeveen.nl/\n retrieved_on: \"2025-11-29T12:28:00Z\"\n has_or_had_provenance_path:\n expression: /html[1]/body[1]/div[6]/div[1]/h1[1]\n match_score:\
|
|
\ 1.0\n html_file: web/0021/historischeverenigingnijeveen.nl/rendered.html\n pipeline_stage: layout_analysis\n```\n\nWRONG (Fabricated - Must Be Removed):\n```yaml\n- claim_type: full_name\n has_or_had_content:\n has_or_had_label: Historische Vereniging Nijeveen\n confidence: 0.95 # \u2190 NO! This is meaningless without XPath\n```\n\n**MIGRATION NOTE (2026-01-15)**:\nConsolidated xpath, xpath_match_score, xpath_matched_text\ninto has_or_had_provenance_path with XPath class.\n\n**MIGRATION NOTE (2026-01-18)**:\nMigrated claim_value to has_or_had_content with Content class per Rule 53/56.\n"
|
|
exact_mappings:
|
|
- prov:Entity
|
|
close_mappings:
|
|
- schema:PropertyValue
|
|
- oa:Annotation
|
|
slots:
|
|
- is_or_was_extracted_using
|
|
- has_or_had_identifier
|
|
- has_or_had_note
|
|
- has_or_had_type
|
|
- has_or_had_content
|
|
- is_or_was_retrieved_through
|
|
- has_or_had_file_path
|
|
- pipeline_stage
|
|
- retrieved_on
|
|
- source_url
|
|
- specificity_annotation
|
|
- has_or_had_score
|
|
- has_or_had_provenance_path
|
|
slot_usage:
|
|
has_or_had_identifier:
|
|
description: 'MIGRATED from claim_id per slot_fixes.yaml (Rule 53/56, 2026-01-18).
|
|
|
|
Unique identifier for the web claim.
|
|
|
|
Uses Identifier class for structured identifier representation.
|
|
|
|
'
|
|
range: Identifier
|
|
inlined: true
|
|
required: false
|
|
examples:
|
|
- value:
|
|
identifier_scheme: web_claim_id
|
|
identifier_value: claim-2025-11-29-001
|
|
description: Web claim identifier
|
|
has_or_had_type:
|
|
description: 'MIGRATED from claim_type per slot_fixes.yaml (Rule 53/56, 2026-01-19).
|
|
|
|
The type of claim being made (e.g., full_name, email, facebook).
|
|
|
|
|
|
Uses ClaimType class hierarchy for structured type representation:
|
|
|
|
- IdentityClaim: full_name, short_name, description, legal_name
|
|
|
|
- ContactClaim: email, phone, address, website
|
|
|
|
- SocialMediaClaim: facebook, twitter, instagram, linkedin, youtube
|
|
|
|
- MediaClaim: logo_url, favicon_url, og_image_url
|
|
|
|
- OperationalClaim: opening_hours, admission_info, accessibility_info
|
|
|
|
- CollectionClaim: collection_count, beeldbank statistics
|
|
|
|
- OrganizationalClaim: founding_date, kvk_number, legal_form
|
|
|
|
- DocumentClaim: annual_report_url, policy_document_url
|
|
|
|
- GeographicClaim: street_address, postal_code, city, province
|
|
|
|
- ArchivalClaim: archief_description, beeldbank_description
|
|
|
|
'
|
|
range: ClaimType
|
|
inlined: true
|
|
required: true
|
|
examples:
|
|
- value:
|
|
has_or_had_label: full_name
|
|
description: Identity claim for organization name
|
|
- value:
|
|
has_or_had_label: facebook
|
|
description: Social media claim for Facebook URL
|
|
has_or_had_note:
|
|
description: 'MIGRATED from claim_note per slot_fixes.yaml (Rule 53/56, 2026-01-18).
|
|
|
|
Notes about this specific claim extraction.
|
|
|
|
|
|
Uses Note class with note_type, note_content, note_date fields.
|
|
|
|
|
|
**Note Type Mapping**:
|
|
|
|
- `note_type`: "claim" (default for WebClaim notes)
|
|
|
|
- `note_content`: The actual note text
|
|
|
|
- `note_date`: When the note was created
|
|
|
|
|
|
**Use Cases**:
|
|
|
|
- Document extraction issues
|
|
|
|
- Note special circumstances
|
|
|
|
- Record conflicts with other sources
|
|
|
|
'
|
|
range: Note
|
|
inlined: true
|
|
inlined_as_list: true
|
|
multivalued: true
|
|
required: false
|
|
examples:
|
|
- value:
|
|
note_type: claim
|
|
note_content: Additional verification required for this claim.
|
|
note_date: '2026-01-18'
|
|
description: Verification note for claim
|
|
- value:
|
|
note_type: extraction
|
|
note_content: Biography truncated from longer text on page.
|
|
note_date: '2025-11-29'
|
|
description: Extraction processing note
|
|
has_or_had_content:
|
|
description: 'MIGRATED from claim_value per slot_fixes.yaml (Rule 53/56, 2026-01-18).
|
|
|
|
The extracted value from the web source - the actual content claimed to exist
|
|
|
|
at the XPath location.
|
|
|
|
|
|
Uses Content class with has_or_had_label holding the raw extracted string.
|
|
|
|
|
|
**Content Mapping**:
|
|
|
|
- `has_or_had_label`: The raw extracted value (required)
|
|
|
|
- `has_or_had_description`: Optional elaboration on the claim content
|
|
|
|
|
|
**Examples of claim values**:
|
|
|
|
- Organization names: "Historische Vereniging Nijeveen"
|
|
|
|
- Statistics: "6253" (photo count)
|
|
|
|
- URLs: "https://www.facebook.com/HistorischeVerenigingNijeveen/"
|
|
|
|
'
|
|
range: Content
|
|
inlined: true
|
|
required: true
|
|
multivalued: false
|
|
examples:
|
|
- value:
|
|
has_or_had_label: Historische Vereniging Nijeveen
|
|
description: Organization name claim value
|
|
- value:
|
|
has_or_had_label: '6253'
|
|
description: Numeric statistic claim value
|
|
- value:
|
|
has_or_had_label: https://www.facebook.com/HistorischeVerenigingNijeveen/
|
|
description: URL claim value
|
|
source_url:
|
|
required: true
|
|
retrieved_on:
|
|
required: true
|
|
has_or_had_provenance_path:
|
|
required: true
|
|
range: XPath
|
|
inlined: true
|
|
description: XPath provenance for this claim - pointing to exact element in archived HTML. Contains expression, matched_text, and match_score.
|
|
has_or_had_file_path:
|
|
required: true
|
|
range: FilePath
|
|
inlined: true
|
|
description: Path to the archived HTML file (Playwright-rendered) used for extraction. MIGRATED from html_file.
|
|
examples:
|
|
- value:
|
|
has_or_had_label: web/0021/historischeverenigingnijeveen.nl/rendered.html
|
|
description: Archived HTML file path
|
|
is_or_was_retrieved_through:
|
|
description: 'Retrieval event containing timestamp.
|
|
|
|
MIGRATED from extraction_timestamp per Rule 53 (2026-01-26).
|
|
|
|
'
|
|
range: RetrievalEvent
|
|
inlined: true
|
|
required: false
|
|
is_or_was_extracted_using:
|
|
description: 'MIGRATED from claim_extraction_method per slot_fixes.yaml (Rule 53/56, 2026-01-19).
|
|
|
|
Method used to extract this claim from the source document.
|
|
|
|
|
|
Uses ExtractionMethod class to represent structured extraction method information.
|
|
|
|
|
|
**Common Extraction Methods**:
|
|
|
|
- `xpath_exact_match` - XPath pointed to exact element containing value
|
|
|
|
- `xpath_fuzzy_match` - XPath with partial/substring match
|
|
|
|
- `nlp_ner` - Named Entity Recognition extraction
|
|
|
|
- `json_ld_parse` - Parsed from embedded JSON-LD structured data
|
|
|
|
- `meta_tag` - Extracted from HTML meta tags
|
|
|
|
- `manual` - Human-verified extraction
|
|
|
|
'
|
|
range: ExtractionMethod
|
|
inlined: true
|
|
required: false
|
|
examples:
|
|
- value:
|
|
has_or_had_label: xpath_exact_match
|
|
description: XPath extraction with exact match
|
|
- value:
|
|
has_or_had_label: nlp_ner
|
|
description: NLP Named Entity Recognition extraction
|
|
rules:
|
|
- preconditions:
|
|
slot_conditions:
|
|
has_or_had_provenance_path:
|
|
value_presence: ABSENT
|
|
postconditions:
|
|
description: Claims without XPath provenance must be removed as unverifiable
|
|
comments:
|
|
- WebClaim requires XPath provenance via has_or_had_provenance_path - claims without it are fabricated
|
|
- XPath class contains expression, matched_text, and match_score in one structure
|
|
- Archived HTML files are Playwright-rendered (NOT WARC format)
|
|
- Use scripts/fetch_website_playwright.py to archive websites
|
|
- Use scripts/add_xpath_provenance.py to add XPath to existing claims
|
|
- "Follows 4-stage GLAM-NER pipeline: recognition \u2192 layout \u2192 resolution \u2192 linking"
|
|
- "MIGRATED 2026-01-15: xpath/xpath_match_score/xpath_matched_text \u2192 has_or_had_provenance_path (XPath class)"
|
|
- "MIGRATED 2026-01-18: claim_value \u2192 has_or_had_content (Content class) per Rule 53/56"
|
|
- "MIGRATED 2026-01-18: claim_note \u2192 has_or_had_note (Note class) per Rule 53/56"
|
|
- "MIGRATED 2026-01-19: claim_extraction_method \u2192 is_or_was_extracted_using (ExtractionMethod class) per Rule 53/56"
|
|
- "MIGRATED 2026-01-19: claim_type \u2192 has_or_had_type (ClaimType/ClaimTypes classes) per Rule 53/56"
|
|
see_also:
|
|
- rules/WEB_OBSERVATION_PROVENANCE_RULES.md
|
|
- scripts/fetch_website_playwright.py
|
|
- scripts/add_xpath_provenance.py
|
|
- docs/convention/schema/20251202/entity_annotation_rules_v1.6.0_unified.yaml
|
|
examples:
|
|
- value:
|
|
has_or_had_type:
|
|
has_or_had_label: full_name
|
|
has_or_had_content:
|
|
has_or_had_label: Historische Vereniging Nijeveen
|
|
source_url: https://historischeverenigingnijeveen.nl/
|
|
retrieved_on: '2025-11-29T12:28:00Z'
|
|
has_or_had_provenance_path:
|
|
expression: /html[1]/body[1]/div[6]/div[1]/h1[1]
|
|
match_score: 1.0
|
|
source_document: web/0021/historischeverenigingnijeveen.nl/rendered.html
|
|
has_or_had_file_path:
|
|
has_or_had_label: web/0021/historischeverenigingnijeveen.nl/rendered.html
|
|
pipeline_stage: layout_analysis
|
|
description: Exact match claim for organization name (claim_type migrated to has_or_had_type)
|
|
- value:
|
|
has_or_had_type:
|
|
has_or_had_label: beeldbank_total_photos
|
|
has_or_had_content:
|
|
has_or_had_label: '6253'
|
|
source_url: https://historischeverenigingnijeveen.nl/nl/hvn
|
|
retrieved_on: '2025-11-29T12:28:00Z'
|
|
has_or_had_provenance_path:
|
|
expression: /html[1]/body[1]/div[6]/div[1]/table[3]/tbody[1]/tr[1]/td[1]/p[1]
|
|
match_score: 1.0
|
|
source_document: web/0021/historischeverenigingnijeveen.nl/rendered.html
|
|
has_or_had_file_path:
|
|
has_or_had_label: web/0021/historischeverenigingnijeveen.nl/rendered.html
|
|
pipeline_stage: layout_analysis
|
|
description: Collection count claim from image bank statistics
|
|
- value:
|
|
has_or_had_type:
|
|
has_or_had_label: facebook
|
|
has_or_had_content:
|
|
has_or_had_label: https://www.facebook.com/HistorischeVerenigingNijeveen/
|
|
source_url: https://historischeverenigingnijeveen.nl/
|
|
retrieved_on: '2025-11-29T12:28:00Z'
|
|
has_or_had_provenance_path:
|
|
expression: /html[1]/body[1]/footer[1]/div[1]/a[3]
|
|
match_score: 1.0
|
|
source_document: web/0021/historischeverenigingnijeveen.nl/rendered.html
|
|
has_or_had_file_path:
|
|
has_or_had_label: web/0021/historischeverenigingnijeveen.nl/rendered.html
|
|
pipeline_stage: entity_linking
|
|
description: Social media link claim - entity linking stage
|
|
- value:
|
|
has_or_had_type:
|
|
has_or_had_label: website
|
|
has_or_had_content:
|
|
has_or_had_label: https://www.historischeverenigingnijeveen.nl/
|
|
source_url: https://historischeverenigingnijeveen.nl/nl/hvn
|
|
retrieved_on: '2025-11-28T12:00:00Z'
|
|
has_or_had_provenance_path:
|
|
expression: /html[1]/body[1]/div[6]/div[1]/table[3]/tbody[1]/tr[1]/td[1]/p[6]
|
|
matched_text: De Historische Vereniging Nijeveen is ook te vinden op Facebook
|
|
match_score: 0.561
|
|
source_document: web/0021/historischeverenigingnijeveen.nl/rendered.html
|
|
has_or_had_file_path:
|
|
has_or_had_label: web/0021/historischeverenigingnijeveen.nl/rendered.html
|
|
pipeline_stage: layout_analysis
|
|
description: Substring match - URL found within longer text
|
|
annotations:
|
|
specificity_score: 0.1
|
|
specificity_rationale: Generic utility class/slot created during migration
|
|
custodian_types: "['*']"
|
|
custodian_types_rationale: Universal utility concept
|