- Updated WorldCatIdentifier.yaml to remove unnecessary description and ensure consistent formatting. - Enhanced WorldHeritageSite.yaml by breaking long description into multiple lines for better readability and removed unused attributes. - Simplified WritingSystem.yaml by removing redundant attributes and ensuring consistent formatting. - Cleaned up XPathScore.yaml by removing unnecessary attributes and ensuring consistent formatting. - Improved YoutubeChannel.yaml by breaking long description into multiple lines for better readability. - Enhanced YoutubeEnrichment.yaml by breaking long description into multiple lines for better readability. - Updated YoutubeVideo.yaml to break long description into multiple lines and removed legacy field name. - Refined has_or_had_affiliation.yaml by removing unnecessary comments and ensuring clarity. - Cleaned up is_or_was_retrieved_at.yaml by removing unnecessary comments and ensuring clarity. - Added rules for generic slots and avoiding rough edits in schema files to maintain structural integrity. - Introduced changes_or_changed_through.yaml to define a new slot for linking entities to change events.
218 lines
11 KiB
YAML
218 lines
11 KiB
YAML
id: https://nde.nl/ontology/hc/class/WebClaim
|
|
name: WebClaim
|
|
title: WebClaim Class - Verifiable Web-Extracted Claims
|
|
prefixes:
|
|
linkml: https://w3id.org/linkml/
|
|
hc: https://nde.nl/ontology/hc/
|
|
schema: http://schema.org/
|
|
dcterms: http://purl.org/dc/terms/
|
|
prov: http://www.w3.org/ns/prov#
|
|
pav: http://purl.org/pav/
|
|
xsd: http://www.w3.org/2001/XMLSchema#
|
|
oa: http://www.w3.org/ns/oa#
|
|
nif: http://persistence.uni-leipzig.org/nlp2rdf/ontologies/nif-core#
|
|
crm: http://www.cidoc-crm.org/cidoc-crm/
|
|
skos: http://www.w3.org/2004/02/skos/core#
|
|
rdfs: http://www.w3.org/2000/01/rdf-schema#
|
|
org: http://www.w3.org/ns/org#
|
|
imports:
|
|
- linkml:types
|
|
- ../enums/ExtractionPipelineStageEnum
|
|
- ../slots/has_or_had_content
|
|
- ../slots/has_or_had_file_path
|
|
- ../slots/has_or_had_identifier
|
|
- ../slots/has_or_had_note
|
|
- ../slots/has_or_had_provenance_path
|
|
- ../slots/has_or_had_score
|
|
- ../slots/has_or_had_type
|
|
- ../slots/is_or_was_extracted_using
|
|
- ../slots/is_or_was_retrieved_through
|
|
- ../slots/pipeline_stage
|
|
- ../slots/retrieved_on
|
|
- ../slots/source_url
|
|
- ../slots/specificity_annotation
|
|
- ../slots/temporal_extent
|
|
- ./Claim
|
|
- ./ClaimType
|
|
- ./ClaimTypes
|
|
- ./Content
|
|
- ./ExtractionMethod
|
|
- ./FilePath
|
|
- ./Identifier
|
|
- ./Note
|
|
- ./RetrievalEvent
|
|
- ./SpecificityAnnotation
|
|
- ./TemplateSpecificityScore
|
|
- ./TemplateSpecificityType
|
|
- ./TemplateSpecificityTypes
|
|
- ./XPath
|
|
default_prefix: hc
|
|
classes:
|
|
WebClaim:
|
|
is_a: Claim
|
|
class_uri: prov:Entity
|
|
description: "A single verifiable claim extracted from a web page.\n\n**CORE PRINCIPLE: XPATH OR REMOVE**\n\nEvery claim extracted from a webpage MUST have:\n1. `has_or_had_provenance_path` - XPath object pointing to exact element in archived HTML\n2. `html_file` - path to the archived HTML (Playwright-rendered, NOT WARC)\n\nThe XPath object contains:\n- `expression` - the XPath string\n- `match_score` - quality of match (0.0-1.0)\n- `matched_text` - actual text found (for verification)\n\nClaims without these fields are FABRICATED and must be REMOVED.\n\n**ARCHIVE FORMAT: PLAYWRIGHT-RENDERED HTML**\n\nWe use Playwright (headless browser) to:\n1. Navigate to the target URL\n2. Wait for JavaScript to fully render\n3. Save the complete DOM as an HTML file\n\nThis differs from WARC archives which capture raw HTTP responses.\nPlaywright rendering captures the final DOM state including:\n- JavaScript-rendered content\n- Dynamically loaded elements\n- Client-side state\n\n**WHY NOT CONFIDENCE\
|
|
\ SCORES?**\n\nConfidence scores like `0.95` are MEANINGLESS because:\n- There is NO methodology defining what these numbers mean\n- They cannot be verified or reproduced\n- They give false impression of rigor\n- They mask the fact that claims may be fabricated\n\nInstead, we use VERIFIABLE provenance:\n- XPath points to exact location\n- Archived HTML can be inspected\n- Match score is computed, not estimated\n\n**EXTRACTION PIPELINE (4 Stages)**\n\nFollowing the GLAM-NER Unified Entity Annotation Convention v1.7.0:\n\n1. **Entity Recognition** (Stage 1)\n - Detect named entities in text\n - Classify by hypernym type (AGT, GRP, TOP, TMP, etc.)\n - Methods: spaCy NER, transformer models, regex patterns\n\n2. **Layout Analysis** (Stage 2)\n - Analyze document structure (headers, paragraphs, tables)\n - Assign DOC hypernym types (DOC.HDR, DOC.PAR, DOC.TBL)\n - Generate XPath provenance for each claim location\n\n3. **Entity Resolution** (Stage 3)\n - Disambiguate entity\
|
|
\ mentions\n - Merge coreferences and name variants\n - Produce canonical entity clusters\n\n4. **Entity Linking** (Stage 4)\n - Link resolved entities to knowledge bases\n - Connect to Wikidata, ISIL, GeoNames, etc.\n - Assign link confidence scores\n\n**WORKFLOW**:\n\n1. Archive website using Playwright:\n `python scripts/fetch_website_playwright.py <entry_number> <url>`\n \n This saves: web/{entry_number}/{domain}/rendered.html\n\n2. Add XPath provenance to claims:\n `python scripts/add_xpath_provenance.py`\n\n3. Script REMOVES claims that cannot be verified\n (stores in `removed_unverified_claims` for audit)\n\n**EXAMPLES**:\n\nCORRECT (Verifiable):\n```yaml\n- claim_type: full_name\n has_or_had_content:\n has_or_had_label: Historische Vereniging Nijeveen\n source_url: https://historischeverenigingnijeveen.nl/\n retrieved_on: \"2025-11-29T12:28:00Z\"\n has_or_had_provenance_path:\n expression: /html[1]/body[1]/div[6]/div[1]/h1[1]\n match_score:\
|
|
\ 1.0\n html_file: web/0021/historischeverenigingnijeveen.nl/rendered.html\n pipeline_stage: layout_analysis\n```\n\nWRONG (Fabricated - Must Be Removed):\n```yaml\n- claim_type: full_name\n has_or_had_content:\n has_or_had_label: Historische Vereniging Nijeveen\n confidence: 0.95 # \u2190 NO! This is meaningless without XPath\n```\n\n**MIGRATION NOTE (2026-01-15)**:\nConsolidated xpath, xpath_match_score, xpath_matched_text\ninto has_or_had_provenance_path with XPath class.\n\n**MIGRATION NOTE (2026-01-18)**:\nMigrated claim_value to has_or_had_content with Content class per Rule 53/56.\n"
|
|
exact_mappings:
|
|
- prov:Entity
|
|
close_mappings:
|
|
- schema:PropertyValue
|
|
- oa:Annotation
|
|
slots:
|
|
- is_or_was_extracted_using
|
|
- has_or_had_identifier
|
|
- has_or_had_note
|
|
- has_or_had_type
|
|
- has_or_had_content
|
|
- is_or_was_retrieved_through
|
|
- has_or_had_file_path
|
|
- pipeline_stage
|
|
- retrieved_on
|
|
- source_url
|
|
- specificity_annotation
|
|
- has_or_had_score
|
|
- has_or_had_provenance_path
|
|
slot_usage:
|
|
has_or_had_identifier:
|
|
range: uriorcurie
|
|
inlined: true
|
|
required: false
|
|
examples:
|
|
- value:
|
|
has_or_had_type:
|
|
range: ClaimType
|
|
inlined: true
|
|
required: true
|
|
examples:
|
|
- value:
|
|
has_or_had_label: full_name
|
|
- value:
|
|
has_or_had_label: facebook
|
|
has_or_had_note:
|
|
range: string
|
|
inlined: true
|
|
inlined_as_list: true
|
|
multivalued: true
|
|
required: false
|
|
examples:
|
|
- value:
|
|
note_type: claim
|
|
note_content: Additional verification required for this claim.
|
|
note_date: '2026-01-18'
|
|
- value:
|
|
note_type: extraction
|
|
note_content: Biography truncated from longer text on page.
|
|
note_date: '2025-11-29'
|
|
has_or_had_content:
|
|
range: string
|
|
inlined: true
|
|
required: true
|
|
multivalued: false
|
|
examples:
|
|
- value:
|
|
has_or_had_label: Historische Vereniging Nijeveen
|
|
- value:
|
|
has_or_had_label: '6253'
|
|
- value:
|
|
has_or_had_label: https://www.facebook.com/HistorischeVerenigingNijeveen/
|
|
source_url:
|
|
required: true
|
|
retrieved_on:
|
|
required: true
|
|
has_or_had_provenance_path:
|
|
required: true
|
|
range: XPath
|
|
inlined: true
|
|
has_or_had_file_path:
|
|
required: true
|
|
range: FilePath
|
|
inlined: true
|
|
examples:
|
|
- value:
|
|
has_or_had_label: web/0021/historischeverenigingnijeveen.nl/rendered.html
|
|
is_or_was_retrieved_through:
|
|
range: RetrievalEvent
|
|
inlined: true
|
|
required: false
|
|
is_or_was_extracted_using:
|
|
range: ExtractionMethod
|
|
inlined: true
|
|
required: false
|
|
examples:
|
|
- value:
|
|
has_or_had_label: xpath_exact_match
|
|
- value:
|
|
has_or_had_label: nlp_ner
|
|
rules:
|
|
- preconditions:
|
|
slot_conditions:
|
|
has_or_had_provenance_path:
|
|
value_presence: ABSENT
|
|
comments:
|
|
- WebClaim requires XPath provenance via has_or_had_provenance_path - claims without it are fabricated
|
|
- XPath class contains expression, matched_text, and match_score in one structure
|
|
- Archived HTML files are Playwright-rendered (NOT WARC format)
|
|
- Use scripts/fetch_website_playwright.py to archive websites
|
|
- Use scripts/add_xpath_provenance.py to add XPath to existing claims
|
|
- "Follows 4-stage GLAM-NER pipeline: recognition \u2192 layout \u2192 resolution \u2192 linking"
|
|
- "MIGRATED 2026-01-15: xpath/xpath_match_score/xpath_matched_text \u2192 has_or_had_provenance_path (XPath class)"
|
|
- "MIGRATED 2026-01-18: claim_value \u2192 has_or_had_content (Content class) per Rule 53/56"
|
|
- "MIGRATED 2026-01-18: claim_note \u2192 has_or_had_note (Note class) per Rule 53/56"
|
|
- "MIGRATED 2026-01-19: claim_extraction_method \u2192 is_or_was_extracted_using (ExtractionMethod class) per Rule 53/56"
|
|
- "MIGRATED 2026-01-19: claim_type \u2192 has_or_had_type (ClaimType/ClaimTypes classes) per Rule 53/56"
|
|
see_also:
|
|
- rules/WEB_OBSERVATION_PROVENANCE_RULES.md
|
|
- scripts/fetch_website_playwright.py
|
|
- scripts/add_xpath_provenance.py
|
|
- docs/convention/schema/20251202/entity_annotation_rules_v1.6.0_unified.yaml
|
|
examples:
|
|
- value:
|
|
has_or_had_type:
|
|
has_or_had_label: full_name
|
|
has_or_had_content:
|
|
has_or_had_label: Historische Vereniging Nijeveen
|
|
source_url: https://historischeverenigingnijeveen.nl/
|
|
retrieved_on: '2025-11-29T12:28:00Z'
|
|
has_or_had_provenance_path:
|
|
has_or_had_file_path:
|
|
has_or_had_label: web/0021/historischeverenigingnijeveen.nl/rendered.html
|
|
pipeline_stage: layout_analysis
|
|
- value:
|
|
has_or_had_type:
|
|
has_or_had_label: beeldbank_total_photos
|
|
has_or_had_content:
|
|
has_or_had_label: '6253'
|
|
source_url: https://historischeverenigingnijeveen.nl/nl/hvn
|
|
retrieved_on: '2025-11-29T12:28:00Z'
|
|
has_or_had_provenance_path:
|
|
has_or_had_file_path:
|
|
has_or_had_label: web/0021/historischeverenigingnijeveen.nl/rendered.html
|
|
pipeline_stage: layout_analysis
|
|
- value:
|
|
has_or_had_type:
|
|
has_or_had_label: facebook
|
|
has_or_had_content:
|
|
has_or_had_label: https://www.facebook.com/HistorischeVerenigingNijeveen/
|
|
source_url: https://historischeverenigingnijeveen.nl/
|
|
retrieved_on: '2025-11-29T12:28:00Z'
|
|
has_or_had_provenance_path:
|
|
has_or_had_file_path:
|
|
has_or_had_label: web/0021/historischeverenigingnijeveen.nl/rendered.html
|
|
pipeline_stage: entity_linking
|
|
- value:
|
|
has_or_had_type:
|
|
has_or_had_label: website
|
|
has_or_had_content:
|
|
has_or_had_label: https://www.historischeverenigingnijeveen.nl/
|
|
source_url: https://historischeverenigingnijeveen.nl/nl/hvn
|
|
retrieved_on: '2025-11-28T12:00:00Z'
|
|
has_or_had_provenance_path:
|
|
has_or_had_file_path:
|
|
has_or_had_label: web/0021/historischeverenigingnijeveen.nl/rendered.html
|
|
pipeline_stage: layout_analysis
|
|
annotations:
|
|
specificity_score: 0.1
|
|
specificity_rationale: Generic utility class/slot created during migration
|
|
custodian_types: "['*']"
|