All checks were successful
Deploy Frontend / build-and-deploy (push) Successful in 2m4s
Session 2026-01-19: Completed remaining migrations per Rules 53/56/60. Major migrations: 1. claim_type → has_or_had_type + ClaimType/ClaimTypes (60+ concrete types in 11 categories) 2. circumstances_of_death → is_deceased + DeceasedStatus + CauseOfDeath 3. claims_count → has_or_had_quantity + Quantity (with based_on_claim for provenance) 4. classification_status → has_or_had_type + ClassificationStatusType Created files: - ClaimType.yaml, ClaimTypes.yaml (abstract base + 60+ concrete subclasses) - DeceasedStatus.yaml, CauseOfDeath.yaml, CauseOfDeathTypeEnum.yaml - ClassificationStatus.yaml, ClassificationStatusType.yaml, ClassificationStatusTypes.yaml - CITESAppendix.yaml, City.yaml, CertaintyLevel.yaml - is_deceased.yaml, is_or_was_caused_by.yaml, based_on_claim.yaml Archived slots: - claim_type, circumstances_of_death, claims_count, classification_status Added Rule 60 to AGENTS.md: No Migration Deferral - agents MUST execute all migrations. All 527 slot_fixes.yaml entries now complete (100%).
327 lines
17 KiB
YAML
327 lines
17 KiB
YAML
id: https://nde.nl/ontology/hc/class/WebClaim
|
|
name: WebClaim
|
|
title: WebClaim Class - Verifiable Web-Extracted Claims
|
|
prefixes:
|
|
linkml: https://w3id.org/linkml/
|
|
hc: https://nde.nl/ontology/hc/
|
|
schema: http://schema.org/
|
|
dcterms: http://purl.org/dc/terms/
|
|
prov: http://www.w3.org/ns/prov#
|
|
pav: http://purl.org/pav/
|
|
xsd: http://www.w3.org/2001/XMLSchema#
|
|
oa: http://www.w3.org/ns/oa#
|
|
nif: http://persistence.uni-leipzig.org/nlp2rdf/ontologies/nif-core#
|
|
imports:
|
|
- linkml:types
|
|
- ./Claim # Base class - added 2026-01-19 per Rule 53/56
|
|
- ../slots/source_url
|
|
- ../slots/retrieved_on
|
|
- ../slots/has_or_had_provenance_path
|
|
- ../slots/html_file
|
|
# REMOVED 2026-01-18: ../slots/claim_id - migrated to has_or_had_identifier + Identifier (Rule 53)
|
|
- ../slots/has_or_had_identifier
|
|
- ./Identifier
|
|
# REMOVED 2026-01-19: ../slots/claim_type - migrated to has_or_had_type + ClaimType (Rule 53)
|
|
- ../slots/has_or_had_type
|
|
- ./ClaimType
|
|
- ./ClaimTypes
|
|
# REMOVED 2026-01-18: ../slots/claim_value - migrated to has_or_had_content + Content (Rule 53)
|
|
- ../slots/has_or_had_content
|
|
- ./Content
|
|
- ../slots/extraction_timestamp
|
|
- ../slots/specificity_annotation
|
|
- ../slots/has_or_had_score # was: template_specificity - migrated per Rule 53 (2026-01-17)
|
|
# REMOVED 2026-01-19: ../slots/claim_extraction_method - migrated to is_or_was_extracted_using + ExtractionMethod (Rule 53)
|
|
- ../slots/is_or_was_extracted_using
|
|
- ./ExtractionMethod
|
|
- ../slots/pipeline_stage
|
|
# REMOVED 2026-01-18: ../slots/claim_note - migrated to has_or_had_note + Note (Rule 53)
|
|
- ../slots/has_or_had_note
|
|
- ./Note
|
|
# REMOVED 2026-01-19: ../enums/ClaimTypeEnum - migrated to ClaimType/ClaimTypes classes (Rule 53)
|
|
- ../enums/ExtractionPipelineStageEnum
|
|
- ./SpecificityAnnotation
|
|
- ./TemplateSpecificityScore # was: TemplateSpecificityScores - migrated per Rule 53 (2026-01-17)
|
|
- ./TemplateSpecificityType
|
|
- ./TemplateSpecificityTypes
|
|
- ./XPath
|
|
default_prefix: hc
|
|
classes:
|
|
WebClaim:
|
|
is_a: Claim # Inherits from base Claim class - added 2026-01-19 per Rule 53/56
|
|
class_uri: prov:Entity
|
|
description: "A single verifiable claim extracted from a web page.\n\n**CORE PRINCIPLE: XPATH OR REMOVE**\n\nEvery claim\
|
|
\ extracted from a webpage MUST have:\n1. `has_or_had_provenance_path` - XPath object pointing to exact element in archived HTML\n2. `html_file` - path\
|
|
\ to the archived HTML (Playwright-rendered, NOT WARC)\n\nThe XPath object contains:\n- `expression` - the XPath string\n- `match_score` - quality of match (0.0-1.0)\n- `matched_text` - actual text found (for verification)\n\nClaims without\
|
|
\ these fields are FABRICATED and must be REMOVED.\n\n**ARCHIVE FORMAT: PLAYWRIGHT-RENDERED HTML**\n\nWe use Playwright\
|
|
\ (headless browser) to:\n1. Navigate to the target URL\n2. Wait for JavaScript to fully render\n3. Save the complete\
|
|
\ DOM as an HTML file\n\nThis differs from WARC archives which capture raw HTTP responses.\nPlaywright rendering captures\
|
|
\ the final DOM state including:\n- JavaScript-rendered content\n- Dynamically loaded elements\n- Client-side state\n\
|
|
\n**WHY NOT CONFIDENCE SCORES?**\n\nConfidence scores like `0.95` are MEANINGLESS because:\n- There is NO methodology\
|
|
\ defining what these numbers mean\n- They cannot be verified or reproduced\n- They give false impression of rigor\n\
|
|
- They mask the fact that claims may be fabricated\n\nInstead, we use VERIFIABLE provenance:\n- XPath points to exact\
|
|
\ location\n- Archived HTML can be inspected\n- Match score is computed, not estimated\n\n**EXTRACTION PIPELINE (4 Stages)**\n\
|
|
\nFollowing the GLAM-NER Unified Entity Annotation Convention v1.7.0:\n\n1. **Entity Recognition** (Stage 1)\n - Detect\
|
|
\ named entities in text\n - Classify by hypernym type (AGT, GRP, TOP, TMP, etc.)\n - Methods: spaCy NER, transformer\
|
|
\ models, regex patterns\n\n2. **Layout Analysis** (Stage 2)\n - Analyze document structure (headers, paragraphs,\
|
|
\ tables)\n - Assign DOC hypernym types (DOC.HDR, DOC.PAR, DOC.TBL)\n - Generate XPath provenance for each claim\
|
|
\ location\n\n3. **Entity Resolution** (Stage 3)\n - Disambiguate entity mentions\n - Merge coreferences and name\
|
|
\ variants\n - Produce canonical entity clusters\n\n4. **Entity Linking** (Stage 4)\n - Link resolved entities to\
|
|
\ knowledge bases\n - Connect to Wikidata, ISIL, GeoNames, etc.\n - Assign link confidence scores\n\n**WORKFLOW**:\n\
|
|
\n1. Archive website using Playwright:\n `python scripts/fetch_website_playwright.py <entry_number> <url>`\n \n\
|
|
\ This saves: web/{entry_number}/{domain}/rendered.html\n\n2. Add XPath provenance to claims:\n `python scripts/add_xpath_provenance.py`\n\
|
|
\n3. Script REMOVES claims that cannot be verified\n (stores in `removed_unverified_claims` for audit)\n\n**EXAMPLES**:\n\
|
|
\nCORRECT (Verifiable):\n```yaml\n- claim_type: full_name\n has_or_had_content:\n has_or_had_label: Historische Vereniging Nijeveen\n source_url:\
|
|
\ https://historischeverenigingnijeveen.nl/\n retrieved_on: \"2025-11-29T12:28:00Z\"\n has_or_had_provenance_path:\n expression: /html[1]/body[1]/div[6]/div[1]/h1[1]\n match_score: 1.0\n\
|
|
\ html_file: web/0021/historischeverenigingnijeveen.nl/rendered.html\n pipeline_stage: layout_analysis\n\
|
|
```\n\nWRONG (Fabricated - Must Be Removed):\n```yaml\n- claim_type: full_name\n has_or_had_content:\n has_or_had_label: Historische Vereniging\
|
|
\ Nijeveen\n confidence: 0.95 # ← NO! This is meaningless without XPath\n```\n\n**MIGRATION NOTE (2026-01-15)**:\nConsolidated xpath, xpath_match_score, xpath_matched_text\ninto has_or_had_provenance_path with XPath class.\n\n**MIGRATION NOTE (2026-01-18)**:\nMigrated claim_value to has_or_had_content with Content class per Rule 53/56.\n"
|
|
exact_mappings:
|
|
- prov:Entity
|
|
close_mappings:
|
|
- schema:PropertyValue
|
|
- oa:Annotation
|
|
slots:
|
|
- is_or_was_extracted_using # was: claim_extraction_method - migrated per Rule 53/56 (2026-01-19)
|
|
# REMOVED 2026-01-18: claim_id - migrated to has_or_had_identifier + Identifier (Rule 53)
|
|
- has_or_had_identifier
|
|
# REMOVED 2026-01-18: claim_note - migrated to has_or_had_note + Note (Rule 53)
|
|
- has_or_had_note
|
|
# REMOVED 2026-01-19: claim_type - migrated to has_or_had_type + ClaimType (Rule 53)
|
|
- has_or_had_type
|
|
# REMOVED 2026-01-18: claim_value - migrated to has_or_had_content + Content (Rule 53)
|
|
- has_or_had_content
|
|
- extraction_timestamp
|
|
- html_file
|
|
- pipeline_stage
|
|
- retrieved_on
|
|
- source_url
|
|
- specificity_annotation
|
|
- has_or_had_score # was: template_specificity - migrated per Rule 53 (2026-01-17)
|
|
- has_or_had_provenance_path
|
|
slot_usage:
|
|
# MIGRATED 2026-01-18: claim_id → has_or_had_identifier + Identifier (Rule 53/56)
|
|
has_or_had_identifier:
|
|
description: |
|
|
MIGRATED from claim_id per slot_fixes.yaml (Rule 53/56, 2026-01-18).
|
|
Unique identifier for the web claim.
|
|
Uses Identifier class for structured identifier representation.
|
|
range: Identifier
|
|
inlined: true
|
|
required: false
|
|
examples:
|
|
- value:
|
|
identifier_scheme: web_claim_id
|
|
identifier_value: "claim-2025-11-29-001"
|
|
description: Web claim identifier
|
|
# MIGRATED 2026-01-19: claim_type → has_or_had_type + ClaimType (Rule 53/56)
|
|
has_or_had_type:
|
|
description: |
|
|
MIGRATED from claim_type per slot_fixes.yaml (Rule 53/56, 2026-01-19).
|
|
The type of claim being made (e.g., full_name, email, facebook).
|
|
|
|
Uses ClaimType class hierarchy for structured type representation:
|
|
- IdentityClaimType: full_name, short_name, description, legal_name
|
|
- ContactClaimType: email, phone, address, website
|
|
- SocialMediaClaimType: facebook, twitter, instagram, linkedin, youtube
|
|
- MediaClaimType: logo_url, favicon_url, og_image_url
|
|
- OperationalClaimType: opening_hours, admission_info, accessibility_info
|
|
- CollectionClaimType: collection_count, beeldbank statistics
|
|
- OrganizationalClaimType: founding_date, kvk_number, legal_form
|
|
- DocumentClaimType: annual_report_url, policy_document_url
|
|
- GeographicClaimType: street_address, postal_code, city, province
|
|
- ArchivalClaimType: archief_description, beeldbank_description
|
|
range: ClaimType
|
|
inlined: true
|
|
required: true
|
|
examples:
|
|
- value:
|
|
has_or_had_label: full_name
|
|
description: Identity claim for organization name
|
|
- value:
|
|
has_or_had_label: facebook
|
|
description: Social media claim for Facebook URL
|
|
# MIGRATED 2026-01-18: claim_note → has_or_had_note + Note (Rule 53/56)
|
|
has_or_had_note:
|
|
description: |
|
|
MIGRATED from claim_note per slot_fixes.yaml (Rule 53/56, 2026-01-18).
|
|
Notes about this specific claim extraction.
|
|
|
|
Uses Note class with note_type, note_content, note_date fields.
|
|
|
|
**Note Type Mapping**:
|
|
- `note_type`: "claim" (default for WebClaim notes)
|
|
- `note_content`: The actual note text
|
|
- `note_date`: When the note was created
|
|
|
|
**Use Cases**:
|
|
- Document extraction issues
|
|
- Note special circumstances
|
|
- Record conflicts with other sources
|
|
range: Note
|
|
inlined: true
|
|
inlined_as_list: true
|
|
multivalued: true
|
|
required: false
|
|
examples:
|
|
- value:
|
|
note_type: claim
|
|
note_content: "Additional verification required for this claim."
|
|
note_date: "2026-01-18"
|
|
description: Verification note for claim
|
|
- value:
|
|
note_type: extraction
|
|
note_content: "Biography truncated from longer text on page."
|
|
note_date: "2025-11-29"
|
|
description: Extraction processing note
|
|
# MIGRATED 2026-01-18: claim_value → has_or_had_content + Content (Rule 53/56)
|
|
has_or_had_content:
|
|
description: |
|
|
MIGRATED from claim_value per slot_fixes.yaml (Rule 53/56, 2026-01-18).
|
|
The extracted value from the web source - the actual content claimed to exist
|
|
at the XPath location.
|
|
|
|
Uses Content class with has_or_had_label holding the raw extracted string.
|
|
|
|
**Content Mapping**:
|
|
- `has_or_had_label`: The raw extracted value (required)
|
|
- `has_or_had_description`: Optional elaboration on the claim content
|
|
|
|
**Examples of claim values**:
|
|
- Organization names: "Historische Vereniging Nijeveen"
|
|
- Statistics: "6253" (photo count)
|
|
- URLs: "https://www.facebook.com/HistorischeVerenigingNijeveen/"
|
|
range: Content
|
|
inlined: true
|
|
required: true
|
|
multivalued: false
|
|
examples:
|
|
- value:
|
|
has_or_had_label: "Historische Vereniging Nijeveen"
|
|
description: Organization name claim value
|
|
- value:
|
|
has_or_had_label: "6253"
|
|
description: Numeric statistic claim value
|
|
- value:
|
|
has_or_had_label: "https://www.facebook.com/HistorischeVerenigingNijeveen/"
|
|
description: URL claim value
|
|
source_url:
|
|
required: true
|
|
retrieved_on:
|
|
required: true
|
|
has_or_had_provenance_path:
|
|
required: true
|
|
range: XPath
|
|
inlined: true
|
|
description: >-
|
|
XPath provenance for this claim - pointing to exact element in archived HTML.
|
|
Contains expression, matched_text, and match_score.
|
|
html_file:
|
|
required: true
|
|
# MIGRATED 2026-01-19: claim_extraction_method → is_or_was_extracted_using + ExtractionMethod (Rule 53/56)
|
|
is_or_was_extracted_using:
|
|
description: |
|
|
MIGRATED from claim_extraction_method per slot_fixes.yaml (Rule 53/56, 2026-01-19).
|
|
Method used to extract this claim from the source document.
|
|
|
|
Uses ExtractionMethod class to represent structured extraction method information.
|
|
|
|
**Common Extraction Methods**:
|
|
- `xpath_exact_match` - XPath pointed to exact element containing value
|
|
- `xpath_fuzzy_match` - XPath with partial/substring match
|
|
- `nlp_ner` - Named Entity Recognition extraction
|
|
- `json_ld_parse` - Parsed from embedded JSON-LD structured data
|
|
- `meta_tag` - Extracted from HTML meta tags
|
|
- `manual` - Human-verified extraction
|
|
range: ExtractionMethod
|
|
inlined: true
|
|
required: false
|
|
examples:
|
|
- value:
|
|
has_or_had_label: xpath_exact_match
|
|
description: XPath extraction with exact match
|
|
- value:
|
|
has_or_had_label: nlp_ner
|
|
description: NLP Named Entity Recognition extraction
|
|
rules:
|
|
- preconditions:
|
|
slot_conditions:
|
|
has_or_had_provenance_path:
|
|
value_presence: ABSENT
|
|
postconditions:
|
|
description: Claims without XPath provenance must be removed as unverifiable
|
|
comments:
|
|
- WebClaim requires XPath provenance via has_or_had_provenance_path - claims without it are fabricated
|
|
- XPath class contains expression, matched_text, and match_score in one structure
|
|
- Archived HTML files are Playwright-rendered (NOT WARC format)
|
|
- Use scripts/fetch_website_playwright.py to archive websites
|
|
- Use scripts/add_xpath_provenance.py to add XPath to existing claims
|
|
- 'Follows 4-stage GLAM-NER pipeline: recognition → layout → resolution → linking'
|
|
- 'MIGRATED 2026-01-15: xpath/xpath_match_score/xpath_matched_text → has_or_had_provenance_path (XPath class)'
|
|
- 'MIGRATED 2026-01-18: claim_value → has_or_had_content (Content class) per Rule 53/56'
|
|
- 'MIGRATED 2026-01-18: claim_note → has_or_had_note (Note class) per Rule 53/56'
|
|
- 'MIGRATED 2026-01-19: claim_extraction_method → is_or_was_extracted_using (ExtractionMethod class) per Rule 53/56'
|
|
- 'MIGRATED 2026-01-19: claim_type → has_or_had_type (ClaimType/ClaimTypes classes) per Rule 53/56'
|
|
see_also:
|
|
- rules/WEB_OBSERVATION_PROVENANCE_RULES.md
|
|
- scripts/fetch_website_playwright.py
|
|
- scripts/add_xpath_provenance.py
|
|
- docs/convention/schema/20251202/entity_annotation_rules_v1.6.0_unified.yaml
|
|
examples:
|
|
- value:
|
|
has_or_had_type:
|
|
has_or_had_label: full_name
|
|
has_or_had_content:
|
|
has_or_had_label: Historische Vereniging Nijeveen
|
|
source_url: https://historischeverenigingnijeveen.nl/
|
|
retrieved_on: '2025-11-29T12:28:00Z'
|
|
has_or_had_provenance_path:
|
|
expression: /html[1]/body[1]/div[6]/div[1]/h1[1]
|
|
match_score: 1.0
|
|
source_document: web/0021/historischeverenigingnijeveen.nl/rendered.html
|
|
html_file: web/0021/historischeverenigingnijeveen.nl/rendered.html
|
|
pipeline_stage: layout_analysis
|
|
description: Exact match claim for organization name (claim_type migrated to has_or_had_type)
|
|
- value:
|
|
has_or_had_type:
|
|
has_or_had_label: beeldbank_total_photos
|
|
has_or_had_content:
|
|
has_or_had_label: '6253'
|
|
source_url: https://historischeverenigingnijeveen.nl/nl/hvn
|
|
retrieved_on: '2025-11-29T12:28:00Z'
|
|
has_or_had_provenance_path:
|
|
expression: /html[1]/body[1]/div[6]/div[1]/table[3]/tbody[1]/tr[1]/td[1]/p[1]
|
|
match_score: 1.0
|
|
source_document: web/0021/historischeverenigingnijeveen.nl/rendered.html
|
|
html_file: web/0021/historischeverenigingnijeveen.nl/rendered.html
|
|
pipeline_stage: layout_analysis
|
|
description: Collection count claim from image bank statistics
|
|
- value:
|
|
has_or_had_type:
|
|
has_or_had_label: facebook
|
|
has_or_had_content:
|
|
has_or_had_label: https://www.facebook.com/HistorischeVerenigingNijeveen/
|
|
source_url: https://historischeverenigingnijeveen.nl/
|
|
retrieved_on: '2025-11-29T12:28:00Z'
|
|
has_or_had_provenance_path:
|
|
expression: /html[1]/body[1]/footer[1]/div[1]/a[3]
|
|
match_score: 1.0
|
|
source_document: web/0021/historischeverenigingnijeveen.nl/rendered.html
|
|
html_file: web/0021/historischeverenigingnijeveen.nl/rendered.html
|
|
pipeline_stage: entity_linking
|
|
description: Social media link claim - entity linking stage
|
|
- value:
|
|
has_or_had_type:
|
|
has_or_had_label: website
|
|
has_or_had_content:
|
|
has_or_had_label: https://www.historischeverenigingnijeveen.nl/
|
|
source_url: https://historischeverenigingnijeveen.nl/nl/hvn
|
|
retrieved_on: '2025-11-28T12:00:00Z'
|
|
has_or_had_provenance_path:
|
|
expression: /html[1]/body[1]/div[6]/div[1]/table[3]/tbody[1]/tr[1]/td[1]/p[6]
|
|
matched_text: De Historische Vereniging Nijeveen is ook te vinden op Facebook
|
|
match_score: 0.561
|
|
source_document: web/0021/historischeverenigingnijeveen.nl/rendered.html
|
|
html_file: web/0021/historischeverenigingnijeveen.nl/rendered.html
|
|
pipeline_stage: layout_analysis
|
|
description: Substring match - URL found within longer text
|