glam/schemas/20251121/linkml/modules/classes/WebClaim.yaml

id: https://nde.nl/ontology/hc/class/WebClaim
name: WebClaim
title: WebClaim Class - Verifiable Web-Extracted Claims
prefixes:
  linkml: https://w3id.org/linkml/
  hc: https://nde.nl/ontology/hc/
  schema: http://schema.org/
  dcterms: http://purl.org/dc/terms/
  prov: http://www.w3.org/ns/prov#
  pav: http://purl.org/pav/
  xsd: http://www.w3.org/2001/XMLSchema#
  oa: http://www.w3.org/ns/oa#
  nif: http://persistence.uni-leipzig.org/nlp2rdf/ontologies/nif-core#
  crm: http://www.cidoc-crm.org/cidoc-crm/
  skos: http://www.w3.org/2004/02/skos/core#
  rdfs: http://www.w3.org/2000/01/rdf-schema#
  org: http://www.w3.org/ns/org#
imports:
- linkml:types
- ./Claim
- ../slots/source_url
- ../slots/retrieved_on
- ../slots/has_or_had_provenance_path
- ../slots/has_or_had_file_path
- ./FilePath
- ../slots/has_or_had_identifier
- ./Identifier
- ../slots/has_or_had_type
- ./ClaimType
- ./ClaimTypes
- ../slots/has_or_had_content
- ./Content
- ../slots/temporal_extent
- ../slots/specificity_annotation
- ../slots/has_or_had_score
- ../slots/is_or_was_extracted_using
- ./ExtractionMethod
- ../slots/is_or_was_retrieved_through
- ./RetrievalEvent
- ../slots/pipeline_stage
- ../slots/has_or_had_note
- ./Note
- ../enums/ExtractionPipelineStageEnum
- ./SpecificityAnnotation
- ./TemplateSpecificityScore
- ./TemplateSpecificityType
- ./TemplateSpecificityTypes
- ./XPath
default_prefix: hc
classes:
  WebClaim:
    is_a: Claim
    class_uri: prov:Entity
    description: "A single verifiable claim extracted from a web page.\n\n**CORE PRINCIPLE: XPATH OR REMOVE**\n\nEvery claim extracted from a webpage MUST have:\n1. `has_or_had_provenance_path` - XPath object pointing to exact element in archived HTML\n2. `html_file` - path to the archived HTML (Playwright-rendered, NOT WARC)\n\nThe XPath object contains:\n- `expression` - the XPath string\n- `match_score` - quality of match (0.0-1.0)\n- `matched_text` - actual text found (for verification)\n\nClaims without these fields are FABRICATED and must be REMOVED.\n\n**ARCHIVE FORMAT: PLAYWRIGHT-RENDERED HTML**\n\nWe use Playwright (headless browser) to:\n1. Navigate to the target URL\n2. Wait for JavaScript to fully render\n3. Save the complete DOM as an HTML file\n\nThis differs from WARC archives which capture raw HTTP responses.\nPlaywright rendering captures the final DOM state including:\n- JavaScript-rendered content\n- Dynamically loaded elements\n- Client-side state\n\n**WHY NOT CONFIDENCE\
      \ SCORES?**\n\nConfidence scores like `0.95` are MEANINGLESS because:\n- There is NO methodology defining what these numbers mean\n- They cannot be verified or reproduced\n- They give false impression of rigor\n- They mask the fact that claims may be fabricated\n\nInstead, we use VERIFIABLE provenance:\n- XPath points to exact location\n- Archived HTML can be inspected\n- Match score is computed, not estimated\n\n**EXTRACTION PIPELINE (4 Stages)**\n\nFollowing the GLAM-NER Unified Entity Annotation Convention v1.7.0:\n\n1. **Entity Recognition** (Stage 1)\n   - Detect named entities in text\n   - Classify by hypernym type (AGT, GRP, TOP, TMP, etc.)\n   - Methods: spaCy NER, transformer models, regex patterns\n\n2. **Layout Analysis** (Stage 2)\n   - Analyze document structure (headers, paragraphs, tables)\n   - Assign DOC hypernym types (DOC.HDR, DOC.PAR, DOC.TBL)\n   - Generate XPath provenance for each claim location\n\n3. **Entity Resolution** (Stage 3)\n   - Disambiguate entity\
      \ mentions\n   - Merge coreferences and name variants\n   - Produce canonical entity clusters\n\n4. **Entity Linking** (Stage 4)\n   - Link resolved entities to knowledge bases\n   - Connect to Wikidata, ISIL, GeoNames, etc.\n   - Assign link confidence scores\n\n**WORKFLOW**:\n\n1. Archive website using Playwright:\n   `python scripts/fetch_website_playwright.py <entry_number> <url>`\n   \n   This saves: web/{entry_number}/{domain}/rendered.html\n\n2. Add XPath provenance to claims:\n   `python scripts/add_xpath_provenance.py`\n\n3. Script REMOVES claims that cannot be verified\n   (stores in `removed_unverified_claims` for audit)\n\n**EXAMPLES**:\n\nCORRECT (Verifiable):\n```yaml\n- claim_type: full_name\n  has_or_had_content:\n    has_or_had_label: Historische Vereniging Nijeveen\n  source_url: https://historischeverenigingnijeveen.nl/\n  retrieved_on: \"2025-11-29T12:28:00Z\"\n  has_or_had_provenance_path:\n    expression: /html[1]/body[1]/div[6]/div[1]/h1[1]\n    match_score:\
      \ 1.0\n  html_file: web/0021/historischeverenigingnijeveen.nl/rendered.html\n  pipeline_stage: layout_analysis\n```\n\nWRONG (Fabricated - Must Be Removed):\n```yaml\n- claim_type: full_name\n  has_or_had_content:\n    has_or_had_label: Historische Vereniging Nijeveen\n  confidence: 0.95  # \u2190 NO! This is meaningless without XPath\n```\n\n**MIGRATION NOTE (2026-01-15)**:\nConsolidated xpath, xpath_match_score, xpath_matched_text\ninto has_or_had_provenance_path with XPath class.\n\n**MIGRATION NOTE (2026-01-18)**:\nMigrated claim_value to has_or_had_content with Content class per Rule 53/56.\n"
    exact_mappings:
    - prov:Entity
    close_mappings:
    - schema:PropertyValue
    - oa:Annotation
    slots:
    - is_or_was_extracted_using
    - has_or_had_identifier
    - has_or_had_note
    - has_or_had_type
    - has_or_had_content
    - is_or_was_retrieved_through
    - has_or_had_file_path
    - pipeline_stage
    - retrieved_on
    - source_url
    - specificity_annotation
    - has_or_had_score
    - has_or_had_provenance_path
    slot_usage:
      has_or_had_identifier:
        description: 'MIGRATED from claim_id per slot_fixes.yaml (Rule 53/56, 2026-01-18).

          Unique identifier for the web claim.

          Uses Identifier class for structured identifier representation.

          '
        range: Identifier
        inlined: true
        required: false
        examples:
        - value:
            identifier_scheme: web_claim_id
            identifier_value: claim-2025-11-29-001
          description: Web claim identifier
      has_or_had_type:
        description: 'MIGRATED from claim_type per slot_fixes.yaml (Rule 53/56, 2026-01-19).

          The type of claim being made (e.g., full_name, email, facebook).


          Uses ClaimType class hierarchy for structured type representation:

          - IdentityClaim: full_name, short_name, description, legal_name

          - ContactClaim: email, phone, address, website

          - SocialMediaClaim: facebook, twitter, instagram, linkedin, youtube

          - MediaClaim: logo_url, favicon_url, og_image_url

          - OperationalClaim: opening_hours, admission_info, accessibility_info

          - CollectionClaim: collection_count, beeldbank statistics

          - OrganizationalClaim: founding_date, kvk_number, legal_form

          - DocumentClaim: annual_report_url, policy_document_url

          - GeographicClaim: street_address, postal_code, city, province

          - ArchivalClaim: archief_description, beeldbank_description

          '
        range: ClaimType
        inlined: true
        required: true
        examples:
        - value:
            has_or_had_label: full_name
          description: Identity claim for organization name
        - value:
            has_or_had_label: facebook
          description: Social media claim for Facebook URL
      has_or_had_note:
        description: 'MIGRATED from claim_note per slot_fixes.yaml (Rule 53/56, 2026-01-18).

          Notes about this specific claim extraction.


          Uses Note class with note_type, note_content, note_date fields.


          **Note Type Mapping**:

          - `note_type`: "claim" (default for WebClaim notes)

          - `note_content`: The actual note text

          - `note_date`: When the note was created


          **Use Cases**:

          - Document extraction issues

          - Note special circumstances

          - Record conflicts with other sources

          '
        range: Note
        inlined: true
        inlined_as_list: true
        multivalued: true
        required: false
        examples:
        - value:
            note_type: claim
            note_content: Additional verification required for this claim.
            note_date: '2026-01-18'
          description: Verification note for claim
        - value:
            note_type: extraction
            note_content: Biography truncated from longer text on page.
            note_date: '2025-11-29'
          description: Extraction processing note
      has_or_had_content:
        description: 'MIGRATED from claim_value per slot_fixes.yaml (Rule 53/56, 2026-01-18).

          The extracted value from the web source - the actual content claimed to exist

          at the XPath location.


          Uses Content class with has_or_had_label holding the raw extracted string.


          **Content Mapping**:

          - `has_or_had_label`: The raw extracted value (required)

          - `has_or_had_description`: Optional elaboration on the claim content


          **Examples of claim values**:

          - Organization names: "Historische Vereniging Nijeveen"

          - Statistics: "6253" (photo count)

          - URLs: "https://www.facebook.com/HistorischeVerenigingNijeveen/"

          '
        range: Content
        inlined: true
        required: true
        multivalued: false
        examples:
        - value:
            has_or_had_label: Historische Vereniging Nijeveen
          description: Organization name claim value
        - value:
            has_or_had_label: '6253'
          description: Numeric statistic claim value
        - value:
            has_or_had_label: https://www.facebook.com/HistorischeVerenigingNijeveen/
          description: URL claim value
      source_url:
        required: true
      retrieved_on:
        required: true
      has_or_had_provenance_path:
        required: true
        range: XPath
        inlined: true
        description: XPath provenance for this claim - pointing to exact element in archived HTML. Contains expression, matched_text, and match_score.
      has_or_had_file_path:
        required: true
        range: FilePath
        inlined: true
        description: Path to the archived HTML file (Playwright-rendered) used for extraction. MIGRATED from html_file.
        examples:
        - value:
            has_or_had_label: web/0021/historischeverenigingnijeveen.nl/rendered.html
          description: Archived HTML file path
      is_or_was_retrieved_through:
        description: 'Retrieval event containing timestamp.

          MIGRATED from extraction_timestamp per Rule 53 (2026-01-26).

          '
        range: RetrievalEvent
        inlined: true
        required: false
      is_or_was_extracted_using:
        description: 'MIGRATED from claim_extraction_method per slot_fixes.yaml (Rule 53/56, 2026-01-19).

          Method used to extract this claim from the source document.


          Uses ExtractionMethod class to represent structured extraction method information.


          **Common Extraction Methods**:

          - `xpath_exact_match` - XPath pointed to exact element containing value

          - `xpath_fuzzy_match` - XPath with partial/substring match

          - `nlp_ner` - Named Entity Recognition extraction

          - `json_ld_parse` - Parsed from embedded JSON-LD structured data

          - `meta_tag` - Extracted from HTML meta tags

          - `manual` - Human-verified extraction

          '
        range: ExtractionMethod
        inlined: true
        required: false
        examples:
        - value:
            has_or_had_label: xpath_exact_match
          description: XPath extraction with exact match
        - value:
            has_or_had_label: nlp_ner
          description: NLP Named Entity Recognition extraction
    rules:
    - preconditions:
        slot_conditions:
          has_or_had_provenance_path:
            value_presence: ABSENT
      postconditions:
        description: Claims without XPath provenance must be removed as unverifiable
    comments:
    - WebClaim requires XPath provenance via has_or_had_provenance_path - claims without it are fabricated
    - XPath class contains expression, matched_text, and match_score in one structure
    - Archived HTML files are Playwright-rendered (NOT WARC format)
    - Use scripts/fetch_website_playwright.py to archive websites
    - Use scripts/add_xpath_provenance.py to add XPath to existing claims
    - "Follows 4-stage GLAM-NER pipeline: recognition \u2192 layout \u2192 resolution \u2192 linking"
    - "MIGRATED 2026-01-15: xpath/xpath_match_score/xpath_matched_text \u2192 has_or_had_provenance_path (XPath class)"
    - "MIGRATED 2026-01-18: claim_value \u2192 has_or_had_content (Content class) per Rule 53/56"
    - "MIGRATED 2026-01-18: claim_note \u2192 has_or_had_note (Note class) per Rule 53/56"
    - "MIGRATED 2026-01-19: claim_extraction_method \u2192 is_or_was_extracted_using (ExtractionMethod class) per Rule 53/56"
    - "MIGRATED 2026-01-19: claim_type \u2192 has_or_had_type (ClaimType/ClaimTypes classes) per Rule 53/56"
    see_also:
    - rules/WEB_OBSERVATION_PROVENANCE_RULES.md
    - scripts/fetch_website_playwright.py
    - scripts/add_xpath_provenance.py
    - docs/convention/schema/20251202/entity_annotation_rules_v1.6.0_unified.yaml
    examples:
    - value:
        has_or_had_type:
          has_or_had_label: full_name
        has_or_had_content:
          has_or_had_label: Historische Vereniging Nijeveen
        source_url: https://historischeverenigingnijeveen.nl/
        retrieved_on: '2025-11-29T12:28:00Z'
        has_or_had_provenance_path:
          expression: /html[1]/body[1]/div[6]/div[1]/h1[1]
          match_score: 1.0
          source_document: web/0021/historischeverenigingnijeveen.nl/rendered.html
        has_or_had_file_path:
          has_or_had_label: web/0021/historischeverenigingnijeveen.nl/rendered.html
        pipeline_stage: layout_analysis
      description: Exact match claim for organization name (claim_type migrated to has_or_had_type)
    - value:
        has_or_had_type:
          has_or_had_label: beeldbank_total_photos
        has_or_had_content:
          has_or_had_label: '6253'
        source_url: https://historischeverenigingnijeveen.nl/nl/hvn
        retrieved_on: '2025-11-29T12:28:00Z'
        has_or_had_provenance_path:
          expression: /html[1]/body[1]/div[6]/div[1]/table[3]/tbody[1]/tr[1]/td[1]/p[1]
          match_score: 1.0
          source_document: web/0021/historischeverenigingnijeveen.nl/rendered.html
        has_or_had_file_path:
          has_or_had_label: web/0021/historischeverenigingnijeveen.nl/rendered.html
        pipeline_stage: layout_analysis
      description: Collection count claim from image bank statistics
    - value:
        has_or_had_type:
          has_or_had_label: facebook
        has_or_had_content:
          has_or_had_label: https://www.facebook.com/HistorischeVerenigingNijeveen/
        source_url: https://historischeverenigingnijeveen.nl/
        retrieved_on: '2025-11-29T12:28:00Z'
        has_or_had_provenance_path:
          expression: /html[1]/body[1]/footer[1]/div[1]/a[3]
          match_score: 1.0
          source_document: web/0021/historischeverenigingnijeveen.nl/rendered.html
        has_or_had_file_path:
          has_or_had_label: web/0021/historischeverenigingnijeveen.nl/rendered.html
        pipeline_stage: entity_linking
      description: Social media link claim - entity linking stage
    - value:
        has_or_had_type:
          has_or_had_label: website
        has_or_had_content:
          has_or_had_label: https://www.historischeverenigingnijeveen.nl/
        source_url: https://historischeverenigingnijeveen.nl/nl/hvn
        retrieved_on: '2025-11-28T12:00:00Z'
        has_or_had_provenance_path:
          expression: /html[1]/body[1]/div[6]/div[1]/table[3]/tbody[1]/tr[1]/td[1]/p[6]
          matched_text: De Historische Vereniging Nijeveen is ook te vinden op Facebook
          match_score: 0.561
          source_document: web/0021/historischeverenigingnijeveen.nl/rendered.html
        has_or_had_file_path:
          has_or_had_label: web/0021/historischeverenigingnijeveen.nl/rendered.html
        pipeline_stage: layout_analysis
      description: Substring match - URL found within longer text
    annotations:
      specificity_score: 0.1
      specificity_rationale: Generic utility class/slot created during migration
      custodian_types: "['*']"
      custodian_types_rationale: Universal utility concept