glam/schemas/20251121/linkml/modules/classes/WebClaim.yaml

id: https://nde.nl/ontology/hc/class/WebClaim
name: WebClaim
title: WebClaim Class - Verifiable Web-Extracted Claims
prefixes:
  linkml: https://w3id.org/linkml/
  hc: https://nde.nl/ontology/hc/
  schema: http://schema.org/
  dcterms: http://purl.org/dc/terms/
  prov: http://www.w3.org/ns/prov#
  pav: http://purl.org/pav/
  xsd: http://www.w3.org/2001/XMLSchema#
  oa: http://www.w3.org/ns/oa#
  nif: http://persistence.uni-leipzig.org/nlp2rdf/ontologies/nif-core#
imports:
  - linkml:types
  - ./Claim  # Base class - added 2026-01-19 per Rule 53/56
  - ../slots/source_url
  - ../slots/retrieved_on
  - ../slots/has_or_had_provenance_path
  - ../slots/html_file
  # REMOVED 2026-01-18: ../slots/claim_id - migrated to has_or_had_identifier + Identifier (Rule 53)
  - ../slots/has_or_had_identifier
  - ./Identifier
  # REMOVED 2026-01-19: ../slots/claim_type - migrated to has_or_had_type + ClaimType (Rule 53)
  - ../slots/has_or_had_type
  - ./ClaimType
  - ./ClaimTypes
  # REMOVED 2026-01-18: ../slots/claim_value - migrated to has_or_had_content + Content (Rule 53)
  - ../slots/has_or_had_content
  - ./Content
  - ../slots/extraction_timestamp
  - ../slots/specificity_annotation
  - ../slots/has_or_had_score  # was: template_specificity - migrated per Rule 53 (2026-01-17)
  # REMOVED 2026-01-19: ../slots/claim_extraction_method - migrated to is_or_was_extracted_using + ExtractionMethod (Rule 53)
  - ../slots/is_or_was_extracted_using
  - ./ExtractionMethod
  - ../slots/pipeline_stage
  # REMOVED 2026-01-18: ../slots/claim_note - migrated to has_or_had_note + Note (Rule 53)
  - ../slots/has_or_had_note
  - ./Note
  # REMOVED 2026-01-19: ../enums/ClaimTypeEnum - migrated to ClaimType/ClaimTypes classes (Rule 53)
  - ../enums/ExtractionPipelineStageEnum
  - ./SpecificityAnnotation
  - ./TemplateSpecificityScore  # was: TemplateSpecificityScores - migrated per Rule 53 (2026-01-17)
  - ./TemplateSpecificityType
  - ./TemplateSpecificityTypes
  - ./XPath
default_prefix: hc
classes:
  WebClaim:
    is_a: Claim  # Inherits from base Claim class - added 2026-01-19 per Rule 53/56
    class_uri: prov:Entity
    description: "A single verifiable claim extracted from a web page.\n\n**CORE PRINCIPLE: XPATH OR REMOVE**\n\nEvery claim\
      \ extracted from a webpage MUST have:\n1. `has_or_had_provenance_path` - XPath object pointing to exact element in archived HTML\n2. `html_file` - path\
      \ to the archived HTML (Playwright-rendered, NOT WARC)\n\nThe XPath object contains:\n- `expression` - the XPath string\n- `match_score` - quality of match (0.0-1.0)\n- `matched_text` - actual text found (for verification)\n\nClaims without\
      \ these fields are FABRICATED and must be REMOVED.\n\n**ARCHIVE FORMAT: PLAYWRIGHT-RENDERED HTML**\n\nWe use Playwright\
      \ (headless browser) to:\n1. Navigate to the target URL\n2. Wait for JavaScript to fully render\n3. Save the complete\
      \ DOM as an HTML file\n\nThis differs from WARC archives which capture raw HTTP responses.\nPlaywright rendering captures\
      \ the final DOM state including:\n- JavaScript-rendered content\n- Dynamically loaded elements\n- Client-side state\n\
      \n**WHY NOT CONFIDENCE SCORES?**\n\nConfidence scores like `0.95` are MEANINGLESS because:\n- There is NO methodology\
      \ defining what these numbers mean\n- They cannot be verified or reproduced\n- They give false impression of rigor\n\
      - They mask the fact that claims may be fabricated\n\nInstead, we use VERIFIABLE provenance:\n- XPath points to exact\
      \ location\n- Archived HTML can be inspected\n- Match score is computed, not estimated\n\n**EXTRACTION PIPELINE (4 Stages)**\n\
      \nFollowing the GLAM-NER Unified Entity Annotation Convention v1.7.0:\n\n1. **Entity Recognition** (Stage 1)\n   - Detect\
      \ named entities in text\n   - Classify by hypernym type (AGT, GRP, TOP, TMP, etc.)\n   - Methods: spaCy NER, transformer\
      \ models, regex patterns\n\n2. **Layout Analysis** (Stage 2)\n   - Analyze document structure (headers, paragraphs,\
      \ tables)\n   - Assign DOC hypernym types (DOC.HDR, DOC.PAR, DOC.TBL)\n   - Generate XPath provenance for each claim\
      \ location\n\n3. **Entity Resolution** (Stage 3)\n   - Disambiguate entity mentions\n   - Merge coreferences and name\
      \ variants\n   - Produce canonical entity clusters\n\n4. **Entity Linking** (Stage 4)\n   - Link resolved entities to\
      \ knowledge bases\n   - Connect to Wikidata, ISIL, GeoNames, etc.\n   - Assign link confidence scores\n\n**WORKFLOW**:\n\
      \n1. Archive website using Playwright:\n   `python scripts/fetch_website_playwright.py <entry_number> <url>`\n   \n\
      \   This saves: web/{entry_number}/{domain}/rendered.html\n\n2. Add XPath provenance to claims:\n   `python scripts/add_xpath_provenance.py`\n\
      \n3. Script REMOVES claims that cannot be verified\n   (stores in `removed_unverified_claims` for audit)\n\n**EXAMPLES**:\n\
      \nCORRECT (Verifiable):\n```yaml\n- claim_type: full_name\n  has_or_had_content:\n    has_or_had_label: Historische Vereniging Nijeveen\n  source_url:\
      \ https://historischeverenigingnijeveen.nl/\n  retrieved_on: \"2025-11-29T12:28:00Z\"\n  has_or_had_provenance_path:\n    expression: /html[1]/body[1]/div[6]/div[1]/h1[1]\n    match_score: 1.0\n\
      \  html_file: web/0021/historischeverenigingnijeveen.nl/rendered.html\n  pipeline_stage: layout_analysis\n\
      ```\n\nWRONG (Fabricated - Must Be Removed):\n```yaml\n- claim_type: full_name\n  has_or_had_content:\n    has_or_had_label: Historische Vereniging\
      \ Nijeveen\n  confidence: 0.95  # ← NO! This is meaningless without XPath\n```\n\n**MIGRATION NOTE (2026-01-15)**:\nConsolidated xpath, xpath_match_score, xpath_matched_text\ninto has_or_had_provenance_path with XPath class.\n\n**MIGRATION NOTE (2026-01-18)**:\nMigrated claim_value to has_or_had_content with Content class per Rule 53/56.\n"
    exact_mappings:
    - prov:Entity
    close_mappings:
    - schema:PropertyValue
    - oa:Annotation
    slots:
    - is_or_was_extracted_using  # was: claim_extraction_method - migrated per Rule 53/56 (2026-01-19)
    # REMOVED 2026-01-18: claim_id - migrated to has_or_had_identifier + Identifier (Rule 53)
    - has_or_had_identifier
    # REMOVED 2026-01-18: claim_note - migrated to has_or_had_note + Note (Rule 53)
    - has_or_had_note
    # REMOVED 2026-01-19: claim_type - migrated to has_or_had_type + ClaimType (Rule 53)
    - has_or_had_type
    # REMOVED 2026-01-18: claim_value - migrated to has_or_had_content + Content (Rule 53)
    - has_or_had_content
    - extraction_timestamp
    - html_file
    - pipeline_stage
    - retrieved_on
    - source_url
    - specificity_annotation
    - has_or_had_score  # was: template_specificity - migrated per Rule 53 (2026-01-17)
    - has_or_had_provenance_path
    slot_usage:
      # MIGRATED 2026-01-18: claim_id → has_or_had_identifier + Identifier (Rule 53/56)
      has_or_had_identifier:
        description: |
          MIGRATED from claim_id per slot_fixes.yaml (Rule 53/56, 2026-01-18).
          Unique identifier for the web claim.
          Uses Identifier class for structured identifier representation.
        range: Identifier
        inlined: true
        required: false
        examples:
          - value:
              identifier_scheme: web_claim_id
              identifier_value: "claim-2025-11-29-001"
            description: Web claim identifier
      # MIGRATED 2026-01-19: claim_type → has_or_had_type + ClaimType (Rule 53/56)
      has_or_had_type:
        description: |
          MIGRATED from claim_type per slot_fixes.yaml (Rule 53/56, 2026-01-19).
          The type of claim being made (e.g., full_name, email, facebook).

          Uses ClaimType class hierarchy for structured type representation:
          - IdentityClaimType: full_name, short_name, description, legal_name
          - ContactClaimType: email, phone, address, website
          - SocialMediaClaimType: facebook, twitter, instagram, linkedin, youtube
          - MediaClaimType: logo_url, favicon_url, og_image_url
          - OperationalClaimType: opening_hours, admission_info, accessibility_info
          - CollectionClaimType: collection_count, beeldbank statistics
          - OrganizationalClaimType: founding_date, kvk_number, legal_form
          - DocumentClaimType: annual_report_url, policy_document_url
          - GeographicClaimType: street_address, postal_code, city, province
          - ArchivalClaimType: archief_description, beeldbank_description
        range: ClaimType
        inlined: true
        required: true
        examples:
          - value:
              has_or_had_label: full_name
            description: Identity claim for organization name
          - value:
              has_or_had_label: facebook
            description: Social media claim for Facebook URL
      # MIGRATED 2026-01-18: claim_note → has_or_had_note + Note (Rule 53/56)
      has_or_had_note:
        description: |
          MIGRATED from claim_note per slot_fixes.yaml (Rule 53/56, 2026-01-18).
          Notes about this specific claim extraction.

          Uses Note class with note_type, note_content, note_date fields.

          **Note Type Mapping**:
          - `note_type`: "claim" (default for WebClaim notes)
          - `note_content`: The actual note text
          - `note_date`: When the note was created

          **Use Cases**:
          - Document extraction issues
          - Note special circumstances
          - Record conflicts with other sources
        range: Note
        inlined: true
        inlined_as_list: true
        multivalued: true
        required: false
        examples:
          - value:
              note_type: claim
              note_content: "Additional verification required for this claim."
              note_date: "2026-01-18"
            description: Verification note for claim
          - value:
              note_type: extraction
              note_content: "Biography truncated from longer text on page."
              note_date: "2025-11-29"
            description: Extraction processing note
      # MIGRATED 2026-01-18: claim_value → has_or_had_content + Content (Rule 53/56)
      has_or_had_content:
        description: |
          MIGRATED from claim_value per slot_fixes.yaml (Rule 53/56, 2026-01-18).
          The extracted value from the web source - the actual content claimed to exist
          at the XPath location.

          Uses Content class with has_or_had_label holding the raw extracted string.

          **Content Mapping**:
          - `has_or_had_label`: The raw extracted value (required)
          - `has_or_had_description`: Optional elaboration on the claim content

          **Examples of claim values**:
          - Organization names: "Historische Vereniging Nijeveen"
          - Statistics: "6253" (photo count)
          - URLs: "https://www.facebook.com/HistorischeVerenigingNijeveen/"
        range: Content
        inlined: true
        required: true
        multivalued: false
        examples:
          - value:
              has_or_had_label: "Historische Vereniging Nijeveen"
            description: Organization name claim value
          - value:
              has_or_had_label: "6253"
            description: Numeric statistic claim value
          - value:
              has_or_had_label: "https://www.facebook.com/HistorischeVerenigingNijeveen/"
            description: URL claim value
      source_url:
        required: true
      retrieved_on:
        required: true
      has_or_had_provenance_path:
        required: true
        range: XPath
        inlined: true
        description: >-
          XPath provenance for this claim - pointing to exact element in archived HTML.
          Contains expression, matched_text, and match_score.
      html_file:
        required: true
      # MIGRATED 2026-01-19: claim_extraction_method → is_or_was_extracted_using + ExtractionMethod (Rule 53/56)
      is_or_was_extracted_using:
        description: |
          MIGRATED from claim_extraction_method per slot_fixes.yaml (Rule 53/56, 2026-01-19).
          Method used to extract this claim from the source document.

          Uses ExtractionMethod class to represent structured extraction method information.

          **Common Extraction Methods**:
          - `xpath_exact_match` - XPath pointed to exact element containing value
          - `xpath_fuzzy_match` - XPath with partial/substring match
          - `nlp_ner` - Named Entity Recognition extraction
          - `json_ld_parse` - Parsed from embedded JSON-LD structured data
          - `meta_tag` - Extracted from HTML meta tags
          - `manual` - Human-verified extraction
        range: ExtractionMethod
        inlined: true
        required: false
        examples:
          - value:
              has_or_had_label: xpath_exact_match
            description: XPath extraction with exact match
          - value:
              has_or_had_label: nlp_ner
            description: NLP Named Entity Recognition extraction
    rules:
    - preconditions:
        slot_conditions:
          has_or_had_provenance_path:
            value_presence: ABSENT
      postconditions:
        description: Claims without XPath provenance must be removed as unverifiable
    comments:
    - WebClaim requires XPath provenance via has_or_had_provenance_path - claims without it are fabricated
    - XPath class contains expression, matched_text, and match_score in one structure
    - Archived HTML files are Playwright-rendered (NOT WARC format)
    - Use scripts/fetch_website_playwright.py to archive websites
    - Use scripts/add_xpath_provenance.py to add XPath to existing claims
    - 'Follows 4-stage GLAM-NER pipeline: recognition → layout → resolution → linking'
    - 'MIGRATED 2026-01-15: xpath/xpath_match_score/xpath_matched_text → has_or_had_provenance_path (XPath class)'
    - 'MIGRATED 2026-01-18: claim_value → has_or_had_content (Content class) per Rule 53/56'
    - 'MIGRATED 2026-01-18: claim_note → has_or_had_note (Note class) per Rule 53/56'
    - 'MIGRATED 2026-01-19: claim_extraction_method → is_or_was_extracted_using (ExtractionMethod class) per Rule 53/56'
    - 'MIGRATED 2026-01-19: claim_type → has_or_had_type (ClaimType/ClaimTypes classes) per Rule 53/56'
    see_also:
    - rules/WEB_OBSERVATION_PROVENANCE_RULES.md
    - scripts/fetch_website_playwright.py
    - scripts/add_xpath_provenance.py
    - docs/convention/schema/20251202/entity_annotation_rules_v1.6.0_unified.yaml
    examples:
    - value:
        has_or_had_type:
          has_or_had_label: full_name
        has_or_had_content:
          has_or_had_label: Historische Vereniging Nijeveen
        source_url: https://historischeverenigingnijeveen.nl/
        retrieved_on: '2025-11-29T12:28:00Z'
        has_or_had_provenance_path:
          expression: /html[1]/body[1]/div[6]/div[1]/h1[1]
          match_score: 1.0
          source_document: web/0021/historischeverenigingnijeveen.nl/rendered.html
        html_file: web/0021/historischeverenigingnijeveen.nl/rendered.html
        pipeline_stage: layout_analysis
      description: Exact match claim for organization name (claim_type migrated to has_or_had_type)
    - value:
        has_or_had_type:
          has_or_had_label: beeldbank_total_photos
        has_or_had_content:
          has_or_had_label: '6253'
        source_url: https://historischeverenigingnijeveen.nl/nl/hvn
        retrieved_on: '2025-11-29T12:28:00Z'
        has_or_had_provenance_path:
          expression: /html[1]/body[1]/div[6]/div[1]/table[3]/tbody[1]/tr[1]/td[1]/p[1]
          match_score: 1.0
          source_document: web/0021/historischeverenigingnijeveen.nl/rendered.html
        html_file: web/0021/historischeverenigingnijeveen.nl/rendered.html
        pipeline_stage: layout_analysis
      description: Collection count claim from image bank statistics
    - value:
        has_or_had_type:
          has_or_had_label: facebook
        has_or_had_content:
          has_or_had_label: https://www.facebook.com/HistorischeVerenigingNijeveen/
        source_url: https://historischeverenigingnijeveen.nl/
        retrieved_on: '2025-11-29T12:28:00Z'
        has_or_had_provenance_path:
          expression: /html[1]/body[1]/footer[1]/div[1]/a[3]
          match_score: 1.0
          source_document: web/0021/historischeverenigingnijeveen.nl/rendered.html
        html_file: web/0021/historischeverenigingnijeveen.nl/rendered.html
        pipeline_stage: entity_linking
      description: Social media link claim - entity linking stage
    - value:
        has_or_had_type:
          has_or_had_label: website
        has_or_had_content:
          has_or_had_label: https://www.historischeverenigingnijeveen.nl/
        source_url: https://historischeverenigingnijeveen.nl/nl/hvn
        retrieved_on: '2025-11-28T12:00:00Z'
        has_or_had_provenance_path:
          expression: /html[1]/body[1]/div[6]/div[1]/table[3]/tbody[1]/tr[1]/td[1]/p[6]
          matched_text: De Historische Vereniging Nijeveen is ook te vinden op Facebook
          match_score: 0.561
          source_document: web/0021/historischeverenigingnijeveen.nl/rendered.html
        html_file: web/0021/historischeverenigingnijeveen.nl/rendered.html
        pipeline_stage: layout_analysis
      description: Substring match - URL found within longer text