glam/schemas/20251121/linkml/modules/classes/LLMResponse.yaml

id: https://nde.nl/ontology/hc/class/LLMResponse
name: llm_response_class
title: LLM Response Class
version: 1.0.0
prefixes:
  linkml: https://w3id.org/linkml/
  hc: https://nde.nl/ontology/hc/
  schema: http://schema.org/
  prov: http://www.w3.org/ns/prov#
  dct: http://purl.org/dc/terms/
  xsd: http://www.w3.org/2001/XMLSchema#
imports:
  - linkml:types
  - ../metadata
  - ./SpecificityAnnotation
  - ./TemplateSpecificityScores
  - ../enums/LLMProviderEnum
  - ../enums/FinishReasonEnum
  - ../enums/ThinkingModeEnum
  - ../slots/content
  - ../slots/reasoning_content
  - ../slots/model
  - ../slots/provider
  - ../slots/prompt_token
  - ../slots/completion_token
  - ../slots/total_token
  - ../slots/cached_token
  - ../slots/finish_reason
  - ../slots/latency_ms
  - ../slots/thinking_mode
  - ../slots/clear_thinking
  - ../slots/created
  - ../slots/cost_usd
  - ../slots/request_id
  - ../slots/specificity_annotation
  - ../slots/template_specificity
default_range: string

classes:
  LLMResponse:
    class_uri: prov:Activity
    description: |
      Provenance metadata for LLM API responses, including GLM 4.7 Thinking Modes.

      Captures complete response metadata from LLM providers (ZhipuAI GLM, Anthropic,
      OpenAI, etc.) for traceability and analysis. The key innovation is capturing
      `reasoning_content` - the chain-of-thought reasoning that GLM 4.7 exposes
      through its three thinking modes.

      **GLM 4.7 Thinking Modes** (https://docs.z.ai/guides/capabilities/thinking-mode):

      1. **Interleaved Thinking** (default, since GLM-4.5):
         - Model thinks between tool calls and after receiving tool results
         - Enables complex, step-by-step reasoning with tool chaining
         - Returns `reasoning_content` alongside `content` in every response

      2. **Preserved Thinking** (new in GLM-4.7):
         - Retains reasoning_content from previous assistant turns in context
         - Preserves reasoning continuity across multi-turn conversations
         - Improves model performance and increases cache hit rates
         - **Enabled by default on Coding Plan endpoint**
         - Requires returning EXACT, UNMODIFIED reasoning_content back to API
         - Set via: `"clear_thinking": false` (do NOT clear previous reasoning)

      3. **Turn-level Thinking** (new in GLM-4.7):
         - Control reasoning computation on a per-turn basis
         - Enable/disable thinking independently for each request in a session
         - Useful for balancing speed (simple queries) vs accuracy (complex tasks)
         - Set via: `"thinking": {"type": "enabled"}` or `"thinking": {"type": "disabled"}`

      **Critical Implementation Note for Preserved Thinking**:
      When using Preserved Thinking with tool calls, thinking blocks MUST be:
      1. Explicitly preserved in the messages array
      2. Returned together with tool results
      3. Kept in EXACT original sequence (no reordering/editing)

      **PROV-O Alignment**:
      - LLMResponse IS a prov:Activity (the inference process)
      - content IS prov:Entity (the generated output)
      - model/provider IS prov:Agent (the AI system)
      - reasoning_content documents the prov:Plan (how the agent reasoned)
      - prompt (input) IS prov:used (input to the activity)

      **Use Cases**:
      - DSPy RAG responses with reasoning traces
      - Heritage institution extraction provenance
      - LinkML schema conformity validation
      - Ontology mapping decision logs
      - Multi-turn agent conversations with preserved context
    exact_mappings:
      - prov:Activity
    close_mappings:
      - schema:Action
      - schema:CreativeWork
    slots:
      - cached_token
      - clear_thinking
      - completion_token
      - content
      - cost_usd
      - created
      - finish_reason
      - latency_ms
      - model
      - prompt_token
      - provider
      - reasoning_content
      - request_id
      - specificity_annotation
      - template_specificity
      - thinking_mode
      - total_token
    slot_usage:
      content:
        description: |
          The final LLM response text (message.content from API response).
          PROV-O: prov:generated - the entity produced by this activity.

          This is the primary output shown to users and used for downstream processing.
        slot_uri: prov:generated
        range: string
        required: true
        examples:
          - value: The Rijksmuseum is a national museum in Amsterdam dedicated to Dutch
              arts and history.
            description: Extracted heritage institution description
      reasoning_content:
        description: |
          Interleaved Thinking - the model's chain-of-thought reasoning.
          PROV-O: prov:hadPlan - documents HOW the agent reasoned.

          **GLM 4.7 Interleaved Thinking**:
          GLM 4.7 returns `reasoning_content` in every response, exposing the
          model's step-by-step reasoning process. This enables:

          1. **Schema Validation**: Model reasons about LinkML constraints before generating output
          2. **Ontology Mapping**: Explicit reasoning about CIDOC-CRM, CPOV, TOOI class mappings
          3. **RDF Quality**: Chain-of-thought validates triple construction
          4. **Transparency**: Full audit trail of extraction decisions

          May be null for providers that don't expose reasoning (Claude, GPT-4).
        slot_uri: prov:hadPlan
        range: string
        required: false
        examples:
          - value: 'The user is asking about Dutch heritage institutions. I need to
              identify: 1) Institution name: Rijksmuseum, 2) Type: Museum (maps to InstitutionTypeEnum.MUSEUM),
              3) Location: Amsterdam (city in Noord-Holland province)...'
            description: GLM 4.7 interleaved thinking showing explicit schema reasoning
      model:
        description: |
          The LLM model identifier from the API response.
          PROV-O: Part of prov:wasAssociatedWith - identifies the specific model version.

          Common values:
          - glm-4.7: ZhipuAI GLM 4.7 (with Interleaved Thinking)
          - glm-4.6: ZhipuAI GLM 4.6
          - claude-3-opus-20240229: Anthropic Claude Opus
          - gpt-4-turbo: OpenAI GPT-4 Turbo
        slot_uri: schema:softwareVersion
        range: string
        required: true
        examples:
          - value: glm-4.7
            description: ZhipuAI GLM 4.7 with Interleaved Thinking
      provider:
        description: |
          The LLM provider/platform.
          PROV-O: prov:wasAssociatedWith - the agent (organization) providing the model.

          Used by DSPy to route requests and track provider-specific behavior.
        slot_uri: prov:wasAssociatedWith
        range: LLMProviderEnum
        required: true
        examples:
          - value: zai
            description: ZhipuAI (Z.AI) - GLM models
      request_id:
        description: |
          Unique request ID from the LLM provider API (for tracing/debugging).
          Enables correlation with provider logs for troubleshooting.
        slot_uri: dct:identifier
        range: string
        required: false
        examples:
          - value: req_8f3a2b1c4d5e6f7g
            description: Provider-assigned request identifier
      created:
        description: |
          Timestamp when the LLM response was generated (from API response).
          PROV-O: prov:endedAtTime - when the inference activity completed.
        slot_uri: prov:endedAtTime
        range: datetime
        required: true
        examples:
          - value: '2025-12-23T10:30:00Z'
            description: UTC timestamp of response generation
      prompt_token:
        description: |
          Number of tokens in the input prompt.
          From API response: usage.prompt_tokens
        slot_uri: schema:value
        range: integer
        minimum_value: 0
        examples:
          - value: 150
            description: 150 tokens in the input prompt
      completion_token:
        description: |
          Number of tokens in the model's response (content + reasoning_content).
          From API response: usage.completion_tokens

          Note: For GLM 4.7, this includes tokens from both content and reasoning_content.
        slot_uri: schema:value
        range: integer
        minimum_value: 0
        examples:
          - value: 450
            description: 450 tokens in the completion (content + reasoning)
      total_token:
        description: |
          Total tokens used (prompt + completion).
          From API response: usage.total_tokens
        slot_uri: schema:value
        range: integer
        minimum_value: 0
        examples:
          - value: 600
            description: 600 total tokens (150 prompt + 450 completion)
      cached_token:
        description: |
          Number of prompt tokens served from cache (if provider supports caching).
          From API response: usage.prompt_tokens_details.cached_tokens

          Cached tokens typically have reduced cost and latency.
        slot_uri: schema:value
        range: integer
        minimum_value: 0
        required: false
        examples:
          - value: 50
            description: 50 tokens served from provider's prompt cache
      finish_reason:
        description: |
          Why the model stopped generating (from API response).

          Common values:
          - stop: Natural completion (hit stop token)
          - length: Hit max_tokens limit
          - tool_calls: Model invoked a tool (function calling)
          - content_filter: Response filtered for safety
        slot_uri: schema:status
        range: FinishReasonEnum
        required: false
        examples:
          - value: stop
            description: Model completed naturally
      latency_ms:
        description: |
          Response latency in milliseconds (time from request to response).
          Measured client-side (includes network time).
        slot_uri: schema:duration
        range: integer
        minimum_value: 0
        required: false
        examples:
          - value: 1250
            description: 1.25 seconds total response time
      cost_usd:
        description: |
          Estimated cost in USD for this LLM call.

          For Z.AI Coding Plan: $0.00 (free tier for GLM models)
          For other providers: calculated from token counts and pricing
        slot_uri: schema:price
        range: float
        minimum_value: 0.0
        required: false
        examples:
          - value: 0.0
            description: Free (Z.AI Coding Plan)
          - value: 0.015
            description: OpenAI GPT-4 Turbo cost estimate
      thinking_mode:
        description: |
          The GLM 4.7 thinking mode used for this request.

          **Available Modes**:
          - **enabled**: Thinking enabled (default) - model reasons before responding
          - **disabled**: Thinking disabled - faster responses, no reasoning_content
          - **interleaved**: Interleaved thinking - think between tool calls (default behavior)
          - **preserved**: Preserved thinking - retain reasoning across turns (Coding Plan default)
        slot_uri: schema:actionOption
        range: ThinkingModeEnum
        required: false
        examples:
          - value: preserved
            description: Preserved thinking for multi-turn agent conversations
          - value: interleaved
            description: Default interleaved thinking between tool calls
          - value: disabled
            description: Disabled for fast, simple queries
      clear_thinking:
        description: |
          Whether to clear previous reasoning_content from context.

          **Preserved Thinking Control**:
          - **false**: Preserved Thinking enabled (keep reasoning, better cache hits)
          - **true**: Clear previous reasoning (default for standard API)

          **Z.AI Coding Plan**: Default is `false` (Preserved Thinking enabled)

          **Critical Implementation Note**:
          When clear_thinking is false, you MUST return the EXACT, UNMODIFIED
          reasoning_content back to the API in subsequent turns.
        slot_uri: schema:Boolean
        range: boolean
        required: false
        examples:
          - value: false
            description: Keep reasoning for Preserved Thinking (recommended)
          - value: true
            description: Clear previous reasoning (fresh context each turn)
      specificity_annotation:
        range: SpecificityAnnotation
        inlined: true
      template_specificity:
        range: TemplateSpecificityScores
        inlined: true
    comments:
      - reasoning_content is the key field for Interleaved Thinking (GLM 4.7)
      - Store reasoning_content for debugging, auditing, and DSPy optimization
      - 'Z.AI Coding Plan endpoint: https://api.z.ai/api/coding/paas/v4/chat/completions'
      - 'For DSPy: use LLMResponse to track all LLM calls in the pipeline'
      - See AGENTS.md Rule 11 for Z.AI API configuration
    see_also:
      - https://www.w3.org/TR/prov-o/
      - https://api.z.ai/docs
      - https://dspy-docs.vercel.app/