- Created 'updated_at.yaml' to record the last modified date and time of entities, including multilingual descriptions and structured aliases. - Created 'written_in.yaml' to specify the language in which content is composed, covering both natural and programming languages, with detailed comments and close ontology mappings.
100 lines
4.3 KiB
YAML
100 lines
4.3 KiB
YAML
id: https://nde.nl/ontology/hc/class/ExtractionMetadata
|
|
name: extraction_metadata_class
|
|
title: Extraction Metadata Class
|
|
version: 1.0.0
|
|
prefixes:
|
|
linkml: https://w3id.org/linkml/
|
|
hc: https://nde.nl/ontology/hc/
|
|
schema: http://schema.org/
|
|
prov: http://www.w3.org/ns/prov#
|
|
dct: http://purl.org/dc/terms/
|
|
xsd: http://www.w3.org/2001/XMLSchema#
|
|
imports:
|
|
- linkml:types
|
|
- ../enums/ProfileExtractionMethodEnum
|
|
- ../metadata
|
|
- ../slots/20260202_matang/has_expense
|
|
- ../slots/20260202_matang/identified_by
|
|
- ../slots/20260202_matang/has_method
|
|
- ../slots/20260202_matang/has_score
|
|
- ../slots/20260202_matang/has_source
|
|
- ../slots/20260202_matang/has_url
|
|
- ../slots/20260202_matang/retrieved_by
|
|
- ../slots/20260202_matang/new/llm_response
|
|
- ../slots/20260202_matang/new/retrieval_timestamp
|
|
# default_range: string
|
|
classes:
|
|
ExtractionMetadata:
|
|
class_uri: prov:Activity
|
|
description: "Provenance metadata for data extraction activities.\n\nRecords how, when, and by what agent data was extracted from \nexternal sources (LinkedIn, web scraping, APIs).\n\n**PROV-O Alignment**:\n- ExtractionMetadata IS a prov:Activity (the extraction process)\n- The extracted data IS the prov:Entity (output of the activity)\n- retrieved_by IS the prov:Agent (software/AI that performed extraction)\n- has_source/has_url IS prov:used (input to the activity)\n\n**Use Cases**:\n- LinkedIn profile extractions via Exa API\n- Web scraping provenance\n- Staff list parsing provenance\n- Connection network extraction\n\n**Example JSON Structure**:\n```json\n{\n \"extraction_metadata\": {\n \"has_source\": \"/path/to/source.json\",\n \"identified_by\": \"org_staff_0001_name\",\n \"retrieval_timestamp\": \"2025-12-12T22:00:00Z\",\n \"has_method\": \"exa_crawling_exa\",\n \"retrieved_by\": \"claude-opus-4.5\",\n \"has_url\": \"https://www.linkedin.com/in/...\"\
|
|
,\n \"has_expense\": 0.001\n }\n}\n```\n"
|
|
exact_mappings:
|
|
- prov:Activity
|
|
close_mappings:
|
|
- schema:Action
|
|
- dct:ProvenanceStatement
|
|
slots:
|
|
- has_expense
|
|
- retrieved_by
|
|
- retrieval_timestamp
|
|
- has_method
|
|
- has_url
|
|
- llm_response
|
|
- identified_by
|
|
- has_source
|
|
- has_score
|
|
slot_usage:
|
|
has_source:
|
|
# range: string
|
|
examples:
|
|
- value: /data/custodian/person/affiliated/parsed/rijksmuseum_staff_20251210T155416Z.json
|
|
identified_by:
|
|
# range: string
|
|
pattern: ^[a-z0-9-]+_staff_[a-z0-9-_]+$
|
|
examples:
|
|
- value: rijksmuseum_staff_0042_jan_van_der_berg
|
|
- value: exa_12345678-abcd-efgh-ijkl-mnopqrstuv
|
|
retrieval_timestamp:
|
|
range: datetime
|
|
required: true
|
|
examples:
|
|
- value: '2025-12-12T22:00:00Z'
|
|
has_method:
|
|
range: ProfileExtractionMethodEnum
|
|
required: true
|
|
examples:
|
|
- value: exa_crawling_exa
|
|
retrieved_by:
|
|
# range: string
|
|
examples:
|
|
- value: claude-opus-4.5
|
|
- value: ''
|
|
has_url:
|
|
range: uri
|
|
pattern: ^https://www\.linkedin\.com/in/[a-z0-9-]+/?$
|
|
examples:
|
|
- value: https://www.linkedin.com/in/jan-van-der-berg-12345
|
|
has_expense:
|
|
range: float
|
|
minimum_value: 0.0
|
|
examples:
|
|
- value: 0.001
|
|
- value: 0.0
|
|
llm_response:
|
|
range: LLMResponse
|
|
required: false
|
|
inlined: true
|
|
examples:
|
|
- value: "{\n \"content\": \"Extracted institution data...\",\n \"reasoning_content\": \"Analyzing the input for LinkML schema conformity...\",\n \"thinking_mode\": \"preserved\",\n \"clear_thinking\": false,\n \"model\": \"glm-4.7\",\n \"provider\": \"zai\",\n \"created\": \"2025-12-23T10:30:00Z\",\n \"prompt_tokens\": 150,\n \"completion_tokens\": 450,\n \"total_tokens\": 600,\n \"finish_reason\": \"stop\",\n \"cost_usd\": 0.0\n}\n"
|
|
comments:
|
|
- Every person entity file MUST have extraction_metadata
|
|
- See AGENTS.md Rule 20 for required fields
|
|
- retrieved_by should be 'claude-opus-4.5' for manual extraction
|
|
- has_expense enables budget tracking for API-heavy extractions
|
|
see_also:
|
|
- https://www.w3.org/TR/prov-o/
|
|
- https://docs.exa.ai/
|
|
annotations:
|
|
specificity_score: 0.1
|
|
specificity_rationale: Generic utility class/slot created during migration
|
|
custodian_types: "['*']"
|