- Added new aliases for existing slots to improve clarity and usability, including: - has_deadline: has_embargo_end_date - has_extent: has_extent_text - has_fonds: has_fond - has_laboratory: conservation_lab - has_language: has_iso_code639_1, has_iso_code639_3 - has_legal_basis: legal_basis - has_light_exposure: max_light_lux - has_measurement_unit: has_unit - has_note: has_custodian_observation - has_occupation: occupation - has_operating_hours: has_operating_hours - has_position: position - has_quantity: has_artwork_count, link_count - has_roadmap: review_date - has_skill: skill - has_speaker: speaker_label - has_specification: specification_url - has_statement: rights_statement_url, rights_statement - has_type: custodian_only - has_user_category: serves_visitors_only - hold_record_set: record_count - identified_by: has_index_number - in_period: has_period - in_place: has_place - in_series: has_series - measure: has_measurement - measured_on: measurement_date - organized_by: has_organizer - originate_from: has_origin - part_of: suborganization_of - published_on: has_publication_date - receive_investment: has_investment - related_to: connection_heritage_type - require: preservation_requirement - safeguarded_by: current_keeper, record_holder_note - state: states_or_stated - take_comission: takes_or_took_comission - take_place_at: takes_or_took_place_at - transmit_through: transmits_or_transmitted_through - warrant: warrants_or_warranted - Introduced a new slot definition for evaluated_through to capture evaluation methodologies and review statuses.
100 lines
4.1 KiB
YAML
100 lines
4.1 KiB
YAML
id: https://nde.nl/ontology/hc/class/ExtractionMetadata
|
|
name: extraction_metadata_class
|
|
title: Extraction Metadata Class
|
|
version: 1.0.0
|
|
prefixes:
|
|
linkml: https://w3id.org/linkml/
|
|
hc: https://nde.nl/ontology/hc/
|
|
schema: http://schema.org/
|
|
prov: http://www.w3.org/ns/prov#
|
|
dct: http://purl.org/dc/terms/
|
|
xsd: http://www.w3.org/2001/XMLSchema#
|
|
imports:
|
|
- linkml:types
|
|
- ../enums/ProfileExtractionMethodEnum
|
|
- ../metadata
|
|
- ../slots/has_expense
|
|
- ../slots/identified_by
|
|
- ../slots/has_method
|
|
- ../slots/has_score
|
|
- ../slots/has_source
|
|
- ../slots/has_url
|
|
- ../slots/retrieved_by
|
|
- ../slots/has_provenance
|
|
- ../slots/retrieved_at
|
|
# default_range: string
|
|
classes:
|
|
ExtractionMetadata:
|
|
class_uri: prov:Activity
|
|
description: "Provenance metadata for data extraction activities.\n\nRecords how, when, and by what agent data was extracted from \nexternal sources (LinkedIn, web scraping, APIs).\n\n**PROV-O Alignment**:\n- ExtractionMetadata IS a prov:Activity (the extraction process)\n- The extracted data IS the prov:Entity (output of the activity)\n- retrieved_by IS the prov:Agent (software/AI that performed extraction)\n- has_source/has_url IS prov:used (input to the activity)\n\n**Use Cases**:\n- LinkedIn profile extractions via Exa API\n- Web scraping provenance\n- Staff list parsing provenance\n- Connection network extraction\n\n**Example JSON Structure**:\n```json\n{\n \"extraction_metadata\": {\n \"has_source\": \"/path/to/source.json\",\n \"identified_by\": \"org_staff_0001_name\",\n \"retrieval_timestamp\": \"2025-12-12T22:00:00Z\",\n \"has_method\": \"exa_crawling_exa\",\n \"retrieved_by\": \"claude-opus-4.5\",\n \"has_url\": \"https://www.linkedin.com/in/...\"\
|
|
,\n \"has_expense\": 0.001\n }\n}\n```\n"
|
|
exact_mappings:
|
|
- prov:Activity
|
|
close_mappings:
|
|
- schema:Action
|
|
- dct:ProvenanceStatement
|
|
slots:
|
|
- has_expense
|
|
- retrieved_by
|
|
- retrieved_at
|
|
- has_method
|
|
- has_url
|
|
- has_provenance
|
|
- identified_by
|
|
- has_source
|
|
- has_score
|
|
slot_usage:
|
|
has_source:
|
|
# range: string
|
|
examples:
|
|
- value: /data/custodian/person/affiliated/parsed/rijksmuseum_staff_20251210T155416Z.json
|
|
identified_by:
|
|
# range: string
|
|
pattern: ^[a-z0-9-]+_staff_[a-z0-9-_]+$
|
|
examples:
|
|
- value: rijksmuseum_staff_0042_jan_van_der_berg
|
|
- value: exa_12345678-abcd-efgh-ijkl-mnopqrstuv
|
|
retrieved_at:
|
|
range: datetime
|
|
required: true
|
|
examples:
|
|
- value: '2025-12-12T22:00:00Z'
|
|
has_method:
|
|
range: ProfileExtractionMethodEnum
|
|
required: true
|
|
examples:
|
|
- value: exa_crawling_exa
|
|
retrieved_by:
|
|
# range: string
|
|
examples:
|
|
- value: claude-opus-4.5
|
|
- value: ''
|
|
has_url:
|
|
range: uri
|
|
pattern: ^https://www\.linkedin\.com/in/[a-z0-9-]+/?$
|
|
examples:
|
|
- value: https://www.linkedin.com/in/jan-van-der-berg-12345
|
|
has_expense:
|
|
range: float
|
|
minimum_value: 0.0
|
|
examples:
|
|
- value: 0.001
|
|
- value: 0.0
|
|
has_provenance:
|
|
range: LLMResponse
|
|
required: false
|
|
inlined: true
|
|
examples:
|
|
- value: "{\n \"content\": \"Extracted institution data...\",\n \"reasoning_content\": \"Analyzing the input for LinkML schema conformity...\",\n \"thinking_mode\": \"preserved\",\n \"clear_thinking\": false,\n \"model\": \"glm-4.7\",\n \"provider\": \"zai\",\n \"created\": \"2025-12-23T10:30:00Z\",\n \"prompt_tokens\": 150,\n \"completion_tokens\": 450,\n \"total_tokens\": 600,\n \"finish_reason\": \"stop\",\n \"cost_usd\": 0.0\n}\n"
|
|
comments:
|
|
- Every person entity file MUST have extraction_metadata
|
|
- See AGENTS.md Rule 20 for required fields
|
|
- retrieved_by should be 'claude-opus-4.5' for manual extraction
|
|
- has_expense enables budget tracking for API-heavy extractions
|
|
see_also:
|
|
- https://www.w3.org/TR/prov-o/
|
|
- https://docs.exa.ai/
|
|
annotations:
|
|
specificity_score: 0.1
|
|
specificity_rationale: Generic utility class/slot created during migration
|
|
custodian_types: "['*']"
|