id: https://nde.nl/ontology/hc/class/ExtractionMetadata name: extraction_metadata_class title: Extraction Metadata Class version: 1.0.0 prefixes: linkml: https://w3id.org/linkml/ hc: https://nde.nl/ontology/hc/ schema: http://schema.org/ prov: http://www.w3.org/ns/prov# dct: http://purl.org/dc/terms/ xsd: http://www.w3.org/2001/XMLSchema# imports: - linkml:types - ../metadata - ./LLMResponse - ./SpecificityAnnotation - ./TemplateSpecificityScore - ./TemplateSpecificityType - ./TemplateSpecificityTypes - ../enums/ProfileExtractionMethodEnum - ../slots/is_or_was_retrieved_by - ../slots/has_or_had_method - ../slots/has_or_had_expense - ../slots/has_or_had_source - ../slots/has_or_had_identifier - ../slots/retrieval_timestamp - ../slots/has_or_had_url - ../slots/llm_response - ../slots/specificity_annotation - ../slots/has_or_had_score default_range: string classes: ExtractionMetadata: class_uri: prov:Activity description: "Provenance metadata for data extraction activities.\n\nRecords how, when, and by what agent data was extracted from \nexternal sources (LinkedIn, web scraping, APIs).\n\n**PROV-O Alignment**:\n- ExtractionMetadata IS a prov:Activity (the extraction process)\n- The extracted data IS the prov:Entity (output of the activity)\n- is_or_was_retrieved_by IS the prov:Agent (software/AI that performed extraction)\n- has_or_had_source/has_or_had_url IS prov:used (input to the activity)\n\n**Use Cases**:\n- LinkedIn profile extractions via Exa API\n- Web scraping provenance\n- Staff list parsing provenance\n- Connection network extraction\n\n**Example JSON Structure**:\n```json\n{\n \"extraction_metadata\": {\n \"has_or_had_source\": \"/path/to/source.json\",\n \"has_or_had_identifier\": \"org_staff_0001_name\",\n \"retrieval_timestamp\": \"2025-12-12T22:00:00Z\",\n \"has_or_had_method\": \"exa_crawling_exa\",\n \"is_or_was_retrieved_by\": \"claude-opus-4.5\",\n \"has_or_had_url\": \"https://www.linkedin.com/in/...\"\ ,\n \"has_or_had_expense\": 0.001\n }\n}\n```\n" exact_mappings: - prov:Activity close_mappings: - schema:Action - dct:ProvenanceStatement slots: - has_or_had_expense - is_or_was_retrieved_by - retrieval_timestamp - has_or_had_method - has_or_had_url - llm_response - has_or_had_identifier - has_or_had_source - specificity_annotation - has_or_had_score slot_usage: has_or_had_source: range: string examples: - value: /data/custodian/person/affiliated/parsed/rijksmuseum_staff_20251210T155416Z.json description: Path to parsed staff list JSON has_or_had_identifier: range: string pattern: ^[a-z0-9-]+_staff_[a-z0-9-_]+$ examples: - value: rijksmuseum_staff_0042_jan_van_der_berg description: Staff ID with org prefix, index, and name slug - value: exa_12345678-abcd-efgh-ijkl-mnopqrstuv description: Exa API request ID retrieval_timestamp: range: datetime required: true examples: - value: '2025-12-12T22:00:00Z' description: UTC timestamp of extraction has_or_had_method: range: ProfileExtractionMethodEnum required: true examples: - value: exa_crawling_exa description: Extracted via Exa AI crawling API is_or_was_retrieved_by: range: string examples: - value: claude-opus-4.5 description: Extracted by Claude Opus 4.5 - value: '' description: Empty string for fully automated extraction has_or_had_url: range: uri pattern: ^https://www\.linkedin\.com/in/[a-z0-9-]+/?$ examples: - value: https://www.linkedin.com/in/jan-van-der-berg-12345 description: LinkedIn profile URL has_or_had_expense: range: float minimum_value: 0.0 examples: - value: 0.001 description: Exa API call cost - value: 0.0 description: Free extraction (cached/local) llm_response: range: LLMResponse required: false inlined: true examples: - value: "{\n \"content\": \"Extracted institution data...\",\n \"reasoning_content\": \"Analyzing the input for LinkML schema conformity...\",\n \"thinking_mode\": \"preserved\",\n \"clear_thinking\": false,\n \"model\": \"glm-4.7\",\n \"provider\": \"zai\",\n \"created\": \"2025-12-23T10:30:00Z\",\n \"prompt_tokens\": 150,\n \"completion_tokens\": 450,\n \"total_tokens\": 600,\n \"finish_reason\": \"stop\",\n \"cost_usd\": 0.0\n}\n" description: GLM 4.7 response with Preserved Thinking for extraction comments: - Every person entity file MUST have extraction_metadata - See AGENTS.md Rule 20 for required fields - is_or_was_retrieved_by should be 'claude-opus-4.5' for manual extraction - has_or_had_expense enables budget tracking for API-heavy extractions see_also: - https://www.w3.org/TR/prov-o/ - https://docs.exa.ai/ annotations: specificity_score: 0.1 specificity_rationale: Generic utility class/slot created during migration custodian_types: "['*']" custodian_types_rationale: Universal utility concept