glam/ONTOLOGY_ENRICHMENT_PLAN.md
kempersc fa5680f0dd Add initial versions of custodian hub UML diagrams in Mermaid and PlantUML formats
- Introduced custodian_hub_v3.mmd, custodian_hub_v4_final.mmd, and custodian_hub_v5_FINAL.mmd for Mermaid representation.
- Created custodian_hub_FINAL.puml and custodian_hub_v3.puml for PlantUML representation.
- Defined entities such as CustodianReconstruction, Identifier, TimeSpan, Agent, CustodianName, CustodianObservation, ReconstructionActivity, Appellation, ConfidenceMeasure, Custodian, LanguageCode, and SourceDocument.
- Established relationships and associations between entities, including temporal extents, observations, and reconstruction activities.
- Incorporated enumerations for various types, statuses, and classifications relevant to custodians and their activities.
2025-11-22 14:33:51 +01:00

14 KiB

Ontology Enrichment Plan for hyponyms_curated.yaml

Date: 2025-11-21 (Updated)
Total Entries: 2,453 Wikidata entities
Status: In Progress (5/2,453 complete = 0.20%)


🎯 Latest Session: DBpedia Integration Complete

Session Date: 2025-11-21
Focus: DBpedia ontology caching + Q119459808 enrichment
Status: COMPLETE

Major Achievements

  1. DBpedia Ontology Files Cached (276 KB total)

    • data/ontology/dbpedia_wikidata_mappings.ttl (804 lines)
    • data/ontology/dbpedia_classes_sample.ttl (2,514 lines)
    • data/ontology/dbpedia_heritage_classes.ttl (219 lines)
    • data/ontology/dbpedia_glam_mappings_index.md (usage guide)
  2. Q119459808 (scientific facility) Enriched

    • Heritage-first framing note added
    • DBpedia mapping: dbo:ResearchProject (medium confidence)
    • Related classes documented
    • Coverage gap identified: No direct DBpedia class for research infrastructure
  3. 4-Step DBpedia Workflow Established

    • Step 1: Check direct Wikidata mappings (high confidence)
    • Step 2: Semantic keyword search (medium confidence)
    • Step 3: Review heritage classes (validation)
    • Step 4: Document confidence + gaps

See: SESSION_SUMMARY_20251121_DBPEDIA_INTEGRATION_COMPLETE.md for full details.


Completed Entries

1. Q1802963 - mansion (RETROFITTED with DBpedia)

  • Hypernym: building
  • Type: F (Features - physical landmarks)
  • Ontology Mapping: Complete + DBpedia
    • Place aspect: crm:E27_Site, schema:LandmarksOrHistoricalBuildings
    • Custodian aspect: cpov:PublicOrganisation (public) OR schema:Museum (private)
    • DBpedia: dbo:Building, dbo:HistoricBuilding, dbo:HistoricPlace
    • Complexity: 9/10
    • Properties: 8 properties mapped

2. Q3694 - vacation property (FIXED heritage-first framing + DBpedia)

  • Hypernym: accommodation
  • Type: F (Features)
  • Ontology Mapping: Complete + DBpedia (heritage-first fix)
    • Place aspect: crm:E27_Site (heritage site focus)
    • schema:Accommodation → Changed to heritage-focused classes
    • DBpedia: dbo:HistoricPlace
    • Heritage framing note added
    • Complexity: 8/10

3. Q2927789 - buitenplaats (Dutch country estate) (RETROFITTED with DBpedia)

  • Hypernym: building
  • Type: F (Features)
  • Country: Netherlands
  • Ontology Mapping: Complete + DBpedia
    • Place aspect: crm:E27_Site, schema:LandmarksOrHistoricalBuildings
    • DBpedia: dbo:HistoricBuilding
    • Dutch heritage context: Rijksmonument status, 17th-19th century estates
    • Complexity: 7/10

4. Q2772772 - military museum

  • Hypernym: museum
  • Type: M (Museum)
  • Ontology Mapping: Complete + DBpedia
    • Custodian aspect: cpov:PublicOrganisation, schema:Museum
    • Collections: crm:E78_Curated_Holding (military artifacts), rico:RecordSet (archival records)
    • DBpedia: dbo:Museum (high confidence, direct Wikidata equivalent)
    • Complexity: 4/10 (straightforward museum pattern)

5. Q119459808 - scientific facility NEW

  • Hypernym: organisation
  • Type: R (Research) + E (Education)
  • Ontology Mapping: Complete + DBpedia + Heritage-First
    • Custodian aspect: schema:ResearchOrganization, cpov:PublicOrganisation (if public)
    • Place aspect: crm:E27_Site (conditional on permanent facilities)
    • Collections: schema:Dataset (research data), crm:E78_Curated_Holding (specimens)
    • DBpedia: dbo:ResearchProject (medium confidence, semantic approximation)
    • Heritage framing note: Emphasizes scientific facilities as heritage custodians (specimen archives, research data), not generic R&D
    • Coverage gap documented: DBpedia lacks "scientific facility" class
    • Complexity: 7/10 (multi-functional research infrastructure)

Batch Processing Strategy

Given 2,452 entries, we'll process them in batches by hypernym category:

Priority 1: Core Heritage Custodian Types (1,465 entries)

These are the most critical for the heritage custodian ontology:

Hypernym Count Ontology Pattern Status
museum 133 cpov:PublicOrganisation + schema:Museum + crm:E39_Actor TODO
archive 117 cpov:PublicOrganisation + rico:CorporateBody + rico:RecordSet TODO
library 29 cpov:PublicOrganisation + schema:Library + bf:Collection TODO
art institution 77 cpov:PublicOrganisation + schema:ArtGallery + crm:E78_Curated_Holding TODO
cultural institution 22 cpov:PublicOrganisation + schema:Organization TODO
heritage site 151 crm:E27_Site + schema:LandmarksOrHistoricalBuildings TODO
organisation 193 cpov:PublicOrganisation OR schema:Organization (requires classification) TODO
company 189 schema:Corporation + crm:E40_Legal_Body TODO
university 66 schema:EducationalOrganization + schema:CollegeOrUniversity TODO
higher education institution 42 schema:EducationalOrganization TODO
school 39 schema:EducationalOrganization TODO
research center (in organisation) schema:ResearchOrganization + cpov:PublicOrganisation TODO

Subtotal: ~1,058 entries (43% of total)

Priority 2: Physical Sites and Places (1,183 entries)

Environmental and landscape heritage:

Hypernym Count Ontology Pattern Status
protected area 875 schema:Place + crm:E27_Site TODO
national park 74 schema:Park + environmental heritage mixins TODO
natural monument 70 schema:LandmarksOrHistoricalBuildings TODO
building 35 crm:E27_Site + schema:Place 1/35
park 21 schema:Park TODO
zoo 17 schema:Zoo + crm:E39_Actor TODO

Subtotal: ~1,092 entries (45% of total)

Priority 3: Specialized Categories (302 entries)

Collections, groups, and specialized types:

Hypernym Count Ontology Pattern Status
group 28 crm:E74_Group + schema:Organization TODO
collection 16 rico:RecordSet OR crm:E78_Curated_Holding OR bf:Collection TODO
data repository 19 schema:DataCatalog + digital platform mixins TODO
historical society (in organisation) schema:NGO + crm:E74_Group TODO

Subtotal: ~63 entries (3% of total)

Priority 4: Settlement and Administrative Units (139 entries)

Geographic and political entities (low priority for heritage custodian ontology):

Hypernym Count Ontology Pattern Status
settlement varies schema:Place TODO
province varies schema:AdministrativeArea TODO
polity varies schema:GovernmentOrganization TODO

Subtotal: ~139 entries (6% of total)


Enrichment Workflow

For each entry, add the following YAML structure:

- label: Q1234567
  hypernym:
    - museum
  type:
    - M
  ontology_mapping:
    wikidata_source: Q1234567
    enrichment_date: '2025-11-20T...'
    enriched_by: manual_ontology_mapper
    complexity_score: 7  # 1-10 scale
    complexity_note: "Explanation of why this entity is complex to model"
    
    semantic_aspects:
      - custodian_reference
      - place_reference
      - collections_reference
    
    custodian_ontology:
      primary_class: cpov:PublicOrganisation
      namespace: http://data.europa.eu/m8g/
      secondary_class: schema:Museum
      rdfs_comment: "Description of when to use this class"
      properties:
        - dct:identifier (ISIL code, Wikidata)
        - cpov:hasUnit (organizational structure)
    
    place_ontology:  # If applicable
      primary_class: crm:E27_Site
      properties:
        - schema:geo (coordinates)
    
    collections_ontology:  # If applicable
      primary_class: crm:E78_Curated_Holding
      properties:
        - crm:P147i_was_curated_by (custodian)
    
    temporal_model:
      custodian_aspect: "Founding → Present/Closure"
      collections_aspect: "Accession dates (per object)"

Next Steps

Automated Batch Processing

Create script to process entries in batches:

  1. Batch 1: Museums (133 entries)

    • Pattern: cpov:PublicOrganisation + schema:Museum + crm:E39_Actor
    • Collections: crm:E78_Curated_Holding
    • People: pico:PersonObservation
  2. Batch 2: Archives (117 entries)

    • Pattern: cpov:PublicOrganisation + rico:CorporateBody
    • Collections: rico:RecordSet
  3. Batch 3: Libraries (29 entries)

    • Pattern: cpov:PublicOrganisation + schema:Library
    • Collections: bf:Collection
  4. Batch 4: Buildings (35 entries)

    • Pattern: crm:E27_Site + schema:Place
    • Dual aspect: place + potential custodian

Manual Review Required

  • Entries with hypernym "organisation" (193 entries) - need public/private classification
  • Entries with multiple hypernyms - need multi-aspect modeling
  • Entries with complexity score ≥ 7 - require human review

Progress Tracking

  • Entry 1/2,452: Q1802963 (mansion)
  • Batch 1: Museums (0/133)
  • Batch 2: Archives (0/117)
  • Batch 3: Libraries (0/29)
  • Batch 4: Buildings (1/35)
  • Remaining: (1/2,138)

Total Progress: 0.04% (1/2,452 entries)


Automation vs. Manual Work

Can Be Automated (70% of entries)

  • Single hypernym with clear ontology mapping
  • Standard patterns (museum, archive, library)
  • Protected areas and natural monuments

Requires Manual Review (30% of entries)

  • Multiple hypernyms (multi-aspect entities)
  • Generic "organisation" classification
  • Complex historical societies (heemkamer, etc.)
  • Ambiguous building types

Estimated Effort

  • Automated enrichment: 2-3 hours processing time
  • Manual review: 20-30 hours for complex entries
  • Quality assurance: 5-10 hours spot-checking

Total: 27-43 hours of work


Resources

  • Ontology files: /data/ontology/
  • Full Wikidata metadata: hyponyms_curated_full.yaml
  • Enrichment target: hyponyms_curated.yaml
  • Rules reference: .opencode/agent/ontology-mapping-rules.md

DBpedia Ontology Integration Discovered - 2025-11-20 23:56:32

Major Discovery: DBpedia Ontology provides pre-existing Wikidata → formal ontology mappings for heritage institutions.

Key Findings:

  1. DBpedia has GLAM classes:

    • dbo:Museum ←→ wd:Q33506 ←→ schema:Museum
    • dbo:Library ←→ wd:Q7075 ←→ schema:Library
    • dbo:Archive ←→ wd:Q166118
  2. DBpedia provides heritage-specific properties:

    • dbo:collection (museum collections)
    • dbo:curator (curator name)
    • dbo:museumType (specialization)
    • dbo:isil (ISIL codes for libraries)
    • dbo:numberOfCollectionItems
  3. Integration benefits:

    • Pre-mapped Wikidata entities save manual mapping work
    • Standardized properties avoid custom property invention
    • OWL reasoning support for ontology inference
    • Validates existing Schema.org mappings

Documentation Created:

  • docs/DBPEDIA_ONTOLOGY_INTEGRATION.md (12,500+ words)
    • DBpedia ontology overview
    • Heritage class mappings (Museum, Library, Archive)
    • Integration workflow (4 steps)
    • SPARQL queries for discovery
    • Implementation recommendations
    • Example enriched YAML with DBpedia references

Next Actions:

  1. Update .opencode/agent/ontology-mapping-rules.md with DBpedia step
  2. Create DBpedia → Wikidata mapping cache script
  3. Retrofit existing mappings (Q1802963, Q3694, Q2927789) with DBpedia
  4. Continue Q119459808 enrichment with DBpedia integration

Heritage-First Framing Principle Added - 2025-11-20 23:55

Critical Policy Update: Added Heritage-First Framing Principle to ontology mapping rules.

Problem Identified

Initial Q3694 (vacation property) mapping used generic real estate classes:

  • PRIMARY: schema:Accommodation (too generic)
  • RATIONALE: "Most vacation properties are commercial rentals"

This violated project mission: we model heritage custodians, not generic real estate.

Solution: Heritage-First Framing Principle

New Rule: All entities in GLAMORCUBESFIXPHDNT taxonomy are evaluated through heritage significance lens.

Key Points:

  1. ALWAYS assume heritage significance - entities in our taxonomy have heritage value
  2. ALWAYS use heritage-focused classes - crm:E27_Site, not schema:Accommodation
  3. ALWAYS model place aspect for sites - physical entities are heritage sites
  4. NEVER use generic classes - schema:Residence, schema:Accommodation too generic
  5. NEVER require "proof" - if in Wikidata extraction, has heritage potential

Documentation Updated

File: .opencode/agent/ontology-mapping-rules.md

Added section: "Heritage-First Framing Principle" (60 lines)

  • Heritage Significance Default
  • Examples (vacation properties, mansions, buitenplaatsen)
  • Ontology Selection Decision Tree for Physical Sites
  • Rationale (5 key points)

Entries Retrofitted

Q3694 (vacation property) - Fixed heritage framing:

  • BEFORE: schema:Accommodation (generic)
  • AFTER: crm:E27_Site (heritage site)
  • Added: heritage_framing_note explaining Heritage-First Principle
  • Updated: ontology_rationale with heritage-focused reasoning
  • Added: DBpedia mapping (dbo:HistoricPlace)

Q1802963 (mansion) - Added DBpedia:

  • Added: dbpedia_mapping section
  • Classes: dbo:Building, dbo:HistoricBuilding, dbo:HistoricPlace

Q2927789 (buitenplaats) - Added DBpedia:

  • Added: dbpedia_mapping section
  • Classes: dbo:HistoricBuilding (Dutch heritage estates)

Impact

All future ontology mappings will:

  1. Default to heritage-focused classes (crm:E27_Site, not schema:Place)
  2. Use CIDOC-CRM as PRIMARY for cultural heritage sites
  3. Reject generic real estate classes
  4. Reference Heritage-First Framing Principle in rationale