glam/ONTOLOGY_ENRICHMENT_PLAN.md
kempersc fa5680f0dd Add initial versions of custodian hub UML diagrams in Mermaid and PlantUML formats
- Introduced custodian_hub_v3.mmd, custodian_hub_v4_final.mmd, and custodian_hub_v5_FINAL.mmd for Mermaid representation.
- Created custodian_hub_FINAL.puml and custodian_hub_v3.puml for PlantUML representation.
- Defined entities such as CustodianReconstruction, Identifier, TimeSpan, Agent, CustodianName, CustodianObservation, ReconstructionActivity, Appellation, ConfidenceMeasure, Custodian, LanguageCode, and SourceDocument.
- Established relationships and associations between entities, including temporal extents, observations, and reconstruction activities.
- Incorporated enumerations for various types, statuses, and classifications relevant to custodians and their activities.
2025-11-22 14:33:51 +01:00

384 lines
14 KiB
Markdown

# Ontology Enrichment Plan for hyponyms_curated.yaml
**Date**: 2025-11-21 (Updated)
**Total Entries**: 2,453 Wikidata entities
**Status**: In Progress (5/2,453 complete = 0.20%)
---
## 🎯 Latest Session: DBpedia Integration Complete
**Session Date**: 2025-11-21
**Focus**: DBpedia ontology caching + Q119459808 enrichment
**Status**: ✅ COMPLETE
### Major Achievements
1. **DBpedia Ontology Files Cached** (276 KB total)
- `data/ontology/dbpedia_wikidata_mappings.ttl` (804 lines)
- `data/ontology/dbpedia_classes_sample.ttl` (2,514 lines)
- `data/ontology/dbpedia_heritage_classes.ttl` (219 lines)
- `data/ontology/dbpedia_glam_mappings_index.md` (usage guide)
2. **Q119459808 (scientific facility) Enriched**
- Heritage-first framing note added
- DBpedia mapping: `dbo:ResearchProject` (medium confidence)
- Related classes documented
- Coverage gap identified: No direct DBpedia class for research infrastructure
3. **4-Step DBpedia Workflow Established**
- Step 1: Check direct Wikidata mappings (high confidence)
- Step 2: Semantic keyword search (medium confidence)
- Step 3: Review heritage classes (validation)
- Step 4: Document confidence + gaps
**See**: `SESSION_SUMMARY_20251121_DBPEDIA_INTEGRATION_COMPLETE.md` for full details.
---
## Completed Entries
### 1. Q1802963 - mansion (RETROFITTED with DBpedia)
- **Hypernym**: building
- **Type**: F (Features - physical landmarks)
- **Ontology Mapping**: ✅ Complete + DBpedia
- Place aspect: `crm:E27_Site`, `schema:LandmarksOrHistoricalBuildings`
- Custodian aspect: `cpov:PublicOrganisation` (public) OR `schema:Museum` (private)
- DBpedia: `dbo:Building`, `dbo:HistoricBuilding`, `dbo:HistoricPlace`
- Complexity: 9/10
- Properties: 8 properties mapped
### 2. Q3694 - vacation property (FIXED heritage-first framing + DBpedia)
- **Hypernym**: accommodation
- **Type**: F (Features)
- **Ontology Mapping**: ✅ Complete + DBpedia (heritage-first fix)
- Place aspect: `crm:E27_Site` (heritage site focus)
- ~~`schema:Accommodation`~~ → Changed to heritage-focused classes
- DBpedia: `dbo:HistoricPlace`
- Heritage framing note added
- Complexity: 8/10
### 3. Q2927789 - buitenplaats (Dutch country estate) (RETROFITTED with DBpedia)
- **Hypernym**: building
- **Type**: F (Features)
- **Country**: Netherlands
- **Ontology Mapping**: ✅ Complete + DBpedia
- Place aspect: `crm:E27_Site`, `schema:LandmarksOrHistoricalBuildings`
- DBpedia: `dbo:HistoricBuilding`
- Dutch heritage context: Rijksmonument status, 17th-19th century estates
- Complexity: 7/10
### 4. Q2772772 - military museum
- **Hypernym**: museum
- **Type**: M (Museum)
- **Ontology Mapping**: ✅ Complete + DBpedia
- Custodian aspect: `cpov:PublicOrganisation`, `schema:Museum`
- Collections: `crm:E78_Curated_Holding` (military artifacts), `rico:RecordSet` (archival records)
- DBpedia: `dbo:Museum` (high confidence, direct Wikidata equivalent)
- Complexity: 4/10 (straightforward museum pattern)
### 5. Q119459808 - scientific facility ✨ NEW
- **Hypernym**: organisation
- **Type**: R (Research) + E (Education)
- **Ontology Mapping**: ✅ Complete + DBpedia + Heritage-First
- Custodian aspect: `schema:ResearchOrganization`, `cpov:PublicOrganisation` (if public)
- Place aspect: `crm:E27_Site` (conditional on permanent facilities)
- Collections: `schema:Dataset` (research data), `crm:E78_Curated_Holding` (specimens)
- DBpedia: `dbo:ResearchProject` (medium confidence, semantic approximation)
- Heritage framing note: Emphasizes scientific facilities as **heritage custodians** (specimen archives, research data), not generic R&D
- Coverage gap documented: DBpedia lacks "scientific facility" class
- Complexity: 7/10 (multi-functional research infrastructure)
---
## Batch Processing Strategy
Given 2,452 entries, we'll process them in batches by hypernym category:
### Priority 1: Core Heritage Custodian Types (1,465 entries)
These are the most critical for the heritage custodian ontology:
| Hypernym | Count | Ontology Pattern | Status |
|----------|-------|------------------|--------|
| museum | 133 | `cpov:PublicOrganisation` + `schema:Museum` + `crm:E39_Actor` | TODO |
| archive | 117 | `cpov:PublicOrganisation` + `rico:CorporateBody` + `rico:RecordSet` | TODO |
| library | 29 | `cpov:PublicOrganisation` + `schema:Library` + `bf:Collection` | TODO |
| art institution | 77 | `cpov:PublicOrganisation` + `schema:ArtGallery` + `crm:E78_Curated_Holding` | TODO |
| cultural institution | 22 | `cpov:PublicOrganisation` + `schema:Organization` | TODO |
| heritage site | 151 | `crm:E27_Site` + `schema:LandmarksOrHistoricalBuildings` | TODO |
| organisation | 193 | `cpov:PublicOrganisation` OR `schema:Organization` (requires classification) | TODO |
| company | 189 | `schema:Corporation` + `crm:E40_Legal_Body` | TODO |
| university | 66 | `schema:EducationalOrganization` + `schema:CollegeOrUniversity` | TODO |
| higher education institution | 42 | `schema:EducationalOrganization` | TODO |
| school | 39 | `schema:EducationalOrganization` | TODO |
| research center | (in organisation) | `schema:ResearchOrganization` + `cpov:PublicOrganisation` | TODO |
**Subtotal**: ~1,058 entries (43% of total)
### Priority 2: Physical Sites and Places (1,183 entries)
Environmental and landscape heritage:
| Hypernym | Count | Ontology Pattern | Status |
|----------|-------|------------------|--------|
| protected area | 875 | `schema:Place` + `crm:E27_Site` | TODO |
| national park | 74 | `schema:Park` + environmental heritage mixins | TODO |
| natural monument | 70 | `schema:LandmarksOrHistoricalBuildings` | TODO |
| building | 35 | `crm:E27_Site` + `schema:Place` | ✅ 1/35 |
| park | 21 | `schema:Park` | TODO |
| zoo | 17 | `schema:Zoo` + `crm:E39_Actor` | TODO |
**Subtotal**: ~1,092 entries (45% of total)
### Priority 3: Specialized Categories (302 entries)
Collections, groups, and specialized types:
| Hypernym | Count | Ontology Pattern | Status |
|----------|-------|------------------|--------|
| group | 28 | `crm:E74_Group` + `schema:Organization` | TODO |
| collection | 16 | `rico:RecordSet` OR `crm:E78_Curated_Holding` OR `bf:Collection` | TODO |
| data repository | 19 | `schema:DataCatalog` + digital platform mixins | TODO |
| historical society | (in organisation) | `schema:NGO` + `crm:E74_Group` | TODO |
**Subtotal**: ~63 entries (3% of total)
### Priority 4: Settlement and Administrative Units (139 entries)
Geographic and political entities (low priority for heritage custodian ontology):
| Hypernym | Count | Ontology Pattern | Status |
|----------|-------|------------------|--------|
| settlement | varies | `schema:Place` | TODO |
| province | varies | `schema:AdministrativeArea` | TODO |
| polity | varies | `schema:GovernmentOrganization` | TODO |
**Subtotal**: ~139 entries (6% of total)
---
## Enrichment Workflow
For each entry, add the following YAML structure:
```yaml
- label: Q1234567
hypernym:
- museum
type:
- M
ontology_mapping:
wikidata_source: Q1234567
enrichment_date: '2025-11-20T...'
enriched_by: manual_ontology_mapper
complexity_score: 7 # 1-10 scale
complexity_note: "Explanation of why this entity is complex to model"
semantic_aspects:
- custodian_reference
- place_reference
- collections_reference
custodian_ontology:
primary_class: cpov:PublicOrganisation
namespace: http://data.europa.eu/m8g/
secondary_class: schema:Museum
rdfs_comment: "Description of when to use this class"
properties:
- dct:identifier (ISIL code, Wikidata)
- cpov:hasUnit (organizational structure)
place_ontology: # If applicable
primary_class: crm:E27_Site
properties:
- schema:geo (coordinates)
collections_ontology: # If applicable
primary_class: crm:E78_Curated_Holding
properties:
- crm:P147i_was_curated_by (custodian)
temporal_model:
custodian_aspect: "Founding → Present/Closure"
collections_aspect: "Accession dates (per object)"
```
---
## Next Steps
### Automated Batch Processing
Create script to process entries in batches:
1. **Batch 1: Museums (133 entries)**
- Pattern: `cpov:PublicOrganisation` + `schema:Museum` + `crm:E39_Actor`
- Collections: `crm:E78_Curated_Holding`
- People: `pico:PersonObservation`
2. **Batch 2: Archives (117 entries)**
- Pattern: `cpov:PublicOrganisation` + `rico:CorporateBody`
- Collections: `rico:RecordSet`
3. **Batch 3: Libraries (29 entries)**
- Pattern: `cpov:PublicOrganisation` + `schema:Library`
- Collections: `bf:Collection`
4. **Batch 4: Buildings (35 entries)**
- Pattern: `crm:E27_Site` + `schema:Place`
- Dual aspect: place + potential custodian
### Manual Review Required
- Entries with hypernym "organisation" (193 entries) - need public/private classification
- Entries with multiple hypernyms - need multi-aspect modeling
- Entries with complexity score ≥ 7 - require human review
---
## Progress Tracking
- [x] Entry 1/2,452: Q1802963 (mansion) ✅
- [ ] Batch 1: Museums (0/133)
- [ ] Batch 2: Archives (0/117)
- [ ] Batch 3: Libraries (0/29)
- [ ] Batch 4: Buildings (1/35)
- [ ] Remaining: (1/2,138)
**Total Progress**: 0.04% (1/2,452 entries)
---
## Automation vs. Manual Work
### Can Be Automated (70% of entries)
- Single hypernym with clear ontology mapping
- Standard patterns (museum, archive, library)
- Protected areas and natural monuments
### Requires Manual Review (30% of entries)
- Multiple hypernyms (multi-aspect entities)
- Generic "organisation" classification
- Complex historical societies (heemkamer, etc.)
- Ambiguous building types
---
## Estimated Effort
- **Automated enrichment**: 2-3 hours processing time
- **Manual review**: 20-30 hours for complex entries
- **Quality assurance**: 5-10 hours spot-checking
**Total**: 27-43 hours of work
---
## Resources
- **Ontology files**: `/data/ontology/`
- **Full Wikidata metadata**: `hyponyms_curated_full.yaml`
- **Enrichment target**: `hyponyms_curated.yaml`
- **Rules reference**: `.opencode/agent/ontology-mapping-rules.md`
## DBpedia Ontology Integration Discovered - 2025-11-20 23:56:32
**Major Discovery**: DBpedia Ontology provides pre-existing Wikidata → formal ontology mappings for heritage institutions.
### Key Findings:
1. **DBpedia has GLAM classes**:
- dbo:Museum ←→ wd:Q33506 ←→ schema:Museum
- dbo:Library ←→ wd:Q7075 ←→ schema:Library
- dbo:Archive ←→ wd:Q166118
2. **DBpedia provides heritage-specific properties**:
- dbo:collection (museum collections)
- dbo:curator (curator name)
- dbo:museumType (specialization)
- dbo:isil (ISIL codes for libraries)
- dbo:numberOfCollectionItems
3. **Integration benefits**:
- Pre-mapped Wikidata entities save manual mapping work
- Standardized properties avoid custom property invention
- OWL reasoning support for ontology inference
- Validates existing Schema.org mappings
### Documentation Created:
- `docs/DBPEDIA_ONTOLOGY_INTEGRATION.md` (12,500+ words)
- DBpedia ontology overview
- Heritage class mappings (Museum, Library, Archive)
- Integration workflow (4 steps)
- SPARQL queries for discovery
- Implementation recommendations
- Example enriched YAML with DBpedia references
### Next Actions:
1. Update `.opencode/agent/ontology-mapping-rules.md` with DBpedia step
2. Create DBpedia → Wikidata mapping cache script
3. Retrofit existing mappings (Q1802963, Q3694, Q2927789) with DBpedia
4. Continue Q119459808 enrichment with DBpedia integration
---
## Heritage-First Framing Principle Added - 2025-11-20 23:55
**Critical Policy Update**: Added Heritage-First Framing Principle to ontology mapping rules.
### Problem Identified
Initial Q3694 (vacation property) mapping used generic real estate classes:
- ❌ PRIMARY: `schema:Accommodation` (too generic)
- ❌ RATIONALE: "Most vacation properties are commercial rentals"
This violated project mission: we model **heritage custodians**, not generic real estate.
### Solution: Heritage-First Framing Principle
**New Rule**: All entities in GLAMORCUBESFIXPHDNT taxonomy are evaluated through **heritage significance lens**.
**Key Points**:
1.**ALWAYS assume heritage significance** - entities in our taxonomy have heritage value
2.**ALWAYS use heritage-focused classes** - `crm:E27_Site`, not `schema:Accommodation`
3.**ALWAYS model place aspect for sites** - physical entities are heritage sites
4.**NEVER use generic classes** - `schema:Residence`, `schema:Accommodation` too generic
5.**NEVER require "proof"** - if in Wikidata extraction, has heritage potential
### Documentation Updated
**File**: `.opencode/agent/ontology-mapping-rules.md`
Added section: "Heritage-First Framing Principle" (60 lines)
- Heritage Significance Default
- Examples (vacation properties, mansions, buitenplaatsen)
- Ontology Selection Decision Tree for Physical Sites
- Rationale (5 key points)
### Entries Retrofitted
**Q3694 (vacation property)** - Fixed heritage framing:
- ✅ BEFORE: `schema:Accommodation` (generic)
- ✅ AFTER: `crm:E27_Site` (heritage site)
- ✅ Added: `heritage_framing_note` explaining Heritage-First Principle
- ✅ Updated: `ontology_rationale` with heritage-focused reasoning
- ✅ Added: DBpedia mapping (`dbo:HistoricPlace`)
**Q1802963 (mansion)** - Added DBpedia:
- ✅ Added: `dbpedia_mapping` section
- ✅ Classes: `dbo:Building`, `dbo:HistoricBuilding`, `dbo:HistoricPlace`
**Q2927789 (buitenplaats)** - Added DBpedia:
- ✅ Added: `dbpedia_mapping` section
- ✅ Classes: `dbo:HistoricBuilding` (Dutch heritage estates)
### Impact
**All future ontology mappings** will:
1. Default to heritage-focused classes (`crm:E27_Site`, not `schema:Place`)
2. Use CIDOC-CRM as PRIMARY for cultural heritage sites
3. Reject generic real estate classes
4. Reference Heritage-First Framing Principle in rationale
---