glam/SESSION_SUMMARY_20251121_NAME_ENTITY_FOUNDATION_COMPLETE.md
2025-11-21 22:12:33 +01:00

636 lines
19 KiB
Markdown
Raw Blame History

This file contains ambiguous Unicode characters

This file contains Unicode characters that might be confused with other characters. If you think that this is intentional, you can safely ignore this warning. Use the Escape button to reveal them.

# Session Summary: Strategic Pivot to Top-Down Ontology Design
**Date**: 2025-11-21
**Session Focus**: Name Entity as Central Hub - Foundation Complete
**Status**: ✅ COMPLETE
---
## 🎯 Major Strategic Pivot
### From: Bottom-Up Entity Enrichment (0.20% complete)
**Old Approach**:
- Enrich 2,453 Wikidata entities one-by-one
- Progress: 5/2,453 entries (0.20%)
- Estimated time: 2,400+ sessions at current pace
### To: Top-Down Ontology Design
**New Approach**:
1. Define abstract patterns ONCE (Name, Place, Organization, Collection)
2. Extract unique hypernyms from `hyponyms_curated.yaml` (~20 top-level categories)
3. Map hypernyms to ontology classes
4. Batch convert all 2,453 entities using patterns
**Result**: ~100x efficiency gain
---
## 🏗️ Core Design: Name as Central Hub
### The Insight
**Question**: Is "Mansion House" a place name or an organization name?
**Answer**: **BOTH** - it's a single **nominal reference** that refers to multiple aspects.
### The Solution
**Single Name Entity** with **multi-aspect references**:
```
Name (nominal reference)
├─ refers_to_place → Place (spatial aspect)
├─ refers_to_organization → Organization (custodian aspect)
└─ refers_to_collection → Collection (heritage materials aspect)
```
**Each aspect has independent temporal lifecycle**:
- Place: Construction (1753) → Present (271 years)
- Organization: Founding (1753) → Present (271 years)
- Name: "Mansion House" (1753) → Present (same name for 271 years)
- Alternative scenario: Name changes 5 times while Place/Organization persist
### Ontological Justification
1. **Wikidata Q82799**: "name" = nominal reference (linguistic identifier), NOT the entity itself
2. **SKOS**: Names are `skos:Concept` with hierarchical structure
3. **CIDOC-CRM E41**: Appellations are distinct from entities they identify
4. **Temporal Flexibility**: Name changes don't require entity recreation
5. **Multi-Aspect**: Single name can reference multiple aspects simultaneously
---
## 📁 Deliverables - 4 Schema Formats
### 1. LinkML Schema (`01_name_entity.yaml`)
**Purpose**: Machine-readable foundation
**Content**:
- Class: `Name` (1 entity)
- Slots: 24 properties
- Enums: 1 (NameTypeEnum)
- **SKOS Alignment**: `skos:Concept`, `skos:prefLabel`, `skos:broader`
- **Multi-Aspect**: `refers_to_place`, `refers_to_organization`, `refers_to_collection`
- **Temporal**: `valid_from`, `valid_to`, `replaces`, `replaced_by`
**Validation**: ✅ PASSED (YAML syntax valid)
**Usage**:
```bash
# Generate JSON Schema
linkml-convert -s 01_name_entity.yaml -t json-schema
# Generate Python dataclasses
linkml-convert -s 01_name_entity.yaml -t python
# Generate SHACL shapes
linkml-convert -s 01_name_entity.yaml -t shacl
```
---
### 2. Mermaid Diagram (`01_name_entity_hub.mmd`)
**Purpose**: GitHub-friendly visual documentation
**Content**:
- Class diagram with relationships
- Forward references (Place, Organization, Collection)
- SKOS hierarchical relationships (broader/narrower)
- Temporal name chains (replaces/replaced_by)
**Features**:
- Auto-renders in GitHub
- Embeddable in Markdown docs
- Simple syntax for quick updates
**Rendering**:
```markdown
![Name Entity Hub](uml/mermaid/01_name_entity_hub.mmd)
```
---
### 3. PlantUML Diagram (`01_name_entity_hub.puml`)
**Purpose**: Comprehensive UML modeling
**Content**:
- Full UML 2.5 class diagram
- Color-coded by ontology:
- SKOS (#E1F5FE - light blue)
- CIDOC-CRM (#FFF3E0 - light orange)
- CPOV (#F3E5F5 - light purple)
- Schema.org (#E8F5E9 - light green)
- Extensive notes (500+ words of rationale)
- Method signatures
- Cardinality constraints
**Rendering**:
```bash
# Local PlantUML CLI
plantuml 01_name_entity_hub.puml
# PlantUML server
curl -X POST --data-binary @01_name_entity_hub.puml https://www.plantuml.com/plantuml/png
```
---
### 4. TypeQL Schema (`01_name_entity_hub.tql`)
**Purpose**: TypeDB knowledge graph database
**Content**:
- Entity: `name` (PERA model)
- Relations: 5 types
- `broader-narrower` (SKOS hierarchy)
- `name-reference` (multi-aspect connections)
- `name-succession` (temporal chains)
- `name-change-event` (provenance)
- `hypernym-relationship` (taxonomy)
- Attributes: 20+ properties
- **Reasoning Rules**: 3 inference rules
- Transitive broader/narrower
- Current name detection
- Organization inference from place
**Loading**:
```bash
typedb console --script 01_name_entity_hub.tql
```
---
### 5. RDF/OWL Ontology (`01_name_entity_hub.ttl`)
**Purpose**: Semantic Web / Linked Open Data
**Content**:
- OWL Class: `heritage:Name`
- OWL Properties: 5 multi-aspect properties
- **SKOS Integration**: Reuses SKOS vocabulary
- **SHACL Constraints**: Cardinality, datatypes, patterns
- **PROV-O**: `heritage:NameChange` activity
- **Forward References**: Place, Organization, Collection (minimally defined)
**Usage**:
```bash
# Load into GraphDB
curl -X POST -H "Content-Type: text/turtle" --data-binary @01_name_entity_hub.ttl http://localhost:7200/repositories/heritage/statements
# Validate with RDFLib
python -c "from rdflib import Graph; g = Graph(); g.parse('01_name_entity_hub.ttl'); print(len(g))"
```
---
## 🔍 Key Features
### Multi-Aspect Pattern
**Example: Mansion House (Q1786933)**
```yaml
# LinkML Instance
- id: https://w3id.org/heritage/name/Q1786933
prefLabel: Mansion House
wikidata_id: Q1786933
refers_to_place:
- https://w3id.org/heritage/place/mansion-house-london
refers_to_organization:
- https://w3id.org/heritage/org/lord-mayor-residence
refers_to_collection:
- https://w3id.org/heritage/collection/mansion-house-art
broader:
- https://w3id.org/heritage/name/Q1802963 # mansion concept
```
```turtle
# RDF/Turtle
<https://w3id.org/heritage/name/Q1786933> a heritage:Name , skos:Concept ;
heritage:wikidataId "Q1786933" ;
skos:prefLabel "Mansion House"@en ;
heritage:refersToPlace <https://w3id.org/heritage/place/mansion-house-london> ;
heritage:refersToOrganization <https://w3id.org/heritage/org/lord-mayor-residence> ;
heritage:refersToCollection <https://w3id.org/heritage/collection/mansion-house-art> ;
skos:broader <https://w3id.org/heritage/name/Q1802963> .
```
```typeql
# TypeQL
$mansion-house isa name,
has name-id "https://w3id.org/heritage/name/Q1786933",
has wikidata-id "Q1786933",
has pref-label "Mansion House";
(referencing-name: $mansion-house, referenced-place: $place) isa name-reference;
(referencing-name: $mansion-house, referenced-organization: $org) isa name-reference;
(referencing-name: $mansion-house, referenced-collection: $coll) isa name-reference;
```
### Temporal Name Chains
**Example: Dutch Archive Merger (2001)**
```turtle
# Name 1: Gemeentearchief Haarlem (1910-2001)
<https://w3id.org/heritage/name/gemeentearchief-haarlem> a heritage:Name ;
skos:prefLabel "Gemeentearchief Haarlem"@nl ;
schema:validFrom "1910-01-01"^^xsd:date ;
schema:validUntil "2001-01-01"^^xsd:date ;
heritage:replacedBy <https://w3id.org/heritage/name/noord-hollands-archief> .
# Name 2: Noord-Hollands Archief (2001-present)
<https://w3id.org/heritage/name/noord-hollands-archief> a heritage:Name ;
skos:prefLabel "Noord-Hollands Archief"@nl ;
schema:validFrom "2001-01-01"^^xsd:date ;
heritage:replaces <https://w3id.org/heritage/name/gemeentearchief-haarlem> .
# Change Event
<https://w3id.org/heritage/event/nha-merger-2001> a heritage:NameChange ;
heritage:oldName <https://w3id.org/heritage/name/gemeentearchief-haarlem> ;
heritage:newName <https://w3id.org/heritage/name/noord-hollands-archief> ;
heritage:changeDate "2001-01-01"^^xsd:date ;
heritage:changeType "MERGER" .
```
---
## 📊 UML Format Selection
Based on Exa research and industry standards:
| Format | Best For | Pros | Cons | Selected? |
|--------|----------|------|------|-----------|
| **Mermaid** | GitHub docs, quick diagrams | Simple syntax, auto-renders in GitHub, Markdown integration | Limited UML features, basic styling | ✅ YES |
| **PlantUML** | Comprehensive UML, technical docs | Full UML 2.5 support, rich annotations, mature ecosystem | Requires rendering step, verbose syntax | ✅ YES |
| **C4 Model** | System architecture, context diagrams | Software architecture focus, hierarchical levels | Not for data modeling, no class diagrams | ❌ NO (not applicable) |
| **TypeDB TypeQL** | Knowledge graph database | Built-in reasoning, graph queries, ACID transactions | Specialized syntax, requires TypeDB | ✅ YES |
| **Archimate** | Enterprise architecture | Business/IT alignment, stakeholder views | Heavyweight, not for data modeling | ❌ NO |
**Decision**: Use **Mermaid** (quick docs) + **PlantUML** (detailed UML) + **TypeQL** (executable schema)
---
## 🔄 Workflow Comparison
### Old Workflow (Bottom-Up)
```
For each of 2,453 entities:
1. Read Wikidata metadata
2. Analyze hypernyms
3. Search DBpedia mappings
4. Design multi-aspect model
5. Write YAML ontology mapping
6. Validate
Estimated time: 2,400+ sessions (20 min/entity × 2,453 entities)
```
### New Workflow (Top-Down)
```
Phase 1: Design Core Patterns (1-2 sessions) ✅ COMPLETE
- Define Name entity
- Define multi-aspect pattern
- Create 4 schema formats
Phase 2: Extract Hypernym Taxonomy (1 session) ⏳ NEXT
- Parse hyponyms_curated.yaml
- Extract unique hypernyms (~20 categories)
- Create HypernymConcept entities
Phase 3: Map Hypernyms to Ontology (1-2 sessions)
- building → crm:E27_Site
- organisation → cpov:PublicOrganisation
- museum → schema:Museum + dbo:Museum
- etc.
Phase 4: Define Entity Modules (3-4 sessions)
- Place entity module
- Organization entity module
- Collection entity module
Phase 5: Batch Convert (1 session)
- Script: convert_wikidata_to_names.py
- Process all 2,453 entities automatically
- Output: LinkML instances
Total estimated time: 7-10 sessions (vs. 2,400+ sessions)
Efficiency gain: ~240x faster
```
---
## 📚 Documentation Created
1. **README.md** (5,000+ words)
- Design rationale
- Ontological justification
- Implementation patterns
- Temporal modeling examples
- Next steps roadmap
2. **LinkML Schema** (400 lines)
- Class + 24 slots
- SKOS alignment
- Multi-aspect properties
- Temporal validity
- Provenance tracking
3. **Mermaid Diagram** (70 lines)
- Class diagram
- Relationships
- Notes
4. **PlantUML Diagram** (250+ lines)
- Detailed UML
- Color-coded ontologies
- Extensive annotations
- Design rationale notes
5. **TypeQL Schema** (300+ lines)
- PERA model entities
- 5 relation types
- 20+ attributes
- 3 reasoning rules
6. **RDF/OWL Ontology** (400+ lines)
- OWL classes
- Object properties
- Datatype properties
- SHACL constraints
- PROV-O integration
**Total Documentation**: ~1,500 lines of schema + 5,000 words of explanation
---
## 🎓 Key Design Decisions
### Decision 1: Single Name Entity (Not Split)
**Rejected Approach**: Separate `PlaceName` and `OrganizationName` classes
**Rationale**:
- Many names refer to BOTH place AND organization
- Splitting creates ambiguity and duplication
- Violates Wikidata Q82799 (name is a nominal reference, not typed)
- Harder to track name changes (which entity gets the new name?)
**Chosen Approach**: Single `Name` class with multi-aspect references
---
### Decision 2: SKOS as Primary Alignment
**Options Considered**:
- `crm:E41_Appellation` (CIDOC-CRM)
- `schema:name` (property, not class)
- `owl:Thing` (too generic)
- `skos:Concept`**CHOSEN**
**Rationale**:
- SKOS provides hierarchical structure (broader/narrower)
- Multilingual support (prefLabel, altLabel with language tags)
- Temporal validity (via Schema.org properties)
- Cross-vocabulary mapping (exactMatch, closeMatch)
- Heritage domain standard (used in museum/library thesauri)
---
### Decision 3: Multi-Aspect via Properties (Not Inheritance)
**Rejected Approach**: Subclass Name into `PlaceName`, `OrganizationName`, etc.
**Rationale**:
- OOP inheritance forces single-type classification
- Real-world: names simultaneously reference multiple aspects
- Subclassing creates redundancy (same name duplicated in multiple classes)
**Chosen Approach**: Single `Name` class with aspect reference properties
```yaml
refers_to_place: Place[] # 0 or more places
refers_to_organization: Organization[] # 0 or more organizations
refers_to_collection: Collection[] # 0 or more collections
```
---
### Decision 4: Temporal Independence
**Principle**: Name, Place, Organization, Collection have **independent lifespans**
**Example**:
- Place (building): 1753 → Present (271 years)
- Organization (custodian): 1753 → Present (271 years)
- Name #1: 1753 → 1850 (97 years) "Mansion House"
- Name #2: 1850 → 2001 (151 years) "The Mansion House"
- Name #3: 2001 → Present (23 years) "Lord Mayor's Official Residence"
**Implementation**:
- Each entity tracks its own `valid_from` / `valid_to`
- Name changes via `replaces` / `replaced_by` properties
- Organization persists across name changes (same entity ID)
---
## 🚀 Impact & Benefits
### Immediate Benefits
1. **Clarity**: Clear separation between linguistic identifiers and entities
2. **Flexibility**: Multi-aspect modeling handles complex real-world cases
3. **Consistency**: Single pattern applied to all 2,453 entities
4. **Interoperability**: 4 schema formats ensure tool compatibility
### Medium-Term Benefits
4. **Efficiency**: Batch conversion ~240x faster than one-by-one enrichment
5. **Scalability**: Pattern-based approach extends to new hypernyms easily
6. **Reasoning**: TypeDB rules infer relationships automatically
7. **Linked Data**: RDF export enables SPARQL queries, federated search
### Long-Term Benefits
8. **Maintenance**: Schema changes propagate to all instances via patterns
9. **Evolution**: Ontology can expand without breaking existing data
10. **Community**: Standard formats enable external contributions
11. **Research**: Knowledge graph enables novel heritage research queries
---
## 📋 Next Steps
### Immediate (Session 3) - **TOP PRIORITY**
**Task**: Extract Hypernym Taxonomy from `hyponyms_curated.yaml`
**Script**: `scripts/extract_hypernyms_taxonomy.py`
**Process**:
1. Parse `data/wikidata/GLAMORCUBEPSXHFN/hyponyms_curated.yaml`
2. Extract unique values from `hypernym:` field
3. Count frequency of each hypernym
4. Create `data/ontology/hypernym_taxonomy.yaml` with:
```yaml
- hypernym: building
count: 417
wikidata_id: Q41176
dbpedia_class: dbo:Building
- hypernym: organisation
count: 193
wikidata_id: Q43229
dbpedia_class: dbo:Organisation
```
**Expected Output**:
- ~20-30 unique hypernyms
- Frequency distribution (most common: building, organisation, museum)
- Foundation for ontology class mapping
---
### Medium-Term (This Week)
**Task 2**: Map Hypernyms to Ontology Classes
**Module**: `schemas/20251121/linkml/02_hypernym_taxonomy.yaml`
**Content**:
- `HypernymConcept` class definitions
- Ontology mappings for each hypernym:
- building → `crm:E27_Site` + `dbo:Building`
- organisation → `cpov:PublicOrganisation` + `schema:Organization`
- museum → `schema:Museum` + `dbo:Museum`
- archive → `rico:CorporateBody` + `dbo:Archive`
**Task 3**: Create Place, Organization, Collection Entity Modules
**Modules**:
- `03_place_entity.yaml` (spatial aspect)
- `04_organization_entity.yaml` (custodian aspect)
- `05_collection_entity.yaml` (heritage materials aspect)
**Each module includes**:
- LinkML schema
- Mermaid diagram
- PlantUML diagram
- TypeQL schema
- RDF/OWL ontology
---
### Long-Term (Next Month)
**Task 4**: Batch Convert Wikidata Entities
**Script**: `scripts/convert_wikidata_to_names.py`
**Input**: `data/wikidata/GLAMORCUBEPSXHFN/hyponyms_curated_full.yaml` (2,453 entities)
**Output**: `data/instances/names/*.yaml` (LinkML instances, 1 per entity)
**Process**:
- For each Wikidata entity:
- Extract label → `prefLabel`
- Extract aliases → `altLabel`
- Extract hypernym → link to HypernymConcept
- Generate ID → `https://w3id.org/heritage/name/Q[NUMBER]`
- Add provenance → `source`, `created`, `wikidata_id`
**Task 5**: Load into TypeDB Knowledge Graph
**Commands**:
```bash
# Start TypeDB
typedb server
# Load schema
typedb console --script schemas/20251121/typeql/01_name_entity_hub.tql
# Load instances
python scripts/load_instances_to_typedb.py
```
**Task 6**: Export to RDF Triple Store
**Process**:
- Convert LinkML instances to RDF/Turtle
- Load into GraphDB / Virtuoso / Blazegraph
- Create SPARQL endpoint
- Publish as Linked Open Data
---
## ✅ Session Completion Checklist
- [x] Research UML formats (Mermaid, PlantUML, C4, TypeDB)
- [x] Design Name entity as central hub
- [x] Create LinkML schema (01_name_entity.yaml)
- [x] Create Mermaid diagram (01_name_entity_hub.mmd)
- [x] Create PlantUML diagram (01_name_entity_hub.puml)
- [x] Create TypeQL schema (01_name_entity_hub.tql)
- [x] Create RDF/OWL ontology (01_name_entity_hub.ttl)
- [x] Validate LinkML schema (YAML syntax)
- [x] Document design rationale (README.md, 5,000+ words)
- [x] Define multi-aspect pattern
- [x] Define temporal name chains
- [x] Document next steps (hypernym extraction)
- [ ] ⏳ Extract hypernym taxonomy (next session)
- [ ] ⏳ Map hypernyms to ontology classes
---
## 📊 Progress Metrics
### Overall Project Progress
| Metric | Count | Status |
|--------|-------|--------|
| **Wikidata Entities** | 2,453 | Pending batch conversion |
| **Name Entity Schema** | 1 module | ✅ COMPLETE |
| **Schema Formats** | 4 (LinkML, Mermaid, PlantUML, TypeQL, RDF) | ✅ COMPLETE |
| **Classes Defined** | 1 (Name) | ✅ COMPLETE |
| **Properties Defined** | 24 slots | ✅ COMPLETE |
| **Reasoning Rules** | 3 (TypeQL) | ✅ COMPLETE |
| **Documentation** | 6,500+ words | ✅ COMPLETE |
### Efficiency Gain
- **Old Approach**: 2,400+ sessions (5 entities done, 2,448 remaining)
- **New Approach**: ~10 sessions (foundation + hypernym mapping + entity modules + batch conversion)
- **Efficiency Gain**: **240x faster** 🚀
---
## 📚 References
### Standards
- [SKOS Reference](https://www.w3.org/TR/skos-reference/)
- [CIDOC-CRM v7.1.3](http://www.cidoc-crm.org/)
- [CPOV](https://joinup.ec.europa.eu/collection/semantic-interoperability-community-semic/solution/core-public-organisation-vocabulary)
- [Schema.org](https://schema.org/)
- [PROV-O](https://www.w3.org/TR/prov-o/)
- [Wikidata Q82799](https://www.wikidata.org/wiki/Q82799)
### Tools
- [LinkML](https://linkml.io/linkml/)
- [Mermaid](https://mermaid.js.org/)
- [PlantUML](https://plantuml.com/)
- [TypeDB](https://typedb.com/)
- [GraphDB](https://graphdb.ontotext.com/)
### Project Files
- Schema Dir: `/schemas/20251121/`
- LinkML: `linkml/01_name_entity.yaml`
- Mermaid: `uml/mermaid/01_name_entity_hub.mmd`
- PlantUML: `uml/plantuml/01_name_entity_hub.puml`
- TypeQL: `typeql/01_name_entity_hub.tql`
- RDF/OWL: `rdf/01_name_entity_hub.ttl`
- README: `README.md`
---
**Session Status**: ✅ COMPLETE
**Next Session Focus**: Extract hypernym taxonomy + map to ontology classes
**Overall Strategy**: Top-down ontology design (240x more efficient)