glam/SESSION_SUMMARY_20251121_NAME_ENTITY_FOUNDATION_COMPLETE.md
2025-11-21 22:12:33 +01:00

19 KiB
Raw Blame History

Session Summary: Strategic Pivot to Top-Down Ontology Design

Date: 2025-11-21
Session Focus: Name Entity as Central Hub - Foundation Complete
Status: COMPLETE


🎯 Major Strategic Pivot

From: Bottom-Up Entity Enrichment (0.20% complete)

Old Approach:

  • Enrich 2,453 Wikidata entities one-by-one
  • Progress: 5/2,453 entries (0.20%)
  • Estimated time: 2,400+ sessions at current pace

To: Top-Down Ontology Design

New Approach:

  1. Define abstract patterns ONCE (Name, Place, Organization, Collection)
  2. Extract unique hypernyms from hyponyms_curated.yaml (~20 top-level categories)
  3. Map hypernyms to ontology classes
  4. Batch convert all 2,453 entities using patterns

Result: ~100x efficiency gain


🏗️ Core Design: Name as Central Hub

The Insight

Question: Is "Mansion House" a place name or an organization name?
Answer: BOTH - it's a single nominal reference that refers to multiple aspects.

The Solution

Single Name Entity with multi-aspect references:

Name (nominal reference)
  ├─ refers_to_place → Place (spatial aspect)
  ├─ refers_to_organization → Organization (custodian aspect)
  └─ refers_to_collection → Collection (heritage materials aspect)

Each aspect has independent temporal lifecycle:

  • Place: Construction (1753) → Present (271 years)
  • Organization: Founding (1753) → Present (271 years)
  • Name: "Mansion House" (1753) → Present (same name for 271 years)
  • Alternative scenario: Name changes 5 times while Place/Organization persist

Ontological Justification

  1. Wikidata Q82799: "name" = nominal reference (linguistic identifier), NOT the entity itself
  2. SKOS: Names are skos:Concept with hierarchical structure
  3. CIDOC-CRM E41: Appellations are distinct from entities they identify
  4. Temporal Flexibility: Name changes don't require entity recreation
  5. Multi-Aspect: Single name can reference multiple aspects simultaneously

📁 Deliverables - 4 Schema Formats

1. LinkML Schema (01_name_entity.yaml)

Purpose: Machine-readable foundation
Content:

  • Class: Name (1 entity)
  • Slots: 24 properties
  • Enums: 1 (NameTypeEnum)
  • SKOS Alignment: skos:Concept, skos:prefLabel, skos:broader
  • Multi-Aspect: refers_to_place, refers_to_organization, refers_to_collection
  • Temporal: valid_from, valid_to, replaces, replaced_by

Validation: PASSED (YAML syntax valid)

Usage:

# Generate JSON Schema
linkml-convert -s 01_name_entity.yaml -t json-schema

# Generate Python dataclasses
linkml-convert -s 01_name_entity.yaml -t python

# Generate SHACL shapes
linkml-convert -s 01_name_entity.yaml -t shacl

2. Mermaid Diagram (01_name_entity_hub.mmd)

Purpose: GitHub-friendly visual documentation
Content:

  • Class diagram with relationships
  • Forward references (Place, Organization, Collection)
  • SKOS hierarchical relationships (broader/narrower)
  • Temporal name chains (replaces/replaced_by)

Features:

  • Auto-renders in GitHub
  • Embeddable in Markdown docs
  • Simple syntax for quick updates

Rendering:

![Name Entity Hub](uml/mermaid/01_name_entity_hub.mmd)

3. PlantUML Diagram (01_name_entity_hub.puml)

Purpose: Comprehensive UML modeling
Content:

  • Full UML 2.5 class diagram
  • Color-coded by ontology:
    • SKOS (#E1F5FE - light blue)
    • CIDOC-CRM (#FFF3E0 - light orange)
    • CPOV (#F3E5F5 - light purple)
    • Schema.org (#E8F5E9 - light green)
  • Extensive notes (500+ words of rationale)
  • Method signatures
  • Cardinality constraints

Rendering:

# Local PlantUML CLI
plantuml 01_name_entity_hub.puml

# PlantUML server
curl -X POST --data-binary @01_name_entity_hub.puml https://www.plantuml.com/plantuml/png

4. TypeQL Schema (01_name_entity_hub.tql)

Purpose: TypeDB knowledge graph database
Content:

  • Entity: name (PERA model)
  • Relations: 5 types
    • broader-narrower (SKOS hierarchy)
    • name-reference (multi-aspect connections)
    • name-succession (temporal chains)
    • name-change-event (provenance)
    • hypernym-relationship (taxonomy)
  • Attributes: 20+ properties
  • Reasoning Rules: 3 inference rules
    • Transitive broader/narrower
    • Current name detection
    • Organization inference from place

Loading:

typedb console --script 01_name_entity_hub.tql

5. RDF/OWL Ontology (01_name_entity_hub.ttl)

Purpose: Semantic Web / Linked Open Data
Content:

  • OWL Class: heritage:Name
  • OWL Properties: 5 multi-aspect properties
  • SKOS Integration: Reuses SKOS vocabulary
  • SHACL Constraints: Cardinality, datatypes, patterns
  • PROV-O: heritage:NameChange activity
  • Forward References: Place, Organization, Collection (minimally defined)

Usage:

# Load into GraphDB
curl -X POST -H "Content-Type: text/turtle" --data-binary @01_name_entity_hub.ttl http://localhost:7200/repositories/heritage/statements

# Validate with RDFLib
python -c "from rdflib import Graph; g = Graph(); g.parse('01_name_entity_hub.ttl'); print(len(g))"

🔍 Key Features

Multi-Aspect Pattern

Example: Mansion House (Q1786933)

# LinkML Instance
- id: https://w3id.org/heritage/name/Q1786933
  prefLabel: Mansion House
  wikidata_id: Q1786933
  refers_to_place:
    - https://w3id.org/heritage/place/mansion-house-london
  refers_to_organization:
    - https://w3id.org/heritage/org/lord-mayor-residence
  refers_to_collection:
    - https://w3id.org/heritage/collection/mansion-house-art
  broader:
    - https://w3id.org/heritage/name/Q1802963  # mansion concept
# RDF/Turtle
<https://w3id.org/heritage/name/Q1786933> a heritage:Name , skos:Concept ;
    heritage:wikidataId "Q1786933" ;
    skos:prefLabel "Mansion House"@en ;
    heritage:refersToPlace <https://w3id.org/heritage/place/mansion-house-london> ;
    heritage:refersToOrganization <https://w3id.org/heritage/org/lord-mayor-residence> ;
    heritage:refersToCollection <https://w3id.org/heritage/collection/mansion-house-art> ;
    skos:broader <https://w3id.org/heritage/name/Q1802963> .
# TypeQL
$mansion-house isa name,
    has name-id "https://w3id.org/heritage/name/Q1786933",
    has wikidata-id "Q1786933",
    has pref-label "Mansion House";

(referencing-name: $mansion-house, referenced-place: $place) isa name-reference;
(referencing-name: $mansion-house, referenced-organization: $org) isa name-reference;
(referencing-name: $mansion-house, referenced-collection: $coll) isa name-reference;

Temporal Name Chains

Example: Dutch Archive Merger (2001)

# Name 1: Gemeentearchief Haarlem (1910-2001)
<https://w3id.org/heritage/name/gemeentearchief-haarlem> a heritage:Name ;
    skos:prefLabel "Gemeentearchief Haarlem"@nl ;
    schema:validFrom "1910-01-01"^^xsd:date ;
    schema:validUntil "2001-01-01"^^xsd:date ;
    heritage:replacedBy <https://w3id.org/heritage/name/noord-hollands-archief> .

# Name 2: Noord-Hollands Archief (2001-present)
<https://w3id.org/heritage/name/noord-hollands-archief> a heritage:Name ;
    skos:prefLabel "Noord-Hollands Archief"@nl ;
    schema:validFrom "2001-01-01"^^xsd:date ;
    heritage:replaces <https://w3id.org/heritage/name/gemeentearchief-haarlem> .

# Change Event
<https://w3id.org/heritage/event/nha-merger-2001> a heritage:NameChange ;
    heritage:oldName <https://w3id.org/heritage/name/gemeentearchief-haarlem> ;
    heritage:newName <https://w3id.org/heritage/name/noord-hollands-archief> ;
    heritage:changeDate "2001-01-01"^^xsd:date ;
    heritage:changeType "MERGER" .

📊 UML Format Selection

Based on Exa research and industry standards:

Format Best For Pros Cons Selected?
Mermaid GitHub docs, quick diagrams Simple syntax, auto-renders in GitHub, Markdown integration Limited UML features, basic styling YES
PlantUML Comprehensive UML, technical docs Full UML 2.5 support, rich annotations, mature ecosystem Requires rendering step, verbose syntax YES
C4 Model System architecture, context diagrams Software architecture focus, hierarchical levels Not for data modeling, no class diagrams NO (not applicable)
TypeDB TypeQL Knowledge graph database Built-in reasoning, graph queries, ACID transactions Specialized syntax, requires TypeDB YES
Archimate Enterprise architecture Business/IT alignment, stakeholder views Heavyweight, not for data modeling NO

Decision: Use Mermaid (quick docs) + PlantUML (detailed UML) + TypeQL (executable schema)


🔄 Workflow Comparison

Old Workflow (Bottom-Up)

For each of 2,453 entities:
  1. Read Wikidata metadata
  2. Analyze hypernyms
  3. Search DBpedia mappings
  4. Design multi-aspect model
  5. Write YAML ontology mapping
  6. Validate
  
Estimated time: 2,400+ sessions (20 min/entity × 2,453 entities)

New Workflow (Top-Down)

Phase 1: Design Core Patterns (1-2 sessions) ✅ COMPLETE
  - Define Name entity
  - Define multi-aspect pattern
  - Create 4 schema formats

Phase 2: Extract Hypernym Taxonomy (1 session) ⏳ NEXT
  - Parse hyponyms_curated.yaml
  - Extract unique hypernyms (~20 categories)
  - Create HypernymConcept entities

Phase 3: Map Hypernyms to Ontology (1-2 sessions)
  - building → crm:E27_Site
  - organisation → cpov:PublicOrganisation
  - museum → schema:Museum + dbo:Museum
  - etc.

Phase 4: Define Entity Modules (3-4 sessions)
  - Place entity module
  - Organization entity module
  - Collection entity module

Phase 5: Batch Convert (1 session)
  - Script: convert_wikidata_to_names.py
  - Process all 2,453 entities automatically
  - Output: LinkML instances

Total estimated time: 7-10 sessions (vs. 2,400+ sessions)
Efficiency gain: ~240x faster

📚 Documentation Created

  1. README.md (5,000+ words)

    • Design rationale
    • Ontological justification
    • Implementation patterns
    • Temporal modeling examples
    • Next steps roadmap
  2. LinkML Schema (400 lines)

    • Class + 24 slots
    • SKOS alignment
    • Multi-aspect properties
    • Temporal validity
    • Provenance tracking
  3. Mermaid Diagram (70 lines)

    • Class diagram
    • Relationships
    • Notes
  4. PlantUML Diagram (250+ lines)

    • Detailed UML
    • Color-coded ontologies
    • Extensive annotations
    • Design rationale notes
  5. TypeQL Schema (300+ lines)

    • PERA model entities
    • 5 relation types
    • 20+ attributes
    • 3 reasoning rules
  6. RDF/OWL Ontology (400+ lines)

    • OWL classes
    • Object properties
    • Datatype properties
    • SHACL constraints
    • PROV-O integration

Total Documentation: ~1,500 lines of schema + 5,000 words of explanation


🎓 Key Design Decisions

Decision 1: Single Name Entity (Not Split)

Rejected Approach: Separate PlaceName and OrganizationName classes

Rationale:

  • Many names refer to BOTH place AND organization
  • Splitting creates ambiguity and duplication
  • Violates Wikidata Q82799 (name is a nominal reference, not typed)
  • Harder to track name changes (which entity gets the new name?)

Chosen Approach: Single Name class with multi-aspect references


Decision 2: SKOS as Primary Alignment

Options Considered:

  • crm:E41_Appellation (CIDOC-CRM)
  • schema:name (property, not class)
  • owl:Thing (too generic)
  • skos:ConceptCHOSEN

Rationale:

  • SKOS provides hierarchical structure (broader/narrower)
  • Multilingual support (prefLabel, altLabel with language tags)
  • Temporal validity (via Schema.org properties)
  • Cross-vocabulary mapping (exactMatch, closeMatch)
  • Heritage domain standard (used in museum/library thesauri)

Decision 3: Multi-Aspect via Properties (Not Inheritance)

Rejected Approach: Subclass Name into PlaceName, OrganizationName, etc.

Rationale:

  • OOP inheritance forces single-type classification
  • Real-world: names simultaneously reference multiple aspects
  • Subclassing creates redundancy (same name duplicated in multiple classes)

Chosen Approach: Single Name class with aspect reference properties

refers_to_place: Place[]       # 0 or more places
refers_to_organization: Organization[]  # 0 or more organizations
refers_to_collection: Collection[]  # 0 or more collections

Decision 4: Temporal Independence

Principle: Name, Place, Organization, Collection have independent lifespans

Example:

  • Place (building): 1753 → Present (271 years)
  • Organization (custodian): 1753 → Present (271 years)
  • Name #1: 1753 → 1850 (97 years) "Mansion House"
  • Name #2: 1850 → 2001 (151 years) "The Mansion House"
  • Name #3: 2001 → Present (23 years) "Lord Mayor's Official Residence"

Implementation:

  • Each entity tracks its own valid_from / valid_to
  • Name changes via replaces / replaced_by properties
  • Organization persists across name changes (same entity ID)

🚀 Impact & Benefits

Immediate Benefits

  1. Clarity: Clear separation between linguistic identifiers and entities
  2. Flexibility: Multi-aspect modeling handles complex real-world cases
  3. Consistency: Single pattern applied to all 2,453 entities
  4. Interoperability: 4 schema formats ensure tool compatibility

Medium-Term Benefits

  1. Efficiency: Batch conversion ~240x faster than one-by-one enrichment
  2. Scalability: Pattern-based approach extends to new hypernyms easily
  3. Reasoning: TypeDB rules infer relationships automatically
  4. Linked Data: RDF export enables SPARQL queries, federated search

Long-Term Benefits

  1. Maintenance: Schema changes propagate to all instances via patterns
  2. Evolution: Ontology can expand without breaking existing data
  3. Community: Standard formats enable external contributions
  4. Research: Knowledge graph enables novel heritage research queries

📋 Next Steps

Immediate (Session 3) - TOP PRIORITY

Task: Extract Hypernym Taxonomy from hyponyms_curated.yaml

Script: scripts/extract_hypernyms_taxonomy.py

Process:

  1. Parse data/wikidata/GLAMORCUBEPSXHFN/hyponyms_curated.yaml
  2. Extract unique values from hypernym: field
  3. Count frequency of each hypernym
  4. Create data/ontology/hypernym_taxonomy.yaml with:
    - hypernym: building
      count: 417
      wikidata_id: Q41176
      dbpedia_class: dbo:Building
    
    - hypernym: organisation
      count: 193
      wikidata_id: Q43229
      dbpedia_class: dbo:Organisation
    

Expected Output:

  • ~20-30 unique hypernyms
  • Frequency distribution (most common: building, organisation, museum)
  • Foundation for ontology class mapping

Medium-Term (This Week)

Task 2: Map Hypernyms to Ontology Classes

Module: schemas/20251121/linkml/02_hypernym_taxonomy.yaml

Content:

  • HypernymConcept class definitions
  • Ontology mappings for each hypernym:
    • building → crm:E27_Site + dbo:Building
    • organisation → cpov:PublicOrganisation + schema:Organization
    • museum → schema:Museum + dbo:Museum
    • archive → rico:CorporateBody + dbo:Archive

Task 3: Create Place, Organization, Collection Entity Modules

Modules:

  • 03_place_entity.yaml (spatial aspect)
  • 04_organization_entity.yaml (custodian aspect)
  • 05_collection_entity.yaml (heritage materials aspect)

Each module includes:

  • LinkML schema
  • Mermaid diagram
  • PlantUML diagram
  • TypeQL schema
  • RDF/OWL ontology

Long-Term (Next Month)

Task 4: Batch Convert Wikidata Entities

Script: scripts/convert_wikidata_to_names.py

Input: data/wikidata/GLAMORCUBEPSXHFN/hyponyms_curated_full.yaml (2,453 entities)

Output: data/instances/names/*.yaml (LinkML instances, 1 per entity)

Process:

  • For each Wikidata entity:
    • Extract label → prefLabel
    • Extract aliases → altLabel
    • Extract hypernym → link to HypernymConcept
    • Generate ID → https://w3id.org/heritage/name/Q[NUMBER]
    • Add provenance → source, created, wikidata_id

Task 5: Load into TypeDB Knowledge Graph

Commands:

# Start TypeDB
typedb server

# Load schema
typedb console --script schemas/20251121/typeql/01_name_entity_hub.tql

# Load instances
python scripts/load_instances_to_typedb.py

Task 6: Export to RDF Triple Store

Process:

  • Convert LinkML instances to RDF/Turtle
  • Load into GraphDB / Virtuoso / Blazegraph
  • Create SPARQL endpoint
  • Publish as Linked Open Data

Session Completion Checklist

  • Research UML formats (Mermaid, PlantUML, C4, TypeDB)
  • Design Name entity as central hub
  • Create LinkML schema (01_name_entity.yaml)
  • Create Mermaid diagram (01_name_entity_hub.mmd)
  • Create PlantUML diagram (01_name_entity_hub.puml)
  • Create TypeQL schema (01_name_entity_hub.tql)
  • Create RDF/OWL ontology (01_name_entity_hub.ttl)
  • Validate LinkML schema (YAML syntax)
  • Document design rationale (README.md, 5,000+ words)
  • Define multi-aspect pattern
  • Define temporal name chains
  • Document next steps (hypernym extraction)
  • Extract hypernym taxonomy (next session)
  • Map hypernyms to ontology classes

📊 Progress Metrics

Overall Project Progress

Metric Count Status
Wikidata Entities 2,453 Pending batch conversion
Name Entity Schema 1 module COMPLETE
Schema Formats 4 (LinkML, Mermaid, PlantUML, TypeQL, RDF) COMPLETE
Classes Defined 1 (Name) COMPLETE
Properties Defined 24 slots COMPLETE
Reasoning Rules 3 (TypeQL) COMPLETE
Documentation 6,500+ words COMPLETE

Efficiency Gain

  • Old Approach: 2,400+ sessions (5 entities done, 2,448 remaining)
  • New Approach: ~10 sessions (foundation + hypernym mapping + entity modules + batch conversion)
  • Efficiency Gain: 240x faster 🚀

📚 References

Standards

Tools

Project Files

  • Schema Dir: /schemas/20251121/
  • LinkML: linkml/01_name_entity.yaml
  • Mermaid: uml/mermaid/01_name_entity_hub.mmd
  • PlantUML: uml/plantuml/01_name_entity_hub.puml
  • TypeQL: typeql/01_name_entity_hub.tql
  • RDF/OWL: rdf/01_name_entity_hub.ttl
  • README: README.md

Session Status: COMPLETE
Next Session Focus: Extract hypernym taxonomy + map to ontology classes
Overall Strategy: Top-down ontology design (240x more efficient)