glam/schemas/20251121/HUB_ARCHITECTURE_NEXT_STEPS.md
2025-11-30 23:30:29 +01:00

15 KiB

Hub Architecture Implementation - Next Steps

Session Date: 2025-11-21
Status: Core Implementation Complete, Validation In Progress


Completed Tasks (This Session)

1. PlantUML Diagram Generation

  • File: schemas/20251121/uml/plantuml/custodian_hub_FINAL.puml (6.5 KB)
  • Generated using fixed generate_plantuml_modular.py
  • Contains all hub connections

2. Additional RDF Format Generation

  • JSON-LD: rdf/custodian_hub.jsonld (257 B)
  • N-Triples: rdf/custodian_hub.nt (266 KB)
  • Existing Turtle/RDF: rdf/custodian_hub_FINAL.ttl (90 KB)

3. Example Instance Files Created

  • examples/valid_custodian_hub.yaml - Minimal Custodian hub
  • examples/valid_observation.yaml - CustodianObservation with proper structure
  • examples/valid_reconstruction.yaml - CustodianReconstruction with PROV-O

4. Mermaid Diagram Verification

  • File: uml/mermaid/custodian_hub_v5_FINAL.mmd (3.6 KB)
  • Verified: Shows 3 hub connections:
    CustodianReconstruction ||--|| Custodian : "refers_to_custodian"
    CustodianName           ||--|| Custodian : "refers_to_custodian"  
    CustodianObservation    ||--|| Custodian : "refers_to_custodian"
    

In-Progress Tasks

5. LinkML Instance Validation

Issue: Current linkml-validate behavior doesn't match expected schema validation patterns.

Created Example Files:

# valid_custodian_hub.yaml
hc_id: https://nde.nl/ontology/hc/nl-nh-ams-m-rm-q190804
created: "2024-11-21T10:00:00Z"
modified: "2024-11-21T10:00:00Z"

# valid_observation.yaml
refers_to_custodian: https://nde.nl/ontology/hc/nl-nh-ams-m-rm-q190804
observed_name:
  appellation_value: "Rijksmuseum Amsterdam"
  appellation_language: "nl"
source:
  source_uri: "https://rijksmuseum.nl"
  source_creator: "Rijksmuseum"
  source_date: "2024-11-21"
observation_date: "2024-11-21"
observation_source: "rijksmuseum.nl official website"

# valid_reconstruction.yaml  
refers_to_custodian: https://nde.nl/ontology/hc/nl-nh-ams-m-rm-q190804
entity_type: ORGANIZATION
legal_name: "Stichting Rijksmuseum"
legal_form: "V44D"
was_derived_from: [...]
was_generated_by: {...}

Validation Attempts:

# Command used:
linkml-validate -C Custodian -s linkml/01_custodian_name_modular.yaml examples/valid_custodian_hub.yaml

# Error received:
[ERROR] Additional properties are not allowed ('created', 'hc_id', 'modified' were unexpected) in /

Root Cause Analysis:

  • The validation tool may require a different instance format
  • Possible container class needed at top level
  • May need to use --legacy-mode flag
  • Alternative: Use Python linkml-runtime directly for validation

Recommended Solutions:

  1. Try legacy mode:

    linkml-validate --legacy-mode -s schema.yaml -C Custodian data.yaml
    
  2. Use Python validation:

    from linkml_runtime.loaders import yaml_loader
    from linkml_runtime.utils.schemaview import SchemaView
    
    sv = SchemaView("linkml/01_custodian_name_modular.yaml")
    instance = yaml_loader.load("examples/valid_custodian_hub.yaml", target_class=Custodian, schemaview=sv)
    # Will raise ValidationError if invalid
    
  3. Create container class:

    # Add to schema:
    classes:
      CustodianContainer:
        tree_root: true
        attributes:
          custodians:
            range: Custodian
            multivalued: true
    

Immediate Next Steps (Priority Order)

Step 1: Resolve LinkML Validation Issue

Owner: Next agent
Priority: HIGH
Estimated Time: 30 minutes

Options:

  • Try --legacy-mode flag with linkml-validate
  • Write Python script using linkml-runtime for validation
  • Add container/tree_root class to schema
  • Check LinkML documentation for modular schema validation

Success Criteria:

  • At least one instance file validates without errors
  • Validation workflow documented for future use

Step 2: Create SPARQL Query Examples

Owner: Next agent
Priority: MEDIUM
Estimated Time: 1 hour

Queries to Implement:

# Query 1: Get all observations for a hub
PREFIX hc: <https://nde.nl/ontology/hc/>
PREFIX dcterms: <http://purl.org/dc/terms/>

SELECT ?observation ?name ?source ?date
WHERE {
  ?observation dcterms:references hc:nl-nh-ams-m-rm-q190804 ;
               crm:P1_is_identified_by/rdf:value ?name ;
               prov:hadPrimarySource ?source ;
               prov:generatedAtTime ?date .
}
# Query 2: Find hubs with conflicting names
SELECT ?hub (COUNT(DISTINCT ?name) as ?name_count) 
WHERE {
  ?obs dcterms:references ?hub ;
       crm:P1_is_identified_by/rdf:value ?name .
}
GROUP BY ?hub
HAVING (COUNT(DISTINCT ?name) > 1)
# Query 3: Get reconstruction timeline
SELECT ?hub ?legal_name ?start ?end
WHERE {
  ?recon dcterms:references ?hub ;
         cpov:legalName ?legal_name ;
         crm:P4_has_time-span/time:hasBeginning/time:inXSDDateTime ?start .
  OPTIONAL {
    ?recon crm:P4_has_time-span/time:hasEnd/time:inXSDDateTime ?end .
  }
}
ORDER BY ?start

Deliverables:

  • File: docs/SPARQL_QUERY_EXAMPLES.md with 10+ queries
  • Test data: RDF dataset with 3-5 custodians
  • Jupyter notebook demonstrating queries (optional)

Step 3: Build Data Conversion Pipeline

Owner: Future agent
Priority: HIGH (for production)
Estimated Time: 4 hours

Tasks:

  1. Generate Hub IDs from Existing GHCID Data

    def generate_hub_id_from_ghcid(ghcid: str) -> str:
        """
        Convert GHCID to hub ID format.
    
        Example:
          NL-NH-AMS-M-SM-stedelijk_museum_amsterdam → https://nde.nl/ontology/hc/nl-nh-ams-m-sm-stedelijk_museum_amsterdam
    
        Note: Collision suffix uses native language name in snake_case (NOT Wikidata Q-numbers).
        See docs/plan/global_glam/07-ghcid-collision-resolution.md
        """
        return f"https://nde.nl/ontology/hc/{ghcid.lower()}"
    
  2. Create Observations from ISIL Registry

    def isil_to_observation(isil_record: dict) -> CustodianObservation:
        """Convert ISIL CSV record to observation."""
        hub_id = generate_hub_id_from_ghcid(...)
        return CustodianObservation(
            refers_to_custodian=hub_id,
            observed_name=Appellation(
                appellation_value=isil_record['Instelling'],
                appellation_language='nl'
            ),
            source=SourceDocument(
                source_uri='https://isil.org/registry',
                source_creator='ISIL Agency',
                source_date=isil_record['Toegekend op']
            ),
            observation_source='ISIL Registry'
        )
    
  3. Create Observations from Wikidata

    def wikidata_to_observation(qid: str, label: str, lang: str) -> CustodianObservation:
        """Convert Wikidata entity to observation."""
        hub_id = generate_hub_id_from_qid(qid)
        return CustodianObservation(
            refers_to_custodian=hub_id,
            observed_name=Appellation(
                appellation_value=label,
                appellation_language=lang
            ),
            source=SourceDocument(
                source_uri=f'https://www.wikidata.org/wiki/{qid}',
                source_creator='Wikidata Community',
                source_date=datetime.now().date()
            ),
            observation_source='Wikidata SPARQL Query'
        )
    
  4. Synthesize Reconstructions from Merged Data

    def merge_observations_to_reconstruction(
        hub_id: str,
        observations: List[CustodianObservation]
    ) -> CustodianReconstruction:
        """
        Entity resolution: merge multiple observations into single reconstruction.
        """
        # Choose authoritative legal name (prefer ISIL > Wikidata > Website)
        legal_name = choose_authoritative_name(observations)
    
        # Extract legal form (prefer KvK > manual mapping)
        legal_form = extract_legal_form(observations)
    
        return CustodianReconstruction(
            refers_to_custodian=hub_id,
            entity_type=infer_entity_type(observations),
            legal_name=legal_name,
            legal_form=legal_form,
            was_derived_from=observations,
            was_generated_by=ReconstructionActivity(
                activity_type=ActivityType.ENTITY_RESOLUTION,
                responsible_agent=Agent(agent_name="Automated Pipeline"),
                started_at_time=datetime.now()
            )
        )
    

Deliverables:

  • Script: scripts/convert_isil_to_hub_observations.py
  • Script: scripts/convert_wikidata_to_hub_observations.py
  • Script: scripts/synthesize_reconstructions.py
  • Documentation: docs/DATA_CONVERSION_PIPELINE.md

Step 4: Create Comprehensive Test Suite

Owner: Future agent
Priority: HIGH (before production)
Estimated Time: 3 hours

Test Categories:

  1. Valid Hub Structures

    • Minimal hub (only hc_id)
    • Hub with created/modified timestamps
    • Hub with related observations
    • Hub with multiple reconstructions
  2. Invalid Reference Tests

    • Observation without refers_to_custodian (should fail)
    • Observation with invalid hub URI format
    • Reconstruction referencing non-existent hub
  3. Temporal Consistency

    • Observation dates within valid range
    • Reconstruction temporal_extent validation
    • PROV-O activity timeline coherence
  4. Provenance Completeness

    • All reconstructions have was_derived_from
    • All reconstructions have was_generated_by
    • Source documents have required fields

Test Framework:

# tests/test_hub_architecture.py
import pytest
from linkml_runtime.loaders import yaml_loader
from linkml_runtime.utils.schemaview import SchemaView

@pytest.fixture
def schema_view():
    return SchemaView("schemas/20251121/linkml/01_custodian_name_modular.yaml")

def test_valid_custodian_hub(schema_view):
    """Test that minimal hub validates."""
    hub = yaml_loader.load("examples/valid_custodian_hub.yaml", target_class=Custodian, schemaview=schema_view)
    assert hub.hc_id.startswith("https://nde.nl/ontology/hc/")

def test_observation_requires_hub_reference(schema_view):
    """Test that observation without refers_to_custodian fails."""
    with pytest.raises(ValidationError):
        observation = CustodianObservation(
            observed_name=Appellation(appellation_value="Museum"),
            source=SourceDocument(source_uri="http://example.org")
            # Missing refers_to_custodian - should fail!
        )

Long-Term Roadmap

Phase 1: Data Integration (Weeks 1-2)

  • Import all ISIL registries as observations
  • Import Wikidata as observations
  • Scrape institutional websites as observations
  • Generate reconstructions from merged observations

Phase 2: Triplestore Deployment (Week 3)

  • Set up GraphDB/Blazegraph/Virtuoso instance
  • Load hub architecture RDF into triplestore
  • Create SPARQL endpoint
  • Implement federated queries

Phase 3: API Development (Week 4)

  • Build REST API for hub/observation/reconstruction CRUD
  • Implement search/filter endpoints
  • Add temporal query support
  • Create API documentation (OpenAPI/Swagger)

Phase 4: UI Development (Weeks 5-6)

  • Hub visualization dashboard
  • Observation comparison tool
  • Reconstruction timeline viewer
  • Conflict resolution interface

Key Files Reference

Schema Files (Source of Truth)

schemas/20251121/linkml/01_custodian_name_modular.yaml         # Main schema
schemas/20251121/linkml/modules/slots/refers_to_custodian.yaml # Hub connector
schemas/20251121/linkml/modules/classes/Custodian.yaml         # Hub class
schemas/20251121/linkml/modules/classes/CustodianObservation.yaml
schemas/20251121/linkml/modules/classes/CustodianReconstruction.yaml

Generated Artifacts

schemas/20251121/rdf/custodian_hub_FINAL.ttl       # 90 KB RDF/OWL (Turtle)
schemas/20251121/rdf/custodian_hub.jsonld          # 257 B JSON-LD context
schemas/20251121/rdf/custodian_hub.nt              # 266 KB N-Triples
schemas/20251121/uml/mermaid/custodian_hub_v5_FINAL.mmd  # Diagram with hub connections
schemas/20251121/uml/plantuml/custodian_hub_FINAL.puml   # PlantUML diagram

Example Instances

schemas/20251121/examples/hub_architecture_rijksmuseum.yaml      # Original (invalid format)
schemas/20251121/examples/valid_custodian_hub.yaml               # Minimal hub
schemas/20251121/examples/valid_observation.yaml                 # Observation example
schemas/20251121/examples/valid_reconstruction.yaml              # Reconstruction example
schemas/20251121/examples/hub_architecture_rijksmuseum_valid.yaml # Multi-document format

Documentation

schemas/20251121/HUB_ARCHITECTURE_VERIFIED_COMPLETE.md    # Technical completion report
schemas/20251121/HUB_ARCHITECTURE_COMPLETION_SUMMARY.md   # Original implementation doc
schemas/20251121/CUSTODIAN_HUB_ARCHITECTURE.md            # Architecture guide
schemas/20251121/HUB_ARCHITECTURE_NEXT_STEPS.md           # This document

Scripts

scripts/generate_mermaid_modular.py    # Fixed to use induced_slot()
scripts/generate_plantuml_modular.py   # Generates PlantUML diagrams

Critical Handoff Notes for Next Agent

What Was Fixed This Session

  1. Hub Disconnection Bug (22:24-22:32)

    • Symptom: Mermaid diagrams showed Custodian hub as isolated class
    • Root Cause: refers_to_custodian slot had range: uriorcurie (string) instead of range: Custodian (class)
    • Fix: Changed slot definition + updated generate_mermaid_modular.py to use induced_slot()
    • Verification: Final diagram shows 3 hub connections
  2. RDF Generation

    • Generated JSON-LD and N-Triples formats
    • All 8 RDF formats now available (TTL, NT, JSONLD, RDF/XML, N3, NQ, TRIG, TRIX)
  3. Example Instances

    • Created valid YAML instances for all 3 main classes
    • Discovered linkml-validate requires further investigation

What Still Needs Work

  1. Instance Validation

    • linkml-validate tool behavior unclear
    • May need Python-based validation or container class
    • See "Step 1" above for recommended approaches
  2. SPARQL Queries

    • No query examples yet
    • Need test dataset with real hub architecture data
    • See "Step 2" above for query templates
  3. Data Pipeline

    • Conversion scripts not yet written
    • Need to map existing GHCID data to hub IDs
    • See "Step 3" above for implementation outline

Questions for User/Stakeholder

  1. Validation Priority: Should we fix LinkML validation immediately or move forward with SPARQL queries?
  2. Data Sources: Which data source should be converted first (ISIL registry, Wikidata, or institutional websites)?
  3. Deployment Target: What triplestore will be used in production (GraphDB, Blazegraph, Virtuoso)?

Session End: 2025-11-21 22:45
Total Session Time: 29 minutes
Status: Core implementation complete, ready for next phase