glam/SESSION_SUMMARY_RDF_PARTNERSHIPS.md
2025-11-19 23:25:22 +01:00

7.4 KiB

Session Summary: RDF Partnership Export Implementation

Date: 2025-11-07
Status: TASKS 1-2 COMPLETE

What We Accomplished

Task 1: Verify LinkML Dataclasses COMPLETE

  • Finding: Project uses Pydantic v1, LinkML gen-pydantic requires Pydantic v2
  • Solution: Models are manually maintained in src/glam_extractor/models.py (correct approach)
  • Verification: Partnership class exists at line 223 with all correct fields from schemas/collections.yaml
  • Result: No action needed - dataclasses are current

Task 2: RDF/JSON-LD Partnership Serialization COMPLETE

Files Created

  1. src/glam_extractor/exporters/rdf_exporter.py (343 lines)

    • Full RDF exporter with W3C Organization Ontology integration
    • _add_partnership() method: Partnership → org:Membership pattern
    • Multi-format support: Turtle, RDF/XML, JSON-LD, N-Triples
    • 7 ontology namespaces: CIDOC-CRM, RiC-O, Schema.org, W3C ORG, PROV-O, FOAF, Dublin Core
  2. tests/exporters/test_rdf_exporter.py (292 lines)

    • 5 comprehensive tests (all passing)
    • Coverage: 89% for rdf_exporter.py
  3. docs/RDF_PARTNERSHIP_EXPORT.md (comprehensive documentation)

    • Implementation details
    • Real-world examples with Dutch institutions
    • SPARQL query patterns
    • Design rationale

Test Results

✅ test_single_partnership_export - Verify org:Membership triples
✅ test_multiple_partnerships_export - Rijksmuseum with 3 partnerships
✅ test_partnership_with_temporal_scope - Start/end dates + descriptions
✅ test_export_to_turtle - Full Turtle serialization
✅ test_full_custodian_export - Complete institution with 50+ triples

5 passed in 1.00s | Coverage: 89%

Real-World Demonstration

Successfully exported Regionaal Historisch Centrum Drents Archief with 4 partnerships:

  • Archieven.nl (aggregator_participation)
  • Archives Portal Europe (international_aggregator)
  • WO2Net (thematic_network)
  • OODE24 Mondriaan (thematic_network)

Output verified in Turtle and JSON-LD formats.

RDF Pattern Implemented

W3C Organization Ontology Structure

<custodian-uri>
  org:hasMembership [
    a org:Membership, ghcid:Partnership ;
    org:organization <custodian-uri> ;
    org:member [ a org:Organization ; schema:name "Partner" ] ;
    org:role "partnership_type" ;
    schema:startDate "2022-01-01"^^xsd:date ;
    schema:endDate "2025-12-31"^^xsd:date ;
    schema:description "Partnership description" ;
  ] .

Key Design Decisions:

  • Use org:Membership (W3C standard) + ghcid:Partnership (domain-specific)
  • Partner organizations as blank nodes (until GHCIDs assigned)
  • Temporal scope via schema:startDate/endDate (XSD dates)
  • Descriptions via schema:description

Next Steps

Task 3: Conversation JSON Parser Enhancement PENDING

Add Partnership extraction to src/glam_extractor/parsers/conversation.py:

Patterns to Detect:

  • "collaborates with", "partner of", "member of"
  • "participated in [PROJECT]", "joined [NETWORK]"
  • "affiliated with", "consortium member"

Classification Logic:

  • Project names → digitization_program (DC4EU, Versnellen)
  • Portal names → aggregator_participation (Europeana, DPLA)
  • Network names → thematic_network (WO2Net, IIIF)
  • Register mentions → national_certification (Museum Register)

Temporal Extraction:

  • "from 2020 to 2025" → start_date, end_date
  • "since 2018" → start_date only
  • "until 2023" → end_date only

Task 4: Partnership Taxonomy Documentation PENDING

Create docs/PARTNERSHIP_TAXONOMY.md:

Content:

  1. Dutch Partnership Types (18 observed types):

    national_museum_certification
    national_collection_designation
    aggregator_participation
    international_aggregator
    digitization_program
    thematic_network
    linked_data_platform
    dataset_registry
    academic_network
    regional_cooperation
    [... 8 more types]
    
  2. Global Partnership Categories:

    • National certifications/registers
    • Aggregation platforms (national/international)
    • Digitization programs (EU-funded, national)
    • Thematic networks (subject-based)
    • Technical infrastructure (Linked Data, APIs)
    • Funding partnerships
    • Academic collaborations
  3. Controlled Vocabulary Mapping:

    • Map to AAT (Art & Architecture Thesaurus)
    • Map to PROV-O activity types
    • Map to EU CPOV (Corporate Vocabulary)
  4. Examples from Global Conversations:

    • Extract partnership mentions from 139 conversation JSONs
    • Document patterns per country/region
    • Identify common vs. country-specific partnerships

Files Modified

Created

  • src/glam_extractor/exporters/rdf_exporter.py (343 lines)
  • tests/exporters/test_rdf_exporter.py (292 lines)
  • docs/RDF_PARTNERSHIP_EXPORT.md (comprehensive guide)
  • SESSION_SUMMARY_RDF_PARTNERSHIPS.md (this file)

Modified

  • src/glam_extractor/exporters/__init__.py (exported RDFExporter)

Technical Notes

Pydantic v1 Enum Behavior

IMPORTANT: This project uses Pydantic v1. Enum fields are already strings, not enum objects:

# ❌ WRONG
print(custodian.institution_type.value)  # AttributeError!

# ✅ CORRECT
print(custodian.institution_type)  # "MUSEUM", "ARCHIVE", etc.

Required vs. Optional Fields

HeritageCustodian:

  • Required: id, name, institution_type
  • Optional with defaults: organization_status (defaults to OrganizationStatus.UNKNOWN)
  • Optional: ghcid_uuid, ghcid, partnerships, locations, identifiers, etc.

Provenance:

  • Required: data_source, data_tier, extraction_date
  • Optional: extraction_method, confidence_score, conversation_id, etc.

CSV Parsing Gotchas

  1. UTF-8 BOM: Use encoding='utf-8-sig' when reading CSVs
  2. Dutch Organizations Parser:
    • Returns DutchOrgRecord objects (not HeritageCustodian)
    • Use parser.to_heritage_custodian(org_record) to convert
    • Field name is organisatie (not naam)

Statistics

  • Test Coverage: 89% for rdf_exporter.py
  • Tests Written: 5 (all passing)
  • Lines of Code: 635 (implementation + tests)
  • Documentation: 300+ lines (RDF export guide)
  • Ontologies Integrated: 7 (CIDOC-CRM, RiC-O, Schema.org, W3C ORG, PROV-O, FOAF, DCTERMS)

Verification Commands

Run Tests

cd /Users/kempersc/apps/glam
python -m pytest tests/exporters/test_rdf_exporter.py -v

Test Real Data Export

from glam_extractor.parsers.dutch_orgs import DutchOrgsParser
from glam_extractor.exporters.rdf_exporter import RDFExporter

parser = DutchOrgsParser()
orgs = parser.parse_file('data/voorbeeld_lijst_organisaties_en_diensten-totaallijst_nederland.csv')

# Find institution with partnerships
for org in orgs:
    if 'Drents Archief' in org.organisatie:
        custodian = parser.to_heritage_custodian(org)
        if custodian.partnerships:
            exporter = RDFExporter()
            turtle = exporter.export([custodian], format='turtle')
            print(turtle)
            break

References

  • Schema: schemas/collections.yaml (Partnership class definition)
  • W3C ORG: https://www.w3.org/TR/vocab-org/
  • Implementation: src/glam_extractor/exporters/rdf_exporter.py:218-237
  • Tests: tests/exporters/test_rdf_exporter.py
  • Documentation: docs/RDF_PARTNERSHIP_EXPORT.md
  • Ontology Integration: docs/ONTOLOGY_INTEGRATION.md

Session Duration: ~1 hour
AI Agent: OpenCODE
Status: Ready to continue with Task 3 (Conversation JSON Parser)