kempersc/glam

Fork 0

kempersc 3c80de87e0 add isil entries

2025-11-19 23:25:22 +01:00

7.4 KiB

Raw Blame History

Session Summary: RDF Partnership Export Implementation

Date: 2025-11-07
Status: ✅ TASKS 1-2 COMPLETE

What We Accomplished

Task 1: Verify LinkML Dataclasses ✅ COMPLETE

Finding: Project uses Pydantic v1, LinkML gen-pydantic requires Pydantic v2
Solution: Models are manually maintained in src/glam_extractor/models.py (correct approach)
Verification: Partnership class exists at line 223 with all correct fields from schemas/collections.yaml
Result: No action needed - dataclasses are current

Task 2: RDF/JSON-LD Partnership Serialization ✅ COMPLETE

Files Created

src/glam_extractor/exporters/rdf_exporter.py (343 lines)
- Full RDF exporter with W3C Organization Ontology integration
- _add_partnership() method: Partnership → org:Membership pattern
- Multi-format support: Turtle, RDF/XML, JSON-LD, N-Triples
- 7 ontology namespaces: CIDOC-CRM, RiC-O, Schema.org, W3C ORG, PROV-O, FOAF, Dublin Core
tests/exporters/test_rdf_exporter.py (292 lines)
- 5 comprehensive tests (all passing)
- Coverage: 89% for rdf_exporter.py
docs/RDF_PARTNERSHIP_EXPORT.md (comprehensive documentation)
- Implementation details
- Real-world examples with Dutch institutions
- SPARQL query patterns
- Design rationale

Test Results

✅ test_single_partnership_export - Verify org:Membership triples
✅ test_multiple_partnerships_export - Rijksmuseum with 3 partnerships
✅ test_partnership_with_temporal_scope - Start/end dates + descriptions
✅ test_export_to_turtle - Full Turtle serialization
✅ test_full_custodian_export - Complete institution with 50+ triples

5 passed in 1.00s | Coverage: 89%

Real-World Demonstration

Successfully exported Regionaal Historisch Centrum Drents Archief with 4 partnerships:

Archieven.nl (aggregator_participation)
Archives Portal Europe (international_aggregator)
WO2Net (thematic_network)
OODE24 Mondriaan (thematic_network)

Output verified in Turtle and JSON-LD formats.

RDF Pattern Implemented

W3C Organization Ontology Structure

<custodian-uri>
  org:hasMembership [
    a org:Membership, ghcid:Partnership ;
    org:organization <custodian-uri> ;
    org:member [ a org:Organization ; schema:name "Partner" ] ;
    org:role "partnership_type" ;
    schema:startDate "2022-01-01"^^xsd:date ;
    schema:endDate "2025-12-31"^^xsd:date ;
    schema:description "Partnership description" ;
  ] .

Key Design Decisions:

Use org:Membership (W3C standard) + ghcid:Partnership (domain-specific)
Partner organizations as blank nodes (until GHCIDs assigned)
Temporal scope via schema:startDate/endDate (XSD dates)
Descriptions via schema:description

Next Steps

Task 3: Conversation JSON Parser Enhancement ⏳ PENDING

Add Partnership extraction to src/glam_extractor/parsers/conversation.py:

Patterns to Detect:

"collaborates with", "partner of", "member of"
"participated in [PROJECT]", "joined [NETWORK]"
"affiliated with", "consortium member"

Classification Logic:

Project names → digitization_program (DC4EU, Versnellen)
Portal names → aggregator_participation (Europeana, DPLA)
Network names → thematic_network (WO2Net, IIIF)
Register mentions → national_certification (Museum Register)

Temporal Extraction:

"from 2020 to 2025" → start_date, end_date
"since 2018" → start_date only
"until 2023" → end_date only

Task 4: Partnership Taxonomy Documentation ⏳ PENDING

Create docs/PARTNERSHIP_TAXONOMY.md:

Content:

Dutch Partnership Types (18 observed types):

national_museum_certification
national_collection_designation
aggregator_participation
international_aggregator
digitization_program
thematic_network
linked_data_platform
dataset_registry
academic_network
regional_cooperation
[... 8 more types]

Global Partnership Categories:
- National certifications/registers
- Aggregation platforms (national/international)
- Digitization programs (EU-funded, national)
- Thematic networks (subject-based)
- Technical infrastructure (Linked Data, APIs)
- Funding partnerships
- Academic collaborations
Controlled Vocabulary Mapping:
- Map to AAT (Art & Architecture Thesaurus)
- Map to PROV-O activity types
- Map to EU CPOV (Corporate Vocabulary)
Examples from Global Conversations:
- Extract partnership mentions from 139 conversation JSONs
- Document patterns per country/region
- Identify common vs. country-specific partnerships

Files Modified

Created

src/glam_extractor/exporters/rdf_exporter.py (343 lines)
tests/exporters/test_rdf_exporter.py (292 lines)
docs/RDF_PARTNERSHIP_EXPORT.md (comprehensive guide)
SESSION_SUMMARY_RDF_PARTNERSHIPS.md (this file)

Modified

src/glam_extractor/exporters/__init__.py (exported RDFExporter)

Technical Notes

Pydantic v1 Enum Behavior

IMPORTANT: This project uses Pydantic v1. Enum fields are already strings, not enum objects:

# ❌ WRONG
print(custodian.institution_type.value)  # AttributeError!

# ✅ CORRECT
print(custodian.institution_type)  # "MUSEUM", "ARCHIVE", etc.

Required vs. Optional Fields

HeritageCustodian:

Required: id, name, institution_type
Optional with defaults: organization_status (defaults to OrganizationStatus.UNKNOWN)
Optional: ghcid_uuid, ghcid, partnerships, locations, identifiers, etc.

Provenance:

Required: data_source, data_tier, extraction_date
Optional: extraction_method, confidence_score, conversation_id, etc.

CSV Parsing Gotchas

UTF-8 BOM: Use encoding='utf-8-sig' when reading CSVs
Dutch Organizations Parser:
- Returns DutchOrgRecord objects (not HeritageCustodian)
- Use parser.to_heritage_custodian(org_record) to convert
- Field name is organisatie (not naam)

Statistics

Test Coverage: 89% for rdf_exporter.py
Tests Written: 5 (all passing)
Lines of Code: 635 (implementation + tests)
Documentation: 300+ lines (RDF export guide)
Ontologies Integrated: 7 (CIDOC-CRM, RiC-O, Schema.org, W3C ORG, PROV-O, FOAF, DCTERMS)

Verification Commands

Run Tests

cd /Users/kempersc/apps/glam
python -m pytest tests/exporters/test_rdf_exporter.py -v

Test Real Data Export

from glam_extractor.parsers.dutch_orgs import DutchOrgsParser
from glam_extractor.exporters.rdf_exporter import RDFExporter

parser = DutchOrgsParser()
orgs = parser.parse_file('data/voorbeeld_lijst_organisaties_en_diensten-totaallijst_nederland.csv')

# Find institution with partnerships
for org in orgs:
    if 'Drents Archief' in org.organisatie:
        custodian = parser.to_heritage_custodian(org)
        if custodian.partnerships:
            exporter = RDFExporter()
            turtle = exporter.export([custodian], format='turtle')
            print(turtle)
            break

References

Schema: schemas/collections.yaml (Partnership class definition)
W3C ORG: https://www.w3.org/TR/vocab-org/
Implementation: src/glam_extractor/exporters/rdf_exporter.py:218-237
Tests: tests/exporters/test_rdf_exporter.py
Documentation: docs/RDF_PARTNERSHIP_EXPORT.md
Ontology Integration: docs/ONTOLOGY_INTEGRATION.md

Session Duration: ~1 hour
AI Agent: OpenCODE
Status: Ready to continue with Task 3 (Conversation JSON Parser)

7.4 KiB Raw Blame History