glam/docs/task6_partnership_batch_analysis.md
2025-11-19 23:25:22 +01:00

9.6 KiB

Task 6: Partnership Batch Processing - Complete Analysis

Date: November 7, 2025
Status: COMPLETE
Output Files: 3 files (RDF, Statistics, Network Graph)

Summary

Successfully extracted partnerships from 160 GLAM conversation files and generated a unified RDF knowledge graph with 242 partnership connections across 115 institutions and 23 unique partner organizations.

Batch Processing Results

Files Processed

  • Total conversation files: 160/160 (100% success rate)
  • Failed: 0
  • Processing time: ~8 seconds

Extraction Statistics

  • Total partnerships extracted: 242
  • Unique partner organizations: 23
  • Institution nodes: 115
  • Network edges: 242

Output Files

1. RDF/Turtle Graph (global_glam_partnerships.ttl)

  • Size: 292 KB
  • Lines: 4,268
  • RDF Triples: 4,744
  • Format: Turtle (.ttl)
  • Ontologies Used:
    • W3C Organization Ontology (ORG) - for partnerships
    • Schema.org - for discoverability
    • PROV-O - for provenance tracking
    • Custom GHCID namespace

SPARQL Validation:

# Query top partners by connection count
SELECT ?partner_name (COUNT(?partnership) as ?count)
WHERE {
    ?partnership <https://w3id.org/heritage/custodian/partner_name> ?partner_name .
}
GROUP BY ?partner_name
ORDER BY DESC(?count)

RDF Structure Example:

<https://w3id.org/heritage/custodian/batch/institution-name> 
    a ghcid:HeritageCustodian, org:Organization, schema:Organization, prov:Entity ;
    schema:name "Institution Name" ;
    ghcid:institution_type "MIXED" ;
    org:hasMembership [
        a org:Membership, ghcid:Partnership ;
        org:member [ a org:Organization ; schema:name "OCLC" ] ;
        org:role "academic_network" ;
        ghcid:partnership_type "academic_network"
    ] ;
    prov:generatedAtTime "2025-11-07T12:49:47Z"^^xsd:dateTime .

2. Statistics Report (partnership_statistics.json)

  • Size: 32 KB
  • Format: JSON
  • Contents:
    • Summary statistics (total files, partnerships, partners)
    • Partnership type distribution
    • Top partners ranking
    • Geographic distribution
    • Failed files tracking

3. Network Graph (partner_network.json)

  • Size: 58 KB
  • Format: D3.js-compatible JSON
  • Nodes: 138 (115 institutions + 23 partners)
  • Edges: 242 (partnership connections)
  • Ready for: Force-directed graph visualization

Top Partners Discovered

Rank Partner Name Connections Type Description
1 OCLC 76 academic_network Global library consortium (WorldCat)
2 UNESCO 48 national_certification World Heritage Site designation
3 IIIF 39 thematic_network International Image Interoperability Framework
4 World Heritage 30 national_certification UNESCO World Heritage Sites
5 linked data 8 linked_data_platform Semantic web technologies
6 Europeana 6 aggregator_participation European cultural heritage aggregator
7 WorldCat 5 international_aggregator OCLC bibliographic database
8 VIAF 4 academic_network Virtual International Authority File
9 DPLA 3 aggregator_participation Digital Public Library of America
10 Tainacan 3 thematic_network Brazilian open-source collections platform

Partnership Type Distribution

Type Count Description
academic_network 83 OCLC, VIAF, OAI-PMH, DSpace
national_certification 79 UNESCO, World Heritage, Museumregister
thematic_network 41 IIIF, Tainacan, domain-specific consortia
linked_data_platform 23 RDF, SPARQL, semantic web technologies
aggregator_participation 10 Europeana, DPLA, Collectie Nederland
international_aggregator 4 WorldCat, VIAF
digitization_program 2 Google Arts & Culture

Geographic Coverage

Countries Represented

  • Netherlands provinces: Limburg, Zeeland, Gelderland, Drenthe, Groningen, Friesland (7)
  • International: Afghanistan, Argentina, Austria, Belgium, Brazil, Canada, Chile, Egypt, Hungary, Mexico, Saudi Arabia, and more (12+)
  • "Unknown": 137 institutions (filename parsing needs improvement)

Geographic Distribution Analysis

The high count of "Unknown" (137/160 = 85.6%) indicates that geographic information extraction from conversation filenames needs enhancement. Future improvements:

  • Parse conversation titles for country names
  • Extract location data from conversation content
  • Cross-reference with institution metadata

Data Quality Observations

Strengths

High extraction rate: 242 partnerships from 160 conversations
Consistent parsing: 0 failures, 100% processing success
Rich partnership types: 7 distinct partnership categories
Global coverage: 12+ countries represented
Standard compliance: W3C ORG ontology, Schema.org, PROV-O

Areas for Improvement

⚠️ Geographic extraction: 137/160 (85.6%) marked as "Unknown" country
⚠️ Institution type inference: All marked as "MIXED" (batch extraction limitation)
⚠️ Partnership descriptions: Sometimes verbose, extracted from conversation context
⚠️ Duplicate partners: Some partners may be variations (e.g., "OCLC" vs "WorldCat")

Technical Implementation

Bug Fix Applied

Problem: Script crashed during RDF generation
Root Cause: Incorrect use of RDFExporter.export() API
Solution: Switched to RDFExporter.export_to_file() method

Before (incorrect):

rdf_output = self.exporter.export(
    custodians=custodians,
    output_path=output_path,  # ← Parameter doesn't exist
    format="turtle"
)

After (correct):

self.exporter.export_to_file(
    custodians=custodians,
    filepath=str(output_path),
    format="turtle"
)

Model Validation

Linter warnings about missing required parameters were false positives:

  • Partnership.partner_id - Optional (line 227 in models.py)
  • HeritageCustodian.organization_status - Has default: OrganizationStatus.UNKNOWN
  • Provenance fields - All optional except core fields

Script executed successfully despite linter errors.

Use Cases

1. SPARQL Queries

Query the RDF graph for partnership analysis:

# Find all institutions using IIIF
SELECT ?institution ?name
WHERE {
    ?institution a ghcid:HeritageCustodian ;
                 schema:name ?name ;
                 org:hasMembership ?membership .
    ?membership ghcid:partner_name "IIIF" .
}

2. Network Visualization

Use partner_network.json with D3.js force-directed graph:

d3.json("partner_network.json").then(data => {
    // Render nodes (institutions + partners)
    // Render edges (partnerships)
    // Color by partnership_type
});

3. Statistics Analysis

Analyze partnership trends from partnership_statistics.json:

  • Most connected partners (OCLC dominates)
  • Partnership type distribution
  • Geographic coverage gaps

Next Steps

Short-term Enhancements

  1. Improve geographic extraction:

    • Parse conversation titles for country names
    • Extract locations from conversation content
    • Use NER (Named Entity Recognition) for places
  2. Deduplicate partners:

    • Normalize partner names (OCLC/WorldCat, UNESCO/World Heritage)
    • Create partner authority file with aliases
    • Link to Wikidata for canonical identifiers
  3. Infer institution types:

    • Use conversation topics to classify institutions
    • Pattern matching on institution names
    • Cross-reference with CSV datasets

Long-term Integration

  1. Merge with CSV data:

    • Cross-link partnership data with Dutch ISIL/Org datasets
    • Validate partnership claims against authoritative sources
    • Elevate data tier from TIER_4_INFERRED to TIER_2_VERIFIED
  2. Temporal analysis:

    • Track partnership start/end dates
    • Analyze partnership formation trends over time
    • Identify emerging networks (e.g., IIIF growth)
  3. Visualization dashboards:

    • Interactive network graph (D3.js, Cytoscape.js)
    • Geographic map of partnerships (Leaflet, Mapbox)
    • Partnership type distribution charts

Files Generated

data/exports/
├── global_glam_partnerships.ttl     # 292 KB - RDF/Turtle knowledge graph
├── partnership_statistics.json      # 32 KB  - Summary statistics
└── partner_network.json             # 58 KB  - D3.js network graph data

Verification Commands

Validate RDF syntax:

rapper -i turtle -c data/exports/global_glam_partnerships.ttl

Query RDF with SPARQL:

from rdflib import Graph
g = Graph()
g.parse('data/exports/global_glam_partnerships.ttl', format='turtle')
results = g.query('SELECT * WHERE { ?s ?p ?o } LIMIT 10')

View statistics:

jq '.summary' data/exports/partnership_statistics.json

Visualize network:

python -m http.server 8000
# Open browser to D3.js visualization loading partner_network.json

Conclusion

Task 6 successfully completed with:

  • 160/160 conversation files processed (0 failures)
  • 242 partnerships extracted across 23 unique partners
  • RDF knowledge graph exported (4,744 triples)
  • Statistics and network data generated
  • SPARQL-queryable linked data

The partnership extraction reveals OCLC, UNESCO, and IIIF as dominant global heritage partners, with opportunities to improve geographic coverage and institution type classification in future iterations.

Key Finding: OCLC's overwhelming presence (76 connections, 31.4% of all partnerships) highlights the central role of library cataloging infrastructure in global GLAM digital strategy.