9.6 KiB
Task 6: Partnership Batch Processing - Complete Analysis
Date: November 7, 2025
Status: ✅ COMPLETE
Output Files: 3 files (RDF, Statistics, Network Graph)
Summary
Successfully extracted partnerships from 160 GLAM conversation files and generated a unified RDF knowledge graph with 242 partnership connections across 115 institutions and 23 unique partner organizations.
Batch Processing Results
Files Processed
- Total conversation files: 160/160 (100% success rate)
- Failed: 0
- Processing time: ~8 seconds
Extraction Statistics
- Total partnerships extracted: 242
- Unique partner organizations: 23
- Institution nodes: 115
- Network edges: 242
Output Files
1. RDF/Turtle Graph (global_glam_partnerships.ttl)
- Size: 292 KB
- Lines: 4,268
- RDF Triples: 4,744
- Format: Turtle (.ttl)
- Ontologies Used:
- W3C Organization Ontology (ORG) - for partnerships
- Schema.org - for discoverability
- PROV-O - for provenance tracking
- Custom GHCID namespace
SPARQL Validation:
# Query top partners by connection count
SELECT ?partner_name (COUNT(?partnership) as ?count)
WHERE {
?partnership <https://w3id.org/heritage/custodian/partner_name> ?partner_name .
}
GROUP BY ?partner_name
ORDER BY DESC(?count)
RDF Structure Example:
<https://w3id.org/heritage/custodian/batch/institution-name>
a ghcid:HeritageCustodian, org:Organization, schema:Organization, prov:Entity ;
schema:name "Institution Name" ;
ghcid:institution_type "MIXED" ;
org:hasMembership [
a org:Membership, ghcid:Partnership ;
org:member [ a org:Organization ; schema:name "OCLC" ] ;
org:role "academic_network" ;
ghcid:partnership_type "academic_network"
] ;
prov:generatedAtTime "2025-11-07T12:49:47Z"^^xsd:dateTime .
2. Statistics Report (partnership_statistics.json)
- Size: 32 KB
- Format: JSON
- Contents:
- Summary statistics (total files, partnerships, partners)
- Partnership type distribution
- Top partners ranking
- Geographic distribution
- Failed files tracking
3. Network Graph (partner_network.json)
- Size: 58 KB
- Format: D3.js-compatible JSON
- Nodes: 138 (115 institutions + 23 partners)
- Edges: 242 (partnership connections)
- Ready for: Force-directed graph visualization
Top Partners Discovered
| Rank | Partner Name | Connections | Type | Description |
|---|---|---|---|---|
| 1 | OCLC | 76 | academic_network | Global library consortium (WorldCat) |
| 2 | UNESCO | 48 | national_certification | World Heritage Site designation |
| 3 | IIIF | 39 | thematic_network | International Image Interoperability Framework |
| 4 | World Heritage | 30 | national_certification | UNESCO World Heritage Sites |
| 5 | linked data | 8 | linked_data_platform | Semantic web technologies |
| 6 | Europeana | 6 | aggregator_participation | European cultural heritage aggregator |
| 7 | WorldCat | 5 | international_aggregator | OCLC bibliographic database |
| 8 | VIAF | 4 | academic_network | Virtual International Authority File |
| 9 | DPLA | 3 | aggregator_participation | Digital Public Library of America |
| 10 | Tainacan | 3 | thematic_network | Brazilian open-source collections platform |
Partnership Type Distribution
| Type | Count | Description |
|---|---|---|
| academic_network | 83 | OCLC, VIAF, OAI-PMH, DSpace |
| national_certification | 79 | UNESCO, World Heritage, Museumregister |
| thematic_network | 41 | IIIF, Tainacan, domain-specific consortia |
| linked_data_platform | 23 | RDF, SPARQL, semantic web technologies |
| aggregator_participation | 10 | Europeana, DPLA, Collectie Nederland |
| international_aggregator | 4 | WorldCat, VIAF |
| digitization_program | 2 | Google Arts & Culture |
Geographic Coverage
Countries Represented
- Netherlands provinces: Limburg, Zeeland, Gelderland, Drenthe, Groningen, Friesland (7)
- International: Afghanistan, Argentina, Austria, Belgium, Brazil, Canada, Chile, Egypt, Hungary, Mexico, Saudi Arabia, and more (12+)
- "Unknown": 137 institutions (filename parsing needs improvement)
Geographic Distribution Analysis
The high count of "Unknown" (137/160 = 85.6%) indicates that geographic information extraction from conversation filenames needs enhancement. Future improvements:
- Parse conversation titles for country names
- Extract location data from conversation content
- Cross-reference with institution metadata
Data Quality Observations
Strengths
✅ High extraction rate: 242 partnerships from 160 conversations
✅ Consistent parsing: 0 failures, 100% processing success
✅ Rich partnership types: 7 distinct partnership categories
✅ Global coverage: 12+ countries represented
✅ Standard compliance: W3C ORG ontology, Schema.org, PROV-O
Areas for Improvement
⚠️ Geographic extraction: 137/160 (85.6%) marked as "Unknown" country
⚠️ Institution type inference: All marked as "MIXED" (batch extraction limitation)
⚠️ Partnership descriptions: Sometimes verbose, extracted from conversation context
⚠️ Duplicate partners: Some partners may be variations (e.g., "OCLC" vs "WorldCat")
Technical Implementation
Bug Fix Applied
Problem: Script crashed during RDF generation
Root Cause: Incorrect use of RDFExporter.export() API
Solution: Switched to RDFExporter.export_to_file() method
Before (incorrect):
rdf_output = self.exporter.export(
custodians=custodians,
output_path=output_path, # ← Parameter doesn't exist
format="turtle"
)
After (correct):
self.exporter.export_to_file(
custodians=custodians,
filepath=str(output_path),
format="turtle"
)
Model Validation
Linter warnings about missing required parameters were false positives:
Partnership.partner_id- Optional (line 227 in models.py)HeritageCustodian.organization_status- Has default:OrganizationStatus.UNKNOWNProvenancefields - All optional except core fields
Script executed successfully despite linter errors.
Use Cases
1. SPARQL Queries
Query the RDF graph for partnership analysis:
# Find all institutions using IIIF
SELECT ?institution ?name
WHERE {
?institution a ghcid:HeritageCustodian ;
schema:name ?name ;
org:hasMembership ?membership .
?membership ghcid:partner_name "IIIF" .
}
2. Network Visualization
Use partner_network.json with D3.js force-directed graph:
d3.json("partner_network.json").then(data => {
// Render nodes (institutions + partners)
// Render edges (partnerships)
// Color by partnership_type
});
3. Statistics Analysis
Analyze partnership trends from partnership_statistics.json:
- Most connected partners (OCLC dominates)
- Partnership type distribution
- Geographic coverage gaps
Next Steps
Short-term Enhancements
-
Improve geographic extraction:
- Parse conversation titles for country names
- Extract locations from conversation content
- Use NER (Named Entity Recognition) for places
-
Deduplicate partners:
- Normalize partner names (OCLC/WorldCat, UNESCO/World Heritage)
- Create partner authority file with aliases
- Link to Wikidata for canonical identifiers
-
Infer institution types:
- Use conversation topics to classify institutions
- Pattern matching on institution names
- Cross-reference with CSV datasets
Long-term Integration
-
Merge with CSV data:
- Cross-link partnership data with Dutch ISIL/Org datasets
- Validate partnership claims against authoritative sources
- Elevate data tier from TIER_4_INFERRED to TIER_2_VERIFIED
-
Temporal analysis:
- Track partnership start/end dates
- Analyze partnership formation trends over time
- Identify emerging networks (e.g., IIIF growth)
-
Visualization dashboards:
- Interactive network graph (D3.js, Cytoscape.js)
- Geographic map of partnerships (Leaflet, Mapbox)
- Partnership type distribution charts
Files Generated
data/exports/
├── global_glam_partnerships.ttl # 292 KB - RDF/Turtle knowledge graph
├── partnership_statistics.json # 32 KB - Summary statistics
└── partner_network.json # 58 KB - D3.js network graph data
Verification Commands
Validate RDF syntax:
rapper -i turtle -c data/exports/global_glam_partnerships.ttl
Query RDF with SPARQL:
from rdflib import Graph
g = Graph()
g.parse('data/exports/global_glam_partnerships.ttl', format='turtle')
results = g.query('SELECT * WHERE { ?s ?p ?o } LIMIT 10')
View statistics:
jq '.summary' data/exports/partnership_statistics.json
Visualize network:
python -m http.server 8000
# Open browser to D3.js visualization loading partner_network.json
Conclusion
Task 6 successfully completed with:
- ✅ 160/160 conversation files processed (0 failures)
- ✅ 242 partnerships extracted across 23 unique partners
- ✅ RDF knowledge graph exported (4,744 triples)
- ✅ Statistics and network data generated
- ✅ SPARQL-queryable linked data
The partnership extraction reveals OCLC, UNESCO, and IIIF as dominant global heritage partners, with opportunities to improve geographic coverage and institution type classification in future iterations.
Key Finding: OCLC's overwhelming presence (76 connections, 31.4% of all partnerships) highlights the central role of library cataloging infrastructure in global GLAM digital strategy.