279 lines
9.6 KiB
Markdown
279 lines
9.6 KiB
Markdown
# Task 6: Partnership Batch Processing - Complete Analysis
|
|
|
|
**Date**: November 7, 2025
|
|
**Status**: ✅ COMPLETE
|
|
**Output Files**: 3 files (RDF, Statistics, Network Graph)
|
|
|
|
## Summary
|
|
|
|
Successfully extracted partnerships from 160 GLAM conversation files and generated a unified RDF knowledge graph with 242 partnership connections across 115 institutions and 23 unique partner organizations.
|
|
|
|
## Batch Processing Results
|
|
|
|
### Files Processed
|
|
- **Total conversation files**: 160/160 (100% success rate)
|
|
- **Failed**: 0
|
|
- **Processing time**: ~8 seconds
|
|
|
|
### Extraction Statistics
|
|
- **Total partnerships extracted**: 242
|
|
- **Unique partner organizations**: 23
|
|
- **Institution nodes**: 115
|
|
- **Network edges**: 242
|
|
|
|
## Output Files
|
|
|
|
### 1. RDF/Turtle Graph (`global_glam_partnerships.ttl`)
|
|
- **Size**: 292 KB
|
|
- **Lines**: 4,268
|
|
- **RDF Triples**: 4,744
|
|
- **Format**: Turtle (.ttl)
|
|
- **Ontologies Used**:
|
|
- W3C Organization Ontology (ORG) - for partnerships
|
|
- Schema.org - for discoverability
|
|
- PROV-O - for provenance tracking
|
|
- Custom GHCID namespace
|
|
|
|
**SPARQL Validation**:
|
|
```sparql
|
|
# Query top partners by connection count
|
|
SELECT ?partner_name (COUNT(?partnership) as ?count)
|
|
WHERE {
|
|
?partnership <https://w3id.org/heritage/custodian/partner_name> ?partner_name .
|
|
}
|
|
GROUP BY ?partner_name
|
|
ORDER BY DESC(?count)
|
|
```
|
|
|
|
**RDF Structure Example**:
|
|
```turtle
|
|
<https://w3id.org/heritage/custodian/batch/institution-name>
|
|
a ghcid:HeritageCustodian, org:Organization, schema:Organization, prov:Entity ;
|
|
schema:name "Institution Name" ;
|
|
ghcid:institution_type "MIXED" ;
|
|
org:hasMembership [
|
|
a org:Membership, ghcid:Partnership ;
|
|
org:member [ a org:Organization ; schema:name "OCLC" ] ;
|
|
org:role "academic_network" ;
|
|
ghcid:partnership_type "academic_network"
|
|
] ;
|
|
prov:generatedAtTime "2025-11-07T12:49:47Z"^^xsd:dateTime .
|
|
```
|
|
|
|
### 2. Statistics Report (`partnership_statistics.json`)
|
|
- **Size**: 32 KB
|
|
- **Format**: JSON
|
|
- **Contents**:
|
|
- Summary statistics (total files, partnerships, partners)
|
|
- Partnership type distribution
|
|
- Top partners ranking
|
|
- Geographic distribution
|
|
- Failed files tracking
|
|
|
|
### 3. Network Graph (`partner_network.json`)
|
|
- **Size**: 58 KB
|
|
- **Format**: D3.js-compatible JSON
|
|
- **Nodes**: 138 (115 institutions + 23 partners)
|
|
- **Edges**: 242 (partnership connections)
|
|
- **Ready for**: Force-directed graph visualization
|
|
|
|
## Top Partners Discovered
|
|
|
|
| Rank | Partner Name | Connections | Type | Description |
|
|
|------|-------------|-------------|------|-------------|
|
|
| 1 | **OCLC** | 76 | academic_network | Global library consortium (WorldCat) |
|
|
| 2 | **UNESCO** | 48 | national_certification | World Heritage Site designation |
|
|
| 3 | **IIIF** | 39 | thematic_network | International Image Interoperability Framework |
|
|
| 4 | **World Heritage** | 30 | national_certification | UNESCO World Heritage Sites |
|
|
| 5 | **linked data** | 8 | linked_data_platform | Semantic web technologies |
|
|
| 6 | **Europeana** | 6 | aggregator_participation | European cultural heritage aggregator |
|
|
| 7 | **WorldCat** | 5 | international_aggregator | OCLC bibliographic database |
|
|
| 8 | **VIAF** | 4 | academic_network | Virtual International Authority File |
|
|
| 9 | **DPLA** | 3 | aggregator_participation | Digital Public Library of America |
|
|
| 10 | **Tainacan** | 3 | thematic_network | Brazilian open-source collections platform |
|
|
|
|
## Partnership Type Distribution
|
|
|
|
| Type | Count | Description |
|
|
|------|-------|-------------|
|
|
| **academic_network** | 83 | OCLC, VIAF, OAI-PMH, DSpace |
|
|
| **national_certification** | 79 | UNESCO, World Heritage, Museumregister |
|
|
| **thematic_network** | 41 | IIIF, Tainacan, domain-specific consortia |
|
|
| **linked_data_platform** | 23 | RDF, SPARQL, semantic web technologies |
|
|
| **aggregator_participation** | 10 | Europeana, DPLA, Collectie Nederland |
|
|
| **international_aggregator** | 4 | WorldCat, VIAF |
|
|
| **digitization_program** | 2 | Google Arts & Culture |
|
|
|
|
## Geographic Coverage
|
|
|
|
### Countries Represented
|
|
- **Netherlands provinces**: Limburg, Zeeland, Gelderland, Drenthe, Groningen, Friesland (7)
|
|
- **International**: Afghanistan, Argentina, Austria, Belgium, Brazil, Canada, Chile, Egypt, Hungary, Mexico, Saudi Arabia, and more (12+)
|
|
- **"Unknown"**: 137 institutions (filename parsing needs improvement)
|
|
|
|
### Geographic Distribution Analysis
|
|
The high count of "Unknown" (137/160 = 85.6%) indicates that geographic information extraction from conversation filenames needs enhancement. Future improvements:
|
|
- Parse conversation titles for country names
|
|
- Extract location data from conversation content
|
|
- Cross-reference with institution metadata
|
|
|
|
## Data Quality Observations
|
|
|
|
### Strengths
|
|
✅ **High extraction rate**: 242 partnerships from 160 conversations
|
|
✅ **Consistent parsing**: 0 failures, 100% processing success
|
|
✅ **Rich partnership types**: 7 distinct partnership categories
|
|
✅ **Global coverage**: 12+ countries represented
|
|
✅ **Standard compliance**: W3C ORG ontology, Schema.org, PROV-O
|
|
|
|
### Areas for Improvement
|
|
⚠️ **Geographic extraction**: 137/160 (85.6%) marked as "Unknown" country
|
|
⚠️ **Institution type inference**: All marked as "MIXED" (batch extraction limitation)
|
|
⚠️ **Partnership descriptions**: Sometimes verbose, extracted from conversation context
|
|
⚠️ **Duplicate partners**: Some partners may be variations (e.g., "OCLC" vs "WorldCat")
|
|
|
|
## Technical Implementation
|
|
|
|
### Bug Fix Applied
|
|
**Problem**: Script crashed during RDF generation
|
|
**Root Cause**: Incorrect use of `RDFExporter.export()` API
|
|
**Solution**: Switched to `RDFExporter.export_to_file()` method
|
|
|
|
**Before (incorrect)**:
|
|
```python
|
|
rdf_output = self.exporter.export(
|
|
custodians=custodians,
|
|
output_path=output_path, # ← Parameter doesn't exist
|
|
format="turtle"
|
|
)
|
|
```
|
|
|
|
**After (correct)**:
|
|
```python
|
|
self.exporter.export_to_file(
|
|
custodians=custodians,
|
|
filepath=str(output_path),
|
|
format="turtle"
|
|
)
|
|
```
|
|
|
|
### Model Validation
|
|
Linter warnings about missing required parameters were **false positives**:
|
|
- `Partnership.partner_id` - Optional (line 227 in models.py)
|
|
- `HeritageCustodian.organization_status` - Has default: `OrganizationStatus.UNKNOWN`
|
|
- `Provenance` fields - All optional except core fields
|
|
|
|
Script executed successfully despite linter errors.
|
|
|
|
## Use Cases
|
|
|
|
### 1. SPARQL Queries
|
|
Query the RDF graph for partnership analysis:
|
|
```sparql
|
|
# Find all institutions using IIIF
|
|
SELECT ?institution ?name
|
|
WHERE {
|
|
?institution a ghcid:HeritageCustodian ;
|
|
schema:name ?name ;
|
|
org:hasMembership ?membership .
|
|
?membership ghcid:partner_name "IIIF" .
|
|
}
|
|
```
|
|
|
|
### 2. Network Visualization
|
|
Use `partner_network.json` with D3.js force-directed graph:
|
|
```javascript
|
|
d3.json("partner_network.json").then(data => {
|
|
// Render nodes (institutions + partners)
|
|
// Render edges (partnerships)
|
|
// Color by partnership_type
|
|
});
|
|
```
|
|
|
|
### 3. Statistics Analysis
|
|
Analyze partnership trends from `partnership_statistics.json`:
|
|
- Most connected partners (OCLC dominates)
|
|
- Partnership type distribution
|
|
- Geographic coverage gaps
|
|
|
|
## Next Steps
|
|
|
|
### Short-term Enhancements
|
|
1. **Improve geographic extraction**:
|
|
- Parse conversation titles for country names
|
|
- Extract locations from conversation content
|
|
- Use NER (Named Entity Recognition) for places
|
|
|
|
2. **Deduplicate partners**:
|
|
- Normalize partner names (OCLC/WorldCat, UNESCO/World Heritage)
|
|
- Create partner authority file with aliases
|
|
- Link to Wikidata for canonical identifiers
|
|
|
|
3. **Infer institution types**:
|
|
- Use conversation topics to classify institutions
|
|
- Pattern matching on institution names
|
|
- Cross-reference with CSV datasets
|
|
|
|
### Long-term Integration
|
|
1. **Merge with CSV data**:
|
|
- Cross-link partnership data with Dutch ISIL/Org datasets
|
|
- Validate partnership claims against authoritative sources
|
|
- Elevate data tier from TIER_4_INFERRED to TIER_2_VERIFIED
|
|
|
|
2. **Temporal analysis**:
|
|
- Track partnership start/end dates
|
|
- Analyze partnership formation trends over time
|
|
- Identify emerging networks (e.g., IIIF growth)
|
|
|
|
3. **Visualization dashboards**:
|
|
- Interactive network graph (D3.js, Cytoscape.js)
|
|
- Geographic map of partnerships (Leaflet, Mapbox)
|
|
- Partnership type distribution charts
|
|
|
|
## Files Generated
|
|
|
|
```
|
|
data/exports/
|
|
├── global_glam_partnerships.ttl # 292 KB - RDF/Turtle knowledge graph
|
|
├── partnership_statistics.json # 32 KB - Summary statistics
|
|
└── partner_network.json # 58 KB - D3.js network graph data
|
|
```
|
|
|
|
## Verification Commands
|
|
|
|
**Validate RDF syntax**:
|
|
```bash
|
|
rapper -i turtle -c data/exports/global_glam_partnerships.ttl
|
|
```
|
|
|
|
**Query RDF with SPARQL**:
|
|
```python
|
|
from rdflib import Graph
|
|
g = Graph()
|
|
g.parse('data/exports/global_glam_partnerships.ttl', format='turtle')
|
|
results = g.query('SELECT * WHERE { ?s ?p ?o } LIMIT 10')
|
|
```
|
|
|
|
**View statistics**:
|
|
```bash
|
|
jq '.summary' data/exports/partnership_statistics.json
|
|
```
|
|
|
|
**Visualize network**:
|
|
```bash
|
|
python -m http.server 8000
|
|
# Open browser to D3.js visualization loading partner_network.json
|
|
```
|
|
|
|
## Conclusion
|
|
|
|
Task 6 successfully completed with:
|
|
- ✅ 160/160 conversation files processed (0 failures)
|
|
- ✅ 242 partnerships extracted across 23 unique partners
|
|
- ✅ RDF knowledge graph exported (4,744 triples)
|
|
- ✅ Statistics and network data generated
|
|
- ✅ SPARQL-queryable linked data
|
|
|
|
The partnership extraction reveals OCLC, UNESCO, and IIIF as dominant global heritage partners, with opportunities to improve geographic coverage and institution type classification in future iterations.
|
|
|
|
**Key Finding**: OCLC's overwhelming presence (76 connections, 31.4% of all partnerships) highlights the central role of library cataloging infrastructure in global GLAM digital strategy.
|