- Created PlantUML diagrams for custodian types, full schema, legal status, and organizational structure. - Implemented a script to generate GraphViz DOT diagrams from OWL/RDF ontology files. - Developed a script to generate UML diagrams from modular LinkML schema, supporting both Mermaid and PlantUML formats. - Enhanced class definitions and relationships in UML diagrams to reflect the latest schema updates.
16 KiB
GLAM Extractor
Extract and standardize global GLAM (Galleries, Libraries, Archives, Museums) institutional data from conversation transcripts and authoritative registries.
🚀 How to Run the Application - Complete guide for starting frontend, backend, and servers.
Overview
This project extracts structured heritage institution data from 139+ Claude conversation JSON files covering worldwide GLAM research, integrates with authoritative CSV datasets (Dutch ISIL registry, Dutch heritage organizations), validates against a comprehensive LinkML schema, and exports to multiple formats (RDF/Turtle, JSON-LD, CSV, Parquet, SQLite).
Features
- Multi-source data integration: Conversation transcripts, CSV registries, web crawling, Wikidata
- NLP extraction: spaCy NER, transformers-based classification, pattern matching
- LinkML validation: Comprehensive schema with TOOI, Schema.org, CPOC, ISIL, RiC-O, BIBFRAME
- Provenance tracking: Every data point tracks source, confidence, and verification status
- Multi-format export: RDF/Turtle, JSON-LD, CSV, Parquet, SQLite
- Geocoding: Nominatim integration for location enrichment
- Multilingual support: Handles 60+ countries and languages
Interactive Frontend (React + TypeScript + D3.js)
- UML Viewer 🎨 - Interactive D3.js visualization of heritage custodian ontology diagrams (docs)
- Mermaid class diagrams, ER diagrams, PlantUML, GraphViz
- Zoom, pan, drag nodes, click for details
- 14 schema diagrams from
schemas/20251121/uml/
- Query Builder 🔍 - Visual SPARQL query constructor
- Add variables, triple patterns, filters
- Live SPARQL generation
- Execute against endpoints
- Graph Visualizer 🕸️ - RDF graph exploration with D3.js
- Upload RDF/Turtle files
- Interactive force-directed layout
- SPARQL queries
- Node metadata inspection
- Database 🗄️ - TypeDB integration (optional)
- NDE House Style 🎨 - Netwerk Digitaal Erfgoed branding throughout
Start the frontend: cd frontend && npm run dev
Quick Start
Installation
# Install Poetry (if not already installed)
curl -sSL https://install.python-poetry.org | python3 -
# Clone repository and install dependencies
cd glam-extractor
poetry install
# Download spaCy models
poetry run python -m spacy download en_core_web_trf
poetry run python -m spacy download nl_core_news_lg
poetry run python -m spacy download xx_ent_wiki_sm
Basic Usage
# Extract from conversation JSON
poetry run glam extract conversations/Brazilian_GLAM.json -o output.jsonld
# Extract from Dutch CSV
poetry run glam extract data/ISIL-codes_2025-08-01.csv --csv -o dutch_isil.jsonld
# Validate extracted data
poetry run glam validate output.jsonld -s schemas/heritage_custodian.yaml
# Export to RDF
poetry run glam export output.jsonld -o output.ttl -f rdf
# Crawl institutional website
poetry run glam crawl https://www.rijksmuseum.nl -o rijksmuseum.jsonld
Linked Open Data
The project publishes heritage institution data as W3C-compliant RDF aligned with international ontologies.
Schema RDF Formats (8 Serializations)
The LinkML schema is available in 8 RDF formats (generated from schemas/20251121/linkml/01_custodian_name_modular.yaml):
| Format | File | Size | Use Case |
|---|---|---|---|
| Turtle | 01_custodian_name.owl.ttl |
77KB | Human-readable, Git-friendly |
| N-Triples | 01_custodian_name.nt |
233KB | Line-oriented processing |
| JSON-LD | 01_custodian_name.jsonld |
191KB | Web APIs, JavaScript |
| RDF/XML | 01_custodian_name.rdf |
165KB | Legacy systems, Java |
| Notation3 | 01_custodian_name.n3 |
77KB | Logic rules, reasoning |
| TriG | 01_custodian_name.trig |
103KB | Named graphs, datasets |
| TriX | 01_custodian_name.trix |
348KB | XML with named graphs |
| N-Quads | 01_custodian_name.nq |
288KB | Quad-based processing |
All formats located in schemas/20251121/rdf/
Published Datasets
Denmark 🇩🇰 - ✅ COMPLETE (November 2025)
- 2,348 institutions (555 libraries, 594 archives, 1,199 branches)
- 43,429 RDF triples across 9 ontologies
- 769 Wikidata links (32.8% coverage)
- Formats: Turtle, RDF/XML, JSON-LD, N-Triples
See data/rdf/README.md for SPARQL examples and usage.
Ontology Alignment
| Ontology | Purpose | Coverage |
|---|---|---|
| CPOV (Core Public Organisation Vocabulary) | EU public sector standard | All institutions |
| Schema.org | Web semantics (Library, ArchiveOrganization) | All institutions |
| RICO (Records in Contexts) | Archival description | Archives |
| ORG (W3C Organization Ontology) | Hierarchical relationships | Branches |
| PROV-O (Provenance Ontology) | Data provenance tracking | All institutions |
| OWL | Semantic equivalence (Wikidata links) | 32.8% Denmark |
SPARQL Examples
# Find all libraries in Copenhagen
PREFIX schema: <http://schema.org/>
PREFIX cpov: <http://data.europa.eu/m8g/>
SELECT ?library ?name ?address WHERE {
?library a cpov:PublicOrganisation, schema:Library .
?library schema:name ?name .
?library schema:address ?addrNode .
?addrNode schema:addressLocality "København K" .
?addrNode schema:streetAddress ?address .
}
# Find all institutions with Wikidata links
PREFIX owl: <http://www.w3.org/2002/07/owl#>
PREFIX schema: <http://schema.org/>
SELECT ?institution ?name ?wikidataID WHERE {
?institution schema:name ?name .
?institution owl:sameAs ?wikidataURI .
FILTER(STRSTARTS(STR(?wikidataURI), "http://www.wikidata.org/entity/Q"))
BIND(STRAFTER(STR(?wikidataURI), "http://www.wikidata.org/entity/") AS ?wikidataID)
}
See data/rdf/README.md for more examples.
Project Structure
glam-extractor/
├── pyproject.toml # Poetry configuration
├── README.md # This file
├── AGENTS.md # AI agent instructions
├── .opencode/ # AI agent documentation
│ ├── HYPER_MODULAR_STRUCTURE.md
│ └── SLOT_NAMING_CONVENTIONS.md
├── src/glam_extractor/ # Main package
│ ├── __init__.py
│ ├── cli.py # Command-line interface
│ ├── parsers/ # Conversation & CSV parsers
│ ├── extractors/ # NLP extraction engines
│ ├── crawlers/ # Web crawling (crawl4ai)
│ ├── validators/ # LinkML validation
│ ├── exporters/ # Multi-format export
│ ├── geocoding/ # Nominatim geocoding
│ └── utils/ # Utilities
├── schemas/20251121/ # LinkML schemas
│ ├── linkml/ # Hyper-modular schema (78 files)
│ │ ├── 01_custodian_name_modular.yaml
│ │ └── modules/
│ │ ├── metadata.yaml
│ │ ├── classes/ # 12 class modules
│ │ ├── enums/ # 5 enum modules
│ │ └── slots/ # 59 slot modules
│ ├── rdf/ # 8 RDF serialization formats
│ │ ├── 01_custodian_name.owl.ttl
│ │ ├── 01_custodian_name.nt
│ │ ├── 01_custodian_name.jsonld
│ │ ├── 01_custodian_name.rdf
│ │ ├── 01_custodian_name.n3
│ │ ├── 01_custodian_name.trig
│ │ ├── 01_custodian_name.trix
│ │ └── 01_custodian_name.nq
│ └── examples/ # LinkML instance examples
├── tests/ # Test suite
│ ├── unit/
│ ├── integration/
│ └── fixtures/
├── docs/ # Documentation
│ ├── plan/global_glam/ # Planning documents
│ ├── api/ # API documentation
│ ├── tutorials/ # User tutorials
│ └── examples/ # Usage examples
└── data/ # Reference data
├── ISIL-codes_2025-08-01.csv
├── voorbeeld_lijst_organisaties_en_diensten-totaallijst_nederland.csv
└── ontology/ # Base ontologies (TOOI, CPOV, Schema.org, etc.)
Data Sources
Conversation JSON Files
139+ conversation files covering global GLAM research:
- Geographic coverage: 60+ countries across all continents
- Content: Institution names, locations, collections, digital platforms, partnerships
- Languages: Multilingual (English, Dutch, Portuguese, Spanish, Vietnamese, Japanese, Arabic, etc.)
CSV Datasets
- Dutch ISIL Registry (
ISIL-codes_2025-08-01.csv): ~300 Dutch heritage institutions with authoritative ISIL codes - Dutch Organizations (
voorbeeld_lijst_organisaties_en_diensten-totaallijst_nederland.csv): Comprehensive metadata including systems, partnerships, collection platforms
External Sources (Optional Enrichment)
- Wikidata: SPARQL queries for additional metadata
- VIAF: Authority file linking
- GeoNames: Geographic name authority
- Nominatim: Geocoding service
Data Quality & Provenance
Every extracted record includes provenance metadata:
provenance:
data_source: CONVERSATION_NLP | ISIL_REGISTRY | DUTCH_ORG_CSV | WEB_CRAWL | WIKIDATA
data_tier: TIER_1_AUTHORITATIVE | TIER_2_VERIFIED | TIER_3_CROWD_SOURCED | TIER_4_INFERRED
extraction_date: 2025-11-05T...
extraction_method: "spaCy NER + GPT-4 classification"
confidence_score: 0.0-1.0
conversation_id: "uuid"
source_url: "https://..."
verified_date: null
verified_by: null
Data Tiers:
- Tier 1: Official registries (ISIL, national registers) - highest authority
- Tier 2: Verified institutional data (official websites)
- Tier 3: Community-sourced data (Wikidata, OpenStreetMap)
- Tier 4: NLP-extracted or inferred data - requires verification
LinkML Schema
Hyper-Modular Architecture
The project uses a hyper-modular LinkML schema (schemas/20251121/linkml/01_custodian_name_modular.yaml) where every class, enum, and slot is defined in its own individual file for maximum maintainability and version control granularity.
Schema Structure:
- 78 YAML files total
- 12 class modules (
modules/classes/) - 5 enum modules (
modules/enums/) - 59 slot modules (
modules/slots/) - 1 metadata module (
modules/metadata.yaml) - 1 main schema (
01_custodian_name_modular.yaml)
- 12 class modules (
Direct Import Pattern:
imports:
- linkml:types
- modules/metadata
- modules/enums/AgentTypeEnum
- modules/slots/observed_name
- modules/classes/CustodianObservation
# ... 76 total individual module imports
Benefits:
- ✅ Complete transparency - all dependencies visible
- ✅ Granular version control - one file per concept
- ✅ Parallel development - no merge conflicts
- ✅ Selective imports - customize schemas easily
See .opencode/HYPER_MODULAR_STRUCTURE.md for complete architecture documentation.
Ontology Alignment
The schema integrates multiple international standards:
- CPOV: Core Public Organisation Vocabulary (EU public sector)
- TOOI: Dutch organizational ontology
- Schema.org: General web semantics
- CIDOC-CRM: Cultural heritage domain model
- RiC-O: Records in Contexts Ontology
- PROV-O: Provenance tracking
- PiCo: Person observations pattern
Key Classes:
CustodianObservation: Source-based references (emic/etic perspectives)CustodianName: Standardized emic namesCustodianReconstruction: Formal legal entitiesReconstructionActivity: Entity derivation from observationsAgent: People responsible for observations/reconstructionsSourceDocument: Documentary evidenceIdentifier: External identifiers (ISIL, VIAF, Wikidata)TimeSpan: Temporal extents with fuzzy boundariesConfidenceMeasure: Data quality metrics
Observation → Reconstruction Pattern:
SourceDocument → CustodianObservation → ReconstructionActivity → CustodianReconstruction
(text) (what source says) (synthesis method) (formal entity)
This pattern distinguishes between source-based references and scholar-derived formal entities, inspired by the PiCo (Persons in Context) ontology.
Development
Run Tests
poetry run pytest # All tests
poetry run pytest -m unit # Unit tests only
poetry run pytest -m integration # Integration tests only
poetry run pytest --cov # With coverage report
Code Quality
poetry run black src/ tests/ # Format code
poetry run ruff check src/ tests/ # Lint code
poetry run mypy src/ # Type checking
Pre-commit Hooks
poetry run pre-commit install
poetry run pre-commit run --all-files
Documentation
poetry run mkdocs serve # Serve docs locally
poetry run mkdocs build # Build static docs
Examples
Extract Brazilian Institutions
from glam_extractor import ConversationParser, InstitutionExtractor
# Parse conversation
parser = ConversationParser()
conversation = parser.load("Brazilian_GLAM_collection_inventories.json")
# Extract institutions
extractor = InstitutionExtractor()
institutions = extractor.extract(conversation)
# Print results
for inst in institutions:
print(f"{inst.name} ({inst.institution_type})")
print(f" Location: {inst.locations[0].city}, {inst.locations[0].country}")
print(f" Confidence: {inst.provenance.confidence_score}")
Cross-link Dutch Data
from glam_extractor import CSVParser, InstitutionExtractor
from glam_extractor.validators import LinkMLValidator
# Load Dutch ISIL registry
csv_parser = CSVParser()
dutch_institutions = csv_parser.load_isil_registry("ISIL-codes_2025-08-01.csv")
# Load Dutch organizations
dutch_orgs = csv_parser.load_dutch_organizations("voorbeeld_lijst_organisaties_en_diensten-totaallijst_nederland.csv")
# Cross-link and merge
extractor = InstitutionExtractor()
merged = extractor.merge_dutch_data(dutch_institutions, dutch_orgs)
# Validate
validator = LinkMLValidator(schema="schemas/heritage_custodian.yaml")
results = validator.validate_batch(merged)
print(f"Valid: {results.valid_count}, Invalid: {results.invalid_count}")
Export to Multiple Formats
from glam_extractor.exporters import JSONLDExporter, RDFExporter, CSVExporter
# Load extracted data
institutions = load_institutions("output.jsonld")
# Export to RDF/Turtle
rdf_exporter = RDFExporter()
rdf_exporter.export(institutions, "output.ttl")
# Export to CSV
csv_exporter = CSVExporter()
csv_exporter.export(institutions, "output.csv")
# Export to Parquet
csv_exporter.export_parquet(institutions, "output.parquet")
Documentation
-
Planning Docs:
docs/plan/global_glam/01-implementation-phases.md: 7-phase implementation plan02-architecture.md: System architecture and data flow03-dependencies.md: Technology stack and dependencies04-data-standardization.md: Data integration strategies05-design-patterns.md: Software design patterns06-consumers-use-cases.md: User segments and applications
-
AI Agent Instructions:
AGENTS.md- NLP extraction guidelines
- Data quality protocols
- Agent workflow examples
-
API Documentation: Generated from docstrings with mkdocstrings
Contributing
This is a research project. Contributions welcome!
- Fork the repository
- Create feature branch (
git checkout -b feature/amazing-feature) - Commit changes (
git commit -m 'Add amazing feature') - Push to branch (
git push origin feature/amazing-feature) - Open Pull Request
License
MIT License - see LICENSE file for details
Acknowledgments
- LinkML: Schema framework
- spaCy: NLP processing
- crawl4ai: Web crawling
- RDFLib: RDF processing
- Dutch ISIL Registry: Authoritative institution data
- Claude AI: Conversation data source
Contact
For questions or collaboration inquiries, please open an issue on GitHub.
Version: 0.1.0
Status: Alpha - Implementation in progress
Last Updated: 2025-11-05