GLAM Extractor
Extract and standardize global GLAM (Galleries, Libraries, Archives, Museums) institutional data from conversation transcripts and authoritative registries.
🚀 How to Run the Application - Complete guide for starting frontend, backend, and servers.
Overview
This project extracts structured heritage institution data from 139+ Claude conversation JSON files covering worldwide GLAM research, integrates with authoritative CSV datasets (Dutch ISIL registry, Dutch heritage organizations), validates against a comprehensive LinkML schema, and exports to multiple formats (RDF/Turtle, JSON-LD, CSV, Parquet, SQLite).
Features
- Multi-source data integration: Conversation transcripts, CSV registries, web crawling, Wikidata
- NLP extraction: spaCy NER, transformers-based classification, pattern matching
- LinkML validation: Comprehensive schema with TOOI, Schema.org, CPOC, ISIL, RiC-O, BIBFRAME
- Provenance tracking: Every data point tracks source, confidence, and verification status
- Multi-format export: RDF/Turtle, JSON-LD, CSV, Parquet, SQLite
- Geocoding: Nominatim integration for location enrichment
- Multilingual support: Handles 60+ countries and languages
Interactive Frontend (React + TypeScript + D3.js)
- UML Viewer 🎨 - Interactive D3.js visualization of heritage custodian ontology diagrams (docs)
- Mermaid class diagrams, ER diagrams, PlantUML, GraphViz
- Zoom, pan, drag nodes, click for details
- 14 schema diagrams from
schemas/20251121/uml/
- Query Builder 🔍 - Visual SPARQL query constructor
- Add variables, triple patterns, filters
- Live SPARQL generation
- Execute against endpoints
- Graph Visualizer 🕸️ - RDF graph exploration with D3.js
- Upload RDF/Turtle files
- Interactive force-directed layout
- SPARQL queries
- Node metadata inspection
- Database 🗄️ - TypeDB integration (optional)
- NDE House Style 🎨 - Netwerk Digitaal Erfgoed branding throughout
Start the frontend: cd frontend && npm run dev
Quick Start
Installation
# Install Poetry (if not already installed)
curl -sSL https://install.python-poetry.org | python3 -
# Clone repository and install dependencies
cd glam-extractor
poetry install
# Download spaCy models
poetry run python -m spacy download en_core_web_trf
poetry run python -m spacy download nl_core_news_lg
poetry run python -m spacy download xx_ent_wiki_sm
Basic Usage
# Extract from conversation JSON
poetry run glam extract conversations/Brazilian_GLAM.json -o output.jsonld
# Extract from Dutch CSV
poetry run glam extract data/ISIL-codes_2025-08-01.csv --csv -o dutch_isil.jsonld
# Validate extracted data
poetry run glam validate output.jsonld -s schemas/heritage_custodian.yaml
# Export to RDF
poetry run glam export output.jsonld -o output.ttl -f rdf
# Crawl institutional website
poetry run glam crawl https://www.rijksmuseum.nl -o rijksmuseum.jsonld
Linked Open Data
The project publishes heritage institution data as W3C-compliant RDF aligned with international ontologies.
Schema RDF Formats (8 Serializations)
The LinkML schema is available in 8 RDF formats (generated from schemas/20251121/linkml/01_custodian_name_modular.yaml):
| Format | File | Size | Use Case |
|---|---|---|---|
| Turtle | 01_custodian_name.owl.ttl |
77KB | Human-readable, Git-friendly |
| N-Triples | 01_custodian_name.nt |
233KB | Line-oriented processing |
| JSON-LD | 01_custodian_name.jsonld |
191KB | Web APIs, JavaScript |
| RDF/XML | 01_custodian_name.rdf |
165KB | Legacy systems, Java |
| Notation3 | 01_custodian_name.n3 |
77KB | Logic rules, reasoning |
| TriG | 01_custodian_name.trig |
103KB | Named graphs, datasets |
| TriX | 01_custodian_name.trix |
348KB | XML with named graphs |
| N-Quads | 01_custodian_name.nq |
288KB | Quad-based processing |
All formats located in schemas/20251121/rdf/
Published Datasets
Denmark 🇩🇰 - ✅ COMPLETE (November 2025)
- 2,348 institutions (555 libraries, 594 archives, 1,199 branches)
- 43,429 RDF triples across 9 ontologies
- 769 Wikidata links (32.8% coverage)
- Formats: Turtle, RDF/XML, JSON-LD, N-Triples
See data/rdf/README.md for SPARQL examples and usage.
Ontology Alignment
| Ontology | Purpose | Coverage |
|---|---|---|
| CPOV (Core Public Organisation Vocabulary) | EU public sector standard | All institutions |
| Schema.org | Web semantics (Library, ArchiveOrganization) | All institutions |
| RICO (Records in Contexts) | Archival description | Archives |
| ORG (W3C Organization Ontology) | Hierarchical relationships | Branches |
| PROV-O (Provenance Ontology) | Data provenance tracking | All institutions |
| OWL | Semantic equivalence (Wikidata links) | 32.8% Denmark |
SPARQL Examples
# Find all libraries in Copenhagen
PREFIX schema: <http://schema.org/>
PREFIX cpov: <http://data.europa.eu/m8g/>
SELECT ?library ?name ?address WHERE {
?library a cpov:PublicOrganisation, schema:Library .
?library schema:name ?name .
?library schema:address ?addrNode .
?addrNode schema:addressLocality "København K" .
?addrNode schema:streetAddress ?address .
}
# Find all institutions with Wikidata links
PREFIX owl: <http://www.w3.org/2002/07/owl#>
PREFIX schema: <http://schema.org/>
SELECT ?institution ?name ?wikidataID WHERE {
?institution schema:name ?name .
?institution owl:sameAs ?wikidataURI .
FILTER(STRSTARTS(STR(?wikidataURI), "http://www.wikidata.org/entity/Q"))
BIND(STRAFTER(STR(?wikidataURI), "http://www.wikidata.org/entity/") AS ?wikidataID)
}
See data/rdf/README.md for more examples.
Project Structure
glam-extractor/
├── pyproject.toml # Poetry configuration
├── README.md # This file
├── AGENTS.md # AI agent instructions
├── .opencode/ # AI agent documentation
│ ├── HYPER_MODULAR_STRUCTURE.md
│ └── SLOT_NAMING_CONVENTIONS.md
├── src/glam_extractor/ # Main package
│ ├── __init__.py
│ ├── cli.py # Command-line interface
│ ├── parsers/ # Conversation & CSV parsers
│ ├── extractors/ # NLP extraction engines
│ ├── crawlers/ # Web crawling (crawl4ai)
│ ├── validators/ # LinkML validation
│ ├── exporters/ # Multi-format export
│ ├── geocoding/ # Nominatim geocoding
│ └── utils/ # Utilities
├── schemas/20251121/ # LinkML schemas
│ ├── linkml/ # Hyper-modular schema (78 files)
│ │ ├── 01_custodian_name_modular.yaml
│ │ └── modules/
│ │ ├── metadata.yaml
│ │ ├── classes/ # 12 class modules
│ │ ├── enums/ # 5 enum modules
│ │ └── slots/ # 59 slot modules
│ ├── rdf/ # 8 RDF serialization formats
│ │ ├── 01_custodian_name.owl.ttl
│ │ ├── 01_custodian_name.nt
│ │ ├── 01_custodian_name.jsonld
│ │ ├── 01_custodian_name.rdf
│ │ ├── 01_custodian_name.n3
│ │ ├── 01_custodian_name.trig
│ │ ├── 01_custodian_name.trix
│ │ └── 01_custodian_name.nq
│ └── examples/ # LinkML instance examples
├── tests/ # Test suite
│ ├── unit/
│ ├── integration/
│ └── fixtures/
├── docs/ # Documentation
│ ├── plan/global_glam/ # Planning documents
│ ├── api/ # API documentation
│ ├── tutorials/ # User tutorials
│ └── examples/ # Usage examples
└── data/ # Reference data
├── ISIL-codes_2025-08-01.csv
├── voorbeeld_lijst_organisaties_en_diensten-totaallijst_nederland.csv
└── ontology/ # Base ontologies (TOOI, CPOV, Schema.org, etc.)
Data Sources
Conversation JSON Files
139+ conversation files covering global GLAM research:
- Geographic coverage: 60+ countries across all continents
- Content: Institution names, locations, collections, digital platforms, partnerships
- Languages: Multilingual (English, Dutch, Portuguese, Spanish, Vietnamese, Japanese, Arabic, etc.)
CSV Datasets
- Dutch ISIL Registry (
ISIL-codes_2025-08-01.csv): ~300 Dutch heritage institutions with authoritative ISIL codes - Dutch Organizations (
voorbeeld_lijst_organisaties_en_diensten-totaallijst_nederland.csv): Comprehensive metadata including systems, partnerships, collection platforms
External Sources (Optional Enrichment)
- Wikidata: SPARQL queries for additional metadata
- VIAF: Authority file linking
- GeoNames: Geographic name authority
- Nominatim: Geocoding service
Data Quality & Provenance
Every extracted record includes provenance metadata:
provenance:
data_source: CONVERSATION_NLP | ISIL_REGISTRY | DUTCH_ORG_CSV | WEB_CRAWL | WIKIDATA
data_tier: TIER_1_AUTHORITATIVE | TIER_2_VERIFIED | TIER_3_CROWD_SOURCED | TIER_4_INFERRED
extraction_date: 2025-11-05T...
extraction_method: "spaCy NER + GPT-4 classification"
confidence_score: 0.0-1.0
conversation_id: "uuid"
source_url: "https://..."
verified_date: null
verified_by: null
Data Tiers:
- Tier 1: Official registries (ISIL, national registers) - highest authority
- Tier 2: Verified institutional data (official websites)
- Tier 3: Community-sourced data (Wikidata, OpenStreetMap)
- Tier 4: NLP-extracted or inferred data - requires verification
LinkML Schema
Hyper-Modular Architecture
The project uses a hyper-modular LinkML schema (schemas/20251121/linkml/01_custodian_name_modular.yaml) where every class, enum, and slot is defined in its own individual file for maximum maintainability and version control granularity.
Schema Structure:
- 78 YAML files total
- 12 class modules (
modules/classes/) - 5 enum modules (
modules/enums/) - 59 slot modules (
modules/slots/) - 1 metadata module (
modules/metadata.yaml) - 1 main schema (
01_custodian_name_modular.yaml)
- 12 class modules (
Direct Import Pattern:
imports:
- linkml:types
- modules/metadata
- modules/enums/AgentTypeEnum
- modules/slots/observed_name
- modules/classes/CustodianObservation
# ... 76 total individual module imports
Benefits:
- ✅ Complete transparency - all dependencies visible
- ✅ Granular version control - one file per concept
- ✅ Parallel development - no merge conflicts
- ✅ Selective imports - customize schemas easily
See .opencode/HYPER_MODULAR_STRUCTURE.md for complete architecture documentation.
Ontology Alignment
The schema integrates multiple international standards:
- CPOV: Core Public Organisation Vocabulary (EU public sector)
- TOOI: Dutch organizational ontology
- Schema.org: General web semantics
- CIDOC-CRM: Cultural heritage domain model
- RiC-O: Records in Contexts Ontology
- PROV-O: Provenance tracking
- PiCo: Person observations pattern
Key Classes:
CustodianObservation: Source-based references (emic/etic perspectives)CustodianName: Standardized emic namesCustodianReconstruction: Formal legal entitiesReconstructionActivity: Entity derivation from observationsAgent: People responsible for observations/reconstructionsSourceDocument: Documentary evidenceIdentifier: External identifiers (ISIL, VIAF, Wikidata)TimeSpan: Temporal extents with fuzzy boundariesConfidenceMeasure: Data quality metrics
Observation → Reconstruction Pattern:
SourceDocument → CustodianObservation → ReconstructionActivity → CustodianReconstruction
(text) (what source says) (synthesis method) (formal entity)
This pattern distinguishes between source-based references and scholar-derived formal entities, inspired by the PiCo (Persons in Context) ontology.
Development
Run Tests
poetry run pytest # All tests
poetry run pytest -m unit # Unit tests only
poetry run pytest -m integration # Integration tests only
poetry run pytest --cov # With coverage report
Code Quality
poetry run black src/ tests/ # Format code
poetry run ruff check src/ tests/ # Lint code
poetry run mypy src/ # Type checking
Pre-commit Hooks
poetry run pre-commit install
poetry run pre-commit run --all-files
Documentation
poetry run mkdocs serve # Serve docs locally
poetry run mkdocs build # Build static docs
Examples
Extract Brazilian Institutions
from glam_extractor import ConversationParser, InstitutionExtractor
# Parse conversation
parser = ConversationParser()
conversation = parser.load("Brazilian_GLAM_collection_inventories.json")
# Extract institutions
extractor = InstitutionExtractor()
institutions = extractor.extract(conversation)
# Print results
for inst in institutions:
print(f"{inst.name} ({inst.institution_type})")
print(f" Location: {inst.locations[0].city}, {inst.locations[0].country}")
print(f" Confidence: {inst.provenance.confidence_score}")
Cross-link Dutch Data
from glam_extractor import CSVParser, InstitutionExtractor
from glam_extractor.validators import LinkMLValidator
# Load Dutch ISIL registry
csv_parser = CSVParser()
dutch_institutions = csv_parser.load_isil_registry("ISIL-codes_2025-08-01.csv")
# Load Dutch organizations
dutch_orgs = csv_parser.load_dutch_organizations("voorbeeld_lijst_organisaties_en_diensten-totaallijst_nederland.csv")
# Cross-link and merge
extractor = InstitutionExtractor()
merged = extractor.merge_dutch_data(dutch_institutions, dutch_orgs)
# Validate
validator = LinkMLValidator(schema="schemas/heritage_custodian.yaml")
results = validator.validate_batch(merged)
print(f"Valid: {results.valid_count}, Invalid: {results.invalid_count}")
Export to Multiple Formats
from glam_extractor.exporters import JSONLDExporter, RDFExporter, CSVExporter
# Load extracted data
institutions = load_institutions("output.jsonld")
# Export to RDF/Turtle
rdf_exporter = RDFExporter()
rdf_exporter.export(institutions, "output.ttl")
# Export to CSV
csv_exporter = CSVExporter()
csv_exporter.export(institutions, "output.csv")
# Export to Parquet
csv_exporter.export_parquet(institutions, "output.parquet")
Documentation
-
Planning Docs:
docs/plan/global_glam/01-implementation-phases.md: 7-phase implementation plan02-architecture.md: System architecture and data flow03-dependencies.md: Technology stack and dependencies04-data-standardization.md: Data integration strategies05-design-patterns.md: Software design patterns06-consumers-use-cases.md: User segments and applications
-
AI Agent Instructions:
AGENTS.md- NLP extraction guidelines
- Data quality protocols
- Agent workflow examples
-
API Documentation: Generated from docstrings with mkdocstrings
Contributing
This is a research project. Contributions welcome!
- Fork the repository
- Create feature branch (
git checkout -b feature/amazing-feature) - Commit changes (
git commit -m 'Add amazing feature') - Push to branch (
git push origin feature/amazing-feature) - Open Pull Request
License
MIT License - see LICENSE file for details
Acknowledgments
- LinkML: Schema framework
- spaCy: NLP processing
- crawl4ai: Web crawling
- RDFLib: RDF processing
- Dutch ISIL Registry: Authoritative institution data
- Claude AI: Conversation data source
Contact
For questions or collaboration inquiries, please open an issue on GitHub.
Version: 0.1.0
Status: Alpha - Implementation in progress
Last Updated: 2025-11-05