# GLAM Extractor
Extract and standardize global GLAM (Galleries, Libraries, Archives, Museums) institutional data from conversation transcripts and authoritative registries.
## Overview
This project extracts structured heritage institution data from 139+ Claude conversation JSON files covering worldwide GLAM research, integrates with authoritative CSV datasets (Dutch ISIL registry, Dutch heritage organizations), validates against a comprehensive LinkML schema, and exports to multiple formats (RDF/Turtle, JSON-LD, CSV, Parquet, SQLite).
## Features
- **Multi-source data integration**: Conversation transcripts, CSV registries, web crawling, Wikidata
- **NLP extraction**: spaCy NER, transformers-based classification, pattern matching
- **LinkML validation**: Comprehensive schema with TOOI, Schema.org, CPOC, ISIL, RiC-O, BIBFRAME
- **Provenance tracking**: Every data point tracks source, confidence, and verification status
- **Multi-format export**: RDF/Turtle, JSON-LD, CSV, Parquet, SQLite
- **Geocoding**: Nominatim integration for location enrichment
- **Multilingual support**: Handles 60+ countries and languages
## Quick Start
### Installation
```bash
# Install Poetry (if not already installed)
curl -sSL https://install.python-poetry.org | python3 -
# Clone repository and install dependencies
cd glam-extractor
poetry install
# Download spaCy models
poetry run python -m spacy download en_core_web_trf
poetry run python -m spacy download nl_core_news_lg
poetry run python -m spacy download xx_ent_wiki_sm
```
### Basic Usage
```bash
# Extract from conversation JSON
poetry run glam extract conversations/Brazilian_GLAM.json -o output.jsonld
# Extract from Dutch CSV
poetry run glam extract data/ISIL-codes_2025-08-01.csv --csv -o dutch_isil.jsonld
# Validate extracted data
poetry run glam validate output.jsonld -s schemas/heritage_custodian.yaml
# Export to RDF
poetry run glam export output.jsonld -o output.ttl -f rdf
# Crawl institutional website
poetry run glam crawl https://www.rijksmuseum.nl -o rijksmuseum.jsonld
```
## Linked Open Data
The project publishes heritage institution data as **W3C-compliant RDF** (Turtle, RDF/XML, JSON-LD, N-Triples) aligned with international ontologies:
### Published Datasets
**Denmark π©π°** - β
**COMPLETE** (November 2025)
- **2,348 institutions** (555 libraries, 594 archives, 1,199 branches)
- **43,429 RDF triples** across 9 ontologies
- **769 Wikidata links** (32.8% coverage)
- **Formats**: [Turtle](data/rdf/denmark_complete.ttl), [RDF/XML](data/rdf/denmark_complete.rdf), [JSON-LD](data/rdf/denmark_complete.jsonld), [N-Triples](data/rdf/denmark_complete.nt)
See [data/rdf/README.md](data/rdf/README.md) for SPARQL examples and usage.
### Ontology Alignment
| Ontology | Purpose | Coverage |
|----------|---------|----------|
| **CPOV** (Core Public Organisation Vocabulary) | EU public sector standard | All institutions |
| **Schema.org** | Web semantics (Library, ArchiveOrganization) | All institutions |
| **RICO** (Records in Contexts) | Archival description | Archives |
| **ORG** (W3C Organization Ontology) | Hierarchical relationships | Branches |
| **PROV-O** (Provenance Ontology) | Data provenance tracking | All institutions |
| **OWL** | Semantic equivalence (Wikidata links) | 32.8% Denmark |
### SPARQL Examples
```sparql
# Find all libraries in Copenhagen
PREFIX schema:
PREFIX cpov:
SELECT ?library ?name ?address WHERE {
?library a cpov:PublicOrganisation, schema:Library .
?library schema:name ?name .
?library schema:address ?addrNode .
?addrNode schema:addressLocality "KΓΈbenhavn K" .
?addrNode schema:streetAddress ?address .
}
```
```sparql
# Find all institutions with Wikidata links
PREFIX owl:
PREFIX schema:
SELECT ?institution ?name ?wikidataID WHERE {
?institution schema:name ?name .
?institution owl:sameAs ?wikidataURI .
FILTER(STRSTARTS(STR(?wikidataURI), "http://www.wikidata.org/entity/Q"))
BIND(STRAFTER(STR(?wikidataURI), "http://www.wikidata.org/entity/") AS ?wikidataID)
}
```
See [data/rdf/README.md](data/rdf/README.md) for more examples.
## Project Structure
```
glam-extractor/
βββ pyproject.toml # Poetry configuration
βββ README.md # This file
βββ AGENTS.md # AI agent instructions
βββ src/glam_extractor/ # Main package
β βββ __init__.py
β βββ cli.py # Command-line interface
β βββ parsers/ # Conversation & CSV parsers
β βββ extractors/ # NLP extraction engines
β βββ crawlers/ # Web crawling (crawl4ai)
β βββ validators/ # LinkML validation
β βββ exporters/ # Multi-format export
β βββ geocoding/ # Nominatim geocoding
β βββ utils/ # Utilities
βββ schemas/ # LinkML schemas
β βββ heritage_custodian.yaml
βββ tests/ # Test suite
β βββ unit/
β βββ integration/
β βββ fixtures/
βββ docs/ # Documentation
β βββ plan/global_glam/ # Planning documents
β βββ api/ # API documentation
β βββ tutorials/ # User tutorials
β βββ examples/ # Usage examples
βββ data/ # Reference data
βββ ISIL-codes_2025-08-01.csv
βββ voorbeeld_lijst_organisaties_en_diensten-totaallijst_nederland.csv
```
## Data Sources
### Conversation JSON Files
139+ conversation files covering global GLAM research:
- **Geographic coverage**: 60+ countries across all continents
- **Content**: Institution names, locations, collections, digital platforms, partnerships
- **Languages**: Multilingual (English, Dutch, Portuguese, Spanish, Vietnamese, Japanese, Arabic, etc.)
### CSV Datasets
1. **Dutch ISIL Registry** (`ISIL-codes_2025-08-01.csv`): ~300 Dutch heritage institutions with authoritative ISIL codes
2. **Dutch Organizations** (`voorbeeld_lijst_organisaties_en_diensten-totaallijst_nederland.csv`): Comprehensive metadata including systems, partnerships, collection platforms
### External Sources (Optional Enrichment)
- **Wikidata**: SPARQL queries for additional metadata
- **VIAF**: Authority file linking
- **GeoNames**: Geographic name authority
- **Nominatim**: Geocoding service
## Data Quality & Provenance
Every extracted record includes provenance metadata:
```yaml
provenance:
data_source: CONVERSATION_NLP | ISIL_REGISTRY | DUTCH_ORG_CSV | WEB_CRAWL | WIKIDATA
data_tier: TIER_1_AUTHORITATIVE | TIER_2_VERIFIED | TIER_3_CROWD_SOURCED | TIER_4_INFERRED
extraction_date: 2025-11-05T...
extraction_method: "spaCy NER + GPT-4 classification"
confidence_score: 0.0-1.0
conversation_id: "uuid"
source_url: "https://..."
verified_date: null
verified_by: null
```
**Data Tiers**:
- **Tier 1**: Official registries (ISIL, national registers) - highest authority
- **Tier 2**: Verified institutional data (official websites)
- **Tier 3**: Community-sourced data (Wikidata, OpenStreetMap)
- **Tier 4**: NLP-extracted or inferred data - requires verification
## LinkML Schema
The `heritage_custodian.yaml` schema integrates multiple standards:
- **TOOI**: Dutch organizational ontology
- **Schema.org**: General web semantics
- **CPOC**: Core Public Organization Vocabulary
- **ISIL**: International Standard Identifier for Libraries
- **RiC-O**: Records in Contexts Ontology
- **BIBFRAME**: Bibliographic Framework
- **CIDOC-CRM**: Conceptual Reference Model
Key classes:
- `HeritageCustodian`: Base class for all heritage institutions
- `DutchHeritageCustodian`: Dutch-specific subclass with KvK, gemeente codes
- `Location`: Geographic locations
- `Identifier`: External identifiers (ISIL, VIAF, Wikidata)
- `Collection`: Collections held by institutions
- `DigitalPlatform`: Digital systems used
- `Provenance`: Data quality tracking
## Development
### Run Tests
```bash
poetry run pytest # All tests
poetry run pytest -m unit # Unit tests only
poetry run pytest -m integration # Integration tests only
poetry run pytest --cov # With coverage report
```
### Code Quality
```bash
poetry run black src/ tests/ # Format code
poetry run ruff check src/ tests/ # Lint code
poetry run mypy src/ # Type checking
```
### Pre-commit Hooks
```bash
poetry run pre-commit install
poetry run pre-commit run --all-files
```
### Documentation
```bash
poetry run mkdocs serve # Serve docs locally
poetry run mkdocs build # Build static docs
```
## Examples
### Extract Brazilian Institutions
```python
from glam_extractor import ConversationParser, InstitutionExtractor
# Parse conversation
parser = ConversationParser()
conversation = parser.load("Brazilian_GLAM_collection_inventories.json")
# Extract institutions
extractor = InstitutionExtractor()
institutions = extractor.extract(conversation)
# Print results
for inst in institutions:
print(f"{inst.name} ({inst.institution_type})")
print(f" Location: {inst.locations[0].city}, {inst.locations[0].country}")
print(f" Confidence: {inst.provenance.confidence_score}")
```
### Cross-link Dutch Data
```python
from glam_extractor import CSVParser, InstitutionExtractor
from glam_extractor.validators import LinkMLValidator
# Load Dutch ISIL registry
csv_parser = CSVParser()
dutch_institutions = csv_parser.load_isil_registry("ISIL-codes_2025-08-01.csv")
# Load Dutch organizations
dutch_orgs = csv_parser.load_dutch_organizations("voorbeeld_lijst_organisaties_en_diensten-totaallijst_nederland.csv")
# Cross-link and merge
extractor = InstitutionExtractor()
merged = extractor.merge_dutch_data(dutch_institutions, dutch_orgs)
# Validate
validator = LinkMLValidator(schema="schemas/heritage_custodian.yaml")
results = validator.validate_batch(merged)
print(f"Valid: {results.valid_count}, Invalid: {results.invalid_count}")
```
### Export to Multiple Formats
```python
from glam_extractor.exporters import JSONLDExporter, RDFExporter, CSVExporter
# Load extracted data
institutions = load_institutions("output.jsonld")
# Export to RDF/Turtle
rdf_exporter = RDFExporter()
rdf_exporter.export(institutions, "output.ttl")
# Export to CSV
csv_exporter = CSVExporter()
csv_exporter.export(institutions, "output.csv")
# Export to Parquet
csv_exporter.export_parquet(institutions, "output.parquet")
```
## Documentation
- **Planning Docs**: `docs/plan/global_glam/`
- `01-implementation-phases.md`: 7-phase implementation plan
- `02-architecture.md`: System architecture and data flow
- `03-dependencies.md`: Technology stack and dependencies
- `04-data-standardization.md`: Data integration strategies
- `05-design-patterns.md`: Software design patterns
- `06-consumers-use-cases.md`: User segments and applications
- **AI Agent Instructions**: `AGENTS.md`
- NLP extraction guidelines
- Data quality protocols
- Agent workflow examples
- **API Documentation**: Generated from docstrings with mkdocstrings
## Contributing
This is a research project. Contributions welcome!
1. Fork the repository
2. Create feature branch (`git checkout -b feature/amazing-feature`)
3. Commit changes (`git commit -m 'Add amazing feature'`)
4. Push to branch (`git push origin feature/amazing-feature`)
5. Open Pull Request
## License
MIT License - see LICENSE file for details
## Acknowledgments
- **LinkML**: Schema framework
- **spaCy**: NLP processing
- **crawl4ai**: Web crawling
- **RDFLib**: RDF processing
- **Dutch ISIL Registry**: Authoritative institution data
- **Claude AI**: Conversation data source
## Contact
For questions or collaboration inquiries, please open an issue on GitHub.
---
**Version**: 0.1.0
**Status**: Alpha - Implementation in progress
**Last Updated**: 2025-11-05