426 lines
15 KiB
Markdown
426 lines
15 KiB
Markdown
# GLAM Extractor
|
|
|
|
Extract and standardize global GLAM (Galleries, Libraries, Archives, Museums) institutional data from conversation transcripts and authoritative registries.
|
|
|
|
## Overview
|
|
|
|
This project extracts structured heritage institution data from 139+ Claude conversation JSON files covering worldwide GLAM research, integrates with authoritative CSV datasets (Dutch ISIL registry, Dutch heritage organizations), validates against a comprehensive LinkML schema, and exports to multiple formats (RDF/Turtle, JSON-LD, CSV, Parquet, SQLite).
|
|
|
|
## Features
|
|
|
|
- **Multi-source data integration**: Conversation transcripts, CSV registries, web crawling, Wikidata
|
|
- **NLP extraction**: spaCy NER, transformers-based classification, pattern matching
|
|
- **LinkML validation**: Comprehensive schema with TOOI, Schema.org, CPOC, ISIL, RiC-O, BIBFRAME
|
|
- **Provenance tracking**: Every data point tracks source, confidence, and verification status
|
|
- **Multi-format export**: RDF/Turtle, JSON-LD, CSV, Parquet, SQLite
|
|
- **Geocoding**: Nominatim integration for location enrichment
|
|
- **Multilingual support**: Handles 60+ countries and languages
|
|
|
|
## Quick Start
|
|
|
|
### Installation
|
|
|
|
```bash
|
|
# Install Poetry (if not already installed)
|
|
curl -sSL https://install.python-poetry.org | python3 -
|
|
|
|
# Clone repository and install dependencies
|
|
cd glam-extractor
|
|
poetry install
|
|
|
|
# Download spaCy models
|
|
poetry run python -m spacy download en_core_web_trf
|
|
poetry run python -m spacy download nl_core_news_lg
|
|
poetry run python -m spacy download xx_ent_wiki_sm
|
|
```
|
|
|
|
### Basic Usage
|
|
|
|
```bash
|
|
# Extract from conversation JSON
|
|
poetry run glam extract conversations/Brazilian_GLAM.json -o output.jsonld
|
|
|
|
# Extract from Dutch CSV
|
|
poetry run glam extract data/ISIL-codes_2025-08-01.csv --csv -o dutch_isil.jsonld
|
|
|
|
# Validate extracted data
|
|
poetry run glam validate output.jsonld -s schemas/heritage_custodian.yaml
|
|
|
|
# Export to RDF
|
|
poetry run glam export output.jsonld -o output.ttl -f rdf
|
|
|
|
# Crawl institutional website
|
|
poetry run glam crawl https://www.rijksmuseum.nl -o rijksmuseum.jsonld
|
|
```
|
|
|
|
## Linked Open Data
|
|
|
|
The project publishes heritage institution data as **W3C-compliant RDF** aligned with international ontologies.
|
|
|
|
### Schema RDF Formats (8 Serializations)
|
|
|
|
The LinkML schema is available in 8 RDF formats (generated from `schemas/20251121/linkml/01_custodian_name_modular.yaml`):
|
|
|
|
| Format | File | Size | Use Case |
|
|
|--------|------|------|----------|
|
|
| **Turtle** | `01_custodian_name.owl.ttl` | 77KB | Human-readable, Git-friendly |
|
|
| **N-Triples** | `01_custodian_name.nt` | 233KB | Line-oriented processing |
|
|
| **JSON-LD** | `01_custodian_name.jsonld` | 191KB | Web APIs, JavaScript |
|
|
| **RDF/XML** | `01_custodian_name.rdf` | 165KB | Legacy systems, Java |
|
|
| **Notation3** | `01_custodian_name.n3` | 77KB | Logic rules, reasoning |
|
|
| **TriG** | `01_custodian_name.trig` | 103KB | Named graphs, datasets |
|
|
| **TriX** | `01_custodian_name.trix` | 348KB | XML with named graphs |
|
|
| **N-Quads** | `01_custodian_name.nq` | 288KB | Quad-based processing |
|
|
|
|
All formats located in `schemas/20251121/rdf/`
|
|
|
|
### Published Datasets
|
|
|
|
**Denmark 🇩🇰** - ✅ **COMPLETE** (November 2025)
|
|
- **2,348 institutions** (555 libraries, 594 archives, 1,199 branches)
|
|
- **43,429 RDF triples** across 9 ontologies
|
|
- **769 Wikidata links** (32.8% coverage)
|
|
- **Formats**: [Turtle](data/rdf/denmark_complete.ttl), [RDF/XML](data/rdf/denmark_complete.rdf), [JSON-LD](data/rdf/denmark_complete.jsonld), [N-Triples](data/rdf/denmark_complete.nt)
|
|
|
|
See [data/rdf/README.md](data/rdf/README.md) for SPARQL examples and usage.
|
|
|
|
### Ontology Alignment
|
|
|
|
| Ontology | Purpose | Coverage |
|
|
|----------|---------|----------|
|
|
| **CPOV** (Core Public Organisation Vocabulary) | EU public sector standard | All institutions |
|
|
| **Schema.org** | Web semantics (Library, ArchiveOrganization) | All institutions |
|
|
| **RICO** (Records in Contexts) | Archival description | Archives |
|
|
| **ORG** (W3C Organization Ontology) | Hierarchical relationships | Branches |
|
|
| **PROV-O** (Provenance Ontology) | Data provenance tracking | All institutions |
|
|
| **OWL** | Semantic equivalence (Wikidata links) | 32.8% Denmark |
|
|
|
|
### SPARQL Examples
|
|
|
|
```sparql
|
|
# Find all libraries in Copenhagen
|
|
PREFIX schema: <http://schema.org/>
|
|
PREFIX cpov: <http://data.europa.eu/m8g/>
|
|
|
|
SELECT ?library ?name ?address WHERE {
|
|
?library a cpov:PublicOrganisation, schema:Library .
|
|
?library schema:name ?name .
|
|
?library schema:address ?addrNode .
|
|
?addrNode schema:addressLocality "København K" .
|
|
?addrNode schema:streetAddress ?address .
|
|
}
|
|
```
|
|
|
|
```sparql
|
|
# Find all institutions with Wikidata links
|
|
PREFIX owl: <http://www.w3.org/2002/07/owl#>
|
|
PREFIX schema: <http://schema.org/>
|
|
|
|
SELECT ?institution ?name ?wikidataID WHERE {
|
|
?institution schema:name ?name .
|
|
?institution owl:sameAs ?wikidataURI .
|
|
FILTER(STRSTARTS(STR(?wikidataURI), "http://www.wikidata.org/entity/Q"))
|
|
BIND(STRAFTER(STR(?wikidataURI), "http://www.wikidata.org/entity/") AS ?wikidataID)
|
|
}
|
|
```
|
|
|
|
See [data/rdf/README.md](data/rdf/README.md) for more examples.
|
|
|
|
## Project Structure
|
|
|
|
```
|
|
glam-extractor/
|
|
├── pyproject.toml # Poetry configuration
|
|
├── README.md # This file
|
|
├── AGENTS.md # AI agent instructions
|
|
├── .opencode/ # AI agent documentation
|
|
│ ├── HYPER_MODULAR_STRUCTURE.md
|
|
│ └── SLOT_NAMING_CONVENTIONS.md
|
|
├── src/glam_extractor/ # Main package
|
|
│ ├── __init__.py
|
|
│ ├── cli.py # Command-line interface
|
|
│ ├── parsers/ # Conversation & CSV parsers
|
|
│ ├── extractors/ # NLP extraction engines
|
|
│ ├── crawlers/ # Web crawling (crawl4ai)
|
|
│ ├── validators/ # LinkML validation
|
|
│ ├── exporters/ # Multi-format export
|
|
│ ├── geocoding/ # Nominatim geocoding
|
|
│ └── utils/ # Utilities
|
|
├── schemas/20251121/ # LinkML schemas
|
|
│ ├── linkml/ # Hyper-modular schema (78 files)
|
|
│ │ ├── 01_custodian_name_modular.yaml
|
|
│ │ └── modules/
|
|
│ │ ├── metadata.yaml
|
|
│ │ ├── classes/ # 12 class modules
|
|
│ │ ├── enums/ # 5 enum modules
|
|
│ │ └── slots/ # 59 slot modules
|
|
│ ├── rdf/ # 8 RDF serialization formats
|
|
│ │ ├── 01_custodian_name.owl.ttl
|
|
│ │ ├── 01_custodian_name.nt
|
|
│ │ ├── 01_custodian_name.jsonld
|
|
│ │ ├── 01_custodian_name.rdf
|
|
│ │ ├── 01_custodian_name.n3
|
|
│ │ ├── 01_custodian_name.trig
|
|
│ │ ├── 01_custodian_name.trix
|
|
│ │ └── 01_custodian_name.nq
|
|
│ └── examples/ # LinkML instance examples
|
|
├── tests/ # Test suite
|
|
│ ├── unit/
|
|
│ ├── integration/
|
|
│ └── fixtures/
|
|
├── docs/ # Documentation
|
|
│ ├── plan/global_glam/ # Planning documents
|
|
│ ├── api/ # API documentation
|
|
│ ├── tutorials/ # User tutorials
|
|
│ └── examples/ # Usage examples
|
|
└── data/ # Reference data
|
|
├── ISIL-codes_2025-08-01.csv
|
|
├── voorbeeld_lijst_organisaties_en_diensten-totaallijst_nederland.csv
|
|
└── ontology/ # Base ontologies (TOOI, CPOV, Schema.org, etc.)
|
|
```
|
|
|
|
## Data Sources
|
|
|
|
### Conversation JSON Files
|
|
139+ conversation files covering global GLAM research:
|
|
- **Geographic coverage**: 60+ countries across all continents
|
|
- **Content**: Institution names, locations, collections, digital platforms, partnerships
|
|
- **Languages**: Multilingual (English, Dutch, Portuguese, Spanish, Vietnamese, Japanese, Arabic, etc.)
|
|
|
|
### CSV Datasets
|
|
1. **Dutch ISIL Registry** (`ISIL-codes_2025-08-01.csv`): ~300 Dutch heritage institutions with authoritative ISIL codes
|
|
2. **Dutch Organizations** (`voorbeeld_lijst_organisaties_en_diensten-totaallijst_nederland.csv`): Comprehensive metadata including systems, partnerships, collection platforms
|
|
|
|
### External Sources (Optional Enrichment)
|
|
- **Wikidata**: SPARQL queries for additional metadata
|
|
- **VIAF**: Authority file linking
|
|
- **GeoNames**: Geographic name authority
|
|
- **Nominatim**: Geocoding service
|
|
|
|
## Data Quality & Provenance
|
|
|
|
Every extracted record includes provenance metadata:
|
|
|
|
```yaml
|
|
provenance:
|
|
data_source: CONVERSATION_NLP | ISIL_REGISTRY | DUTCH_ORG_CSV | WEB_CRAWL | WIKIDATA
|
|
data_tier: TIER_1_AUTHORITATIVE | TIER_2_VERIFIED | TIER_3_CROWD_SOURCED | TIER_4_INFERRED
|
|
extraction_date: 2025-11-05T...
|
|
extraction_method: "spaCy NER + GPT-4 classification"
|
|
confidence_score: 0.0-1.0
|
|
conversation_id: "uuid"
|
|
source_url: "https://..."
|
|
verified_date: null
|
|
verified_by: null
|
|
```
|
|
|
|
**Data Tiers**:
|
|
- **Tier 1**: Official registries (ISIL, national registers) - highest authority
|
|
- **Tier 2**: Verified institutional data (official websites)
|
|
- **Tier 3**: Community-sourced data (Wikidata, OpenStreetMap)
|
|
- **Tier 4**: NLP-extracted or inferred data - requires verification
|
|
|
|
## LinkML Schema
|
|
|
|
### Hyper-Modular Architecture
|
|
|
|
The project uses a **hyper-modular LinkML schema** (`schemas/20251121/linkml/01_custodian_name_modular.yaml`) where every class, enum, and slot is defined in its own individual file for maximum maintainability and version control granularity.
|
|
|
|
**Schema Structure**:
|
|
- **78 YAML files** total
|
|
- **12 class modules** (`modules/classes/`)
|
|
- **5 enum modules** (`modules/enums/`)
|
|
- **59 slot modules** (`modules/slots/`)
|
|
- **1 metadata module** (`modules/metadata.yaml`)
|
|
- **1 main schema** (`01_custodian_name_modular.yaml`)
|
|
|
|
**Direct Import Pattern**:
|
|
```yaml
|
|
imports:
|
|
- linkml:types
|
|
- modules/metadata
|
|
- modules/enums/AgentTypeEnum
|
|
- modules/slots/observed_name
|
|
- modules/classes/CustodianObservation
|
|
# ... 76 total individual module imports
|
|
```
|
|
|
|
**Benefits**:
|
|
- ✅ Complete transparency - all dependencies visible
|
|
- ✅ Granular version control - one file per concept
|
|
- ✅ Parallel development - no merge conflicts
|
|
- ✅ Selective imports - customize schemas easily
|
|
|
|
See [.opencode/HYPER_MODULAR_STRUCTURE.md](.opencode/HYPER_MODULAR_STRUCTURE.md) for complete architecture documentation.
|
|
|
|
### Ontology Alignment
|
|
|
|
The schema integrates multiple international standards:
|
|
|
|
- **CPOV**: Core Public Organisation Vocabulary (EU public sector)
|
|
- **TOOI**: Dutch organizational ontology
|
|
- **Schema.org**: General web semantics
|
|
- **CIDOC-CRM**: Cultural heritage domain model
|
|
- **RiC-O**: Records in Contexts Ontology
|
|
- **PROV-O**: Provenance tracking
|
|
- **PiCo**: Person observations pattern
|
|
|
|
**Key Classes**:
|
|
- `CustodianObservation`: Source-based references (emic/etic perspectives)
|
|
- `CustodianName`: Standardized emic names
|
|
- `CustodianReconstruction`: Formal legal entities
|
|
- `ReconstructionActivity`: Entity derivation from observations
|
|
- `Agent`: People responsible for observations/reconstructions
|
|
- `SourceDocument`: Documentary evidence
|
|
- `Identifier`: External identifiers (ISIL, VIAF, Wikidata)
|
|
- `TimeSpan`: Temporal extents with fuzzy boundaries
|
|
- `ConfidenceMeasure`: Data quality metrics
|
|
|
|
**Observation → Reconstruction Pattern**:
|
|
```
|
|
SourceDocument → CustodianObservation → ReconstructionActivity → CustodianReconstruction
|
|
(text) (what source says) (synthesis method) (formal entity)
|
|
```
|
|
|
|
This pattern distinguishes between source-based references and scholar-derived formal entities, inspired by the PiCo (Persons in Context) ontology.
|
|
|
|
## Development
|
|
|
|
### Run Tests
|
|
```bash
|
|
poetry run pytest # All tests
|
|
poetry run pytest -m unit # Unit tests only
|
|
poetry run pytest -m integration # Integration tests only
|
|
poetry run pytest --cov # With coverage report
|
|
```
|
|
|
|
### Code Quality
|
|
```bash
|
|
poetry run black src/ tests/ # Format code
|
|
poetry run ruff check src/ tests/ # Lint code
|
|
poetry run mypy src/ # Type checking
|
|
```
|
|
|
|
### Pre-commit Hooks
|
|
```bash
|
|
poetry run pre-commit install
|
|
poetry run pre-commit run --all-files
|
|
```
|
|
|
|
### Documentation
|
|
```bash
|
|
poetry run mkdocs serve # Serve docs locally
|
|
poetry run mkdocs build # Build static docs
|
|
```
|
|
|
|
## Examples
|
|
|
|
### Extract Brazilian Institutions
|
|
```python
|
|
from glam_extractor import ConversationParser, InstitutionExtractor
|
|
|
|
# Parse conversation
|
|
parser = ConversationParser()
|
|
conversation = parser.load("Brazilian_GLAM_collection_inventories.json")
|
|
|
|
# Extract institutions
|
|
extractor = InstitutionExtractor()
|
|
institutions = extractor.extract(conversation)
|
|
|
|
# Print results
|
|
for inst in institutions:
|
|
print(f"{inst.name} ({inst.institution_type})")
|
|
print(f" Location: {inst.locations[0].city}, {inst.locations[0].country}")
|
|
print(f" Confidence: {inst.provenance.confidence_score}")
|
|
```
|
|
|
|
### Cross-link Dutch Data
|
|
```python
|
|
from glam_extractor import CSVParser, InstitutionExtractor
|
|
from glam_extractor.validators import LinkMLValidator
|
|
|
|
# Load Dutch ISIL registry
|
|
csv_parser = CSVParser()
|
|
dutch_institutions = csv_parser.load_isil_registry("ISIL-codes_2025-08-01.csv")
|
|
|
|
# Load Dutch organizations
|
|
dutch_orgs = csv_parser.load_dutch_organizations("voorbeeld_lijst_organisaties_en_diensten-totaallijst_nederland.csv")
|
|
|
|
# Cross-link and merge
|
|
extractor = InstitutionExtractor()
|
|
merged = extractor.merge_dutch_data(dutch_institutions, dutch_orgs)
|
|
|
|
# Validate
|
|
validator = LinkMLValidator(schema="schemas/heritage_custodian.yaml")
|
|
results = validator.validate_batch(merged)
|
|
print(f"Valid: {results.valid_count}, Invalid: {results.invalid_count}")
|
|
```
|
|
|
|
### Export to Multiple Formats
|
|
```python
|
|
from glam_extractor.exporters import JSONLDExporter, RDFExporter, CSVExporter
|
|
|
|
# Load extracted data
|
|
institutions = load_institutions("output.jsonld")
|
|
|
|
# Export to RDF/Turtle
|
|
rdf_exporter = RDFExporter()
|
|
rdf_exporter.export(institutions, "output.ttl")
|
|
|
|
# Export to CSV
|
|
csv_exporter = CSVExporter()
|
|
csv_exporter.export(institutions, "output.csv")
|
|
|
|
# Export to Parquet
|
|
csv_exporter.export_parquet(institutions, "output.parquet")
|
|
```
|
|
|
|
## Documentation
|
|
|
|
- **Planning Docs**: `docs/plan/global_glam/`
|
|
- `01-implementation-phases.md`: 7-phase implementation plan
|
|
- `02-architecture.md`: System architecture and data flow
|
|
- `03-dependencies.md`: Technology stack and dependencies
|
|
- `04-data-standardization.md`: Data integration strategies
|
|
- `05-design-patterns.md`: Software design patterns
|
|
- `06-consumers-use-cases.md`: User segments and applications
|
|
|
|
- **AI Agent Instructions**: `AGENTS.md`
|
|
- NLP extraction guidelines
|
|
- Data quality protocols
|
|
- Agent workflow examples
|
|
|
|
- **API Documentation**: Generated from docstrings with mkdocstrings
|
|
|
|
## Contributing
|
|
|
|
This is a research project. Contributions welcome!
|
|
|
|
1. Fork the repository
|
|
2. Create feature branch (`git checkout -b feature/amazing-feature`)
|
|
3. Commit changes (`git commit -m 'Add amazing feature'`)
|
|
4. Push to branch (`git push origin feature/amazing-feature`)
|
|
5. Open Pull Request
|
|
|
|
## License
|
|
|
|
MIT License - see LICENSE file for details
|
|
|
|
## Acknowledgments
|
|
|
|
- **LinkML**: Schema framework
|
|
- **spaCy**: NLP processing
|
|
- **crawl4ai**: Web crawling
|
|
- **RDFLib**: RDF processing
|
|
- **Dutch ISIL Registry**: Authoritative institution data
|
|
- **Claude AI**: Conversation data source
|
|
|
|
## Contact
|
|
|
|
For questions or collaboration inquiries, please open an issue on GitHub.
|
|
|
|
---
|
|
|
|
**Version**: 0.1.0
|
|
**Status**: Alpha - Implementation in progress
|
|
**Last Updated**: 2025-11-05
|