# GLAM Extractor Extract and standardize global GLAM (Galleries, Libraries, Archives, Museums) institutional data from conversation transcripts and authoritative registries. ## Overview This project extracts structured heritage institution data from 139+ Claude conversation JSON files covering worldwide GLAM research, integrates with authoritative CSV datasets (Dutch ISIL registry, Dutch heritage organizations), validates against a comprehensive LinkML schema, and exports to multiple formats (RDF/Turtle, JSON-LD, CSV, Parquet, SQLite). ## Features - **Multi-source data integration**: Conversation transcripts, CSV registries, web crawling, Wikidata - **NLP extraction**: spaCy NER, transformers-based classification, pattern matching - **LinkML validation**: Comprehensive schema with TOOI, Schema.org, CPOC, ISIL, RiC-O, BIBFRAME - **Provenance tracking**: Every data point tracks source, confidence, and verification status - **Multi-format export**: RDF/Turtle, JSON-LD, CSV, Parquet, SQLite - **Geocoding**: Nominatim integration for location enrichment - **Multilingual support**: Handles 60+ countries and languages ## Quick Start ### Installation ```bash # Install Poetry (if not already installed) curl -sSL https://install.python-poetry.org | python3 - # Clone repository and install dependencies cd glam-extractor poetry install # Download spaCy models poetry run python -m spacy download en_core_web_trf poetry run python -m spacy download nl_core_news_lg poetry run python -m spacy download xx_ent_wiki_sm ``` ### Basic Usage ```bash # Extract from conversation JSON poetry run glam extract conversations/Brazilian_GLAM.json -o output.jsonld # Extract from Dutch CSV poetry run glam extract data/ISIL-codes_2025-08-01.csv --csv -o dutch_isil.jsonld # Validate extracted data poetry run glam validate output.jsonld -s schemas/heritage_custodian.yaml # Export to RDF poetry run glam export output.jsonld -o output.ttl -f rdf # Crawl institutional website poetry run glam crawl https://www.rijksmuseum.nl -o rijksmuseum.jsonld ``` ## Linked Open Data The project publishes heritage institution data as **W3C-compliant RDF** aligned with international ontologies. ### Schema RDF Formats (8 Serializations) The LinkML schema is available in 8 RDF formats (generated from `schemas/20251121/linkml/01_custodian_name_modular.yaml`): | Format | File | Size | Use Case | |--------|------|------|----------| | **Turtle** | `01_custodian_name.owl.ttl` | 77KB | Human-readable, Git-friendly | | **N-Triples** | `01_custodian_name.nt` | 233KB | Line-oriented processing | | **JSON-LD** | `01_custodian_name.jsonld` | 191KB | Web APIs, JavaScript | | **RDF/XML** | `01_custodian_name.rdf` | 165KB | Legacy systems, Java | | **Notation3** | `01_custodian_name.n3` | 77KB | Logic rules, reasoning | | **TriG** | `01_custodian_name.trig` | 103KB | Named graphs, datasets | | **TriX** | `01_custodian_name.trix` | 348KB | XML with named graphs | | **N-Quads** | `01_custodian_name.nq` | 288KB | Quad-based processing | All formats located in `schemas/20251121/rdf/` ### Published Datasets **Denmark πŸ‡©πŸ‡°** - βœ… **COMPLETE** (November 2025) - **2,348 institutions** (555 libraries, 594 archives, 1,199 branches) - **43,429 RDF triples** across 9 ontologies - **769 Wikidata links** (32.8% coverage) - **Formats**: [Turtle](data/rdf/denmark_complete.ttl), [RDF/XML](data/rdf/denmark_complete.rdf), [JSON-LD](data/rdf/denmark_complete.jsonld), [N-Triples](data/rdf/denmark_complete.nt) See [data/rdf/README.md](data/rdf/README.md) for SPARQL examples and usage. ### Ontology Alignment | Ontology | Purpose | Coverage | |----------|---------|----------| | **CPOV** (Core Public Organisation Vocabulary) | EU public sector standard | All institutions | | **Schema.org** | Web semantics (Library, ArchiveOrganization) | All institutions | | **RICO** (Records in Contexts) | Archival description | Archives | | **ORG** (W3C Organization Ontology) | Hierarchical relationships | Branches | | **PROV-O** (Provenance Ontology) | Data provenance tracking | All institutions | | **OWL** | Semantic equivalence (Wikidata links) | 32.8% Denmark | ### SPARQL Examples ```sparql # Find all libraries in Copenhagen PREFIX schema: PREFIX cpov: SELECT ?library ?name ?address WHERE { ?library a cpov:PublicOrganisation, schema:Library . ?library schema:name ?name . ?library schema:address ?addrNode . ?addrNode schema:addressLocality "KΓΈbenhavn K" . ?addrNode schema:streetAddress ?address . } ``` ```sparql # Find all institutions with Wikidata links PREFIX owl: PREFIX schema: SELECT ?institution ?name ?wikidataID WHERE { ?institution schema:name ?name . ?institution owl:sameAs ?wikidataURI . FILTER(STRSTARTS(STR(?wikidataURI), "http://www.wikidata.org/entity/Q")) BIND(STRAFTER(STR(?wikidataURI), "http://www.wikidata.org/entity/") AS ?wikidataID) } ``` See [data/rdf/README.md](data/rdf/README.md) for more examples. ## Project Structure ``` glam-extractor/ β”œβ”€β”€ pyproject.toml # Poetry configuration β”œβ”€β”€ README.md # This file β”œβ”€β”€ AGENTS.md # AI agent instructions β”œβ”€β”€ .opencode/ # AI agent documentation β”‚ β”œβ”€β”€ HYPER_MODULAR_STRUCTURE.md β”‚ └── SLOT_NAMING_CONVENTIONS.md β”œβ”€β”€ src/glam_extractor/ # Main package β”‚ β”œβ”€β”€ __init__.py β”‚ β”œβ”€β”€ cli.py # Command-line interface β”‚ β”œβ”€β”€ parsers/ # Conversation & CSV parsers β”‚ β”œβ”€β”€ extractors/ # NLP extraction engines β”‚ β”œβ”€β”€ crawlers/ # Web crawling (crawl4ai) β”‚ β”œβ”€β”€ validators/ # LinkML validation β”‚ β”œβ”€β”€ exporters/ # Multi-format export β”‚ β”œβ”€β”€ geocoding/ # Nominatim geocoding β”‚ └── utils/ # Utilities β”œβ”€β”€ schemas/20251121/ # LinkML schemas β”‚ β”œβ”€β”€ linkml/ # Hyper-modular schema (78 files) β”‚ β”‚ β”œβ”€β”€ 01_custodian_name_modular.yaml β”‚ β”‚ └── modules/ β”‚ β”‚ β”œβ”€β”€ metadata.yaml β”‚ β”‚ β”œβ”€β”€ classes/ # 12 class modules β”‚ β”‚ β”œβ”€β”€ enums/ # 5 enum modules β”‚ β”‚ └── slots/ # 59 slot modules β”‚ β”œβ”€β”€ rdf/ # 8 RDF serialization formats β”‚ β”‚ β”œβ”€β”€ 01_custodian_name.owl.ttl β”‚ β”‚ β”œβ”€β”€ 01_custodian_name.nt β”‚ β”‚ β”œβ”€β”€ 01_custodian_name.jsonld β”‚ β”‚ β”œβ”€β”€ 01_custodian_name.rdf β”‚ β”‚ β”œβ”€β”€ 01_custodian_name.n3 β”‚ β”‚ β”œβ”€β”€ 01_custodian_name.trig β”‚ β”‚ β”œβ”€β”€ 01_custodian_name.trix β”‚ β”‚ └── 01_custodian_name.nq β”‚ └── examples/ # LinkML instance examples β”œβ”€β”€ tests/ # Test suite β”‚ β”œβ”€β”€ unit/ β”‚ β”œβ”€β”€ integration/ β”‚ └── fixtures/ β”œβ”€β”€ docs/ # Documentation β”‚ β”œβ”€β”€ plan/global_glam/ # Planning documents β”‚ β”œβ”€β”€ api/ # API documentation β”‚ β”œβ”€β”€ tutorials/ # User tutorials β”‚ └── examples/ # Usage examples └── data/ # Reference data β”œβ”€β”€ ISIL-codes_2025-08-01.csv β”œβ”€β”€ voorbeeld_lijst_organisaties_en_diensten-totaallijst_nederland.csv └── ontology/ # Base ontologies (TOOI, CPOV, Schema.org, etc.) ``` ## Data Sources ### Conversation JSON Files 139+ conversation files covering global GLAM research: - **Geographic coverage**: 60+ countries across all continents - **Content**: Institution names, locations, collections, digital platforms, partnerships - **Languages**: Multilingual (English, Dutch, Portuguese, Spanish, Vietnamese, Japanese, Arabic, etc.) ### CSV Datasets 1. **Dutch ISIL Registry** (`ISIL-codes_2025-08-01.csv`): ~300 Dutch heritage institutions with authoritative ISIL codes 2. **Dutch Organizations** (`voorbeeld_lijst_organisaties_en_diensten-totaallijst_nederland.csv`): Comprehensive metadata including systems, partnerships, collection platforms ### External Sources (Optional Enrichment) - **Wikidata**: SPARQL queries for additional metadata - **VIAF**: Authority file linking - **GeoNames**: Geographic name authority - **Nominatim**: Geocoding service ## Data Quality & Provenance Every extracted record includes provenance metadata: ```yaml provenance: data_source: CONVERSATION_NLP | ISIL_REGISTRY | DUTCH_ORG_CSV | WEB_CRAWL | WIKIDATA data_tier: TIER_1_AUTHORITATIVE | TIER_2_VERIFIED | TIER_3_CROWD_SOURCED | TIER_4_INFERRED extraction_date: 2025-11-05T... extraction_method: "spaCy NER + GPT-4 classification" confidence_score: 0.0-1.0 conversation_id: "uuid" source_url: "https://..." verified_date: null verified_by: null ``` **Data Tiers**: - **Tier 1**: Official registries (ISIL, national registers) - highest authority - **Tier 2**: Verified institutional data (official websites) - **Tier 3**: Community-sourced data (Wikidata, OpenStreetMap) - **Tier 4**: NLP-extracted or inferred data - requires verification ## LinkML Schema ### Hyper-Modular Architecture The project uses a **hyper-modular LinkML schema** (`schemas/20251121/linkml/01_custodian_name_modular.yaml`) where every class, enum, and slot is defined in its own individual file for maximum maintainability and version control granularity. **Schema Structure**: - **78 YAML files** total - **12 class modules** (`modules/classes/`) - **5 enum modules** (`modules/enums/`) - **59 slot modules** (`modules/slots/`) - **1 metadata module** (`modules/metadata.yaml`) - **1 main schema** (`01_custodian_name_modular.yaml`) **Direct Import Pattern**: ```yaml imports: - linkml:types - modules/metadata - modules/enums/AgentTypeEnum - modules/slots/observed_name - modules/classes/CustodianObservation # ... 76 total individual module imports ``` **Benefits**: - βœ… Complete transparency - all dependencies visible - βœ… Granular version control - one file per concept - βœ… Parallel development - no merge conflicts - βœ… Selective imports - customize schemas easily See [.opencode/HYPER_MODULAR_STRUCTURE.md](.opencode/HYPER_MODULAR_STRUCTURE.md) for complete architecture documentation. ### Ontology Alignment The schema integrates multiple international standards: - **CPOV**: Core Public Organisation Vocabulary (EU public sector) - **TOOI**: Dutch organizational ontology - **Schema.org**: General web semantics - **CIDOC-CRM**: Cultural heritage domain model - **RiC-O**: Records in Contexts Ontology - **PROV-O**: Provenance tracking - **PiCo**: Person observations pattern **Key Classes**: - `CustodianObservation`: Source-based references (emic/etic perspectives) - `CustodianName`: Standardized emic names - `CustodianReconstruction`: Formal legal entities - `ReconstructionActivity`: Entity derivation from observations - `Agent`: People responsible for observations/reconstructions - `SourceDocument`: Documentary evidence - `Identifier`: External identifiers (ISIL, VIAF, Wikidata) - `TimeSpan`: Temporal extents with fuzzy boundaries - `ConfidenceMeasure`: Data quality metrics **Observation β†’ Reconstruction Pattern**: ``` SourceDocument β†’ CustodianObservation β†’ ReconstructionActivity β†’ CustodianReconstruction (text) (what source says) (synthesis method) (formal entity) ``` This pattern distinguishes between source-based references and scholar-derived formal entities, inspired by the PiCo (Persons in Context) ontology. ## Development ### Run Tests ```bash poetry run pytest # All tests poetry run pytest -m unit # Unit tests only poetry run pytest -m integration # Integration tests only poetry run pytest --cov # With coverage report ``` ### Code Quality ```bash poetry run black src/ tests/ # Format code poetry run ruff check src/ tests/ # Lint code poetry run mypy src/ # Type checking ``` ### Pre-commit Hooks ```bash poetry run pre-commit install poetry run pre-commit run --all-files ``` ### Documentation ```bash poetry run mkdocs serve # Serve docs locally poetry run mkdocs build # Build static docs ``` ## Examples ### Extract Brazilian Institutions ```python from glam_extractor import ConversationParser, InstitutionExtractor # Parse conversation parser = ConversationParser() conversation = parser.load("Brazilian_GLAM_collection_inventories.json") # Extract institutions extractor = InstitutionExtractor() institutions = extractor.extract(conversation) # Print results for inst in institutions: print(f"{inst.name} ({inst.institution_type})") print(f" Location: {inst.locations[0].city}, {inst.locations[0].country}") print(f" Confidence: {inst.provenance.confidence_score}") ``` ### Cross-link Dutch Data ```python from glam_extractor import CSVParser, InstitutionExtractor from glam_extractor.validators import LinkMLValidator # Load Dutch ISIL registry csv_parser = CSVParser() dutch_institutions = csv_parser.load_isil_registry("ISIL-codes_2025-08-01.csv") # Load Dutch organizations dutch_orgs = csv_parser.load_dutch_organizations("voorbeeld_lijst_organisaties_en_diensten-totaallijst_nederland.csv") # Cross-link and merge extractor = InstitutionExtractor() merged = extractor.merge_dutch_data(dutch_institutions, dutch_orgs) # Validate validator = LinkMLValidator(schema="schemas/heritage_custodian.yaml") results = validator.validate_batch(merged) print(f"Valid: {results.valid_count}, Invalid: {results.invalid_count}") ``` ### Export to Multiple Formats ```python from glam_extractor.exporters import JSONLDExporter, RDFExporter, CSVExporter # Load extracted data institutions = load_institutions("output.jsonld") # Export to RDF/Turtle rdf_exporter = RDFExporter() rdf_exporter.export(institutions, "output.ttl") # Export to CSV csv_exporter = CSVExporter() csv_exporter.export(institutions, "output.csv") # Export to Parquet csv_exporter.export_parquet(institutions, "output.parquet") ``` ## Documentation - **Planning Docs**: `docs/plan/global_glam/` - `01-implementation-phases.md`: 7-phase implementation plan - `02-architecture.md`: System architecture and data flow - `03-dependencies.md`: Technology stack and dependencies - `04-data-standardization.md`: Data integration strategies - `05-design-patterns.md`: Software design patterns - `06-consumers-use-cases.md`: User segments and applications - **AI Agent Instructions**: `AGENTS.md` - NLP extraction guidelines - Data quality protocols - Agent workflow examples - **API Documentation**: Generated from docstrings with mkdocstrings ## Contributing This is a research project. Contributions welcome! 1. Fork the repository 2. Create feature branch (`git checkout -b feature/amazing-feature`) 3. Commit changes (`git commit -m 'Add amazing feature'`) 4. Push to branch (`git push origin feature/amazing-feature`) 5. Open Pull Request ## License MIT License - see LICENSE file for details ## Acknowledgments - **LinkML**: Schema framework - **spaCy**: NLP processing - **crawl4ai**: Web crawling - **RDFLib**: RDF processing - **Dutch ISIL Registry**: Authoritative institution data - **Claude AI**: Conversation data source ## Contact For questions or collaboration inquiries, please open an issue on GitHub. --- **Version**: 0.1.0 **Status**: Alpha - Implementation in progress **Last Updated**: 2025-11-05