GLAM heritage institution data extraction and management
Find a file
2025-11-19 23:25:22 +01:00
.opencode/agent add isil entries 2025-11-19 23:25:22 +01:00
archive/scripts/brazil add isil entries 2025-11-19 23:25:22 +01:00
data add isil entries 2025-11-19 23:25:22 +01:00
docs add isil entries 2025-11-19 23:25:22 +01:00
exa-mcp-server-source@07aedc21cc add isil entries 2025-11-19 23:25:22 +01:00
examples add isil entries 2025-11-19 23:25:22 +01:00
mcp-wikidata@230e0456d2 add isil entries 2025-11-19 23:25:22 +01:00
mcp_servers/wikidata_auth add isil entries 2025-11-19 23:25:22 +01:00
ontology add isil entries 2025-11-19 23:25:22 +01:00
package add isil entries 2025-11-19 23:25:22 +01:00
reports add isil entries 2025-11-19 23:25:22 +01:00
schemas add isil entries 2025-11-19 23:25:22 +01:00
scripts add isil entries 2025-11-19 23:25:22 +01:00
src/glam_extractor add isil entries 2025-11-19 23:25:22 +01:00
tests add isil entries 2025-11-19 23:25:22 +01:00
.gitignore add isil entries 2025-11-19 23:25:22 +01:00
AGENTS.md add isil entries 2025-11-19 23:25:22 +01:00
analyze_brazil_batch13_candidates.py add isil entries 2025-11-19 23:25:22 +01:00
AUSTRIAN_ISIL_DEDUPLICATION_SUMMARY.md add isil entries 2025-11-19 23:25:22 +01:00
AUSTRIAN_ISIL_QUICK_START.md add isil entries 2025-11-19 23:25:22 +01:00
AUSTRIAN_ISIL_SESSION_COMPLETE.md add isil entries 2025-11-19 23:25:22 +01:00
AUSTRIAN_ISIL_SESSION_COMPLETE_BATCH1.md add isil entries 2025-11-19 23:25:22 +01:00
AUSTRIAN_ISIL_SESSION_CONTINUED_20251118.md add isil entries 2025-11-19 23:25:22 +01:00
AUSTRIAN_ISIL_SESSION_HANDOFF_20251118.md add isil entries 2025-11-19 23:25:22 +01:00
AUSTRIAN_ISIL_SESSION_SUMMARY.md add isil entries 2025-11-19 23:25:22 +01:00
BATCH12_ENRICHMENT_REPORT.md add isil entries 2025-11-19 23:25:22 +01:00
BATCH13_ENRICHMENT_REPORT.md add isil entries 2025-11-19 23:25:22 +01:00
BATCH14_ENRICHMENT_REPORT.md add isil entries 2025-11-19 23:25:22 +01:00
BELGIAN_ISIL_COMPLETE.md add isil entries 2025-11-19 23:25:22 +01:00
BRAZILIAN_CURATION_SESSION_SUMMARY.md add isil entries 2025-11-19 23:25:22 +01:00
BULGARIAN_ISIL_EXTRACTION_COMPLETE.md add isil entries 2025-11-19 23:25:22 +01:00
CANADIAN_ENRICHMENT_GUIDE.md add isil entries 2025-11-19 23:25:22 +01:00
CANADIAN_GEOCODING_COMPLETE.md add isil entries 2025-11-19 23:25:22 +01:00
CANADIAN_INTEGRATION_REPORT.md add isil entries 2025-11-19 23:25:22 +01:00
CANADIAN_ISIL_SUCCESS.md add isil entries 2025-11-19 23:25:22 +01:00
check_geocoding_progress.py add isil entries 2025-11-19 23:25:22 +01:00
check_scraper_status.sh add isil entries 2025-11-19 23:25:22 +01:00
CHILEAN_BATCH1_REPORT.md add isil entries 2025-11-19 23:25:22 +01:00
compare_dutch_datasets.py add isil entries 2025-11-19 23:25:22 +01:00
CONTRIBUTING.md add isil entries 2025-11-19 23:25:22 +01:00
convert_canadian_to_linkml.py add isil entries 2025-11-19 23:25:22 +01:00
crosslink_dutch_datasets.py add isil entries 2025-11-19 23:25:22 +01:00
curate_brazilian_institutions.py add isil entries 2025-11-19 23:25:22 +01:00
curate_chilean_institutions.md add isil entries 2025-11-19 23:25:22 +01:00
CURATION_STATUS.md add isil entries 2025-11-19 23:25:22 +01:00
CZECH_ARCHIVES_INVESTIGATION.md add isil entries 2025-11-19 23:25:22 +01:00
CZECH_ARCHIVES_NEXT_ACTIONS.md add isil entries 2025-11-19 23:25:22 +01:00
CZECH_ARON_API_INVESTIGATION.md add isil entries 2025-11-19 23:25:22 +01:00
CZECH_CROSSLINK_REPORT.md add isil entries 2025-11-19 23:25:22 +01:00
CZECH_ISIL_COMPLETE_REPORT.md add isil entries 2025-11-19 23:25:22 +01:00
CZECH_ISIL_HARVEST_SUMMARY.md add isil entries 2025-11-19 23:25:22 +01:00
CZECH_ISIL_NEXT_STEPS.md add isil entries 2025-11-19 23:25:22 +01:00
CZECH_PRIORITY1_COMPLETE.md add isil entries 2025-11-19 23:25:22 +01:00
deduplicate_brazilian_institutions.py Deduplicate Brazilian institutions (212→121) 2025-11-11 22:08:34 +01:00
DENMARK_QUICK_REFERENCE.md add isil entries 2025-11-19 23:25:22 +01:00
enrich_brazil_batch11.py add isil entries 2025-11-19 23:25:22 +01:00
enrich_brazil_batch12.py add isil entries 2025-11-19 23:25:22 +01:00
enrich_brazil_batch13.py add isil entries 2025-11-19 23:25:22 +01:00
enrich_brazil_batch17.py add isil entries 2025-11-19 23:25:22 +01:00
enrich_bulgaria_isil.py add isil entries 2025-11-19 23:25:22 +01:00
enrich_geocoding.py add isil entries 2025-11-19 23:25:22 +01:00
enrich_japan_fast.py add isil entries 2025-11-19 23:25:22 +01:00
enrich_japan_isil.py add isil entries 2025-11-19 23:25:22 +01:00
EXA_BUG_FIX.md add isil entries 2025-11-19 23:25:22 +01:00
EXECUTIVE_SUMMARY.md add isil entries 2025-11-19 23:25:22 +01:00
export_bulgaria_rdf.py add isil entries 2025-11-19 23:25:22 +01:00
extract_brazilian_institutions.py add isil entries 2025-11-19 23:25:22 +01:00
extract_brazilian_institutions_v2.py add isil entries 2025-11-19 23:25:22 +01:00
extract_conversations_batch.py add isil entries 2025-11-19 23:25:22 +01:00
extract_mexican_glams.py add isil entries 2025-11-19 23:25:22 +01:00
extract_mexican_glams_v2.py add isil entries 2025-11-19 23:25:22 +01:00
FINAL_SESSION_SUMMARY.md add isil entries 2025-11-19 23:25:22 +01:00
find_brazil_bonus.py add isil entries 2025-11-19 23:25:22 +01:00
find_brazil_institutions.py add isil entries 2025-11-19 23:25:22 +01:00
fix_heritage_linked_pubs.py add isil entries 2025-11-19 23:25:22 +01:00
generate_comparison_report.py add isil entries 2025-11-19 23:25:22 +01:00
generate_geocoding_report.py add isil entries 2025-11-19 23:25:22 +01:00
GEOCODING_SESSION_2025-11-07.md add isil entries 2025-11-19 23:25:22 +01:00
GEOCODING_SESSION_2025-11-07_RESUMED.md add isil entries 2025-11-19 23:25:22 +01:00
GERMAN_REGIONAL_ARCHIVE_PORTALS_DISCOVERY.md add isil entries 2025-11-19 23:25:22 +01:00
ISIL_HARVEST_STATUS_20251119.md add isil entries 2025-11-19 23:25:22 +01:00
LIBYA_ENRICHMENT_COMPLETE.md add isil entries 2025-11-19 23:25:22 +01:00
LIBYA_WIKIDATA_CLEANUP_SUMMARY.md add isil entries 2025-11-19 23:25:22 +01:00
LIBYA_WIKIDATA_CREATION_STATUS.md add isil entries 2025-11-19 23:25:22 +01:00
LIBYA_WIKIDATA_ENRICHMENT_COMPLETE.md add isil entries 2025-11-19 23:25:22 +01:00
LICENSE add isil entries 2025-11-19 23:25:22 +01:00
merge_batch13_corrected.py add isil entries 2025-11-19 23:25:22 +01:00
merge_batch14.py add isil entries 2025-11-19 23:25:22 +01:00
merge_batch15.py add isil entries 2025-11-19 23:25:22 +01:00
merge_brazil_batch13.py add isil entries 2025-11-19 23:25:22 +01:00
mexican_glam_1.json add isil entries 2025-11-19 23:25:22 +01:00
mexican_glam_2.json add isil entries 2025-11-19 23:25:22 +01:00
mexican_glam_extracted.json add isil entries 2025-11-19 23:25:22 +01:00
MIGRATION_COMPLETED_v0.2.2.md add isil entries 2025-11-19 23:25:22 +01:00
MNEMONIC_CORRECTION.md add isil entries 2025-11-19 23:25:22 +01:00
NEXT_AGENT_HANDOFF_NRW_COMPLETE.md add isil entries 2025-11-19 23:25:22 +01:00
NEXT_SESSION_HANDOFF.md add isil entries 2025-11-19 23:25:22 +01:00
NEXT_STEPS.md add isil entries 2025-11-19 23:25:22 +01:00
NEXT_STEPS_Mexican_Geocoding.md add isil entries 2025-11-19 23:25:22 +01:00
NRW_HARVEST_COMPLETE_20251119.md add isil entries 2025-11-19 23:25:22 +01:00
osm_resume_log.txt add isil entries 2025-11-19 23:25:22 +01:00
parse_eu_isil.py add isil entries 2025-11-19 23:25:22 +01:00
parse_japan_isil.py add isil entries 2025-11-19 23:25:22 +01:00
process_chilean_institutions.py add isil entries 2025-11-19 23:25:22 +01:00
process_mexican_institutions.py add isil entries 2025-11-19 23:25:22 +01:00
PROGRESS.md add isil entries 2025-11-19 23:25:22 +01:00
pyproject.toml add isil entries 2025-11-19 23:25:22 +01:00
QUICK_ACTION_PLAN_GERMAN_REGIONAL_HARVESTS.md add isil entries 2025-11-19 23:25:22 +01:00
QUICK_START_AUSTRALIA.md add isil entries 2025-11-19 23:25:22 +01:00
QUICK_STATUS_20251119.md add isil entries 2025-11-19 23:25:22 +01:00
QUICK_STATUS_20251119_POST_NRW.md add isil entries 2025-11-19 23:25:22 +01:00
README.md add isil entries 2025-11-19 23:25:22 +01:00
RECORD_COMPARISON.md add isil entries 2025-11-19 23:25:22 +01:00
RESUME_CHILEAN_ENRICHMENT.md add isil entries 2025-11-19 23:25:22 +01:00
run_scraper_background.sh add isil entries 2025-11-19 23:25:22 +01:00
SCRAPER_COMPLETION_INSTRUCTIONS.md add isil entries 2025-11-19 23:25:22 +01:00
SESSION-RESUME.md add isil entries 2025-11-19 23:25:22 +01:00
SESSION_2025-11-09_SCHEMA_ONTOLOGY_UPDATE.md add isil entries 2025-11-19 23:25:22 +01:00
SESSION_COMPLETE.md add isil entries 2025-11-19 23:25:22 +01:00
SESSION_COMPLETE_ARGENTINA_ENRICHMENT.txt add isil entries 2025-11-19 23:25:22 +01:00
SESSION_COMPLETION_SUMMARY.md add isil entries 2025-11-19 23:25:22 +01:00
SESSION_CONTINUATION_SUMMARY_20251119.md add isil entries 2025-11-19 23:25:22 +01:00
SESSION_SUMMARY.md add isil entries 2025-11-19 23:25:22 +01:00
SESSION_SUMMARY_2025-11-05.md add isil entries 2025-11-19 23:25:22 +01:00
SESSION_SUMMARY_2025-11-05_batch_processing.md add isil entries 2025-11-19 23:25:22 +01:00
SESSION_SUMMARY_2025-11-06_Chilean_Geocoding.md add isil entries 2025-11-19 23:25:22 +01:00
SESSION_SUMMARY_2025-11-07.md add isil entries 2025-11-19 23:25:22 +01:00
SESSION_SUMMARY_2025-11-08.md add isil entries 2025-11-19 23:25:22 +01:00
SESSION_SUMMARY_2025-11-08_LATAM.md add isil entries 2025-11-19 23:25:22 +01:00
SESSION_SUMMARY_2025-11-09.md add isil entries 2025-11-19 23:25:22 +01:00
SESSION_SUMMARY_20251111_BRAZIL_MERGE.md add isil entries 2025-11-19 23:25:22 +01:00
SESSION_SUMMARY_20251112_BRAZIL_DOCUMENTATION.md add isil entries 2025-11-19 23:25:22 +01:00
SESSION_SUMMARY_20251113_MEXICO_BATCH2.md add isil entries 2025-11-19 23:25:22 +01:00
SESSION_SUMMARY_20251113_MEXICO_RECONCILIATION.md add isil entries 2025-11-19 23:25:22 +01:00
SESSION_SUMMARY_20251118_ARGENTINA_LINKML_EXPORT.md add isil entries 2025-11-19 23:25:22 +01:00
SESSION_SUMMARY_20251118_AUSTRALIA_TROVE.md add isil entries 2025-11-19 23:25:22 +01:00
SESSION_SUMMARY_20251118_ISIL_PROCESSING.md add isil entries 2025-11-19 23:25:22 +01:00
SESSION_SUMMARY_20251119_ARCHIVES_DISCOVERY.md add isil entries 2025-11-19 23:25:22 +01:00
SESSION_SUMMARY_20251119_AUSTRIAN_CONSOLIDATION.md add isil entries 2025-11-19 23:25:22 +01:00
SESSION_SUMMARY_20251119_CANADIAN_COMPLETE.md add isil entries 2025-11-19 23:25:22 +01:00
SESSION_SUMMARY_20251119_CZECH_ARCHIVES_COMPLETE.md add isil entries 2025-11-19 23:25:22 +01:00
SESSION_SUMMARY_20251119_CZECH_COMPLETE.md add isil entries 2025-11-19 23:25:22 +01:00
SESSION_SUMMARY_20251119_DDB_HARVEST_COMPLETE.md add isil entries 2025-11-19 23:25:22 +01:00
SESSION_SUMMARY_20251119_DENMARK_ARCHIVES_COMPLETE.md add isil entries 2025-11-19 23:25:22 +01:00
SESSION_SUMMARY_20251119_DENMARK_COMPLETE.md add isil entries 2025-11-19 23:25:22 +01:00
SESSION_SUMMARY_20251119_DENMARK_ISIL_COMPLETE.md add isil entries 2025-11-19 23:25:22 +01:00
SESSION_SUMMARY_20251119_NRW_MERGE_COMPLETE.md add isil entries 2025-11-19 23:25:22 +01:00
SESSION_SUMMARY_20251119_PRIORITY1_COMPLETE.md add isil entries 2025-11-19 23:25:22 +01:00
SESSION_SUMMARY_20251119_RDF_WIKIDATA_COMPLETE.md add isil entries 2025-11-19 23:25:22 +01:00
SESSION_SUMMARY_20251119_UNIFICATION_COMPLETE.md add isil entries 2025-11-19 23:25:22 +01:00
SESSION_SUMMARY_20251119_WIKIDATA_VALIDATION_PACKAGE.md add isil entries 2025-11-19 23:25:22 +01:00
SESSION_SUMMARY_ARGENTINA_CONABIP.md add isil entries 2025-11-19 23:25:22 +01:00
SESSION_SUMMARY_ARGENTINA_Z3950_INVESTIGATION.md add isil entries 2025-11-19 23:25:22 +01:00
SESSION_SUMMARY_BATCH7.md add isil entries 2025-11-19 23:25:22 +01:00
SESSION_SUMMARY_NETHERLANDS_ARGENTINA.md add isil entries 2025-11-19 23:25:22 +01:00
SESSION_SUMMARY_NOV7_DUTCH_VALIDATION.md add isil entries 2025-11-19 23:25:22 +01:00
SESSION_SUMMARY_RDF_PARTNERSHIPS.md add isil entries 2025-11-19 23:25:22 +01:00
SESSION_SUMMARY_SWITZERLAND_ISIL.md add isil entries 2025-11-19 23:25:22 +01:00
SESSION_SUMMARY_v3_geocoding.md add isil entries 2025-11-19 23:25:22 +01:00
SESSION_SUMMARY_V5.md add isil entries 2025-11-19 23:25:22 +01:00
TASTE_SMELL_CLASS_ADDITION.md add isil entries 2025-11-19 23:25:22 +01:00
TAXONOMY_UPDATE_SUMMARY.md add isil entries 2025-11-19 23:25:22 +01:00
test_canadian_parser.py add isil entries 2025-11-19 23:25:22 +01:00
test_real_dutch_orgs.py add isil entries 2025-11-19 23:25:22 +01:00
test_real_isil.py add isil entries 2025-11-19 23:25:22 +01:00
UNIFICATION_SUMMARY.md add isil entries 2025-11-19 23:25:22 +01:00
V5_QUICK_REFERENCE.md add isil entries 2025-11-19 23:25:22 +01:00
validate_curated.py add isil entries 2025-11-19 23:25:22 +01:00
validate_instances.py add isil entries 2025-11-19 23:25:22 +01:00
validation_output.txt add isil entries 2025-11-19 23:25:22 +01:00
verify_batch13_ids.py add isil entries 2025-11-19 23:25:22 +01:00
WIKIDATA_CREATION_PLAN.md add isil entries 2025-11-19 23:25:22 +01:00
WIKIDATA_MANUAL_CREATION_GUIDE.md add isil entries 2025-11-19 23:25:22 +01:00

GLAM Extractor

Extract and standardize global GLAM (Galleries, Libraries, Archives, Museums) institutional data from conversation transcripts and authoritative registries.

Overview

This project extracts structured heritage institution data from 139+ Claude conversation JSON files covering worldwide GLAM research, integrates with authoritative CSV datasets (Dutch ISIL registry, Dutch heritage organizations), validates against a comprehensive LinkML schema, and exports to multiple formats (RDF/Turtle, JSON-LD, CSV, Parquet, SQLite).

Features

  • Multi-source data integration: Conversation transcripts, CSV registries, web crawling, Wikidata
  • NLP extraction: spaCy NER, transformers-based classification, pattern matching
  • LinkML validation: Comprehensive schema with TOOI, Schema.org, CPOC, ISIL, RiC-O, BIBFRAME
  • Provenance tracking: Every data point tracks source, confidence, and verification status
  • Multi-format export: RDF/Turtle, JSON-LD, CSV, Parquet, SQLite
  • Geocoding: Nominatim integration for location enrichment
  • Multilingual support: Handles 60+ countries and languages

Quick Start

Installation

# Install Poetry (if not already installed)
curl -sSL https://install.python-poetry.org | python3 -

# Clone repository and install dependencies
cd glam-extractor
poetry install

# Download spaCy models
poetry run python -m spacy download en_core_web_trf
poetry run python -m spacy download nl_core_news_lg
poetry run python -m spacy download xx_ent_wiki_sm

Basic Usage

# Extract from conversation JSON
poetry run glam extract conversations/Brazilian_GLAM.json -o output.jsonld

# Extract from Dutch CSV
poetry run glam extract data/ISIL-codes_2025-08-01.csv --csv -o dutch_isil.jsonld

# Validate extracted data
poetry run glam validate output.jsonld -s schemas/heritage_custodian.yaml

# Export to RDF
poetry run glam export output.jsonld -o output.ttl -f rdf

# Crawl institutional website
poetry run glam crawl https://www.rijksmuseum.nl -o rijksmuseum.jsonld

Linked Open Data

The project publishes heritage institution data as W3C-compliant RDF (Turtle, RDF/XML, JSON-LD, N-Triples) aligned with international ontologies:

Published Datasets

Denmark 🇩🇰 - COMPLETE (November 2025)

  • 2,348 institutions (555 libraries, 594 archives, 1,199 branches)
  • 43,429 RDF triples across 9 ontologies
  • 769 Wikidata links (32.8% coverage)
  • Formats: Turtle, RDF/XML, JSON-LD, N-Triples

See data/rdf/README.md for SPARQL examples and usage.

Ontology Alignment

Ontology Purpose Coverage
CPOV (Core Public Organisation Vocabulary) EU public sector standard All institutions
Schema.org Web semantics (Library, ArchiveOrganization) All institutions
RICO (Records in Contexts) Archival description Archives
ORG (W3C Organization Ontology) Hierarchical relationships Branches
PROV-O (Provenance Ontology) Data provenance tracking All institutions
OWL Semantic equivalence (Wikidata links) 32.8% Denmark

SPARQL Examples

# Find all libraries in Copenhagen
PREFIX schema: <http://schema.org/>
PREFIX cpov: <http://data.europa.eu/m8g/>

SELECT ?library ?name ?address WHERE {
  ?library a cpov:PublicOrganisation, schema:Library .
  ?library schema:name ?name .
  ?library schema:address ?addrNode .
  ?addrNode schema:addressLocality "København K" .
  ?addrNode schema:streetAddress ?address .
}
# Find all institutions with Wikidata links
PREFIX owl: <http://www.w3.org/2002/07/owl#>
PREFIX schema: <http://schema.org/>

SELECT ?institution ?name ?wikidataID WHERE {
  ?institution schema:name ?name .
  ?institution owl:sameAs ?wikidataURI .
  FILTER(STRSTARTS(STR(?wikidataURI), "http://www.wikidata.org/entity/Q"))
  BIND(STRAFTER(STR(?wikidataURI), "http://www.wikidata.org/entity/") AS ?wikidataID)
}

See data/rdf/README.md for more examples.

Project Structure

glam-extractor/
├── pyproject.toml           # Poetry configuration
├── README.md                # This file
├── AGENTS.md                # AI agent instructions
├── src/glam_extractor/      # Main package
│   ├── __init__.py
│   ├── cli.py               # Command-line interface
│   ├── parsers/             # Conversation & CSV parsers
│   ├── extractors/          # NLP extraction engines
│   ├── crawlers/            # Web crawling (crawl4ai)
│   ├── validators/          # LinkML validation
│   ├── exporters/           # Multi-format export
│   ├── geocoding/           # Nominatim geocoding
│   └── utils/               # Utilities
├── schemas/                 # LinkML schemas
│   └── heritage_custodian.yaml
├── tests/                   # Test suite
│   ├── unit/
│   ├── integration/
│   └── fixtures/
├── docs/                    # Documentation
│   ├── plan/global_glam/    # Planning documents
│   ├── api/                 # API documentation
│   ├── tutorials/           # User tutorials
│   └── examples/            # Usage examples
└── data/                    # Reference data
    ├── ISIL-codes_2025-08-01.csv
    └── voorbeeld_lijst_organisaties_en_diensten-totaallijst_nederland.csv

Data Sources

Conversation JSON Files

139+ conversation files covering global GLAM research:

  • Geographic coverage: 60+ countries across all continents
  • Content: Institution names, locations, collections, digital platforms, partnerships
  • Languages: Multilingual (English, Dutch, Portuguese, Spanish, Vietnamese, Japanese, Arabic, etc.)

CSV Datasets

  1. Dutch ISIL Registry (ISIL-codes_2025-08-01.csv): ~300 Dutch heritage institutions with authoritative ISIL codes
  2. Dutch Organizations (voorbeeld_lijst_organisaties_en_diensten-totaallijst_nederland.csv): Comprehensive metadata including systems, partnerships, collection platforms

External Sources (Optional Enrichment)

  • Wikidata: SPARQL queries for additional metadata
  • VIAF: Authority file linking
  • GeoNames: Geographic name authority
  • Nominatim: Geocoding service

Data Quality & Provenance

Every extracted record includes provenance metadata:

provenance:
  data_source: CONVERSATION_NLP | ISIL_REGISTRY | DUTCH_ORG_CSV | WEB_CRAWL | WIKIDATA
  data_tier: TIER_1_AUTHORITATIVE | TIER_2_VERIFIED | TIER_3_CROWD_SOURCED | TIER_4_INFERRED
  extraction_date: 2025-11-05T...
  extraction_method: "spaCy NER + GPT-4 classification"
  confidence_score: 0.0-1.0
  conversation_id: "uuid"
  source_url: "https://..."
  verified_date: null
  verified_by: null

Data Tiers:

  • Tier 1: Official registries (ISIL, national registers) - highest authority
  • Tier 2: Verified institutional data (official websites)
  • Tier 3: Community-sourced data (Wikidata, OpenStreetMap)
  • Tier 4: NLP-extracted or inferred data - requires verification

LinkML Schema

The heritage_custodian.yaml schema integrates multiple standards:

  • TOOI: Dutch organizational ontology
  • Schema.org: General web semantics
  • CPOC: Core Public Organization Vocabulary
  • ISIL: International Standard Identifier for Libraries
  • RiC-O: Records in Contexts Ontology
  • BIBFRAME: Bibliographic Framework
  • CIDOC-CRM: Conceptual Reference Model

Key classes:

  • HeritageCustodian: Base class for all heritage institutions
  • DutchHeritageCustodian: Dutch-specific subclass with KvK, gemeente codes
  • Location: Geographic locations
  • Identifier: External identifiers (ISIL, VIAF, Wikidata)
  • Collection: Collections held by institutions
  • DigitalPlatform: Digital systems used
  • Provenance: Data quality tracking

Development

Run Tests

poetry run pytest                    # All tests
poetry run pytest -m unit           # Unit tests only
poetry run pytest -m integration    # Integration tests only
poetry run pytest --cov             # With coverage report

Code Quality

poetry run black src/ tests/        # Format code
poetry run ruff check src/ tests/   # Lint code
poetry run mypy src/                # Type checking

Pre-commit Hooks

poetry run pre-commit install
poetry run pre-commit run --all-files

Documentation

poetry run mkdocs serve    # Serve docs locally
poetry run mkdocs build    # Build static docs

Examples

Extract Brazilian Institutions

from glam_extractor import ConversationParser, InstitutionExtractor

# Parse conversation
parser = ConversationParser()
conversation = parser.load("Brazilian_GLAM_collection_inventories.json")

# Extract institutions
extractor = InstitutionExtractor()
institutions = extractor.extract(conversation)

# Print results
for inst in institutions:
    print(f"{inst.name} ({inst.institution_type})")
    print(f"  Location: {inst.locations[0].city}, {inst.locations[0].country}")
    print(f"  Confidence: {inst.provenance.confidence_score}")
from glam_extractor import CSVParser, InstitutionExtractor
from glam_extractor.validators import LinkMLValidator

# Load Dutch ISIL registry
csv_parser = CSVParser()
dutch_institutions = csv_parser.load_isil_registry("ISIL-codes_2025-08-01.csv")

# Load Dutch organizations
dutch_orgs = csv_parser.load_dutch_organizations("voorbeeld_lijst_organisaties_en_diensten-totaallijst_nederland.csv")

# Cross-link and merge
extractor = InstitutionExtractor()
merged = extractor.merge_dutch_data(dutch_institutions, dutch_orgs)

# Validate
validator = LinkMLValidator(schema="schemas/heritage_custodian.yaml")
results = validator.validate_batch(merged)
print(f"Valid: {results.valid_count}, Invalid: {results.invalid_count}")

Export to Multiple Formats

from glam_extractor.exporters import JSONLDExporter, RDFExporter, CSVExporter

# Load extracted data
institutions = load_institutions("output.jsonld")

# Export to RDF/Turtle
rdf_exporter = RDFExporter()
rdf_exporter.export(institutions, "output.ttl")

# Export to CSV
csv_exporter = CSVExporter()
csv_exporter.export(institutions, "output.csv")

# Export to Parquet
csv_exporter.export_parquet(institutions, "output.parquet")

Documentation

  • Planning Docs: docs/plan/global_glam/

    • 01-implementation-phases.md: 7-phase implementation plan
    • 02-architecture.md: System architecture and data flow
    • 03-dependencies.md: Technology stack and dependencies
    • 04-data-standardization.md: Data integration strategies
    • 05-design-patterns.md: Software design patterns
    • 06-consumers-use-cases.md: User segments and applications
  • AI Agent Instructions: AGENTS.md

    • NLP extraction guidelines
    • Data quality protocols
    • Agent workflow examples
  • API Documentation: Generated from docstrings with mkdocstrings

Contributing

This is a research project. Contributions welcome!

  1. Fork the repository
  2. Create feature branch (git checkout -b feature/amazing-feature)
  3. Commit changes (git commit -m 'Add amazing feature')
  4. Push to branch (git push origin feature/amazing-feature)
  5. Open Pull Request

License

MIT License - see LICENSE file for details

Acknowledgments

  • LinkML: Schema framework
  • spaCy: NLP processing
  • crawl4ai: Web crawling
  • RDFLib: RDF processing
  • Dutch ISIL Registry: Authoritative institution data
  • Claude AI: Conversation data source

Contact

For questions or collaboration inquiries, please open an issue on GitHub.


Version: 0.1.0
Status: Alpha - Implementation in progress
Last Updated: 2025-11-05