- Update AGENTS.md with PROJECT CORE MISSION section emphasizing ontology engineering focus - Create .opencode/agent/ontology-mapping-rules.md (665 lines) with detailed guidelines: * Ontology consultation workflows (Rule 1) * Wikidata entity mapping procedures (Rule 2) * Multi-aspect modeling requirements (Rule 3) * Temporal independence documentation (Rule 4) * Property research workflows (Rule 5) * Decision trees for ontology selection (Rule 6-7) * Quality assurance checklists (Rule 8-9) * Agent collaboration protocols (Rule 10) - Create ONTOLOGY_RULES_SUMMARY.md as quick reference guide Key principles established: 1. Wikidata Q-numbers are NOT ontology classes (must be mapped) 2. Every heritage entity has multiple aspects with independent temporal lifecycles 3. Base ontologies (CPOV, TOOI, CIDOC-CRM, RiC-O, Schema.org, PiCo) are source of truth 4. Custom properties forbidden when ontology equivalents exist Example: 'Mansion' (Q1802963) requires modeling as: - Place aspect (crm:E27_Site, construction→present) - Custodian aspect (cpov:PublicOrganisation OR schema:Museum, founding→present) - Legal form aspect (org:FormalOrganization, registration→present) - Collections aspect (crm:E78_Curated_Holding, accession→present) - People aspect (picom:PersonObservation, employment periods) - Temporal events (crm:E10_Transfer_of_Custody for custody changes) All agents MUST read ontology files before schema design. |
||
|---|---|---|
| .opencode/agent | ||
| archive/scripts/brazil | ||
| data | ||
| docs | ||
| exa-mcp-server-source@07aedc21cc | ||
| examples | ||
| mcp-wikidata@230e0456d2 | ||
| mcp_servers/wikidata_auth | ||
| ontology | ||
| package | ||
| reports | ||
| schemas | ||
| scripts | ||
| src/glam_extractor | ||
| tests | ||
| .gitignore | ||
| AGENTS.md | ||
| analyze_brazil_batch13_candidates.py | ||
| AUSTRIAN_ISIL_DEDUPLICATION_SUMMARY.md | ||
| AUSTRIAN_ISIL_QUICK_START.md | ||
| AUSTRIAN_ISIL_SESSION_COMPLETE.md | ||
| AUSTRIAN_ISIL_SESSION_COMPLETE_BATCH1.md | ||
| AUSTRIAN_ISIL_SESSION_CONTINUED_20251118.md | ||
| AUSTRIAN_ISIL_SESSION_HANDOFF_20251118.md | ||
| AUSTRIAN_ISIL_SESSION_SUMMARY.md | ||
| BATCH12_ENRICHMENT_REPORT.md | ||
| BATCH13_ENRICHMENT_REPORT.md | ||
| BATCH14_ENRICHMENT_REPORT.md | ||
| BELGIAN_ISIL_COMPLETE.md | ||
| BRAZILIAN_CURATION_SESSION_SUMMARY.md | ||
| BULGARIAN_ISIL_EXTRACTION_COMPLETE.md | ||
| CANADIAN_ENRICHMENT_GUIDE.md | ||
| CANADIAN_GEOCODING_COMPLETE.md | ||
| CANADIAN_INTEGRATION_REPORT.md | ||
| CANADIAN_ISIL_SUCCESS.md | ||
| check_geocoding_progress.py | ||
| check_scraper_status.sh | ||
| CHILEAN_BATCH1_REPORT.md | ||
| compare_dutch_datasets.py | ||
| CONTRIBUTING.md | ||
| convert_canadian_to_linkml.py | ||
| crosslink_dutch_datasets.py | ||
| curate_brazilian_institutions.py | ||
| curate_chilean_institutions.md | ||
| CURATION_STATUS.md | ||
| CZECH_ARCHIVES_INVESTIGATION.md | ||
| CZECH_ARCHIVES_NEXT_ACTIONS.md | ||
| CZECH_ARON_API_INVESTIGATION.md | ||
| CZECH_CROSSLINK_REPORT.md | ||
| CZECH_ISIL_COMPLETE_REPORT.md | ||
| CZECH_ISIL_HARVEST_SUMMARY.md | ||
| CZECH_ISIL_NEXT_STEPS.md | ||
| CZECH_PRIORITY1_COMPLETE.md | ||
| deduplicate_brazilian_institutions.py | ||
| DENMARK_QUICK_REFERENCE.md | ||
| enrich_brazil_batch11.py | ||
| enrich_brazil_batch12.py | ||
| enrich_brazil_batch13.py | ||
| enrich_brazil_batch17.py | ||
| enrich_bulgaria_isil.py | ||
| enrich_geocoding.py | ||
| enrich_japan_fast.py | ||
| enrich_japan_isil.py | ||
| EXA_BUG_FIX.md | ||
| EXECUTIVE_SUMMARY.md | ||
| export_bulgaria_rdf.py | ||
| extract_brazilian_institutions.py | ||
| extract_brazilian_institutions_v2.py | ||
| extract_conversations_batch.py | ||
| extract_mexican_glams.py | ||
| extract_mexican_glams_v2.py | ||
| FINAL_SESSION_SUMMARY.md | ||
| find_brazil_bonus.py | ||
| find_brazil_institutions.py | ||
| fix_heritage_linked_pubs.py | ||
| generate_comparison_report.py | ||
| generate_geocoding_report.py | ||
| GEOCODING_SESSION_2025-11-07.md | ||
| GEOCODING_SESSION_2025-11-07_RESUMED.md | ||
| GERMAN_REGIONAL_ARCHIVE_PORTALS_DISCOVERY.md | ||
| ISIL_HARVEST_STATUS_20251119.md | ||
| LIBYA_ENRICHMENT_COMPLETE.md | ||
| LIBYA_WIKIDATA_CLEANUP_SUMMARY.md | ||
| LIBYA_WIKIDATA_CREATION_STATUS.md | ||
| LIBYA_WIKIDATA_ENRICHMENT_COMPLETE.md | ||
| LICENSE | ||
| merge_batch13_corrected.py | ||
| merge_batch14.py | ||
| merge_batch15.py | ||
| merge_brazil_batch13.py | ||
| mexican_glam_1.json | ||
| mexican_glam_2.json | ||
| mexican_glam_extracted.json | ||
| MIGRATION_COMPLETED_v0.2.2.md | ||
| MNEMONIC_CORRECTION.md | ||
| NEXT_AGENT_HANDOFF_NRW_COMPLETE.md | ||
| NEXT_SESSION_HANDOFF.md | ||
| NEXT_STEPS.md | ||
| NEXT_STEPS_Mexican_Geocoding.md | ||
| NRW_HARVEST_COMPLETE_20251119.md | ||
| ONTOLOGY_RULES_SUMMARY.md | ||
| osm_resume_log.txt | ||
| parse_eu_isil.py | ||
| parse_japan_isil.py | ||
| process_chilean_institutions.py | ||
| process_mexican_institutions.py | ||
| PROGRESS.md | ||
| pyproject.toml | ||
| QUICK_ACTION_PLAN_GERMAN_REGIONAL_HARVESTS.md | ||
| QUICK_START_AUSTRALIA.md | ||
| QUICK_STATUS_20251119.md | ||
| QUICK_STATUS_20251119_POST_NRW.md | ||
| README.md | ||
| RECORD_COMPARISON.md | ||
| RESUME_CHILEAN_ENRICHMENT.md | ||
| run_scraper_background.sh | ||
| SCRAPER_COMPLETION_INSTRUCTIONS.md | ||
| SESSION-RESUME.md | ||
| SESSION_2025-11-09_SCHEMA_ONTOLOGY_UPDATE.md | ||
| SESSION_COMPLETE.md | ||
| SESSION_COMPLETE_ARGENTINA_ENRICHMENT.txt | ||
| SESSION_COMPLETION_SUMMARY.md | ||
| SESSION_CONTINUATION_SUMMARY_20251119.md | ||
| SESSION_SUMMARY.md | ||
| SESSION_SUMMARY_2025-11-05.md | ||
| SESSION_SUMMARY_2025-11-05_batch_processing.md | ||
| SESSION_SUMMARY_2025-11-06_Chilean_Geocoding.md | ||
| SESSION_SUMMARY_2025-11-07.md | ||
| SESSION_SUMMARY_2025-11-08.md | ||
| SESSION_SUMMARY_2025-11-08_LATAM.md | ||
| SESSION_SUMMARY_2025-11-09.md | ||
| SESSION_SUMMARY_20251111_BRAZIL_MERGE.md | ||
| SESSION_SUMMARY_20251112_BRAZIL_DOCUMENTATION.md | ||
| SESSION_SUMMARY_20251113_MEXICO_BATCH2.md | ||
| SESSION_SUMMARY_20251113_MEXICO_RECONCILIATION.md | ||
| SESSION_SUMMARY_20251118_ARGENTINA_LINKML_EXPORT.md | ||
| SESSION_SUMMARY_20251118_AUSTRALIA_TROVE.md | ||
| SESSION_SUMMARY_20251118_ISIL_PROCESSING.md | ||
| SESSION_SUMMARY_20251119_ARCHIVES_DISCOVERY.md | ||
| SESSION_SUMMARY_20251119_AUSTRIAN_CONSOLIDATION.md | ||
| SESSION_SUMMARY_20251119_CANADIAN_COMPLETE.md | ||
| SESSION_SUMMARY_20251119_CZECH_ARCHIVES_COMPLETE.md | ||
| SESSION_SUMMARY_20251119_CZECH_COMPLETE.md | ||
| SESSION_SUMMARY_20251119_DDB_HARVEST_COMPLETE.md | ||
| SESSION_SUMMARY_20251119_DENMARK_ARCHIVES_COMPLETE.md | ||
| SESSION_SUMMARY_20251119_DENMARK_COMPLETE.md | ||
| SESSION_SUMMARY_20251119_DENMARK_ISIL_COMPLETE.md | ||
| SESSION_SUMMARY_20251119_NRW_MERGE_COMPLETE.md | ||
| SESSION_SUMMARY_20251119_PRIORITY1_COMPLETE.md | ||
| SESSION_SUMMARY_20251119_RDF_WIKIDATA_COMPLETE.md | ||
| SESSION_SUMMARY_20251119_UNIFICATION_COMPLETE.md | ||
| SESSION_SUMMARY_20251119_WIKIDATA_VALIDATION_PACKAGE.md | ||
| SESSION_SUMMARY_ARGENTINA_CONABIP.md | ||
| SESSION_SUMMARY_ARGENTINA_Z3950_INVESTIGATION.md | ||
| SESSION_SUMMARY_BATCH7.md | ||
| SESSION_SUMMARY_NETHERLANDS_ARGENTINA.md | ||
| SESSION_SUMMARY_NOV7_DUTCH_VALIDATION.md | ||
| SESSION_SUMMARY_RDF_PARTNERSHIPS.md | ||
| SESSION_SUMMARY_SWITZERLAND_ISIL.md | ||
| SESSION_SUMMARY_v3_geocoding.md | ||
| SESSION_SUMMARY_V5.md | ||
| TASTE_SMELL_CLASS_ADDITION.md | ||
| TAXONOMY_UPDATE_SUMMARY.md | ||
| test_canadian_parser.py | ||
| test_real_dutch_orgs.py | ||
| test_real_isil.py | ||
| THUERINGEN_HARVEST_COMPLETE.md | ||
| UNIFICATION_SUMMARY.md | ||
| V5_QUICK_REFERENCE.md | ||
| validate_curated.py | ||
| validate_instances.py | ||
| validation_output.txt | ||
| verify_batch13_ids.py | ||
| WIKIDATA_CREATION_PLAN.md | ||
| WIKIDATA_MANUAL_CREATION_GUIDE.md | ||
GLAM Extractor
Extract and standardize global GLAM (Galleries, Libraries, Archives, Museums) institutional data from conversation transcripts and authoritative registries.
Overview
This project extracts structured heritage institution data from 139+ Claude conversation JSON files covering worldwide GLAM research, integrates with authoritative CSV datasets (Dutch ISIL registry, Dutch heritage organizations), validates against a comprehensive LinkML schema, and exports to multiple formats (RDF/Turtle, JSON-LD, CSV, Parquet, SQLite).
Features
- Multi-source data integration: Conversation transcripts, CSV registries, web crawling, Wikidata
- NLP extraction: spaCy NER, transformers-based classification, pattern matching
- LinkML validation: Comprehensive schema with TOOI, Schema.org, CPOC, ISIL, RiC-O, BIBFRAME
- Provenance tracking: Every data point tracks source, confidence, and verification status
- Multi-format export: RDF/Turtle, JSON-LD, CSV, Parquet, SQLite
- Geocoding: Nominatim integration for location enrichment
- Multilingual support: Handles 60+ countries and languages
Quick Start
Installation
# Install Poetry (if not already installed)
curl -sSL https://install.python-poetry.org | python3 -
# Clone repository and install dependencies
cd glam-extractor
poetry install
# Download spaCy models
poetry run python -m spacy download en_core_web_trf
poetry run python -m spacy download nl_core_news_lg
poetry run python -m spacy download xx_ent_wiki_sm
Basic Usage
# Extract from conversation JSON
poetry run glam extract conversations/Brazilian_GLAM.json -o output.jsonld
# Extract from Dutch CSV
poetry run glam extract data/ISIL-codes_2025-08-01.csv --csv -o dutch_isil.jsonld
# Validate extracted data
poetry run glam validate output.jsonld -s schemas/heritage_custodian.yaml
# Export to RDF
poetry run glam export output.jsonld -o output.ttl -f rdf
# Crawl institutional website
poetry run glam crawl https://www.rijksmuseum.nl -o rijksmuseum.jsonld
Linked Open Data
The project publishes heritage institution data as W3C-compliant RDF (Turtle, RDF/XML, JSON-LD, N-Triples) aligned with international ontologies:
Published Datasets
Denmark 🇩🇰 - ✅ COMPLETE (November 2025)
- 2,348 institutions (555 libraries, 594 archives, 1,199 branches)
- 43,429 RDF triples across 9 ontologies
- 769 Wikidata links (32.8% coverage)
- Formats: Turtle, RDF/XML, JSON-LD, N-Triples
See data/rdf/README.md for SPARQL examples and usage.
Ontology Alignment
| Ontology | Purpose | Coverage |
|---|---|---|
| CPOV (Core Public Organisation Vocabulary) | EU public sector standard | All institutions |
| Schema.org | Web semantics (Library, ArchiveOrganization) | All institutions |
| RICO (Records in Contexts) | Archival description | Archives |
| ORG (W3C Organization Ontology) | Hierarchical relationships | Branches |
| PROV-O (Provenance Ontology) | Data provenance tracking | All institutions |
| OWL | Semantic equivalence (Wikidata links) | 32.8% Denmark |
SPARQL Examples
# Find all libraries in Copenhagen
PREFIX schema: <http://schema.org/>
PREFIX cpov: <http://data.europa.eu/m8g/>
SELECT ?library ?name ?address WHERE {
?library a cpov:PublicOrganisation, schema:Library .
?library schema:name ?name .
?library schema:address ?addrNode .
?addrNode schema:addressLocality "København K" .
?addrNode schema:streetAddress ?address .
}
# Find all institutions with Wikidata links
PREFIX owl: <http://www.w3.org/2002/07/owl#>
PREFIX schema: <http://schema.org/>
SELECT ?institution ?name ?wikidataID WHERE {
?institution schema:name ?name .
?institution owl:sameAs ?wikidataURI .
FILTER(STRSTARTS(STR(?wikidataURI), "http://www.wikidata.org/entity/Q"))
BIND(STRAFTER(STR(?wikidataURI), "http://www.wikidata.org/entity/") AS ?wikidataID)
}
See data/rdf/README.md for more examples.
Project Structure
glam-extractor/
├── pyproject.toml # Poetry configuration
├── README.md # This file
├── AGENTS.md # AI agent instructions
├── src/glam_extractor/ # Main package
│ ├── __init__.py
│ ├── cli.py # Command-line interface
│ ├── parsers/ # Conversation & CSV parsers
│ ├── extractors/ # NLP extraction engines
│ ├── crawlers/ # Web crawling (crawl4ai)
│ ├── validators/ # LinkML validation
│ ├── exporters/ # Multi-format export
│ ├── geocoding/ # Nominatim geocoding
│ └── utils/ # Utilities
├── schemas/ # LinkML schemas
│ └── heritage_custodian.yaml
├── tests/ # Test suite
│ ├── unit/
│ ├── integration/
│ └── fixtures/
├── docs/ # Documentation
│ ├── plan/global_glam/ # Planning documents
│ ├── api/ # API documentation
│ ├── tutorials/ # User tutorials
│ └── examples/ # Usage examples
└── data/ # Reference data
├── ISIL-codes_2025-08-01.csv
└── voorbeeld_lijst_organisaties_en_diensten-totaallijst_nederland.csv
Data Sources
Conversation JSON Files
139+ conversation files covering global GLAM research:
- Geographic coverage: 60+ countries across all continents
- Content: Institution names, locations, collections, digital platforms, partnerships
- Languages: Multilingual (English, Dutch, Portuguese, Spanish, Vietnamese, Japanese, Arabic, etc.)
CSV Datasets
- Dutch ISIL Registry (
ISIL-codes_2025-08-01.csv): ~300 Dutch heritage institutions with authoritative ISIL codes - Dutch Organizations (
voorbeeld_lijst_organisaties_en_diensten-totaallijst_nederland.csv): Comprehensive metadata including systems, partnerships, collection platforms
External Sources (Optional Enrichment)
- Wikidata: SPARQL queries for additional metadata
- VIAF: Authority file linking
- GeoNames: Geographic name authority
- Nominatim: Geocoding service
Data Quality & Provenance
Every extracted record includes provenance metadata:
provenance:
data_source: CONVERSATION_NLP | ISIL_REGISTRY | DUTCH_ORG_CSV | WEB_CRAWL | WIKIDATA
data_tier: TIER_1_AUTHORITATIVE | TIER_2_VERIFIED | TIER_3_CROWD_SOURCED | TIER_4_INFERRED
extraction_date: 2025-11-05T...
extraction_method: "spaCy NER + GPT-4 classification"
confidence_score: 0.0-1.0
conversation_id: "uuid"
source_url: "https://..."
verified_date: null
verified_by: null
Data Tiers:
- Tier 1: Official registries (ISIL, national registers) - highest authority
- Tier 2: Verified institutional data (official websites)
- Tier 3: Community-sourced data (Wikidata, OpenStreetMap)
- Tier 4: NLP-extracted or inferred data - requires verification
LinkML Schema
The heritage_custodian.yaml schema integrates multiple standards:
- TOOI: Dutch organizational ontology
- Schema.org: General web semantics
- CPOC: Core Public Organization Vocabulary
- ISIL: International Standard Identifier for Libraries
- RiC-O: Records in Contexts Ontology
- BIBFRAME: Bibliographic Framework
- CIDOC-CRM: Conceptual Reference Model
Key classes:
HeritageCustodian: Base class for all heritage institutionsDutchHeritageCustodian: Dutch-specific subclass with KvK, gemeente codesLocation: Geographic locationsIdentifier: External identifiers (ISIL, VIAF, Wikidata)Collection: Collections held by institutionsDigitalPlatform: Digital systems usedProvenance: Data quality tracking
Development
Run Tests
poetry run pytest # All tests
poetry run pytest -m unit # Unit tests only
poetry run pytest -m integration # Integration tests only
poetry run pytest --cov # With coverage report
Code Quality
poetry run black src/ tests/ # Format code
poetry run ruff check src/ tests/ # Lint code
poetry run mypy src/ # Type checking
Pre-commit Hooks
poetry run pre-commit install
poetry run pre-commit run --all-files
Documentation
poetry run mkdocs serve # Serve docs locally
poetry run mkdocs build # Build static docs
Examples
Extract Brazilian Institutions
from glam_extractor import ConversationParser, InstitutionExtractor
# Parse conversation
parser = ConversationParser()
conversation = parser.load("Brazilian_GLAM_collection_inventories.json")
# Extract institutions
extractor = InstitutionExtractor()
institutions = extractor.extract(conversation)
# Print results
for inst in institutions:
print(f"{inst.name} ({inst.institution_type})")
print(f" Location: {inst.locations[0].city}, {inst.locations[0].country}")
print(f" Confidence: {inst.provenance.confidence_score}")
Cross-link Dutch Data
from glam_extractor import CSVParser, InstitutionExtractor
from glam_extractor.validators import LinkMLValidator
# Load Dutch ISIL registry
csv_parser = CSVParser()
dutch_institutions = csv_parser.load_isil_registry("ISIL-codes_2025-08-01.csv")
# Load Dutch organizations
dutch_orgs = csv_parser.load_dutch_organizations("voorbeeld_lijst_organisaties_en_diensten-totaallijst_nederland.csv")
# Cross-link and merge
extractor = InstitutionExtractor()
merged = extractor.merge_dutch_data(dutch_institutions, dutch_orgs)
# Validate
validator = LinkMLValidator(schema="schemas/heritage_custodian.yaml")
results = validator.validate_batch(merged)
print(f"Valid: {results.valid_count}, Invalid: {results.invalid_count}")
Export to Multiple Formats
from glam_extractor.exporters import JSONLDExporter, RDFExporter, CSVExporter
# Load extracted data
institutions = load_institutions("output.jsonld")
# Export to RDF/Turtle
rdf_exporter = RDFExporter()
rdf_exporter.export(institutions, "output.ttl")
# Export to CSV
csv_exporter = CSVExporter()
csv_exporter.export(institutions, "output.csv")
# Export to Parquet
csv_exporter.export_parquet(institutions, "output.parquet")
Documentation
-
Planning Docs:
docs/plan/global_glam/01-implementation-phases.md: 7-phase implementation plan02-architecture.md: System architecture and data flow03-dependencies.md: Technology stack and dependencies04-data-standardization.md: Data integration strategies05-design-patterns.md: Software design patterns06-consumers-use-cases.md: User segments and applications
-
AI Agent Instructions:
AGENTS.md- NLP extraction guidelines
- Data quality protocols
- Agent workflow examples
-
API Documentation: Generated from docstrings with mkdocstrings
Contributing
This is a research project. Contributions welcome!
- Fork the repository
- Create feature branch (
git checkout -b feature/amazing-feature) - Commit changes (
git commit -m 'Add amazing feature') - Push to branch (
git push origin feature/amazing-feature) - Open Pull Request
License
MIT License - see LICENSE file for details
Acknowledgments
- LinkML: Schema framework
- spaCy: NLP processing
- crawl4ai: Web crawling
- RDFLib: RDF processing
- Dutch ISIL Registry: Authoritative institution data
- Claude AI: Conversation data source
Contact
For questions or collaboration inquiries, please open an issue on GitHub.
Version: 0.1.0
Status: Alpha - Implementation in progress
Last Updated: 2025-11-05