21 KiB
Global GLAM Dataset: System Architecture
Overview
This document describes the technical architecture for extracting, enriching, and publishing a comprehensive global GLAM (Galleries, Libraries, Archives, Museums) dataset from Claude conversation files.
System Context Diagram
┌─────────────────────────────────────────────────────────────────┐
│ Global GLAM Extraction System │
│ │
│ ┌──────────────┐ ┌──────────────┐ ┌──────────────┐ │
│ │ Conversation │ ───> │ Extraction │ ───> │ LinkML │ │
│ │ Parsers │ │ Pipeline │ │ Instances │ │
│ └──────────────┘ └──────────────┘ └──────────────┘ │
│ │ │ │ │
│ ▼ ▼ ▼ │
│ ┌──────────────┐ ┌──────────────┐ ┌──────────────┐ │
│ │ Text │ │ Web Crawler │ │ Export │ │
│ │ Corpus │ │ (crawl4ai) │ │ Formats │ │
│ └──────────────┘ └──────────────┘ └──────────────┘ │
└─────────────────────────────────────────────────────────────────┘
│ │ │
▼ ▼ ▼
[139 JSON Files] [Web Sources] [Published Dataset]
High-Level Architecture
Extraction Architecture: Hybrid Approach
This system uses a hybrid extraction architecture combining pattern matching and subagent-based NER:
Pattern Matching (Main Code):
- Extract structured identifiers: ISIL codes, Wikidata IDs, VIAF IDs, KvK numbers
- URL extraction and validation
- Fast, deterministic, no dependencies on NLP libraries
- Implemented in:
src/glam_extractor/extractors/identifiers.py
Subagent-Based NER:
- Extract unstructured entities: institution names, locations, relationships
- Coding subagents autonomously choose NER tools (spaCy, transformers, GPT-4, etc.)
- Main application stays lightweight (no PyTorch, spaCy, transformers dependencies)
- Flexible: can swap extraction methods without changing main code
- Implemented via: Task tool invocation in extractors
Rationale: See docs/plan/global_glam/07-subagent-architecture.md for detailed architectural decision record.
1. Data Ingestion Layer
1.1 Conversation Parser
Purpose: Extract structured content from Claude conversation JSON files
Components:
ConversationReader: Read and validate JSON filesMessageExtractor: Extract message content and metadataCitationExtractor: Parse URLs and referencesConversationIndexer: Create searchable index of conversations
Inputs:
- Claude conversation JSON files (139 files covering global GLAM research)
Outputs:
- Structured conversation objects
- Text corpus (markdown content)
- Citation database (URLs, references)
- Conversation metadata index
Technology:
- Python
jsonmodule for parsing pydanticfor data validation- DuckDB or SQLite for indexing
1.2 Data Model
@dataclass
class Conversation:
id: str
created_at: datetime
title: str
messages: List[Message]
citations: List[Citation]
metadata: ConversationMetadata
@dataclass
class Message:
role: str # 'user' or 'assistant'
content: str # Markdown text
timestamp: datetime
@dataclass
class Citation:
url: str
title: Optional[str]
context: str # Surrounding text
@dataclass
class ConversationMetadata:
country: Optional[str]
region: Optional[str]
glam_types: List[str] # ['museum', 'archive', 'library', 'gallery']
languages: List[str]
2. Extraction & Processing Layer
2.1 NLP Extraction Pipeline
Purpose: Extract heritage institutions and attributes using pattern matching and subagent-based NER
Architecture: This pipeline uses a hybrid approach:
- Pattern matching (in main code) for identifiers and structured data
- Coding subagents (via Task tool) for Named Entity Recognition
- See
docs/plan/global_glam/07-subagent-architecture.mdfor detailed rationale
Components:
Entity Extractor (Main Code - Pattern Matching)
- IdentifierExtractor: Extract ISIL codes, Wikidata IDs, VIAF IDs, KvK numbers using regex patterns
- URLExtractor: URL pattern matching and validation
Entity Extractor (Subagent-Based NER)
- InstitutionExtractor: Launch subagents to extract institution names using NLP/NER
- LocationExtractor: Launch subagents for geographic entity extraction
- RelationshipExtractor: Launch subagents to extract organizational relationships
Attribute Extractor (Subagent-Based)
- TypeClassifier: Subagents classify institution type (museum, archive, etc.)
- CollectionExtractor: Subagents extract collection subjects/themes
- StandardsExtractor: Subagents identify metadata/preservation standards used
Data Model
@dataclass
class ExtractedInstitution:
name: str
name_variants: List[str]
institution_type: List[str] # Can be multiple
# Location
country: Optional[str]
region: Optional[str]
city: Optional[str]
address: Optional[str]
coordinates: Optional[Tuple[float, float]]
# Digital presence
urls: List[str]
repositories: List[DigitalRepository]
# Identifiers
isil_code: Optional[str]
wikidata_id: Optional[str]
national_ids: Dict[str, str]
# Collections
collection_subjects: List[str]
collection_formats: List[str]
# Technical
metadata_standards: List[str]
preservation_standards: List[str]
access_protocols: List[str]
# Organizational
parent_organization: Optional[str]
consortia: List[str]
partnerships: List[str]
# Provenance
source_conversations: List[str]
extraction_confidence: float
extraction_method: str
@dataclass
class DigitalRepository:
url: str
platform: Optional[str] # e.g., "DSpace", "Omeka", "CollectiveAccess"
repository_type: str # e.g., "institutional repository", "catalog", "portal"
access_type: str # "open", "restricted", "mixed"
Technology:
- Pattern Matching (main code): Python
remodule,rapidfuzzfor fuzzy matching - NER & Extraction (subagents): Coding subagents choose appropriate tools (spaCy, transformers, etc.)
- Task Orchestration: Task tool for subagent invocation
- Text Processing:
langdetectfor language identification (main code)
2.2 Web Crawling & Enrichment
Purpose: Fetch and extract data from URLs cited in conversations
Components:
Web Crawler (crawl4ai)
- URLValidator: Check URL availability and redirects
- PageFetcher: Async crawling with rate limiting
- MetadataExtractor: Extract structured metadata from pages
- ContentExtractor: Platform-specific content extraction
Enrichment Services
- GeocodingService: Resolve addresses to coordinates (Nominatim)
- WikidataLinker: Link institutions to Wikidata entities
- VIAFLinker: Link to VIAF authority records
- RegistryChecker: Verify against national/international registries
Technology:
- crawl4ai: Async web crawling with LLM extraction
- httpx: Modern async HTTP client
- beautifulsoup4: HTML parsing
- lxml: XML/HTML processing
- geopy: Geocoding
- SPARQLWrapper: Query Wikidata/VIAF
Data Flow:
Citations → URL Validation → Crawl Queue
↓
crawl4ai Async Crawler
↓
┌─────────┴─────────┐
▼ ▼
Metadata Extract Content Extract
│ │
└─────────┬─────────┘
▼
Enrichment Pipeline
▼
Enriched Institution Data
3. Schema & Validation Layer
3.1 LinkML Schema Definition
Purpose: Define comprehensive heritage custodian ontology
Schema Structure:
# heritage_custodian.yaml (simplified)
id: https://w3id.org/heritage-custodian
name: heritage-custodian-schema
title: Heritage Custodian Ontology
prefixes:
hc: https://w3id.org/heritage-custodian/
schema: http://schema.org/
dcterms: http://purl.org/dc/terms/
skos: http://www.w3.org/2004/02/skos/core#
isil: http://id.loc.gov/vocabulary/organizations/
rico: https://www.ica.org/standards/RiC/ontology#
imports:
- linkml:types
- schema_org_subset
- cpoc_subset
- tooi_subset
classes:
HeritageCustodian:
is_a: schema:Organization
mixins:
- CPOCPublicOrganization
- TOOIOrganization
slots:
- name
- institution_types
- geographic_coverage
- digital_platforms
- collections
- identifiers
DigitalPlatform:
is_a: schema:WebSite
slots:
- platform_type
- software_platform
- access_protocol
- metadata_standard
Collection:
slots:
- collection_name
- subjects
- formats
- temporal_coverage
- rights_statements
slots:
institution_types:
multivalued: true
range: InstitutionTypeEnum
identifiers:
multivalued: true
range: Identifier
enums:
InstitutionTypeEnum:
permissible_values:
ARCHIVE:
meaning: rico:RecordCreator
LIBRARY:
meaning: schema:Library
MUSEUM:
meaning: schema:Museum
GALLERY:
meaning: schema:ArtGallery
Modular Design:
core/- Core classes (HeritageCustodian, Collection)mixins/- Reusable components (TOOI, CPOC, Schema.org)identifiers/- Identifier systems (ISIL, Wikidata, etc.)standards/- Domain standards (RiC-O, BIBFRAME, LIDO)enums/- Controlled vocabularies
3.2 Validation & Mapping
Components:
- SchemaValidator: Validate instances against LinkML schema
- EntityMapper: Map extracted data to LinkML classes
- IRIGenerator: Generate unique IRIs for instances
- ProvenanceTracker: Track extraction provenance
Technology:
- linkml-runtime: Schema validation and instance generation
- linkml: Schema development tools
- rdflib: RDF graph manipulation
- jsonschema: JSON validation
4. Storage Layer
4.1 Intermediate Storage
Purpose: Store processed data during extraction pipeline
Databases:
-
Conversation Index: DuckDB
- Fast analytical queries
- Full-text search on conversation content
- Lightweight, embedded database
-
Extraction Cache: SQLite
- Store intermediate extraction results
- Cache web crawling results
- Track processing status
Schema:
-- Conversations
CREATE TABLE conversations (
id TEXT PRIMARY KEY,
created_at TIMESTAMP,
title TEXT,
country TEXT,
region TEXT,
content TEXT, -- Full markdown content
processed BOOLEAN DEFAULT FALSE
);
CREATE TABLE citations (
id INTEGER PRIMARY KEY,
conversation_id TEXT,
url TEXT,
title TEXT,
context TEXT,
crawled BOOLEAN DEFAULT FALSE,
crawl_status TEXT,
FOREIGN KEY (conversation_id) REFERENCES conversations(id)
);
-- Extracted entities
CREATE TABLE extracted_institutions (
id TEXT PRIMARY KEY,
name TEXT,
institution_type TEXT[],
country TEXT,
raw_data JSONB, -- Full extracted data
confidence FLOAT,
status TEXT, -- 'pending', 'validated', 'rejected'
created_at TIMESTAMP
);
-- Crawl results
CREATE TABLE crawl_results (
url TEXT PRIMARY KEY,
status_code INTEGER,
content_hash TEXT,
metadata JSONB,
crawled_at TIMESTAMP
);
4.2 Output Storage
Purpose: Store final dataset in multiple formats
Formats:
-
RDF Store: Apache Jena TDB or Oxigraph
- Native RDF/SPARQL support
- Linked data queries
-
Document Store: JSON-LD files
- One file per institution
- GitHub-friendly for version control
-
Analytical: Parquet + DuckDB
- Efficient columnar storage
- Fast analytical queries
-
Relational: PostgreSQL (optional)
- If needed for applications
- Can be generated from RDF
5. Export & Publishing Layer
5.1 Export Pipeline
Components:
- RDFExporter: Export to Turtle/N-Triples/JSON-LD
- TabularExporter: Export to CSV/TSV/Excel
- SQLExporter: Generate SQL dumps
- ParquetExporter: Generate Parquet files
- StatisticsGenerator: Dataset statistics and reports
5.2 Publishing Formats
dataset/
├── rdf/
│ ├── institutions.ttl # All institutions in Turtle
│ ├── institutions.nt # N-Triples format
│ └── jsonld/ # Individual JSON-LD files
│ ├── inst_001.jsonld
│ └── ...
├── tabular/
│ ├── institutions.csv # Flattened institution data
│ ├── collections.csv # Collections table
│ ├── digital_platforms.csv # Digital platforms table
│ └── data_dictionary.csv # Column descriptions
├── database/
│ ├── glam_dataset.db # SQLite database
│ ├── glam_dataset.duckdb # DuckDB database
│ └── schema.sql # SQL schema
├── analytics/
│ └── glam_dataset.parquet # Parquet format
├── metadata/
│ ├── dataset_metadata.yaml # Dataset description
│ ├── statistics.json # Dataset statistics
│ └── provenance.jsonld # PROV-O provenance
└── docs/
├── README.md # Dataset documentation
├── schema.html # Schema documentation
└── query_examples.md # SPARQL/SQL examples
Component Integration
Data Flow Architecture
┌─────────────────┐
│ Conversation │
│ JSON Files │
└────────┬────────┘
│
▼
┌─────────────────┐
│ Conversation │ ──> [DuckDB Index]
│ Parser │
└────────┬────────┘
│
▼
┌─────────────────┐
│ Pattern-based │ ──> Extract: ISIL, Wikidata, URLs
│ Extraction │ (IdentifierExtractor in main code)
└────────┬────────┘
│
▼
┌─────────────────┐
│ SUBAGENT │ ──> Launch subagents for NER
│ BOUNDARY │ (InstitutionExtractor, LocationExtractor)
└────────┬────────┘
│
▼
┌─────────────────┐
│ Coding │ ──> Subagents use spaCy/transformers
│ Subagents │ Return structured JSON results
└────────┬────────┘
│
▼
┌─────────────────┐
│ Web Crawler │ ──> [Crawl Results DB]
│ (crawl4ai) │
└────────┬────────┘
│
▼
┌─────────────────┐
│ Enrichment │ ──> [External APIs]
│ Services │ (Wikidata, VIAF)
└────────┬────────┘
│
▼
┌─────────────────┐
│ LinkML Mapper │ ──> [Validation Errors]
└────────┬────────┘
│
▼
┌─────────────────┐
│ Quality Check │ ──> [Review Queue]
└────────┬────────┘
│
▼
┌─────────────────┐
│ Exporters │
└────────┬────────┘
│
▼
┌─────────────────┐
│ Published │
│ Dataset │
└─────────────────┘
Processing Modes
1. Batch Processing (Default)
- Process all 139 conversations sequentially
- Suitable for initial dataset creation
- Can be parallelized by conversation
2. Incremental Processing
- Process new/updated conversations only
- Track processing timestamps
- Merge with existing dataset
3. Targeted Processing
- Process specific conversations (by country, date, etc.)
- For dataset updates or testing
Technology Stack
Core Languages
- Python 3.11+: Main implementation language
- YAML: Schema definition (LinkML)
- SPARQL: RDF queries
Key Libraries
- Data Processing: pandas, polars (for tabular data)
- Pattern Matching: Python
remodule,rapidfuzz(fuzzy matching) - Subagent Orchestration: Task tool (for NER and complex extraction)
- Web: crawl4ai, httpx, beautifulsoup4
- Semantic Web: linkml-runtime, rdflib, pyshacl
- Database: duckdb, sqlite3, sqlalchemy
- Validation: pydantic, jsonschema
- Export: pyarrow (Parquet), openpyxl (Excel)
Note: NLP libraries (spaCy, transformers, PyTorch) are NOT dependencies of the main application. They are used by coding subagents when needed for entity extraction. See docs/plan/global_glam/07-subagent-architecture.md for details.
Development Tools
- Package Management: Poetry
- Code Quality: ruff (linting), mypy (type checking)
- Testing: pytest, hypothesis
- Documentation: mkdocs, sphinx
- CI/CD: GitHub Actions
Scalability Considerations
Performance Optimizations
- Parallel Processing: Process conversations concurrently
- Caching: Cache web crawl results and API calls
- Incremental Processing: Only process changed data
- Efficient Storage: Use columnar formats (Parquet) for analytics
Resource Management
- Memory: Stream large files, avoid loading all data in memory
- Disk: Compress intermediate results
- Network: Rate limiting for web crawling, batch API calls
Security & Ethics
Web Crawling
- Respect robots.txt
- Implement rate limiting
- User-agent identification
- No aggressive crawling
Data Privacy
- No personal data extraction
- Public information only
- Comply with institutional terms of service
Licensing
- Clear dataset license (CC0, CC-BY, or ODbL)
- Respect source data licenses
- Attribution to source institutions
Monitoring & Observability
Logging
- Structured logging (JSON format)
- Log levels: DEBUG, INFO, WARNING, ERROR
- Per-component logging
Metrics
- Extraction success rates
- Web crawl success rates
- Validation pass/fail rates
- Processing time per conversation
- Dataset statistics over time
Error Handling
- Graceful degradation
- Retry logic for transient failures
- Error categorization and reporting
- Manual review queue for edge cases
Deployment Architecture
Local Development
poetry install
poetry run glam-extract --input data/conversations/ --output data/dataset/
Docker Container (Future)
FROM python:3.11-slim
RUN pip install poetry
COPY . /app
WORKDIR /app
RUN poetry install --no-dev
ENTRYPOINT ["poetry", "run", "glam-extract"]
CI/CD Pipeline
- GitHub Actions: Run on push/PR
- Tests: Unit tests, integration tests
- Quality Gates: Linting, type checking, coverage
- Dataset Build: Weekly automated builds
- Publication: Auto-publish to GitHub Releases
Future Enhancements
Phase 2 Features
- Web UI for manual review and curation
- SPARQL endpoint for queries
- REST API for dataset access
- Automated duplicate detection across conversations
- Multi-language support (NLP for non-English content)
- Machine learning for institution type classification
- Automated monitoring of institution URLs (link rot detection)
Integration Opportunities
- Wikidata integration (read/write)
- National heritage registries
- OpenStreetMap for geocoding
- IIIF for image collections
- OAI-PMH for metadata harvesting
Related Documentation
- Subagent Architecture:
docs/plan/global_glam/07-subagent-architecture.md- Detailed explanation of subagent-based NER approach - Design Patterns:
docs/plan/global_glam/05-design-patterns.md- Code patterns and best practices - Data Standardization:
docs/plan/global_glam/04-data-standardization.md- Data tier system and provenance - Dependencies:
docs/plan/global_glam/03-dependencies.md- Technology stack and library choices - LinkML Schema:
schemas/heritage_custodian.yaml- Data model definition - Agent Instructions:
AGENTS.md- Instructions for AI agents working on this project