# Global GLAM Dataset: System Architecture ## Overview This document describes the technical architecture for extracting, enriching, and publishing a comprehensive global GLAM (Galleries, Libraries, Archives, Museums) dataset from Claude conversation files. ## System Context Diagram ``` ┌─────────────────────────────────────────────────────────────────┐ │ Global GLAM Extraction System │ │ │ │ ┌──────────────┐ ┌──────────────┐ ┌──────────────┐ │ │ │ Conversation │ ───> │ Extraction │ ───> │ LinkML │ │ │ │ Parsers │ │ Pipeline │ │ Instances │ │ │ └──────────────┘ └──────────────┘ └──────────────┘ │ │ │ │ │ │ │ ▼ ▼ ▼ │ │ ┌──────────────┐ ┌──────────────┐ ┌──────────────┐ │ │ │ Text │ │ Web Crawler │ │ Export │ │ │ │ Corpus │ │ (crawl4ai) │ │ Formats │ │ │ └──────────────┘ └──────────────┘ └──────────────┘ │ └─────────────────────────────────────────────────────────────────┘ │ │ │ ▼ ▼ ▼ [139 JSON Files] [Web Sources] [Published Dataset] ``` ## High-Level Architecture ### Extraction Architecture: Hybrid Approach This system uses a **hybrid extraction architecture** combining pattern matching and subagent-based NER: **Pattern Matching (Main Code)**: - Extract structured identifiers: ISIL codes, Wikidata IDs, VIAF IDs, KvK numbers - URL extraction and validation - Fast, deterministic, no dependencies on NLP libraries - Implemented in: `src/glam_extractor/extractors/identifiers.py` **Subagent-Based NER**: - Extract unstructured entities: institution names, locations, relationships - Coding subagents autonomously choose NER tools (spaCy, transformers, GPT-4, etc.) - Main application stays lightweight (no PyTorch, spaCy, transformers dependencies) - Flexible: can swap extraction methods without changing main code - Implemented via: Task tool invocation in extractors **Rationale**: See `docs/plan/global_glam/07-subagent-architecture.md` for detailed architectural decision record. ### 1. Data Ingestion Layer #### 1.1 Conversation Parser **Purpose**: Extract structured content from Claude conversation JSON files **Components**: - `ConversationReader`: Read and validate JSON files - `MessageExtractor`: Extract message content and metadata - `CitationExtractor`: Parse URLs and references - `ConversationIndexer`: Create searchable index of conversations **Inputs**: - Claude conversation JSON files (139 files covering global GLAM research) **Outputs**: - Structured conversation objects - Text corpus (markdown content) - Citation database (URLs, references) - Conversation metadata index **Technology**: - Python `json` module for parsing - `pydantic` for data validation - DuckDB or SQLite for indexing #### 1.2 Data Model ```python @dataclass class Conversation: id: str created_at: datetime title: str messages: List[Message] citations: List[Citation] metadata: ConversationMetadata @dataclass class Message: role: str # 'user' or 'assistant' content: str # Markdown text timestamp: datetime @dataclass class Citation: url: str title: Optional[str] context: str # Surrounding text @dataclass class ConversationMetadata: country: Optional[str] region: Optional[str] glam_types: List[str] # ['museum', 'archive', 'library', 'gallery'] languages: List[str] ``` ### 2. Extraction & Processing Layer #### 2.1 NLP Extraction Pipeline **Purpose**: Extract heritage institutions and attributes using pattern matching and subagent-based NER **Architecture**: This pipeline uses a **hybrid approach**: - **Pattern matching** (in main code) for identifiers and structured data - **Coding subagents** (via Task tool) for Named Entity Recognition - See `docs/plan/global_glam/07-subagent-architecture.md` for detailed rationale **Components**: ##### Entity Extractor (Main Code - Pattern Matching) - **IdentifierExtractor**: Extract ISIL codes, Wikidata IDs, VIAF IDs, KvK numbers using regex patterns - **URLExtractor**: URL pattern matching and validation ##### Entity Extractor (Subagent-Based NER) - **InstitutionExtractor**: Launch subagents to extract institution names using NLP/NER - **LocationExtractor**: Launch subagents for geographic entity extraction - **RelationshipExtractor**: Launch subagents to extract organizational relationships ##### Attribute Extractor (Subagent-Based) - **TypeClassifier**: Subagents classify institution type (museum, archive, etc.) - **CollectionExtractor**: Subagents extract collection subjects/themes - **StandardsExtractor**: Subagents identify metadata/preservation standards used ##### Data Model ```python @dataclass class ExtractedInstitution: name: str name_variants: List[str] institution_type: List[str] # Can be multiple # Location country: Optional[str] region: Optional[str] city: Optional[str] address: Optional[str] coordinates: Optional[Tuple[float, float]] # Digital presence urls: List[str] repositories: List[DigitalRepository] # Identifiers isil_code: Optional[str] wikidata_id: Optional[str] national_ids: Dict[str, str] # Collections collection_subjects: List[str] collection_formats: List[str] # Technical metadata_standards: List[str] preservation_standards: List[str] access_protocols: List[str] # Organizational parent_organization: Optional[str] consortia: List[str] partnerships: List[str] # Provenance source_conversations: List[str] extraction_confidence: float extraction_method: str @dataclass class DigitalRepository: url: str platform: Optional[str] # e.g., "DSpace", "Omeka", "CollectiveAccess" repository_type: str # e.g., "institutional repository", "catalog", "portal" access_type: str # "open", "restricted", "mixed" ``` **Technology**: - **Pattern Matching** (main code): Python `re` module, `rapidfuzz` for fuzzy matching - **NER & Extraction** (subagents): Coding subagents choose appropriate tools (spaCy, transformers, etc.) - **Task Orchestration**: Task tool for subagent invocation - **Text Processing**: `langdetect` for language identification (main code) #### 2.2 Web Crawling & Enrichment **Purpose**: Fetch and extract data from URLs cited in conversations **Components**: ##### Web Crawler (crawl4ai) - **URLValidator**: Check URL availability and redirects - **PageFetcher**: Async crawling with rate limiting - **MetadataExtractor**: Extract structured metadata from pages - **ContentExtractor**: Platform-specific content extraction ##### Enrichment Services - **GeocodingService**: Resolve addresses to coordinates (Nominatim) - **WikidataLinker**: Link institutions to Wikidata entities - **VIAFLinker**: Link to VIAF authority records - **RegistryChecker**: Verify against national/international registries **Technology**: - **crawl4ai**: Async web crawling with LLM extraction - **httpx**: Modern async HTTP client - **beautifulsoup4**: HTML parsing - **lxml**: XML/HTML processing - **geopy**: Geocoding - **SPARQLWrapper**: Query Wikidata/VIAF **Data Flow**: ``` Citations → URL Validation → Crawl Queue ↓ crawl4ai Async Crawler ↓ ┌─────────┴─────────┐ ▼ ▼ Metadata Extract Content Extract │ │ └─────────┬─────────┘ ▼ Enrichment Pipeline ▼ Enriched Institution Data ``` ### 3. Schema & Validation Layer #### 3.1 LinkML Schema Definition **Purpose**: Define comprehensive heritage custodian ontology **Schema Structure**: ```yaml # heritage_custodian.yaml (simplified) id: https://w3id.org/heritage-custodian name: heritage-custodian-schema title: Heritage Custodian Ontology prefixes: hc: https://w3id.org/heritage-custodian/ schema: http://schema.org/ dcterms: http://purl.org/dc/terms/ skos: http://www.w3.org/2004/02/skos/core# isil: http://id.loc.gov/vocabulary/organizations/ rico: https://www.ica.org/standards/RiC/ontology# imports: - linkml:types - schema_org_subset - cpoc_subset - tooi_subset classes: HeritageCustodian: is_a: schema:Organization mixins: - CPOCPublicOrganization - TOOIOrganization slots: - name - institution_types - geographic_coverage - digital_platforms - collections - identifiers DigitalPlatform: is_a: schema:WebSite slots: - platform_type - software_platform - access_protocol - metadata_standard Collection: slots: - collection_name - subjects - formats - temporal_coverage - rights_statements slots: institution_types: multivalued: true range: InstitutionTypeEnum identifiers: multivalued: true range: Identifier enums: InstitutionTypeEnum: permissible_values: ARCHIVE: meaning: rico:RecordCreator LIBRARY: meaning: schema:Library MUSEUM: meaning: schema:Museum GALLERY: meaning: schema:ArtGallery ``` **Modular Design**: - `core/` - Core classes (HeritageCustodian, Collection) - `mixins/` - Reusable components (TOOI, CPOC, Schema.org) - `identifiers/` - Identifier systems (ISIL, Wikidata, etc.) - `standards/` - Domain standards (RiC-O, BIBFRAME, LIDO) - `enums/` - Controlled vocabularies #### 3.2 Validation & Mapping **Components**: - **SchemaValidator**: Validate instances against LinkML schema - **EntityMapper**: Map extracted data to LinkML classes - **IRIGenerator**: Generate unique IRIs for instances - **ProvenanceTracker**: Track extraction provenance **Technology**: - **linkml-runtime**: Schema validation and instance generation - **linkml**: Schema development tools - **rdflib**: RDF graph manipulation - **jsonschema**: JSON validation ### 4. Storage Layer #### 4.1 Intermediate Storage **Purpose**: Store processed data during extraction pipeline **Databases**: - **Conversation Index**: DuckDB - Fast analytical queries - Full-text search on conversation content - Lightweight, embedded database - **Extraction Cache**: SQLite - Store intermediate extraction results - Cache web crawling results - Track processing status **Schema**: ```sql -- Conversations CREATE TABLE conversations ( id TEXT PRIMARY KEY, created_at TIMESTAMP, title TEXT, country TEXT, region TEXT, content TEXT, -- Full markdown content processed BOOLEAN DEFAULT FALSE ); CREATE TABLE citations ( id INTEGER PRIMARY KEY, conversation_id TEXT, url TEXT, title TEXT, context TEXT, crawled BOOLEAN DEFAULT FALSE, crawl_status TEXT, FOREIGN KEY (conversation_id) REFERENCES conversations(id) ); -- Extracted entities CREATE TABLE extracted_institutions ( id TEXT PRIMARY KEY, name TEXT, institution_type TEXT[], country TEXT, raw_data JSONB, -- Full extracted data confidence FLOAT, status TEXT, -- 'pending', 'validated', 'rejected' created_at TIMESTAMP ); -- Crawl results CREATE TABLE crawl_results ( url TEXT PRIMARY KEY, status_code INTEGER, content_hash TEXT, metadata JSONB, crawled_at TIMESTAMP ); ``` #### 4.2 Output Storage **Purpose**: Store final dataset in multiple formats **Formats**: - **RDF Store**: Apache Jena TDB or Oxigraph - Native RDF/SPARQL support - Linked data queries - **Document Store**: JSON-LD files - One file per institution - GitHub-friendly for version control - **Analytical**: Parquet + DuckDB - Efficient columnar storage - Fast analytical queries - **Relational**: PostgreSQL (optional) - If needed for applications - Can be generated from RDF ### 5. Export & Publishing Layer #### 5.1 Export Pipeline **Components**: - **RDFExporter**: Export to Turtle/N-Triples/JSON-LD - **TabularExporter**: Export to CSV/TSV/Excel - **SQLExporter**: Generate SQL dumps - **ParquetExporter**: Generate Parquet files - **StatisticsGenerator**: Dataset statistics and reports #### 5.2 Publishing Formats ``` dataset/ ├── rdf/ │ ├── institutions.ttl # All institutions in Turtle │ ├── institutions.nt # N-Triples format │ └── jsonld/ # Individual JSON-LD files │ ├── inst_001.jsonld │ └── ... ├── tabular/ │ ├── institutions.csv # Flattened institution data │ ├── collections.csv # Collections table │ ├── digital_platforms.csv # Digital platforms table │ └── data_dictionary.csv # Column descriptions ├── database/ │ ├── glam_dataset.db # SQLite database │ ├── glam_dataset.duckdb # DuckDB database │ └── schema.sql # SQL schema ├── analytics/ │ └── glam_dataset.parquet # Parquet format ├── metadata/ │ ├── dataset_metadata.yaml # Dataset description │ ├── statistics.json # Dataset statistics │ └── provenance.jsonld # PROV-O provenance └── docs/ ├── README.md # Dataset documentation ├── schema.html # Schema documentation └── query_examples.md # SPARQL/SQL examples ``` ## Component Integration ### Data Flow Architecture ``` ┌─────────────────┐ │ Conversation │ │ JSON Files │ └────────┬────────┘ │ ▼ ┌─────────────────┐ │ Conversation │ ──> [DuckDB Index] │ Parser │ └────────┬────────┘ │ ▼ ┌─────────────────┐ │ Pattern-based │ ──> Extract: ISIL, Wikidata, URLs │ Extraction │ (IdentifierExtractor in main code) └────────┬────────┘ │ ▼ ┌─────────────────┐ │ SUBAGENT │ ──> Launch subagents for NER │ BOUNDARY │ (InstitutionExtractor, LocationExtractor) └────────┬────────┘ │ ▼ ┌─────────────────┐ │ Coding │ ──> Subagents use spaCy/transformers │ Subagents │ Return structured JSON results └────────┬────────┘ │ ▼ ┌─────────────────┐ │ Web Crawler │ ──> [Crawl Results DB] │ (crawl4ai) │ └────────┬────────┘ │ ▼ ┌─────────────────┐ │ Enrichment │ ──> [External APIs] │ Services │ (Wikidata, VIAF) └────────┬────────┘ │ ▼ ┌─────────────────┐ │ LinkML Mapper │ ──> [Validation Errors] └────────┬────────┘ │ ▼ ┌─────────────────┐ │ Quality Check │ ──> [Review Queue] └────────┬────────┘ │ ▼ ┌─────────────────┐ │ Exporters │ └────────┬────────┘ │ ▼ ┌─────────────────┐ │ Published │ │ Dataset │ └─────────────────┘ ``` ### Processing Modes #### 1. Batch Processing (Default) - Process all 139 conversations sequentially - Suitable for initial dataset creation - Can be parallelized by conversation #### 2. Incremental Processing - Process new/updated conversations only - Track processing timestamps - Merge with existing dataset #### 3. Targeted Processing - Process specific conversations (by country, date, etc.) - For dataset updates or testing ## Technology Stack ### Core Languages - **Python 3.11+**: Main implementation language - **YAML**: Schema definition (LinkML) - **SPARQL**: RDF queries ### Key Libraries - **Data Processing**: pandas, polars (for tabular data) - **Pattern Matching**: Python `re` module, `rapidfuzz` (fuzzy matching) - **Subagent Orchestration**: Task tool (for NER and complex extraction) - **Web**: crawl4ai, httpx, beautifulsoup4 - **Semantic Web**: linkml-runtime, rdflib, pyshacl - **Database**: duckdb, sqlite3, sqlalchemy - **Validation**: pydantic, jsonschema - **Export**: pyarrow (Parquet), openpyxl (Excel) **Note**: NLP libraries (spaCy, transformers, PyTorch) are NOT dependencies of the main application. They are used by coding subagents when needed for entity extraction. See `docs/plan/global_glam/07-subagent-architecture.md` for details. ### Development Tools - **Package Management**: Poetry - **Code Quality**: ruff (linting), mypy (type checking) - **Testing**: pytest, hypothesis - **Documentation**: mkdocs, sphinx - **CI/CD**: GitHub Actions ## Scalability Considerations ### Performance Optimizations 1. **Parallel Processing**: Process conversations concurrently 2. **Caching**: Cache web crawl results and API calls 3. **Incremental Processing**: Only process changed data 4. **Efficient Storage**: Use columnar formats (Parquet) for analytics ### Resource Management - **Memory**: Stream large files, avoid loading all data in memory - **Disk**: Compress intermediate results - **Network**: Rate limiting for web crawling, batch API calls ## Security & Ethics ### Web Crawling - Respect robots.txt - Implement rate limiting - User-agent identification - No aggressive crawling ### Data Privacy - No personal data extraction - Public information only - Comply with institutional terms of service ### Licensing - Clear dataset license (CC0, CC-BY, or ODbL) - Respect source data licenses - Attribution to source institutions ## Monitoring & Observability ### Logging - Structured logging (JSON format) - Log levels: DEBUG, INFO, WARNING, ERROR - Per-component logging ### Metrics - Extraction success rates - Web crawl success rates - Validation pass/fail rates - Processing time per conversation - Dataset statistics over time ### Error Handling - Graceful degradation - Retry logic for transient failures - Error categorization and reporting - Manual review queue for edge cases ## Deployment Architecture ### Local Development ``` poetry install poetry run glam-extract --input data/conversations/ --output data/dataset/ ``` ### Docker Container (Future) ```dockerfile FROM python:3.11-slim RUN pip install poetry COPY . /app WORKDIR /app RUN poetry install --no-dev ENTRYPOINT ["poetry", "run", "glam-extract"] ``` ### CI/CD Pipeline - **GitHub Actions**: Run on push/PR - **Tests**: Unit tests, integration tests - **Quality Gates**: Linting, type checking, coverage - **Dataset Build**: Weekly automated builds - **Publication**: Auto-publish to GitHub Releases ## Future Enhancements ### Phase 2 Features - Web UI for manual review and curation - SPARQL endpoint for queries - REST API for dataset access - Automated duplicate detection across conversations - Multi-language support (NLP for non-English content) - Machine learning for institution type classification - Automated monitoring of institution URLs (link rot detection) ### Integration Opportunities - Wikidata integration (read/write) - National heritage registries - OpenStreetMap for geocoding - IIIF for image collections - OAI-PMH for metadata harvesting ## Related Documentation - **Subagent Architecture**: `docs/plan/global_glam/07-subagent-architecture.md` - Detailed explanation of subagent-based NER approach - **Design Patterns**: `docs/plan/global_glam/05-design-patterns.md` - Code patterns and best practices - **Data Standardization**: `docs/plan/global_glam/04-data-standardization.md` - Data tier system and provenance - **Dependencies**: `docs/plan/global_glam/03-dependencies.md` - Technology stack and library choices - **LinkML Schema**: `schemas/heritage_custodian.yaml` - Data model definition - **Agent Instructions**: `AGENTS.md` - Instructions for AI agents working on this project