glam/docs/plan/global_glam/02-architecture.md
2025-11-19 23:25:22 +01:00

21 KiB

Global GLAM Dataset: System Architecture

Overview

This document describes the technical architecture for extracting, enriching, and publishing a comprehensive global GLAM (Galleries, Libraries, Archives, Museums) dataset from Claude conversation files.

System Context Diagram

┌─────────────────────────────────────────────────────────────────┐
│                     Global GLAM Extraction System                │
│                                                                   │
│  ┌──────────────┐      ┌──────────────┐      ┌──────────────┐  │
│  │ Conversation │ ───> │  Extraction  │ ───> │   LinkML     │  │
│  │   Parsers    │      │   Pipeline   │      │  Instances   │  │
│  └──────────────┘      └──────────────┘      └──────────────┘  │
│         │                      │                      │          │
│         ▼                      ▼                      ▼          │
│  ┌──────────────┐      ┌──────────────┐      ┌──────────────┐  │
│  │   Text       │      │  Web Crawler │      │   Export     │  │
│  │   Corpus     │      │  (crawl4ai)  │      │   Formats    │  │
│  └──────────────┘      └──────────────┘      └──────────────┘  │
└─────────────────────────────────────────────────────────────────┘
         │                      │                      │
         ▼                      ▼                      ▼
  [139 JSON Files]        [Web Sources]        [Published Dataset]

High-Level Architecture

Extraction Architecture: Hybrid Approach

This system uses a hybrid extraction architecture combining pattern matching and subagent-based NER:

Pattern Matching (Main Code):

  • Extract structured identifiers: ISIL codes, Wikidata IDs, VIAF IDs, KvK numbers
  • URL extraction and validation
  • Fast, deterministic, no dependencies on NLP libraries
  • Implemented in: src/glam_extractor/extractors/identifiers.py

Subagent-Based NER:

  • Extract unstructured entities: institution names, locations, relationships
  • Coding subagents autonomously choose NER tools (spaCy, transformers, GPT-4, etc.)
  • Main application stays lightweight (no PyTorch, spaCy, transformers dependencies)
  • Flexible: can swap extraction methods without changing main code
  • Implemented via: Task tool invocation in extractors

Rationale: See docs/plan/global_glam/07-subagent-architecture.md for detailed architectural decision record.

1. Data Ingestion Layer

1.1 Conversation Parser

Purpose: Extract structured content from Claude conversation JSON files

Components:

  • ConversationReader: Read and validate JSON files
  • MessageExtractor: Extract message content and metadata
  • CitationExtractor: Parse URLs and references
  • ConversationIndexer: Create searchable index of conversations

Inputs:

  • Claude conversation JSON files (139 files covering global GLAM research)

Outputs:

  • Structured conversation objects
  • Text corpus (markdown content)
  • Citation database (URLs, references)
  • Conversation metadata index

Technology:

  • Python json module for parsing
  • pydantic for data validation
  • DuckDB or SQLite for indexing

1.2 Data Model

@dataclass
class Conversation:
    id: str
    created_at: datetime
    title: str
    messages: List[Message]
    citations: List[Citation]
    metadata: ConversationMetadata

@dataclass
class Message:
    role: str  # 'user' or 'assistant'
    content: str  # Markdown text
    timestamp: datetime
    
@dataclass
class Citation:
    url: str
    title: Optional[str]
    context: str  # Surrounding text
    
@dataclass
class ConversationMetadata:
    country: Optional[str]
    region: Optional[str]
    glam_types: List[str]  # ['museum', 'archive', 'library', 'gallery']
    languages: List[str]

2. Extraction & Processing Layer

2.1 NLP Extraction Pipeline

Purpose: Extract heritage institutions and attributes using pattern matching and subagent-based NER

Architecture: This pipeline uses a hybrid approach:

  • Pattern matching (in main code) for identifiers and structured data
  • Coding subagents (via Task tool) for Named Entity Recognition
  • See docs/plan/global_glam/07-subagent-architecture.md for detailed rationale

Components:

Entity Extractor (Main Code - Pattern Matching)
  • IdentifierExtractor: Extract ISIL codes, Wikidata IDs, VIAF IDs, KvK numbers using regex patterns
  • URLExtractor: URL pattern matching and validation
Entity Extractor (Subagent-Based NER)
  • InstitutionExtractor: Launch subagents to extract institution names using NLP/NER
  • LocationExtractor: Launch subagents for geographic entity extraction
  • RelationshipExtractor: Launch subagents to extract organizational relationships
Attribute Extractor (Subagent-Based)
  • TypeClassifier: Subagents classify institution type (museum, archive, etc.)
  • CollectionExtractor: Subagents extract collection subjects/themes
  • StandardsExtractor: Subagents identify metadata/preservation standards used
Data Model
@dataclass
class ExtractedInstitution:
    name: str
    name_variants: List[str]
    institution_type: List[str]  # Can be multiple
    
    # Location
    country: Optional[str]
    region: Optional[str]
    city: Optional[str]
    address: Optional[str]
    coordinates: Optional[Tuple[float, float]]
    
    # Digital presence
    urls: List[str]
    repositories: List[DigitalRepository]
    
    # Identifiers
    isil_code: Optional[str]
    wikidata_id: Optional[str]
    national_ids: Dict[str, str]
    
    # Collections
    collection_subjects: List[str]
    collection_formats: List[str]
    
    # Technical
    metadata_standards: List[str]
    preservation_standards: List[str]
    access_protocols: List[str]
    
    # Organizational
    parent_organization: Optional[str]
    consortia: List[str]
    partnerships: List[str]
    
    # Provenance
    source_conversations: List[str]
    extraction_confidence: float
    extraction_method: str
    
@dataclass
class DigitalRepository:
    url: str
    platform: Optional[str]  # e.g., "DSpace", "Omeka", "CollectiveAccess"
    repository_type: str  # e.g., "institutional repository", "catalog", "portal"
    access_type: str  # "open", "restricted", "mixed"

Technology:

  • Pattern Matching (main code): Python re module, rapidfuzz for fuzzy matching
  • NER & Extraction (subagents): Coding subagents choose appropriate tools (spaCy, transformers, etc.)
  • Task Orchestration: Task tool for subagent invocation
  • Text Processing: langdetect for language identification (main code)

2.2 Web Crawling & Enrichment

Purpose: Fetch and extract data from URLs cited in conversations

Components:

Web Crawler (crawl4ai)
  • URLValidator: Check URL availability and redirects
  • PageFetcher: Async crawling with rate limiting
  • MetadataExtractor: Extract structured metadata from pages
  • ContentExtractor: Platform-specific content extraction
Enrichment Services
  • GeocodingService: Resolve addresses to coordinates (Nominatim)
  • WikidataLinker: Link institutions to Wikidata entities
  • VIAFLinker: Link to VIAF authority records
  • RegistryChecker: Verify against national/international registries

Technology:

  • crawl4ai: Async web crawling with LLM extraction
  • httpx: Modern async HTTP client
  • beautifulsoup4: HTML parsing
  • lxml: XML/HTML processing
  • geopy: Geocoding
  • SPARQLWrapper: Query Wikidata/VIAF

Data Flow:

Citations → URL Validation → Crawl Queue
                              ↓
                        crawl4ai Async Crawler
                              ↓
                    ┌─────────┴─────────┐
                    ▼                   ▼
              Metadata Extract    Content Extract
                    │                   │
                    └─────────┬─────────┘
                              ▼
                      Enrichment Pipeline
                              ▼
                      Enriched Institution Data

3. Schema & Validation Layer

3.1 LinkML Schema Definition

Purpose: Define comprehensive heritage custodian ontology

Schema Structure:

# heritage_custodian.yaml (simplified)
id: https://w3id.org/heritage-custodian
name: heritage-custodian-schema
title: Heritage Custodian Ontology

prefixes:
  hc: https://w3id.org/heritage-custodian/
  schema: http://schema.org/
  dcterms: http://purl.org/dc/terms/
  skos: http://www.w3.org/2004/02/skos/core#
  isil: http://id.loc.gov/vocabulary/organizations/
  rico: https://www.ica.org/standards/RiC/ontology#
  
imports:
  - linkml:types
  - schema_org_subset
  - cpoc_subset
  - tooi_subset

classes:
  HeritageCustodian:
    is_a: schema:Organization
    mixins:
      - CPOCPublicOrganization
      - TOOIOrganization
    slots:
      - name
      - institution_types
      - geographic_coverage
      - digital_platforms
      - collections
      - identifiers
      
  DigitalPlatform:
    is_a: schema:WebSite
    slots:
      - platform_type
      - software_platform
      - access_protocol
      - metadata_standard
      
  Collection:
    slots:
      - collection_name
      - subjects
      - formats
      - temporal_coverage
      - rights_statements

slots:
  institution_types:
    multivalued: true
    range: InstitutionTypeEnum
    
  identifiers:
    multivalued: true
    range: Identifier
    
enums:
  InstitutionTypeEnum:
    permissible_values:
      ARCHIVE:
        meaning: rico:RecordCreator
      LIBRARY:
        meaning: schema:Library
      MUSEUM:
        meaning: schema:Museum
      GALLERY:
        meaning: schema:ArtGallery

Modular Design:

  • core/ - Core classes (HeritageCustodian, Collection)
  • mixins/ - Reusable components (TOOI, CPOC, Schema.org)
  • identifiers/ - Identifier systems (ISIL, Wikidata, etc.)
  • standards/ - Domain standards (RiC-O, BIBFRAME, LIDO)
  • enums/ - Controlled vocabularies

3.2 Validation & Mapping

Components:

  • SchemaValidator: Validate instances against LinkML schema
  • EntityMapper: Map extracted data to LinkML classes
  • IRIGenerator: Generate unique IRIs for instances
  • ProvenanceTracker: Track extraction provenance

Technology:

  • linkml-runtime: Schema validation and instance generation
  • linkml: Schema development tools
  • rdflib: RDF graph manipulation
  • jsonschema: JSON validation

4. Storage Layer

4.1 Intermediate Storage

Purpose: Store processed data during extraction pipeline

Databases:

  • Conversation Index: DuckDB

    • Fast analytical queries
    • Full-text search on conversation content
    • Lightweight, embedded database
  • Extraction Cache: SQLite

    • Store intermediate extraction results
    • Cache web crawling results
    • Track processing status

Schema:

-- Conversations
CREATE TABLE conversations (
    id TEXT PRIMARY KEY,
    created_at TIMESTAMP,
    title TEXT,
    country TEXT,
    region TEXT,
    content TEXT,  -- Full markdown content
    processed BOOLEAN DEFAULT FALSE
);

CREATE TABLE citations (
    id INTEGER PRIMARY KEY,
    conversation_id TEXT,
    url TEXT,
    title TEXT,
    context TEXT,
    crawled BOOLEAN DEFAULT FALSE,
    crawl_status TEXT,
    FOREIGN KEY (conversation_id) REFERENCES conversations(id)
);

-- Extracted entities
CREATE TABLE extracted_institutions (
    id TEXT PRIMARY KEY,
    name TEXT,
    institution_type TEXT[],
    country TEXT,
    raw_data JSONB,  -- Full extracted data
    confidence FLOAT,
    status TEXT,  -- 'pending', 'validated', 'rejected'
    created_at TIMESTAMP
);

-- Crawl results
CREATE TABLE crawl_results (
    url TEXT PRIMARY KEY,
    status_code INTEGER,
    content_hash TEXT,
    metadata JSONB,
    crawled_at TIMESTAMP
);

4.2 Output Storage

Purpose: Store final dataset in multiple formats

Formats:

  • RDF Store: Apache Jena TDB or Oxigraph

    • Native RDF/SPARQL support
    • Linked data queries
  • Document Store: JSON-LD files

    • One file per institution
    • GitHub-friendly for version control
  • Analytical: Parquet + DuckDB

    • Efficient columnar storage
    • Fast analytical queries
  • Relational: PostgreSQL (optional)

    • If needed for applications
    • Can be generated from RDF

5. Export & Publishing Layer

5.1 Export Pipeline

Components:

  • RDFExporter: Export to Turtle/N-Triples/JSON-LD
  • TabularExporter: Export to CSV/TSV/Excel
  • SQLExporter: Generate SQL dumps
  • ParquetExporter: Generate Parquet files
  • StatisticsGenerator: Dataset statistics and reports

5.2 Publishing Formats

dataset/
├── rdf/
│   ├── institutions.ttl         # All institutions in Turtle
│   ├── institutions.nt          # N-Triples format
│   └── jsonld/                  # Individual JSON-LD files
│       ├── inst_001.jsonld
│       └── ...
├── tabular/
│   ├── institutions.csv         # Flattened institution data
│   ├── collections.csv          # Collections table
│   ├── digital_platforms.csv   # Digital platforms table
│   └── data_dictionary.csv     # Column descriptions
├── database/
│   ├── glam_dataset.db          # SQLite database
│   ├── glam_dataset.duckdb      # DuckDB database
│   └── schema.sql               # SQL schema
├── analytics/
│   └── glam_dataset.parquet     # Parquet format
├── metadata/
│   ├── dataset_metadata.yaml    # Dataset description
│   ├── statistics.json          # Dataset statistics
│   └── provenance.jsonld        # PROV-O provenance
└── docs/
    ├── README.md                # Dataset documentation
    ├── schema.html              # Schema documentation
    └── query_examples.md        # SPARQL/SQL examples

Component Integration

Data Flow Architecture

┌─────────────────┐
│  Conversation   │
│   JSON Files    │
└────────┬────────┘
         │
         ▼
┌─────────────────┐
│ Conversation    │ ──> [DuckDB Index]
│   Parser        │
└────────┬────────┘
         │
         ▼
┌─────────────────┐
│  Pattern-based  │ ──> Extract: ISIL, Wikidata, URLs
│  Extraction     │     (IdentifierExtractor in main code)
└────────┬────────┘
         │
         ▼
┌─────────────────┐
│  SUBAGENT       │ ──> Launch subagents for NER
│  BOUNDARY       │     (InstitutionExtractor, LocationExtractor)
└────────┬────────┘
         │
         ▼
┌─────────────────┐
│  Coding         │ ──> Subagents use spaCy/transformers
│  Subagents      │     Return structured JSON results
└────────┬────────┘
         │
         ▼
┌─────────────────┐
│  Web Crawler    │ ──> [Crawl Results DB]
│   (crawl4ai)    │
└────────┬────────┘
         │
         ▼
┌─────────────────┐
│  Enrichment     │ ──> [External APIs]
│   Services      │     (Wikidata, VIAF)
└────────┬────────┘
         │
         ▼
┌─────────────────┐
│  LinkML Mapper  │ ──> [Validation Errors]
└────────┬────────┘
         │
         ▼
┌─────────────────┐
│  Quality Check  │ ──> [Review Queue]
└────────┬────────┘
         │
         ▼
┌─────────────────┐
│    Exporters    │
└────────┬────────┘
         │
         ▼
┌─────────────────┐
│ Published       │
│   Dataset       │
└─────────────────┘

Processing Modes

1. Batch Processing (Default)

  • Process all 139 conversations sequentially
  • Suitable for initial dataset creation
  • Can be parallelized by conversation

2. Incremental Processing

  • Process new/updated conversations only
  • Track processing timestamps
  • Merge with existing dataset

3. Targeted Processing

  • Process specific conversations (by country, date, etc.)
  • For dataset updates or testing

Technology Stack

Core Languages

  • Python 3.11+: Main implementation language
  • YAML: Schema definition (LinkML)
  • SPARQL: RDF queries

Key Libraries

  • Data Processing: pandas, polars (for tabular data)
  • Pattern Matching: Python re module, rapidfuzz (fuzzy matching)
  • Subagent Orchestration: Task tool (for NER and complex extraction)
  • Web: crawl4ai, httpx, beautifulsoup4
  • Semantic Web: linkml-runtime, rdflib, pyshacl
  • Database: duckdb, sqlite3, sqlalchemy
  • Validation: pydantic, jsonschema
  • Export: pyarrow (Parquet), openpyxl (Excel)

Note: NLP libraries (spaCy, transformers, PyTorch) are NOT dependencies of the main application. They are used by coding subagents when needed for entity extraction. See docs/plan/global_glam/07-subagent-architecture.md for details.

Development Tools

  • Package Management: Poetry
  • Code Quality: ruff (linting), mypy (type checking)
  • Testing: pytest, hypothesis
  • Documentation: mkdocs, sphinx
  • CI/CD: GitHub Actions

Scalability Considerations

Performance Optimizations

  1. Parallel Processing: Process conversations concurrently
  2. Caching: Cache web crawl results and API calls
  3. Incremental Processing: Only process changed data
  4. Efficient Storage: Use columnar formats (Parquet) for analytics

Resource Management

  • Memory: Stream large files, avoid loading all data in memory
  • Disk: Compress intermediate results
  • Network: Rate limiting for web crawling, batch API calls

Security & Ethics

Web Crawling

  • Respect robots.txt
  • Implement rate limiting
  • User-agent identification
  • No aggressive crawling

Data Privacy

  • No personal data extraction
  • Public information only
  • Comply with institutional terms of service

Licensing

  • Clear dataset license (CC0, CC-BY, or ODbL)
  • Respect source data licenses
  • Attribution to source institutions

Monitoring & Observability

Logging

  • Structured logging (JSON format)
  • Log levels: DEBUG, INFO, WARNING, ERROR
  • Per-component logging

Metrics

  • Extraction success rates
  • Web crawl success rates
  • Validation pass/fail rates
  • Processing time per conversation
  • Dataset statistics over time

Error Handling

  • Graceful degradation
  • Retry logic for transient failures
  • Error categorization and reporting
  • Manual review queue for edge cases

Deployment Architecture

Local Development

poetry install
poetry run glam-extract --input data/conversations/ --output data/dataset/

Docker Container (Future)

FROM python:3.11-slim
RUN pip install poetry
COPY . /app
WORKDIR /app
RUN poetry install --no-dev
ENTRYPOINT ["poetry", "run", "glam-extract"]

CI/CD Pipeline

  • GitHub Actions: Run on push/PR
  • Tests: Unit tests, integration tests
  • Quality Gates: Linting, type checking, coverage
  • Dataset Build: Weekly automated builds
  • Publication: Auto-publish to GitHub Releases

Future Enhancements

Phase 2 Features

  • Web UI for manual review and curation
  • SPARQL endpoint for queries
  • REST API for dataset access
  • Automated duplicate detection across conversations
  • Multi-language support (NLP for non-English content)
  • Machine learning for institution type classification
  • Automated monitoring of institution URLs (link rot detection)

Integration Opportunities

  • Wikidata integration (read/write)
  • National heritage registries
  • OpenStreetMap for geocoding
  • IIIF for image collections
  • OAI-PMH for metadata harvesting
  • Subagent Architecture: docs/plan/global_glam/07-subagent-architecture.md - Detailed explanation of subagent-based NER approach
  • Design Patterns: docs/plan/global_glam/05-design-patterns.md - Code patterns and best practices
  • Data Standardization: docs/plan/global_glam/04-data-standardization.md - Data tier system and provenance
  • Dependencies: docs/plan/global_glam/03-dependencies.md - Technology stack and library choices
  • LinkML Schema: schemas/heritage_custodian.yaml - Data model definition
  • Agent Instructions: AGENTS.md - Instructions for AI agents working on this project