glam/docs/plan/global_glam/02-architecture.md
2025-11-19 23:25:22 +01:00

661 lines
21 KiB
Markdown

# Global GLAM Dataset: System Architecture
## Overview
This document describes the technical architecture for extracting, enriching, and publishing a comprehensive global GLAM (Galleries, Libraries, Archives, Museums) dataset from Claude conversation files.
## System Context Diagram
```
┌─────────────────────────────────────────────────────────────────┐
│ Global GLAM Extraction System │
│ │
│ ┌──────────────┐ ┌──────────────┐ ┌──────────────┐ │
│ │ Conversation │ ───> │ Extraction │ ───> │ LinkML │ │
│ │ Parsers │ │ Pipeline │ │ Instances │ │
│ └──────────────┘ └──────────────┘ └──────────────┘ │
│ │ │ │ │
│ ▼ ▼ ▼ │
│ ┌──────────────┐ ┌──────────────┐ ┌──────────────┐ │
│ │ Text │ │ Web Crawler │ │ Export │ │
│ │ Corpus │ │ (crawl4ai) │ │ Formats │ │
│ └──────────────┘ └──────────────┘ └──────────────┘ │
└─────────────────────────────────────────────────────────────────┘
│ │ │
▼ ▼ ▼
[139 JSON Files] [Web Sources] [Published Dataset]
```
## High-Level Architecture
### Extraction Architecture: Hybrid Approach
This system uses a **hybrid extraction architecture** combining pattern matching and subagent-based NER:
**Pattern Matching (Main Code)**:
- Extract structured identifiers: ISIL codes, Wikidata IDs, VIAF IDs, KvK numbers
- URL extraction and validation
- Fast, deterministic, no dependencies on NLP libraries
- Implemented in: `src/glam_extractor/extractors/identifiers.py`
**Subagent-Based NER**:
- Extract unstructured entities: institution names, locations, relationships
- Coding subagents autonomously choose NER tools (spaCy, transformers, GPT-4, etc.)
- Main application stays lightweight (no PyTorch, spaCy, transformers dependencies)
- Flexible: can swap extraction methods without changing main code
- Implemented via: Task tool invocation in extractors
**Rationale**: See `docs/plan/global_glam/07-subagent-architecture.md` for detailed architectural decision record.
### 1. Data Ingestion Layer
#### 1.1 Conversation Parser
**Purpose**: Extract structured content from Claude conversation JSON files
**Components**:
- `ConversationReader`: Read and validate JSON files
- `MessageExtractor`: Extract message content and metadata
- `CitationExtractor`: Parse URLs and references
- `ConversationIndexer`: Create searchable index of conversations
**Inputs**:
- Claude conversation JSON files (139 files covering global GLAM research)
**Outputs**:
- Structured conversation objects
- Text corpus (markdown content)
- Citation database (URLs, references)
- Conversation metadata index
**Technology**:
- Python `json` module for parsing
- `pydantic` for data validation
- DuckDB or SQLite for indexing
#### 1.2 Data Model
```python
@dataclass
class Conversation:
id: str
created_at: datetime
title: str
messages: List[Message]
citations: List[Citation]
metadata: ConversationMetadata
@dataclass
class Message:
role: str # 'user' or 'assistant'
content: str # Markdown text
timestamp: datetime
@dataclass
class Citation:
url: str
title: Optional[str]
context: str # Surrounding text
@dataclass
class ConversationMetadata:
country: Optional[str]
region: Optional[str]
glam_types: List[str] # ['museum', 'archive', 'library', 'gallery']
languages: List[str]
```
### 2. Extraction & Processing Layer
#### 2.1 NLP Extraction Pipeline
**Purpose**: Extract heritage institutions and attributes using pattern matching and subagent-based NER
**Architecture**: This pipeline uses a **hybrid approach**:
- **Pattern matching** (in main code) for identifiers and structured data
- **Coding subagents** (via Task tool) for Named Entity Recognition
- See `docs/plan/global_glam/07-subagent-architecture.md` for detailed rationale
**Components**:
##### Entity Extractor (Main Code - Pattern Matching)
- **IdentifierExtractor**: Extract ISIL codes, Wikidata IDs, VIAF IDs, KvK numbers using regex patterns
- **URLExtractor**: URL pattern matching and validation
##### Entity Extractor (Subagent-Based NER)
- **InstitutionExtractor**: Launch subagents to extract institution names using NLP/NER
- **LocationExtractor**: Launch subagents for geographic entity extraction
- **RelationshipExtractor**: Launch subagents to extract organizational relationships
##### Attribute Extractor (Subagent-Based)
- **TypeClassifier**: Subagents classify institution type (museum, archive, etc.)
- **CollectionExtractor**: Subagents extract collection subjects/themes
- **StandardsExtractor**: Subagents identify metadata/preservation standards used
##### Data Model
```python
@dataclass
class ExtractedInstitution:
name: str
name_variants: List[str]
institution_type: List[str] # Can be multiple
# Location
country: Optional[str]
region: Optional[str]
city: Optional[str]
address: Optional[str]
coordinates: Optional[Tuple[float, float]]
# Digital presence
urls: List[str]
repositories: List[DigitalRepository]
# Identifiers
isil_code: Optional[str]
wikidata_id: Optional[str]
national_ids: Dict[str, str]
# Collections
collection_subjects: List[str]
collection_formats: List[str]
# Technical
metadata_standards: List[str]
preservation_standards: List[str]
access_protocols: List[str]
# Organizational
parent_organization: Optional[str]
consortia: List[str]
partnerships: List[str]
# Provenance
source_conversations: List[str]
extraction_confidence: float
extraction_method: str
@dataclass
class DigitalRepository:
url: str
platform: Optional[str] # e.g., "DSpace", "Omeka", "CollectiveAccess"
repository_type: str # e.g., "institutional repository", "catalog", "portal"
access_type: str # "open", "restricted", "mixed"
```
**Technology**:
- **Pattern Matching** (main code): Python `re` module, `rapidfuzz` for fuzzy matching
- **NER & Extraction** (subagents): Coding subagents choose appropriate tools (spaCy, transformers, etc.)
- **Task Orchestration**: Task tool for subagent invocation
- **Text Processing**: `langdetect` for language identification (main code)
#### 2.2 Web Crawling & Enrichment
**Purpose**: Fetch and extract data from URLs cited in conversations
**Components**:
##### Web Crawler (crawl4ai)
- **URLValidator**: Check URL availability and redirects
- **PageFetcher**: Async crawling with rate limiting
- **MetadataExtractor**: Extract structured metadata from pages
- **ContentExtractor**: Platform-specific content extraction
##### Enrichment Services
- **GeocodingService**: Resolve addresses to coordinates (Nominatim)
- **WikidataLinker**: Link institutions to Wikidata entities
- **VIAFLinker**: Link to VIAF authority records
- **RegistryChecker**: Verify against national/international registries
**Technology**:
- **crawl4ai**: Async web crawling with LLM extraction
- **httpx**: Modern async HTTP client
- **beautifulsoup4**: HTML parsing
- **lxml**: XML/HTML processing
- **geopy**: Geocoding
- **SPARQLWrapper**: Query Wikidata/VIAF
**Data Flow**:
```
Citations → URL Validation → Crawl Queue
crawl4ai Async Crawler
┌─────────┴─────────┐
▼ ▼
Metadata Extract Content Extract
│ │
└─────────┬─────────┘
Enrichment Pipeline
Enriched Institution Data
```
### 3. Schema & Validation Layer
#### 3.1 LinkML Schema Definition
**Purpose**: Define comprehensive heritage custodian ontology
**Schema Structure**:
```yaml
# heritage_custodian.yaml (simplified)
id: https://w3id.org/heritage-custodian
name: heritage-custodian-schema
title: Heritage Custodian Ontology
prefixes:
hc: https://w3id.org/heritage-custodian/
schema: http://schema.org/
dcterms: http://purl.org/dc/terms/
skos: http://www.w3.org/2004/02/skos/core#
isil: http://id.loc.gov/vocabulary/organizations/
rico: https://www.ica.org/standards/RiC/ontology#
imports:
- linkml:types
- schema_org_subset
- cpoc_subset
- tooi_subset
classes:
HeritageCustodian:
is_a: schema:Organization
mixins:
- CPOCPublicOrganization
- TOOIOrganization
slots:
- name
- institution_types
- geographic_coverage
- digital_platforms
- collections
- identifiers
DigitalPlatform:
is_a: schema:WebSite
slots:
- platform_type
- software_platform
- access_protocol
- metadata_standard
Collection:
slots:
- collection_name
- subjects
- formats
- temporal_coverage
- rights_statements
slots:
institution_types:
multivalued: true
range: InstitutionTypeEnum
identifiers:
multivalued: true
range: Identifier
enums:
InstitutionTypeEnum:
permissible_values:
ARCHIVE:
meaning: rico:RecordCreator
LIBRARY:
meaning: schema:Library
MUSEUM:
meaning: schema:Museum
GALLERY:
meaning: schema:ArtGallery
```
**Modular Design**:
- `core/` - Core classes (HeritageCustodian, Collection)
- `mixins/` - Reusable components (TOOI, CPOC, Schema.org)
- `identifiers/` - Identifier systems (ISIL, Wikidata, etc.)
- `standards/` - Domain standards (RiC-O, BIBFRAME, LIDO)
- `enums/` - Controlled vocabularies
#### 3.2 Validation & Mapping
**Components**:
- **SchemaValidator**: Validate instances against LinkML schema
- **EntityMapper**: Map extracted data to LinkML classes
- **IRIGenerator**: Generate unique IRIs for instances
- **ProvenanceTracker**: Track extraction provenance
**Technology**:
- **linkml-runtime**: Schema validation and instance generation
- **linkml**: Schema development tools
- **rdflib**: RDF graph manipulation
- **jsonschema**: JSON validation
### 4. Storage Layer
#### 4.1 Intermediate Storage
**Purpose**: Store processed data during extraction pipeline
**Databases**:
- **Conversation Index**: DuckDB
- Fast analytical queries
- Full-text search on conversation content
- Lightweight, embedded database
- **Extraction Cache**: SQLite
- Store intermediate extraction results
- Cache web crawling results
- Track processing status
**Schema**:
```sql
-- Conversations
CREATE TABLE conversations (
id TEXT PRIMARY KEY,
created_at TIMESTAMP,
title TEXT,
country TEXT,
region TEXT,
content TEXT, -- Full markdown content
processed BOOLEAN DEFAULT FALSE
);
CREATE TABLE citations (
id INTEGER PRIMARY KEY,
conversation_id TEXT,
url TEXT,
title TEXT,
context TEXT,
crawled BOOLEAN DEFAULT FALSE,
crawl_status TEXT,
FOREIGN KEY (conversation_id) REFERENCES conversations(id)
);
-- Extracted entities
CREATE TABLE extracted_institutions (
id TEXT PRIMARY KEY,
name TEXT,
institution_type TEXT[],
country TEXT,
raw_data JSONB, -- Full extracted data
confidence FLOAT,
status TEXT, -- 'pending', 'validated', 'rejected'
created_at TIMESTAMP
);
-- Crawl results
CREATE TABLE crawl_results (
url TEXT PRIMARY KEY,
status_code INTEGER,
content_hash TEXT,
metadata JSONB,
crawled_at TIMESTAMP
);
```
#### 4.2 Output Storage
**Purpose**: Store final dataset in multiple formats
**Formats**:
- **RDF Store**: Apache Jena TDB or Oxigraph
- Native RDF/SPARQL support
- Linked data queries
- **Document Store**: JSON-LD files
- One file per institution
- GitHub-friendly for version control
- **Analytical**: Parquet + DuckDB
- Efficient columnar storage
- Fast analytical queries
- **Relational**: PostgreSQL (optional)
- If needed for applications
- Can be generated from RDF
### 5. Export & Publishing Layer
#### 5.1 Export Pipeline
**Components**:
- **RDFExporter**: Export to Turtle/N-Triples/JSON-LD
- **TabularExporter**: Export to CSV/TSV/Excel
- **SQLExporter**: Generate SQL dumps
- **ParquetExporter**: Generate Parquet files
- **StatisticsGenerator**: Dataset statistics and reports
#### 5.2 Publishing Formats
```
dataset/
├── rdf/
│ ├── institutions.ttl # All institutions in Turtle
│ ├── institutions.nt # N-Triples format
│ └── jsonld/ # Individual JSON-LD files
│ ├── inst_001.jsonld
│ └── ...
├── tabular/
│ ├── institutions.csv # Flattened institution data
│ ├── collections.csv # Collections table
│ ├── digital_platforms.csv # Digital platforms table
│ └── data_dictionary.csv # Column descriptions
├── database/
│ ├── glam_dataset.db # SQLite database
│ ├── glam_dataset.duckdb # DuckDB database
│ └── schema.sql # SQL schema
├── analytics/
│ └── glam_dataset.parquet # Parquet format
├── metadata/
│ ├── dataset_metadata.yaml # Dataset description
│ ├── statistics.json # Dataset statistics
│ └── provenance.jsonld # PROV-O provenance
└── docs/
├── README.md # Dataset documentation
├── schema.html # Schema documentation
└── query_examples.md # SPARQL/SQL examples
```
## Component Integration
### Data Flow Architecture
```
┌─────────────────┐
│ Conversation │
│ JSON Files │
└────────┬────────┘
┌─────────────────┐
│ Conversation │ ──> [DuckDB Index]
│ Parser │
└────────┬────────┘
┌─────────────────┐
│ Pattern-based │ ──> Extract: ISIL, Wikidata, URLs
│ Extraction │ (IdentifierExtractor in main code)
└────────┬────────┘
┌─────────────────┐
│ SUBAGENT │ ──> Launch subagents for NER
│ BOUNDARY │ (InstitutionExtractor, LocationExtractor)
└────────┬────────┘
┌─────────────────┐
│ Coding │ ──> Subagents use spaCy/transformers
│ Subagents │ Return structured JSON results
└────────┬────────┘
┌─────────────────┐
│ Web Crawler │ ──> [Crawl Results DB]
│ (crawl4ai) │
└────────┬────────┘
┌─────────────────┐
│ Enrichment │ ──> [External APIs]
│ Services │ (Wikidata, VIAF)
└────────┬────────┘
┌─────────────────┐
│ LinkML Mapper │ ──> [Validation Errors]
└────────┬────────┘
┌─────────────────┐
│ Quality Check │ ──> [Review Queue]
└────────┬────────┘
┌─────────────────┐
│ Exporters │
└────────┬────────┘
┌─────────────────┐
│ Published │
│ Dataset │
└─────────────────┘
```
### Processing Modes
#### 1. Batch Processing (Default)
- Process all 139 conversations sequentially
- Suitable for initial dataset creation
- Can be parallelized by conversation
#### 2. Incremental Processing
- Process new/updated conversations only
- Track processing timestamps
- Merge with existing dataset
#### 3. Targeted Processing
- Process specific conversations (by country, date, etc.)
- For dataset updates or testing
## Technology Stack
### Core Languages
- **Python 3.11+**: Main implementation language
- **YAML**: Schema definition (LinkML)
- **SPARQL**: RDF queries
### Key Libraries
- **Data Processing**: pandas, polars (for tabular data)
- **Pattern Matching**: Python `re` module, `rapidfuzz` (fuzzy matching)
- **Subagent Orchestration**: Task tool (for NER and complex extraction)
- **Web**: crawl4ai, httpx, beautifulsoup4
- **Semantic Web**: linkml-runtime, rdflib, pyshacl
- **Database**: duckdb, sqlite3, sqlalchemy
- **Validation**: pydantic, jsonschema
- **Export**: pyarrow (Parquet), openpyxl (Excel)
**Note**: NLP libraries (spaCy, transformers, PyTorch) are NOT dependencies of the main application. They are used by coding subagents when needed for entity extraction. See `docs/plan/global_glam/07-subagent-architecture.md` for details.
### Development Tools
- **Package Management**: Poetry
- **Code Quality**: ruff (linting), mypy (type checking)
- **Testing**: pytest, hypothesis
- **Documentation**: mkdocs, sphinx
- **CI/CD**: GitHub Actions
## Scalability Considerations
### Performance Optimizations
1. **Parallel Processing**: Process conversations concurrently
2. **Caching**: Cache web crawl results and API calls
3. **Incremental Processing**: Only process changed data
4. **Efficient Storage**: Use columnar formats (Parquet) for analytics
### Resource Management
- **Memory**: Stream large files, avoid loading all data in memory
- **Disk**: Compress intermediate results
- **Network**: Rate limiting for web crawling, batch API calls
## Security & Ethics
### Web Crawling
- Respect robots.txt
- Implement rate limiting
- User-agent identification
- No aggressive crawling
### Data Privacy
- No personal data extraction
- Public information only
- Comply with institutional terms of service
### Licensing
- Clear dataset license (CC0, CC-BY, or ODbL)
- Respect source data licenses
- Attribution to source institutions
## Monitoring & Observability
### Logging
- Structured logging (JSON format)
- Log levels: DEBUG, INFO, WARNING, ERROR
- Per-component logging
### Metrics
- Extraction success rates
- Web crawl success rates
- Validation pass/fail rates
- Processing time per conversation
- Dataset statistics over time
### Error Handling
- Graceful degradation
- Retry logic for transient failures
- Error categorization and reporting
- Manual review queue for edge cases
## Deployment Architecture
### Local Development
```
poetry install
poetry run glam-extract --input data/conversations/ --output data/dataset/
```
### Docker Container (Future)
```dockerfile
FROM python:3.11-slim
RUN pip install poetry
COPY . /app
WORKDIR /app
RUN poetry install --no-dev
ENTRYPOINT ["poetry", "run", "glam-extract"]
```
### CI/CD Pipeline
- **GitHub Actions**: Run on push/PR
- **Tests**: Unit tests, integration tests
- **Quality Gates**: Linting, type checking, coverage
- **Dataset Build**: Weekly automated builds
- **Publication**: Auto-publish to GitHub Releases
## Future Enhancements
### Phase 2 Features
- Web UI for manual review and curation
- SPARQL endpoint for queries
- REST API for dataset access
- Automated duplicate detection across conversations
- Multi-language support (NLP for non-English content)
- Machine learning for institution type classification
- Automated monitoring of institution URLs (link rot detection)
### Integration Opportunities
- Wikidata integration (read/write)
- National heritage registries
- OpenStreetMap for geocoding
- IIIF for image collections
- OAI-PMH for metadata harvesting
## Related Documentation
- **Subagent Architecture**: `docs/plan/global_glam/07-subagent-architecture.md` - Detailed explanation of subagent-based NER approach
- **Design Patterns**: `docs/plan/global_glam/05-design-patterns.md` - Code patterns and best practices
- **Data Standardization**: `docs/plan/global_glam/04-data-standardization.md` - Data tier system and provenance
- **Dependencies**: `docs/plan/global_glam/03-dependencies.md` - Technology stack and library choices
- **LinkML Schema**: `schemas/heritage_custodian.yaml` - Data model definition
- **Agent Instructions**: `AGENTS.md` - Instructions for AI agents working on this project