glam/docs/plan/global_glam/03-dependencies.md

# Global GLAM Dataset: Dependencies

## Overview
This document catalogs all technical, data, and organizational dependencies for the Global GLAM Dataset extraction project.

## Technical Dependencies

### 1. Python Runtime
- **Python 3.11+** (required)
  - Rationale: Modern type hints, performance improvements, better error messages
  - Alternative: Python 3.10 (minimum supported)

### 2. Core Python Libraries

#### Data Processing
```toml
[tool.poetry.dependencies]
python = "^3.11"

# Data manipulation
pandas = "^2.1.0"           # Tabular data processing
polars = "^0.19.0"          # High-performance dataframes (optional)
pyarrow = "^13.0.0"         # Parquet format support

# Data validation
pydantic = "^2.4.0"         # Data validation and settings
jsonschema = "^4.19.0"      # JSON schema validation
```

#### NLP & Text Processing

**NOTE**: NLP extraction (NER, entity recognition) is handled by **coding subagents** via the Task tool, not directly in the main codebase. Subagents may use spaCy, transformers, or other tools internally, but these are NOT direct dependencies of the main application.

```toml
# Text utilities (direct dependencies)
langdetect = "^1.0.9"       # Language detection
unidecode = "^1.3.7"        # Unicode transliteration
ftfy = "^6.1.1"             # Fix text encoding issues
rapidfuzz = "^3.0.0"        # Fuzzy string matching for deduplication

# NLP libraries (used by subagents only - NOT direct dependencies)
# spacy = "^3.7.0"          # ❌ Used by subagents, not main code
# transformers = "^4.34.0"  # ❌ Used by subagents, not main code
# torch = "^2.1.0"          # ❌ Used by subagents, not main code
```

#### Web Crawling & HTTP
```toml
# Web crawling
crawl4ai = "^0.1.0"         # AI-powered web crawling
httpx = "^0.25.0"           # Modern async HTTP client
aiohttp = "^3.8.6"          # Alternative async HTTP

# HTML/XML parsing
beautifulsoup4 = "^4.12.0"  # HTML parsing
lxml = "^4.9.3"             # Fast XML/HTML processing
selectolax = "^0.3.17"      # Fast HTML5 parser (optional)

# URL utilities
urllib3 = "^2.0.7"          # URL utilities
validators = "^0.22.0"      # URL/email validation
```

#### Semantic Web & LinkML
```toml
# LinkML ecosystem
linkml = "^1.6.0"           # Schema development tools
linkml-runtime = "^1.6.0"   # Runtime validation and generation
linkml-model = "^1.6.0"     # LinkML metamodel

# RDF/Semantic Web
rdflib = "^7.0.0"           # RDF manipulation
pyshacl = "^0.24.0"         # SHACL validation
sparqlwrapper = "^2.0.0"    # SPARQL queries
```

#### Database & Storage
```toml
# Embedded databases
duckdb = "^0.9.0"           # Analytical database
sqlite3 = "^3.42.0"         # Built-in to Python

# SQL toolkit
sqlalchemy = "^2.0.0"       # SQL abstraction layer
alembic = "^1.12.0"         # Database migrations (if needed)
```

#### Geolocation & Geographic Data
```toml
# Geocoding
geopy = "^2.4.0"            # Geocoding library
pycountry = "^22.3.5"       # Country/language data
phonenumbers = "^8.13.0"    # Phone number parsing
```

#### Development Tools
```toml
[tool.poetry.group.dev.dependencies]
# Code quality
ruff = "^0.1.0"             # Fast linter (replaces flake8, black, isort)
mypy = "^1.6.0"             # Static type checking
pre-commit = "^3.5.0"       # Git hooks

# Testing
pytest = "^7.4.0"           # Testing framework
pytest-cov = "^4.1.0"       # Coverage plugin
pytest-asyncio = "^0.21.0"  # Async testing
hypothesis = "^6.88.0"      # Property-based testing

# Documentation
mkdocs = "^1.5.0"           # Documentation generator
mkdocs-material = "^9.4.0"  # Material theme
```

### 3. External Services & APIs

#### Required External Services
1. **None (self-contained by default)**
   - System designed to work offline with conversation files only

#### Optional External Services (for enrichment)
1. **Wikidata SPARQL Endpoint**
   - URL: `https://query.wikidata.org/sparql`
   - Purpose: Entity linking and enrichment
   - Rate limit: No official limit, be respectful
   - Fallback: Can work without, just less enrichment

2. **Nominatim (OpenStreetMap Geocoding)**
   - URL: `https://nominatim.openstreetmap.org`
   - Purpose: Geocoding addresses
   - Rate limit: 1 request/second for public instance
   - Fallback: Can skip geocoding
   - Alternative: Self-hosted Nominatim instance

3. **VIAF (Virtual International Authority File)**
   - URL: `https://www.viaf.org/`
   - Purpose: Authority linking for organizations
   - Rate limit: Reasonable use
   - Fallback: Can work without

4. **OCLC APIs** (future)
   - Purpose: Library registry lookups
   - Requires: API key
   - Fallback: Optional enrichment only

### 4. Data Dependencies

#### Input Data
1. **Conversation JSON Files** (REQUIRED)
   - Location: `/Users/kempersc/Documents/claude/glam/*.json`
   - Count: 139 files
   - Format: Claude conversation export format
   - Size: ~200MB total (estimated)

2. **Ontology Research Files** (REFERENCE)
   - Location: `/Users/kempersc/Documents/claude/glam/ontology/*.json`
   - Purpose: Schema design reference
   - Count: 14 files

#### Reference Data (to be included in project)
1. **Country Codes & Names**
   - Source: ISO 3166-1 (via pycountry)
   - Purpose: Normalize country names
   - Format: Built into library

2. **Language Codes**
   - Source: ISO 639 (via pycountry)
   - Purpose: Identify content languages
   - Format: Built into library

3. **GLAM Type Vocabulary**
   - Source: Custom controlled vocabulary
   - Purpose: Standardize institution types
   - Format: LinkML enum
   - Location: `data/vocabularies/institution_types.yaml`

4. **Metadata Standards Registry**
   - Source: Curated list
   - Purpose: Recognize metadata standards mentioned in text
   - Format: YAML
   - Location: `data/vocabularies/metadata_standards.yaml`
   - Examples: Dublin Core, EAD, MARC21, LIDO, etc.

#### Downloadable Reference Data (optional)
1. **ISIL Registry**
   - Source: Library of Congress
   - URL: `https://www.loc.gov/marc/organizations/orgshome.html`
   - Purpose: Validate ISIL codes
   - Format: MARC XML
   - Frequency: Annual updates
   - Storage: `data/reference/isil/`

2. **Wikidata Dumps** (for offline work)
   - Source: Wikidata
   - URL: `https://dumps.wikimedia.org/wikidatawiki/entities/`
   - Purpose: Offline entity linking
   - Format: JSON
   - Size: Very large (100GB+)
   - Recommendation: Use SPARQL endpoint instead

### 5. Model Dependencies

#### spaCy Models
```bash
# Required for English content
python -m spacy download en_core_web_lg

# Recommended for multilingual content
python -m spacy download xx_ent_wiki_sm  # Multilingual NER
python -m spacy download es_core_news_lg  # Spanish
python -m spacy download fr_core_news_lg  # French
python -m spacy download de_core_news_lg  # German
python -m spacy download nl_core_news_lg  # Dutch
python -m spacy download ja_core_news_lg  # Japanese
python -m spacy download zh_core_web_lg   # Chinese
```

#### Transformer Models (optional, for better NER)
```python
# Hugging Face models (downloaded on first use)
from transformers import pipeline

# Organization NER
"dslim/bert-base-NER"  # General NER
"Babelscape/wikineural-multilingual-ner"  # Multilingual NER
```

### 6. Schema Dependencies

#### Imported Schemas (LinkML)
1. **linkml:types**
   - Source: LinkML standard library
   - Purpose: Base types (string, integer, etc.)

2. **Schema.org Vocabulary**
   - Source: https://schema.org/
   - Subset: Organization, Place, WebSite, CreativeWork
   - Format: LinkML representation
   - Location: `schemas/imports/schema_org_subset.yaml`

3. **CPOC (Core Public Organization Vocabulary)**
   - Source: W3C
   - URL: https://www.w3.org/ns/org
   - Format: LinkML representation
   - Location: `schemas/imports/cpoc_subset.yaml`

4. **TOOI (Dutch Organizations Ontology)**
   - Source: Dutch government
   - URL: https://standaarden.overheid.nl/tooi
   - Format: LinkML representation (to be created)
   - Location: `schemas/imports/tooi_subset.yaml`

5. **RiC-O (Records in Contexts Ontology)** (subset)
   - Source: International Council on Archives
   - URL: https://www.ica.org/standards/RiC/ontology
   - Purpose: Archives-specific classes
   - Location: `schemas/imports/rico_subset.yaml`

### 7. Operating System Dependencies

#### Cross-Platform
- Works on: macOS, Linux, Windows
- Recommended: macOS or Linux for development

#### System Tools (optional, for development)
- **Git**: Version control
- **Docker**: Containerization (future)
- **Make**: Build automation (optional)

#### File System
- Minimum disk space: 5GB (for data, models, cache)
- Recommended: 20GB (with all optional models)

## Dependency Management Strategy

### Version Pinning
```toml
# pyproject.toml
[tool.poetry.dependencies]
# Core dependencies: Use ^ (compatible version)
pandas = "^2.1.0"  # Allows 2.1.x, 2.2.x, but not 3.x

# Critical dependencies: Use ~ (patch version)
linkml-runtime = "~1.6.0"  # Allows 1.6.x only

# Unstable dependencies: Pin exact version
crawl4ai = "==0.1.0"
```

### Dependency Updates
- **Weekly**: Check for security updates
- **Monthly**: Update development dependencies
- **Quarterly**: Update core dependencies with testing
- **As needed**: Update spaCy models

### Dependency Security
```bash
# Check for vulnerabilities
poetry run safety check

# Update dependencies
poetry update

# Audit dependencies
poetry show --tree
```

## External Data Sources (URLs to crawl)

### Source URLs in Conversations
- **~2000+ URLs estimated** across 139 conversation files
- Types:
  - Institution homepages
  - Digital repositories
  - National library catalogs
  - Archive finding aids
  - Museum collection databases
  - Government heritage portals
  - Wikipedia/Wikidata pages

### Crawling Constraints
- Respect robots.txt
- Rate limiting: Max 1 request/second per domain
- User-Agent: `GLAM-Dataset-Extractor/0.1.0 (+https://github.com/user/glam-dataset)`
- Timeout: 30 seconds per request
- Retry: Max 3 attempts with exponential backoff

## Optional Dependencies

### Performance Optimization
```toml
# Fast JSON parsing
orjson = "^3.9.0"

# Fast YAML parsing
ruamel.yaml = "^0.18.0"

# Progress bars
tqdm = "^4.66.0"
rich = "^13.6.0"  # Beautiful terminal output
```

### Export Formats
```toml
# Excel export
openpyxl = "^3.1.0"

# XML generation
xmltodict = "^0.13.0"
```

### Monitoring & Observability
```toml
# Structured logging
structlog = "^23.2.0"

# Metrics (future)
prometheus-client = "^0.18.0"
```

## Dependency Graph

```
Global GLAM Extractor
│
├─ Data Processing
│  ├─ pandas (tabular data)
│  ├─ pydantic (validation)
│  └─ duckdb (analytics)
│
├─ NLP Pipeline
│  ├─ spaCy (core NLP)
│  │  └─ spacy models (language-specific)
│  ├─ transformers (advanced NER)
│  │  └─ torch (ML backend)
│  └─ langdetect (language ID)
│
├─ Web Crawling
│  ├─ crawl4ai (AI crawling)
│  ├─ httpx (HTTP client)
│  ├─ beautifulsoup4 (HTML parsing)
│  └─ lxml (XML processing)
│
├─ Semantic Web
│  ├─ linkml-runtime (schema validation)
│  ├─ rdflib (RDF manipulation)
│  └─ sparqlwrapper (SPARQL queries)
│
├─ Enrichment Services
│  ├─ geopy (geocoding)
│  ├─ pycountry (country/language data)
│  └─ External APIs (Wikidata, VIAF)
│
└─ Development
   ├─ pytest (testing)
   ├─ ruff (linting)
   ├─ mypy (type checking)
   └─ mkdocs (documentation)
```

## Installation

### Complete Installation
```bash
# Clone repository
git clone https://github.com/user/glam-dataset.git
cd glam-dataset

# Install with Poetry
poetry install

# Install spaCy models
poetry run python -m spacy download en_core_web_lg
poetry run python -m spacy download xx_ent_wiki_sm

# Optional: Install all language models
poetry run bash scripts/install_language_models.sh
```

### Minimal Installation (no ML models)
```bash
poetry install --without dev
poetry run python -m spacy download en_core_web_sm  # Smaller model
```

### Docker Installation (future)
```bash
docker pull glamdataset/extractor:latest
docker run -v ./data:/data glamdataset/extractor
```

## Dependency Risks & Mitigation

### High-Risk Dependencies
1. **crawl4ai** (new/unstable)
   - Risk: API changes, bugs
   - Mitigation: Pin exact version, wrap in adapter pattern
   - Fallback: Direct httpx/beautifulsoup4 implementation

2. **transformers** (large, heavy)
   - Risk: Memory usage, slow startup
   - Mitigation: Optional dependency, use spaCy by default
   - Fallback: spaCy-only NER

### Medium-Risk Dependencies
1. **External APIs** (Wikidata, VIAF)
   - Risk: Rate limiting, downtime
   - Mitigation: Aggressive caching, graceful degradation
   - Fallback: Skip enrichment, work with extracted data only

### Low-Risk Dependencies
1. **Core libraries** (pandas, rdflib, etc.)
   - Risk: Minimal, well-maintained
   - Mitigation: Regular updates

## License Compatibility

All dependencies use permissive licenses compatible with the project:
- **MIT**: pandas, pydantic, httpx, beautifulsoup4
- **Apache 2.0**: transformers, spaCy, duckdb
- **BSD**: rdflib, lxml
- **LGPL**: geopy (dynamic linking, acceptable)

No GPL dependencies that would require copyleft distribution.