glam/docs/plan/global_glam/03-dependencies.md
2025-11-19 23:25:22 +01:00

464 lines
13 KiB
Markdown

# Global GLAM Dataset: Dependencies
## Overview
This document catalogs all technical, data, and organizational dependencies for the Global GLAM Dataset extraction project.
## Technical Dependencies
### 1. Python Runtime
- **Python 3.11+** (required)
- Rationale: Modern type hints, performance improvements, better error messages
- Alternative: Python 3.10 (minimum supported)
### 2. Core Python Libraries
#### Data Processing
```toml
[tool.poetry.dependencies]
python = "^3.11"
# Data manipulation
pandas = "^2.1.0" # Tabular data processing
polars = "^0.19.0" # High-performance dataframes (optional)
pyarrow = "^13.0.0" # Parquet format support
# Data validation
pydantic = "^2.4.0" # Data validation and settings
jsonschema = "^4.19.0" # JSON schema validation
```
#### NLP & Text Processing
**NOTE**: NLP extraction (NER, entity recognition) is handled by **coding subagents** via the Task tool, not directly in the main codebase. Subagents may use spaCy, transformers, or other tools internally, but these are NOT direct dependencies of the main application.
```toml
# Text utilities (direct dependencies)
langdetect = "^1.0.9" # Language detection
unidecode = "^1.3.7" # Unicode transliteration
ftfy = "^6.1.1" # Fix text encoding issues
rapidfuzz = "^3.0.0" # Fuzzy string matching for deduplication
# NLP libraries (used by subagents only - NOT direct dependencies)
# spacy = "^3.7.0" # ❌ Used by subagents, not main code
# transformers = "^4.34.0" # ❌ Used by subagents, not main code
# torch = "^2.1.0" # ❌ Used by subagents, not main code
```
#### Web Crawling & HTTP
```toml
# Web crawling
crawl4ai = "^0.1.0" # AI-powered web crawling
httpx = "^0.25.0" # Modern async HTTP client
aiohttp = "^3.8.6" # Alternative async HTTP
# HTML/XML parsing
beautifulsoup4 = "^4.12.0" # HTML parsing
lxml = "^4.9.3" # Fast XML/HTML processing
selectolax = "^0.3.17" # Fast HTML5 parser (optional)
# URL utilities
urllib3 = "^2.0.7" # URL utilities
validators = "^0.22.0" # URL/email validation
```
#### Semantic Web & LinkML
```toml
# LinkML ecosystem
linkml = "^1.6.0" # Schema development tools
linkml-runtime = "^1.6.0" # Runtime validation and generation
linkml-model = "^1.6.0" # LinkML metamodel
# RDF/Semantic Web
rdflib = "^7.0.0" # RDF manipulation
pyshacl = "^0.24.0" # SHACL validation
sparqlwrapper = "^2.0.0" # SPARQL queries
```
#### Database & Storage
```toml
# Embedded databases
duckdb = "^0.9.0" # Analytical database
sqlite3 = "^3.42.0" # Built-in to Python
# SQL toolkit
sqlalchemy = "^2.0.0" # SQL abstraction layer
alembic = "^1.12.0" # Database migrations (if needed)
```
#### Geolocation & Geographic Data
```toml
# Geocoding
geopy = "^2.4.0" # Geocoding library
pycountry = "^22.3.5" # Country/language data
phonenumbers = "^8.13.0" # Phone number parsing
```
#### Development Tools
```toml
[tool.poetry.group.dev.dependencies]
# Code quality
ruff = "^0.1.0" # Fast linter (replaces flake8, black, isort)
mypy = "^1.6.0" # Static type checking
pre-commit = "^3.5.0" # Git hooks
# Testing
pytest = "^7.4.0" # Testing framework
pytest-cov = "^4.1.0" # Coverage plugin
pytest-asyncio = "^0.21.0" # Async testing
hypothesis = "^6.88.0" # Property-based testing
# Documentation
mkdocs = "^1.5.0" # Documentation generator
mkdocs-material = "^9.4.0" # Material theme
```
### 3. External Services & APIs
#### Required External Services
1. **None (self-contained by default)**
- System designed to work offline with conversation files only
#### Optional External Services (for enrichment)
1. **Wikidata SPARQL Endpoint**
- URL: `https://query.wikidata.org/sparql`
- Purpose: Entity linking and enrichment
- Rate limit: No official limit, be respectful
- Fallback: Can work without, just less enrichment
2. **Nominatim (OpenStreetMap Geocoding)**
- URL: `https://nominatim.openstreetmap.org`
- Purpose: Geocoding addresses
- Rate limit: 1 request/second for public instance
- Fallback: Can skip geocoding
- Alternative: Self-hosted Nominatim instance
3. **VIAF (Virtual International Authority File)**
- URL: `https://www.viaf.org/`
- Purpose: Authority linking for organizations
- Rate limit: Reasonable use
- Fallback: Can work without
4. **OCLC APIs** (future)
- Purpose: Library registry lookups
- Requires: API key
- Fallback: Optional enrichment only
### 4. Data Dependencies
#### Input Data
1. **Conversation JSON Files** (REQUIRED)
- Location: `/Users/kempersc/Documents/claude/glam/*.json`
- Count: 139 files
- Format: Claude conversation export format
- Size: ~200MB total (estimated)
2. **Ontology Research Files** (REFERENCE)
- Location: `/Users/kempersc/Documents/claude/glam/ontology/*.json`
- Purpose: Schema design reference
- Count: 14 files
#### Reference Data (to be included in project)
1. **Country Codes & Names**
- Source: ISO 3166-1 (via pycountry)
- Purpose: Normalize country names
- Format: Built into library
2. **Language Codes**
- Source: ISO 639 (via pycountry)
- Purpose: Identify content languages
- Format: Built into library
3. **GLAM Type Vocabulary**
- Source: Custom controlled vocabulary
- Purpose: Standardize institution types
- Format: LinkML enum
- Location: `data/vocabularies/institution_types.yaml`
4. **Metadata Standards Registry**
- Source: Curated list
- Purpose: Recognize metadata standards mentioned in text
- Format: YAML
- Location: `data/vocabularies/metadata_standards.yaml`
- Examples: Dublin Core, EAD, MARC21, LIDO, etc.
#### Downloadable Reference Data (optional)
1. **ISIL Registry**
- Source: Library of Congress
- URL: `https://www.loc.gov/marc/organizations/orgshome.html`
- Purpose: Validate ISIL codes
- Format: MARC XML
- Frequency: Annual updates
- Storage: `data/reference/isil/`
2. **Wikidata Dumps** (for offline work)
- Source: Wikidata
- URL: `https://dumps.wikimedia.org/wikidatawiki/entities/`
- Purpose: Offline entity linking
- Format: JSON
- Size: Very large (100GB+)
- Recommendation: Use SPARQL endpoint instead
### 5. Model Dependencies
#### spaCy Models
```bash
# Required for English content
python -m spacy download en_core_web_lg
# Recommended for multilingual content
python -m spacy download xx_ent_wiki_sm # Multilingual NER
python -m spacy download es_core_news_lg # Spanish
python -m spacy download fr_core_news_lg # French
python -m spacy download de_core_news_lg # German
python -m spacy download nl_core_news_lg # Dutch
python -m spacy download ja_core_news_lg # Japanese
python -m spacy download zh_core_web_lg # Chinese
```
#### Transformer Models (optional, for better NER)
```python
# Hugging Face models (downloaded on first use)
from transformers import pipeline
# Organization NER
"dslim/bert-base-NER" # General NER
"Babelscape/wikineural-multilingual-ner" # Multilingual NER
```
### 6. Schema Dependencies
#### Imported Schemas (LinkML)
1. **linkml:types**
- Source: LinkML standard library
- Purpose: Base types (string, integer, etc.)
2. **Schema.org Vocabulary**
- Source: https://schema.org/
- Subset: Organization, Place, WebSite, CreativeWork
- Format: LinkML representation
- Location: `schemas/imports/schema_org_subset.yaml`
3. **CPOC (Core Public Organization Vocabulary)**
- Source: W3C
- URL: https://www.w3.org/ns/org
- Format: LinkML representation
- Location: `schemas/imports/cpoc_subset.yaml`
4. **TOOI (Dutch Organizations Ontology)**
- Source: Dutch government
- URL: https://standaarden.overheid.nl/tooi
- Format: LinkML representation (to be created)
- Location: `schemas/imports/tooi_subset.yaml`
5. **RiC-O (Records in Contexts Ontology)** (subset)
- Source: International Council on Archives
- URL: https://www.ica.org/standards/RiC/ontology
- Purpose: Archives-specific classes
- Location: `schemas/imports/rico_subset.yaml`
### 7. Operating System Dependencies
#### Cross-Platform
- Works on: macOS, Linux, Windows
- Recommended: macOS or Linux for development
#### System Tools (optional, for development)
- **Git**: Version control
- **Docker**: Containerization (future)
- **Make**: Build automation (optional)
#### File System
- Minimum disk space: 5GB (for data, models, cache)
- Recommended: 20GB (with all optional models)
## Dependency Management Strategy
### Version Pinning
```toml
# pyproject.toml
[tool.poetry.dependencies]
# Core dependencies: Use ^ (compatible version)
pandas = "^2.1.0" # Allows 2.1.x, 2.2.x, but not 3.x
# Critical dependencies: Use ~ (patch version)
linkml-runtime = "~1.6.0" # Allows 1.6.x only
# Unstable dependencies: Pin exact version
crawl4ai = "==0.1.0"
```
### Dependency Updates
- **Weekly**: Check for security updates
- **Monthly**: Update development dependencies
- **Quarterly**: Update core dependencies with testing
- **As needed**: Update spaCy models
### Dependency Security
```bash
# Check for vulnerabilities
poetry run safety check
# Update dependencies
poetry update
# Audit dependencies
poetry show --tree
```
## External Data Sources (URLs to crawl)
### Source URLs in Conversations
- **~2000+ URLs estimated** across 139 conversation files
- Types:
- Institution homepages
- Digital repositories
- National library catalogs
- Archive finding aids
- Museum collection databases
- Government heritage portals
- Wikipedia/Wikidata pages
### Crawling Constraints
- Respect robots.txt
- Rate limiting: Max 1 request/second per domain
- User-Agent: `GLAM-Dataset-Extractor/0.1.0 (+https://github.com/user/glam-dataset)`
- Timeout: 30 seconds per request
- Retry: Max 3 attempts with exponential backoff
## Optional Dependencies
### Performance Optimization
```toml
# Fast JSON parsing
orjson = "^3.9.0"
# Fast YAML parsing
ruamel.yaml = "^0.18.0"
# Progress bars
tqdm = "^4.66.0"
rich = "^13.6.0" # Beautiful terminal output
```
### Export Formats
```toml
# Excel export
openpyxl = "^3.1.0"
# XML generation
xmltodict = "^0.13.0"
```
### Monitoring & Observability
```toml
# Structured logging
structlog = "^23.2.0"
# Metrics (future)
prometheus-client = "^0.18.0"
```
## Dependency Graph
```
Global GLAM Extractor
├─ Data Processing
│ ├─ pandas (tabular data)
│ ├─ pydantic (validation)
│ └─ duckdb (analytics)
├─ NLP Pipeline
│ ├─ spaCy (core NLP)
│ │ └─ spacy models (language-specific)
│ ├─ transformers (advanced NER)
│ │ └─ torch (ML backend)
│ └─ langdetect (language ID)
├─ Web Crawling
│ ├─ crawl4ai (AI crawling)
│ ├─ httpx (HTTP client)
│ ├─ beautifulsoup4 (HTML parsing)
│ └─ lxml (XML processing)
├─ Semantic Web
│ ├─ linkml-runtime (schema validation)
│ ├─ rdflib (RDF manipulation)
│ └─ sparqlwrapper (SPARQL queries)
├─ Enrichment Services
│ ├─ geopy (geocoding)
│ ├─ pycountry (country/language data)
│ └─ External APIs (Wikidata, VIAF)
└─ Development
├─ pytest (testing)
├─ ruff (linting)
├─ mypy (type checking)
└─ mkdocs (documentation)
```
## Installation
### Complete Installation
```bash
# Clone repository
git clone https://github.com/user/glam-dataset.git
cd glam-dataset
# Install with Poetry
poetry install
# Install spaCy models
poetry run python -m spacy download en_core_web_lg
poetry run python -m spacy download xx_ent_wiki_sm
# Optional: Install all language models
poetry run bash scripts/install_language_models.sh
```
### Minimal Installation (no ML models)
```bash
poetry install --without dev
poetry run python -m spacy download en_core_web_sm # Smaller model
```
### Docker Installation (future)
```bash
docker pull glamdataset/extractor:latest
docker run -v ./data:/data glamdataset/extractor
```
## Dependency Risks & Mitigation
### High-Risk Dependencies
1. **crawl4ai** (new/unstable)
- Risk: API changes, bugs
- Mitigation: Pin exact version, wrap in adapter pattern
- Fallback: Direct httpx/beautifulsoup4 implementation
2. **transformers** (large, heavy)
- Risk: Memory usage, slow startup
- Mitigation: Optional dependency, use spaCy by default
- Fallback: spaCy-only NER
### Medium-Risk Dependencies
1. **External APIs** (Wikidata, VIAF)
- Risk: Rate limiting, downtime
- Mitigation: Aggressive caching, graceful degradation
- Fallback: Skip enrichment, work with extracted data only
### Low-Risk Dependencies
1. **Core libraries** (pandas, rdflib, etc.)
- Risk: Minimal, well-maintained
- Mitigation: Regular updates
## License Compatibility
All dependencies use permissive licenses compatible with the project:
- **MIT**: pandas, pydantic, httpx, beautifulsoup4
- **Apache 2.0**: transformers, spaCy, duckdb
- **BSD**: rdflib, lxml
- **LGPL**: geopy (dynamic linking, acceptable)
No GPL dependencies that would require copyleft distribution.