464 lines
13 KiB
Markdown
464 lines
13 KiB
Markdown
# Global GLAM Dataset: Dependencies
|
|
|
|
## Overview
|
|
This document catalogs all technical, data, and organizational dependencies for the Global GLAM Dataset extraction project.
|
|
|
|
## Technical Dependencies
|
|
|
|
### 1. Python Runtime
|
|
- **Python 3.11+** (required)
|
|
- Rationale: Modern type hints, performance improvements, better error messages
|
|
- Alternative: Python 3.10 (minimum supported)
|
|
|
|
### 2. Core Python Libraries
|
|
|
|
#### Data Processing
|
|
```toml
|
|
[tool.poetry.dependencies]
|
|
python = "^3.11"
|
|
|
|
# Data manipulation
|
|
pandas = "^2.1.0" # Tabular data processing
|
|
polars = "^0.19.0" # High-performance dataframes (optional)
|
|
pyarrow = "^13.0.0" # Parquet format support
|
|
|
|
# Data validation
|
|
pydantic = "^2.4.0" # Data validation and settings
|
|
jsonschema = "^4.19.0" # JSON schema validation
|
|
```
|
|
|
|
#### NLP & Text Processing
|
|
|
|
**NOTE**: NLP extraction (NER, entity recognition) is handled by **coding subagents** via the Task tool, not directly in the main codebase. Subagents may use spaCy, transformers, or other tools internally, but these are NOT direct dependencies of the main application.
|
|
|
|
```toml
|
|
# Text utilities (direct dependencies)
|
|
langdetect = "^1.0.9" # Language detection
|
|
unidecode = "^1.3.7" # Unicode transliteration
|
|
ftfy = "^6.1.1" # Fix text encoding issues
|
|
rapidfuzz = "^3.0.0" # Fuzzy string matching for deduplication
|
|
|
|
# NLP libraries (used by subagents only - NOT direct dependencies)
|
|
# spacy = "^3.7.0" # ❌ Used by subagents, not main code
|
|
# transformers = "^4.34.0" # ❌ Used by subagents, not main code
|
|
# torch = "^2.1.0" # ❌ Used by subagents, not main code
|
|
```
|
|
|
|
#### Web Crawling & HTTP
|
|
```toml
|
|
# Web crawling
|
|
crawl4ai = "^0.1.0" # AI-powered web crawling
|
|
httpx = "^0.25.0" # Modern async HTTP client
|
|
aiohttp = "^3.8.6" # Alternative async HTTP
|
|
|
|
# HTML/XML parsing
|
|
beautifulsoup4 = "^4.12.0" # HTML parsing
|
|
lxml = "^4.9.3" # Fast XML/HTML processing
|
|
selectolax = "^0.3.17" # Fast HTML5 parser (optional)
|
|
|
|
# URL utilities
|
|
urllib3 = "^2.0.7" # URL utilities
|
|
validators = "^0.22.0" # URL/email validation
|
|
```
|
|
|
|
#### Semantic Web & LinkML
|
|
```toml
|
|
# LinkML ecosystem
|
|
linkml = "^1.6.0" # Schema development tools
|
|
linkml-runtime = "^1.6.0" # Runtime validation and generation
|
|
linkml-model = "^1.6.0" # LinkML metamodel
|
|
|
|
# RDF/Semantic Web
|
|
rdflib = "^7.0.0" # RDF manipulation
|
|
pyshacl = "^0.24.0" # SHACL validation
|
|
sparqlwrapper = "^2.0.0" # SPARQL queries
|
|
```
|
|
|
|
#### Database & Storage
|
|
```toml
|
|
# Embedded databases
|
|
duckdb = "^0.9.0" # Analytical database
|
|
sqlite3 = "^3.42.0" # Built-in to Python
|
|
|
|
# SQL toolkit
|
|
sqlalchemy = "^2.0.0" # SQL abstraction layer
|
|
alembic = "^1.12.0" # Database migrations (if needed)
|
|
```
|
|
|
|
#### Geolocation & Geographic Data
|
|
```toml
|
|
# Geocoding
|
|
geopy = "^2.4.0" # Geocoding library
|
|
pycountry = "^22.3.5" # Country/language data
|
|
phonenumbers = "^8.13.0" # Phone number parsing
|
|
```
|
|
|
|
#### Development Tools
|
|
```toml
|
|
[tool.poetry.group.dev.dependencies]
|
|
# Code quality
|
|
ruff = "^0.1.0" # Fast linter (replaces flake8, black, isort)
|
|
mypy = "^1.6.0" # Static type checking
|
|
pre-commit = "^3.5.0" # Git hooks
|
|
|
|
# Testing
|
|
pytest = "^7.4.0" # Testing framework
|
|
pytest-cov = "^4.1.0" # Coverage plugin
|
|
pytest-asyncio = "^0.21.0" # Async testing
|
|
hypothesis = "^6.88.0" # Property-based testing
|
|
|
|
# Documentation
|
|
mkdocs = "^1.5.0" # Documentation generator
|
|
mkdocs-material = "^9.4.0" # Material theme
|
|
```
|
|
|
|
### 3. External Services & APIs
|
|
|
|
#### Required External Services
|
|
1. **None (self-contained by default)**
|
|
- System designed to work offline with conversation files only
|
|
|
|
#### Optional External Services (for enrichment)
|
|
1. **Wikidata SPARQL Endpoint**
|
|
- URL: `https://query.wikidata.org/sparql`
|
|
- Purpose: Entity linking and enrichment
|
|
- Rate limit: No official limit, be respectful
|
|
- Fallback: Can work without, just less enrichment
|
|
|
|
2. **Nominatim (OpenStreetMap Geocoding)**
|
|
- URL: `https://nominatim.openstreetmap.org`
|
|
- Purpose: Geocoding addresses
|
|
- Rate limit: 1 request/second for public instance
|
|
- Fallback: Can skip geocoding
|
|
- Alternative: Self-hosted Nominatim instance
|
|
|
|
3. **VIAF (Virtual International Authority File)**
|
|
- URL: `https://www.viaf.org/`
|
|
- Purpose: Authority linking for organizations
|
|
- Rate limit: Reasonable use
|
|
- Fallback: Can work without
|
|
|
|
4. **OCLC APIs** (future)
|
|
- Purpose: Library registry lookups
|
|
- Requires: API key
|
|
- Fallback: Optional enrichment only
|
|
|
|
### 4. Data Dependencies
|
|
|
|
#### Input Data
|
|
1. **Conversation JSON Files** (REQUIRED)
|
|
- Location: `/Users/kempersc/Documents/claude/glam/*.json`
|
|
- Count: 139 files
|
|
- Format: Claude conversation export format
|
|
- Size: ~200MB total (estimated)
|
|
|
|
2. **Ontology Research Files** (REFERENCE)
|
|
- Location: `/Users/kempersc/Documents/claude/glam/ontology/*.json`
|
|
- Purpose: Schema design reference
|
|
- Count: 14 files
|
|
|
|
#### Reference Data (to be included in project)
|
|
1. **Country Codes & Names**
|
|
- Source: ISO 3166-1 (via pycountry)
|
|
- Purpose: Normalize country names
|
|
- Format: Built into library
|
|
|
|
2. **Language Codes**
|
|
- Source: ISO 639 (via pycountry)
|
|
- Purpose: Identify content languages
|
|
- Format: Built into library
|
|
|
|
3. **GLAM Type Vocabulary**
|
|
- Source: Custom controlled vocabulary
|
|
- Purpose: Standardize institution types
|
|
- Format: LinkML enum
|
|
- Location: `data/vocabularies/institution_types.yaml`
|
|
|
|
4. **Metadata Standards Registry**
|
|
- Source: Curated list
|
|
- Purpose: Recognize metadata standards mentioned in text
|
|
- Format: YAML
|
|
- Location: `data/vocabularies/metadata_standards.yaml`
|
|
- Examples: Dublin Core, EAD, MARC21, LIDO, etc.
|
|
|
|
#### Downloadable Reference Data (optional)
|
|
1. **ISIL Registry**
|
|
- Source: Library of Congress
|
|
- URL: `https://www.loc.gov/marc/organizations/orgshome.html`
|
|
- Purpose: Validate ISIL codes
|
|
- Format: MARC XML
|
|
- Frequency: Annual updates
|
|
- Storage: `data/reference/isil/`
|
|
|
|
2. **Wikidata Dumps** (for offline work)
|
|
- Source: Wikidata
|
|
- URL: `https://dumps.wikimedia.org/wikidatawiki/entities/`
|
|
- Purpose: Offline entity linking
|
|
- Format: JSON
|
|
- Size: Very large (100GB+)
|
|
- Recommendation: Use SPARQL endpoint instead
|
|
|
|
### 5. Model Dependencies
|
|
|
|
#### spaCy Models
|
|
```bash
|
|
# Required for English content
|
|
python -m spacy download en_core_web_lg
|
|
|
|
# Recommended for multilingual content
|
|
python -m spacy download xx_ent_wiki_sm # Multilingual NER
|
|
python -m spacy download es_core_news_lg # Spanish
|
|
python -m spacy download fr_core_news_lg # French
|
|
python -m spacy download de_core_news_lg # German
|
|
python -m spacy download nl_core_news_lg # Dutch
|
|
python -m spacy download ja_core_news_lg # Japanese
|
|
python -m spacy download zh_core_web_lg # Chinese
|
|
```
|
|
|
|
#### Transformer Models (optional, for better NER)
|
|
```python
|
|
# Hugging Face models (downloaded on first use)
|
|
from transformers import pipeline
|
|
|
|
# Organization NER
|
|
"dslim/bert-base-NER" # General NER
|
|
"Babelscape/wikineural-multilingual-ner" # Multilingual NER
|
|
```
|
|
|
|
### 6. Schema Dependencies
|
|
|
|
#### Imported Schemas (LinkML)
|
|
1. **linkml:types**
|
|
- Source: LinkML standard library
|
|
- Purpose: Base types (string, integer, etc.)
|
|
|
|
2. **Schema.org Vocabulary**
|
|
- Source: https://schema.org/
|
|
- Subset: Organization, Place, WebSite, CreativeWork
|
|
- Format: LinkML representation
|
|
- Location: `schemas/imports/schema_org_subset.yaml`
|
|
|
|
3. **CPOC (Core Public Organization Vocabulary)**
|
|
- Source: W3C
|
|
- URL: https://www.w3.org/ns/org
|
|
- Format: LinkML representation
|
|
- Location: `schemas/imports/cpoc_subset.yaml`
|
|
|
|
4. **TOOI (Dutch Organizations Ontology)**
|
|
- Source: Dutch government
|
|
- URL: https://standaarden.overheid.nl/tooi
|
|
- Format: LinkML representation (to be created)
|
|
- Location: `schemas/imports/tooi_subset.yaml`
|
|
|
|
5. **RiC-O (Records in Contexts Ontology)** (subset)
|
|
- Source: International Council on Archives
|
|
- URL: https://www.ica.org/standards/RiC/ontology
|
|
- Purpose: Archives-specific classes
|
|
- Location: `schemas/imports/rico_subset.yaml`
|
|
|
|
### 7. Operating System Dependencies
|
|
|
|
#### Cross-Platform
|
|
- Works on: macOS, Linux, Windows
|
|
- Recommended: macOS or Linux for development
|
|
|
|
#### System Tools (optional, for development)
|
|
- **Git**: Version control
|
|
- **Docker**: Containerization (future)
|
|
- **Make**: Build automation (optional)
|
|
|
|
#### File System
|
|
- Minimum disk space: 5GB (for data, models, cache)
|
|
- Recommended: 20GB (with all optional models)
|
|
|
|
## Dependency Management Strategy
|
|
|
|
### Version Pinning
|
|
```toml
|
|
# pyproject.toml
|
|
[tool.poetry.dependencies]
|
|
# Core dependencies: Use ^ (compatible version)
|
|
pandas = "^2.1.0" # Allows 2.1.x, 2.2.x, but not 3.x
|
|
|
|
# Critical dependencies: Use ~ (patch version)
|
|
linkml-runtime = "~1.6.0" # Allows 1.6.x only
|
|
|
|
# Unstable dependencies: Pin exact version
|
|
crawl4ai = "==0.1.0"
|
|
```
|
|
|
|
### Dependency Updates
|
|
- **Weekly**: Check for security updates
|
|
- **Monthly**: Update development dependencies
|
|
- **Quarterly**: Update core dependencies with testing
|
|
- **As needed**: Update spaCy models
|
|
|
|
### Dependency Security
|
|
```bash
|
|
# Check for vulnerabilities
|
|
poetry run safety check
|
|
|
|
# Update dependencies
|
|
poetry update
|
|
|
|
# Audit dependencies
|
|
poetry show --tree
|
|
```
|
|
|
|
## External Data Sources (URLs to crawl)
|
|
|
|
### Source URLs in Conversations
|
|
- **~2000+ URLs estimated** across 139 conversation files
|
|
- Types:
|
|
- Institution homepages
|
|
- Digital repositories
|
|
- National library catalogs
|
|
- Archive finding aids
|
|
- Museum collection databases
|
|
- Government heritage portals
|
|
- Wikipedia/Wikidata pages
|
|
|
|
### Crawling Constraints
|
|
- Respect robots.txt
|
|
- Rate limiting: Max 1 request/second per domain
|
|
- User-Agent: `GLAM-Dataset-Extractor/0.1.0 (+https://github.com/user/glam-dataset)`
|
|
- Timeout: 30 seconds per request
|
|
- Retry: Max 3 attempts with exponential backoff
|
|
|
|
## Optional Dependencies
|
|
|
|
### Performance Optimization
|
|
```toml
|
|
# Fast JSON parsing
|
|
orjson = "^3.9.0"
|
|
|
|
# Fast YAML parsing
|
|
ruamel.yaml = "^0.18.0"
|
|
|
|
# Progress bars
|
|
tqdm = "^4.66.0"
|
|
rich = "^13.6.0" # Beautiful terminal output
|
|
```
|
|
|
|
### Export Formats
|
|
```toml
|
|
# Excel export
|
|
openpyxl = "^3.1.0"
|
|
|
|
# XML generation
|
|
xmltodict = "^0.13.0"
|
|
```
|
|
|
|
### Monitoring & Observability
|
|
```toml
|
|
# Structured logging
|
|
structlog = "^23.2.0"
|
|
|
|
# Metrics (future)
|
|
prometheus-client = "^0.18.0"
|
|
```
|
|
|
|
## Dependency Graph
|
|
|
|
```
|
|
Global GLAM Extractor
|
|
│
|
|
├─ Data Processing
|
|
│ ├─ pandas (tabular data)
|
|
│ ├─ pydantic (validation)
|
|
│ └─ duckdb (analytics)
|
|
│
|
|
├─ NLP Pipeline
|
|
│ ├─ spaCy (core NLP)
|
|
│ │ └─ spacy models (language-specific)
|
|
│ ├─ transformers (advanced NER)
|
|
│ │ └─ torch (ML backend)
|
|
│ └─ langdetect (language ID)
|
|
│
|
|
├─ Web Crawling
|
|
│ ├─ crawl4ai (AI crawling)
|
|
│ ├─ httpx (HTTP client)
|
|
│ ├─ beautifulsoup4 (HTML parsing)
|
|
│ └─ lxml (XML processing)
|
|
│
|
|
├─ Semantic Web
|
|
│ ├─ linkml-runtime (schema validation)
|
|
│ ├─ rdflib (RDF manipulation)
|
|
│ └─ sparqlwrapper (SPARQL queries)
|
|
│
|
|
├─ Enrichment Services
|
|
│ ├─ geopy (geocoding)
|
|
│ ├─ pycountry (country/language data)
|
|
│ └─ External APIs (Wikidata, VIAF)
|
|
│
|
|
└─ Development
|
|
├─ pytest (testing)
|
|
├─ ruff (linting)
|
|
├─ mypy (type checking)
|
|
└─ mkdocs (documentation)
|
|
```
|
|
|
|
## Installation
|
|
|
|
### Complete Installation
|
|
```bash
|
|
# Clone repository
|
|
git clone https://github.com/user/glam-dataset.git
|
|
cd glam-dataset
|
|
|
|
# Install with Poetry
|
|
poetry install
|
|
|
|
# Install spaCy models
|
|
poetry run python -m spacy download en_core_web_lg
|
|
poetry run python -m spacy download xx_ent_wiki_sm
|
|
|
|
# Optional: Install all language models
|
|
poetry run bash scripts/install_language_models.sh
|
|
```
|
|
|
|
### Minimal Installation (no ML models)
|
|
```bash
|
|
poetry install --without dev
|
|
poetry run python -m spacy download en_core_web_sm # Smaller model
|
|
```
|
|
|
|
### Docker Installation (future)
|
|
```bash
|
|
docker pull glamdataset/extractor:latest
|
|
docker run -v ./data:/data glamdataset/extractor
|
|
```
|
|
|
|
## Dependency Risks & Mitigation
|
|
|
|
### High-Risk Dependencies
|
|
1. **crawl4ai** (new/unstable)
|
|
- Risk: API changes, bugs
|
|
- Mitigation: Pin exact version, wrap in adapter pattern
|
|
- Fallback: Direct httpx/beautifulsoup4 implementation
|
|
|
|
2. **transformers** (large, heavy)
|
|
- Risk: Memory usage, slow startup
|
|
- Mitigation: Optional dependency, use spaCy by default
|
|
- Fallback: spaCy-only NER
|
|
|
|
### Medium-Risk Dependencies
|
|
1. **External APIs** (Wikidata, VIAF)
|
|
- Risk: Rate limiting, downtime
|
|
- Mitigation: Aggressive caching, graceful degradation
|
|
- Fallback: Skip enrichment, work with extracted data only
|
|
|
|
### Low-Risk Dependencies
|
|
1. **Core libraries** (pandas, rdflib, etc.)
|
|
- Risk: Minimal, well-maintained
|
|
- Mitigation: Regular updates
|
|
|
|
## License Compatibility
|
|
|
|
All dependencies use permissive licenses compatible with the project:
|
|
- **MIT**: pandas, pydantic, httpx, beautifulsoup4
|
|
- **Apache 2.0**: transformers, spaCy, duckdb
|
|
- **BSD**: rdflib, lxml
|
|
- **LGPL**: geopy (dynamic linking, acceptable)
|
|
|
|
No GPL dependencies that would require copyleft distribution.
|