13 KiB
Global GLAM Dataset: Dependencies
Overview
This document catalogs all technical, data, and organizational dependencies for the Global GLAM Dataset extraction project.
Technical Dependencies
1. Python Runtime
- Python 3.11+ (required)
- Rationale: Modern type hints, performance improvements, better error messages
- Alternative: Python 3.10 (minimum supported)
2. Core Python Libraries
Data Processing
[tool.poetry.dependencies]
python = "^3.11"
# Data manipulation
pandas = "^2.1.0" # Tabular data processing
polars = "^0.19.0" # High-performance dataframes (optional)
pyarrow = "^13.0.0" # Parquet format support
# Data validation
pydantic = "^2.4.0" # Data validation and settings
jsonschema = "^4.19.0" # JSON schema validation
NLP & Text Processing
NOTE: NLP extraction (NER, entity recognition) is handled by coding subagents via the Task tool, not directly in the main codebase. Subagents may use spaCy, transformers, or other tools internally, but these are NOT direct dependencies of the main application.
# Text utilities (direct dependencies)
langdetect = "^1.0.9" # Language detection
unidecode = "^1.3.7" # Unicode transliteration
ftfy = "^6.1.1" # Fix text encoding issues
rapidfuzz = "^3.0.0" # Fuzzy string matching for deduplication
# NLP libraries (used by subagents only - NOT direct dependencies)
# spacy = "^3.7.0" # ❌ Used by subagents, not main code
# transformers = "^4.34.0" # ❌ Used by subagents, not main code
# torch = "^2.1.0" # ❌ Used by subagents, not main code
Web Crawling & HTTP
# Web crawling
crawl4ai = "^0.1.0" # AI-powered web crawling
httpx = "^0.25.0" # Modern async HTTP client
aiohttp = "^3.8.6" # Alternative async HTTP
# HTML/XML parsing
beautifulsoup4 = "^4.12.0" # HTML parsing
lxml = "^4.9.3" # Fast XML/HTML processing
selectolax = "^0.3.17" # Fast HTML5 parser (optional)
# URL utilities
urllib3 = "^2.0.7" # URL utilities
validators = "^0.22.0" # URL/email validation
Semantic Web & LinkML
# LinkML ecosystem
linkml = "^1.6.0" # Schema development tools
linkml-runtime = "^1.6.0" # Runtime validation and generation
linkml-model = "^1.6.0" # LinkML metamodel
# RDF/Semantic Web
rdflib = "^7.0.0" # RDF manipulation
pyshacl = "^0.24.0" # SHACL validation
sparqlwrapper = "^2.0.0" # SPARQL queries
Database & Storage
# Embedded databases
duckdb = "^0.9.0" # Analytical database
sqlite3 = "^3.42.0" # Built-in to Python
# SQL toolkit
sqlalchemy = "^2.0.0" # SQL abstraction layer
alembic = "^1.12.0" # Database migrations (if needed)
Geolocation & Geographic Data
# Geocoding
geopy = "^2.4.0" # Geocoding library
pycountry = "^22.3.5" # Country/language data
phonenumbers = "^8.13.0" # Phone number parsing
Development Tools
[tool.poetry.group.dev.dependencies]
# Code quality
ruff = "^0.1.0" # Fast linter (replaces flake8, black, isort)
mypy = "^1.6.0" # Static type checking
pre-commit = "^3.5.0" # Git hooks
# Testing
pytest = "^7.4.0" # Testing framework
pytest-cov = "^4.1.0" # Coverage plugin
pytest-asyncio = "^0.21.0" # Async testing
hypothesis = "^6.88.0" # Property-based testing
# Documentation
mkdocs = "^1.5.0" # Documentation generator
mkdocs-material = "^9.4.0" # Material theme
3. External Services & APIs
Required External Services
- None (self-contained by default)
- System designed to work offline with conversation files only
Optional External Services (for enrichment)
-
Wikidata SPARQL Endpoint
- URL:
https://query.wikidata.org/sparql - Purpose: Entity linking and enrichment
- Rate limit: No official limit, be respectful
- Fallback: Can work without, just less enrichment
- URL:
-
Nominatim (OpenStreetMap Geocoding)
- URL:
https://nominatim.openstreetmap.org - Purpose: Geocoding addresses
- Rate limit: 1 request/second for public instance
- Fallback: Can skip geocoding
- Alternative: Self-hosted Nominatim instance
- URL:
-
VIAF (Virtual International Authority File)
- URL:
https://www.viaf.org/ - Purpose: Authority linking for organizations
- Rate limit: Reasonable use
- Fallback: Can work without
- URL:
-
OCLC APIs (future)
- Purpose: Library registry lookups
- Requires: API key
- Fallback: Optional enrichment only
4. Data Dependencies
Input Data
-
Conversation JSON Files (REQUIRED)
- Location:
/Users/kempersc/Documents/claude/glam/*.json - Count: 139 files
- Format: Claude conversation export format
- Size: ~200MB total (estimated)
- Location:
-
Ontology Research Files (REFERENCE)
- Location:
/Users/kempersc/Documents/claude/glam/ontology/*.json - Purpose: Schema design reference
- Count: 14 files
- Location:
Reference Data (to be included in project)
-
Country Codes & Names
- Source: ISO 3166-1 (via pycountry)
- Purpose: Normalize country names
- Format: Built into library
-
Language Codes
- Source: ISO 639 (via pycountry)
- Purpose: Identify content languages
- Format: Built into library
-
GLAM Type Vocabulary
- Source: Custom controlled vocabulary
- Purpose: Standardize institution types
- Format: LinkML enum
- Location:
data/vocabularies/institution_types.yaml
-
Metadata Standards Registry
- Source: Curated list
- Purpose: Recognize metadata standards mentioned in text
- Format: YAML
- Location:
data/vocabularies/metadata_standards.yaml - Examples: Dublin Core, EAD, MARC21, LIDO, etc.
Downloadable Reference Data (optional)
-
ISIL Registry
- Source: Library of Congress
- URL:
https://www.loc.gov/marc/organizations/orgshome.html - Purpose: Validate ISIL codes
- Format: MARC XML
- Frequency: Annual updates
- Storage:
data/reference/isil/
-
Wikidata Dumps (for offline work)
- Source: Wikidata
- URL:
https://dumps.wikimedia.org/wikidatawiki/entities/ - Purpose: Offline entity linking
- Format: JSON
- Size: Very large (100GB+)
- Recommendation: Use SPARQL endpoint instead
5. Model Dependencies
spaCy Models
# Required for English content
python -m spacy download en_core_web_lg
# Recommended for multilingual content
python -m spacy download xx_ent_wiki_sm # Multilingual NER
python -m spacy download es_core_news_lg # Spanish
python -m spacy download fr_core_news_lg # French
python -m spacy download de_core_news_lg # German
python -m spacy download nl_core_news_lg # Dutch
python -m spacy download ja_core_news_lg # Japanese
python -m spacy download zh_core_web_lg # Chinese
Transformer Models (optional, for better NER)
# Hugging Face models (downloaded on first use)
from transformers import pipeline
# Organization NER
"dslim/bert-base-NER" # General NER
"Babelscape/wikineural-multilingual-ner" # Multilingual NER
6. Schema Dependencies
Imported Schemas (LinkML)
-
linkml:types
- Source: LinkML standard library
- Purpose: Base types (string, integer, etc.)
-
Schema.org Vocabulary
- Source: https://schema.org/
- Subset: Organization, Place, WebSite, CreativeWork
- Format: LinkML representation
- Location:
schemas/imports/schema_org_subset.yaml
-
CPOC (Core Public Organization Vocabulary)
- Source: W3C
- URL: https://www.w3.org/ns/org
- Format: LinkML representation
- Location:
schemas/imports/cpoc_subset.yaml
-
TOOI (Dutch Organizations Ontology)
- Source: Dutch government
- URL: https://standaarden.overheid.nl/tooi
- Format: LinkML representation (to be created)
- Location:
schemas/imports/tooi_subset.yaml
-
RiC-O (Records in Contexts Ontology) (subset)
- Source: International Council on Archives
- URL: https://www.ica.org/standards/RiC/ontology
- Purpose: Archives-specific classes
- Location:
schemas/imports/rico_subset.yaml
7. Operating System Dependencies
Cross-Platform
- Works on: macOS, Linux, Windows
- Recommended: macOS or Linux for development
System Tools (optional, for development)
- Git: Version control
- Docker: Containerization (future)
- Make: Build automation (optional)
File System
- Minimum disk space: 5GB (for data, models, cache)
- Recommended: 20GB (with all optional models)
Dependency Management Strategy
Version Pinning
# pyproject.toml
[tool.poetry.dependencies]
# Core dependencies: Use ^ (compatible version)
pandas = "^2.1.0" # Allows 2.1.x, 2.2.x, but not 3.x
# Critical dependencies: Use ~ (patch version)
linkml-runtime = "~1.6.0" # Allows 1.6.x only
# Unstable dependencies: Pin exact version
crawl4ai = "==0.1.0"
Dependency Updates
- Weekly: Check for security updates
- Monthly: Update development dependencies
- Quarterly: Update core dependencies with testing
- As needed: Update spaCy models
Dependency Security
# Check for vulnerabilities
poetry run safety check
# Update dependencies
poetry update
# Audit dependencies
poetry show --tree
External Data Sources (URLs to crawl)
Source URLs in Conversations
- ~2000+ URLs estimated across 139 conversation files
- Types:
- Institution homepages
- Digital repositories
- National library catalogs
- Archive finding aids
- Museum collection databases
- Government heritage portals
- Wikipedia/Wikidata pages
Crawling Constraints
- Respect robots.txt
- Rate limiting: Max 1 request/second per domain
- User-Agent:
GLAM-Dataset-Extractor/0.1.0 (+https://github.com/user/glam-dataset) - Timeout: 30 seconds per request
- Retry: Max 3 attempts with exponential backoff
Optional Dependencies
Performance Optimization
# Fast JSON parsing
orjson = "^3.9.0"
# Fast YAML parsing
ruamel.yaml = "^0.18.0"
# Progress bars
tqdm = "^4.66.0"
rich = "^13.6.0" # Beautiful terminal output
Export Formats
# Excel export
openpyxl = "^3.1.0"
# XML generation
xmltodict = "^0.13.0"
Monitoring & Observability
# Structured logging
structlog = "^23.2.0"
# Metrics (future)
prometheus-client = "^0.18.0"
Dependency Graph
Global GLAM Extractor
│
├─ Data Processing
│ ├─ pandas (tabular data)
│ ├─ pydantic (validation)
│ └─ duckdb (analytics)
│
├─ NLP Pipeline
│ ├─ spaCy (core NLP)
│ │ └─ spacy models (language-specific)
│ ├─ transformers (advanced NER)
│ │ └─ torch (ML backend)
│ └─ langdetect (language ID)
│
├─ Web Crawling
│ ├─ crawl4ai (AI crawling)
│ ├─ httpx (HTTP client)
│ ├─ beautifulsoup4 (HTML parsing)
│ └─ lxml (XML processing)
│
├─ Semantic Web
│ ├─ linkml-runtime (schema validation)
│ ├─ rdflib (RDF manipulation)
│ └─ sparqlwrapper (SPARQL queries)
│
├─ Enrichment Services
│ ├─ geopy (geocoding)
│ ├─ pycountry (country/language data)
│ └─ External APIs (Wikidata, VIAF)
│
└─ Development
├─ pytest (testing)
├─ ruff (linting)
├─ mypy (type checking)
└─ mkdocs (documentation)
Installation
Complete Installation
# Clone repository
git clone https://github.com/user/glam-dataset.git
cd glam-dataset
# Install with Poetry
poetry install
# Install spaCy models
poetry run python -m spacy download en_core_web_lg
poetry run python -m spacy download xx_ent_wiki_sm
# Optional: Install all language models
poetry run bash scripts/install_language_models.sh
Minimal Installation (no ML models)
poetry install --without dev
poetry run python -m spacy download en_core_web_sm # Smaller model
Docker Installation (future)
docker pull glamdataset/extractor:latest
docker run -v ./data:/data glamdataset/extractor
Dependency Risks & Mitigation
High-Risk Dependencies
-
crawl4ai (new/unstable)
- Risk: API changes, bugs
- Mitigation: Pin exact version, wrap in adapter pattern
- Fallback: Direct httpx/beautifulsoup4 implementation
-
transformers (large, heavy)
- Risk: Memory usage, slow startup
- Mitigation: Optional dependency, use spaCy by default
- Fallback: spaCy-only NER
Medium-Risk Dependencies
- External APIs (Wikidata, VIAF)
- Risk: Rate limiting, downtime
- Mitigation: Aggressive caching, graceful degradation
- Fallback: Skip enrichment, work with extracted data only
Low-Risk Dependencies
- Core libraries (pandas, rdflib, etc.)
- Risk: Minimal, well-maintained
- Mitigation: Regular updates
License Compatibility
All dependencies use permissive licenses compatible with the project:
- MIT: pandas, pydantic, httpx, beautifulsoup4
- Apache 2.0: transformers, spaCy, duckdb
- BSD: rdflib, lxml
- LGPL: geopy (dynamic linking, acceptable)
No GPL dependencies that would require copyleft distribution.