# Global GLAM Dataset: Dependencies ## Overview This document catalogs all technical, data, and organizational dependencies for the Global GLAM Dataset extraction project. ## Technical Dependencies ### 1. Python Runtime - **Python 3.11+** (required) - Rationale: Modern type hints, performance improvements, better error messages - Alternative: Python 3.10 (minimum supported) ### 2. Core Python Libraries #### Data Processing ```toml [tool.poetry.dependencies] python = "^3.11" # Data manipulation pandas = "^2.1.0" # Tabular data processing polars = "^0.19.0" # High-performance dataframes (optional) pyarrow = "^13.0.0" # Parquet format support # Data validation pydantic = "^2.4.0" # Data validation and settings jsonschema = "^4.19.0" # JSON schema validation ``` #### NLP & Text Processing **NOTE**: NLP extraction (NER, entity recognition) is handled by **coding subagents** via the Task tool, not directly in the main codebase. Subagents may use spaCy, transformers, or other tools internally, but these are NOT direct dependencies of the main application. ```toml # Text utilities (direct dependencies) langdetect = "^1.0.9" # Language detection unidecode = "^1.3.7" # Unicode transliteration ftfy = "^6.1.1" # Fix text encoding issues rapidfuzz = "^3.0.0" # Fuzzy string matching for deduplication # NLP libraries (used by subagents only - NOT direct dependencies) # spacy = "^3.7.0" # ❌ Used by subagents, not main code # transformers = "^4.34.0" # ❌ Used by subagents, not main code # torch = "^2.1.0" # ❌ Used by subagents, not main code ``` #### Web Crawling & HTTP ```toml # Web crawling crawl4ai = "^0.1.0" # AI-powered web crawling httpx = "^0.25.0" # Modern async HTTP client aiohttp = "^3.8.6" # Alternative async HTTP # HTML/XML parsing beautifulsoup4 = "^4.12.0" # HTML parsing lxml = "^4.9.3" # Fast XML/HTML processing selectolax = "^0.3.17" # Fast HTML5 parser (optional) # URL utilities urllib3 = "^2.0.7" # URL utilities validators = "^0.22.0" # URL/email validation ``` #### Semantic Web & LinkML ```toml # LinkML ecosystem linkml = "^1.6.0" # Schema development tools linkml-runtime = "^1.6.0" # Runtime validation and generation linkml-model = "^1.6.0" # LinkML metamodel # RDF/Semantic Web rdflib = "^7.0.0" # RDF manipulation pyshacl = "^0.24.0" # SHACL validation sparqlwrapper = "^2.0.0" # SPARQL queries ``` #### Database & Storage ```toml # Embedded databases duckdb = "^0.9.0" # Analytical database sqlite3 = "^3.42.0" # Built-in to Python # SQL toolkit sqlalchemy = "^2.0.0" # SQL abstraction layer alembic = "^1.12.0" # Database migrations (if needed) ``` #### Geolocation & Geographic Data ```toml # Geocoding geopy = "^2.4.0" # Geocoding library pycountry = "^22.3.5" # Country/language data phonenumbers = "^8.13.0" # Phone number parsing ``` #### Development Tools ```toml [tool.poetry.group.dev.dependencies] # Code quality ruff = "^0.1.0" # Fast linter (replaces flake8, black, isort) mypy = "^1.6.0" # Static type checking pre-commit = "^3.5.0" # Git hooks # Testing pytest = "^7.4.0" # Testing framework pytest-cov = "^4.1.0" # Coverage plugin pytest-asyncio = "^0.21.0" # Async testing hypothesis = "^6.88.0" # Property-based testing # Documentation mkdocs = "^1.5.0" # Documentation generator mkdocs-material = "^9.4.0" # Material theme ``` ### 3. External Services & APIs #### Required External Services 1. **None (self-contained by default)** - System designed to work offline with conversation files only #### Optional External Services (for enrichment) 1. **Wikidata SPARQL Endpoint** - URL: `https://query.wikidata.org/sparql` - Purpose: Entity linking and enrichment - Rate limit: No official limit, be respectful - Fallback: Can work without, just less enrichment 2. **Nominatim (OpenStreetMap Geocoding)** - URL: `https://nominatim.openstreetmap.org` - Purpose: Geocoding addresses - Rate limit: 1 request/second for public instance - Fallback: Can skip geocoding - Alternative: Self-hosted Nominatim instance 3. **VIAF (Virtual International Authority File)** - URL: `https://www.viaf.org/` - Purpose: Authority linking for organizations - Rate limit: Reasonable use - Fallback: Can work without 4. **OCLC APIs** (future) - Purpose: Library registry lookups - Requires: API key - Fallback: Optional enrichment only ### 4. Data Dependencies #### Input Data 1. **Conversation JSON Files** (REQUIRED) - Location: `/Users/kempersc/Documents/claude/glam/*.json` - Count: 139 files - Format: Claude conversation export format - Size: ~200MB total (estimated) 2. **Ontology Research Files** (REFERENCE) - Location: `/Users/kempersc/Documents/claude/glam/ontology/*.json` - Purpose: Schema design reference - Count: 14 files #### Reference Data (to be included in project) 1. **Country Codes & Names** - Source: ISO 3166-1 (via pycountry) - Purpose: Normalize country names - Format: Built into library 2. **Language Codes** - Source: ISO 639 (via pycountry) - Purpose: Identify content languages - Format: Built into library 3. **GLAM Type Vocabulary** - Source: Custom controlled vocabulary - Purpose: Standardize institution types - Format: LinkML enum - Location: `data/vocabularies/institution_types.yaml` 4. **Metadata Standards Registry** - Source: Curated list - Purpose: Recognize metadata standards mentioned in text - Format: YAML - Location: `data/vocabularies/metadata_standards.yaml` - Examples: Dublin Core, EAD, MARC21, LIDO, etc. #### Downloadable Reference Data (optional) 1. **ISIL Registry** - Source: Library of Congress - URL: `https://www.loc.gov/marc/organizations/orgshome.html` - Purpose: Validate ISIL codes - Format: MARC XML - Frequency: Annual updates - Storage: `data/reference/isil/` 2. **Wikidata Dumps** (for offline work) - Source: Wikidata - URL: `https://dumps.wikimedia.org/wikidatawiki/entities/` - Purpose: Offline entity linking - Format: JSON - Size: Very large (100GB+) - Recommendation: Use SPARQL endpoint instead ### 5. Model Dependencies #### spaCy Models ```bash # Required for English content python -m spacy download en_core_web_lg # Recommended for multilingual content python -m spacy download xx_ent_wiki_sm # Multilingual NER python -m spacy download es_core_news_lg # Spanish python -m spacy download fr_core_news_lg # French python -m spacy download de_core_news_lg # German python -m spacy download nl_core_news_lg # Dutch python -m spacy download ja_core_news_lg # Japanese python -m spacy download zh_core_web_lg # Chinese ``` #### Transformer Models (optional, for better NER) ```python # Hugging Face models (downloaded on first use) from transformers import pipeline # Organization NER "dslim/bert-base-NER" # General NER "Babelscape/wikineural-multilingual-ner" # Multilingual NER ``` ### 6. Schema Dependencies #### Imported Schemas (LinkML) 1. **linkml:types** - Source: LinkML standard library - Purpose: Base types (string, integer, etc.) 2. **Schema.org Vocabulary** - Source: https://schema.org/ - Subset: Organization, Place, WebSite, CreativeWork - Format: LinkML representation - Location: `schemas/imports/schema_org_subset.yaml` 3. **CPOC (Core Public Organization Vocabulary)** - Source: W3C - URL: https://www.w3.org/ns/org - Format: LinkML representation - Location: `schemas/imports/cpoc_subset.yaml` 4. **TOOI (Dutch Organizations Ontology)** - Source: Dutch government - URL: https://standaarden.overheid.nl/tooi - Format: LinkML representation (to be created) - Location: `schemas/imports/tooi_subset.yaml` 5. **RiC-O (Records in Contexts Ontology)** (subset) - Source: International Council on Archives - URL: https://www.ica.org/standards/RiC/ontology - Purpose: Archives-specific classes - Location: `schemas/imports/rico_subset.yaml` ### 7. Operating System Dependencies #### Cross-Platform - Works on: macOS, Linux, Windows - Recommended: macOS or Linux for development #### System Tools (optional, for development) - **Git**: Version control - **Docker**: Containerization (future) - **Make**: Build automation (optional) #### File System - Minimum disk space: 5GB (for data, models, cache) - Recommended: 20GB (with all optional models) ## Dependency Management Strategy ### Version Pinning ```toml # pyproject.toml [tool.poetry.dependencies] # Core dependencies: Use ^ (compatible version) pandas = "^2.1.0" # Allows 2.1.x, 2.2.x, but not 3.x # Critical dependencies: Use ~ (patch version) linkml-runtime = "~1.6.0" # Allows 1.6.x only # Unstable dependencies: Pin exact version crawl4ai = "==0.1.0" ``` ### Dependency Updates - **Weekly**: Check for security updates - **Monthly**: Update development dependencies - **Quarterly**: Update core dependencies with testing - **As needed**: Update spaCy models ### Dependency Security ```bash # Check for vulnerabilities poetry run safety check # Update dependencies poetry update # Audit dependencies poetry show --tree ``` ## External Data Sources (URLs to crawl) ### Source URLs in Conversations - **~2000+ URLs estimated** across 139 conversation files - Types: - Institution homepages - Digital repositories - National library catalogs - Archive finding aids - Museum collection databases - Government heritage portals - Wikipedia/Wikidata pages ### Crawling Constraints - Respect robots.txt - Rate limiting: Max 1 request/second per domain - User-Agent: `GLAM-Dataset-Extractor/0.1.0 (+https://github.com/user/glam-dataset)` - Timeout: 30 seconds per request - Retry: Max 3 attempts with exponential backoff ## Optional Dependencies ### Performance Optimization ```toml # Fast JSON parsing orjson = "^3.9.0" # Fast YAML parsing ruamel.yaml = "^0.18.0" # Progress bars tqdm = "^4.66.0" rich = "^13.6.0" # Beautiful terminal output ``` ### Export Formats ```toml # Excel export openpyxl = "^3.1.0" # XML generation xmltodict = "^0.13.0" ``` ### Monitoring & Observability ```toml # Structured logging structlog = "^23.2.0" # Metrics (future) prometheus-client = "^0.18.0" ``` ## Dependency Graph ``` Global GLAM Extractor │ ├─ Data Processing │ ├─ pandas (tabular data) │ ├─ pydantic (validation) │ └─ duckdb (analytics) │ ├─ NLP Pipeline │ ├─ spaCy (core NLP) │ │ └─ spacy models (language-specific) │ ├─ transformers (advanced NER) │ │ └─ torch (ML backend) │ └─ langdetect (language ID) │ ├─ Web Crawling │ ├─ crawl4ai (AI crawling) │ ├─ httpx (HTTP client) │ ├─ beautifulsoup4 (HTML parsing) │ └─ lxml (XML processing) │ ├─ Semantic Web │ ├─ linkml-runtime (schema validation) │ ├─ rdflib (RDF manipulation) │ └─ sparqlwrapper (SPARQL queries) │ ├─ Enrichment Services │ ├─ geopy (geocoding) │ ├─ pycountry (country/language data) │ └─ External APIs (Wikidata, VIAF) │ └─ Development ├─ pytest (testing) ├─ ruff (linting) ├─ mypy (type checking) └─ mkdocs (documentation) ``` ## Installation ### Complete Installation ```bash # Clone repository git clone https://github.com/user/glam-dataset.git cd glam-dataset # Install with Poetry poetry install # Install spaCy models poetry run python -m spacy download en_core_web_lg poetry run python -m spacy download xx_ent_wiki_sm # Optional: Install all language models poetry run bash scripts/install_language_models.sh ``` ### Minimal Installation (no ML models) ```bash poetry install --without dev poetry run python -m spacy download en_core_web_sm # Smaller model ``` ### Docker Installation (future) ```bash docker pull glamdataset/extractor:latest docker run -v ./data:/data glamdataset/extractor ``` ## Dependency Risks & Mitigation ### High-Risk Dependencies 1. **crawl4ai** (new/unstable) - Risk: API changes, bugs - Mitigation: Pin exact version, wrap in adapter pattern - Fallback: Direct httpx/beautifulsoup4 implementation 2. **transformers** (large, heavy) - Risk: Memory usage, slow startup - Mitigation: Optional dependency, use spaCy by default - Fallback: spaCy-only NER ### Medium-Risk Dependencies 1. **External APIs** (Wikidata, VIAF) - Risk: Rate limiting, downtime - Mitigation: Aggressive caching, graceful degradation - Fallback: Skip enrichment, work with extracted data only ### Low-Risk Dependencies 1. **Core libraries** (pandas, rdflib, etc.) - Risk: Minimal, well-maintained - Mitigation: Regular updates ## License Compatibility All dependencies use permissive licenses compatible with the project: - **MIT**: pandas, pydantic, httpx, beautifulsoup4 - **Apache 2.0**: transformers, spaCy, duckdb - **BSD**: rdflib, lxml - **LGPL**: geopy (dynamic linking, acceptable) No GPL dependencies that would require copyleft distribution.