kempersc 3c80de87e0 add isil entries

2025-11-19 23:25:22 +01:00

13 KiB

Raw Permalink Blame History

Global GLAM Dataset: Dependencies

Overview

This document catalogs all technical, data, and organizational dependencies for the Global GLAM Dataset extraction project.

Technical Dependencies

1. Python Runtime

Python 3.11+ (required)
- Rationale: Modern type hints, performance improvements, better error messages
- Alternative: Python 3.10 (minimum supported)

2. Core Python Libraries

Data Processing

[tool.poetry.dependencies]
python = "^3.11"

# Data manipulation
pandas = "^2.1.0"           # Tabular data processing
polars = "^0.19.0"          # High-performance dataframes (optional)
pyarrow = "^13.0.0"         # Parquet format support

# Data validation
pydantic = "^2.4.0"         # Data validation and settings
jsonschema = "^4.19.0"      # JSON schema validation

NLP & Text Processing

NOTE: NLP extraction (NER, entity recognition) is handled by coding subagents via the Task tool, not directly in the main codebase. Subagents may use spaCy, transformers, or other tools internally, but these are NOT direct dependencies of the main application.

# Text utilities (direct dependencies)
langdetect = "^1.0.9"       # Language detection
unidecode = "^1.3.7"        # Unicode transliteration
ftfy = "^6.1.1"             # Fix text encoding issues
rapidfuzz = "^3.0.0"        # Fuzzy string matching for deduplication

# NLP libraries (used by subagents only - NOT direct dependencies)
# spacy = "^3.7.0"          # ❌ Used by subagents, not main code
# transformers = "^4.34.0"  # ❌ Used by subagents, not main code
# torch = "^2.1.0"          # ❌ Used by subagents, not main code

Web Crawling & HTTP

# Web crawling
crawl4ai = "^0.1.0"         # AI-powered web crawling
httpx = "^0.25.0"           # Modern async HTTP client
aiohttp = "^3.8.6"          # Alternative async HTTP

# HTML/XML parsing
beautifulsoup4 = "^4.12.0"  # HTML parsing
lxml = "^4.9.3"             # Fast XML/HTML processing
selectolax = "^0.3.17"      # Fast HTML5 parser (optional)

# URL utilities
urllib3 = "^2.0.7"          # URL utilities
validators = "^0.22.0"      # URL/email validation

Semantic Web & LinkML

# LinkML ecosystem
linkml = "^1.6.0"           # Schema development tools
linkml-runtime = "^1.6.0"   # Runtime validation and generation
linkml-model = "^1.6.0"     # LinkML metamodel

# RDF/Semantic Web
rdflib = "^7.0.0"           # RDF manipulation
pyshacl = "^0.24.0"         # SHACL validation
sparqlwrapper = "^2.0.0"    # SPARQL queries

Database & Storage

# Embedded databases
duckdb = "^0.9.0"           # Analytical database
sqlite3 = "^3.42.0"         # Built-in to Python

# SQL toolkit
sqlalchemy = "^2.0.0"       # SQL abstraction layer
alembic = "^1.12.0"         # Database migrations (if needed)

Geolocation & Geographic Data

# Geocoding
geopy = "^2.4.0"            # Geocoding library
pycountry = "^22.3.5"       # Country/language data
phonenumbers = "^8.13.0"    # Phone number parsing

Development Tools

[tool.poetry.group.dev.dependencies]
# Code quality
ruff = "^0.1.0"             # Fast linter (replaces flake8, black, isort)
mypy = "^1.6.0"             # Static type checking
pre-commit = "^3.5.0"       # Git hooks

# Testing
pytest = "^7.4.0"           # Testing framework
pytest-cov = "^4.1.0"       # Coverage plugin
pytest-asyncio = "^0.21.0"  # Async testing
hypothesis = "^6.88.0"      # Property-based testing

# Documentation
mkdocs = "^1.5.0"           # Documentation generator
mkdocs-material = "^9.4.0"  # Material theme

3. External Services & APIs

Required External Services

None (self-contained by default)
- System designed to work offline with conversation files only

Optional External Services (for enrichment)

Wikidata SPARQL Endpoint
- URL: https://query.wikidata.org/sparql
- Purpose: Entity linking and enrichment
- Rate limit: No official limit, be respectful
- Fallback: Can work without, just less enrichment
Nominatim (OpenStreetMap Geocoding)
- URL: https://nominatim.openstreetmap.org
- Purpose: Geocoding addresses
- Rate limit: 1 request/second for public instance
- Fallback: Can skip geocoding
- Alternative: Self-hosted Nominatim instance
VIAF (Virtual International Authority File)
- URL: https://www.viaf.org/
- Purpose: Authority linking for organizations
- Rate limit: Reasonable use
- Fallback: Can work without
OCLC APIs (future)
- Purpose: Library registry lookups
- Requires: API key
- Fallback: Optional enrichment only

4. Data Dependencies

Input Data

Conversation JSON Files (REQUIRED)
- Location: /Users/kempersc/Documents/claude/glam/*.json
- Count: 139 files
- Format: Claude conversation export format
- Size: ~200MB total (estimated)
Ontology Research Files (REFERENCE)
- Location: /Users/kempersc/Documents/claude/glam/ontology/*.json
- Purpose: Schema design reference
- Count: 14 files

Reference Data (to be included in project)

Country Codes & Names
- Source: ISO 3166-1 (via pycountry)
- Purpose: Normalize country names
- Format: Built into library
Language Codes
- Source: ISO 639 (via pycountry)
- Purpose: Identify content languages
- Format: Built into library
GLAM Type Vocabulary
- Source: Custom controlled vocabulary
- Purpose: Standardize institution types
- Format: LinkML enum
- Location: data/vocabularies/institution_types.yaml
Metadata Standards Registry
- Source: Curated list
- Purpose: Recognize metadata standards mentioned in text
- Format: YAML
- Location: data/vocabularies/metadata_standards.yaml
- Examples: Dublin Core, EAD, MARC21, LIDO, etc.

Downloadable Reference Data (optional)

ISIL Registry
- Source: Library of Congress
- URL: https://www.loc.gov/marc/organizations/orgshome.html
- Purpose: Validate ISIL codes
- Format: MARC XML
- Frequency: Annual updates
- Storage: data/reference/isil/
Wikidata Dumps (for offline work)
- Source: Wikidata
- URL: https://dumps.wikimedia.org/wikidatawiki/entities/
- Purpose: Offline entity linking
- Format: JSON
- Size: Very large (100GB+)
- Recommendation: Use SPARQL endpoint instead

5. Model Dependencies

spaCy Models

# Required for English content
python -m spacy download en_core_web_lg

# Recommended for multilingual content
python -m spacy download xx_ent_wiki_sm  # Multilingual NER
python -m spacy download es_core_news_lg  # Spanish
python -m spacy download fr_core_news_lg  # French
python -m spacy download de_core_news_lg  # German
python -m spacy download nl_core_news_lg  # Dutch
python -m spacy download ja_core_news_lg  # Japanese
python -m spacy download zh_core_web_lg   # Chinese

Transformer Models (optional, for better NER)

# Hugging Face models (downloaded on first use)
from transformers import pipeline

# Organization NER
"dslim/bert-base-NER"  # General NER
"Babelscape/wikineural-multilingual-ner"  # Multilingual NER

6. Schema Dependencies

Imported Schemas (LinkML)

linkml:types
- Source: LinkML standard library
- Purpose: Base types (string, integer, etc.)
Schema.org Vocabulary
- Source: https://schema.org/
- Subset: Organization, Place, WebSite, CreativeWork
- Format: LinkML representation
- Location: schemas/imports/schema_org_subset.yaml
CPOC (Core Public Organization Vocabulary)
- Source: W3C
- URL: https://www.w3.org/ns/org
- Format: LinkML representation
- Location: schemas/imports/cpoc_subset.yaml
TOOI (Dutch Organizations Ontology)
- Source: Dutch government
- URL: https://standaarden.overheid.nl/tooi
- Format: LinkML representation (to be created)
- Location: schemas/imports/tooi_subset.yaml
RiC-O (Records in Contexts Ontology) (subset)
- Source: International Council on Archives
- URL: https://www.ica.org/standards/RiC/ontology
- Purpose: Archives-specific classes
- Location: schemas/imports/rico_subset.yaml

7. Operating System Dependencies

Cross-Platform

Works on: macOS, Linux, Windows
Recommended: macOS or Linux for development

System Tools (optional, for development)

Git: Version control
Docker: Containerization (future)
Make: Build automation (optional)

File System

Minimum disk space: 5GB (for data, models, cache)
Recommended: 20GB (with all optional models)

Dependency Management Strategy

Version Pinning

# pyproject.toml
[tool.poetry.dependencies]
# Core dependencies: Use ^ (compatible version)
pandas = "^2.1.0"  # Allows 2.1.x, 2.2.x, but not 3.x

# Critical dependencies: Use ~ (patch version)
linkml-runtime = "~1.6.0"  # Allows 1.6.x only

# Unstable dependencies: Pin exact version
crawl4ai = "==0.1.0"

Dependency Updates

Weekly: Check for security updates
Monthly: Update development dependencies
Quarterly: Update core dependencies with testing
As needed: Update spaCy models

Dependency Security

# Check for vulnerabilities
poetry run safety check

# Update dependencies
poetry update

# Audit dependencies
poetry show --tree

External Data Sources (URLs to crawl)

Source URLs in Conversations

~2000+ URLs estimated across 139 conversation files
Types:
- Institution homepages
- Digital repositories
- National library catalogs
- Archive finding aids
- Museum collection databases
- Government heritage portals
- Wikipedia/Wikidata pages

Crawling Constraints

Respect robots.txt
Rate limiting: Max 1 request/second per domain
User-Agent: GLAM-Dataset-Extractor/0.1.0 (+https://github.com/user/glam-dataset)
Timeout: 30 seconds per request
Retry: Max 3 attempts with exponential backoff

Optional Dependencies

Performance Optimization

# Fast JSON parsing
orjson = "^3.9.0"

# Fast YAML parsing
ruamel.yaml = "^0.18.0"

# Progress bars
tqdm = "^4.66.0"
rich = "^13.6.0"  # Beautiful terminal output

Export Formats

# Excel export
openpyxl = "^3.1.0"

# XML generation
xmltodict = "^0.13.0"

Monitoring & Observability

# Structured logging
structlog = "^23.2.0"

# Metrics (future)
prometheus-client = "^0.18.0"

Dependency Graph

Global GLAM Extractor
│
├─ Data Processing
│  ├─ pandas (tabular data)
│  ├─ pydantic (validation)
│  └─ duckdb (analytics)
│
├─ NLP Pipeline
│  ├─ spaCy (core NLP)
│  │  └─ spacy models (language-specific)
│  ├─ transformers (advanced NER)
│  │  └─ torch (ML backend)
│  └─ langdetect (language ID)
│
├─ Web Crawling
│  ├─ crawl4ai (AI crawling)
│  ├─ httpx (HTTP client)
│  ├─ beautifulsoup4 (HTML parsing)
│  └─ lxml (XML processing)
│
├─ Semantic Web
│  ├─ linkml-runtime (schema validation)
│  ├─ rdflib (RDF manipulation)
│  └─ sparqlwrapper (SPARQL queries)
│
├─ Enrichment Services
│  ├─ geopy (geocoding)
│  ├─ pycountry (country/language data)
│  └─ External APIs (Wikidata, VIAF)
│
└─ Development
   ├─ pytest (testing)
   ├─ ruff (linting)
   ├─ mypy (type checking)
   └─ mkdocs (documentation)

Installation

Complete Installation

# Clone repository
git clone https://github.com/user/glam-dataset.git
cd glam-dataset

# Install with Poetry
poetry install

# Install spaCy models
poetry run python -m spacy download en_core_web_lg
poetry run python -m spacy download xx_ent_wiki_sm

# Optional: Install all language models
poetry run bash scripts/install_language_models.sh

Minimal Installation (no ML models)

poetry install --without dev
poetry run python -m spacy download en_core_web_sm  # Smaller model

Docker Installation (future)

docker pull glamdataset/extractor:latest
docker run -v ./data:/data glamdataset/extractor

Dependency Risks & Mitigation

High-Risk Dependencies

crawl4ai (new/unstable)
- Risk: API changes, bugs
- Mitigation: Pin exact version, wrap in adapter pattern
- Fallback: Direct httpx/beautifulsoup4 implementation
transformers (large, heavy)
- Risk: Memory usage, slow startup
- Mitigation: Optional dependency, use spaCy by default
- Fallback: spaCy-only NER

Medium-Risk Dependencies

External APIs (Wikidata, VIAF)
- Risk: Rate limiting, downtime
- Mitigation: Aggressive caching, graceful degradation
- Fallback: Skip enrichment, work with extracted data only

Low-Risk Dependencies

Core libraries (pandas, rdflib, etc.)
- Risk: Minimal, well-maintained
- Mitigation: Regular updates

License Compatibility

All dependencies use permissive licenses compatible with the project:

MIT: pandas, pydantic, httpx, beautifulsoup4
Apache 2.0: transformers, spaCy, duckdb
BSD: rdflib, lxml
LGPL: geopy (dynamic linking, acceptable)

No GPL dependencies that would require copyleft distribution.

13 KiB Raw Permalink Blame History