10 KiB
Contributing to GLAM Extractor
Thank you for your interest in contributing to the GLAM Extractor project! This document provides guidelines and instructions for contributors.
Development Setup
Prerequisites
- Python 3.11 or higher
- Poetry (Python package manager)
- Git
Installation
# Clone the repository
git clone <repository-url>
cd glam-extractor
# Install dependencies
poetry install
# Install pre-commit hooks
poetry run pre-commit install
# Download spaCy models (required for NLP features)
poetry run python -m spacy download en_core_web_trf
poetry run python -m spacy download nl_core_news_lg
poetry run python -m spacy download xx_ent_wiki_sm
Project Structure
glam-extractor/
├── src/glam_extractor/ # Main package
│ ├── parsers/ # Conversation & CSV parsers
│ ├── extractors/ # NLP extraction engines
│ ├── crawlers/ # Web crawling (crawl4ai)
│ ├── validators/ # LinkML validation
│ ├── exporters/ # Multi-format export
│ ├── geocoding/ # Nominatim geocoding
│ └── utils/ # Utilities
├── tests/ # Test suite
│ ├── unit/ # Unit tests
│ ├── integration/ # Integration tests
│ └── fixtures/ # Test data
├── schemas/ # LinkML schemas
├── docs/ # Documentation
└── data/ # Reference data (CSVs)
Development Workflow
1. Create a Branch
git checkout -b feature/your-feature-name
# or
git checkout -b fix/bug-description
2. Make Changes
Follow the coding standards (see below) and ensure all tests pass.
3. Write Tests
All new features should include:
- Unit tests in
tests/unit/ - Integration tests in
tests/integration/(if applicable) - Docstring examples that serve as documentation
4. Run Tests
# Run all tests
poetry run pytest
# Run specific test types
poetry run pytest -m unit
poetry run pytest -m integration
# Run with coverage
poetry run pytest --cov
# Run specific test file
poetry run pytest tests/unit/test_parsers.py
5. Code Quality Checks
# Format code with black
poetry run black src/ tests/
# Lint with ruff
poetry run ruff check src/ tests/
# Type check with mypy
poetry run mypy src/
# Run all pre-commit hooks
poetry run pre-commit run --all-files
6. Commit Changes
git add .
git commit -m "feat: add institution name extractor"
Commit Message Format:
feat:New featurefix:Bug fixdocs:Documentation changestest:Test changesrefactor:Code refactoringchore:Maintenance tasks
7. Push and Create PR
git push origin feature/your-feature-name
Then create a Pull Request on GitHub.
Coding Standards
Python Style
- Follow PEP 8
- Use Black for formatting (line length: 100)
- Use type hints for all function signatures
- Write docstrings for all public functions/classes
Example
from typing import Optional, List
from pathlib import Path
def extract_institution_names(
conversation_path: Path,
confidence_threshold: float = 0.7
) -> List[str]:
"""
Extract heritage institution names from a conversation JSON file.
Args:
conversation_path: Path to conversation JSON file
confidence_threshold: Minimum confidence score (0.0-1.0)
Returns:
List of institution names with confidence above threshold
Raises:
FileNotFoundError: If conversation file doesn't exist
ValueError: If confidence_threshold is out of range
Examples:
>>> extract_institution_names(Path("brazilian_glam.json"))
['Biblioteca Nacional do Brasil', 'Museu Nacional']
"""
if not 0.0 <= confidence_threshold <= 1.0:
raise ValueError("confidence_threshold must be between 0.0 and 1.0")
# Implementation here
...
Type Hints
Use type hints for:
- Function parameters
- Return types
- Class attributes
from typing import Optional, List, Dict, Any
from pathlib import Path
from datetime import datetime
class HeritageCustodian:
"""Represents a heritage institution"""
name: str
institution_type: str
founded_date: Optional[datetime]
identifiers: List[Dict[str, Any]]
def __init__(
self,
name: str,
institution_type: str,
founded_date: Optional[datetime] = None
) -> None:
self.name = name
self.institution_type = institution_type
self.founded_date = founded_date
self.identifiers = []
Docstrings
Use Google-style docstrings:
def merge_institutions(
source1: List[HeritageCustodian],
source2: List[HeritageCustodian],
merge_strategy: str = "isil_code"
) -> List[HeritageCustodian]:
"""
Merge two lists of heritage institutions using specified strategy.
Args:
source1: First list of institutions
source2: Second list of institutions
merge_strategy: Strategy to use ("isil_code", "name_fuzzy", "location")
Returns:
Merged list with duplicates resolved
Raises:
ValueError: If merge_strategy is not recognized
Note:
When conflicts occur, source1 takes precedence for TIER_1 data,
otherwise uses highest data tier.
Examples:
>>> csv_institutions = load_csv_institutions(...)
>>> conversation_institutions = extract_from_conversations(...)
>>> merged = merge_institutions(csv_institutions, conversation_institutions)
"""
...
Testing Guidelines
Unit Tests
Test individual functions/classes in isolation:
# tests/unit/test_extractors.py
import pytest
from glam_extractor.extractors import extract_isil_codes
def test_extract_isil_codes_single():
text = "The ISIL code NL-AsdAM identifies Amsterdam Museum"
codes = extract_isil_codes(text)
assert len(codes) == 1
assert codes[0]["value"] == "NL-AsdAM"
def test_extract_isil_codes_multiple():
text = "Codes include NL-AsdAM and NL-AmfRCE"
codes = extract_isil_codes(text)
assert len(codes) == 2
def test_extract_isil_codes_none():
text = "No ISIL codes here"
codes = extract_isil_codes(text)
assert len(codes) == 0
Integration Tests
Test full workflows:
# tests/integration/test_pipeline.py
import pytest
from pathlib import Path
from glam_extractor import ConversationParser, InstitutionExtractor, LinkMLValidator
@pytest.mark.integration
def test_full_extraction_pipeline(tmp_path):
# Setup
conversation_file = Path("tests/fixtures/brazilian_glam.json")
output_file = tmp_path / "output.jsonld"
# Execute
parser = ConversationParser()
conversation = parser.load(conversation_file)
extractor = InstitutionExtractor()
institutions = extractor.extract(conversation)
validator = LinkMLValidator(schema="schemas/heritage_custodian.yaml")
valid_institutions = validator.validate_batch(institutions)
# Assert
assert len(institutions) > 0
assert all(inst.provenance.data_source == "CONVERSATION_NLP" for inst in institutions)
assert len(valid_institutions) == len(institutions)
Test Fixtures
Create reusable test data in tests/fixtures/:
# tests/fixtures/sample_conversation.py
import json
from pathlib import Path
def create_sample_conversation(tmp_path: Path) -> Path:
"""Create a minimal conversation JSON for testing"""
conversation = {
"uuid": "test-uuid",
"name": "Test Conversation",
"chat_messages": [
{
"uuid": "msg-1",
"text": "Amsterdam Museum (ISIL: NL-AsdAM) is located in Amsterdam.",
"sender": "assistant",
"content": [{"type": "text", "text": "..."}]
}
]
}
fixture_path = tmp_path / "test_conversation.json"
fixture_path.write_text(json.dumps(conversation, indent=2))
return fixture_path
Documentation
API Documentation
All public functions/classes must have docstrings. We use mkdocstrings to auto-generate API docs.
# Serve docs locally
poetry run mkdocs serve
# Build static docs
poetry run mkdocs build
Tutorials
Add tutorials to docs/tutorials/ with step-by-step examples.
Examples
Add working code examples to docs/examples/.
Areas for Contribution
High Priority
-
Parser Implementation (
src/glam_extractor/parsers/)- Conversation JSON parser
- CSV parser for Dutch datasets
- Schema-compliant object builders
-
Extractor Implementation (
src/glam_extractor/extractors/)- spaCy NER integration
- Institution type classifier
- Identifier pattern extractors
-
Validator Implementation (
src/glam_extractor/validators/)- LinkML schema validator
- Cross-reference validator
-
Exporter Implementation (
src/glam_extractor/exporters/)- JSON-LD exporter
- RDF/Turtle exporter
- CSV/Parquet exporters
Medium Priority
-
Geocoding Module (
src/glam_extractor/geocoding/)- Nominatim client
- GeoNames integration
- Caching layer
-
Web Crawler (
src/glam_extractor/crawlers/)- crawl4ai integration
- Institution website scraping
Lower Priority
-
CLI Enhancements (
src/glam_extractor/cli.py)- Progress bars
- Better error reporting
- Configuration file support
-
Performance Optimization
- Parallel processing
- Caching strategies
- Memory optimization
Design Patterns
Follow the patterns documented in docs/plan/global_glam/05-design-patterns.md:
- Pipeline Pattern: For data processing workflows
- Repository Pattern: For data access
- Strategy Pattern: For configurable algorithms
- Builder Pattern: For complex object construction
- Result Pattern: For explicit error handling
Questions?
- Check existing documentation in
docs/ - Read
AGENTS.mdfor AI agent instructions - Review planning docs in
docs/plan/global_glam/ - Open an issue on GitHub
License
By contributing, you agree that your contributions will be licensed under the MIT License.