glam/CONTRIBUTING.md
2025-11-19 23:25:22 +01:00

10 KiB

Contributing to GLAM Extractor

Thank you for your interest in contributing to the GLAM Extractor project! This document provides guidelines and instructions for contributors.

Development Setup

Prerequisites

  • Python 3.11 or higher
  • Poetry (Python package manager)
  • Git

Installation

# Clone the repository
git clone <repository-url>
cd glam-extractor

# Install dependencies
poetry install

# Install pre-commit hooks
poetry run pre-commit install

# Download spaCy models (required for NLP features)
poetry run python -m spacy download en_core_web_trf
poetry run python -m spacy download nl_core_news_lg
poetry run python -m spacy download xx_ent_wiki_sm

Project Structure

glam-extractor/
├── src/glam_extractor/      # Main package
│   ├── parsers/             # Conversation & CSV parsers
│   ├── extractors/          # NLP extraction engines
│   ├── crawlers/            # Web crawling (crawl4ai)
│   ├── validators/          # LinkML validation
│   ├── exporters/           # Multi-format export
│   ├── geocoding/           # Nominatim geocoding
│   └── utils/               # Utilities
├── tests/                   # Test suite
│   ├── unit/               # Unit tests
│   ├── integration/        # Integration tests
│   └── fixtures/           # Test data
├── schemas/                # LinkML schemas
├── docs/                   # Documentation
└── data/                   # Reference data (CSVs)

Development Workflow

1. Create a Branch

git checkout -b feature/your-feature-name
# or
git checkout -b fix/bug-description

2. Make Changes

Follow the coding standards (see below) and ensure all tests pass.

3. Write Tests

All new features should include:

  • Unit tests in tests/unit/
  • Integration tests in tests/integration/ (if applicable)
  • Docstring examples that serve as documentation

4. Run Tests

# Run all tests
poetry run pytest

# Run specific test types
poetry run pytest -m unit
poetry run pytest -m integration

# Run with coverage
poetry run pytest --cov

# Run specific test file
poetry run pytest tests/unit/test_parsers.py

5. Code Quality Checks

# Format code with black
poetry run black src/ tests/

# Lint with ruff
poetry run ruff check src/ tests/

# Type check with mypy
poetry run mypy src/

# Run all pre-commit hooks
poetry run pre-commit run --all-files

6. Commit Changes

git add .
git commit -m "feat: add institution name extractor"

Commit Message Format:

  • feat: New feature
  • fix: Bug fix
  • docs: Documentation changes
  • test: Test changes
  • refactor: Code refactoring
  • chore: Maintenance tasks

7. Push and Create PR

git push origin feature/your-feature-name

Then create a Pull Request on GitHub.

Coding Standards

Python Style

  • Follow PEP 8
  • Use Black for formatting (line length: 100)
  • Use type hints for all function signatures
  • Write docstrings for all public functions/classes

Example

from typing import Optional, List
from pathlib import Path


def extract_institution_names(
    conversation_path: Path,
    confidence_threshold: float = 0.7
) -> List[str]:
    """
    Extract heritage institution names from a conversation JSON file.

    Args:
        conversation_path: Path to conversation JSON file
        confidence_threshold: Minimum confidence score (0.0-1.0)

    Returns:
        List of institution names with confidence above threshold

    Raises:
        FileNotFoundError: If conversation file doesn't exist
        ValueError: If confidence_threshold is out of range

    Examples:
        >>> extract_institution_names(Path("brazilian_glam.json"))
        ['Biblioteca Nacional do Brasil', 'Museu Nacional']
    """
    if not 0.0 <= confidence_threshold <= 1.0:
        raise ValueError("confidence_threshold must be between 0.0 and 1.0")
    
    # Implementation here
    ...

Type Hints

Use type hints for:

  • Function parameters
  • Return types
  • Class attributes
from typing import Optional, List, Dict, Any
from pathlib import Path
from datetime import datetime


class HeritageCustodian:
    """Represents a heritage institution"""
    
    name: str
    institution_type: str
    founded_date: Optional[datetime]
    identifiers: List[Dict[str, Any]]
    
    def __init__(
        self,
        name: str,
        institution_type: str,
        founded_date: Optional[datetime] = None
    ) -> None:
        self.name = name
        self.institution_type = institution_type
        self.founded_date = founded_date
        self.identifiers = []

Docstrings

Use Google-style docstrings:

def merge_institutions(
    source1: List[HeritageCustodian],
    source2: List[HeritageCustodian],
    merge_strategy: str = "isil_code"
) -> List[HeritageCustodian]:
    """
    Merge two lists of heritage institutions using specified strategy.

    Args:
        source1: First list of institutions
        source2: Second list of institutions
        merge_strategy: Strategy to use ("isil_code", "name_fuzzy", "location")

    Returns:
        Merged list with duplicates resolved

    Raises:
        ValueError: If merge_strategy is not recognized

    Note:
        When conflicts occur, source1 takes precedence for TIER_1 data,
        otherwise uses highest data tier.

    Examples:
        >>> csv_institutions = load_csv_institutions(...)
        >>> conversation_institutions = extract_from_conversations(...)
        >>> merged = merge_institutions(csv_institutions, conversation_institutions)
    """
    ...

Testing Guidelines

Unit Tests

Test individual functions/classes in isolation:

# tests/unit/test_extractors.py
import pytest
from glam_extractor.extractors import extract_isil_codes


def test_extract_isil_codes_single():
    text = "The ISIL code NL-AsdAM identifies Amsterdam Museum"
    codes = extract_isil_codes(text)
    assert len(codes) == 1
    assert codes[0]["value"] == "NL-AsdAM"


def test_extract_isil_codes_multiple():
    text = "Codes include NL-AsdAM and NL-AmfRCE"
    codes = extract_isil_codes(text)
    assert len(codes) == 2


def test_extract_isil_codes_none():
    text = "No ISIL codes here"
    codes = extract_isil_codes(text)
    assert len(codes) == 0

Integration Tests

Test full workflows:

# tests/integration/test_pipeline.py
import pytest
from pathlib import Path
from glam_extractor import ConversationParser, InstitutionExtractor, LinkMLValidator


@pytest.mark.integration
def test_full_extraction_pipeline(tmp_path):
    # Setup
    conversation_file = Path("tests/fixtures/brazilian_glam.json")
    output_file = tmp_path / "output.jsonld"
    
    # Execute
    parser = ConversationParser()
    conversation = parser.load(conversation_file)
    
    extractor = InstitutionExtractor()
    institutions = extractor.extract(conversation)
    
    validator = LinkMLValidator(schema="schemas/heritage_custodian.yaml")
    valid_institutions = validator.validate_batch(institutions)
    
    # Assert
    assert len(institutions) > 0
    assert all(inst.provenance.data_source == "CONVERSATION_NLP" for inst in institutions)
    assert len(valid_institutions) == len(institutions)

Test Fixtures

Create reusable test data in tests/fixtures/:

# tests/fixtures/sample_conversation.py
import json
from pathlib import Path


def create_sample_conversation(tmp_path: Path) -> Path:
    """Create a minimal conversation JSON for testing"""
    conversation = {
        "uuid": "test-uuid",
        "name": "Test Conversation",
        "chat_messages": [
            {
                "uuid": "msg-1",
                "text": "Amsterdam Museum (ISIL: NL-AsdAM) is located in Amsterdam.",
                "sender": "assistant",
                "content": [{"type": "text", "text": "..."}]
            }
        ]
    }
    
    fixture_path = tmp_path / "test_conversation.json"
    fixture_path.write_text(json.dumps(conversation, indent=2))
    return fixture_path

Documentation

API Documentation

All public functions/classes must have docstrings. We use mkdocstrings to auto-generate API docs.

# Serve docs locally
poetry run mkdocs serve

# Build static docs
poetry run mkdocs build

Tutorials

Add tutorials to docs/tutorials/ with step-by-step examples.

Examples

Add working code examples to docs/examples/.

Areas for Contribution

High Priority

  1. Parser Implementation (src/glam_extractor/parsers/)

    • Conversation JSON parser
    • CSV parser for Dutch datasets
    • Schema-compliant object builders
  2. Extractor Implementation (src/glam_extractor/extractors/)

    • spaCy NER integration
    • Institution type classifier
    • Identifier pattern extractors
  3. Validator Implementation (src/glam_extractor/validators/)

    • LinkML schema validator
    • Cross-reference validator
  4. Exporter Implementation (src/glam_extractor/exporters/)

    • JSON-LD exporter
    • RDF/Turtle exporter
    • CSV/Parquet exporters

Medium Priority

  1. Geocoding Module (src/glam_extractor/geocoding/)

    • Nominatim client
    • GeoNames integration
    • Caching layer
  2. Web Crawler (src/glam_extractor/crawlers/)

    • crawl4ai integration
    • Institution website scraping

Lower Priority

  1. CLI Enhancements (src/glam_extractor/cli.py)

    • Progress bars
    • Better error reporting
    • Configuration file support
  2. Performance Optimization

    • Parallel processing
    • Caching strategies
    • Memory optimization

Design Patterns

Follow the patterns documented in docs/plan/global_glam/05-design-patterns.md:

  • Pipeline Pattern: For data processing workflows
  • Repository Pattern: For data access
  • Strategy Pattern: For configurable algorithms
  • Builder Pattern: For complex object construction
  • Result Pattern: For explicit error handling

Questions?

  • Check existing documentation in docs/
  • Read AGENTS.md for AI agent instructions
  • Review planning docs in docs/plan/global_glam/
  • Open an issue on GitHub

License

By contributing, you agree that your contributions will be licensed under the MIT License.