glam/CONTRIBUTING.md

# Contributing to GLAM Extractor

Thank you for your interest in contributing to the GLAM Extractor project! This document provides guidelines and instructions for contributors.

## Development Setup

### Prerequisites

- Python 3.11 or higher
- Poetry (Python package manager)
- Git

### Installation

```bash
# Clone the repository
git clone <repository-url>
cd glam-extractor

# Install dependencies
poetry install

# Install pre-commit hooks
poetry run pre-commit install

# Download spaCy models (required for NLP features)
poetry run python -m spacy download en_core_web_trf
poetry run python -m spacy download nl_core_news_lg
poetry run python -m spacy download xx_ent_wiki_sm
```

## Project Structure

```
glam-extractor/
├── src/glam_extractor/      # Main package
│   ├── parsers/             # Conversation & CSV parsers
│   ├── extractors/          # NLP extraction engines
│   ├── crawlers/            # Web crawling (crawl4ai)
│   ├── validators/          # LinkML validation
│   ├── exporters/           # Multi-format export
│   ├── geocoding/           # Nominatim geocoding
│   └── utils/               # Utilities
├── tests/                   # Test suite
│   ├── unit/               # Unit tests
│   ├── integration/        # Integration tests
│   └── fixtures/           # Test data
├── schemas/                # LinkML schemas
├── docs/                   # Documentation
└── data/                   # Reference data (CSVs)
```

## Development Workflow

### 1. Create a Branch

```bash
git checkout -b feature/your-feature-name
# or
git checkout -b fix/bug-description
```

### 2. Make Changes

Follow the coding standards (see below) and ensure all tests pass.

### 3. Write Tests

All new features should include:
- Unit tests in `tests/unit/`
- Integration tests in `tests/integration/` (if applicable)
- Docstring examples that serve as documentation

### 4. Run Tests

```bash
# Run all tests
poetry run pytest

# Run specific test types
poetry run pytest -m unit
poetry run pytest -m integration

# Run with coverage
poetry run pytest --cov

# Run specific test file
poetry run pytest tests/unit/test_parsers.py
```

### 5. Code Quality Checks

```bash
# Format code with black
poetry run black src/ tests/

# Lint with ruff
poetry run ruff check src/ tests/

# Type check with mypy
poetry run mypy src/

# Run all pre-commit hooks
poetry run pre-commit run --all-files
```

### 6. Commit Changes

```bash
git add .
git commit -m "feat: add institution name extractor"
```

**Commit Message Format**:
- `feat:` New feature
- `fix:` Bug fix
- `docs:` Documentation changes
- `test:` Test changes
- `refactor:` Code refactoring
- `chore:` Maintenance tasks

### 7. Push and Create PR

```bash
git push origin feature/your-feature-name
```

Then create a Pull Request on GitHub.

## Coding Standards

### Python Style

- Follow PEP 8
- Use Black for formatting (line length: 100)
- Use type hints for all function signatures
- Write docstrings for all public functions/classes

### Example

```python
from typing import Optional, List
from pathlib import Path


def extract_institution_names(
    conversation_path: Path,
    confidence_threshold: float = 0.7
) -> List[str]:
    """
    Extract heritage institution names from a conversation JSON file.

    Args:
        conversation_path: Path to conversation JSON file
        confidence_threshold: Minimum confidence score (0.0-1.0)

    Returns:
        List of institution names with confidence above threshold

    Raises:
        FileNotFoundError: If conversation file doesn't exist
        ValueError: If confidence_threshold is out of range

    Examples:
        >>> extract_institution_names(Path("brazilian_glam.json"))
        ['Biblioteca Nacional do Brasil', 'Museu Nacional']
    """
    if not 0.0 <= confidence_threshold <= 1.0:
        raise ValueError("confidence_threshold must be between 0.0 and 1.0")

    # Implementation here
    ...
```

### Type Hints

Use type hints for:
- Function parameters
- Return types
- Class attributes

```python
from typing import Optional, List, Dict, Any
from pathlib import Path
from datetime import datetime


class HeritageCustodian:
    """Represents a heritage institution"""

    name: str
    institution_type: str
    founded_date: Optional[datetime]
    identifiers: List[Dict[str, Any]]

    def __init__(
        self,
        name: str,
        institution_type: str,
        founded_date: Optional[datetime] = None
    ) -> None:
        self.name = name
        self.institution_type = institution_type
        self.founded_date = founded_date
        self.identifiers = []
```

### Docstrings

Use Google-style docstrings:

```python
def merge_institutions(
    source1: List[HeritageCustodian],
    source2: List[HeritageCustodian],
    merge_strategy: str = "isil_code"
) -> List[HeritageCustodian]:
    """
    Merge two lists of heritage institutions using specified strategy.

    Args:
        source1: First list of institutions
        source2: Second list of institutions
        merge_strategy: Strategy to use ("isil_code", "name_fuzzy", "location")

    Returns:
        Merged list with duplicates resolved

    Raises:
        ValueError: If merge_strategy is not recognized

    Note:
        When conflicts occur, source1 takes precedence for TIER_1 data,
        otherwise uses highest data tier.

    Examples:
        >>> csv_institutions = load_csv_institutions(...)
        >>> conversation_institutions = extract_from_conversations(...)
        >>> merged = merge_institutions(csv_institutions, conversation_institutions)
    """
    ...
```

## Testing Guidelines

### Unit Tests

Test individual functions/classes in isolation:

```python
# tests/unit/test_extractors.py
import pytest
from glam_extractor.extractors import extract_isil_codes


def test_extract_isil_codes_single():
    text = "The ISIL code NL-AsdAM identifies Amsterdam Museum"
    codes = extract_isil_codes(text)
    assert len(codes) == 1
    assert codes[0]["value"] == "NL-AsdAM"


def test_extract_isil_codes_multiple():
    text = "Codes include NL-AsdAM and NL-AmfRCE"
    codes = extract_isil_codes(text)
    assert len(codes) == 2


def test_extract_isil_codes_none():
    text = "No ISIL codes here"
    codes = extract_isil_codes(text)
    assert len(codes) == 0
```

### Integration Tests

Test full workflows:

```python
# tests/integration/test_pipeline.py
import pytest
from pathlib import Path
from glam_extractor import ConversationParser, InstitutionExtractor, LinkMLValidator


@pytest.mark.integration
def test_full_extraction_pipeline(tmp_path):
    # Setup
    conversation_file = Path("tests/fixtures/brazilian_glam.json")
    output_file = tmp_path / "output.jsonld"

    # Execute
    parser = ConversationParser()
    conversation = parser.load(conversation_file)

    extractor = InstitutionExtractor()
    institutions = extractor.extract(conversation)

    validator = LinkMLValidator(schema="schemas/heritage_custodian.yaml")
    valid_institutions = validator.validate_batch(institutions)

    # Assert
    assert len(institutions) > 0
    assert all(inst.provenance.data_source == "CONVERSATION_NLP" for inst in institutions)
    assert len(valid_institutions) == len(institutions)
```

### Test Fixtures

Create reusable test data in `tests/fixtures/`:

```python
# tests/fixtures/sample_conversation.py
import json
from pathlib import Path


def create_sample_conversation(tmp_path: Path) -> Path:
    """Create a minimal conversation JSON for testing"""
    conversation = {
        "uuid": "test-uuid",
        "name": "Test Conversation",
        "chat_messages": [
            {
                "uuid": "msg-1",
                "text": "Amsterdam Museum (ISIL: NL-AsdAM) is located in Amsterdam.",
                "sender": "assistant",
                "content": [{"type": "text", "text": "..."}]
            }
        ]
    }

    fixture_path = tmp_path / "test_conversation.json"
    fixture_path.write_text(json.dumps(conversation, indent=2))
    return fixture_path
```

## Documentation

### API Documentation

All public functions/classes must have docstrings. We use mkdocstrings to auto-generate API docs.

```bash
# Serve docs locally
poetry run mkdocs serve

# Build static docs
poetry run mkdocs build
```

### Tutorials

Add tutorials to `docs/tutorials/` with step-by-step examples.

### Examples

Add working code examples to `docs/examples/`.

## Areas for Contribution

### High Priority

1. **Parser Implementation** (`src/glam_extractor/parsers/`)
   - Conversation JSON parser
   - CSV parser for Dutch datasets
   - Schema-compliant object builders

2. **Extractor Implementation** (`src/glam_extractor/extractors/`)
   - spaCy NER integration
   - Institution type classifier
   - Identifier pattern extractors

3. **Validator Implementation** (`src/glam_extractor/validators/`)
   - LinkML schema validator
   - Cross-reference validator

4. **Exporter Implementation** (`src/glam_extractor/exporters/`)
   - JSON-LD exporter
   - RDF/Turtle exporter
   - CSV/Parquet exporters

### Medium Priority

5. **Geocoding Module** (`src/glam_extractor/geocoding/`)
   - Nominatim client
   - GeoNames integration
   - Caching layer

6. **Web Crawler** (`src/glam_extractor/crawlers/`)
   - crawl4ai integration
   - Institution website scraping

### Lower Priority

7. **CLI Enhancements** (`src/glam_extractor/cli.py`)
   - Progress bars
   - Better error reporting
   - Configuration file support

8. **Performance Optimization**
   - Parallel processing
   - Caching strategies
   - Memory optimization

## Design Patterns

Follow the patterns documented in `docs/plan/global_glam/05-design-patterns.md`:

- **Pipeline Pattern**: For data processing workflows
- **Repository Pattern**: For data access
- **Strategy Pattern**: For configurable algorithms
- **Builder Pattern**: For complex object construction
- **Result Pattern**: For explicit error handling

## Questions?

- Check existing documentation in `docs/`
- Read `AGENTS.md` for AI agent instructions
- Review planning docs in `docs/plan/global_glam/`
- Open an issue on GitHub

## License

By contributing, you agree that your contributions will be licensed under the MIT License.