kempersc 3c80de87e0 add isil entries

2025-11-19 23:25:22 +01:00

10 KiB

Raw Blame History

Contributing to GLAM Extractor

Thank you for your interest in contributing to the GLAM Extractor project! This document provides guidelines and instructions for contributors.

Development Setup

Prerequisites

Python 3.11 or higher
Poetry (Python package manager)
Git

Installation

# Clone the repository
git clone <repository-url>
cd glam-extractor

# Install dependencies
poetry install

# Install pre-commit hooks
poetry run pre-commit install

# Download spaCy models (required for NLP features)
poetry run python -m spacy download en_core_web_trf
poetry run python -m spacy download nl_core_news_lg
poetry run python -m spacy download xx_ent_wiki_sm

Project Structure

glam-extractor/
├── src/glam_extractor/      # Main package
│   ├── parsers/             # Conversation & CSV parsers
│   ├── extractors/          # NLP extraction engines
│   ├── crawlers/            # Web crawling (crawl4ai)
│   ├── validators/          # LinkML validation
│   ├── exporters/           # Multi-format export
│   ├── geocoding/           # Nominatim geocoding
│   └── utils/               # Utilities
├── tests/                   # Test suite
│   ├── unit/               # Unit tests
│   ├── integration/        # Integration tests
│   └── fixtures/           # Test data
├── schemas/                # LinkML schemas
├── docs/                   # Documentation
└── data/                   # Reference data (CSVs)

Development Workflow

1. Create a Branch

git checkout -b feature/your-feature-name
# or
git checkout -b fix/bug-description

2. Make Changes

Follow the coding standards (see below) and ensure all tests pass.

3. Write Tests

All new features should include:

Unit tests in tests/unit/
Integration tests in tests/integration/ (if applicable)
Docstring examples that serve as documentation

4. Run Tests

# Run all tests
poetry run pytest

# Run specific test types
poetry run pytest -m unit
poetry run pytest -m integration

# Run with coverage
poetry run pytest --cov

# Run specific test file
poetry run pytest tests/unit/test_parsers.py

5. Code Quality Checks

# Format code with black
poetry run black src/ tests/

# Lint with ruff
poetry run ruff check src/ tests/

# Type check with mypy
poetry run mypy src/

# Run all pre-commit hooks
poetry run pre-commit run --all-files

6. Commit Changes

git add .
git commit -m "feat: add institution name extractor"

Commit Message Format:

feat: New feature
fix: Bug fix
docs: Documentation changes
test: Test changes
refactor: Code refactoring
chore: Maintenance tasks

7. Push and Create PR

git push origin feature/your-feature-name

Then create a Pull Request on GitHub.

Coding Standards

Python Style

Follow PEP 8
Use Black for formatting (line length: 100)
Use type hints for all function signatures
Write docstrings for all public functions/classes

Example

from typing import Optional, List
from pathlib import Path


def extract_institution_names(
    conversation_path: Path,
    confidence_threshold: float = 0.7
) -> List[str]:
    """
    Extract heritage institution names from a conversation JSON file.

    Args:
        conversation_path: Path to conversation JSON file
        confidence_threshold: Minimum confidence score (0.0-1.0)

    Returns:
        List of institution names with confidence above threshold

    Raises:
        FileNotFoundError: If conversation file doesn't exist
        ValueError: If confidence_threshold is out of range

    Examples:
        >>> extract_institution_names(Path("brazilian_glam.json"))
        ['Biblioteca Nacional do Brasil', 'Museu Nacional']
    """
    if not 0.0 <= confidence_threshold <= 1.0:
        raise ValueError("confidence_threshold must be between 0.0 and 1.0")
    
    # Implementation here
    ...

Type Hints

Use type hints for:

Function parameters
Return types
Class attributes

from typing import Optional, List, Dict, Any
from pathlib import Path
from datetime import datetime


class HeritageCustodian:
    """Represents a heritage institution"""
    
    name: str
    institution_type: str
    founded_date: Optional[datetime]
    identifiers: List[Dict[str, Any]]
    
    def __init__(
        self,
        name: str,
        institution_type: str,
        founded_date: Optional[datetime] = None
    ) -> None:
        self.name = name
        self.institution_type = institution_type
        self.founded_date = founded_date
        self.identifiers = []

Docstrings

Use Google-style docstrings:

def merge_institutions(
    source1: List[HeritageCustodian],
    source2: List[HeritageCustodian],
    merge_strategy: str = "isil_code"
) -> List[HeritageCustodian]:
    """
    Merge two lists of heritage institutions using specified strategy.

    Args:
        source1: First list of institutions
        source2: Second list of institutions
        merge_strategy: Strategy to use ("isil_code", "name_fuzzy", "location")

    Returns:
        Merged list with duplicates resolved

    Raises:
        ValueError: If merge_strategy is not recognized

    Note:
        When conflicts occur, source1 takes precedence for TIER_1 data,
        otherwise uses highest data tier.

    Examples:
        >>> csv_institutions = load_csv_institutions(...)
        >>> conversation_institutions = extract_from_conversations(...)
        >>> merged = merge_institutions(csv_institutions, conversation_institutions)
    """
    ...

Testing Guidelines

Unit Tests

Test individual functions/classes in isolation:

# tests/unit/test_extractors.py
import pytest
from glam_extractor.extractors import extract_isil_codes


def test_extract_isil_codes_single():
    text = "The ISIL code NL-AsdAM identifies Amsterdam Museum"
    codes = extract_isil_codes(text)
    assert len(codes) == 1
    assert codes[0]["value"] == "NL-AsdAM"


def test_extract_isil_codes_multiple():
    text = "Codes include NL-AsdAM and NL-AmfRCE"
    codes = extract_isil_codes(text)
    assert len(codes) == 2


def test_extract_isil_codes_none():
    text = "No ISIL codes here"
    codes = extract_isil_codes(text)
    assert len(codes) == 0

Integration Tests

Test full workflows:

# tests/integration/test_pipeline.py
import pytest
from pathlib import Path
from glam_extractor import ConversationParser, InstitutionExtractor, LinkMLValidator


@pytest.mark.integration
def test_full_extraction_pipeline(tmp_path):
    # Setup
    conversation_file = Path("tests/fixtures/brazilian_glam.json")
    output_file = tmp_path / "output.jsonld"
    
    # Execute
    parser = ConversationParser()
    conversation = parser.load(conversation_file)
    
    extractor = InstitutionExtractor()
    institutions = extractor.extract(conversation)
    
    validator = LinkMLValidator(schema="schemas/heritage_custodian.yaml")
    valid_institutions = validator.validate_batch(institutions)
    
    # Assert
    assert len(institutions) > 0
    assert all(inst.provenance.data_source == "CONVERSATION_NLP" for inst in institutions)
    assert len(valid_institutions) == len(institutions)

Test Fixtures

Create reusable test data in tests/fixtures/:

# tests/fixtures/sample_conversation.py
import json
from pathlib import Path


def create_sample_conversation(tmp_path: Path) -> Path:
    """Create a minimal conversation JSON for testing"""
    conversation = {
        "uuid": "test-uuid",
        "name": "Test Conversation",
        "chat_messages": [
            {
                "uuid": "msg-1",
                "text": "Amsterdam Museum (ISIL: NL-AsdAM) is located in Amsterdam.",
                "sender": "assistant",
                "content": [{"type": "text", "text": "..."}]
            }
        ]
    }
    
    fixture_path = tmp_path / "test_conversation.json"
    fixture_path.write_text(json.dumps(conversation, indent=2))
    return fixture_path

Documentation

API Documentation

All public functions/classes must have docstrings. We use mkdocstrings to auto-generate API docs.

# Serve docs locally
poetry run mkdocs serve

# Build static docs
poetry run mkdocs build

Tutorials

Add tutorials to docs/tutorials/ with step-by-step examples.

Examples

Add working code examples to docs/examples/.

Areas for Contribution

High Priority

Parser Implementation (src/glam_extractor/parsers/)
- Conversation JSON parser
- CSV parser for Dutch datasets
- Schema-compliant object builders
Extractor Implementation (src/glam_extractor/extractors/)
- spaCy NER integration
- Institution type classifier
- Identifier pattern extractors
Validator Implementation (src/glam_extractor/validators/)
- LinkML schema validator
- Cross-reference validator
Exporter Implementation (src/glam_extractor/exporters/)
- JSON-LD exporter
- RDF/Turtle exporter
- CSV/Parquet exporters

Medium Priority

Geocoding Module (src/glam_extractor/geocoding/)
- Nominatim client
- GeoNames integration
- Caching layer
Web Crawler (src/glam_extractor/crawlers/)
- crawl4ai integration
- Institution website scraping

Lower Priority

CLI Enhancements (src/glam_extractor/cli.py)
- Progress bars
- Better error reporting
- Configuration file support
Performance Optimization
- Parallel processing
- Caching strategies
- Memory optimization

Design Patterns

Follow the patterns documented in docs/plan/global_glam/05-design-patterns.md:

Pipeline Pattern: For data processing workflows
Repository Pattern: For data access
Strategy Pattern: For configurable algorithms
Builder Pattern: For complex object construction
Result Pattern: For explicit error handling

Questions?

Check existing documentation in docs/
Read AGENTS.md for AI agent instructions
Review planning docs in docs/plan/global_glam/
Open an issue on GitHub

License

By contributing, you agree that your contributions will be licensed under the MIT License.

10 KiB Raw Blame History