glam/CONTRIBUTING.md
2025-11-19 23:25:22 +01:00

425 lines
10 KiB
Markdown

# Contributing to GLAM Extractor
Thank you for your interest in contributing to the GLAM Extractor project! This document provides guidelines and instructions for contributors.
## Development Setup
### Prerequisites
- Python 3.11 or higher
- Poetry (Python package manager)
- Git
### Installation
```bash
# Clone the repository
git clone <repository-url>
cd glam-extractor
# Install dependencies
poetry install
# Install pre-commit hooks
poetry run pre-commit install
# Download spaCy models (required for NLP features)
poetry run python -m spacy download en_core_web_trf
poetry run python -m spacy download nl_core_news_lg
poetry run python -m spacy download xx_ent_wiki_sm
```
## Project Structure
```
glam-extractor/
├── src/glam_extractor/ # Main package
│ ├── parsers/ # Conversation & CSV parsers
│ ├── extractors/ # NLP extraction engines
│ ├── crawlers/ # Web crawling (crawl4ai)
│ ├── validators/ # LinkML validation
│ ├── exporters/ # Multi-format export
│ ├── geocoding/ # Nominatim geocoding
│ └── utils/ # Utilities
├── tests/ # Test suite
│ ├── unit/ # Unit tests
│ ├── integration/ # Integration tests
│ └── fixtures/ # Test data
├── schemas/ # LinkML schemas
├── docs/ # Documentation
└── data/ # Reference data (CSVs)
```
## Development Workflow
### 1. Create a Branch
```bash
git checkout -b feature/your-feature-name
# or
git checkout -b fix/bug-description
```
### 2. Make Changes
Follow the coding standards (see below) and ensure all tests pass.
### 3. Write Tests
All new features should include:
- Unit tests in `tests/unit/`
- Integration tests in `tests/integration/` (if applicable)
- Docstring examples that serve as documentation
### 4. Run Tests
```bash
# Run all tests
poetry run pytest
# Run specific test types
poetry run pytest -m unit
poetry run pytest -m integration
# Run with coverage
poetry run pytest --cov
# Run specific test file
poetry run pytest tests/unit/test_parsers.py
```
### 5. Code Quality Checks
```bash
# Format code with black
poetry run black src/ tests/
# Lint with ruff
poetry run ruff check src/ tests/
# Type check with mypy
poetry run mypy src/
# Run all pre-commit hooks
poetry run pre-commit run --all-files
```
### 6. Commit Changes
```bash
git add .
git commit -m "feat: add institution name extractor"
```
**Commit Message Format**:
- `feat:` New feature
- `fix:` Bug fix
- `docs:` Documentation changes
- `test:` Test changes
- `refactor:` Code refactoring
- `chore:` Maintenance tasks
### 7. Push and Create PR
```bash
git push origin feature/your-feature-name
```
Then create a Pull Request on GitHub.
## Coding Standards
### Python Style
- Follow PEP 8
- Use Black for formatting (line length: 100)
- Use type hints for all function signatures
- Write docstrings for all public functions/classes
### Example
```python
from typing import Optional, List
from pathlib import Path
def extract_institution_names(
conversation_path: Path,
confidence_threshold: float = 0.7
) -> List[str]:
"""
Extract heritage institution names from a conversation JSON file.
Args:
conversation_path: Path to conversation JSON file
confidence_threshold: Minimum confidence score (0.0-1.0)
Returns:
List of institution names with confidence above threshold
Raises:
FileNotFoundError: If conversation file doesn't exist
ValueError: If confidence_threshold is out of range
Examples:
>>> extract_institution_names(Path("brazilian_glam.json"))
['Biblioteca Nacional do Brasil', 'Museu Nacional']
"""
if not 0.0 <= confidence_threshold <= 1.0:
raise ValueError("confidence_threshold must be between 0.0 and 1.0")
# Implementation here
...
```
### Type Hints
Use type hints for:
- Function parameters
- Return types
- Class attributes
```python
from typing import Optional, List, Dict, Any
from pathlib import Path
from datetime import datetime
class HeritageCustodian:
"""Represents a heritage institution"""
name: str
institution_type: str
founded_date: Optional[datetime]
identifiers: List[Dict[str, Any]]
def __init__(
self,
name: str,
institution_type: str,
founded_date: Optional[datetime] = None
) -> None:
self.name = name
self.institution_type = institution_type
self.founded_date = founded_date
self.identifiers = []
```
### Docstrings
Use Google-style docstrings:
```python
def merge_institutions(
source1: List[HeritageCustodian],
source2: List[HeritageCustodian],
merge_strategy: str = "isil_code"
) -> List[HeritageCustodian]:
"""
Merge two lists of heritage institutions using specified strategy.
Args:
source1: First list of institutions
source2: Second list of institutions
merge_strategy: Strategy to use ("isil_code", "name_fuzzy", "location")
Returns:
Merged list with duplicates resolved
Raises:
ValueError: If merge_strategy is not recognized
Note:
When conflicts occur, source1 takes precedence for TIER_1 data,
otherwise uses highest data tier.
Examples:
>>> csv_institutions = load_csv_institutions(...)
>>> conversation_institutions = extract_from_conversations(...)
>>> merged = merge_institutions(csv_institutions, conversation_institutions)
"""
...
```
## Testing Guidelines
### Unit Tests
Test individual functions/classes in isolation:
```python
# tests/unit/test_extractors.py
import pytest
from glam_extractor.extractors import extract_isil_codes
def test_extract_isil_codes_single():
text = "The ISIL code NL-AsdAM identifies Amsterdam Museum"
codes = extract_isil_codes(text)
assert len(codes) == 1
assert codes[0]["value"] == "NL-AsdAM"
def test_extract_isil_codes_multiple():
text = "Codes include NL-AsdAM and NL-AmfRCE"
codes = extract_isil_codes(text)
assert len(codes) == 2
def test_extract_isil_codes_none():
text = "No ISIL codes here"
codes = extract_isil_codes(text)
assert len(codes) == 0
```
### Integration Tests
Test full workflows:
```python
# tests/integration/test_pipeline.py
import pytest
from pathlib import Path
from glam_extractor import ConversationParser, InstitutionExtractor, LinkMLValidator
@pytest.mark.integration
def test_full_extraction_pipeline(tmp_path):
# Setup
conversation_file = Path("tests/fixtures/brazilian_glam.json")
output_file = tmp_path / "output.jsonld"
# Execute
parser = ConversationParser()
conversation = parser.load(conversation_file)
extractor = InstitutionExtractor()
institutions = extractor.extract(conversation)
validator = LinkMLValidator(schema="schemas/heritage_custodian.yaml")
valid_institutions = validator.validate_batch(institutions)
# Assert
assert len(institutions) > 0
assert all(inst.provenance.data_source == "CONVERSATION_NLP" for inst in institutions)
assert len(valid_institutions) == len(institutions)
```
### Test Fixtures
Create reusable test data in `tests/fixtures/`:
```python
# tests/fixtures/sample_conversation.py
import json
from pathlib import Path
def create_sample_conversation(tmp_path: Path) -> Path:
"""Create a minimal conversation JSON for testing"""
conversation = {
"uuid": "test-uuid",
"name": "Test Conversation",
"chat_messages": [
{
"uuid": "msg-1",
"text": "Amsterdam Museum (ISIL: NL-AsdAM) is located in Amsterdam.",
"sender": "assistant",
"content": [{"type": "text", "text": "..."}]
}
]
}
fixture_path = tmp_path / "test_conversation.json"
fixture_path.write_text(json.dumps(conversation, indent=2))
return fixture_path
```
## Documentation
### API Documentation
All public functions/classes must have docstrings. We use mkdocstrings to auto-generate API docs.
```bash
# Serve docs locally
poetry run mkdocs serve
# Build static docs
poetry run mkdocs build
```
### Tutorials
Add tutorials to `docs/tutorials/` with step-by-step examples.
### Examples
Add working code examples to `docs/examples/`.
## Areas for Contribution
### High Priority
1. **Parser Implementation** (`src/glam_extractor/parsers/`)
- Conversation JSON parser
- CSV parser for Dutch datasets
- Schema-compliant object builders
2. **Extractor Implementation** (`src/glam_extractor/extractors/`)
- spaCy NER integration
- Institution type classifier
- Identifier pattern extractors
3. **Validator Implementation** (`src/glam_extractor/validators/`)
- LinkML schema validator
- Cross-reference validator
4. **Exporter Implementation** (`src/glam_extractor/exporters/`)
- JSON-LD exporter
- RDF/Turtle exporter
- CSV/Parquet exporters
### Medium Priority
5. **Geocoding Module** (`src/glam_extractor/geocoding/`)
- Nominatim client
- GeoNames integration
- Caching layer
6. **Web Crawler** (`src/glam_extractor/crawlers/`)
- crawl4ai integration
- Institution website scraping
### Lower Priority
7. **CLI Enhancements** (`src/glam_extractor/cli.py`)
- Progress bars
- Better error reporting
- Configuration file support
8. **Performance Optimization**
- Parallel processing
- Caching strategies
- Memory optimization
## Design Patterns
Follow the patterns documented in `docs/plan/global_glam/05-design-patterns.md`:
- **Pipeline Pattern**: For data processing workflows
- **Repository Pattern**: For data access
- **Strategy Pattern**: For configurable algorithms
- **Builder Pattern**: For complex object construction
- **Result Pattern**: For explicit error handling
## Questions?
- Check existing documentation in `docs/`
- Read `AGENTS.md` for AI agent instructions
- Review planning docs in `docs/plan/global_glam/`
- Open an issue on GitHub
## License
By contributing, you agree that your contributions will be licensed under the MIT License.