425 lines
10 KiB
Markdown
425 lines
10 KiB
Markdown
# Contributing to GLAM Extractor
|
|
|
|
Thank you for your interest in contributing to the GLAM Extractor project! This document provides guidelines and instructions for contributors.
|
|
|
|
## Development Setup
|
|
|
|
### Prerequisites
|
|
|
|
- Python 3.11 or higher
|
|
- Poetry (Python package manager)
|
|
- Git
|
|
|
|
### Installation
|
|
|
|
```bash
|
|
# Clone the repository
|
|
git clone <repository-url>
|
|
cd glam-extractor
|
|
|
|
# Install dependencies
|
|
poetry install
|
|
|
|
# Install pre-commit hooks
|
|
poetry run pre-commit install
|
|
|
|
# Download spaCy models (required for NLP features)
|
|
poetry run python -m spacy download en_core_web_trf
|
|
poetry run python -m spacy download nl_core_news_lg
|
|
poetry run python -m spacy download xx_ent_wiki_sm
|
|
```
|
|
|
|
## Project Structure
|
|
|
|
```
|
|
glam-extractor/
|
|
├── src/glam_extractor/ # Main package
|
|
│ ├── parsers/ # Conversation & CSV parsers
|
|
│ ├── extractors/ # NLP extraction engines
|
|
│ ├── crawlers/ # Web crawling (crawl4ai)
|
|
│ ├── validators/ # LinkML validation
|
|
│ ├── exporters/ # Multi-format export
|
|
│ ├── geocoding/ # Nominatim geocoding
|
|
│ └── utils/ # Utilities
|
|
├── tests/ # Test suite
|
|
│ ├── unit/ # Unit tests
|
|
│ ├── integration/ # Integration tests
|
|
│ └── fixtures/ # Test data
|
|
├── schemas/ # LinkML schemas
|
|
├── docs/ # Documentation
|
|
└── data/ # Reference data (CSVs)
|
|
```
|
|
|
|
## Development Workflow
|
|
|
|
### 1. Create a Branch
|
|
|
|
```bash
|
|
git checkout -b feature/your-feature-name
|
|
# or
|
|
git checkout -b fix/bug-description
|
|
```
|
|
|
|
### 2. Make Changes
|
|
|
|
Follow the coding standards (see below) and ensure all tests pass.
|
|
|
|
### 3. Write Tests
|
|
|
|
All new features should include:
|
|
- Unit tests in `tests/unit/`
|
|
- Integration tests in `tests/integration/` (if applicable)
|
|
- Docstring examples that serve as documentation
|
|
|
|
### 4. Run Tests
|
|
|
|
```bash
|
|
# Run all tests
|
|
poetry run pytest
|
|
|
|
# Run specific test types
|
|
poetry run pytest -m unit
|
|
poetry run pytest -m integration
|
|
|
|
# Run with coverage
|
|
poetry run pytest --cov
|
|
|
|
# Run specific test file
|
|
poetry run pytest tests/unit/test_parsers.py
|
|
```
|
|
|
|
### 5. Code Quality Checks
|
|
|
|
```bash
|
|
# Format code with black
|
|
poetry run black src/ tests/
|
|
|
|
# Lint with ruff
|
|
poetry run ruff check src/ tests/
|
|
|
|
# Type check with mypy
|
|
poetry run mypy src/
|
|
|
|
# Run all pre-commit hooks
|
|
poetry run pre-commit run --all-files
|
|
```
|
|
|
|
### 6. Commit Changes
|
|
|
|
```bash
|
|
git add .
|
|
git commit -m "feat: add institution name extractor"
|
|
```
|
|
|
|
**Commit Message Format**:
|
|
- `feat:` New feature
|
|
- `fix:` Bug fix
|
|
- `docs:` Documentation changes
|
|
- `test:` Test changes
|
|
- `refactor:` Code refactoring
|
|
- `chore:` Maintenance tasks
|
|
|
|
### 7. Push and Create PR
|
|
|
|
```bash
|
|
git push origin feature/your-feature-name
|
|
```
|
|
|
|
Then create a Pull Request on GitHub.
|
|
|
|
## Coding Standards
|
|
|
|
### Python Style
|
|
|
|
- Follow PEP 8
|
|
- Use Black for formatting (line length: 100)
|
|
- Use type hints for all function signatures
|
|
- Write docstrings for all public functions/classes
|
|
|
|
### Example
|
|
|
|
```python
|
|
from typing import Optional, List
|
|
from pathlib import Path
|
|
|
|
|
|
def extract_institution_names(
|
|
conversation_path: Path,
|
|
confidence_threshold: float = 0.7
|
|
) -> List[str]:
|
|
"""
|
|
Extract heritage institution names from a conversation JSON file.
|
|
|
|
Args:
|
|
conversation_path: Path to conversation JSON file
|
|
confidence_threshold: Minimum confidence score (0.0-1.0)
|
|
|
|
Returns:
|
|
List of institution names with confidence above threshold
|
|
|
|
Raises:
|
|
FileNotFoundError: If conversation file doesn't exist
|
|
ValueError: If confidence_threshold is out of range
|
|
|
|
Examples:
|
|
>>> extract_institution_names(Path("brazilian_glam.json"))
|
|
['Biblioteca Nacional do Brasil', 'Museu Nacional']
|
|
"""
|
|
if not 0.0 <= confidence_threshold <= 1.0:
|
|
raise ValueError("confidence_threshold must be between 0.0 and 1.0")
|
|
|
|
# Implementation here
|
|
...
|
|
```
|
|
|
|
### Type Hints
|
|
|
|
Use type hints for:
|
|
- Function parameters
|
|
- Return types
|
|
- Class attributes
|
|
|
|
```python
|
|
from typing import Optional, List, Dict, Any
|
|
from pathlib import Path
|
|
from datetime import datetime
|
|
|
|
|
|
class HeritageCustodian:
|
|
"""Represents a heritage institution"""
|
|
|
|
name: str
|
|
institution_type: str
|
|
founded_date: Optional[datetime]
|
|
identifiers: List[Dict[str, Any]]
|
|
|
|
def __init__(
|
|
self,
|
|
name: str,
|
|
institution_type: str,
|
|
founded_date: Optional[datetime] = None
|
|
) -> None:
|
|
self.name = name
|
|
self.institution_type = institution_type
|
|
self.founded_date = founded_date
|
|
self.identifiers = []
|
|
```
|
|
|
|
### Docstrings
|
|
|
|
Use Google-style docstrings:
|
|
|
|
```python
|
|
def merge_institutions(
|
|
source1: List[HeritageCustodian],
|
|
source2: List[HeritageCustodian],
|
|
merge_strategy: str = "isil_code"
|
|
) -> List[HeritageCustodian]:
|
|
"""
|
|
Merge two lists of heritage institutions using specified strategy.
|
|
|
|
Args:
|
|
source1: First list of institutions
|
|
source2: Second list of institutions
|
|
merge_strategy: Strategy to use ("isil_code", "name_fuzzy", "location")
|
|
|
|
Returns:
|
|
Merged list with duplicates resolved
|
|
|
|
Raises:
|
|
ValueError: If merge_strategy is not recognized
|
|
|
|
Note:
|
|
When conflicts occur, source1 takes precedence for TIER_1 data,
|
|
otherwise uses highest data tier.
|
|
|
|
Examples:
|
|
>>> csv_institutions = load_csv_institutions(...)
|
|
>>> conversation_institutions = extract_from_conversations(...)
|
|
>>> merged = merge_institutions(csv_institutions, conversation_institutions)
|
|
"""
|
|
...
|
|
```
|
|
|
|
## Testing Guidelines
|
|
|
|
### Unit Tests
|
|
|
|
Test individual functions/classes in isolation:
|
|
|
|
```python
|
|
# tests/unit/test_extractors.py
|
|
import pytest
|
|
from glam_extractor.extractors import extract_isil_codes
|
|
|
|
|
|
def test_extract_isil_codes_single():
|
|
text = "The ISIL code NL-AsdAM identifies Amsterdam Museum"
|
|
codes = extract_isil_codes(text)
|
|
assert len(codes) == 1
|
|
assert codes[0]["value"] == "NL-AsdAM"
|
|
|
|
|
|
def test_extract_isil_codes_multiple():
|
|
text = "Codes include NL-AsdAM and NL-AmfRCE"
|
|
codes = extract_isil_codes(text)
|
|
assert len(codes) == 2
|
|
|
|
|
|
def test_extract_isil_codes_none():
|
|
text = "No ISIL codes here"
|
|
codes = extract_isil_codes(text)
|
|
assert len(codes) == 0
|
|
```
|
|
|
|
### Integration Tests
|
|
|
|
Test full workflows:
|
|
|
|
```python
|
|
# tests/integration/test_pipeline.py
|
|
import pytest
|
|
from pathlib import Path
|
|
from glam_extractor import ConversationParser, InstitutionExtractor, LinkMLValidator
|
|
|
|
|
|
@pytest.mark.integration
|
|
def test_full_extraction_pipeline(tmp_path):
|
|
# Setup
|
|
conversation_file = Path("tests/fixtures/brazilian_glam.json")
|
|
output_file = tmp_path / "output.jsonld"
|
|
|
|
# Execute
|
|
parser = ConversationParser()
|
|
conversation = parser.load(conversation_file)
|
|
|
|
extractor = InstitutionExtractor()
|
|
institutions = extractor.extract(conversation)
|
|
|
|
validator = LinkMLValidator(schema="schemas/heritage_custodian.yaml")
|
|
valid_institutions = validator.validate_batch(institutions)
|
|
|
|
# Assert
|
|
assert len(institutions) > 0
|
|
assert all(inst.provenance.data_source == "CONVERSATION_NLP" for inst in institutions)
|
|
assert len(valid_institutions) == len(institutions)
|
|
```
|
|
|
|
### Test Fixtures
|
|
|
|
Create reusable test data in `tests/fixtures/`:
|
|
|
|
```python
|
|
# tests/fixtures/sample_conversation.py
|
|
import json
|
|
from pathlib import Path
|
|
|
|
|
|
def create_sample_conversation(tmp_path: Path) -> Path:
|
|
"""Create a minimal conversation JSON for testing"""
|
|
conversation = {
|
|
"uuid": "test-uuid",
|
|
"name": "Test Conversation",
|
|
"chat_messages": [
|
|
{
|
|
"uuid": "msg-1",
|
|
"text": "Amsterdam Museum (ISIL: NL-AsdAM) is located in Amsterdam.",
|
|
"sender": "assistant",
|
|
"content": [{"type": "text", "text": "..."}]
|
|
}
|
|
]
|
|
}
|
|
|
|
fixture_path = tmp_path / "test_conversation.json"
|
|
fixture_path.write_text(json.dumps(conversation, indent=2))
|
|
return fixture_path
|
|
```
|
|
|
|
## Documentation
|
|
|
|
### API Documentation
|
|
|
|
All public functions/classes must have docstrings. We use mkdocstrings to auto-generate API docs.
|
|
|
|
```bash
|
|
# Serve docs locally
|
|
poetry run mkdocs serve
|
|
|
|
# Build static docs
|
|
poetry run mkdocs build
|
|
```
|
|
|
|
### Tutorials
|
|
|
|
Add tutorials to `docs/tutorials/` with step-by-step examples.
|
|
|
|
### Examples
|
|
|
|
Add working code examples to `docs/examples/`.
|
|
|
|
## Areas for Contribution
|
|
|
|
### High Priority
|
|
|
|
1. **Parser Implementation** (`src/glam_extractor/parsers/`)
|
|
- Conversation JSON parser
|
|
- CSV parser for Dutch datasets
|
|
- Schema-compliant object builders
|
|
|
|
2. **Extractor Implementation** (`src/glam_extractor/extractors/`)
|
|
- spaCy NER integration
|
|
- Institution type classifier
|
|
- Identifier pattern extractors
|
|
|
|
3. **Validator Implementation** (`src/glam_extractor/validators/`)
|
|
- LinkML schema validator
|
|
- Cross-reference validator
|
|
|
|
4. **Exporter Implementation** (`src/glam_extractor/exporters/`)
|
|
- JSON-LD exporter
|
|
- RDF/Turtle exporter
|
|
- CSV/Parquet exporters
|
|
|
|
### Medium Priority
|
|
|
|
5. **Geocoding Module** (`src/glam_extractor/geocoding/`)
|
|
- Nominatim client
|
|
- GeoNames integration
|
|
- Caching layer
|
|
|
|
6. **Web Crawler** (`src/glam_extractor/crawlers/`)
|
|
- crawl4ai integration
|
|
- Institution website scraping
|
|
|
|
### Lower Priority
|
|
|
|
7. **CLI Enhancements** (`src/glam_extractor/cli.py`)
|
|
- Progress bars
|
|
- Better error reporting
|
|
- Configuration file support
|
|
|
|
8. **Performance Optimization**
|
|
- Parallel processing
|
|
- Caching strategies
|
|
- Memory optimization
|
|
|
|
## Design Patterns
|
|
|
|
Follow the patterns documented in `docs/plan/global_glam/05-design-patterns.md`:
|
|
|
|
- **Pipeline Pattern**: For data processing workflows
|
|
- **Repository Pattern**: For data access
|
|
- **Strategy Pattern**: For configurable algorithms
|
|
- **Builder Pattern**: For complex object construction
|
|
- **Result Pattern**: For explicit error handling
|
|
|
|
## Questions?
|
|
|
|
- Check existing documentation in `docs/`
|
|
- Read `AGENTS.md` for AI agent instructions
|
|
- Review planning docs in `docs/plan/global_glam/`
|
|
- Open an issue on GitHub
|
|
|
|
## License
|
|
|
|
By contributing, you agree that your contributions will be licensed under the MIT License.
|