# Contributing to GLAM Extractor Thank you for your interest in contributing to the GLAM Extractor project! This document provides guidelines and instructions for contributors. ## Development Setup ### Prerequisites - Python 3.11 or higher - Poetry (Python package manager) - Git ### Installation ```bash # Clone the repository git clone cd glam-extractor # Install dependencies poetry install # Install pre-commit hooks poetry run pre-commit install # Download spaCy models (required for NLP features) poetry run python -m spacy download en_core_web_trf poetry run python -m spacy download nl_core_news_lg poetry run python -m spacy download xx_ent_wiki_sm ``` ## Project Structure ``` glam-extractor/ ├── src/glam_extractor/ # Main package │ ├── parsers/ # Conversation & CSV parsers │ ├── extractors/ # NLP extraction engines │ ├── crawlers/ # Web crawling (crawl4ai) │ ├── validators/ # LinkML validation │ ├── exporters/ # Multi-format export │ ├── geocoding/ # Nominatim geocoding │ └── utils/ # Utilities ├── tests/ # Test suite │ ├── unit/ # Unit tests │ ├── integration/ # Integration tests │ └── fixtures/ # Test data ├── schemas/ # LinkML schemas ├── docs/ # Documentation └── data/ # Reference data (CSVs) ``` ## Development Workflow ### 1. Create a Branch ```bash git checkout -b feature/your-feature-name # or git checkout -b fix/bug-description ``` ### 2. Make Changes Follow the coding standards (see below) and ensure all tests pass. ### 3. Write Tests All new features should include: - Unit tests in `tests/unit/` - Integration tests in `tests/integration/` (if applicable) - Docstring examples that serve as documentation ### 4. Run Tests ```bash # Run all tests poetry run pytest # Run specific test types poetry run pytest -m unit poetry run pytest -m integration # Run with coverage poetry run pytest --cov # Run specific test file poetry run pytest tests/unit/test_parsers.py ``` ### 5. Code Quality Checks ```bash # Format code with black poetry run black src/ tests/ # Lint with ruff poetry run ruff check src/ tests/ # Type check with mypy poetry run mypy src/ # Run all pre-commit hooks poetry run pre-commit run --all-files ``` ### 6. Commit Changes ```bash git add . git commit -m "feat: add institution name extractor" ``` **Commit Message Format**: - `feat:` New feature - `fix:` Bug fix - `docs:` Documentation changes - `test:` Test changes - `refactor:` Code refactoring - `chore:` Maintenance tasks ### 7. Push and Create PR ```bash git push origin feature/your-feature-name ``` Then create a Pull Request on GitHub. ## Coding Standards ### Python Style - Follow PEP 8 - Use Black for formatting (line length: 100) - Use type hints for all function signatures - Write docstrings for all public functions/classes ### Example ```python from typing import Optional, List from pathlib import Path def extract_institution_names( conversation_path: Path, confidence_threshold: float = 0.7 ) -> List[str]: """ Extract heritage institution names from a conversation JSON file. Args: conversation_path: Path to conversation JSON file confidence_threshold: Minimum confidence score (0.0-1.0) Returns: List of institution names with confidence above threshold Raises: FileNotFoundError: If conversation file doesn't exist ValueError: If confidence_threshold is out of range Examples: >>> extract_institution_names(Path("brazilian_glam.json")) ['Biblioteca Nacional do Brasil', 'Museu Nacional'] """ if not 0.0 <= confidence_threshold <= 1.0: raise ValueError("confidence_threshold must be between 0.0 and 1.0") # Implementation here ... ``` ### Type Hints Use type hints for: - Function parameters - Return types - Class attributes ```python from typing import Optional, List, Dict, Any from pathlib import Path from datetime import datetime class HeritageCustodian: """Represents a heritage institution""" name: str institution_type: str founded_date: Optional[datetime] identifiers: List[Dict[str, Any]] def __init__( self, name: str, institution_type: str, founded_date: Optional[datetime] = None ) -> None: self.name = name self.institution_type = institution_type self.founded_date = founded_date self.identifiers = [] ``` ### Docstrings Use Google-style docstrings: ```python def merge_institutions( source1: List[HeritageCustodian], source2: List[HeritageCustodian], merge_strategy: str = "isil_code" ) -> List[HeritageCustodian]: """ Merge two lists of heritage institutions using specified strategy. Args: source1: First list of institutions source2: Second list of institutions merge_strategy: Strategy to use ("isil_code", "name_fuzzy", "location") Returns: Merged list with duplicates resolved Raises: ValueError: If merge_strategy is not recognized Note: When conflicts occur, source1 takes precedence for TIER_1 data, otherwise uses highest data tier. Examples: >>> csv_institutions = load_csv_institutions(...) >>> conversation_institutions = extract_from_conversations(...) >>> merged = merge_institutions(csv_institutions, conversation_institutions) """ ... ``` ## Testing Guidelines ### Unit Tests Test individual functions/classes in isolation: ```python # tests/unit/test_extractors.py import pytest from glam_extractor.extractors import extract_isil_codes def test_extract_isil_codes_single(): text = "The ISIL code NL-AsdAM identifies Amsterdam Museum" codes = extract_isil_codes(text) assert len(codes) == 1 assert codes[0]["value"] == "NL-AsdAM" def test_extract_isil_codes_multiple(): text = "Codes include NL-AsdAM and NL-AmfRCE" codes = extract_isil_codes(text) assert len(codes) == 2 def test_extract_isil_codes_none(): text = "No ISIL codes here" codes = extract_isil_codes(text) assert len(codes) == 0 ``` ### Integration Tests Test full workflows: ```python # tests/integration/test_pipeline.py import pytest from pathlib import Path from glam_extractor import ConversationParser, InstitutionExtractor, LinkMLValidator @pytest.mark.integration def test_full_extraction_pipeline(tmp_path): # Setup conversation_file = Path("tests/fixtures/brazilian_glam.json") output_file = tmp_path / "output.jsonld" # Execute parser = ConversationParser() conversation = parser.load(conversation_file) extractor = InstitutionExtractor() institutions = extractor.extract(conversation) validator = LinkMLValidator(schema="schemas/heritage_custodian.yaml") valid_institutions = validator.validate_batch(institutions) # Assert assert len(institutions) > 0 assert all(inst.provenance.data_source == "CONVERSATION_NLP" for inst in institutions) assert len(valid_institutions) == len(institutions) ``` ### Test Fixtures Create reusable test data in `tests/fixtures/`: ```python # tests/fixtures/sample_conversation.py import json from pathlib import Path def create_sample_conversation(tmp_path: Path) -> Path: """Create a minimal conversation JSON for testing""" conversation = { "uuid": "test-uuid", "name": "Test Conversation", "chat_messages": [ { "uuid": "msg-1", "text": "Amsterdam Museum (ISIL: NL-AsdAM) is located in Amsterdam.", "sender": "assistant", "content": [{"type": "text", "text": "..."}] } ] } fixture_path = tmp_path / "test_conversation.json" fixture_path.write_text(json.dumps(conversation, indent=2)) return fixture_path ``` ## Documentation ### API Documentation All public functions/classes must have docstrings. We use mkdocstrings to auto-generate API docs. ```bash # Serve docs locally poetry run mkdocs serve # Build static docs poetry run mkdocs build ``` ### Tutorials Add tutorials to `docs/tutorials/` with step-by-step examples. ### Examples Add working code examples to `docs/examples/`. ## Areas for Contribution ### High Priority 1. **Parser Implementation** (`src/glam_extractor/parsers/`) - Conversation JSON parser - CSV parser for Dutch datasets - Schema-compliant object builders 2. **Extractor Implementation** (`src/glam_extractor/extractors/`) - spaCy NER integration - Institution type classifier - Identifier pattern extractors 3. **Validator Implementation** (`src/glam_extractor/validators/`) - LinkML schema validator - Cross-reference validator 4. **Exporter Implementation** (`src/glam_extractor/exporters/`) - JSON-LD exporter - RDF/Turtle exporter - CSV/Parquet exporters ### Medium Priority 5. **Geocoding Module** (`src/glam_extractor/geocoding/`) - Nominatim client - GeoNames integration - Caching layer 6. **Web Crawler** (`src/glam_extractor/crawlers/`) - crawl4ai integration - Institution website scraping ### Lower Priority 7. **CLI Enhancements** (`src/glam_extractor/cli.py`) - Progress bars - Better error reporting - Configuration file support 8. **Performance Optimization** - Parallel processing - Caching strategies - Memory optimization ## Design Patterns Follow the patterns documented in `docs/plan/global_glam/05-design-patterns.md`: - **Pipeline Pattern**: For data processing workflows - **Repository Pattern**: For data access - **Strategy Pattern**: For configurable algorithms - **Builder Pattern**: For complex object construction - **Result Pattern**: For explicit error handling ## Questions? - Check existing documentation in `docs/` - Read `AGENTS.md` for AI agent instructions - Review planning docs in `docs/plan/global_glam/` - Open an issue on GitHub ## License By contributing, you agree that your contributions will be licensed under the MIT License.