# UNESCO Data Extraction - Design Patterns **Project**: Global GLAM Dataset - UNESCO World Heritage Sites Extraction **Document**: 05 - Design Patterns **Version**: 1.1 **Date**: 2025-11-10 **Status**: Draft --- ## Executive Summary This document defines the architectural design patterns for the UNESCO World Heritage Sites extraction project. These patterns ensure consistency with the existing Global GLAM Dataset architecture while addressing UNESCO-specific requirements such as institution type classification, multi-language handling, and authoritative data tier management. **Key Principles**: 1. **Consistency with Global GLAM patterns** - Reuse existing patterns from `docs/plan/global_glam/05-design-patterns.md` 2. **UNESCO-specific adaptations** - Handle unique UNESCO data characteristics 3. **SOLID principles** - Single Responsibility, Open/Closed, Liskov Substitution, Interface Segregation, Dependency Inversion 4. **Testability** - All components designed for easy unit testing --- ## Core Design Patterns ### Pattern 1: Repository Pattern (Data Access Layer) **Purpose**: Abstract data access logic to allow flexible data sources (API, cache, fixtures). **Implementation**: ```python # src/glam_extractor/repositories/unesco_repository.py from abc import ABC, abstractmethod from typing import List, Optional, Dict from glam_extractor.models import HeritageCustodian class UNESCORepository(ABC): """Abstract repository for UNESCO site data.""" @abstractmethod def fetch_site_list(self) -> List[Dict]: """Fetch list of all UNESCO World Heritage Sites.""" pass @abstractmethod def fetch_site_details(self, whc_id: int) -> Dict: """Fetch detailed information for a specific site.""" pass @abstractmethod def fetch_sites_by_country(self, country_code: str) -> List[Dict]: """Fetch all sites in a specific country.""" pass class UNESCOAPIRepository(UNESCORepository): """Repository implementation using UNESCO DataHub API.""" def __init__(self, api_client: UNESCOAPIClient): self.api_client = api_client def fetch_site_list(self) -> List[Dict]: return self.api_client.get("/catalog/datasets/whc-sites-2021/records") def fetch_site_details(self, whc_id: int) -> Dict: return self.api_client.get(f"/catalog/datasets/whc-sites-2021/records/{whc_id}") def fetch_sites_by_country(self, country_code: str) -> List[Dict]: all_sites = self.fetch_site_list() # OpenDataSoft nested structure: record.fields.states_name_en return [s for s in all_sites if country_code in s.get('record', {}).get('fields', {}).get('states_name_en', '')] class UNESCOCachedRepository(UNESCORepository): """Repository with caching layer.""" def __init__(self, api_repository: UNESCOAPIRepository, cache: Cache): self.api_repository = api_repository self.cache = cache def fetch_site_details(self, whc_id: int) -> Dict: cache_key = f"unesco_site_{whc_id}" # Try cache first cached = self.cache.get(cache_key) if cached: return cached # Fetch from API and cache data = self.api_repository.fetch_site_details(whc_id) self.cache.set(cache_key, data, ttl=86400) # 24 hours return data class UNESCOFixtureRepository(UNESCORepository): """Repository using test fixtures (for testing).""" def __init__(self, fixtures_dir: Path): self.fixtures_dir = fixtures_dir def fetch_site_details(self, whc_id: int) -> Dict: fixture_path = self.fixtures_dir / f"site_{whc_id}.json" with open(fixture_path) as f: return json.load(f) ``` **Benefits**: - ✅ Easy to mock for testing (use `UNESCOFixtureRepository`) - ✅ Separation of concerns (API logic vs. business logic) - ✅ Flexible data sources (API, cache, fixtures, future: database) - ✅ Adheres to Dependency Inversion Principle **Usage**: ```python # Production: API with caching api_client = UNESCOAPIClient() api_repo = UNESCOAPIRepository(api_client) cache = SQLiteCache("cache/unesco.db") repo = UNESCOCachedRepository(api_repo, cache) # Testing: Fixtures repo = UNESCOFixtureRepository(Path("tests/fixtures/unesco_api_responses")) # Extract data (same interface regardless of source) site_data = repo.fetch_site_details(600) ``` --- ### Pattern 2: Strategy Pattern (Institution Type Classification) **Purpose**: Allow multiple classification strategies (keyword-based, ML-based, rule-based) with the same interface. **Implementation**: ```python # src/glam_extractor/classifiers/strategy.py from abc import ABC, abstractmethod from typing import Dict from glam_extractor.models import InstitutionTypeEnum class ClassificationStrategy(ABC): """Abstract strategy for institution type classification.""" @abstractmethod def classify(self, unesco_data: Dict) -> Dict[str, any]: """ Classify institution type from UNESCO data. Returns: { 'institution_type': InstitutionTypeEnum, 'confidence_score': float, 'reasoning': str } """ pass class KeywordClassificationStrategy(ClassificationStrategy): """Classification based on keyword matching.""" MUSEUM_KEYWORDS = ['museum', 'musée', 'museo', 'muzeum', 'gallery'] LIBRARY_KEYWORDS = ['library', 'bibliothèque', 'biblioteca', 'bibliotheek'] ARCHIVE_KEYWORDS = ['archive', 'archiv', 'archivo', 'archief'] BOTANICAL_KEYWORDS = ['botanical garden', 'jardin botanique', 'arboretum', 'botanic'] HOLY_SITE_KEYWORDS = ['cathedral', 'church', 'monastery', 'abbey', 'temple', 'mosque', 'synagogue'] def classify(self, unesco_data: Dict) -> Dict[str, any]: # Extract from OpenDataSoft nested structure fields = unesco_data.get('record', {}).get('fields', {}) description = fields.get('short_description_en', '').lower() justification = fields.get('justification_en', '').lower() text = f"{description} {justification}" scores = { InstitutionTypeEnum.MUSEUM: self._count_keywords(text, self.MUSEUM_KEYWORDS), InstitutionTypeEnum.LIBRARY: self._count_keywords(text, self.LIBRARY_KEYWORDS), InstitutionTypeEnum.ARCHIVE: self._count_keywords(text, self.ARCHIVE_KEYWORDS), InstitutionTypeEnum.BOTANICAL_ZOO: self._count_keywords(text, self.BOTANICAL_KEYWORDS), InstitutionTypeEnum.HOLY_SITES: self._count_keywords(text, self.HOLY_SITE_KEYWORDS), } # Find highest scoring type best_type = max(scores, key=scores.get) max_score = scores[best_type] if max_score == 0: return { 'institution_type': InstitutionTypeEnum.MIXED, 'confidence_score': 0.5, 'reasoning': 'No keywords matched, defaulted to MIXED' } confidence = min(1.0, max_score / 3) # Normalize to 0-1 range return { 'institution_type': best_type, 'confidence_score': confidence, 'reasoning': f'Matched {max_score} keyword(s) for {best_type}' } def _count_keywords(self, text: str, keywords: List[str]) -> int: return sum(1 for kw in keywords if kw in text) class RuleBasedClassificationStrategy(ClassificationStrategy): """Classification using explicit rules and heuristics.""" def classify(self, unesco_data: Dict) -> Dict[str, any]: # Extract from OpenDataSoft nested structure fields = unesco_data.get('record', {}).get('fields', {}) category = fields.get('category', '') description = fields.get('short_description_en', '').lower() # Rule 1: Natural sites are rarely GLAM institutions if category == 'Natural': if 'visitor center' in description or 'interpretation' in description: return { 'institution_type': InstitutionTypeEnum.MIXED, 'confidence_score': 0.6, 'reasoning': 'Natural site with visitor facilities' } else: return { 'institution_type': InstitutionTypeEnum.MIXED, 'confidence_score': 0.4, 'reasoning': 'Natural site, likely not a GLAM custodian' } # Rule 2: Sites with "library" in name are libraries site_name = fields.get('name_en', '').lower() if 'library' in site_name or 'bibliothèque' in site_name: return { 'institution_type': InstitutionTypeEnum.LIBRARY, 'confidence_score': 0.95, 'reasoning': 'Site name explicitly mentions library' } # Rule 3: Religious sites with collections are HOLY_SITES if any(word in description for word in ['cathedral', 'monastery', 'abbey', 'church']): if any(word in description for word in ['collection', 'archive', 'library', 'manuscript']): return { 'institution_type': InstitutionTypeEnum.HOLY_SITES, 'confidence_score': 0.85, 'reasoning': 'Religious site with heritage collection' } # Fallback: Use keyword strategy keyword_strategy = KeywordClassificationStrategy() return keyword_strategy.classify(unesco_data) class CompositeClassificationStrategy(ClassificationStrategy): """Combine multiple strategies and vote.""" def __init__(self, strategies: List[ClassificationStrategy]): self.strategies = strategies def classify(self, unesco_data: Dict) -> Dict[str, any]: results = [strategy.classify(unesco_data) for strategy in self.strategies] # Weighted voting (weight by confidence score) votes = {} for result in results: inst_type = result['institution_type'] confidence = result['confidence_score'] votes[inst_type] = votes.get(inst_type, 0) + confidence # Find winner best_type = max(votes, key=votes.get) avg_confidence = votes[best_type] / len(self.strategies) return { 'institution_type': best_type, 'confidence_score': avg_confidence, 'reasoning': f'Composite vote from {len(self.strategies)} strategies' } class InstitutionTypeClassifier: """Context class for classification strategies.""" def __init__(self, strategy: ClassificationStrategy): self.strategy = strategy def classify(self, unesco_data: Dict) -> Dict[str, any]: return self.strategy.classify(unesco_data) def set_strategy(self, strategy: ClassificationStrategy): """Allow runtime strategy switching.""" self.strategy = strategy ``` **Benefits**: - ✅ Easy to add new classification strategies (Open/Closed Principle) - ✅ Strategies can be tested independently - ✅ Composite strategy allows voting across multiple approaches - ✅ Runtime strategy switching for A/B testing **Usage**: ```python # Use keyword strategy classifier = InstitutionTypeClassifier(KeywordClassificationStrategy()) result = classifier.classify(unesco_site_data) # Switch to rule-based strategy classifier.set_strategy(RuleBasedClassificationStrategy()) result = classifier.classify(unesco_site_data) # Use composite (voting) strategy composite = CompositeClassificationStrategy([ KeywordClassificationStrategy(), RuleBasedClassificationStrategy() ]) classifier.set_strategy(composite) result = classifier.classify(unesco_site_data) ``` --- ### Pattern 3: Builder Pattern (HeritageCustodian Construction) **Purpose**: Construct complex `HeritageCustodian` objects step-by-step from UNESCO data. **Implementation**: ```python # src/glam_extractor/builders/heritage_custodian_builder.py from datetime import datetime, timezone from typing import Optional, List from glam_extractor.models import ( HeritageCustodian, Location, Identifier, Provenance, DataSource, DataTier ) class HeritageCustodianBuilder: """Builder for constructing HeritageCustodian from UNESCO data.""" def __init__(self): self._custodian = HeritageCustodian() self._unesco_data = None def from_unesco_data(self, unesco_data: Dict) -> 'HeritageCustodianBuilder': """Initialize builder with UNESCO site data.""" self._unesco_data = unesco_data return self def with_name(self, name: Optional[str] = None) -> 'HeritageCustodianBuilder': """Set institution name (defaults to UNESCO site name).""" if name: self._custodian.name = name elif self._unesco_data: # Extract from OpenDataSoft nested structure fields = self._unesco_data.get('record', {}).get('fields', {}) self._custodian.name = fields.get('name_en', '') return self def with_institution_type( self, classifier: InstitutionTypeClassifier ) -> 'HeritageCustodianBuilder': """Classify and set institution type.""" if self._unesco_data: result = classifier.classify(self._unesco_data) self._custodian.institution_type = result['institution_type'] # Store confidence in provenance if not self._custodian.provenance: self._custodian.provenance = Provenance() self._custodian.provenance.confidence_score = result['confidence_score'] return self def with_location(self) -> 'HeritageCustodianBuilder': """Extract and set location from UNESCO data.""" if not self._unesco_data: return self # Extract from OpenDataSoft nested structure fields = self._unesco_data.get('record', {}).get('fields', {}) location = Location( city=self._extract_city(), country=self._extract_country_code(), region=fields.get('region_en', ''), latitude=fields.get('latitude'), longitude=fields.get('longitude') ) self._custodian.locations = [location] return self def with_unesco_identifiers(self) -> 'HeritageCustodianBuilder': """Add UNESCO WHC ID and any linked identifiers.""" if not self._unesco_data: return self identifiers = [] # Extract from OpenDataSoft nested structure fields = self._unesco_data.get('record', {}).get('fields', {}) # UNESCO WHC ID whc_id = fields.get('unique_number') if whc_id: identifiers.append(Identifier( identifier_scheme='UNESCO_WHC', identifier_value=str(whc_id), identifier_url=f'https://whc.unesco.org/en/list/{whc_id}' )) # Wikidata Q-number from http_url field (if present) http_url = fields.get('http_url', '') if 'wikidata.org' in http_url: q_number = http_url.split('/')[-1] identifiers.append(Identifier( identifier_scheme='Wikidata', identifier_value=q_number, identifier_url=http_url )) self._custodian.identifiers = identifiers return self def with_alternative_names(self) -> 'HeritageCustodianBuilder': """Extract multi-language names as alternative names.""" if not self._unesco_data: return self alt_names = [] primary_name = self._custodian.name # Extract from OpenDataSoft nested structure fields = self._unesco_data.get('record', {}).get('fields', {}) # Check multi-language name fields for lang_suffix in ['fr', 'es', 'ar', 'ru', 'zh']: name_field = f'name_{lang_suffix}' if name_field in fields: name = fields[name_field] if name and name != primary_name: alt_names.append(f"{name}@{lang_suffix}") if alt_names: self._custodian.alternative_names = alt_names return self def with_description(self) -> 'HeritageCustodianBuilder': """Set description from UNESCO short_description.""" if self._unesco_data: description = self._unesco_data.get('short_description', '') if description: self._custodian.description = description return self def with_provenance( self, extraction_method: str = "UNESCO DataHub API extraction" ) -> 'HeritageCustodianBuilder': """Set provenance metadata.""" self._custodian.provenance = Provenance( data_source=DataSource.UNESCO_WORLD_HERITAGE, data_tier=DataTier.TIER_1_AUTHORITATIVE, extraction_date=datetime.now(timezone.utc).isoformat(), extraction_method=extraction_method, confidence_score=self._custodian.provenance.confidence_score if self._custodian.provenance else 1.0 ) return self def with_ghcid( self, ghcid_generator: 'GHCIDGenerator' ) -> 'HeritageCustodianBuilder': """Generate and set GHCID.""" if self._custodian.name and self._custodian.institution_type and self._custodian.locations: self._custodian.ghcid = ghcid_generator.generate(self._custodian) self._custodian.ghcid_uuid = ghcid_generator.generate_uuid_v5(self._custodian.ghcid) return self def build(self) -> HeritageCustodian: """Return the constructed HeritageCustodian.""" # Validate that required fields are set if not self._custodian.name: raise ValueError("HeritageCustodian must have a name") if not self._custodian.provenance: self.with_provenance() return self._custodian # Helper methods def _extract_city(self) -> str: """Extract city name from UNESCO data.""" # UNESCO doesn't always provide city, try to extract from description states = self._unesco_data.get('states', '') # Simplified: In real implementation, use geocoding or NLP return states.split(',')[0] if ',' in states else states def _extract_country_code(self) -> str: """Extract ISO country code from UNESCO states field.""" states = self._unesco_data.get('states', '') # Map country name to ISO code (simplified, use pycountry in real implementation) country_map = { 'France': 'FR', 'United Kingdom': 'GB', 'Brazil': 'BR', 'India': 'IN', 'Japan': 'JP', # ... full mapping } return country_map.get(states.strip(), 'XX') ``` **Benefits**: - ✅ Fluent interface (method chaining) - ✅ Step-by-step construction with validation - ✅ Easy to add new construction steps - ✅ Separates construction logic from model **Usage**: ```python # Construct HeritageCustodian with fluent API custodian = ( HeritageCustodianBuilder() .from_unesco_data(unesco_site_data) .with_name() .with_institution_type(classifier) .with_location() .with_unesco_identifiers() .with_alternative_names() .with_description() .with_provenance() .with_ghcid(ghcid_generator) .build() ) ``` --- ### Pattern 4: Result Pattern (Error Handling) **Purpose**: Handle errors without exceptions, making error handling explicit in type signatures. **Implementation**: ```python # src/glam_extractor/utils/result.py from typing import TypeVar, Generic, Union, Callable from dataclasses import dataclass T = TypeVar('T') E = TypeVar('E') @dataclass class Ok(Generic[T]): """Success result.""" value: T def is_ok(self) -> bool: return True def is_err(self) -> bool: return False def unwrap(self) -> T: return self.value def map(self, func: Callable[[T], 'U']) -> 'Result[U, E]': return Ok(func(self.value)) def and_then(self, func: Callable[[T], 'Result[U, E]']) -> 'Result[U, E]': return func(self.value) @dataclass class Err(Generic[E]): """Error result.""" error: E def is_ok(self) -> bool: return False def is_err(self) -> bool: return True def unwrap(self) -> T: raise ValueError(f"Called unwrap on Err: {self.error}") def map(self, func: Callable[[T], 'U']) -> 'Result[U, E]': return self # Propagate error def and_then(self, func: Callable[[T], 'Result[U, E]']) -> 'Result[U, E]': return self # Propagate error Result = Union[Ok[T], Err[E]] ``` **Usage**: ```python # src/glam_extractor/parsers/unesco_parser.py def parse_unesco_site(unesco_data: Dict) -> Result[HeritageCustodian, str]: """Parse UNESCO site data to HeritageCustodian.""" try: # Validate required fields if 'id_number' not in unesco_data: return Err("Missing required field: id_number") if 'site' not in unesco_data: return Err("Missing required field: site") # Build custodian builder = HeritageCustodianBuilder() custodian = ( builder .from_unesco_data(unesco_data) .with_name() .with_institution_type(classifier) .with_location() .with_unesco_identifiers() .with_provenance() .build() ) return Ok(custodian) except Exception as e: return Err(f"Parsing error: {str(e)}") # Usage result = parse_unesco_site(site_data) if result.is_ok(): custodian = result.unwrap() print(f"Successfully parsed: {custodian.name}") else: print(f"Parsing failed: {result.error}") # Functional chaining result = ( parse_unesco_site(site_data) .map(lambda c: enrich_with_wikidata(c)) .map(lambda c: generate_ghcid(c)) ) ``` **Benefits**: - ✅ Explicit error handling (no hidden exceptions) - ✅ Type-safe (mypy can verify error handling) - ✅ Functional composition (map, and_then) - ✅ Forces callers to handle errors --- ### Pattern 5: Adapter Pattern (LinkML Map Integration) **Purpose**: Adapt UNESCO JSON structure to LinkML schema using LinkML Map. **Implementation**: ```python # src/glam_extractor/adapters/linkml_adapter.py from typing import Dict from linkml_map import Mapper from glam_extractor.models import HeritageCustodian class LinkMLAdapter: """Adapter for transforming UNESCO JSON to LinkML instances.""" def __init__(self, map_schema_path: str): self.mapper = Mapper(schema_path=map_schema_path) def adapt(self, unesco_data: Dict) -> Dict: """ Transform UNESCO JSON to LinkML-compliant dictionary. Args: unesco_data: Raw UNESCO API response Returns: Dictionary conforming to HeritageCustodian schema """ return self.mapper.transform(unesco_data) def adapt_to_model(self, unesco_data: Dict) -> HeritageCustodian: """ Transform UNESCO JSON directly to HeritageCustodian model. Args: unesco_data: Raw UNESCO API response Returns: HeritageCustodian instance """ adapted = self.adapt(unesco_data) return HeritageCustodian(**adapted) # Alternative: Manual adapter (if LinkML Map not sufficient) class ManualLinkMLAdapter: """Manual transformation when LinkML Map can't handle complex logic.""" def __init__(self, classifier: InstitutionTypeClassifier): self.classifier = classifier def adapt(self, unesco_data: Dict) -> Dict: """Manually transform UNESCO data to LinkML format.""" # Extract multi-language names primary_name = unesco_data.get('site', '') alt_names = [ f"{n['name']}@{n['lang']}" for n in unesco_data.get('names', []) if n['name'] != primary_name ] # Classify institution type classification = self.classifier.classify(unesco_data) # Build location location = { 'city': self._extract_city(unesco_data), 'country': self._extract_country_code(unesco_data), 'latitude': unesco_data.get('latitude'), 'longitude': unesco_data.get('longitude') } # Build identifiers identifiers = [ { 'identifier_scheme': 'UNESCO_WHC', 'identifier_value': str(unesco_data['id_number']), 'identifier_url': f"https://whc.unesco.org/en/list/{unesco_data['id_number']}" } ] # Extract Wikidata from links for link in unesco_data.get('links', []): if 'wikidata.org' in link.get('url', ''): q_number = link['url'].split('/')[-1] identifiers.append({ 'identifier_scheme': 'Wikidata', 'identifier_value': q_number, 'identifier_url': link['url'] }) return { 'name': primary_name, 'alternative_names': alt_names, 'institution_type': classification['institution_type'], 'locations': [location], 'identifiers': identifiers, 'description': unesco_data.get('short_description', ''), 'provenance': { 'data_source': 'UNESCO_WORLD_HERITAGE', 'data_tier': 'TIER_1_AUTHORITATIVE', 'extraction_date': datetime.now(timezone.utc).isoformat(), 'confidence_score': classification['confidence_score'] } } ``` **Benefits**: - ✅ Abstracts transformation logic - ✅ Easy to switch between LinkML Map and manual transformation - ✅ Testable in isolation - ✅ Handles complex transformations LinkML Map can't express **Usage**: ```python # Use LinkML Map adapter adapter = LinkMLAdapter("schemas/maps/unesco_to_heritage_custodian.yaml") custodian_dict = adapter.adapt(unesco_site_data) # Or use manual adapter for complex cases manual_adapter = ManualLinkMLAdapter(classifier) custodian_dict = manual_adapter.adapt(unesco_site_data) ``` --- ### Pattern 6: Factory Pattern (GHCID Generation) **Purpose**: Encapsulate GHCID generation logic with different strategies (base GHCID, with Q-number, etc.). **Implementation**: ```python # src/glam_extractor/identifiers/ghcid_factory.py from abc import ABC, abstractmethod import uuid import hashlib from glam_extractor.models import HeritageCustodian class GHCIDGenerator(ABC): """Abstract factory for GHCID generation.""" @abstractmethod def generate(self, custodian: HeritageCustodian) -> str: """Generate GHCID string.""" pass def generate_uuid_v5(self, ghcid: str) -> uuid.UUID: """Generate deterministic UUID v5 from GHCID.""" namespace = uuid.UUID('6ba7b810-9dad-11d1-80b4-00c04fd430c8') # DNS namespace return uuid.uuid5(namespace, ghcid) def generate_uuid_v8_sha256(self, ghcid: str) -> uuid.UUID: """Generate deterministic UUID v8 using SHA-256.""" hash_bytes = hashlib.sha256(ghcid.encode('utf-8')).digest()[:16] # Set version (8) and variant bits hash_bytes = bytearray(hash_bytes) hash_bytes[6] = (hash_bytes[6] & 0x0F) | 0x80 # Version 8 hash_bytes[8] = (hash_bytes[8] & 0x3F) | 0x80 # Variant RFC 4122 return uuid.UUID(bytes=bytes(hash_bytes)) def generate_numeric_id(self, ghcid: str) -> int: """Generate 64-bit numeric ID from GHCID.""" hash_bytes = hashlib.sha256(ghcid.encode('utf-8')).digest() return int.from_bytes(hash_bytes[:8], byteorder='big') class BaseGHCIDGenerator(GHCIDGenerator): """Generate base GHCID without Q-number.""" def generate(self, custodian: HeritageCustodian) -> str: if not custodian.locations: raise ValueError("Cannot generate GHCID without location") location = custodian.locations[0] country = location.country region = self._get_region_code(location.region) city = self._get_city_code(location.city) inst_type = self._get_type_code(custodian.institution_type) abbrev = self._get_name_abbreviation(custodian.name) return f"{country}-{region}-{city}-{inst_type}-{abbrev}" def _get_region_code(self, region: str) -> str: """Get 2-3 letter region code.""" # Simplified: Use first 2 letters of region return region[:2].upper() if region else "XX" def _get_city_code(self, city: str) -> str: """Get 3-letter city code.""" # Use UN/LOCODE or first 3 letters return city[:3].upper() if city else "XXX" def _get_type_code(self, inst_type: str) -> str: """Get single-letter type code.""" type_map = { 'MUSEUM': 'M', 'LIBRARY': 'L', 'ARCHIVE': 'A', 'GALLERY': 'G', 'BOTANICAL_ZOO': 'B', 'HOLY_SITES': 'H', 'MIXED': 'X' } return type_map.get(inst_type, 'X') def _get_name_abbreviation(self, name: str) -> str: """Get 2-5 letter abbreviation from name.""" # Simple: Use first letters of words words = name.split() if len(words) == 1: return words[0][:3].upper() else: return ''.join(w[0] for w in words[:5]).upper() class QNumberGHCIDGenerator(GHCIDGenerator): """Generate GHCID with Wikidata Q-number appended.""" def __init__(self, base_generator: BaseGHCIDGenerator): self.base_generator = base_generator def generate(self, custodian: HeritageCustodian) -> str: base_ghcid = self.base_generator.generate(custodian) # Extract Wikidata Q-number q_number = self._extract_wikidata_qnumber(custodian) if q_number: return f"{base_ghcid}-{q_number}" else: return base_ghcid def _extract_wikidata_qnumber(self, custodian: HeritageCustodian) -> Optional[str]: """Extract Wikidata Q-number from identifiers.""" if not custodian.identifiers: return None for identifier in custodian.identifiers: if identifier.identifier_scheme == 'Wikidata': return identifier.identifier_value return None class CollisionAwareGHCIDGenerator(GHCIDGenerator): """Generate GHCID with collision detection and Q-number resolution.""" def __init__( self, base_generator: BaseGHCIDGenerator, existing_dataset: List[HeritageCustodian] ): self.base_generator = base_generator self.q_generator = QNumberGHCIDGenerator(base_generator) self.existing_ghcids = {c.ghcid for c in existing_dataset} def generate(self, custodian: HeritageCustodian) -> str: base_ghcid = self.base_generator.generate(custodian) # Check for collision if base_ghcid in self.existing_ghcids: # Collision detected, use Q-number return self.q_generator.generate(custodian) else: return base_ghcid ``` **Benefits**: - ✅ Separation of concerns (base GHCID vs. Q-number logic) - ✅ Easy to extend with new generation strategies - ✅ Collision detection encapsulated - ✅ Testable components **Usage**: ```python # Generate base GHCID base_gen = BaseGHCIDGenerator() ghcid = base_gen.generate(custodian) # "FR-IDF-PAR-M-LOU" # Generate with Q-number q_gen = QNumberGHCIDGenerator(base_gen) ghcid = q_gen.generate(custodian) # "FR-IDF-PAR-M-LOU-Q19675" # Generate with collision detection existing_dataset = load_existing_dataset() collision_gen = CollisionAwareGHCIDGenerator(base_gen, existing_dataset) ghcid = collision_gen.generate(custodian) # Auto-appends Q-number if collision ``` --- ### Pattern 7: Observer Pattern (Progress Tracking) **Purpose**: Track extraction progress and notify listeners (logging, progress bars, metrics). **Implementation**: ```python # src/glam_extractor/observers/progress_observer.py from abc import ABC, abstractmethod from typing import List, Dict class ProgressObserver(ABC): """Abstract observer for extraction progress.""" @abstractmethod def on_start(self, total_sites: int): """Called when extraction starts.""" pass @abstractmethod def on_site_success(self, site_id: int, custodian: HeritageCustodian): """Called when a site is successfully extracted.""" pass @abstractmethod def on_site_error(self, site_id: int, error: str): """Called when a site extraction fails.""" pass @abstractmethod def on_complete(self, success_count: int, error_count: int): """Called when extraction completes.""" pass class LoggingObserver(ProgressObserver): """Observer that logs to a file.""" def __init__(self, log_file: Path): self.log_file = log_file def on_start(self, total_sites: int): with open(self.log_file, 'a') as f: f.write(f"[{datetime.now()}] Starting extraction of {total_sites} sites\n") def on_site_success(self, site_id: int, custodian: HeritageCustodian): with open(self.log_file, 'a') as f: f.write(f"[{datetime.now()}] SUCCESS: Site {site_id} - {custodian.name}\n") def on_site_error(self, site_id: int, error: str): with open(self.log_file, 'a') as f: f.write(f"[{datetime.now()}] ERROR: Site {site_id} - {error}\n") def on_complete(self, success_count: int, error_count: int): with open(self.log_file, 'a') as f: f.write(f"[{datetime.now()}] Completed: {success_count} success, {error_count} errors\n") class ProgressBarObserver(ProgressObserver): """Observer that displays a progress bar.""" def __init__(self): self.pbar = None def on_start(self, total_sites: int): from tqdm import tqdm self.pbar = tqdm(total=total_sites, desc="Extracting UNESCO sites") def on_site_success(self, site_id: int, custodian: HeritageCustodian): if self.pbar: self.pbar.update(1) self.pbar.set_postfix_str(f"Last: {custodian.name[:30]}") def on_site_error(self, site_id: int, error: str): if self.pbar: self.pbar.update(1) def on_complete(self, success_count: int, error_count: int): if self.pbar: self.pbar.close() class MetricsObserver(ProgressObserver): """Observer that collects metrics.""" def __init__(self): self.metrics = { 'start_time': None, 'end_time': None, 'total_sites': 0, 'success_count': 0, 'error_count': 0, 'institution_types': {}, 'countries': {} } def on_start(self, total_sites: int): self.metrics['start_time'] = datetime.now() self.metrics['total_sites'] = total_sites def on_site_success(self, site_id: int, custodian: HeritageCustodian): self.metrics['success_count'] += 1 # Track institution types inst_type = custodian.institution_type self.metrics['institution_types'][inst_type] = \ self.metrics['institution_types'].get(inst_type, 0) + 1 # Track countries if custodian.locations: country = custodian.locations[0].country self.metrics['countries'][country] = \ self.metrics['countries'].get(country, 0) + 1 def on_site_error(self, site_id: int, error: str): self.metrics['error_count'] += 1 def on_complete(self, success_count: int, error_count: int): self.metrics['end_time'] = datetime.now() duration = self.metrics['end_time'] - self.metrics['start_time'] self.metrics['duration_seconds'] = duration.total_seconds() def get_metrics(self) -> Dict: return self.metrics class UNESCOExtractor: """Extraction orchestrator with observer pattern.""" def __init__(self, repository: UNESCORepository): self.repository = repository self.observers: List[ProgressObserver] = [] def add_observer(self, observer: ProgressObserver): self.observers.append(observer) def extract_all_sites(self) -> List[HeritageCustodian]: """Extract all UNESCO sites with progress tracking.""" site_list = self.repository.fetch_site_list() total_sites = len(site_list) # Notify observers: start for observer in self.observers: observer.on_start(total_sites) results = [] errors = 0 for site_data in site_list: site_id = site_data['id_number'] try: # Fetch details and parse details = self.repository.fetch_site_details(site_id) custodian = parse_unesco_site(details) results.append(custodian) # Notify observers: success for observer in self.observers: observer.on_site_success(site_id, custodian) except Exception as e: errors += 1 # Notify observers: error for observer in self.observers: observer.on_site_error(site_id, str(e)) # Notify observers: complete for observer in self.observers: observer.on_complete(len(results), errors) return results ``` **Benefits**: - ✅ Decouples extraction logic from logging/progress tracking - ✅ Easy to add new observers without modifying extractor - ✅ Multiple observers can run simultaneously (logging + progress bar + metrics) - ✅ Testable (mock observers in tests) **Usage**: ```python # Set up extraction with observers extractor = UNESCOExtractor(repository) extractor.add_observer(LoggingObserver(Path("logs/unesco_extraction.log"))) extractor.add_observer(ProgressBarObserver()) metrics_observer = MetricsObserver() extractor.add_observer(metrics_observer) # Run extraction custodians = extractor.extract_all_sites() # Get metrics after completion metrics = metrics_observer.get_metrics() print(f"Extracted {metrics['success_count']} sites in {metrics['duration_seconds']:.2f}s") ``` --- ## Integration Patterns with Existing GLAM Architecture ### Pattern 8: Consistent Provenance Tracking **Align with existing GLAM patterns**: ```python # Match provenance structure from Dutch ISIL/Orgs parsers provenance = Provenance( data_source=DataSource.UNESCO_WORLD_HERITAGE, # Add to enums.yaml data_tier=DataTier.TIER_1_AUTHORITATIVE, extraction_date=datetime.now(timezone.utc).isoformat(), extraction_method="UNESCO DataHub API with LinkML Map transformation", confidence_score=classification_result['confidence_score'], # Notes go in custodian.description, NOT provenance (schema quirk!) ) ``` ### Pattern 9: GHCID Collision Resolution **Follow existing collision handling**: ```python # From AGENTS.md - Temporal priority rule def resolve_ghcid_collision( new_custodian: HeritageCustodian, existing_custodian: HeritageCustodian ) -> str: """Apply temporal priority rule for GHCID collision.""" new_date = new_custodian.provenance.extraction_date existing_date = existing_custodian.provenance.extraction_date if new_date == existing_date: # First batch collision: BOTH get Q-numbers return generate_ghcid_with_qnumber(new_custodian) else: # Historical addition: ONLY new gets Q-number # Existing GHCID preserved (PID stability) return generate_ghcid_with_qnumber(new_custodian) ``` ### Pattern 10: Multi-format Export **Reuse existing exporter patterns**: ```python # Match structure from Dutch dataset exporters class UNESCOExportPipeline: """Export pipeline matching existing GLAM patterns.""" def __init__(self, dataset: List[HeritageCustodian]): self.dataset = dataset def export_all_formats(self, output_dir: Path): """Export to all formats (RDF, JSON-LD, CSV, Parquet, SQLite).""" self.export_rdf(output_dir / "unesco_glam.ttl") self.export_jsonld(output_dir / "unesco_glam.jsonld") self.export_csv(output_dir / "unesco_glam.csv") self.export_parquet(output_dir / "unesco_glam.parquet") self.export_sqlite(output_dir / "unesco_glam.db") ``` --- ## Testing Patterns ### Pattern 11: Test Fixture Management ```python # Centralized fixture loading @pytest.fixture(scope="session") def unesco_fixtures(): """Load all UNESCO test fixtures.""" return UNESCOFixtureRepository(Path("tests/fixtures/unesco_api_responses")) @pytest.fixture def sample_museum_site(unesco_fixtures): """Fixture for a museum site.""" return unesco_fixtures.fetch_site_details(600) # Louvre @pytest.fixture def sample_library_site(unesco_fixtures): """Fixture for a library site.""" return unesco_fixtures.fetch_site_details(1110) # National Library of Vietnam ``` ### Pattern 12: Property-Based Testing for GHCIDs ```python from hypothesis import given, strategies as st @given( country=st.sampled_from(["FR", "NL", "BR", "IN", "JP"]), city=st.text(min_size=3, max_size=20), inst_type=st.sampled_from(list(InstitutionTypeEnum)) ) def test_ghcid_determinism(country, city, inst_type): """Property: GHCID generation is deterministic.""" custodian = create_test_custodian(country, city, inst_type) ghcid1 = generator.generate(custodian) ghcid2 = generator.generate(custodian) assert ghcid1 == ghcid2 ``` --- ## Anti-Patterns to Avoid ### ❌ Anti-Pattern 1: God Class **Bad**: ```python class UNESCOExtractor: def fetch_data(self): ... def parse_data(self): ... def classify_type(self): ... def generate_ghcid(self): ... def validate(self): ... def export_rdf(self): ... # 1000+ lines, does everything! ``` **Good**: Separate concerns (Repository, Parser, Classifier, Generator, Exporter) ### ❌ Anti-Pattern 2: Primitive Obsession **Bad**: ```python def generate_ghcid(country: str, city: str, type: str, name: str) -> str: # 4 primitive parameters, easy to mix up ``` **Good**: Use domain objects ```python def generate_ghcid(custodian: HeritageCustodian) -> str: # Single parameter, type-safe ``` ### ❌ Anti-Pattern 3: Exception-Driven Control Flow **Bad**: ```python try: custodian = parse(data) return custodian except ValueError: return None # Silent failure, hard to debug ``` **Good**: Use Result pattern ```python result = parse(data) if result.is_err(): log.error(f"Parse error: {result.error}") return None return result.unwrap() ``` --- ## Summary of Patterns | Pattern | Purpose | Benefit | |---------|---------|---------| | **Repository** | Abstract data access | Testability, flexibility | | **Strategy** | Pluggable classification | Extensibility, A/B testing | | **Builder** | Complex object construction | Fluent API, step-by-step validation | | **Result** | Explicit error handling | Type safety, no hidden exceptions | | **Adapter** | LinkML integration | Abstraction, testability | | **Factory** | GHCID generation | Separation of concerns, extensibility | | **Observer** | Progress tracking | Decoupling, multiple listeners | | **Provenance** | Consistent metadata | Integration with existing GLAM patterns | | **Collision Resolution** | GHCID uniqueness | PID stability, determinism | | **Multi-format Export** | Standard outputs | Reusability, consumer flexibility | --- ## Next Steps **Completed**: - ✅ `01-dependencies.md` - Technical dependencies - ✅ `02-consumers.md` - Use cases and data consumers - ✅ `03-implementation-phases.md` - 6-week timeline - ✅ `04-tdd-strategy.md` - Testing strategy - ✅ `05-design-patterns.md` - **THIS DOCUMENT** - Architectural patterns **Next to Create**: 1. **`06-linkml-map-schema.md`** - **CRITICAL** - Complete LinkML Map transformation rules 2. **`07-master-checklist.md`** - Implementation checklist tying everything together --- **Document Status**: Complete **Next Document**: `06-linkml-map-schema.md` - LinkML Map transformation rules **Version**: 1.0