1023 lines
29 KiB
Markdown
1023 lines
29 KiB
Markdown
# Global GLAM Dataset: Design Patterns
|
|
|
|
## Overview
|
|
This document describes the software design patterns and architectural principles used throughout the Global GLAM Dataset extraction system.
|
|
|
|
## Core Design Principles
|
|
|
|
### 1. Separation of Concerns
|
|
- **Parsing** ≠ **Extraction** ≠ **Validation** ≠ **Export**
|
|
- Each component has a single, well-defined responsibility
|
|
- Components communicate through clearly defined interfaces
|
|
|
|
### 2. Data Pipeline Architecture
|
|
- Immutable intermediate stages
|
|
- Clear data transformations at each step
|
|
- Traceable provenance from source to output
|
|
|
|
### 3. Fail-Safe Processing
|
|
- Graceful degradation when external services unavailable
|
|
- Partial results better than no results
|
|
- Continue processing even when individual items fail
|
|
|
|
### 4. Provenance Tracking
|
|
- Every piece of data tracks its source
|
|
- Confidence scores for derived data
|
|
- Audit trail for transformations
|
|
|
|
## Architectural Patterns
|
|
|
|
### Pattern 1: Pipeline Pattern
|
|
|
|
**Intent**: Process data through a series of sequential transformations
|
|
|
|
**Structure**:
|
|
```python
|
|
from abc import ABC, abstractmethod
|
|
from typing import Generic, TypeVar, List
|
|
|
|
TInput = TypeVar('TInput')
|
|
TOutput = TypeVar('TOutput')
|
|
|
|
class PipelineStage(ABC, Generic[TInput, TOutput]):
|
|
"""Base class for pipeline stages"""
|
|
|
|
@abstractmethod
|
|
def process(self, input_data: TInput) -> TOutput:
|
|
"""Transform input to output"""
|
|
pass
|
|
|
|
@abstractmethod
|
|
def validate_input(self, input_data: TInput) -> bool:
|
|
"""Validate input before processing"""
|
|
pass
|
|
|
|
def handle_error(self, error: Exception, input_data: TInput) -> TOutput:
|
|
"""Handle processing errors gracefully"""
|
|
pass
|
|
|
|
class Pipeline(Generic[TInput, TOutput]):
|
|
"""Compose multiple stages into a pipeline"""
|
|
|
|
def __init__(self):
|
|
self.stages: List[PipelineStage] = []
|
|
|
|
def add_stage(self, stage: PipelineStage) -> 'Pipeline':
|
|
self.stages.append(stage)
|
|
return self # Fluent interface
|
|
|
|
def execute(self, input_data: TInput) -> TOutput:
|
|
"""Execute all stages in sequence"""
|
|
current_data = input_data
|
|
|
|
for stage in self.stages:
|
|
if stage.validate_input(current_data):
|
|
try:
|
|
current_data = stage.process(current_data)
|
|
except Exception as e:
|
|
current_data = stage.handle_error(e, current_data)
|
|
else:
|
|
raise ValueError(f"Invalid input for stage {stage.__class__.__name__}")
|
|
|
|
return current_data
|
|
```
|
|
|
|
**Usage**:
|
|
```python
|
|
# Build extraction pipeline
|
|
pipeline = (Pipeline()
|
|
.add_stage(ConversationParser())
|
|
.add_stage(EntityExtractor())
|
|
.add_stage(AttributeExtractor())
|
|
.add_stage(WebCrawler())
|
|
.add_stage(LinkMLMapper())
|
|
.add_stage(RDFExporter())
|
|
)
|
|
|
|
# Process a conversation file
|
|
result = pipeline.execute(conversation_file)
|
|
```
|
|
|
|
**Benefits**:
|
|
- Clear data flow
|
|
- Easy to add/remove stages
|
|
- Testable in isolation
|
|
- Composable
|
|
|
|
### Pattern 2: Repository Pattern
|
|
|
|
**Intent**: Separate data access logic from business logic
|
|
|
|
**Structure**:
|
|
```python
|
|
from abc import ABC, abstractmethod
|
|
from typing import List, Optional, TypeVar, Generic
|
|
|
|
T = TypeVar('T')
|
|
|
|
class Repository(ABC, Generic[T]):
|
|
"""Abstract repository for data access"""
|
|
|
|
@abstractmethod
|
|
def get_by_id(self, id: str) -> Optional[T]:
|
|
"""Retrieve single entity by ID"""
|
|
pass
|
|
|
|
@abstractmethod
|
|
def get_all(self) -> List[T]:
|
|
"""Retrieve all entities"""
|
|
pass
|
|
|
|
@abstractmethod
|
|
def save(self, entity: T) -> None:
|
|
"""Persist entity"""
|
|
pass
|
|
|
|
@abstractmethod
|
|
def delete(self, id: str) -> None:
|
|
"""Delete entity"""
|
|
pass
|
|
|
|
@abstractmethod
|
|
def find_by(self, **criteria) -> List[T]:
|
|
"""Find entities matching criteria"""
|
|
pass
|
|
|
|
class ConversationRepository(Repository[Conversation]):
|
|
"""Repository for conversation files"""
|
|
|
|
def __init__(self, data_dir: Path):
|
|
self.data_dir = data_dir
|
|
self._index = self._build_index()
|
|
|
|
def get_by_id(self, id: str) -> Optional[Conversation]:
|
|
if id in self._index:
|
|
return self._load_conversation(self._index[id])
|
|
return None
|
|
|
|
def find_by(self, **criteria) -> List[Conversation]:
|
|
"""
|
|
Find conversations by criteria
|
|
Examples:
|
|
- find_by(country='Brazil')
|
|
- find_by(date_after='2025-09-01')
|
|
- find_by(glam_type='museum')
|
|
"""
|
|
results = []
|
|
for conv in self.get_all():
|
|
if self._matches_criteria(conv, criteria):
|
|
results.append(conv)
|
|
return results
|
|
|
|
class InstitutionRepository(Repository[HeritageCustodian]):
|
|
"""Repository for heritage institutions"""
|
|
|
|
def __init__(self, db_path: Path):
|
|
self.db = duckdb.connect(str(db_path))
|
|
|
|
def get_by_id(self, id: str) -> Optional[HeritageCustodian]:
|
|
result = self.db.execute(
|
|
"SELECT * FROM institutions WHERE id = ?", [id]
|
|
).fetchone()
|
|
return self._to_entity(result) if result else None
|
|
|
|
def find_by_isil(self, isil_code: str) -> Optional[HeritageCustodian]:
|
|
"""Domain-specific query method"""
|
|
return self.find_by(isil_code=isil_code)[0] if \
|
|
self.find_by(isil_code=isil_code) else None
|
|
```
|
|
|
|
**Benefits**:
|
|
- Swap storage backends (SQLite, DuckDB, files)
|
|
- Easier testing (mock repositories)
|
|
- Centralized data access logic
|
|
- Domain-specific query methods
|
|
|
|
### Pattern 3: Strategy Pattern
|
|
|
|
**Intent**: Define a family of algorithms and make them interchangeable
|
|
|
|
**Structure**:
|
|
```python
|
|
class ExtractionStrategy(ABC):
|
|
"""Base strategy for entity extraction"""
|
|
|
|
@abstractmethod
|
|
def extract_institutions(self, text: str) -> List[ExtractedInstitution]:
|
|
pass
|
|
|
|
class SpacyNERStrategy(ExtractionStrategy):
|
|
"""Use spaCy for entity extraction"""
|
|
|
|
def __init__(self, model_name: str = "en_core_web_lg"):
|
|
self.nlp = spacy.load(model_name)
|
|
|
|
def extract_institutions(self, text: str) -> List[ExtractedInstitution]:
|
|
doc = self.nlp(text)
|
|
institutions = []
|
|
|
|
for ent in doc.ents:
|
|
if ent.label_ in ["ORG", "FAC"]:
|
|
# Extract institution
|
|
inst = self._parse_entity(ent)
|
|
institutions.append(inst)
|
|
|
|
return institutions
|
|
|
|
class TransformerNERStrategy(ExtractionStrategy):
|
|
"""Use transformer models for entity extraction"""
|
|
|
|
def __init__(self, model_name: str = "dslim/bert-base-NER"):
|
|
from transformers import pipeline
|
|
self.ner = pipeline("ner", model=model_name)
|
|
|
|
def extract_institutions(self, text: str) -> List[ExtractedInstitution]:
|
|
entities = self.ner(text)
|
|
# Process transformer output...
|
|
return institutions
|
|
|
|
class RegexPatternStrategy(ExtractionStrategy):
|
|
"""Use regex patterns for extraction"""
|
|
|
|
def __init__(self, patterns: Dict[str, str]):
|
|
self.patterns = {k: re.compile(v) for k, v in patterns.items()}
|
|
|
|
def extract_institutions(self, text: str) -> List[ExtractedInstitution]:
|
|
# Pattern-based extraction...
|
|
return institutions
|
|
|
|
# Usage
|
|
class EntityExtractor:
|
|
def __init__(self, strategy: ExtractionStrategy):
|
|
self.strategy = strategy
|
|
|
|
def set_strategy(self, strategy: ExtractionStrategy):
|
|
"""Change extraction strategy at runtime"""
|
|
self.strategy = strategy
|
|
|
|
def extract(self, text: str) -> List[ExtractedInstitution]:
|
|
return self.strategy.extract_institutions(text)
|
|
|
|
# Configure different strategies for different use cases
|
|
extractor = EntityExtractor(SpacyNERStrategy()) # Default
|
|
extractor.set_strategy(TransformerNERStrategy()) # Switch to transformer
|
|
```
|
|
|
|
**Benefits**:
|
|
- Easy to add new extraction methods
|
|
- Switch algorithms at runtime
|
|
- Compare performance of different approaches
|
|
- A/B testing different strategies
|
|
|
|
### Pattern 4: Builder Pattern
|
|
|
|
**Intent**: Construct complex objects step by step
|
|
|
|
**Structure**:
|
|
```python
|
|
class HeritageCustodianBuilder:
|
|
"""Builder for HeritageCustodian instances"""
|
|
|
|
def __init__(self):
|
|
self._institution = HeritageCustodian()
|
|
self._provenance = []
|
|
|
|
def with_name(self, name: str, source: str = None) -> 'HeritageCustodianBuilder':
|
|
self._institution.name = name
|
|
if source:
|
|
self._provenance.append(ProvenanceRecord(
|
|
field="name", source=source, confidence=1.0
|
|
))
|
|
return self
|
|
|
|
def with_isil_code(self, code: str, source: str = None) -> 'HeritageCustodianBuilder':
|
|
self._institution.isil_code = code
|
|
if source:
|
|
self._provenance.append(ProvenanceRecord(
|
|
field="isil_code", source=source, confidence=1.0
|
|
))
|
|
return self
|
|
|
|
def with_type(self, inst_type: str, confidence: float = 1.0) -> 'HeritageCustodianBuilder':
|
|
self._institution.institution_types.append(inst_type)
|
|
self._provenance.append(ProvenanceRecord(
|
|
field="institution_types",
|
|
source="inferred",
|
|
confidence=confidence
|
|
))
|
|
return self
|
|
|
|
def with_location(self, city: str, country: str,
|
|
coordinates: Tuple[float, float] = None) -> 'HeritageCustodianBuilder':
|
|
self._institution.geographic_coverage = GeographicCoverage(
|
|
municipality=city,
|
|
country=country,
|
|
coordinates=coordinates
|
|
)
|
|
return self
|
|
|
|
def from_csv_record(self, record: DutchOrganizationRecord) -> 'HeritageCustodianBuilder':
|
|
"""Bulk population from CSV record"""
|
|
self.with_name(record.organization_name, source="dutch_org_registry")
|
|
if record.isil_code:
|
|
self.with_isil_code(record.isil_code, source="dutch_org_registry")
|
|
self.with_location(record.city, "Netherlands", record.coordinates)
|
|
# ... more fields
|
|
return self
|
|
|
|
def from_extracted_entity(self, entity: ExtractedInstitution) -> 'HeritageCustodianBuilder':
|
|
"""Bulk population from NLP extraction"""
|
|
self.with_name(entity.name, source="nlp_extraction")
|
|
if entity.isil_code:
|
|
self.with_isil_code(entity.isil_code, source="nlp_extraction")
|
|
for inst_type in entity.institution_type:
|
|
self.with_type(inst_type, confidence=entity.extraction_confidence)
|
|
# ... more fields
|
|
return self
|
|
|
|
def merge_from(self, other: HeritageCustodian,
|
|
priority: str = "authoritative") -> 'HeritageCustodianBuilder':
|
|
"""Merge data from another instance"""
|
|
# Complex merge logic...
|
|
return self
|
|
|
|
def build(self) -> HeritageCustodian:
|
|
"""Construct final instance with provenance"""
|
|
self._institution.provenance = self._provenance
|
|
return self._institution
|
|
|
|
# Usage
|
|
institution = (HeritageCustodianBuilder()
|
|
.with_name("Rijksmuseum")
|
|
.with_isil_code("NL-AsdRM")
|
|
.with_type("museum")
|
|
.with_location("Amsterdam", "Netherlands", (52.36, 4.88))
|
|
.build()
|
|
)
|
|
|
|
# Or from existing data sources
|
|
institution = (HeritageCustodianBuilder()
|
|
.from_csv_record(csv_record)
|
|
.merge_from(extracted_entity)
|
|
.build()
|
|
)
|
|
```
|
|
|
|
**Benefits**:
|
|
- Complex object construction
|
|
- Fluent, readable API
|
|
- Track provenance during construction
|
|
- Merge data from multiple sources
|
|
|
|
### Pattern 5: Adapter Pattern
|
|
|
|
**Intent**: Convert interfaces to work together
|
|
|
|
**Structure**:
|
|
```python
|
|
class WebCrawlerAdapter(ABC):
|
|
"""Adapter for different web crawling libraries"""
|
|
|
|
@abstractmethod
|
|
async def fetch(self, url: str) -> CrawlResult:
|
|
pass
|
|
|
|
@abstractmethod
|
|
async def extract_metadata(self, url: str) -> Dict[str, Any]:
|
|
pass
|
|
|
|
class Crawl4AIAdapter(WebCrawlerAdapter):
|
|
"""Adapter for crawl4ai library"""
|
|
|
|
def __init__(self):
|
|
from crawl4ai import AsyncWebCrawler
|
|
self.crawler = AsyncWebCrawler()
|
|
|
|
async def fetch(self, url: str) -> CrawlResult:
|
|
result = await self.crawler.arun(url)
|
|
# Convert crawl4ai result to our CrawlResult format
|
|
return CrawlResult(
|
|
url=url,
|
|
content=result.extracted_content,
|
|
metadata=result.metadata,
|
|
status_code=result.status_code
|
|
)
|
|
|
|
async def extract_metadata(self, url: str) -> Dict[str, Any]:
|
|
result = await self.fetch(url)
|
|
# Extract Schema.org, OpenGraph, etc.
|
|
return self._parse_metadata(result.content)
|
|
|
|
class BeautifulSoupAdapter(WebCrawlerAdapter):
|
|
"""Adapter for BeautifulSoup (fallback)"""
|
|
|
|
async def fetch(self, url: str) -> CrawlResult:
|
|
async with httpx.AsyncClient() as client:
|
|
response = await client.get(url)
|
|
|
|
soup = BeautifulSoup(response.text, 'lxml')
|
|
|
|
return CrawlResult(
|
|
url=url,
|
|
content=soup.get_text(),
|
|
metadata=self._extract_meta_tags(soup),
|
|
status_code=response.status_code
|
|
)
|
|
|
|
# Usage - swap implementations transparently
|
|
class WebCrawler:
|
|
def __init__(self, adapter: WebCrawlerAdapter):
|
|
self.adapter = adapter
|
|
|
|
async def crawl(self, url: str) -> CrawlResult:
|
|
return await self.adapter.fetch(url)
|
|
|
|
# Try crawl4ai, fall back to BeautifulSoup
|
|
try:
|
|
crawler = WebCrawler(Crawl4AIAdapter())
|
|
except ImportError:
|
|
crawler = WebCrawler(BeautifulSoupAdapter())
|
|
```
|
|
|
|
**Benefits**:
|
|
- Swap implementations easily
|
|
- Isolate external dependencies
|
|
- Fallback mechanisms
|
|
- Easier testing with mock adapters
|
|
|
|
### Pattern 6: Observer Pattern
|
|
|
|
**Intent**: Notify observers when processing events occur
|
|
|
|
**Structure**:
|
|
```python
|
|
from typing import List, Callable
|
|
|
|
class ProcessingEvent:
|
|
"""Base class for events"""
|
|
def __init__(self, stage: str, data: Any):
|
|
self.stage = stage
|
|
self.data = data
|
|
self.timestamp = datetime.now()
|
|
|
|
class Observer(ABC):
|
|
"""Observer interface"""
|
|
|
|
@abstractmethod
|
|
def update(self, event: ProcessingEvent) -> None:
|
|
pass
|
|
|
|
class LoggingObserver(Observer):
|
|
"""Log all processing events"""
|
|
|
|
def update(self, event: ProcessingEvent) -> None:
|
|
logging.info(f"[{event.stage}] {event.timestamp}: {event.data}")
|
|
|
|
class MetricsObserver(Observer):
|
|
"""Collect metrics on processing"""
|
|
|
|
def __init__(self):
|
|
self.metrics = defaultdict(list)
|
|
|
|
def update(self, event: ProcessingEvent) -> None:
|
|
self.metrics[event.stage].append({
|
|
'timestamp': event.timestamp,
|
|
'success': event.data.get('success', True),
|
|
'duration': event.data.get('duration', 0)
|
|
})
|
|
|
|
class ProgressObserver(Observer):
|
|
"""Display progress bars"""
|
|
|
|
def __init__(self):
|
|
from tqdm import tqdm
|
|
self.progress_bars = {}
|
|
|
|
def update(self, event: ProcessingEvent) -> None:
|
|
if event.stage not in self.progress_bars:
|
|
self.progress_bars[event.stage] = tqdm(desc=event.stage)
|
|
self.progress_bars[event.stage].update(1)
|
|
|
|
class Observable:
|
|
"""Observable base class for pipeline stages"""
|
|
|
|
def __init__(self):
|
|
self._observers: List[Observer] = []
|
|
|
|
def attach(self, observer: Observer) -> None:
|
|
self._observers.append(observer)
|
|
|
|
def detach(self, observer: Observer) -> None:
|
|
self._observers.remove(observer)
|
|
|
|
def notify(self, event: ProcessingEvent) -> None:
|
|
for observer in self._observers:
|
|
observer.update(event)
|
|
|
|
# Usage
|
|
class ObservablePipeline(Pipeline, Observable):
|
|
"""Pipeline with event notifications"""
|
|
|
|
def execute(self, input_data):
|
|
self.notify(ProcessingEvent('pipeline_start', {'input': input_data}))
|
|
|
|
for stage in self.stages:
|
|
self.notify(ProcessingEvent(stage.__class__.__name__, {'status': 'starting'}))
|
|
|
|
start = time.time()
|
|
result = stage.process(input_data)
|
|
duration = time.time() - start
|
|
|
|
self.notify(ProcessingEvent(stage.__class__.__name__, {
|
|
'status': 'completed',
|
|
'duration': duration,
|
|
'success': True
|
|
}))
|
|
|
|
self.notify(ProcessingEvent('pipeline_complete', {'output': result}))
|
|
return result
|
|
|
|
# Attach observers
|
|
pipeline = ObservablePipeline()
|
|
pipeline.attach(LoggingObserver())
|
|
pipeline.attach(MetricsObserver())
|
|
pipeline.attach(ProgressObserver())
|
|
```
|
|
|
|
**Benefits**:
|
|
- Decouple logging, metrics, progress tracking
|
|
- Easy to add new observers
|
|
- Monitor pipeline execution
|
|
- Collect performance data
|
|
|
|
### Pattern 7: Factory Pattern
|
|
|
|
**Intent**: Create objects without specifying exact class
|
|
|
|
**Structure**:
|
|
```python
|
|
class ExporterFactory:
|
|
"""Factory for creating exporters"""
|
|
|
|
_exporters = {}
|
|
|
|
@classmethod
|
|
def register(cls, format_name: str, exporter_class: Type[Exporter]):
|
|
"""Register an exporter for a format"""
|
|
cls._exporters[format_name] = exporter_class
|
|
|
|
@classmethod
|
|
def create(cls, format_name: str, **kwargs) -> Exporter:
|
|
"""Create exporter instance"""
|
|
if format_name not in cls._exporters:
|
|
raise ValueError(f"Unknown format: {format_name}")
|
|
|
|
exporter_class = cls._exporters[format_name]
|
|
return exporter_class(**kwargs)
|
|
|
|
@classmethod
|
|
def available_formats(cls) -> List[str]:
|
|
"""List available export formats"""
|
|
return list(cls._exporters.keys())
|
|
|
|
# Register exporters
|
|
ExporterFactory.register('rdf', RDFExporter)
|
|
ExporterFactory.register('csv', CSVExporter)
|
|
ExporterFactory.register('jsonld', JSONLDExporter)
|
|
ExporterFactory.register('parquet', ParquetExporter)
|
|
|
|
# Usage
|
|
exporter = ExporterFactory.create('rdf', format='turtle')
|
|
exporter.export(institutions, output_path)
|
|
|
|
# Or export to multiple formats
|
|
for format_name in ['rdf', 'csv', 'jsonld']:
|
|
exporter = ExporterFactory.create(format_name)
|
|
exporter.export(institutions, output_dir / f"output.{format_name}")
|
|
```
|
|
|
|
**Benefits**:
|
|
- Extensible (register new formats)
|
|
- Centralized creation logic
|
|
- Easy to discover available formats
|
|
|
|
### Pattern 8: Command Pattern
|
|
|
|
**Intent**: Encapsulate requests as objects
|
|
|
|
**Structure**:
|
|
```python
|
|
class Command(ABC):
|
|
"""Base command interface"""
|
|
|
|
@abstractmethod
|
|
def execute(self) -> Any:
|
|
pass
|
|
|
|
@abstractmethod
|
|
def undo(self) -> None:
|
|
pass
|
|
|
|
class ExtractConversationCommand(Command):
|
|
"""Extract institutions from a conversation"""
|
|
|
|
def __init__(self, conversation_id: str,
|
|
conversation_repo: ConversationRepository,
|
|
institution_repo: InstitutionRepository,
|
|
extractor: EntityExtractor):
|
|
self.conversation_id = conversation_id
|
|
self.conversation_repo = conversation_repo
|
|
self.institution_repo = institution_repo
|
|
self.extractor = extractor
|
|
self.extracted_institutions = []
|
|
|
|
def execute(self) -> List[HeritageCustodian]:
|
|
# Load conversation
|
|
conversation = self.conversation_repo.get_by_id(self.conversation_id)
|
|
|
|
# Extract institutions
|
|
self.extracted_institutions = self.extractor.extract(conversation.content)
|
|
|
|
# Save to repository
|
|
for inst in self.extracted_institutions:
|
|
self.institution_repo.save(inst)
|
|
|
|
return self.extracted_institutions
|
|
|
|
def undo(self) -> None:
|
|
# Remove extracted institutions
|
|
for inst in self.extracted_institutions:
|
|
self.institution_repo.delete(inst.id)
|
|
|
|
class CommandQueue:
|
|
"""Queue and execute commands"""
|
|
|
|
def __init__(self):
|
|
self.commands = []
|
|
self.executed = []
|
|
|
|
def add_command(self, command: Command) -> None:
|
|
self.commands.append(command)
|
|
|
|
def execute_all(self) -> List[Any]:
|
|
results = []
|
|
for command in self.commands:
|
|
result = command.execute()
|
|
self.executed.append(command)
|
|
results.append(result)
|
|
return results
|
|
|
|
def undo_last(self) -> None:
|
|
if self.executed:
|
|
command = self.executed.pop()
|
|
command.undo()
|
|
|
|
# Usage - queue up work
|
|
queue = CommandQueue()
|
|
|
|
for conversation_id in conversation_ids:
|
|
cmd = ExtractConversationCommand(
|
|
conversation_id,
|
|
conversation_repo,
|
|
institution_repo,
|
|
extractor
|
|
)
|
|
queue.add_command(cmd)
|
|
|
|
# Execute all
|
|
results = queue.execute_all()
|
|
|
|
# Undo if needed
|
|
if error_detected:
|
|
queue.undo_last()
|
|
```
|
|
|
|
**Benefits**:
|
|
- Queue operations
|
|
- Undo/redo support
|
|
- Batch processing
|
|
- Deferred execution
|
|
|
|
## Data Patterns
|
|
|
|
### Pattern 9: Value Object Pattern
|
|
|
|
**Intent**: Objects that represent values (immutable, equality by value)
|
|
|
|
**Structure**:
|
|
```python
|
|
from dataclasses import dataclass, field
|
|
from typing import Tuple, Optional
|
|
|
|
@dataclass(frozen=True) # Immutable
|
|
class GeographicCoverage:
|
|
"""Value object for geographic location"""
|
|
|
|
country: str
|
|
municipality: Optional[str] = None
|
|
province: Optional[str] = None
|
|
coordinates: Optional[Tuple[float, float]] = None
|
|
|
|
def __post_init__(self):
|
|
# Validation
|
|
if self.coordinates:
|
|
lat, lon = self.coordinates
|
|
if not (-90 <= lat <= 90 and -180 <= lon <= 180):
|
|
raise ValueError("Invalid coordinates")
|
|
|
|
def distance_to(self, other: 'GeographicCoverage') -> Optional[float]:
|
|
"""Calculate distance to another location"""
|
|
if self.coordinates and other.coordinates:
|
|
# Haversine formula
|
|
return calculate_distance(self.coordinates, other.coordinates)
|
|
return None
|
|
|
|
@dataclass(frozen=True)
|
|
class Identifier:
|
|
"""Value object for identifiers"""
|
|
|
|
scheme: str # 'isil', 'wikidata', 'kvk', etc.
|
|
value: str
|
|
assigned_date: Optional[date] = None
|
|
|
|
def __str__(self):
|
|
return f"{self.scheme}:{self.value}"
|
|
|
|
@dataclass(frozen=True)
|
|
class ProvenanceRecord:
|
|
"""Value object for provenance tracking"""
|
|
|
|
source: str # 'isil_registry', 'nlp_extraction', etc.
|
|
field: Optional[str] = None
|
|
confidence: float = 1.0
|
|
timestamp: datetime = field(default_factory=datetime.now)
|
|
|
|
def __post_init__(self):
|
|
if not 0.0 <= self.confidence <= 1.0:
|
|
raise ValueError("Confidence must be between 0 and 1")
|
|
```
|
|
|
|
**Benefits**:
|
|
- Immutable (thread-safe)
|
|
- Equality by value
|
|
- Self-validating
|
|
- Reusable across entities
|
|
|
|
### Pattern 10: Entity Pattern
|
|
|
|
**Intent**: Objects with identity that persists over time
|
|
|
|
**Structure**:
|
|
```python
|
|
from dataclasses import dataclass, field
|
|
from typing import List
|
|
import uuid
|
|
|
|
@dataclass
|
|
class HeritageCustodian:
|
|
"""Entity with persistent identity"""
|
|
|
|
# Identity
|
|
id: str = field(default_factory=lambda: str(uuid.uuid4()))
|
|
|
|
# Essential attributes
|
|
name: str = ""
|
|
institution_types: List[str] = field(default_factory=list)
|
|
|
|
# Value objects
|
|
geographic_coverage: Optional[GeographicCoverage] = None
|
|
identifiers: List[Identifier] = field(default_factory=list)
|
|
|
|
# Provenance
|
|
provenance: List[ProvenanceRecord] = field(default_factory=list)
|
|
|
|
# Version tracking
|
|
version: int = 1
|
|
last_updated: datetime = field(default_factory=datetime.now)
|
|
|
|
def __eq__(self, other):
|
|
"""Equality by ID, not by value"""
|
|
if not isinstance(other, HeritageCustodian):
|
|
return False
|
|
return self.id == other.id
|
|
|
|
def __hash__(self):
|
|
return hash(self.id)
|
|
|
|
def update_name(self, new_name: str, source: str):
|
|
"""Update with provenance tracking"""
|
|
self.name = new_name
|
|
self.provenance.append(ProvenanceRecord(
|
|
source=source,
|
|
field="name",
|
|
confidence=1.0
|
|
))
|
|
self.version += 1
|
|
self.last_updated = datetime.now()
|
|
```
|
|
|
|
**Benefits**:
|
|
- Clear identity
|
|
- Mutable (can be updated)
|
|
- Track changes over time
|
|
- Distinct from value objects
|
|
|
|
## Error Handling Patterns
|
|
|
|
### Pattern 11: Result Pattern
|
|
|
|
**Intent**: Explicit error handling without exceptions
|
|
|
|
**Structure**:
|
|
```python
|
|
from typing import Generic, TypeVar, Union
|
|
from dataclasses import dataclass
|
|
|
|
T = TypeVar('T')
|
|
E = TypeVar('E')
|
|
|
|
@dataclass
|
|
class Success(Generic[T]):
|
|
"""Successful result"""
|
|
value: T
|
|
|
|
def is_success(self) -> bool:
|
|
return True
|
|
|
|
def is_failure(self) -> bool:
|
|
return False
|
|
|
|
@dataclass
|
|
class Failure(Generic[E]):
|
|
"""Failed result"""
|
|
error: E
|
|
|
|
def is_success(self) -> bool:
|
|
return False
|
|
|
|
def is_failure(self) -> bool:
|
|
return True
|
|
|
|
Result = Union[Success[T], Failure[E]]
|
|
|
|
# Usage
|
|
def extract_isil_code(text: str) -> Result[str, str]:
|
|
"""Extract ISIL code from text"""
|
|
pattern = r'(NL-[A-Za-z0-9]+)'
|
|
match = re.search(pattern, text)
|
|
|
|
if match:
|
|
return Success(match.group(1))
|
|
else:
|
|
return Failure("No ISIL code found in text")
|
|
|
|
# Consumer
|
|
result = extract_isil_code(text)
|
|
if result.is_success():
|
|
print(f"Found ISIL: {result.value}")
|
|
else:
|
|
print(f"Error: {result.error}")
|
|
```
|
|
|
|
**Benefits**:
|
|
- Explicit error handling
|
|
- No silent failures
|
|
- Type-safe
|
|
- Forces error consideration
|
|
|
|
## Testing Patterns
|
|
|
|
### Pattern 12: Test Fixture Pattern
|
|
|
|
**Intent**: Reusable test data and setup
|
|
|
|
**Structure**:
|
|
```python
|
|
import pytest
|
|
from pathlib import Path
|
|
|
|
@pytest.fixture
|
|
def sample_conversation():
|
|
"""Fixture providing sample conversation"""
|
|
return Conversation(
|
|
id="test-001",
|
|
title="Brazilian GLAM institutions",
|
|
created_at=datetime(2025, 9, 22, 14, 40),
|
|
messages=[
|
|
Message(role="user", content="Tell me about Brazilian museums"),
|
|
Message(role="assistant", content="The National Museum of Brazil...")
|
|
],
|
|
citations=[
|
|
Citation(url="https://www.museu.br", title="National Museum")
|
|
]
|
|
)
|
|
|
|
@pytest.fixture
|
|
def sample_csv_record():
|
|
"""Fixture providing sample CSV record"""
|
|
return DutchOrganizationRecord(
|
|
organization_name="Rijksmuseum",
|
|
organization_type="museum",
|
|
city="Amsterdam",
|
|
province="Noord-Holland",
|
|
isil_code="NL-AsdRM",
|
|
website_url="https://www.rijksmuseum.nl"
|
|
)
|
|
|
|
@pytest.fixture
|
|
def temp_output_dir(tmp_path):
|
|
"""Fixture providing temporary output directory"""
|
|
output_dir = tmp_path / "output"
|
|
output_dir.mkdir()
|
|
return output_dir
|
|
|
|
# Usage in tests
|
|
def test_conversation_parser(sample_conversation):
|
|
parser = ConversationParser()
|
|
result = parser.parse(sample_conversation)
|
|
assert len(result.messages) == 2
|
|
|
|
def test_csv_to_linkml(sample_csv_record, temp_output_dir):
|
|
mapper = LinkMLMapper()
|
|
instance = mapper.map_from_csv(sample_csv_record)
|
|
|
|
exporter = YAMLExporter()
|
|
exporter.export(instance, temp_output_dir / "test.yaml")
|
|
|
|
assert (temp_output_dir / "test.yaml").exists()
|
|
```
|
|
|
|
## Configuration Patterns
|
|
|
|
### Pattern 13: Configuration Object
|
|
|
|
**Intent**: Centralized, typed configuration
|
|
|
|
**Structure**:
|
|
```python
|
|
from pydantic import BaseSettings, Field
|
|
from typing import List, Optional
|
|
|
|
class ExtractionConfig(BaseSettings):
|
|
"""Configuration for extraction pipeline"""
|
|
|
|
# NLP settings
|
|
spacy_model: str = "en_core_web_lg"
|
|
use_transformers: bool = False
|
|
transformer_model: str = "dslim/bert-base-NER"
|
|
|
|
# Web crawling
|
|
max_concurrent_requests: int = 10
|
|
request_timeout: int = 30
|
|
user_agent: str = "GLAM-Dataset-Extractor/0.1.0"
|
|
|
|
# Geocoding
|
|
enable_geocoding: bool = True
|
|
geocoding_provider: str = "nominatim"
|
|
|
|
# Enrichment
|
|
enable_wikidata_linking: bool = True
|
|
enable_viaf_linking: bool = False
|
|
|
|
# Output
|
|
output_formats: List[str] = ["rdf", "csv", "jsonld"]
|
|
output_directory: Path = Path("output/")
|
|
|
|
# Performance
|
|
batch_size: int = 100
|
|
max_workers: int = 4
|
|
|
|
class Config:
|
|
env_file = ".env"
|
|
env_prefix = "GLAM_"
|
|
|
|
# Usage
|
|
config = ExtractionConfig() # Loads from environment
|
|
|
|
# Override specific values
|
|
config = ExtractionConfig(
|
|
spacy_model="nl_core_news_lg", # Dutch model
|
|
enable_geocoding=False
|
|
)
|
|
```
|
|
|
|
**Benefits**:
|
|
- Type-safe configuration
|
|
- Environment variable support
|
|
- Validation
|
|
- Documentation
|
|
|
|
## Summary
|
|
|
|
These design patterns provide:
|
|
|
|
1. **Modularity**: Components can be developed and tested independently
|
|
2. **Flexibility**: Easy to swap implementations (Strategy, Adapter)
|
|
3. **Maintainability**: Clear structure and responsibilities
|
|
4. **Extensibility**: New features can be added without modifying existing code
|
|
5. **Testability**: Patterns like Repository and Fixture make testing easier
|
|
6. **Observability**: Observer pattern enables monitoring
|
|
7. **Error Handling**: Result pattern makes errors explicit
|
|
8. **Data Quality**: Value Objects and Entities ensure valid state
|
|
|
|
The combination of these patterns creates a robust, maintainable system for processing diverse data sources into a unified heritage dataset.
|