45 KiB
GLAMORCUBEPSXHF Multilingual Vocabulary - Design Patterns
Version: 1.0
Date: 2025-11-11
Status: Planning Phase (Phase 0)
Dependencies: 01-architecture.md, 00-MASTER_CHECKLIST.md
Table of Contents
- Overview
- Multilingual Data Patterns
- Script Handling Patterns
- Matching and Disambiguation Patterns
- Error Handling Patterns
- Testing Patterns
- Data Validation Patterns
- Provenance Tracking Patterns
Overview
This document defines reusable design patterns for building the GLAMORCUBEPSXHF multilingual vocabulary system. These patterns address common challenges in multilingual NLP, Unicode handling, fuzzy matching, and semantic disambiguation.
Design Principles
- Language Neutrality: No hardcoded language assumptions (avoid "English-first" bias)
- Script Awareness: Treat Latin, Cyrillic, Arabic, CJK, etc. as first-class citizens
- Provenance First: Every term carries source metadata for verification
- Fail Gracefully: Degrade to fallback behavior rather than crashing
- Test-Driven: Validate patterns with real multilingual test cases
Multilingual Data Patterns
Pattern 1: Language-Tagged Text
Problem: Store terms with their language codes without losing metadata
Solution: Use BCP 47 language tags with ISO 639-3 codes
from dataclasses import dataclass
from typing import Optional
@dataclass
class LanguageTaggedText:
"""Text with explicit language tagging."""
text: str
language: str # ISO 639-3 code (e.g., "eng", "ara", "ind")
script: Optional[str] = None # ISO 15924 code (e.g., "Latn", "Arab", "Hans")
region: Optional[str] = None # ISO 3166-1 alpha-2 (e.g., "US", "SA", "ID")
@property
def bcp47_tag(self) -> str:
"""Generate BCP 47 language tag."""
tag = self.language
if self.script:
tag += f"-{self.script}"
if self.region:
tag += f"-{self.region}"
return tag
def __repr__(self):
return f'"{self.text}"@{self.bcp47_tag}'
# Example usage
terms = [
LanguageTaggedText("temple", "eng", "Latn"),
LanguageTaggedText("معبد", "ara", "Arab"),
LanguageTaggedText("寺", "jpn", "Hani"), # Japanese kanji
LanguageTaggedText("keramat", "ind", "Latn", "ID"), # Indonesian in Indonesia
LanguageTaggedText("kramat", "msa", "Latn", "MY"), # Malay in Malaysia
]
for term in terms:
print(term) # "temple"@eng-Latn, "معبد"@ara-Arab, etc.
Benefits:
- ✅ Unambiguous language identification
- ✅ Supports regional variants (Indonesian vs. Malaysian Malay)
- ✅ Script-aware (Serbian Cyrillic vs. Latin)
- ✅ Compatible with RDF language tags (
"معبد"@ar)
When to Use:
- Storing terms in JSON/YAML
- RDF/JSON-LD serialization
- Multilingual UI display
Pattern 2: Multilingual Dictionary
Problem: Efficiently look up terms across multiple languages
Solution: Use nested dictionaries with language keys
from typing import Dict, List, Optional
from collections import defaultdict
class MultilingualDictionary:
"""Dictionary supporting multilingual key-value lookups."""
def __init__(self):
# Structure: {language: {normalized_term: [metadata]}}
self.data: Dict[str, Dict[str, List[Dict]]] = defaultdict(lambda: defaultdict(list))
def add_term(
self,
term: str,
language: str,
metadata: Dict
):
"""Add a term with metadata."""
normalized = self._normalize(term, language)
self.data[language][normalized].append(metadata)
def lookup(
self,
term: str,
language: str
) -> List[Dict]:
"""Look up term in specific language."""
normalized = self._normalize(term, language)
return self.data[language].get(normalized, [])
def lookup_all_languages(
self,
term: str
) -> Dict[str, List[Dict]]:
"""Look up term across all languages."""
results = {}
for lang in self.data:
matches = self.lookup(term, lang)
if matches:
results[lang] = matches
return results
def _normalize(self, term: str, language: str) -> str:
"""Language-aware normalization."""
import unicodedata
# NFC normalization (canonical composition)
normalized = unicodedata.normalize('NFC', term)
# Lowercase only for case-insensitive scripts
if self._is_case_insensitive_script(language):
normalized = normalized.lower()
return normalized
def _is_case_insensitive_script(self, language: str) -> bool:
"""Check if language uses case-insensitive script."""
# Latin, Cyrillic, Greek are case-insensitive
# Arabic, Hebrew, CJK are case-sensitive (no case distinction)
case_insensitive_langs = {'eng', 'fra', 'deu', 'rus', 'pol', 'ces', ...}
return language in case_insensitive_langs
# Example usage
vocab = MultilingualDictionary()
vocab.add_term("temple", "eng", {
"qid": "Q44539",
"class": "H",
"confidence": 0.95
})
vocab.add_term("معبد", "ara", {
"qid": "Q44539",
"class": "H",
"confidence": 0.95
})
vocab.add_term("keramat", "ind", {
"class": "H",
"confidence": 0.75,
"source": "exa"
})
# Lookup
print(vocab.lookup("temple", "eng")) # [{qid: Q44539, ...}]
print(vocab.lookup("TEMPLE", "eng")) # Same (case-insensitive)
print(vocab.lookup("معبد", "ara")) # [{qid: Q44539, ...}]
print(vocab.lookup_all_languages("keramat")) # {ind: [{class: H, ...}]}
Benefits:
- ✅ O(1) lookup per language
- ✅ Language-aware normalization
- ✅ Supports multiple matches per term (disambiguation)
When to Use:
- In-memory vocabulary cache
- Real-time NLP extraction
- Testing term coverage
Pattern 3: Translation Equivalence
Problem: Link terms that represent the same concept across languages
Solution: Use Wikidata Q-numbers as universal identifiers
from typing import Dict, List, Set
from dataclasses import dataclass
@dataclass
class ConceptCluster:
"""Group of terms representing the same concept."""
qid: str # Wikidata Q-number (universal identifier)
labels: Dict[str, str] # language → term
glamorcubepsxhf_type: str
confidence: float
def get_translations(self, term: str, source_lang: str) -> Dict[str, str]:
"""Get translations of a term into other languages."""
if source_lang not in self.labels or self.labels[source_lang] != term:
return {}
# Return all other languages
return {lang: label for lang, label in self.labels.items() if lang != source_lang}
# Example: "temple" concept cluster
temple_cluster = ConceptCluster(
qid="Q44539",
labels={
"eng": "temple",
"fra": "temple",
"deu": "Tempel",
"spa": "templo",
"ara": "معبد",
"jpn": "寺",
"zho": "寺庙",
"hin": "मंदिर",
"ind": "pura",
"tha": "วัด"
},
glamorcubepsxhf_type="H",
confidence=0.95
)
# Get translations
translations = temple_cluster.get_translations("temple", "eng")
print(translations)
# {
# "fra": "temple",
# "deu": "Tempel",
# "ara": "معبد",
# "jpn": "寺",
# ...
# }
Benefits:
- ✅ Semantic equivalence via Wikidata
- ✅ Multilingual synonym detection
- ✅ Cross-language validation
When to Use:
- Building translation dictionaries
- Validating extracted terms
- Cross-lingual search
Script Handling Patterns
Pattern 4: Script Detection
Problem: Determine which writing system a term uses (Latin, Arabic, CJK, etc.)
Solution: Use Unicode block analysis
import unicodedata
from typing import Dict, Set
from collections import Counter
class ScriptDetector:
"""Detect Unicode script(s) in text."""
# Unicode block ranges for major scripts
SCRIPT_RANGES = {
'Latin': [(0x0041, 0x005A), (0x0061, 0x007A), (0x00C0, 0x024F)],
'Cyrillic': [(0x0400, 0x04FF), (0x0500, 0x052F)],
'Arabic': [(0x0600, 0x06FF), (0x0750, 0x077F), (0x08A0, 0x08FF)],
'Devanagari': [(0x0900, 0x097F)],
'Bengali': [(0x0980, 0x09FF)],
'Han': [(0x4E00, 0x9FFF), (0x3400, 0x4DBF)], # CJK Unified Ideographs
'Hiragana': [(0x3040, 0x309F)],
'Katakana': [(0x30A0, 0x30FF)],
'Hangul': [(0xAC00, 0xD7AF)],
'Thai': [(0x0E00, 0x0E7F)],
'Hebrew': [(0x0590, 0x05FF)],
'Greek': [(0x0370, 0x03FF)],
}
def detect(self, text: str) -> str:
"""Detect dominant script in text."""
scripts = self.detect_all(text)
if not scripts:
return 'Unknown'
# Return most frequent script
most_common = scripts.most_common(1)[0]
return most_common[0]
def detect_all(self, text: str) -> Counter:
"""Detect all scripts in text with character counts."""
script_counts = Counter()
for char in text:
if char.isspace() or not char.isalnum():
continue
script = self._char_to_script(char)
if script:
script_counts[script] += 1
return script_counts
def _char_to_script(self, char: str) -> str:
"""Map a character to its script."""
code_point = ord(char)
for script, ranges in self.SCRIPT_RANGES.items():
for start, end in ranges:
if start <= code_point <= end:
return script
# Fallback to Unicode character name
try:
char_name = unicodedata.name(char, '')
if 'LATIN' in char_name:
return 'Latin'
elif 'ARABIC' in char_name:
return 'Arabic'
elif 'CJK' in char_name or 'IDEOGRAPH' in char_name:
return 'Han'
except ValueError:
pass
return 'Unknown'
def is_rtl_script(self, script: str) -> bool:
"""Check if script is right-to-left."""
return script in {'Arabic', 'Hebrew'}
# Example usage
detector = ScriptDetector()
test_terms = [
"temple", # Latin
"معبد", # Arabic
"寺", # Han (CJK)
"มัสยิด", # Thai
"музей", # Cyrillic
"museum المتحف", # Mixed Latin + Arabic
]
for term in test_terms:
script = detector.detect(term)
all_scripts = detector.detect_all(term)
rtl = detector.is_rtl_script(script)
print(f"{term:20} → {script:15} (RTL: {rtl}) {dict(all_scripts)}")
# Output:
# temple → Latin (RTL: False) {'Latin': 6}
# معبد → Arabic (RTL: True) {'Arabic': 4}
# 寺 → Han (RTL: False) {'Han': 1}
# มัสยิด → Thai (RTL: False) {'Thai': 6}
# музей → Cyrillic (RTL: False) {'Cyrillic': 5}
# museum المتحف → Latin (RTL: False) {'Latin': 6, 'Arabic': 6}
Benefits:
- ✅ Accurate script detection for 12+ writing systems
- ✅ Handles mixed-script text
- ✅ RTL language detection for UI rendering
When to Use:
- Text normalization pipeline
- Choosing transliteration algorithm
- UI rendering (RTL vs. LTR)
Pattern 5: Unicode Normalization
Problem: Equivalent strings have different byte representations (é vs. e + combining accent)
Solution: Apply NFC normalization consistently
import unicodedata
class UnicodeNormalizer:
"""Normalize Unicode text for consistent comparison."""
def normalize(self, text: str, form: str = 'NFC') -> str:
"""
Normalize text using Unicode normalization form.
Forms:
- NFC: Canonical composition (default)
- NFD: Canonical decomposition
- NFKC: Compatibility composition
- NFKD: Compatibility decomposition
"""
return unicodedata.normalize(form, text)
def are_equivalent(self, text1: str, text2: str) -> bool:
"""Check if two strings are equivalent after normalization."""
return self.normalize(text1) == self.normalize(text2)
# Example: Vietnamese text with combining diacritics
normalizer = UnicodeNormalizer()
# Two ways to write "Bảo tàng" (museum in Vietnamese)
text1 = "Bảo tàng" # Precomposed characters (NFC)
text2 = "Bảo tàng" # Combining diacritics (NFD)
print(f"Byte length: {len(text1.encode('utf-8'))} vs {len(text2.encode('utf-8'))}")
print(f"Equivalent: {normalizer.are_equivalent(text1, text2)}") # True
# After normalization
norm1 = normalizer.normalize(text1)
norm2 = normalizer.normalize(text2)
print(f"Normalized: '{norm1}' == '{norm2}': {norm1 == norm2}") # True
Benefits:
- ✅ Consistent string comparison
- ✅ Fixes diacritic encoding issues
- ✅ Database indexing compatibility
When to Use:
- Before storing terms in database
- Before fuzzy matching
- When comparing user input to vocabulary
Pattern 6: Transliteration
Problem: Enable fuzzy matching across scripts (e.g., Arabic "معبد" → "ma'bad")
Solution: Generate Latin transliterations for non-Latin scripts
from typing import Dict, Optional
import unicodedata
class Transliterator:
"""Transliterate non-Latin scripts to Latin alphabet."""
# Simple transliteration tables (production should use ICU or transliterate library)
ARABIC_TO_LATIN = {
'ا': 'a', 'ب': 'b', 'ت': 't', 'ث': 'th',
'ج': 'j', 'ح': 'h', 'خ': 'kh', 'د': 'd',
'ذ': 'dh', 'ر': 'r', 'ز': 'z', 'س': 's',
'ش': 'sh', 'ص': 's', 'ض': 'd', 'ط': 't',
'ظ': 'z', 'ع': "'", 'غ': 'gh', 'ف': 'f',
'ق': 'q', 'ك': 'k', 'ل': 'l', 'م': 'm',
'ن': 'n', 'ه': 'h', 'و': 'w', 'ي': 'y'
}
CYRILLIC_TO_LATIN = {
'а': 'a', 'б': 'b', 'в': 'v', 'г': 'g',
'д': 'd', 'е': 'e', 'ё': 'yo', 'ж': 'zh',
'з': 'z', 'и': 'i', 'й': 'y', 'к': 'k',
'л': 'l', 'м': 'm', 'н': 'n', 'о': 'o',
'п': 'p', 'р': 'r', 'с': 's', 'т': 't',
'у': 'u', 'ф': 'f', 'х': 'h', 'ц': 'ts',
'ч': 'ch', 'ш': 'sh', 'щ': 'shch', 'ъ': '',
'ы': 'y', 'ь': '', 'э': 'e', 'ю': 'yu',
'я': 'ya'
}
def transliterate(self, text: str, script: str) -> Optional[str]:
"""
Transliterate text to Latin alphabet.
Args:
text: Input text
script: Source script (Arabic, Cyrillic, Han, etc.)
Returns:
Latin transliteration or None if not supported
"""
if script == 'Latin':
return text # Already Latin
if script == 'Arabic':
return self._transliterate_arabic(text)
elif script == 'Cyrillic':
return self._transliterate_cyrillic(text)
elif script == 'Han':
return self._transliterate_han(text)
else:
# Fallback: NFKD decomposition (removes diacritics)
return self._nfkd_to_ascii(text)
def _transliterate_arabic(self, text: str) -> str:
"""Transliterate Arabic to Latin."""
result = []
for char in text:
if char in self.ARABIC_TO_LATIN:
result.append(self.ARABIC_TO_LATIN[char])
elif char.isspace():
result.append(' ')
return ''.join(result)
def _transliterate_cyrillic(self, text: str) -> str:
"""Transliterate Cyrillic to Latin."""
result = []
for char in text:
lower_char = char.lower()
if lower_char in self.CYRILLIC_TO_LATIN:
latin = self.CYRILLIC_TO_LATIN[lower_char]
# Preserve case
if char.isupper():
latin = latin.capitalize()
result.append(latin)
elif char.isspace():
result.append(' ')
return ''.join(result)
def _transliterate_han(self, text: str) -> Optional[str]:
"""Transliterate CJK to Pinyin/Romaji (requires external library)."""
# Production: Use pypinyin for Chinese, pykakasi for Japanese
# Placeholder implementation
return None
def _nfkd_to_ascii(self, text: str) -> str:
"""Fallback: NFKD normalization + ASCII conversion."""
nfkd = unicodedata.normalize('NFKD', text)
ascii_text = nfkd.encode('ascii', 'ignore').decode('ascii')
return ascii_text
# Example usage
transliterator = Transliterator()
test_cases = [
("معبد", "Arabic"), # ma'bad
("музей", "Cyrillic"), # muzey
("Bảo tàng", "Latin"), # Bao tang (with NFKD fallback)
]
for text, script in test_cases:
latin = transliterator.transliterate(text, script)
print(f"{text:20} ({script:10}) → {latin}")
# Output:
# معبد (Arabic ) → m'bd
# музей (Cyrillic ) → muzey
# Bảo tàng (Latin ) → Bao tang
Benefits:
- ✅ Enables cross-script fuzzy matching
- ✅ Searchability for non-Latin terms
- ✅ Fallback for unsupported scripts
When to Use:
- Fuzzy matching fallback
- Search indexing
- User input normalization
Production Note: Use specialized libraries:
- Arabic:
arabic-reshaper+ ALA-LC romanization - Cyrillic: ISO 9 standard
- Chinese:
pypinyin(Mandarin Pinyin) - Japanese:
pykakasi(Hepburn romanization) - Korean:
hangul-romanize(Revised Romanization)
Matching and Disambiguation Patterns
Pattern 7: Fuzzy Matching
Problem: Handle spelling variants, typos, and transliteration differences
Solution: Use Levenshtein distance with language-aware thresholds
from rapidfuzz import fuzz, process
from typing import List, Tuple, Dict
from dataclasses import dataclass
@dataclass
class FuzzyMatch:
"""Result of fuzzy matching."""
matched_term: str
query_term: str
score: float # 0.0 to 1.0
method: str # levenshtein, token_sort, partial
metadata: Dict
class FuzzyMatcher:
"""Fuzzy matching with language-aware scoring."""
# Language-specific thresholds
THRESHOLDS = {
'default': 0.85,
'ara': 0.80, # Arabic: lower threshold (transliteration variance)
'zho': 0.90, # Chinese: higher threshold (ideographs)
'jpn': 0.90, # Japanese: higher threshold
'kor': 0.90, # Korean: higher threshold
}
def match(
self,
query: str,
candidates: List[str],
language: str = 'eng',
threshold: Optional[float] = None
) -> List[FuzzyMatch]:
"""
Fuzzy match query against candidate terms.
Args:
query: Term to match
candidates: List of candidate terms
language: ISO 639-3 language code
threshold: Custom threshold (overrides language default)
Returns:
List of matches above threshold, sorted by score
"""
threshold = threshold or self.THRESHOLDS.get(language, self.THRESHOLDS['default'])
matches = []
for candidate in candidates:
# Try multiple matching strategies
scores = {
'levenshtein': fuzz.ratio(query, candidate) / 100.0,
'token_sort': fuzz.token_sort_ratio(query, candidate) / 100.0,
'partial': fuzz.partial_ratio(query, candidate) / 100.0,
}
# Use best score
best_method = max(scores, key=scores.get)
best_score = scores[best_method]
if best_score >= threshold:
matches.append(FuzzyMatch(
matched_term=candidate,
query_term=query,
score=best_score,
method=best_method,
metadata={}
))
# Sort by score descending
matches.sort(key=lambda m: m.score, reverse=True)
return matches
# Example usage
matcher = FuzzyMatcher()
# Test case: Indonesian "keramat" with spelling variants
candidates = [
"keramat",
"kramat",
"karamat",
"kermat",
"keramot",
"sacred shrine",
"makam"
]
query = "kramat"
matches = matcher.match(query, candidates, language='ind', threshold=0.80)
for match in matches:
print(f"{match.query_term} → {match.matched_term:15} "
f"(score: {match.score:.2f}, method: {match.method})")
# Output:
# kramat → kramat (score: 1.00, method: levenshtein)
# kramat → keramat (score: 0.93, method: levenshtein)
# kramat → karamat (score: 0.86, method: levenshtein)
# kramat → kermat (score: 0.86, method: levenshtein)
Benefits:
- ✅ Handles typos and OCR errors
- ✅ Language-specific thresholds
- ✅ Multiple matching strategies
When to Use:
- User input matching
- Cross-referencing datasets
- Spelling variant detection
Pattern 8: Semantic Disambiguation
Problem: Ambiguous terms (e.g., "collection" could be P: personal or general concept)
Solution: Use context clues and confidence scoring
from typing import List, Tuple, Optional, Dict
from dataclasses import dataclass
@dataclass
class DisambiguatedTerm:
"""Term with disambiguated type assignments."""
term: str
language: str
primary_type: str # G, L, A, M, O, R, C, U, B, E, P, S, H, X
primary_confidence: float
alternative_types: List[Tuple[str, float]] # [(type, confidence), ...]
disambiguation_method: str
context: Optional[str] = None
class SemanticDisambiguator:
"""Disambiguate ambiguous heritage terms using context."""
# Contextual keywords for disambiguation
CONTEXT_KEYWORDS = {
'M': {
'strong': ['exhibition', 'curator', 'artifacts', 'gallery', 'display'],
'weak': ['collection', 'objects', 'holdings']
},
'L': {
'strong': ['books', 'lending', 'catalog', 'bibliographic', 'reading room'],
'weak': ['collection', 'holdings', 'materials']
},
'A': {
'strong': ['records', 'fonds', 'provenance', 'manuscript', 'documents'],
'weak': ['collection', 'holdings', 'materials']
},
'P': {
'strong': ['private', 'personal', 'individual collector', 'family'],
'weak': ['collection']
},
'H': {
'strong': ['worship', 'sacred', 'religious', 'pilgrimage', 'prayer', 'ritual'],
'weak': ['holy', 'spiritual']
},
'S': {
'strong': ['society', 'club', 'association', 'numismatic', 'philatelic'],
'weak': ['collectors', 'members']
}
}
def disambiguate(
self,
term: str,
language: str,
candidate_types: List[Tuple[str, float]],
context: Optional[str] = None
) -> DisambiguatedTerm:
"""
Disambiguate term using context clues.
Args:
term: Ambiguous term
language: ISO 639-3 code
candidate_types: [(type, base_confidence), ...]
context: Surrounding text for context analysis
Returns:
DisambiguatedTerm with adjusted confidences
"""
if not context:
# No context: return as-is
primary = max(candidate_types, key=lambda x: x[1])
alternatives = [t for t in candidate_types if t != primary]
return DisambiguatedTerm(
term=term,
language=language,
primary_type=primary[0],
primary_confidence=primary[1],
alternative_types=alternatives,
disambiguation_method='no_context'
)
# Adjust confidences based on context
adjusted_scores = {}
for inst_type, base_conf in candidate_types:
score = base_conf
# Check for strong keywords
strong_keywords = self.CONTEXT_KEYWORDS.get(inst_type, {}).get('strong', [])
weak_keywords = self.CONTEXT_KEYWORDS.get(inst_type, {}).get('weak', [])
context_lower = context.lower()
for keyword in strong_keywords:
if keyword in context_lower:
score += 0.2 # Strong boost
for keyword in weak_keywords:
if keyword in context_lower:
score += 0.05 # Weak boost
# Cap at 1.0
adjusted_scores[inst_type] = min(score, 1.0)
# Sort by adjusted scores
sorted_types = sorted(adjusted_scores.items(), key=lambda x: x[1], reverse=True)
primary = sorted_types[0]
alternatives = sorted_types[1:]
return DisambiguatedTerm(
term=term,
language=language,
primary_type=primary[0],
primary_confidence=primary[1],
alternative_types=alternatives,
disambiguation_method='context_analysis',
context=context[:200] + '...' if len(context) > 200 else context
)
# Example usage
disambiguator = SemanticDisambiguator()
# Ambiguous term: "collection"
candidate_types = [
('M', 0.5), # Museum
('L', 0.5), # Library
('A', 0.5), # Archive
('P', 0.6), # Personal collection
]
# Test 1: Museum context
context1 = "The museum's collection includes 5,000 artifacts from ancient Egypt, displayed in rotating exhibitions curated by expert scholars."
result1 = disambiguator.disambiguate("collection", "eng", candidate_types, context1)
print(f"Museum context: {result1.primary_type} (confidence: {result1.primary_confidence:.2f})")
# Output: Museum context: M (confidence: 0.75)
# Test 2: Personal collection context
context2 = "This private collection was assembled by an individual collector over 40 years, consisting of rare stamps from Southeast Asia."
result2 = disambiguator.disambiguate("collection", "eng", candidate_types, context2)
print(f"Personal context: {result2.primary_type} (confidence: {result2.primary_confidence:.2f})")
# Output: Personal context: P (confidence: 0.80)
# Test 3: No context
result3 = disambiguator.disambiguate("collection", "eng", candidate_types, context=None)
print(f"No context: {result3.primary_type} (confidence: {result3.primary_confidence:.2f})")
# Output: No context: P (confidence: 0.60)
Benefits:
- ✅ Context-aware type assignment
- ✅ Adjustable confidence scores
- ✅ Transparent disambiguation method
When to Use:
- Ambiguous terms (collection, center, archive)
- Terms with multiple institution types
- Low-confidence classifications
Error Handling Patterns
Pattern 9: Result Type for Errors
Problem: Avoid exceptions during batch processing; track failures gracefully
Solution: Use Result[T, E] pattern for recoverable errors
from typing import Generic, TypeVar, Union, Callable
from dataclasses import dataclass
from enum import Enum
T = TypeVar('T')
E = TypeVar('E')
@dataclass
class Ok(Generic[T]):
"""Successful result."""
value: T
def is_ok(self) -> bool:
return True
def is_err(self) -> bool:
return False
def unwrap(self) -> T:
return self.value
def map(self, func: Callable[[T], 'Result']) -> 'Result':
return func(self.value)
@dataclass
class Err(Generic[E]):
"""Error result."""
error: E
def is_ok(self) -> bool:
return False
def is_err(self) -> bool:
return True
def unwrap(self) -> E:
raise ValueError(f"Called unwrap() on Err: {self.error}")
def map(self, func: Callable) -> 'Result':
return self # Pass through error
Result = Union[Ok[T], Err[E]]
# Example: Term extraction with error handling
class ExtractionError(Enum):
INVALID_LANGUAGE_CODE = "Invalid ISO 639-3 language code"
UNSUPPORTED_SCRIPT = "Script not supported for extraction"
MALFORMED_TERM = "Term contains invalid characters"
NETWORK_ERROR = "Network request failed"
def extract_term(text: str, language: str) -> Result[Dict, ExtractionError]:
"""Extract term with error handling."""
# Validate language code
if len(language) != 3:
return Err(ExtractionError.INVALID_LANGUAGE_CODE)
# Detect script
from glam_extractor.vocab.normalization import ScriptDetector
detector = ScriptDetector()
script = detector.detect(text)
if script == 'Unknown':
return Err(ExtractionError.UNSUPPORTED_SCRIPT)
# Extract term
try:
term_data = {
'term': text,
'language': language,
'script': script
}
return Ok(term_data)
except Exception as e:
return Err(ExtractionError.MALFORMED_TERM)
# Example usage: Batch processing
terms = [
("temple", "eng"),
("معبد", "ara"),
("invalid", "x"), # Invalid language code
("寺", "jpn"),
]
results = [extract_term(text, lang) for text, lang in terms]
# Process results
successful = [r.unwrap() for r in results if r.is_ok()]
failed = [(terms[i], r.error) for i, r in enumerate(results) if r.is_err()]
print(f"Successful: {len(successful)}")
print(f"Failed: {len(failed)}")
for (text, lang), error in failed:
print(f" - '{text}' ({lang}): {error.value}")
# Output:
# Successful: 3
# Failed: 1
# - 'invalid' (x): Invalid ISO 639-3 language code
Benefits:
- ✅ No exceptions during batch processing
- ✅ Explicit error types
- ✅ Composable error handling
When to Use:
- Batch term extraction
- Network requests (Wikidata, Exa)
- File parsing
Pattern 10: Logging with Provenance
Problem: Debug extraction issues; trace where terms came from
Solution: Structured logging with provenance metadata
import logging
import json
from datetime import datetime, timezone
from typing import Dict, Any
class ProvenanceLogger:
"""Logger with provenance tracking for vocabulary extraction."""
def __init__(self, name: str):
self.logger = logging.getLogger(name)
self.logger.setLevel(logging.INFO)
# JSON formatter for structured logs
handler = logging.StreamHandler()
handler.setFormatter(logging.Formatter('%(message)s'))
self.logger.addHandler(handler)
def log_extraction(
self,
term: str,
language: str,
source: str,
metadata: Dict[str, Any]
):
"""Log term extraction with provenance."""
log_entry = {
'timestamp': datetime.now(timezone.utc).isoformat(),
'event': 'term_extracted',
'term': term,
'language': language,
'source': source,
'metadata': metadata
}
self.logger.info(json.dumps(log_entry))
def log_error(
self,
term: str,
error_type: str,
error_message: str,
context: Dict[str, Any]
):
"""Log extraction error."""
log_entry = {
'timestamp': datetime.now(timezone.utc).isoformat(),
'event': 'extraction_error',
'term': term,
'error_type': error_type,
'error_message': error_message,
'context': context
}
self.logger.error(json.dumps(log_entry))
# Example usage
logger = ProvenanceLogger('vocab.extractor')
# Log successful extraction
logger.log_extraction(
term="keramat",
language="ind",
source="exa",
metadata={
'url': 'https://example.com/keramat-malaysia',
'confidence': 0.75,
'context': 'sacred shrine in Malay culture'
}
)
# Log error
logger.log_error(
term="invalid_term",
error_type="UnsupportedScript",
error_message="Script detection failed",
context={'language': 'xxx', 'script': 'Unknown'}
)
# JSON output (structured logging):
# {"timestamp": "2025-11-11T...", "event": "term_extracted", "term": "keramat", ...}
# {"timestamp": "2025-11-11T...", "event": "extraction_error", "term": "invalid_term", ...}
Benefits:
- ✅ Structured JSON logs
- ✅ Provenance tracking
- ✅ Easy filtering and analysis
When to Use:
- Production extraction pipelines
- Debugging NLP models
- Audit trails
Testing Patterns
Pattern 11: Multilingual Test Cases
Problem: Ensure code works across all scripts and languages
Solution: Comprehensive test fixtures with real multilingual data
import pytest
from typing import List, Dict
@pytest.fixture
def multilingual_test_terms() -> List[Dict]:
"""Fixture with terms in 20+ languages across 10+ scripts."""
return [
# Latin script
{"term": "museum", "language": "eng", "script": "Latin", "type": "M"},
{"term": "musée", "language": "fra", "script": "Latin", "type": "M"},
{"term": "biblioteca", "language": "spa", "script": "Latin", "type": "L"},
{"term": "arquivo", "language": "por", "script": "Latin", "type": "A"},
{"term": "keramat", "language": "ind", "script": "Latin", "type": "H"},
# Arabic script
{"term": "متحف", "language": "ara", "script": "Arabic", "type": "M"},
{"term": "مكتبة", "language": "ara", "script": "Arabic", "type": "L"},
{"term": "مسجد", "language": "ara", "script": "Arabic", "type": "H"},
# Cyrillic script
{"term": "музей", "language": "rus", "script": "Cyrillic", "type": "M"},
{"term": "библиотека", "language": "rus", "script": "Cyrillic", "type": "L"},
# Han script (CJK)
{"term": "博物馆", "language": "zho", "script": "Han", "type": "M"},
{"term": "图书馆", "language": "zho", "script": "Han", "type": "L"},
{"term": "寺", "language": "jpn", "script": "Han", "type": "H"},
# Devanagari script
{"term": "संग्रहालय", "language": "hin", "script": "Devanagari", "type": "M"},
{"term": "पुस्तकालय", "language": "hin", "script": "Devanagari", "type": "L"},
{"term": "मंदिर", "language": "hin", "script": "Devanagari", "type": "H"},
# Thai script
{"term": "พิพิธภัณฑ์", "language": "tha", "script": "Thai", "type": "M"},
{"term": "วัด", "language": "tha", "script": "Thai", "type": "H"},
# Hangul script
{"term": "박물관", "language": "kor", "script": "Hangul", "type": "M"},
{"term": "도서관", "language": "kor", "script": "Hangul", "type": "L"},
]
def test_script_detection(multilingual_test_terms):
"""Test script detection across 10+ scripts."""
from glam_extractor.vocab.normalization import ScriptDetector
detector = ScriptDetector()
for test_case in multilingual_test_terms:
detected_script = detector.detect(test_case['term'])
expected_script = test_case['script']
assert detected_script == expected_script, \
f"Script detection failed for '{test_case['term']}': " \
f"expected {expected_script}, got {detected_script}"
def test_unicode_normalization(multilingual_test_terms):
"""Test Unicode NFC normalization."""
from glam_extractor.vocab.normalization import UnicodeNormalizer
normalizer = UnicodeNormalizer()
for test_case in multilingual_test_terms:
term = test_case['term']
normalized = normalizer.normalize(term)
# Check NFC normalization (idempotent)
double_normalized = normalizer.normalize(normalized)
assert normalized == double_normalized, \
f"NFC normalization not idempotent for '{term}'"
def test_fuzzy_matching_multilingual(multilingual_test_terms):
"""Test fuzzy matching across languages."""
from glam_extractor.vocab.matching import FuzzyMatcher
matcher = FuzzyMatcher()
# Test case: Match "museo" against Spanish terms
candidates = [tc['term'] for tc in multilingual_test_terms if tc['language'] == 'spa']
if candidates:
matches = matcher.match("museo", candidates, language='spa')
# Should match "musée" with high score
# (Note: this is a cross-language test, adjust expectations)
Benefits:
- ✅ Real-world multilingual coverage
- ✅ Comprehensive script testing
- ✅ Reusable test fixtures
When to Use:
- Unit tests for all components
- Integration tests for extraction pipeline
- Regression testing
Data Validation Patterns
Pattern 12: Schema Validation
Problem: Ensure extracted terms conform to schema before storage
Solution: Use Pydantic for runtime validation
from pydantic import BaseModel, Field, validator
from typing import List, Optional, Dict
from datetime import datetime
from enum import Enum
class Script(str, Enum):
"""Unicode scripts."""
LATIN = "Latin"
CYRILLIC = "Cyrillic"
ARABIC = "Arabic"
HAN = "Han"
DEVANAGARI = "Devanagari"
THAI = "Thai"
HANGUL = "Hangul"
class ProvenanceSource(str, Enum):
"""Data sources."""
WIKIDATA = "wikidata"
EXA = "exa"
MANUAL = "manual"
class VocabularyTerm(BaseModel):
"""Validated vocabulary term."""
term_id: str = Field(..., regex=r'^[A-Z]-[a-z]{3}-[\w-]+-\d{3}$')
term: str = Field(..., min_length=1, max_length=200)
language: str = Field(..., regex=r'^[a-z]{3}$') # ISO 639-3
script: Script
glamorcubepsxhf_type: str = Field(..., regex=r'^[GLAMORCUBEPSXHF]$')
confidence: float = Field(..., ge=0.0, le=1.0)
provenance_source: ProvenanceSource
provenance_id: Optional[str] = None # Q-number or URL
extraction_date: datetime
transliteration: Optional[str] = None
definition: Optional[str] = None
variants: List[str] = []
@validator('term')
def term_not_empty(cls, v):
"""Validate term is not empty or whitespace."""
if not v.strip():
raise ValueError("Term cannot be empty")
return v.strip()
@validator('provenance_id')
def validate_provenance_id(cls, v, values):
"""Validate provenance ID matches source."""
if v is None:
return v
source = values.get('provenance_source')
if source == ProvenanceSource.WIKIDATA:
if not v.startswith('Q'):
raise ValueError(f"Wikidata ID must start with 'Q', got: {v}")
elif source == ProvenanceSource.EXA:
if not v.startswith('http'):
raise ValueError(f"Exa provenance must be URL, got: {v}")
return v
class Config:
json_encoders = {
datetime: lambda v: v.isoformat()
}
# Example usage
try:
term = VocabularyTerm(
term_id="H-ind-keramat-001",
term="keramat",
language="ind",
script=Script.LATIN,
glamorcubepsxhf_type="H",
confidence=0.75,
provenance_source=ProvenanceSource.EXA,
provenance_id="https://example.com/keramat",
extraction_date=datetime.now(),
variants=["kramat", "karamat"]
)
print(f"Valid term: {term.term_id}")
except ValueError as e:
print(f"Validation error: {e}")
# Invalid example
try:
invalid_term = VocabularyTerm(
term_id="INVALID", # Wrong format
term="", # Empty term
language="english", # Should be 3-letter code
script=Script.LATIN,
glamorcubepsxhf_type="Z", # Invalid type
confidence=1.5, # > 1.0
provenance_source=ProvenanceSource.WIKIDATA,
provenance_id="invalid", # Should start with Q
extraction_date=datetime.now()
)
except ValueError as e:
print(f"Validation caught errors: {e}")
Benefits:
- ✅ Runtime type checking
- ✅ Clear validation errors
- ✅ Self-documenting schema
When to Use:
- Before database insertion
- API input validation
- Data import/export
Provenance Tracking Patterns
Pattern 13: Provenance Chains
Problem: Track how a term was discovered and processed
Solution: Build provenance chains showing transformation history
from dataclasses import dataclass, field
from typing import List, Optional
from datetime import datetime, timezone
from enum import Enum
class ProvenanceAction(str, Enum):
"""Actions in provenance chain."""
EXTRACTED = "extracted"
NORMALIZED = "normalized"
TRANSLITERATED = "transliterated"
CLASSIFIED = "classified"
DISAMBIGUATED = "disambiguated"
VALIDATED = "validated"
@dataclass
class ProvenanceStep:
"""Single step in provenance chain."""
action: ProvenanceAction
timestamp: datetime
agent: str # Component that performed action
input_value: Optional[str] = None
output_value: Optional[str] = None
metadata: dict = field(default_factory=dict)
@dataclass
class ProvenanceChain:
"""Complete provenance chain for a term."""
term_id: str
steps: List[ProvenanceStep] = field(default_factory=list)
def add_step(
self,
action: ProvenanceAction,
agent: str,
input_value: Optional[str] = None,
output_value: Optional[str] = None,
**metadata
):
"""Add a step to provenance chain."""
step = ProvenanceStep(
action=action,
timestamp=datetime.now(timezone.utc),
agent=agent,
input_value=input_value,
output_value=output_value,
metadata=metadata
)
self.steps.append(step)
def to_dict(self) -> dict:
"""Serialize provenance chain."""
return {
'term_id': self.term_id,
'steps': [
{
'action': step.action.value,
'timestamp': step.timestamp.isoformat(),
'agent': step.agent,
'input': step.input_value,
'output': step.output_value,
'metadata': step.metadata
}
for step in self.steps
]
}
# Example: Track term extraction and processing
provenance = ProvenanceChain(term_id="H-ind-keramat-001")
# Step 1: Extraction from Exa
provenance.add_step(
action=ProvenanceAction.EXTRACTED,
agent="EXaResearchEngine",
output_value="keramat",
url="https://example.com/keramat",
confidence=0.75
)
# Step 2: Normalization
provenance.add_step(
action=ProvenanceAction.NORMALIZED,
agent="UnicodeNormalizer",
input_value="keramat",
output_value="keramat", # No change (already NFC)
normalization_form="NFC"
)
# Step 3: Classification
provenance.add_step(
action=ProvenanceAction.CLASSIFIED,
agent="ClassificationEngine",
input_value="keramat",
output_value="H",
confidence=0.75,
method="context_analysis"
)
# Step 4: Disambiguation
provenance.add_step(
action=ProvenanceAction.DISAMBIGUATED,
agent="SemanticDisambiguator",
input_value="H",
output_value="H",
alternative_types=[],
method="no_ambiguity"
)
# Step 5: Validation
provenance.add_step(
action=ProvenanceAction.VALIDATED,
agent="SchemaValidator",
input_value="keramat",
output_value="keramat",
valid=True
)
# Export provenance chain
import json
print(json.dumps(provenance.to_dict(), indent=2))
Benefits:
- ✅ Complete audit trail
- ✅ Reproducibility
- ✅ Debugging support
When to Use:
- Production extraction pipelines
- Data quality audits
- Research reproducibility
Summary
This document defines 13 core design patterns for building the GLAMORCUBEPSXHF multilingual vocabulary system:
Multilingual Data Patterns
- Language-Tagged Text - BCP 47 language tagging
- Multilingual Dictionary - Efficient cross-language lookups
- Translation Equivalence - Wikidata-based concept clusters
Script Handling Patterns
- Script Detection - Unicode block analysis for 12+ scripts
- Unicode Normalization - NFC normalization for consistency
- Transliteration - Latin transliterations for fuzzy matching
Matching and Disambiguation Patterns
- Fuzzy Matching - Levenshtein distance with language thresholds
- Semantic Disambiguation - Context-aware type assignment
Error Handling Patterns
- Result Type for Errors - Recoverable error handling
- Logging with Provenance - Structured logging with metadata
Testing Patterns
- Multilingual Test Cases - Comprehensive test fixtures
Data Validation Patterns
- Schema Validation - Pydantic runtime validation
Provenance Tracking Patterns
- Provenance Chains - Complete transformation history
Next Steps:
- Implement patterns in
src/glam_extractor/vocab/modules - Create unit tests validating each pattern
- Document pattern usage in component code
See Also:
01-architecture.md- System architecture03-mcp-tools.md- MCP tool configuration (to be created)04-sparql-templates.md- SPARQL query templates (to be created)
Version: 1.0
Status: ✅ Design patterns complete
Pattern Count: 13 core patterns across 7 categories