# Transliteration Standards for Non-Latin Scripts **Rule ID**: TRANSLIT-ISO **Status**: MANDATORY **Applies To**: GHCID abbreviation generation from emic names in non-Latin scripts **Created**: 2025-12-08 --- ## Summary **When generating GHCID abbreviations from institution names written in non-Latin scripts, the emic name MUST first be transliterated to Latin characters using the designated ISO or recognized standard for that script.** This rule affects **170 institutions** across **21 languages** with non-Latin writing systems. ### Key Principles 1. **Emic name is preserved** - The original script is stored in `custodian_name.emic_name` 2. **Transliteration is for processing only** - Used to generate abbreviations 3. **ISO/recognized standards required** - No ad-hoc romanization 4. **Deterministic output** - Same input always produces same Latin output 5. **Existing GHCIDs grandfathered** - Only applies to NEW custodians --- ## Transliteration Standards by Script/Language ### Cyrillic Scripts | Language | ISO Code | Standard | Library/Tool | Notes | |----------|----------|----------|--------------|-------| | **Russian** | ru | ISO 9:1995 | `transliterate` | Scientific transliteration | | **Ukrainian** | uk | ISO 9:1995 | `transliterate` | Includes Ukrainian-specific letters | | **Bulgarian** | bg | ISO 9:1995 | `transliterate` | Uses same Cyrillic base | | **Serbian** | sr | ISO 9:1995 | `transliterate` | Serbian Cyrillic variant | | **Kazakh** | kk | ISO 9:1995 | `transliterate` | Cyrillic-based (pre-2023) | **ISO 9:1995 Mapping (Core Characters)**: | Cyrillic | Latin | Cyrillic | Latin | |----------|-------|----------|-------| | А а | A a | П п | P p | | Б б | B b | Р р | R r | | В в | V v | С с | S s | | Г г | G g | Т т | T t | | Д д | D d | У у | U u | | Е е | E e | Ф ф | F f | | Ё ё | Ë ë | Х х | H h | | Ж ж | Ž ž | Ц ц | C c | | З з | Z z | Ч ч | Č č | | И и | I i | Ш ш | Š š | | Й й | J j | Щ щ | Ŝ ŝ | | К к | K k | Ъ ъ | ʺ (hard sign) | | Л л | L l | Ы ы | Y y | | М м | M m | Ь ь | ʹ (soft sign) | | Н н | N n | Э э | È è | | О о | O o | Ю ю | Û û | | | | Я я | Â â | **Example**: ``` Input: Институт восточных рукописей РАН ISO 9: Institut vostočnyh rukopisej RAN Abbrev: IVRRAN → IVRRAN (after diacritic normalization) ``` --- ### CJK Scripts #### Chinese (Hanzi) | Variant | Standard | Library/Tool | Notes | |---------|----------|--------------|-------| | Simplified | Hanyu Pinyin (ISO 7098) | `pypinyin` | Standard PRC romanization | | Traditional | Hanyu Pinyin | `pypinyin` | Same standard applies | **Pinyin Rules**: - Tone marks are OMITTED for abbreviation (diacritics removed anyway) - Word boundaries follow natural spacing - Proper nouns capitalized **Example**: ``` Input: 东巴文化博物院 Pinyin: Dōngbā Wénhuà Bówùyuàn ASCII: Dongba Wenhua Bowuyuan Abbrev: DWB ``` #### Japanese (Kanji/Kana) | Standard | Library/Tool | Notes | |----------|--------------|-------| | Modified Hepburn | `pykakasi`, `romkan` | Most widely used internationally | **Hepburn Rules**: - Long vowels: ō, ū (normalized to o, u for abbreviation) - Particles: は (wa), を (wo), へ (e) - Syllabic n: ん = n (before vowels: n') **Example**: ``` Input: 国立中央博物館 Romaji: Kokuritsu Chūō Hakubutsukan ASCII: Kokuritsu Chuo Hakubutsukan Abbrev: KCH ``` #### Korean (Hangul) | Standard | Library/Tool | Notes | |----------|--------------|-------| | Revised Romanization (RR) | `korean-romanizer`, `hangul-romanize` | Official South Korean standard (2000) | **RR Rules**: - No diacritics (unlike McCune-Reischauer) - Consonant assimilation reflected in spelling - Word boundaries at natural breaks **Example**: ``` Input: 독립기념관 RR: Dongnip Ginyeomgwan Abbrev: DG ``` --- ### Arabic Script | Language | ISO Code | Standard | Library/Tool | Notes | |----------|----------|----------|--------------|-------| | **Arabic** | ar | ISO 233-2:1993 | `arabic-transliteration` | Simplified standard | | **Persian/Farsi** | fa | ISO 233-3:1999 | `persian-transliteration` | Persian extensions | | **Urdu** | ur | ISO 233-3 + Urdu extensions | `urdu-transliteration` | Additional characters | **ISO 233 Mapping (Core Arabic)**: | Arabic | Name | Latin | |--------|------|-------| | ا | Alif | ā / a | | ب | Ba | b | | ت | Ta | t | | ث | Tha | ṯ | | ج | Jim | ǧ / j | | ح | Ha | ḥ | | خ | Kha | ḫ / kh | | د | Dal | d | | ذ | Dhal | ḏ | | ر | Ra | r | | ز | Zay | z | | س | Sin | s | | ش | Shin | š / sh | | ص | Sad | ṣ | | ض | Dad | ḍ | | ط | Ta | ṭ | | ظ | Za | ẓ | | ع | Ayn | ʿ | | غ | Ghayn | ġ / gh | | ف | Fa | f | | ق | Qaf | q | | ك | Kaf | k | | ل | Lam | l | | م | Mim | m | | ن | Nun | n | | ه | Ha | h | | و | Waw | w / ū | | ي | Ya | y / ī | **Example (Arabic)**: ``` Input: المكتبة الوطنية للمملكة المغربية ISO: al-Maktaba al-Waṭanīya lil-Mamlaka al-Maġribīya ASCII: al-Maktaba al-Wataniya lil-Mamlaka al-Maghribiya Abbrev: MWMM (skip "al-" articles) ``` **Example (Persian)**: ``` Input: وزارت امور خارجه ایران ISO: Vezārat-e Omur-e Khāreǧe-ye Īrān ASCII: Vezarat-e Omur-e Khareje-ye Iran Abbrev: VOKI (skip "e" connector) ``` --- ### Hebrew Script | Standard | Library/Tool | Notes | |----------|--------------|-------| | ISO 259-3:1999 | `hebrew-transliteration` | Simplified romanization | **ISO 259 Mapping**: | Hebrew | Name | Latin | |--------|------|-------| | א | Aleph | ʾ / (silent) | | ב | Bet | b / v | | ג | Gimel | g | | ד | Dalet | d | | ה | He | h | | ו | Vav | v / o / u | | ז | Zayin | z | | ח | Chet | ḥ / ch | | ט | Tet | ṭ / t | | י | Yod | y / i | | כ ך | Kaf | k / kh | | ל | Lamed | l | | מ ם | Mem | m | | נ ן | Nun | n | | ס | Samekh | s | | ע | Ayin | ʿ / (silent) | | פ ף | Pe | p / f | | צ ץ | Tsade | ṣ / ts | | ק | Qof | q / k | | ר | Resh | r | | ש | Shin/Sin | š / s | | ת | Tav | t | **Example**: ``` Input: ארכיון הסיפור העממי בישראל ISO: Arḵiyon ha-Sipur ha-ʿAmami be-Yiśraʾel ASCII: Arkhiyon ha-Sipur ha-Amami be-Yisrael Abbrev: ASAY (skip "ha-" and "be-" articles) ``` --- ### Greek Script | Standard | Library/Tool | Notes | |----------|--------------|-------| | ISO 843:1997 | `greek-transliteration` | Romanization of Greek | **ISO 843 Mapping**: | Greek | Latin | Greek | Latin | |-------|-------|-------|-------| | Α α | A a | Ν ν | N n | | Β β | V v | Ξ ξ | X x | | Γ γ | G g | Ο ο | O o | | Δ δ | D d | Π π | P p | | Ε ε | E e | Ρ ρ | R r | | Ζ ζ | Z z | Σ σ ς | S s | | Η η | Ī ī | Τ τ | T t | | Θ θ | Th th | Υ υ | Y y | | Ι ι | I i | Φ φ | F f | | Κ κ | K k | Χ χ | Ch ch | | Λ λ | L l | Ψ ψ | Ps ps | | Μ μ | M m | Ω ω | Ō ō | **Example**: ``` Input: Αρχαιολογικό Μουσείο Θεσσαλονίκης ISO: Archaiologikó Mouseío Thessaloníkīs ASCII: Archaiologiko Mouseio Thessalonikis Abbrev: AMT ``` --- ### Indic Scripts | Language | Script | Standard | Library/Tool | |----------|--------|----------|--------------| | **Hindi** | Devanagari | ISO 15919 | `indic-transliteration` | | **Bengali** | Bengali | ISO 15919 | `indic-transliteration` | | **Nepali** | Devanagari | ISO 15919 | `indic-transliteration` | | **Sinhala** | Sinhala | ISO 15919 | `indic-transliteration` | **ISO 15919 Core Consonants (Devanagari)**: | Devanagari | Latin | Devanagari | Latin | |------------|-------|------------|-------| | क | ka | त | ta | | ख | kha | थ | tha | | ग | ga | द | da | | घ | gha | ध | dha | | ङ | ṅa | न | na | | च | ca | प | pa | | छ | cha | फ | pha | | ज | ja | ब | ba | | झ | jha | भ | bha | | ञ | ña | म | ma | | ट | ṭa | य | ya | | ठ | ṭha | र | ra | | ड | ḍa | ल | la | | ढ | ḍha | व | va | | ण | ṇa | श | śa | | | | ष | ṣa | | | | स | sa | | | | ह | ha | **Example (Hindi)**: ``` Input: राजस्थान प्राच्यविद्या प्रतिष्ठान ISO: Rājasthāna Prācyavidyā Pratiṣṭhāna ASCII: Rajasthana Pracyavidya Pratishthana Abbrev: RPP ``` --- ### Southeast Asian Scripts | Language | Script | Standard | Library/Tool | |----------|--------|----------|--------------| | **Thai** | Thai | ISO 11940-2 | `thai-romanization` | | **Khmer** | Khmer | ALA-LC | `khmer-romanization` | **Thai Example**: ``` Input: สำนักหอจดหมายเหตุแห่งชาติ ISO: Samnak Ho Chotmaihet Haeng Chat Abbrev: SHCHC ``` **Khmer Example**: ``` Input: សារមន្ទីរទួលស្លែង ALA-LC: Sāramanṭīr Tūl Slèṅ ASCII: Saramantir Tuol Sleng Abbrev: STS ``` --- ### Other Scripts | Language | Script | Standard | Library/Tool | |----------|--------|----------|--------------| | **Armenian** | Armenian | ISO 9985 | `armenian-transliteration` | | **Georgian** | Georgian | ISO 9984 | `georgian-transliteration` | **Armenian Example**: ``` Input: Մdelays delays delays delays delays delays delays delays delays delays delays delays delays delays delaysdelays delays delays delays delays delays delaysdelays delaysdelays delaysdelays delaysатdelays delays delaysенадаранdelays delays delays Input: Մdelays delays delays delays delays delaysделays delays delaysատdelays delays delays delaysdelays delaysdelays delaysdelays delaysdelays delaysdelays delaysdelays delays delays delaysdeенadaran Input: Մdelays delays delays delaysатенадаранdelays delays delays delaysdeленадаран Input: Մdelays delays delaysатенадаран Input: Մdelays delaysатенадаран Input: Մатенадаран Input: Մatenadaran ISO: Matenadaran Abbrev: M ``` **Georgian Example**: ``` Input: ხელნაწერთა ეროვნული ცენტრი ISO: Xelnawerti Erovnuli C'ent'ri ASCII: Khelnawerti Erovnuli Centri Abbrev: KEC ``` --- ## Implementation ### Python Transliteration Utility ```python #!/usr/bin/env python3 """ Transliteration utility for GHCID abbreviation generation. Uses ISO and recognized standards for each script/language. """ import unicodedata from typing import Optional # Try importing transliteration libraries try: from pypinyin import pinyin, Style HAS_PYPINYIN = True except ImportError: HAS_PYPINYIN = False try: import pykakasi HAS_PYKAKASI = True except ImportError: HAS_PYKAKASI = False try: from transliterate import translit HAS_TRANSLITERATE = True except ImportError: HAS_TRANSLITERATE = False def detect_script(text: str) -> str: """ Detect the primary script of the input text. Returns one of: - 'latin': Latin alphabet - 'cyrillic': Cyrillic script - 'chinese': Chinese characters (Hanzi) - 'japanese': Japanese (mixed Kanji/Kana) - 'korean': Korean Hangul - 'arabic': Arabic script (includes Persian, Urdu) - 'hebrew': Hebrew script - 'greek': Greek script - 'devanagari': Devanagari (Hindi, Nepali, Sanskrit) - 'bengali': Bengali script - 'thai': Thai script - 'armenian': Armenian script - 'georgian': Georgian script - 'unknown': Cannot determine """ script_ranges = { 'cyrillic': (0x0400, 0x04FF), 'arabic': (0x0600, 0x06FF), 'hebrew': (0x0590, 0x05FF), 'devanagari': (0x0900, 0x097F), 'bengali': (0x0980, 0x09FF), 'thai': (0x0E00, 0x0E7F), 'greek': (0x0370, 0x03FF), 'armenian': (0x0530, 0x058F), 'georgian': (0x10A0, 0x10FF), 'korean': (0xAC00, 0xD7AF), # Hangul syllables 'japanese_hiragana': (0x3040, 0x309F), 'japanese_katakana': (0x30A0, 0x30FF), 'chinese': (0x4E00, 0x9FFF), # CJK Unified Ideographs } script_counts = {script: 0 for script in script_ranges} latin_count = 0 for char in text: code = ord(char) # Check Latin if ('a' <= char <= 'z') or ('A' <= char <= 'Z'): latin_count += 1 continue # Check other scripts for script, (start, end) in script_ranges.items(): if start <= code <= end: script_counts[script] += 1 break # Determine primary script if latin_count > 0 and all(c == 0 for c in script_counts.values()): return 'latin' # Find max non-Latin script max_script = max(script_counts, key=script_counts.get) if script_counts[max_script] > 0: # Handle Japanese (can be Kanji + Kana) if max_script in ('japanese_hiragana', 'japanese_katakana', 'chinese'): if script_counts['japanese_hiragana'] > 0 or script_counts['japanese_katakana'] > 0: return 'japanese' return 'chinese' return max_script return 'latin' if latin_count > 0 else 'unknown' def transliterate_cyrillic(text: str, lang: str = 'ru') -> str: """Transliterate Cyrillic text using ISO 9.""" if HAS_TRANSLITERATE: try: return translit(text, lang, reversed=True) except Exception: pass # Fallback: basic Cyrillic to Latin mapping cyrillic_map = { 'А': 'A', 'Б': 'B', 'В': 'V', 'Г': 'G', 'Д': 'D', 'Е': 'E', 'Ё': 'E', 'Ж': 'Zh', 'З': 'Z', 'И': 'I', 'Й': 'Y', 'К': 'K', 'Л': 'L', 'М': 'M', 'Н': 'N', 'О': 'O', 'П': 'P', 'Р': 'R', 'С': 'S', 'Т': 'T', 'У': 'U', 'Ф': 'F', 'Х': 'Kh', 'Ц': 'Ts', 'Ч': 'Ch', 'Ш': 'Sh', 'Щ': 'Shch', 'Ъ': '', 'Ы': 'Y', 'Ь': '', 'Э': 'E', 'Ю': 'Yu', 'Я': 'Ya', 'а': 'a', 'б': 'b', 'в': 'v', 'г': 'g', 'д': 'd', 'е': 'e', 'ё': 'e', 'ж': 'zh', 'з': 'z', 'и': 'i', 'й': 'y', 'к': 'k', 'л': 'l', 'м': 'm', 'н': 'n', 'о': 'o', 'п': 'p', 'р': 'r', 'с': 's', 'т': 't', 'у': 'u', 'ф': 'f', 'х': 'kh', 'ц': 'ts', 'ч': 'ch', 'ш': 'sh', 'щ': 'shch', 'ъ': '', 'ы': 'y', 'ь': '', 'э': 'e', 'ю': 'yu', 'я': 'ya', # Ukrainian additions 'І': 'I', 'і': 'i', 'Ї': 'Yi', 'ї': 'yi', 'Є': 'Ye', 'є': 'ye', 'Ґ': 'G', 'ґ': 'g', } return ''.join(cyrillic_map.get(c, c) for c in text) def transliterate_chinese(text: str) -> str: """Transliterate Chinese to Pinyin.""" if HAS_PYPINYIN: # Get pinyin without tone marks result = pinyin(text, style=Style.NORMAL) return ' '.join([''.join(p) for p in result]) # Fallback: return as-is (requires manual handling) return text def transliterate_japanese(text: str) -> str: """Transliterate Japanese to Romaji (Hepburn).""" if HAS_PYKAKASI: kakasi = pykakasi.kakasi() result = kakasi.convert(text) return ' '.join([item['hepburn'] for item in result]) # Fallback: return as-is return text def transliterate_korean(text: str) -> str: """Transliterate Korean Hangul to Revised Romanization.""" # Korean romanization is complex - use library if available try: from korean_romanizer.romanizer import Romanizer r = Romanizer(text) return r.romanize() except ImportError: pass # Fallback: basic Hangul syllable decomposition # This is a simplified implementation return text def transliterate_arabic(text: str) -> str: """Transliterate Arabic script to Latin (ISO 233 simplified).""" arabic_map = { 'ا': 'a', 'أ': 'a', 'إ': 'i', 'آ': 'a', 'ب': 'b', 'ت': 't', 'ث': 'th', 'ج': 'j', 'ح': 'h', 'خ': 'kh', 'د': 'd', 'ذ': 'dh', 'ر': 'r', 'ز': 'z', 'س': 's', 'ش': 'sh', 'ص': 's', 'ض': 'd', 'ط': 't', 'ظ': 'z', 'ع': "'", 'غ': 'gh', 'ف': 'f', 'ق': 'q', 'ك': 'k', 'ل': 'l', 'م': 'm', 'ن': 'n', 'ه': 'h', 'و': 'w', 'ي': 'y', 'ى': 'a', 'ة': 'a', 'ء': "'", # Persian additions 'پ': 'p', 'چ': 'ch', 'ژ': 'zh', 'گ': 'g', 'ک': 'k', 'ی': 'i', } result = [] for c in text: if c in arabic_map: result.append(arabic_map[c]) elif c == ' ' or c.isalnum(): result.append(c) return ''.join(result) def transliterate_hebrew(text: str) -> str: """Transliterate Hebrew to Latin (ISO 259 simplified).""" hebrew_map = { 'א': '', 'ב': 'v', 'ג': 'g', 'ד': 'd', 'ה': 'h', 'ו': 'v', 'ז': 'z', 'ח': 'ch', 'ט': 't', 'י': 'y', 'כ': 'k', 'ך': 'k', 'ל': 'l', 'מ': 'm', 'ם': 'm', 'נ': 'n', 'ן': 'n', 'ס': 's', 'ע': '', 'פ': 'f', 'ף': 'f', 'צ': 'ts', 'ץ': 'ts', 'ק': 'k', 'ר': 'r', 'ש': 'sh', 'ת': 't', } result = [] for c in text: if c in hebrew_map: result.append(hebrew_map[c]) elif c == ' ' or c.isalnum(): result.append(c) return ''.join(result) def transliterate_greek(text: str) -> str: """Transliterate Greek to Latin (ISO 843).""" greek_map = { 'Α': 'A', 'α': 'a', 'Β': 'V', 'β': 'v', 'Γ': 'G', 'γ': 'g', 'Δ': 'D', 'δ': 'd', 'Ε': 'E', 'ε': 'e', 'Ζ': 'Z', 'ζ': 'z', 'Η': 'I', 'η': 'i', 'Θ': 'Th', 'θ': 'th', 'Ι': 'I', 'ι': 'i', 'Κ': 'K', 'κ': 'k', 'Λ': 'L', 'λ': 'l', 'Μ': 'M', 'μ': 'm', 'Ν': 'N', 'ν': 'n', 'Ξ': 'X', 'ξ': 'x', 'Ο': 'O', 'ο': 'o', 'Π': 'P', 'π': 'p', 'Ρ': 'R', 'ρ': 'r', 'Σ': 'S', 'σ': 's', 'ς': 's', 'Τ': 'T', 'τ': 't', 'Υ': 'Y', 'υ': 'y', 'Φ': 'F', 'φ': 'f', 'Χ': 'Ch', 'χ': 'ch', 'Ψ': 'Ps', 'ψ': 'ps', 'Ω': 'O', 'ω': 'o', } return ''.join(greek_map.get(c, c) for c in text) def transliterate_devanagari(text: str) -> str: """Transliterate Devanagari to Latin (ISO 15919 simplified).""" try: from indic_transliteration import sanscript from indic_transliteration.sanscript import transliterate as indic_translit return indic_translit(text, sanscript.DEVANAGARI, sanscript.IAST) except ImportError: pass # Fallback: basic mapping # This would need a full Devanagari character map return text def transliterate_thai(text: str) -> str: """Transliterate Thai to Latin (Royal Thai General System).""" try: from thaispellcheck import transliterate as thai_translit return thai_translit(text) except ImportError: pass # Fallback return text def transliterate(text: str, lang: Optional[str] = None) -> str: """ Transliterate text from non-Latin script to Latin. Args: text: Input text in any script lang: Optional ISO 639-1 language code (e.g., 'ru', 'zh', 'ko') If not provided, script is auto-detected. Returns: Transliterated text in Latin characters. """ if not text: return text # Detect script if language not provided if lang: script_map = { 'ru': 'cyrillic', 'uk': 'cyrillic', 'bg': 'cyrillic', 'sr': 'cyrillic', 'kk': 'cyrillic', 'zh': 'chinese', 'ja': 'japanese', 'ko': 'korean', 'ar': 'arabic', 'fa': 'arabic', 'ur': 'arabic', 'he': 'hebrew', 'el': 'greek', 'hi': 'devanagari', 'ne': 'devanagari', 'bn': 'bengali', 'th': 'thai', 'hy': 'armenian', 'ka': 'georgian', } script = script_map.get(lang, detect_script(text)) else: script = detect_script(text) # Apply appropriate transliteration transliterators = { 'cyrillic': lambda t: transliterate_cyrillic(t, lang or 'ru'), 'chinese': transliterate_chinese, 'japanese': transliterate_japanese, 'korean': transliterate_korean, 'arabic': transliterate_arabic, 'hebrew': transliterate_hebrew, 'greek': transliterate_greek, 'devanagari': transliterate_devanagari, 'thai': transliterate_thai, 'latin': lambda t: t, # No transliteration needed } translit_func = transliterators.get(script, lambda t: t) result = translit_func(text) # Normalize diacritics to ASCII normalized = unicodedata.normalize('NFD', result) ascii_result = ''.join(c for c in normalized if unicodedata.category(c) != 'Mn') return ascii_result def transliterate_for_abbreviation(emic_name: str, lang: str) -> str: """ Transliterate emic name for GHCID abbreviation generation. This is the main entry point for GHCID generation scripts. Args: emic_name: Institution name in original script lang: ISO 639-1 language code Returns: Transliterated name ready for abbreviation extraction """ # Step 1: Transliterate to Latin latin = transliterate(emic_name, lang) # Step 2: Normalize diacritics (handled in transliterate()) # Step 3: Remove special characters (except spaces) import re clean = re.sub(r'[^a-zA-Z\s]', ' ', latin) # Step 4: Normalize whitespace clean = ' '.join(clean.split()) return clean # Example usage if __name__ == '__main__': test_cases = [ ('Институт восточных рукописей РАН', 'ru'), ('东巴文化博物院', 'zh'), ('독립기념관', 'ko'), ('राजस्थान प्राच्यविद्या प्रतिष्ठान', 'hi'), ('المكتبة الوطنية للمملكة المغربية', 'ar'), ('ארכיון הסיפור העממי בישראל', 'he'), ('Αρχαιολογικό Μουσείο Θεσσαλονίκης', 'el'), ] for name, lang in test_cases: result = transliterate_for_abbreviation(name, lang) print(f'{lang}: {name}') print(f' → {result}') print() ``` --- ## Skip Words by Language When extracting abbreviations from transliterated text, skip these articles/prepositions: ### Arabic - `al-` (the definite article) - `bi-`, `li-`, `fi-` (prepositions) ### Hebrew - `ha-` (the) - `ve-` (and) - `be-`, `le-`, `me-` (prepositions) ### Persian - `-e`, `-ye` (ezafe connector) - `va` (and) ### CJK Languages - No skip words (particles are integral to meaning) ### Indic Languages - `ka`, `ki`, `ke` (Hindi: of) - `aur` (Hindi: and) --- ## Validation ### Check Transliteration Output ```python def validate_transliteration(result: str) -> bool: """ Validate that transliteration output contains only ASCII letters and spaces. """ import re return bool(re.match(r'^[a-zA-Z\s]+$', result)) ``` ### Manual Review Queue Non-Latin institutions should be flagged for manual review if: 1. Transliteration library not available for that script 2. Confidence in transliteration is low 3. Institution has multiple official romanizations --- ## Related Documentation - `AGENTS.md` - Rule 12: Transliteration Standards - `ABBREVIATION_SPECIAL_CHAR_RULE.md` - Character filtering after transliteration - `docs/TRANSLITERATION_CONVENTIONS.md` - Extended examples and edge cases - `scripts/transliterate_emic_names.py` - Production transliteration script --- ## Changelog | Date | Change | |------|--------| | 2025-12-08 | Initial standards document created |