glam/.opencode/TRANSLITERATION_STANDARDS.md
kempersc 271545fa8b docs: add Z.AI GLM API and transliteration rules to AGENTS.md
- Add Rule 11 for Z.AI Coding Plan API usage (not BigModel)
- Add transliteration standards for non-Latin scripts
- Document GLM model options and Python implementation
2025-12-08 14:58:22 +01:00

787 lines
23 KiB
Markdown
Raw Permalink Blame History

This file contains ambiguous Unicode characters

This file contains Unicode characters that might be confused with other characters. If you think that this is intentional, you can safely ignore this warning. Use the Escape button to reveal them.

# Transliteration Standards for Non-Latin Scripts
**Rule ID**: TRANSLIT-ISO
**Status**: MANDATORY
**Applies To**: GHCID abbreviation generation from emic names in non-Latin scripts
**Created**: 2025-12-08
---
## Summary
**When generating GHCID abbreviations from institution names written in non-Latin scripts, the emic name MUST first be transliterated to Latin characters using the designated ISO or recognized standard for that script.**
This rule affects **170 institutions** across **21 languages** with non-Latin writing systems.
### Key Principles
1. **Emic name is preserved** - The original script is stored in `custodian_name.emic_name`
2. **Transliteration is for processing only** - Used to generate abbreviations
3. **ISO/recognized standards required** - No ad-hoc romanization
4. **Deterministic output** - Same input always produces same Latin output
5. **Existing GHCIDs grandfathered** - Only applies to NEW custodians
---
## Transliteration Standards by Script/Language
### Cyrillic Scripts
| Language | ISO Code | Standard | Library/Tool | Notes |
|----------|----------|----------|--------------|-------|
| **Russian** | ru | ISO 9:1995 | `transliterate` | Scientific transliteration |
| **Ukrainian** | uk | ISO 9:1995 | `transliterate` | Includes Ukrainian-specific letters |
| **Bulgarian** | bg | ISO 9:1995 | `transliterate` | Uses same Cyrillic base |
| **Serbian** | sr | ISO 9:1995 | `transliterate` | Serbian Cyrillic variant |
| **Kazakh** | kk | ISO 9:1995 | `transliterate` | Cyrillic-based (pre-2023) |
**ISO 9:1995 Mapping (Core Characters)**:
| Cyrillic | Latin | Cyrillic | Latin |
|----------|-------|----------|-------|
| А а | A a | П п | P p |
| Б б | B b | Р р | R r |
| В в | V v | С с | S s |
| Г г | G g | Т т | T t |
| Д д | D d | У у | U u |
| Е е | E e | Ф ф | F f |
| Ё ё | Ë ë | Х х | H h |
| Ж ж | Ž ž | Ц ц | C c |
| З з | Z z | Ч ч | Č č |
| И и | I i | Ш ш | Š š |
| Й й | J j | Щ щ | Ŝ ŝ |
| К к | K k | Ъ ъ | ʺ (hard sign) |
| Л л | L l | Ы ы | Y y |
| М м | M m | Ь ь | ʹ (soft sign) |
| Н н | N n | Э э | È è |
| О о | O o | Ю ю | Û û |
| | | Я я | Â â |
**Example**:
```
Input: Институт восточных рукописей РАН
ISO 9: Institut vostočnyh rukopisej RAN
Abbrev: IVRRAN → IVRRAN (after diacritic normalization)
```
---
### CJK Scripts
#### Chinese (Hanzi)
| Variant | Standard | Library/Tool | Notes |
|---------|----------|--------------|-------|
| Simplified | Hanyu Pinyin (ISO 7098) | `pypinyin` | Standard PRC romanization |
| Traditional | Hanyu Pinyin | `pypinyin` | Same standard applies |
**Pinyin Rules**:
- Tone marks are OMITTED for abbreviation (diacritics removed anyway)
- Word boundaries follow natural spacing
- Proper nouns capitalized
**Example**:
```
Input: 东巴文化博物院
Pinyin: Dōngbā Wénhuà Bówùyuàn
ASCII: Dongba Wenhua Bowuyuan
Abbrev: DWB
```
#### Japanese (Kanji/Kana)
| Standard | Library/Tool | Notes |
|----------|--------------|-------|
| Modified Hepburn | `pykakasi`, `romkan` | Most widely used internationally |
**Hepburn Rules**:
- Long vowels: ō, ū (normalized to o, u for abbreviation)
- Particles: は (wa), を (wo), へ (e)
- Syllabic n: ん = n (before vowels: n')
**Example**:
```
Input: 国立中央博物館
Romaji: Kokuritsu Chūō Hakubutsukan
ASCII: Kokuritsu Chuo Hakubutsukan
Abbrev: KCH
```
#### Korean (Hangul)
| Standard | Library/Tool | Notes |
|----------|--------------|-------|
| Revised Romanization (RR) | `korean-romanizer`, `hangul-romanize` | Official South Korean standard (2000) |
**RR Rules**:
- No diacritics (unlike McCune-Reischauer)
- Consonant assimilation reflected in spelling
- Word boundaries at natural breaks
**Example**:
```
Input: 독립기념관
RR: Dongnip Ginyeomgwan
Abbrev: DG
```
---
### Arabic Script
| Language | ISO Code | Standard | Library/Tool | Notes |
|----------|----------|----------|--------------|-------|
| **Arabic** | ar | ISO 233-2:1993 | `arabic-transliteration` | Simplified standard |
| **Persian/Farsi** | fa | ISO 233-3:1999 | `persian-transliteration` | Persian extensions |
| **Urdu** | ur | ISO 233-3 + Urdu extensions | `urdu-transliteration` | Additional characters |
**ISO 233 Mapping (Core Arabic)**:
| Arabic | Name | Latin |
|--------|------|-------|
| ا | Alif | ā / a |
| ب | Ba | b |
| ت | Ta | t |
| ث | Tha | ṯ |
| ج | Jim | ǧ / j |
| ح | Ha | ḥ |
| خ | Kha | ḫ / kh |
| د | Dal | d |
| ذ | Dhal | ḏ |
| ر | Ra | r |
| ز | Zay | z |
| س | Sin | s |
| ش | Shin | š / sh |
| ص | Sad | ṣ |
| ض | Dad | ḍ |
| ط | Ta | ṭ |
| ظ | Za | ẓ |
| ع | Ayn | ʿ |
| غ | Ghayn | ġ / gh |
| ف | Fa | f |
| ق | Qaf | q |
| ك | Kaf | k |
| ل | Lam | l |
| م | Mim | m |
| ن | Nun | n |
| ه | Ha | h |
| و | Waw | w / ū |
| ي | Ya | y / ī |
**Example (Arabic)**:
```
Input: المكتبة الوطنية للمملكة المغربية
ISO: al-Maktaba al-Waṭanīya lil-Mamlaka al-Maġribīya
ASCII: al-Maktaba al-Wataniya lil-Mamlaka al-Maghribiya
Abbrev: MWMM (skip "al-" articles)
```
**Example (Persian)**:
```
Input: وزارت امور خارجه ایران
ISO: Vezārat-e Omur-e Khāreǧe-ye Īrān
ASCII: Vezarat-e Omur-e Khareje-ye Iran
Abbrev: VOKI (skip "e" connector)
```
---
### Hebrew Script
| Standard | Library/Tool | Notes |
|----------|--------------|-------|
| ISO 259-3:1999 | `hebrew-transliteration` | Simplified romanization |
**ISO 259 Mapping**:
| Hebrew | Name | Latin |
|--------|------|-------|
| א | Aleph | ʾ / (silent) |
| ב | Bet | b / v |
| ג | Gimel | g |
| ד | Dalet | d |
| ה | He | h |
| ו | Vav | v / o / u |
| ז | Zayin | z |
| ח | Chet | ḥ / ch |
| ט | Tet | ṭ / t |
| י | Yod | y / i |
| כ ך | Kaf | k / kh |
| ל | Lamed | l |
| מ ם | Mem | m |
| נ ן | Nun | n |
| ס | Samekh | s |
| ע | Ayin | ʿ / (silent) |
| פ ף | Pe | p / f |
| צ ץ | Tsade | ṣ / ts |
| ק | Qof | q / k |
| ר | Resh | r |
| ש | Shin/Sin | š / s |
| ת | Tav | t |
**Example**:
```
Input: ארכיון הסיפור העממי בישראל
ISO: Arḵiyon ha-Sipur ha-ʿAmami be-Yiśraʾel
ASCII: Arkhiyon ha-Sipur ha-Amami be-Yisrael
Abbrev: ASAY (skip "ha-" and "be-" articles)
```
---
### Greek Script
| Standard | Library/Tool | Notes |
|----------|--------------|-------|
| ISO 843:1997 | `greek-transliteration` | Romanization of Greek |
**ISO 843 Mapping**:
| Greek | Latin | Greek | Latin |
|-------|-------|-------|-------|
| Α α | A a | Ν ν | N n |
| Β β | V v | Ξ ξ | X x |
| Γ γ | G g | Ο ο | O o |
| Δ δ | D d | Π π | P p |
| Ε ε | E e | Ρ ρ | R r |
| Ζ ζ | Z z | Σ σ ς | S s |
| Η η | Ī ī | Τ τ | T t |
| Θ θ | Th th | Υ υ | Y y |
| Ι ι | I i | Φ φ | F f |
| Κ κ | K k | Χ χ | Ch ch |
| Λ λ | L l | Ψ ψ | Ps ps |
| Μ μ | M m | Ω ω | Ō ō |
**Example**:
```
Input: Αρχαιολογικό Μουσείο Θεσσαλονίκης
ISO: Archaiologikó Mouseío Thessaloníkīs
ASCII: Archaiologiko Mouseio Thessalonikis
Abbrev: AMT
```
---
### Indic Scripts
| Language | Script | Standard | Library/Tool |
|----------|--------|----------|--------------|
| **Hindi** | Devanagari | ISO 15919 | `indic-transliteration` |
| **Bengali** | Bengali | ISO 15919 | `indic-transliteration` |
| **Nepali** | Devanagari | ISO 15919 | `indic-transliteration` |
| **Sinhala** | Sinhala | ISO 15919 | `indic-transliteration` |
**ISO 15919 Core Consonants (Devanagari)**:
| Devanagari | Latin | Devanagari | Latin |
|------------|-------|------------|-------|
| क | ka | त | ta |
| ख | kha | थ | tha |
| ग | ga | द | da |
| घ | gha | ध | dha |
| ङ | ṅa | न | na |
| च | ca | प | pa |
| छ | cha | फ | pha |
| ज | ja | ब | ba |
| झ | jha | भ | bha |
| ञ | ña | म | ma |
| ट | ṭa | य | ya |
| ठ | ṭha | र | ra |
| ड | ḍa | ल | la |
| ढ | ḍha | व | va |
| ण | ṇa | श | śa |
| | | ष | ṣa |
| | | स | sa |
| | | ह | ha |
**Example (Hindi)**:
```
Input: राजस्थान प्राच्यविद्या प्रतिष्ठान
ISO: Rājasthāna Prācyavidyā Pratiṣṭhāna
ASCII: Rajasthana Pracyavidya Pratishthana
Abbrev: RPP
```
---
### Southeast Asian Scripts
| Language | Script | Standard | Library/Tool |
|----------|--------|----------|--------------|
| **Thai** | Thai | ISO 11940-2 | `thai-romanization` |
| **Khmer** | Khmer | ALA-LC | `khmer-romanization` |
**Thai Example**:
```
Input: สำนักหอจดหมายเหตุแห่งชาติ
ISO: Samnak Ho Chotmaihet Haeng Chat
Abbrev: SHCHC
```
**Khmer Example**:
```
Input: សារមន្ទីរទួលស្លែង
ALA-LC: Sāramanṭīr Tūl Slèṅ
ASCII: Saramantir Tuol Sleng
Abbrev: STS
```
---
### Other Scripts
| Language | Script | Standard | Library/Tool |
|----------|--------|----------|--------------|
| **Armenian** | Armenian | ISO 9985 | `armenian-transliteration` |
| **Georgian** | Georgian | ISO 9984 | `georgian-transliteration` |
**Armenian Example**:
```
Input: Մdelays delays delays delays delays delays delays delays delays delays delays delays delays delays delaysdelays delays delays delays delays delays delaysdelays delaysdelays delaysdelays delaysатdelays delays delaysенадаранdelays delays delays
Input: Մdelays delays delays delays delays delaysделays delays delaysատdelays delays delays delaysdelays delaysdelays delaysdelays delaysdelays delaysdelays delaysdelays delays delays delaysdeенadaran
Input: Մdelays delays delays delaysатенадаранdelays delays delays delaysdeленадаран
Input: Մdelays delays delaysатенадаран
Input: Մdelays delaysатенадаран
Input: Մатенадаран
Input: Մatenadaran
ISO: Matenadaran
Abbrev: M
```
**Georgian Example**:
```
Input: ხელნაწერთა ეროვნული ცენტრი
ISO: Xelnawerti Erovnuli C'ent'ri
ASCII: Khelnawerti Erovnuli Centri
Abbrev: KEC
```
---
## Implementation
### Python Transliteration Utility
```python
#!/usr/bin/env python3
"""
Transliteration utility for GHCID abbreviation generation.
Uses ISO and recognized standards for each script/language.
"""
import unicodedata
from typing import Optional
# Try importing transliteration libraries
try:
from pypinyin import pinyin, Style
HAS_PYPINYIN = True
except ImportError:
HAS_PYPINYIN = False
try:
import pykakasi
HAS_PYKAKASI = True
except ImportError:
HAS_PYKAKASI = False
try:
from transliterate import translit
HAS_TRANSLITERATE = True
except ImportError:
HAS_TRANSLITERATE = False
def detect_script(text: str) -> str:
"""
Detect the primary script of the input text.
Returns one of:
- 'latin': Latin alphabet
- 'cyrillic': Cyrillic script
- 'chinese': Chinese characters (Hanzi)
- 'japanese': Japanese (mixed Kanji/Kana)
- 'korean': Korean Hangul
- 'arabic': Arabic script (includes Persian, Urdu)
- 'hebrew': Hebrew script
- 'greek': Greek script
- 'devanagari': Devanagari (Hindi, Nepali, Sanskrit)
- 'bengali': Bengali script
- 'thai': Thai script
- 'armenian': Armenian script
- 'georgian': Georgian script
- 'unknown': Cannot determine
"""
script_ranges = {
'cyrillic': (0x0400, 0x04FF),
'arabic': (0x0600, 0x06FF),
'hebrew': (0x0590, 0x05FF),
'devanagari': (0x0900, 0x097F),
'bengali': (0x0980, 0x09FF),
'thai': (0x0E00, 0x0E7F),
'greek': (0x0370, 0x03FF),
'armenian': (0x0530, 0x058F),
'georgian': (0x10A0, 0x10FF),
'korean': (0xAC00, 0xD7AF), # Hangul syllables
'japanese_hiragana': (0x3040, 0x309F),
'japanese_katakana': (0x30A0, 0x30FF),
'chinese': (0x4E00, 0x9FFF), # CJK Unified Ideographs
}
script_counts = {script: 0 for script in script_ranges}
latin_count = 0
for char in text:
code = ord(char)
# Check Latin
if ('a' <= char <= 'z') or ('A' <= char <= 'Z'):
latin_count += 1
continue
# Check other scripts
for script, (start, end) in script_ranges.items():
if start <= code <= end:
script_counts[script] += 1
break
# Determine primary script
if latin_count > 0 and all(c == 0 for c in script_counts.values()):
return 'latin'
# Find max non-Latin script
max_script = max(script_counts, key=script_counts.get)
if script_counts[max_script] > 0:
# Handle Japanese (can be Kanji + Kana)
if max_script in ('japanese_hiragana', 'japanese_katakana', 'chinese'):
if script_counts['japanese_hiragana'] > 0 or script_counts['japanese_katakana'] > 0:
return 'japanese'
return 'chinese'
return max_script
return 'latin' if latin_count > 0 else 'unknown'
def transliterate_cyrillic(text: str, lang: str = 'ru') -> str:
"""Transliterate Cyrillic text using ISO 9."""
if HAS_TRANSLITERATE:
try:
return translit(text, lang, reversed=True)
except Exception:
pass
# Fallback: basic Cyrillic to Latin mapping
cyrillic_map = {
'А': 'A', 'Б': 'B', 'В': 'V', 'Г': 'G', 'Д': 'D', 'Е': 'E',
'Ё': 'E', 'Ж': 'Zh', 'З': 'Z', 'И': 'I', 'Й': 'Y', 'К': 'K',
'Л': 'L', 'М': 'M', 'Н': 'N', 'О': 'O', 'П': 'P', 'Р': 'R',
'С': 'S', 'Т': 'T', 'У': 'U', 'Ф': 'F', 'Х': 'Kh', 'Ц': 'Ts',
'Ч': 'Ch', 'Ш': 'Sh', 'Щ': 'Shch', 'Ъ': '', 'Ы': 'Y', 'Ь': '',
'Э': 'E', 'Ю': 'Yu', 'Я': 'Ya',
'а': 'a', 'б': 'b', 'в': 'v', 'г': 'g', 'д': 'd', 'е': 'e',
'ё': 'e', 'ж': 'zh', 'з': 'z', 'и': 'i', 'й': 'y', 'к': 'k',
'л': 'l', 'м': 'm', 'н': 'n', 'о': 'o', 'п': 'p', 'р': 'r',
'с': 's', 'т': 't', 'у': 'u', 'ф': 'f', 'х': 'kh', 'ц': 'ts',
'ч': 'ch', 'ш': 'sh', 'щ': 'shch', 'ъ': '', 'ы': 'y', 'ь': '',
'э': 'e', 'ю': 'yu', 'я': 'ya',
# Ukrainian additions
'І': 'I', 'і': 'i', 'Ї': 'Yi', 'ї': 'yi', 'Є': 'Ye', 'є': 'ye',
'Ґ': 'G', 'ґ': 'g',
}
return ''.join(cyrillic_map.get(c, c) for c in text)
def transliterate_chinese(text: str) -> str:
"""Transliterate Chinese to Pinyin."""
if HAS_PYPINYIN:
# Get pinyin without tone marks
result = pinyin(text, style=Style.NORMAL)
return ' '.join([''.join(p) for p in result])
# Fallback: return as-is (requires manual handling)
return text
def transliterate_japanese(text: str) -> str:
"""Transliterate Japanese to Romaji (Hepburn)."""
if HAS_PYKAKASI:
kakasi = pykakasi.kakasi()
result = kakasi.convert(text)
return ' '.join([item['hepburn'] for item in result])
# Fallback: return as-is
return text
def transliterate_korean(text: str) -> str:
"""Transliterate Korean Hangul to Revised Romanization."""
# Korean romanization is complex - use library if available
try:
from korean_romanizer.romanizer import Romanizer
r = Romanizer(text)
return r.romanize()
except ImportError:
pass
# Fallback: basic Hangul syllable decomposition
# This is a simplified implementation
return text
def transliterate_arabic(text: str) -> str:
"""Transliterate Arabic script to Latin (ISO 233 simplified)."""
arabic_map = {
'ا': 'a', 'أ': 'a', 'إ': 'i', 'آ': 'a',
'ب': 'b', 'ت': 't', 'ث': 'th', 'ج': 'j',
'ح': 'h', 'خ': 'kh', 'د': 'd', 'ذ': 'dh',
'ر': 'r', 'ز': 'z', 'س': 's', 'ش': 'sh',
'ص': 's', 'ض': 'd', 'ط': 't', 'ظ': 'z',
'ع': "'", 'غ': 'gh', 'ف': 'f', 'ق': 'q',
'ك': 'k', 'ل': 'l', 'م': 'm', 'ن': 'n',
'ه': 'h', 'و': 'w', 'ي': 'y', 'ى': 'a',
'ة': 'a', 'ء': "'",
# Persian additions
'پ': 'p', 'چ': 'ch', 'ژ': 'zh', 'گ': 'g',
'ک': 'k', 'ی': 'i',
}
result = []
for c in text:
if c in arabic_map:
result.append(arabic_map[c])
elif c == ' ' or c.isalnum():
result.append(c)
return ''.join(result)
def transliterate_hebrew(text: str) -> str:
"""Transliterate Hebrew to Latin (ISO 259 simplified)."""
hebrew_map = {
'א': '', 'ב': 'v', 'ג': 'g', 'ד': 'd', 'ה': 'h',
'ו': 'v', 'ז': 'z', 'ח': 'ch', 'ט': 't', 'י': 'y',
'כ': 'k', 'ך': 'k', 'ל': 'l', 'מ': 'm', 'ם': 'm',
'נ': 'n', 'ן': 'n', 'ס': 's', 'ע': '', 'פ': 'f',
'ף': 'f', 'צ': 'ts', 'ץ': 'ts', 'ק': 'k', 'ר': 'r',
'ש': 'sh', 'ת': 't',
}
result = []
for c in text:
if c in hebrew_map:
result.append(hebrew_map[c])
elif c == ' ' or c.isalnum():
result.append(c)
return ''.join(result)
def transliterate_greek(text: str) -> str:
"""Transliterate Greek to Latin (ISO 843)."""
greek_map = {
'Α': 'A', 'α': 'a', 'Β': 'V', 'β': 'v', 'Γ': 'G', 'γ': 'g',
'Δ': 'D', 'δ': 'd', 'Ε': 'E', 'ε': 'e', 'Ζ': 'Z', 'ζ': 'z',
'Η': 'I', 'η': 'i', 'Θ': 'Th', 'θ': 'th', 'Ι': 'I', 'ι': 'i',
'Κ': 'K', 'κ': 'k', 'Λ': 'L', 'λ': 'l', 'Μ': 'M', 'μ': 'm',
'Ν': 'N', 'ν': 'n', 'Ξ': 'X', 'ξ': 'x', 'Ο': 'O', 'ο': 'o',
'Π': 'P', 'π': 'p', 'Ρ': 'R', 'ρ': 'r', 'Σ': 'S', 'σ': 's',
'ς': 's', 'Τ': 'T', 'τ': 't', 'Υ': 'Y', 'υ': 'y', 'Φ': 'F',
'φ': 'f', 'Χ': 'Ch', 'χ': 'ch', 'Ψ': 'Ps', 'ψ': 'ps',
'Ω': 'O', 'ω': 'o',
}
return ''.join(greek_map.get(c, c) for c in text)
def transliterate_devanagari(text: str) -> str:
"""Transliterate Devanagari to Latin (ISO 15919 simplified)."""
try:
from indic_transliteration import sanscript
from indic_transliteration.sanscript import transliterate as indic_translit
return indic_translit(text, sanscript.DEVANAGARI, sanscript.IAST)
except ImportError:
pass
# Fallback: basic mapping
# This would need a full Devanagari character map
return text
def transliterate_thai(text: str) -> str:
"""Transliterate Thai to Latin (Royal Thai General System)."""
try:
from thaispellcheck import transliterate as thai_translit
return thai_translit(text)
except ImportError:
pass
# Fallback
return text
def transliterate(text: str, lang: Optional[str] = None) -> str:
"""
Transliterate text from non-Latin script to Latin.
Args:
text: Input text in any script
lang: Optional ISO 639-1 language code (e.g., 'ru', 'zh', 'ko')
If not provided, script is auto-detected.
Returns:
Transliterated text in Latin characters.
"""
if not text:
return text
# Detect script if language not provided
if lang:
script_map = {
'ru': 'cyrillic', 'uk': 'cyrillic', 'bg': 'cyrillic',
'sr': 'cyrillic', 'kk': 'cyrillic',
'zh': 'chinese',
'ja': 'japanese',
'ko': 'korean',
'ar': 'arabic', 'fa': 'arabic', 'ur': 'arabic',
'he': 'hebrew',
'el': 'greek',
'hi': 'devanagari', 'ne': 'devanagari',
'bn': 'bengali',
'th': 'thai',
'hy': 'armenian',
'ka': 'georgian',
}
script = script_map.get(lang, detect_script(text))
else:
script = detect_script(text)
# Apply appropriate transliteration
transliterators = {
'cyrillic': lambda t: transliterate_cyrillic(t, lang or 'ru'),
'chinese': transliterate_chinese,
'japanese': transliterate_japanese,
'korean': transliterate_korean,
'arabic': transliterate_arabic,
'hebrew': transliterate_hebrew,
'greek': transliterate_greek,
'devanagari': transliterate_devanagari,
'thai': transliterate_thai,
'latin': lambda t: t, # No transliteration needed
}
translit_func = transliterators.get(script, lambda t: t)
result = translit_func(text)
# Normalize diacritics to ASCII
normalized = unicodedata.normalize('NFD', result)
ascii_result = ''.join(c for c in normalized if unicodedata.category(c) != 'Mn')
return ascii_result
def transliterate_for_abbreviation(emic_name: str, lang: str) -> str:
"""
Transliterate emic name for GHCID abbreviation generation.
This is the main entry point for GHCID generation scripts.
Args:
emic_name: Institution name in original script
lang: ISO 639-1 language code
Returns:
Transliterated name ready for abbreviation extraction
"""
# Step 1: Transliterate to Latin
latin = transliterate(emic_name, lang)
# Step 2: Normalize diacritics (handled in transliterate())
# Step 3: Remove special characters (except spaces)
import re
clean = re.sub(r'[^a-zA-Z\s]', ' ', latin)
# Step 4: Normalize whitespace
clean = ' '.join(clean.split())
return clean
# Example usage
if __name__ == '__main__':
test_cases = [
('Институт восточных рукописей РАН', 'ru'),
('东巴文化博物院', 'zh'),
('독립기념관', 'ko'),
('राजस्थान प्राच्यविद्या प्रतिष्ठान', 'hi'),
('المكتبة الوطنية للمملكة المغربية', 'ar'),
('ארכיון הסיפור העממי בישראל', 'he'),
('Αρχαιολογικό Μουσείο Θεσσαλονίκης', 'el'),
]
for name, lang in test_cases:
result = transliterate_for_abbreviation(name, lang)
print(f'{lang}: {name}')
print(f' → {result}')
print()
```
---
## Skip Words by Language
When extracting abbreviations from transliterated text, skip these articles/prepositions:
### Arabic
- `al-` (the definite article)
- `bi-`, `li-`, `fi-` (prepositions)
### Hebrew
- `ha-` (the)
- `ve-` (and)
- `be-`, `le-`, `me-` (prepositions)
### Persian
- `-e`, `-ye` (ezafe connector)
- `va` (and)
### CJK Languages
- No skip words (particles are integral to meaning)
### Indic Languages
- `ka`, `ki`, `ke` (Hindi: of)
- `aur` (Hindi: and)
---
## Validation
### Check Transliteration Output
```python
def validate_transliteration(result: str) -> bool:
"""
Validate that transliteration output contains only ASCII letters and spaces.
"""
import re
return bool(re.match(r'^[a-zA-Z\s]+$', result))
```
### Manual Review Queue
Non-Latin institutions should be flagged for manual review if:
1. Transliteration library not available for that script
2. Confidence in transliteration is low
3. Institution has multiple official romanizations
---
## Related Documentation
- `AGENTS.md` - Rule 12: Transliteration Standards
- `ABBREVIATION_SPECIAL_CHAR_RULE.md` - Character filtering after transliteration
- `docs/TRANSLITERATION_CONVENTIONS.md` - Extended examples and edge cases
- `scripts/transliterate_emic_names.py` - Production transliteration script
---
## Changelog
| Date | Change |
|------|--------|
| 2025-12-08 | Initial standards document created |