- Add Rule 11 for Z.AI Coding Plan API usage (not BigModel) - Add transliteration standards for non-Latin scripts - Document GLM model options and Python implementation
787 lines
23 KiB
Markdown
787 lines
23 KiB
Markdown
# Transliteration Standards for Non-Latin Scripts
|
||
|
||
**Rule ID**: TRANSLIT-ISO
|
||
**Status**: MANDATORY
|
||
**Applies To**: GHCID abbreviation generation from emic names in non-Latin scripts
|
||
**Created**: 2025-12-08
|
||
|
||
---
|
||
|
||
## Summary
|
||
|
||
**When generating GHCID abbreviations from institution names written in non-Latin scripts, the emic name MUST first be transliterated to Latin characters using the designated ISO or recognized standard for that script.**
|
||
|
||
This rule affects **170 institutions** across **21 languages** with non-Latin writing systems.
|
||
|
||
### Key Principles
|
||
|
||
1. **Emic name is preserved** - The original script is stored in `custodian_name.emic_name`
|
||
2. **Transliteration is for processing only** - Used to generate abbreviations
|
||
3. **ISO/recognized standards required** - No ad-hoc romanization
|
||
4. **Deterministic output** - Same input always produces same Latin output
|
||
5. **Existing GHCIDs grandfathered** - Only applies to NEW custodians
|
||
|
||
---
|
||
|
||
## Transliteration Standards by Script/Language
|
||
|
||
### Cyrillic Scripts
|
||
|
||
| Language | ISO Code | Standard | Library/Tool | Notes |
|
||
|----------|----------|----------|--------------|-------|
|
||
| **Russian** | ru | ISO 9:1995 | `transliterate` | Scientific transliteration |
|
||
| **Ukrainian** | uk | ISO 9:1995 | `transliterate` | Includes Ukrainian-specific letters |
|
||
| **Bulgarian** | bg | ISO 9:1995 | `transliterate` | Uses same Cyrillic base |
|
||
| **Serbian** | sr | ISO 9:1995 | `transliterate` | Serbian Cyrillic variant |
|
||
| **Kazakh** | kk | ISO 9:1995 | `transliterate` | Cyrillic-based (pre-2023) |
|
||
|
||
**ISO 9:1995 Mapping (Core Characters)**:
|
||
|
||
| Cyrillic | Latin | Cyrillic | Latin |
|
||
|----------|-------|----------|-------|
|
||
| А а | A a | П п | P p |
|
||
| Б б | B b | Р р | R r |
|
||
| В в | V v | С с | S s |
|
||
| Г г | G g | Т т | T t |
|
||
| Д д | D d | У у | U u |
|
||
| Е е | E e | Ф ф | F f |
|
||
| Ё ё | Ë ë | Х х | H h |
|
||
| Ж ж | Ž ž | Ц ц | C c |
|
||
| З з | Z z | Ч ч | Č č |
|
||
| И и | I i | Ш ш | Š š |
|
||
| Й й | J j | Щ щ | Ŝ ŝ |
|
||
| К к | K k | Ъ ъ | ʺ (hard sign) |
|
||
| Л л | L l | Ы ы | Y y |
|
||
| М м | M m | Ь ь | ʹ (soft sign) |
|
||
| Н н | N n | Э э | È è |
|
||
| О о | O o | Ю ю | Û û |
|
||
| | | Я я | Â â |
|
||
|
||
**Example**:
|
||
```
|
||
Input: Институт восточных рукописей РАН
|
||
ISO 9: Institut vostočnyh rukopisej RAN
|
||
Abbrev: IVRRAN → IVRRAN (after diacritic normalization)
|
||
```
|
||
|
||
---
|
||
|
||
### CJK Scripts
|
||
|
||
#### Chinese (Hanzi)
|
||
|
||
| Variant | Standard | Library/Tool | Notes |
|
||
|---------|----------|--------------|-------|
|
||
| Simplified | Hanyu Pinyin (ISO 7098) | `pypinyin` | Standard PRC romanization |
|
||
| Traditional | Hanyu Pinyin | `pypinyin` | Same standard applies |
|
||
|
||
**Pinyin Rules**:
|
||
- Tone marks are OMITTED for abbreviation (diacritics removed anyway)
|
||
- Word boundaries follow natural spacing
|
||
- Proper nouns capitalized
|
||
|
||
**Example**:
|
||
```
|
||
Input: 东巴文化博物院
|
||
Pinyin: Dōngbā Wénhuà Bówùyuàn
|
||
ASCII: Dongba Wenhua Bowuyuan
|
||
Abbrev: DWB
|
||
```
|
||
|
||
#### Japanese (Kanji/Kana)
|
||
|
||
| Standard | Library/Tool | Notes |
|
||
|----------|--------------|-------|
|
||
| Modified Hepburn | `pykakasi`, `romkan` | Most widely used internationally |
|
||
|
||
**Hepburn Rules**:
|
||
- Long vowels: ō, ū (normalized to o, u for abbreviation)
|
||
- Particles: は (wa), を (wo), へ (e)
|
||
- Syllabic n: ん = n (before vowels: n')
|
||
|
||
**Example**:
|
||
```
|
||
Input: 国立中央博物館
|
||
Romaji: Kokuritsu Chūō Hakubutsukan
|
||
ASCII: Kokuritsu Chuo Hakubutsukan
|
||
Abbrev: KCH
|
||
```
|
||
|
||
#### Korean (Hangul)
|
||
|
||
| Standard | Library/Tool | Notes |
|
||
|----------|--------------|-------|
|
||
| Revised Romanization (RR) | `korean-romanizer`, `hangul-romanize` | Official South Korean standard (2000) |
|
||
|
||
**RR Rules**:
|
||
- No diacritics (unlike McCune-Reischauer)
|
||
- Consonant assimilation reflected in spelling
|
||
- Word boundaries at natural breaks
|
||
|
||
**Example**:
|
||
```
|
||
Input: 독립기념관
|
||
RR: Dongnip Ginyeomgwan
|
||
Abbrev: DG
|
||
```
|
||
|
||
---
|
||
|
||
### Arabic Script
|
||
|
||
| Language | ISO Code | Standard | Library/Tool | Notes |
|
||
|----------|----------|----------|--------------|-------|
|
||
| **Arabic** | ar | ISO 233-2:1993 | `arabic-transliteration` | Simplified standard |
|
||
| **Persian/Farsi** | fa | ISO 233-3:1999 | `persian-transliteration` | Persian extensions |
|
||
| **Urdu** | ur | ISO 233-3 + Urdu extensions | `urdu-transliteration` | Additional characters |
|
||
|
||
**ISO 233 Mapping (Core Arabic)**:
|
||
|
||
| Arabic | Name | Latin |
|
||
|--------|------|-------|
|
||
| ا | Alif | ā / a |
|
||
| ب | Ba | b |
|
||
| ت | Ta | t |
|
||
| ث | Tha | ṯ |
|
||
| ج | Jim | ǧ / j |
|
||
| ح | Ha | ḥ |
|
||
| خ | Kha | ḫ / kh |
|
||
| د | Dal | d |
|
||
| ذ | Dhal | ḏ |
|
||
| ر | Ra | r |
|
||
| ز | Zay | z |
|
||
| س | Sin | s |
|
||
| ش | Shin | š / sh |
|
||
| ص | Sad | ṣ |
|
||
| ض | Dad | ḍ |
|
||
| ط | Ta | ṭ |
|
||
| ظ | Za | ẓ |
|
||
| ع | Ayn | ʿ |
|
||
| غ | Ghayn | ġ / gh |
|
||
| ف | Fa | f |
|
||
| ق | Qaf | q |
|
||
| ك | Kaf | k |
|
||
| ل | Lam | l |
|
||
| م | Mim | m |
|
||
| ن | Nun | n |
|
||
| ه | Ha | h |
|
||
| و | Waw | w / ū |
|
||
| ي | Ya | y / ī |
|
||
|
||
**Example (Arabic)**:
|
||
```
|
||
Input: المكتبة الوطنية للمملكة المغربية
|
||
ISO: al-Maktaba al-Waṭanīya lil-Mamlaka al-Maġribīya
|
||
ASCII: al-Maktaba al-Wataniya lil-Mamlaka al-Maghribiya
|
||
Abbrev: MWMM (skip "al-" articles)
|
||
```
|
||
|
||
**Example (Persian)**:
|
||
```
|
||
Input: وزارت امور خارجه ایران
|
||
ISO: Vezārat-e Omur-e Khāreǧe-ye Īrān
|
||
ASCII: Vezarat-e Omur-e Khareje-ye Iran
|
||
Abbrev: VOKI (skip "e" connector)
|
||
```
|
||
|
||
---
|
||
|
||
### Hebrew Script
|
||
|
||
| Standard | Library/Tool | Notes |
|
||
|----------|--------------|-------|
|
||
| ISO 259-3:1999 | `hebrew-transliteration` | Simplified romanization |
|
||
|
||
**ISO 259 Mapping**:
|
||
|
||
| Hebrew | Name | Latin |
|
||
|--------|------|-------|
|
||
| א | Aleph | ʾ / (silent) |
|
||
| ב | Bet | b / v |
|
||
| ג | Gimel | g |
|
||
| ד | Dalet | d |
|
||
| ה | He | h |
|
||
| ו | Vav | v / o / u |
|
||
| ז | Zayin | z |
|
||
| ח | Chet | ḥ / ch |
|
||
| ט | Tet | ṭ / t |
|
||
| י | Yod | y / i |
|
||
| כ ך | Kaf | k / kh |
|
||
| ל | Lamed | l |
|
||
| מ ם | Mem | m |
|
||
| נ ן | Nun | n |
|
||
| ס | Samekh | s |
|
||
| ע | Ayin | ʿ / (silent) |
|
||
| פ ף | Pe | p / f |
|
||
| צ ץ | Tsade | ṣ / ts |
|
||
| ק | Qof | q / k |
|
||
| ר | Resh | r |
|
||
| ש | Shin/Sin | š / s |
|
||
| ת | Tav | t |
|
||
|
||
**Example**:
|
||
```
|
||
Input: ארכיון הסיפור העממי בישראל
|
||
ISO: Arḵiyon ha-Sipur ha-ʿAmami be-Yiśraʾel
|
||
ASCII: Arkhiyon ha-Sipur ha-Amami be-Yisrael
|
||
Abbrev: ASAY (skip "ha-" and "be-" articles)
|
||
```
|
||
|
||
---
|
||
|
||
### Greek Script
|
||
|
||
| Standard | Library/Tool | Notes |
|
||
|----------|--------------|-------|
|
||
| ISO 843:1997 | `greek-transliteration` | Romanization of Greek |
|
||
|
||
**ISO 843 Mapping**:
|
||
|
||
| Greek | Latin | Greek | Latin |
|
||
|-------|-------|-------|-------|
|
||
| Α α | A a | Ν ν | N n |
|
||
| Β β | V v | Ξ ξ | X x |
|
||
| Γ γ | G g | Ο ο | O o |
|
||
| Δ δ | D d | Π π | P p |
|
||
| Ε ε | E e | Ρ ρ | R r |
|
||
| Ζ ζ | Z z | Σ σ ς | S s |
|
||
| Η η | Ī ī | Τ τ | T t |
|
||
| Θ θ | Th th | Υ υ | Y y |
|
||
| Ι ι | I i | Φ φ | F f |
|
||
| Κ κ | K k | Χ χ | Ch ch |
|
||
| Λ λ | L l | Ψ ψ | Ps ps |
|
||
| Μ μ | M m | Ω ω | Ō ō |
|
||
|
||
**Example**:
|
||
```
|
||
Input: Αρχαιολογικό Μουσείο Θεσσαλονίκης
|
||
ISO: Archaiologikó Mouseío Thessaloníkīs
|
||
ASCII: Archaiologiko Mouseio Thessalonikis
|
||
Abbrev: AMT
|
||
```
|
||
|
||
---
|
||
|
||
### Indic Scripts
|
||
|
||
| Language | Script | Standard | Library/Tool |
|
||
|----------|--------|----------|--------------|
|
||
| **Hindi** | Devanagari | ISO 15919 | `indic-transliteration` |
|
||
| **Bengali** | Bengali | ISO 15919 | `indic-transliteration` |
|
||
| **Nepali** | Devanagari | ISO 15919 | `indic-transliteration` |
|
||
| **Sinhala** | Sinhala | ISO 15919 | `indic-transliteration` |
|
||
|
||
**ISO 15919 Core Consonants (Devanagari)**:
|
||
|
||
| Devanagari | Latin | Devanagari | Latin |
|
||
|------------|-------|------------|-------|
|
||
| क | ka | त | ta |
|
||
| ख | kha | थ | tha |
|
||
| ग | ga | द | da |
|
||
| घ | gha | ध | dha |
|
||
| ङ | ṅa | न | na |
|
||
| च | ca | प | pa |
|
||
| छ | cha | फ | pha |
|
||
| ज | ja | ब | ba |
|
||
| झ | jha | भ | bha |
|
||
| ञ | ña | म | ma |
|
||
| ट | ṭa | य | ya |
|
||
| ठ | ṭha | र | ra |
|
||
| ड | ḍa | ल | la |
|
||
| ढ | ḍha | व | va |
|
||
| ण | ṇa | श | śa |
|
||
| | | ष | ṣa |
|
||
| | | स | sa |
|
||
| | | ह | ha |
|
||
|
||
**Example (Hindi)**:
|
||
```
|
||
Input: राजस्थान प्राच्यविद्या प्रतिष्ठान
|
||
ISO: Rājasthāna Prācyavidyā Pratiṣṭhāna
|
||
ASCII: Rajasthana Pracyavidya Pratishthana
|
||
Abbrev: RPP
|
||
```
|
||
|
||
---
|
||
|
||
### Southeast Asian Scripts
|
||
|
||
| Language | Script | Standard | Library/Tool |
|
||
|----------|--------|----------|--------------|
|
||
| **Thai** | Thai | ISO 11940-2 | `thai-romanization` |
|
||
| **Khmer** | Khmer | ALA-LC | `khmer-romanization` |
|
||
|
||
**Thai Example**:
|
||
```
|
||
Input: สำนักหอจดหมายเหตุแห่งชาติ
|
||
ISO: Samnak Ho Chotmaihet Haeng Chat
|
||
Abbrev: SHCHC
|
||
```
|
||
|
||
**Khmer Example**:
|
||
```
|
||
Input: សារមន្ទីរទួលស្លែង
|
||
ALA-LC: Sāramanṭīr Tūl Slèṅ
|
||
ASCII: Saramantir Tuol Sleng
|
||
Abbrev: STS
|
||
```
|
||
|
||
---
|
||
|
||
### Other Scripts
|
||
|
||
| Language | Script | Standard | Library/Tool |
|
||
|----------|--------|----------|--------------|
|
||
| **Armenian** | Armenian | ISO 9985 | `armenian-transliteration` |
|
||
| **Georgian** | Georgian | ISO 9984 | `georgian-transliteration` |
|
||
|
||
**Armenian Example**:
|
||
```
|
||
Input: Մdelays delays delays delays delays delays delays delays delays delays delays delays delays delays delaysdelays delays delays delays delays delays delaysdelays delaysdelays delaysdelays delaysатdelays delays delaysенадаранdelays delays delays
|
||
Input: Մdelays delays delays delays delays delaysделays delays delaysատdelays delays delays delaysdelays delaysdelays delaysdelays delaysdelays delaysdelays delaysdelays delays delays delaysdeенadaran
|
||
Input: Մdelays delays delays delaysатенадаранdelays delays delays delaysdeленадаран
|
||
Input: Մdelays delays delaysатенадаран
|
||
Input: Մdelays delaysатенадаран
|
||
Input: Մатенадаран
|
||
Input: Մatenadaran
|
||
ISO: Matenadaran
|
||
Abbrev: M
|
||
```
|
||
|
||
**Georgian Example**:
|
||
```
|
||
Input: ხელნაწერთა ეროვნული ცენტრი
|
||
ISO: Xelnawerti Erovnuli C'ent'ri
|
||
ASCII: Khelnawerti Erovnuli Centri
|
||
Abbrev: KEC
|
||
```
|
||
|
||
---
|
||
|
||
## Implementation
|
||
|
||
### Python Transliteration Utility
|
||
|
||
```python
|
||
#!/usr/bin/env python3
|
||
"""
|
||
Transliteration utility for GHCID abbreviation generation.
|
||
Uses ISO and recognized standards for each script/language.
|
||
"""
|
||
|
||
import unicodedata
|
||
from typing import Optional
|
||
|
||
# Try importing transliteration libraries
|
||
try:
|
||
from pypinyin import pinyin, Style
|
||
HAS_PYPINYIN = True
|
||
except ImportError:
|
||
HAS_PYPINYIN = False
|
||
|
||
try:
|
||
import pykakasi
|
||
HAS_PYKAKASI = True
|
||
except ImportError:
|
||
HAS_PYKAKASI = False
|
||
|
||
try:
|
||
from transliterate import translit
|
||
HAS_TRANSLITERATE = True
|
||
except ImportError:
|
||
HAS_TRANSLITERATE = False
|
||
|
||
|
||
def detect_script(text: str) -> str:
|
||
"""
|
||
Detect the primary script of the input text.
|
||
|
||
Returns one of:
|
||
- 'latin': Latin alphabet
|
||
- 'cyrillic': Cyrillic script
|
||
- 'chinese': Chinese characters (Hanzi)
|
||
- 'japanese': Japanese (mixed Kanji/Kana)
|
||
- 'korean': Korean Hangul
|
||
- 'arabic': Arabic script (includes Persian, Urdu)
|
||
- 'hebrew': Hebrew script
|
||
- 'greek': Greek script
|
||
- 'devanagari': Devanagari (Hindi, Nepali, Sanskrit)
|
||
- 'bengali': Bengali script
|
||
- 'thai': Thai script
|
||
- 'armenian': Armenian script
|
||
- 'georgian': Georgian script
|
||
- 'unknown': Cannot determine
|
||
"""
|
||
script_ranges = {
|
||
'cyrillic': (0x0400, 0x04FF),
|
||
'arabic': (0x0600, 0x06FF),
|
||
'hebrew': (0x0590, 0x05FF),
|
||
'devanagari': (0x0900, 0x097F),
|
||
'bengali': (0x0980, 0x09FF),
|
||
'thai': (0x0E00, 0x0E7F),
|
||
'greek': (0x0370, 0x03FF),
|
||
'armenian': (0x0530, 0x058F),
|
||
'georgian': (0x10A0, 0x10FF),
|
||
'korean': (0xAC00, 0xD7AF), # Hangul syllables
|
||
'japanese_hiragana': (0x3040, 0x309F),
|
||
'japanese_katakana': (0x30A0, 0x30FF),
|
||
'chinese': (0x4E00, 0x9FFF), # CJK Unified Ideographs
|
||
}
|
||
|
||
script_counts = {script: 0 for script in script_ranges}
|
||
latin_count = 0
|
||
|
||
for char in text:
|
||
code = ord(char)
|
||
|
||
# Check Latin
|
||
if ('a' <= char <= 'z') or ('A' <= char <= 'Z'):
|
||
latin_count += 1
|
||
continue
|
||
|
||
# Check other scripts
|
||
for script, (start, end) in script_ranges.items():
|
||
if start <= code <= end:
|
||
script_counts[script] += 1
|
||
break
|
||
|
||
# Determine primary script
|
||
if latin_count > 0 and all(c == 0 for c in script_counts.values()):
|
||
return 'latin'
|
||
|
||
# Find max non-Latin script
|
||
max_script = max(script_counts, key=script_counts.get)
|
||
if script_counts[max_script] > 0:
|
||
# Handle Japanese (can be Kanji + Kana)
|
||
if max_script in ('japanese_hiragana', 'japanese_katakana', 'chinese'):
|
||
if script_counts['japanese_hiragana'] > 0 or script_counts['japanese_katakana'] > 0:
|
||
return 'japanese'
|
||
return 'chinese'
|
||
return max_script
|
||
|
||
return 'latin' if latin_count > 0 else 'unknown'
|
||
|
||
|
||
def transliterate_cyrillic(text: str, lang: str = 'ru') -> str:
|
||
"""Transliterate Cyrillic text using ISO 9."""
|
||
if HAS_TRANSLITERATE:
|
||
try:
|
||
return translit(text, lang, reversed=True)
|
||
except Exception:
|
||
pass
|
||
|
||
# Fallback: basic Cyrillic to Latin mapping
|
||
cyrillic_map = {
|
||
'А': 'A', 'Б': 'B', 'В': 'V', 'Г': 'G', 'Д': 'D', 'Е': 'E',
|
||
'Ё': 'E', 'Ж': 'Zh', 'З': 'Z', 'И': 'I', 'Й': 'Y', 'К': 'K',
|
||
'Л': 'L', 'М': 'M', 'Н': 'N', 'О': 'O', 'П': 'P', 'Р': 'R',
|
||
'С': 'S', 'Т': 'T', 'У': 'U', 'Ф': 'F', 'Х': 'Kh', 'Ц': 'Ts',
|
||
'Ч': 'Ch', 'Ш': 'Sh', 'Щ': 'Shch', 'Ъ': '', 'Ы': 'Y', 'Ь': '',
|
||
'Э': 'E', 'Ю': 'Yu', 'Я': 'Ya',
|
||
'а': 'a', 'б': 'b', 'в': 'v', 'г': 'g', 'д': 'd', 'е': 'e',
|
||
'ё': 'e', 'ж': 'zh', 'з': 'z', 'и': 'i', 'й': 'y', 'к': 'k',
|
||
'л': 'l', 'м': 'm', 'н': 'n', 'о': 'o', 'п': 'p', 'р': 'r',
|
||
'с': 's', 'т': 't', 'у': 'u', 'ф': 'f', 'х': 'kh', 'ц': 'ts',
|
||
'ч': 'ch', 'ш': 'sh', 'щ': 'shch', 'ъ': '', 'ы': 'y', 'ь': '',
|
||
'э': 'e', 'ю': 'yu', 'я': 'ya',
|
||
# Ukrainian additions
|
||
'І': 'I', 'і': 'i', 'Ї': 'Yi', 'ї': 'yi', 'Є': 'Ye', 'є': 'ye',
|
||
'Ґ': 'G', 'ґ': 'g',
|
||
}
|
||
return ''.join(cyrillic_map.get(c, c) for c in text)
|
||
|
||
|
||
def transliterate_chinese(text: str) -> str:
|
||
"""Transliterate Chinese to Pinyin."""
|
||
if HAS_PYPINYIN:
|
||
# Get pinyin without tone marks
|
||
result = pinyin(text, style=Style.NORMAL)
|
||
return ' '.join([''.join(p) for p in result])
|
||
|
||
# Fallback: return as-is (requires manual handling)
|
||
return text
|
||
|
||
|
||
def transliterate_japanese(text: str) -> str:
|
||
"""Transliterate Japanese to Romaji (Hepburn)."""
|
||
if HAS_PYKAKASI:
|
||
kakasi = pykakasi.kakasi()
|
||
result = kakasi.convert(text)
|
||
return ' '.join([item['hepburn'] for item in result])
|
||
|
||
# Fallback: return as-is
|
||
return text
|
||
|
||
|
||
def transliterate_korean(text: str) -> str:
|
||
"""Transliterate Korean Hangul to Revised Romanization."""
|
||
# Korean romanization is complex - use library if available
|
||
try:
|
||
from korean_romanizer.romanizer import Romanizer
|
||
r = Romanizer(text)
|
||
return r.romanize()
|
||
except ImportError:
|
||
pass
|
||
|
||
# Fallback: basic Hangul syllable decomposition
|
||
# This is a simplified implementation
|
||
return text
|
||
|
||
|
||
def transliterate_arabic(text: str) -> str:
|
||
"""Transliterate Arabic script to Latin (ISO 233 simplified)."""
|
||
arabic_map = {
|
||
'ا': 'a', 'أ': 'a', 'إ': 'i', 'آ': 'a',
|
||
'ب': 'b', 'ت': 't', 'ث': 'th', 'ج': 'j',
|
||
'ح': 'h', 'خ': 'kh', 'د': 'd', 'ذ': 'dh',
|
||
'ر': 'r', 'ز': 'z', 'س': 's', 'ش': 'sh',
|
||
'ص': 's', 'ض': 'd', 'ط': 't', 'ظ': 'z',
|
||
'ع': "'", 'غ': 'gh', 'ف': 'f', 'ق': 'q',
|
||
'ك': 'k', 'ل': 'l', 'م': 'm', 'ن': 'n',
|
||
'ه': 'h', 'و': 'w', 'ي': 'y', 'ى': 'a',
|
||
'ة': 'a', 'ء': "'",
|
||
# Persian additions
|
||
'پ': 'p', 'چ': 'ch', 'ژ': 'zh', 'گ': 'g',
|
||
'ک': 'k', 'ی': 'i',
|
||
}
|
||
result = []
|
||
for c in text:
|
||
if c in arabic_map:
|
||
result.append(arabic_map[c])
|
||
elif c == ' ' or c.isalnum():
|
||
result.append(c)
|
||
return ''.join(result)
|
||
|
||
|
||
def transliterate_hebrew(text: str) -> str:
|
||
"""Transliterate Hebrew to Latin (ISO 259 simplified)."""
|
||
hebrew_map = {
|
||
'א': '', 'ב': 'v', 'ג': 'g', 'ד': 'd', 'ה': 'h',
|
||
'ו': 'v', 'ז': 'z', 'ח': 'ch', 'ט': 't', 'י': 'y',
|
||
'כ': 'k', 'ך': 'k', 'ל': 'l', 'מ': 'm', 'ם': 'm',
|
||
'נ': 'n', 'ן': 'n', 'ס': 's', 'ע': '', 'פ': 'f',
|
||
'ף': 'f', 'צ': 'ts', 'ץ': 'ts', 'ק': 'k', 'ר': 'r',
|
||
'ש': 'sh', 'ת': 't',
|
||
}
|
||
result = []
|
||
for c in text:
|
||
if c in hebrew_map:
|
||
result.append(hebrew_map[c])
|
||
elif c == ' ' or c.isalnum():
|
||
result.append(c)
|
||
return ''.join(result)
|
||
|
||
|
||
def transliterate_greek(text: str) -> str:
|
||
"""Transliterate Greek to Latin (ISO 843)."""
|
||
greek_map = {
|
||
'Α': 'A', 'α': 'a', 'Β': 'V', 'β': 'v', 'Γ': 'G', 'γ': 'g',
|
||
'Δ': 'D', 'δ': 'd', 'Ε': 'E', 'ε': 'e', 'Ζ': 'Z', 'ζ': 'z',
|
||
'Η': 'I', 'η': 'i', 'Θ': 'Th', 'θ': 'th', 'Ι': 'I', 'ι': 'i',
|
||
'Κ': 'K', 'κ': 'k', 'Λ': 'L', 'λ': 'l', 'Μ': 'M', 'μ': 'm',
|
||
'Ν': 'N', 'ν': 'n', 'Ξ': 'X', 'ξ': 'x', 'Ο': 'O', 'ο': 'o',
|
||
'Π': 'P', 'π': 'p', 'Ρ': 'R', 'ρ': 'r', 'Σ': 'S', 'σ': 's',
|
||
'ς': 's', 'Τ': 'T', 'τ': 't', 'Υ': 'Y', 'υ': 'y', 'Φ': 'F',
|
||
'φ': 'f', 'Χ': 'Ch', 'χ': 'ch', 'Ψ': 'Ps', 'ψ': 'ps',
|
||
'Ω': 'O', 'ω': 'o',
|
||
}
|
||
return ''.join(greek_map.get(c, c) for c in text)
|
||
|
||
|
||
def transliterate_devanagari(text: str) -> str:
|
||
"""Transliterate Devanagari to Latin (ISO 15919 simplified)."""
|
||
try:
|
||
from indic_transliteration import sanscript
|
||
from indic_transliteration.sanscript import transliterate as indic_translit
|
||
return indic_translit(text, sanscript.DEVANAGARI, sanscript.IAST)
|
||
except ImportError:
|
||
pass
|
||
|
||
# Fallback: basic mapping
|
||
# This would need a full Devanagari character map
|
||
return text
|
||
|
||
|
||
def transliterate_thai(text: str) -> str:
|
||
"""Transliterate Thai to Latin (Royal Thai General System)."""
|
||
try:
|
||
from thaispellcheck import transliterate as thai_translit
|
||
return thai_translit(text)
|
||
except ImportError:
|
||
pass
|
||
|
||
# Fallback
|
||
return text
|
||
|
||
|
||
def transliterate(text: str, lang: Optional[str] = None) -> str:
|
||
"""
|
||
Transliterate text from non-Latin script to Latin.
|
||
|
||
Args:
|
||
text: Input text in any script
|
||
lang: Optional ISO 639-1 language code (e.g., 'ru', 'zh', 'ko')
|
||
If not provided, script is auto-detected.
|
||
|
||
Returns:
|
||
Transliterated text in Latin characters.
|
||
"""
|
||
if not text:
|
||
return text
|
||
|
||
# Detect script if language not provided
|
||
if lang:
|
||
script_map = {
|
||
'ru': 'cyrillic', 'uk': 'cyrillic', 'bg': 'cyrillic',
|
||
'sr': 'cyrillic', 'kk': 'cyrillic',
|
||
'zh': 'chinese',
|
||
'ja': 'japanese',
|
||
'ko': 'korean',
|
||
'ar': 'arabic', 'fa': 'arabic', 'ur': 'arabic',
|
||
'he': 'hebrew',
|
||
'el': 'greek',
|
||
'hi': 'devanagari', 'ne': 'devanagari',
|
||
'bn': 'bengali',
|
||
'th': 'thai',
|
||
'hy': 'armenian',
|
||
'ka': 'georgian',
|
||
}
|
||
script = script_map.get(lang, detect_script(text))
|
||
else:
|
||
script = detect_script(text)
|
||
|
||
# Apply appropriate transliteration
|
||
transliterators = {
|
||
'cyrillic': lambda t: transliterate_cyrillic(t, lang or 'ru'),
|
||
'chinese': transliterate_chinese,
|
||
'japanese': transliterate_japanese,
|
||
'korean': transliterate_korean,
|
||
'arabic': transliterate_arabic,
|
||
'hebrew': transliterate_hebrew,
|
||
'greek': transliterate_greek,
|
||
'devanagari': transliterate_devanagari,
|
||
'thai': transliterate_thai,
|
||
'latin': lambda t: t, # No transliteration needed
|
||
}
|
||
|
||
translit_func = transliterators.get(script, lambda t: t)
|
||
result = translit_func(text)
|
||
|
||
# Normalize diacritics to ASCII
|
||
normalized = unicodedata.normalize('NFD', result)
|
||
ascii_result = ''.join(c for c in normalized if unicodedata.category(c) != 'Mn')
|
||
|
||
return ascii_result
|
||
|
||
|
||
def transliterate_for_abbreviation(emic_name: str, lang: str) -> str:
|
||
"""
|
||
Transliterate emic name for GHCID abbreviation generation.
|
||
|
||
This is the main entry point for GHCID generation scripts.
|
||
|
||
Args:
|
||
emic_name: Institution name in original script
|
||
lang: ISO 639-1 language code
|
||
|
||
Returns:
|
||
Transliterated name ready for abbreviation extraction
|
||
"""
|
||
# Step 1: Transliterate to Latin
|
||
latin = transliterate(emic_name, lang)
|
||
|
||
# Step 2: Normalize diacritics (handled in transliterate())
|
||
|
||
# Step 3: Remove special characters (except spaces)
|
||
import re
|
||
clean = re.sub(r'[^a-zA-Z\s]', ' ', latin)
|
||
|
||
# Step 4: Normalize whitespace
|
||
clean = ' '.join(clean.split())
|
||
|
||
return clean
|
||
|
||
|
||
# Example usage
|
||
if __name__ == '__main__':
|
||
test_cases = [
|
||
('Институт восточных рукописей РАН', 'ru'),
|
||
('东巴文化博物院', 'zh'),
|
||
('독립기념관', 'ko'),
|
||
('राजस्थान प्राच्यविद्या प्रतिष्ठान', 'hi'),
|
||
('المكتبة الوطنية للمملكة المغربية', 'ar'),
|
||
('ארכיון הסיפור העממי בישראל', 'he'),
|
||
('Αρχαιολογικό Μουσείο Θεσσαλονίκης', 'el'),
|
||
]
|
||
|
||
for name, lang in test_cases:
|
||
result = transliterate_for_abbreviation(name, lang)
|
||
print(f'{lang}: {name}')
|
||
print(f' → {result}')
|
||
print()
|
||
```
|
||
|
||
---
|
||
|
||
## Skip Words by Language
|
||
|
||
When extracting abbreviations from transliterated text, skip these articles/prepositions:
|
||
|
||
### Arabic
|
||
- `al-` (the definite article)
|
||
- `bi-`, `li-`, `fi-` (prepositions)
|
||
|
||
### Hebrew
|
||
- `ha-` (the)
|
||
- `ve-` (and)
|
||
- `be-`, `le-`, `me-` (prepositions)
|
||
|
||
### Persian
|
||
- `-e`, `-ye` (ezafe connector)
|
||
- `va` (and)
|
||
|
||
### CJK Languages
|
||
- No skip words (particles are integral to meaning)
|
||
|
||
### Indic Languages
|
||
- `ka`, `ki`, `ke` (Hindi: of)
|
||
- `aur` (Hindi: and)
|
||
|
||
---
|
||
|
||
## Validation
|
||
|
||
### Check Transliteration Output
|
||
|
||
```python
|
||
def validate_transliteration(result: str) -> bool:
|
||
"""
|
||
Validate that transliteration output contains only ASCII letters and spaces.
|
||
"""
|
||
import re
|
||
return bool(re.match(r'^[a-zA-Z\s]+$', result))
|
||
```
|
||
|
||
### Manual Review Queue
|
||
|
||
Non-Latin institutions should be flagged for manual review if:
|
||
1. Transliteration library not available for that script
|
||
2. Confidence in transliteration is low
|
||
3. Institution has multiple official romanizations
|
||
|
||
---
|
||
|
||
## Related Documentation
|
||
|
||
- `AGENTS.md` - Rule 12: Transliteration Standards
|
||
- `ABBREVIATION_SPECIAL_CHAR_RULE.md` - Character filtering after transliteration
|
||
- `docs/TRANSLITERATION_CONVENTIONS.md` - Extended examples and edge cases
|
||
- `scripts/transliterate_emic_names.py` - Production transliteration script
|
||
|
||
---
|
||
|
||
## Changelog
|
||
|
||
| Date | Change |
|
||
|------|--------|
|
||
| 2025-12-08 | Initial standards document created |
|