- Add Rule 11 for Z.AI Coding Plan API usage (not BigModel) - Add transliteration standards for non-Latin scripts - Document GLM model options and Python implementation
13 KiB
Transliteration Conventions for Heritage Custodian Names
Document Type: User Guide
Version: 1.0
Last Updated: 2025-12-08
Related Rules: .opencode/TRANSLITERATION_STANDARDS.md, AGENTS.md (Rule 12)
Overview
This document provides comprehensive examples and guidance for transliterating heritage institution names from non-Latin scripts to Latin characters. Transliteration is required for generating GHCID abbreviations but the original emic name is always preserved.
Key Principles
- Emic name preserved - Original script stored in
custodian_name.emic_name - ISO standards used - Recognized international transliteration standards
- Deterministic output - Same input always produces same Latin output
- Abbreviation purpose only - Transliteration is for GHCID generation, not display
Language-by-Language Examples
Russian (Cyrillic - ISO 9:1995)
Dataset Statistics: 13 institutions
| Emic Name | Transliterated | Abbreviation |
|---|---|---|
| Институт восточных рукописей РАН | Institut vostočnyh rukopisej RAN | IVRR |
| Российская государственная библиотека | Rossijskaja gosudarstvennaja biblioteka | RGB |
| Государственный архив Российской Федерации | Gosudarstvennyj arhiv Rossijskoj Federacii | GARF |
Skip Words (Russian): None significant (Russian doesn't use articles)
Character Mapping:
А → A Б → B В → V Г → G Д → D Е → E
Ё → Ë Ж → Ž З → Z И → I Й → J К → K
Л → L М → M Н → N О → O П → P Р → R
С → S Т → T У → U Ф → F Х → H Ц → C
Ч → Č Ш → Š Щ → Ŝ Ъ → ʺ Ы → Y Ь → ʹ
Э → È Ю → Û Я → Â
Ukrainian (Cyrillic - ISO 9:1995)
Dataset Statistics: 8 institutions
| Emic Name | Transliterated | Abbreviation |
|---|---|---|
| Центральний державний архів громадських об'єднань України | Centralnyj deržavnyj arhiv hromadskyx objednan Ukrainy | CDAGOU |
| Національна бібліотека України | Nacionalna biblioteka Ukrainy | NBU |
Ukrainian-specific characters:
І → I Ї → Ji Є → Je Ґ → G'
Chinese (Hanyu Pinyin - ISO 7098)
Dataset Statistics: 27 institutions
| Emic Name | Pinyin | Abbreviation |
|---|---|---|
| 东巴文化博物院 | Dōngbā Wénhuà Bówùyuàn | DWB |
| 中国第一历史档案馆 | Zhōngguó Dìyī Lìshǐ Dàng'ànguǎn | ZDLD |
| 北京故宫博物院 | Běijīng Gùgōng Bówùyuàn | BGB |
| 中国国家图书馆 | Zhōngguó Guójiā Túshūguǎn | ZGT |
Notes:
- Tone marks are removed for abbreviation (diacritics normalization)
- Word boundaries follow natural semantic breaks
- Multi-syllable words keep together
Skip Words: None (Chinese doesn't use separate articles/prepositions)
Japanese (Modified Hepburn)
Dataset Statistics: 19 institutions
| Emic Name | Romaji | Abbreviation |
|---|---|---|
| 国立中央博物館 | Kokuritsu Chūō Hakubutsukan | KCH |
| 東京国立博物館 | Tōkyō Kokuritsu Hakubutsukan | TKH |
| 国立国会図書館 | Kokuritsu Kokkai Toshokan | KKT |
Notes:
- Long vowels (ō, ū) normalized to (o, u)
- Particles typically attached to preceding word
- Kanji compounds transliterated as single words
Korean (Revised Romanization)
Dataset Statistics: 36 institutions
| Emic Name | RR Romanization | Abbreviation |
|---|---|---|
| 독립기념관 | Dongnip Ginyeomgwan | DG |
| 국립중앙박물관 | Gungnip Jungang Bakmulgwan | GJB |
| 서울대학교 규장각한국학연구원 | Seoul Daehakgyo Gyujanggak Hangukhak Yeonguwon | SDGHY |
| 국립한글박물관 | Gungnip Hangeul Bakmulgwan | GHB |
Notes:
- No diacritics in Revised Romanization (unlike McCune-Reischauer)
- Consonant assimilation reflected in spelling
- Spaces at natural word boundaries
Arabic (ISO 233-2)
Dataset Statistics: 8 institutions
| Emic Name | Transliterated | Abbreviation |
|---|---|---|
| المكتبة الوطنية للمملكة المغربية | al-Maktaba al-Waṭanīya lil-Mamlaka al-Maġribīya | MWMM |
| دار الكتب المصرية | Dār al-Kutub al-Miṣrīya | DKM |
| المتحف الوطني العراقي | al-Matḥaf al-Waṭanī al-ʿIrāqī | MWI |
Skip Words:
al-(definite article "the")- After skip word removal: Maktaba, Wataniya, Mamlaka, Maghribiya → MWMM
Notes:
- Right-to-left script
- Definite article "al-" always skipped
- Diacritics normalized (ā→a, ī→i, etc.)
Persian/Farsi (ISO 233-3)
Dataset Statistics: 11 institutions
| Emic Name | Transliterated | Abbreviation |
|---|---|---|
| وزارت امور خارجه ایران | Vezārat-e Omūr-e Khārejeh-ye Īrān | VOKI |
| کتابخانه آستان قدس رضوی | Ketābkhāneh-ye Āstān-e Qods-e Raẓavī | KAQR |
| مجلس شورای اسلامی | Majles-e Showrā-ye Eslāmī | MSE |
Skip Words:
-e,-ye(ezafe connector, "of")va("and")
Persian-specific characters:
پ → p چ → č ژ → ž گ → g
Hebrew (ISO 259-3)
Dataset Statistics: 4 institutions
| Emic Name | Transliterated | Abbreviation |
|---|---|---|
| ארכיון הסיפור העממי בישראל | Arḵiyon ha-Sipur ha-ʿAmami be-Yiśraʾel | ASAY |
| הספרייה הלאומית | ha-Sifriya ha-Leʾumit | SL |
| ארכיון המדינה | Arḵiyon ha-Medina | AM |
Skip Words:
ha-(definite article "the")be-("in")le-("to")ve-("and")
Notes:
- Right-to-left script
- Articles attached with hyphen
- Silent letters (aleph, ayin) often omitted in abbreviation
Hindi (Devanagari - ISO 15919)
Dataset Statistics: 14 institutions
| Emic Name | Transliterated | Abbreviation |
|---|---|---|
| राजस्थान प्राच्यविद्या प्रतिष्ठान | Rājasthāna Prācyavidyā Pratiṣṭhāna | RPP |
| राष्ट्रीय अभिलेखागार | Rāṣṭrīya Abhilekhāgāra | RA |
| राष्ट्रीय संग्रहालय नई दिल्ली | Rāṣṭrīya Saṅgrahālaya Naī Dillī | RSND |
Skip Words:
ka,ki,ke("of")aur("and")mein("in")
Notes:
- Conjunct consonants transliterated as cluster
- Long vowels marked (ā, ī, ū) then normalized
Greek (ISO 843)
Dataset Statistics: 2 institutions
| Emic Name | Transliterated | Abbreviation |
|---|---|---|
| Αρχαιολογικό Μουσείο Θεσσαλονίκης | Archaiologikó Mouseío Thessaloníkīs | AMT |
| Εθνική Βιβλιοθήκη της Ελλάδας | Ethnikī́ Vivliothī́kī tīs Elládas | EVE |
Skip Words:
tīs,tou("of the")kai("and")
Character Mapping:
Α → A Β → V Γ → G Δ → D Ε → E Ζ → Z
Η → Ī Θ → Th Ι → I Κ → K Λ → L Μ → M
Ν → N Ξ → X Ο → O Π → P Ρ → R Σ → S
Τ → T Υ → Y Φ → F Χ → Ch Ψ → Ps Ω → Ō
Thai (ISO 11940-2)
Dataset Statistics: 6 institutions
| Emic Name | Transliterated | Abbreviation |
|---|---|---|
| สำนักหอจดหมายเหตุแห่งชาติ | Samnak Ho Chotmaihet Haeng Chat | SHCHC |
| หอสมุดแห่งชาติ | Ho Samut Haeng Chat | HSHC |
Notes:
- Thai script is abugida (consonant-vowel syllables)
- No spaces in Thai; word boundaries determined by meaning
- Royal Thai General System also acceptable
Armenian (ISO 9985)
Dataset Statistics: 4 institutions
| Emic Name | Transliterated | Abbreviation |
|---|---|---|
| Մdelays delays delaysdelays delaysатенадаран | Matenadaran | M |
| Ազdelays delays delays delays delays delays delays delays delaysգdelays delays delays delays delays delaysdelays delaysdelays delaysდайн Пdelays delays delays delays delaysатאрաнագитаран | Azgayin Matenadaran | AM |
Notes:
- Armenian alphabet unique to Armenian language
- Transliteration straightforward letter-for-letter
Georgian (ISO 9984)
Dataset Statistics: 2 institutions
| Emic Name | Transliterated | Abbreviation |
|---|---|---|
| ხელნაწერთა ეროვნული ცენტრი | Xelnawerti Erovnuli C'ent'ri | XEC |
| საქართველოს ეროვნული არქივი | Sakartvelos Erovnuli Arkivi | SEA |
Notes:
- Georgian Mkhedruli script
- Apostrophes mark ejective consonants (removed in abbreviation)
Complete Workflow Example
Step-by-Step: Korean Institution
Institution: National Museum of Korea
-
Emic Name (Original Script):
국립중앙박물관 -
Language Detection: Korean (ko)
-
Transliterate using Revised Romanization:
Gungnip Jungang Bakmulgwan -
Identify Skip Words: None for Korean
-
Extract First Letters:
G + J + B = GJB -
Diacritic Normalization: N/A (RR has no diacritics)
-
Final Abbreviation:
GJB -
Store in YAML:
custodian_name: emic_name: 국립중앙박물관 name_language: ko english_name: National Museum of Korea ghcid: ghcid_current: KR-SO-SEO-M-GJB abbreviation_source: transliterated_emic
Step-by-Step: Arabic Institution
Institution: National Library of Morocco
-
Emic Name (Original Script):
المكتبة الوطنية للمملكة المغربية -
Language Detection: Arabic (ar)
-
Transliterate using ISO 233-2:
al-Maktaba al-Waṭanīya lil-Mamlaka al-Maġribīya -
Identify Skip Words:
al-(4 occurrences),lil-(1) -
After Skip Word Removal:
Maktaba Waṭanīya Mamlaka Maġribīya -
Extract First Letters:
M + W + M + M = MWMM -
Diacritic Normalization: ṭ→t, ī→i, ġ→g
MWMM (already ASCII) -
Final Abbreviation:
MWMM
Edge Cases and Special Handling
Mixed Scripts
Some institution names mix scripts (e.g., Latin brand names in Chinese text):
Example: 中国IBM研究院
- Transliterate Chinese: Zhongguo IBM Yanjiuyuan
- Keep "IBM" as-is (already Latin)
- Abbreviation: ZIY
Transliteration Ambiguity
When multiple valid transliterations exist, prefer:
- ISO standard spelling
- Institution's own romanization (if consistent)
- Most commonly used academic romanization
Very Long Names
If abbreviation exceeds 10 characters after applying rules:
- Truncate to 10 characters
- Ensure truncation doesn't create ambiguous abbreviation
- Document truncation in
ghcid.notes
Python Implementation Reference
For the complete Python implementation of transliteration functions, see:
.opencode/TRANSLITERATION_STANDARDS.md- Full code with all language handlersscripts/transliterate_emic_names.py- Production script for batch transliteration
Quick Reference Function
from transliteration import transliterate_for_abbreviation
# Example usage for all supported languages
examples = {
'ru': 'Российская государственная библиотека',
'zh': '中国国家图书馆',
'ja': '国立国会図書館',
'ko': '국립중앙박물관',
'ar': 'المكتبة الوطنية للمملكة المغربية',
'he': 'הספרייה הלאומית',
'hi': 'राष्ट्रीय अभिलेखागार',
'el': 'Εθνική Βιβλιοθήκη της Ελλάδας',
}
for lang, name in examples.items():
latin = transliterate_for_abbreviation(name, lang)
print(f'{lang}: {name}')
print(f' → {latin}')
Validation Checklist
Before finalizing a transliterated abbreviation:
- Original emic name preserved in
custodian_name.emic_name - Language code stored in
custodian_name.name_language - Correct ISO standard applied for script
- Skip words removed (articles, prepositions)
- Diacritics normalized to ASCII
- Special characters removed
- Abbreviation ≤ 10 characters
- No conflicts with existing GHCIDs
See Also
.opencode/TRANSLITERATION_STANDARDS.md- Technical rules and Python code.opencode/ABBREVIATION_SPECIAL_CHAR_RULE.md- Character filtering rulesAGENTS.md- Rule 12: Non-Latin Script Transliterationdocs/PERSISTENT_IDENTIFIERS.md- GHCID specification
Changelog
| Date | Change |
|---|---|
| 2025-12-08 | Initial document created with 21 language examples |