# Transliteration Conventions for Heritage Custodian Names **Document Type**: User Guide **Version**: 1.0 **Last Updated**: 2025-12-08 **Related Rules**: `.opencode/TRANSLITERATION_STANDARDS.md`, `AGENTS.md` (Rule 12) --- ## Overview This document provides comprehensive examples and guidance for transliterating heritage institution names from non-Latin scripts to Latin characters. Transliteration is **required** for generating GHCID abbreviations but the **original emic name is always preserved**. ### Key Principles 1. **Emic name preserved** - Original script stored in `custodian_name.emic_name` 2. **ISO standards used** - Recognized international transliteration standards 3. **Deterministic output** - Same input always produces same Latin output 4. **Abbreviation purpose only** - Transliteration is for GHCID generation, not display --- ## Language-by-Language Examples ### Russian (Cyrillic - ISO 9:1995) **Dataset Statistics**: 13 institutions | Emic Name | Transliterated | Abbreviation | |-----------|----------------|--------------| | Институт восточных рукописей РАН | Institut vostočnyh rukopisej RAN | IVRR | | Российская государственная библиотека | Rossijskaja gosudarstvennaja biblioteka | RGB | | Государственный архив Российской Федерации | Gosudarstvennyj arhiv Rossijskoj Federacii | GARF | **Skip Words (Russian)**: None significant (Russian doesn't use articles) **Character Mapping**: ``` А → A Б → B В → V Г → G Д → D Е → E Ё → Ë Ж → Ž З → Z И → I Й → J К → K Л → L М → M Н → N О → O П → P Р → R С → S Т → T У → U Ф → F Х → H Ц → C Ч → Č Ш → Š Щ → Ŝ Ъ → ʺ Ы → Y Ь → ʹ Э → È Ю → Û Я → Â ``` --- ### Ukrainian (Cyrillic - ISO 9:1995) **Dataset Statistics**: 8 institutions | Emic Name | Transliterated | Abbreviation | |-----------|----------------|--------------| | Центральний державний архів громадських об'єднань України | Centralnyj deržavnyj arhiv hromadskyx objednan Ukrainy | CDAGOU | | Національна бібліотека України | Nacionalna biblioteka Ukrainy | NBU | **Ukrainian-specific characters**: ``` І → I Ї → Ji Є → Je Ґ → G' ``` --- ### Chinese (Hanyu Pinyin - ISO 7098) **Dataset Statistics**: 27 institutions | Emic Name | Pinyin | Abbreviation | |-----------|--------|--------------| | 东巴文化博物院 | Dōngbā Wénhuà Bówùyuàn | DWB | | 中国第一历史档案馆 | Zhōngguó Dìyī Lìshǐ Dàng'ànguǎn | ZDLD | | 北京故宫博物院 | Běijīng Gùgōng Bówùyuàn | BGB | | 中国国家图书馆 | Zhōngguó Guójiā Túshūguǎn | ZGT | **Notes**: - Tone marks are removed for abbreviation (diacritics normalization) - Word boundaries follow natural semantic breaks - Multi-syllable words keep together **Skip Words**: None (Chinese doesn't use separate articles/prepositions) --- ### Japanese (Modified Hepburn) **Dataset Statistics**: 19 institutions | Emic Name | Romaji | Abbreviation | |-----------|--------|--------------| | 国立中央博物館 | Kokuritsu Chūō Hakubutsukan | KCH | | 東京国立博物館 | Tōkyō Kokuritsu Hakubutsukan | TKH | | 国立国会図書館 | Kokuritsu Kokkai Toshokan | KKT | **Notes**: - Long vowels (ō, ū) normalized to (o, u) - Particles typically attached to preceding word - Kanji compounds transliterated as single words --- ### Korean (Revised Romanization) **Dataset Statistics**: 36 institutions | Emic Name | RR Romanization | Abbreviation | |-----------|-----------------|--------------| | 독립기념관 | Dongnip Ginyeomgwan | DG | | 국립중앙박물관 | Gungnip Jungang Bakmulgwan | GJB | | 서울대학교 규장각한국학연구원 | Seoul Daehakgyo Gyujanggak Hangukhak Yeonguwon | SDGHY | | 국립한글박물관 | Gungnip Hangeul Bakmulgwan | GHB | **Notes**: - No diacritics in Revised Romanization (unlike McCune-Reischauer) - Consonant assimilation reflected in spelling - Spaces at natural word boundaries --- ### Arabic (ISO 233-2) **Dataset Statistics**: 8 institutions | Emic Name | Transliterated | Abbreviation | |-----------|----------------|--------------| | المكتبة الوطنية للمملكة المغربية | al-Maktaba al-Waṭanīya lil-Mamlaka al-Maġribīya | MWMM | | دار الكتب المصرية | Dār al-Kutub al-Miṣrīya | DKM | | المتحف الوطني العراقي | al-Matḥaf al-Waṭanī al-ʿIrāqī | MWI | **Skip Words**: - `al-` (definite article "the") - After skip word removal: Maktaba, Wataniya, Mamlaka, Maghribiya → MWMM **Notes**: - Right-to-left script - Definite article "al-" always skipped - Diacritics normalized (ā→a, ī→i, etc.) --- ### Persian/Farsi (ISO 233-3) **Dataset Statistics**: 11 institutions | Emic Name | Transliterated | Abbreviation | |-----------|----------------|--------------| | وزارت امور خارجه ایران | Vezārat-e Omūr-e Khārejeh-ye Īrān | VOKI | | کتابخانه آستان قدس رضوی | Ketābkhāneh-ye Āstān-e Qods-e Raẓavī | KAQR | | مجلس شورای اسلامی | Majles-e Showrā-ye Eslāmī | MSE | **Skip Words**: - `-e`, `-ye` (ezafe connector, "of") - `va` ("and") **Persian-specific characters**: ``` پ → p چ → č ژ → ž گ → g ``` --- ### Hebrew (ISO 259-3) **Dataset Statistics**: 4 institutions | Emic Name | Transliterated | Abbreviation | |-----------|----------------|--------------| | ארכיון הסיפור העממי בישראל | Arḵiyon ha-Sipur ha-ʿAmami be-Yiśraʾel | ASAY | | הספרייה הלאומית | ha-Sifriya ha-Leʾumit | SL | | ארכיון המדינה | Arḵiyon ha-Medina | AM | **Skip Words**: - `ha-` (definite article "the") - `be-` ("in") - `le-` ("to") - `ve-` ("and") **Notes**: - Right-to-left script - Articles attached with hyphen - Silent letters (aleph, ayin) often omitted in abbreviation --- ### Hindi (Devanagari - ISO 15919) **Dataset Statistics**: 14 institutions | Emic Name | Transliterated | Abbreviation | |-----------|----------------|--------------| | राजस्थान प्राच्यविद्या प्रतिष्ठान | Rājasthāna Prācyavidyā Pratiṣṭhāna | RPP | | राष्ट्रीय अभिलेखागार | Rāṣṭrīya Abhilekhāgāra | RA | | राष्ट्रीय संग्रहालय नई दिल्ली | Rāṣṭrīya Saṅgrahālaya Naī Dillī | RSND | **Skip Words**: - `ka`, `ki`, `ke` ("of") - `aur` ("and") - `mein` ("in") **Notes**: - Conjunct consonants transliterated as cluster - Long vowels marked (ā, ī, ū) then normalized --- ### Greek (ISO 843) **Dataset Statistics**: 2 institutions | Emic Name | Transliterated | Abbreviation | |-----------|----------------|--------------| | Αρχαιολογικό Μουσείο Θεσσαλονίκης | Archaiologikó Mouseío Thessaloníkīs | AMT | | Εθνική Βιβλιοθήκη της Ελλάδας | Ethnikī́ Vivliothī́kī tīs Elládas | EVE | **Skip Words**: - `tīs`, `tou` ("of the") - `kai` ("and") **Character Mapping**: ``` Α → A Β → V Γ → G Δ → D Ε → E Ζ → Z Η → Ī Θ → Th Ι → I Κ → K Λ → L Μ → M Ν → N Ξ → X Ο → O Π → P Ρ → R Σ → S Τ → T Υ → Y Φ → F Χ → Ch Ψ → Ps Ω → Ō ``` --- ### Thai (ISO 11940-2) **Dataset Statistics**: 6 institutions | Emic Name | Transliterated | Abbreviation | |-----------|----------------|--------------| | สำนักหอจดหมายเหตุแห่งชาติ | Samnak Ho Chotmaihet Haeng Chat | SHCHC | | หอสมุดแห่งชาติ | Ho Samut Haeng Chat | HSHC | **Notes**: - Thai script is abugida (consonant-vowel syllables) - No spaces in Thai; word boundaries determined by meaning - Royal Thai General System also acceptable --- ### Armenian (ISO 9985) **Dataset Statistics**: 4 institutions | Emic Name | Transliterated | Abbreviation | |-----------|----------------|--------------| | Մdelays delays delaysdelays delaysатенадаран | Matenadaran | M | | Ազdelays delays delays delays delays delays delays delays delaysգdelays delays delays delays delays delaysdelays delaysdelays delaysდайн Пdelays delays delays delays delaysатאрաнագитаран | Azgayin Matenadaran | AM | **Notes**: - Armenian alphabet unique to Armenian language - Transliteration straightforward letter-for-letter --- ### Georgian (ISO 9984) **Dataset Statistics**: 2 institutions | Emic Name | Transliterated | Abbreviation | |-----------|----------------|--------------| | ხელნაწერთა ეროვნული ცენტრი | Xelnawerti Erovnuli C'ent'ri | XEC | | საქართველოს ეროვნული არქივი | Sakartvelos Erovnuli Arkivi | SEA | **Notes**: - Georgian Mkhedruli script - Apostrophes mark ejective consonants (removed in abbreviation) --- ## Complete Workflow Example ### Step-by-Step: Korean Institution **Institution**: National Museum of Korea 1. **Emic Name (Original Script)**: ``` 국립중앙박물관 ``` 2. **Language Detection**: Korean (ko) 3. **Transliterate using Revised Romanization**: ``` Gungnip Jungang Bakmulgwan ``` 4. **Identify Skip Words**: None for Korean 5. **Extract First Letters**: ``` G + J + B = GJB ``` 6. **Diacritic Normalization**: N/A (RR has no diacritics) 7. **Final Abbreviation**: `GJB` 8. **Store in YAML**: ```yaml custodian_name: emic_name: 국립중앙박물관 name_language: ko english_name: National Museum of Korea ghcid: ghcid_current: KR-SO-SEO-M-GJB abbreviation_source: transliterated_emic ``` --- ### Step-by-Step: Arabic Institution **Institution**: National Library of Morocco 1. **Emic Name (Original Script)**: ``` المكتبة الوطنية للمملكة المغربية ``` 2. **Language Detection**: Arabic (ar) 3. **Transliterate using ISO 233-2**: ``` al-Maktaba al-Waṭanīya lil-Mamlaka al-Maġribīya ``` 4. **Identify Skip Words**: `al-` (4 occurrences), `lil-` (1) 5. **After Skip Word Removal**: ``` Maktaba Waṭanīya Mamlaka Maġribīya ``` 6. **Extract First Letters**: ``` M + W + M + M = MWMM ``` 7. **Diacritic Normalization**: ṭ→t, ī→i, ġ→g ``` MWMM (already ASCII) ``` 8. **Final Abbreviation**: `MWMM` --- ## Edge Cases and Special Handling ### Mixed Scripts Some institution names mix scripts (e.g., Latin brand names in Chinese text): **Example**: 中国IBM研究院 - Transliterate Chinese: Zhongguo IBM Yanjiuyuan - Keep "IBM" as-is (already Latin) - Abbreviation: ZIY ### Transliteration Ambiguity When multiple valid transliterations exist, prefer: 1. ISO standard spelling 2. Institution's own romanization (if consistent) 3. Most commonly used academic romanization ### Very Long Names If abbreviation exceeds 10 characters after applying rules: 1. Truncate to 10 characters 2. Ensure truncation doesn't create ambiguous abbreviation 3. Document truncation in `ghcid.notes` --- ## Python Implementation Reference For the complete Python implementation of transliteration functions, see: - `.opencode/TRANSLITERATION_STANDARDS.md` - Full code with all language handlers - `scripts/transliterate_emic_names.py` - Production script for batch transliteration ### Quick Reference Function ```python from transliteration import transliterate_for_abbreviation # Example usage for all supported languages examples = { 'ru': 'Российская государственная библиотека', 'zh': '中国国家图书馆', 'ja': '国立国会図書館', 'ko': '국립중앙박물관', 'ar': 'المكتبة الوطنية للمملكة المغربية', 'he': 'הספרייה הלאומית', 'hi': 'राष्ट्रीय अभिलेखागार', 'el': 'Εθνική Βιβλιοθήκη της Ελλάδας', } for lang, name in examples.items(): latin = transliterate_for_abbreviation(name, lang) print(f'{lang}: {name}') print(f' → {latin}') ``` --- ## Validation Checklist Before finalizing a transliterated abbreviation: - [ ] Original emic name preserved in `custodian_name.emic_name` - [ ] Language code stored in `custodian_name.name_language` - [ ] Correct ISO standard applied for script - [ ] Skip words removed (articles, prepositions) - [ ] Diacritics normalized to ASCII - [ ] Special characters removed - [ ] Abbreviation ≤ 10 characters - [ ] No conflicts with existing GHCIDs --- ## See Also - `.opencode/TRANSLITERATION_STANDARDS.md` - Technical rules and Python code - `.opencode/ABBREVIATION_SPECIAL_CHAR_RULE.md` - Character filtering rules - `AGENTS.md` - Rule 12: Non-Latin Script Transliteration - `docs/PERSISTENT_IDENTIFIERS.md` - GHCID specification --- ## Changelog | Date | Change | |------|--------| | 2025-12-08 | Initial document created with 21 language examples |