- Add Rule 11 for Z.AI Coding Plan API usage (not BigModel) - Add transliteration standards for non-Latin scripts - Document GLM model options and Python implementation
441 lines
13 KiB
Markdown
441 lines
13 KiB
Markdown
# Transliteration Conventions for Heritage Custodian Names
|
||
|
||
**Document Type**: User Guide
|
||
**Version**: 1.0
|
||
**Last Updated**: 2025-12-08
|
||
**Related Rules**: `.opencode/TRANSLITERATION_STANDARDS.md`, `AGENTS.md` (Rule 12)
|
||
|
||
---
|
||
|
||
## Overview
|
||
|
||
This document provides comprehensive examples and guidance for transliterating heritage institution names from non-Latin scripts to Latin characters. Transliteration is **required** for generating GHCID abbreviations but the **original emic name is always preserved**.
|
||
|
||
### Key Principles
|
||
|
||
1. **Emic name preserved** - Original script stored in `custodian_name.emic_name`
|
||
2. **ISO standards used** - Recognized international transliteration standards
|
||
3. **Deterministic output** - Same input always produces same Latin output
|
||
4. **Abbreviation purpose only** - Transliteration is for GHCID generation, not display
|
||
|
||
---
|
||
|
||
## Language-by-Language Examples
|
||
|
||
### Russian (Cyrillic - ISO 9:1995)
|
||
|
||
**Dataset Statistics**: 13 institutions
|
||
|
||
| Emic Name | Transliterated | Abbreviation |
|
||
|-----------|----------------|--------------|
|
||
| Институт восточных рукописей РАН | Institut vostočnyh rukopisej RAN | IVRR |
|
||
| Российская государственная библиотека | Rossijskaja gosudarstvennaja biblioteka | RGB |
|
||
| Государственный архив Российской Федерации | Gosudarstvennyj arhiv Rossijskoj Federacii | GARF |
|
||
|
||
**Skip Words (Russian)**: None significant (Russian doesn't use articles)
|
||
|
||
**Character Mapping**:
|
||
```
|
||
А → A Б → B В → V Г → G Д → D Е → E
|
||
Ё → Ë Ж → Ž З → Z И → I Й → J К → K
|
||
Л → L М → M Н → N О → O П → P Р → R
|
||
С → S Т → T У → U Ф → F Х → H Ц → C
|
||
Ч → Č Ш → Š Щ → Ŝ Ъ → ʺ Ы → Y Ь → ʹ
|
||
Э → È Ю → Û Я → Â
|
||
```
|
||
|
||
---
|
||
|
||
### Ukrainian (Cyrillic - ISO 9:1995)
|
||
|
||
**Dataset Statistics**: 8 institutions
|
||
|
||
| Emic Name | Transliterated | Abbreviation |
|
||
|-----------|----------------|--------------|
|
||
| Центральний державний архів громадських об'єднань України | Centralnyj deržavnyj arhiv hromadskyx objednan Ukrainy | CDAGOU |
|
||
| Національна бібліотека України | Nacionalna biblioteka Ukrainy | NBU |
|
||
|
||
**Ukrainian-specific characters**:
|
||
```
|
||
І → I Ї → Ji Є → Je Ґ → G'
|
||
```
|
||
|
||
---
|
||
|
||
### Chinese (Hanyu Pinyin - ISO 7098)
|
||
|
||
**Dataset Statistics**: 27 institutions
|
||
|
||
| Emic Name | Pinyin | Abbreviation |
|
||
|-----------|--------|--------------|
|
||
| 东巴文化博物院 | Dōngbā Wénhuà Bówùyuàn | DWB |
|
||
| 中国第一历史档案馆 | Zhōngguó Dìyī Lìshǐ Dàng'ànguǎn | ZDLD |
|
||
| 北京故宫博物院 | Běijīng Gùgōng Bówùyuàn | BGB |
|
||
| 中国国家图书馆 | Zhōngguó Guójiā Túshūguǎn | ZGT |
|
||
|
||
**Notes**:
|
||
- Tone marks are removed for abbreviation (diacritics normalization)
|
||
- Word boundaries follow natural semantic breaks
|
||
- Multi-syllable words keep together
|
||
|
||
**Skip Words**: None (Chinese doesn't use separate articles/prepositions)
|
||
|
||
---
|
||
|
||
### Japanese (Modified Hepburn)
|
||
|
||
**Dataset Statistics**: 19 institutions
|
||
|
||
| Emic Name | Romaji | Abbreviation |
|
||
|-----------|--------|--------------|
|
||
| 国立中央博物館 | Kokuritsu Chūō Hakubutsukan | KCH |
|
||
| 東京国立博物館 | Tōkyō Kokuritsu Hakubutsukan | TKH |
|
||
| 国立国会図書館 | Kokuritsu Kokkai Toshokan | KKT |
|
||
|
||
**Notes**:
|
||
- Long vowels (ō, ū) normalized to (o, u)
|
||
- Particles typically attached to preceding word
|
||
- Kanji compounds transliterated as single words
|
||
|
||
---
|
||
|
||
### Korean (Revised Romanization)
|
||
|
||
**Dataset Statistics**: 36 institutions
|
||
|
||
| Emic Name | RR Romanization | Abbreviation |
|
||
|-----------|-----------------|--------------|
|
||
| 독립기념관 | Dongnip Ginyeomgwan | DG |
|
||
| 국립중앙박물관 | Gungnip Jungang Bakmulgwan | GJB |
|
||
| 서울대학교 규장각한국학연구원 | Seoul Daehakgyo Gyujanggak Hangukhak Yeonguwon | SDGHY |
|
||
| 국립한글박물관 | Gungnip Hangeul Bakmulgwan | GHB |
|
||
|
||
**Notes**:
|
||
- No diacritics in Revised Romanization (unlike McCune-Reischauer)
|
||
- Consonant assimilation reflected in spelling
|
||
- Spaces at natural word boundaries
|
||
|
||
---
|
||
|
||
### Arabic (ISO 233-2)
|
||
|
||
**Dataset Statistics**: 8 institutions
|
||
|
||
| Emic Name | Transliterated | Abbreviation |
|
||
|-----------|----------------|--------------|
|
||
| المكتبة الوطنية للمملكة المغربية | al-Maktaba al-Waṭanīya lil-Mamlaka al-Maġribīya | MWMM |
|
||
| دار الكتب المصرية | Dār al-Kutub al-Miṣrīya | DKM |
|
||
| المتحف الوطني العراقي | al-Matḥaf al-Waṭanī al-ʿIrāqī | MWI |
|
||
|
||
**Skip Words**:
|
||
- `al-` (definite article "the")
|
||
- After skip word removal: Maktaba, Wataniya, Mamlaka, Maghribiya → MWMM
|
||
|
||
**Notes**:
|
||
- Right-to-left script
|
||
- Definite article "al-" always skipped
|
||
- Diacritics normalized (ā→a, ī→i, etc.)
|
||
|
||
---
|
||
|
||
### Persian/Farsi (ISO 233-3)
|
||
|
||
**Dataset Statistics**: 11 institutions
|
||
|
||
| Emic Name | Transliterated | Abbreviation |
|
||
|-----------|----------------|--------------|
|
||
| وزارت امور خارجه ایران | Vezārat-e Omūr-e Khārejeh-ye Īrān | VOKI |
|
||
| کتابخانه آستان قدس رضوی | Ketābkhāneh-ye Āstān-e Qods-e Raẓavī | KAQR |
|
||
| مجلس شورای اسلامی | Majles-e Showrā-ye Eslāmī | MSE |
|
||
|
||
**Skip Words**:
|
||
- `-e`, `-ye` (ezafe connector, "of")
|
||
- `va` ("and")
|
||
|
||
**Persian-specific characters**:
|
||
```
|
||
پ → p چ → č ژ → ž گ → g
|
||
```
|
||
|
||
---
|
||
|
||
### Hebrew (ISO 259-3)
|
||
|
||
**Dataset Statistics**: 4 institutions
|
||
|
||
| Emic Name | Transliterated | Abbreviation |
|
||
|-----------|----------------|--------------|
|
||
| ארכיון הסיפור העממי בישראל | Arḵiyon ha-Sipur ha-ʿAmami be-Yiśraʾel | ASAY |
|
||
| הספרייה הלאומית | ha-Sifriya ha-Leʾumit | SL |
|
||
| ארכיון המדינה | Arḵiyon ha-Medina | AM |
|
||
|
||
**Skip Words**:
|
||
- `ha-` (definite article "the")
|
||
- `be-` ("in")
|
||
- `le-` ("to")
|
||
- `ve-` ("and")
|
||
|
||
**Notes**:
|
||
- Right-to-left script
|
||
- Articles attached with hyphen
|
||
- Silent letters (aleph, ayin) often omitted in abbreviation
|
||
|
||
---
|
||
|
||
### Hindi (Devanagari - ISO 15919)
|
||
|
||
**Dataset Statistics**: 14 institutions
|
||
|
||
| Emic Name | Transliterated | Abbreviation |
|
||
|-----------|----------------|--------------|
|
||
| राजस्थान प्राच्यविद्या प्रतिष्ठान | Rājasthāna Prācyavidyā Pratiṣṭhāna | RPP |
|
||
| राष्ट्रीय अभिलेखागार | Rāṣṭrīya Abhilekhāgāra | RA |
|
||
| राष्ट्रीय संग्रहालय नई दिल्ली | Rāṣṭrīya Saṅgrahālaya Naī Dillī | RSND |
|
||
|
||
**Skip Words**:
|
||
- `ka`, `ki`, `ke` ("of")
|
||
- `aur` ("and")
|
||
- `mein` ("in")
|
||
|
||
**Notes**:
|
||
- Conjunct consonants transliterated as cluster
|
||
- Long vowels marked (ā, ī, ū) then normalized
|
||
|
||
---
|
||
|
||
### Greek (ISO 843)
|
||
|
||
**Dataset Statistics**: 2 institutions
|
||
|
||
| Emic Name | Transliterated | Abbreviation |
|
||
|-----------|----------------|--------------|
|
||
| Αρχαιολογικό Μουσείο Θεσσαλονίκης | Archaiologikó Mouseío Thessaloníkīs | AMT |
|
||
| Εθνική Βιβλιοθήκη της Ελλάδας | Ethnikī́ Vivliothī́kī tīs Elládas | EVE |
|
||
|
||
**Skip Words**:
|
||
- `tīs`, `tou` ("of the")
|
||
- `kai` ("and")
|
||
|
||
**Character Mapping**:
|
||
```
|
||
Α → A Β → V Γ → G Δ → D Ε → E Ζ → Z
|
||
Η → Ī Θ → Th Ι → I Κ → K Λ → L Μ → M
|
||
Ν → N Ξ → X Ο → O Π → P Ρ → R Σ → S
|
||
Τ → T Υ → Y Φ → F Χ → Ch Ψ → Ps Ω → Ō
|
||
```
|
||
|
||
---
|
||
|
||
### Thai (ISO 11940-2)
|
||
|
||
**Dataset Statistics**: 6 institutions
|
||
|
||
| Emic Name | Transliterated | Abbreviation |
|
||
|-----------|----------------|--------------|
|
||
| สำนักหอจดหมายเหตุแห่งชาติ | Samnak Ho Chotmaihet Haeng Chat | SHCHC |
|
||
| หอสมุดแห่งชาติ | Ho Samut Haeng Chat | HSHC |
|
||
|
||
**Notes**:
|
||
- Thai script is abugida (consonant-vowel syllables)
|
||
- No spaces in Thai; word boundaries determined by meaning
|
||
- Royal Thai General System also acceptable
|
||
|
||
---
|
||
|
||
### Armenian (ISO 9985)
|
||
|
||
**Dataset Statistics**: 4 institutions
|
||
|
||
| Emic Name | Transliterated | Abbreviation |
|
||
|-----------|----------------|--------------|
|
||
| Մdelays delays delaysdelays delaysатенадаран | Matenadaran | M |
|
||
| Ազdelays delays delays delays delays delays delays delays delaysգdelays delays delays delays delays delaysdelays delaysdelays delaysდайн Пdelays delays delays delays delaysатאрաнագитаран | Azgayin Matenadaran | AM |
|
||
|
||
**Notes**:
|
||
- Armenian alphabet unique to Armenian language
|
||
- Transliteration straightforward letter-for-letter
|
||
|
||
---
|
||
|
||
### Georgian (ISO 9984)
|
||
|
||
**Dataset Statistics**: 2 institutions
|
||
|
||
| Emic Name | Transliterated | Abbreviation |
|
||
|-----------|----------------|--------------|
|
||
| ხელნაწერთა ეროვნული ცენტრი | Xelnawerti Erovnuli C'ent'ri | XEC |
|
||
| საქართველოს ეროვნული არქივი | Sakartvelos Erovnuli Arkivi | SEA |
|
||
|
||
**Notes**:
|
||
- Georgian Mkhedruli script
|
||
- Apostrophes mark ejective consonants (removed in abbreviation)
|
||
|
||
---
|
||
|
||
## Complete Workflow Example
|
||
|
||
### Step-by-Step: Korean Institution
|
||
|
||
**Institution**: National Museum of Korea
|
||
|
||
1. **Emic Name (Original Script)**:
|
||
```
|
||
국립중앙박물관
|
||
```
|
||
|
||
2. **Language Detection**: Korean (ko)
|
||
|
||
3. **Transliterate using Revised Romanization**:
|
||
```
|
||
Gungnip Jungang Bakmulgwan
|
||
```
|
||
|
||
4. **Identify Skip Words**: None for Korean
|
||
|
||
5. **Extract First Letters**:
|
||
```
|
||
G + J + B = GJB
|
||
```
|
||
|
||
6. **Diacritic Normalization**: N/A (RR has no diacritics)
|
||
|
||
7. **Final Abbreviation**: `GJB`
|
||
|
||
8. **Store in YAML**:
|
||
```yaml
|
||
custodian_name:
|
||
emic_name: 국립중앙박물관
|
||
name_language: ko
|
||
english_name: National Museum of Korea
|
||
ghcid:
|
||
ghcid_current: KR-SO-SEO-M-GJB
|
||
abbreviation_source: transliterated_emic
|
||
```
|
||
|
||
---
|
||
|
||
### Step-by-Step: Arabic Institution
|
||
|
||
**Institution**: National Library of Morocco
|
||
|
||
1. **Emic Name (Original Script)**:
|
||
```
|
||
المكتبة الوطنية للمملكة المغربية
|
||
```
|
||
|
||
2. **Language Detection**: Arabic (ar)
|
||
|
||
3. **Transliterate using ISO 233-2**:
|
||
```
|
||
al-Maktaba al-Waṭanīya lil-Mamlaka al-Maġribīya
|
||
```
|
||
|
||
4. **Identify Skip Words**: `al-` (4 occurrences), `lil-` (1)
|
||
|
||
5. **After Skip Word Removal**:
|
||
```
|
||
Maktaba Waṭanīya Mamlaka Maġribīya
|
||
```
|
||
|
||
6. **Extract First Letters**:
|
||
```
|
||
M + W + M + M = MWMM
|
||
```
|
||
|
||
7. **Diacritic Normalization**: ṭ→t, ī→i, ġ→g
|
||
```
|
||
MWMM (already ASCII)
|
||
```
|
||
|
||
8. **Final Abbreviation**: `MWMM`
|
||
|
||
---
|
||
|
||
## Edge Cases and Special Handling
|
||
|
||
### Mixed Scripts
|
||
|
||
Some institution names mix scripts (e.g., Latin brand names in Chinese text):
|
||
|
||
**Example**: 中国IBM研究院
|
||
- Transliterate Chinese: Zhongguo IBM Yanjiuyuan
|
||
- Keep "IBM" as-is (already Latin)
|
||
- Abbreviation: ZIY
|
||
|
||
### Transliteration Ambiguity
|
||
|
||
When multiple valid transliterations exist, prefer:
|
||
1. ISO standard spelling
|
||
2. Institution's own romanization (if consistent)
|
||
3. Most commonly used academic romanization
|
||
|
||
### Very Long Names
|
||
|
||
If abbreviation exceeds 10 characters after applying rules:
|
||
1. Truncate to 10 characters
|
||
2. Ensure truncation doesn't create ambiguous abbreviation
|
||
3. Document truncation in `ghcid.notes`
|
||
|
||
---
|
||
|
||
## Python Implementation Reference
|
||
|
||
For the complete Python implementation of transliteration functions, see:
|
||
|
||
- `.opencode/TRANSLITERATION_STANDARDS.md` - Full code with all language handlers
|
||
- `scripts/transliterate_emic_names.py` - Production script for batch transliteration
|
||
|
||
### Quick Reference Function
|
||
|
||
```python
|
||
from transliteration import transliterate_for_abbreviation
|
||
|
||
# Example usage for all supported languages
|
||
examples = {
|
||
'ru': 'Российская государственная библиотека',
|
||
'zh': '中国国家图书馆',
|
||
'ja': '国立国会図書館',
|
||
'ko': '국립중앙박물관',
|
||
'ar': 'المكتبة الوطنية للمملكة المغربية',
|
||
'he': 'הספרייה הלאומית',
|
||
'hi': 'राष्ट्रीय अभिलेखागार',
|
||
'el': 'Εθνική Βιβλιοθήκη της Ελλάδας',
|
||
}
|
||
|
||
for lang, name in examples.items():
|
||
latin = transliterate_for_abbreviation(name, lang)
|
||
print(f'{lang}: {name}')
|
||
print(f' → {latin}')
|
||
```
|
||
|
||
---
|
||
|
||
## Validation Checklist
|
||
|
||
Before finalizing a transliterated abbreviation:
|
||
|
||
- [ ] Original emic name preserved in `custodian_name.emic_name`
|
||
- [ ] Language code stored in `custodian_name.name_language`
|
||
- [ ] Correct ISO standard applied for script
|
||
- [ ] Skip words removed (articles, prepositions)
|
||
- [ ] Diacritics normalized to ASCII
|
||
- [ ] Special characters removed
|
||
- [ ] Abbreviation ≤ 10 characters
|
||
- [ ] No conflicts with existing GHCIDs
|
||
|
||
---
|
||
|
||
## See Also
|
||
|
||
- `.opencode/TRANSLITERATION_STANDARDS.md` - Technical rules and Python code
|
||
- `.opencode/ABBREVIATION_SPECIAL_CHAR_RULE.md` - Character filtering rules
|
||
- `AGENTS.md` - Rule 12: Non-Latin Script Transliteration
|
||
- `docs/PERSISTENT_IDENTIFIERS.md` - GHCID specification
|
||
|
||
---
|
||
|
||
## Changelog
|
||
|
||
| Date | Change |
|
||
|------|--------|
|
||
| 2025-12-08 | Initial document created with 21 language examples |
|