glam/docs/TRANSLITERATION_CONVENTIONS.md
kempersc 271545fa8b docs: add Z.AI GLM API and transliteration rules to AGENTS.md
- Add Rule 11 for Z.AI Coding Plan API usage (not BigModel)
- Add transliteration standards for non-Latin scripts
- Document GLM model options and Python implementation
2025-12-08 14:58:22 +01:00

441 lines
13 KiB
Markdown
Raw Blame History

This file contains ambiguous Unicode characters

This file contains Unicode characters that might be confused with other characters. If you think that this is intentional, you can safely ignore this warning. Use the Escape button to reveal them.

# Transliteration Conventions for Heritage Custodian Names
**Document Type**: User Guide
**Version**: 1.0
**Last Updated**: 2025-12-08
**Related Rules**: `.opencode/TRANSLITERATION_STANDARDS.md`, `AGENTS.md` (Rule 12)
---
## Overview
This document provides comprehensive examples and guidance for transliterating heritage institution names from non-Latin scripts to Latin characters. Transliteration is **required** for generating GHCID abbreviations but the **original emic name is always preserved**.
### Key Principles
1. **Emic name preserved** - Original script stored in `custodian_name.emic_name`
2. **ISO standards used** - Recognized international transliteration standards
3. **Deterministic output** - Same input always produces same Latin output
4. **Abbreviation purpose only** - Transliteration is for GHCID generation, not display
---
## Language-by-Language Examples
### Russian (Cyrillic - ISO 9:1995)
**Dataset Statistics**: 13 institutions
| Emic Name | Transliterated | Abbreviation |
|-----------|----------------|--------------|
| Институт восточных рукописей РАН | Institut vostočnyh rukopisej RAN | IVRR |
| Российская государственная библиотека | Rossijskaja gosudarstvennaja biblioteka | RGB |
| Государственный архив Российской Федерации | Gosudarstvennyj arhiv Rossijskoj Federacii | GARF |
**Skip Words (Russian)**: None significant (Russian doesn't use articles)
**Character Mapping**:
```
А → A Б → B В → V Г → G Д → D Е → E
Ё → Ë Ж → Ž З → Z И → I Й → J К → K
Л → L М → M Н → N О → O П → P Р → R
С → S Т → T У → U Ф → F Х → H Ц → C
Ч → Č Ш → Š Щ → Ŝ Ъ → ʺ Ы → Y Ьʹ
Э → È Ю → Û Я → Â
```
---
### Ukrainian (Cyrillic - ISO 9:1995)
**Dataset Statistics**: 8 institutions
| Emic Name | Transliterated | Abbreviation |
|-----------|----------------|--------------|
| Центральний державний архів громадських об'єднань України | Centralnyj deržavnyj arhiv hromadskyx objednan Ukrainy | CDAGOU |
| Національна бібліотека України | Nacionalna biblioteka Ukrainy | NBU |
**Ukrainian-specific characters**:
```
І → I Ї → Ji Є → Je Ґ → G'
```
---
### Chinese (Hanyu Pinyin - ISO 7098)
**Dataset Statistics**: 27 institutions
| Emic Name | Pinyin | Abbreviation |
|-----------|--------|--------------|
| 东巴文化博物院 | Dōngbā Wénhuà Bówùyuàn | DWB |
| 中国第一历史档案馆 | Zhōngguó Dìyī Lìshǐ Dàng'ànguǎn | ZDLD |
| 北京故宫博物院 | Běijīng Gùgōng Bówùyuàn | BGB |
| 中国国家图书馆 | Zhōngguó Guójiā Túshūguǎn | ZGT |
**Notes**:
- Tone marks are removed for abbreviation (diacritics normalization)
- Word boundaries follow natural semantic breaks
- Multi-syllable words keep together
**Skip Words**: None (Chinese doesn't use separate articles/prepositions)
---
### Japanese (Modified Hepburn)
**Dataset Statistics**: 19 institutions
| Emic Name | Romaji | Abbreviation |
|-----------|--------|--------------|
| 国立中央博物館 | Kokuritsu Chūō Hakubutsukan | KCH |
| 東京国立博物館 | Tōkyō Kokuritsu Hakubutsukan | TKH |
| 国立国会図書館 | Kokuritsu Kokkai Toshokan | KKT |
**Notes**:
- Long vowels (ō, ū) normalized to (o, u)
- Particles typically attached to preceding word
- Kanji compounds transliterated as single words
---
### Korean (Revised Romanization)
**Dataset Statistics**: 36 institutions
| Emic Name | RR Romanization | Abbreviation |
|-----------|-----------------|--------------|
| 독립기념관 | Dongnip Ginyeomgwan | DG |
| 국립중앙박물관 | Gungnip Jungang Bakmulgwan | GJB |
| 서울대학교 규장각한국학연구원 | Seoul Daehakgyo Gyujanggak Hangukhak Yeonguwon | SDGHY |
| 국립한글박물관 | Gungnip Hangeul Bakmulgwan | GHB |
**Notes**:
- No diacritics in Revised Romanization (unlike McCune-Reischauer)
- Consonant assimilation reflected in spelling
- Spaces at natural word boundaries
---
### Arabic (ISO 233-2)
**Dataset Statistics**: 8 institutions
| Emic Name | Transliterated | Abbreviation |
|-----------|----------------|--------------|
| المكتبة الوطنية للمملكة المغربية | al-Maktaba al-Waṭanīya lil-Mamlaka al-Maġribīya | MWMM |
| دار الكتب المصرية | Dār al-Kutub al-Miṣrīya | DKM |
| المتحف الوطني العراقي | al-Matḥaf al-Waṭanī al-ʿIrāqī | MWI |
**Skip Words**:
- `al-` (definite article "the")
- After skip word removal: Maktaba, Wataniya, Mamlaka, Maghribiya → MWMM
**Notes**:
- Right-to-left script
- Definite article "al-" always skipped
- Diacritics normalized (ā→a, ī→i, etc.)
---
### Persian/Farsi (ISO 233-3)
**Dataset Statistics**: 11 institutions
| Emic Name | Transliterated | Abbreviation |
|-----------|----------------|--------------|
| وزارت امور خارجه ایران | Vezārat-e Omūr-e Khārejeh-ye Īrān | VOKI |
| کتابخانه آستان قدس رضوی | Ketābkhāneh-ye Āstān-e Qods-e Raẓavī | KAQR |
| مجلس شورای اسلامی | Majles-e Showrā-ye Eslāmī | MSE |
**Skip Words**:
- `-e`, `-ye` (ezafe connector, "of")
- `va` ("and")
**Persian-specific characters**:
```
پ → p چ → č ژ → ž گ → g
```
---
### Hebrew (ISO 259-3)
**Dataset Statistics**: 4 institutions
| Emic Name | Transliterated | Abbreviation |
|-----------|----------------|--------------|
| ארכיון הסיפור העממי בישראל | Arḵiyon ha-Sipur ha-ʿAmami be-Yiśraʾel | ASAY |
| הספרייה הלאומית | ha-Sifriya ha-Leʾumit | SL |
| ארכיון המדינה | Arḵiyon ha-Medina | AM |
**Skip Words**:
- `ha-` (definite article "the")
- `be-` ("in")
- `le-` ("to")
- `ve-` ("and")
**Notes**:
- Right-to-left script
- Articles attached with hyphen
- Silent letters (aleph, ayin) often omitted in abbreviation
---
### Hindi (Devanagari - ISO 15919)
**Dataset Statistics**: 14 institutions
| Emic Name | Transliterated | Abbreviation |
|-----------|----------------|--------------|
| राजस्थान प्राच्यविद्या प्रतिष्ठान | Rājasthāna Prācyavidyā Pratiṣṭhāna | RPP |
| राष्ट्रीय अभिलेखागार | Rāṣṭrīya Abhilekhāgāra | RA |
| राष्ट्रीय संग्रहालय नई दिल्ली | Rāṣṭrīya Saṅgrahālaya Naī Dillī | RSND |
**Skip Words**:
- `ka`, `ki`, `ke` ("of")
- `aur` ("and")
- `mein` ("in")
**Notes**:
- Conjunct consonants transliterated as cluster
- Long vowels marked (ā, ī, ū) then normalized
---
### Greek (ISO 843)
**Dataset Statistics**: 2 institutions
| Emic Name | Transliterated | Abbreviation |
|-----------|----------------|--------------|
| Αρχαιολογικό Μουσείο Θεσσαλονίκης | Archaiologikó Mouseío Thessaloníkīs | AMT |
| Εθνική Βιβλιοθήκη της Ελλάδας | Ethnikī́ Vivliothī́kī tīs Elládas | EVE |
**Skip Words**:
- `tīs`, `tou` ("of the")
- `kai` ("and")
**Character Mapping**:
```
Α → A Β → V Γ → G Δ → D Ε → E Ζ → Z
Η → Ī Θ → Th Ι → I Κ → K Λ → L Μ → M
Ν → N Ξ → X Ο → O Π → P Ρ → R Σ → S
Τ → T Υ → Y Φ → F Χ → Ch Ψ → Ps Ω → Ō
```
---
### Thai (ISO 11940-2)
**Dataset Statistics**: 6 institutions
| Emic Name | Transliterated | Abbreviation |
|-----------|----------------|--------------|
| สำนักหอจดหมายเหตุแห่งชาติ | Samnak Ho Chotmaihet Haeng Chat | SHCHC |
| หอสมุดแห่งชาติ | Ho Samut Haeng Chat | HSHC |
**Notes**:
- Thai script is abugida (consonant-vowel syllables)
- No spaces in Thai; word boundaries determined by meaning
- Royal Thai General System also acceptable
---
### Armenian (ISO 9985)
**Dataset Statistics**: 4 institutions
| Emic Name | Transliterated | Abbreviation |
|-----------|----------------|--------------|
| Մdelays delays delaysdelays delaysатенадаран | Matenadaran | M |
| Ազdelays delays delays delays delays delays delays delays delaysգdelays delays delays delays delays delaysdelays delaysdelays delaysდайн Пdelays delays delays delays delaysатאрաнագитаран | Azgayin Matenadaran | AM |
**Notes**:
- Armenian alphabet unique to Armenian language
- Transliteration straightforward letter-for-letter
---
### Georgian (ISO 9984)
**Dataset Statistics**: 2 institutions
| Emic Name | Transliterated | Abbreviation |
|-----------|----------------|--------------|
| ხელნაწერთა ეროვნული ცენტრი | Xelnawerti Erovnuli C'ent'ri | XEC |
| საქართველოს ეროვნული არქივი | Sakartvelos Erovnuli Arkivi | SEA |
**Notes**:
- Georgian Mkhedruli script
- Apostrophes mark ejective consonants (removed in abbreviation)
---
## Complete Workflow Example
### Step-by-Step: Korean Institution
**Institution**: National Museum of Korea
1. **Emic Name (Original Script)**:
```
국립중앙박물관
```
2. **Language Detection**: Korean (ko)
3. **Transliterate using Revised Romanization**:
```
Gungnip Jungang Bakmulgwan
```
4. **Identify Skip Words**: None for Korean
5. **Extract First Letters**:
```
G + J + B = GJB
```
6. **Diacritic Normalization**: N/A (RR has no diacritics)
7. **Final Abbreviation**: `GJB`
8. **Store in YAML**:
```yaml
custodian_name:
emic_name: 국립중앙박물관
name_language: ko
english_name: National Museum of Korea
ghcid:
ghcid_current: KR-SO-SEO-M-GJB
abbreviation_source: transliterated_emic
```
---
### Step-by-Step: Arabic Institution
**Institution**: National Library of Morocco
1. **Emic Name (Original Script)**:
```
المكتبة الوطنية للمملكة المغربية
```
2. **Language Detection**: Arabic (ar)
3. **Transliterate using ISO 233-2**:
```
al-Maktaba al-Waṭanīya lil-Mamlaka al-Maġribīya
```
4. **Identify Skip Words**: `al-` (4 occurrences), `lil-` (1)
5. **After Skip Word Removal**:
```
Maktaba Waṭanīya Mamlaka Maġribīya
```
6. **Extract First Letters**:
```
M + W + M + M = MWMM
```
7. **Diacritic Normalization**: ṭ→t, ī→i, ġ→g
```
MWMM (already ASCII)
```
8. **Final Abbreviation**: `MWMM`
---
## Edge Cases and Special Handling
### Mixed Scripts
Some institution names mix scripts (e.g., Latin brand names in Chinese text):
**Example**: 中国IBM研究院
- Transliterate Chinese: Zhongguo IBM Yanjiuyuan
- Keep "IBM" as-is (already Latin)
- Abbreviation: ZIY
### Transliteration Ambiguity
When multiple valid transliterations exist, prefer:
1. ISO standard spelling
2. Institution's own romanization (if consistent)
3. Most commonly used academic romanization
### Very Long Names
If abbreviation exceeds 10 characters after applying rules:
1. Truncate to 10 characters
2. Ensure truncation doesn't create ambiguous abbreviation
3. Document truncation in `ghcid.notes`
---
## Python Implementation Reference
For the complete Python implementation of transliteration functions, see:
- `.opencode/TRANSLITERATION_STANDARDS.md` - Full code with all language handlers
- `scripts/transliterate_emic_names.py` - Production script for batch transliteration
### Quick Reference Function
```python
from transliteration import transliterate_for_abbreviation
# Example usage for all supported languages
examples = {
'ru': 'Российская государственная библиотека',
'zh': '中国国家图书馆',
'ja': '国立国会図書館',
'ko': '국립중앙박물관',
'ar': 'المكتبة الوطنية للمملكة المغربية',
'he': 'הספרייה הלאומית',
'hi': 'राष्ट्रीय अभिलेखागार',
'el': 'Εθνική Βιβλιοθήκη της Ελλάδας',
}
for lang, name in examples.items():
latin = transliterate_for_abbreviation(name, lang)
print(f'{lang}: {name}')
print(f' → {latin}')
```
---
## Validation Checklist
Before finalizing a transliterated abbreviation:
- [ ] Original emic name preserved in `custodian_name.emic_name`
- [ ] Language code stored in `custodian_name.name_language`
- [ ] Correct ISO standard applied for script
- [ ] Skip words removed (articles, prepositions)
- [ ] Diacritics normalized to ASCII
- [ ] Special characters removed
- [ ] Abbreviation ≤ 10 characters
- [ ] No conflicts with existing GHCIDs
---
## See Also
- `.opencode/TRANSLITERATION_STANDARDS.md` - Technical rules and Python code
- `.opencode/ABBREVIATION_SPECIAL_CHAR_RULE.md` - Character filtering rules
- `AGENTS.md` - Rule 12: Non-Latin Script Transliteration
- `docs/PERSISTENT_IDENTIFIERS.md` - GHCID specification
---
## Changelog
| Date | Change |
|------|--------|
| 2025-12-08 | Initial document created with 21 language examples |