glam/docs/TRANSLITERATION_CONVENTIONS.md
kempersc 271545fa8b docs: add Z.AI GLM API and transliteration rules to AGENTS.md
- Add Rule 11 for Z.AI Coding Plan API usage (not BigModel)
- Add transliteration standards for non-Latin scripts
- Document GLM model options and Python implementation
2025-12-08 14:58:22 +01:00

13 KiB
Raw Blame History

Transliteration Conventions for Heritage Custodian Names

Document Type: User Guide
Version: 1.0
Last Updated: 2025-12-08
Related Rules: .opencode/TRANSLITERATION_STANDARDS.md, AGENTS.md (Rule 12)


Overview

This document provides comprehensive examples and guidance for transliterating heritage institution names from non-Latin scripts to Latin characters. Transliteration is required for generating GHCID abbreviations but the original emic name is always preserved.

Key Principles

  1. Emic name preserved - Original script stored in custodian_name.emic_name
  2. ISO standards used - Recognized international transliteration standards
  3. Deterministic output - Same input always produces same Latin output
  4. Abbreviation purpose only - Transliteration is for GHCID generation, not display

Language-by-Language Examples

Russian (Cyrillic - ISO 9:1995)

Dataset Statistics: 13 institutions

Emic Name Transliterated Abbreviation
Институт восточных рукописей РАН Institut vostočnyh rukopisej RAN IVRR
Российская государственная библиотека Rossijskaja gosudarstvennaja biblioteka RGB
Государственный архив Российской Федерации Gosudarstvennyj arhiv Rossijskoj Federacii GARF

Skip Words (Russian): None significant (Russian doesn't use articles)

Character Mapping:

А → A    Б → B    В → V    Г → G    Д → D    Е → E
Ё → Ë    Ж → Ž    З → Z    И → I    Й → J    К → K
Л → L    М → M    Н → N    О → O    П → P    Р → R
С → S    Т → T    У → U    Ф → F    Х → H    Ц → C
Ч → Č    Ш → Š    Щ → Ŝ    Ъ → ʺ    Ы → Y    Ьʹ
Э → È    Ю → Û    Я → Â

Ukrainian (Cyrillic - ISO 9:1995)

Dataset Statistics: 8 institutions

Emic Name Transliterated Abbreviation
Центральний державний архів громадських об'єднань України Centralnyj deržavnyj arhiv hromadskyx objednan Ukrainy CDAGOU
Національна бібліотека України Nacionalna biblioteka Ukrainy NBU

Ukrainian-specific characters:

І → I    Ї → Ji    Є → Je    Ґ → G'

Chinese (Hanyu Pinyin - ISO 7098)

Dataset Statistics: 27 institutions

Emic Name Pinyin Abbreviation
东巴文化博物院 Dōngbā Wénhuà Bówùyuàn DWB
中国第一历史档案馆 Zhōngguó Dìyī Lìshǐ Dàng'ànguǎn ZDLD
北京故宫博物院 Běijīng Gùgōng Bówùyuàn BGB
中国国家图书馆 Zhōngguó Guójiā Túshūguǎn ZGT

Notes:

  • Tone marks are removed for abbreviation (diacritics normalization)
  • Word boundaries follow natural semantic breaks
  • Multi-syllable words keep together

Skip Words: None (Chinese doesn't use separate articles/prepositions)


Japanese (Modified Hepburn)

Dataset Statistics: 19 institutions

Emic Name Romaji Abbreviation
国立中央博物館 Kokuritsu Chūō Hakubutsukan KCH
東京国立博物館 Tōkyō Kokuritsu Hakubutsukan TKH
国立国会図書館 Kokuritsu Kokkai Toshokan KKT

Notes:

  • Long vowels (ō, ū) normalized to (o, u)
  • Particles typically attached to preceding word
  • Kanji compounds transliterated as single words

Korean (Revised Romanization)

Dataset Statistics: 36 institutions

Emic Name RR Romanization Abbreviation
독립기념관 Dongnip Ginyeomgwan DG
국립중앙박물관 Gungnip Jungang Bakmulgwan GJB
서울대학교 규장각한국학연구원 Seoul Daehakgyo Gyujanggak Hangukhak Yeonguwon SDGHY
국립한글박물관 Gungnip Hangeul Bakmulgwan GHB

Notes:

  • No diacritics in Revised Romanization (unlike McCune-Reischauer)
  • Consonant assimilation reflected in spelling
  • Spaces at natural word boundaries

Arabic (ISO 233-2)

Dataset Statistics: 8 institutions

Emic Name Transliterated Abbreviation
المكتبة الوطنية للمملكة المغربية al-Maktaba al-Waṭanīya lil-Mamlaka al-Maġribīya MWMM
دار الكتب المصرية Dār al-Kutub al-Miṣrīya DKM
المتحف الوطني العراقي al-Matḥaf al-Waṭanī al-ʿIrāqī MWI

Skip Words:

  • al- (definite article "the")
  • After skip word removal: Maktaba, Wataniya, Mamlaka, Maghribiya → MWMM

Notes:

  • Right-to-left script
  • Definite article "al-" always skipped
  • Diacritics normalized (ā→a, ī→i, etc.)

Persian/Farsi (ISO 233-3)

Dataset Statistics: 11 institutions

Emic Name Transliterated Abbreviation
وزارت امور خارجه ایران Vezārat-e Omūr-e Khārejeh-ye Īrān VOKI
کتابخانه آستان قدس رضوی Ketābkhāneh-ye Āstān-e Qods-e Raẓavī KAQR
مجلس شورای اسلامی Majles-e Showrā-ye Eslāmī MSE

Skip Words:

  • -e, -ye (ezafe connector, "of")
  • va ("and")

Persian-specific characters:

پ → p    چ → č    ژ → ž    گ → g

Hebrew (ISO 259-3)

Dataset Statistics: 4 institutions

Emic Name Transliterated Abbreviation
ארכיון הסיפור העממי בישראל Arḵiyon ha-Sipur ha-ʿAmami be-Yiśraʾel ASAY
הספרייה הלאומית ha-Sifriya ha-Leʾumit SL
ארכיון המדינה Arḵiyon ha-Medina AM

Skip Words:

  • ha- (definite article "the")
  • be- ("in")
  • le- ("to")
  • ve- ("and")

Notes:

  • Right-to-left script
  • Articles attached with hyphen
  • Silent letters (aleph, ayin) often omitted in abbreviation

Hindi (Devanagari - ISO 15919)

Dataset Statistics: 14 institutions

Emic Name Transliterated Abbreviation
राजस्थान प्राच्यविद्या प्रतिष्ठान Rājasthāna Prācyavidyā Pratiṣṭhāna RPP
राष्ट्रीय अभिलेखागार Rāṣṭrīya Abhilekhāgāra RA
राष्ट्रीय संग्रहालय नई दिल्ली Rāṣṭrīya Saṅgrahālaya Naī Dillī RSND

Skip Words:

  • ka, ki, ke ("of")
  • aur ("and")
  • mein ("in")

Notes:

  • Conjunct consonants transliterated as cluster
  • Long vowels marked (ā, ī, ū) then normalized

Greek (ISO 843)

Dataset Statistics: 2 institutions

Emic Name Transliterated Abbreviation
Αρχαιολογικό Μουσείο Θεσσαλονίκης Archaiologikó Mouseío Thessaloníkīs AMT
Εθνική Βιβλιοθήκη της Ελλάδας Ethnikī́ Vivliothī́kī tīs Elládas EVE

Skip Words:

  • tīs, tou ("of the")
  • kai ("and")

Character Mapping:

Α → A    Β → V    Γ → G    Δ → D    Ε → E    Ζ → Z
Η → Ī    Θ → Th   Ι → I    Κ → K    Λ → L    Μ → M
Ν → N    Ξ → X    Ο → O    Π → P    Ρ → R    Σ → S
Τ → T    Υ → Y    Φ → F    Χ → Ch   Ψ → Ps   Ω → Ō

Thai (ISO 11940-2)

Dataset Statistics: 6 institutions

Emic Name Transliterated Abbreviation
สำนักหอจดหมายเหตุแห่งชาติ Samnak Ho Chotmaihet Haeng Chat SHCHC
หอสมุดแห่งชาติ Ho Samut Haeng Chat HSHC

Notes:

  • Thai script is abugida (consonant-vowel syllables)
  • No spaces in Thai; word boundaries determined by meaning
  • Royal Thai General System also acceptable

Armenian (ISO 9985)

Dataset Statistics: 4 institutions

Emic Name Transliterated Abbreviation
Մdelays delays delaysdelays delaysатенадаран Matenadaran M
Ազdelays delays delays delays delays delays delays delays delaysգdelays delays delays delays delays delaysdelays delaysdelays delaysდайн Пdelays delays delays delays delaysатאрաнագитаран Azgayin Matenadaran AM

Notes:

  • Armenian alphabet unique to Armenian language
  • Transliteration straightforward letter-for-letter

Georgian (ISO 9984)

Dataset Statistics: 2 institutions

Emic Name Transliterated Abbreviation
ხელნაწერთა ეროვნული ცენტრი Xelnawerti Erovnuli C'ent'ri XEC
საქართველოს ეროვნული არქივი Sakartvelos Erovnuli Arkivi SEA

Notes:

  • Georgian Mkhedruli script
  • Apostrophes mark ejective consonants (removed in abbreviation)

Complete Workflow Example

Step-by-Step: Korean Institution

Institution: National Museum of Korea

  1. Emic Name (Original Script):

    국립중앙박물관
    
  2. Language Detection: Korean (ko)

  3. Transliterate using Revised Romanization:

    Gungnip Jungang Bakmulgwan
    
  4. Identify Skip Words: None for Korean

  5. Extract First Letters:

    G + J + B = GJB
    
  6. Diacritic Normalization: N/A (RR has no diacritics)

  7. Final Abbreviation: GJB

  8. Store in YAML:

    custodian_name:
      emic_name: 국립중앙박물관
      name_language: ko
      english_name: National Museum of Korea
    ghcid:
      ghcid_current: KR-SO-SEO-M-GJB
      abbreviation_source: transliterated_emic
    

Step-by-Step: Arabic Institution

Institution: National Library of Morocco

  1. Emic Name (Original Script):

    المكتبة الوطنية للمملكة المغربية
    
  2. Language Detection: Arabic (ar)

  3. Transliterate using ISO 233-2:

    al-Maktaba al-Waṭanīya lil-Mamlaka al-Maġribīya
    
  4. Identify Skip Words: al- (4 occurrences), lil- (1)

  5. After Skip Word Removal:

    Maktaba Waṭanīya Mamlaka Maġribīya
    
  6. Extract First Letters:

    M + W + M + M = MWMM
    
  7. Diacritic Normalization: ṭ→t, ī→i, ġ→g

    MWMM (already ASCII)
    
  8. Final Abbreviation: MWMM


Edge Cases and Special Handling

Mixed Scripts

Some institution names mix scripts (e.g., Latin brand names in Chinese text):

Example: 中国IBM研究院

  • Transliterate Chinese: Zhongguo IBM Yanjiuyuan
  • Keep "IBM" as-is (already Latin)
  • Abbreviation: ZIY

Transliteration Ambiguity

When multiple valid transliterations exist, prefer:

  1. ISO standard spelling
  2. Institution's own romanization (if consistent)
  3. Most commonly used academic romanization

Very Long Names

If abbreviation exceeds 10 characters after applying rules:

  1. Truncate to 10 characters
  2. Ensure truncation doesn't create ambiguous abbreviation
  3. Document truncation in ghcid.notes

Python Implementation Reference

For the complete Python implementation of transliteration functions, see:

  • .opencode/TRANSLITERATION_STANDARDS.md - Full code with all language handlers
  • scripts/transliterate_emic_names.py - Production script for batch transliteration

Quick Reference Function

from transliteration import transliterate_for_abbreviation

# Example usage for all supported languages
examples = {
    'ru': 'Российская государственная библиотека',
    'zh': '中国国家图书馆',
    'ja': '国立国会図書館',
    'ko': '국립중앙박물관',
    'ar': 'المكتبة الوطنية للمملكة المغربية',
    'he': 'הספרייה הלאומית',
    'hi': 'राष्ट्रीय अभिलेखागार',
    'el': 'Εθνική Βιβλιοθήκη της Ελλάδας',
}

for lang, name in examples.items():
    latin = transliterate_for_abbreviation(name, lang)
    print(f'{lang}: {name}')
    print(f'    → {latin}')

Validation Checklist

Before finalizing a transliterated abbreviation:

  • Original emic name preserved in custodian_name.emic_name
  • Language code stored in custodian_name.name_language
  • Correct ISO standard applied for script
  • Skip words removed (articles, prepositions)
  • Diacritics normalized to ASCII
  • Special characters removed
  • Abbreviation ≤ 10 characters
  • No conflicts with existing GHCIDs

See Also

  • .opencode/TRANSLITERATION_STANDARDS.md - Technical rules and Python code
  • .opencode/ABBREVIATION_SPECIAL_CHAR_RULE.md - Character filtering rules
  • AGENTS.md - Rule 12: Non-Latin Script Transliteration
  • docs/PERSISTENT_IDENTIFIERS.md - GHCID specification

Changelog

Date Change
2025-12-08 Initial document created with 21 language examples