docs: add Z.AI GLM API and transliteration rules to AGENTS.md

- Add Rule 11 for Z.AI Coding Plan API usage (not BigModel) - Add transliteration standards for non-Latin scripts - Document GLM model options and Python implementation
2025-12-08 14:58:22 +01:00 · 2025-12-08 14:58:22 +01:00 · 271545fa8b
commit 271545fa8b
parent 40bd3cb8f5
6 changed files with 2172 additions and 7 deletions
--- a/.opencode/ABBREVIATION_SPECIAL_CHAR_RULE.md
+++ b/.opencode/ABBREVIATION_SPECIAL_CHAR_RULE.md
@ -1,17 +1,102 @@
-# Abbreviation Special Character Filtering Rule
+# Abbreviation Character Filtering Rules

-**Rule ID**: ABBREV-SPECIAL-CHAR  
+**Rule ID**: ABBREV-CHAR-FILTER  
 **Status**: MANDATORY  
 **Applies To**: GHCID abbreviation component generation  
 **Created**: 2025-12-07  
+**Updated**: 2025-12-08 (added diacritics rule)

 ---

 ## Summary

-**When generating abbreviations for GHCID, special characters and symbols MUST be completely removed. Only alphabetic characters (A-Z) are permitted in the abbreviation component of the GHCID.**
+**When generating abbreviations for GHCID, ONLY ASCII uppercase letters (A-Z) are permitted. Both special characters AND diacritics MUST be removed/normalized.**

-This is a **MANDATORY** rule. Abbreviations containing special characters are INVALID and must be regenerated.
+This is a **MANDATORY** rule. Abbreviations containing special characters or diacritics are INVALID and must be regenerated.
+
+### Two Mandatory Sub-Rules:
+
+1. **ABBREV-SPECIAL-CHAR**: Remove all special characters and symbols
+2. **ABBREV-DIACRITICS**: Normalize all diacritics to ASCII equivalents
+
+---
+
+## Rule 1: Diacritics MUST Be Normalized to ASCII (ABBREV-DIACRITICS)
+
+**Diacritics (accented characters) MUST be normalized to their ASCII base letter equivalents.**
+
+### Example (Real Case)
+
+```
+❌ WRONG:  CZ-VY-TEL-L-VHSPAOČRZS  (contains Č)
+✅ CORRECT: CZ-VY-TEL-L-VHSPAOCRZS  (ASCII only)
+```
+
+### Diacritics Normalization Table
+
+| Diacritic | ASCII | Example |
+|-----------|-------|---------|
+| Á, À, Â, Ã, Ä, Å, Ā | A | "Ålborg" → A |
+| Č, Ć, Ç | C | "Český" → C |
+| Ď | D | "Ďáblice" → D |
+| É, È, Ê, Ë, Ě, Ē | E | "Éire" → E |
+| Í, Ì, Î, Ï, Ī | I | "Ísland" → I |
+| Ñ, Ń, Ň | N | "España" → N |
+| Ó, Ò, Ô, Õ, Ö, Ø, Ō | O | "Österreich" → O |
+| Ř | R | "Říčany" → R |
+| Š, Ś, Ş | S | "Šumperk" → S |
+| Ť | T | "Ťažký" → T |
+| Ú, Ù, Û, Ü, Ů, Ū | U | "Ústí" → U |
+| Ý, Ÿ | Y | "Ýmir" → Y |
+| Ž, Ź, Ż | Z | "Žilina" → Z |
+| Ł | L | "Łódź" → L |
+| Æ | AE | "Ærø" → AE |
+| Œ | OE | "Œuvre" → OE |
+| ß | SS | "Straße" → SS |
+
+### Implementation
+
+```python
+import unicodedata
+
+def normalize_diacritics(text: str) -> str:
+    """
+    Normalize diacritics to ASCII equivalents.
+    
+    Examples:
+        "Č" → "C"
+        "Ř" → "R"  
+        "Ö" → "O"
+        "ñ" → "n"
+    """
+    # NFD decomposition separates base characters from combining marks
+    normalized = unicodedata.normalize('NFD', text)
+    # Remove combining marks (category 'Mn' = Mark, Nonspacing)
+    ascii_text = ''.join(c for c in normalized if unicodedata.category(c) != 'Mn')
+    return ascii_text
+
+# Example
+normalize_diacritics("VHSPAOČRZS")  # Returns "VHSPAOCRZS"
+```
+
+### Languages Commonly Affected
+
+| Language | Common Diacritics | Example Institution |
+|----------|-------------------|---------------------|
+| **Czech** | Č, Ř, Š, Ž, Ě, Ů | Vlastivědné muzeum → VM (not VM with háček) |
+| **Polish** | Ł, Ń, Ó, Ś, Ź, Ż, Ą, Ę | Biblioteka Łódzka → BL |
+| **German** | Ä, Ö, Ü, ß | Österreichische Nationalbibliothek → ON |
+| **French** | É, È, Ê, Ç, Ô | Bibliothèque nationale → BN |
+| **Spanish** | Ñ, Á, É, Í, Ó, Ú | Museo Nacional → MN |
+| **Portuguese** | Ã, Õ, Ç, Á, É | Biblioteca Nacional → BN |
+| **Nordic** | Å, Ä, Ö, Ø, Æ | Nationalmuseet → N |
+| **Turkish** | Ç, Ğ, İ, Ö, Ş, Ü | İstanbul Üniversitesi → IU |
+| **Hungarian** | Á, É, Í, Ó, Ö, Ő, Ú, Ü, Ű | Országos Levéltár → OL |
+| **Romanian** | Ă, Â, Î, Ș, Ț | Biblioteca Națională → BN |
+
+---
+
+## Rule 2: Special Characters MUST Be Removed (ABBREV-SPECIAL-CHAR)

 ---

--- a/.opencode/TRANSLITERATION_STANDARDS.md
+++ b/.opencode/TRANSLITERATION_STANDARDS.md
@ -0,0 +1,787 @@
+# Transliteration Standards for Non-Latin Scripts
+
+**Rule ID**: TRANSLIT-ISO  
+**Status**: MANDATORY  
+**Applies To**: GHCID abbreviation generation from emic names in non-Latin scripts  
+**Created**: 2025-12-08  
+
+---
+
+## Summary
+
+**When generating GHCID abbreviations from institution names written in non-Latin scripts, the emic name MUST first be transliterated to Latin characters using the designated ISO or recognized standard for that script.**
+
+This rule affects **170 institutions** across **21 languages** with non-Latin writing systems.
+
+### Key Principles
+
+1. **Emic name is preserved** - The original script is stored in `custodian_name.emic_name`
+2. **Transliteration is for processing only** - Used to generate abbreviations
+3. **ISO/recognized standards required** - No ad-hoc romanization
+4. **Deterministic output** - Same input always produces same Latin output
+5. **Existing GHCIDs grandfathered** - Only applies to NEW custodians
+
+---
+
+## Transliteration Standards by Script/Language
+
+### Cyrillic Scripts
+
+| Language | ISO Code | Standard | Library/Tool | Notes |
+|----------|----------|----------|--------------|-------|
+| **Russian** | ru | ISO 9:1995 | `transliterate` | Scientific transliteration |
+| **Ukrainian** | uk | ISO 9:1995 | `transliterate` | Includes Ukrainian-specific letters |
+| **Bulgarian** | bg | ISO 9:1995 | `transliterate` | Uses same Cyrillic base |
+| **Serbian** | sr | ISO 9:1995 | `transliterate` | Serbian Cyrillic variant |
+| **Kazakh** | kk | ISO 9:1995 | `transliterate` | Cyrillic-based (pre-2023) |
+
+**ISO 9:1995 Mapping (Core Characters)**:
+
+| Cyrillic | Latin | Cyrillic | Latin |
+|----------|-------|----------|-------|
+| А а | A a | П п | P p |
+| Б б | B b | Р р | R r |
+| В в | V v | С с | S s |
+| Г г | G g | Т т | T t |
+| Д д | D d | У у | U u |
+| Е е | E e | Ф ф | F f |
+| Ё ё | Ë ë | Х х | H h |
+| Ж ж | Ž ž | Ц ц | C c |
+| З з | Z z | Ч ч | Č č |
+| И и | I i | Ш ш | Š š |
+| Й й | J j | Щ щ | Ŝ ŝ |
+| К к | K k | Ъ ъ | ʺ (hard sign) |
+| Л л | L l | Ы ы | Y y |
+| М м | M m | Ь ь | ʹ (soft sign) |
+| Н н | N n | Э э | È è |
+| О о | O o | Ю ю | Û û |
+|  |  | Я я | Â â |
+
+**Example**:
+```
+Input:  Институт восточных рукописей РАН
+ISO 9:  Institut vostočnyh rukopisej RAN
+Abbrev: IVRRAN → IVRRAN (after diacritic normalization)
+```
+
+---
+
+### CJK Scripts
+
+#### Chinese (Hanzi)
+
+| Variant | Standard | Library/Tool | Notes |
+|---------|----------|--------------|-------|
+| Simplified | Hanyu Pinyin (ISO 7098) | `pypinyin` | Standard PRC romanization |
+| Traditional | Hanyu Pinyin | `pypinyin` | Same standard applies |
+
+**Pinyin Rules**:
+- Tone marks are OMITTED for abbreviation (diacritics removed anyway)
+- Word boundaries follow natural spacing
+- Proper nouns capitalized
+
+**Example**:
+```
+Input:  东巴文化博物院
+Pinyin: Dōngbā Wénhuà Bówùyuàn
+ASCII:  Dongba Wenhua Bowuyuan
+Abbrev: DWB
+```
+
+#### Japanese (Kanji/Kana)
+
+| Standard | Library/Tool | Notes |
+|----------|--------------|-------|
+| Modified Hepburn | `pykakasi`, `romkan` | Most widely used internationally |
+
+**Hepburn Rules**:
+- Long vowels: ō, ū (normalized to o, u for abbreviation)
+- Particles: は (wa), を (wo), へ (e)
+- Syllabic n: ん = n (before vowels: n')
+
+**Example**:
+```
+Input:  国立中央博物館
+Romaji: Kokuritsu Chūō Hakubutsukan
+ASCII:  Kokuritsu Chuo Hakubutsukan
+Abbrev: KCH
+```
+
+#### Korean (Hangul)
+
+| Standard | Library/Tool | Notes |
+|----------|--------------|-------|
+| Revised Romanization (RR) | `korean-romanizer`, `hangul-romanize` | Official South Korean standard (2000) |
+
+**RR Rules**:
+- No diacritics (unlike McCune-Reischauer)
+- Consonant assimilation reflected in spelling
+- Word boundaries at natural breaks
+
+**Example**:
+```
+Input:  독립기념관
+RR:     Dongnip Ginyeomgwan
+Abbrev: DG
+```
+
+---
+
+### Arabic Script
+
+| Language | ISO Code | Standard | Library/Tool | Notes |
+|----------|----------|----------|--------------|-------|
+| **Arabic** | ar | ISO 233-2:1993 | `arabic-transliteration` | Simplified standard |
+| **Persian/Farsi** | fa | ISO 233-3:1999 | `persian-transliteration` | Persian extensions |
+| **Urdu** | ur | ISO 233-3 + Urdu extensions | `urdu-transliteration` | Additional characters |
+
+**ISO 233 Mapping (Core Arabic)**:
+
+| Arabic | Name | Latin |
+|--------|------|-------|
+| ا | Alif | ā / a |
+| ب | Ba | b |
+| ت | Ta | t |
+| ث | Tha | ṯ |
+| ج | Jim | ǧ / j |
+| ح | Ha | ḥ |
+| خ | Kha | ḫ / kh |
+| د | Dal | d |
+| ذ | Dhal | ḏ |
+| ر | Ra | r |
+| ز | Zay | z |
+| س | Sin | s |
+| ش | Shin | š / sh |
+| ص | Sad | ṣ |
+| ض | Dad | ḍ |
+| ط | Ta | ṭ |
+| ظ | Za | ẓ |
+| ع | Ayn | ʿ |
+| غ | Ghayn | ġ / gh |
+| ف | Fa | f |
+| ق | Qaf | q |
+| ك | Kaf | k |
+| ل | Lam | l |
+| م | Mim | m |
+| ن | Nun | n |
+| ه | Ha | h |
+| و | Waw | w / ū |
+| ي | Ya | y / ī |
+
+**Example (Arabic)**:
+```
+Input:  المكتبة الوطنية للمملكة المغربية
+ISO:    al-Maktaba al-Waṭanīya lil-Mamlaka al-Maġribīya
+ASCII:  al-Maktaba al-Wataniya lil-Mamlaka al-Maghribiya
+Abbrev: MWMM (skip "al-" articles)
+```
+
+**Example (Persian)**:
+```
+Input:  وزارت امور خارجه ایران
+ISO:    Vezārat-e Omur-e Khāreǧe-ye Īrān
+ASCII:  Vezarat-e Omur-e Khareje-ye Iran
+Abbrev: VOKI (skip "e" connector)
+```
+
+---
+
+### Hebrew Script
+
+| Standard | Library/Tool | Notes |
+|----------|--------------|-------|
+| ISO 259-3:1999 | `hebrew-transliteration` | Simplified romanization |
+
+**ISO 259 Mapping**:
+
+| Hebrew | Name | Latin |
+|--------|------|-------|
+| א | Aleph | ʾ / (silent) |
+| ב | Bet | b / v |
+| ג | Gimel | g |
+| ד | Dalet | d |
+| ה | He | h |
+| ו | Vav | v / o / u |
+| ז | Zayin | z |
+| ח | Chet | ḥ / ch |
+| ט | Tet | ṭ / t |
+| י | Yod | y / i |
+| כ ך | Kaf | k / kh |
+| ל | Lamed | l |
+| מ ם | Mem | m |
+| נ ן | Nun | n |
+| ס | Samekh | s |
+| ע | Ayin | ʿ / (silent) |
+| פ ף | Pe | p / f |
+| צ ץ | Tsade | ṣ / ts |
+| ק | Qof | q / k |
+| ר | Resh | r |
+| ש | Shin/Sin | š / s |
+| ת | Tav | t |
+
+**Example**:
+```
+Input:  ארכיון הסיפור העממי בישראל
+ISO:    Arḵiyon ha-Sipur ha-ʿAmami be-Yiśraʾel
+ASCII:  Arkhiyon ha-Sipur ha-Amami be-Yisrael
+Abbrev: ASAY (skip "ha-" and "be-" articles)
+```
+
+---
+
+### Greek Script
+
+| Standard | Library/Tool | Notes |
+|----------|--------------|-------|
+| ISO 843:1997 | `greek-transliteration` | Romanization of Greek |
+
+**ISO 843 Mapping**:
+
+| Greek | Latin | Greek | Latin |
+|-------|-------|-------|-------|
+| Α α | A a | Ν ν | N n |
+| Β β | V v | Ξ ξ | X x |
+| Γ γ | G g | Ο ο | O o |
+| Δ δ | D d | Π π | P p |
+| Ε ε | E e | Ρ ρ | R r |
+| Ζ ζ | Z z | Σ σ ς | S s |
+| Η η | Ī ī | Τ τ | T t |
+| Θ θ | Th th | Υ υ | Y y |
+| Ι ι | I i | Φ φ | F f |
+| Κ κ | K k | Χ χ | Ch ch |
+| Λ λ | L l | Ψ ψ | Ps ps |
+| Μ μ | M m | Ω ω | Ō ō |
+
+**Example**:
+```
+Input:  Αρχαιολογικό Μουσείο Θεσσαλονίκης
+ISO:    Archaiologikó Mouseío Thessaloníkīs
+ASCII:  Archaiologiko Mouseio Thessalonikis
+Abbrev: AMT
+```
+
+---
+
+### Indic Scripts
+
+| Language | Script | Standard | Library/Tool |
+|----------|--------|----------|--------------|
+| **Hindi** | Devanagari | ISO 15919 | `indic-transliteration` |
+| **Bengali** | Bengali | ISO 15919 | `indic-transliteration` |
+| **Nepali** | Devanagari | ISO 15919 | `indic-transliteration` |
+| **Sinhala** | Sinhala | ISO 15919 | `indic-transliteration` |
+
+**ISO 15919 Core Consonants (Devanagari)**:
+
+| Devanagari | Latin | Devanagari | Latin |
+|------------|-------|------------|-------|
+| क | ka | त | ta |
+| ख | kha | थ | tha |
+| ग | ga | द | da |
+| घ | gha | ध | dha |
+| ङ | ṅa | न | na |
+| च | ca | प | pa |
+| छ | cha | फ | pha |
+| ज | ja | ब | ba |
+| झ | jha | भ | bha |
+| ञ | ña | म | ma |
+| ट | ṭa | य | ya |
+| ठ | ṭha | र | ra |
+| ड | ḍa | ल | la |
+| ढ | ḍha | व | va |
+| ण | ṇa | श | śa |
+|  |  | ष | ṣa |
+|  |  | स | sa |
+|  |  | ह | ha |
+
+**Example (Hindi)**:
+```
+Input:  राजस्थान प्राच्यविद्या प्रतिष्ठान
+ISO:    Rājasthāna Prācyavidyā Pratiṣṭhāna
+ASCII:  Rajasthana Pracyavidya Pratishthana
+Abbrev: RPP
+```
+
+---
+
+### Southeast Asian Scripts
+
+| Language | Script | Standard | Library/Tool |
+|----------|--------|----------|--------------|
+| **Thai** | Thai | ISO 11940-2 | `thai-romanization` |
+| **Khmer** | Khmer | ALA-LC | `khmer-romanization` |
+
+**Thai Example**:
+```
+Input:  สำนักหอจดหมายเหตุแห่งชาติ
+ISO:    Samnak Ho Chotmaihet Haeng Chat
+Abbrev: SHCHC
+```
+
+**Khmer Example**:
+```
+Input:  សារមន្ទីរទួលស្លែង
+ALA-LC: Sāramanṭīr Tūl Slèṅ
+ASCII:  Saramantir Tuol Sleng
+Abbrev: STS
+```
+
+---
+
+### Other Scripts
+
+| Language | Script | Standard | Library/Tool |
+|----------|--------|----------|--------------|
+| **Armenian** | Armenian | ISO 9985 | `armenian-transliteration` |
+| **Georgian** | Georgian | ISO 9984 | `georgian-transliteration` |
+
+**Armenian Example**:
+```
+Input:  Մdelays delays delays delays delays delays delays delays delays delays delays delays delays delays delaysdelays delays delays delays delays delays delaysdelays delaysdelays delaysdelays delaysатdelays delays delaysенадаранdelays delays delays
+Input:  Մdelays delays delays delays delays delaysделays delays delaysատdelays delays delays delaysdelays delaysdelays delaysdelays delaysdelays delaysdelays delaysdelays delays delays delaysdeенadaran
+Input:  Մdelays delays delays delaysатенадаранdelays delays delays delaysdeленадаран
+Input:  Մdelays delays delaysатенадаран
+Input:  Մdelays delaysатенадаран
+Input:  Մатенадаран
+Input:  Մatenadaran
+ISO:    Matenadaran
+Abbrev: M
+```
+
+**Georgian Example**:
+```
+Input:  ხელნაწერთა ეროვნული ცენტრი
+ISO:    Xelnawerti Erovnuli C'ent'ri
+ASCII:  Khelnawerti Erovnuli Centri
+Abbrev: KEC
+```
+
+---
+
+## Implementation
+
+### Python Transliteration Utility
+
+```python
+#!/usr/bin/env python3
+"""
+Transliteration utility for GHCID abbreviation generation.
+Uses ISO and recognized standards for each script/language.
+"""
+
+import unicodedata
+from typing import Optional
+
+# Try importing transliteration libraries
+try:
+    from pypinyin import pinyin, Style
+    HAS_PYPINYIN = True
+except ImportError:
+    HAS_PYPINYIN = False
+
+try:
+    import pykakasi
+    HAS_PYKAKASI = True
+except ImportError:
+    HAS_PYKAKASI = False
+
+try:
+    from transliterate import translit
+    HAS_TRANSLITERATE = True
+except ImportError:
+    HAS_TRANSLITERATE = False
+
+
+def detect_script(text: str) -> str:
+    """
+    Detect the primary script of the input text.
+    
+    Returns one of:
+    - 'latin': Latin alphabet
+    - 'cyrillic': Cyrillic script
+    - 'chinese': Chinese characters (Hanzi)
+    - 'japanese': Japanese (mixed Kanji/Kana)
+    - 'korean': Korean Hangul
+    - 'arabic': Arabic script (includes Persian, Urdu)
+    - 'hebrew': Hebrew script
+    - 'greek': Greek script
+    - 'devanagari': Devanagari (Hindi, Nepali, Sanskrit)
+    - 'bengali': Bengali script
+    - 'thai': Thai script
+    - 'armenian': Armenian script
+    - 'georgian': Georgian script
+    - 'unknown': Cannot determine
+    """
+    script_ranges = {
+        'cyrillic': (0x0400, 0x04FF),
+        'arabic': (0x0600, 0x06FF),
+        'hebrew': (0x0590, 0x05FF),
+        'devanagari': (0x0900, 0x097F),
+        'bengali': (0x0980, 0x09FF),
+        'thai': (0x0E00, 0x0E7F),
+        'greek': (0x0370, 0x03FF),
+        'armenian': (0x0530, 0x058F),
+        'georgian': (0x10A0, 0x10FF),
+        'korean': (0xAC00, 0xD7AF),  # Hangul syllables
+        'japanese_hiragana': (0x3040, 0x309F),
+        'japanese_katakana': (0x30A0, 0x30FF),
+        'chinese': (0x4E00, 0x9FFF),  # CJK Unified Ideographs
+    }
+    
+    script_counts = {script: 0 for script in script_ranges}
+    latin_count = 0
+    
+    for char in text:
+        code = ord(char)
+        
+        # Check Latin
+        if ('a' <= char <= 'z') or ('A' <= char <= 'Z'):
+            latin_count += 1
+            continue
+            
+        # Check other scripts
+        for script, (start, end) in script_ranges.items():
+            if start <= code <= end:
+                script_counts[script] += 1
+                break
+    
+    # Determine primary script
+    if latin_count > 0 and all(c == 0 for c in script_counts.values()):
+        return 'latin'
+    
+    # Find max non-Latin script
+    max_script = max(script_counts, key=script_counts.get)
+    if script_counts[max_script] > 0:
+        # Handle Japanese (can be Kanji + Kana)
+        if max_script in ('japanese_hiragana', 'japanese_katakana', 'chinese'):
+            if script_counts['japanese_hiragana'] > 0 or script_counts['japanese_katakana'] > 0:
+                return 'japanese'
+            return 'chinese'
+        return max_script
+    
+    return 'latin' if latin_count > 0 else 'unknown'
+
+
+def transliterate_cyrillic(text: str, lang: str = 'ru') -> str:
+    """Transliterate Cyrillic text using ISO 9."""
+    if HAS_TRANSLITERATE:
+        try:
+            return translit(text, lang, reversed=True)
+        except Exception:
+            pass
+    
+    # Fallback: basic Cyrillic to Latin mapping
+    cyrillic_map = {
+        'А': 'A', 'Б': 'B', 'В': 'V', 'Г': 'G', 'Д': 'D', 'Е': 'E',
+        'Ё': 'E', 'Ж': 'Zh', 'З': 'Z', 'И': 'I', 'Й': 'Y', 'К': 'K',
+        'Л': 'L', 'М': 'M', 'Н': 'N', 'О': 'O', 'П': 'P', 'Р': 'R',
+        'С': 'S', 'Т': 'T', 'У': 'U', 'Ф': 'F', 'Х': 'Kh', 'Ц': 'Ts',
+        'Ч': 'Ch', 'Ш': 'Sh', 'Щ': 'Shch', 'Ъ': '', 'Ы': 'Y', 'Ь': '',
+        'Э': 'E', 'Ю': 'Yu', 'Я': 'Ya',
+        'а': 'a', 'б': 'b', 'в': 'v', 'г': 'g', 'д': 'd', 'е': 'e',
+        'ё': 'e', 'ж': 'zh', 'з': 'z', 'и': 'i', 'й': 'y', 'к': 'k',
+        'л': 'l', 'м': 'm', 'н': 'n', 'о': 'o', 'п': 'p', 'р': 'r',
+        'с': 's', 'т': 't', 'у': 'u', 'ф': 'f', 'х': 'kh', 'ц': 'ts',
+        'ч': 'ch', 'ш': 'sh', 'щ': 'shch', 'ъ': '', 'ы': 'y', 'ь': '',
+        'э': 'e', 'ю': 'yu', 'я': 'ya',
+        # Ukrainian additions
+        'І': 'I', 'і': 'i', 'Ї': 'Yi', 'ї': 'yi', 'Є': 'Ye', 'є': 'ye',
+        'Ґ': 'G', 'ґ': 'g',
+    }
+    return ''.join(cyrillic_map.get(c, c) for c in text)
+
+
+def transliterate_chinese(text: str) -> str:
+    """Transliterate Chinese to Pinyin."""
+    if HAS_PYPINYIN:
+        # Get pinyin without tone marks
+        result = pinyin(text, style=Style.NORMAL)
+        return ' '.join([''.join(p) for p in result])
+    
+    # Fallback: return as-is (requires manual handling)
+    return text
+
+
+def transliterate_japanese(text: str) -> str:
+    """Transliterate Japanese to Romaji (Hepburn)."""
+    if HAS_PYKAKASI:
+        kakasi = pykakasi.kakasi()
+        result = kakasi.convert(text)
+        return ' '.join([item['hepburn'] for item in result])
+    
+    # Fallback: return as-is
+    return text
+
+
+def transliterate_korean(text: str) -> str:
+    """Transliterate Korean Hangul to Revised Romanization."""
+    # Korean romanization is complex - use library if available
+    try:
+        from korean_romanizer.romanizer import Romanizer
+        r = Romanizer(text)
+        return r.romanize()
+    except ImportError:
+        pass
+    
+    # Fallback: basic Hangul syllable decomposition
+    # This is a simplified implementation
+    return text
+
+
+def transliterate_arabic(text: str) -> str:
+    """Transliterate Arabic script to Latin (ISO 233 simplified)."""
+    arabic_map = {
+        'ا': 'a', 'أ': 'a', 'إ': 'i', 'آ': 'a',
+        'ب': 'b', 'ت': 't', 'ث': 'th', 'ج': 'j',
+        'ح': 'h', 'خ': 'kh', 'د': 'd', 'ذ': 'dh',
+        'ر': 'r', 'ز': 'z', 'س': 's', 'ش': 'sh',
+        'ص': 's', 'ض': 'd', 'ط': 't', 'ظ': 'z',
+        'ع': "'", 'غ': 'gh', 'ف': 'f', 'ق': 'q',
+        'ك': 'k', 'ل': 'l', 'م': 'm', 'ن': 'n',
+        'ه': 'h', 'و': 'w', 'ي': 'y', 'ى': 'a',
+        'ة': 'a', 'ء': "'",
+        # Persian additions
+        'پ': 'p', 'چ': 'ch', 'ژ': 'zh', 'گ': 'g',
+        'ک': 'k', 'ی': 'i',
+    }
+    result = []
+    for c in text:
+        if c in arabic_map:
+            result.append(arabic_map[c])
+        elif c == ' ' or c.isalnum():
+            result.append(c)
+    return ''.join(result)
+
+
+def transliterate_hebrew(text: str) -> str:
+    """Transliterate Hebrew to Latin (ISO 259 simplified)."""
+    hebrew_map = {
+        'א': '', 'ב': 'v', 'ג': 'g', 'ד': 'd', 'ה': 'h',
+        'ו': 'v', 'ז': 'z', 'ח': 'ch', 'ט': 't', 'י': 'y',
+        'כ': 'k', 'ך': 'k', 'ל': 'l', 'מ': 'm', 'ם': 'm',
+        'נ': 'n', 'ן': 'n', 'ס': 's', 'ע': '', 'פ': 'f',
+        'ף': 'f', 'צ': 'ts', 'ץ': 'ts', 'ק': 'k', 'ר': 'r',
+        'ש': 'sh', 'ת': 't',
+    }
+    result = []
+    for c in text:
+        if c in hebrew_map:
+            result.append(hebrew_map[c])
+        elif c == ' ' or c.isalnum():
+            result.append(c)
+    return ''.join(result)
+
+
+def transliterate_greek(text: str) -> str:
+    """Transliterate Greek to Latin (ISO 843)."""
+    greek_map = {
+        'Α': 'A', 'α': 'a', 'Β': 'V', 'β': 'v', 'Γ': 'G', 'γ': 'g',
+        'Δ': 'D', 'δ': 'd', 'Ε': 'E', 'ε': 'e', 'Ζ': 'Z', 'ζ': 'z',
+        'Η': 'I', 'η': 'i', 'Θ': 'Th', 'θ': 'th', 'Ι': 'I', 'ι': 'i',
+        'Κ': 'K', 'κ': 'k', 'Λ': 'L', 'λ': 'l', 'Μ': 'M', 'μ': 'm',
+        'Ν': 'N', 'ν': 'n', 'Ξ': 'X', 'ξ': 'x', 'Ο': 'O', 'ο': 'o',
+        'Π': 'P', 'π': 'p', 'Ρ': 'R', 'ρ': 'r', 'Σ': 'S', 'σ': 's',
+        'ς': 's', 'Τ': 'T', 'τ': 't', 'Υ': 'Y', 'υ': 'y', 'Φ': 'F',
+        'φ': 'f', 'Χ': 'Ch', 'χ': 'ch', 'Ψ': 'Ps', 'ψ': 'ps',
+        'Ω': 'O', 'ω': 'o',
+    }
+    return ''.join(greek_map.get(c, c) for c in text)
+
+
+def transliterate_devanagari(text: str) -> str:
+    """Transliterate Devanagari to Latin (ISO 15919 simplified)."""
+    try:
+        from indic_transliteration import sanscript
+        from indic_transliteration.sanscript import transliterate as indic_translit
+        return indic_translit(text, sanscript.DEVANAGARI, sanscript.IAST)
+    except ImportError:
+        pass
+    
+    # Fallback: basic mapping
+    # This would need a full Devanagari character map
+    return text
+
+
+def transliterate_thai(text: str) -> str:
+    """Transliterate Thai to Latin (Royal Thai General System)."""
+    try:
+        from thaispellcheck import transliterate as thai_translit
+        return thai_translit(text)
+    except ImportError:
+        pass
+    
+    # Fallback
+    return text
+
+
+def transliterate(text: str, lang: Optional[str] = None) -> str:
+    """
+    Transliterate text from non-Latin script to Latin.
+    
+    Args:
+        text: Input text in any script
+        lang: Optional ISO 639-1 language code (e.g., 'ru', 'zh', 'ko')
+              If not provided, script is auto-detected.
+    
+    Returns:
+        Transliterated text in Latin characters.
+    """
+    if not text:
+        return text
+    
+    # Detect script if language not provided
+    if lang:
+        script_map = {
+            'ru': 'cyrillic', 'uk': 'cyrillic', 'bg': 'cyrillic',
+            'sr': 'cyrillic', 'kk': 'cyrillic',
+            'zh': 'chinese',
+            'ja': 'japanese',
+            'ko': 'korean',
+            'ar': 'arabic', 'fa': 'arabic', 'ur': 'arabic',
+            'he': 'hebrew',
+            'el': 'greek',
+            'hi': 'devanagari', 'ne': 'devanagari',
+            'bn': 'bengali',
+            'th': 'thai',
+            'hy': 'armenian',
+            'ka': 'georgian',
+        }
+        script = script_map.get(lang, detect_script(text))
+    else:
+        script = detect_script(text)
+    
+    # Apply appropriate transliteration
+    transliterators = {
+        'cyrillic': lambda t: transliterate_cyrillic(t, lang or 'ru'),
+        'chinese': transliterate_chinese,
+        'japanese': transliterate_japanese,
+        'korean': transliterate_korean,
+        'arabic': transliterate_arabic,
+        'hebrew': transliterate_hebrew,
+        'greek': transliterate_greek,
+        'devanagari': transliterate_devanagari,
+        'thai': transliterate_thai,
+        'latin': lambda t: t,  # No transliteration needed
+    }
+    
+    translit_func = transliterators.get(script, lambda t: t)
+    result = translit_func(text)
+    
+    # Normalize diacritics to ASCII
+    normalized = unicodedata.normalize('NFD', result)
+    ascii_result = ''.join(c for c in normalized if unicodedata.category(c) != 'Mn')
+    
+    return ascii_result
+
+
+def transliterate_for_abbreviation(emic_name: str, lang: str) -> str:
+    """
+    Transliterate emic name for GHCID abbreviation generation.
+    
+    This is the main entry point for GHCID generation scripts.
+    
+    Args:
+        emic_name: Institution name in original script
+        lang: ISO 639-1 language code
+    
+    Returns:
+        Transliterated name ready for abbreviation extraction
+    """
+    # Step 1: Transliterate to Latin
+    latin = transliterate(emic_name, lang)
+    
+    # Step 2: Normalize diacritics (handled in transliterate())
+    
+    # Step 3: Remove special characters (except spaces)
+    import re
+    clean = re.sub(r'[^a-zA-Z\s]', ' ', latin)
+    
+    # Step 4: Normalize whitespace
+    clean = ' '.join(clean.split())
+    
+    return clean
+
+
+# Example usage
+if __name__ == '__main__':
+    test_cases = [
+        ('Институт восточных рукописей РАН', 'ru'),
+        ('东巴文化博物院', 'zh'),
+        ('독립기념관', 'ko'),
+        ('राजस्थान प्राच्यविद्या प्रतिष्ठान', 'hi'),
+        ('المكتبة الوطنية للمملكة المغربية', 'ar'),
+        ('ארכיון הסיפור העממי בישראל', 'he'),
+        ('Αρχαιολογικό Μουσείο Θεσσαλονίκης', 'el'),
+    ]
+    
+    for name, lang in test_cases:
+        result = transliterate_for_abbreviation(name, lang)
+        print(f'{lang}: {name}')
+        print(f'    → {result}')
+        print()
+```
+
+---
+
+## Skip Words by Language
+
+When extracting abbreviations from transliterated text, skip these articles/prepositions:
+
+### Arabic
+- `al-` (the definite article)
+- `bi-`, `li-`, `fi-` (prepositions)
+
+### Hebrew
+- `ha-` (the)
+- `ve-` (and)
+- `be-`, `le-`, `me-` (prepositions)
+
+### Persian
+- `-e`, `-ye` (ezafe connector)
+- `va` (and)
+
+### CJK Languages
+- No skip words (particles are integral to meaning)
+
+### Indic Languages
+- `ka`, `ki`, `ke` (Hindi: of)
+- `aur` (Hindi: and)
+
+---
+
+## Validation
+
+### Check Transliteration Output
+
+```python
+def validate_transliteration(result: str) -> bool:
+    """
+    Validate that transliteration output contains only ASCII letters and spaces.
+    """
+    import re
+    return bool(re.match(r'^[a-zA-Z\s]+$', result))
+```
+
+### Manual Review Queue
+
+Non-Latin institutions should be flagged for manual review if:
+1. Transliteration library not available for that script
+2. Confidence in transliteration is low
+3. Institution has multiple official romanizations
+
+---
+
+## Related Documentation
+
+- `AGENTS.md` - Rule 12: Transliteration Standards
+- `ABBREVIATION_SPECIAL_CHAR_RULE.md` - Character filtering after transliteration
+- `docs/TRANSLITERATION_CONVENTIONS.md` - Extended examples and edge cases
+- `scripts/transliterate_emic_names.py` - Production transliteration script
+
+---
+
+## Changelog
+
+| Date | Change |
+|------|--------|
+| 2025-12-08 | Initial standards document created |
--- a/.opencode/ZAI_GLM_API_RULES.md
+++ b/.opencode/ZAI_GLM_API_RULES.md
@ -0,0 +1,277 @@
+# Z.AI GLM API Rules for AI Agents
+
+**Last Updated**: 2025-12-08  
+**Status**: MANDATORY for all LLM API calls in scripts
+
+---
+
+## CRITICAL: Use Z.AI Coding Plan, NOT BigModel API
+
+**This project uses the Z.AI Coding Plan endpoint, which is the SAME endpoint that OpenCode uses internally.**
+
+The regular BigModel API (`open.bigmodel.cn`) will NOT work with the tokens stored in this project. You MUST use the Z.AI Coding Plan endpoint.
+
+---
+
+## API Configuration
+
+### Correct Endpoint
+
+| Property | Value |
+|----------|-------|
+| **API URL** | `https://api.z.ai/api/coding/paas/v4/chat/completions` |
+| **Auth Header** | `Authorization: Bearer {ZAI_API_TOKEN}` |
+| **Content-Type** | `application/json` |
+
+### Available Models
+
+| Model | Description | Cost |
+|-------|-------------|------|
+| `glm-4.5` | Standard GLM-4.5 | Free (0 per token) |
+| `glm-4.5-air` | GLM-4.5 Air variant | Free |
+| `glm-4.5-flash` | Fast GLM-4.5 | Free |
+| `glm-4.5v` | Vision-capable GLM-4.5 | Free |
+| `glm-4.6` | Latest GLM-4.6 (recommended) | Free |
+
+**Recommended Model**: `glm-4.6` for best quality
+
+---
+
+## Authentication
+
+### Token Location
+
+The Z.AI API token can be obtained from two locations:
+
+1. **Environment Variable** (preferred for scripts):
+   ```bash
+   # In .env file at project root
+   ZAI_API_TOKEN=your_token_here
+   ```
+
+2. **OpenCode Auth File** (reference only):
+   ```
+   ~/.local/share/opencode/auth.json
+   ```
+   The token is stored under key `zai-coding-plan`.
+
+### Getting the Token
+
+If you need to set up the token:
+
+1. The token is shared with OpenCode's Z.AI Coding Plan
+2. Check `~/.local/share/opencode/auth.json` for existing token
+3. Add to `.env` file as `ZAI_API_TOKEN`
+
+---
+
+## Python Implementation
+
+### Correct Implementation
+
+```python
+import os
+import httpx
+
+class GLMClient:
+    """Client for Z.AI GLM API (Coding Plan endpoint)."""
+    
+    # CORRECT endpoint - Z.AI Coding Plan
+    API_URL = "https://api.z.ai/api/coding/paas/v4/chat/completions"
+    
+    def __init__(self, model: str = "glm-4.6"):
+        self.api_key = os.environ.get("ZAI_API_TOKEN")
+        if not self.api_key:
+            raise ValueError("ZAI_API_TOKEN not found in environment")
+        
+        self.model = model
+        self.client = httpx.AsyncClient(
+            timeout=60.0,
+            headers={
+                "Authorization": f"Bearer {self.api_key}",
+                "Content-Type": "application/json",
+            }
+        )
+    
+    async def chat(self, messages: list) -> dict:
+        """Send chat completion request."""
+        response = await self.client.post(
+            self.API_URL,
+            json={
+                "model": self.model,
+                "messages": messages,
+                "temperature": 0.3,
+            }
+        )
+        response.raise_for_status()
+        return response.json()
+```
+
+### WRONG Implementation (DO NOT USE)
+
+```python
+# WRONG - This endpoint will fail with quota errors
+API_URL = "https://open.bigmodel.cn/api/paas/v4/chat/completions"
+
+# WRONG - This is for regular BigModel API, not Z.AI Coding Plan
+api_key = os.environ.get("ZHIPU_API_KEY")
+```
+
+---
+
+## Request Format
+
+### Chat Completion Request
+
+```json
+{
+  "model": "glm-4.6",
+  "messages": [
+    {
+      "role": "system",
+      "content": "You are a helpful assistant."
+    },
+    {
+      "role": "user",
+      "content": "Your prompt here"
+    }
+  ],
+  "temperature": 0.3,
+  "max_tokens": 4096
+}
+```
+
+### Response Format
+
+```json
+{
+  "id": "request-id",
+  "created": 1733651234,
+  "model": "glm-4.6",
+  "choices": [
+    {
+      "index": 0,
+      "message": {
+        "role": "assistant",
+        "content": "Response text here"
+      },
+      "finish_reason": "stop"
+    }
+  ],
+  "usage": {
+    "prompt_tokens": 100,
+    "completion_tokens": 50,
+    "total_tokens": 150
+  }
+}
+```
+
+---
+
+## Error Handling
+
+### Common Errors
+
+| Error | Cause | Solution |
+|-------|-------|----------|
+| `401 Unauthorized` | Invalid or missing token | Check ZAI_API_TOKEN in .env |
+| `403 Quota exceeded` | Wrong endpoint (BigModel) | Use Z.AI Coding Plan endpoint |
+| `429 Rate limited` | Too many requests | Add delay between requests |
+| `500 Server error` | API issue | Retry with exponential backoff |
+
+### Retry Strategy
+
+```python
+import asyncio
+from tenacity import retry, stop_after_attempt, wait_exponential
+
+@retry(
+    stop=stop_after_attempt(3),
+    wait=wait_exponential(multiplier=1, min=2, max=10)
+)
+async def call_api_with_retry(client, messages):
+    return await client.chat(messages)
+```
+
+---
+
+## Integration with CH-Annotator
+
+When using GLM for entity recognition or verification, always reference CH-Annotator v1.7.0:
+
+```python
+PROMPT = """You are a heritage institution classifier following CH-Annotator v1.7.0 convention.
+
+## CH-Annotator GRP.HER Definition
+Heritage institutions are organizations that:
+- Collect, preserve, and provide access to cultural heritage materials
+- Include: museums (GRP.HER.MUS), libraries (GRP.HER.LIB), archives (GRP.HER.ARC), galleries (GRP.HER.GAL)
+
+## Entity to Analyze
+...
+"""
+```
+
+See `.opencode/CH_ANNOTATOR_CONVENTION.md` for full convention details.
+
+---
+
+## Scripts Using GLM API
+
+The following scripts use the Z.AI GLM API:
+
+| Script | Purpose |
+|--------|---------|
+| `scripts/reenrich_wikidata_with_verification.py` | Wikidata entity verification using GLM-4.6 |
+
+When creating new scripts that need LLM capabilities, follow this pattern.
+
+---
+
+## Environment Setup Checklist
+
+When setting up a new environment:
+
+- [ ] Check `~/.local/share/opencode/auth.json` for existing Z.AI token
+- [ ] Add `ZAI_API_TOKEN` to `.env` file
+- [ ] Verify endpoint is `https://api.z.ai/api/coding/paas/v4/chat/completions`
+- [ ] Test with `glm-4.6` model
+- [ ] Reference CH-Annotator v1.7.0 for entity recognition tasks
+
+---
+
+## AI Agent Rules
+
+### DO
+
+- Use `https://api.z.ai/api/coding/paas/v4/chat/completions` endpoint
+- Get token from `ZAI_API_TOKEN` environment variable
+- Use `glm-4.6` as the default model
+- Reference CH-Annotator v1.7.0 for entity tasks
+- Add retry logic with exponential backoff
+- Handle JSON parsing errors gracefully
+
+### DO NOT
+
+- Use `open.bigmodel.cn` endpoint (wrong API)
+- Use `ZHIPU_API_KEY` environment variable (wrong key)
+- Hard-code API tokens in scripts
+- Skip error handling for API calls
+- Forget to load `.env` file before accessing environment
+
+---
+
+## Related Documentation
+
+- **CH-Annotator Convention**: `.opencode/CH_ANNOTATOR_CONVENTION.md`
+- **Entity Annotation**: `data/entity_annotation/ch_annotator-v1_7_0.yaml`
+- **Wikidata Enrichment Script**: `scripts/reenrich_wikidata_with_verification.py`
+
+---
+
+## Version History
+
+| Date | Change |
+|------|--------|
+| 2025-12-08 | Initial documentation - Fixed API endpoint discovery |
+
--- a/AGENTS.md
+++ b/AGENTS.md
@ -720,6 +720,66 @@ claim:

 ---

+### Rule 11: Z.AI GLM API for LLM Tasks (NOT BigModel)
+
+**CRITICAL: When using GLM models in scripts, use the Z.AI Coding Plan endpoint, NOT the regular BigModel API.**
+
+The project uses the same Z.AI Coding Plan that OpenCode uses internally. The regular BigModel API (`open.bigmodel.cn`) will NOT work with our tokens.
+
+**Correct API Configuration**:
+
+| Property | Value |
+|----------|-------|
+| **API URL** | `https://api.z.ai/api/coding/paas/v4/chat/completions` |
+| **Environment Variable** | `ZAI_API_TOKEN` |
+| **Recommended Model** | `glm-4.6` |
+| **Cost** | Free (0 per token for all GLM models) |
+
+**Available Models**: `glm-4.5`, `glm-4.5-air`, `glm-4.5-flash`, `glm-4.5v`, `glm-4.6`
+
+**Python Implementation**:
+
+```python
+import os
+import httpx
+
+# CORRECT - Z.AI Coding Plan endpoint
+API_URL = "https://api.z.ai/api/coding/paas/v4/chat/completions"
+api_key = os.environ.get("ZAI_API_TOKEN")
+
+client = httpx.AsyncClient(
+    timeout=60.0,
+    headers={
+        "Authorization": f"Bearer {api_key}",
+        "Content-Type": "application/json",
+    }
+)
+
+# WRONG - This will fail with quota errors!
+# API_URL = "https://open.bigmodel.cn/api/paas/v4/chat/completions"
+# api_key = os.environ.get("ZHIPU_API_KEY")
+```
+
+**Integration with CH-Annotator**: When using GLM for entity recognition, always reference CH-Annotator v1.7.0 in prompts:
+
+```python
+PROMPT = """You are following CH-Annotator v1.7.0 convention.
+Heritage institutions are type GRP.HER with subtypes:
+- GRP.HER.MUS (museums)
+- GRP.HER.LIB (libraries)
+- GRP.HER.ARC (archives)
+- GRP.HER.GAL (galleries)
+..."""
+```
+
+**Token Location**:
+1. **Environment**: Add `ZAI_API_TOKEN` to `.env` file
+2. **OpenCode Auth**: Token stored in `~/.local/share/opencode/auth.json` under key `zai-coding-plan`
+
+**See**: `.opencode/ZAI_GLM_API_RULES.md` for complete documentation
+
+---
+
 ## Project Overview

 **Goal**: Extract structured data about worldwide GLAMORCUBESFIXPHDNT (Galleries, Libraries, Archives, Museums, Official institutions, Research centers, Corporations, Unknown, Botanical gardens/zoos, Educational providers, Societies, Features, Intangible heritage groups, miXed, Personal collections, Holy sites, Digital platforms, NGOs, Taste/smell heritage) institutions from 139+ Claude conversation JSON files and integrate with authoritative CSV datasets.
@ -2571,6 +2631,16 @@ location_resolution:

 **The institution abbreviation component uses the FIRST LETTER of each significant word in the official emic (native language) name.**

+**⚠️ GRANDFATHERING POLICY (PID STABILITY)**
+
+Existing GHCIDs created before December 2025 are **grandfathered** - their abbreviations will NOT be updated even if derived from English translations rather than emic names. This preserves PID stability per the "Cool URIs Don't Change" principle.
+
+**Applies to:**
+- 817 UNESCO Memory of the World custodian files enriched with `custodian_name.emic_name`
+- Abbreviations like `NLP` (National Library of Peru) remain unchanged even though emic name is "Biblioteca Nacional del Perú" (would be `BNP`)
+
+**For NEW custodians only:** Apply emic name abbreviation protocol described below.
+
 **Abbreviation Rules**:
 1. Use the **CustodianName** (official emic name), NOT an English translation
 2. Take the **first letter** of each word
@ -2681,6 +2751,154 @@ ghcid_current: SX-XX-PHI-O-DRIMSM   # ✅ Alphabetic only

 **See**: `.opencode/ABBREVIATION_SPECIAL_CHAR_RULE.md` for complete documentation

+### 🚨 CRITICAL: Diacritics MUST Be Normalized to ASCII in Abbreviations 🚨
+
+**When generating abbreviations for GHCID, diacritics (accented characters) MUST be normalized to their ASCII base letter equivalents. Only ASCII uppercase letters (A-Z) are permitted.**
+
+This rule applies to ALL languages with diacritical marks including Czech, Polish, German, French, Spanish, Portuguese, Nordic languages, Hungarian, Romanian, Turkish, and others.
+
+**RATIONALE**:
+1. **URI/URL safety** - Non-ASCII characters require percent-encoding
+2. **Cross-system compatibility** - ASCII is universally supported
+3. **Filename safety** - Some systems have issues with non-ASCII filenames
+4. **Human readability** - Easier to type and communicate
+
+**DIACRITICS NORMALIZATION TABLE**:
+
+| Language | Diacritics | ASCII Equivalent |
+|----------|------------|------------------|
+| **Czech** | Č, Ř, Š, Ž, Ě, Ů | C, R, S, Z, E, U |
+| **Polish** | Ł, Ń, Ó, Ś, Ź, Ż, Ą, Ę | L, N, O, S, Z, Z, A, E |
+| **German** | Ä, Ö, Ü, ß | A, O, U, SS |
+| **French** | É, È, Ê, Ç, Ô, Â | E, E, E, C, O, A |
+| **Spanish** | Ñ, Á, É, Í, Ó, Ú | N, A, E, I, O, U |
+| **Portuguese** | Ã, Õ, Ç, Á, É | A, O, C, A, E |
+| **Nordic** | Å, Ä, Ö, Ø, Æ | A, A, O, O, AE |
+| **Hungarian** | Á, É, Í, Ó, Ö, Ő, Ú, Ü, Ű | A, E, I, O, O, O, U, U, U |
+| **Turkish** | Ç, Ğ, İ, Ö, Ş, Ü | C, G, I, O, S, U |
+| **Romanian** | Ă, Â, Î, Ș, Ț | A, A, I, S, T |
+
+**REAL-WORLD EXAMPLE** (Czech institution):
+
+```yaml
+# INCORRECT - Contains diacritics:
+ghcid_current: CZ-VY-TEL-L-VHSPAOČRZS  # ❌ Contains "Č"
+
+# CORRECT - ASCII only:
+ghcid_current: CZ-VY-TEL-L-VHSPAOCRZS  # ✅ "Č" → "C"
+```
+
+**IMPLEMENTATION**:
+
+```python
+import unicodedata
+
+def normalize_diacritics(text: str) -> str:
+    """Normalize diacritics to ASCII equivalents."""
+    # NFD decomposition separates base characters from combining marks
+    normalized = unicodedata.normalize('NFD', text)
+    # Remove combining marks (category 'Mn' = Mark, Nonspacing)
+    ascii_text = ''.join(c for c in normalized if unicodedata.category(c) != 'Mn')
+    return ascii_text
+
+# Example
+normalize_diacritics("VHSPAOČRZS")  # Returns "VHSPAOCRZS"
+```
+
+**EXAMPLES**:
+
+| Emic Name (with diacritics) | Abbreviation | Wrong |
+|-----------------------------|--------------|-------|
+| Vlastivědné muzeum v Šumperku | VMS | VMŠ ❌ |
+| Österreichische Nationalbibliothek | ON | ÖN ❌ |
+| Bibliothèque nationale de France | BNF | BNF (OK - è not in first letter) |
+| Múzeum Łódzkie | ML | MŁ ❌ |
+| Þjóðminjasafn Íslands | TI | ÞI ❌ |
+
+**See**: `.opencode/ABBREVIATION_SPECIAL_CHAR_RULE.md` for complete documentation (covers both special characters and diacritics)
+
+### 🚨 CRITICAL: Non-Latin Scripts MUST Be Transliterated Before Abbreviation 🚨
+
+**When generating GHCID abbreviations from institution names in non-Latin scripts (Cyrillic, Chinese, Japanese, Korean, Arabic, Hebrew, Greek, Devanagari, Thai, etc.), the emic name MUST first be transliterated to Latin characters using ISO or recognized standards.**
+
+This rule affects **170 institutions** across **21 languages** with non-Latin writing systems.
+
+**CORE PRINCIPLE**: The emic name is PRESERVED in original script in `custodian_name.emic_name`. Transliteration is only used for abbreviation generation.
+
+**TRANSLITERATION STANDARDS BY SCRIPT**:
+
+| Script | Languages | Standard | Example |
+|--------|-----------|----------|---------|
+| **Cyrillic** | ru, uk, bg, sr, kk | ISO 9:1995 | Институт → Institut |
+| **Chinese** | zh | Hanyu Pinyin (ISO 7098) | 东巴文化博物院 → Dongba Wenhua Bowuyuan |
+| **Japanese** | ja | Modified Hepburn | 国立博物館 → Kokuritsu Hakubutsukan |
+| **Korean** | ko | Revised Romanization | 독립기념관 → Dongnip Ginyeomgwan |
+| **Arabic** | ar, fa, ur | ISO 233-2/3 | المكتبة الوطنية → al-Maktaba al-Wataniya |
+| **Hebrew** | he | ISO 259-3 | ארכיון → Arkhiyon |
+| **Greek** | el | ISO 843 | Μουσείο → Mouseio |
+| **Devanagari** | hi, ne | ISO 15919 | राजस्थान → Rajasthana |
+| **Bengali** | bn | ISO 15919 | বাংলাদেশ → Bangladesh |
+| **Thai** | th | ISO 11940-2 | สำนักหอ → Samnak Ho |
+| **Armenian** | hy | ISO 9985 | Մdelays → Matenadaran |
+| **Georgian** | ka | ISO 9984 | ხელნაწერთა → Khelnawerti |
+
+**WORKFLOW**:
+
+```
+1. Emic Name (original script)
+   ↓
+2. Transliterate to Latin (ISO standard)
+   ↓
+3. Normalize diacritics (remove accents)
+   ↓
+4. Skip articles/prepositions
+   ↓
+5. Extract first letters → Abbreviation
+```
+
+**EXAMPLES**:
+
+| Language | Emic Name | Transliterated | Abbreviation |
+|----------|-----------|----------------|--------------|
+| **Russian** | Институт восточных рукописей РАН | Institut Vostochnykh Rukopisey RAN | IVRR |
+| **Chinese** | 东巴文化博物院 | Dongba Wenhua Bowuyuan | DWB |
+| **Korean** | 독립기념관 | Dongnip Ginyeomgwan | DG |
+| **Hindi** | राजस्थान प्राच्यविद्या प्रतिष्ठान | Rajasthana Pracyavidya Pratishthana | RPP |
+| **Arabic** | المكتبة الوطنية للمملكة المغربية | al-Maktaba al-Wataniya lil-Mamlaka | MWMM |
+| **Hebrew** | ארכיון הסיפור העממי בישראל | Arkhiyon ha-Sipur ha-Amami | ASAY |
+| **Greek** | Αρχαιολογικό Μουσείο Θεσσαλονίκης | Archaiologiko Mouseio Thessalonikis | AMT |
+
+**SCRIPT-SPECIFIC SKIP WORDS**:
+
+| Language | Skip Words (Articles/Prepositions) |
+|----------|-------------------------------------|
+| **Arabic** | al- (the), bi-, li-, fi- (prepositions) |
+| **Hebrew** | ha- (the), ve- (and), be-, le-, me- |
+| **Persian** | -e, -ye (ezafe connector), va (and) |
+| **CJK** | None (particles integral to meaning) |
+
+**IMPLEMENTATION**:
+
+```python
+from transliteration import transliterate_for_abbreviation
+
+# Input: emic name in non-Latin script + language code
+emic_name = "Институт восточных рукописей РАН"
+lang = "ru"
+
+# Step 1: Transliterate to Latin using ISO standard
+latin = transliterate_for_abbreviation(emic_name, lang)
+# Result: "Institut Vostochnykh Rukopisey RAN"
+
+# Step 2: Apply standard abbreviation extraction
+abbreviation = extract_abbreviation_from_name(latin, skip_words={'vostochnykh'})
+# Result: "IVRRAN"
+```
+
+**GRANDFATHERING POLICY**: Existing abbreviations from 817 UNESCO MoW custodians are grandfathered. This transliteration standard applies only to **NEW custodians** created after December 2025.
+
+**See**: `.opencode/TRANSLITERATION_STANDARDS.md` for complete ISO standards, mapping tables, and Python implementation
+
 ---

 GHCID uses a **four-identifier strategy** for maximum flexibility and transparency:
@ -3115,7 +3333,7 @@ def test_historical_addition():

 ---

-**Version**: 0.2.0  
-**Schema Version**: v0.2.0 (modular)  
-**Last Updated**: 2025-11-05  
+**Version**: 0.2.1  
+**Schema Version**: v0.2.1 (modular)  
+**Last Updated**: 2025-12-08  
 **Maintained By**: GLAM Data Extraction Project
--- a/docs/GLM_API_SETUP.md
+++ b/docs/GLM_API_SETUP.md
@ -0,0 +1,357 @@
+# GLM API Setup Guide
+
+This guide explains how to configure and use the GLM-4 language model for entity recognition, verification, and enrichment tasks in the GLAM project.
+
+## Overview
+
+The GLAM project uses **GLM-4.6** via the **Z.AI Coding Plan** endpoint for LLM-powered tasks such as:
+
+- **Entity Verification**: Verify that Wikidata entities are heritage institutions
+- **Description Enrichment**: Generate rich descriptions from multiple data sources
+- **Entity Resolution**: Match institution names across different data sources
+- **Claim Validation**: Verify extracted claims against source documents
+
+**Cost**: All GLM models are FREE (0 cost per token) on the Z.AI Coding Plan.
+
+## Prerequisites
+
+- Python 3.10+
+- `httpx` library for async HTTP requests
+- Access to Z.AI Coding Plan (same as OpenCode)
+
+## Quick Start
+
+### 1. Set Up Environment Variable
+
+Add your Z.AI API token to the `.env` file in the project root:
+
+```bash
+# .env file
+ZAI_API_TOKEN=your_token_here
+```
+
+### 2. Find Your Token
+
+The token is shared with OpenCode. Check:
+
+```bash
+# View OpenCode auth file
+cat ~/.local/share/opencode/auth.json | jq '.["zai-coding-plan"]'
+```
+
+Copy this token to your `.env` file.
+
+### 3. Basic Python Usage
+
+```python
+import os
+import httpx
+import asyncio
+from dotenv import load_dotenv
+
+# Load environment variables
+load_dotenv()
+
+async def call_glm():
+    api_url = "https://api.z.ai/api/coding/paas/v4/chat/completions"
+    api_key = os.environ.get("ZAI_API_TOKEN")
+    
+    async with httpx.AsyncClient(timeout=60.0) as client:
+        response = await client.post(
+            api_url,
+            headers={
+                "Authorization": f"Bearer {api_key}",
+                "Content-Type": "application/json",
+            },
+            json={
+                "model": "glm-4.6",
+                "messages": [
+                    {"role": "user", "content": "Hello, GLM!"}
+                ],
+                "temperature": 0.3,
+            }
+        )
+        result = response.json()
+        print(result["choices"][0]["message"]["content"])
+
+asyncio.run(call_glm())
+```
+
+## API Configuration
+
+### Endpoint Details
+
+| Property | Value |
+|----------|-------|
+| **Base URL** | `https://api.z.ai/api/coding/paas/v4` |
+| **Chat Endpoint** | `/chat/completions` |
+| **Auth Method** | Bearer Token |
+| **Header** | `Authorization: Bearer {token}` |
+
+### Available Models
+
+| Model | Speed | Quality | Use Case |
+|-------|-------|---------|----------|
+| `glm-4.6` | Medium | Highest | Complex reasoning, verification |
+| `glm-4.5` | Medium | High | General tasks |
+| `glm-4.5-air` | Fast | Good | High-volume processing |
+| `glm-4.5-flash` | Fastest | Good | Quick responses |
+| `glm-4.5v` | Medium | High | Vision/image tasks |
+
+**Recommendation**: Use `glm-4.6` for entity verification and complex tasks.
+
+## Integration with CH-Annotator
+
+When using GLM for entity recognition tasks, always reference the CH-Annotator convention:
+
+### Heritage Institution Verification
+
+```python
+VERIFICATION_PROMPT = """You are a heritage institution classifier following CH-Annotator v1.7.0 convention.
+
+## CH-Annotator GRP.HER Definition
+Heritage institutions are organizations that:
+- Collect, preserve, and provide access to cultural heritage materials
+- Include: museums (GRP.HER.MUS), libraries (GRP.HER.LIB), archives (GRP.HER.ARC), galleries (GRP.HER.GAL)
+
+## Entity Types That Are NOT Heritage Institutions
+- Cities, towns, municipalities (places, not institutions)
+- General businesses or companies
+- People/individuals
+- Events, festivals, exhibitions (temporary)
+
+## Your Task
+Analyze the entity and respond in JSON:
+```json
+{
+  "is_heritage_institution": true/false,
+  "subtype": "MUS|LIB|ARC|GAL|OTHER|null",
+  "confidence": 0.95,
+  "reasoning": "Brief explanation"
+}
+```
+"""
+```
+
+### Entity Type Mapping
+
+| CH-Annotator Type | GLAM Institution Type |
+|-------------------|----------------------|
+| GRP.HER.MUS | MUSEUM |
+| GRP.HER.LIB | LIBRARY |
+| GRP.HER.ARC | ARCHIVE |
+| GRP.HER.GAL | GALLERY |
+| GRP.HER.RES | RESEARCH_CENTER |
+| GRP.HER.BOT | BOTANICAL_ZOO |
+| GRP.HER.EDU | EDUCATION_PROVIDER |
+
+## Complete Implementation Example
+
+### Wikidata Verification Script
+
+See `scripts/reenrich_wikidata_with_verification.py` for a complete example:
+
+```python
+import os
+import httpx
+import json
+from typing import Any, Dict, List, Optional
+
+class GLMHeritageVerifier:
+    """Verify Wikidata entities using GLM-4.6 and CH-Annotator."""
+    
+    API_URL = "https://api.z.ai/api/coding/paas/v4/chat/completions"
+    
+    def __init__(self, model: str = "glm-4.6"):
+        self.api_key = os.environ.get("ZAI_API_TOKEN")
+        if not self.api_key:
+            raise ValueError("ZAI_API_TOKEN not found in environment")
+        
+        self.model = model
+        self.client = httpx.AsyncClient(
+            timeout=60.0,
+            headers={
+                "Authorization": f"Bearer {self.api_key}",
+                "Content-Type": "application/json",
+            }
+        )
+    
+    async def verify_heritage_institution(
+        self,
+        institution_name: str,
+        wikidata_label: str,
+        wikidata_description: str,
+        instance_of_types: List[str],
+    ) -> Dict[str, Any]:
+        """Check if a Wikidata entity is a heritage institution."""
+        
+        prompt = f"""Analyze if this entity is a heritage institution (GRP.HER):
+
+Institution Name: {institution_name}
+Wikidata Label: {wikidata_label}
+Description: {wikidata_description}
+Instance Of: {', '.join(instance_of_types)}
+
+Respond with JSON only."""
+
+        response = await self.client.post(
+            self.API_URL,
+            json={
+                "model": self.model,
+                "messages": [
+                    {"role": "system", "content": self.VERIFICATION_PROMPT},
+                    {"role": "user", "content": prompt}
+                ],
+                "temperature": 0.1,
+            }
+        )
+        
+        result = response.json()
+        content = result["choices"][0]["message"]["content"]
+        
+        # Parse JSON from response
+        json_match = re.search(r'\{.*\}', content, re.DOTALL)
+        if json_match:
+            return json.loads(json_match.group())
+        
+        return {"is_heritage_institution": False, "error": "No JSON found"}
+```
+
+## Error Handling
+
+### Common Errors
+
+| Error Code | Meaning | Solution |
+|------------|---------|----------|
+| 401 | Unauthorized | Check ZAI_API_TOKEN |
+| 403 | Forbidden/Quota | Using wrong endpoint (use Z.AI, not BigModel) |
+| 429 | Rate Limited | Add delays between requests |
+| 500 | Server Error | Retry with backoff |
+
+### Retry Pattern
+
+```python
+from tenacity import retry, stop_after_attempt, wait_exponential
+
+@retry(
+    stop=stop_after_attempt(3),
+    wait=wait_exponential(multiplier=1, min=2, max=10)
+)
+async def call_with_retry(client, messages):
+    response = await client.post(API_URL, json={"model": "glm-4.6", "messages": messages})
+    response.raise_for_status()
+    return response.json()
+```
+
+### JSON Parsing
+
+LLM responses may contain text around JSON. Always parse safely:
+
+```python
+import re
+import json
+
+def parse_json_from_response(content: str) -> dict:
+    """Extract JSON from LLM response text."""
+    # Try to find JSON block
+    json_match = re.search(r'```json\s*(\{.*?\})\s*```', content, re.DOTALL)
+    if json_match:
+        return json.loads(json_match.group(1))
+    
+    # Try bare JSON
+    json_match = re.search(r'\{.*\}', content, re.DOTALL)
+    if json_match:
+        return json.loads(json_match.group())
+    
+    return {"error": "No JSON found in response"}
+```
+
+## Best Practices
+
+### 1. Use Low Temperature for Verification
+
+```python
+{
+    "temperature": 0.1  # Low for consistent, deterministic responses
+}
+```
+
+### 2. Request JSON Output
+
+Always request JSON format in your prompts for structured responses:
+
+```
+Respond in JSON format only:
+```json
+{"key": "value"}
+```
+```
+
+### 3. Batch Processing
+
+Process multiple entities with rate limiting:
+
+```python
+import asyncio
+
+async def batch_verify(entities: List[dict], rate_limit: float = 0.5):
+    """Verify entities with rate limiting."""
+    results = []
+    for entity in entities:
+        result = await verifier.verify(entity)
+        results.append(result)
+        await asyncio.sleep(rate_limit)  # Respect rate limits
+    return results
+```
+
+### 4. Always Reference CH-Annotator
+
+For entity recognition tasks, include CH-Annotator context:
+
+```python
+system_prompt = """You are following CH-Annotator v1.7.0 convention.
+Heritage institutions are type GRP.HER with subtypes for museums, libraries, archives, and galleries.
+"""
+```
+
+## Related Scripts
+
+| Script | Purpose |
+|--------|---------|
+| `scripts/reenrich_wikidata_with_verification.py` | Wikidata entity verification |
+
+## Related Documentation
+
+- **Agent Rules**: `AGENTS.md` (Rule 11: Z.AI GLM API)
+- **Agent Config**: `.opencode/ZAI_GLM_API_RULES.md`
+- **CH-Annotator**: `.opencode/CH_ANNOTATOR_CONVENTION.md`
+- **Entity Annotation**: `data/entity_annotation/ch_annotator-v1_7_0.yaml`
+
+## Troubleshooting
+
+### "Quota exceeded" Error
+
+**Symptom**: 403 error with "quota exceeded" message
+
+**Cause**: Using wrong API endpoint (`open.bigmodel.cn` instead of `api.z.ai`)
+
+**Solution**: Update API URL to `https://api.z.ai/api/coding/paas/v4/chat/completions`
+
+### "Token not found" Error
+
+**Symptom**: ValueError about missing ZAI_API_TOKEN
+
+**Solution**: 
+1. Check `~/.local/share/opencode/auth.json` for token
+2. Add to `.env` file as `ZAI_API_TOKEN=your_token`
+3. Ensure `load_dotenv()` is called before accessing environment
+
+### JSON Parsing Failures
+
+**Symptom**: LLM returns text that can't be parsed as JSON
+
+**Solution**: Use the `parse_json_from_response()` helper function with fallback handling
+
+---
+
+**Last Updated**: 2025-12-08
--- a/docs/TRANSLITERATION_CONVENTIONS.md
+++ b/docs/TRANSLITERATION_CONVENTIONS.md
@ -0,0 +1,441 @@
+# Transliteration Conventions for Heritage Custodian Names
+
+**Document Type**: User Guide  
+**Version**: 1.0  
+**Last Updated**: 2025-12-08  
+**Related Rules**: `.opencode/TRANSLITERATION_STANDARDS.md`, `AGENTS.md` (Rule 12)
+
+---
+
+## Overview
+
+This document provides comprehensive examples and guidance for transliterating heritage institution names from non-Latin scripts to Latin characters. Transliteration is **required** for generating GHCID abbreviations but the **original emic name is always preserved**.
+
+### Key Principles
+
+1. **Emic name preserved** - Original script stored in `custodian_name.emic_name`
+2. **ISO standards used** - Recognized international transliteration standards
+3. **Deterministic output** - Same input always produces same Latin output
+4. **Abbreviation purpose only** - Transliteration is for GHCID generation, not display
+
+---
+
+## Language-by-Language Examples
+
+### Russian (Cyrillic - ISO 9:1995)
+
+**Dataset Statistics**: 13 institutions
+
+| Emic Name | Transliterated | Abbreviation |
+|-----------|----------------|--------------|
+| Институт восточных рукописей РАН | Institut vostočnyh rukopisej RAN | IVRR |
+| Российская государственная библиотека | Rossijskaja gosudarstvennaja biblioteka | RGB |
+| Государственный архив Российской Федерации | Gosudarstvennyj arhiv Rossijskoj Federacii | GARF |
+
+**Skip Words (Russian)**: None significant (Russian doesn't use articles)
+
+**Character Mapping**:
+```
+А → A    Б → B    В → V    Г → G    Д → D    Е → E
+Ё → Ë    Ж → Ž    З → Z    И → I    Й → J    К → K
+Л → L    М → M    Н → N    О → O    П → P    Р → R
+С → S    Т → T    У → U    Ф → F    Х → H    Ц → C
+Ч → Č    Ш → Š    Щ → Ŝ    Ъ → ʺ    Ы → Y    Ь → ʹ
+Э → È    Ю → Û    Я → Â
+```
+
+---
+
+### Ukrainian (Cyrillic - ISO 9:1995)
+
+**Dataset Statistics**: 8 institutions
+
+| Emic Name | Transliterated | Abbreviation |
+|-----------|----------------|--------------|
+| Центральний державний архів громадських об'єднань України | Centralnyj deržavnyj arhiv hromadskyx objednan Ukrainy | CDAGOU |
+| Національна бібліотека України | Nacionalna biblioteka Ukrainy | NBU |
+
+**Ukrainian-specific characters**:
+```
+І → I    Ї → Ji    Є → Je    Ґ → G'
+```
+
+---
+
+### Chinese (Hanyu Pinyin - ISO 7098)
+
+**Dataset Statistics**: 27 institutions
+
+| Emic Name | Pinyin | Abbreviation |
+|-----------|--------|--------------|
+| 东巴文化博物院 | Dōngbā Wénhuà Bówùyuàn | DWB |
+| 中国第一历史档案馆 | Zhōngguó Dìyī Lìshǐ Dàng'ànguǎn | ZDLD |
+| 北京故宫博物院 | Běijīng Gùgōng Bówùyuàn | BGB |
+| 中国国家图书馆 | Zhōngguó Guójiā Túshūguǎn | ZGT |
+
+**Notes**:
+- Tone marks are removed for abbreviation (diacritics normalization)
+- Word boundaries follow natural semantic breaks
+- Multi-syllable words keep together
+
+**Skip Words**: None (Chinese doesn't use separate articles/prepositions)
+
+---
+
+### Japanese (Modified Hepburn)
+
+**Dataset Statistics**: 19 institutions
+
+| Emic Name | Romaji | Abbreviation |
+|-----------|--------|--------------|
+| 国立中央博物館 | Kokuritsu Chūō Hakubutsukan | KCH |
+| 東京国立博物館 | Tōkyō Kokuritsu Hakubutsukan | TKH |
+| 国立国会図書館 | Kokuritsu Kokkai Toshokan | KKT |
+
+**Notes**:
+- Long vowels (ō, ū) normalized to (o, u)
+- Particles typically attached to preceding word
+- Kanji compounds transliterated as single words
+
+---
+
+### Korean (Revised Romanization)
+
+**Dataset Statistics**: 36 institutions
+
+| Emic Name | RR Romanization | Abbreviation |
+|-----------|-----------------|--------------|
+| 독립기념관 | Dongnip Ginyeomgwan | DG |
+| 국립중앙박물관 | Gungnip Jungang Bakmulgwan | GJB |
+| 서울대학교 규장각한국학연구원 | Seoul Daehakgyo Gyujanggak Hangukhak Yeonguwon | SDGHY |
+| 국립한글박물관 | Gungnip Hangeul Bakmulgwan | GHB |
+
+**Notes**:
+- No diacritics in Revised Romanization (unlike McCune-Reischauer)
+- Consonant assimilation reflected in spelling
+- Spaces at natural word boundaries
+
+---
+
+### Arabic (ISO 233-2)
+
+**Dataset Statistics**: 8 institutions
+
+| Emic Name | Transliterated | Abbreviation |
+|-----------|----------------|--------------|
+| المكتبة الوطنية للمملكة المغربية | al-Maktaba al-Waṭanīya lil-Mamlaka al-Maġribīya | MWMM |
+| دار الكتب المصرية | Dār al-Kutub al-Miṣrīya | DKM |
+| المتحف الوطني العراقي | al-Matḥaf al-Waṭanī al-ʿIrāqī | MWI |
+
+**Skip Words**:
+- `al-` (definite article "the")
+- After skip word removal: Maktaba, Wataniya, Mamlaka, Maghribiya → MWMM
+
+**Notes**:
+- Right-to-left script
+- Definite article "al-" always skipped
+- Diacritics normalized (ā→a, ī→i, etc.)
+
+---
+
+### Persian/Farsi (ISO 233-3)
+
+**Dataset Statistics**: 11 institutions
+
+| Emic Name | Transliterated | Abbreviation |
+|-----------|----------------|--------------|
+| وزارت امور خارجه ایران | Vezārat-e Omūr-e Khārejeh-ye Īrān | VOKI |
+| کتابخانه آستان قدس رضوی | Ketābkhāneh-ye Āstān-e Qods-e Raẓavī | KAQR |
+| مجلس شورای اسلامی | Majles-e Showrā-ye Eslāmī | MSE |
+
+**Skip Words**:
+- `-e`, `-ye` (ezafe connector, "of")
+- `va` ("and")
+
+**Persian-specific characters**:
+```
+پ → p    چ → č    ژ → ž    گ → g
+```
+
+---
+
+### Hebrew (ISO 259-3)
+
+**Dataset Statistics**: 4 institutions
+
+| Emic Name | Transliterated | Abbreviation |
+|-----------|----------------|--------------|
+| ארכיון הסיפור העממי בישראל | Arḵiyon ha-Sipur ha-ʿAmami be-Yiśraʾel | ASAY |
+| הספרייה הלאומית | ha-Sifriya ha-Leʾumit | SL |
+| ארכיון המדינה | Arḵiyon ha-Medina | AM |
+
+**Skip Words**:
+- `ha-` (definite article "the")
+- `be-` ("in")
+- `le-` ("to")
+- `ve-` ("and")
+
+**Notes**:
+- Right-to-left script
+- Articles attached with hyphen
+- Silent letters (aleph, ayin) often omitted in abbreviation
+
+---
+
+### Hindi (Devanagari - ISO 15919)
+
+**Dataset Statistics**: 14 institutions
+
+| Emic Name | Transliterated | Abbreviation |
+|-----------|----------------|--------------|
+| राजस्थान प्राच्यविद्या प्रतिष्ठान | Rājasthāna Prācyavidyā Pratiṣṭhāna | RPP |
+| राष्ट्रीय अभिलेखागार | Rāṣṭrīya Abhilekhāgāra | RA |
+| राष्ट्रीय संग्रहालय नई दिल्ली | Rāṣṭrīya Saṅgrahālaya Naī Dillī | RSND |
+
+**Skip Words**:
+- `ka`, `ki`, `ke` ("of")
+- `aur` ("and")
+- `mein` ("in")
+
+**Notes**:
+- Conjunct consonants transliterated as cluster
+- Long vowels marked (ā, ī, ū) then normalized
+
+---
+
+### Greek (ISO 843)
+
+**Dataset Statistics**: 2 institutions
+
+| Emic Name | Transliterated | Abbreviation |
+|-----------|----------------|--------------|
+| Αρχαιολογικό Μουσείο Θεσσαλονίκης | Archaiologikó Mouseío Thessaloníkīs | AMT |
+| Εθνική Βιβλιοθήκη της Ελλάδας | Ethnikī́ Vivliothī́kī tīs Elládas | EVE |
+
+**Skip Words**:
+- `tīs`, `tou` ("of the")
+- `kai` ("and")
+
+**Character Mapping**:
+```
+Α → A    Β → V    Γ → G    Δ → D    Ε → E    Ζ → Z
+Η → Ī    Θ → Th   Ι → I    Κ → K    Λ → L    Μ → M
+Ν → N    Ξ → X    Ο → O    Π → P    Ρ → R    Σ → S
+Τ → T    Υ → Y    Φ → F    Χ → Ch   Ψ → Ps   Ω → Ō
+```
+
+---
+
+### Thai (ISO 11940-2)
+
+**Dataset Statistics**: 6 institutions
+
+| Emic Name | Transliterated | Abbreviation |
+|-----------|----------------|--------------|
+| สำนักหอจดหมายเหตุแห่งชาติ | Samnak Ho Chotmaihet Haeng Chat | SHCHC |
+| หอสมุดแห่งชาติ | Ho Samut Haeng Chat | HSHC |
+
+**Notes**:
+- Thai script is abugida (consonant-vowel syllables)
+- No spaces in Thai; word boundaries determined by meaning
+- Royal Thai General System also acceptable
+
+---
+
+### Armenian (ISO 9985)
+
+**Dataset Statistics**: 4 institutions
+
+| Emic Name | Transliterated | Abbreviation |
+|-----------|----------------|--------------|
+| Մdelays delays delaysdelays delaysатенадаран | Matenadaran | M |
+| Ազdelays delays delays delays delays delays delays delays delaysգdelays delays delays delays delays delaysdelays delaysdelays delaysდайн Пdelays delays delays delays delaysатאрաнագитаран | Azgayin Matenadaran | AM |
+
+**Notes**:
+- Armenian alphabet unique to Armenian language
+- Transliteration straightforward letter-for-letter
+
+---
+
+### Georgian (ISO 9984)
+
+**Dataset Statistics**: 2 institutions
+
+| Emic Name | Transliterated | Abbreviation |
+|-----------|----------------|--------------|
+| ხელნაწერთა ეროვნული ცენტრი | Xelnawerti Erovnuli C'ent'ri | XEC |
+| საქართველოს ეროვნული არქივი | Sakartvelos Erovnuli Arkivi | SEA |
+
+**Notes**:
+- Georgian Mkhedruli script
+- Apostrophes mark ejective consonants (removed in abbreviation)
+
+---
+
+## Complete Workflow Example
+
+### Step-by-Step: Korean Institution
+
+**Institution**: National Museum of Korea
+
+1. **Emic Name (Original Script)**:
+   ```
+   국립중앙박물관
+   ```
+
+2. **Language Detection**: Korean (ko)
+
+3. **Transliterate using Revised Romanization**:
+   ```
+   Gungnip Jungang Bakmulgwan
+   ```
+
+4. **Identify Skip Words**: None for Korean
+
+5. **Extract First Letters**:
+   ```
+   G + J + B = GJB
+   ```
+
+6. **Diacritic Normalization**: N/A (RR has no diacritics)
+
+7. **Final Abbreviation**: `GJB`
+
+8. **Store in YAML**:
+   ```yaml
+   custodian_name:
+     emic_name: 국립중앙박물관
+     name_language: ko
+     english_name: National Museum of Korea
+   ghcid:
+     ghcid_current: KR-SO-SEO-M-GJB
+     abbreviation_source: transliterated_emic
+   ```
+
+---
+
+### Step-by-Step: Arabic Institution
+
+**Institution**: National Library of Morocco
+
+1. **Emic Name (Original Script)**:
+   ```
+   المكتبة الوطنية للمملكة المغربية
+   ```
+
+2. **Language Detection**: Arabic (ar)
+
+3. **Transliterate using ISO 233-2**:
+   ```
+   al-Maktaba al-Waṭanīya lil-Mamlaka al-Maġribīya
+   ```
+
+4. **Identify Skip Words**: `al-` (4 occurrences), `lil-` (1)
+
+5. **After Skip Word Removal**:
+   ```
+   Maktaba Waṭanīya Mamlaka Maġribīya
+   ```
+
+6. **Extract First Letters**:
+   ```
+   M + W + M + M = MWMM
+   ```
+
+7. **Diacritic Normalization**: ṭ→t, ī→i, ġ→g
+   ```
+   MWMM (already ASCII)
+   ```
+
+8. **Final Abbreviation**: `MWMM`
+
+---
+
+## Edge Cases and Special Handling
+
+### Mixed Scripts
+
+Some institution names mix scripts (e.g., Latin brand names in Chinese text):
+
+**Example**: 中国IBM研究院
+- Transliterate Chinese: Zhongguo IBM Yanjiuyuan
+- Keep "IBM" as-is (already Latin)
+- Abbreviation: ZIY
+
+### Transliteration Ambiguity
+
+When multiple valid transliterations exist, prefer:
+1. ISO standard spelling
+2. Institution's own romanization (if consistent)
+3. Most commonly used academic romanization
+
+### Very Long Names
+
+If abbreviation exceeds 10 characters after applying rules:
+1. Truncate to 10 characters
+2. Ensure truncation doesn't create ambiguous abbreviation
+3. Document truncation in `ghcid.notes`
+
+---
+
+## Python Implementation Reference
+
+For the complete Python implementation of transliteration functions, see:
+
+- `.opencode/TRANSLITERATION_STANDARDS.md` - Full code with all language handlers
+- `scripts/transliterate_emic_names.py` - Production script for batch transliteration
+
+### Quick Reference Function
+
+```python
+from transliteration import transliterate_for_abbreviation
+
+# Example usage for all supported languages
+examples = {
+    'ru': 'Российская государственная библиотека',
+    'zh': '中国国家图书馆',
+    'ja': '国立国会図書館',
+    'ko': '국립중앙박물관',
+    'ar': 'المكتبة الوطنية للمملكة المغربية',
+    'he': 'הספרייה הלאומית',
+    'hi': 'राष्ट्रीय अभिलेखागार',
+    'el': 'Εθνική Βιβλιοθήκη της Ελλάδας',
+}
+
+for lang, name in examples.items():
+    latin = transliterate_for_abbreviation(name, lang)
+    print(f'{lang}: {name}')
+    print(f'    → {latin}')
+```
+
+---
+
+## Validation Checklist
+
+Before finalizing a transliterated abbreviation:
+
+- [ ] Original emic name preserved in `custodian_name.emic_name`
+- [ ] Language code stored in `custodian_name.name_language`
+- [ ] Correct ISO standard applied for script
+- [ ] Skip words removed (articles, prepositions)
+- [ ] Diacritics normalized to ASCII
+- [ ] Special characters removed
+- [ ] Abbreviation ≤ 10 characters
+- [ ] No conflicts with existing GHCIDs
+
+---
+
+## See Also
+
+- `.opencode/TRANSLITERATION_STANDARDS.md` - Technical rules and Python code
+- `.opencode/ABBREVIATION_SPECIAL_CHAR_RULE.md` - Character filtering rules
+- `AGENTS.md` - Rule 12: Non-Latin Script Transliteration
+- `docs/PERSISTENT_IDENTIFIERS.md` - GHCID specification
+
+---
+
+## Changelog
+
+| Date | Change |
+|------|--------|
+| 2025-12-08 | Initial document created with 21 language examples |