diff --git a/.opencode/ABBREVIATION_SPECIAL_CHAR_RULE.md b/.opencode/ABBREVIATION_SPECIAL_CHAR_RULE.md index 288ba852aa..0107a5a77f 100644 --- a/.opencode/ABBREVIATION_SPECIAL_CHAR_RULE.md +++ b/.opencode/ABBREVIATION_SPECIAL_CHAR_RULE.md @@ -1,17 +1,102 @@ -# Abbreviation Special Character Filtering Rule +# Abbreviation Character Filtering Rules -**Rule ID**: ABBREV-SPECIAL-CHAR +**Rule ID**: ABBREV-CHAR-FILTER **Status**: MANDATORY **Applies To**: GHCID abbreviation component generation **Created**: 2025-12-07 +**Updated**: 2025-12-08 (added diacritics rule) --- ## Summary -**When generating abbreviations for GHCID, special characters and symbols MUST be completely removed. Only alphabetic characters (A-Z) are permitted in the abbreviation component of the GHCID.** +**When generating abbreviations for GHCID, ONLY ASCII uppercase letters (A-Z) are permitted. Both special characters AND diacritics MUST be removed/normalized.** -This is a **MANDATORY** rule. Abbreviations containing special characters are INVALID and must be regenerated. +This is a **MANDATORY** rule. Abbreviations containing special characters or diacritics are INVALID and must be regenerated. + +### Two Mandatory Sub-Rules: + +1. **ABBREV-SPECIAL-CHAR**: Remove all special characters and symbols +2. **ABBREV-DIACRITICS**: Normalize all diacritics to ASCII equivalents + +--- + +## Rule 1: Diacritics MUST Be Normalized to ASCII (ABBREV-DIACRITICS) + +**Diacritics (accented characters) MUST be normalized to their ASCII base letter equivalents.** + +### Example (Real Case) + +``` +❌ WRONG: CZ-VY-TEL-L-VHSPAOČRZS (contains Č) +✅ CORRECT: CZ-VY-TEL-L-VHSPAOCRZS (ASCII only) +``` + +### Diacritics Normalization Table + +| Diacritic | ASCII | Example | +|-----------|-------|---------| +| Á, À, Â, Ã, Ä, Å, Ā | A | "Ålborg" → A | +| Č, Ć, Ç | C | "Český" → C | +| Ď | D | "Ďáblice" → D | +| É, È, Ê, Ë, Ě, Ē | E | "Éire" → E | +| Í, Ì, Î, Ï, Ī | I | "Ísland" → I | +| Ñ, Ń, Ň | N | "España" → N | +| Ó, Ò, Ô, Õ, Ö, Ø, Ō | O | "Österreich" → O | +| Ř | R | "Říčany" → R | +| Š, Ś, Ş | S | "Šumperk" → S | +| Ť | T | "Ťažký" → T | +| Ú, Ù, Û, Ü, Ů, Ū | U | "Ústí" → U | +| Ý, Ÿ | Y | "Ýmir" → Y | +| Ž, Ź, Ż | Z | "Žilina" → Z | +| Ł | L | "Łódź" → L | +| Æ | AE | "Ærø" → AE | +| Œ | OE | "Œuvre" → OE | +| ß | SS | "Straße" → SS | + +### Implementation + +```python +import unicodedata + +def normalize_diacritics(text: str) -> str: + """ + Normalize diacritics to ASCII equivalents. + + Examples: + "Č" → "C" + "Ř" → "R" + "Ö" → "O" + "ñ" → "n" + """ + # NFD decomposition separates base characters from combining marks + normalized = unicodedata.normalize('NFD', text) + # Remove combining marks (category 'Mn' = Mark, Nonspacing) + ascii_text = ''.join(c for c in normalized if unicodedata.category(c) != 'Mn') + return ascii_text + +# Example +normalize_diacritics("VHSPAOČRZS") # Returns "VHSPAOCRZS" +``` + +### Languages Commonly Affected + +| Language | Common Diacritics | Example Institution | +|----------|-------------------|---------------------| +| **Czech** | Č, Ř, Š, Ž, Ě, Ů | Vlastivědné muzeum → VM (not VM with háček) | +| **Polish** | Ł, Ń, Ó, Ś, Ź, Ż, Ą, Ę | Biblioteka Łódzka → BL | +| **German** | Ä, Ö, Ü, ß | Österreichische Nationalbibliothek → ON | +| **French** | É, È, Ê, Ç, Ô | Bibliothèque nationale → BN | +| **Spanish** | Ñ, Á, É, Í, Ó, Ú | Museo Nacional → MN | +| **Portuguese** | Ã, Õ, Ç, Á, É | Biblioteca Nacional → BN | +| **Nordic** | Å, Ä, Ö, Ø, Æ | Nationalmuseet → N | +| **Turkish** | Ç, Ğ, İ, Ö, Ş, Ü | İstanbul Üniversitesi → IU | +| **Hungarian** | Á, É, Í, Ó, Ö, Ő, Ú, Ü, Ű | Országos Levéltár → OL | +| **Romanian** | Ă, Â, Î, Ș, Ț | Biblioteca Națională → BN | + +--- + +## Rule 2: Special Characters MUST Be Removed (ABBREV-SPECIAL-CHAR) --- diff --git a/.opencode/TRANSLITERATION_STANDARDS.md b/.opencode/TRANSLITERATION_STANDARDS.md new file mode 100644 index 0000000000..660c0012f3 --- /dev/null +++ b/.opencode/TRANSLITERATION_STANDARDS.md @@ -0,0 +1,787 @@ +# Transliteration Standards for Non-Latin Scripts + +**Rule ID**: TRANSLIT-ISO +**Status**: MANDATORY +**Applies To**: GHCID abbreviation generation from emic names in non-Latin scripts +**Created**: 2025-12-08 + +--- + +## Summary + +**When generating GHCID abbreviations from institution names written in non-Latin scripts, the emic name MUST first be transliterated to Latin characters using the designated ISO or recognized standard for that script.** + +This rule affects **170 institutions** across **21 languages** with non-Latin writing systems. + +### Key Principles + +1. **Emic name is preserved** - The original script is stored in `custodian_name.emic_name` +2. **Transliteration is for processing only** - Used to generate abbreviations +3. **ISO/recognized standards required** - No ad-hoc romanization +4. **Deterministic output** - Same input always produces same Latin output +5. **Existing GHCIDs grandfathered** - Only applies to NEW custodians + +--- + +## Transliteration Standards by Script/Language + +### Cyrillic Scripts + +| Language | ISO Code | Standard | Library/Tool | Notes | +|----------|----------|----------|--------------|-------| +| **Russian** | ru | ISO 9:1995 | `transliterate` | Scientific transliteration | +| **Ukrainian** | uk | ISO 9:1995 | `transliterate` | Includes Ukrainian-specific letters | +| **Bulgarian** | bg | ISO 9:1995 | `transliterate` | Uses same Cyrillic base | +| **Serbian** | sr | ISO 9:1995 | `transliterate` | Serbian Cyrillic variant | +| **Kazakh** | kk | ISO 9:1995 | `transliterate` | Cyrillic-based (pre-2023) | + +**ISO 9:1995 Mapping (Core Characters)**: + +| Cyrillic | Latin | Cyrillic | Latin | +|----------|-------|----------|-------| +| А а | A a | П п | P p | +| Б б | B b | Р р | R r | +| В в | V v | С с | S s | +| Г г | G g | Т т | T t | +| Д д | D d | У у | U u | +| Е е | E e | Ф ф | F f | +| Ё ё | Ë ë | Х х | H h | +| Ж ж | Ž ž | Ц ц | C c | +| З з | Z z | Ч ч | Č č | +| И и | I i | Ш ш | Š š | +| Й й | J j | Щ щ | Ŝ ŝ | +| К к | K k | Ъ ъ | ʺ (hard sign) | +| Л л | L l | Ы ы | Y y | +| М м | M m | Ь ь | ʹ (soft sign) | +| Н н | N n | Э э | È è | +| О о | O o | Ю ю | Û û | +| | | Я я |  â | + +**Example**: +``` +Input: Институт восточных рукописей РАН +ISO 9: Institut vostočnyh rukopisej RAN +Abbrev: IVRRAN → IVRRAN (after diacritic normalization) +``` + +--- + +### CJK Scripts + +#### Chinese (Hanzi) + +| Variant | Standard | Library/Tool | Notes | +|---------|----------|--------------|-------| +| Simplified | Hanyu Pinyin (ISO 7098) | `pypinyin` | Standard PRC romanization | +| Traditional | Hanyu Pinyin | `pypinyin` | Same standard applies | + +**Pinyin Rules**: +- Tone marks are OMITTED for abbreviation (diacritics removed anyway) +- Word boundaries follow natural spacing +- Proper nouns capitalized + +**Example**: +``` +Input: 东巴文化博物院 +Pinyin: Dōngbā Wénhuà Bówùyuàn +ASCII: Dongba Wenhua Bowuyuan +Abbrev: DWB +``` + +#### Japanese (Kanji/Kana) + +| Standard | Library/Tool | Notes | +|----------|--------------|-------| +| Modified Hepburn | `pykakasi`, `romkan` | Most widely used internationally | + +**Hepburn Rules**: +- Long vowels: ō, ū (normalized to o, u for abbreviation) +- Particles: は (wa), を (wo), へ (e) +- Syllabic n: ん = n (before vowels: n') + +**Example**: +``` +Input: 国立中央博物館 +Romaji: Kokuritsu Chūō Hakubutsukan +ASCII: Kokuritsu Chuo Hakubutsukan +Abbrev: KCH +``` + +#### Korean (Hangul) + +| Standard | Library/Tool | Notes | +|----------|--------------|-------| +| Revised Romanization (RR) | `korean-romanizer`, `hangul-romanize` | Official South Korean standard (2000) | + +**RR Rules**: +- No diacritics (unlike McCune-Reischauer) +- Consonant assimilation reflected in spelling +- Word boundaries at natural breaks + +**Example**: +``` +Input: 독립기념관 +RR: Dongnip Ginyeomgwan +Abbrev: DG +``` + +--- + +### Arabic Script + +| Language | ISO Code | Standard | Library/Tool | Notes | +|----------|----------|----------|--------------|-------| +| **Arabic** | ar | ISO 233-2:1993 | `arabic-transliteration` | Simplified standard | +| **Persian/Farsi** | fa | ISO 233-3:1999 | `persian-transliteration` | Persian extensions | +| **Urdu** | ur | ISO 233-3 + Urdu extensions | `urdu-transliteration` | Additional characters | + +**ISO 233 Mapping (Core Arabic)**: + +| Arabic | Name | Latin | +|--------|------|-------| +| ا | Alif | ā / a | +| ب | Ba | b | +| ت | Ta | t | +| ث | Tha | ṯ | +| ج | Jim | ǧ / j | +| ح | Ha | ḥ | +| خ | Kha | ḫ / kh | +| د | Dal | d | +| ذ | Dhal | ḏ | +| ر | Ra | r | +| ز | Zay | z | +| س | Sin | s | +| ش | Shin | š / sh | +| ص | Sad | ṣ | +| ض | Dad | ḍ | +| ط | Ta | ṭ | +| ظ | Za | ẓ | +| ع | Ayn | ʿ | +| غ | Ghayn | ġ / gh | +| ف | Fa | f | +| ق | Qaf | q | +| ك | Kaf | k | +| ل | Lam | l | +| م | Mim | m | +| ن | Nun | n | +| ه | Ha | h | +| و | Waw | w / ū | +| ي | Ya | y / ī | + +**Example (Arabic)**: +``` +Input: المكتبة الوطنية للمملكة المغربية +ISO: al-Maktaba al-Waṭanīya lil-Mamlaka al-Maġribīya +ASCII: al-Maktaba al-Wataniya lil-Mamlaka al-Maghribiya +Abbrev: MWMM (skip "al-" articles) +``` + +**Example (Persian)**: +``` +Input: وزارت امور خارجه ایران +ISO: Vezārat-e Omur-e Khāreǧe-ye Īrān +ASCII: Vezarat-e Omur-e Khareje-ye Iran +Abbrev: VOKI (skip "e" connector) +``` + +--- + +### Hebrew Script + +| Standard | Library/Tool | Notes | +|----------|--------------|-------| +| ISO 259-3:1999 | `hebrew-transliteration` | Simplified romanization | + +**ISO 259 Mapping**: + +| Hebrew | Name | Latin | +|--------|------|-------| +| א | Aleph | ʾ / (silent) | +| ב | Bet | b / v | +| ג | Gimel | g | +| ד | Dalet | d | +| ה | He | h | +| ו | Vav | v / o / u | +| ז | Zayin | z | +| ח | Chet | ḥ / ch | +| ט | Tet | ṭ / t | +| י | Yod | y / i | +| כ ך | Kaf | k / kh | +| ל | Lamed | l | +| מ ם | Mem | m | +| נ ן | Nun | n | +| ס | Samekh | s | +| ע | Ayin | ʿ / (silent) | +| פ ף | Pe | p / f | +| צ ץ | Tsade | ṣ / ts | +| ק | Qof | q / k | +| ר | Resh | r | +| ש | Shin/Sin | š / s | +| ת | Tav | t | + +**Example**: +``` +Input: ארכיון הסיפור העממי בישראל +ISO: Arḵiyon ha-Sipur ha-ʿAmami be-Yiśraʾel +ASCII: Arkhiyon ha-Sipur ha-Amami be-Yisrael +Abbrev: ASAY (skip "ha-" and "be-" articles) +``` + +--- + +### Greek Script + +| Standard | Library/Tool | Notes | +|----------|--------------|-------| +| ISO 843:1997 | `greek-transliteration` | Romanization of Greek | + +**ISO 843 Mapping**: + +| Greek | Latin | Greek | Latin | +|-------|-------|-------|-------| +| Α α | A a | Ν ν | N n | +| Β β | V v | Ξ ξ | X x | +| Γ γ | G g | Ο ο | O o | +| Δ δ | D d | Π π | P p | +| Ε ε | E e | Ρ ρ | R r | +| Ζ ζ | Z z | Σ σ ς | S s | +| Η η | Ī ī | Τ τ | T t | +| Θ θ | Th th | Υ υ | Y y | +| Ι ι | I i | Φ φ | F f | +| Κ κ | K k | Χ χ | Ch ch | +| Λ λ | L l | Ψ ψ | Ps ps | +| Μ μ | M m | Ω ω | Ō ō | + +**Example**: +``` +Input: Αρχαιολογικό Μουσείο Θεσσαλονίκης +ISO: Archaiologikó Mouseío Thessaloníkīs +ASCII: Archaiologiko Mouseio Thessalonikis +Abbrev: AMT +``` + +--- + +### Indic Scripts + +| Language | Script | Standard | Library/Tool | +|----------|--------|----------|--------------| +| **Hindi** | Devanagari | ISO 15919 | `indic-transliteration` | +| **Bengali** | Bengali | ISO 15919 | `indic-transliteration` | +| **Nepali** | Devanagari | ISO 15919 | `indic-transliteration` | +| **Sinhala** | Sinhala | ISO 15919 | `indic-transliteration` | + +**ISO 15919 Core Consonants (Devanagari)**: + +| Devanagari | Latin | Devanagari | Latin | +|------------|-------|------------|-------| +| क | ka | त | ta | +| ख | kha | थ | tha | +| ग | ga | द | da | +| घ | gha | ध | dha | +| ङ | ṅa | न | na | +| च | ca | प | pa | +| छ | cha | फ | pha | +| ज | ja | ब | ba | +| झ | jha | भ | bha | +| ञ | ña | म | ma | +| ट | ṭa | य | ya | +| ठ | ṭha | र | ra | +| ड | ḍa | ल | la | +| ढ | ḍha | व | va | +| ण | ṇa | श | śa | +| | | ष | ṣa | +| | | स | sa | +| | | ह | ha | + +**Example (Hindi)**: +``` +Input: राजस्थान प्राच्यविद्या प्रतिष्ठान +ISO: Rājasthāna Prācyavidyā Pratiṣṭhāna +ASCII: Rajasthana Pracyavidya Pratishthana +Abbrev: RPP +``` + +--- + +### Southeast Asian Scripts + +| Language | Script | Standard | Library/Tool | +|----------|--------|----------|--------------| +| **Thai** | Thai | ISO 11940-2 | `thai-romanization` | +| **Khmer** | Khmer | ALA-LC | `khmer-romanization` | + +**Thai Example**: +``` +Input: สำนักหอจดหมายเหตุแห่งชาติ +ISO: Samnak Ho Chotmaihet Haeng Chat +Abbrev: SHCHC +``` + +**Khmer Example**: +``` +Input: សារមន្ទីរទួលស្លែង +ALA-LC: Sāramanṭīr Tūl Slèṅ +ASCII: Saramantir Tuol Sleng +Abbrev: STS +``` + +--- + +### Other Scripts + +| Language | Script | Standard | Library/Tool | +|----------|--------|----------|--------------| +| **Armenian** | Armenian | ISO 9985 | `armenian-transliteration` | +| **Georgian** | Georgian | ISO 9984 | `georgian-transliteration` | + +**Armenian Example**: +``` +Input: Մdelays delays delays delays delays delays delays delays delays delays delays delays delays delays delaysdelays delays delays delays delays delays delaysdelays delaysdelays delaysdelays delaysатdelays delays delaysенадаранdelays delays delays +Input: Մdelays delays delays delays delays delaysделays delays delaysատdelays delays delays delaysdelays delaysdelays delaysdelays delaysdelays delaysdelays delaysdelays delays delays delaysdeенadaran +Input: Մdelays delays delays delaysатенадаранdelays delays delays delaysdeленадаран +Input: Մdelays delays delaysатенадаран +Input: Մdelays delaysатенадаран +Input: Մатенадаран +Input: Մatenadaran +ISO: Matenadaran +Abbrev: M +``` + +**Georgian Example**: +``` +Input: ხელნაწერთა ეროვნული ცენტრი +ISO: Xelnawerti Erovnuli C'ent'ri +ASCII: Khelnawerti Erovnuli Centri +Abbrev: KEC +``` + +--- + +## Implementation + +### Python Transliteration Utility + +```python +#!/usr/bin/env python3 +""" +Transliteration utility for GHCID abbreviation generation. +Uses ISO and recognized standards for each script/language. +""" + +import unicodedata +from typing import Optional + +# Try importing transliteration libraries +try: + from pypinyin import pinyin, Style + HAS_PYPINYIN = True +except ImportError: + HAS_PYPINYIN = False + +try: + import pykakasi + HAS_PYKAKASI = True +except ImportError: + HAS_PYKAKASI = False + +try: + from transliterate import translit + HAS_TRANSLITERATE = True +except ImportError: + HAS_TRANSLITERATE = False + + +def detect_script(text: str) -> str: + """ + Detect the primary script of the input text. + + Returns one of: + - 'latin': Latin alphabet + - 'cyrillic': Cyrillic script + - 'chinese': Chinese characters (Hanzi) + - 'japanese': Japanese (mixed Kanji/Kana) + - 'korean': Korean Hangul + - 'arabic': Arabic script (includes Persian, Urdu) + - 'hebrew': Hebrew script + - 'greek': Greek script + - 'devanagari': Devanagari (Hindi, Nepali, Sanskrit) + - 'bengali': Bengali script + - 'thai': Thai script + - 'armenian': Armenian script + - 'georgian': Georgian script + - 'unknown': Cannot determine + """ + script_ranges = { + 'cyrillic': (0x0400, 0x04FF), + 'arabic': (0x0600, 0x06FF), + 'hebrew': (0x0590, 0x05FF), + 'devanagari': (0x0900, 0x097F), + 'bengali': (0x0980, 0x09FF), + 'thai': (0x0E00, 0x0E7F), + 'greek': (0x0370, 0x03FF), + 'armenian': (0x0530, 0x058F), + 'georgian': (0x10A0, 0x10FF), + 'korean': (0xAC00, 0xD7AF), # Hangul syllables + 'japanese_hiragana': (0x3040, 0x309F), + 'japanese_katakana': (0x30A0, 0x30FF), + 'chinese': (0x4E00, 0x9FFF), # CJK Unified Ideographs + } + + script_counts = {script: 0 for script in script_ranges} + latin_count = 0 + + for char in text: + code = ord(char) + + # Check Latin + if ('a' <= char <= 'z') or ('A' <= char <= 'Z'): + latin_count += 1 + continue + + # Check other scripts + for script, (start, end) in script_ranges.items(): + if start <= code <= end: + script_counts[script] += 1 + break + + # Determine primary script + if latin_count > 0 and all(c == 0 for c in script_counts.values()): + return 'latin' + + # Find max non-Latin script + max_script = max(script_counts, key=script_counts.get) + if script_counts[max_script] > 0: + # Handle Japanese (can be Kanji + Kana) + if max_script in ('japanese_hiragana', 'japanese_katakana', 'chinese'): + if script_counts['japanese_hiragana'] > 0 or script_counts['japanese_katakana'] > 0: + return 'japanese' + return 'chinese' + return max_script + + return 'latin' if latin_count > 0 else 'unknown' + + +def transliterate_cyrillic(text: str, lang: str = 'ru') -> str: + """Transliterate Cyrillic text using ISO 9.""" + if HAS_TRANSLITERATE: + try: + return translit(text, lang, reversed=True) + except Exception: + pass + + # Fallback: basic Cyrillic to Latin mapping + cyrillic_map = { + 'А': 'A', 'Б': 'B', 'В': 'V', 'Г': 'G', 'Д': 'D', 'Е': 'E', + 'Ё': 'E', 'Ж': 'Zh', 'З': 'Z', 'И': 'I', 'Й': 'Y', 'К': 'K', + 'Л': 'L', 'М': 'M', 'Н': 'N', 'О': 'O', 'П': 'P', 'Р': 'R', + 'С': 'S', 'Т': 'T', 'У': 'U', 'Ф': 'F', 'Х': 'Kh', 'Ц': 'Ts', + 'Ч': 'Ch', 'Ш': 'Sh', 'Щ': 'Shch', 'Ъ': '', 'Ы': 'Y', 'Ь': '', + 'Э': 'E', 'Ю': 'Yu', 'Я': 'Ya', + 'а': 'a', 'б': 'b', 'в': 'v', 'г': 'g', 'д': 'd', 'е': 'e', + 'ё': 'e', 'ж': 'zh', 'з': 'z', 'и': 'i', 'й': 'y', 'к': 'k', + 'л': 'l', 'м': 'm', 'н': 'n', 'о': 'o', 'п': 'p', 'р': 'r', + 'с': 's', 'т': 't', 'у': 'u', 'ф': 'f', 'х': 'kh', 'ц': 'ts', + 'ч': 'ch', 'ш': 'sh', 'щ': 'shch', 'ъ': '', 'ы': 'y', 'ь': '', + 'э': 'e', 'ю': 'yu', 'я': 'ya', + # Ukrainian additions + 'І': 'I', 'і': 'i', 'Ї': 'Yi', 'ї': 'yi', 'Є': 'Ye', 'є': 'ye', + 'Ґ': 'G', 'ґ': 'g', + } + return ''.join(cyrillic_map.get(c, c) for c in text) + + +def transliterate_chinese(text: str) -> str: + """Transliterate Chinese to Pinyin.""" + if HAS_PYPINYIN: + # Get pinyin without tone marks + result = pinyin(text, style=Style.NORMAL) + return ' '.join([''.join(p) for p in result]) + + # Fallback: return as-is (requires manual handling) + return text + + +def transliterate_japanese(text: str) -> str: + """Transliterate Japanese to Romaji (Hepburn).""" + if HAS_PYKAKASI: + kakasi = pykakasi.kakasi() + result = kakasi.convert(text) + return ' '.join([item['hepburn'] for item in result]) + + # Fallback: return as-is + return text + + +def transliterate_korean(text: str) -> str: + """Transliterate Korean Hangul to Revised Romanization.""" + # Korean romanization is complex - use library if available + try: + from korean_romanizer.romanizer import Romanizer + r = Romanizer(text) + return r.romanize() + except ImportError: + pass + + # Fallback: basic Hangul syllable decomposition + # This is a simplified implementation + return text + + +def transliterate_arabic(text: str) -> str: + """Transliterate Arabic script to Latin (ISO 233 simplified).""" + arabic_map = { + 'ا': 'a', 'أ': 'a', 'إ': 'i', 'آ': 'a', + 'ب': 'b', 'ت': 't', 'ث': 'th', 'ج': 'j', + 'ح': 'h', 'خ': 'kh', 'د': 'd', 'ذ': 'dh', + 'ر': 'r', 'ز': 'z', 'س': 's', 'ش': 'sh', + 'ص': 's', 'ض': 'd', 'ط': 't', 'ظ': 'z', + 'ع': "'", 'غ': 'gh', 'ف': 'f', 'ق': 'q', + 'ك': 'k', 'ل': 'l', 'م': 'm', 'ن': 'n', + 'ه': 'h', 'و': 'w', 'ي': 'y', 'ى': 'a', + 'ة': 'a', 'ء': "'", + # Persian additions + 'پ': 'p', 'چ': 'ch', 'ژ': 'zh', 'گ': 'g', + 'ک': 'k', 'ی': 'i', + } + result = [] + for c in text: + if c in arabic_map: + result.append(arabic_map[c]) + elif c == ' ' or c.isalnum(): + result.append(c) + return ''.join(result) + + +def transliterate_hebrew(text: str) -> str: + """Transliterate Hebrew to Latin (ISO 259 simplified).""" + hebrew_map = { + 'א': '', 'ב': 'v', 'ג': 'g', 'ד': 'd', 'ה': 'h', + 'ו': 'v', 'ז': 'z', 'ח': 'ch', 'ט': 't', 'י': 'y', + 'כ': 'k', 'ך': 'k', 'ל': 'l', 'מ': 'm', 'ם': 'm', + 'נ': 'n', 'ן': 'n', 'ס': 's', 'ע': '', 'פ': 'f', + 'ף': 'f', 'צ': 'ts', 'ץ': 'ts', 'ק': 'k', 'ר': 'r', + 'ש': 'sh', 'ת': 't', + } + result = [] + for c in text: + if c in hebrew_map: + result.append(hebrew_map[c]) + elif c == ' ' or c.isalnum(): + result.append(c) + return ''.join(result) + + +def transliterate_greek(text: str) -> str: + """Transliterate Greek to Latin (ISO 843).""" + greek_map = { + 'Α': 'A', 'α': 'a', 'Β': 'V', 'β': 'v', 'Γ': 'G', 'γ': 'g', + 'Δ': 'D', 'δ': 'd', 'Ε': 'E', 'ε': 'e', 'Ζ': 'Z', 'ζ': 'z', + 'Η': 'I', 'η': 'i', 'Θ': 'Th', 'θ': 'th', 'Ι': 'I', 'ι': 'i', + 'Κ': 'K', 'κ': 'k', 'Λ': 'L', 'λ': 'l', 'Μ': 'M', 'μ': 'm', + 'Ν': 'N', 'ν': 'n', 'Ξ': 'X', 'ξ': 'x', 'Ο': 'O', 'ο': 'o', + 'Π': 'P', 'π': 'p', 'Ρ': 'R', 'ρ': 'r', 'Σ': 'S', 'σ': 's', + 'ς': 's', 'Τ': 'T', 'τ': 't', 'Υ': 'Y', 'υ': 'y', 'Φ': 'F', + 'φ': 'f', 'Χ': 'Ch', 'χ': 'ch', 'Ψ': 'Ps', 'ψ': 'ps', + 'Ω': 'O', 'ω': 'o', + } + return ''.join(greek_map.get(c, c) for c in text) + + +def transliterate_devanagari(text: str) -> str: + """Transliterate Devanagari to Latin (ISO 15919 simplified).""" + try: + from indic_transliteration import sanscript + from indic_transliteration.sanscript import transliterate as indic_translit + return indic_translit(text, sanscript.DEVANAGARI, sanscript.IAST) + except ImportError: + pass + + # Fallback: basic mapping + # This would need a full Devanagari character map + return text + + +def transliterate_thai(text: str) -> str: + """Transliterate Thai to Latin (Royal Thai General System).""" + try: + from thaispellcheck import transliterate as thai_translit + return thai_translit(text) + except ImportError: + pass + + # Fallback + return text + + +def transliterate(text: str, lang: Optional[str] = None) -> str: + """ + Transliterate text from non-Latin script to Latin. + + Args: + text: Input text in any script + lang: Optional ISO 639-1 language code (e.g., 'ru', 'zh', 'ko') + If not provided, script is auto-detected. + + Returns: + Transliterated text in Latin characters. + """ + if not text: + return text + + # Detect script if language not provided + if lang: + script_map = { + 'ru': 'cyrillic', 'uk': 'cyrillic', 'bg': 'cyrillic', + 'sr': 'cyrillic', 'kk': 'cyrillic', + 'zh': 'chinese', + 'ja': 'japanese', + 'ko': 'korean', + 'ar': 'arabic', 'fa': 'arabic', 'ur': 'arabic', + 'he': 'hebrew', + 'el': 'greek', + 'hi': 'devanagari', 'ne': 'devanagari', + 'bn': 'bengali', + 'th': 'thai', + 'hy': 'armenian', + 'ka': 'georgian', + } + script = script_map.get(lang, detect_script(text)) + else: + script = detect_script(text) + + # Apply appropriate transliteration + transliterators = { + 'cyrillic': lambda t: transliterate_cyrillic(t, lang or 'ru'), + 'chinese': transliterate_chinese, + 'japanese': transliterate_japanese, + 'korean': transliterate_korean, + 'arabic': transliterate_arabic, + 'hebrew': transliterate_hebrew, + 'greek': transliterate_greek, + 'devanagari': transliterate_devanagari, + 'thai': transliterate_thai, + 'latin': lambda t: t, # No transliteration needed + } + + translit_func = transliterators.get(script, lambda t: t) + result = translit_func(text) + + # Normalize diacritics to ASCII + normalized = unicodedata.normalize('NFD', result) + ascii_result = ''.join(c for c in normalized if unicodedata.category(c) != 'Mn') + + return ascii_result + + +def transliterate_for_abbreviation(emic_name: str, lang: str) -> str: + """ + Transliterate emic name for GHCID abbreviation generation. + + This is the main entry point for GHCID generation scripts. + + Args: + emic_name: Institution name in original script + lang: ISO 639-1 language code + + Returns: + Transliterated name ready for abbreviation extraction + """ + # Step 1: Transliterate to Latin + latin = transliterate(emic_name, lang) + + # Step 2: Normalize diacritics (handled in transliterate()) + + # Step 3: Remove special characters (except spaces) + import re + clean = re.sub(r'[^a-zA-Z\s]', ' ', latin) + + # Step 4: Normalize whitespace + clean = ' '.join(clean.split()) + + return clean + + +# Example usage +if __name__ == '__main__': + test_cases = [ + ('Институт восточных рукописей РАН', 'ru'), + ('东巴文化博物院', 'zh'), + ('독립기념관', 'ko'), + ('राजस्थान प्राच्यविद्या प्रतिष्ठान', 'hi'), + ('المكتبة الوطنية للمملكة المغربية', 'ar'), + ('ארכיון הסיפור העממי בישראל', 'he'), + ('Αρχαιολογικό Μουσείο Θεσσαλονίκης', 'el'), + ] + + for name, lang in test_cases: + result = transliterate_for_abbreviation(name, lang) + print(f'{lang}: {name}') + print(f' → {result}') + print() +``` + +--- + +## Skip Words by Language + +When extracting abbreviations from transliterated text, skip these articles/prepositions: + +### Arabic +- `al-` (the definite article) +- `bi-`, `li-`, `fi-` (prepositions) + +### Hebrew +- `ha-` (the) +- `ve-` (and) +- `be-`, `le-`, `me-` (prepositions) + +### Persian +- `-e`, `-ye` (ezafe connector) +- `va` (and) + +### CJK Languages +- No skip words (particles are integral to meaning) + +### Indic Languages +- `ka`, `ki`, `ke` (Hindi: of) +- `aur` (Hindi: and) + +--- + +## Validation + +### Check Transliteration Output + +```python +def validate_transliteration(result: str) -> bool: + """ + Validate that transliteration output contains only ASCII letters and spaces. + """ + import re + return bool(re.match(r'^[a-zA-Z\s]+$', result)) +``` + +### Manual Review Queue + +Non-Latin institutions should be flagged for manual review if: +1. Transliteration library not available for that script +2. Confidence in transliteration is low +3. Institution has multiple official romanizations + +--- + +## Related Documentation + +- `AGENTS.md` - Rule 12: Transliteration Standards +- `ABBREVIATION_SPECIAL_CHAR_RULE.md` - Character filtering after transliteration +- `docs/TRANSLITERATION_CONVENTIONS.md` - Extended examples and edge cases +- `scripts/transliterate_emic_names.py` - Production transliteration script + +--- + +## Changelog + +| Date | Change | +|------|--------| +| 2025-12-08 | Initial standards document created | diff --git a/.opencode/ZAI_GLM_API_RULES.md b/.opencode/ZAI_GLM_API_RULES.md new file mode 100644 index 0000000000..ec6811fac3 --- /dev/null +++ b/.opencode/ZAI_GLM_API_RULES.md @@ -0,0 +1,277 @@ +# Z.AI GLM API Rules for AI Agents + +**Last Updated**: 2025-12-08 +**Status**: MANDATORY for all LLM API calls in scripts + +--- + +## CRITICAL: Use Z.AI Coding Plan, NOT BigModel API + +**This project uses the Z.AI Coding Plan endpoint, which is the SAME endpoint that OpenCode uses internally.** + +The regular BigModel API (`open.bigmodel.cn`) will NOT work with the tokens stored in this project. You MUST use the Z.AI Coding Plan endpoint. + +--- + +## API Configuration + +### Correct Endpoint + +| Property | Value | +|----------|-------| +| **API URL** | `https://api.z.ai/api/coding/paas/v4/chat/completions` | +| **Auth Header** | `Authorization: Bearer {ZAI_API_TOKEN}` | +| **Content-Type** | `application/json` | + +### Available Models + +| Model | Description | Cost | +|-------|-------------|------| +| `glm-4.5` | Standard GLM-4.5 | Free (0 per token) | +| `glm-4.5-air` | GLM-4.5 Air variant | Free | +| `glm-4.5-flash` | Fast GLM-4.5 | Free | +| `glm-4.5v` | Vision-capable GLM-4.5 | Free | +| `glm-4.6` | Latest GLM-4.6 (recommended) | Free | + +**Recommended Model**: `glm-4.6` for best quality + +--- + +## Authentication + +### Token Location + +The Z.AI API token can be obtained from two locations: + +1. **Environment Variable** (preferred for scripts): + ```bash + # In .env file at project root + ZAI_API_TOKEN=your_token_here + ``` + +2. **OpenCode Auth File** (reference only): + ``` + ~/.local/share/opencode/auth.json + ``` + The token is stored under key `zai-coding-plan`. + +### Getting the Token + +If you need to set up the token: + +1. The token is shared with OpenCode's Z.AI Coding Plan +2. Check `~/.local/share/opencode/auth.json` for existing token +3. Add to `.env` file as `ZAI_API_TOKEN` + +--- + +## Python Implementation + +### Correct Implementation + +```python +import os +import httpx + +class GLMClient: + """Client for Z.AI GLM API (Coding Plan endpoint).""" + + # CORRECT endpoint - Z.AI Coding Plan + API_URL = "https://api.z.ai/api/coding/paas/v4/chat/completions" + + def __init__(self, model: str = "glm-4.6"): + self.api_key = os.environ.get("ZAI_API_TOKEN") + if not self.api_key: + raise ValueError("ZAI_API_TOKEN not found in environment") + + self.model = model + self.client = httpx.AsyncClient( + timeout=60.0, + headers={ + "Authorization": f"Bearer {self.api_key}", + "Content-Type": "application/json", + } + ) + + async def chat(self, messages: list) -> dict: + """Send chat completion request.""" + response = await self.client.post( + self.API_URL, + json={ + "model": self.model, + "messages": messages, + "temperature": 0.3, + } + ) + response.raise_for_status() + return response.json() +``` + +### WRONG Implementation (DO NOT USE) + +```python +# WRONG - This endpoint will fail with quota errors +API_URL = "https://open.bigmodel.cn/api/paas/v4/chat/completions" + +# WRONG - This is for regular BigModel API, not Z.AI Coding Plan +api_key = os.environ.get("ZHIPU_API_KEY") +``` + +--- + +## Request Format + +### Chat Completion Request + +```json +{ + "model": "glm-4.6", + "messages": [ + { + "role": "system", + "content": "You are a helpful assistant." + }, + { + "role": "user", + "content": "Your prompt here" + } + ], + "temperature": 0.3, + "max_tokens": 4096 +} +``` + +### Response Format + +```json +{ + "id": "request-id", + "created": 1733651234, + "model": "glm-4.6", + "choices": [ + { + "index": 0, + "message": { + "role": "assistant", + "content": "Response text here" + }, + "finish_reason": "stop" + } + ], + "usage": { + "prompt_tokens": 100, + "completion_tokens": 50, + "total_tokens": 150 + } +} +``` + +--- + +## Error Handling + +### Common Errors + +| Error | Cause | Solution | +|-------|-------|----------| +| `401 Unauthorized` | Invalid or missing token | Check ZAI_API_TOKEN in .env | +| `403 Quota exceeded` | Wrong endpoint (BigModel) | Use Z.AI Coding Plan endpoint | +| `429 Rate limited` | Too many requests | Add delay between requests | +| `500 Server error` | API issue | Retry with exponential backoff | + +### Retry Strategy + +```python +import asyncio +from tenacity import retry, stop_after_attempt, wait_exponential + +@retry( + stop=stop_after_attempt(3), + wait=wait_exponential(multiplier=1, min=2, max=10) +) +async def call_api_with_retry(client, messages): + return await client.chat(messages) +``` + +--- + +## Integration with CH-Annotator + +When using GLM for entity recognition or verification, always reference CH-Annotator v1.7.0: + +```python +PROMPT = """You are a heritage institution classifier following CH-Annotator v1.7.0 convention. + +## CH-Annotator GRP.HER Definition +Heritage institutions are organizations that: +- Collect, preserve, and provide access to cultural heritage materials +- Include: museums (GRP.HER.MUS), libraries (GRP.HER.LIB), archives (GRP.HER.ARC), galleries (GRP.HER.GAL) + +## Entity to Analyze +... +""" +``` + +See `.opencode/CH_ANNOTATOR_CONVENTION.md` for full convention details. + +--- + +## Scripts Using GLM API + +The following scripts use the Z.AI GLM API: + +| Script | Purpose | +|--------|---------| +| `scripts/reenrich_wikidata_with_verification.py` | Wikidata entity verification using GLM-4.6 | + +When creating new scripts that need LLM capabilities, follow this pattern. + +--- + +## Environment Setup Checklist + +When setting up a new environment: + +- [ ] Check `~/.local/share/opencode/auth.json` for existing Z.AI token +- [ ] Add `ZAI_API_TOKEN` to `.env` file +- [ ] Verify endpoint is `https://api.z.ai/api/coding/paas/v4/chat/completions` +- [ ] Test with `glm-4.6` model +- [ ] Reference CH-Annotator v1.7.0 for entity recognition tasks + +--- + +## AI Agent Rules + +### DO + +- Use `https://api.z.ai/api/coding/paas/v4/chat/completions` endpoint +- Get token from `ZAI_API_TOKEN` environment variable +- Use `glm-4.6` as the default model +- Reference CH-Annotator v1.7.0 for entity tasks +- Add retry logic with exponential backoff +- Handle JSON parsing errors gracefully + +### DO NOT + +- Use `open.bigmodel.cn` endpoint (wrong API) +- Use `ZHIPU_API_KEY` environment variable (wrong key) +- Hard-code API tokens in scripts +- Skip error handling for API calls +- Forget to load `.env` file before accessing environment + +--- + +## Related Documentation + +- **CH-Annotator Convention**: `.opencode/CH_ANNOTATOR_CONVENTION.md` +- **Entity Annotation**: `data/entity_annotation/ch_annotator-v1_7_0.yaml` +- **Wikidata Enrichment Script**: `scripts/reenrich_wikidata_with_verification.py` + +--- + +## Version History + +| Date | Change | +|------|--------| +| 2025-12-08 | Initial documentation - Fixed API endpoint discovery | + diff --git a/AGENTS.md b/AGENTS.md index 7ea52d8690..5822654655 100644 --- a/AGENTS.md +++ b/AGENTS.md @@ -720,6 +720,66 @@ claim: --- +### Rule 11: Z.AI GLM API for LLM Tasks (NOT BigModel) + +**CRITICAL: When using GLM models in scripts, use the Z.AI Coding Plan endpoint, NOT the regular BigModel API.** + +The project uses the same Z.AI Coding Plan that OpenCode uses internally. The regular BigModel API (`open.bigmodel.cn`) will NOT work with our tokens. + +**Correct API Configuration**: + +| Property | Value | +|----------|-------| +| **API URL** | `https://api.z.ai/api/coding/paas/v4/chat/completions` | +| **Environment Variable** | `ZAI_API_TOKEN` | +| **Recommended Model** | `glm-4.6` | +| **Cost** | Free (0 per token for all GLM models) | + +**Available Models**: `glm-4.5`, `glm-4.5-air`, `glm-4.5-flash`, `glm-4.5v`, `glm-4.6` + +**Python Implementation**: + +```python +import os +import httpx + +# CORRECT - Z.AI Coding Plan endpoint +API_URL = "https://api.z.ai/api/coding/paas/v4/chat/completions" +api_key = os.environ.get("ZAI_API_TOKEN") + +client = httpx.AsyncClient( + timeout=60.0, + headers={ + "Authorization": f"Bearer {api_key}", + "Content-Type": "application/json", + } +) + +# WRONG - This will fail with quota errors! +# API_URL = "https://open.bigmodel.cn/api/paas/v4/chat/completions" +# api_key = os.environ.get("ZHIPU_API_KEY") +``` + +**Integration with CH-Annotator**: When using GLM for entity recognition, always reference CH-Annotator v1.7.0 in prompts: + +```python +PROMPT = """You are following CH-Annotator v1.7.0 convention. +Heritage institutions are type GRP.HER with subtypes: +- GRP.HER.MUS (museums) +- GRP.HER.LIB (libraries) +- GRP.HER.ARC (archives) +- GRP.HER.GAL (galleries) +...""" +``` + +**Token Location**: +1. **Environment**: Add `ZAI_API_TOKEN` to `.env` file +2. **OpenCode Auth**: Token stored in `~/.local/share/opencode/auth.json` under key `zai-coding-plan` + +**See**: `.opencode/ZAI_GLM_API_RULES.md` for complete documentation + +--- + ## Project Overview **Goal**: Extract structured data about worldwide GLAMORCUBESFIXPHDNT (Galleries, Libraries, Archives, Museums, Official institutions, Research centers, Corporations, Unknown, Botanical gardens/zoos, Educational providers, Societies, Features, Intangible heritage groups, miXed, Personal collections, Holy sites, Digital platforms, NGOs, Taste/smell heritage) institutions from 139+ Claude conversation JSON files and integrate with authoritative CSV datasets. @@ -2571,6 +2631,16 @@ location_resolution: **The institution abbreviation component uses the FIRST LETTER of each significant word in the official emic (native language) name.** +**⚠️ GRANDFATHERING POLICY (PID STABILITY)** + +Existing GHCIDs created before December 2025 are **grandfathered** - their abbreviations will NOT be updated even if derived from English translations rather than emic names. This preserves PID stability per the "Cool URIs Don't Change" principle. + +**Applies to:** +- 817 UNESCO Memory of the World custodian files enriched with `custodian_name.emic_name` +- Abbreviations like `NLP` (National Library of Peru) remain unchanged even though emic name is "Biblioteca Nacional del Perú" (would be `BNP`) + +**For NEW custodians only:** Apply emic name abbreviation protocol described below. + **Abbreviation Rules**: 1. Use the **CustodianName** (official emic name), NOT an English translation 2. Take the **first letter** of each word @@ -2681,6 +2751,154 @@ ghcid_current: SX-XX-PHI-O-DRIMSM # ✅ Alphabetic only **See**: `.opencode/ABBREVIATION_SPECIAL_CHAR_RULE.md` for complete documentation +### 🚨 CRITICAL: Diacritics MUST Be Normalized to ASCII in Abbreviations 🚨 + +**When generating abbreviations for GHCID, diacritics (accented characters) MUST be normalized to their ASCII base letter equivalents. Only ASCII uppercase letters (A-Z) are permitted.** + +This rule applies to ALL languages with diacritical marks including Czech, Polish, German, French, Spanish, Portuguese, Nordic languages, Hungarian, Romanian, Turkish, and others. + +**RATIONALE**: +1. **URI/URL safety** - Non-ASCII characters require percent-encoding +2. **Cross-system compatibility** - ASCII is universally supported +3. **Filename safety** - Some systems have issues with non-ASCII filenames +4. **Human readability** - Easier to type and communicate + +**DIACRITICS NORMALIZATION TABLE**: + +| Language | Diacritics | ASCII Equivalent | +|----------|------------|------------------| +| **Czech** | Č, Ř, Š, Ž, Ě, Ů | C, R, S, Z, E, U | +| **Polish** | Ł, Ń, Ó, Ś, Ź, Ż, Ą, Ę | L, N, O, S, Z, Z, A, E | +| **German** | Ä, Ö, Ü, ß | A, O, U, SS | +| **French** | É, È, Ê, Ç, Ô,  | E, E, E, C, O, A | +| **Spanish** | Ñ, Á, É, Í, Ó, Ú | N, A, E, I, O, U | +| **Portuguese** | Ã, Õ, Ç, Á, É | A, O, C, A, E | +| **Nordic** | Å, Ä, Ö, Ø, Æ | A, A, O, O, AE | +| **Hungarian** | Á, É, Í, Ó, Ö, Ő, Ú, Ü, Ű | A, E, I, O, O, O, U, U, U | +| **Turkish** | Ç, Ğ, İ, Ö, Ş, Ü | C, G, I, O, S, U | +| **Romanian** | Ă, Â, Î, Ș, Ț | A, A, I, S, T | + +**REAL-WORLD EXAMPLE** (Czech institution): + +```yaml +# INCORRECT - Contains diacritics: +ghcid_current: CZ-VY-TEL-L-VHSPAOČRZS # ❌ Contains "Č" + +# CORRECT - ASCII only: +ghcid_current: CZ-VY-TEL-L-VHSPAOCRZS # ✅ "Č" → "C" +``` + +**IMPLEMENTATION**: + +```python +import unicodedata + +def normalize_diacritics(text: str) -> str: + """Normalize diacritics to ASCII equivalents.""" + # NFD decomposition separates base characters from combining marks + normalized = unicodedata.normalize('NFD', text) + # Remove combining marks (category 'Mn' = Mark, Nonspacing) + ascii_text = ''.join(c for c in normalized if unicodedata.category(c) != 'Mn') + return ascii_text + +# Example +normalize_diacritics("VHSPAOČRZS") # Returns "VHSPAOCRZS" +``` + +**EXAMPLES**: + +| Emic Name (with diacritics) | Abbreviation | Wrong | +|-----------------------------|--------------|-------| +| Vlastivědné muzeum v Šumperku | VMS | VMŠ ❌ | +| Österreichische Nationalbibliothek | ON | ÖN ❌ | +| Bibliothèque nationale de France | BNF | BNF (OK - è not in first letter) | +| Múzeum Łódzkie | ML | MŁ ❌ | +| Þjóðminjasafn Íslands | TI | ÞI ❌ | + +**See**: `.opencode/ABBREVIATION_SPECIAL_CHAR_RULE.md` for complete documentation (covers both special characters and diacritics) + +### 🚨 CRITICAL: Non-Latin Scripts MUST Be Transliterated Before Abbreviation 🚨 + +**When generating GHCID abbreviations from institution names in non-Latin scripts (Cyrillic, Chinese, Japanese, Korean, Arabic, Hebrew, Greek, Devanagari, Thai, etc.), the emic name MUST first be transliterated to Latin characters using ISO or recognized standards.** + +This rule affects **170 institutions** across **21 languages** with non-Latin writing systems. + +**CORE PRINCIPLE**: The emic name is PRESERVED in original script in `custodian_name.emic_name`. Transliteration is only used for abbreviation generation. + +**TRANSLITERATION STANDARDS BY SCRIPT**: + +| Script | Languages | Standard | Example | +|--------|-----------|----------|---------| +| **Cyrillic** | ru, uk, bg, sr, kk | ISO 9:1995 | Институт → Institut | +| **Chinese** | zh | Hanyu Pinyin (ISO 7098) | 东巴文化博物院 → Dongba Wenhua Bowuyuan | +| **Japanese** | ja | Modified Hepburn | 国立博物館 → Kokuritsu Hakubutsukan | +| **Korean** | ko | Revised Romanization | 독립기념관 → Dongnip Ginyeomgwan | +| **Arabic** | ar, fa, ur | ISO 233-2/3 | المكتبة الوطنية → al-Maktaba al-Wataniya | +| **Hebrew** | he | ISO 259-3 | ארכיון → Arkhiyon | +| **Greek** | el | ISO 843 | Μουσείο → Mouseio | +| **Devanagari** | hi, ne | ISO 15919 | राजस्थान → Rajasthana | +| **Bengali** | bn | ISO 15919 | বাংলাদেশ → Bangladesh | +| **Thai** | th | ISO 11940-2 | สำนักหอ → Samnak Ho | +| **Armenian** | hy | ISO 9985 | Մdelays → Matenadaran | +| **Georgian** | ka | ISO 9984 | ხელნაწერთა → Khelnawerti | + +**WORKFLOW**: + +``` +1. Emic Name (original script) + ↓ +2. Transliterate to Latin (ISO standard) + ↓ +3. Normalize diacritics (remove accents) + ↓ +4. Skip articles/prepositions + ↓ +5. Extract first letters → Abbreviation +``` + +**EXAMPLES**: + +| Language | Emic Name | Transliterated | Abbreviation | +|----------|-----------|----------------|--------------| +| **Russian** | Институт восточных рукописей РАН | Institut Vostochnykh Rukopisey RAN | IVRR | +| **Chinese** | 东巴文化博物院 | Dongba Wenhua Bowuyuan | DWB | +| **Korean** | 독립기념관 | Dongnip Ginyeomgwan | DG | +| **Hindi** | राजस्थान प्राच्यविद्या प्रतिष्ठान | Rajasthana Pracyavidya Pratishthana | RPP | +| **Arabic** | المكتبة الوطنية للمملكة المغربية | al-Maktaba al-Wataniya lil-Mamlaka | MWMM | +| **Hebrew** | ארכיון הסיפור העממי בישראל | Arkhiyon ha-Sipur ha-Amami | ASAY | +| **Greek** | Αρχαιολογικό Μουσείο Θεσσαλονίκης | Archaiologiko Mouseio Thessalonikis | AMT | + +**SCRIPT-SPECIFIC SKIP WORDS**: + +| Language | Skip Words (Articles/Prepositions) | +|----------|-------------------------------------| +| **Arabic** | al- (the), bi-, li-, fi- (prepositions) | +| **Hebrew** | ha- (the), ve- (and), be-, le-, me- | +| **Persian** | -e, -ye (ezafe connector), va (and) | +| **CJK** | None (particles integral to meaning) | + +**IMPLEMENTATION**: + +```python +from transliteration import transliterate_for_abbreviation + +# Input: emic name in non-Latin script + language code +emic_name = "Институт восточных рукописей РАН" +lang = "ru" + +# Step 1: Transliterate to Latin using ISO standard +latin = transliterate_for_abbreviation(emic_name, lang) +# Result: "Institut Vostochnykh Rukopisey RAN" + +# Step 2: Apply standard abbreviation extraction +abbreviation = extract_abbreviation_from_name(latin, skip_words={'vostochnykh'}) +# Result: "IVRRAN" +``` + +**GRANDFATHERING POLICY**: Existing abbreviations from 817 UNESCO MoW custodians are grandfathered. This transliteration standard applies only to **NEW custodians** created after December 2025. + +**See**: `.opencode/TRANSLITERATION_STANDARDS.md` for complete ISO standards, mapping tables, and Python implementation + --- GHCID uses a **four-identifier strategy** for maximum flexibility and transparency: @@ -3115,7 +3333,7 @@ def test_historical_addition(): --- -**Version**: 0.2.0 -**Schema Version**: v0.2.0 (modular) -**Last Updated**: 2025-11-05 +**Version**: 0.2.1 +**Schema Version**: v0.2.1 (modular) +**Last Updated**: 2025-12-08 **Maintained By**: GLAM Data Extraction Project diff --git a/docs/GLM_API_SETUP.md b/docs/GLM_API_SETUP.md new file mode 100644 index 0000000000..eba9e86b35 --- /dev/null +++ b/docs/GLM_API_SETUP.md @@ -0,0 +1,357 @@ +# GLM API Setup Guide + +This guide explains how to configure and use the GLM-4 language model for entity recognition, verification, and enrichment tasks in the GLAM project. + +## Overview + +The GLAM project uses **GLM-4.6** via the **Z.AI Coding Plan** endpoint for LLM-powered tasks such as: + +- **Entity Verification**: Verify that Wikidata entities are heritage institutions +- **Description Enrichment**: Generate rich descriptions from multiple data sources +- **Entity Resolution**: Match institution names across different data sources +- **Claim Validation**: Verify extracted claims against source documents + +**Cost**: All GLM models are FREE (0 cost per token) on the Z.AI Coding Plan. + +## Prerequisites + +- Python 3.10+ +- `httpx` library for async HTTP requests +- Access to Z.AI Coding Plan (same as OpenCode) + +## Quick Start + +### 1. Set Up Environment Variable + +Add your Z.AI API token to the `.env` file in the project root: + +```bash +# .env file +ZAI_API_TOKEN=your_token_here +``` + +### 2. Find Your Token + +The token is shared with OpenCode. Check: + +```bash +# View OpenCode auth file +cat ~/.local/share/opencode/auth.json | jq '.["zai-coding-plan"]' +``` + +Copy this token to your `.env` file. + +### 3. Basic Python Usage + +```python +import os +import httpx +import asyncio +from dotenv import load_dotenv + +# Load environment variables +load_dotenv() + +async def call_glm(): + api_url = "https://api.z.ai/api/coding/paas/v4/chat/completions" + api_key = os.environ.get("ZAI_API_TOKEN") + + async with httpx.AsyncClient(timeout=60.0) as client: + response = await client.post( + api_url, + headers={ + "Authorization": f"Bearer {api_key}", + "Content-Type": "application/json", + }, + json={ + "model": "glm-4.6", + "messages": [ + {"role": "user", "content": "Hello, GLM!"} + ], + "temperature": 0.3, + } + ) + result = response.json() + print(result["choices"][0]["message"]["content"]) + +asyncio.run(call_glm()) +``` + +## API Configuration + +### Endpoint Details + +| Property | Value | +|----------|-------| +| **Base URL** | `https://api.z.ai/api/coding/paas/v4` | +| **Chat Endpoint** | `/chat/completions` | +| **Auth Method** | Bearer Token | +| **Header** | `Authorization: Bearer {token}` | + +### Available Models + +| Model | Speed | Quality | Use Case | +|-------|-------|---------|----------| +| `glm-4.6` | Medium | Highest | Complex reasoning, verification | +| `glm-4.5` | Medium | High | General tasks | +| `glm-4.5-air` | Fast | Good | High-volume processing | +| `glm-4.5-flash` | Fastest | Good | Quick responses | +| `glm-4.5v` | Medium | High | Vision/image tasks | + +**Recommendation**: Use `glm-4.6` for entity verification and complex tasks. + +## Integration with CH-Annotator + +When using GLM for entity recognition tasks, always reference the CH-Annotator convention: + +### Heritage Institution Verification + +```python +VERIFICATION_PROMPT = """You are a heritage institution classifier following CH-Annotator v1.7.0 convention. + +## CH-Annotator GRP.HER Definition +Heritage institutions are organizations that: +- Collect, preserve, and provide access to cultural heritage materials +- Include: museums (GRP.HER.MUS), libraries (GRP.HER.LIB), archives (GRP.HER.ARC), galleries (GRP.HER.GAL) + +## Entity Types That Are NOT Heritage Institutions +- Cities, towns, municipalities (places, not institutions) +- General businesses or companies +- People/individuals +- Events, festivals, exhibitions (temporary) + +## Your Task +Analyze the entity and respond in JSON: +```json +{ + "is_heritage_institution": true/false, + "subtype": "MUS|LIB|ARC|GAL|OTHER|null", + "confidence": 0.95, + "reasoning": "Brief explanation" +} +``` +""" +``` + +### Entity Type Mapping + +| CH-Annotator Type | GLAM Institution Type | +|-------------------|----------------------| +| GRP.HER.MUS | MUSEUM | +| GRP.HER.LIB | LIBRARY | +| GRP.HER.ARC | ARCHIVE | +| GRP.HER.GAL | GALLERY | +| GRP.HER.RES | RESEARCH_CENTER | +| GRP.HER.BOT | BOTANICAL_ZOO | +| GRP.HER.EDU | EDUCATION_PROVIDER | + +## Complete Implementation Example + +### Wikidata Verification Script + +See `scripts/reenrich_wikidata_with_verification.py` for a complete example: + +```python +import os +import httpx +import json +from typing import Any, Dict, List, Optional + +class GLMHeritageVerifier: + """Verify Wikidata entities using GLM-4.6 and CH-Annotator.""" + + API_URL = "https://api.z.ai/api/coding/paas/v4/chat/completions" + + def __init__(self, model: str = "glm-4.6"): + self.api_key = os.environ.get("ZAI_API_TOKEN") + if not self.api_key: + raise ValueError("ZAI_API_TOKEN not found in environment") + + self.model = model + self.client = httpx.AsyncClient( + timeout=60.0, + headers={ + "Authorization": f"Bearer {self.api_key}", + "Content-Type": "application/json", + } + ) + + async def verify_heritage_institution( + self, + institution_name: str, + wikidata_label: str, + wikidata_description: str, + instance_of_types: List[str], + ) -> Dict[str, Any]: + """Check if a Wikidata entity is a heritage institution.""" + + prompt = f"""Analyze if this entity is a heritage institution (GRP.HER): + +Institution Name: {institution_name} +Wikidata Label: {wikidata_label} +Description: {wikidata_description} +Instance Of: {', '.join(instance_of_types)} + +Respond with JSON only.""" + + response = await self.client.post( + self.API_URL, + json={ + "model": self.model, + "messages": [ + {"role": "system", "content": self.VERIFICATION_PROMPT}, + {"role": "user", "content": prompt} + ], + "temperature": 0.1, + } + ) + + result = response.json() + content = result["choices"][0]["message"]["content"] + + # Parse JSON from response + json_match = re.search(r'\{.*\}', content, re.DOTALL) + if json_match: + return json.loads(json_match.group()) + + return {"is_heritage_institution": False, "error": "No JSON found"} +``` + +## Error Handling + +### Common Errors + +| Error Code | Meaning | Solution | +|------------|---------|----------| +| 401 | Unauthorized | Check ZAI_API_TOKEN | +| 403 | Forbidden/Quota | Using wrong endpoint (use Z.AI, not BigModel) | +| 429 | Rate Limited | Add delays between requests | +| 500 | Server Error | Retry with backoff | + +### Retry Pattern + +```python +from tenacity import retry, stop_after_attempt, wait_exponential + +@retry( + stop=stop_after_attempt(3), + wait=wait_exponential(multiplier=1, min=2, max=10) +) +async def call_with_retry(client, messages): + response = await client.post(API_URL, json={"model": "glm-4.6", "messages": messages}) + response.raise_for_status() + return response.json() +``` + +### JSON Parsing + +LLM responses may contain text around JSON. Always parse safely: + +```python +import re +import json + +def parse_json_from_response(content: str) -> dict: + """Extract JSON from LLM response text.""" + # Try to find JSON block + json_match = re.search(r'```json\s*(\{.*?\})\s*```', content, re.DOTALL) + if json_match: + return json.loads(json_match.group(1)) + + # Try bare JSON + json_match = re.search(r'\{.*\}', content, re.DOTALL) + if json_match: + return json.loads(json_match.group()) + + return {"error": "No JSON found in response"} +``` + +## Best Practices + +### 1. Use Low Temperature for Verification + +```python +{ + "temperature": 0.1 # Low for consistent, deterministic responses +} +``` + +### 2. Request JSON Output + +Always request JSON format in your prompts for structured responses: + +``` +Respond in JSON format only: +```json +{"key": "value"} +``` +``` + +### 3. Batch Processing + +Process multiple entities with rate limiting: + +```python +import asyncio + +async def batch_verify(entities: List[dict], rate_limit: float = 0.5): + """Verify entities with rate limiting.""" + results = [] + for entity in entities: + result = await verifier.verify(entity) + results.append(result) + await asyncio.sleep(rate_limit) # Respect rate limits + return results +``` + +### 4. Always Reference CH-Annotator + +For entity recognition tasks, include CH-Annotator context: + +```python +system_prompt = """You are following CH-Annotator v1.7.0 convention. +Heritage institutions are type GRP.HER with subtypes for museums, libraries, archives, and galleries. +""" +``` + +## Related Scripts + +| Script | Purpose | +|--------|---------| +| `scripts/reenrich_wikidata_with_verification.py` | Wikidata entity verification | + +## Related Documentation + +- **Agent Rules**: `AGENTS.md` (Rule 11: Z.AI GLM API) +- **Agent Config**: `.opencode/ZAI_GLM_API_RULES.md` +- **CH-Annotator**: `.opencode/CH_ANNOTATOR_CONVENTION.md` +- **Entity Annotation**: `data/entity_annotation/ch_annotator-v1_7_0.yaml` + +## Troubleshooting + +### "Quota exceeded" Error + +**Symptom**: 403 error with "quota exceeded" message + +**Cause**: Using wrong API endpoint (`open.bigmodel.cn` instead of `api.z.ai`) + +**Solution**: Update API URL to `https://api.z.ai/api/coding/paas/v4/chat/completions` + +### "Token not found" Error + +**Symptom**: ValueError about missing ZAI_API_TOKEN + +**Solution**: +1. Check `~/.local/share/opencode/auth.json` for token +2. Add to `.env` file as `ZAI_API_TOKEN=your_token` +3. Ensure `load_dotenv()` is called before accessing environment + +### JSON Parsing Failures + +**Symptom**: LLM returns text that can't be parsed as JSON + +**Solution**: Use the `parse_json_from_response()` helper function with fallback handling + +--- + +**Last Updated**: 2025-12-08 diff --git a/docs/TRANSLITERATION_CONVENTIONS.md b/docs/TRANSLITERATION_CONVENTIONS.md new file mode 100644 index 0000000000..0a472d7b2f --- /dev/null +++ b/docs/TRANSLITERATION_CONVENTIONS.md @@ -0,0 +1,441 @@ +# Transliteration Conventions for Heritage Custodian Names + +**Document Type**: User Guide +**Version**: 1.0 +**Last Updated**: 2025-12-08 +**Related Rules**: `.opencode/TRANSLITERATION_STANDARDS.md`, `AGENTS.md` (Rule 12) + +--- + +## Overview + +This document provides comprehensive examples and guidance for transliterating heritage institution names from non-Latin scripts to Latin characters. Transliteration is **required** for generating GHCID abbreviations but the **original emic name is always preserved**. + +### Key Principles + +1. **Emic name preserved** - Original script stored in `custodian_name.emic_name` +2. **ISO standards used** - Recognized international transliteration standards +3. **Deterministic output** - Same input always produces same Latin output +4. **Abbreviation purpose only** - Transliteration is for GHCID generation, not display + +--- + +## Language-by-Language Examples + +### Russian (Cyrillic - ISO 9:1995) + +**Dataset Statistics**: 13 institutions + +| Emic Name | Transliterated | Abbreviation | +|-----------|----------------|--------------| +| Институт восточных рукописей РАН | Institut vostočnyh rukopisej RAN | IVRR | +| Российская государственная библиотека | Rossijskaja gosudarstvennaja biblioteka | RGB | +| Государственный архив Российской Федерации | Gosudarstvennyj arhiv Rossijskoj Federacii | GARF | + +**Skip Words (Russian)**: None significant (Russian doesn't use articles) + +**Character Mapping**: +``` +А → A Б → B В → V Г → G Д → D Е → E +Ё → Ë Ж → Ž З → Z И → I Й → J К → K +Л → L М → M Н → N О → O П → P Р → R +С → S Т → T У → U Ф → F Х → H Ц → C +Ч → Č Ш → Š Щ → Ŝ Ъ → ʺ Ы → Y Ь → ʹ +Э → È Ю → Û Я →  +``` + +--- + +### Ukrainian (Cyrillic - ISO 9:1995) + +**Dataset Statistics**: 8 institutions + +| Emic Name | Transliterated | Abbreviation | +|-----------|----------------|--------------| +| Центральний державний архів громадських об'єднань України | Centralnyj deržavnyj arhiv hromadskyx objednan Ukrainy | CDAGOU | +| Національна бібліотека України | Nacionalna biblioteka Ukrainy | NBU | + +**Ukrainian-specific characters**: +``` +І → I Ї → Ji Є → Je Ґ → G' +``` + +--- + +### Chinese (Hanyu Pinyin - ISO 7098) + +**Dataset Statistics**: 27 institutions + +| Emic Name | Pinyin | Abbreviation | +|-----------|--------|--------------| +| 东巴文化博物院 | Dōngbā Wénhuà Bówùyuàn | DWB | +| 中国第一历史档案馆 | Zhōngguó Dìyī Lìshǐ Dàng'ànguǎn | ZDLD | +| 北京故宫博物院 | Běijīng Gùgōng Bówùyuàn | BGB | +| 中国国家图书馆 | Zhōngguó Guójiā Túshūguǎn | ZGT | + +**Notes**: +- Tone marks are removed for abbreviation (diacritics normalization) +- Word boundaries follow natural semantic breaks +- Multi-syllable words keep together + +**Skip Words**: None (Chinese doesn't use separate articles/prepositions) + +--- + +### Japanese (Modified Hepburn) + +**Dataset Statistics**: 19 institutions + +| Emic Name | Romaji | Abbreviation | +|-----------|--------|--------------| +| 国立中央博物館 | Kokuritsu Chūō Hakubutsukan | KCH | +| 東京国立博物館 | Tōkyō Kokuritsu Hakubutsukan | TKH | +| 国立国会図書館 | Kokuritsu Kokkai Toshokan | KKT | + +**Notes**: +- Long vowels (ō, ū) normalized to (o, u) +- Particles typically attached to preceding word +- Kanji compounds transliterated as single words + +--- + +### Korean (Revised Romanization) + +**Dataset Statistics**: 36 institutions + +| Emic Name | RR Romanization | Abbreviation | +|-----------|-----------------|--------------| +| 독립기념관 | Dongnip Ginyeomgwan | DG | +| 국립중앙박물관 | Gungnip Jungang Bakmulgwan | GJB | +| 서울대학교 규장각한국학연구원 | Seoul Daehakgyo Gyujanggak Hangukhak Yeonguwon | SDGHY | +| 국립한글박물관 | Gungnip Hangeul Bakmulgwan | GHB | + +**Notes**: +- No diacritics in Revised Romanization (unlike McCune-Reischauer) +- Consonant assimilation reflected in spelling +- Spaces at natural word boundaries + +--- + +### Arabic (ISO 233-2) + +**Dataset Statistics**: 8 institutions + +| Emic Name | Transliterated | Abbreviation | +|-----------|----------------|--------------| +| المكتبة الوطنية للمملكة المغربية | al-Maktaba al-Waṭanīya lil-Mamlaka al-Maġribīya | MWMM | +| دار الكتب المصرية | Dār al-Kutub al-Miṣrīya | DKM | +| المتحف الوطني العراقي | al-Matḥaf al-Waṭanī al-ʿIrāqī | MWI | + +**Skip Words**: +- `al-` (definite article "the") +- After skip word removal: Maktaba, Wataniya, Mamlaka, Maghribiya → MWMM + +**Notes**: +- Right-to-left script +- Definite article "al-" always skipped +- Diacritics normalized (ā→a, ī→i, etc.) + +--- + +### Persian/Farsi (ISO 233-3) + +**Dataset Statistics**: 11 institutions + +| Emic Name | Transliterated | Abbreviation | +|-----------|----------------|--------------| +| وزارت امور خارجه ایران | Vezārat-e Omūr-e Khārejeh-ye Īrān | VOKI | +| کتابخانه آستان قدس رضوی | Ketābkhāneh-ye Āstān-e Qods-e Raẓavī | KAQR | +| مجلس شورای اسلامی | Majles-e Showrā-ye Eslāmī | MSE | + +**Skip Words**: +- `-e`, `-ye` (ezafe connector, "of") +- `va` ("and") + +**Persian-specific characters**: +``` +پ → p چ → č ژ → ž گ → g +``` + +--- + +### Hebrew (ISO 259-3) + +**Dataset Statistics**: 4 institutions + +| Emic Name | Transliterated | Abbreviation | +|-----------|----------------|--------------| +| ארכיון הסיפור העממי בישראל | Arḵiyon ha-Sipur ha-ʿAmami be-Yiśraʾel | ASAY | +| הספרייה הלאומית | ha-Sifriya ha-Leʾumit | SL | +| ארכיון המדינה | Arḵiyon ha-Medina | AM | + +**Skip Words**: +- `ha-` (definite article "the") +- `be-` ("in") +- `le-` ("to") +- `ve-` ("and") + +**Notes**: +- Right-to-left script +- Articles attached with hyphen +- Silent letters (aleph, ayin) often omitted in abbreviation + +--- + +### Hindi (Devanagari - ISO 15919) + +**Dataset Statistics**: 14 institutions + +| Emic Name | Transliterated | Abbreviation | +|-----------|----------------|--------------| +| राजस्थान प्राच्यविद्या प्रतिष्ठान | Rājasthāna Prācyavidyā Pratiṣṭhāna | RPP | +| राष्ट्रीय अभिलेखागार | Rāṣṭrīya Abhilekhāgāra | RA | +| राष्ट्रीय संग्रहालय नई दिल्ली | Rāṣṭrīya Saṅgrahālaya Naī Dillī | RSND | + +**Skip Words**: +- `ka`, `ki`, `ke` ("of") +- `aur` ("and") +- `mein` ("in") + +**Notes**: +- Conjunct consonants transliterated as cluster +- Long vowels marked (ā, ī, ū) then normalized + +--- + +### Greek (ISO 843) + +**Dataset Statistics**: 2 institutions + +| Emic Name | Transliterated | Abbreviation | +|-----------|----------------|--------------| +| Αρχαιολογικό Μουσείο Θεσσαλονίκης | Archaiologikó Mouseío Thessaloníkīs | AMT | +| Εθνική Βιβλιοθήκη της Ελλάδας | Ethnikī́ Vivliothī́kī tīs Elládas | EVE | + +**Skip Words**: +- `tīs`, `tou` ("of the") +- `kai` ("and") + +**Character Mapping**: +``` +Α → A Β → V Γ → G Δ → D Ε → E Ζ → Z +Η → Ī Θ → Th Ι → I Κ → K Λ → L Μ → M +Ν → N Ξ → X Ο → O Π → P Ρ → R Σ → S +Τ → T Υ → Y Φ → F Χ → Ch Ψ → Ps Ω → Ō +``` + +--- + +### Thai (ISO 11940-2) + +**Dataset Statistics**: 6 institutions + +| Emic Name | Transliterated | Abbreviation | +|-----------|----------------|--------------| +| สำนักหอจดหมายเหตุแห่งชาติ | Samnak Ho Chotmaihet Haeng Chat | SHCHC | +| หอสมุดแห่งชาติ | Ho Samut Haeng Chat | HSHC | + +**Notes**: +- Thai script is abugida (consonant-vowel syllables) +- No spaces in Thai; word boundaries determined by meaning +- Royal Thai General System also acceptable + +--- + +### Armenian (ISO 9985) + +**Dataset Statistics**: 4 institutions + +| Emic Name | Transliterated | Abbreviation | +|-----------|----------------|--------------| +| Մdelays delays delaysdelays delaysатенадаран | Matenadaran | M | +| Ազdelays delays delays delays delays delays delays delays delaysգdelays delays delays delays delays delaysdelays delaysdelays delaysდайн Пdelays delays delays delays delaysатאрաнագитаран | Azgayin Matenadaran | AM | + +**Notes**: +- Armenian alphabet unique to Armenian language +- Transliteration straightforward letter-for-letter + +--- + +### Georgian (ISO 9984) + +**Dataset Statistics**: 2 institutions + +| Emic Name | Transliterated | Abbreviation | +|-----------|----------------|--------------| +| ხელნაწერთა ეროვნული ცენტრი | Xelnawerti Erovnuli C'ent'ri | XEC | +| საქართველოს ეროვნული არქივი | Sakartvelos Erovnuli Arkivi | SEA | + +**Notes**: +- Georgian Mkhedruli script +- Apostrophes mark ejective consonants (removed in abbreviation) + +--- + +## Complete Workflow Example + +### Step-by-Step: Korean Institution + +**Institution**: National Museum of Korea + +1. **Emic Name (Original Script)**: + ``` + 국립중앙박물관 + ``` + +2. **Language Detection**: Korean (ko) + +3. **Transliterate using Revised Romanization**: + ``` + Gungnip Jungang Bakmulgwan + ``` + +4. **Identify Skip Words**: None for Korean + +5. **Extract First Letters**: + ``` + G + J + B = GJB + ``` + +6. **Diacritic Normalization**: N/A (RR has no diacritics) + +7. **Final Abbreviation**: `GJB` + +8. **Store in YAML**: + ```yaml + custodian_name: + emic_name: 국립중앙박물관 + name_language: ko + english_name: National Museum of Korea + ghcid: + ghcid_current: KR-SO-SEO-M-GJB + abbreviation_source: transliterated_emic + ``` + +--- + +### Step-by-Step: Arabic Institution + +**Institution**: National Library of Morocco + +1. **Emic Name (Original Script)**: + ``` + المكتبة الوطنية للمملكة المغربية + ``` + +2. **Language Detection**: Arabic (ar) + +3. **Transliterate using ISO 233-2**: + ``` + al-Maktaba al-Waṭanīya lil-Mamlaka al-Maġribīya + ``` + +4. **Identify Skip Words**: `al-` (4 occurrences), `lil-` (1) + +5. **After Skip Word Removal**: + ``` + Maktaba Waṭanīya Mamlaka Maġribīya + ``` + +6. **Extract First Letters**: + ``` + M + W + M + M = MWMM + ``` + +7. **Diacritic Normalization**: ṭ→t, ī→i, ġ→g + ``` + MWMM (already ASCII) + ``` + +8. **Final Abbreviation**: `MWMM` + +--- + +## Edge Cases and Special Handling + +### Mixed Scripts + +Some institution names mix scripts (e.g., Latin brand names in Chinese text): + +**Example**: 中国IBM研究院 +- Transliterate Chinese: Zhongguo IBM Yanjiuyuan +- Keep "IBM" as-is (already Latin) +- Abbreviation: ZIY + +### Transliteration Ambiguity + +When multiple valid transliterations exist, prefer: +1. ISO standard spelling +2. Institution's own romanization (if consistent) +3. Most commonly used academic romanization + +### Very Long Names + +If abbreviation exceeds 10 characters after applying rules: +1. Truncate to 10 characters +2. Ensure truncation doesn't create ambiguous abbreviation +3. Document truncation in `ghcid.notes` + +--- + +## Python Implementation Reference + +For the complete Python implementation of transliteration functions, see: + +- `.opencode/TRANSLITERATION_STANDARDS.md` - Full code with all language handlers +- `scripts/transliterate_emic_names.py` - Production script for batch transliteration + +### Quick Reference Function + +```python +from transliteration import transliterate_for_abbreviation + +# Example usage for all supported languages +examples = { + 'ru': 'Российская государственная библиотека', + 'zh': '中国国家图书馆', + 'ja': '国立国会図書館', + 'ko': '국립중앙박물관', + 'ar': 'المكتبة الوطنية للمملكة المغربية', + 'he': 'הספרייה הלאומית', + 'hi': 'राष्ट्रीय अभिलेखागार', + 'el': 'Εθνική Βιβλιοθήκη της Ελλάδας', +} + +for lang, name in examples.items(): + latin = transliterate_for_abbreviation(name, lang) + print(f'{lang}: {name}') + print(f' → {latin}') +``` + +--- + +## Validation Checklist + +Before finalizing a transliterated abbreviation: + +- [ ] Original emic name preserved in `custodian_name.emic_name` +- [ ] Language code stored in `custodian_name.name_language` +- [ ] Correct ISO standard applied for script +- [ ] Skip words removed (articles, prepositions) +- [ ] Diacritics normalized to ASCII +- [ ] Special characters removed +- [ ] Abbreviation ≤ 10 characters +- [ ] No conflicts with existing GHCIDs + +--- + +## See Also + +- `.opencode/TRANSLITERATION_STANDARDS.md` - Technical rules and Python code +- `.opencode/ABBREVIATION_SPECIAL_CHAR_RULE.md` - Character filtering rules +- `AGENTS.md` - Rule 12: Non-Latin Script Transliteration +- `docs/PERSISTENT_IDENTIFIERS.md` - GHCID specification + +--- + +## Changelog + +| Date | Change | +|------|--------| +| 2025-12-08 | Initial document created with 21 language examples |