docs: add Z.AI GLM API and transliteration rules to AGENTS.md

- Add Rule 11 for Z.AI Coding Plan API usage (not BigModel)
- Add transliteration standards for non-Latin scripts
- Document GLM model options and Python implementation
This commit is contained in:
kempersc 2025-12-08 14:58:22 +01:00
parent 40bd3cb8f5
commit 271545fa8b
6 changed files with 2172 additions and 7 deletions

View file

@ -1,17 +1,102 @@
# Abbreviation Special Character Filtering Rule
# Abbreviation Character Filtering Rules
**Rule ID**: ABBREV-SPECIAL-CHAR
**Rule ID**: ABBREV-CHAR-FILTER
**Status**: MANDATORY
**Applies To**: GHCID abbreviation component generation
**Created**: 2025-12-07
**Updated**: 2025-12-08 (added diacritics rule)
---
## Summary
**When generating abbreviations for GHCID, special characters and symbols MUST be completely removed. Only alphabetic characters (A-Z) are permitted in the abbreviation component of the GHCID.**
**When generating abbreviations for GHCID, ONLY ASCII uppercase letters (A-Z) are permitted. Both special characters AND diacritics MUST be removed/normalized.**
This is a **MANDATORY** rule. Abbreviations containing special characters are INVALID and must be regenerated.
This is a **MANDATORY** rule. Abbreviations containing special characters or diacritics are INVALID and must be regenerated.
### Two Mandatory Sub-Rules:
1. **ABBREV-SPECIAL-CHAR**: Remove all special characters and symbols
2. **ABBREV-DIACRITICS**: Normalize all diacritics to ASCII equivalents
---
## Rule 1: Diacritics MUST Be Normalized to ASCII (ABBREV-DIACRITICS)
**Diacritics (accented characters) MUST be normalized to their ASCII base letter equivalents.**
### Example (Real Case)
```
❌ WRONG: CZ-VY-TEL-L-VHSPAOČRZS (contains Č)
✅ CORRECT: CZ-VY-TEL-L-VHSPAOCRZS (ASCII only)
```
### Diacritics Normalization Table
| Diacritic | ASCII | Example |
|-----------|-------|---------|
| Á, À, Â, Ã, Ä, Å, Ā | A | "Ålborg" → A |
| Č, Ć, Ç | C | "Český" → C |
| Ď | D | "Ďáblice" → D |
| É, È, Ê, Ë, Ě, Ē | E | "Éire" → E |
| Í, Ì, Î, Ï, Ī | I | "Ísland" → I |
| Ñ, Ń, Ň | N | "España" → N |
| Ó, Ò, Ô, Õ, Ö, Ø, Ō | O | "Österreich" → O |
| Ř | R | "Říčany" → R |
| Š, Ś, Ş | S | "Šumperk" → S |
| Ť | T | "Ťažký" → T |
| Ú, Ù, Û, Ü, Ů, Ū | U | "Ústí" → U |
| Ý, Ÿ | Y | "Ýmir" → Y |
| Ž, Ź, Ż | Z | "Žilina" → Z |
| Ł | L | "Łódź" → L |
| Æ | AE | "Ærø" → AE |
| Œ | OE | "Œuvre" → OE |
| ß | SS | "Straße" → SS |
### Implementation
```python
import unicodedata
def normalize_diacritics(text: str) -> str:
"""
Normalize diacritics to ASCII equivalents.
Examples:
"Č" → "C"
"Ř" → "R"
"Ö" → "O"
"ñ" → "n"
"""
# NFD decomposition separates base characters from combining marks
normalized = unicodedata.normalize('NFD', text)
# Remove combining marks (category 'Mn' = Mark, Nonspacing)
ascii_text = ''.join(c for c in normalized if unicodedata.category(c) != 'Mn')
return ascii_text
# Example
normalize_diacritics("VHSPAOČRZS") # Returns "VHSPAOCRZS"
```
### Languages Commonly Affected
| Language | Common Diacritics | Example Institution |
|----------|-------------------|---------------------|
| **Czech** | Č, Ř, Š, Ž, Ě, Ů | Vlastivědné muzeum → VM (not VM with háček) |
| **Polish** | Ł, Ń, Ó, Ś, Ź, Ż, Ą, Ę | Biblioteka Łódzka → BL |
| **German** | Ä, Ö, Ü, ß | Österreichische Nationalbibliothek → ON |
| **French** | É, È, Ê, Ç, Ô | Bibliothèque nationale → BN |
| **Spanish** | Ñ, Á, É, Í, Ó, Ú | Museo Nacional → MN |
| **Portuguese** | Ã, Õ, Ç, Á, É | Biblioteca Nacional → BN |
| **Nordic** | Å, Ä, Ö, Ø, Æ | Nationalmuseet → N |
| **Turkish** | Ç, Ğ, İ, Ö, Ş, Ü | İstanbul Üniversitesi → IU |
| **Hungarian** | Á, É, Í, Ó, Ö, Ő, Ú, Ü, Ű | Országos Levéltár → OL |
| **Romanian** | Ă, Â, Î, Ș, Ț | Biblioteca Națională → BN |
---
## Rule 2: Special Characters MUST Be Removed (ABBREV-SPECIAL-CHAR)
---

View file

@ -0,0 +1,787 @@
# Transliteration Standards for Non-Latin Scripts
**Rule ID**: TRANSLIT-ISO
**Status**: MANDATORY
**Applies To**: GHCID abbreviation generation from emic names in non-Latin scripts
**Created**: 2025-12-08
---
## Summary
**When generating GHCID abbreviations from institution names written in non-Latin scripts, the emic name MUST first be transliterated to Latin characters using the designated ISO or recognized standard for that script.**
This rule affects **170 institutions** across **21 languages** with non-Latin writing systems.
### Key Principles
1. **Emic name is preserved** - The original script is stored in `custodian_name.emic_name`
2. **Transliteration is for processing only** - Used to generate abbreviations
3. **ISO/recognized standards required** - No ad-hoc romanization
4. **Deterministic output** - Same input always produces same Latin output
5. **Existing GHCIDs grandfathered** - Only applies to NEW custodians
---
## Transliteration Standards by Script/Language
### Cyrillic Scripts
| Language | ISO Code | Standard | Library/Tool | Notes |
|----------|----------|----------|--------------|-------|
| **Russian** | ru | ISO 9:1995 | `transliterate` | Scientific transliteration |
| **Ukrainian** | uk | ISO 9:1995 | `transliterate` | Includes Ukrainian-specific letters |
| **Bulgarian** | bg | ISO 9:1995 | `transliterate` | Uses same Cyrillic base |
| **Serbian** | sr | ISO 9:1995 | `transliterate` | Serbian Cyrillic variant |
| **Kazakh** | kk | ISO 9:1995 | `transliterate` | Cyrillic-based (pre-2023) |
**ISO 9:1995 Mapping (Core Characters)**:
| Cyrillic | Latin | Cyrillic | Latin |
|----------|-------|----------|-------|
| А а | A a | П п | P p |
| Б б | B b | Р р | R r |
| В в | V v | С с | S s |
| Г г | G g | Т т | T t |
| Д д | D d | У у | U u |
| Е е | E e | Ф ф | F f |
| Ё ё | Ë ë | Х х | H h |
| Ж ж | Ž ž | Ц ц | C c |
| З з | Z z | Ч ч | Č č |
| И и | I i | Ш ш | Š š |
| Й й | J j | Щ щ | Ŝ ŝ |
| К к | K k | Ъ ъ | ʺ (hard sign) |
| Л л | L l | Ы ы | Y y |
| М м | M m | Ь ь | ʹ (soft sign) |
| Н н | N n | Э э | È è |
| О о | O o | Ю ю | Û û |
| | | Я я | Â â |
**Example**:
```
Input: Институт восточных рукописей РАН
ISO 9: Institut vostočnyh rukopisej RAN
Abbrev: IVRRAN → IVRRAN (after diacritic normalization)
```
---
### CJK Scripts
#### Chinese (Hanzi)
| Variant | Standard | Library/Tool | Notes |
|---------|----------|--------------|-------|
| Simplified | Hanyu Pinyin (ISO 7098) | `pypinyin` | Standard PRC romanization |
| Traditional | Hanyu Pinyin | `pypinyin` | Same standard applies |
**Pinyin Rules**:
- Tone marks are OMITTED for abbreviation (diacritics removed anyway)
- Word boundaries follow natural spacing
- Proper nouns capitalized
**Example**:
```
Input: 东巴文化博物院
Pinyin: Dōngbā Wénhuà Bówùyuàn
ASCII: Dongba Wenhua Bowuyuan
Abbrev: DWB
```
#### Japanese (Kanji/Kana)
| Standard | Library/Tool | Notes |
|----------|--------------|-------|
| Modified Hepburn | `pykakasi`, `romkan` | Most widely used internationally |
**Hepburn Rules**:
- Long vowels: ō, ū (normalized to o, u for abbreviation)
- Particles: は (wa), を (wo), へ (e)
- Syllabic n: ん = n (before vowels: n')
**Example**:
```
Input: 国立中央博物館
Romaji: Kokuritsu Chūō Hakubutsukan
ASCII: Kokuritsu Chuo Hakubutsukan
Abbrev: KCH
```
#### Korean (Hangul)
| Standard | Library/Tool | Notes |
|----------|--------------|-------|
| Revised Romanization (RR) | `korean-romanizer`, `hangul-romanize` | Official South Korean standard (2000) |
**RR Rules**:
- No diacritics (unlike McCune-Reischauer)
- Consonant assimilation reflected in spelling
- Word boundaries at natural breaks
**Example**:
```
Input: 독립기념관
RR: Dongnip Ginyeomgwan
Abbrev: DG
```
---
### Arabic Script
| Language | ISO Code | Standard | Library/Tool | Notes |
|----------|----------|----------|--------------|-------|
| **Arabic** | ar | ISO 233-2:1993 | `arabic-transliteration` | Simplified standard |
| **Persian/Farsi** | fa | ISO 233-3:1999 | `persian-transliteration` | Persian extensions |
| **Urdu** | ur | ISO 233-3 + Urdu extensions | `urdu-transliteration` | Additional characters |
**ISO 233 Mapping (Core Arabic)**:
| Arabic | Name | Latin |
|--------|------|-------|
| ا | Alif | ā / a |
| ب | Ba | b |
| ت | Ta | t |
| ث | Tha | ṯ |
| ج | Jim | ǧ / j |
| ح | Ha | ḥ |
| خ | Kha | ḫ / kh |
| د | Dal | d |
| ذ | Dhal | ḏ |
| ر | Ra | r |
| ز | Zay | z |
| س | Sin | s |
| ش | Shin | š / sh |
| ص | Sad | ṣ |
| ض | Dad | ḍ |
| ط | Ta | ṭ |
| ظ | Za | ẓ |
| ع | Ayn | ʿ |
| غ | Ghayn | ġ / gh |
| ف | Fa | f |
| ق | Qaf | q |
| ك | Kaf | k |
| ل | Lam | l |
| م | Mim | m |
| ن | Nun | n |
| ه | Ha | h |
| و | Waw | w / ū |
| ي | Ya | y / ī |
**Example (Arabic)**:
```
Input: المكتبة الوطنية للمملكة المغربية
ISO: al-Maktaba al-Waṭanīya lil-Mamlaka al-Maġribīya
ASCII: al-Maktaba al-Wataniya lil-Mamlaka al-Maghribiya
Abbrev: MWMM (skip "al-" articles)
```
**Example (Persian)**:
```
Input: وزارت امور خارجه ایران
ISO: Vezārat-e Omur-e Khāreǧe-ye Īrān
ASCII: Vezarat-e Omur-e Khareje-ye Iran
Abbrev: VOKI (skip "e" connector)
```
---
### Hebrew Script
| Standard | Library/Tool | Notes |
|----------|--------------|-------|
| ISO 259-3:1999 | `hebrew-transliteration` | Simplified romanization |
**ISO 259 Mapping**:
| Hebrew | Name | Latin |
|--------|------|-------|
| א | Aleph | ʾ / (silent) |
| ב | Bet | b / v |
| ג | Gimel | g |
| ד | Dalet | d |
| ה | He | h |
| ו | Vav | v / o / u |
| ז | Zayin | z |
| ח | Chet | ḥ / ch |
| ט | Tet | ṭ / t |
| י | Yod | y / i |
| כ ך | Kaf | k / kh |
| ל | Lamed | l |
| מ ם | Mem | m |
| נ ן | Nun | n |
| ס | Samekh | s |
| ע | Ayin | ʿ / (silent) |
| פ ף | Pe | p / f |
| צ ץ | Tsade | ṣ / ts |
| ק | Qof | q / k |
| ר | Resh | r |
| ש | Shin/Sin | š / s |
| ת | Tav | t |
**Example**:
```
Input: ארכיון הסיפור העממי בישראל
ISO: Arḵiyon ha-Sipur ha-ʿAmami be-Yiśraʾel
ASCII: Arkhiyon ha-Sipur ha-Amami be-Yisrael
Abbrev: ASAY (skip "ha-" and "be-" articles)
```
---
### Greek Script
| Standard | Library/Tool | Notes |
|----------|--------------|-------|
| ISO 843:1997 | `greek-transliteration` | Romanization of Greek |
**ISO 843 Mapping**:
| Greek | Latin | Greek | Latin |
|-------|-------|-------|-------|
| Α α | A a | Ν ν | N n |
| Β β | V v | Ξ ξ | X x |
| Γ γ | G g | Ο ο | O o |
| Δ δ | D d | Π π | P p |
| Ε ε | E e | Ρ ρ | R r |
| Ζ ζ | Z z | Σ σ ς | S s |
| Η η | Ī ī | Τ τ | T t |
| Θ θ | Th th | Υ υ | Y y |
| Ι ι | I i | Φ φ | F f |
| Κ κ | K k | Χ χ | Ch ch |
| Λ λ | L l | Ψ ψ | Ps ps |
| Μ μ | M m | Ω ω | Ō ō |
**Example**:
```
Input: Αρχαιολογικό Μουσείο Θεσσαλονίκης
ISO: Archaiologikó Mouseío Thessaloníkīs
ASCII: Archaiologiko Mouseio Thessalonikis
Abbrev: AMT
```
---
### Indic Scripts
| Language | Script | Standard | Library/Tool |
|----------|--------|----------|--------------|
| **Hindi** | Devanagari | ISO 15919 | `indic-transliteration` |
| **Bengali** | Bengali | ISO 15919 | `indic-transliteration` |
| **Nepali** | Devanagari | ISO 15919 | `indic-transliteration` |
| **Sinhala** | Sinhala | ISO 15919 | `indic-transliteration` |
**ISO 15919 Core Consonants (Devanagari)**:
| Devanagari | Latin | Devanagari | Latin |
|------------|-------|------------|-------|
| क | ka | त | ta |
| ख | kha | थ | tha |
| ग | ga | द | da |
| घ | gha | ध | dha |
| ङ | ṅa | न | na |
| च | ca | प | pa |
| छ | cha | फ | pha |
| ज | ja | ब | ba |
| झ | jha | भ | bha |
| ञ | ña | म | ma |
| ट | ṭa | य | ya |
| ठ | ṭha | र | ra |
| ड | ḍa | ल | la |
| ढ | ḍha | व | va |
| ण | ṇa | श | śa |
| | | ष | ṣa |
| | | स | sa |
| | | ह | ha |
**Example (Hindi)**:
```
Input: राजस्थान प्राच्यविद्या प्रतिष्ठान
ISO: Rājasthāna Prācyavidyā Pratiṣṭhāna
ASCII: Rajasthana Pracyavidya Pratishthana
Abbrev: RPP
```
---
### Southeast Asian Scripts
| Language | Script | Standard | Library/Tool |
|----------|--------|----------|--------------|
| **Thai** | Thai | ISO 11940-2 | `thai-romanization` |
| **Khmer** | Khmer | ALA-LC | `khmer-romanization` |
**Thai Example**:
```
Input: สำนักหอจดหมายเหตุแห่งชาติ
ISO: Samnak Ho Chotmaihet Haeng Chat
Abbrev: SHCHC
```
**Khmer Example**:
```
Input: សារមន្ទីរទួលស្លែង
ALA-LC: Sāramanṭīr Tūl Slèṅ
ASCII: Saramantir Tuol Sleng
Abbrev: STS
```
---
### Other Scripts
| Language | Script | Standard | Library/Tool |
|----------|--------|----------|--------------|
| **Armenian** | Armenian | ISO 9985 | `armenian-transliteration` |
| **Georgian** | Georgian | ISO 9984 | `georgian-transliteration` |
**Armenian Example**:
```
Input: Մdelays delays delays delays delays delays delays delays delays delays delays delays delays delays delaysdelays delays delays delays delays delays delaysdelays delaysdelays delaysdelays delaysатdelays delays delaysенадаранdelays delays delays
Input: Մdelays delays delays delays delays delaysделays delays delaysատdelays delays delays delaysdelays delaysdelays delaysdelays delaysdelays delaysdelays delaysdelays delays delays delaysdeенadaran
Input: Մdelays delays delays delaysатенадаранdelays delays delays delaysdeленадаран
Input: Մdelays delays delaysатенадаран
Input: Մdelays delaysатенадаран
Input: Մатенадаран
Input: Մatenadaran
ISO: Matenadaran
Abbrev: M
```
**Georgian Example**:
```
Input: ხელნაწერთა ეროვნული ცენტრი
ISO: Xelnawerti Erovnuli C'ent'ri
ASCII: Khelnawerti Erovnuli Centri
Abbrev: KEC
```
---
## Implementation
### Python Transliteration Utility
```python
#!/usr/bin/env python3
"""
Transliteration utility for GHCID abbreviation generation.
Uses ISO and recognized standards for each script/language.
"""
import unicodedata
from typing import Optional
# Try importing transliteration libraries
try:
from pypinyin import pinyin, Style
HAS_PYPINYIN = True
except ImportError:
HAS_PYPINYIN = False
try:
import pykakasi
HAS_PYKAKASI = True
except ImportError:
HAS_PYKAKASI = False
try:
from transliterate import translit
HAS_TRANSLITERATE = True
except ImportError:
HAS_TRANSLITERATE = False
def detect_script(text: str) -> str:
"""
Detect the primary script of the input text.
Returns one of:
- 'latin': Latin alphabet
- 'cyrillic': Cyrillic script
- 'chinese': Chinese characters (Hanzi)
- 'japanese': Japanese (mixed Kanji/Kana)
- 'korean': Korean Hangul
- 'arabic': Arabic script (includes Persian, Urdu)
- 'hebrew': Hebrew script
- 'greek': Greek script
- 'devanagari': Devanagari (Hindi, Nepali, Sanskrit)
- 'bengali': Bengali script
- 'thai': Thai script
- 'armenian': Armenian script
- 'georgian': Georgian script
- 'unknown': Cannot determine
"""
script_ranges = {
'cyrillic': (0x0400, 0x04FF),
'arabic': (0x0600, 0x06FF),
'hebrew': (0x0590, 0x05FF),
'devanagari': (0x0900, 0x097F),
'bengali': (0x0980, 0x09FF),
'thai': (0x0E00, 0x0E7F),
'greek': (0x0370, 0x03FF),
'armenian': (0x0530, 0x058F),
'georgian': (0x10A0, 0x10FF),
'korean': (0xAC00, 0xD7AF), # Hangul syllables
'japanese_hiragana': (0x3040, 0x309F),
'japanese_katakana': (0x30A0, 0x30FF),
'chinese': (0x4E00, 0x9FFF), # CJK Unified Ideographs
}
script_counts = {script: 0 for script in script_ranges}
latin_count = 0
for char in text:
code = ord(char)
# Check Latin
if ('a' <= char <= 'z') or ('A' <= char <= 'Z'):
latin_count += 1
continue
# Check other scripts
for script, (start, end) in script_ranges.items():
if start <= code <= end:
script_counts[script] += 1
break
# Determine primary script
if latin_count > 0 and all(c == 0 for c in script_counts.values()):
return 'latin'
# Find max non-Latin script
max_script = max(script_counts, key=script_counts.get)
if script_counts[max_script] > 0:
# Handle Japanese (can be Kanji + Kana)
if max_script in ('japanese_hiragana', 'japanese_katakana', 'chinese'):
if script_counts['japanese_hiragana'] > 0 or script_counts['japanese_katakana'] > 0:
return 'japanese'
return 'chinese'
return max_script
return 'latin' if latin_count > 0 else 'unknown'
def transliterate_cyrillic(text: str, lang: str = 'ru') -> str:
"""Transliterate Cyrillic text using ISO 9."""
if HAS_TRANSLITERATE:
try:
return translit(text, lang, reversed=True)
except Exception:
pass
# Fallback: basic Cyrillic to Latin mapping
cyrillic_map = {
'А': 'A', 'Б': 'B', 'В': 'V', 'Г': 'G', 'Д': 'D', 'Е': 'E',
'Ё': 'E', 'Ж': 'Zh', 'З': 'Z', 'И': 'I', 'Й': 'Y', 'К': 'K',
'Л': 'L', 'М': 'M', 'Н': 'N', 'О': 'O', 'П': 'P', 'Р': 'R',
'С': 'S', 'Т': 'T', 'У': 'U', 'Ф': 'F', 'Х': 'Kh', 'Ц': 'Ts',
'Ч': 'Ch', 'Ш': 'Sh', 'Щ': 'Shch', 'Ъ': '', 'Ы': 'Y', 'Ь': '',
'Э': 'E', 'Ю': 'Yu', 'Я': 'Ya',
'а': 'a', 'б': 'b', 'в': 'v', 'г': 'g', 'д': 'd', 'е': 'e',
'ё': 'e', 'ж': 'zh', 'з': 'z', 'и': 'i', 'й': 'y', 'к': 'k',
'л': 'l', 'м': 'm', 'н': 'n', 'о': 'o', 'п': 'p', 'р': 'r',
'с': 's', 'т': 't', 'у': 'u', 'ф': 'f', 'х': 'kh', 'ц': 'ts',
'ч': 'ch', 'ш': 'sh', 'щ': 'shch', 'ъ': '', 'ы': 'y', 'ь': '',
'э': 'e', 'ю': 'yu', 'я': 'ya',
# Ukrainian additions
'І': 'I', 'і': 'i', 'Ї': 'Yi', 'ї': 'yi', 'Є': 'Ye', 'є': 'ye',
'Ґ': 'G', 'ґ': 'g',
}
return ''.join(cyrillic_map.get(c, c) for c in text)
def transliterate_chinese(text: str) -> str:
"""Transliterate Chinese to Pinyin."""
if HAS_PYPINYIN:
# Get pinyin without tone marks
result = pinyin(text, style=Style.NORMAL)
return ' '.join([''.join(p) for p in result])
# Fallback: return as-is (requires manual handling)
return text
def transliterate_japanese(text: str) -> str:
"""Transliterate Japanese to Romaji (Hepburn)."""
if HAS_PYKAKASI:
kakasi = pykakasi.kakasi()
result = kakasi.convert(text)
return ' '.join([item['hepburn'] for item in result])
# Fallback: return as-is
return text
def transliterate_korean(text: str) -> str:
"""Transliterate Korean Hangul to Revised Romanization."""
# Korean romanization is complex - use library if available
try:
from korean_romanizer.romanizer import Romanizer
r = Romanizer(text)
return r.romanize()
except ImportError:
pass
# Fallback: basic Hangul syllable decomposition
# This is a simplified implementation
return text
def transliterate_arabic(text: str) -> str:
"""Transliterate Arabic script to Latin (ISO 233 simplified)."""
arabic_map = {
'ا': 'a', 'أ': 'a', 'إ': 'i', 'آ': 'a',
'ب': 'b', 'ت': 't', 'ث': 'th', 'ج': 'j',
'ح': 'h', 'خ': 'kh', 'د': 'd', 'ذ': 'dh',
'ر': 'r', 'ز': 'z', 'س': 's', 'ش': 'sh',
'ص': 's', 'ض': 'd', 'ط': 't', 'ظ': 'z',
'ع': "'", 'غ': 'gh', 'ف': 'f', 'ق': 'q',
'ك': 'k', 'ل': 'l', 'م': 'm', 'ن': 'n',
'ه': 'h', 'و': 'w', 'ي': 'y', 'ى': 'a',
'ة': 'a', 'ء': "'",
# Persian additions
'پ': 'p', 'چ': 'ch', 'ژ': 'zh', 'گ': 'g',
'ک': 'k', 'ی': 'i',
}
result = []
for c in text:
if c in arabic_map:
result.append(arabic_map[c])
elif c == ' ' or c.isalnum():
result.append(c)
return ''.join(result)
def transliterate_hebrew(text: str) -> str:
"""Transliterate Hebrew to Latin (ISO 259 simplified)."""
hebrew_map = {
'א': '', 'ב': 'v', 'ג': 'g', 'ד': 'd', 'ה': 'h',
'ו': 'v', 'ז': 'z', 'ח': 'ch', 'ט': 't', 'י': 'y',
'כ': 'k', 'ך': 'k', 'ל': 'l', 'מ': 'm', 'ם': 'm',
'נ': 'n', 'ן': 'n', 'ס': 's', 'ע': '', 'פ': 'f',
'ף': 'f', 'צ': 'ts', 'ץ': 'ts', 'ק': 'k', 'ר': 'r',
'ש': 'sh', 'ת': 't',
}
result = []
for c in text:
if c in hebrew_map:
result.append(hebrew_map[c])
elif c == ' ' or c.isalnum():
result.append(c)
return ''.join(result)
def transliterate_greek(text: str) -> str:
"""Transliterate Greek to Latin (ISO 843)."""
greek_map = {
'Α': 'A', 'α': 'a', 'Β': 'V', 'β': 'v', 'Γ': 'G', 'γ': 'g',
'Δ': 'D', 'δ': 'd', 'Ε': 'E', 'ε': 'e', 'Ζ': 'Z', 'ζ': 'z',
'Η': 'I', 'η': 'i', 'Θ': 'Th', 'θ': 'th', 'Ι': 'I', 'ι': 'i',
'Κ': 'K', 'κ': 'k', 'Λ': 'L', 'λ': 'l', 'Μ': 'M', 'μ': 'm',
'Ν': 'N', 'ν': 'n', 'Ξ': 'X', 'ξ': 'x', 'Ο': 'O', 'ο': 'o',
'Π': 'P', 'π': 'p', 'Ρ': 'R', 'ρ': 'r', 'Σ': 'S', 'σ': 's',
'ς': 's', 'Τ': 'T', 'τ': 't', 'Υ': 'Y', 'υ': 'y', 'Φ': 'F',
'φ': 'f', 'Χ': 'Ch', 'χ': 'ch', 'Ψ': 'Ps', 'ψ': 'ps',
'Ω': 'O', 'ω': 'o',
}
return ''.join(greek_map.get(c, c) for c in text)
def transliterate_devanagari(text: str) -> str:
"""Transliterate Devanagari to Latin (ISO 15919 simplified)."""
try:
from indic_transliteration import sanscript
from indic_transliteration.sanscript import transliterate as indic_translit
return indic_translit(text, sanscript.DEVANAGARI, sanscript.IAST)
except ImportError:
pass
# Fallback: basic mapping
# This would need a full Devanagari character map
return text
def transliterate_thai(text: str) -> str:
"""Transliterate Thai to Latin (Royal Thai General System)."""
try:
from thaispellcheck import transliterate as thai_translit
return thai_translit(text)
except ImportError:
pass
# Fallback
return text
def transliterate(text: str, lang: Optional[str] = None) -> str:
"""
Transliterate text from non-Latin script to Latin.
Args:
text: Input text in any script
lang: Optional ISO 639-1 language code (e.g., 'ru', 'zh', 'ko')
If not provided, script is auto-detected.
Returns:
Transliterated text in Latin characters.
"""
if not text:
return text
# Detect script if language not provided
if lang:
script_map = {
'ru': 'cyrillic', 'uk': 'cyrillic', 'bg': 'cyrillic',
'sr': 'cyrillic', 'kk': 'cyrillic',
'zh': 'chinese',
'ja': 'japanese',
'ko': 'korean',
'ar': 'arabic', 'fa': 'arabic', 'ur': 'arabic',
'he': 'hebrew',
'el': 'greek',
'hi': 'devanagari', 'ne': 'devanagari',
'bn': 'bengali',
'th': 'thai',
'hy': 'armenian',
'ka': 'georgian',
}
script = script_map.get(lang, detect_script(text))
else:
script = detect_script(text)
# Apply appropriate transliteration
transliterators = {
'cyrillic': lambda t: transliterate_cyrillic(t, lang or 'ru'),
'chinese': transliterate_chinese,
'japanese': transliterate_japanese,
'korean': transliterate_korean,
'arabic': transliterate_arabic,
'hebrew': transliterate_hebrew,
'greek': transliterate_greek,
'devanagari': transliterate_devanagari,
'thai': transliterate_thai,
'latin': lambda t: t, # No transliteration needed
}
translit_func = transliterators.get(script, lambda t: t)
result = translit_func(text)
# Normalize diacritics to ASCII
normalized = unicodedata.normalize('NFD', result)
ascii_result = ''.join(c for c in normalized if unicodedata.category(c) != 'Mn')
return ascii_result
def transliterate_for_abbreviation(emic_name: str, lang: str) -> str:
"""
Transliterate emic name for GHCID abbreviation generation.
This is the main entry point for GHCID generation scripts.
Args:
emic_name: Institution name in original script
lang: ISO 639-1 language code
Returns:
Transliterated name ready for abbreviation extraction
"""
# Step 1: Transliterate to Latin
latin = transliterate(emic_name, lang)
# Step 2: Normalize diacritics (handled in transliterate())
# Step 3: Remove special characters (except spaces)
import re
clean = re.sub(r'[^a-zA-Z\s]', ' ', latin)
# Step 4: Normalize whitespace
clean = ' '.join(clean.split())
return clean
# Example usage
if __name__ == '__main__':
test_cases = [
('Институт восточных рукописей РАН', 'ru'),
('东巴文化博物院', 'zh'),
('독립기념관', 'ko'),
('राजस्थान प्राच्यविद्या प्रतिष्ठान', 'hi'),
('المكتبة الوطنية للمملكة المغربية', 'ar'),
('ארכיון הסיפור העממי בישראל', 'he'),
('Αρχαιολογικό Μουσείο Θεσσαλονίκης', 'el'),
]
for name, lang in test_cases:
result = transliterate_for_abbreviation(name, lang)
print(f'{lang}: {name}')
print(f' → {result}')
print()
```
---
## Skip Words by Language
When extracting abbreviations from transliterated text, skip these articles/prepositions:
### Arabic
- `al-` (the definite article)
- `bi-`, `li-`, `fi-` (prepositions)
### Hebrew
- `ha-` (the)
- `ve-` (and)
- `be-`, `le-`, `me-` (prepositions)
### Persian
- `-e`, `-ye` (ezafe connector)
- `va` (and)
### CJK Languages
- No skip words (particles are integral to meaning)
### Indic Languages
- `ka`, `ki`, `ke` (Hindi: of)
- `aur` (Hindi: and)
---
## Validation
### Check Transliteration Output
```python
def validate_transliteration(result: str) -> bool:
"""
Validate that transliteration output contains only ASCII letters and spaces.
"""
import re
return bool(re.match(r'^[a-zA-Z\s]+$', result))
```
### Manual Review Queue
Non-Latin institutions should be flagged for manual review if:
1. Transliteration library not available for that script
2. Confidence in transliteration is low
3. Institution has multiple official romanizations
---
## Related Documentation
- `AGENTS.md` - Rule 12: Transliteration Standards
- `ABBREVIATION_SPECIAL_CHAR_RULE.md` - Character filtering after transliteration
- `docs/TRANSLITERATION_CONVENTIONS.md` - Extended examples and edge cases
- `scripts/transliterate_emic_names.py` - Production transliteration script
---
## Changelog
| Date | Change |
|------|--------|
| 2025-12-08 | Initial standards document created |

View file

@ -0,0 +1,277 @@
# Z.AI GLM API Rules for AI Agents
**Last Updated**: 2025-12-08
**Status**: MANDATORY for all LLM API calls in scripts
---
## CRITICAL: Use Z.AI Coding Plan, NOT BigModel API
**This project uses the Z.AI Coding Plan endpoint, which is the SAME endpoint that OpenCode uses internally.**
The regular BigModel API (`open.bigmodel.cn`) will NOT work with the tokens stored in this project. You MUST use the Z.AI Coding Plan endpoint.
---
## API Configuration
### Correct Endpoint
| Property | Value |
|----------|-------|
| **API URL** | `https://api.z.ai/api/coding/paas/v4/chat/completions` |
| **Auth Header** | `Authorization: Bearer {ZAI_API_TOKEN}` |
| **Content-Type** | `application/json` |
### Available Models
| Model | Description | Cost |
|-------|-------------|------|
| `glm-4.5` | Standard GLM-4.5 | Free (0 per token) |
| `glm-4.5-air` | GLM-4.5 Air variant | Free |
| `glm-4.5-flash` | Fast GLM-4.5 | Free |
| `glm-4.5v` | Vision-capable GLM-4.5 | Free |
| `glm-4.6` | Latest GLM-4.6 (recommended) | Free |
**Recommended Model**: `glm-4.6` for best quality
---
## Authentication
### Token Location
The Z.AI API token can be obtained from two locations:
1. **Environment Variable** (preferred for scripts):
```bash
# In .env file at project root
ZAI_API_TOKEN=your_token_here
```
2. **OpenCode Auth File** (reference only):
```
~/.local/share/opencode/auth.json
```
The token is stored under key `zai-coding-plan`.
### Getting the Token
If you need to set up the token:
1. The token is shared with OpenCode's Z.AI Coding Plan
2. Check `~/.local/share/opencode/auth.json` for existing token
3. Add to `.env` file as `ZAI_API_TOKEN`
---
## Python Implementation
### Correct Implementation
```python
import os
import httpx
class GLMClient:
"""Client for Z.AI GLM API (Coding Plan endpoint)."""
# CORRECT endpoint - Z.AI Coding Plan
API_URL = "https://api.z.ai/api/coding/paas/v4/chat/completions"
def __init__(self, model: str = "glm-4.6"):
self.api_key = os.environ.get("ZAI_API_TOKEN")
if not self.api_key:
raise ValueError("ZAI_API_TOKEN not found in environment")
self.model = model
self.client = httpx.AsyncClient(
timeout=60.0,
headers={
"Authorization": f"Bearer {self.api_key}",
"Content-Type": "application/json",
}
)
async def chat(self, messages: list) -> dict:
"""Send chat completion request."""
response = await self.client.post(
self.API_URL,
json={
"model": self.model,
"messages": messages,
"temperature": 0.3,
}
)
response.raise_for_status()
return response.json()
```
### WRONG Implementation (DO NOT USE)
```python
# WRONG - This endpoint will fail with quota errors
API_URL = "https://open.bigmodel.cn/api/paas/v4/chat/completions"
# WRONG - This is for regular BigModel API, not Z.AI Coding Plan
api_key = os.environ.get("ZHIPU_API_KEY")
```
---
## Request Format
### Chat Completion Request
```json
{
"model": "glm-4.6",
"messages": [
{
"role": "system",
"content": "You are a helpful assistant."
},
{
"role": "user",
"content": "Your prompt here"
}
],
"temperature": 0.3,
"max_tokens": 4096
}
```
### Response Format
```json
{
"id": "request-id",
"created": 1733651234,
"model": "glm-4.6",
"choices": [
{
"index": 0,
"message": {
"role": "assistant",
"content": "Response text here"
},
"finish_reason": "stop"
}
],
"usage": {
"prompt_tokens": 100,
"completion_tokens": 50,
"total_tokens": 150
}
}
```
---
## Error Handling
### Common Errors
| Error | Cause | Solution |
|-------|-------|----------|
| `401 Unauthorized` | Invalid or missing token | Check ZAI_API_TOKEN in .env |
| `403 Quota exceeded` | Wrong endpoint (BigModel) | Use Z.AI Coding Plan endpoint |
| `429 Rate limited` | Too many requests | Add delay between requests |
| `500 Server error` | API issue | Retry with exponential backoff |
### Retry Strategy
```python
import asyncio
from tenacity import retry, stop_after_attempt, wait_exponential
@retry(
stop=stop_after_attempt(3),
wait=wait_exponential(multiplier=1, min=2, max=10)
)
async def call_api_with_retry(client, messages):
return await client.chat(messages)
```
---
## Integration with CH-Annotator
When using GLM for entity recognition or verification, always reference CH-Annotator v1.7.0:
```python
PROMPT = """You are a heritage institution classifier following CH-Annotator v1.7.0 convention.
## CH-Annotator GRP.HER Definition
Heritage institutions are organizations that:
- Collect, preserve, and provide access to cultural heritage materials
- Include: museums (GRP.HER.MUS), libraries (GRP.HER.LIB), archives (GRP.HER.ARC), galleries (GRP.HER.GAL)
## Entity to Analyze
...
"""
```
See `.opencode/CH_ANNOTATOR_CONVENTION.md` for full convention details.
---
## Scripts Using GLM API
The following scripts use the Z.AI GLM API:
| Script | Purpose |
|--------|---------|
| `scripts/reenrich_wikidata_with_verification.py` | Wikidata entity verification using GLM-4.6 |
When creating new scripts that need LLM capabilities, follow this pattern.
---
## Environment Setup Checklist
When setting up a new environment:
- [ ] Check `~/.local/share/opencode/auth.json` for existing Z.AI token
- [ ] Add `ZAI_API_TOKEN` to `.env` file
- [ ] Verify endpoint is `https://api.z.ai/api/coding/paas/v4/chat/completions`
- [ ] Test with `glm-4.6` model
- [ ] Reference CH-Annotator v1.7.0 for entity recognition tasks
---
## AI Agent Rules
### DO
- Use `https://api.z.ai/api/coding/paas/v4/chat/completions` endpoint
- Get token from `ZAI_API_TOKEN` environment variable
- Use `glm-4.6` as the default model
- Reference CH-Annotator v1.7.0 for entity tasks
- Add retry logic with exponential backoff
- Handle JSON parsing errors gracefully
### DO NOT
- Use `open.bigmodel.cn` endpoint (wrong API)
- Use `ZHIPU_API_KEY` environment variable (wrong key)
- Hard-code API tokens in scripts
- Skip error handling for API calls
- Forget to load `.env` file before accessing environment
---
## Related Documentation
- **CH-Annotator Convention**: `.opencode/CH_ANNOTATOR_CONVENTION.md`
- **Entity Annotation**: `data/entity_annotation/ch_annotator-v1_7_0.yaml`
- **Wikidata Enrichment Script**: `scripts/reenrich_wikidata_with_verification.py`
---
## Version History
| Date | Change |
|------|--------|
| 2025-12-08 | Initial documentation - Fixed API endpoint discovery |

224
AGENTS.md
View file

@ -720,6 +720,66 @@ claim:
---
### Rule 11: Z.AI GLM API for LLM Tasks (NOT BigModel)
**CRITICAL: When using GLM models in scripts, use the Z.AI Coding Plan endpoint, NOT the regular BigModel API.**
The project uses the same Z.AI Coding Plan that OpenCode uses internally. The regular BigModel API (`open.bigmodel.cn`) will NOT work with our tokens.
**Correct API Configuration**:
| Property | Value |
|----------|-------|
| **API URL** | `https://api.z.ai/api/coding/paas/v4/chat/completions` |
| **Environment Variable** | `ZAI_API_TOKEN` |
| **Recommended Model** | `glm-4.6` |
| **Cost** | Free (0 per token for all GLM models) |
**Available Models**: `glm-4.5`, `glm-4.5-air`, `glm-4.5-flash`, `glm-4.5v`, `glm-4.6`
**Python Implementation**:
```python
import os
import httpx
# CORRECT - Z.AI Coding Plan endpoint
API_URL = "https://api.z.ai/api/coding/paas/v4/chat/completions"
api_key = os.environ.get("ZAI_API_TOKEN")
client = httpx.AsyncClient(
timeout=60.0,
headers={
"Authorization": f"Bearer {api_key}",
"Content-Type": "application/json",
}
)
# WRONG - This will fail with quota errors!
# API_URL = "https://open.bigmodel.cn/api/paas/v4/chat/completions"
# api_key = os.environ.get("ZHIPU_API_KEY")
```
**Integration with CH-Annotator**: When using GLM for entity recognition, always reference CH-Annotator v1.7.0 in prompts:
```python
PROMPT = """You are following CH-Annotator v1.7.0 convention.
Heritage institutions are type GRP.HER with subtypes:
- GRP.HER.MUS (museums)
- GRP.HER.LIB (libraries)
- GRP.HER.ARC (archives)
- GRP.HER.GAL (galleries)
..."""
```
**Token Location**:
1. **Environment**: Add `ZAI_API_TOKEN` to `.env` file
2. **OpenCode Auth**: Token stored in `~/.local/share/opencode/auth.json` under key `zai-coding-plan`
**See**: `.opencode/ZAI_GLM_API_RULES.md` for complete documentation
---
## Project Overview
**Goal**: Extract structured data about worldwide GLAMORCUBESFIXPHDNT (Galleries, Libraries, Archives, Museums, Official institutions, Research centers, Corporations, Unknown, Botanical gardens/zoos, Educational providers, Societies, Features, Intangible heritage groups, miXed, Personal collections, Holy sites, Digital platforms, NGOs, Taste/smell heritage) institutions from 139+ Claude conversation JSON files and integrate with authoritative CSV datasets.
@ -2571,6 +2631,16 @@ location_resolution:
**The institution abbreviation component uses the FIRST LETTER of each significant word in the official emic (native language) name.**
**⚠️ GRANDFATHERING POLICY (PID STABILITY)**
Existing GHCIDs created before December 2025 are **grandfathered** - their abbreviations will NOT be updated even if derived from English translations rather than emic names. This preserves PID stability per the "Cool URIs Don't Change" principle.
**Applies to:**
- 817 UNESCO Memory of the World custodian files enriched with `custodian_name.emic_name`
- Abbreviations like `NLP` (National Library of Peru) remain unchanged even though emic name is "Biblioteca Nacional del Perú" (would be `BNP`)
**For NEW custodians only:** Apply emic name abbreviation protocol described below.
**Abbreviation Rules**:
1. Use the **CustodianName** (official emic name), NOT an English translation
2. Take the **first letter** of each word
@ -2681,6 +2751,154 @@ ghcid_current: SX-XX-PHI-O-DRIMSM # ✅ Alphabetic only
**See**: `.opencode/ABBREVIATION_SPECIAL_CHAR_RULE.md` for complete documentation
### 🚨 CRITICAL: Diacritics MUST Be Normalized to ASCII in Abbreviations 🚨
**When generating abbreviations for GHCID, diacritics (accented characters) MUST be normalized to their ASCII base letter equivalents. Only ASCII uppercase letters (A-Z) are permitted.**
This rule applies to ALL languages with diacritical marks including Czech, Polish, German, French, Spanish, Portuguese, Nordic languages, Hungarian, Romanian, Turkish, and others.
**RATIONALE**:
1. **URI/URL safety** - Non-ASCII characters require percent-encoding
2. **Cross-system compatibility** - ASCII is universally supported
3. **Filename safety** - Some systems have issues with non-ASCII filenames
4. **Human readability** - Easier to type and communicate
**DIACRITICS NORMALIZATION TABLE**:
| Language | Diacritics | ASCII Equivalent |
|----------|------------|------------------|
| **Czech** | Č, Ř, Š, Ž, Ě, Ů | C, R, S, Z, E, U |
| **Polish** | Ł, Ń, Ó, Ś, Ź, Ż, Ą, Ę | L, N, O, S, Z, Z, A, E |
| **German** | Ä, Ö, Ü, ß | A, O, U, SS |
| **French** | É, È, Ê, Ç, Ô, Â | E, E, E, C, O, A |
| **Spanish** | Ñ, Á, É, Í, Ó, Ú | N, A, E, I, O, U |
| **Portuguese** | Ã, Õ, Ç, Á, É | A, O, C, A, E |
| **Nordic** | Å, Ä, Ö, Ø, Æ | A, A, O, O, AE |
| **Hungarian** | Á, É, Í, Ó, Ö, Ő, Ú, Ü, Ű | A, E, I, O, O, O, U, U, U |
| **Turkish** | Ç, Ğ, İ, Ö, Ş, Ü | C, G, I, O, S, U |
| **Romanian** | Ă, Â, Î, Ș, Ț | A, A, I, S, T |
**REAL-WORLD EXAMPLE** (Czech institution):
```yaml
# INCORRECT - Contains diacritics:
ghcid_current: CZ-VY-TEL-L-VHSPAOČRZS # ❌ Contains "Č"
# CORRECT - ASCII only:
ghcid_current: CZ-VY-TEL-L-VHSPAOCRZS # ✅ "Č" → "C"
```
**IMPLEMENTATION**:
```python
import unicodedata
def normalize_diacritics(text: str) -> str:
"""Normalize diacritics to ASCII equivalents."""
# NFD decomposition separates base characters from combining marks
normalized = unicodedata.normalize('NFD', text)
# Remove combining marks (category 'Mn' = Mark, Nonspacing)
ascii_text = ''.join(c for c in normalized if unicodedata.category(c) != 'Mn')
return ascii_text
# Example
normalize_diacritics("VHSPAOČRZS") # Returns "VHSPAOCRZS"
```
**EXAMPLES**:
| Emic Name (with diacritics) | Abbreviation | Wrong |
|-----------------------------|--------------|-------|
| Vlastivědné muzeum v Šumperku | VMS | VMŠ ❌ |
| Österreichische Nationalbibliothek | ON | ÖN ❌ |
| Bibliothèque nationale de France | BNF | BNF (OK - è not in first letter) |
| Múzeum Łódzkie | ML | MŁ ❌ |
| Þjóðminjasafn Íslands | TI | ÞI ❌ |
**See**: `.opencode/ABBREVIATION_SPECIAL_CHAR_RULE.md` for complete documentation (covers both special characters and diacritics)
### 🚨 CRITICAL: Non-Latin Scripts MUST Be Transliterated Before Abbreviation 🚨
**When generating GHCID abbreviations from institution names in non-Latin scripts (Cyrillic, Chinese, Japanese, Korean, Arabic, Hebrew, Greek, Devanagari, Thai, etc.), the emic name MUST first be transliterated to Latin characters using ISO or recognized standards.**
This rule affects **170 institutions** across **21 languages** with non-Latin writing systems.
**CORE PRINCIPLE**: The emic name is PRESERVED in original script in `custodian_name.emic_name`. Transliteration is only used for abbreviation generation.
**TRANSLITERATION STANDARDS BY SCRIPT**:
| Script | Languages | Standard | Example |
|--------|-----------|----------|---------|
| **Cyrillic** | ru, uk, bg, sr, kk | ISO 9:1995 | Институт → Institut |
| **Chinese** | zh | Hanyu Pinyin (ISO 7098) | 东巴文化博物院 → Dongba Wenhua Bowuyuan |
| **Japanese** | ja | Modified Hepburn | 国立博物館 → Kokuritsu Hakubutsukan |
| **Korean** | ko | Revised Romanization | 독립기념관 → Dongnip Ginyeomgwan |
| **Arabic** | ar, fa, ur | ISO 233-2/3 | المكتبة الوطنية → al-Maktaba al-Wataniya |
| **Hebrew** | he | ISO 259-3 | ארכיון → Arkhiyon |
| **Greek** | el | ISO 843 | Μουσείο → Mouseio |
| **Devanagari** | hi, ne | ISO 15919 | राजस्थान → Rajasthana |
| **Bengali** | bn | ISO 15919 | বাংলাদেশ → Bangladesh |
| **Thai** | th | ISO 11940-2 | สำนักหอ → Samnak Ho |
| **Armenian** | hy | ISO 9985 | Մdelays → Matenadaran |
| **Georgian** | ka | ISO 9984 | ხელნაწერთა → Khelnawerti |
**WORKFLOW**:
```
1. Emic Name (original script)
2. Transliterate to Latin (ISO standard)
3. Normalize diacritics (remove accents)
4. Skip articles/prepositions
5. Extract first letters → Abbreviation
```
**EXAMPLES**:
| Language | Emic Name | Transliterated | Abbreviation |
|----------|-----------|----------------|--------------|
| **Russian** | Институт восточных рукописей РАН | Institut Vostochnykh Rukopisey RAN | IVRR |
| **Chinese** | 东巴文化博物院 | Dongba Wenhua Bowuyuan | DWB |
| **Korean** | 독립기념관 | Dongnip Ginyeomgwan | DG |
| **Hindi** | राजस्थान प्राच्यविद्या प्रतिष्ठान | Rajasthana Pracyavidya Pratishthana | RPP |
| **Arabic** | المكتبة الوطنية للمملكة المغربية | al-Maktaba al-Wataniya lil-Mamlaka | MWMM |
| **Hebrew** | ארכיון הסיפור העממי בישראל | Arkhiyon ha-Sipur ha-Amami | ASAY |
| **Greek** | Αρχαιολογικό Μουσείο Θεσσαλονίκης | Archaiologiko Mouseio Thessalonikis | AMT |
**SCRIPT-SPECIFIC SKIP WORDS**:
| Language | Skip Words (Articles/Prepositions) |
|----------|-------------------------------------|
| **Arabic** | al- (the), bi-, li-, fi- (prepositions) |
| **Hebrew** | ha- (the), ve- (and), be-, le-, me- |
| **Persian** | -e, -ye (ezafe connector), va (and) |
| **CJK** | None (particles integral to meaning) |
**IMPLEMENTATION**:
```python
from transliteration import transliterate_for_abbreviation
# Input: emic name in non-Latin script + language code
emic_name = "Институт восточных рукописей РАН"
lang = "ru"
# Step 1: Transliterate to Latin using ISO standard
latin = transliterate_for_abbreviation(emic_name, lang)
# Result: "Institut Vostochnykh Rukopisey RAN"
# Step 2: Apply standard abbreviation extraction
abbreviation = extract_abbreviation_from_name(latin, skip_words={'vostochnykh'})
# Result: "IVRRAN"
```
**GRANDFATHERING POLICY**: Existing abbreviations from 817 UNESCO MoW custodians are grandfathered. This transliteration standard applies only to **NEW custodians** created after December 2025.
**See**: `.opencode/TRANSLITERATION_STANDARDS.md` for complete ISO standards, mapping tables, and Python implementation
---
GHCID uses a **four-identifier strategy** for maximum flexibility and transparency:
@ -3115,7 +3333,7 @@ def test_historical_addition():
---
**Version**: 0.2.0
**Schema Version**: v0.2.0 (modular)
**Last Updated**: 2025-11-05
**Version**: 0.2.1
**Schema Version**: v0.2.1 (modular)
**Last Updated**: 2025-12-08
**Maintained By**: GLAM Data Extraction Project

357
docs/GLM_API_SETUP.md Normal file
View file

@ -0,0 +1,357 @@
# GLM API Setup Guide
This guide explains how to configure and use the GLM-4 language model for entity recognition, verification, and enrichment tasks in the GLAM project.
## Overview
The GLAM project uses **GLM-4.6** via the **Z.AI Coding Plan** endpoint for LLM-powered tasks such as:
- **Entity Verification**: Verify that Wikidata entities are heritage institutions
- **Description Enrichment**: Generate rich descriptions from multiple data sources
- **Entity Resolution**: Match institution names across different data sources
- **Claim Validation**: Verify extracted claims against source documents
**Cost**: All GLM models are FREE (0 cost per token) on the Z.AI Coding Plan.
## Prerequisites
- Python 3.10+
- `httpx` library for async HTTP requests
- Access to Z.AI Coding Plan (same as OpenCode)
## Quick Start
### 1. Set Up Environment Variable
Add your Z.AI API token to the `.env` file in the project root:
```bash
# .env file
ZAI_API_TOKEN=your_token_here
```
### 2. Find Your Token
The token is shared with OpenCode. Check:
```bash
# View OpenCode auth file
cat ~/.local/share/opencode/auth.json | jq '.["zai-coding-plan"]'
```
Copy this token to your `.env` file.
### 3. Basic Python Usage
```python
import os
import httpx
import asyncio
from dotenv import load_dotenv
# Load environment variables
load_dotenv()
async def call_glm():
api_url = "https://api.z.ai/api/coding/paas/v4/chat/completions"
api_key = os.environ.get("ZAI_API_TOKEN")
async with httpx.AsyncClient(timeout=60.0) as client:
response = await client.post(
api_url,
headers={
"Authorization": f"Bearer {api_key}",
"Content-Type": "application/json",
},
json={
"model": "glm-4.6",
"messages": [
{"role": "user", "content": "Hello, GLM!"}
],
"temperature": 0.3,
}
)
result = response.json()
print(result["choices"][0]["message"]["content"])
asyncio.run(call_glm())
```
## API Configuration
### Endpoint Details
| Property | Value |
|----------|-------|
| **Base URL** | `https://api.z.ai/api/coding/paas/v4` |
| **Chat Endpoint** | `/chat/completions` |
| **Auth Method** | Bearer Token |
| **Header** | `Authorization: Bearer {token}` |
### Available Models
| Model | Speed | Quality | Use Case |
|-------|-------|---------|----------|
| `glm-4.6` | Medium | Highest | Complex reasoning, verification |
| `glm-4.5` | Medium | High | General tasks |
| `glm-4.5-air` | Fast | Good | High-volume processing |
| `glm-4.5-flash` | Fastest | Good | Quick responses |
| `glm-4.5v` | Medium | High | Vision/image tasks |
**Recommendation**: Use `glm-4.6` for entity verification and complex tasks.
## Integration with CH-Annotator
When using GLM for entity recognition tasks, always reference the CH-Annotator convention:
### Heritage Institution Verification
```python
VERIFICATION_PROMPT = """You are a heritage institution classifier following CH-Annotator v1.7.0 convention.
## CH-Annotator GRP.HER Definition
Heritage institutions are organizations that:
- Collect, preserve, and provide access to cultural heritage materials
- Include: museums (GRP.HER.MUS), libraries (GRP.HER.LIB), archives (GRP.HER.ARC), galleries (GRP.HER.GAL)
## Entity Types That Are NOT Heritage Institutions
- Cities, towns, municipalities (places, not institutions)
- General businesses or companies
- People/individuals
- Events, festivals, exhibitions (temporary)
## Your Task
Analyze the entity and respond in JSON:
```json
{
"is_heritage_institution": true/false,
"subtype": "MUS|LIB|ARC|GAL|OTHER|null",
"confidence": 0.95,
"reasoning": "Brief explanation"
}
```
"""
```
### Entity Type Mapping
| CH-Annotator Type | GLAM Institution Type |
|-------------------|----------------------|
| GRP.HER.MUS | MUSEUM |
| GRP.HER.LIB | LIBRARY |
| GRP.HER.ARC | ARCHIVE |
| GRP.HER.GAL | GALLERY |
| GRP.HER.RES | RESEARCH_CENTER |
| GRP.HER.BOT | BOTANICAL_ZOO |
| GRP.HER.EDU | EDUCATION_PROVIDER |
## Complete Implementation Example
### Wikidata Verification Script
See `scripts/reenrich_wikidata_with_verification.py` for a complete example:
```python
import os
import httpx
import json
from typing import Any, Dict, List, Optional
class GLMHeritageVerifier:
"""Verify Wikidata entities using GLM-4.6 and CH-Annotator."""
API_URL = "https://api.z.ai/api/coding/paas/v4/chat/completions"
def __init__(self, model: str = "glm-4.6"):
self.api_key = os.environ.get("ZAI_API_TOKEN")
if not self.api_key:
raise ValueError("ZAI_API_TOKEN not found in environment")
self.model = model
self.client = httpx.AsyncClient(
timeout=60.0,
headers={
"Authorization": f"Bearer {self.api_key}",
"Content-Type": "application/json",
}
)
async def verify_heritage_institution(
self,
institution_name: str,
wikidata_label: str,
wikidata_description: str,
instance_of_types: List[str],
) -> Dict[str, Any]:
"""Check if a Wikidata entity is a heritage institution."""
prompt = f"""Analyze if this entity is a heritage institution (GRP.HER):
Institution Name: {institution_name}
Wikidata Label: {wikidata_label}
Description: {wikidata_description}
Instance Of: {', '.join(instance_of_types)}
Respond with JSON only."""
response = await self.client.post(
self.API_URL,
json={
"model": self.model,
"messages": [
{"role": "system", "content": self.VERIFICATION_PROMPT},
{"role": "user", "content": prompt}
],
"temperature": 0.1,
}
)
result = response.json()
content = result["choices"][0]["message"]["content"]
# Parse JSON from response
json_match = re.search(r'\{.*\}', content, re.DOTALL)
if json_match:
return json.loads(json_match.group())
return {"is_heritage_institution": False, "error": "No JSON found"}
```
## Error Handling
### Common Errors
| Error Code | Meaning | Solution |
|------------|---------|----------|
| 401 | Unauthorized | Check ZAI_API_TOKEN |
| 403 | Forbidden/Quota | Using wrong endpoint (use Z.AI, not BigModel) |
| 429 | Rate Limited | Add delays between requests |
| 500 | Server Error | Retry with backoff |
### Retry Pattern
```python
from tenacity import retry, stop_after_attempt, wait_exponential
@retry(
stop=stop_after_attempt(3),
wait=wait_exponential(multiplier=1, min=2, max=10)
)
async def call_with_retry(client, messages):
response = await client.post(API_URL, json={"model": "glm-4.6", "messages": messages})
response.raise_for_status()
return response.json()
```
### JSON Parsing
LLM responses may contain text around JSON. Always parse safely:
```python
import re
import json
def parse_json_from_response(content: str) -> dict:
"""Extract JSON from LLM response text."""
# Try to find JSON block
json_match = re.search(r'```json\s*(\{.*?\})\s*```', content, re.DOTALL)
if json_match:
return json.loads(json_match.group(1))
# Try bare JSON
json_match = re.search(r'\{.*\}', content, re.DOTALL)
if json_match:
return json.loads(json_match.group())
return {"error": "No JSON found in response"}
```
## Best Practices
### 1. Use Low Temperature for Verification
```python
{
"temperature": 0.1 # Low for consistent, deterministic responses
}
```
### 2. Request JSON Output
Always request JSON format in your prompts for structured responses:
```
Respond in JSON format only:
```json
{"key": "value"}
```
```
### 3. Batch Processing
Process multiple entities with rate limiting:
```python
import asyncio
async def batch_verify(entities: List[dict], rate_limit: float = 0.5):
"""Verify entities with rate limiting."""
results = []
for entity in entities:
result = await verifier.verify(entity)
results.append(result)
await asyncio.sleep(rate_limit) # Respect rate limits
return results
```
### 4. Always Reference CH-Annotator
For entity recognition tasks, include CH-Annotator context:
```python
system_prompt = """You are following CH-Annotator v1.7.0 convention.
Heritage institutions are type GRP.HER with subtypes for museums, libraries, archives, and galleries.
"""
```
## Related Scripts
| Script | Purpose |
|--------|---------|
| `scripts/reenrich_wikidata_with_verification.py` | Wikidata entity verification |
## Related Documentation
- **Agent Rules**: `AGENTS.md` (Rule 11: Z.AI GLM API)
- **Agent Config**: `.opencode/ZAI_GLM_API_RULES.md`
- **CH-Annotator**: `.opencode/CH_ANNOTATOR_CONVENTION.md`
- **Entity Annotation**: `data/entity_annotation/ch_annotator-v1_7_0.yaml`
## Troubleshooting
### "Quota exceeded" Error
**Symptom**: 403 error with "quota exceeded" message
**Cause**: Using wrong API endpoint (`open.bigmodel.cn` instead of `api.z.ai`)
**Solution**: Update API URL to `https://api.z.ai/api/coding/paas/v4/chat/completions`
### "Token not found" Error
**Symptom**: ValueError about missing ZAI_API_TOKEN
**Solution**:
1. Check `~/.local/share/opencode/auth.json` for token
2. Add to `.env` file as `ZAI_API_TOKEN=your_token`
3. Ensure `load_dotenv()` is called before accessing environment
### JSON Parsing Failures
**Symptom**: LLM returns text that can't be parsed as JSON
**Solution**: Use the `parse_json_from_response()` helper function with fallback handling
---
**Last Updated**: 2025-12-08

View file

@ -0,0 +1,441 @@
# Transliteration Conventions for Heritage Custodian Names
**Document Type**: User Guide
**Version**: 1.0
**Last Updated**: 2025-12-08
**Related Rules**: `.opencode/TRANSLITERATION_STANDARDS.md`, `AGENTS.md` (Rule 12)
---
## Overview
This document provides comprehensive examples and guidance for transliterating heritage institution names from non-Latin scripts to Latin characters. Transliteration is **required** for generating GHCID abbreviations but the **original emic name is always preserved**.
### Key Principles
1. **Emic name preserved** - Original script stored in `custodian_name.emic_name`
2. **ISO standards used** - Recognized international transliteration standards
3. **Deterministic output** - Same input always produces same Latin output
4. **Abbreviation purpose only** - Transliteration is for GHCID generation, not display
---
## Language-by-Language Examples
### Russian (Cyrillic - ISO 9:1995)
**Dataset Statistics**: 13 institutions
| Emic Name | Transliterated | Abbreviation |
|-----------|----------------|--------------|
| Институт восточных рукописей РАН | Institut vostočnyh rukopisej RAN | IVRR |
| Российская государственная библиотека | Rossijskaja gosudarstvennaja biblioteka | RGB |
| Государственный архив Российской Федерации | Gosudarstvennyj arhiv Rossijskoj Federacii | GARF |
**Skip Words (Russian)**: None significant (Russian doesn't use articles)
**Character Mapping**:
```
А → A Б → B В → V Г → G Д → D Е → E
Ё → Ë Ж → Ž З → Z И → I Й → J К → K
Л → L М → M Н → N О → O П → P Р → R
С → S Т → T У → U Ф → F Х → H Ц → C
Ч → Č Ш → Š Щ → Ŝ Ъ → ʺ Ы → Y Ьʹ
Э → È Ю → Û Я → Â
```
---
### Ukrainian (Cyrillic - ISO 9:1995)
**Dataset Statistics**: 8 institutions
| Emic Name | Transliterated | Abbreviation |
|-----------|----------------|--------------|
| Центральний державний архів громадських об'єднань України | Centralnyj deržavnyj arhiv hromadskyx objednan Ukrainy | CDAGOU |
| Національна бібліотека України | Nacionalna biblioteka Ukrainy | NBU |
**Ukrainian-specific characters**:
```
І → I Ї → Ji Є → Je Ґ → G'
```
---
### Chinese (Hanyu Pinyin - ISO 7098)
**Dataset Statistics**: 27 institutions
| Emic Name | Pinyin | Abbreviation |
|-----------|--------|--------------|
| 东巴文化博物院 | Dōngbā Wénhuà Bówùyuàn | DWB |
| 中国第一历史档案馆 | Zhōngguó Dìyī Lìshǐ Dàng'ànguǎn | ZDLD |
| 北京故宫博物院 | Běijīng Gùgōng Bówùyuàn | BGB |
| 中国国家图书馆 | Zhōngguó Guójiā Túshūguǎn | ZGT |
**Notes**:
- Tone marks are removed for abbreviation (diacritics normalization)
- Word boundaries follow natural semantic breaks
- Multi-syllable words keep together
**Skip Words**: None (Chinese doesn't use separate articles/prepositions)
---
### Japanese (Modified Hepburn)
**Dataset Statistics**: 19 institutions
| Emic Name | Romaji | Abbreviation |
|-----------|--------|--------------|
| 国立中央博物館 | Kokuritsu Chūō Hakubutsukan | KCH |
| 東京国立博物館 | Tōkyō Kokuritsu Hakubutsukan | TKH |
| 国立国会図書館 | Kokuritsu Kokkai Toshokan | KKT |
**Notes**:
- Long vowels (ō, ū) normalized to (o, u)
- Particles typically attached to preceding word
- Kanji compounds transliterated as single words
---
### Korean (Revised Romanization)
**Dataset Statistics**: 36 institutions
| Emic Name | RR Romanization | Abbreviation |
|-----------|-----------------|--------------|
| 독립기념관 | Dongnip Ginyeomgwan | DG |
| 국립중앙박물관 | Gungnip Jungang Bakmulgwan | GJB |
| 서울대학교 규장각한국학연구원 | Seoul Daehakgyo Gyujanggak Hangukhak Yeonguwon | SDGHY |
| 국립한글박물관 | Gungnip Hangeul Bakmulgwan | GHB |
**Notes**:
- No diacritics in Revised Romanization (unlike McCune-Reischauer)
- Consonant assimilation reflected in spelling
- Spaces at natural word boundaries
---
### Arabic (ISO 233-2)
**Dataset Statistics**: 8 institutions
| Emic Name | Transliterated | Abbreviation |
|-----------|----------------|--------------|
| المكتبة الوطنية للمملكة المغربية | al-Maktaba al-Waṭanīya lil-Mamlaka al-Maġribīya | MWMM |
| دار الكتب المصرية | Dār al-Kutub al-Miṣrīya | DKM |
| المتحف الوطني العراقي | al-Matḥaf al-Waṭanī al-ʿIrāqī | MWI |
**Skip Words**:
- `al-` (definite article "the")
- After skip word removal: Maktaba, Wataniya, Mamlaka, Maghribiya → MWMM
**Notes**:
- Right-to-left script
- Definite article "al-" always skipped
- Diacritics normalized (ā→a, ī→i, etc.)
---
### Persian/Farsi (ISO 233-3)
**Dataset Statistics**: 11 institutions
| Emic Name | Transliterated | Abbreviation |
|-----------|----------------|--------------|
| وزارت امور خارجه ایران | Vezārat-e Omūr-e Khārejeh-ye Īrān | VOKI |
| کتابخانه آستان قدس رضوی | Ketābkhāneh-ye Āstān-e Qods-e Raẓavī | KAQR |
| مجلس شورای اسلامی | Majles-e Showrā-ye Eslāmī | MSE |
**Skip Words**:
- `-e`, `-ye` (ezafe connector, "of")
- `va` ("and")
**Persian-specific characters**:
```
پ → p چ → č ژ → ž گ → g
```
---
### Hebrew (ISO 259-3)
**Dataset Statistics**: 4 institutions
| Emic Name | Transliterated | Abbreviation |
|-----------|----------------|--------------|
| ארכיון הסיפור העממי בישראל | Arḵiyon ha-Sipur ha-ʿAmami be-Yiśraʾel | ASAY |
| הספרייה הלאומית | ha-Sifriya ha-Leʾumit | SL |
| ארכיון המדינה | Arḵiyon ha-Medina | AM |
**Skip Words**:
- `ha-` (definite article "the")
- `be-` ("in")
- `le-` ("to")
- `ve-` ("and")
**Notes**:
- Right-to-left script
- Articles attached with hyphen
- Silent letters (aleph, ayin) often omitted in abbreviation
---
### Hindi (Devanagari - ISO 15919)
**Dataset Statistics**: 14 institutions
| Emic Name | Transliterated | Abbreviation |
|-----------|----------------|--------------|
| राजस्थान प्राच्यविद्या प्रतिष्ठान | Rājasthāna Prācyavidyā Pratiṣṭhāna | RPP |
| राष्ट्रीय अभिलेखागार | Rāṣṭrīya Abhilekhāgāra | RA |
| राष्ट्रीय संग्रहालय नई दिल्ली | Rāṣṭrīya Saṅgrahālaya Naī Dillī | RSND |
**Skip Words**:
- `ka`, `ki`, `ke` ("of")
- `aur` ("and")
- `mein` ("in")
**Notes**:
- Conjunct consonants transliterated as cluster
- Long vowels marked (ā, ī, ū) then normalized
---
### Greek (ISO 843)
**Dataset Statistics**: 2 institutions
| Emic Name | Transliterated | Abbreviation |
|-----------|----------------|--------------|
| Αρχαιολογικό Μουσείο Θεσσαλονίκης | Archaiologikó Mouseío Thessaloníkīs | AMT |
| Εθνική Βιβλιοθήκη της Ελλάδας | Ethnikī́ Vivliothī́kī tīs Elládas | EVE |
**Skip Words**:
- `tīs`, `tou` ("of the")
- `kai` ("and")
**Character Mapping**:
```
Α → A Β → V Γ → G Δ → D Ε → E Ζ → Z
Η → Ī Θ → Th Ι → I Κ → K Λ → L Μ → M
Ν → N Ξ → X Ο → O Π → P Ρ → R Σ → S
Τ → T Υ → Y Φ → F Χ → Ch Ψ → Ps Ω → Ō
```
---
### Thai (ISO 11940-2)
**Dataset Statistics**: 6 institutions
| Emic Name | Transliterated | Abbreviation |
|-----------|----------------|--------------|
| สำนักหอจดหมายเหตุแห่งชาติ | Samnak Ho Chotmaihet Haeng Chat | SHCHC |
| หอสมุดแห่งชาติ | Ho Samut Haeng Chat | HSHC |
**Notes**:
- Thai script is abugida (consonant-vowel syllables)
- No spaces in Thai; word boundaries determined by meaning
- Royal Thai General System also acceptable
---
### Armenian (ISO 9985)
**Dataset Statistics**: 4 institutions
| Emic Name | Transliterated | Abbreviation |
|-----------|----------------|--------------|
| Մdelays delays delaysdelays delaysатенадаран | Matenadaran | M |
| Ազdelays delays delays delays delays delays delays delays delaysգdelays delays delays delays delays delaysdelays delaysdelays delaysდайн Пdelays delays delays delays delaysатאрաнագитаран | Azgayin Matenadaran | AM |
**Notes**:
- Armenian alphabet unique to Armenian language
- Transliteration straightforward letter-for-letter
---
### Georgian (ISO 9984)
**Dataset Statistics**: 2 institutions
| Emic Name | Transliterated | Abbreviation |
|-----------|----------------|--------------|
| ხელნაწერთა ეროვნული ცენტრი | Xelnawerti Erovnuli C'ent'ri | XEC |
| საქართველოს ეროვნული არქივი | Sakartvelos Erovnuli Arkivi | SEA |
**Notes**:
- Georgian Mkhedruli script
- Apostrophes mark ejective consonants (removed in abbreviation)
---
## Complete Workflow Example
### Step-by-Step: Korean Institution
**Institution**: National Museum of Korea
1. **Emic Name (Original Script)**:
```
국립중앙박물관
```
2. **Language Detection**: Korean (ko)
3. **Transliterate using Revised Romanization**:
```
Gungnip Jungang Bakmulgwan
```
4. **Identify Skip Words**: None for Korean
5. **Extract First Letters**:
```
G + J + B = GJB
```
6. **Diacritic Normalization**: N/A (RR has no diacritics)
7. **Final Abbreviation**: `GJB`
8. **Store in YAML**:
```yaml
custodian_name:
emic_name: 국립중앙박물관
name_language: ko
english_name: National Museum of Korea
ghcid:
ghcid_current: KR-SO-SEO-M-GJB
abbreviation_source: transliterated_emic
```
---
### Step-by-Step: Arabic Institution
**Institution**: National Library of Morocco
1. **Emic Name (Original Script)**:
```
المكتبة الوطنية للمملكة المغربية
```
2. **Language Detection**: Arabic (ar)
3. **Transliterate using ISO 233-2**:
```
al-Maktaba al-Waṭanīya lil-Mamlaka al-Maġribīya
```
4. **Identify Skip Words**: `al-` (4 occurrences), `lil-` (1)
5. **After Skip Word Removal**:
```
Maktaba Waṭanīya Mamlaka Maġribīya
```
6. **Extract First Letters**:
```
M + W + M + M = MWMM
```
7. **Diacritic Normalization**: ṭ→t, ī→i, ġ→g
```
MWMM (already ASCII)
```
8. **Final Abbreviation**: `MWMM`
---
## Edge Cases and Special Handling
### Mixed Scripts
Some institution names mix scripts (e.g., Latin brand names in Chinese text):
**Example**: 中国IBM研究院
- Transliterate Chinese: Zhongguo IBM Yanjiuyuan
- Keep "IBM" as-is (already Latin)
- Abbreviation: ZIY
### Transliteration Ambiguity
When multiple valid transliterations exist, prefer:
1. ISO standard spelling
2. Institution's own romanization (if consistent)
3. Most commonly used academic romanization
### Very Long Names
If abbreviation exceeds 10 characters after applying rules:
1. Truncate to 10 characters
2. Ensure truncation doesn't create ambiguous abbreviation
3. Document truncation in `ghcid.notes`
---
## Python Implementation Reference
For the complete Python implementation of transliteration functions, see:
- `.opencode/TRANSLITERATION_STANDARDS.md` - Full code with all language handlers
- `scripts/transliterate_emic_names.py` - Production script for batch transliteration
### Quick Reference Function
```python
from transliteration import transliterate_for_abbreviation
# Example usage for all supported languages
examples = {
'ru': 'Российская государственная библиотека',
'zh': '中国国家图书馆',
'ja': '国立国会図書館',
'ko': '국립중앙박물관',
'ar': 'المكتبة الوطنية للمملكة المغربية',
'he': 'הספרייה הלאומית',
'hi': 'राष्ट्रीय अभिलेखागार',
'el': 'Εθνική Βιβλιοθήκη της Ελλάδας',
}
for lang, name in examples.items():
latin = transliterate_for_abbreviation(name, lang)
print(f'{lang}: {name}')
print(f' → {latin}')
```
---
## Validation Checklist
Before finalizing a transliterated abbreviation:
- [ ] Original emic name preserved in `custodian_name.emic_name`
- [ ] Language code stored in `custodian_name.name_language`
- [ ] Correct ISO standard applied for script
- [ ] Skip words removed (articles, prepositions)
- [ ] Diacritics normalized to ASCII
- [ ] Special characters removed
- [ ] Abbreviation ≤ 10 characters
- [ ] No conflicts with existing GHCIDs
---
## See Also
- `.opencode/TRANSLITERATION_STANDARDS.md` - Technical rules and Python code
- `.opencode/ABBREVIATION_SPECIAL_CHAR_RULE.md` - Character filtering rules
- `AGENTS.md` - Rule 12: Non-Latin Script Transliteration
- `docs/PERSISTENT_IDENTIFIERS.md` - GHCID specification
---
## Changelog
| Date | Change |
|------|--------|
| 2025-12-08 | Initial document created with 21 language examples |