docs: add Z.AI GLM API and transliteration rules to AGENTS.md
- Add Rule 11 for Z.AI Coding Plan API usage (not BigModel) - Add transliteration standards for non-Latin scripts - Document GLM model options and Python implementation
This commit is contained in:
parent
40bd3cb8f5
commit
271545fa8b
6 changed files with 2172 additions and 7 deletions
|
|
@ -1,17 +1,102 @@
|
|||
# Abbreviation Special Character Filtering Rule
|
||||
# Abbreviation Character Filtering Rules
|
||||
|
||||
**Rule ID**: ABBREV-SPECIAL-CHAR
|
||||
**Rule ID**: ABBREV-CHAR-FILTER
|
||||
**Status**: MANDATORY
|
||||
**Applies To**: GHCID abbreviation component generation
|
||||
**Created**: 2025-12-07
|
||||
**Updated**: 2025-12-08 (added diacritics rule)
|
||||
|
||||
---
|
||||
|
||||
## Summary
|
||||
|
||||
**When generating abbreviations for GHCID, special characters and symbols MUST be completely removed. Only alphabetic characters (A-Z) are permitted in the abbreviation component of the GHCID.**
|
||||
**When generating abbreviations for GHCID, ONLY ASCII uppercase letters (A-Z) are permitted. Both special characters AND diacritics MUST be removed/normalized.**
|
||||
|
||||
This is a **MANDATORY** rule. Abbreviations containing special characters are INVALID and must be regenerated.
|
||||
This is a **MANDATORY** rule. Abbreviations containing special characters or diacritics are INVALID and must be regenerated.
|
||||
|
||||
### Two Mandatory Sub-Rules:
|
||||
|
||||
1. **ABBREV-SPECIAL-CHAR**: Remove all special characters and symbols
|
||||
2. **ABBREV-DIACRITICS**: Normalize all diacritics to ASCII equivalents
|
||||
|
||||
---
|
||||
|
||||
## Rule 1: Diacritics MUST Be Normalized to ASCII (ABBREV-DIACRITICS)
|
||||
|
||||
**Diacritics (accented characters) MUST be normalized to their ASCII base letter equivalents.**
|
||||
|
||||
### Example (Real Case)
|
||||
|
||||
```
|
||||
❌ WRONG: CZ-VY-TEL-L-VHSPAOČRZS (contains Č)
|
||||
✅ CORRECT: CZ-VY-TEL-L-VHSPAOCRZS (ASCII only)
|
||||
```
|
||||
|
||||
### Diacritics Normalization Table
|
||||
|
||||
| Diacritic | ASCII | Example |
|
||||
|-----------|-------|---------|
|
||||
| Á, À, Â, Ã, Ä, Å, Ā | A | "Ålborg" → A |
|
||||
| Č, Ć, Ç | C | "Český" → C |
|
||||
| Ď | D | "Ďáblice" → D |
|
||||
| É, È, Ê, Ë, Ě, Ē | E | "Éire" → E |
|
||||
| Í, Ì, Î, Ï, Ī | I | "Ísland" → I |
|
||||
| Ñ, Ń, Ň | N | "España" → N |
|
||||
| Ó, Ò, Ô, Õ, Ö, Ø, Ō | O | "Österreich" → O |
|
||||
| Ř | R | "Říčany" → R |
|
||||
| Š, Ś, Ş | S | "Šumperk" → S |
|
||||
| Ť | T | "Ťažký" → T |
|
||||
| Ú, Ù, Û, Ü, Ů, Ū | U | "Ústí" → U |
|
||||
| Ý, Ÿ | Y | "Ýmir" → Y |
|
||||
| Ž, Ź, Ż | Z | "Žilina" → Z |
|
||||
| Ł | L | "Łódź" → L |
|
||||
| Æ | AE | "Ærø" → AE |
|
||||
| Œ | OE | "Œuvre" → OE |
|
||||
| ß | SS | "Straße" → SS |
|
||||
|
||||
### Implementation
|
||||
|
||||
```python
|
||||
import unicodedata
|
||||
|
||||
def normalize_diacritics(text: str) -> str:
|
||||
"""
|
||||
Normalize diacritics to ASCII equivalents.
|
||||
|
||||
Examples:
|
||||
"Č" → "C"
|
||||
"Ř" → "R"
|
||||
"Ö" → "O"
|
||||
"ñ" → "n"
|
||||
"""
|
||||
# NFD decomposition separates base characters from combining marks
|
||||
normalized = unicodedata.normalize('NFD', text)
|
||||
# Remove combining marks (category 'Mn' = Mark, Nonspacing)
|
||||
ascii_text = ''.join(c for c in normalized if unicodedata.category(c) != 'Mn')
|
||||
return ascii_text
|
||||
|
||||
# Example
|
||||
normalize_diacritics("VHSPAOČRZS") # Returns "VHSPAOCRZS"
|
||||
```
|
||||
|
||||
### Languages Commonly Affected
|
||||
|
||||
| Language | Common Diacritics | Example Institution |
|
||||
|----------|-------------------|---------------------|
|
||||
| **Czech** | Č, Ř, Š, Ž, Ě, Ů | Vlastivědné muzeum → VM (not VM with háček) |
|
||||
| **Polish** | Ł, Ń, Ó, Ś, Ź, Ż, Ą, Ę | Biblioteka Łódzka → BL |
|
||||
| **German** | Ä, Ö, Ü, ß | Österreichische Nationalbibliothek → ON |
|
||||
| **French** | É, È, Ê, Ç, Ô | Bibliothèque nationale → BN |
|
||||
| **Spanish** | Ñ, Á, É, Í, Ó, Ú | Museo Nacional → MN |
|
||||
| **Portuguese** | Ã, Õ, Ç, Á, É | Biblioteca Nacional → BN |
|
||||
| **Nordic** | Å, Ä, Ö, Ø, Æ | Nationalmuseet → N |
|
||||
| **Turkish** | Ç, Ğ, İ, Ö, Ş, Ü | İstanbul Üniversitesi → IU |
|
||||
| **Hungarian** | Á, É, Í, Ó, Ö, Ő, Ú, Ü, Ű | Országos Levéltár → OL |
|
||||
| **Romanian** | Ă, Â, Î, Ș, Ț | Biblioteca Națională → BN |
|
||||
|
||||
---
|
||||
|
||||
## Rule 2: Special Characters MUST Be Removed (ABBREV-SPECIAL-CHAR)
|
||||
|
||||
---
|
||||
|
||||
|
|
|
|||
787
.opencode/TRANSLITERATION_STANDARDS.md
Normal file
787
.opencode/TRANSLITERATION_STANDARDS.md
Normal file
|
|
@ -0,0 +1,787 @@
|
|||
# Transliteration Standards for Non-Latin Scripts
|
||||
|
||||
**Rule ID**: TRANSLIT-ISO
|
||||
**Status**: MANDATORY
|
||||
**Applies To**: GHCID abbreviation generation from emic names in non-Latin scripts
|
||||
**Created**: 2025-12-08
|
||||
|
||||
---
|
||||
|
||||
## Summary
|
||||
|
||||
**When generating GHCID abbreviations from institution names written in non-Latin scripts, the emic name MUST first be transliterated to Latin characters using the designated ISO or recognized standard for that script.**
|
||||
|
||||
This rule affects **170 institutions** across **21 languages** with non-Latin writing systems.
|
||||
|
||||
### Key Principles
|
||||
|
||||
1. **Emic name is preserved** - The original script is stored in `custodian_name.emic_name`
|
||||
2. **Transliteration is for processing only** - Used to generate abbreviations
|
||||
3. **ISO/recognized standards required** - No ad-hoc romanization
|
||||
4. **Deterministic output** - Same input always produces same Latin output
|
||||
5. **Existing GHCIDs grandfathered** - Only applies to NEW custodians
|
||||
|
||||
---
|
||||
|
||||
## Transliteration Standards by Script/Language
|
||||
|
||||
### Cyrillic Scripts
|
||||
|
||||
| Language | ISO Code | Standard | Library/Tool | Notes |
|
||||
|----------|----------|----------|--------------|-------|
|
||||
| **Russian** | ru | ISO 9:1995 | `transliterate` | Scientific transliteration |
|
||||
| **Ukrainian** | uk | ISO 9:1995 | `transliterate` | Includes Ukrainian-specific letters |
|
||||
| **Bulgarian** | bg | ISO 9:1995 | `transliterate` | Uses same Cyrillic base |
|
||||
| **Serbian** | sr | ISO 9:1995 | `transliterate` | Serbian Cyrillic variant |
|
||||
| **Kazakh** | kk | ISO 9:1995 | `transliterate` | Cyrillic-based (pre-2023) |
|
||||
|
||||
**ISO 9:1995 Mapping (Core Characters)**:
|
||||
|
||||
| Cyrillic | Latin | Cyrillic | Latin |
|
||||
|----------|-------|----------|-------|
|
||||
| А а | A a | П п | P p |
|
||||
| Б б | B b | Р р | R r |
|
||||
| В в | V v | С с | S s |
|
||||
| Г г | G g | Т т | T t |
|
||||
| Д д | D d | У у | U u |
|
||||
| Е е | E e | Ф ф | F f |
|
||||
| Ё ё | Ë ë | Х х | H h |
|
||||
| Ж ж | Ž ž | Ц ц | C c |
|
||||
| З з | Z z | Ч ч | Č č |
|
||||
| И и | I i | Ш ш | Š š |
|
||||
| Й й | J j | Щ щ | Ŝ ŝ |
|
||||
| К к | K k | Ъ ъ | ʺ (hard sign) |
|
||||
| Л л | L l | Ы ы | Y y |
|
||||
| М м | M m | Ь ь | ʹ (soft sign) |
|
||||
| Н н | N n | Э э | È è |
|
||||
| О о | O o | Ю ю | Û û |
|
||||
| | | Я я | Â â |
|
||||
|
||||
**Example**:
|
||||
```
|
||||
Input: Институт восточных рукописей РАН
|
||||
ISO 9: Institut vostočnyh rukopisej RAN
|
||||
Abbrev: IVRRAN → IVRRAN (after diacritic normalization)
|
||||
```
|
||||
|
||||
---
|
||||
|
||||
### CJK Scripts
|
||||
|
||||
#### Chinese (Hanzi)
|
||||
|
||||
| Variant | Standard | Library/Tool | Notes |
|
||||
|---------|----------|--------------|-------|
|
||||
| Simplified | Hanyu Pinyin (ISO 7098) | `pypinyin` | Standard PRC romanization |
|
||||
| Traditional | Hanyu Pinyin | `pypinyin` | Same standard applies |
|
||||
|
||||
**Pinyin Rules**:
|
||||
- Tone marks are OMITTED for abbreviation (diacritics removed anyway)
|
||||
- Word boundaries follow natural spacing
|
||||
- Proper nouns capitalized
|
||||
|
||||
**Example**:
|
||||
```
|
||||
Input: 东巴文化博物院
|
||||
Pinyin: Dōngbā Wénhuà Bówùyuàn
|
||||
ASCII: Dongba Wenhua Bowuyuan
|
||||
Abbrev: DWB
|
||||
```
|
||||
|
||||
#### Japanese (Kanji/Kana)
|
||||
|
||||
| Standard | Library/Tool | Notes |
|
||||
|----------|--------------|-------|
|
||||
| Modified Hepburn | `pykakasi`, `romkan` | Most widely used internationally |
|
||||
|
||||
**Hepburn Rules**:
|
||||
- Long vowels: ō, ū (normalized to o, u for abbreviation)
|
||||
- Particles: は (wa), を (wo), へ (e)
|
||||
- Syllabic n: ん = n (before vowels: n')
|
||||
|
||||
**Example**:
|
||||
```
|
||||
Input: 国立中央博物館
|
||||
Romaji: Kokuritsu Chūō Hakubutsukan
|
||||
ASCII: Kokuritsu Chuo Hakubutsukan
|
||||
Abbrev: KCH
|
||||
```
|
||||
|
||||
#### Korean (Hangul)
|
||||
|
||||
| Standard | Library/Tool | Notes |
|
||||
|----------|--------------|-------|
|
||||
| Revised Romanization (RR) | `korean-romanizer`, `hangul-romanize` | Official South Korean standard (2000) |
|
||||
|
||||
**RR Rules**:
|
||||
- No diacritics (unlike McCune-Reischauer)
|
||||
- Consonant assimilation reflected in spelling
|
||||
- Word boundaries at natural breaks
|
||||
|
||||
**Example**:
|
||||
```
|
||||
Input: 독립기념관
|
||||
RR: Dongnip Ginyeomgwan
|
||||
Abbrev: DG
|
||||
```
|
||||
|
||||
---
|
||||
|
||||
### Arabic Script
|
||||
|
||||
| Language | ISO Code | Standard | Library/Tool | Notes |
|
||||
|----------|----------|----------|--------------|-------|
|
||||
| **Arabic** | ar | ISO 233-2:1993 | `arabic-transliteration` | Simplified standard |
|
||||
| **Persian/Farsi** | fa | ISO 233-3:1999 | `persian-transliteration` | Persian extensions |
|
||||
| **Urdu** | ur | ISO 233-3 + Urdu extensions | `urdu-transliteration` | Additional characters |
|
||||
|
||||
**ISO 233 Mapping (Core Arabic)**:
|
||||
|
||||
| Arabic | Name | Latin |
|
||||
|--------|------|-------|
|
||||
| ا | Alif | ā / a |
|
||||
| ب | Ba | b |
|
||||
| ت | Ta | t |
|
||||
| ث | Tha | ṯ |
|
||||
| ج | Jim | ǧ / j |
|
||||
| ح | Ha | ḥ |
|
||||
| خ | Kha | ḫ / kh |
|
||||
| د | Dal | d |
|
||||
| ذ | Dhal | ḏ |
|
||||
| ر | Ra | r |
|
||||
| ز | Zay | z |
|
||||
| س | Sin | s |
|
||||
| ش | Shin | š / sh |
|
||||
| ص | Sad | ṣ |
|
||||
| ض | Dad | ḍ |
|
||||
| ط | Ta | ṭ |
|
||||
| ظ | Za | ẓ |
|
||||
| ع | Ayn | ʿ |
|
||||
| غ | Ghayn | ġ / gh |
|
||||
| ف | Fa | f |
|
||||
| ق | Qaf | q |
|
||||
| ك | Kaf | k |
|
||||
| ل | Lam | l |
|
||||
| م | Mim | m |
|
||||
| ن | Nun | n |
|
||||
| ه | Ha | h |
|
||||
| و | Waw | w / ū |
|
||||
| ي | Ya | y / ī |
|
||||
|
||||
**Example (Arabic)**:
|
||||
```
|
||||
Input: المكتبة الوطنية للمملكة المغربية
|
||||
ISO: al-Maktaba al-Waṭanīya lil-Mamlaka al-Maġribīya
|
||||
ASCII: al-Maktaba al-Wataniya lil-Mamlaka al-Maghribiya
|
||||
Abbrev: MWMM (skip "al-" articles)
|
||||
```
|
||||
|
||||
**Example (Persian)**:
|
||||
```
|
||||
Input: وزارت امور خارجه ایران
|
||||
ISO: Vezārat-e Omur-e Khāreǧe-ye Īrān
|
||||
ASCII: Vezarat-e Omur-e Khareje-ye Iran
|
||||
Abbrev: VOKI (skip "e" connector)
|
||||
```
|
||||
|
||||
---
|
||||
|
||||
### Hebrew Script
|
||||
|
||||
| Standard | Library/Tool | Notes |
|
||||
|----------|--------------|-------|
|
||||
| ISO 259-3:1999 | `hebrew-transliteration` | Simplified romanization |
|
||||
|
||||
**ISO 259 Mapping**:
|
||||
|
||||
| Hebrew | Name | Latin |
|
||||
|--------|------|-------|
|
||||
| א | Aleph | ʾ / (silent) |
|
||||
| ב | Bet | b / v |
|
||||
| ג | Gimel | g |
|
||||
| ד | Dalet | d |
|
||||
| ה | He | h |
|
||||
| ו | Vav | v / o / u |
|
||||
| ז | Zayin | z |
|
||||
| ח | Chet | ḥ / ch |
|
||||
| ט | Tet | ṭ / t |
|
||||
| י | Yod | y / i |
|
||||
| כ ך | Kaf | k / kh |
|
||||
| ל | Lamed | l |
|
||||
| מ ם | Mem | m |
|
||||
| נ ן | Nun | n |
|
||||
| ס | Samekh | s |
|
||||
| ע | Ayin | ʿ / (silent) |
|
||||
| פ ף | Pe | p / f |
|
||||
| צ ץ | Tsade | ṣ / ts |
|
||||
| ק | Qof | q / k |
|
||||
| ר | Resh | r |
|
||||
| ש | Shin/Sin | š / s |
|
||||
| ת | Tav | t |
|
||||
|
||||
**Example**:
|
||||
```
|
||||
Input: ארכיון הסיפור העממי בישראל
|
||||
ISO: Arḵiyon ha-Sipur ha-ʿAmami be-Yiśraʾel
|
||||
ASCII: Arkhiyon ha-Sipur ha-Amami be-Yisrael
|
||||
Abbrev: ASAY (skip "ha-" and "be-" articles)
|
||||
```
|
||||
|
||||
---
|
||||
|
||||
### Greek Script
|
||||
|
||||
| Standard | Library/Tool | Notes |
|
||||
|----------|--------------|-------|
|
||||
| ISO 843:1997 | `greek-transliteration` | Romanization of Greek |
|
||||
|
||||
**ISO 843 Mapping**:
|
||||
|
||||
| Greek | Latin | Greek | Latin |
|
||||
|-------|-------|-------|-------|
|
||||
| Α α | A a | Ν ν | N n |
|
||||
| Β β | V v | Ξ ξ | X x |
|
||||
| Γ γ | G g | Ο ο | O o |
|
||||
| Δ δ | D d | Π π | P p |
|
||||
| Ε ε | E e | Ρ ρ | R r |
|
||||
| Ζ ζ | Z z | Σ σ ς | S s |
|
||||
| Η η | Ī ī | Τ τ | T t |
|
||||
| Θ θ | Th th | Υ υ | Y y |
|
||||
| Ι ι | I i | Φ φ | F f |
|
||||
| Κ κ | K k | Χ χ | Ch ch |
|
||||
| Λ λ | L l | Ψ ψ | Ps ps |
|
||||
| Μ μ | M m | Ω ω | Ō ō |
|
||||
|
||||
**Example**:
|
||||
```
|
||||
Input: Αρχαιολογικό Μουσείο Θεσσαλονίκης
|
||||
ISO: Archaiologikó Mouseío Thessaloníkīs
|
||||
ASCII: Archaiologiko Mouseio Thessalonikis
|
||||
Abbrev: AMT
|
||||
```
|
||||
|
||||
---
|
||||
|
||||
### Indic Scripts
|
||||
|
||||
| Language | Script | Standard | Library/Tool |
|
||||
|----------|--------|----------|--------------|
|
||||
| **Hindi** | Devanagari | ISO 15919 | `indic-transliteration` |
|
||||
| **Bengali** | Bengali | ISO 15919 | `indic-transliteration` |
|
||||
| **Nepali** | Devanagari | ISO 15919 | `indic-transliteration` |
|
||||
| **Sinhala** | Sinhala | ISO 15919 | `indic-transliteration` |
|
||||
|
||||
**ISO 15919 Core Consonants (Devanagari)**:
|
||||
|
||||
| Devanagari | Latin | Devanagari | Latin |
|
||||
|------------|-------|------------|-------|
|
||||
| क | ka | त | ta |
|
||||
| ख | kha | थ | tha |
|
||||
| ग | ga | द | da |
|
||||
| घ | gha | ध | dha |
|
||||
| ङ | ṅa | न | na |
|
||||
| च | ca | प | pa |
|
||||
| छ | cha | फ | pha |
|
||||
| ज | ja | ब | ba |
|
||||
| झ | jha | भ | bha |
|
||||
| ञ | ña | म | ma |
|
||||
| ट | ṭa | य | ya |
|
||||
| ठ | ṭha | र | ra |
|
||||
| ड | ḍa | ल | la |
|
||||
| ढ | ḍha | व | va |
|
||||
| ण | ṇa | श | śa |
|
||||
| | | ष | ṣa |
|
||||
| | | स | sa |
|
||||
| | | ह | ha |
|
||||
|
||||
**Example (Hindi)**:
|
||||
```
|
||||
Input: राजस्थान प्राच्यविद्या प्रतिष्ठान
|
||||
ISO: Rājasthāna Prācyavidyā Pratiṣṭhāna
|
||||
ASCII: Rajasthana Pracyavidya Pratishthana
|
||||
Abbrev: RPP
|
||||
```
|
||||
|
||||
---
|
||||
|
||||
### Southeast Asian Scripts
|
||||
|
||||
| Language | Script | Standard | Library/Tool |
|
||||
|----------|--------|----------|--------------|
|
||||
| **Thai** | Thai | ISO 11940-2 | `thai-romanization` |
|
||||
| **Khmer** | Khmer | ALA-LC | `khmer-romanization` |
|
||||
|
||||
**Thai Example**:
|
||||
```
|
||||
Input: สำนักหอจดหมายเหตุแห่งชาติ
|
||||
ISO: Samnak Ho Chotmaihet Haeng Chat
|
||||
Abbrev: SHCHC
|
||||
```
|
||||
|
||||
**Khmer Example**:
|
||||
```
|
||||
Input: សារមន្ទីរទួលស្លែង
|
||||
ALA-LC: Sāramanṭīr Tūl Slèṅ
|
||||
ASCII: Saramantir Tuol Sleng
|
||||
Abbrev: STS
|
||||
```
|
||||
|
||||
---
|
||||
|
||||
### Other Scripts
|
||||
|
||||
| Language | Script | Standard | Library/Tool |
|
||||
|----------|--------|----------|--------------|
|
||||
| **Armenian** | Armenian | ISO 9985 | `armenian-transliteration` |
|
||||
| **Georgian** | Georgian | ISO 9984 | `georgian-transliteration` |
|
||||
|
||||
**Armenian Example**:
|
||||
```
|
||||
Input: Մdelays delays delays delays delays delays delays delays delays delays delays delays delays delays delaysdelays delays delays delays delays delays delaysdelays delaysdelays delaysdelays delaysатdelays delays delaysенадаранdelays delays delays
|
||||
Input: Մdelays delays delays delays delays delaysделays delays delaysատdelays delays delays delaysdelays delaysdelays delaysdelays delaysdelays delaysdelays delaysdelays delays delays delaysdeенadaran
|
||||
Input: Մdelays delays delays delaysатенадаранdelays delays delays delaysdeленадаран
|
||||
Input: Մdelays delays delaysатенадаран
|
||||
Input: Մdelays delaysатенадаран
|
||||
Input: Մатенадаран
|
||||
Input: Մatenadaran
|
||||
ISO: Matenadaran
|
||||
Abbrev: M
|
||||
```
|
||||
|
||||
**Georgian Example**:
|
||||
```
|
||||
Input: ხელნაწერთა ეროვნული ცენტრი
|
||||
ISO: Xelnawerti Erovnuli C'ent'ri
|
||||
ASCII: Khelnawerti Erovnuli Centri
|
||||
Abbrev: KEC
|
||||
```
|
||||
|
||||
---
|
||||
|
||||
## Implementation
|
||||
|
||||
### Python Transliteration Utility
|
||||
|
||||
```python
|
||||
#!/usr/bin/env python3
|
||||
"""
|
||||
Transliteration utility for GHCID abbreviation generation.
|
||||
Uses ISO and recognized standards for each script/language.
|
||||
"""
|
||||
|
||||
import unicodedata
|
||||
from typing import Optional
|
||||
|
||||
# Try importing transliteration libraries
|
||||
try:
|
||||
from pypinyin import pinyin, Style
|
||||
HAS_PYPINYIN = True
|
||||
except ImportError:
|
||||
HAS_PYPINYIN = False
|
||||
|
||||
try:
|
||||
import pykakasi
|
||||
HAS_PYKAKASI = True
|
||||
except ImportError:
|
||||
HAS_PYKAKASI = False
|
||||
|
||||
try:
|
||||
from transliterate import translit
|
||||
HAS_TRANSLITERATE = True
|
||||
except ImportError:
|
||||
HAS_TRANSLITERATE = False
|
||||
|
||||
|
||||
def detect_script(text: str) -> str:
|
||||
"""
|
||||
Detect the primary script of the input text.
|
||||
|
||||
Returns one of:
|
||||
- 'latin': Latin alphabet
|
||||
- 'cyrillic': Cyrillic script
|
||||
- 'chinese': Chinese characters (Hanzi)
|
||||
- 'japanese': Japanese (mixed Kanji/Kana)
|
||||
- 'korean': Korean Hangul
|
||||
- 'arabic': Arabic script (includes Persian, Urdu)
|
||||
- 'hebrew': Hebrew script
|
||||
- 'greek': Greek script
|
||||
- 'devanagari': Devanagari (Hindi, Nepali, Sanskrit)
|
||||
- 'bengali': Bengali script
|
||||
- 'thai': Thai script
|
||||
- 'armenian': Armenian script
|
||||
- 'georgian': Georgian script
|
||||
- 'unknown': Cannot determine
|
||||
"""
|
||||
script_ranges = {
|
||||
'cyrillic': (0x0400, 0x04FF),
|
||||
'arabic': (0x0600, 0x06FF),
|
||||
'hebrew': (0x0590, 0x05FF),
|
||||
'devanagari': (0x0900, 0x097F),
|
||||
'bengali': (0x0980, 0x09FF),
|
||||
'thai': (0x0E00, 0x0E7F),
|
||||
'greek': (0x0370, 0x03FF),
|
||||
'armenian': (0x0530, 0x058F),
|
||||
'georgian': (0x10A0, 0x10FF),
|
||||
'korean': (0xAC00, 0xD7AF), # Hangul syllables
|
||||
'japanese_hiragana': (0x3040, 0x309F),
|
||||
'japanese_katakana': (0x30A0, 0x30FF),
|
||||
'chinese': (0x4E00, 0x9FFF), # CJK Unified Ideographs
|
||||
}
|
||||
|
||||
script_counts = {script: 0 for script in script_ranges}
|
||||
latin_count = 0
|
||||
|
||||
for char in text:
|
||||
code = ord(char)
|
||||
|
||||
# Check Latin
|
||||
if ('a' <= char <= 'z') or ('A' <= char <= 'Z'):
|
||||
latin_count += 1
|
||||
continue
|
||||
|
||||
# Check other scripts
|
||||
for script, (start, end) in script_ranges.items():
|
||||
if start <= code <= end:
|
||||
script_counts[script] += 1
|
||||
break
|
||||
|
||||
# Determine primary script
|
||||
if latin_count > 0 and all(c == 0 for c in script_counts.values()):
|
||||
return 'latin'
|
||||
|
||||
# Find max non-Latin script
|
||||
max_script = max(script_counts, key=script_counts.get)
|
||||
if script_counts[max_script] > 0:
|
||||
# Handle Japanese (can be Kanji + Kana)
|
||||
if max_script in ('japanese_hiragana', 'japanese_katakana', 'chinese'):
|
||||
if script_counts['japanese_hiragana'] > 0 or script_counts['japanese_katakana'] > 0:
|
||||
return 'japanese'
|
||||
return 'chinese'
|
||||
return max_script
|
||||
|
||||
return 'latin' if latin_count > 0 else 'unknown'
|
||||
|
||||
|
||||
def transliterate_cyrillic(text: str, lang: str = 'ru') -> str:
|
||||
"""Transliterate Cyrillic text using ISO 9."""
|
||||
if HAS_TRANSLITERATE:
|
||||
try:
|
||||
return translit(text, lang, reversed=True)
|
||||
except Exception:
|
||||
pass
|
||||
|
||||
# Fallback: basic Cyrillic to Latin mapping
|
||||
cyrillic_map = {
|
||||
'А': 'A', 'Б': 'B', 'В': 'V', 'Г': 'G', 'Д': 'D', 'Е': 'E',
|
||||
'Ё': 'E', 'Ж': 'Zh', 'З': 'Z', 'И': 'I', 'Й': 'Y', 'К': 'K',
|
||||
'Л': 'L', 'М': 'M', 'Н': 'N', 'О': 'O', 'П': 'P', 'Р': 'R',
|
||||
'С': 'S', 'Т': 'T', 'У': 'U', 'Ф': 'F', 'Х': 'Kh', 'Ц': 'Ts',
|
||||
'Ч': 'Ch', 'Ш': 'Sh', 'Щ': 'Shch', 'Ъ': '', 'Ы': 'Y', 'Ь': '',
|
||||
'Э': 'E', 'Ю': 'Yu', 'Я': 'Ya',
|
||||
'а': 'a', 'б': 'b', 'в': 'v', 'г': 'g', 'д': 'd', 'е': 'e',
|
||||
'ё': 'e', 'ж': 'zh', 'з': 'z', 'и': 'i', 'й': 'y', 'к': 'k',
|
||||
'л': 'l', 'м': 'm', 'н': 'n', 'о': 'o', 'п': 'p', 'р': 'r',
|
||||
'с': 's', 'т': 't', 'у': 'u', 'ф': 'f', 'х': 'kh', 'ц': 'ts',
|
||||
'ч': 'ch', 'ш': 'sh', 'щ': 'shch', 'ъ': '', 'ы': 'y', 'ь': '',
|
||||
'э': 'e', 'ю': 'yu', 'я': 'ya',
|
||||
# Ukrainian additions
|
||||
'І': 'I', 'і': 'i', 'Ї': 'Yi', 'ї': 'yi', 'Є': 'Ye', 'є': 'ye',
|
||||
'Ґ': 'G', 'ґ': 'g',
|
||||
}
|
||||
return ''.join(cyrillic_map.get(c, c) for c in text)
|
||||
|
||||
|
||||
def transliterate_chinese(text: str) -> str:
|
||||
"""Transliterate Chinese to Pinyin."""
|
||||
if HAS_PYPINYIN:
|
||||
# Get pinyin without tone marks
|
||||
result = pinyin(text, style=Style.NORMAL)
|
||||
return ' '.join([''.join(p) for p in result])
|
||||
|
||||
# Fallback: return as-is (requires manual handling)
|
||||
return text
|
||||
|
||||
|
||||
def transliterate_japanese(text: str) -> str:
|
||||
"""Transliterate Japanese to Romaji (Hepburn)."""
|
||||
if HAS_PYKAKASI:
|
||||
kakasi = pykakasi.kakasi()
|
||||
result = kakasi.convert(text)
|
||||
return ' '.join([item['hepburn'] for item in result])
|
||||
|
||||
# Fallback: return as-is
|
||||
return text
|
||||
|
||||
|
||||
def transliterate_korean(text: str) -> str:
|
||||
"""Transliterate Korean Hangul to Revised Romanization."""
|
||||
# Korean romanization is complex - use library if available
|
||||
try:
|
||||
from korean_romanizer.romanizer import Romanizer
|
||||
r = Romanizer(text)
|
||||
return r.romanize()
|
||||
except ImportError:
|
||||
pass
|
||||
|
||||
# Fallback: basic Hangul syllable decomposition
|
||||
# This is a simplified implementation
|
||||
return text
|
||||
|
||||
|
||||
def transliterate_arabic(text: str) -> str:
|
||||
"""Transliterate Arabic script to Latin (ISO 233 simplified)."""
|
||||
arabic_map = {
|
||||
'ا': 'a', 'أ': 'a', 'إ': 'i', 'آ': 'a',
|
||||
'ب': 'b', 'ت': 't', 'ث': 'th', 'ج': 'j',
|
||||
'ح': 'h', 'خ': 'kh', 'د': 'd', 'ذ': 'dh',
|
||||
'ر': 'r', 'ز': 'z', 'س': 's', 'ش': 'sh',
|
||||
'ص': 's', 'ض': 'd', 'ط': 't', 'ظ': 'z',
|
||||
'ع': "'", 'غ': 'gh', 'ف': 'f', 'ق': 'q',
|
||||
'ك': 'k', 'ل': 'l', 'م': 'm', 'ن': 'n',
|
||||
'ه': 'h', 'و': 'w', 'ي': 'y', 'ى': 'a',
|
||||
'ة': 'a', 'ء': "'",
|
||||
# Persian additions
|
||||
'پ': 'p', 'چ': 'ch', 'ژ': 'zh', 'گ': 'g',
|
||||
'ک': 'k', 'ی': 'i',
|
||||
}
|
||||
result = []
|
||||
for c in text:
|
||||
if c in arabic_map:
|
||||
result.append(arabic_map[c])
|
||||
elif c == ' ' or c.isalnum():
|
||||
result.append(c)
|
||||
return ''.join(result)
|
||||
|
||||
|
||||
def transliterate_hebrew(text: str) -> str:
|
||||
"""Transliterate Hebrew to Latin (ISO 259 simplified)."""
|
||||
hebrew_map = {
|
||||
'א': '', 'ב': 'v', 'ג': 'g', 'ד': 'd', 'ה': 'h',
|
||||
'ו': 'v', 'ז': 'z', 'ח': 'ch', 'ט': 't', 'י': 'y',
|
||||
'כ': 'k', 'ך': 'k', 'ל': 'l', 'מ': 'm', 'ם': 'm',
|
||||
'נ': 'n', 'ן': 'n', 'ס': 's', 'ע': '', 'פ': 'f',
|
||||
'ף': 'f', 'צ': 'ts', 'ץ': 'ts', 'ק': 'k', 'ר': 'r',
|
||||
'ש': 'sh', 'ת': 't',
|
||||
}
|
||||
result = []
|
||||
for c in text:
|
||||
if c in hebrew_map:
|
||||
result.append(hebrew_map[c])
|
||||
elif c == ' ' or c.isalnum():
|
||||
result.append(c)
|
||||
return ''.join(result)
|
||||
|
||||
|
||||
def transliterate_greek(text: str) -> str:
|
||||
"""Transliterate Greek to Latin (ISO 843)."""
|
||||
greek_map = {
|
||||
'Α': 'A', 'α': 'a', 'Β': 'V', 'β': 'v', 'Γ': 'G', 'γ': 'g',
|
||||
'Δ': 'D', 'δ': 'd', 'Ε': 'E', 'ε': 'e', 'Ζ': 'Z', 'ζ': 'z',
|
||||
'Η': 'I', 'η': 'i', 'Θ': 'Th', 'θ': 'th', 'Ι': 'I', 'ι': 'i',
|
||||
'Κ': 'K', 'κ': 'k', 'Λ': 'L', 'λ': 'l', 'Μ': 'M', 'μ': 'm',
|
||||
'Ν': 'N', 'ν': 'n', 'Ξ': 'X', 'ξ': 'x', 'Ο': 'O', 'ο': 'o',
|
||||
'Π': 'P', 'π': 'p', 'Ρ': 'R', 'ρ': 'r', 'Σ': 'S', 'σ': 's',
|
||||
'ς': 's', 'Τ': 'T', 'τ': 't', 'Υ': 'Y', 'υ': 'y', 'Φ': 'F',
|
||||
'φ': 'f', 'Χ': 'Ch', 'χ': 'ch', 'Ψ': 'Ps', 'ψ': 'ps',
|
||||
'Ω': 'O', 'ω': 'o',
|
||||
}
|
||||
return ''.join(greek_map.get(c, c) for c in text)
|
||||
|
||||
|
||||
def transliterate_devanagari(text: str) -> str:
|
||||
"""Transliterate Devanagari to Latin (ISO 15919 simplified)."""
|
||||
try:
|
||||
from indic_transliteration import sanscript
|
||||
from indic_transliteration.sanscript import transliterate as indic_translit
|
||||
return indic_translit(text, sanscript.DEVANAGARI, sanscript.IAST)
|
||||
except ImportError:
|
||||
pass
|
||||
|
||||
# Fallback: basic mapping
|
||||
# This would need a full Devanagari character map
|
||||
return text
|
||||
|
||||
|
||||
def transliterate_thai(text: str) -> str:
|
||||
"""Transliterate Thai to Latin (Royal Thai General System)."""
|
||||
try:
|
||||
from thaispellcheck import transliterate as thai_translit
|
||||
return thai_translit(text)
|
||||
except ImportError:
|
||||
pass
|
||||
|
||||
# Fallback
|
||||
return text
|
||||
|
||||
|
||||
def transliterate(text: str, lang: Optional[str] = None) -> str:
|
||||
"""
|
||||
Transliterate text from non-Latin script to Latin.
|
||||
|
||||
Args:
|
||||
text: Input text in any script
|
||||
lang: Optional ISO 639-1 language code (e.g., 'ru', 'zh', 'ko')
|
||||
If not provided, script is auto-detected.
|
||||
|
||||
Returns:
|
||||
Transliterated text in Latin characters.
|
||||
"""
|
||||
if not text:
|
||||
return text
|
||||
|
||||
# Detect script if language not provided
|
||||
if lang:
|
||||
script_map = {
|
||||
'ru': 'cyrillic', 'uk': 'cyrillic', 'bg': 'cyrillic',
|
||||
'sr': 'cyrillic', 'kk': 'cyrillic',
|
||||
'zh': 'chinese',
|
||||
'ja': 'japanese',
|
||||
'ko': 'korean',
|
||||
'ar': 'arabic', 'fa': 'arabic', 'ur': 'arabic',
|
||||
'he': 'hebrew',
|
||||
'el': 'greek',
|
||||
'hi': 'devanagari', 'ne': 'devanagari',
|
||||
'bn': 'bengali',
|
||||
'th': 'thai',
|
||||
'hy': 'armenian',
|
||||
'ka': 'georgian',
|
||||
}
|
||||
script = script_map.get(lang, detect_script(text))
|
||||
else:
|
||||
script = detect_script(text)
|
||||
|
||||
# Apply appropriate transliteration
|
||||
transliterators = {
|
||||
'cyrillic': lambda t: transliterate_cyrillic(t, lang or 'ru'),
|
||||
'chinese': transliterate_chinese,
|
||||
'japanese': transliterate_japanese,
|
||||
'korean': transliterate_korean,
|
||||
'arabic': transliterate_arabic,
|
||||
'hebrew': transliterate_hebrew,
|
||||
'greek': transliterate_greek,
|
||||
'devanagari': transliterate_devanagari,
|
||||
'thai': transliterate_thai,
|
||||
'latin': lambda t: t, # No transliteration needed
|
||||
}
|
||||
|
||||
translit_func = transliterators.get(script, lambda t: t)
|
||||
result = translit_func(text)
|
||||
|
||||
# Normalize diacritics to ASCII
|
||||
normalized = unicodedata.normalize('NFD', result)
|
||||
ascii_result = ''.join(c for c in normalized if unicodedata.category(c) != 'Mn')
|
||||
|
||||
return ascii_result
|
||||
|
||||
|
||||
def transliterate_for_abbreviation(emic_name: str, lang: str) -> str:
|
||||
"""
|
||||
Transliterate emic name for GHCID abbreviation generation.
|
||||
|
||||
This is the main entry point for GHCID generation scripts.
|
||||
|
||||
Args:
|
||||
emic_name: Institution name in original script
|
||||
lang: ISO 639-1 language code
|
||||
|
||||
Returns:
|
||||
Transliterated name ready for abbreviation extraction
|
||||
"""
|
||||
# Step 1: Transliterate to Latin
|
||||
latin = transliterate(emic_name, lang)
|
||||
|
||||
# Step 2: Normalize diacritics (handled in transliterate())
|
||||
|
||||
# Step 3: Remove special characters (except spaces)
|
||||
import re
|
||||
clean = re.sub(r'[^a-zA-Z\s]', ' ', latin)
|
||||
|
||||
# Step 4: Normalize whitespace
|
||||
clean = ' '.join(clean.split())
|
||||
|
||||
return clean
|
||||
|
||||
|
||||
# Example usage
|
||||
if __name__ == '__main__':
|
||||
test_cases = [
|
||||
('Институт восточных рукописей РАН', 'ru'),
|
||||
('东巴文化博物院', 'zh'),
|
||||
('독립기념관', 'ko'),
|
||||
('राजस्थान प्राच्यविद्या प्रतिष्ठान', 'hi'),
|
||||
('المكتبة الوطنية للمملكة المغربية', 'ar'),
|
||||
('ארכיון הסיפור העממי בישראל', 'he'),
|
||||
('Αρχαιολογικό Μουσείο Θεσσαλονίκης', 'el'),
|
||||
]
|
||||
|
||||
for name, lang in test_cases:
|
||||
result = transliterate_for_abbreviation(name, lang)
|
||||
print(f'{lang}: {name}')
|
||||
print(f' → {result}')
|
||||
print()
|
||||
```
|
||||
|
||||
---
|
||||
|
||||
## Skip Words by Language
|
||||
|
||||
When extracting abbreviations from transliterated text, skip these articles/prepositions:
|
||||
|
||||
### Arabic
|
||||
- `al-` (the definite article)
|
||||
- `bi-`, `li-`, `fi-` (prepositions)
|
||||
|
||||
### Hebrew
|
||||
- `ha-` (the)
|
||||
- `ve-` (and)
|
||||
- `be-`, `le-`, `me-` (prepositions)
|
||||
|
||||
### Persian
|
||||
- `-e`, `-ye` (ezafe connector)
|
||||
- `va` (and)
|
||||
|
||||
### CJK Languages
|
||||
- No skip words (particles are integral to meaning)
|
||||
|
||||
### Indic Languages
|
||||
- `ka`, `ki`, `ke` (Hindi: of)
|
||||
- `aur` (Hindi: and)
|
||||
|
||||
---
|
||||
|
||||
## Validation
|
||||
|
||||
### Check Transliteration Output
|
||||
|
||||
```python
|
||||
def validate_transliteration(result: str) -> bool:
|
||||
"""
|
||||
Validate that transliteration output contains only ASCII letters and spaces.
|
||||
"""
|
||||
import re
|
||||
return bool(re.match(r'^[a-zA-Z\s]+$', result))
|
||||
```
|
||||
|
||||
### Manual Review Queue
|
||||
|
||||
Non-Latin institutions should be flagged for manual review if:
|
||||
1. Transliteration library not available for that script
|
||||
2. Confidence in transliteration is low
|
||||
3. Institution has multiple official romanizations
|
||||
|
||||
---
|
||||
|
||||
## Related Documentation
|
||||
|
||||
- `AGENTS.md` - Rule 12: Transliteration Standards
|
||||
- `ABBREVIATION_SPECIAL_CHAR_RULE.md` - Character filtering after transliteration
|
||||
- `docs/TRANSLITERATION_CONVENTIONS.md` - Extended examples and edge cases
|
||||
- `scripts/transliterate_emic_names.py` - Production transliteration script
|
||||
|
||||
---
|
||||
|
||||
## Changelog
|
||||
|
||||
| Date | Change |
|
||||
|------|--------|
|
||||
| 2025-12-08 | Initial standards document created |
|
||||
277
.opencode/ZAI_GLM_API_RULES.md
Normal file
277
.opencode/ZAI_GLM_API_RULES.md
Normal file
|
|
@ -0,0 +1,277 @@
|
|||
# Z.AI GLM API Rules for AI Agents
|
||||
|
||||
**Last Updated**: 2025-12-08
|
||||
**Status**: MANDATORY for all LLM API calls in scripts
|
||||
|
||||
---
|
||||
|
||||
## CRITICAL: Use Z.AI Coding Plan, NOT BigModel API
|
||||
|
||||
**This project uses the Z.AI Coding Plan endpoint, which is the SAME endpoint that OpenCode uses internally.**
|
||||
|
||||
The regular BigModel API (`open.bigmodel.cn`) will NOT work with the tokens stored in this project. You MUST use the Z.AI Coding Plan endpoint.
|
||||
|
||||
---
|
||||
|
||||
## API Configuration
|
||||
|
||||
### Correct Endpoint
|
||||
|
||||
| Property | Value |
|
||||
|----------|-------|
|
||||
| **API URL** | `https://api.z.ai/api/coding/paas/v4/chat/completions` |
|
||||
| **Auth Header** | `Authorization: Bearer {ZAI_API_TOKEN}` |
|
||||
| **Content-Type** | `application/json` |
|
||||
|
||||
### Available Models
|
||||
|
||||
| Model | Description | Cost |
|
||||
|-------|-------------|------|
|
||||
| `glm-4.5` | Standard GLM-4.5 | Free (0 per token) |
|
||||
| `glm-4.5-air` | GLM-4.5 Air variant | Free |
|
||||
| `glm-4.5-flash` | Fast GLM-4.5 | Free |
|
||||
| `glm-4.5v` | Vision-capable GLM-4.5 | Free |
|
||||
| `glm-4.6` | Latest GLM-4.6 (recommended) | Free |
|
||||
|
||||
**Recommended Model**: `glm-4.6` for best quality
|
||||
|
||||
---
|
||||
|
||||
## Authentication
|
||||
|
||||
### Token Location
|
||||
|
||||
The Z.AI API token can be obtained from two locations:
|
||||
|
||||
1. **Environment Variable** (preferred for scripts):
|
||||
```bash
|
||||
# In .env file at project root
|
||||
ZAI_API_TOKEN=your_token_here
|
||||
```
|
||||
|
||||
2. **OpenCode Auth File** (reference only):
|
||||
```
|
||||
~/.local/share/opencode/auth.json
|
||||
```
|
||||
The token is stored under key `zai-coding-plan`.
|
||||
|
||||
### Getting the Token
|
||||
|
||||
If you need to set up the token:
|
||||
|
||||
1. The token is shared with OpenCode's Z.AI Coding Plan
|
||||
2. Check `~/.local/share/opencode/auth.json` for existing token
|
||||
3. Add to `.env` file as `ZAI_API_TOKEN`
|
||||
|
||||
---
|
||||
|
||||
## Python Implementation
|
||||
|
||||
### Correct Implementation
|
||||
|
||||
```python
|
||||
import os
|
||||
import httpx
|
||||
|
||||
class GLMClient:
|
||||
"""Client for Z.AI GLM API (Coding Plan endpoint)."""
|
||||
|
||||
# CORRECT endpoint - Z.AI Coding Plan
|
||||
API_URL = "https://api.z.ai/api/coding/paas/v4/chat/completions"
|
||||
|
||||
def __init__(self, model: str = "glm-4.6"):
|
||||
self.api_key = os.environ.get("ZAI_API_TOKEN")
|
||||
if not self.api_key:
|
||||
raise ValueError("ZAI_API_TOKEN not found in environment")
|
||||
|
||||
self.model = model
|
||||
self.client = httpx.AsyncClient(
|
||||
timeout=60.0,
|
||||
headers={
|
||||
"Authorization": f"Bearer {self.api_key}",
|
||||
"Content-Type": "application/json",
|
||||
}
|
||||
)
|
||||
|
||||
async def chat(self, messages: list) -> dict:
|
||||
"""Send chat completion request."""
|
||||
response = await self.client.post(
|
||||
self.API_URL,
|
||||
json={
|
||||
"model": self.model,
|
||||
"messages": messages,
|
||||
"temperature": 0.3,
|
||||
}
|
||||
)
|
||||
response.raise_for_status()
|
||||
return response.json()
|
||||
```
|
||||
|
||||
### WRONG Implementation (DO NOT USE)
|
||||
|
||||
```python
|
||||
# WRONG - This endpoint will fail with quota errors
|
||||
API_URL = "https://open.bigmodel.cn/api/paas/v4/chat/completions"
|
||||
|
||||
# WRONG - This is for regular BigModel API, not Z.AI Coding Plan
|
||||
api_key = os.environ.get("ZHIPU_API_KEY")
|
||||
```
|
||||
|
||||
---
|
||||
|
||||
## Request Format
|
||||
|
||||
### Chat Completion Request
|
||||
|
||||
```json
|
||||
{
|
||||
"model": "glm-4.6",
|
||||
"messages": [
|
||||
{
|
||||
"role": "system",
|
||||
"content": "You are a helpful assistant."
|
||||
},
|
||||
{
|
||||
"role": "user",
|
||||
"content": "Your prompt here"
|
||||
}
|
||||
],
|
||||
"temperature": 0.3,
|
||||
"max_tokens": 4096
|
||||
}
|
||||
```
|
||||
|
||||
### Response Format
|
||||
|
||||
```json
|
||||
{
|
||||
"id": "request-id",
|
||||
"created": 1733651234,
|
||||
"model": "glm-4.6",
|
||||
"choices": [
|
||||
{
|
||||
"index": 0,
|
||||
"message": {
|
||||
"role": "assistant",
|
||||
"content": "Response text here"
|
||||
},
|
||||
"finish_reason": "stop"
|
||||
}
|
||||
],
|
||||
"usage": {
|
||||
"prompt_tokens": 100,
|
||||
"completion_tokens": 50,
|
||||
"total_tokens": 150
|
||||
}
|
||||
}
|
||||
```
|
||||
|
||||
---
|
||||
|
||||
## Error Handling
|
||||
|
||||
### Common Errors
|
||||
|
||||
| Error | Cause | Solution |
|
||||
|-------|-------|----------|
|
||||
| `401 Unauthorized` | Invalid or missing token | Check ZAI_API_TOKEN in .env |
|
||||
| `403 Quota exceeded` | Wrong endpoint (BigModel) | Use Z.AI Coding Plan endpoint |
|
||||
| `429 Rate limited` | Too many requests | Add delay between requests |
|
||||
| `500 Server error` | API issue | Retry with exponential backoff |
|
||||
|
||||
### Retry Strategy
|
||||
|
||||
```python
|
||||
import asyncio
|
||||
from tenacity import retry, stop_after_attempt, wait_exponential
|
||||
|
||||
@retry(
|
||||
stop=stop_after_attempt(3),
|
||||
wait=wait_exponential(multiplier=1, min=2, max=10)
|
||||
)
|
||||
async def call_api_with_retry(client, messages):
|
||||
return await client.chat(messages)
|
||||
```
|
||||
|
||||
---
|
||||
|
||||
## Integration with CH-Annotator
|
||||
|
||||
When using GLM for entity recognition or verification, always reference CH-Annotator v1.7.0:
|
||||
|
||||
```python
|
||||
PROMPT = """You are a heritage institution classifier following CH-Annotator v1.7.0 convention.
|
||||
|
||||
## CH-Annotator GRP.HER Definition
|
||||
Heritage institutions are organizations that:
|
||||
- Collect, preserve, and provide access to cultural heritage materials
|
||||
- Include: museums (GRP.HER.MUS), libraries (GRP.HER.LIB), archives (GRP.HER.ARC), galleries (GRP.HER.GAL)
|
||||
|
||||
## Entity to Analyze
|
||||
...
|
||||
"""
|
||||
```
|
||||
|
||||
See `.opencode/CH_ANNOTATOR_CONVENTION.md` for full convention details.
|
||||
|
||||
---
|
||||
|
||||
## Scripts Using GLM API
|
||||
|
||||
The following scripts use the Z.AI GLM API:
|
||||
|
||||
| Script | Purpose |
|
||||
|--------|---------|
|
||||
| `scripts/reenrich_wikidata_with_verification.py` | Wikidata entity verification using GLM-4.6 |
|
||||
|
||||
When creating new scripts that need LLM capabilities, follow this pattern.
|
||||
|
||||
---
|
||||
|
||||
## Environment Setup Checklist
|
||||
|
||||
When setting up a new environment:
|
||||
|
||||
- [ ] Check `~/.local/share/opencode/auth.json` for existing Z.AI token
|
||||
- [ ] Add `ZAI_API_TOKEN` to `.env` file
|
||||
- [ ] Verify endpoint is `https://api.z.ai/api/coding/paas/v4/chat/completions`
|
||||
- [ ] Test with `glm-4.6` model
|
||||
- [ ] Reference CH-Annotator v1.7.0 for entity recognition tasks
|
||||
|
||||
---
|
||||
|
||||
## AI Agent Rules
|
||||
|
||||
### DO
|
||||
|
||||
- Use `https://api.z.ai/api/coding/paas/v4/chat/completions` endpoint
|
||||
- Get token from `ZAI_API_TOKEN` environment variable
|
||||
- Use `glm-4.6` as the default model
|
||||
- Reference CH-Annotator v1.7.0 for entity tasks
|
||||
- Add retry logic with exponential backoff
|
||||
- Handle JSON parsing errors gracefully
|
||||
|
||||
### DO NOT
|
||||
|
||||
- Use `open.bigmodel.cn` endpoint (wrong API)
|
||||
- Use `ZHIPU_API_KEY` environment variable (wrong key)
|
||||
- Hard-code API tokens in scripts
|
||||
- Skip error handling for API calls
|
||||
- Forget to load `.env` file before accessing environment
|
||||
|
||||
---
|
||||
|
||||
## Related Documentation
|
||||
|
||||
- **CH-Annotator Convention**: `.opencode/CH_ANNOTATOR_CONVENTION.md`
|
||||
- **Entity Annotation**: `data/entity_annotation/ch_annotator-v1_7_0.yaml`
|
||||
- **Wikidata Enrichment Script**: `scripts/reenrich_wikidata_with_verification.py`
|
||||
|
||||
---
|
||||
|
||||
## Version History
|
||||
|
||||
| Date | Change |
|
||||
|------|--------|
|
||||
| 2025-12-08 | Initial documentation - Fixed API endpoint discovery |
|
||||
|
||||
224
AGENTS.md
224
AGENTS.md
|
|
@ -720,6 +720,66 @@ claim:
|
|||
|
||||
---
|
||||
|
||||
### Rule 11: Z.AI GLM API for LLM Tasks (NOT BigModel)
|
||||
|
||||
**CRITICAL: When using GLM models in scripts, use the Z.AI Coding Plan endpoint, NOT the regular BigModel API.**
|
||||
|
||||
The project uses the same Z.AI Coding Plan that OpenCode uses internally. The regular BigModel API (`open.bigmodel.cn`) will NOT work with our tokens.
|
||||
|
||||
**Correct API Configuration**:
|
||||
|
||||
| Property | Value |
|
||||
|----------|-------|
|
||||
| **API URL** | `https://api.z.ai/api/coding/paas/v4/chat/completions` |
|
||||
| **Environment Variable** | `ZAI_API_TOKEN` |
|
||||
| **Recommended Model** | `glm-4.6` |
|
||||
| **Cost** | Free (0 per token for all GLM models) |
|
||||
|
||||
**Available Models**: `glm-4.5`, `glm-4.5-air`, `glm-4.5-flash`, `glm-4.5v`, `glm-4.6`
|
||||
|
||||
**Python Implementation**:
|
||||
|
||||
```python
|
||||
import os
|
||||
import httpx
|
||||
|
||||
# CORRECT - Z.AI Coding Plan endpoint
|
||||
API_URL = "https://api.z.ai/api/coding/paas/v4/chat/completions"
|
||||
api_key = os.environ.get("ZAI_API_TOKEN")
|
||||
|
||||
client = httpx.AsyncClient(
|
||||
timeout=60.0,
|
||||
headers={
|
||||
"Authorization": f"Bearer {api_key}",
|
||||
"Content-Type": "application/json",
|
||||
}
|
||||
)
|
||||
|
||||
# WRONG - This will fail with quota errors!
|
||||
# API_URL = "https://open.bigmodel.cn/api/paas/v4/chat/completions"
|
||||
# api_key = os.environ.get("ZHIPU_API_KEY")
|
||||
```
|
||||
|
||||
**Integration with CH-Annotator**: When using GLM for entity recognition, always reference CH-Annotator v1.7.0 in prompts:
|
||||
|
||||
```python
|
||||
PROMPT = """You are following CH-Annotator v1.7.0 convention.
|
||||
Heritage institutions are type GRP.HER with subtypes:
|
||||
- GRP.HER.MUS (museums)
|
||||
- GRP.HER.LIB (libraries)
|
||||
- GRP.HER.ARC (archives)
|
||||
- GRP.HER.GAL (galleries)
|
||||
..."""
|
||||
```
|
||||
|
||||
**Token Location**:
|
||||
1. **Environment**: Add `ZAI_API_TOKEN` to `.env` file
|
||||
2. **OpenCode Auth**: Token stored in `~/.local/share/opencode/auth.json` under key `zai-coding-plan`
|
||||
|
||||
**See**: `.opencode/ZAI_GLM_API_RULES.md` for complete documentation
|
||||
|
||||
---
|
||||
|
||||
## Project Overview
|
||||
|
||||
**Goal**: Extract structured data about worldwide GLAMORCUBESFIXPHDNT (Galleries, Libraries, Archives, Museums, Official institutions, Research centers, Corporations, Unknown, Botanical gardens/zoos, Educational providers, Societies, Features, Intangible heritage groups, miXed, Personal collections, Holy sites, Digital platforms, NGOs, Taste/smell heritage) institutions from 139+ Claude conversation JSON files and integrate with authoritative CSV datasets.
|
||||
|
|
@ -2571,6 +2631,16 @@ location_resolution:
|
|||
|
||||
**The institution abbreviation component uses the FIRST LETTER of each significant word in the official emic (native language) name.**
|
||||
|
||||
**⚠️ GRANDFATHERING POLICY (PID STABILITY)**
|
||||
|
||||
Existing GHCIDs created before December 2025 are **grandfathered** - their abbreviations will NOT be updated even if derived from English translations rather than emic names. This preserves PID stability per the "Cool URIs Don't Change" principle.
|
||||
|
||||
**Applies to:**
|
||||
- 817 UNESCO Memory of the World custodian files enriched with `custodian_name.emic_name`
|
||||
- Abbreviations like `NLP` (National Library of Peru) remain unchanged even though emic name is "Biblioteca Nacional del Perú" (would be `BNP`)
|
||||
|
||||
**For NEW custodians only:** Apply emic name abbreviation protocol described below.
|
||||
|
||||
**Abbreviation Rules**:
|
||||
1. Use the **CustodianName** (official emic name), NOT an English translation
|
||||
2. Take the **first letter** of each word
|
||||
|
|
@ -2681,6 +2751,154 @@ ghcid_current: SX-XX-PHI-O-DRIMSM # ✅ Alphabetic only
|
|||
|
||||
**See**: `.opencode/ABBREVIATION_SPECIAL_CHAR_RULE.md` for complete documentation
|
||||
|
||||
### 🚨 CRITICAL: Diacritics MUST Be Normalized to ASCII in Abbreviations 🚨
|
||||
|
||||
**When generating abbreviations for GHCID, diacritics (accented characters) MUST be normalized to their ASCII base letter equivalents. Only ASCII uppercase letters (A-Z) are permitted.**
|
||||
|
||||
This rule applies to ALL languages with diacritical marks including Czech, Polish, German, French, Spanish, Portuguese, Nordic languages, Hungarian, Romanian, Turkish, and others.
|
||||
|
||||
**RATIONALE**:
|
||||
1. **URI/URL safety** - Non-ASCII characters require percent-encoding
|
||||
2. **Cross-system compatibility** - ASCII is universally supported
|
||||
3. **Filename safety** - Some systems have issues with non-ASCII filenames
|
||||
4. **Human readability** - Easier to type and communicate
|
||||
|
||||
**DIACRITICS NORMALIZATION TABLE**:
|
||||
|
||||
| Language | Diacritics | ASCII Equivalent |
|
||||
|----------|------------|------------------|
|
||||
| **Czech** | Č, Ř, Š, Ž, Ě, Ů | C, R, S, Z, E, U |
|
||||
| **Polish** | Ł, Ń, Ó, Ś, Ź, Ż, Ą, Ę | L, N, O, S, Z, Z, A, E |
|
||||
| **German** | Ä, Ö, Ü, ß | A, O, U, SS |
|
||||
| **French** | É, È, Ê, Ç, Ô, Â | E, E, E, C, O, A |
|
||||
| **Spanish** | Ñ, Á, É, Í, Ó, Ú | N, A, E, I, O, U |
|
||||
| **Portuguese** | Ã, Õ, Ç, Á, É | A, O, C, A, E |
|
||||
| **Nordic** | Å, Ä, Ö, Ø, Æ | A, A, O, O, AE |
|
||||
| **Hungarian** | Á, É, Í, Ó, Ö, Ő, Ú, Ü, Ű | A, E, I, O, O, O, U, U, U |
|
||||
| **Turkish** | Ç, Ğ, İ, Ö, Ş, Ü | C, G, I, O, S, U |
|
||||
| **Romanian** | Ă, Â, Î, Ș, Ț | A, A, I, S, T |
|
||||
|
||||
**REAL-WORLD EXAMPLE** (Czech institution):
|
||||
|
||||
```yaml
|
||||
# INCORRECT - Contains diacritics:
|
||||
ghcid_current: CZ-VY-TEL-L-VHSPAOČRZS # ❌ Contains "Č"
|
||||
|
||||
# CORRECT - ASCII only:
|
||||
ghcid_current: CZ-VY-TEL-L-VHSPAOCRZS # ✅ "Č" → "C"
|
||||
```
|
||||
|
||||
**IMPLEMENTATION**:
|
||||
|
||||
```python
|
||||
import unicodedata
|
||||
|
||||
def normalize_diacritics(text: str) -> str:
|
||||
"""Normalize diacritics to ASCII equivalents."""
|
||||
# NFD decomposition separates base characters from combining marks
|
||||
normalized = unicodedata.normalize('NFD', text)
|
||||
# Remove combining marks (category 'Mn' = Mark, Nonspacing)
|
||||
ascii_text = ''.join(c for c in normalized if unicodedata.category(c) != 'Mn')
|
||||
return ascii_text
|
||||
|
||||
# Example
|
||||
normalize_diacritics("VHSPAOČRZS") # Returns "VHSPAOCRZS"
|
||||
```
|
||||
|
||||
**EXAMPLES**:
|
||||
|
||||
| Emic Name (with diacritics) | Abbreviation | Wrong |
|
||||
|-----------------------------|--------------|-------|
|
||||
| Vlastivědné muzeum v Šumperku | VMS | VMŠ ❌ |
|
||||
| Österreichische Nationalbibliothek | ON | ÖN ❌ |
|
||||
| Bibliothèque nationale de France | BNF | BNF (OK - è not in first letter) |
|
||||
| Múzeum Łódzkie | ML | MŁ ❌ |
|
||||
| Þjóðminjasafn Íslands | TI | ÞI ❌ |
|
||||
|
||||
**See**: `.opencode/ABBREVIATION_SPECIAL_CHAR_RULE.md` for complete documentation (covers both special characters and diacritics)
|
||||
|
||||
### 🚨 CRITICAL: Non-Latin Scripts MUST Be Transliterated Before Abbreviation 🚨
|
||||
|
||||
**When generating GHCID abbreviations from institution names in non-Latin scripts (Cyrillic, Chinese, Japanese, Korean, Arabic, Hebrew, Greek, Devanagari, Thai, etc.), the emic name MUST first be transliterated to Latin characters using ISO or recognized standards.**
|
||||
|
||||
This rule affects **170 institutions** across **21 languages** with non-Latin writing systems.
|
||||
|
||||
**CORE PRINCIPLE**: The emic name is PRESERVED in original script in `custodian_name.emic_name`. Transliteration is only used for abbreviation generation.
|
||||
|
||||
**TRANSLITERATION STANDARDS BY SCRIPT**:
|
||||
|
||||
| Script | Languages | Standard | Example |
|
||||
|--------|-----------|----------|---------|
|
||||
| **Cyrillic** | ru, uk, bg, sr, kk | ISO 9:1995 | Институт → Institut |
|
||||
| **Chinese** | zh | Hanyu Pinyin (ISO 7098) | 东巴文化博物院 → Dongba Wenhua Bowuyuan |
|
||||
| **Japanese** | ja | Modified Hepburn | 国立博物館 → Kokuritsu Hakubutsukan |
|
||||
| **Korean** | ko | Revised Romanization | 독립기념관 → Dongnip Ginyeomgwan |
|
||||
| **Arabic** | ar, fa, ur | ISO 233-2/3 | المكتبة الوطنية → al-Maktaba al-Wataniya |
|
||||
| **Hebrew** | he | ISO 259-3 | ארכיון → Arkhiyon |
|
||||
| **Greek** | el | ISO 843 | Μουσείο → Mouseio |
|
||||
| **Devanagari** | hi, ne | ISO 15919 | राजस्थान → Rajasthana |
|
||||
| **Bengali** | bn | ISO 15919 | বাংলাদেশ → Bangladesh |
|
||||
| **Thai** | th | ISO 11940-2 | สำนักหอ → Samnak Ho |
|
||||
| **Armenian** | hy | ISO 9985 | Մdelays → Matenadaran |
|
||||
| **Georgian** | ka | ISO 9984 | ხელნაწერთა → Khelnawerti |
|
||||
|
||||
**WORKFLOW**:
|
||||
|
||||
```
|
||||
1. Emic Name (original script)
|
||||
↓
|
||||
2. Transliterate to Latin (ISO standard)
|
||||
↓
|
||||
3. Normalize diacritics (remove accents)
|
||||
↓
|
||||
4. Skip articles/prepositions
|
||||
↓
|
||||
5. Extract first letters → Abbreviation
|
||||
```
|
||||
|
||||
**EXAMPLES**:
|
||||
|
||||
| Language | Emic Name | Transliterated | Abbreviation |
|
||||
|----------|-----------|----------------|--------------|
|
||||
| **Russian** | Институт восточных рукописей РАН | Institut Vostochnykh Rukopisey RAN | IVRR |
|
||||
| **Chinese** | 东巴文化博物院 | Dongba Wenhua Bowuyuan | DWB |
|
||||
| **Korean** | 독립기념관 | Dongnip Ginyeomgwan | DG |
|
||||
| **Hindi** | राजस्थान प्राच्यविद्या प्रतिष्ठान | Rajasthana Pracyavidya Pratishthana | RPP |
|
||||
| **Arabic** | المكتبة الوطنية للمملكة المغربية | al-Maktaba al-Wataniya lil-Mamlaka | MWMM |
|
||||
| **Hebrew** | ארכיון הסיפור העממי בישראל | Arkhiyon ha-Sipur ha-Amami | ASAY |
|
||||
| **Greek** | Αρχαιολογικό Μουσείο Θεσσαλονίκης | Archaiologiko Mouseio Thessalonikis | AMT |
|
||||
|
||||
**SCRIPT-SPECIFIC SKIP WORDS**:
|
||||
|
||||
| Language | Skip Words (Articles/Prepositions) |
|
||||
|----------|-------------------------------------|
|
||||
| **Arabic** | al- (the), bi-, li-, fi- (prepositions) |
|
||||
| **Hebrew** | ha- (the), ve- (and), be-, le-, me- |
|
||||
| **Persian** | -e, -ye (ezafe connector), va (and) |
|
||||
| **CJK** | None (particles integral to meaning) |
|
||||
|
||||
**IMPLEMENTATION**:
|
||||
|
||||
```python
|
||||
from transliteration import transliterate_for_abbreviation
|
||||
|
||||
# Input: emic name in non-Latin script + language code
|
||||
emic_name = "Институт восточных рукописей РАН"
|
||||
lang = "ru"
|
||||
|
||||
# Step 1: Transliterate to Latin using ISO standard
|
||||
latin = transliterate_for_abbreviation(emic_name, lang)
|
||||
# Result: "Institut Vostochnykh Rukopisey RAN"
|
||||
|
||||
# Step 2: Apply standard abbreviation extraction
|
||||
abbreviation = extract_abbreviation_from_name(latin, skip_words={'vostochnykh'})
|
||||
# Result: "IVRRAN"
|
||||
```
|
||||
|
||||
**GRANDFATHERING POLICY**: Existing abbreviations from 817 UNESCO MoW custodians are grandfathered. This transliteration standard applies only to **NEW custodians** created after December 2025.
|
||||
|
||||
**See**: `.opencode/TRANSLITERATION_STANDARDS.md` for complete ISO standards, mapping tables, and Python implementation
|
||||
|
||||
---
|
||||
|
||||
GHCID uses a **four-identifier strategy** for maximum flexibility and transparency:
|
||||
|
|
@ -3115,7 +3333,7 @@ def test_historical_addition():
|
|||
|
||||
---
|
||||
|
||||
**Version**: 0.2.0
|
||||
**Schema Version**: v0.2.0 (modular)
|
||||
**Last Updated**: 2025-11-05
|
||||
**Version**: 0.2.1
|
||||
**Schema Version**: v0.2.1 (modular)
|
||||
**Last Updated**: 2025-12-08
|
||||
**Maintained By**: GLAM Data Extraction Project
|
||||
|
|
|
|||
357
docs/GLM_API_SETUP.md
Normal file
357
docs/GLM_API_SETUP.md
Normal file
|
|
@ -0,0 +1,357 @@
|
|||
# GLM API Setup Guide
|
||||
|
||||
This guide explains how to configure and use the GLM-4 language model for entity recognition, verification, and enrichment tasks in the GLAM project.
|
||||
|
||||
## Overview
|
||||
|
||||
The GLAM project uses **GLM-4.6** via the **Z.AI Coding Plan** endpoint for LLM-powered tasks such as:
|
||||
|
||||
- **Entity Verification**: Verify that Wikidata entities are heritage institutions
|
||||
- **Description Enrichment**: Generate rich descriptions from multiple data sources
|
||||
- **Entity Resolution**: Match institution names across different data sources
|
||||
- **Claim Validation**: Verify extracted claims against source documents
|
||||
|
||||
**Cost**: All GLM models are FREE (0 cost per token) on the Z.AI Coding Plan.
|
||||
|
||||
## Prerequisites
|
||||
|
||||
- Python 3.10+
|
||||
- `httpx` library for async HTTP requests
|
||||
- Access to Z.AI Coding Plan (same as OpenCode)
|
||||
|
||||
## Quick Start
|
||||
|
||||
### 1. Set Up Environment Variable
|
||||
|
||||
Add your Z.AI API token to the `.env` file in the project root:
|
||||
|
||||
```bash
|
||||
# .env file
|
||||
ZAI_API_TOKEN=your_token_here
|
||||
```
|
||||
|
||||
### 2. Find Your Token
|
||||
|
||||
The token is shared with OpenCode. Check:
|
||||
|
||||
```bash
|
||||
# View OpenCode auth file
|
||||
cat ~/.local/share/opencode/auth.json | jq '.["zai-coding-plan"]'
|
||||
```
|
||||
|
||||
Copy this token to your `.env` file.
|
||||
|
||||
### 3. Basic Python Usage
|
||||
|
||||
```python
|
||||
import os
|
||||
import httpx
|
||||
import asyncio
|
||||
from dotenv import load_dotenv
|
||||
|
||||
# Load environment variables
|
||||
load_dotenv()
|
||||
|
||||
async def call_glm():
|
||||
api_url = "https://api.z.ai/api/coding/paas/v4/chat/completions"
|
||||
api_key = os.environ.get("ZAI_API_TOKEN")
|
||||
|
||||
async with httpx.AsyncClient(timeout=60.0) as client:
|
||||
response = await client.post(
|
||||
api_url,
|
||||
headers={
|
||||
"Authorization": f"Bearer {api_key}",
|
||||
"Content-Type": "application/json",
|
||||
},
|
||||
json={
|
||||
"model": "glm-4.6",
|
||||
"messages": [
|
||||
{"role": "user", "content": "Hello, GLM!"}
|
||||
],
|
||||
"temperature": 0.3,
|
||||
}
|
||||
)
|
||||
result = response.json()
|
||||
print(result["choices"][0]["message"]["content"])
|
||||
|
||||
asyncio.run(call_glm())
|
||||
```
|
||||
|
||||
## API Configuration
|
||||
|
||||
### Endpoint Details
|
||||
|
||||
| Property | Value |
|
||||
|----------|-------|
|
||||
| **Base URL** | `https://api.z.ai/api/coding/paas/v4` |
|
||||
| **Chat Endpoint** | `/chat/completions` |
|
||||
| **Auth Method** | Bearer Token |
|
||||
| **Header** | `Authorization: Bearer {token}` |
|
||||
|
||||
### Available Models
|
||||
|
||||
| Model | Speed | Quality | Use Case |
|
||||
|-------|-------|---------|----------|
|
||||
| `glm-4.6` | Medium | Highest | Complex reasoning, verification |
|
||||
| `glm-4.5` | Medium | High | General tasks |
|
||||
| `glm-4.5-air` | Fast | Good | High-volume processing |
|
||||
| `glm-4.5-flash` | Fastest | Good | Quick responses |
|
||||
| `glm-4.5v` | Medium | High | Vision/image tasks |
|
||||
|
||||
**Recommendation**: Use `glm-4.6` for entity verification and complex tasks.
|
||||
|
||||
## Integration with CH-Annotator
|
||||
|
||||
When using GLM for entity recognition tasks, always reference the CH-Annotator convention:
|
||||
|
||||
### Heritage Institution Verification
|
||||
|
||||
```python
|
||||
VERIFICATION_PROMPT = """You are a heritage institution classifier following CH-Annotator v1.7.0 convention.
|
||||
|
||||
## CH-Annotator GRP.HER Definition
|
||||
Heritage institutions are organizations that:
|
||||
- Collect, preserve, and provide access to cultural heritage materials
|
||||
- Include: museums (GRP.HER.MUS), libraries (GRP.HER.LIB), archives (GRP.HER.ARC), galleries (GRP.HER.GAL)
|
||||
|
||||
## Entity Types That Are NOT Heritage Institutions
|
||||
- Cities, towns, municipalities (places, not institutions)
|
||||
- General businesses or companies
|
||||
- People/individuals
|
||||
- Events, festivals, exhibitions (temporary)
|
||||
|
||||
## Your Task
|
||||
Analyze the entity and respond in JSON:
|
||||
```json
|
||||
{
|
||||
"is_heritage_institution": true/false,
|
||||
"subtype": "MUS|LIB|ARC|GAL|OTHER|null",
|
||||
"confidence": 0.95,
|
||||
"reasoning": "Brief explanation"
|
||||
}
|
||||
```
|
||||
"""
|
||||
```
|
||||
|
||||
### Entity Type Mapping
|
||||
|
||||
| CH-Annotator Type | GLAM Institution Type |
|
||||
|-------------------|----------------------|
|
||||
| GRP.HER.MUS | MUSEUM |
|
||||
| GRP.HER.LIB | LIBRARY |
|
||||
| GRP.HER.ARC | ARCHIVE |
|
||||
| GRP.HER.GAL | GALLERY |
|
||||
| GRP.HER.RES | RESEARCH_CENTER |
|
||||
| GRP.HER.BOT | BOTANICAL_ZOO |
|
||||
| GRP.HER.EDU | EDUCATION_PROVIDER |
|
||||
|
||||
## Complete Implementation Example
|
||||
|
||||
### Wikidata Verification Script
|
||||
|
||||
See `scripts/reenrich_wikidata_with_verification.py` for a complete example:
|
||||
|
||||
```python
|
||||
import os
|
||||
import httpx
|
||||
import json
|
||||
from typing import Any, Dict, List, Optional
|
||||
|
||||
class GLMHeritageVerifier:
|
||||
"""Verify Wikidata entities using GLM-4.6 and CH-Annotator."""
|
||||
|
||||
API_URL = "https://api.z.ai/api/coding/paas/v4/chat/completions"
|
||||
|
||||
def __init__(self, model: str = "glm-4.6"):
|
||||
self.api_key = os.environ.get("ZAI_API_TOKEN")
|
||||
if not self.api_key:
|
||||
raise ValueError("ZAI_API_TOKEN not found in environment")
|
||||
|
||||
self.model = model
|
||||
self.client = httpx.AsyncClient(
|
||||
timeout=60.0,
|
||||
headers={
|
||||
"Authorization": f"Bearer {self.api_key}",
|
||||
"Content-Type": "application/json",
|
||||
}
|
||||
)
|
||||
|
||||
async def verify_heritage_institution(
|
||||
self,
|
||||
institution_name: str,
|
||||
wikidata_label: str,
|
||||
wikidata_description: str,
|
||||
instance_of_types: List[str],
|
||||
) -> Dict[str, Any]:
|
||||
"""Check if a Wikidata entity is a heritage institution."""
|
||||
|
||||
prompt = f"""Analyze if this entity is a heritage institution (GRP.HER):
|
||||
|
||||
Institution Name: {institution_name}
|
||||
Wikidata Label: {wikidata_label}
|
||||
Description: {wikidata_description}
|
||||
Instance Of: {', '.join(instance_of_types)}
|
||||
|
||||
Respond with JSON only."""
|
||||
|
||||
response = await self.client.post(
|
||||
self.API_URL,
|
||||
json={
|
||||
"model": self.model,
|
||||
"messages": [
|
||||
{"role": "system", "content": self.VERIFICATION_PROMPT},
|
||||
{"role": "user", "content": prompt}
|
||||
],
|
||||
"temperature": 0.1,
|
||||
}
|
||||
)
|
||||
|
||||
result = response.json()
|
||||
content = result["choices"][0]["message"]["content"]
|
||||
|
||||
# Parse JSON from response
|
||||
json_match = re.search(r'\{.*\}', content, re.DOTALL)
|
||||
if json_match:
|
||||
return json.loads(json_match.group())
|
||||
|
||||
return {"is_heritage_institution": False, "error": "No JSON found"}
|
||||
```
|
||||
|
||||
## Error Handling
|
||||
|
||||
### Common Errors
|
||||
|
||||
| Error Code | Meaning | Solution |
|
||||
|------------|---------|----------|
|
||||
| 401 | Unauthorized | Check ZAI_API_TOKEN |
|
||||
| 403 | Forbidden/Quota | Using wrong endpoint (use Z.AI, not BigModel) |
|
||||
| 429 | Rate Limited | Add delays between requests |
|
||||
| 500 | Server Error | Retry with backoff |
|
||||
|
||||
### Retry Pattern
|
||||
|
||||
```python
|
||||
from tenacity import retry, stop_after_attempt, wait_exponential
|
||||
|
||||
@retry(
|
||||
stop=stop_after_attempt(3),
|
||||
wait=wait_exponential(multiplier=1, min=2, max=10)
|
||||
)
|
||||
async def call_with_retry(client, messages):
|
||||
response = await client.post(API_URL, json={"model": "glm-4.6", "messages": messages})
|
||||
response.raise_for_status()
|
||||
return response.json()
|
||||
```
|
||||
|
||||
### JSON Parsing
|
||||
|
||||
LLM responses may contain text around JSON. Always parse safely:
|
||||
|
||||
```python
|
||||
import re
|
||||
import json
|
||||
|
||||
def parse_json_from_response(content: str) -> dict:
|
||||
"""Extract JSON from LLM response text."""
|
||||
# Try to find JSON block
|
||||
json_match = re.search(r'```json\s*(\{.*?\})\s*```', content, re.DOTALL)
|
||||
if json_match:
|
||||
return json.loads(json_match.group(1))
|
||||
|
||||
# Try bare JSON
|
||||
json_match = re.search(r'\{.*\}', content, re.DOTALL)
|
||||
if json_match:
|
||||
return json.loads(json_match.group())
|
||||
|
||||
return {"error": "No JSON found in response"}
|
||||
```
|
||||
|
||||
## Best Practices
|
||||
|
||||
### 1. Use Low Temperature for Verification
|
||||
|
||||
```python
|
||||
{
|
||||
"temperature": 0.1 # Low for consistent, deterministic responses
|
||||
}
|
||||
```
|
||||
|
||||
### 2. Request JSON Output
|
||||
|
||||
Always request JSON format in your prompts for structured responses:
|
||||
|
||||
```
|
||||
Respond in JSON format only:
|
||||
```json
|
||||
{"key": "value"}
|
||||
```
|
||||
```
|
||||
|
||||
### 3. Batch Processing
|
||||
|
||||
Process multiple entities with rate limiting:
|
||||
|
||||
```python
|
||||
import asyncio
|
||||
|
||||
async def batch_verify(entities: List[dict], rate_limit: float = 0.5):
|
||||
"""Verify entities with rate limiting."""
|
||||
results = []
|
||||
for entity in entities:
|
||||
result = await verifier.verify(entity)
|
||||
results.append(result)
|
||||
await asyncio.sleep(rate_limit) # Respect rate limits
|
||||
return results
|
||||
```
|
||||
|
||||
### 4. Always Reference CH-Annotator
|
||||
|
||||
For entity recognition tasks, include CH-Annotator context:
|
||||
|
||||
```python
|
||||
system_prompt = """You are following CH-Annotator v1.7.0 convention.
|
||||
Heritage institutions are type GRP.HER with subtypes for museums, libraries, archives, and galleries.
|
||||
"""
|
||||
```
|
||||
|
||||
## Related Scripts
|
||||
|
||||
| Script | Purpose |
|
||||
|--------|---------|
|
||||
| `scripts/reenrich_wikidata_with_verification.py` | Wikidata entity verification |
|
||||
|
||||
## Related Documentation
|
||||
|
||||
- **Agent Rules**: `AGENTS.md` (Rule 11: Z.AI GLM API)
|
||||
- **Agent Config**: `.opencode/ZAI_GLM_API_RULES.md`
|
||||
- **CH-Annotator**: `.opencode/CH_ANNOTATOR_CONVENTION.md`
|
||||
- **Entity Annotation**: `data/entity_annotation/ch_annotator-v1_7_0.yaml`
|
||||
|
||||
## Troubleshooting
|
||||
|
||||
### "Quota exceeded" Error
|
||||
|
||||
**Symptom**: 403 error with "quota exceeded" message
|
||||
|
||||
**Cause**: Using wrong API endpoint (`open.bigmodel.cn` instead of `api.z.ai`)
|
||||
|
||||
**Solution**: Update API URL to `https://api.z.ai/api/coding/paas/v4/chat/completions`
|
||||
|
||||
### "Token not found" Error
|
||||
|
||||
**Symptom**: ValueError about missing ZAI_API_TOKEN
|
||||
|
||||
**Solution**:
|
||||
1. Check `~/.local/share/opencode/auth.json` for token
|
||||
2. Add to `.env` file as `ZAI_API_TOKEN=your_token`
|
||||
3. Ensure `load_dotenv()` is called before accessing environment
|
||||
|
||||
### JSON Parsing Failures
|
||||
|
||||
**Symptom**: LLM returns text that can't be parsed as JSON
|
||||
|
||||
**Solution**: Use the `parse_json_from_response()` helper function with fallback handling
|
||||
|
||||
---
|
||||
|
||||
**Last Updated**: 2025-12-08
|
||||
441
docs/TRANSLITERATION_CONVENTIONS.md
Normal file
441
docs/TRANSLITERATION_CONVENTIONS.md
Normal file
|
|
@ -0,0 +1,441 @@
|
|||
# Transliteration Conventions for Heritage Custodian Names
|
||||
|
||||
**Document Type**: User Guide
|
||||
**Version**: 1.0
|
||||
**Last Updated**: 2025-12-08
|
||||
**Related Rules**: `.opencode/TRANSLITERATION_STANDARDS.md`, `AGENTS.md` (Rule 12)
|
||||
|
||||
---
|
||||
|
||||
## Overview
|
||||
|
||||
This document provides comprehensive examples and guidance for transliterating heritage institution names from non-Latin scripts to Latin characters. Transliteration is **required** for generating GHCID abbreviations but the **original emic name is always preserved**.
|
||||
|
||||
### Key Principles
|
||||
|
||||
1. **Emic name preserved** - Original script stored in `custodian_name.emic_name`
|
||||
2. **ISO standards used** - Recognized international transliteration standards
|
||||
3. **Deterministic output** - Same input always produces same Latin output
|
||||
4. **Abbreviation purpose only** - Transliteration is for GHCID generation, not display
|
||||
|
||||
---
|
||||
|
||||
## Language-by-Language Examples
|
||||
|
||||
### Russian (Cyrillic - ISO 9:1995)
|
||||
|
||||
**Dataset Statistics**: 13 institutions
|
||||
|
||||
| Emic Name | Transliterated | Abbreviation |
|
||||
|-----------|----------------|--------------|
|
||||
| Институт восточных рукописей РАН | Institut vostočnyh rukopisej RAN | IVRR |
|
||||
| Российская государственная библиотека | Rossijskaja gosudarstvennaja biblioteka | RGB |
|
||||
| Государственный архив Российской Федерации | Gosudarstvennyj arhiv Rossijskoj Federacii | GARF |
|
||||
|
||||
**Skip Words (Russian)**: None significant (Russian doesn't use articles)
|
||||
|
||||
**Character Mapping**:
|
||||
```
|
||||
А → A Б → B В → V Г → G Д → D Е → E
|
||||
Ё → Ë Ж → Ž З → Z И → I Й → J К → K
|
||||
Л → L М → M Н → N О → O П → P Р → R
|
||||
С → S Т → T У → U Ф → F Х → H Ц → C
|
||||
Ч → Č Ш → Š Щ → Ŝ Ъ → ʺ Ы → Y Ь → ʹ
|
||||
Э → È Ю → Û Я → Â
|
||||
```
|
||||
|
||||
---
|
||||
|
||||
### Ukrainian (Cyrillic - ISO 9:1995)
|
||||
|
||||
**Dataset Statistics**: 8 institutions
|
||||
|
||||
| Emic Name | Transliterated | Abbreviation |
|
||||
|-----------|----------------|--------------|
|
||||
| Центральний державний архів громадських об'єднань України | Centralnyj deržavnyj arhiv hromadskyx objednan Ukrainy | CDAGOU |
|
||||
| Національна бібліотека України | Nacionalna biblioteka Ukrainy | NBU |
|
||||
|
||||
**Ukrainian-specific characters**:
|
||||
```
|
||||
І → I Ї → Ji Є → Je Ґ → G'
|
||||
```
|
||||
|
||||
---
|
||||
|
||||
### Chinese (Hanyu Pinyin - ISO 7098)
|
||||
|
||||
**Dataset Statistics**: 27 institutions
|
||||
|
||||
| Emic Name | Pinyin | Abbreviation |
|
||||
|-----------|--------|--------------|
|
||||
| 东巴文化博物院 | Dōngbā Wénhuà Bówùyuàn | DWB |
|
||||
| 中国第一历史档案馆 | Zhōngguó Dìyī Lìshǐ Dàng'ànguǎn | ZDLD |
|
||||
| 北京故宫博物院 | Běijīng Gùgōng Bówùyuàn | BGB |
|
||||
| 中国国家图书馆 | Zhōngguó Guójiā Túshūguǎn | ZGT |
|
||||
|
||||
**Notes**:
|
||||
- Tone marks are removed for abbreviation (diacritics normalization)
|
||||
- Word boundaries follow natural semantic breaks
|
||||
- Multi-syllable words keep together
|
||||
|
||||
**Skip Words**: None (Chinese doesn't use separate articles/prepositions)
|
||||
|
||||
---
|
||||
|
||||
### Japanese (Modified Hepburn)
|
||||
|
||||
**Dataset Statistics**: 19 institutions
|
||||
|
||||
| Emic Name | Romaji | Abbreviation |
|
||||
|-----------|--------|--------------|
|
||||
| 国立中央博物館 | Kokuritsu Chūō Hakubutsukan | KCH |
|
||||
| 東京国立博物館 | Tōkyō Kokuritsu Hakubutsukan | TKH |
|
||||
| 国立国会図書館 | Kokuritsu Kokkai Toshokan | KKT |
|
||||
|
||||
**Notes**:
|
||||
- Long vowels (ō, ū) normalized to (o, u)
|
||||
- Particles typically attached to preceding word
|
||||
- Kanji compounds transliterated as single words
|
||||
|
||||
---
|
||||
|
||||
### Korean (Revised Romanization)
|
||||
|
||||
**Dataset Statistics**: 36 institutions
|
||||
|
||||
| Emic Name | RR Romanization | Abbreviation |
|
||||
|-----------|-----------------|--------------|
|
||||
| 독립기념관 | Dongnip Ginyeomgwan | DG |
|
||||
| 국립중앙박물관 | Gungnip Jungang Bakmulgwan | GJB |
|
||||
| 서울대학교 규장각한국학연구원 | Seoul Daehakgyo Gyujanggak Hangukhak Yeonguwon | SDGHY |
|
||||
| 국립한글박물관 | Gungnip Hangeul Bakmulgwan | GHB |
|
||||
|
||||
**Notes**:
|
||||
- No diacritics in Revised Romanization (unlike McCune-Reischauer)
|
||||
- Consonant assimilation reflected in spelling
|
||||
- Spaces at natural word boundaries
|
||||
|
||||
---
|
||||
|
||||
### Arabic (ISO 233-2)
|
||||
|
||||
**Dataset Statistics**: 8 institutions
|
||||
|
||||
| Emic Name | Transliterated | Abbreviation |
|
||||
|-----------|----------------|--------------|
|
||||
| المكتبة الوطنية للمملكة المغربية | al-Maktaba al-Waṭanīya lil-Mamlaka al-Maġribīya | MWMM |
|
||||
| دار الكتب المصرية | Dār al-Kutub al-Miṣrīya | DKM |
|
||||
| المتحف الوطني العراقي | al-Matḥaf al-Waṭanī al-ʿIrāqī | MWI |
|
||||
|
||||
**Skip Words**:
|
||||
- `al-` (definite article "the")
|
||||
- After skip word removal: Maktaba, Wataniya, Mamlaka, Maghribiya → MWMM
|
||||
|
||||
**Notes**:
|
||||
- Right-to-left script
|
||||
- Definite article "al-" always skipped
|
||||
- Diacritics normalized (ā→a, ī→i, etc.)
|
||||
|
||||
---
|
||||
|
||||
### Persian/Farsi (ISO 233-3)
|
||||
|
||||
**Dataset Statistics**: 11 institutions
|
||||
|
||||
| Emic Name | Transliterated | Abbreviation |
|
||||
|-----------|----------------|--------------|
|
||||
| وزارت امور خارجه ایران | Vezārat-e Omūr-e Khārejeh-ye Īrān | VOKI |
|
||||
| کتابخانه آستان قدس رضوی | Ketābkhāneh-ye Āstān-e Qods-e Raẓavī | KAQR |
|
||||
| مجلس شورای اسلامی | Majles-e Showrā-ye Eslāmī | MSE |
|
||||
|
||||
**Skip Words**:
|
||||
- `-e`, `-ye` (ezafe connector, "of")
|
||||
- `va` ("and")
|
||||
|
||||
**Persian-specific characters**:
|
||||
```
|
||||
پ → p چ → č ژ → ž گ → g
|
||||
```
|
||||
|
||||
---
|
||||
|
||||
### Hebrew (ISO 259-3)
|
||||
|
||||
**Dataset Statistics**: 4 institutions
|
||||
|
||||
| Emic Name | Transliterated | Abbreviation |
|
||||
|-----------|----------------|--------------|
|
||||
| ארכיון הסיפור העממי בישראל | Arḵiyon ha-Sipur ha-ʿAmami be-Yiśraʾel | ASAY |
|
||||
| הספרייה הלאומית | ha-Sifriya ha-Leʾumit | SL |
|
||||
| ארכיון המדינה | Arḵiyon ha-Medina | AM |
|
||||
|
||||
**Skip Words**:
|
||||
- `ha-` (definite article "the")
|
||||
- `be-` ("in")
|
||||
- `le-` ("to")
|
||||
- `ve-` ("and")
|
||||
|
||||
**Notes**:
|
||||
- Right-to-left script
|
||||
- Articles attached with hyphen
|
||||
- Silent letters (aleph, ayin) often omitted in abbreviation
|
||||
|
||||
---
|
||||
|
||||
### Hindi (Devanagari - ISO 15919)
|
||||
|
||||
**Dataset Statistics**: 14 institutions
|
||||
|
||||
| Emic Name | Transliterated | Abbreviation |
|
||||
|-----------|----------------|--------------|
|
||||
| राजस्थान प्राच्यविद्या प्रतिष्ठान | Rājasthāna Prācyavidyā Pratiṣṭhāna | RPP |
|
||||
| राष्ट्रीय अभिलेखागार | Rāṣṭrīya Abhilekhāgāra | RA |
|
||||
| राष्ट्रीय संग्रहालय नई दिल्ली | Rāṣṭrīya Saṅgrahālaya Naī Dillī | RSND |
|
||||
|
||||
**Skip Words**:
|
||||
- `ka`, `ki`, `ke` ("of")
|
||||
- `aur` ("and")
|
||||
- `mein` ("in")
|
||||
|
||||
**Notes**:
|
||||
- Conjunct consonants transliterated as cluster
|
||||
- Long vowels marked (ā, ī, ū) then normalized
|
||||
|
||||
---
|
||||
|
||||
### Greek (ISO 843)
|
||||
|
||||
**Dataset Statistics**: 2 institutions
|
||||
|
||||
| Emic Name | Transliterated | Abbreviation |
|
||||
|-----------|----------------|--------------|
|
||||
| Αρχαιολογικό Μουσείο Θεσσαλονίκης | Archaiologikó Mouseío Thessaloníkīs | AMT |
|
||||
| Εθνική Βιβλιοθήκη της Ελλάδας | Ethnikī́ Vivliothī́kī tīs Elládas | EVE |
|
||||
|
||||
**Skip Words**:
|
||||
- `tīs`, `tou` ("of the")
|
||||
- `kai` ("and")
|
||||
|
||||
**Character Mapping**:
|
||||
```
|
||||
Α → A Β → V Γ → G Δ → D Ε → E Ζ → Z
|
||||
Η → Ī Θ → Th Ι → I Κ → K Λ → L Μ → M
|
||||
Ν → N Ξ → X Ο → O Π → P Ρ → R Σ → S
|
||||
Τ → T Υ → Y Φ → F Χ → Ch Ψ → Ps Ω → Ō
|
||||
```
|
||||
|
||||
---
|
||||
|
||||
### Thai (ISO 11940-2)
|
||||
|
||||
**Dataset Statistics**: 6 institutions
|
||||
|
||||
| Emic Name | Transliterated | Abbreviation |
|
||||
|-----------|----------------|--------------|
|
||||
| สำนักหอจดหมายเหตุแห่งชาติ | Samnak Ho Chotmaihet Haeng Chat | SHCHC |
|
||||
| หอสมุดแห่งชาติ | Ho Samut Haeng Chat | HSHC |
|
||||
|
||||
**Notes**:
|
||||
- Thai script is abugida (consonant-vowel syllables)
|
||||
- No spaces in Thai; word boundaries determined by meaning
|
||||
- Royal Thai General System also acceptable
|
||||
|
||||
---
|
||||
|
||||
### Armenian (ISO 9985)
|
||||
|
||||
**Dataset Statistics**: 4 institutions
|
||||
|
||||
| Emic Name | Transliterated | Abbreviation |
|
||||
|-----------|----------------|--------------|
|
||||
| Մdelays delays delaysdelays delaysатенадаран | Matenadaran | M |
|
||||
| Ազdelays delays delays delays delays delays delays delays delaysգdelays delays delays delays delays delaysdelays delaysdelays delaysდайн Пdelays delays delays delays delaysатאрաнագитаран | Azgayin Matenadaran | AM |
|
||||
|
||||
**Notes**:
|
||||
- Armenian alphabet unique to Armenian language
|
||||
- Transliteration straightforward letter-for-letter
|
||||
|
||||
---
|
||||
|
||||
### Georgian (ISO 9984)
|
||||
|
||||
**Dataset Statistics**: 2 institutions
|
||||
|
||||
| Emic Name | Transliterated | Abbreviation |
|
||||
|-----------|----------------|--------------|
|
||||
| ხელნაწერთა ეროვნული ცენტრი | Xelnawerti Erovnuli C'ent'ri | XEC |
|
||||
| საქართველოს ეროვნული არქივი | Sakartvelos Erovnuli Arkivi | SEA |
|
||||
|
||||
**Notes**:
|
||||
- Georgian Mkhedruli script
|
||||
- Apostrophes mark ejective consonants (removed in abbreviation)
|
||||
|
||||
---
|
||||
|
||||
## Complete Workflow Example
|
||||
|
||||
### Step-by-Step: Korean Institution
|
||||
|
||||
**Institution**: National Museum of Korea
|
||||
|
||||
1. **Emic Name (Original Script)**:
|
||||
```
|
||||
국립중앙박물관
|
||||
```
|
||||
|
||||
2. **Language Detection**: Korean (ko)
|
||||
|
||||
3. **Transliterate using Revised Romanization**:
|
||||
```
|
||||
Gungnip Jungang Bakmulgwan
|
||||
```
|
||||
|
||||
4. **Identify Skip Words**: None for Korean
|
||||
|
||||
5. **Extract First Letters**:
|
||||
```
|
||||
G + J + B = GJB
|
||||
```
|
||||
|
||||
6. **Diacritic Normalization**: N/A (RR has no diacritics)
|
||||
|
||||
7. **Final Abbreviation**: `GJB`
|
||||
|
||||
8. **Store in YAML**:
|
||||
```yaml
|
||||
custodian_name:
|
||||
emic_name: 국립중앙박물관
|
||||
name_language: ko
|
||||
english_name: National Museum of Korea
|
||||
ghcid:
|
||||
ghcid_current: KR-SO-SEO-M-GJB
|
||||
abbreviation_source: transliterated_emic
|
||||
```
|
||||
|
||||
---
|
||||
|
||||
### Step-by-Step: Arabic Institution
|
||||
|
||||
**Institution**: National Library of Morocco
|
||||
|
||||
1. **Emic Name (Original Script)**:
|
||||
```
|
||||
المكتبة الوطنية للمملكة المغربية
|
||||
```
|
||||
|
||||
2. **Language Detection**: Arabic (ar)
|
||||
|
||||
3. **Transliterate using ISO 233-2**:
|
||||
```
|
||||
al-Maktaba al-Waṭanīya lil-Mamlaka al-Maġribīya
|
||||
```
|
||||
|
||||
4. **Identify Skip Words**: `al-` (4 occurrences), `lil-` (1)
|
||||
|
||||
5. **After Skip Word Removal**:
|
||||
```
|
||||
Maktaba Waṭanīya Mamlaka Maġribīya
|
||||
```
|
||||
|
||||
6. **Extract First Letters**:
|
||||
```
|
||||
M + W + M + M = MWMM
|
||||
```
|
||||
|
||||
7. **Diacritic Normalization**: ṭ→t, ī→i, ġ→g
|
||||
```
|
||||
MWMM (already ASCII)
|
||||
```
|
||||
|
||||
8. **Final Abbreviation**: `MWMM`
|
||||
|
||||
---
|
||||
|
||||
## Edge Cases and Special Handling
|
||||
|
||||
### Mixed Scripts
|
||||
|
||||
Some institution names mix scripts (e.g., Latin brand names in Chinese text):
|
||||
|
||||
**Example**: 中国IBM研究院
|
||||
- Transliterate Chinese: Zhongguo IBM Yanjiuyuan
|
||||
- Keep "IBM" as-is (already Latin)
|
||||
- Abbreviation: ZIY
|
||||
|
||||
### Transliteration Ambiguity
|
||||
|
||||
When multiple valid transliterations exist, prefer:
|
||||
1. ISO standard spelling
|
||||
2. Institution's own romanization (if consistent)
|
||||
3. Most commonly used academic romanization
|
||||
|
||||
### Very Long Names
|
||||
|
||||
If abbreviation exceeds 10 characters after applying rules:
|
||||
1. Truncate to 10 characters
|
||||
2. Ensure truncation doesn't create ambiguous abbreviation
|
||||
3. Document truncation in `ghcid.notes`
|
||||
|
||||
---
|
||||
|
||||
## Python Implementation Reference
|
||||
|
||||
For the complete Python implementation of transliteration functions, see:
|
||||
|
||||
- `.opencode/TRANSLITERATION_STANDARDS.md` - Full code with all language handlers
|
||||
- `scripts/transliterate_emic_names.py` - Production script for batch transliteration
|
||||
|
||||
### Quick Reference Function
|
||||
|
||||
```python
|
||||
from transliteration import transliterate_for_abbreviation
|
||||
|
||||
# Example usage for all supported languages
|
||||
examples = {
|
||||
'ru': 'Российская государственная библиотека',
|
||||
'zh': '中国国家图书馆',
|
||||
'ja': '国立国会図書館',
|
||||
'ko': '국립중앙박물관',
|
||||
'ar': 'المكتبة الوطنية للمملكة المغربية',
|
||||
'he': 'הספרייה הלאומית',
|
||||
'hi': 'राष्ट्रीय अभिलेखागार',
|
||||
'el': 'Εθνική Βιβλιοθήκη της Ελλάδας',
|
||||
}
|
||||
|
||||
for lang, name in examples.items():
|
||||
latin = transliterate_for_abbreviation(name, lang)
|
||||
print(f'{lang}: {name}')
|
||||
print(f' → {latin}')
|
||||
```
|
||||
|
||||
---
|
||||
|
||||
## Validation Checklist
|
||||
|
||||
Before finalizing a transliterated abbreviation:
|
||||
|
||||
- [ ] Original emic name preserved in `custodian_name.emic_name`
|
||||
- [ ] Language code stored in `custodian_name.name_language`
|
||||
- [ ] Correct ISO standard applied for script
|
||||
- [ ] Skip words removed (articles, prepositions)
|
||||
- [ ] Diacritics normalized to ASCII
|
||||
- [ ] Special characters removed
|
||||
- [ ] Abbreviation ≤ 10 characters
|
||||
- [ ] No conflicts with existing GHCIDs
|
||||
|
||||
---
|
||||
|
||||
## See Also
|
||||
|
||||
- `.opencode/TRANSLITERATION_STANDARDS.md` - Technical rules and Python code
|
||||
- `.opencode/ABBREVIATION_SPECIAL_CHAR_RULE.md` - Character filtering rules
|
||||
- `AGENTS.md` - Rule 12: Non-Latin Script Transliteration
|
||||
- `docs/PERSISTENT_IDENTIFIERS.md` - GHCID specification
|
||||
|
||||
---
|
||||
|
||||
## Changelog
|
||||
|
||||
| Date | Change |
|
||||
|------|--------|
|
||||
| 2025-12-08 | Initial document created with 21 language examples |
|
||||
Loading…
Reference in a new issue