kempersc 271545fa8b docs: add Z.AI GLM API and transliteration rules to AGENTS.md

- Add Rule 11 for Z.AI Coding Plan API usage (not BigModel)
- Add transliteration standards for non-Latin scripts
- Document GLM model options and Python implementation

2025-12-08 14:58:22 +01:00

11 KiB

Raw Blame History

Abbreviation Character Filtering Rules

Rule ID: ABBREV-CHAR-FILTER
Status: MANDATORY
Applies To: GHCID abbreviation component generation
Created: 2025-12-07
Updated: 2025-12-08 (added diacritics rule)

Summary

When generating abbreviations for GHCID, ONLY ASCII uppercase letters (A-Z) are permitted. Both special characters AND diacritics MUST be removed/normalized.

This is a MANDATORY rule. Abbreviations containing special characters or diacritics are INVALID and must be regenerated.

Two Mandatory Sub-Rules:

ABBREV-SPECIAL-CHAR: Remove all special characters and symbols
ABBREV-DIACRITICS: Normalize all diacritics to ASCII equivalents

Rule 1: Diacritics MUST Be Normalized to ASCII (ABBREV-DIACRITICS)

Diacritics (accented characters) MUST be normalized to their ASCII base letter equivalents.

Example (Real Case)

❌ WRONG:  CZ-VY-TEL-L-VHSPAOČRZS  (contains Č)
✅ CORRECT: CZ-VY-TEL-L-VHSPAOCRZS  (ASCII only)

Diacritics Normalization Table

Diacritic	ASCII	Example
Á, À, Â, Ã, Ä, Å, Ā	A	"Ålborg" → A
Č, Ć, Ç	C	"Český" → C
Ď	D	"Ďáblice" → D
É, È, Ê, Ë, Ě, Ē	E	"Éire" → E
Í, Ì, Î, Ï, Ī	I	"Ísland" → I
Ñ, Ń, Ň	N	"España" → N
Ó, Ò, Ô, Õ, Ö, Ø, Ō	O	"Österreich" → O
Ř	R	"Říčany" → R
Š, Ś, Ş	S	"Šumperk" → S
Ť	T	"Ťažký" → T
Ú, Ù, Û, Ü, Ů, Ū	U	"Ústí" → U
Ý, Ÿ	Y	"Ýmir" → Y
Ž, Ź, Ż	Z	"Žilina" → Z
Ł	L	"Łódź" → L
Æ	AE	"Ærø" → AE
Œ	OE	"Œuvre" → OE
ß	SS	"Straße" → SS

Implementation

import unicodedata

def normalize_diacritics(text: str) -> str:
    """
    Normalize diacritics to ASCII equivalents.
    
    Examples:
        "Č" → "C"
        "Ř" → "R"  
        "Ö" → "O"
        "ñ" → "n"
    """
    # NFD decomposition separates base characters from combining marks
    normalized = unicodedata.normalize('NFD', text)
    # Remove combining marks (category 'Mn' = Mark, Nonspacing)
    ascii_text = ''.join(c for c in normalized if unicodedata.category(c) != 'Mn')
    return ascii_text

# Example
normalize_diacritics("VHSPAOČRZS")  # Returns "VHSPAOCRZS"

Languages Commonly Affected

Language	Common Diacritics	Example Institution
Czech	Č, Ř, Š, Ž, Ě, Ů	Vlastivědné muzeum → VM (not VM with háček)
Polish	Ł, Ń, Ó, Ś, Ź, Ż, Ą, Ę	Biblioteka Łódzka → BL
German	Ä, Ö, Ü, ß	Österreichische Nationalbibliothek → ON
French	É, È, Ê, Ç, Ô	Bibliothèque nationale → BN
Spanish	Ñ, Á, É, Í, Ó, Ú	Museo Nacional → MN
Portuguese	Ã, Õ, Ç, Á, É	Biblioteca Nacional → BN
Nordic	Å, Ä, Ö, Ø, Æ	Nationalmuseet → N
Turkish	Ç, Ğ, İ, Ö, Ş, Ü	İstanbul Üniversitesi → IU
Hungarian	Á, É, Í, Ó, Ö, Ő, Ú, Ü, Ű	Országos Levéltár → OL
Romanian	Ă, Â, Î, Ș, Ț	Biblioteca Națională → BN

Rule 2: Special Characters MUST Be Removed (ABBREV-SPECIAL-CHAR)

Rationale

1. URL/URI Safety

Special characters require percent-encoding in URIs. For example:

& becomes %26
+ becomes %2B

This makes identifiers harder to share, copy, and verify.

2. Filename Safety

Many special characters are invalid in filenames across operating systems:

Windows: \ / : * ? " < > |
macOS/Linux: / and null bytes

Files like SX-XX-PHI-O-DR&IMSM.yaml may cause issues on some systems.

3. Parsing Consistency

Special characters can conflict with delimiters in data pipelines:

& is used in query strings
: is used in YAML, JSON
/ is a path separator
| is a common CSV delimiter alternative

4. Cross-System Compatibility

Identifiers should work across all systems:

Databases (SQL, TypeDB, Neo4j)
RDF/SPARQL endpoints
REST APIs
Command-line tools
Spreadsheets

5. Human Readability

Clean identifiers are easier to:

Communicate verbally
Type correctly
Proofread
Remember

Characters to Remove

The following characters MUST be completely removed (not replaced) when generating abbreviations:

Character	Name	Example Issue
`&`	Ampersand	"R&A" in URLs, HTML entities
`/`	Slash	Path separator confusion
`\`	Backslash	Escape sequence issues
`+`	Plus	URL encoding (`+` = space)
`@`	At sign	Email/handle confusion
`#`	Hash/Pound	Fragment identifier in URLs
`%`	Percent	URL encoding prefix
`$`	Dollar	Variable prefix in shells
`*`	Asterisk	Glob/wildcard character
`(` `)`	Parentheses	Grouping in regex, code
`[` `]`	Square brackets	Array notation
`{` `}`	Curly braces	Object notation
`\|`	Pipe	Command chaining, OR operator
`:`	Colon	YAML key-value, namespace separator
`;`	Semicolon	Statement terminator
`"` `'` `	Quotes	String delimiters
`,`	Comma	List separator
`.`	Period	File extension, namespace
`-`	Hyphen	Already used as GHCID component separator
`_`	Underscore	Reserved for name suffix in collisions
`=`	Equals	Assignment operator
`?`	Question mark	Query string indicator
`!`	Exclamation	Negation, shell history
`~`	Tilde	Home directory, bitwise NOT
`^`	Caret	Regex anchor, power operator
`<` `>`	Angle brackets	HTML tags, redirects

Implementation

Algorithm

When extracting abbreviation from institution name:

import re
import unicodedata

def extract_abbreviation_from_name(name: str, skip_words: set) -> str:
    """
    Extract abbreviation from institution name.
    
    Args:
        name: Full institution name (emic)
        skip_words: Set of prepositions/articles to skip
    
    Returns:
        Uppercase abbreviation with only A-Z characters
    """
    # Step 1: Normalize unicode (remove diacritics)
    normalized = unicodedata.normalize('NFD', name)
    ascii_name = ''.join(c for c in normalized if unicodedata.category(c) != 'Mn')
    
    # Step 2: Replace special characters with spaces (to split words)
    # This handles cases like "Records&Information" -> "Records Information"
    clean_name = re.sub(r'[^a-zA-Z\s]', ' ', ascii_name)
    
    # Step 3: Split into words
    words = clean_name.split()
    
    # Step 4: Filter out skip words (prepositions, articles)
    significant_words = [w for w in words if w.lower() not in skip_words]
    
    # Step 5: Take first letter of each significant word
    abbreviation = ''.join(w[0].upper() for w in significant_words if w)
    
    # Step 6: Limit to 10 characters
    return abbreviation[:10]

Handling Special Cases

Case 1: "Records & Information Management"

Input: "Records & Information Management"
After special char removal: "Records Information Management"
After split: ["Records", "Information", "Management"]
Abbreviation: RIM

Case 2: "Art/Design Museum"

Input: "Art/Design Museum"
After special char removal: "Art Design Museum"
After split: ["Art", "Design", "Museum"]
Abbreviation: ADM

Case 3: "Culture+"

Input: "Culture+"
After special char removal: "Culture"
After split: ["Culture"]
Abbreviation: C

Examples

Institution Name	Correct	Incorrect
Department of Records & Information Management	DRIM	DR&IM
Art + Culture Center	ACC	A+CC
Museum/Gallery Amsterdam	MGA	M/GA
Heritage@Digital	HD	H@D
Archives (Historical)	AH	A(H)
Research & Development Institute	RDI	R&DI
Sint Maarten Records & Information	SMRI	SMR&I

Real-World Case Study

File: data/custodian/SX-XX-PHI-O-DR&IMSM.yaml

Institution: Department of Records & Information Management of Sint Maarten

Current (INCORRECT):

ghcid:
  ghcid_current: SX-XX-PHI-O-DR&IMSM  # Contains "&"

Required (CORRECT):

ghcid:
  ghcid_current: SX-XX-PHI-O-DRIMSM   # Alphabetic only

Derivation:

Full name: "Department of Records & Information Management of Sint Maarten"
Skip prepositions: "of" (appears twice)
Handle special character: "&" → split "Records & Information" into separate words
Significant words: Department, Records, Information, Management, Sint, Maarten
First letters: D, R, I, M, S, M → "DRIMSM"
Maximum 10 chars: "DRIMSM" (6 chars, OK)

Note: File should be renamed from SX-XX-PHI-O-DR&IMSM.yaml to SX-XX-PHI-O-DRIMSM.yaml

Validation

Check for Invalid Abbreviations

# Find GHCID files with special characters in abbreviation
find data/custodian -name "*.yaml" | xargs grep -l '[&+@#%$*|:;?!=~^<>]' | head -20

# Specifically check for & in filenames
find data/custodian -name "*&*.yaml"

Programmatic Validation

import re

def validate_abbreviation(abbrev: str) -> bool:
    """
    Validate that abbreviation contains only A-Z.
    
    Returns True if valid, False if contains special characters.
    """
    return bool(re.match(r'^[A-Z]+$', abbrev))

# Examples
validate_abbreviation("DRIMSM")   # True - valid
validate_abbreviation("DR&IMSM")  # False - contains &
validate_abbreviation("A+CC")     # False - contains +

Migration

For existing files with special characters in abbreviations:

Identify affected files:

find data/custodian -name "*[&+@#%$*|]*" -type f

For each file:
- Read the file
- Regenerate the abbreviation following this rule
- Update ghcid_current and ghcid_original
- Add entry to ghcid_history documenting the correction
- Rename the file to match new GHCID

Update GHCID history:

ghcid_history:
  - ghcid: SX-XX-PHI-O-DRIMSM
    ghcid_numeric: <new_numeric>
    valid_from: "2025-12-07T00:00:00Z"
    reason: "Corrected abbreviation to remove special character '&' per ABBREV-SPECIAL-CHAR rule"
  - ghcid: SX-XX-PHI-O-DR&IMSM
    ghcid_numeric: 12942843033761857211
    valid_from: "2025-12-06T21:07:22.140567+00:00"
    valid_to: "2025-12-07T00:00:00Z"
    reason: "Initial GHCID (contained invalid character)"

AGENTS.md - Section "INSTITUTION ABBREVIATION: EMIC NAME FIRST-LETTER PROTOCOL"
schemas/20251121/linkml/modules/classes/CustodianName.yaml - Schema description
.opencode/LEGAL_FORM_FILTERING_RULE.md - Related filtering rule for legal forms
docs/PERSISTENT_IDENTIFIERS.md - GHCID specification

Changelog

Date	Change
2025-12-07	Initial rule created after discovery of `&` in GHCID

11 KiB Raw Blame History

Abbreviation Character Filtering Rules

Summary

Two Mandatory Sub-Rules:

Rule 1: Diacritics MUST Be Normalized to ASCII (ABBREV-DIACRITICS)

Example (Real Case)

Diacritics Normalization Table

Implementation

Languages Commonly Affected

Rule 2: Special Characters MUST Be Removed (ABBREV-SPECIAL-CHAR)

Rationale

1. URL/URI Safety

2. Filename Safety

3. Parsing Consistency

4. Cross-System Compatibility

5. Human Readability

Characters to Remove

Implementation

Algorithm

Handling Special Cases

Examples

Real-World Case Study

Validation

Check for Invalid Abbreviations

Programmatic Validation

Migration

Related Documentation

Changelog

11 KiB

Raw Blame History