glam/.opencode/ABBREVIATION_SPECIAL_CHAR_RULE.md
kempersc 271545fa8b docs: add Z.AI GLM API and transliteration rules to AGENTS.md
- Add Rule 11 for Z.AI Coding Plan API usage (not BigModel)
- Add transliteration standards for non-Latin scripts
- Document GLM model options and Python implementation
2025-12-08 14:58:22 +01:00

11 KiB

Abbreviation Character Filtering Rules

Rule ID: ABBREV-CHAR-FILTER
Status: MANDATORY
Applies To: GHCID abbreviation component generation
Created: 2025-12-07
Updated: 2025-12-08 (added diacritics rule)


Summary

When generating abbreviations for GHCID, ONLY ASCII uppercase letters (A-Z) are permitted. Both special characters AND diacritics MUST be removed/normalized.

This is a MANDATORY rule. Abbreviations containing special characters or diacritics are INVALID and must be regenerated.

Two Mandatory Sub-Rules:

  1. ABBREV-SPECIAL-CHAR: Remove all special characters and symbols
  2. ABBREV-DIACRITICS: Normalize all diacritics to ASCII equivalents

Rule 1: Diacritics MUST Be Normalized to ASCII (ABBREV-DIACRITICS)

Diacritics (accented characters) MUST be normalized to their ASCII base letter equivalents.

Example (Real Case)

❌ WRONG:  CZ-VY-TEL-L-VHSPAOČRZS  (contains Č)
✅ CORRECT: CZ-VY-TEL-L-VHSPAOCRZS  (ASCII only)

Diacritics Normalization Table

Diacritic ASCII Example
Á, À, Â, Ã, Ä, Å, Ā A "Ålborg" → A
Č, Ć, Ç C "Český" → C
Ď D "Ďáblice" → D
É, È, Ê, Ë, Ě, Ē E "Éire" → E
Í, Ì, Î, Ï, Ī I "Ísland" → I
Ñ, Ń, Ň N "España" → N
Ó, Ò, Ô, Õ, Ö, Ø, Ō O "Österreich" → O
Ř R "Říčany" → R
Š, Ś, Ş S "Šumperk" → S
Ť T "Ťažký" → T
Ú, Ù, Û, Ü, Ů, Ū U "Ústí" → U
Ý, Ÿ Y "Ýmir" → Y
Ž, Ź, Ż Z "Žilina" → Z
Ł L "Łódź" → L
Æ AE "Ærø" → AE
Œ OE "Œuvre" → OE
ß SS "Straße" → SS

Implementation

import unicodedata

def normalize_diacritics(text: str) -> str:
    """
    Normalize diacritics to ASCII equivalents.
    
    Examples:
        "Č" → "C"
        "Ř" → "R"  
        "Ö" → "O"
        "ñ" → "n"
    """
    # NFD decomposition separates base characters from combining marks
    normalized = unicodedata.normalize('NFD', text)
    # Remove combining marks (category 'Mn' = Mark, Nonspacing)
    ascii_text = ''.join(c for c in normalized if unicodedata.category(c) != 'Mn')
    return ascii_text

# Example
normalize_diacritics("VHSPAOČRZS")  # Returns "VHSPAOCRZS"

Languages Commonly Affected

Language Common Diacritics Example Institution
Czech Č, Ř, Š, Ž, Ě, Ů Vlastivědné muzeum → VM (not VM with háček)
Polish Ł, Ń, Ó, Ś, Ź, Ż, Ą, Ę Biblioteka Łódzka → BL
German Ä, Ö, Ü, ß Österreichische Nationalbibliothek → ON
French É, È, Ê, Ç, Ô Bibliothèque nationale → BN
Spanish Ñ, Á, É, Í, Ó, Ú Museo Nacional → MN
Portuguese Ã, Õ, Ç, Á, É Biblioteca Nacional → BN
Nordic Å, Ä, Ö, Ø, Æ Nationalmuseet → N
Turkish Ç, Ğ, İ, Ö, Ş, Ü İstanbul Üniversitesi → IU
Hungarian Á, É, Í, Ó, Ö, Ő, Ú, Ü, Ű Országos Levéltár → OL
Romanian Ă, Â, Î, Ș, Ț Biblioteca Națională → BN

Rule 2: Special Characters MUST Be Removed (ABBREV-SPECIAL-CHAR)


Rationale

1. URL/URI Safety

Special characters require percent-encoding in URIs. For example:

  • & becomes %26
  • + becomes %2B

This makes identifiers harder to share, copy, and verify.

2. Filename Safety

Many special characters are invalid in filenames across operating systems:

  • Windows: \ / : * ? " < > |
  • macOS/Linux: / and null bytes

Files like SX-XX-PHI-O-DR&IMSM.yaml may cause issues on some systems.

3. Parsing Consistency

Special characters can conflict with delimiters in data pipelines:

  • & is used in query strings
  • : is used in YAML, JSON
  • / is a path separator
  • | is a common CSV delimiter alternative

4. Cross-System Compatibility

Identifiers should work across all systems:

  • Databases (SQL, TypeDB, Neo4j)
  • RDF/SPARQL endpoints
  • REST APIs
  • Command-line tools
  • Spreadsheets

5. Human Readability

Clean identifiers are easier to:

  • Communicate verbally
  • Type correctly
  • Proofread
  • Remember

Characters to Remove

The following characters MUST be completely removed (not replaced) when generating abbreviations:

Character Name Example Issue
& Ampersand "R&A" in URLs, HTML entities
/ Slash Path separator confusion
\ Backslash Escape sequence issues
+ Plus URL encoding (+ = space)
@ At sign Email/handle confusion
# Hash/Pound Fragment identifier in URLs
% Percent URL encoding prefix
$ Dollar Variable prefix in shells
* Asterisk Glob/wildcard character
( ) Parentheses Grouping in regex, code
[ ] Square brackets Array notation
{ } Curly braces Object notation
| Pipe Command chaining, OR operator
: Colon YAML key-value, namespace separator
; Semicolon Statement terminator
" ' ` Quotes String delimiters
, Comma List separator
. Period File extension, namespace
- Hyphen Already used as GHCID component separator
_ Underscore Reserved for name suffix in collisions
= Equals Assignment operator
? Question mark Query string indicator
! Exclamation Negation, shell history
~ Tilde Home directory, bitwise NOT
^ Caret Regex anchor, power operator
< > Angle brackets HTML tags, redirects

Implementation

Algorithm

When extracting abbreviation from institution name:

import re
import unicodedata

def extract_abbreviation_from_name(name: str, skip_words: set) -> str:
    """
    Extract abbreviation from institution name.
    
    Args:
        name: Full institution name (emic)
        skip_words: Set of prepositions/articles to skip
    
    Returns:
        Uppercase abbreviation with only A-Z characters
    """
    # Step 1: Normalize unicode (remove diacritics)
    normalized = unicodedata.normalize('NFD', name)
    ascii_name = ''.join(c for c in normalized if unicodedata.category(c) != 'Mn')
    
    # Step 2: Replace special characters with spaces (to split words)
    # This handles cases like "Records&Information" -> "Records Information"
    clean_name = re.sub(r'[^a-zA-Z\s]', ' ', ascii_name)
    
    # Step 3: Split into words
    words = clean_name.split()
    
    # Step 4: Filter out skip words (prepositions, articles)
    significant_words = [w for w in words if w.lower() not in skip_words]
    
    # Step 5: Take first letter of each significant word
    abbreviation = ''.join(w[0].upper() for w in significant_words if w)
    
    # Step 6: Limit to 10 characters
    return abbreviation[:10]

Handling Special Cases

Case 1: "Records & Information Management"

  1. Input: "Records & Information Management"
  2. After special char removal: "Records Information Management"
  3. After split: ["Records", "Information", "Management"]
  4. Abbreviation: RIM

Case 2: "Art/Design Museum"

  1. Input: "Art/Design Museum"
  2. After special char removal: "Art Design Museum"
  3. After split: ["Art", "Design", "Museum"]
  4. Abbreviation: ADM

Case 3: "Culture+"

  1. Input: "Culture+"
  2. After special char removal: "Culture"
  3. After split: ["Culture"]
  4. Abbreviation: C

Examples

Institution Name Correct Incorrect
Department of Records & Information Management DRIM DR&IM
Art + Culture Center ACC A+CC
Museum/Gallery Amsterdam MGA M/GA
Heritage@Digital HD H@D
Archives (Historical) AH A(H)
Research & Development Institute RDI R&DI
Sint Maarten Records & Information SMRI SMR&I

Real-World Case Study

File: data/custodian/SX-XX-PHI-O-DR&IMSM.yaml

Institution: Department of Records & Information Management of Sint Maarten

Current (INCORRECT):

ghcid:
  ghcid_current: SX-XX-PHI-O-DR&IMSM  # Contains "&"

Required (CORRECT):

ghcid:
  ghcid_current: SX-XX-PHI-O-DRIMSM   # Alphabetic only

Derivation:

  1. Full name: "Department of Records & Information Management of Sint Maarten"
  2. Skip prepositions: "of" (appears twice)
  3. Handle special character: "&" → split "Records & Information" into separate words
  4. Significant words: Department, Records, Information, Management, Sint, Maarten
  5. First letters: D, R, I, M, S, M → "DRIMSM"
  6. Maximum 10 chars: "DRIMSM" (6 chars, OK)

Note: File should be renamed from SX-XX-PHI-O-DR&IMSM.yaml to SX-XX-PHI-O-DRIMSM.yaml


Validation

Check for Invalid Abbreviations

# Find GHCID files with special characters in abbreviation
find data/custodian -name "*.yaml" | xargs grep -l '[&+@#%$*|:;?!=~^<>]' | head -20

# Specifically check for & in filenames
find data/custodian -name "*&*.yaml"

Programmatic Validation

import re

def validate_abbreviation(abbrev: str) -> bool:
    """
    Validate that abbreviation contains only A-Z.
    
    Returns True if valid, False if contains special characters.
    """
    return bool(re.match(r'^[A-Z]+$', abbrev))

# Examples
validate_abbreviation("DRIMSM")   # True - valid
validate_abbreviation("DR&IMSM")  # False - contains &
validate_abbreviation("A+CC")     # False - contains +

Migration

For existing files with special characters in abbreviations:

  1. Identify affected files:

    find data/custodian -name "*[&+@#%$*|]*" -type f
    
  2. For each file:

    • Read the file
    • Regenerate the abbreviation following this rule
    • Update ghcid_current and ghcid_original
    • Add entry to ghcid_history documenting the correction
    • Rename the file to match new GHCID
  3. Update GHCID history:

    ghcid_history:
      - ghcid: SX-XX-PHI-O-DRIMSM
        ghcid_numeric: <new_numeric>
        valid_from: "2025-12-07T00:00:00Z"
        reason: "Corrected abbreviation to remove special character '&' per ABBREV-SPECIAL-CHAR rule"
      - ghcid: SX-XX-PHI-O-DR&IMSM
        ghcid_numeric: 12942843033761857211
        valid_from: "2025-12-06T21:07:22.140567+00:00"
        valid_to: "2025-12-07T00:00:00Z"
        reason: "Initial GHCID (contained invalid character)"
    

  • AGENTS.md - Section "INSTITUTION ABBREVIATION: EMIC NAME FIRST-LETTER PROTOCOL"
  • schemas/20251121/linkml/modules/classes/CustodianName.yaml - Schema description
  • .opencode/LEGAL_FORM_FILTERING_RULE.md - Related filtering rule for legal forms
  • docs/PERSISTENT_IDENTIFIERS.md - GHCID specification

Changelog

Date Change
2025-12-07 Initial rule created after discovery of & in GHCID