- Add Rule 11 for Z.AI Coding Plan API usage (not BigModel) - Add transliteration standards for non-Latin scripts - Document GLM model options and Python implementation
11 KiB
Abbreviation Character Filtering Rules
Rule ID: ABBREV-CHAR-FILTER
Status: MANDATORY
Applies To: GHCID abbreviation component generation
Created: 2025-12-07
Updated: 2025-12-08 (added diacritics rule)
Summary
When generating abbreviations for GHCID, ONLY ASCII uppercase letters (A-Z) are permitted. Both special characters AND diacritics MUST be removed/normalized.
This is a MANDATORY rule. Abbreviations containing special characters or diacritics are INVALID and must be regenerated.
Two Mandatory Sub-Rules:
- ABBREV-SPECIAL-CHAR: Remove all special characters and symbols
- ABBREV-DIACRITICS: Normalize all diacritics to ASCII equivalents
Rule 1: Diacritics MUST Be Normalized to ASCII (ABBREV-DIACRITICS)
Diacritics (accented characters) MUST be normalized to their ASCII base letter equivalents.
Example (Real Case)
❌ WRONG: CZ-VY-TEL-L-VHSPAOČRZS (contains Č)
✅ CORRECT: CZ-VY-TEL-L-VHSPAOCRZS (ASCII only)
Diacritics Normalization Table
| Diacritic | ASCII | Example |
|---|---|---|
| Á, À, Â, Ã, Ä, Å, Ā | A | "Ålborg" → A |
| Č, Ć, Ç | C | "Český" → C |
| Ď | D | "Ďáblice" → D |
| É, È, Ê, Ë, Ě, Ē | E | "Éire" → E |
| Í, Ì, Î, Ï, Ī | I | "Ísland" → I |
| Ñ, Ń, Ň | N | "España" → N |
| Ó, Ò, Ô, Õ, Ö, Ø, Ō | O | "Österreich" → O |
| Ř | R | "Říčany" → R |
| Š, Ś, Ş | S | "Šumperk" → S |
| Ť | T | "Ťažký" → T |
| Ú, Ù, Û, Ü, Ů, Ū | U | "Ústí" → U |
| Ý, Ÿ | Y | "Ýmir" → Y |
| Ž, Ź, Ż | Z | "Žilina" → Z |
| Ł | L | "Łódź" → L |
| Æ | AE | "Ærø" → AE |
| Œ | OE | "Œuvre" → OE |
| ß | SS | "Straße" → SS |
Implementation
import unicodedata
def normalize_diacritics(text: str) -> str:
"""
Normalize diacritics to ASCII equivalents.
Examples:
"Č" → "C"
"Ř" → "R"
"Ö" → "O"
"ñ" → "n"
"""
# NFD decomposition separates base characters from combining marks
normalized = unicodedata.normalize('NFD', text)
# Remove combining marks (category 'Mn' = Mark, Nonspacing)
ascii_text = ''.join(c for c in normalized if unicodedata.category(c) != 'Mn')
return ascii_text
# Example
normalize_diacritics("VHSPAOČRZS") # Returns "VHSPAOCRZS"
Languages Commonly Affected
| Language | Common Diacritics | Example Institution |
|---|---|---|
| Czech | Č, Ř, Š, Ž, Ě, Ů | Vlastivědné muzeum → VM (not VM with háček) |
| Polish | Ł, Ń, Ó, Ś, Ź, Ż, Ą, Ę | Biblioteka Łódzka → BL |
| German | Ä, Ö, Ü, ß | Österreichische Nationalbibliothek → ON |
| French | É, È, Ê, Ç, Ô | Bibliothèque nationale → BN |
| Spanish | Ñ, Á, É, Í, Ó, Ú | Museo Nacional → MN |
| Portuguese | Ã, Õ, Ç, Á, É | Biblioteca Nacional → BN |
| Nordic | Å, Ä, Ö, Ø, Æ | Nationalmuseet → N |
| Turkish | Ç, Ğ, İ, Ö, Ş, Ü | İstanbul Üniversitesi → IU |
| Hungarian | Á, É, Í, Ó, Ö, Ő, Ú, Ü, Ű | Országos Levéltár → OL |
| Romanian | Ă, Â, Î, Ș, Ț | Biblioteca Națională → BN |
Rule 2: Special Characters MUST Be Removed (ABBREV-SPECIAL-CHAR)
Rationale
1. URL/URI Safety
Special characters require percent-encoding in URIs. For example:
&becomes%26+becomes%2B
This makes identifiers harder to share, copy, and verify.
2. Filename Safety
Many special characters are invalid in filenames across operating systems:
- Windows:
\ / : * ? " < > | - macOS/Linux:
/and null bytes
Files like SX-XX-PHI-O-DR&IMSM.yaml may cause issues on some systems.
3. Parsing Consistency
Special characters can conflict with delimiters in data pipelines:
&is used in query strings:is used in YAML, JSON/is a path separator|is a common CSV delimiter alternative
4. Cross-System Compatibility
Identifiers should work across all systems:
- Databases (SQL, TypeDB, Neo4j)
- RDF/SPARQL endpoints
- REST APIs
- Command-line tools
- Spreadsheets
5. Human Readability
Clean identifiers are easier to:
- Communicate verbally
- Type correctly
- Proofread
- Remember
Characters to Remove
The following characters MUST be completely removed (not replaced) when generating abbreviations:
| Character | Name | Example Issue |
|---|---|---|
& |
Ampersand | "R&A" in URLs, HTML entities |
/ |
Slash | Path separator confusion |
\ |
Backslash | Escape sequence issues |
+ |
Plus | URL encoding (+ = space) |
@ |
At sign | Email/handle confusion |
# |
Hash/Pound | Fragment identifier in URLs |
% |
Percent | URL encoding prefix |
$ |
Dollar | Variable prefix in shells |
* |
Asterisk | Glob/wildcard character |
( ) |
Parentheses | Grouping in regex, code |
[ ] |
Square brackets | Array notation |
{ } |
Curly braces | Object notation |
| |
Pipe | Command chaining, OR operator |
: |
Colon | YAML key-value, namespace separator |
; |
Semicolon | Statement terminator |
" ' ` |
Quotes | String delimiters |
, |
Comma | List separator |
. |
Period | File extension, namespace |
- |
Hyphen | Already used as GHCID component separator |
_ |
Underscore | Reserved for name suffix in collisions |
= |
Equals | Assignment operator |
? |
Question mark | Query string indicator |
! |
Exclamation | Negation, shell history |
~ |
Tilde | Home directory, bitwise NOT |
^ |
Caret | Regex anchor, power operator |
< > |
Angle brackets | HTML tags, redirects |
Implementation
Algorithm
When extracting abbreviation from institution name:
import re
import unicodedata
def extract_abbreviation_from_name(name: str, skip_words: set) -> str:
"""
Extract abbreviation from institution name.
Args:
name: Full institution name (emic)
skip_words: Set of prepositions/articles to skip
Returns:
Uppercase abbreviation with only A-Z characters
"""
# Step 1: Normalize unicode (remove diacritics)
normalized = unicodedata.normalize('NFD', name)
ascii_name = ''.join(c for c in normalized if unicodedata.category(c) != 'Mn')
# Step 2: Replace special characters with spaces (to split words)
# This handles cases like "Records&Information" -> "Records Information"
clean_name = re.sub(r'[^a-zA-Z\s]', ' ', ascii_name)
# Step 3: Split into words
words = clean_name.split()
# Step 4: Filter out skip words (prepositions, articles)
significant_words = [w for w in words if w.lower() not in skip_words]
# Step 5: Take first letter of each significant word
abbreviation = ''.join(w[0].upper() for w in significant_words if w)
# Step 6: Limit to 10 characters
return abbreviation[:10]
Handling Special Cases
Case 1: "Records & Information Management"
- Input:
"Records & Information Management" - After special char removal:
"Records Information Management" - After split:
["Records", "Information", "Management"] - Abbreviation:
RIM
Case 2: "Art/Design Museum"
- Input:
"Art/Design Museum" - After special char removal:
"Art Design Museum" - After split:
["Art", "Design", "Museum"] - Abbreviation:
ADM
Case 3: "Culture+"
- Input:
"Culture+" - After special char removal:
"Culture" - After split:
["Culture"] - Abbreviation:
C
Examples
| Institution Name | Correct | Incorrect |
|---|---|---|
| Department of Records & Information Management | DRIM | DR&IM |
| Art + Culture Center | ACC | A+CC |
| Museum/Gallery Amsterdam | MGA | M/GA |
| Heritage@Digital | HD | H@D |
| Archives (Historical) | AH | A(H) |
| Research & Development Institute | RDI | R&DI |
| Sint Maarten Records & Information | SMRI | SMR&I |
Real-World Case Study
File: data/custodian/SX-XX-PHI-O-DR&IMSM.yaml
Institution: Department of Records & Information Management of Sint Maarten
Current (INCORRECT):
ghcid:
ghcid_current: SX-XX-PHI-O-DR&IMSM # Contains "&"
Required (CORRECT):
ghcid:
ghcid_current: SX-XX-PHI-O-DRIMSM # Alphabetic only
Derivation:
- Full name: "Department of Records & Information Management of Sint Maarten"
- Skip prepositions: "of" (appears twice)
- Handle special character: "&" → split "Records & Information" into separate words
- Significant words: Department, Records, Information, Management, Sint, Maarten
- First letters: D, R, I, M, S, M → "DRIMSM"
- Maximum 10 chars: "DRIMSM" (6 chars, OK)
Note: File should be renamed from SX-XX-PHI-O-DR&IMSM.yaml to SX-XX-PHI-O-DRIMSM.yaml
Validation
Check for Invalid Abbreviations
# Find GHCID files with special characters in abbreviation
find data/custodian -name "*.yaml" | xargs grep -l '[&+@#%$*|:;?!=~^<>]' | head -20
# Specifically check for & in filenames
find data/custodian -name "*&*.yaml"
Programmatic Validation
import re
def validate_abbreviation(abbrev: str) -> bool:
"""
Validate that abbreviation contains only A-Z.
Returns True if valid, False if contains special characters.
"""
return bool(re.match(r'^[A-Z]+$', abbrev))
# Examples
validate_abbreviation("DRIMSM") # True - valid
validate_abbreviation("DR&IMSM") # False - contains &
validate_abbreviation("A+CC") # False - contains +
Migration
For existing files with special characters in abbreviations:
-
Identify affected files:
find data/custodian -name "*[&+@#%$*|]*" -type f -
For each file:
- Read the file
- Regenerate the abbreviation following this rule
- Update
ghcid_currentandghcid_original - Add entry to
ghcid_historydocumenting the correction - Rename the file to match new GHCID
-
Update GHCID history:
ghcid_history: - ghcid: SX-XX-PHI-O-DRIMSM ghcid_numeric: <new_numeric> valid_from: "2025-12-07T00:00:00Z" reason: "Corrected abbreviation to remove special character '&' per ABBREV-SPECIAL-CHAR rule" - ghcid: SX-XX-PHI-O-DR&IMSM ghcid_numeric: 12942843033761857211 valid_from: "2025-12-06T21:07:22.140567+00:00" valid_to: "2025-12-07T00:00:00Z" reason: "Initial GHCID (contained invalid character)"
Related Documentation
AGENTS.md- Section "INSTITUTION ABBREVIATION: EMIC NAME FIRST-LETTER PROTOCOL"schemas/20251121/linkml/modules/classes/CustodianName.yaml- Schema description.opencode/LEGAL_FORM_FILTERING_RULE.md- Related filtering rule for legal formsdocs/PERSISTENT_IDENTIFIERS.md- GHCID specification
Changelog
| Date | Change |
|---|---|
| 2025-12-07 | Initial rule created after discovery of & in GHCID |