# Abbreviation Character Filtering Rules **Rule ID**: ABBREV-CHAR-FILTER **Status**: MANDATORY **Applies To**: GHCID abbreviation component generation **Created**: 2025-12-07 **Updated**: 2025-12-08 (added diacritics rule) --- ## Summary **When generating abbreviations for GHCID, ONLY ASCII uppercase letters (A-Z) are permitted. Both special characters AND diacritics MUST be removed/normalized.** This is a **MANDATORY** rule. Abbreviations containing special characters or diacritics are INVALID and must be regenerated. ### Two Mandatory Sub-Rules: 1. **ABBREV-SPECIAL-CHAR**: Remove all special characters and symbols 2. **ABBREV-DIACRITICS**: Normalize all diacritics to ASCII equivalents --- ## Rule 1: Diacritics MUST Be Normalized to ASCII (ABBREV-DIACRITICS) **Diacritics (accented characters) MUST be normalized to their ASCII base letter equivalents.** ### Example (Real Case) ``` ❌ WRONG: CZ-VY-TEL-L-VHSPAOČRZS (contains Č) ✅ CORRECT: CZ-VY-TEL-L-VHSPAOCRZS (ASCII only) ``` ### Diacritics Normalization Table | Diacritic | ASCII | Example | |-----------|-------|---------| | Á, À, Â, Ã, Ä, Å, Ā | A | "Ålborg" → A | | Č, Ć, Ç | C | "Český" → C | | Ď | D | "Ďáblice" → D | | É, È, Ê, Ë, Ě, Ē | E | "Éire" → E | | Í, Ì, Î, Ï, Ī | I | "Ísland" → I | | Ñ, Ń, Ň | N | "España" → N | | Ó, Ò, Ô, Õ, Ö, Ø, Ō | O | "Österreich" → O | | Ř | R | "Říčany" → R | | Š, Ś, Ş | S | "Šumperk" → S | | Ť | T | "Ťažký" → T | | Ú, Ù, Û, Ü, Ů, Ū | U | "Ústí" → U | | Ý, Ÿ | Y | "Ýmir" → Y | | Ž, Ź, Ż | Z | "Žilina" → Z | | Ł | L | "Łódź" → L | | Æ | AE | "Ærø" → AE | | Œ | OE | "Œuvre" → OE | | ß | SS | "Straße" → SS | ### Implementation ```python import unicodedata def normalize_diacritics(text: str) -> str: """ Normalize diacritics to ASCII equivalents. Examples: "Č" → "C" "Ř" → "R" "Ö" → "O" "ñ" → "n" """ # NFD decomposition separates base characters from combining marks normalized = unicodedata.normalize('NFD', text) # Remove combining marks (category 'Mn' = Mark, Nonspacing) ascii_text = ''.join(c for c in normalized if unicodedata.category(c) != 'Mn') return ascii_text # Example normalize_diacritics("VHSPAOČRZS") # Returns "VHSPAOCRZS" ``` ### Languages Commonly Affected | Language | Common Diacritics | Example Institution | |----------|-------------------|---------------------| | **Czech** | Č, Ř, Š, Ž, Ě, Ů | Vlastivědné muzeum → VM (not VM with háček) | | **Polish** | Ł, Ń, Ó, Ś, Ź, Ż, Ą, Ę | Biblioteka Łódzka → BL | | **German** | Ä, Ö, Ü, ß | Österreichische Nationalbibliothek → ON | | **French** | É, È, Ê, Ç, Ô | Bibliothèque nationale → BN | | **Spanish** | Ñ, Á, É, Í, Ó, Ú | Museo Nacional → MN | | **Portuguese** | Ã, Õ, Ç, Á, É | Biblioteca Nacional → BN | | **Nordic** | Å, Ä, Ö, Ø, Æ | Nationalmuseet → N | | **Turkish** | Ç, Ğ, İ, Ö, Ş, Ü | İstanbul Üniversitesi → IU | | **Hungarian** | Á, É, Í, Ó, Ö, Ő, Ú, Ü, Ű | Országos Levéltár → OL | | **Romanian** | Ă, Â, Î, Ș, Ț | Biblioteca Națională → BN | --- ## Rule 2: Special Characters MUST Be Removed (ABBREV-SPECIAL-CHAR) --- ## Rationale ### 1. URL/URI Safety Special characters require percent-encoding in URIs. For example: - `&` becomes `%26` - `+` becomes `%2B` This makes identifiers harder to share, copy, and verify. ### 2. Filename Safety Many special characters are invalid in filenames across operating systems: - Windows: `\ / : * ? " < > |` - macOS/Linux: `/` and null bytes Files like `SX-XX-PHI-O-DR&IMSM.yaml` may cause issues on some systems. ### 3. Parsing Consistency Special characters can conflict with delimiters in data pipelines: - `&` is used in query strings - `:` is used in YAML, JSON - `/` is a path separator - `|` is a common CSV delimiter alternative ### 4. Cross-System Compatibility Identifiers should work across all systems: - Databases (SQL, TypeDB, Neo4j) - RDF/SPARQL endpoints - REST APIs - Command-line tools - Spreadsheets ### 5. Human Readability Clean identifiers are easier to: - Communicate verbally - Type correctly - Proofread - Remember --- ## Characters to Remove The following characters MUST be completely removed (not replaced) when generating abbreviations: | Character | Name | Example Issue | |-----------|------|---------------| | `&` | Ampersand | "R&A" in URLs, HTML entities | | `/` | Slash | Path separator confusion | | `\` | Backslash | Escape sequence issues | | `+` | Plus | URL encoding (`+` = space) | | `@` | At sign | Email/handle confusion | | `#` | Hash/Pound | Fragment identifier in URLs | | `%` | Percent | URL encoding prefix | | `$` | Dollar | Variable prefix in shells | | `*` | Asterisk | Glob/wildcard character | | `(` `)` | Parentheses | Grouping in regex, code | | `[` `]` | Square brackets | Array notation | | `{` `}` | Curly braces | Object notation | | `\|` | Pipe | Command chaining, OR operator | | `:` | Colon | YAML key-value, namespace separator | | `;` | Semicolon | Statement terminator | | `"` `'` `` ` `` | Quotes | String delimiters | | `,` | Comma | List separator | | `.` | Period | File extension, namespace | | `-` | Hyphen | Already used as GHCID component separator | | `_` | Underscore | Reserved for name suffix in collisions | | `=` | Equals | Assignment operator | | `?` | Question mark | Query string indicator | | `!` | Exclamation | Negation, shell history | | `~` | Tilde | Home directory, bitwise NOT | | `^` | Caret | Regex anchor, power operator | | `<` `>` | Angle brackets | HTML tags, redirects | --- ## Implementation ### Algorithm When extracting abbreviation from institution name: ```python import re import unicodedata def extract_abbreviation_from_name(name: str, skip_words: set) -> str: """ Extract abbreviation from institution name. Args: name: Full institution name (emic) skip_words: Set of prepositions/articles to skip Returns: Uppercase abbreviation with only A-Z characters """ # Step 1: Normalize unicode (remove diacritics) normalized = unicodedata.normalize('NFD', name) ascii_name = ''.join(c for c in normalized if unicodedata.category(c) != 'Mn') # Step 2: Replace special characters with spaces (to split words) # This handles cases like "Records&Information" -> "Records Information" clean_name = re.sub(r'[^a-zA-Z\s]', ' ', ascii_name) # Step 3: Split into words words = clean_name.split() # Step 4: Filter out skip words (prepositions, articles) significant_words = [w for w in words if w.lower() not in skip_words] # Step 5: Take first letter of each significant word abbreviation = ''.join(w[0].upper() for w in significant_words if w) # Step 6: Limit to 10 characters return abbreviation[:10] ``` ### Handling Special Cases **Case 1: "Records & Information Management"** 1. Input: `"Records & Information Management"` 2. After special char removal: `"Records Information Management"` 3. After split: `["Records", "Information", "Management"]` 4. Abbreviation: `RIM` **Case 2: "Art/Design Museum"** 1. Input: `"Art/Design Museum"` 2. After special char removal: `"Art Design Museum"` 3. After split: `["Art", "Design", "Museum"]` 4. Abbreviation: `ADM` **Case 3: "Culture+"** 1. Input: `"Culture+"` 2. After special char removal: `"Culture"` 3. After split: `["Culture"]` 4. Abbreviation: `C` --- ## Examples | Institution Name | Correct | Incorrect | |------------------|---------|-----------| | Department of Records & Information Management | DRIM | DR&IM | | Art + Culture Center | ACC | A+CC | | Museum/Gallery Amsterdam | MGA | M/GA | | Heritage@Digital | HD | H@D | | Archives (Historical) | AH | A(H) | | Research & Development Institute | RDI | R&DI | | Sint Maarten Records & Information | SMRI | SMR&I | --- ## Real-World Case Study **File**: `data/custodian/SX-XX-PHI-O-DR&IMSM.yaml` **Institution**: Department of Records & Information Management of Sint Maarten **Current (INCORRECT)**: ```yaml ghcid: ghcid_current: SX-XX-PHI-O-DR&IMSM # Contains "&" ``` **Required (CORRECT)**: ```yaml ghcid: ghcid_current: SX-XX-PHI-O-DRIMSM # Alphabetic only ``` **Derivation**: 1. Full name: "Department of Records & Information Management of Sint Maarten" 2. Skip prepositions: "of" (appears twice) 3. Handle special character: "&" → split "Records & Information" into separate words 4. Significant words: Department, Records, Information, Management, Sint, Maarten 5. First letters: D, R, I, M, S, M → "DRIMSM" 6. Maximum 10 chars: "DRIMSM" (6 chars, OK) **Note**: File should be renamed from `SX-XX-PHI-O-DR&IMSM.yaml` to `SX-XX-PHI-O-DRIMSM.yaml` --- ## Validation ### Check for Invalid Abbreviations ```bash # Find GHCID files with special characters in abbreviation find data/custodian -name "*.yaml" | xargs grep -l '[&+@#%$*|:;?!=~^<>]' | head -20 # Specifically check for & in filenames find data/custodian -name "*&*.yaml" ``` ### Programmatic Validation ```python import re def validate_abbreviation(abbrev: str) -> bool: """ Validate that abbreviation contains only A-Z. Returns True if valid, False if contains special characters. """ return bool(re.match(r'^[A-Z]+$', abbrev)) # Examples validate_abbreviation("DRIMSM") # True - valid validate_abbreviation("DR&IMSM") # False - contains & validate_abbreviation("A+CC") # False - contains + ``` --- ## Migration For existing files with special characters in abbreviations: 1. **Identify affected files**: ```bash find data/custodian -name "*[&+@#%$*|]*" -type f ``` 2. **For each file**: - Read the file - Regenerate the abbreviation following this rule - Update `ghcid_current` and `ghcid_original` - Add entry to `ghcid_history` documenting the correction - Rename the file to match new GHCID 3. **Update GHCID history**: ```yaml ghcid_history: - ghcid: SX-XX-PHI-O-DRIMSM ghcid_numeric: valid_from: "2025-12-07T00:00:00Z" reason: "Corrected abbreviation to remove special character '&' per ABBREV-SPECIAL-CHAR rule" - ghcid: SX-XX-PHI-O-DR&IMSM ghcid_numeric: 12942843033761857211 valid_from: "2025-12-06T21:07:22.140567+00:00" valid_to: "2025-12-07T00:00:00Z" reason: "Initial GHCID (contained invalid character)" ``` --- ## Related Documentation - `AGENTS.md` - Section "INSTITUTION ABBREVIATION: EMIC NAME FIRST-LETTER PROTOCOL" - `schemas/20251121/linkml/modules/classes/CustodianName.yaml` - Schema description - `.opencode/LEGAL_FORM_FILTERING_RULE.md` - Related filtering rule for legal forms - `docs/PERSISTENT_IDENTIFIERS.md` - GHCID specification --- ## Changelog | Date | Change | |------|--------| | 2025-12-07 | Initial rule created after discovery of `&` in GHCID |