# Abbreviation Special Character Filtering Rule **Rule ID**: ABBREV-SPECIAL-CHAR **Status**: MANDATORY **Applies To**: GHCID abbreviation component generation **Created**: 2025-12-07 --- ## Summary **When generating abbreviations for GHCID, special characters and symbols MUST be completely removed. Only alphabetic characters (A-Z) are permitted in the abbreviation component of the GHCID.** This is a **MANDATORY** rule. Abbreviations containing special characters are INVALID and must be regenerated. --- ## Rationale ### 1. URL/URI Safety Special characters require percent-encoding in URIs. For example: - `&` becomes `%26` - `+` becomes `%2B` This makes identifiers harder to share, copy, and verify. ### 2. Filename Safety Many special characters are invalid in filenames across operating systems: - Windows: `\ / : * ? " < > |` - macOS/Linux: `/` and null bytes Files like `SX-XX-PHI-O-DR&IMSM.yaml` may cause issues on some systems. ### 3. Parsing Consistency Special characters can conflict with delimiters in data pipelines: - `&` is used in query strings - `:` is used in YAML, JSON - `/` is a path separator - `|` is a common CSV delimiter alternative ### 4. Cross-System Compatibility Identifiers should work across all systems: - Databases (SQL, TypeDB, Neo4j) - RDF/SPARQL endpoints - REST APIs - Command-line tools - Spreadsheets ### 5. Human Readability Clean identifiers are easier to: - Communicate verbally - Type correctly - Proofread - Remember --- ## Characters to Remove The following characters MUST be completely removed (not replaced) when generating abbreviations: | Character | Name | Example Issue | |-----------|------|---------------| | `&` | Ampersand | "R&A" in URLs, HTML entities | | `/` | Slash | Path separator confusion | | `\` | Backslash | Escape sequence issues | | `+` | Plus | URL encoding (`+` = space) | | `@` | At sign | Email/handle confusion | | `#` | Hash/Pound | Fragment identifier in URLs | | `%` | Percent | URL encoding prefix | | `$` | Dollar | Variable prefix in shells | | `*` | Asterisk | Glob/wildcard character | | `(` `)` | Parentheses | Grouping in regex, code | | `[` `]` | Square brackets | Array notation | | `{` `}` | Curly braces | Object notation | | `\|` | Pipe | Command chaining, OR operator | | `:` | Colon | YAML key-value, namespace separator | | `;` | Semicolon | Statement terminator | | `"` `'` `` ` `` | Quotes | String delimiters | | `,` | Comma | List separator | | `.` | Period | File extension, namespace | | `-` | Hyphen | Already used as GHCID component separator | | `_` | Underscore | Reserved for name suffix in collisions | | `=` | Equals | Assignment operator | | `?` | Question mark | Query string indicator | | `!` | Exclamation | Negation, shell history | | `~` | Tilde | Home directory, bitwise NOT | | `^` | Caret | Regex anchor, power operator | | `<` `>` | Angle brackets | HTML tags, redirects | --- ## Implementation ### Algorithm When extracting abbreviation from institution name: ```python import re import unicodedata def extract_abbreviation_from_name(name: str, skip_words: set) -> str: """ Extract abbreviation from institution name. Args: name: Full institution name (emic) skip_words: Set of prepositions/articles to skip Returns: Uppercase abbreviation with only A-Z characters """ # Step 1: Normalize unicode (remove diacritics) normalized = unicodedata.normalize('NFD', name) ascii_name = ''.join(c for c in normalized if unicodedata.category(c) != 'Mn') # Step 2: Replace special characters with spaces (to split words) # This handles cases like "Records&Information" -> "Records Information" clean_name = re.sub(r'[^a-zA-Z\s]', ' ', ascii_name) # Step 3: Split into words words = clean_name.split() # Step 4: Filter out skip words (prepositions, articles) significant_words = [w for w in words if w.lower() not in skip_words] # Step 5: Take first letter of each significant word abbreviation = ''.join(w[0].upper() for w in significant_words if w) # Step 6: Limit to 10 characters return abbreviation[:10] ``` ### Handling Special Cases **Case 1: "Records & Information Management"** 1. Input: `"Records & Information Management"` 2. After special char removal: `"Records Information Management"` 3. After split: `["Records", "Information", "Management"]` 4. Abbreviation: `RIM` **Case 2: "Art/Design Museum"** 1. Input: `"Art/Design Museum"` 2. After special char removal: `"Art Design Museum"` 3. After split: `["Art", "Design", "Museum"]` 4. Abbreviation: `ADM` **Case 3: "Culture+"** 1. Input: `"Culture+"` 2. After special char removal: `"Culture"` 3. After split: `["Culture"]` 4. Abbreviation: `C` --- ## Examples | Institution Name | Correct | Incorrect | |------------------|---------|-----------| | Department of Records & Information Management | DRIM | DR&IM | | Art + Culture Center | ACC | A+CC | | Museum/Gallery Amsterdam | MGA | M/GA | | Heritage@Digital | HD | H@D | | Archives (Historical) | AH | A(H) | | Research & Development Institute | RDI | R&DI | | Sint Maarten Records & Information | SMRI | SMR&I | --- ## Real-World Case Study **File**: `data/custodian/SX-XX-PHI-O-DR&IMSM.yaml` **Institution**: Department of Records & Information Management of Sint Maarten **Current (INCORRECT)**: ```yaml ghcid: ghcid_current: SX-XX-PHI-O-DR&IMSM # Contains "&" ``` **Required (CORRECT)**: ```yaml ghcid: ghcid_current: SX-XX-PHI-O-DRIMSM # Alphabetic only ``` **Derivation**: 1. Full name: "Department of Records & Information Management of Sint Maarten" 2. Skip prepositions: "of" (appears twice) 3. Handle special character: "&" → split "Records & Information" into separate words 4. Significant words: Department, Records, Information, Management, Sint, Maarten 5. First letters: D, R, I, M, S, M → "DRIMSM" 6. Maximum 10 chars: "DRIMSM" (6 chars, OK) **Note**: File should be renamed from `SX-XX-PHI-O-DR&IMSM.yaml` to `SX-XX-PHI-O-DRIMSM.yaml` --- ## Validation ### Check for Invalid Abbreviations ```bash # Find GHCID files with special characters in abbreviation find data/custodian -name "*.yaml" | xargs grep -l '[&+@#%$*|:;?!=~^<>]' | head -20 # Specifically check for & in filenames find data/custodian -name "*&*.yaml" ``` ### Programmatic Validation ```python import re def validate_abbreviation(abbrev: str) -> bool: """ Validate that abbreviation contains only A-Z. Returns True if valid, False if contains special characters. """ return bool(re.match(r'^[A-Z]+$', abbrev)) # Examples validate_abbreviation("DRIMSM") # True - valid validate_abbreviation("DR&IMSM") # False - contains & validate_abbreviation("A+CC") # False - contains + ``` --- ## Migration For existing files with special characters in abbreviations: 1. **Identify affected files**: ```bash find data/custodian -name "*[&+@#%$*|]*" -type f ``` 2. **For each file**: - Read the file - Regenerate the abbreviation following this rule - Update `ghcid_current` and `ghcid_original` - Add entry to `ghcid_history` documenting the correction - Rename the file to match new GHCID 3. **Update GHCID history**: ```yaml ghcid_history: - ghcid: SX-XX-PHI-O-DRIMSM ghcid_numeric: valid_from: "2025-12-07T00:00:00Z" reason: "Corrected abbreviation to remove special character '&' per ABBREV-SPECIAL-CHAR rule" - ghcid: SX-XX-PHI-O-DR&IMSM ghcid_numeric: 12942843033761857211 valid_from: "2025-12-06T21:07:22.140567+00:00" valid_to: "2025-12-07T00:00:00Z" reason: "Initial GHCID (contained invalid character)" ``` --- ## Related Documentation - `AGENTS.md` - Section "INSTITUTION ABBREVIATION: EMIC NAME FIRST-LETTER PROTOCOL" - `schemas/20251121/linkml/modules/classes/CustodianName.yaml` - Schema description - `.opencode/LEGAL_FORM_FILTERING_RULE.md` - Related filtering rule for legal forms - `docs/PERSISTENT_IDENTIFIERS.md` - GHCID specification --- ## Changelog | Date | Change | |------|--------| | 2025-12-07 | Initial rule created after discovery of `&` in GHCID |