# Abbreviation Special Character Filtering Rule

**Rule ID**: ABBREV-SPECIAL-CHAR  
**Status**: MANDATORY  
**Applies To**: GHCID abbreviation component generation  
**Created**: 2025-12-07  

---

## Summary

**When generating abbreviations for GHCID, special characters and symbols MUST be completely removed. Only alphabetic characters (A-Z) are permitted in the abbreviation component of the GHCID.**

This is a **MANDATORY** rule. Abbreviations containing special characters are INVALID and must be regenerated.

---

## Rationale

### 1. URL/URI Safety
Special characters require percent-encoding in URIs. For example:
- `&` becomes `%26`
- `+` becomes `%2B`

This makes identifiers harder to share, copy, and verify.

### 2. Filename Safety
Many special characters are invalid in filenames across operating systems:
- Windows: `\ / : * ? " < > |`
- macOS/Linux: `/` and null bytes

Files like `SX-XX-PHI-O-DR&IMSM.yaml` may cause issues on some systems.

### 3. Parsing Consistency
Special characters can conflict with delimiters in data pipelines:
- `&` is used in query strings
- `:` is used in YAML, JSON
- `/` is a path separator
- `|` is a common CSV delimiter alternative

### 4. Cross-System Compatibility
Identifiers should work across all systems:
- Databases (SQL, TypeDB, Neo4j)
- RDF/SPARQL endpoints
- REST APIs
- Command-line tools
- Spreadsheets

### 5. Human Readability
Clean identifiers are easier to:
- Communicate verbally
- Type correctly
- Proofread
- Remember

---

## Characters to Remove

The following characters MUST be completely removed (not replaced) when generating abbreviations:

| Character | Name | Example Issue |
|-----------|------|---------------|
| `&` | Ampersand | "R&A" in URLs, HTML entities |
| `/` | Slash | Path separator confusion |
| `\` | Backslash | Escape sequence issues |
| `+` | Plus | URL encoding (`+` = space) |
| `@` | At sign | Email/handle confusion |
| `#` | Hash/Pound | Fragment identifier in URLs |
| `%` | Percent | URL encoding prefix |
| `$` | Dollar | Variable prefix in shells |
| `*` | Asterisk | Glob/wildcard character |
| `(` `)` | Parentheses | Grouping in regex, code |
| `[` `]` | Square brackets | Array notation |
| `{` `}` | Curly braces | Object notation |
| `\|` | Pipe | Command chaining, OR operator |
| `:` | Colon | YAML key-value, namespace separator |
| `;` | Semicolon | Statement terminator |
| `"` `'` `` ` `` | Quotes | String delimiters |
| `,` | Comma | List separator |
| `.` | Period | File extension, namespace |
| `-` | Hyphen | Already used as GHCID component separator |
| `_` | Underscore | Reserved for name suffix in collisions |
| `=` | Equals | Assignment operator |
| `?` | Question mark | Query string indicator |
| `!` | Exclamation | Negation, shell history |
| `~` | Tilde | Home directory, bitwise NOT |
| `^` | Caret | Regex anchor, power operator |
| `<` `>` | Angle brackets | HTML tags, redirects |

---

## Implementation

### Algorithm

When extracting abbreviation from institution name:

```python
import re
import unicodedata

def extract_abbreviation_from_name(name: str, skip_words: set) -> str:
    """
    Extract abbreviation from institution name.
    
    Args:
        name: Full institution name (emic)
        skip_words: Set of prepositions/articles to skip
    
    Returns:
        Uppercase abbreviation with only A-Z characters
    """
    # Step 1: Normalize unicode (remove diacritics)
    normalized = unicodedata.normalize('NFD', name)
    ascii_name = ''.join(c for c in normalized if unicodedata.category(c) != 'Mn')
    
    # Step 2: Replace special characters with spaces (to split words)
    # This handles cases like "Records&Information" -> "Records Information"
    clean_name = re.sub(r'[^a-zA-Z\s]', ' ', ascii_name)
    
    # Step 3: Split into words
    words = clean_name.split()
    
    # Step 4: Filter out skip words (prepositions, articles)
    significant_words = [w for w in words if w.lower() not in skip_words]
    
    # Step 5: Take first letter of each significant word
    abbreviation = ''.join(w[0].upper() for w in significant_words if w)
    
    # Step 6: Limit to 10 characters
    return abbreviation[:10]
```

### Handling Special Cases

**Case 1: "Records & Information Management"**
1. Input: `"Records & Information Management"`
2. After special char removal: `"Records   Information Management"`
3. After split: `["Records", "Information", "Management"]`
4. Abbreviation: `RIM`

**Case 2: "Art/Design Museum"**
1. Input: `"Art/Design Museum"`
2. After special char removal: `"Art Design Museum"`
3. After split: `["Art", "Design", "Museum"]`
4. Abbreviation: `ADM`

**Case 3: "Culture+"**
1. Input: `"Culture+"`
2. After special char removal: `"Culture"`
3. After split: `["Culture"]`
4. Abbreviation: `C`

---

## Examples

| Institution Name | Correct | Incorrect |
|------------------|---------|-----------|
| Department of Records & Information Management | DRIM | DR&IM |
| Art + Culture Center | ACC | A+CC |
| Museum/Gallery Amsterdam | MGA | M/GA |
| Heritage@Digital | HD | H@D |
| Archives (Historical) | AH | A(H) |
| Research & Development Institute | RDI | R&DI |
| Sint Maarten Records & Information | SMRI | SMR&I |

---

## Real-World Case Study

**File**: `data/custodian/SX-XX-PHI-O-DR&IMSM.yaml`

**Institution**: Department of Records & Information Management of Sint Maarten

**Current (INCORRECT)**:
```yaml
ghcid:
  ghcid_current: SX-XX-PHI-O-DR&IMSM  # Contains "&"
```

**Required (CORRECT)**:
```yaml
ghcid:
  ghcid_current: SX-XX-PHI-O-DRIMSM   # Alphabetic only
```

**Derivation**:
1. Full name: "Department of Records & Information Management of Sint Maarten"
2. Skip prepositions: "of" (appears twice)
3. Handle special character: "&" → split "Records & Information" into separate words
4. Significant words: Department, Records, Information, Management, Sint, Maarten
5. First letters: D, R, I, M, S, M → "DRIMSM"
6. Maximum 10 chars: "DRIMSM" (6 chars, OK)

**Note**: File should be renamed from `SX-XX-PHI-O-DR&IMSM.yaml` to `SX-XX-PHI-O-DRIMSM.yaml`

---

## Validation

### Check for Invalid Abbreviations

```bash
# Find GHCID files with special characters in abbreviation
find data/custodian -name "*.yaml" | xargs grep -l '[&+@#%$*|:;?!=~^<>]' | head -20

# Specifically check for & in filenames
find data/custodian -name "*&*.yaml"
```

### Programmatic Validation

```python
import re

def validate_abbreviation(abbrev: str) -> bool:
    """
    Validate that abbreviation contains only A-Z.
    
    Returns True if valid, False if contains special characters.
    """
    return bool(re.match(r'^[A-Z]+$', abbrev))

# Examples
validate_abbreviation("DRIMSM")   # True - valid
validate_abbreviation("DR&IMSM")  # False - contains &
validate_abbreviation("A+CC")     # False - contains +
```

---

## Migration

For existing files with special characters in abbreviations:

1. **Identify affected files**:
   ```bash
   find data/custodian -name "*[&+@#%$*|]*" -type f
   ```

2. **For each file**:
   - Read the file
   - Regenerate the abbreviation following this rule
   - Update `ghcid_current` and `ghcid_original`
   - Add entry to `ghcid_history` documenting the correction
   - Rename the file to match new GHCID

3. **Update GHCID history**:
   ```yaml
   ghcid_history:
     - ghcid: SX-XX-PHI-O-DRIMSM
       ghcid_numeric: <new_numeric>
       valid_from: "2025-12-07T00:00:00Z"
       reason: "Corrected abbreviation to remove special character '&' per ABBREV-SPECIAL-CHAR rule"
     - ghcid: SX-XX-PHI-O-DR&IMSM
       ghcid_numeric: 12942843033761857211
       valid_from: "2025-12-06T21:07:22.140567+00:00"
       valid_to: "2025-12-07T00:00:00Z"
       reason: "Initial GHCID (contained invalid character)"
   ```

---

## Related Documentation

- `AGENTS.md` - Section "INSTITUTION ABBREVIATION: EMIC NAME FIRST-LETTER PROTOCOL"
- `schemas/20251121/linkml/modules/classes/CustodianName.yaml` - Schema description
- `.opencode/LEGAL_FORM_FILTERING_RULE.md` - Related filtering rule for legal forms
- `docs/PERSISTENT_IDENTIFIERS.md` - GHCID specification

---

## Changelog

| Date | Change |
|------|--------|
| 2025-12-07 | Initial rule created after discovery of `&` in GHCID |