2025-12-07 00:26:01 +01:00

8 KiB

Raw Blame History

Abbreviation Special Character Filtering Rule

Rule ID: ABBREV-SPECIAL-CHAR
Status: MANDATORY
Applies To: GHCID abbreviation component generation
Created: 2025-12-07

Summary

When generating abbreviations for GHCID, special characters and symbols MUST be completely removed. Only alphabetic characters (A-Z) are permitted in the abbreviation component of the GHCID.

This is a MANDATORY rule. Abbreviations containing special characters are INVALID and must be regenerated.

Rationale

1. URL/URI Safety

Special characters require percent-encoding in URIs. For example:

& becomes %26
+ becomes %2B

This makes identifiers harder to share, copy, and verify.

2. Filename Safety

Many special characters are invalid in filenames across operating systems:

Windows: \ / : * ? " < > |
macOS/Linux: / and null bytes

Files like SX-XX-PHI-O-DR&IMSM.yaml may cause issues on some systems.

3. Parsing Consistency

Special characters can conflict with delimiters in data pipelines:

& is used in query strings
: is used in YAML, JSON
/ is a path separator
| is a common CSV delimiter alternative

4. Cross-System Compatibility

Identifiers should work across all systems:

Databases (SQL, TypeDB, Neo4j)
RDF/SPARQL endpoints
REST APIs
Command-line tools
Spreadsheets

5. Human Readability

Clean identifiers are easier to:

Communicate verbally
Type correctly
Proofread
Remember

Characters to Remove

The following characters MUST be completely removed (not replaced) when generating abbreviations:

Character	Name	Example Issue
`&`	Ampersand	"R&A" in URLs, HTML entities
`/`	Slash	Path separator confusion
`\`	Backslash	Escape sequence issues
`+`	Plus	URL encoding (`+` = space)
`@`	At sign	Email/handle confusion
`#`	Hash/Pound	Fragment identifier in URLs
`%`	Percent	URL encoding prefix
`$`	Dollar	Variable prefix in shells
`*`	Asterisk	Glob/wildcard character
`(` `)`	Parentheses	Grouping in regex, code
`[` `]`	Square brackets	Array notation
`{` `}`	Curly braces	Object notation
`\|`	Pipe	Command chaining, OR operator
`:`	Colon	YAML key-value, namespace separator
`;`	Semicolon	Statement terminator
`"` `'` `	Quotes	String delimiters
`,`	Comma	List separator
`.`	Period	File extension, namespace
`-`	Hyphen	Already used as GHCID component separator
`_`	Underscore	Reserved for name suffix in collisions
`=`	Equals	Assignment operator
`?`	Question mark	Query string indicator
`!`	Exclamation	Negation, shell history
`~`	Tilde	Home directory, bitwise NOT
`^`	Caret	Regex anchor, power operator
`<` `>`	Angle brackets	HTML tags, redirects

Implementation

Algorithm

When extracting abbreviation from institution name:

import re
import unicodedata

def extract_abbreviation_from_name(name: str, skip_words: set) -> str:
    """
    Extract abbreviation from institution name.
    
    Args:
        name: Full institution name (emic)
        skip_words: Set of prepositions/articles to skip
    
    Returns:
        Uppercase abbreviation with only A-Z characters
    """
    # Step 1: Normalize unicode (remove diacritics)
    normalized = unicodedata.normalize('NFD', name)
    ascii_name = ''.join(c for c in normalized if unicodedata.category(c) != 'Mn')
    
    # Step 2: Replace special characters with spaces (to split words)
    # This handles cases like "Records&Information" -> "Records Information"
    clean_name = re.sub(r'[^a-zA-Z\s]', ' ', ascii_name)
    
    # Step 3: Split into words
    words = clean_name.split()
    
    # Step 4: Filter out skip words (prepositions, articles)
    significant_words = [w for w in words if w.lower() not in skip_words]
    
    # Step 5: Take first letter of each significant word
    abbreviation = ''.join(w[0].upper() for w in significant_words if w)
    
    # Step 6: Limit to 10 characters
    return abbreviation[:10]

Handling Special Cases

Case 1: "Records & Information Management"

Input: "Records & Information Management"
After special char removal: "Records Information Management"
After split: ["Records", "Information", "Management"]
Abbreviation: RIM

Case 2: "Art/Design Museum"

Input: "Art/Design Museum"
After special char removal: "Art Design Museum"
After split: ["Art", "Design", "Museum"]
Abbreviation: ADM

Case 3: "Culture+"

Input: "Culture+"
After special char removal: "Culture"
After split: ["Culture"]
Abbreviation: C

Examples

Institution Name	Correct	Incorrect
Department of Records & Information Management	DRIM	DR&IM
Art + Culture Center	ACC	A+CC
Museum/Gallery Amsterdam	MGA	M/GA
Heritage@Digital	HD	H@D
Archives (Historical)	AH	A(H)
Research & Development Institute	RDI	R&DI
Sint Maarten Records & Information	SMRI	SMR&I

Real-World Case Study

File: data/custodian/SX-XX-PHI-O-DR&IMSM.yaml

Institution: Department of Records & Information Management of Sint Maarten

Current (INCORRECT):

ghcid:
  ghcid_current: SX-XX-PHI-O-DR&IMSM  # Contains "&"

Required (CORRECT):

ghcid:
  ghcid_current: SX-XX-PHI-O-DRIMSM   # Alphabetic only

Derivation:

Full name: "Department of Records & Information Management of Sint Maarten"
Skip prepositions: "of" (appears twice)
Handle special character: "&" → split "Records & Information" into separate words
Significant words: Department, Records, Information, Management, Sint, Maarten
First letters: D, R, I, M, S, M → "DRIMSM"
Maximum 10 chars: "DRIMSM" (6 chars, OK)

Note: File should be renamed from SX-XX-PHI-O-DR&IMSM.yaml to SX-XX-PHI-O-DRIMSM.yaml

Validation

Check for Invalid Abbreviations

# Find GHCID files with special characters in abbreviation
find data/custodian -name "*.yaml" | xargs grep -l '[&+@#%$*|:;?!=~^<>]' | head -20

# Specifically check for & in filenames
find data/custodian -name "*&*.yaml"

Programmatic Validation

import re

def validate_abbreviation(abbrev: str) -> bool:
    """
    Validate that abbreviation contains only A-Z.
    
    Returns True if valid, False if contains special characters.
    """
    return bool(re.match(r'^[A-Z]+$', abbrev))

# Examples
validate_abbreviation("DRIMSM")   # True - valid
validate_abbreviation("DR&IMSM")  # False - contains &
validate_abbreviation("A+CC")     # False - contains +

Migration

For existing files with special characters in abbreviations:

Identify affected files:

find data/custodian -name "*[&+@#%$*|]*" -type f

For each file:
- Read the file
- Regenerate the abbreviation following this rule
- Update ghcid_current and ghcid_original
- Add entry to ghcid_history documenting the correction
- Rename the file to match new GHCID

Update GHCID history:

ghcid_history:
  - ghcid: SX-XX-PHI-O-DRIMSM
    ghcid_numeric: <new_numeric>
    valid_from: "2025-12-07T00:00:00Z"
    reason: "Corrected abbreviation to remove special character '&' per ABBREV-SPECIAL-CHAR rule"
  - ghcid: SX-XX-PHI-O-DR&IMSM
    ghcid_numeric: 12942843033761857211
    valid_from: "2025-12-06T21:07:22.140567+00:00"
    valid_to: "2025-12-07T00:00:00Z"
    reason: "Initial GHCID (contained invalid character)"

AGENTS.md - Section "INSTITUTION ABBREVIATION: EMIC NAME FIRST-LETTER PROTOCOL"
schemas/20251121/linkml/modules/classes/CustodianName.yaml - Schema description
.opencode/LEGAL_FORM_FILTERING_RULE.md - Related filtering rule for legal forms
docs/PERSISTENT_IDENTIFIERS.md - GHCID specification

Changelog

Date	Change
2025-12-07	Initial rule created after discovery of `&` in GHCID

8 KiB Raw Blame History