- Add Rule 11 for Z.AI Coding Plan API usage (not BigModel) - Add transliteration standards for non-Latin scripts - Document GLM model options and Python implementation
364 lines
11 KiB
Markdown
364 lines
11 KiB
Markdown
# Abbreviation Character Filtering Rules
|
|
|
|
**Rule ID**: ABBREV-CHAR-FILTER
|
|
**Status**: MANDATORY
|
|
**Applies To**: GHCID abbreviation component generation
|
|
**Created**: 2025-12-07
|
|
**Updated**: 2025-12-08 (added diacritics rule)
|
|
|
|
---
|
|
|
|
## Summary
|
|
|
|
**When generating abbreviations for GHCID, ONLY ASCII uppercase letters (A-Z) are permitted. Both special characters AND diacritics MUST be removed/normalized.**
|
|
|
|
This is a **MANDATORY** rule. Abbreviations containing special characters or diacritics are INVALID and must be regenerated.
|
|
|
|
### Two Mandatory Sub-Rules:
|
|
|
|
1. **ABBREV-SPECIAL-CHAR**: Remove all special characters and symbols
|
|
2. **ABBREV-DIACRITICS**: Normalize all diacritics to ASCII equivalents
|
|
|
|
---
|
|
|
|
## Rule 1: Diacritics MUST Be Normalized to ASCII (ABBREV-DIACRITICS)
|
|
|
|
**Diacritics (accented characters) MUST be normalized to their ASCII base letter equivalents.**
|
|
|
|
### Example (Real Case)
|
|
|
|
```
|
|
❌ WRONG: CZ-VY-TEL-L-VHSPAOČRZS (contains Č)
|
|
✅ CORRECT: CZ-VY-TEL-L-VHSPAOCRZS (ASCII only)
|
|
```
|
|
|
|
### Diacritics Normalization Table
|
|
|
|
| Diacritic | ASCII | Example |
|
|
|-----------|-------|---------|
|
|
| Á, À, Â, Ã, Ä, Å, Ā | A | "Ålborg" → A |
|
|
| Č, Ć, Ç | C | "Český" → C |
|
|
| Ď | D | "Ďáblice" → D |
|
|
| É, È, Ê, Ë, Ě, Ē | E | "Éire" → E |
|
|
| Í, Ì, Î, Ï, Ī | I | "Ísland" → I |
|
|
| Ñ, Ń, Ň | N | "España" → N |
|
|
| Ó, Ò, Ô, Õ, Ö, Ø, Ō | O | "Österreich" → O |
|
|
| Ř | R | "Říčany" → R |
|
|
| Š, Ś, Ş | S | "Šumperk" → S |
|
|
| Ť | T | "Ťažký" → T |
|
|
| Ú, Ù, Û, Ü, Ů, Ū | U | "Ústí" → U |
|
|
| Ý, Ÿ | Y | "Ýmir" → Y |
|
|
| Ž, Ź, Ż | Z | "Žilina" → Z |
|
|
| Ł | L | "Łódź" → L |
|
|
| Æ | AE | "Ærø" → AE |
|
|
| Œ | OE | "Œuvre" → OE |
|
|
| ß | SS | "Straße" → SS |
|
|
|
|
### Implementation
|
|
|
|
```python
|
|
import unicodedata
|
|
|
|
def normalize_diacritics(text: str) -> str:
|
|
"""
|
|
Normalize diacritics to ASCII equivalents.
|
|
|
|
Examples:
|
|
"Č" → "C"
|
|
"Ř" → "R"
|
|
"Ö" → "O"
|
|
"ñ" → "n"
|
|
"""
|
|
# NFD decomposition separates base characters from combining marks
|
|
normalized = unicodedata.normalize('NFD', text)
|
|
# Remove combining marks (category 'Mn' = Mark, Nonspacing)
|
|
ascii_text = ''.join(c for c in normalized if unicodedata.category(c) != 'Mn')
|
|
return ascii_text
|
|
|
|
# Example
|
|
normalize_diacritics("VHSPAOČRZS") # Returns "VHSPAOCRZS"
|
|
```
|
|
|
|
### Languages Commonly Affected
|
|
|
|
| Language | Common Diacritics | Example Institution |
|
|
|----------|-------------------|---------------------|
|
|
| **Czech** | Č, Ř, Š, Ž, Ě, Ů | Vlastivědné muzeum → VM (not VM with háček) |
|
|
| **Polish** | Ł, Ń, Ó, Ś, Ź, Ż, Ą, Ę | Biblioteka Łódzka → BL |
|
|
| **German** | Ä, Ö, Ü, ß | Österreichische Nationalbibliothek → ON |
|
|
| **French** | É, È, Ê, Ç, Ô | Bibliothèque nationale → BN |
|
|
| **Spanish** | Ñ, Á, É, Í, Ó, Ú | Museo Nacional → MN |
|
|
| **Portuguese** | Ã, Õ, Ç, Á, É | Biblioteca Nacional → BN |
|
|
| **Nordic** | Å, Ä, Ö, Ø, Æ | Nationalmuseet → N |
|
|
| **Turkish** | Ç, Ğ, İ, Ö, Ş, Ü | İstanbul Üniversitesi → IU |
|
|
| **Hungarian** | Á, É, Í, Ó, Ö, Ő, Ú, Ü, Ű | Országos Levéltár → OL |
|
|
| **Romanian** | Ă, Â, Î, Ș, Ț | Biblioteca Națională → BN |
|
|
|
|
---
|
|
|
|
## Rule 2: Special Characters MUST Be Removed (ABBREV-SPECIAL-CHAR)
|
|
|
|
---
|
|
|
|
## Rationale
|
|
|
|
### 1. URL/URI Safety
|
|
Special characters require percent-encoding in URIs. For example:
|
|
- `&` becomes `%26`
|
|
- `+` becomes `%2B`
|
|
|
|
This makes identifiers harder to share, copy, and verify.
|
|
|
|
### 2. Filename Safety
|
|
Many special characters are invalid in filenames across operating systems:
|
|
- Windows: `\ / : * ? " < > |`
|
|
- macOS/Linux: `/` and null bytes
|
|
|
|
Files like `SX-XX-PHI-O-DR&IMSM.yaml` may cause issues on some systems.
|
|
|
|
### 3. Parsing Consistency
|
|
Special characters can conflict with delimiters in data pipelines:
|
|
- `&` is used in query strings
|
|
- `:` is used in YAML, JSON
|
|
- `/` is a path separator
|
|
- `|` is a common CSV delimiter alternative
|
|
|
|
### 4. Cross-System Compatibility
|
|
Identifiers should work across all systems:
|
|
- Databases (SQL, TypeDB, Neo4j)
|
|
- RDF/SPARQL endpoints
|
|
- REST APIs
|
|
- Command-line tools
|
|
- Spreadsheets
|
|
|
|
### 5. Human Readability
|
|
Clean identifiers are easier to:
|
|
- Communicate verbally
|
|
- Type correctly
|
|
- Proofread
|
|
- Remember
|
|
|
|
---
|
|
|
|
## Characters to Remove
|
|
|
|
The following characters MUST be completely removed (not replaced) when generating abbreviations:
|
|
|
|
| Character | Name | Example Issue |
|
|
|-----------|------|---------------|
|
|
| `&` | Ampersand | "R&A" in URLs, HTML entities |
|
|
| `/` | Slash | Path separator confusion |
|
|
| `\` | Backslash | Escape sequence issues |
|
|
| `+` | Plus | URL encoding (`+` = space) |
|
|
| `@` | At sign | Email/handle confusion |
|
|
| `#` | Hash/Pound | Fragment identifier in URLs |
|
|
| `%` | Percent | URL encoding prefix |
|
|
| `$` | Dollar | Variable prefix in shells |
|
|
| `*` | Asterisk | Glob/wildcard character |
|
|
| `(` `)` | Parentheses | Grouping in regex, code |
|
|
| `[` `]` | Square brackets | Array notation |
|
|
| `{` `}` | Curly braces | Object notation |
|
|
| `\|` | Pipe | Command chaining, OR operator |
|
|
| `:` | Colon | YAML key-value, namespace separator |
|
|
| `;` | Semicolon | Statement terminator |
|
|
| `"` `'` `` ` `` | Quotes | String delimiters |
|
|
| `,` | Comma | List separator |
|
|
| `.` | Period | File extension, namespace |
|
|
| `-` | Hyphen | Already used as GHCID component separator |
|
|
| `_` | Underscore | Reserved for name suffix in collisions |
|
|
| `=` | Equals | Assignment operator |
|
|
| `?` | Question mark | Query string indicator |
|
|
| `!` | Exclamation | Negation, shell history |
|
|
| `~` | Tilde | Home directory, bitwise NOT |
|
|
| `^` | Caret | Regex anchor, power operator |
|
|
| `<` `>` | Angle brackets | HTML tags, redirects |
|
|
|
|
---
|
|
|
|
## Implementation
|
|
|
|
### Algorithm
|
|
|
|
When extracting abbreviation from institution name:
|
|
|
|
```python
|
|
import re
|
|
import unicodedata
|
|
|
|
def extract_abbreviation_from_name(name: str, skip_words: set) -> str:
|
|
"""
|
|
Extract abbreviation from institution name.
|
|
|
|
Args:
|
|
name: Full institution name (emic)
|
|
skip_words: Set of prepositions/articles to skip
|
|
|
|
Returns:
|
|
Uppercase abbreviation with only A-Z characters
|
|
"""
|
|
# Step 1: Normalize unicode (remove diacritics)
|
|
normalized = unicodedata.normalize('NFD', name)
|
|
ascii_name = ''.join(c for c in normalized if unicodedata.category(c) != 'Mn')
|
|
|
|
# Step 2: Replace special characters with spaces (to split words)
|
|
# This handles cases like "Records&Information" -> "Records Information"
|
|
clean_name = re.sub(r'[^a-zA-Z\s]', ' ', ascii_name)
|
|
|
|
# Step 3: Split into words
|
|
words = clean_name.split()
|
|
|
|
# Step 4: Filter out skip words (prepositions, articles)
|
|
significant_words = [w for w in words if w.lower() not in skip_words]
|
|
|
|
# Step 5: Take first letter of each significant word
|
|
abbreviation = ''.join(w[0].upper() for w in significant_words if w)
|
|
|
|
# Step 6: Limit to 10 characters
|
|
return abbreviation[:10]
|
|
```
|
|
|
|
### Handling Special Cases
|
|
|
|
**Case 1: "Records & Information Management"**
|
|
1. Input: `"Records & Information Management"`
|
|
2. After special char removal: `"Records Information Management"`
|
|
3. After split: `["Records", "Information", "Management"]`
|
|
4. Abbreviation: `RIM`
|
|
|
|
**Case 2: "Art/Design Museum"**
|
|
1. Input: `"Art/Design Museum"`
|
|
2. After special char removal: `"Art Design Museum"`
|
|
3. After split: `["Art", "Design", "Museum"]`
|
|
4. Abbreviation: `ADM`
|
|
|
|
**Case 3: "Culture+"**
|
|
1. Input: `"Culture+"`
|
|
2. After special char removal: `"Culture"`
|
|
3. After split: `["Culture"]`
|
|
4. Abbreviation: `C`
|
|
|
|
---
|
|
|
|
## Examples
|
|
|
|
| Institution Name | Correct | Incorrect |
|
|
|------------------|---------|-----------|
|
|
| Department of Records & Information Management | DRIM | DR&IM |
|
|
| Art + Culture Center | ACC | A+CC |
|
|
| Museum/Gallery Amsterdam | MGA | M/GA |
|
|
| Heritage@Digital | HD | H@D |
|
|
| Archives (Historical) | AH | A(H) |
|
|
| Research & Development Institute | RDI | R&DI |
|
|
| Sint Maarten Records & Information | SMRI | SMR&I |
|
|
|
|
---
|
|
|
|
## Real-World Case Study
|
|
|
|
**File**: `data/custodian/SX-XX-PHI-O-DR&IMSM.yaml`
|
|
|
|
**Institution**: Department of Records & Information Management of Sint Maarten
|
|
|
|
**Current (INCORRECT)**:
|
|
```yaml
|
|
ghcid:
|
|
ghcid_current: SX-XX-PHI-O-DR&IMSM # Contains "&"
|
|
```
|
|
|
|
**Required (CORRECT)**:
|
|
```yaml
|
|
ghcid:
|
|
ghcid_current: SX-XX-PHI-O-DRIMSM # Alphabetic only
|
|
```
|
|
|
|
**Derivation**:
|
|
1. Full name: "Department of Records & Information Management of Sint Maarten"
|
|
2. Skip prepositions: "of" (appears twice)
|
|
3. Handle special character: "&" → split "Records & Information" into separate words
|
|
4. Significant words: Department, Records, Information, Management, Sint, Maarten
|
|
5. First letters: D, R, I, M, S, M → "DRIMSM"
|
|
6. Maximum 10 chars: "DRIMSM" (6 chars, OK)
|
|
|
|
**Note**: File should be renamed from `SX-XX-PHI-O-DR&IMSM.yaml` to `SX-XX-PHI-O-DRIMSM.yaml`
|
|
|
|
---
|
|
|
|
## Validation
|
|
|
|
### Check for Invalid Abbreviations
|
|
|
|
```bash
|
|
# Find GHCID files with special characters in abbreviation
|
|
find data/custodian -name "*.yaml" | xargs grep -l '[&+@#%$*|:;?!=~^<>]' | head -20
|
|
|
|
# Specifically check for & in filenames
|
|
find data/custodian -name "*&*.yaml"
|
|
```
|
|
|
|
### Programmatic Validation
|
|
|
|
```python
|
|
import re
|
|
|
|
def validate_abbreviation(abbrev: str) -> bool:
|
|
"""
|
|
Validate that abbreviation contains only A-Z.
|
|
|
|
Returns True if valid, False if contains special characters.
|
|
"""
|
|
return bool(re.match(r'^[A-Z]+$', abbrev))
|
|
|
|
# Examples
|
|
validate_abbreviation("DRIMSM") # True - valid
|
|
validate_abbreviation("DR&IMSM") # False - contains &
|
|
validate_abbreviation("A+CC") # False - contains +
|
|
```
|
|
|
|
---
|
|
|
|
## Migration
|
|
|
|
For existing files with special characters in abbreviations:
|
|
|
|
1. **Identify affected files**:
|
|
```bash
|
|
find data/custodian -name "*[&+@#%$*|]*" -type f
|
|
```
|
|
|
|
2. **For each file**:
|
|
- Read the file
|
|
- Regenerate the abbreviation following this rule
|
|
- Update `ghcid_current` and `ghcid_original`
|
|
- Add entry to `ghcid_history` documenting the correction
|
|
- Rename the file to match new GHCID
|
|
|
|
3. **Update GHCID history**:
|
|
```yaml
|
|
ghcid_history:
|
|
- ghcid: SX-XX-PHI-O-DRIMSM
|
|
ghcid_numeric: <new_numeric>
|
|
valid_from: "2025-12-07T00:00:00Z"
|
|
reason: "Corrected abbreviation to remove special character '&' per ABBREV-SPECIAL-CHAR rule"
|
|
- ghcid: SX-XX-PHI-O-DR&IMSM
|
|
ghcid_numeric: 12942843033761857211
|
|
valid_from: "2025-12-06T21:07:22.140567+00:00"
|
|
valid_to: "2025-12-07T00:00:00Z"
|
|
reason: "Initial GHCID (contained invalid character)"
|
|
```
|
|
|
|
---
|
|
|
|
## Related Documentation
|
|
|
|
- `AGENTS.md` - Section "INSTITUTION ABBREVIATION: EMIC NAME FIRST-LETTER PROTOCOL"
|
|
- `schemas/20251121/linkml/modules/classes/CustodianName.yaml` - Schema description
|
|
- `.opencode/LEGAL_FORM_FILTERING_RULE.md` - Related filtering rule for legal forms
|
|
- `docs/PERSISTENT_IDENTIFIERS.md` - GHCID specification
|
|
|
|
---
|
|
|
|
## Changelog
|
|
|
|
| Date | Change |
|
|
|------|--------|
|
|
| 2025-12-07 | Initial rule created after discovery of `&` in GHCID |
|