glam/schemas/20251121/linkml/modules/classes/CustodianName.yaml
kempersc 6a6557bbe8 feat(enrichment): add emic name enrichment and update CustodianName schema
- Add emic_name, name_language, standardized_name to CustodianName
- Add scripts for enriching custodian emic names from Wikidata
- Add YouTube and Google Maps enrichment scripts
- Update DuckLake loader for new schema fields
2025-12-08 14:58:50 +01:00

308 lines
12 KiB
YAML

id: https://nde.nl/ontology/hc/class/CustodianName
name: CustodianName
title: Custodian Name Class
imports:
- linkml:types
- ./Custodian
- ./CustodianObservation
- ./ReconstructionActivity
- ./TimeSpan
- ./ReconstructedEntity
classes:
CustodianName:
is_a: ReconstructedEntity
class_uri: skos:Concept
description: |
Standardized emic (insider) name DERIVED FROM CustodianObservation(s).
CRITICAL: CustodianName is NOT a subclass of CustodianObservation!
- CustodianObservation = Evidence seen in sources (input)
- CustodianName = Standardized interpretation (output)
- Relationship: CustodianName prov:wasDerivedFrom CustodianObservation
CustodianName represents the CANONICAL LABEL - the standardized form
accepted by the custodian itself for public identification.
IMPORTANT: CustodianName ≠ Legal Name
- CustodianName = How custodian presents itself (emic, operational)
- Legal Name = Formal registered name (in CustodianLegalStatus)
- Example: "Rijksmuseum" (emic) vs "Stichting Rijksmuseum" (legal)
===========================================================================
MANDATORY RULE: Legal Form Terms MUST Be Filtered
===========================================================================
Legal form designations (Stichting, Foundation, Inc., Ltd., GmbH, etc.)
MUST ALWAYS be removed from CustodianName, even when the custodian
self-identifies with them. This is the ONE EXCEPTION to the emic principle.
RATIONALE:
1. Legal form is METADATA about the entity, not part of its identity
2. Legal forms change (foundation→corporation) but identity persists
3. Enables consistent cross-jurisdictional comparison
4. Prevents duplicate entries ("X Foundation" vs "X")
5. Aligns with ISO 20275 (Legal Entity Identifier) principles
EXAMPLES:
- "Stichting Rijksmuseum" → CustodianName: "Rijksmuseum"
- "Hidde Nijland Stichting" → CustodianName: "Hidde Nijland"
- "The Getty Foundation" → CustodianName: "The Getty"
- "British Museum Trust Ltd" → CustodianName: "British Museum"
- "Fundação Biblioteca Nacional" → CustodianName: "Biblioteca Nacional"
LEGAL FORM TERMS TO FILTER (partial list by jurisdiction):
- Dutch: Stichting, Vereniging, Coöperatie, B.V., N.V., V.O.F.
- English: Foundation, Trust, Inc., Ltd., LLC, Corp., Association
- German: Stiftung, Verein, e.V., GmbH, AG
- French: Fondation, Association, S.A., S.A.R.L.
- Spanish: Fundación, Asociación, S.A., S.L.
- Portuguese: Fundação, Associação, Ltda., S.A.
- Italian: Fondazione, Associazione, S.p.A., S.r.l.
See: .opencode/LEGAL_FORM_FILTERING_RULE.md for comprehensive global list
===========================================================================
MANDATORY RULE: Special Characters MUST Be Excluded from Abbreviations
===========================================================================
When generating abbreviations for GHCID, special characters and symbols
MUST be completely removed. Only alphabetic characters (A-Z) are permitted
in the abbreviation component of the GHCID.
RATIONALE:
1. URL/URI safety - Special characters require encoding in URIs
2. Filename safety - Characters like &, /, \, : are invalid in filenames
3. Parsing consistency - Avoids delimiter conflicts in data pipelines
4. Cross-system compatibility - Ensures interoperability with all systems
5. Human readability - Clean identifiers are easier to communicate
CHARACTERS TO REMOVE (exhaustive list):
- Ampersand: & (e.g., "Records & Archives" → "RA", not "R&A")
- Slash: / (e.g., "Art/Design Museum" → "ADM", not "A/DM")
- Backslash: \
- Plus: + (e.g., "Culture+" → "C")
- At sign: @
- Hash/Pound: #
- Percent: %
- Dollar: $
- Asterisk: *
- Parentheses: ( )
- Brackets: [ ] { }
- Pipe: |
- Colon: :
- Semicolon: ;
- Quotation marks: " ' `
- Comma: ,
- Period: . (unless part of abbreviation like "U.S." → "US")
- Hyphen: - (skip, do not replace with letter)
- Underscore: _
- Equals: =
- Question mark: ?
- Exclamation: !
- Tilde: ~
- Caret: ^
- Less/Greater than: < >
EXAMPLES:
- "Department of Records & Information Management" → "DRIM" (not "DR&IM")
- "Art + Culture Center" → "ACC" (not "A+CC")
- "Museum/Gallery Amsterdam" → "MGA" (not "M/GA")
- "Heritage@Digital" → "HD" (not "H@D")
- "Archives (Historical)" → "AH" (not "A(H)")
See: .opencode/ABBREVIATION_SPECIAL_CHAR_RULE.md for complete documentation
===========================================================================
MANDATORY RULE: Diacritics MUST Be Normalized to ASCII in Abbreviations
===========================================================================
When generating abbreviations for GHCID, diacritics (accented characters)
MUST be normalized to their ASCII base letter equivalents. Only ASCII
uppercase letters (A-Z) are permitted in the abbreviation component.
RATIONALE:
1. URI/URL safety - Non-ASCII requires percent-encoding
2. Cross-system compatibility - ASCII is universally supported
3. Parsing consistency - No special character handling needed
4. Human readability - Easier to type and communicate
DIACRITICS TO NORMALIZE (examples by language):
- Czech: Č→C, Ř→R, Š→S, Ž→Z, Ě→E, Ů→U
- Polish: Ł→L, Ń→N, Ó→O, Ś→S, Ź→Z, Ż→Z, Ą→A, Ę→E
- German: Ä→A, Ö→O, Ü→U, ß→SS
- French: É→E, È→E, Ê→E, Ç→C, Ô→O
- Spanish: Ñ→N, Á→A, É→E, Í→I, Ó→O, Ú→U
- Nordic: Å→A, Ä→A, Ö→O, Ø→O, Æ→AE
EXAMPLES:
- "Vlastivědné muzeum" (Czech) → "VM" (not "VM" with háček)
- "Österreichische Nationalbibliothek" (German) → "ON"
- "Bibliothèque nationale" (French) → "BN"
REAL-WORLD EXAMPLE:
- ❌ WRONG: CZ-VY-TEL-L-VHSPAOČRZS (contains Č)
- ✅ CORRECT: CZ-VY-TEL-L-VHSPAOCRZS (ASCII only)
IMPLEMENTATION:
```python
import unicodedata
normalized = unicodedata.normalize('NFD', text)
ascii_text = ''.join(c for c in normalized if unicodedata.category(c) != 'Mn')
```
See: .opencode/ABBREVIATION_SPECIAL_CHAR_RULE.md for complete documentation
Can be generated by:
1. ReconstructionActivity (formal entity resolution) - was_generated_by link
2. Direct extraction (simple standardization) - no was_generated_by link
exact_mappings:
- skos:prefLabel
- schema:name
- foaf:name
close_mappings:
- rdfs:label
- dcterms:title
- org:legalName
- tooi:officieleNaamInclSoort
- rico:name
related_mappings:
- skos:altLabel
- schema:alternateName
- foaf:nick
- gleif:hasOtherName
slots:
- emic_name
- name_language
- standardized_name
- alternative_names
- endorsement_source
- name_authority
- valid_from
- valid_to
- name_validity_period
- supersedes
- superseded_by
- was_derived_from
- was_generated_by
- refers_to_custodian
slot_usage:
emic_name:
slot_uri: skos:prefLabel
description: |
The observed name as the custodian refers to itself in source materials,
preserving the custodian's own naming convention. This is descriptive
data, not an identifier - the custodian is identified by its hc_id.
range: string
required: true
name_language:
slot_uri: dcterms:language
description: |
The language or locale code (ISO 639-1 or BCP 47) of the emic name.
Examples: 'nl', 'en', 'pt-BR'
range: string
pattern: "^[a-z]{2}(-[A-Z]{2})?$"
standardized_name:
slot_uri: skos:prefLabel
description: "The canonical emic name accepted by custodian itself (REQUIRED)"
range: string
required: true
alternative_names:
slot_uri: skos:altLabel
description: |
Alternative names and label variants for this custodian name.
SKOS: altLabel for alternative lexical labels.
W3C Org: Recommended for trading names, colloquial names, abbreviations.
Examples:
- "BnF" (abbreviation for "Bibliothèque nationale de France")
- "Rijks" (colloquial for "Rijksmuseum")
- "National Library of France" (English translation)
- Historical spellings and variants
These are NOT the preferred/canonical name but are recognized variants
that people use to refer to the same custodian.
range: CustodianAppellation
multivalued: true
inlined_as_list: true
endorsement_source:
slot_uri: prov:hadPrimarySource
description: "Source proving custodian acceptance of this name (REQUIRED)"
range: uriorcurie
required: true
name_authority:
slot_uri: prov:wasAttributedTo
description: "Authority that authorized this name"
range: string
valid_from:
slot_uri: schema:validFrom
description: "Date when this name became official/valid"
range: date
valid_to:
slot_uri: schema:validUntil
description: "Date when this name ceased to be valid (null if current)"
range: date
name_validity_period:
slot_uri: crm:P4_has_time-span
description: |
Temporal period during which this name was valid (with fuzzy boundaries).
CIDOC-CRM: P4_has_time-span links to E52_Time-Span for uncertain validity periods.
Use this when name validity dates are uncertain:
- "Name adopted sometime in the 1920s"
- "Name changed around 1950"
- "Name used from approximately 1800 to 1850"
For precise dates, use valid_from/valid_to instead.
range: TimeSpan
examples:
- value:
begin_of_the_begin: "1920-01-01"
end_of_the_begin: "1929-12-31"
begin_of_the_end: "1945-01-01"
end_of_the_end: "1955-12-31"
description: "Name adopted sometime in the 1920s, changed around 1950"
supersedes:
slot_uri: dcterms:replaces
description: "Previous CustodianName replaced by this one"
range: CustodianName
superseded_by:
slot_uri: dcterms:isReplacedBy
description: "Subsequent CustodianName that replaced this name"
range: CustodianName
was_derived_from:
slot_uri: prov:wasDerivedFrom
description: |
CustodianObservation(s) from which this name was derived (REQUIRED).
PROV-O: wasDerivedFrom establishes observation→name derivation.
A name can be derived from multiple observations through consolidation:
- "Rijks" (letterhead) + "Rijksmuseum Amsterdam" (ISIL) → "Rijksmuseum"
This is NOT inheritance (is_a) but transformation (derived_from).
range: CustodianObservation
multivalued: true
required: true
was_generated_by:
slot_uri: prov:wasGeneratedBy
description: |
ReconstructionActivity that generated this standardized name (optional).
If present: Name created through formal entity resolution process
If null: Name extracted directly without reconstruction activity
PROV-O: wasGeneratedBy links Entity (CustodianName) to generating Activity.
range: ReconstructionActivity
required: false
inverse: generates
refers_to_custodian:
slot_uri: dcterms:references
description: |
The Custodian hub that this name identifies (REQUIRED).
Links the standardized name back to the hub it represents.
The hub may also link back via skos:prefLabel if this is the preferred name.
range: Custodian
required: true