38 KiB
PPID-GHCID Alignment: Revised Identifier Structure
Version: 0.1.0
Last Updated: 2025-01-09
Status: DRAFT - Supersedes opaque identifier design in 05_identifier_structure_design.md
Related: GHCID Specification | PiCo Ontology
1. Executive Summary
This document proposes a revised PPID structure that aligns with GHCID's geographic-semantic identifier pattern while accommodating the unique challenges of person identification across historical records.
1.1 Key Changes from Original Design
| Aspect | Original (Doc 05) | Revised (This Document) |
|---|---|---|
| Format | Opaque hex (POID-7a3b-c4d5-...) |
Semantic (PID-NL-NH-AMS-NL-NH-HAA-19-20-JAN-BERG) |
| Type Distinction | POID vs PRID | ID (temporary) vs PID (persistent) |
| Geographic | None in identifier | Dual anchors: first + last observation |
| Temporal | None in identifier | Century range |
| Name | None in identifier | First + last token of emic label |
| Persistence | Always persistent | May remain ID indefinitely |
1.2 Design Philosophy
The revised PPID follows the same principles as GHCID:
- Human-readable semantic components that aid discovery and deduplication
- Geographic anchoring to physical locations using GeoNames
- Temporal anchoring to enable disambiguation across time
- Emic authenticity using names from primary sources
- Collision resolution via full emic label suffix
- Dual representation as both semantic string and UUID/numeric
2. Identifier Type: ID vs PID
2.1 The Epistemic Uncertainty Problem
Unlike institutions (which typically have founding documents, legal registrations, and clear organizational boundaries), persons in historical records often exist in epistemic uncertainty:
- Incomplete records (many records lost to time)
- Ambiguous references (common names, no surnames)
- Conflicting sources (different dates, spellings)
- Undiscovered archives (unexplored record sets)
2.2 Two-Class Identifier System
| Type | Prefix | Description | Persistence | Promotion Path |
|---|---|---|---|---|
| ID | ID- |
Temporary identifier | May change | Can become PID |
| PID | PID- |
Persistent identifier | Permanent | Cannot revert to ID |
2.3 Promotion Criteria: ID → PID
An identifier can be promoted from ID to PID when ALL of the following are satisfied:
@dataclass
class PIDPromotionCriteria:
"""
Criteria for promoting an ID to a PID.
ALL conditions must be True for promotion.
"""
# Geographic anchors
first_observation_verified: bool # Birth or equivalent
last_observation_verified: bool # Death or equivalent
# Temporal anchors
century_range_established: bool # From verified observations
# Identity anchors
emic_label_verified: bool # From primary sources
no_unexplored_archives: bool # Reasonable assumption
# Quality checks
no_unresolved_conflicts: bool # No conflicting claims
multiple_corroborating_sources: bool # At least 2 independent sources
def is_promotable(self) -> bool:
return all([
self.first_observation_verified,
self.last_observation_verified,
self.century_range_established,
self.emic_label_verified,
self.no_unexplored_archives,
self.no_unresolved_conflicts,
self.multiple_corroborating_sources,
])
2.4 Permanent ID Status
Some identifiers may forever remain IDs due to:
- Fragmentary records: Only one surviving document mentions the person
- Uncertain dates: Cannot establish century range
- Unknown location: Cannot anchor geographically
- Anonymous figures: No emic label recoverable
- Ongoing research: Archives not yet explored
This is acceptable and expected. An ID is still a valid identifier for internal use; it simply cannot be cited as a persistent identifier in scholarly work.
3. Identifier Structure
3.1 Full Format Specification
{TYPE}-{FC}-{FR}-{FP}-{LC}-{LR}-{LP}-{CR}-{FT}-{LT}[-{FULL_EMIC}]
│ │ │ │ │ │ │ │ │ │ │
│ │ │ │ │ │ │ │ │ │ └── Collision suffix (optional)
│ │ │ │ │ │ │ │ │ └── Last Token of emic label
│ │ │ │ │ │ │ │ └── First Token of emic label
│ │ │ │ │ │ │ └── Century Range (e.g., 19-20)
│ │ │ │ │ │ └── Last observation Place (GeoNames 3-letter)
│ │ │ │ │ └── Last observation Region (ISO 3166-2)
│ │ │ │ └── Last observation Country (ISO 3166-1 alpha-2)
│ │ │ └── First observation Place (GeoNames 3-letter)
│ │ └── First observation Region (ISO 3166-2)
│ └── First observation Country (ISO 3166-1 alpha-2)
└── Type: ID or PID
3.2 Component Definitions
| Component | Format | Description | Example |
|---|---|---|---|
| TYPE | ID or PID |
Identifier class | PID |
| FC | ISO 3166-1 α2 | First observation country (modern) | NL |
| FR | ISO 3166-2 suffix | First observation region | NH |
| FP | 3 letters | First observation place (GeoNames) | AMS |
| LC | ISO 3166-1 α2 | Last observation country (modern) | NL |
| LR | ISO 3166-2 suffix | Last observation region | NH |
| LP | 3 letters | Last observation place (GeoNames) | HAA |
| CR | CC-CC |
Century range (CE) | 19-20 |
| FT | UPPERCASE | First token of emic label | JAN |
| LT | UPPERCASE | Last token of emic label | BERG |
| FULL_EMIC | snake_case | Full emic label (collision only) | jan_van_den_berg |
3.3 Examples
| Person | Full Emic Label | PPID |
|---|---|---|
| Jan van den Berg, born Amsterdam 1895, died Haarlem 1970 | Jan van den Berg | PID-NL-NH-AMS-NL-NH-HAA-19-20-JAN-BERG |
| Rembrandt, born Leiden 1606, died Amsterdam 1669 | Rembrandt van Rijn | PID-NL-ZH-LEI-NL-NH-AMS-17-17-REMBRANDT-RIJN |
| Maria Sibylla Merian, born Frankfurt 1647, died Amsterdam 1717 | Maria Sibylla Merian | PID-DE-HE-FRA-NL-NH-AMS-17-18-MARIA-MERIAN |
| Unknown soldier, found Normandy, died 1944 | (unknown) | ID-XX-XX-XXX-FR-NM-OMH-20-20-UNKNOWN- |
| Henry VIII, born London 1491, died London 1547 | Henry VIII | PID-GB-ENG-LON-GB-ENG-LON-15-16-HENRY-VIII |
Notes on Emic Labels:
- Always use formal/complete emic names from primary sources, not modern colloquial short forms
- "Rembrandt" alone is a modern convention; the emic label from his lifetime was "Rembrandt van Rijn"
- Tussenvoegsels (particles) like "van", "de", "den", "der", "van de", "van den", "van der" are skipped when extracting the last token (see §4.5)
- This follows the same pattern as GHCID abbreviation rules (AGENTS.md Rule 8)
4. Component Rules
4.1 First Observation (Birth or Earliest)
from dataclasses import dataclass
from enum import Enum
from typing import Optional
class ObservationType(Enum):
BIRTH_CERTIFICATE = "birth_certificate" # Highest authority
BAPTISM_RECORD = "baptism_record" # Common for pre-civil registration
BIRTH_STATEMENT = "birth_statement" # Stated birth in other document
EARLIEST_REFERENCE = "earliest_reference" # Earliest surviving mention
INFERRED = "inferred" # Inferred from context
@dataclass
class FirstObservation:
"""
First observation of a person during their lifetime.
Ideally a birth record, but may be another early record.
"""
observation_type: ObservationType
# Modern geographic codes (mapped from historical)
country_code: str # ISO 3166-1 alpha-2
region_code: str # ISO 3166-2 subdivision
place_code: str # GeoNames 3-letter code
# Original historical reference
historical_place_name: str # As named in source
historical_date: str # As stated in source
# Mapping provenance
modern_mapping_method: str # How historical → modern mapping done
geonames_id: Optional[int] # GeoNames ID for place
# Quality indicators
is_birth_record: bool
can_assume_earliest: bool # No unexplored archives likely
source_confidence: float # 0.0 - 1.0
def is_valid_for_pid(self) -> bool:
"""
Determine if this observation is valid for PID generation.
"""
if self.is_birth_record:
return True
if self.observation_type == ObservationType.EARLIEST_REFERENCE:
# Must be able to assume this is actually the earliest
return self.can_assume_earliest and self.source_confidence >= 0.8
return False
4.2 Last Observation (Death or Latest During Lifetime)
@dataclass
class LastObservation:
"""
Last observation of a person during their lifetime or immediate after death.
Ideally a death record, but may be last known living reference.
"""
observation_type: ObservationType # Reusing enum, but DEATH_CERTIFICATE etc.
# Modern geographic codes
country_code: str
region_code: str
place_code: str
# Original historical reference
historical_place_name: str
historical_date: str
# Critical distinction
is_death_record: bool
is_lifetime_observation: bool # True if person still alive at observation
is_immediate_post_death: bool # First record after death
# Quality
can_assume_latest: bool
source_confidence: float
def is_valid_for_pid(self) -> bool:
if self.is_death_record:
return True
if self.is_immediate_post_death:
# First mention of death
return self.source_confidence >= 0.8
if self.is_lifetime_observation:
# Last known alive, but not death record
return self.can_assume_latest and self.source_confidence >= 0.8
return False
4.3 Geographic Mapping: Historical → Modern
from dataclasses import dataclass
from typing import Optional, Tuple
@dataclass
class HistoricalPlaceMapping:
"""
Map historical place names to modern ISO/GeoNames codes.
Historical places must be mapped to their MODERN equivalents
as of the PPID generation date. This ensures stability even
when historical boundaries shifted.
"""
# Historical input
historical_name: str
historical_date: str # When the place was referenced
# Modern output (at PPID generation time)
modern_country_code: str # ISO 3166-1 alpha-2
modern_region_code: str # ISO 3166-2 suffix (e.g., "NH" not "NL-NH")
modern_place_code: str # 3-letter from GeoNames
# GeoNames reference
geonames_id: int
geonames_name: str # Modern canonical name
geonames_feature_class: str # P = populated place
geonames_feature_code: str # PPL, PPLA, PPLC, etc.
# Mapping provenance
mapping_method: str # "direct", "successor", "enclosing", "manual"
mapping_confidence: float
mapping_notes: str
ppid_generation_date: str # When mapping was performed
def map_historical_to_modern(
historical_name: str,
historical_date: str,
db
) -> HistoricalPlaceMapping:
"""
Map a historical place name to modern ISO/GeoNames codes.
Strategies (in order):
1. Direct match: Place still exists with same name
2. Successor: Place renamed but geographically same
3. Enclosing: Place absorbed into larger entity
4. Manual: Requires human research
"""
# Strategy 1: Direct GeoNames lookup
direct_match = db.geonames_search(historical_name)
if direct_match and direct_match.is_populated_place:
return HistoricalPlaceMapping(
historical_name=historical_name,
historical_date=historical_date,
modern_country_code=direct_match.country_code,
modern_region_code=direct_match.admin1_code,
modern_place_code=generate_place_code(direct_match.name),
geonames_id=direct_match.geonames_id,
geonames_name=direct_match.name,
geonames_feature_class=direct_match.feature_class,
geonames_feature_code=direct_match.feature_code,
mapping_method="direct",
mapping_confidence=0.95,
mapping_notes="Direct GeoNames match",
ppid_generation_date=datetime.utcnow().isoformat()
)
# Strategy 2: Historical name lookup (renamed places)
# e.g., "Batavia" → "Jakarta"
historical_match = db.historical_place_names.get(historical_name)
if historical_match:
modern = db.geonames_by_id(historical_match.modern_geonames_id)
return HistoricalPlaceMapping(
historical_name=historical_name,
historical_date=historical_date,
modern_country_code=modern.country_code,
modern_region_code=modern.admin1_code,
modern_place_code=generate_place_code(modern.name),
geonames_id=modern.geonames_id,
geonames_name=modern.name,
geonames_feature_class=modern.feature_class,
geonames_feature_code=modern.feature_code,
mapping_method="successor",
mapping_confidence=0.90,
mapping_notes=f"Historical name '{historical_name}' → modern '{modern.name}'",
ppid_generation_date=datetime.utcnow().isoformat()
)
# Strategy 3: Geographic coordinates (if available from source)
# Reverse geocode to find enclosing modern settlement
# Strategy 4: Manual research required
raise ManualResearchRequired(
f"Cannot automatically map '{historical_name}' ({historical_date}) to modern location"
)
def generate_place_code(place_name: str) -> str:
"""
Generate 3-letter place code from GeoNames name.
Rules (same as GHCID):
- Single word: First 3 letters → "Amsterdam" → "AMS"
- Multi-word: Initials → "New York" → "NYO" (or "NYC" if registered)
- Dutch articles: Article initial + 2 from main → "Den Haag" → "DHA"
"""
# Implementation follows GHCID rules
# See AGENTS.md: "SETTLEMENT STANDARDIZATION: GEONAMES IS AUTHORITATIVE"
pass
4.4 Century Range Calculation
def calculate_century_range(
first_observation: FirstObservation,
last_observation: LastObservation
) -> str:
"""
Calculate the CE century range for a person's lifetime.
Returns format: "CC-CC" (e.g., "19-20" for 1850-1925)
Rules:
- Centuries are 1-indexed: 1-100 AD = 1st century, 1901-2000 = 20th century
- BCE dates: Use negative century numbers (e.g., "-5--4" for 5th-4th century BCE)
This follows ISO 8601 extended format which uses negative years for BCE
- Range must be from verified observations
"""
def year_to_century(year: int) -> int:
"""
Convert year to century number.
Positive years (CE): 1-100 = century 1, 1901-2000 = century 20
Negative years (BCE): -500 to -401 = century -5
Note: There is no year 0 in the proleptic Gregorian calendar.
Year 1 BCE is followed directly by year 1 CE.
"""
if year > 0:
return ((year - 1) // 100) + 1
else:
# BCE: year -500 → century -5, year -1 → century -1
return (year // 100)
def parse_year(date_str: str) -> int:
"""Extract year from various date formats."""
# Handle: "1895", "1895-03-15", "March 1895", "c. 1895", etc.
# Also handle BCE: "-500", "500 BCE", "500 BC", "c. 500 BCE"
import re
# Check for BCE indicators
bce_match = re.search(r'(\d+)\s*(BCE|BC|B\.C\.E?\.|v\.Chr\.)', date_str, re.IGNORECASE)
if bce_match:
return -int(bce_match.group(1))
# Check for negative year (ISO 8601 extended)
neg_match = re.search(r'-(\d+)', date_str)
if neg_match and date_str.strip().startswith('-'):
return -int(neg_match.group(1))
# Standard positive year
match = re.search(r'\b(\d{4})\b', date_str)
if match:
return int(match.group(1))
# 3-digit year (ancient dates)
match = re.search(r'\b(\d{3})\b', date_str)
if match:
return int(match.group(1))
raise ValueError(f"Cannot parse year from: {date_str}")
first_year = parse_year(first_observation.historical_date)
last_year = parse_year(last_observation.historical_date)
first_century = year_to_century(first_year)
last_century = year_to_century(last_year)
# Validation
if last_century < first_century:
raise ValueError(
f"Last observation ({last_year}) cannot be before "
f"first observation ({first_year})"
)
return f"{first_century}-{last_century}"
# Examples (CE):
# 1850 → century 19
# 1925 → century 20
# Range: "19-20"
# 1606 → century 17
# 1669 → century 17
# Range: "17-17" (same century)
# 1895 → century 19
# 2005 → century 21
# Range: "19-21" (centenarian)
# Examples (BCE):
# -500 (500 BCE) → century -5
# -401 (401 BCE) → century -5
# Range: "-5--5" (same century)
# -469 (469 BCE, Socrates birth) → century -5
# -399 (399 BCE, Socrates death) → century -4
# Range: "-5--4"
# -100 (100 BCE) → century -1
# 14 (14 CE) → century 1
# Range: "-1-1" (crossing BCE/CE boundary)
4.5 Emic Label Tokens
from dataclasses import dataclass
from typing import Optional, List
import re
@dataclass
class EmicLabel:
"""
The common contemporary emic label of a person.
"Emic" = from the insider perspective, as the person was known
during their lifetime in primary sources.
"Etic" = from the outsider perspective, how we refer to them now.
Prefer emic; fall back to etic only if emic unrecoverable.
"""
full_label: str # Complete emic label
first_token: str # First word/token
last_token: str # Last word/token (empty if mononym)
# Source provenance
source_type: str # "primary" or "etic_fallback"
source_document: str # Reference to source
source_date: str # When source was created
# Quality
is_from_primary_source: bool
is_vernacular: bool # From vernacular (non-official) source
confidence: float
@classmethod
def from_full_label(cls, label: str, **kwargs) -> 'EmicLabel':
"""Parse full label into first and last tokens."""
tokens = tokenize_emic_label(label)
first_token = tokens[0].upper() if tokens else ""
last_token = tokens[-1].upper() if len(tokens) > 1 else ""
return cls(
full_label=label,
first_token=first_token,
last_token=last_token,
**kwargs
)
def tokenize_emic_label(label: str) -> List[str]:
"""
Tokenize an emic label into words.
Rules:
- Split on whitespace
- Preserve numeric tokens (e.g., "VIII" in "Henry VIII")
- Do NOT split compound words
- Normalize to uppercase for identifier
"""
# Basic whitespace split
tokens = label.strip().split()
# Filter empty tokens
tokens = [t for t in tokens if t]
return tokens
def extract_name_tokens(
full_emic_label: str
) -> tuple[str, str]:
"""
Extract first and last tokens from emic label.
Rules:
1. First token: First word of the emic label
2. Last token: Last word AFTER skipping tussenvoegsels (name particles)
Tussenvoegsels are common prefixes in Dutch and other languages that are
NOT part of the surname proper. They are skipped when extracting the
last token (same as GHCID abbreviation rules - AGENTS.md Rule 8).
Examples:
- "Jan van den Berg" → ("JAN", "BERG") # "van den" skipped
- "Rembrandt van Rijn" → ("REMBRANDT", "RIJN") # "van" skipped
- "Henry VIII" → ("HENRY", "VIII")
- "Maria Sibylla Merian" → ("MARIA", "MERIAN")
- "Ludwig van Beethoven" → ("LUDWIG", "BEETHOVEN") # "van" skipped
- "Vincent van Gogh" → ("VINCENT", "GOGH") # "van" skipped
- "Leonardo da Vinci" → ("LEONARDO", "VINCI") # "da" skipped
- "中村 太郎" → transliterated: ("NAKAMURA", "TARO")
"""
# Tussenvoegsels (name particles) to skip when finding last token
# Following GHCID pattern (AGENTS.md Rule 8: Legal Form Filtering)
TUSSENVOEGSELS = {
# Dutch
'van', 'de', 'den', 'der', 'het', "'t", 'te', 'ten', 'ter',
'van de', 'van den', 'van der', 'van het', "van 't",
'in de', 'in den', 'in het', "in 't",
'op de', 'op den', 'op het', "op 't",
# German
'von', 'vom', 'zu', 'zum', 'zur', 'von und zu',
# French
'de', 'du', 'des', 'de la', 'le', 'la', 'les',
# Italian
'da', 'di', 'del', 'della', 'dei', 'degli', 'delle',
# Spanish
'de', 'del', 'de la', 'de los', 'de las',
# Portuguese
'da', 'do', 'dos', 'das', 'de',
}
tokens = tokenize_emic_label(full_emic_label)
if len(tokens) == 0:
raise ValueError("Empty emic label")
first_token = tokens[0].upper()
if len(tokens) == 1:
# Mononym
last_token = ""
else:
# Find last token that is NOT a tussenvoegsel
# Work backwards from the end
last_token = ""
for token in reversed(tokens[1:]): # Skip first token
token_lower = token.lower()
if token_lower not in TUSSENVOEGSELS:
last_token = token.upper()
break
# If all remaining tokens are tussenvoegsels, use the actual last token
if not last_token:
last_token = tokens[-1].upper()
# Normalize: remove diacritics, special characters
first_token = normalize_token(first_token)
last_token = normalize_token(last_token)
return (first_token, last_token)
def normalize_token(token: str) -> str:
"""
Normalize token for PPID.
- Remove diacritics (é → E)
- Uppercase
- Allow alphanumeric only (for Roman numerals like VIII)
- Transliterate non-Latin scripts
"""
import unicodedata
# NFD decomposition + remove combining marks
normalized = unicodedata.normalize('NFD', token)
ascii_token = ''.join(
c for c in normalized
if unicodedata.category(c) != 'Mn'
)
# Uppercase
ascii_token = ascii_token.upper()
# Keep only alphanumeric
ascii_token = re.sub(r'[^A-Z0-9]', '', ascii_token)
return ascii_token
4.6 Emic vs Etic Fallback
@dataclass
class EmicLabelResolution:
"""
Resolution of emic label for a person.
Priority:
1. Emic from primary sources (documents from their lifetime)
2. Etic fallback (only if emic truly unrecoverable)
"""
resolved_label: EmicLabel
resolution_method: str # "emic_primary", "emic_vernacular", "etic_fallback"
emic_search_exhausted: bool
vernacular_sources_checked: List[str]
fallback_justification: Optional[str]
def resolve_emic_label(
person_observations: List['PersonObservation'],
db
) -> EmicLabelResolution:
"""
Resolve the emic label for a person from their observations.
Rules:
1. Search all primary sources for emic names
2. Prefer most frequently used name in primary sources
3. Only use etic fallback if emic truly unrecoverable
4. Vernacular sources must have clear pedigrees
5. Oral traditions without documentation not valid
"""
# Collect all name mentions from primary sources
emic_candidates = []
for obs in person_observations:
if obs.is_primary_source and obs.is_from_lifetime:
for claim in obs.claims:
if claim.claim_type in ('full_name', 'given_name', 'title'):
emic_candidates.append({
'label': claim.claim_value,
'source': obs.source_url,
'date': obs.source_date,
'is_vernacular': obs.is_vernacular_source
})
if emic_candidates:
# Find most common emic label
from collections import Counter
label_counts = Counter(c['label'] for c in emic_candidates)
most_common = label_counts.most_common(1)[0][0]
best_candidate = next(
c for c in emic_candidates if c['label'] == most_common
)
return EmicLabelResolution(
resolved_label=EmicLabel.from_full_label(
most_common,
source_type="primary",
source_document=best_candidate['source'],
source_date=best_candidate['date'],
is_from_primary_source=True,
is_vernacular=best_candidate['is_vernacular'],
confidence=0.95
),
resolution_method="emic_primary",
emic_search_exhausted=True,
vernacular_sources_checked=[c['source'] for c in emic_candidates if c['is_vernacular']],
fallback_justification=None
)
# Check if etic fallback is justified
unexplored_vernacular = db.get_unexplored_vernacular_archives(person_observations)
if unexplored_vernacular:
raise EmicLabelNotYetResolvable(
f"Emic label not found in explored sources. "
f"Unexplored vernacular archives exist: {unexplored_vernacular}. "
f"Cannot use etic fallback until these are explored."
)
# Etic fallback (rare)
etic_label = db.get_most_common_etic_label(person_observations)
return EmicLabelResolution(
resolved_label=EmicLabel.from_full_label(
etic_label,
source_type="etic_fallback",
source_document="Modern scholarly consensus",
source_date=datetime.utcnow().isoformat(),
is_from_primary_source=False,
is_vernacular=False,
confidence=0.70
),
resolution_method="etic_fallback",
emic_search_exhausted=True,
vernacular_sources_checked=[],
fallback_justification=(
"No emic label found in explored primary sources. "
"All known vernacular sources checked. "
"Using most common modern scholarly reference."
)
)
5. Collision Handling
5.1 Collision Detection
Two PPIDs collide when all components except the collision suffix match:
def detect_collision(new_ppid: str, existing_ppids: Set[str]) -> bool:
"""
Check if new PPID collides with existing identifiers.
Collision = same base components (before any collision suffix).
"""
base_new = get_base_ppid(new_ppid)
for existing in existing_ppids:
base_existing = get_base_ppid(existing)
if base_new == base_existing:
return True
return False
def get_base_ppid(ppid: str) -> str:
"""Extract base PPID without collision suffix."""
# Full PPID may have collision suffix after last token
# e.g., "PID-NL-NH-AMS-NL-NH-HAA-19-20-JAN-BERG-jan_van_den_berg"
# Base: "PID-NL-NH-AMS-NL-NH-HAA-19-20-JAN-BERG"
parts = ppid.split('-')
# Standard PPID has 11 parts (TYPE + 6 geo + CR + FT + LT)
# If more parts, the extra is collision suffix
if len(parts) > 11:
return '-'.join(parts[:11])
return ppid
5.2 Collision Resolution via Full Emic Label
When collision occurs, append full emic label in snake_case:
def resolve_collision(
base_ppid: str,
full_emic_label: str,
existing_ppids: Set[str]
) -> str:
"""
Resolve collision by appending full emic label.
Example:
Base: "PID-NL-NH-AMS-NL-NH-HAA-19-20-JAN-BERG"
Emic: "Jan van den Berg"
Result: "PID-NL-NH-AMS-NL-NH-HAA-19-20-JAN-BERG-jan_van_den_berg"
"""
suffix = generate_collision_suffix(full_emic_label)
resolved = f"{base_ppid}-{suffix}"
# Check if still collides (extremely rare)
if resolved in existing_ppids:
# Add numeric discriminator
counter = 2
while f"{resolved}_{counter}" in existing_ppids:
counter += 1
resolved = f"{resolved}_{counter}"
return resolved
def generate_collision_suffix(full_emic_label: str) -> str:
"""
Generate collision suffix from full emic label.
Same rules as GHCID collision suffix:
- Convert to lowercase snake_case
- Remove diacritics
- Remove punctuation
"""
import unicodedata
import re
# Normalize unicode
normalized = unicodedata.normalize('NFD', full_emic_label)
ascii_name = ''.join(
c for c in normalized
if unicodedata.category(c) != 'Mn'
)
# Lowercase
lowercase = ascii_name.lower()
# Remove punctuation
no_punct = re.sub(r"[''`\",.:;!?()[\]{}]", '', lowercase)
# Replace spaces with underscores
underscored = re.sub(r'\s+', '_', no_punct)
# Remove non-alphanumeric except underscore
clean = re.sub(r'[^a-z0-9_]', '', underscored)
# Collapse multiple underscores
final = re.sub(r'_+', '_', clean).strip('_')
return final
6. Unknown Components: XX and XXX Placeholders
6.1 When Components Are Unknown
Unlike GHCID (where XX/XXX are temporary and require research), PPID may have permanently unknown components:
| Scenario | Placeholder | Can be PID? |
|---|---|---|
| Unknown birth country | XX |
No (remains ID) |
| Unknown birth region | XX |
No (remains ID) |
| Unknown birth place | XXX |
No (remains ID) |
| Unknown death country | XX |
No (remains ID) |
| Unknown death region | XX |
No (remains ID) |
| Unknown death place | XXX |
No (remains ID) |
| Unknown century | XX-XX |
No (remains ID) |
| Unknown first token | UNKNOWN |
No (remains ID) |
| Unknown last token | (empty) | Yes (if mononym) |
6.2 ID Examples with Unknown Components
ID-XX-XX-XXX-FR-NM-OMH-20-20-UNKNOWN- # Unknown soldier, Normandy
ID-NL-NH-AMS-XX-XX-XXX-17-17-REMBRANDT- # Rembrandt, death place unknown
ID-XX-XX-XXX-XX-XX-XXX-XX-XX-ANONYMOUS- # Completely unknown person
7. UUID and Numeric Generation
7.1 Dual Representation (Same as GHCID)
Every PPID generates three representations:
| Format | Purpose | Example |
|---|---|---|
| Semantic String | Human-readable | PID-NL-NH-AMS-NL-NH-HAA-19-20-JAN-BERG |
| UUID v5 | Linked data, URIs | 550e8400-e29b-41d4-a716-446655440000 |
| Numeric (64-bit) | Database keys, CSV | 213324328442227739 |
7.2 Generation Algorithm
import uuid
import hashlib
# PPID namespace UUID (different from GHCID namespace)
PPID_NAMESPACE = uuid.UUID('f47ac10b-58cc-4372-a567-0e02b2c3d479')
def generate_ppid_identifiers(semantic_ppid: str) -> dict:
"""
Generate all identifier formats from semantic PPID string.
Returns:
{
'semantic': 'PID-NL-NH-AMS-...',
'uuid_v5': '550e8400-...',
'numeric': 213324328442227739
}
"""
# UUID v5 from semantic string
ppid_uuid = uuid.uuid5(PPID_NAMESPACE, semantic_ppid)
# Numeric from SHA-256 (64-bit)
sha256 = hashlib.sha256(semantic_ppid.encode()).digest()
numeric = int.from_bytes(sha256[:8], byteorder='big')
return {
'semantic': semantic_ppid,
'uuid_v5': str(ppid_uuid),
'numeric': numeric
}
# Example:
ppid = "PID-NL-NH-AMS-NL-NH-HAA-19-20-JAN-BERG"
identifiers = generate_ppid_identifiers(ppid)
# {
# 'semantic': 'PID-NL-NH-AMS-NL-NH-HAA-19-20-JAN-BERG',
# 'uuid_v5': 'a1b2c3d4-e5f6-5a1b-9c2d-3e4f5a6b7c8d',
# 'numeric': 1234567890123456789
# }
8. Relationship to Person Observations
8.1 Distinction: PPID vs Observation Identifiers
| Identifier | Purpose | Structure | Persistence |
|---|---|---|---|
| PPID | Identify a person (reconstruction) | Geographic + temporal + emic | Permanent (if PID) |
| Observation ID | Identify a specific source observation | GHCID-based + RiC-O | Permanent |
8.2 Observation Identifier Structure (Forthcoming)
As noted in the user's input, observation identifiers will use a different pattern:
{REPOSITORY_GHCID}/{CREATOR_GHCID}/{RICO_RECORD_PATH}
Where:
- REPOSITORY_GHCID: GHCID of the institution holding the record
- CREATOR_GHCID: GHCID of the institution that created the record (may be same)
- RICO_RECORD_PATH: RiC-O derived path to RecordSet/Record/RecordPart
Example:
NL-NH-HAA-A-NHA/NL-NH-HAA-A-NHA/burgerlijke-stand/geboorten/1895/003/045
│ │ │
│ │ └── RiC-O path: fonds/series/file/item
│ └── Creator (same institution)
└── Repository
This is separate from PPID and will be specified in a future document.
9. Comparison with Original POID/PRID Design
9.1 What Changes
| Aspect | POID/PRID (Doc 05) | Revised PPID (This Doc) |
|---|---|---|
| Identifier opacity | Opaque (no semantic content) | Semantic (human-readable) |
| Geographic anchoring | None | Dual (birth + death locations) |
| Temporal anchoring | None | Century range |
| Name in identifier | None | First + last token |
| Type prefix | POID/PRID | ID/PID |
| Observation vs Person | Different identifier types | Completely separate systems |
| UUID backing | Primary | Secondary (derived) |
| Collision handling | UUID collision (rare) | Semantic collision (more common) |
9.2 What Stays the Same
- Dual identifier generation (UUID + numeric)
- Deterministic generation from input
- Permanent persistence (once PID)
- Integration with GHCID for institution links
- Claim-based provenance model
- PiCo ontology alignment
9.3 Transition Plan
If this revised structure is adopted:
- Document 05 becomes historical reference
- This document becomes the authoritative identifier spec
- No existing identifiers need migration (this is a new system)
- Code examples in other documents need updates
10. Implementation Considerations
10.1 Character Set and Length
# Maximum lengths
MAX_COUNTRY_CODE = 2 # ISO 3166-1 alpha-2
MAX_REGION_CODE = 3 # ISO 3166-2 suffix (some are 3 chars)
MAX_PLACE_CODE = 3 # GeoNames convention
MAX_CENTURY_RANGE = 5 # "XX-XX"
MAX_TOKEN_LENGTH = 20 # Reasonable limit for names
MAX_COLLISION_SUFFIX = 50 # Full emic label
# Maximum total PPID length (without collision suffix)
# "PID-" + "XX-XXX-XXX-" * 2 + "XX-XX-" + "TOKEN-TOKEN"
# = 4 + (2+3+3+4)*2 + 6 + 20 + 20 = ~70 characters
# With collision suffix: ~120 characters max
10.2 Validation Regex
import re
PPID_PATTERN = re.compile(
r'^(ID|PID)-' # Type
r'([A-Z]{2}|XX)-' # First country
r'([A-Z]{2,3}|XX)-' # First region
r'([A-Z]{3}|XXX)-' # First place
r'([A-Z]{2}|XX)-' # Last country
r'([A-Z]{2,3}|XX)-' # Last region
r'([A-Z]{3}|XXX)-' # Last place
r'(\d{1,2}-\d{1,2}|XX-XX)-' # Century range
r'([A-Z0-9]+)-' # First token
r'([A-Z0-9]*)' # Last token (may be empty)
r'(-[a-z0-9_]+)?$' # Collision suffix (optional)
)
def validate_ppid(ppid: str) -> tuple[bool, str]:
"""Validate PPID format."""
if not PPID_PATTERN.match(ppid):
return False, "Invalid PPID format"
# Additional semantic validation
parts = ppid.split('-')
# Century range validation
if len(parts) >= 9:
century_range = f"{parts[7]}-{parts[8]}"
if century_range != "XX-XX":
try:
first_c, last_c = map(int, [parts[7], parts[8]])
if last_c < first_c:
return False, "Last century cannot be before first century"
if first_c < 1 or last_c > 22: # Reasonable bounds
return False, "Century out of reasonable range"
except ValueError:
pass
return True, "Valid"
11. Open Questions
11.1 BCE Dates
How to handle persons from before Common Era?
Options:
- Negative century numbers:
-5--4for 5th-4th century BCE - BCE prefix:
BCE5-BCE4 - Separate identifier scheme for ancient persons
11.2 Non-Latin Name Tokens
How to handle names in non-Latin scripts?
Options:
- Require transliteration (current approach)
- Allow Unicode tokens with normalization
- Dual representation (original + transliterated)
11.3 Disputed Locations
What if birth/death locations are historically disputed?
Options:
- Use most likely location with note
- Use
XX/XXXuntil resolved - Create multiple IDs for each interpretation
11.4 Living Persons
How to handle persons still alive (no death observation)?
Options:
- Cannot be PID until death
- Use
XX-XX-XXXfor death location, current century for range - Separate identifier class for living persons
12. References
GHCID Documentation
Related PPID Documents
Standards
- ISO 3166-1: Country codes
- ISO 3166-2: Subdivision codes
- GeoNames: Geographic names database