# PPID-GHCID Alignment: Revised Identifier Structure **Version**: 0.1.0 **Last Updated**: 2025-01-09 **Status**: DRAFT - Supersedes opaque identifier design in [05_identifier_structure_design.md](./05_identifier_structure_design.md) **Related**: [GHCID Specification](../../GHCID_PID_SCHEME.md) | [PiCo Ontology](./03_pico_ontology_analysis.md) --- ## 1. Executive Summary This document proposes a **revised PPID structure** that aligns with GHCID's geographic-semantic identifier pattern while accommodating the unique challenges of person identification across historical records. ### 1.1 Key Changes from Original Design | Aspect | Original (Doc 05) | Revised (This Document) | |--------|-------------------|-------------------------| | **Format** | Opaque hex (`POID-7a3b-c4d5-...`) | Semantic (`PID_NL-NH-AMS_1895-03-15_NL-NH-HAA_1970-08-22_JAN-BERG`) | | **Type Distinction** | POID vs PRID | ID (temporary) vs PID (persistent) | | **Geographic** | None in identifier | Dual anchors: first + last observation location | | **Temporal** | None in identifier | ISO 8601 dates with variable precision (YYYY-MM-DD / YYYY-MM / YYYY) | | **Name** | None in identifier | First + last token of emic label (skip particles) | | **Delimiters** | Single type | Hierarchical: `_` (major groups) + `-` (within groups) | | **Persistence** | Always persistent | May remain ID indefinitely | ### 1.2 Design Philosophy The revised PPID follows the same principles as GHCID: 1. **Human-readable semantic components** that aid discovery and deduplication 2. **Geographic anchoring** to physical locations using GeoNames 3. **Temporal anchoring** to enable disambiguation across time 4. **Emic authenticity** using names from primary sources 5. **Collision resolution** via full emic label suffix 6. **Dual representation** as both semantic string and UUID/numeric --- ## 2. Identifier Type: ID vs PID ### 2.1 The Epistemic Uncertainty Problem Unlike institutions (which typically have founding documents, legal registrations, and clear organizational boundaries), **persons in historical records often exist in epistemic uncertainty**: - Incomplete records (many records lost to time) - Ambiguous references (common names, no surnames) - Conflicting sources (different dates, spellings) - Undiscovered archives (unexplored record sets) ### 2.2 Two-Class Identifier System | Type | Prefix | Description | Persistence | Promotion Path | |------|--------|-------------|-------------|----------------| | **ID** | `ID-` | Temporary identifier | May change | Can become PID | | **PID** | `PID-` | Persistent identifier | Permanent | Cannot revert to ID | ### 2.3 Promotion Criteria: ID → PID An identifier can be promoted from ID to PID when ALL of the following are satisfied: ```python @dataclass class PIDPromotionCriteria: """ Criteria for promoting an ID to a PID. ALL conditions must be True for promotion. """ # Geographic anchors first_observation_verified: bool # Birth or equivalent last_observation_verified: bool # Death or equivalent # Temporal anchors century_range_established: bool # From verified observations # Identity anchors emic_label_verified: bool # From primary sources no_unexplored_archives: bool # Reasonable assumption # Quality checks no_unresolved_conflicts: bool # No conflicting claims multiple_corroborating_sources: bool # At least 2 independent sources def is_promotable(self) -> bool: return all([ self.first_observation_verified, self.last_observation_verified, self.century_range_established, self.emic_label_verified, self.no_unexplored_archives, self.no_unresolved_conflicts, self.multiple_corroborating_sources, ]) ``` ### 2.4 Permanent ID Status Some identifiers may **forever remain IDs** due to: - **Fragmentary records**: Only one surviving document mentions the person - **Uncertain dates**: Cannot establish century range - **Unknown location**: Cannot anchor geographically - **Anonymous figures**: No emic label recoverable - **Ongoing research**: Archives not yet explored This is acceptable and expected. An ID is still a valid identifier for internal use; it simply cannot be cited as a persistent identifier in scholarly work. --- ## 3. Identifier Structure ### 3.1 Full Format Specification ``` {TYPE}_{FL}_{FD}_{LL}_{LD}_{NT}[-{FULL_EMIC}] │ │ │ │ │ │ │ │ │ │ │ │ │ └── Collision suffix (optional, snake_case) │ │ │ │ │ └── Name Tokens (FIRST-LAST, hyphen-joined) │ │ │ │ └── Last observation Date (ISO 8601: YYYY-MM-DD or reduced) │ │ │ └── Last observation Location (CC-RR-PPP, hyphen-joined) │ │ └── First observation Date (ISO 8601: YYYY-MM-DD or reduced) │ └── First observation Location (CC-RR-PPP, hyphen-joined) └── Type: ID or PID Delimiters: - Underscore (_) = Major delimiter between logical groups - Hyphen (-) = Minor delimiter within groups ``` **Expanded with all components:** ``` {TYPE}_{FC-FR-FP}_{FD}_{LC-LR-LP}_{LD}_{FT-LT}[-{full_emic_label}] Example: PID_NL-NH-AMS_1895-03-15_NL-NH-HAA_1970-08-22_JAN-BERG ``` ### 3.2 Component Definitions | Component | Format | Description | Example | |-----------|--------|-------------|---------| | **TYPE** | `ID` or `PID` | Identifier class | `PID` | | **FL** | `CC-RR-PPP` | First observation Location | `NL-NH-AMS` | | → FC | ISO 3166-1 α2 | Country code | `NL` | | → FR | ISO 3166-2 suffix | Region code | `NH` | | → FP | 3 letters | Place code (GeoNames) | `AMS` | | **FD** | ISO 8601 | First observation Date | `1895-03-15` or `1895` | | **LL** | `CC-RR-PPP` | Last observation Location | `NL-NH-HAA` | | → LC | ISO 3166-1 α2 | Country code | `NL` | | → LR | ISO 3166-2 suffix | Region code | `NH` | | → LP | 3 letters | Place code (GeoNames) | `HAA` | | **LD** | ISO 8601 | Last observation Date | `1970-08-22` or `1970` | | **NT** | `FIRST-LAST` | Name Tokens (emic label) | `JAN-BERG` | | → FT | UPPERCASE | First token | `JAN` | | → LT | UPPERCASE | Last token (skip particles) | `BERG` | | **FULL_EMIC** | snake_case | Full emic label (collision only) | `jan_van_den_berg` | ### 3.3 ISO 8601 Date Precision Levels Dates use ISO 8601 format with **variable precision** based on what can be verified: | Precision | Format | Example | When to Use | |-----------|--------|---------|-------------| | **Day** | `YYYY-MM-DD` | `1895-03-15` | Birth/death certificate with exact date | | **Month** | `YYYY-MM` | `1895-03` | Record states month but not day | | **Year** | `YYYY` | `1895` | Only year known (common for historical figures) | | **Unknown** | `XXXX` | `XXXX` | Cannot determine; identifier remains ID class | **BCE Dates** (negative years per ISO 8601 extended): | Year | ISO 8601 | Example PPID | |------|----------|--------------| | 469 BCE | `-0469` | `PID_GR-AT-ATH_-0469_GR-AT-ATH_-0399_SOCRATES-` | | 44 BCE | `-0044` | `PID_IT-RM-ROM_-0100_IT-RM-ROM_-0044_GAIUS-CAESAR` | **No Time Components**: Hours, minutes, seconds are never included (impractical for historical persons). ### 3.4 Examples | Person | Full Emic Label | PPID | |--------|-----------------|------| | Jan van den Berg (1895-03-15 → 1970-08-22) | Jan van den Berg | `PID_NL-NH-AMS_1895-03-15_NL-NH-HAA_1970-08-22_JAN-BERG` | | Rembrandt (1606-07-15 → 1669-10-04) | Rembrandt van Rijn | `PID_NL-ZH-LEI_1606-07-15_NL-NH-AMS_1669-10-04_REMBRANDT-RIJN` | | Maria Sibylla Merian (1647 → 1717) | Maria Sibylla Merian | `PID_DE-HE-FRA_1647_NL-NH-AMS_1717_MARIA-MERIAN` | | Socrates (c. 470 BCE → 399 BCE) | Σωκράτης (Sōkrátēs) | `PID_GR-AT-ATH_-0470_GR-AT-ATH_-0399_SOCRATES-` | | Julius Caesar (100 BCE → 44 BCE) | Gaius Iulius Caesar | `PID_IT-RM-ROM_-0100_IT-RM-ROM_-0044_GAIUS-CAESAR` | | Unknown soldier (? → 1944-06-06) | (unknown) | `ID_XX-XX-XXX_XXXX_FR-NM-OMH_1944-06-06_UNKNOWN-` | | Henry VIII (1491-06-28 → 1547-01-28) | Henry VIII | `PID_GB-ENG-LON_1491-06-28_GB-ENG-LON_1547-01-28_HENRY-VIII` | | Vincent van Gogh (1853-03-30 → 1890-07-29) | Vincent Willem van Gogh | `PID_NL-NB-ZUN_1853-03-30_FR-IDF-AUV_1890-07-29_VINCENT-GOGH` | **Notes on Emic Labels**: - Always use **formal/complete emic names** from primary sources, not modern colloquial short forms - "Rembrandt" alone is a modern convention; the emic label from his lifetime was "Rembrandt van Rijn" - **Tussenvoegsels (particles)** like "van", "de", "den", "der", "van de", "van den", "van der" are **skipped** when extracting the last token (see §4.5) - Non-Latin names are transliterated following GHCID transliteration standards (see AGENTS.md) - This follows the same pattern as GHCID abbreviation rules (AGENTS.md Rule 8) --- ## 4. Component Rules ### 4.1 First Observation (Birth or Earliest) ```python from dataclasses import dataclass from enum import Enum from typing import Optional class ObservationType(Enum): BIRTH_CERTIFICATE = "birth_certificate" # Highest authority BAPTISM_RECORD = "baptism_record" # Common for pre-civil registration BIRTH_STATEMENT = "birth_statement" # Stated birth in other document EARLIEST_REFERENCE = "earliest_reference" # Earliest surviving mention INFERRED = "inferred" # Inferred from context @dataclass class FirstObservation: """ First observation of a person during their lifetime. Ideally a birth record, but may be another early record. """ observation_type: ObservationType # Modern geographic codes (mapped from historical) country_code: str # ISO 3166-1 alpha-2 region_code: str # ISO 3166-2 subdivision place_code: str # GeoNames 3-letter code # Original historical reference historical_place_name: str # As named in source historical_date: str # As stated in source # Mapping provenance modern_mapping_method: str # How historical → modern mapping done geonames_id: Optional[int] # GeoNames ID for place # Quality indicators is_birth_record: bool can_assume_earliest: bool # No unexplored archives likely source_confidence: float # 0.0 - 1.0 def is_valid_for_pid(self) -> bool: """ Determine if this observation is valid for PID generation. """ if self.is_birth_record: return True if self.observation_type == ObservationType.EARLIEST_REFERENCE: # Must be able to assume this is actually the earliest return self.can_assume_earliest and self.source_confidence >= 0.8 return False ``` ### 4.2 Last Observation (Death or Latest During Lifetime) ```python @dataclass class LastObservation: """ Last observation of a person during their lifetime or immediate after death. Ideally a death record, but may be last known living reference. """ observation_type: ObservationType # Reusing enum, but DEATH_CERTIFICATE etc. # Modern geographic codes country_code: str region_code: str place_code: str # Original historical reference historical_place_name: str historical_date: str # Critical distinction is_death_record: bool is_lifetime_observation: bool # True if person still alive at observation is_immediate_post_death: bool # First record after death # Quality can_assume_latest: bool source_confidence: float def is_valid_for_pid(self) -> bool: if self.is_death_record: return True if self.is_immediate_post_death: # First mention of death return self.source_confidence >= 0.8 if self.is_lifetime_observation: # Last known alive, but not death record return self.can_assume_latest and self.source_confidence >= 0.8 return False ``` ### 4.3 Geographic Mapping: Historical → Modern ```python from dataclasses import dataclass from typing import Optional, Tuple @dataclass class HistoricalPlaceMapping: """ Map historical place names to modern ISO/GeoNames codes. Historical places must be mapped to their MODERN equivalents as of the PPID generation date. This ensures stability even when historical boundaries shifted. """ # Historical input historical_name: str historical_date: str # When the place was referenced # Modern output (at PPID generation time) modern_country_code: str # ISO 3166-1 alpha-2 modern_region_code: str # ISO 3166-2 suffix (e.g., "NH" not "NL-NH") modern_place_code: str # 3-letter from GeoNames # GeoNames reference geonames_id: int geonames_name: str # Modern canonical name geonames_feature_class: str # P = populated place geonames_feature_code: str # PPL, PPLA, PPLC, etc. # Mapping provenance mapping_method: str # "direct", "successor", "enclosing", "manual" mapping_confidence: float mapping_notes: str ppid_generation_date: str # When mapping was performed def map_historical_to_modern( historical_name: str, historical_date: str, db ) -> HistoricalPlaceMapping: """ Map a historical place name to modern ISO/GeoNames codes. Strategies (in order): 1. Direct match: Place still exists with same name 2. Successor: Place renamed but geographically same 3. Enclosing: Place absorbed into larger entity 4. Manual: Requires human research """ # Strategy 1: Direct GeoNames lookup direct_match = db.geonames_search(historical_name) if direct_match and direct_match.is_populated_place: return HistoricalPlaceMapping( historical_name=historical_name, historical_date=historical_date, modern_country_code=direct_match.country_code, modern_region_code=direct_match.admin1_code, modern_place_code=generate_place_code(direct_match.name), geonames_id=direct_match.geonames_id, geonames_name=direct_match.name, geonames_feature_class=direct_match.feature_class, geonames_feature_code=direct_match.feature_code, mapping_method="direct", mapping_confidence=0.95, mapping_notes="Direct GeoNames match", ppid_generation_date=datetime.utcnow().isoformat() ) # Strategy 2: Historical name lookup (renamed places) # e.g., "Batavia" → "Jakarta" historical_match = db.historical_place_names.get(historical_name) if historical_match: modern = db.geonames_by_id(historical_match.modern_geonames_id) return HistoricalPlaceMapping( historical_name=historical_name, historical_date=historical_date, modern_country_code=modern.country_code, modern_region_code=modern.admin1_code, modern_place_code=generate_place_code(modern.name), geonames_id=modern.geonames_id, geonames_name=modern.name, geonames_feature_class=modern.feature_class, geonames_feature_code=modern.feature_code, mapping_method="successor", mapping_confidence=0.90, mapping_notes=f"Historical name '{historical_name}' → modern '{modern.name}'", ppid_generation_date=datetime.utcnow().isoformat() ) # Strategy 3: Geographic coordinates (if available from source) # Reverse geocode to find enclosing modern settlement # Strategy 4: Manual research required raise ManualResearchRequired( f"Cannot automatically map '{historical_name}' ({historical_date}) to modern location" ) def generate_place_code(place_name: str) -> str: """ Generate 3-letter place code from GeoNames name. Rules (same as GHCID): - Single word: First 3 letters → "Amsterdam" → "AMS" - Multi-word: Initials → "New York" → "NYO" (or "NYC" if registered) - Dutch articles: Article initial + 2 from main → "Den Haag" → "DHA" """ # Implementation follows GHCID rules # See AGENTS.md: "SETTLEMENT STANDARDIZATION: GEONAMES IS AUTHORITATIVE" pass ``` ### 4.4 ISO 8601 Date Formatting ```python from dataclasses import dataclass from typing import Optional from datetime import date import re @dataclass class ObservationDate: """ Date of an observation formatted for PPID. Uses ISO 8601 with variable precision: - Full date: YYYY-MM-DD (1895-03-15) - Month precision: YYYY-MM (1895-03) - Year precision: YYYY (1895) - BCE dates: Negative years (-0469 for 469 BCE) - Unknown: XXXX No time components (hours/minutes/seconds) - impractical for historical persons. """ year: Optional[int] # None if unknown month: Optional[int] = None # 1-12, None if unknown day: Optional[int] = None # 1-31, None if unknown is_bce: bool = False # True for BCE dates # Provenance precision_source: str = "" # What source established this precision confidence: float = 1.0 # 0.0-1.0 def to_ppid_format(self) -> str: """ Format date for PPID component. Examples: - Full date: "1895-03-15" - Month only: "1895-03" - Year only: "1895" - BCE: "-0469" (469 BCE) - Unknown: "XXXX" """ if self.year is None: return "XXXX" # Handle BCE with leading minus and 4-digit padding if self.is_bce or self.year < 0: year_str = f"-{abs(self.year):04d}" else: year_str = f"{self.year:04d}" if self.month is None: return year_str if self.day is None: return f"{year_str}-{self.month:02d}" return f"{year_str}-{self.month:02d}-{self.day:02d}" @classmethod def from_historical_date(cls, date_str: str) -> 'ObservationDate': """ Parse a historical date string into ObservationDate. Handles various formats: - ISO 8601: "1895-03-15", "1895-03", "1895" - BCE indicators: "469 BCE", "469 BC", "469 v.Chr." - Approximate: "c. 1895", "circa 1895", "~1895" - Ranges: "1890-1895" (uses midpoint with lower confidence) """ if not date_str or date_str.upper() in ('UNKNOWN', '?', 'XXXX', ''): return cls(year=None) # Check for BCE indicators is_bce = False bce_match = re.search(r'(\d+)\s*(BCE|BC|B\.C\.E?\.?|v\.Chr\.)', date_str, re.IGNORECASE) if bce_match: year = int(bce_match.group(1)) is_bce = True return cls(year=year, is_bce=True) # Check for negative year (ISO 8601 extended) if date_str.strip().startswith('-'): neg_match = re.match(r'-(\d+)(?:-(\d{2}))?(?:-(\d{2}))?', date_str.strip()) if neg_match: year = int(neg_match.group(1)) month = int(neg_match.group(2)) if neg_match.group(2) else None day = int(neg_match.group(3)) if neg_match.group(3) else None return cls(year=year, month=month, day=day, is_bce=True) # Standard ISO 8601 format iso_match = re.match(r'(\d{4})(?:-(\d{2}))?(?:-(\d{2}))?', date_str) if iso_match: year = int(iso_match.group(1)) month = int(iso_match.group(2)) if iso_match.group(2) else None day = int(iso_match.group(3)) if iso_match.group(3) else None return cls(year=year, month=month, day=day) # Fallback: extract any 4-digit year year_match = re.search(r'\b(\d{4})\b', date_str) if year_match: return cls(year=int(year_match.group(1)), confidence=0.8) # 3-digit year (ancient dates) year_match = re.search(r'\b(\d{3})\b', date_str) if year_match: return cls(year=int(year_match.group(1)), confidence=0.7) return cls(year=None) def validate_date_ordering( first_date: ObservationDate, last_date: ObservationDate ) -> tuple[bool, str]: """ Validate that first observation is before last observation. """ if first_date.year is None or last_date.year is None: return True, "Unknown dates cannot be validated" first_year = -first_date.year if first_date.is_bce else first_date.year last_year = -last_date.year if last_date.is_bce else last_date.year if last_year < first_year: return False, f"Last observation ({last_date.to_ppid_format()}) before first ({first_date.to_ppid_format()})" return True, "Valid" # Examples: # ObservationDate(year=1895, month=3, day=15).to_ppid_format() -> "1895-03-15" # ObservationDate(year=1895, month=3).to_ppid_format() -> "1895-03" # ObservationDate(year=1895).to_ppid_format() -> "1895" # ObservationDate(year=469, is_bce=True).to_ppid_format() -> "-0469" # ObservationDate(year=None).to_ppid_format() -> "XXXX" ``` ### 4.5 Emic Label Tokens ```python from dataclasses import dataclass from typing import Optional, List import re @dataclass class EmicLabel: """ The common contemporary emic label of a person. "Emic" = from the insider perspective, as the person was known during their lifetime in primary sources. "Etic" = from the outsider perspective, how we refer to them now. Prefer emic; fall back to etic only if emic unrecoverable. """ full_label: str # Complete emic label first_token: str # First word/token last_token: str # Last word/token (empty if mononym) # Source provenance source_type: str # "primary" or "etic_fallback" source_document: str # Reference to source source_date: str # When source was created # Quality is_from_primary_source: bool is_vernacular: bool # From vernacular (non-official) source confidence: float @classmethod def from_full_label(cls, label: str, **kwargs) -> 'EmicLabel': """Parse full label into first and last tokens.""" tokens = tokenize_emic_label(label) first_token = tokens[0].upper() if tokens else "" last_token = tokens[-1].upper() if len(tokens) > 1 else "" return cls( full_label=label, first_token=first_token, last_token=last_token, **kwargs ) def tokenize_emic_label(label: str) -> List[str]: """ Tokenize an emic label into words. Rules: - Split on whitespace - Preserve numeric tokens (e.g., "VIII" in "Henry VIII") - Do NOT split compound words - Normalize to uppercase for identifier """ # Basic whitespace split tokens = label.strip().split() # Filter empty tokens tokens = [t for t in tokens if t] return tokens def extract_name_tokens( full_emic_label: str ) -> tuple[str, str]: """ Extract first and last tokens from emic label. Rules: 1. First token: First word of the emic label 2. Last token: Last word AFTER skipping tussenvoegsels (name particles) Tussenvoegsels are common prefixes in Dutch and other languages that are NOT part of the surname proper. They are skipped when extracting the last token (same as GHCID abbreviation rules - AGENTS.md Rule 8). Examples: - "Jan van den Berg" → ("JAN", "BERG") # "van den" skipped - "Rembrandt van Rijn" → ("REMBRANDT", "RIJN") # "van" skipped - "Henry VIII" → ("HENRY", "VIII") - "Maria Sibylla Merian" → ("MARIA", "MERIAN") - "Ludwig van Beethoven" → ("LUDWIG", "BEETHOVEN") # "van" skipped - "Vincent van Gogh" → ("VINCENT", "GOGH") # "van" skipped - "Leonardo da Vinci" → ("LEONARDO", "VINCI") # "da" skipped - "中村 太郎" → transliterated: ("NAKAMURA", "TARO") """ # Tussenvoegsels (name particles) to skip when finding last token # Following GHCID pattern (AGENTS.md Rule 8: Legal Form Filtering) TUSSENVOEGSELS = { # Dutch 'van', 'de', 'den', 'der', 'het', "'t", 'te', 'ten', 'ter', 'van de', 'van den', 'van der', 'van het', "van 't", 'in de', 'in den', 'in het', "in 't", 'op de', 'op den', 'op het', "op 't", # German 'von', 'vom', 'zu', 'zum', 'zur', 'von und zu', # French 'de', 'du', 'des', 'de la', 'le', 'la', 'les', # Italian 'da', 'di', 'del', 'della', 'dei', 'degli', 'delle', # Spanish 'de', 'del', 'de la', 'de los', 'de las', # Portuguese 'da', 'do', 'dos', 'das', 'de', } tokens = tokenize_emic_label(full_emic_label) if len(tokens) == 0: raise ValueError("Empty emic label") first_token = tokens[0].upper() if len(tokens) == 1: # Mononym last_token = "" else: # Find last token that is NOT a tussenvoegsel # Work backwards from the end last_token = "" for token in reversed(tokens[1:]): # Skip first token token_lower = token.lower() if token_lower not in TUSSENVOEGSELS: last_token = token.upper() break # If all remaining tokens are tussenvoegsels, use the actual last token if not last_token: last_token = tokens[-1].upper() # Normalize: remove diacritics, special characters first_token = normalize_token(first_token) last_token = normalize_token(last_token) return (first_token, last_token) def normalize_token(token: str) -> str: """ Normalize token for PPID. - Remove diacritics (é → E) - Uppercase - Allow alphanumeric only (for Roman numerals like VIII) - Transliterate non-Latin scripts """ import unicodedata # NFD decomposition + remove combining marks normalized = unicodedata.normalize('NFD', token) ascii_token = ''.join( c for c in normalized if unicodedata.category(c) != 'Mn' ) # Uppercase ascii_token = ascii_token.upper() # Keep only alphanumeric ascii_token = re.sub(r'[^A-Z0-9]', '', ascii_token) return ascii_token ``` ### 4.6 Emic vs Etic Fallback ```python @dataclass class EmicLabelResolution: """ Resolution of emic label for a person. Priority: 1. Emic from primary sources (documents from their lifetime) 2. Etic fallback (only if emic truly unrecoverable) """ resolved_label: EmicLabel resolution_method: str # "emic_primary", "emic_vernacular", "etic_fallback" emic_search_exhausted: bool vernacular_sources_checked: List[str] fallback_justification: Optional[str] def resolve_emic_label( person_observations: List['PersonObservation'], db ) -> EmicLabelResolution: """ Resolve the emic label for a person from their observations. Rules: 1. Search all primary sources for emic names 2. Prefer most frequently used name in primary sources 3. Only use etic fallback if emic truly unrecoverable 4. Vernacular sources must have clear pedigrees 5. Oral traditions without documentation not valid """ # Collect all name mentions from primary sources emic_candidates = [] for obs in person_observations: if obs.is_primary_source and obs.is_from_lifetime: for claim in obs.claims: if claim.claim_type in ('full_name', 'given_name', 'title'): emic_candidates.append({ 'label': claim.claim_value, 'source': obs.source_url, 'date': obs.source_date, 'is_vernacular': obs.is_vernacular_source }) if emic_candidates: # Find most common emic label from collections import Counter label_counts = Counter(c['label'] for c in emic_candidates) most_common = label_counts.most_common(1)[0][0] best_candidate = next( c for c in emic_candidates if c['label'] == most_common ) return EmicLabelResolution( resolved_label=EmicLabel.from_full_label( most_common, source_type="primary", source_document=best_candidate['source'], source_date=best_candidate['date'], is_from_primary_source=True, is_vernacular=best_candidate['is_vernacular'], confidence=0.95 ), resolution_method="emic_primary", emic_search_exhausted=True, vernacular_sources_checked=[c['source'] for c in emic_candidates if c['is_vernacular']], fallback_justification=None ) # Check if etic fallback is justified unexplored_vernacular = db.get_unexplored_vernacular_archives(person_observations) if unexplored_vernacular: raise EmicLabelNotYetResolvable( f"Emic label not found in explored sources. " f"Unexplored vernacular archives exist: {unexplored_vernacular}. " f"Cannot use etic fallback until these are explored." ) # Etic fallback (rare) etic_label = db.get_most_common_etic_label(person_observations) return EmicLabelResolution( resolved_label=EmicLabel.from_full_label( etic_label, source_type="etic_fallback", source_document="Modern scholarly consensus", source_date=datetime.utcnow().isoformat(), is_from_primary_source=False, is_vernacular=False, confidence=0.70 ), resolution_method="etic_fallback", emic_search_exhausted=True, vernacular_sources_checked=[], fallback_justification=( "No emic label found in explored primary sources. " "All known vernacular sources checked. " "Using most common modern scholarly reference." ) ) ``` --- ## 5. Collision Handling ### 5.1 Collision Detection Two PPIDs collide when all components except the collision suffix match: ```python def detect_collision(new_ppid: str, existing_ppids: Set[str]) -> bool: """ Check if new PPID collides with existing identifiers. Collision = same base components (before any collision suffix). """ base_new = get_base_ppid(new_ppid) for existing in existing_ppids: base_existing = get_base_ppid(existing) if base_new == base_existing: return True return False def get_base_ppid(ppid: str) -> str: """Extract base PPID without collision suffix. New format uses underscore as major delimiter: PID_NL-NH-AMS_1895-03-15_NL-NH-HAA_1970-08-22_JAN-BERG-jan_van_den_berg ↑ collision suffix starts here Base PPID has exactly 6 underscore-delimited parts: TYPE_FL_FD_LL_LD_NT """ # Split by underscore (major delimiter) parts = ppid.split('_') # Standard PPID has 6 parts: TYPE, FL, FD, LL, LD, NT if len(parts) < 6: return ppid # Invalid format # Check if last part contains collision suffix (hyphen after name tokens) name_tokens_part = parts[5] # Name tokens format: FIRST-LAST or FIRST-LAST-collision_suffix # Collision suffix is in snake_case (contains underscores within the last major part) # Since we already split by _, a collision suffix would appear as extra parts if len(parts) > 6: # Extra parts after NT are collision suffix base_parts = parts[:6] return '_'.join(base_parts) # Check for hyphen-appended collision within NT part # e.g., "JAN-BERG-jan_van_den_berg" - but wait, this would be split by _ # Actually: collision suffix uses - to connect: JAN-BERG-jan_van_den_berg # Let's handle this case if '-' in name_tokens_part: nt_parts = name_tokens_part.split('-') # First two are name tokens, rest is collision suffix if len(nt_parts) > 2 and nt_parts[2].islower(): # Has collision suffix base_nt = '-'.join(nt_parts[:2]) parts[5] = base_nt return '_'.join(parts[:6]) return ppid ``` ### 5.2 Collision Resolution Strategy Collisions are resolved through a **three-tier escalation** strategy: 1. **Tier 1**: Append full emic label in snake_case 2. **Tier 2**: If still collides, add 8-character hash discriminator 3. **Tier 3**: If still collides (virtually impossible), add timestamp-based discriminator ```python import hashlib import secrets from datetime import datetime from typing import Set def resolve_collision( base_ppid: str, full_emic_label: str, existing_ppids: Set[str], distinguishing_data: dict = None ) -> str: """ Resolve collision using three-tier escalation strategy. Args: base_ppid: The base PPID without collision suffix full_emic_label: The person's full emic name existing_ppids: Set of existing PPIDs to check against distinguishing_data: Optional dict with additional data for hashing (e.g., occupation, parent names, source document ID) Example: Base: "PID_NL-NH-AMS_1895-03-15_NL-NH-HAA_1970-08-22_JAN-BERG" Emic: "Jan van den Berg" Tier 1 Result: "PID_NL-NH-AMS_1895-03-15_NL-NH-HAA_1970-08-22_JAN-BERG-jan_van_den_berg" Tier 2 Result: "PID_NL-NH-AMS_1895-03-15_NL-NH-HAA_1970-08-22_JAN-BERG-jan_van_den_berg-a7b3c2d1" Tier 3 Result: "PID_NL-NH-AMS_1895-03-15_NL-NH-HAA_1970-08-22_JAN-BERG-jan_van_den_berg-20250109143022" """ # Tier 1: Full emic label suffix emic_suffix = generate_collision_suffix(full_emic_label) tier1_ppid = f"{base_ppid}-{emic_suffix}" if tier1_ppid not in existing_ppids: return tier1_ppid # Tier 2: Add deterministic hash discriminator # Hash is based on distinguishing data if provided, otherwise random if distinguishing_data: # Deterministic: hash of distinguishing data hash_input = f"{tier1_ppid}|{sorted(distinguishing_data.items())}" hash_bytes = hashlib.sha256(hash_input.encode()).digest() discriminator = hash_bytes[:4].hex() # 8 hex characters else: # Fallback: cryptographically secure random discriminator = secrets.token_hex(4) # 8 hex characters tier2_ppid = f"{tier1_ppid}-{discriminator}" if tier2_ppid not in existing_ppids: return tier2_ppid # Tier 3: Timestamp-based (virtually impossible to reach) # This should never happen with random discriminator, but provides safety timestamp = datetime.utcnow().strftime("%Y%m%d%H%M%S") tier3_ppid = f"{tier1_ppid}-{timestamp}" # Final fallback: add microseconds if still colliding while tier3_ppid in existing_ppids: timestamp = datetime.utcnow().strftime("%Y%m%d%H%M%S%f") tier3_ppid = f"{tier1_ppid}-{timestamp}" return tier3_ppid def generate_collision_suffix(full_emic_label: str) -> str: """ Generate collision suffix from full emic label. Same rules as GHCID collision suffix: - Convert to lowercase snake_case - Remove diacritics - Remove punctuation """ import unicodedata import re # Normalize unicode normalized = unicodedata.normalize('NFD', full_emic_label) ascii_name = ''.join( c for c in normalized if unicodedata.category(c) != 'Mn' ) # Lowercase lowercase = ascii_name.lower() # Remove punctuation no_punct = re.sub(r"[''`\",.:;!?()[\]{}]", '', lowercase) # Replace spaces with underscores underscored = re.sub(r'\s+', '_', no_punct) # Remove non-alphanumeric except underscore clean = re.sub(r'[^a-z0-9_]', '', underscored) # Collapse multiple underscores final = re.sub(r'_+', '_', clean).strip('_') return final ``` ### 5.3 Distinguishing Data for Tier 2 Hash When two persons have identical base PPID and emic label, use **distinguishing data** to generate a deterministic hash: | Priority | Distinguishing Data | Example | |----------|---------------------|---------| | 1 | Source document ID | `"NL-NH-HAA/BS/Geb/1895/123"` | | 2 | Parent names | `"father:Pieter_van_den_Berg"` | | 3 | Occupation | `"occupation:timmerman"` | | 4 | Spouse name | `"spouse:Maria_Jansen"` | | 5 | Unique claim from observation | Any distinguishing fact | ```python # Example: Two "Jan van den Berg" born same day, same place distinguishing_data_person_1 = { "source_document": "NL-NH-HAA/BS/Geb/1895/123", "father_name": "Pieter van den Berg", "occupation": "timmerman" } distinguishing_data_person_2 = { "source_document": "NL-NH-HAA/BS/Geb/1895/456", "father_name": "Hendrik van den Berg", "occupation": "bakker" } # Results in different deterministic hashes: # Person 1: PID_NL-NH-AMS_1895-03-15_..._JAN-BERG-jan_van_den_berg-a7b3c2d1 # Person 2: PID_NL-NH-AMS_1895-03-15_..._JAN-BERG-jan_van_den_berg-f2e8d4a9 ``` ### 5.4 Collision Probability Analysis | Tier | Collision Probability | When Triggered | |------|----------------------|----------------| | **Base PPID** | ~1/10,000 for common names | Same location, date, name tokens | | **Tier 1** (+emic) | ~1/1,000,000 | Same full emic label | | **Tier 2** (+hash) | ~1/4.3 billion | Same emic AND no distinguishing data | | **Tier 3** (+time) | ~0 | Cryptographic failure | **Practical Impact**: For a dataset of 10 million persons, expected Tier 2 collisions ≈ 0.002 (effectively zero). --- ## 6. Unknown Components: XX and XXX Placeholders ### 6.1 When Components Are Unknown Unlike GHCID (where `XX`/`XXX` are temporary and require research), PPID may have permanently unknown components: | Scenario | Placeholder | Can be PID? | |----------|-------------|-------------| | Unknown birth country | `XX` | No (remains ID) | | Unknown birth region | `XX` | No (remains ID) | | Unknown birth place | `XXX` | No (remains ID) | | Unknown death country | `XX` | No (remains ID) | | Unknown death region | `XX` | No (remains ID) | | Unknown death place | `XXX` | No (remains ID) | | Unknown date | `XXXX` | No (remains ID) | | Unknown first token | `UNKNOWN` | No (remains ID) | | Unknown last token | (empty after hyphen) | Yes (if mononym) | ### 6.2 ID Examples with Unknown Components ``` ID_XX-XX-XXX_XXXX_FR-NM-OMH_1944-06-06_UNKNOWN- # Unknown soldier, Normandy ID_NL-NH-AMS_1606_XX-XX-XXX_XXXX_REMBRANDT- # Rembrandt, death unknown (hypothetical) ID_XX-XX-XXX_XXXX_XX-XX-XXX_XXXX_ANONYMOUS- # Completely unknown person ID_NL-ZH-LEI_1606-07_NL-NH-AMS_1669_REMBRANDT-RIJN # Rembrandt, month known for birth, only year for death ``` --- ## 7. UUID and Numeric Generation ### 7.1 Dual Representation (Same as GHCID) Every PPID generates three representations: | Format | Purpose | Example | |--------|---------|---------| | **Semantic String** | Human-readable | `PID_NL-NH-AMS_1895-03-15_NL-NH-HAA_1970-08-22_JAN-BERG` | | **UUID v5** | Linked data, URIs | `550e8400-e29b-41d4-a716-446655440000` | | **Numeric (64-bit)** | Database keys, CSV | `213324328442227739` | ### 7.2 Generation Algorithm ```python import uuid import hashlib # PPID namespace UUID (different from GHCID namespace) PPID_NAMESPACE = uuid.UUID('f47ac10b-58cc-4372-a567-0e02b2c3d479') def generate_ppid_identifiers(semantic_ppid: str) -> dict: """ Generate all identifier formats from semantic PPID string. Returns: { 'semantic': 'PID_NL-NH-AMS_1895-03-15_...', 'uuid_v5': '550e8400-...', 'numeric': 213324328442227739 } """ # UUID v5 from semantic string ppid_uuid = uuid.uuid5(PPID_NAMESPACE, semantic_ppid) # Numeric from SHA-256 (64-bit) sha256 = hashlib.sha256(semantic_ppid.encode()).digest() numeric = int.from_bytes(sha256[:8], byteorder='big') return { 'semantic': semantic_ppid, 'uuid_v5': str(ppid_uuid), 'numeric': numeric } # Example: ppid = "PID_NL-NH-AMS_1895-03-15_NL-NH-HAA_1970-08-22_JAN-BERG" identifiers = generate_ppid_identifiers(ppid) # { # 'semantic': 'PID_NL-NH-AMS_1895-03-15_NL-NH-HAA_1970-08-22_JAN-BERG', # 'uuid_v5': 'a1b2c3d4-e5f6-5a1b-9c2d-3e4f5a6b7c8d', # 'numeric': 1234567890123456789 # } ``` --- ## 8. Relationship to Person Observations ### 8.1 Distinction: PPID vs Observation Identifiers | Identifier | Purpose | Structure | Persistence | |------------|---------|-----------|-------------| | **PPID** | Identify a person (reconstruction) | Geographic + temporal + emic | Permanent (if PID) | | **Observation ID** | Identify a specific source observation | GHCID-based + RiC-O | Permanent | ### 8.2 Observation Identifier Structure (Forthcoming) As noted in the user's input, observation identifiers will use a different pattern: ``` {REPOSITORY_GHCID}/{CREATOR_GHCID}/{RICO_RECORD_PATH} ``` Where: - **REPOSITORY_GHCID**: GHCID of the institution holding the record - **CREATOR_GHCID**: GHCID of the institution that created the record (may be same) - **RICO_RECORD_PATH**: RiC-O derived path to RecordSet/Record/RecordPart Example: ``` NL-NH-HAA-A-NHA/NL-NH-HAA-A-NHA/burgerlijke-stand/geboorten/1895/003/045 │ │ │ │ │ └── RiC-O path: fonds/series/file/item │ └── Creator (same institution) └── Repository ``` This is **separate from PPID** and will be specified in a future document. --- ## 9. Comparison with Original POID/PRID Design ### 9.1 What Changes | Aspect | POID/PRID (Doc 05) | Revised PPID (This Doc) | |--------|-------------------|-------------------------| | **Identifier opacity** | Opaque (no semantic content) | Semantic (human-readable) | | **Geographic anchoring** | None | Dual (birth + death locations) | | **Temporal anchoring** | None | Century range | | **Name in identifier** | None | First + last token | | **Type prefix** | POID/PRID | ID/PID | | **Observation vs Person** | Different identifier types | Completely separate systems | | **UUID backing** | Primary | Secondary (derived) | | **Collision handling** | UUID collision (rare) | Semantic collision (more common) | ### 9.2 What Stays the Same - Dual identifier generation (UUID + numeric) - Deterministic generation from input - Permanent persistence (once PID) - Integration with GHCID for institution links - Claim-based provenance model - PiCo ontology alignment ### 9.3 Transition Plan If this revised structure is adopted: 1. **Document 05** becomes historical reference 2. **This document** becomes the authoritative identifier spec 3. No existing identifiers need migration (this is a new system) 4. Code examples in other documents need updates --- ## 10. Implementation Considerations ### 10.1 Character Set and Length ```python # Component lengths MAX_COUNTRY_CODE = 2 # ISO 3166-1 alpha-2 MAX_REGION_CODE = 3 # ISO 3166-2 suffix (some are 3 chars) MAX_PLACE_CODE = 3 # GeoNames convention MAX_DATE = 10 # YYYY-MM-DD (ISO 8601) MAX_TOKEN_LENGTH = 20 # Reasonable limit for names MAX_COLLISION_SUFFIX = 50 # Full emic label in snake_case # Example PPID structure (without collision suffix): # PID_NL-NH-AMS_1895-03-15_NL-NH-HAA_1970-08-22_JAN-BERG # 3 +1+ 10 +1+ 10 +1+ 10 +1+ 10 +1+ 20 # ≈ 70 characters maximum # With collision suffix: ~120 characters max ``` ### 10.2 Validation Regex ```python import re # Location pattern: CC-RR-PPP (country-region-place) LOCATION_PATTERN = r'([A-Z]{2}|XX)-([A-Z]{2,3}|XX)-([A-Z]{3}|XXX)' # Date pattern: YYYY-MM-DD or YYYY-MM or YYYY or XXXX (including BCE with leading -) DATE_PATTERN = r'(-?\d{4}(?:-\d{2}(?:-\d{2})?)?|XXXX)' # Name tokens pattern: FIRST-LAST or FIRST- (for mononyms) NAME_PATTERN = r'([A-Z0-9]+)-([A-Z0-9]*)' # Full PPID pattern PPID_PATTERN = re.compile( r'^(ID|PID)' # Type r'_' + LOCATION_PATTERN + # First location (underscore + CC-RR-PPP) r'_' + DATE_PATTERN + # First date (underscore + ISO 8601) r'_' + LOCATION_PATTERN + # Last location (underscore + CC-RR-PPP) r'_' + DATE_PATTERN + # Last date (underscore + ISO 8601) r'_' + NAME_PATTERN + # Name tokens (underscore + FIRST-LAST) r'(-[a-z0-9_]+)?$' # Collision suffix (optional, hyphen + snake_case) ) def validate_ppid(ppid: str) -> tuple[bool, str]: """Validate PPID format.""" if not PPID_PATTERN.match(ppid): return False, "Invalid PPID format" # Split by major delimiter (underscore) parts = ppid.split('_') if len(parts) < 6: return False, "Incomplete PPID - requires 6 underscore-delimited parts" # Extract dates for validation first_date = parts[2] last_date = parts[4] # Date ordering validation (if both are known) if first_date != 'XXXX' and last_date != 'XXXX': # Parse years (handle BCE with leading -) try: first_year = int(first_date.split('-')[0]) if not first_date.startswith('-') else -int(first_date.split('-')[1]) last_year = int(last_date.split('-')[0]) if not last_date.startswith('-') else -int(last_date.split('-')[1]) if last_year < first_year: return False, "Last observation date cannot be before first observation date" except (ValueError, IndexError): pass # Invalid date format caught by regex return True, "Valid" # Example validations: assert validate_ppid("PID_NL-NH-AMS_1895-03-15_NL-NH-HAA_1970-08-22_JAN-BERG")[0] assert validate_ppid("PID_GR-AT-ATH_-0470_GR-AT-ATH_-0399_SOCRATES-")[0] assert validate_ppid("ID_XX-XX-XXX_XXXX_FR-NM-OMH_1944-06-06_UNKNOWN-")[0] assert not validate_ppid("PID_NL-NH-AMS_1970_NL-NH-HAA_1895_JAN-BERG")[0] # Dates reversed ``` --- ## 11. Open Questions ### 11.1 BCE Dates **RESOLVED**: Use ISO 8601 extended format with negative years. - `-0469` for 469 BCE - `-0044` for 44 BCE - Examples in section 3.3 and 3.4 ### 11.2 Non-Latin Name Tokens **RESOLVED**: Apply same transliteration rules as GHCID (see AGENTS.md). | Script | Standard | |--------|----------| | Cyrillic | ISO 9:1995 | | Chinese | Hanyu Pinyin (ISO 7098) | | Japanese | Modified Hepburn | | Korean | Revised Romanization | | Arabic | ISO 233-2/3 | ### 11.3 Disputed Locations **RESOLVED**: Not a PPID concern - handled by ISO standardization. Use modern ISO-standardized location codes; document disputes in observation metadata. ### 11.4 Living Persons **RESOLVED**: Living persons are **always ID class** and can only be promoted to PID after death. - Living persons have no verified last observation (death date/location) - Use `XXXX` for unknown death date and `XX-XX-XXX` for unknown death location - Example: `ID_NL-NH-AMS_1985-06-15_XX-XX-XXX_XXXX_JAN-BERG` - Can be promoted to PID only after death observation is verified **Rationale**: 1. PID requires verified last observation (death) 2. Living persons have incomplete lifecycle data 3. Future observations may change identity assessment 4. Privacy considerations for living individuals --- ## 12. References ### GHCID Documentation - [GHCID PID Scheme](../../GHCID_PID_SCHEME.md) - [AGENTS.md: Persistent Identifiers](../../AGENTS.md#persistent-identifiers-ghcid) ### Related PPID Documents - [Original Identifier Structure (superseded)](./05_identifier_structure_design.md) - [PiCo Ontology Analysis](./03_pico_ontology_analysis.md) - [Cultural Naming Conventions](./04_cultural_naming_conventions.md) ### Standards - ISO 3166-1: Country codes - ISO 3166-2: Subdivision codes - ISO 8601: Date and time format (including BCE with negative years) - GeoNames: Geographic names database