kempersc 7ec4e05dd4 feat(merge): add script to merge PENDING files by matching emic names with existing files

2026-01-09 16:42:55 +01:00

46 KiB

Raw Blame History

PPID-GHCID Alignment: Revised Identifier Structure

Version: 0.1.0
Last Updated: 2025-01-09
Status: DRAFT - Supersedes opaque identifier design in 05_identifier_structure_design.md
Related: GHCID Specification | PiCo Ontology

1. Executive Summary

This document proposes a revised PPID structure that aligns with GHCID's geographic-semantic identifier pattern while accommodating the unique challenges of person identification across historical records.

1.1 Key Changes from Original Design

Aspect	Original (Doc 05)	Revised (This Document)
Format	Opaque hex (`POID-7a3b-c4d5-...`)	Semantic (`PID_NL-NH-AMS_1895-03-15_NL-NH-HAA_1970-08-22_JAN-BERG`)
Type Distinction	POID vs PRID	ID (temporary) vs PID (persistent)
Geographic	None in identifier	Dual anchors: first + last observation location
Temporal	None in identifier	ISO 8601 dates with variable precision (YYYY-MM-DD / YYYY-MM / YYYY)
Name	None in identifier	First + last token of emic label (skip particles)
Delimiters	Single type	Hierarchical: `_` (major groups) + `-` (within groups)
Persistence	Always persistent	May remain ID indefinitely

1.2 Design Philosophy

The revised PPID follows the same principles as GHCID:

Human-readable semantic components that aid discovery and deduplication
Geographic anchoring to physical locations using GeoNames
Temporal anchoring to enable disambiguation across time
Emic authenticity using names from primary sources
Collision resolution via full emic label suffix
Dual representation as both semantic string and UUID/numeric

2. Identifier Type: ID vs PID

2.1 The Epistemic Uncertainty Problem

Unlike institutions (which typically have founding documents, legal registrations, and clear organizational boundaries), persons in historical records often exist in epistemic uncertainty:

Incomplete records (many records lost to time)
Ambiguous references (common names, no surnames)
Conflicting sources (different dates, spellings)
Undiscovered archives (unexplored record sets)

2.2 Two-Class Identifier System

Type	Prefix	Description	Persistence	Promotion Path
ID	`ID-`	Temporary identifier	May change	Can become PID
PID	`PID-`	Persistent identifier	Permanent	Cannot revert to ID

2.3 Promotion Criteria: ID → PID

An identifier can be promoted from ID to PID when ALL of the following are satisfied:

@dataclass
class PIDPromotionCriteria:
    """
    Criteria for promoting an ID to a PID.
    ALL conditions must be True for promotion.
    """
    
    # Geographic anchors
    first_observation_verified: bool  # Birth or equivalent
    last_observation_verified: bool   # Death or equivalent
    
    # Temporal anchors
    century_range_established: bool   # From verified observations
    
    # Identity anchors
    emic_label_verified: bool         # From primary sources
    no_unexplored_archives: bool      # Reasonable assumption
    
    # Quality checks
    no_unresolved_conflicts: bool     # No conflicting claims
    multiple_corroborating_sources: bool  # At least 2 independent sources
    
    def is_promotable(self) -> bool:
        return all([
            self.first_observation_verified,
            self.last_observation_verified,
            self.century_range_established,
            self.emic_label_verified,
            self.no_unexplored_archives,
            self.no_unresolved_conflicts,
            self.multiple_corroborating_sources,
        ])

2.4 Permanent ID Status

Some identifiers may forever remain IDs due to:

Fragmentary records: Only one surviving document mentions the person
Uncertain dates: Cannot establish century range
Unknown location: Cannot anchor geographically
Anonymous figures: No emic label recoverable
Ongoing research: Archives not yet explored

This is acceptable and expected. An ID is still a valid identifier for internal use; it simply cannot be cited as a persistent identifier in scholarly work.

3. Identifier Structure

3.1 Full Format Specification

{TYPE}_{FL}_{FD}_{LL}_{LD}_{NT}[-{FULL_EMIC}]
  │      │    │    │    │    │       │
  │      │    │    │    │    │       └── Collision suffix (optional, snake_case)
  │      │    │    │    │    └── Name Tokens (FIRST-LAST, hyphen-joined)
  │      │    │    │    └── Last observation Date (ISO 8601: YYYY-MM-DD or reduced)
  │      │    │    └── Last observation Location (CC-RR-PPP, hyphen-joined)
  │      │    └── First observation Date (ISO 8601: YYYY-MM-DD or reduced)
  │      └── First observation Location (CC-RR-PPP, hyphen-joined)
  └── Type: ID or PID

Delimiters:
  - Underscore (_) = Major delimiter between logical groups
  - Hyphen (-) = Minor delimiter within groups

Expanded with all components:

{TYPE}_{FC-FR-FP}_{FD}_{LC-LR-LP}_{LD}_{FT-LT}[-{full_emic_label}]

Example: PID_NL-NH-AMS_1895-03-15_NL-NH-HAA_1970-08-22_JAN-BERG

3.2 Component Definitions

Component	Format	Description	Example
TYPE	`ID` or `PID`	Identifier class	`PID`
FL	`CC-RR-PPP`	First observation Location	`NL-NH-AMS`
→ FC	ISO 3166-1 α2	Country code	`NL`
→ FR	ISO 3166-2 suffix	Region code	`NH`
→ FP	3 letters	Place code (GeoNames)	`AMS`
FD	ISO 8601	First observation Date	`1895-03-15` or `1895`
LL	`CC-RR-PPP`	Last observation Location	`NL-NH-HAA`
→ LC	ISO 3166-1 α2	Country code	`NL`
→ LR	ISO 3166-2 suffix	Region code	`NH`
→ LP	3 letters	Place code (GeoNames)	`HAA`
LD	ISO 8601	Last observation Date	`1970-08-22` or `1970`
NT	`FIRST-LAST`	Name Tokens (emic label)	`JAN-BERG`
→ FT	UPPERCASE	First token	`JAN`
→ LT	UPPERCASE	Last token (skip particles)	`BERG`
FULL_EMIC	snake_case	Full emic label (collision only)	`jan_van_den_berg`

3.3 ISO 8601 Date Precision Levels

Dates use ISO 8601 format with variable precision based on what can be verified:

Precision	Format	Example	When to Use
Day	`YYYY-MM-DD`	`1895-03-15`	Birth/death certificate with exact date
Month	`YYYY-MM`	`1895-03`	Record states month but not day
Year	`YYYY`	`1895`	Only year known (common for historical figures)
Unknown	`XXXX`	`XXXX`	Cannot determine; identifier remains ID class

BCE Dates (negative years per ISO 8601 extended):

Year	ISO 8601	Example PPID
469 BCE	`-0469`	`PID_GR-AT-ATH_-0469_GR-AT-ATH_-0399_SOCRATES-`
44 BCE	`-0044`	`PID_IT-RM-ROM_-0100_IT-RM-ROM_-0044_GAIUS-CAESAR`

No Time Components: Hours, minutes, seconds are never included (impractical for historical persons).

3.4 Examples

Person	Full Emic Label	PPID
Jan van den Berg (1895-03-15 → 1970-08-22)	Jan van den Berg	`PID_NL-NH-AMS_1895-03-15_NL-NH-HAA_1970-08-22_JAN-BERG`
Rembrandt (1606-07-15 → 1669-10-04)	Rembrandt van Rijn	`PID_NL-ZH-LEI_1606-07-15_NL-NH-AMS_1669-10-04_REMBRANDT-RIJN`
Maria Sibylla Merian (1647 → 1717)	Maria Sibylla Merian	`PID_DE-HE-FRA_1647_NL-NH-AMS_1717_MARIA-MERIAN`
Socrates (c. 470 BCE → 399 BCE)	Σωκράτης (Sōkrátēs)	`PID_GR-AT-ATH_-0470_GR-AT-ATH_-0399_SOCRATES-`
Julius Caesar (100 BCE → 44 BCE)	Gaius Iulius Caesar	`PID_IT-RM-ROM_-0100_IT-RM-ROM_-0044_GAIUS-CAESAR`
Unknown soldier (? → 1944-06-06)	(unknown)	`ID_XX-XX-XXX_XXXX_FR-NM-OMH_1944-06-06_UNKNOWN-`
Henry VIII (1491-06-28 → 1547-01-28)	Henry VIII	`PID_GB-ENG-LON_1491-06-28_GB-ENG-LON_1547-01-28_HENRY-VIII`
Vincent van Gogh (1853-03-30 → 1890-07-29)	Vincent Willem van Gogh	`PID_NL-NB-ZUN_1853-03-30_FR-IDF-AUV_1890-07-29_VINCENT-GOGH`

Notes on Emic Labels:

Always use formal/complete emic names from primary sources, not modern colloquial short forms
"Rembrandt" alone is a modern convention; the emic label from his lifetime was "Rembrandt van Rijn"
Tussenvoegsels (particles) like "van", "de", "den", "der", "van de", "van den", "van der" are skipped when extracting the last token (see §4.5)
Non-Latin names are transliterated following GHCID transliteration standards (see AGENTS.md)
This follows the same pattern as GHCID abbreviation rules (AGENTS.md Rule 8)

4. Component Rules

4.1 First Observation (Birth or Earliest)

from dataclasses import dataclass
from enum import Enum
from typing import Optional

class ObservationType(Enum):
    BIRTH_CERTIFICATE = "birth_certificate"       # Highest authority
    BAPTISM_RECORD = "baptism_record"             # Common for pre-civil registration
    BIRTH_STATEMENT = "birth_statement"           # Stated birth in other document
    EARLIEST_REFERENCE = "earliest_reference"     # Earliest surviving mention
    INFERRED = "inferred"                         # Inferred from context

@dataclass
class FirstObservation:
    """
    First observation of a person during their lifetime.
    Ideally a birth record, but may be another early record.
    """
    
    observation_type: ObservationType
    
    # Modern geographic codes (mapped from historical)
    country_code: str       # ISO 3166-1 alpha-2
    region_code: str        # ISO 3166-2 subdivision
    place_code: str         # GeoNames 3-letter code
    
    # Original historical reference
    historical_place_name: str      # As named in source
    historical_date: str            # As stated in source
    
    # Mapping provenance
    modern_mapping_method: str      # How historical → modern mapping done
    geonames_id: Optional[int]      # GeoNames ID for place
    
    # Quality indicators
    is_birth_record: bool
    can_assume_earliest: bool       # No unexplored archives likely
    source_confidence: float        # 0.0 - 1.0
    
    def is_valid_for_pid(self) -> bool:
        """
        Determine if this observation is valid for PID generation.
        """
        if self.is_birth_record:
            return True
        
        if self.observation_type == ObservationType.EARLIEST_REFERENCE:
            # Must be able to assume this is actually the earliest
            return self.can_assume_earliest and self.source_confidence >= 0.8
        
        return False

4.2 Last Observation (Death or Latest During Lifetime)

@dataclass
class LastObservation:
    """
    Last observation of a person during their lifetime or immediate after death.
    Ideally a death record, but may be last known living reference.
    """
    
    observation_type: ObservationType  # Reusing enum, but DEATH_CERTIFICATE etc.
    
    # Modern geographic codes
    country_code: str
    region_code: str
    place_code: str
    
    # Original historical reference
    historical_place_name: str
    historical_date: str
    
    # Critical distinction
    is_death_record: bool
    is_lifetime_observation: bool     # True if person still alive at observation
    is_immediate_post_death: bool     # First record after death
    
    # Quality
    can_assume_latest: bool
    source_confidence: float
    
    def is_valid_for_pid(self) -> bool:
        if self.is_death_record:
            return True
        
        if self.is_immediate_post_death:
            # First mention of death
            return self.source_confidence >= 0.8
        
        if self.is_lifetime_observation:
            # Last known alive, but not death record
            return self.can_assume_latest and self.source_confidence >= 0.8
        
        return False

4.3 Geographic Mapping: Historical → Modern

from dataclasses import dataclass
from typing import Optional, Tuple

@dataclass
class HistoricalPlaceMapping:
    """
    Map historical place names to modern ISO/GeoNames codes.
    
    Historical places must be mapped to their MODERN equivalents
    as of the PPID generation date. This ensures stability even
    when historical boundaries shifted.
    """
    
    # Historical input
    historical_name: str
    historical_date: str  # When the place was referenced
    
    # Modern output (at PPID generation time)
    modern_country_code: str   # ISO 3166-1 alpha-2
    modern_region_code: str    # ISO 3166-2 suffix (e.g., "NH" not "NL-NH")
    modern_place_code: str     # 3-letter from GeoNames
    
    # GeoNames reference
    geonames_id: int
    geonames_name: str         # Modern canonical name
    geonames_feature_class: str  # P = populated place
    geonames_feature_code: str   # PPL, PPLA, PPLC, etc.
    
    # Mapping provenance
    mapping_method: str        # "direct", "successor", "enclosing", "manual"
    mapping_confidence: float
    mapping_notes: str
    ppid_generation_date: str  # When mapping was performed

def map_historical_to_modern(
    historical_name: str,
    historical_date: str,
    db
) -> HistoricalPlaceMapping:
    """
    Map a historical place name to modern ISO/GeoNames codes.
    
    Strategies (in order):
    1. Direct match: Place still exists with same name
    2. Successor: Place renamed but geographically same
    3. Enclosing: Place absorbed into larger entity
    4. Manual: Requires human research
    """
    
    # Strategy 1: Direct GeoNames lookup
    direct_match = db.geonames_search(historical_name)
    if direct_match and direct_match.is_populated_place:
        return HistoricalPlaceMapping(
            historical_name=historical_name,
            historical_date=historical_date,
            modern_country_code=direct_match.country_code,
            modern_region_code=direct_match.admin1_code,
            modern_place_code=generate_place_code(direct_match.name),
            geonames_id=direct_match.geonames_id,
            geonames_name=direct_match.name,
            geonames_feature_class=direct_match.feature_class,
            geonames_feature_code=direct_match.feature_code,
            mapping_method="direct",
            mapping_confidence=0.95,
            mapping_notes="Direct GeoNames match",
            ppid_generation_date=datetime.utcnow().isoformat()
        )
    
    # Strategy 2: Historical name lookup (renamed places)
    # e.g., "Batavia" → "Jakarta"
    historical_match = db.historical_place_names.get(historical_name)
    if historical_match:
        modern = db.geonames_by_id(historical_match.modern_geonames_id)
        return HistoricalPlaceMapping(
            historical_name=historical_name,
            historical_date=historical_date,
            modern_country_code=modern.country_code,
            modern_region_code=modern.admin1_code,
            modern_place_code=generate_place_code(modern.name),
            geonames_id=modern.geonames_id,
            geonames_name=modern.name,
            geonames_feature_class=modern.feature_class,
            geonames_feature_code=modern.feature_code,
            mapping_method="successor",
            mapping_confidence=0.90,
            mapping_notes=f"Historical name '{historical_name}' → modern '{modern.name}'",
            ppid_generation_date=datetime.utcnow().isoformat()
        )
    
    # Strategy 3: Geographic coordinates (if available from source)
    # Reverse geocode to find enclosing modern settlement
    
    # Strategy 4: Manual research required
    raise ManualResearchRequired(
        f"Cannot automatically map '{historical_name}' ({historical_date}) to modern location"
    )


def generate_place_code(place_name: str) -> str:
    """
    Generate 3-letter place code from GeoNames name.
    
    Rules (same as GHCID):
    - Single word: First 3 letters → "Amsterdam" → "AMS"
    - Multi-word: Initials → "New York" → "NYO" (or "NYC" if registered)
    - Dutch articles: Article initial + 2 from main → "Den Haag" → "DHA"
    """
    # Implementation follows GHCID rules
    # See AGENTS.md: "SETTLEMENT STANDARDIZATION: GEONAMES IS AUTHORITATIVE"
    pass

4.4 Century Range Calculation

def calculate_century_range(
    first_observation: FirstObservation,
    last_observation: LastObservation
) -> str:
    """
    Calculate the CE century range for a person's lifetime.
    
    Returns format: "CC-CC" (e.g., "19-20" for 1850-1925)
    
    Rules:
    - Centuries are 1-indexed: 1-100 AD = 1st century, 1901-2000 = 20th century
    - BCE dates: Use negative century numbers (e.g., "-5--4" for 5th-4th century BCE)
      This follows ISO 8601 extended format which uses negative years for BCE
    - Range must be from verified observations
    """
    
    def year_to_century(year: int) -> int:
        """
        Convert year to century number.
        
        Positive years (CE): 1-100 = century 1, 1901-2000 = century 20
        Negative years (BCE): -500 to -401 = century -5
        
        Note: There is no year 0 in the proleptic Gregorian calendar.
        Year 1 BCE is followed directly by year 1 CE.
        """
        if year > 0:
            return ((year - 1) // 100) + 1
        else:
            # BCE: year -500 → century -5, year -1 → century -1
            return (year // 100)
    
    def parse_year(date_str: str) -> int:
        """Extract year from various date formats."""
        # Handle: "1895", "1895-03-15", "March 1895", "c. 1895", etc.
        # Also handle BCE: "-500", "500 BCE", "500 BC", "c. 500 BCE"
        import re
        
        # Check for BCE indicators
        bce_match = re.search(r'(\d+)\s*(BCE|BC|B\.C\.E?\.|v\.Chr\.)', date_str, re.IGNORECASE)
        if bce_match:
            return -int(bce_match.group(1))
        
        # Check for negative year (ISO 8601 extended)
        neg_match = re.search(r'-(\d+)', date_str)
        if neg_match and date_str.strip().startswith('-'):
            return -int(neg_match.group(1))
        
        # Standard positive year
        match = re.search(r'\b(\d{4})\b', date_str)
        if match:
            return int(match.group(1))
        
        # 3-digit year (ancient dates)
        match = re.search(r'\b(\d{3})\b', date_str)
        if match:
            return int(match.group(1))
            
        raise ValueError(f"Cannot parse year from: {date_str}")
    
    first_year = parse_year(first_observation.historical_date)
    last_year = parse_year(last_observation.historical_date)
    
    first_century = year_to_century(first_year)
    last_century = year_to_century(last_year)
    
    # Validation
    if last_century < first_century:
        raise ValueError(
            f"Last observation ({last_year}) cannot be before "
            f"first observation ({first_year})"
        )
    
    return f"{first_century}-{last_century}"


# Examples (CE):
# 1850 → century 19
# 1925 → century 20
# Range: "19-20"

# 1606 → century 17
# 1669 → century 17
# Range: "17-17" (same century)

# 1895 → century 19
# 2005 → century 21
# Range: "19-21" (centenarian)

# Examples (BCE):
# -500 (500 BCE) → century -5
# -401 (401 BCE) → century -5
# Range: "-5--5" (same century)

# -469 (469 BCE, Socrates birth) → century -5
# -399 (399 BCE, Socrates death) → century -4
# Range: "-5--4"

# -100 (100 BCE) → century -1
# 14 (14 CE) → century 1
# Range: "-1-1" (crossing BCE/CE boundary)

4.5 Emic Label Tokens

from dataclasses import dataclass
from typing import Optional, List
import re

@dataclass
class EmicLabel:
    """
    The common contemporary emic label of a person.
    
    "Emic" = from the insider perspective, as the person was known
    during their lifetime in primary sources.
    
    "Etic" = from the outsider perspective, how we refer to them now.
    
    Prefer emic; fall back to etic only if emic unrecoverable.
    """
    
    full_label: str           # Complete emic label
    first_token: str          # First word/token
    last_token: str           # Last word/token (empty if mononym)
    
    # Source provenance
    source_type: str          # "primary" or "etic_fallback"
    source_document: str      # Reference to source
    source_date: str          # When source was created
    
    # Quality
    is_from_primary_source: bool
    is_vernacular: bool       # From vernacular (non-official) source
    confidence: float
    
    @classmethod
    def from_full_label(cls, label: str, **kwargs) -> 'EmicLabel':
        """Parse full label into first and last tokens."""
        tokens = tokenize_emic_label(label)
        
        first_token = tokens[0].upper() if tokens else ""
        last_token = tokens[-1].upper() if len(tokens) > 1 else ""
        
        return cls(
            full_label=label,
            first_token=first_token,
            last_token=last_token,
            **kwargs
        )


def tokenize_emic_label(label: str) -> List[str]:
    """
    Tokenize an emic label into words.
    
    Rules:
    - Split on whitespace
    - Preserve numeric tokens (e.g., "VIII" in "Henry VIII")
    - Do NOT split compound words
    - Normalize to uppercase for identifier
    """
    # Basic whitespace split
    tokens = label.strip().split()
    
    # Filter empty tokens
    tokens = [t for t in tokens if t]
    
    return tokens


def extract_name_tokens(
    full_emic_label: str
) -> tuple[str, str]:
    """
    Extract first and last tokens from emic label.
    
    Rules:
    1. First token: First word of the emic label
    2. Last token: Last word AFTER skipping tussenvoegsels (name particles)
    
    Tussenvoegsels are common prefixes in Dutch and other languages that are
    NOT part of the surname proper. They are skipped when extracting the
    last token (same as GHCID abbreviation rules - AGENTS.md Rule 8).
    
    Examples:
    - "Jan van den Berg" → ("JAN", "BERG")  # "van den" skipped
    - "Rembrandt van Rijn" → ("REMBRANDT", "RIJN")  # "van" skipped
    - "Henry VIII" → ("HENRY", "VIII")
    - "Maria Sibylla Merian" → ("MARIA", "MERIAN")
    - "Ludwig van Beethoven" → ("LUDWIG", "BEETHOVEN")  # "van" skipped
    - "Vincent van Gogh" → ("VINCENT", "GOGH")  # "van" skipped
    - "Leonardo da Vinci" → ("LEONARDO", "VINCI")  # "da" skipped
    - "中村 太郎" → transliterated: ("NAKAMURA", "TARO")
    """
    # Tussenvoegsels (name particles) to skip when finding last token
    # Following GHCID pattern (AGENTS.md Rule 8: Legal Form Filtering)
    TUSSENVOEGSELS = {
        # Dutch
        'van', 'de', 'den', 'der', 'het', "'t", 'te', 'ten', 'ter',
        'van de', 'van den', 'van der', 'van het', "van 't",
        'in de', 'in den', 'in het', "in 't",
        'op de', 'op den', 'op het', "op 't",
        # German
        'von', 'vom', 'zu', 'zum', 'zur', 'von und zu',
        # French
        'de', 'du', 'des', 'de la', 'le', 'la', 'les',
        # Italian
        'da', 'di', 'del', 'della', 'dei', 'degli', 'delle',
        # Spanish
        'de', 'del', 'de la', 'de los', 'de las',
        # Portuguese
        'da', 'do', 'dos', 'das', 'de',
    }
    
    tokens = tokenize_emic_label(full_emic_label)
    
    if len(tokens) == 0:
        raise ValueError("Empty emic label")
    
    first_token = tokens[0].upper()
    
    if len(tokens) == 1:
        # Mononym
        last_token = ""
    else:
        # Find last token that is NOT a tussenvoegsel
        # Work backwards from the end
        last_token = ""
        for token in reversed(tokens[1:]):  # Skip first token
            token_lower = token.lower()
            if token_lower not in TUSSENVOEGSELS:
                last_token = token.upper()
                break
        
        # If all remaining tokens are tussenvoegsels, use the actual last token
        if not last_token:
            last_token = tokens[-1].upper()
    
    # Normalize: remove diacritics, special characters
    first_token = normalize_token(first_token)
    last_token = normalize_token(last_token)
    
    return (first_token, last_token)


def normalize_token(token: str) -> str:
    """
    Normalize token for PPID.
    
    - Remove diacritics (é → E)
    - Uppercase
    - Allow alphanumeric only (for Roman numerals like VIII)
    - Transliterate non-Latin scripts
    """
    import unicodedata
    
    # NFD decomposition + remove combining marks
    normalized = unicodedata.normalize('NFD', token)
    ascii_token = ''.join(
        c for c in normalized 
        if unicodedata.category(c) != 'Mn'
    )
    
    # Uppercase
    ascii_token = ascii_token.upper()
    
    # Keep only alphanumeric
    ascii_token = re.sub(r'[^A-Z0-9]', '', ascii_token)
    
    return ascii_token

4.6 Emic vs Etic Fallback

@dataclass
class EmicLabelResolution:
    """
    Resolution of emic label for a person.
    
    Priority:
    1. Emic from primary sources (documents from their lifetime)
    2. Etic fallback (only if emic truly unrecoverable)
    """
    
    resolved_label: EmicLabel
    resolution_method: str  # "emic_primary", "emic_vernacular", "etic_fallback"
    emic_search_exhausted: bool
    vernacular_sources_checked: List[str]
    fallback_justification: Optional[str]

def resolve_emic_label(
    person_observations: List['PersonObservation'],
    db
) -> EmicLabelResolution:
    """
    Resolve the emic label for a person from their observations.
    
    Rules:
    1. Search all primary sources for emic names
    2. Prefer most frequently used name in primary sources
    3. Only use etic fallback if emic truly unrecoverable
    4. Vernacular sources must have clear pedigrees
    5. Oral traditions without documentation not valid
    """
    
    # Collect all name mentions from primary sources
    emic_candidates = []
    
    for obs in person_observations:
        if obs.is_primary_source and obs.is_from_lifetime:
            for claim in obs.claims:
                if claim.claim_type in ('full_name', 'given_name', 'title'):
                    emic_candidates.append({
                        'label': claim.claim_value,
                        'source': obs.source_url,
                        'date': obs.source_date,
                        'is_vernacular': obs.is_vernacular_source
                    })
    
    if emic_candidates:
        # Find most common emic label
        from collections import Counter
        label_counts = Counter(c['label'] for c in emic_candidates)
        most_common = label_counts.most_common(1)[0][0]
        
        best_candidate = next(
            c for c in emic_candidates if c['label'] == most_common
        )
        
        return EmicLabelResolution(
            resolved_label=EmicLabel.from_full_label(
                most_common,
                source_type="primary",
                source_document=best_candidate['source'],
                source_date=best_candidate['date'],
                is_from_primary_source=True,
                is_vernacular=best_candidate['is_vernacular'],
                confidence=0.95
            ),
            resolution_method="emic_primary",
            emic_search_exhausted=True,
            vernacular_sources_checked=[c['source'] for c in emic_candidates if c['is_vernacular']],
            fallback_justification=None
        )
    
    # Check if etic fallback is justified
    unexplored_vernacular = db.get_unexplored_vernacular_archives(person_observations)
    
    if unexplored_vernacular:
        raise EmicLabelNotYetResolvable(
            f"Emic label not found in explored sources. "
            f"Unexplored vernacular archives exist: {unexplored_vernacular}. "
            f"Cannot use etic fallback until these are explored."
        )
    
    # Etic fallback (rare)
    etic_label = db.get_most_common_etic_label(person_observations)
    
    return EmicLabelResolution(
        resolved_label=EmicLabel.from_full_label(
            etic_label,
            source_type="etic_fallback",
            source_document="Modern scholarly consensus",
            source_date=datetime.utcnow().isoformat(),
            is_from_primary_source=False,
            is_vernacular=False,
            confidence=0.70
        ),
        resolution_method="etic_fallback",
        emic_search_exhausted=True,
        vernacular_sources_checked=[],
        fallback_justification=(
            "No emic label found in explored primary sources. "
            "All known vernacular sources checked. "
            "Using most common modern scholarly reference."
        )
    )

5. Collision Handling

5.1 Collision Detection

Two PPIDs collide when all components except the collision suffix match:

def detect_collision(new_ppid: str, existing_ppids: Set[str]) -> bool:
    """
    Check if new PPID collides with existing identifiers.
    
    Collision = same base components (before any collision suffix).
    """
    base_new = get_base_ppid(new_ppid)
    
    for existing in existing_ppids:
        base_existing = get_base_ppid(existing)
        if base_new == base_existing:
            return True
    
    return False

def get_base_ppid(ppid: str) -> str:
    """Extract base PPID without collision suffix.
    
    New format uses underscore as major delimiter:
    PID_NL-NH-AMS_1895-03-15_NL-NH-HAA_1970-08-22_JAN-BERG-jan_van_den_berg
                                                          ↑ collision suffix starts here
    
    Base PPID has exactly 6 underscore-delimited parts:
    TYPE_FL_FD_LL_LD_NT
    """
    # Split by underscore (major delimiter)
    parts = ppid.split('_')
    
    # Standard PPID has 6 parts: TYPE, FL, FD, LL, LD, NT
    if len(parts) < 6:
        return ppid  # Invalid format
    
    # Check if last part contains collision suffix (hyphen after name tokens)
    name_tokens_part = parts[5]
    
    # Name tokens format: FIRST-LAST or FIRST-LAST-collision_suffix
    # Collision suffix is in snake_case (contains underscores within the last major part)
    # Since we already split by _, a collision suffix would appear as extra parts
    if len(parts) > 6:
        # Extra parts after NT are collision suffix
        base_parts = parts[:6]
        return '_'.join(base_parts)
    
    # Check for hyphen-appended collision within NT part
    # e.g., "JAN-BERG-jan_van_den_berg" - but wait, this would be split by _
    # Actually: collision suffix uses - to connect: JAN-BERG-jan_van_den_berg
    # Let's handle this case
    if '-' in name_tokens_part:
        nt_parts = name_tokens_part.split('-')
        # First two are name tokens, rest is collision suffix
        if len(nt_parts) > 2 and nt_parts[2].islower():
            # Has collision suffix
            base_nt = '-'.join(nt_parts[:2])
            parts[5] = base_nt
            return '_'.join(parts[:6])
    
    return ppid

5.2 Collision Resolution Strategy

Collisions are resolved through a three-tier escalation strategy:

Tier 1: Append full emic label in snake_case
Tier 2: If still collides, add 8-character hash discriminator
Tier 3: If still collides (virtually impossible), add timestamp-based discriminator

import hashlib
import secrets
from datetime import datetime
from typing import Set

def resolve_collision(
    base_ppid: str,
    full_emic_label: str,
    existing_ppids: Set[str],
    distinguishing_data: dict = None
) -> str:
    """
    Resolve collision using three-tier escalation strategy.
    
    Args:
        base_ppid: The base PPID without collision suffix
        full_emic_label: The person's full emic name
        existing_ppids: Set of existing PPIDs to check against
        distinguishing_data: Optional dict with additional data for hashing
                            (e.g., occupation, parent names, source document ID)
    
    Example:
    Base: "PID_NL-NH-AMS_1895-03-15_NL-NH-HAA_1970-08-22_JAN-BERG"
    Emic: "Jan van den Berg"
    
    Tier 1 Result: "PID_NL-NH-AMS_1895-03-15_NL-NH-HAA_1970-08-22_JAN-BERG-jan_van_den_berg"
    Tier 2 Result: "PID_NL-NH-AMS_1895-03-15_NL-NH-HAA_1970-08-22_JAN-BERG-jan_van_den_berg-a7b3c2d1"
    Tier 3 Result: "PID_NL-NH-AMS_1895-03-15_NL-NH-HAA_1970-08-22_JAN-BERG-jan_van_den_berg-20250109143022"
    """
    
    # Tier 1: Full emic label suffix
    emic_suffix = generate_collision_suffix(full_emic_label)
    tier1_ppid = f"{base_ppid}-{emic_suffix}"
    
    if tier1_ppid not in existing_ppids:
        return tier1_ppid
    
    # Tier 2: Add deterministic hash discriminator
    # Hash is based on distinguishing data if provided, otherwise random
    if distinguishing_data:
        # Deterministic: hash of distinguishing data
        hash_input = f"{tier1_ppid}|{sorted(distinguishing_data.items())}"
        hash_bytes = hashlib.sha256(hash_input.encode()).digest()
        discriminator = hash_bytes[:4].hex()  # 8 hex characters
    else:
        # Fallback: cryptographically secure random
        discriminator = secrets.token_hex(4)  # 8 hex characters
    
    tier2_ppid = f"{tier1_ppid}-{discriminator}"
    
    if tier2_ppid not in existing_ppids:
        return tier2_ppid
    
    # Tier 3: Timestamp-based (virtually impossible to reach)
    # This should never happen with random discriminator, but provides safety
    timestamp = datetime.utcnow().strftime("%Y%m%d%H%M%S")
    tier3_ppid = f"{tier1_ppid}-{timestamp}"
    
    # Final fallback: add microseconds if still colliding
    while tier3_ppid in existing_ppids:
        timestamp = datetime.utcnow().strftime("%Y%m%d%H%M%S%f")
        tier3_ppid = f"{tier1_ppid}-{timestamp}"
    
    return tier3_ppid


def generate_collision_suffix(full_emic_label: str) -> str:
    """
    Generate collision suffix from full emic label.
    
    Same rules as GHCID collision suffix:
    - Convert to lowercase snake_case
    - Remove diacritics
    - Remove punctuation
    """
    import unicodedata
    import re
    
    # Normalize unicode
    normalized = unicodedata.normalize('NFD', full_emic_label)
    ascii_name = ''.join(
        c for c in normalized 
        if unicodedata.category(c) != 'Mn'
    )
    
    # Lowercase
    lowercase = ascii_name.lower()
    
    # Remove punctuation
    no_punct = re.sub(r"[''`\",.:;!?()[\]{}]", '', lowercase)
    
    # Replace spaces with underscores
    underscored = re.sub(r'\s+', '_', no_punct)
    
    # Remove non-alphanumeric except underscore
    clean = re.sub(r'[^a-z0-9_]', '', underscored)
    
    # Collapse multiple underscores
    final = re.sub(r'_+', '_', clean).strip('_')
    
    return final

5.3 Distinguishing Data for Tier 2 Hash

When two persons have identical base PPID and emic label, use distinguishing data to generate a deterministic hash:

Priority	Distinguishing Data	Example
1	Source document ID	`"NL-NH-HAA/BS/Geb/1895/123"`
2	Parent names	`"father:Pieter_van_den_Berg"`
3	Occupation	`"occupation:timmerman"`
4	Spouse name	`"spouse:Maria_Jansen"`
5	Unique claim from observation	Any distinguishing fact

# Example: Two "Jan van den Berg" born same day, same place
distinguishing_data_person_1 = {
    "source_document": "NL-NH-HAA/BS/Geb/1895/123",
    "father_name": "Pieter van den Berg",
    "occupation": "timmerman"
}

distinguishing_data_person_2 = {
    "source_document": "NL-NH-HAA/BS/Geb/1895/456",
    "father_name": "Hendrik van den Berg", 
    "occupation": "bakker"
}

# Results in different deterministic hashes:
# Person 1: PID_NL-NH-AMS_1895-03-15_..._JAN-BERG-jan_van_den_berg-a7b3c2d1
# Person 2: PID_NL-NH-AMS_1895-03-15_..._JAN-BERG-jan_van_den_berg-f2e8d4a9

5.4 Collision Probability Analysis

Tier	Collision Probability	When Triggered
Base PPID	~1/10,000 for common names	Same location, date, name tokens
Tier 1 (+emic)	~1/1,000,000	Same full emic label
Tier 2 (+hash)	~1/4.3 billion	Same emic AND no distinguishing data
Tier 3 (+time)	~0	Cryptographic failure

Practical Impact: For a dataset of 10 million persons, expected Tier 2 collisions ≈ 0.002 (effectively zero).

6. Unknown Components: XX and XXX Placeholders

6.1 When Components Are Unknown

Unlike GHCID (where XX/XXX are temporary and require research), PPID may have permanently unknown components:

Scenario	Placeholder	Can be PID?
Unknown birth country	`XX`	No (remains ID)
Unknown birth region	`XX`	No (remains ID)
Unknown birth place	`XXX`	No (remains ID)
Unknown death country	`XX`	No (remains ID)
Unknown death region	`XX`	No (remains ID)
Unknown death place	`XXX`	No (remains ID)
Unknown date	`XXXX`	No (remains ID)
Unknown first token	`UNKNOWN`	No (remains ID)
Unknown last token	(empty after hyphen)	Yes (if mononym)

6.2 ID Examples with Unknown Components

ID_XX-XX-XXX_XXXX_FR-NM-OMH_1944-06-06_UNKNOWN-        # Unknown soldier, Normandy
ID_NL-NH-AMS_1606_XX-XX-XXX_XXXX_REMBRANDT-            # Rembrandt, death unknown (hypothetical)
ID_XX-XX-XXX_XXXX_XX-XX-XXX_XXXX_ANONYMOUS-            # Completely unknown person
ID_NL-ZH-LEI_1606-07_NL-NH-AMS_1669_REMBRANDT-RIJN    # Rembrandt, month known for birth, only year for death

7. UUID and Numeric Generation

7.1 Dual Representation (Same as GHCID)

Every PPID generates three representations:

Format	Purpose	Example
Semantic String	Human-readable	`PID_NL-NH-AMS_1895-03-15_NL-NH-HAA_1970-08-22_JAN-BERG`
UUID v5	Linked data, URIs	`550e8400-e29b-41d4-a716-446655440000`
Numeric (64-bit)	Database keys, CSV	`213324328442227739`

7.2 Generation Algorithm

import uuid
import hashlib

# PPID namespace UUID (different from GHCID namespace)
PPID_NAMESPACE = uuid.UUID('f47ac10b-58cc-4372-a567-0e02b2c3d479')

def generate_ppid_identifiers(semantic_ppid: str) -> dict:
    """
    Generate all identifier formats from semantic PPID string.
    
    Returns:
        {
            'semantic': 'PID_NL-NH-AMS_1895-03-15_...',
            'uuid_v5': '550e8400-...',
            'numeric': 213324328442227739
        }
    """
    # UUID v5 from semantic string
    ppid_uuid = uuid.uuid5(PPID_NAMESPACE, semantic_ppid)
    
    # Numeric from SHA-256 (64-bit)
    sha256 = hashlib.sha256(semantic_ppid.encode()).digest()
    numeric = int.from_bytes(sha256[:8], byteorder='big')
    
    return {
        'semantic': semantic_ppid,
        'uuid_v5': str(ppid_uuid),
        'numeric': numeric
    }


# Example:
ppid = "PID_NL-NH-AMS_1895-03-15_NL-NH-HAA_1970-08-22_JAN-BERG"
identifiers = generate_ppid_identifiers(ppid)
# {
#     'semantic': 'PID_NL-NH-AMS_1895-03-15_NL-NH-HAA_1970-08-22_JAN-BERG',
#     'uuid_v5': 'a1b2c3d4-e5f6-5a1b-9c2d-3e4f5a6b7c8d',
#     'numeric': 1234567890123456789
# }

8. Relationship to Person Observations

8.1 Distinction: PPID vs Observation Identifiers

Identifier	Purpose	Structure	Persistence
PPID	Identify a person (reconstruction)	Geographic + temporal + emic	Permanent (if PID)
Observation ID	Identify a specific source observation	GHCID-based + RiC-O	Permanent

8.2 Observation Identifier Structure (Forthcoming)

As noted in the user's input, observation identifiers will use a different pattern:

{REPOSITORY_GHCID}/{CREATOR_GHCID}/{RICO_RECORD_PATH}

Where:

REPOSITORY_GHCID: GHCID of the institution holding the record
CREATOR_GHCID: GHCID of the institution that created the record (may be same)
RICO_RECORD_PATH: RiC-O derived path to RecordSet/Record/RecordPart

Example:

NL-NH-HAA-A-NHA/NL-NH-HAA-A-NHA/burgerlijke-stand/geboorten/1895/003/045
│              │              │
│              │              └── RiC-O path: fonds/series/file/item
│              └── Creator (same institution)
└── Repository

This is separate from PPID and will be specified in a future document.

9. Comparison with Original POID/PRID Design

9.1 What Changes

Aspect	POID/PRID (Doc 05)	Revised PPID (This Doc)
Identifier opacity	Opaque (no semantic content)	Semantic (human-readable)
Geographic anchoring	None	Dual (birth + death locations)
Temporal anchoring	None	Century range
Name in identifier	None	First + last token
Type prefix	POID/PRID	ID/PID
Observation vs Person	Different identifier types	Completely separate systems
UUID backing	Primary	Secondary (derived)
Collision handling	UUID collision (rare)	Semantic collision (more common)

9.2 What Stays the Same

Dual identifier generation (UUID + numeric)
Deterministic generation from input
Permanent persistence (once PID)
Integration with GHCID for institution links
Claim-based provenance model
PiCo ontology alignment

9.3 Transition Plan

If this revised structure is adopted:

Document 05 becomes historical reference
This document becomes the authoritative identifier spec
No existing identifiers need migration (this is a new system)
Code examples in other documents need updates

10. Implementation Considerations

10.1 Character Set and Length

# Component lengths
MAX_COUNTRY_CODE = 2      # ISO 3166-1 alpha-2
MAX_REGION_CODE = 3       # ISO 3166-2 suffix (some are 3 chars)
MAX_PLACE_CODE = 3        # GeoNames convention
MAX_DATE = 10             # YYYY-MM-DD (ISO 8601)
MAX_TOKEN_LENGTH = 20     # Reasonable limit for names
MAX_COLLISION_SUFFIX = 50 # Full emic label in snake_case

# Example PPID structure (without collision suffix):
# PID_NL-NH-AMS_1895-03-15_NL-NH-HAA_1970-08-22_JAN-BERG
#  3  +1+ 10   +1+  10    +1+  10   +1+  10    +1+ 20
# ≈ 70 characters maximum

# With collision suffix: ~120 characters max

10.2 Validation Regex

import re

# Location pattern: CC-RR-PPP (country-region-place)
LOCATION_PATTERN = r'([A-Z]{2}|XX)-([A-Z]{2,3}|XX)-([A-Z]{3}|XXX)'

# Date pattern: YYYY-MM-DD or YYYY-MM or YYYY or XXXX (including BCE with leading -)
DATE_PATTERN = r'(-?\d{4}(?:-\d{2}(?:-\d{2})?)?|XXXX)'

# Name tokens pattern: FIRST-LAST or FIRST- (for mononyms)
NAME_PATTERN = r'([A-Z0-9]+)-([A-Z0-9]*)'

# Full PPID pattern
PPID_PATTERN = re.compile(
    r'^(ID|PID)'                      # Type
    r'_' + LOCATION_PATTERN +         # First location (underscore + CC-RR-PPP)
    r'_' + DATE_PATTERN +             # First date (underscore + ISO 8601)
    r'_' + LOCATION_PATTERN +         # Last location (underscore + CC-RR-PPP)
    r'_' + DATE_PATTERN +             # Last date (underscore + ISO 8601)
    r'_' + NAME_PATTERN +             # Name tokens (underscore + FIRST-LAST)
    r'(-[a-z0-9_]+)?$'                # Collision suffix (optional, hyphen + snake_case)
)

def validate_ppid(ppid: str) -> tuple[bool, str]:
    """Validate PPID format."""
    if not PPID_PATTERN.match(ppid):
        return False, "Invalid PPID format"
    
    # Split by major delimiter (underscore)
    parts = ppid.split('_')
    
    if len(parts) < 6:
        return False, "Incomplete PPID - requires 6 underscore-delimited parts"
    
    # Extract dates for validation
    first_date = parts[2]
    last_date = parts[4]
    
    # Date ordering validation (if both are known)
    if first_date != 'XXXX' and last_date != 'XXXX':
        # Parse years (handle BCE with leading -)
        try:
            first_year = int(first_date.split('-')[0]) if not first_date.startswith('-') else -int(first_date.split('-')[1])
            last_year = int(last_date.split('-')[0]) if not last_date.startswith('-') else -int(last_date.split('-')[1])
            
            if last_year < first_year:
                return False, "Last observation date cannot be before first observation date"
        except (ValueError, IndexError):
            pass  # Invalid date format caught by regex
    
    return True, "Valid"


# Example validations:
assert validate_ppid("PID_NL-NH-AMS_1895-03-15_NL-NH-HAA_1970-08-22_JAN-BERG")[0]
assert validate_ppid("PID_GR-AT-ATH_-0470_GR-AT-ATH_-0399_SOCRATES-")[0]
assert validate_ppid("ID_XX-XX-XXX_XXXX_FR-NM-OMH_1944-06-06_UNKNOWN-")[0]
assert not validate_ppid("PID_NL-NH-AMS_1970_NL-NH-HAA_1895_JAN-BERG")[0]  # Dates reversed

11. Open Questions

11.1 BCE Dates

RESOLVED: Use ISO 8601 extended format with negative years.

-0469 for 469 BCE
-0044 for 44 BCE
Examples in section 3.3 and 3.4

11.2 Non-Latin Name Tokens

RESOLVED: Apply same transliteration rules as GHCID (see AGENTS.md).

Script	Standard
Cyrillic	ISO 9:1995
Chinese	Hanyu Pinyin (ISO 7098)
Japanese	Modified Hepburn
Korean	Revised Romanization
Arabic	ISO 233-2/3

11.3 Disputed Locations

RESOLVED: Not a PPID concern - handled by ISO standardization. Use modern ISO-standardized location codes; document disputes in observation metadata.

11.4 Living Persons

RESOLVED: Living persons are always ID class and can only be promoted to PID after death.

Living persons have no verified last observation (death date/location)
Use XXXX for unknown death date and XX-XX-XXX for unknown death location
Example: ID_NL-NH-AMS_1985-06-15_XX-XX-XXX_XXXX_JAN-BERG
Can be promoted to PID only after death observation is verified

Rationale:

PID requires verified last observation (death)
Living persons have incomplete lifecycle data
Future observations may change identity assessment
Privacy considerations for living individuals

12. References

GHCID Documentation

Standards

ISO 3166-1: Country codes
ISO 3166-2: Subdivision codes
ISO 8601: Date and time format (including BCE with negative years)
GeoNames: Geographic names database

46 KiB Raw Blame History Unescape Escape