From 7f53ec6074b8be7b2f540aebf48a9f840e8d8fb3 Mon Sep 17 00:00:00 2001 From: kempersc Date: Fri, 9 Jan 2026 15:57:26 +0100 Subject: [PATCH] docs(person_pid): add PPID-GHCID alignment and PiCo comparison docs --- .../person_pid/10_ppid_ghcid_alignment.md | 1150 +++++++++++++++++ .../person_pid/11_pico_ppid_comparison.md | 475 +++++++ 2 files changed, 1625 insertions(+) create mode 100644 docs/plan/person_pid/10_ppid_ghcid_alignment.md create mode 100644 docs/plan/person_pid/11_pico_ppid_comparison.md diff --git a/docs/plan/person_pid/10_ppid_ghcid_alignment.md b/docs/plan/person_pid/10_ppid_ghcid_alignment.md new file mode 100644 index 0000000000..a9fec96913 --- /dev/null +++ b/docs/plan/person_pid/10_ppid_ghcid_alignment.md @@ -0,0 +1,1150 @@ +# PPID-GHCID Alignment: Revised Identifier Structure + +**Version**: 0.1.0 +**Last Updated**: 2025-01-09 +**Status**: DRAFT - Supersedes opaque identifier design in [05_identifier_structure_design.md](./05_identifier_structure_design.md) +**Related**: [GHCID Specification](../../GHCID_PID_SCHEME.md) | [PiCo Ontology](./03_pico_ontology_analysis.md) + +--- + +## 1. Executive Summary + +This document proposes a **revised PPID structure** that aligns with GHCID's geographic-semantic identifier pattern while accommodating the unique challenges of person identification across historical records. + +### 1.1 Key Changes from Original Design + +| Aspect | Original (Doc 05) | Revised (This Document) | +|--------|-------------------|-------------------------| +| **Format** | Opaque hex (`POID-7a3b-c4d5-...`) | Semantic (`PID-NL-NH-AMS-NL-NH-HAA-19-20-JAN-BERG`) | +| **Type Distinction** | POID vs PRID | ID (temporary) vs PID (persistent) | +| **Geographic** | None in identifier | Dual anchors: first + last observation | +| **Temporal** | None in identifier | Century range | +| **Name** | None in identifier | First + last token of emic label | +| **Persistence** | Always persistent | May remain ID indefinitely | + +### 1.2 Design Philosophy + +The revised PPID follows the same principles as GHCID: + +1. **Human-readable semantic components** that aid discovery and deduplication +2. **Geographic anchoring** to physical locations using GeoNames +3. **Temporal anchoring** to enable disambiguation across time +4. **Emic authenticity** using names from primary sources +5. **Collision resolution** via full emic label suffix +6. **Dual representation** as both semantic string and UUID/numeric + +--- + +## 2. Identifier Type: ID vs PID + +### 2.1 The Epistemic Uncertainty Problem + +Unlike institutions (which typically have founding documents, legal registrations, and clear organizational boundaries), **persons in historical records often exist in epistemic uncertainty**: + +- Incomplete records (many records lost to time) +- Ambiguous references (common names, no surnames) +- Conflicting sources (different dates, spellings) +- Undiscovered archives (unexplored record sets) + +### 2.2 Two-Class Identifier System + +| Type | Prefix | Description | Persistence | Promotion Path | +|------|--------|-------------|-------------|----------------| +| **ID** | `ID-` | Temporary identifier | May change | Can become PID | +| **PID** | `PID-` | Persistent identifier | Permanent | Cannot revert to ID | + +### 2.3 Promotion Criteria: ID → PID + +An identifier can be promoted from ID to PID when ALL of the following are satisfied: + +```python +@dataclass +class PIDPromotionCriteria: + """ + Criteria for promoting an ID to a PID. + ALL conditions must be True for promotion. + """ + + # Geographic anchors + first_observation_verified: bool # Birth or equivalent + last_observation_verified: bool # Death or equivalent + + # Temporal anchors + century_range_established: bool # From verified observations + + # Identity anchors + emic_label_verified: bool # From primary sources + no_unexplored_archives: bool # Reasonable assumption + + # Quality checks + no_unresolved_conflicts: bool # No conflicting claims + multiple_corroborating_sources: bool # At least 2 independent sources + + def is_promotable(self) -> bool: + return all([ + self.first_observation_verified, + self.last_observation_verified, + self.century_range_established, + self.emic_label_verified, + self.no_unexplored_archives, + self.no_unresolved_conflicts, + self.multiple_corroborating_sources, + ]) +``` + +### 2.4 Permanent ID Status + +Some identifiers may **forever remain IDs** due to: + +- **Fragmentary records**: Only one surviving document mentions the person +- **Uncertain dates**: Cannot establish century range +- **Unknown location**: Cannot anchor geographically +- **Anonymous figures**: No emic label recoverable +- **Ongoing research**: Archives not yet explored + +This is acceptable and expected. An ID is still a valid identifier for internal use; it simply cannot be cited as a persistent identifier in scholarly work. + +--- + +## 3. Identifier Structure + +### 3.1 Full Format Specification + +``` +{TYPE}-{FC}-{FR}-{FP}-{LC}-{LR}-{LP}-{CR}-{FT}-{LT}[-{FULL_EMIC}] + │ │ │ │ │ │ │ │ │ │ │ + │ │ │ │ │ │ │ │ │ │ └── Collision suffix (optional) + │ │ │ │ │ │ │ │ │ └── Last Token of emic label + │ │ │ │ │ │ │ │ └── First Token of emic label + │ │ │ │ │ │ │ └── Century Range (e.g., 19-20) + │ │ │ │ │ │ └── Last observation Place (GeoNames 3-letter) + │ │ │ │ │ └── Last observation Region (ISO 3166-2) + │ │ │ │ └── Last observation Country (ISO 3166-1 alpha-2) + │ │ │ └── First observation Place (GeoNames 3-letter) + │ │ └── First observation Region (ISO 3166-2) + │ └── First observation Country (ISO 3166-1 alpha-2) + └── Type: ID or PID +``` + +### 3.2 Component Definitions + +| Component | Format | Description | Example | +|-----------|--------|-------------|---------| +| **TYPE** | `ID` or `PID` | Identifier class | `PID` | +| **FC** | ISO 3166-1 α2 | First observation country (modern) | `NL` | +| **FR** | ISO 3166-2 suffix | First observation region | `NH` | +| **FP** | 3 letters | First observation place (GeoNames) | `AMS` | +| **LC** | ISO 3166-1 α2 | Last observation country (modern) | `NL` | +| **LR** | ISO 3166-2 suffix | Last observation region | `NH` | +| **LP** | 3 letters | Last observation place (GeoNames) | `HAA` | +| **CR** | `CC-CC` | Century range (CE) | `19-20` | +| **FT** | UPPERCASE | First token of emic label | `JAN` | +| **LT** | UPPERCASE | Last token of emic label | `BERG` | +| **FULL_EMIC** | snake_case | Full emic label (collision only) | `jan_van_den_berg` | + +### 3.3 Examples + +| Person | Full Emic Label | PPID | +|--------|-----------------|------| +| Jan van den Berg, born Amsterdam 1895, died Haarlem 1970 | Jan van den Berg | `PID-NL-NH-AMS-NL-NH-HAA-19-20-JAN-BERG` | +| Rembrandt, born Leiden 1606, died Amsterdam 1669 | Rembrandt van Rijn | `PID-NL-ZH-LEI-NL-NH-AMS-17-17-REMBRANDT-RIJN` | +| Maria Sibylla Merian, born Frankfurt 1647, died Amsterdam 1717 | Maria Sibylla Merian | `PID-DE-HE-FRA-NL-NH-AMS-17-18-MARIA-MERIAN` | +| Unknown soldier, found Normandy, died 1944 | (unknown) | `ID-XX-XX-XXX-FR-NM-OMH-20-20-UNKNOWN-` | +| Henry VIII, born London 1491, died London 1547 | Henry VIII | `PID-GB-ENG-LON-GB-ENG-LON-15-16-HENRY-VIII` | + +**Notes on Emic Labels**: +- Always use **formal/complete emic names** from primary sources, not modern colloquial short forms +- "Rembrandt" alone is a modern convention; the emic label from his lifetime was "Rembrandt van Rijn" +- **Tussenvoegsels (particles)** like "van", "de", "den", "der", "van de", "van den", "van der" are **skipped** when extracting the last token (see §4.5) +- This follows the same pattern as GHCID abbreviation rules (AGENTS.md Rule 8) + +--- + +## 4. Component Rules + +### 4.1 First Observation (Birth or Earliest) + +```python +from dataclasses import dataclass +from enum import Enum +from typing import Optional + +class ObservationType(Enum): + BIRTH_CERTIFICATE = "birth_certificate" # Highest authority + BAPTISM_RECORD = "baptism_record" # Common for pre-civil registration + BIRTH_STATEMENT = "birth_statement" # Stated birth in other document + EARLIEST_REFERENCE = "earliest_reference" # Earliest surviving mention + INFERRED = "inferred" # Inferred from context + +@dataclass +class FirstObservation: + """ + First observation of a person during their lifetime. + Ideally a birth record, but may be another early record. + """ + + observation_type: ObservationType + + # Modern geographic codes (mapped from historical) + country_code: str # ISO 3166-1 alpha-2 + region_code: str # ISO 3166-2 subdivision + place_code: str # GeoNames 3-letter code + + # Original historical reference + historical_place_name: str # As named in source + historical_date: str # As stated in source + + # Mapping provenance + modern_mapping_method: str # How historical → modern mapping done + geonames_id: Optional[int] # GeoNames ID for place + + # Quality indicators + is_birth_record: bool + can_assume_earliest: bool # No unexplored archives likely + source_confidence: float # 0.0 - 1.0 + + def is_valid_for_pid(self) -> bool: + """ + Determine if this observation is valid for PID generation. + """ + if self.is_birth_record: + return True + + if self.observation_type == ObservationType.EARLIEST_REFERENCE: + # Must be able to assume this is actually the earliest + return self.can_assume_earliest and self.source_confidence >= 0.8 + + return False +``` + +### 4.2 Last Observation (Death or Latest During Lifetime) + +```python +@dataclass +class LastObservation: + """ + Last observation of a person during their lifetime or immediate after death. + Ideally a death record, but may be last known living reference. + """ + + observation_type: ObservationType # Reusing enum, but DEATH_CERTIFICATE etc. + + # Modern geographic codes + country_code: str + region_code: str + place_code: str + + # Original historical reference + historical_place_name: str + historical_date: str + + # Critical distinction + is_death_record: bool + is_lifetime_observation: bool # True if person still alive at observation + is_immediate_post_death: bool # First record after death + + # Quality + can_assume_latest: bool + source_confidence: float + + def is_valid_for_pid(self) -> bool: + if self.is_death_record: + return True + + if self.is_immediate_post_death: + # First mention of death + return self.source_confidence >= 0.8 + + if self.is_lifetime_observation: + # Last known alive, but not death record + return self.can_assume_latest and self.source_confidence >= 0.8 + + return False +``` + +### 4.3 Geographic Mapping: Historical → Modern + +```python +from dataclasses import dataclass +from typing import Optional, Tuple + +@dataclass +class HistoricalPlaceMapping: + """ + Map historical place names to modern ISO/GeoNames codes. + + Historical places must be mapped to their MODERN equivalents + as of the PPID generation date. This ensures stability even + when historical boundaries shifted. + """ + + # Historical input + historical_name: str + historical_date: str # When the place was referenced + + # Modern output (at PPID generation time) + modern_country_code: str # ISO 3166-1 alpha-2 + modern_region_code: str # ISO 3166-2 suffix (e.g., "NH" not "NL-NH") + modern_place_code: str # 3-letter from GeoNames + + # GeoNames reference + geonames_id: int + geonames_name: str # Modern canonical name + geonames_feature_class: str # P = populated place + geonames_feature_code: str # PPL, PPLA, PPLC, etc. + + # Mapping provenance + mapping_method: str # "direct", "successor", "enclosing", "manual" + mapping_confidence: float + mapping_notes: str + ppid_generation_date: str # When mapping was performed + +def map_historical_to_modern( + historical_name: str, + historical_date: str, + db +) -> HistoricalPlaceMapping: + """ + Map a historical place name to modern ISO/GeoNames codes. + + Strategies (in order): + 1. Direct match: Place still exists with same name + 2. Successor: Place renamed but geographically same + 3. Enclosing: Place absorbed into larger entity + 4. Manual: Requires human research + """ + + # Strategy 1: Direct GeoNames lookup + direct_match = db.geonames_search(historical_name) + if direct_match and direct_match.is_populated_place: + return HistoricalPlaceMapping( + historical_name=historical_name, + historical_date=historical_date, + modern_country_code=direct_match.country_code, + modern_region_code=direct_match.admin1_code, + modern_place_code=generate_place_code(direct_match.name), + geonames_id=direct_match.geonames_id, + geonames_name=direct_match.name, + geonames_feature_class=direct_match.feature_class, + geonames_feature_code=direct_match.feature_code, + mapping_method="direct", + mapping_confidence=0.95, + mapping_notes="Direct GeoNames match", + ppid_generation_date=datetime.utcnow().isoformat() + ) + + # Strategy 2: Historical name lookup (renamed places) + # e.g., "Batavia" → "Jakarta" + historical_match = db.historical_place_names.get(historical_name) + if historical_match: + modern = db.geonames_by_id(historical_match.modern_geonames_id) + return HistoricalPlaceMapping( + historical_name=historical_name, + historical_date=historical_date, + modern_country_code=modern.country_code, + modern_region_code=modern.admin1_code, + modern_place_code=generate_place_code(modern.name), + geonames_id=modern.geonames_id, + geonames_name=modern.name, + geonames_feature_class=modern.feature_class, + geonames_feature_code=modern.feature_code, + mapping_method="successor", + mapping_confidence=0.90, + mapping_notes=f"Historical name '{historical_name}' → modern '{modern.name}'", + ppid_generation_date=datetime.utcnow().isoformat() + ) + + # Strategy 3: Geographic coordinates (if available from source) + # Reverse geocode to find enclosing modern settlement + + # Strategy 4: Manual research required + raise ManualResearchRequired( + f"Cannot automatically map '{historical_name}' ({historical_date}) to modern location" + ) + + +def generate_place_code(place_name: str) -> str: + """ + Generate 3-letter place code from GeoNames name. + + Rules (same as GHCID): + - Single word: First 3 letters → "Amsterdam" → "AMS" + - Multi-word: Initials → "New York" → "NYO" (or "NYC" if registered) + - Dutch articles: Article initial + 2 from main → "Den Haag" → "DHA" + """ + # Implementation follows GHCID rules + # See AGENTS.md: "SETTLEMENT STANDARDIZATION: GEONAMES IS AUTHORITATIVE" + pass +``` + +### 4.4 Century Range Calculation + +```python +def calculate_century_range( + first_observation: FirstObservation, + last_observation: LastObservation +) -> str: + """ + Calculate the CE century range for a person's lifetime. + + Returns format: "CC-CC" (e.g., "19-20" for 1850-1925) + + Rules: + - Centuries are 1-indexed: 1-100 AD = 1st century, 1901-2000 = 20th century + - BCE dates: Use negative century numbers (e.g., "-5--4" for 5th-4th century BCE) + This follows ISO 8601 extended format which uses negative years for BCE + - Range must be from verified observations + """ + + def year_to_century(year: int) -> int: + """ + Convert year to century number. + + Positive years (CE): 1-100 = century 1, 1901-2000 = century 20 + Negative years (BCE): -500 to -401 = century -5 + + Note: There is no year 0 in the proleptic Gregorian calendar. + Year 1 BCE is followed directly by year 1 CE. + """ + if year > 0: + return ((year - 1) // 100) + 1 + else: + # BCE: year -500 → century -5, year -1 → century -1 + return (year // 100) + + def parse_year(date_str: str) -> int: + """Extract year from various date formats.""" + # Handle: "1895", "1895-03-15", "March 1895", "c. 1895", etc. + # Also handle BCE: "-500", "500 BCE", "500 BC", "c. 500 BCE" + import re + + # Check for BCE indicators + bce_match = re.search(r'(\d+)\s*(BCE|BC|B\.C\.E?\.|v\.Chr\.)', date_str, re.IGNORECASE) + if bce_match: + return -int(bce_match.group(1)) + + # Check for negative year (ISO 8601 extended) + neg_match = re.search(r'-(\d+)', date_str) + if neg_match and date_str.strip().startswith('-'): + return -int(neg_match.group(1)) + + # Standard positive year + match = re.search(r'\b(\d{4})\b', date_str) + if match: + return int(match.group(1)) + + # 3-digit year (ancient dates) + match = re.search(r'\b(\d{3})\b', date_str) + if match: + return int(match.group(1)) + + raise ValueError(f"Cannot parse year from: {date_str}") + + first_year = parse_year(first_observation.historical_date) + last_year = parse_year(last_observation.historical_date) + + first_century = year_to_century(first_year) + last_century = year_to_century(last_year) + + # Validation + if last_century < first_century: + raise ValueError( + f"Last observation ({last_year}) cannot be before " + f"first observation ({first_year})" + ) + + return f"{first_century}-{last_century}" + + +# Examples (CE): +# 1850 → century 19 +# 1925 → century 20 +# Range: "19-20" + +# 1606 → century 17 +# 1669 → century 17 +# Range: "17-17" (same century) + +# 1895 → century 19 +# 2005 → century 21 +# Range: "19-21" (centenarian) + +# Examples (BCE): +# -500 (500 BCE) → century -5 +# -401 (401 BCE) → century -5 +# Range: "-5--5" (same century) + +# -469 (469 BCE, Socrates birth) → century -5 +# -399 (399 BCE, Socrates death) → century -4 +# Range: "-5--4" + +# -100 (100 BCE) → century -1 +# 14 (14 CE) → century 1 +# Range: "-1-1" (crossing BCE/CE boundary) +``` + +### 4.5 Emic Label Tokens + +```python +from dataclasses import dataclass +from typing import Optional, List +import re + +@dataclass +class EmicLabel: + """ + The common contemporary emic label of a person. + + "Emic" = from the insider perspective, as the person was known + during their lifetime in primary sources. + + "Etic" = from the outsider perspective, how we refer to them now. + + Prefer emic; fall back to etic only if emic unrecoverable. + """ + + full_label: str # Complete emic label + first_token: str # First word/token + last_token: str # Last word/token (empty if mononym) + + # Source provenance + source_type: str # "primary" or "etic_fallback" + source_document: str # Reference to source + source_date: str # When source was created + + # Quality + is_from_primary_source: bool + is_vernacular: bool # From vernacular (non-official) source + confidence: float + + @classmethod + def from_full_label(cls, label: str, **kwargs) -> 'EmicLabel': + """Parse full label into first and last tokens.""" + tokens = tokenize_emic_label(label) + + first_token = tokens[0].upper() if tokens else "" + last_token = tokens[-1].upper() if len(tokens) > 1 else "" + + return cls( + full_label=label, + first_token=first_token, + last_token=last_token, + **kwargs + ) + + +def tokenize_emic_label(label: str) -> List[str]: + """ + Tokenize an emic label into words. + + Rules: + - Split on whitespace + - Preserve numeric tokens (e.g., "VIII" in "Henry VIII") + - Do NOT split compound words + - Normalize to uppercase for identifier + """ + # Basic whitespace split + tokens = label.strip().split() + + # Filter empty tokens + tokens = [t for t in tokens if t] + + return tokens + + +def extract_name_tokens( + full_emic_label: str +) -> tuple[str, str]: + """ + Extract first and last tokens from emic label. + + Rules: + 1. First token: First word of the emic label + 2. Last token: Last word AFTER skipping tussenvoegsels (name particles) + + Tussenvoegsels are common prefixes in Dutch and other languages that are + NOT part of the surname proper. They are skipped when extracting the + last token (same as GHCID abbreviation rules - AGENTS.md Rule 8). + + Examples: + - "Jan van den Berg" → ("JAN", "BERG") # "van den" skipped + - "Rembrandt van Rijn" → ("REMBRANDT", "RIJN") # "van" skipped + - "Henry VIII" → ("HENRY", "VIII") + - "Maria Sibylla Merian" → ("MARIA", "MERIAN") + - "Ludwig van Beethoven" → ("LUDWIG", "BEETHOVEN") # "van" skipped + - "Vincent van Gogh" → ("VINCENT", "GOGH") # "van" skipped + - "Leonardo da Vinci" → ("LEONARDO", "VINCI") # "da" skipped + - "中村 太郎" → transliterated: ("NAKAMURA", "TARO") + """ + # Tussenvoegsels (name particles) to skip when finding last token + # Following GHCID pattern (AGENTS.md Rule 8: Legal Form Filtering) + TUSSENVOEGSELS = { + # Dutch + 'van', 'de', 'den', 'der', 'het', "'t", 'te', 'ten', 'ter', + 'van de', 'van den', 'van der', 'van het', "van 't", + 'in de', 'in den', 'in het', "in 't", + 'op de', 'op den', 'op het', "op 't", + # German + 'von', 'vom', 'zu', 'zum', 'zur', 'von und zu', + # French + 'de', 'du', 'des', 'de la', 'le', 'la', 'les', + # Italian + 'da', 'di', 'del', 'della', 'dei', 'degli', 'delle', + # Spanish + 'de', 'del', 'de la', 'de los', 'de las', + # Portuguese + 'da', 'do', 'dos', 'das', 'de', + } + + tokens = tokenize_emic_label(full_emic_label) + + if len(tokens) == 0: + raise ValueError("Empty emic label") + + first_token = tokens[0].upper() + + if len(tokens) == 1: + # Mononym + last_token = "" + else: + # Find last token that is NOT a tussenvoegsel + # Work backwards from the end + last_token = "" + for token in reversed(tokens[1:]): # Skip first token + token_lower = token.lower() + if token_lower not in TUSSENVOEGSELS: + last_token = token.upper() + break + + # If all remaining tokens are tussenvoegsels, use the actual last token + if not last_token: + last_token = tokens[-1].upper() + + # Normalize: remove diacritics, special characters + first_token = normalize_token(first_token) + last_token = normalize_token(last_token) + + return (first_token, last_token) + + +def normalize_token(token: str) -> str: + """ + Normalize token for PPID. + + - Remove diacritics (é → E) + - Uppercase + - Allow alphanumeric only (for Roman numerals like VIII) + - Transliterate non-Latin scripts + """ + import unicodedata + + # NFD decomposition + remove combining marks + normalized = unicodedata.normalize('NFD', token) + ascii_token = ''.join( + c for c in normalized + if unicodedata.category(c) != 'Mn' + ) + + # Uppercase + ascii_token = ascii_token.upper() + + # Keep only alphanumeric + ascii_token = re.sub(r'[^A-Z0-9]', '', ascii_token) + + return ascii_token +``` + +### 4.6 Emic vs Etic Fallback + +```python +@dataclass +class EmicLabelResolution: + """ + Resolution of emic label for a person. + + Priority: + 1. Emic from primary sources (documents from their lifetime) + 2. Etic fallback (only if emic truly unrecoverable) + """ + + resolved_label: EmicLabel + resolution_method: str # "emic_primary", "emic_vernacular", "etic_fallback" + emic_search_exhausted: bool + vernacular_sources_checked: List[str] + fallback_justification: Optional[str] + +def resolve_emic_label( + person_observations: List['PersonObservation'], + db +) -> EmicLabelResolution: + """ + Resolve the emic label for a person from their observations. + + Rules: + 1. Search all primary sources for emic names + 2. Prefer most frequently used name in primary sources + 3. Only use etic fallback if emic truly unrecoverable + 4. Vernacular sources must have clear pedigrees + 5. Oral traditions without documentation not valid + """ + + # Collect all name mentions from primary sources + emic_candidates = [] + + for obs in person_observations: + if obs.is_primary_source and obs.is_from_lifetime: + for claim in obs.claims: + if claim.claim_type in ('full_name', 'given_name', 'title'): + emic_candidates.append({ + 'label': claim.claim_value, + 'source': obs.source_url, + 'date': obs.source_date, + 'is_vernacular': obs.is_vernacular_source + }) + + if emic_candidates: + # Find most common emic label + from collections import Counter + label_counts = Counter(c['label'] for c in emic_candidates) + most_common = label_counts.most_common(1)[0][0] + + best_candidate = next( + c for c in emic_candidates if c['label'] == most_common + ) + + return EmicLabelResolution( + resolved_label=EmicLabel.from_full_label( + most_common, + source_type="primary", + source_document=best_candidate['source'], + source_date=best_candidate['date'], + is_from_primary_source=True, + is_vernacular=best_candidate['is_vernacular'], + confidence=0.95 + ), + resolution_method="emic_primary", + emic_search_exhausted=True, + vernacular_sources_checked=[c['source'] for c in emic_candidates if c['is_vernacular']], + fallback_justification=None + ) + + # Check if etic fallback is justified + unexplored_vernacular = db.get_unexplored_vernacular_archives(person_observations) + + if unexplored_vernacular: + raise EmicLabelNotYetResolvable( + f"Emic label not found in explored sources. " + f"Unexplored vernacular archives exist: {unexplored_vernacular}. " + f"Cannot use etic fallback until these are explored." + ) + + # Etic fallback (rare) + etic_label = db.get_most_common_etic_label(person_observations) + + return EmicLabelResolution( + resolved_label=EmicLabel.from_full_label( + etic_label, + source_type="etic_fallback", + source_document="Modern scholarly consensus", + source_date=datetime.utcnow().isoformat(), + is_from_primary_source=False, + is_vernacular=False, + confidence=0.70 + ), + resolution_method="etic_fallback", + emic_search_exhausted=True, + vernacular_sources_checked=[], + fallback_justification=( + "No emic label found in explored primary sources. " + "All known vernacular sources checked. " + "Using most common modern scholarly reference." + ) + ) +``` + +--- + +## 5. Collision Handling + +### 5.1 Collision Detection + +Two PPIDs collide when all components except the collision suffix match: + +```python +def detect_collision(new_ppid: str, existing_ppids: Set[str]) -> bool: + """ + Check if new PPID collides with existing identifiers. + + Collision = same base components (before any collision suffix). + """ + base_new = get_base_ppid(new_ppid) + + for existing in existing_ppids: + base_existing = get_base_ppid(existing) + if base_new == base_existing: + return True + + return False + +def get_base_ppid(ppid: str) -> str: + """Extract base PPID without collision suffix.""" + # Full PPID may have collision suffix after last token + # e.g., "PID-NL-NH-AMS-NL-NH-HAA-19-20-JAN-BERG-jan_van_den_berg" + # Base: "PID-NL-NH-AMS-NL-NH-HAA-19-20-JAN-BERG" + + parts = ppid.split('-') + + # Standard PPID has 11 parts (TYPE + 6 geo + CR + FT + LT) + # If more parts, the extra is collision suffix + if len(parts) > 11: + return '-'.join(parts[:11]) + + return ppid +``` + +### 5.2 Collision Resolution via Full Emic Label + +When collision occurs, append full emic label in snake_case: + +```python +def resolve_collision( + base_ppid: str, + full_emic_label: str, + existing_ppids: Set[str] +) -> str: + """ + Resolve collision by appending full emic label. + + Example: + Base: "PID-NL-NH-AMS-NL-NH-HAA-19-20-JAN-BERG" + Emic: "Jan van den Berg" + Result: "PID-NL-NH-AMS-NL-NH-HAA-19-20-JAN-BERG-jan_van_den_berg" + """ + suffix = generate_collision_suffix(full_emic_label) + resolved = f"{base_ppid}-{suffix}" + + # Check if still collides (extremely rare) + if resolved in existing_ppids: + # Add numeric discriminator + counter = 2 + while f"{resolved}_{counter}" in existing_ppids: + counter += 1 + resolved = f"{resolved}_{counter}" + + return resolved + +def generate_collision_suffix(full_emic_label: str) -> str: + """ + Generate collision suffix from full emic label. + + Same rules as GHCID collision suffix: + - Convert to lowercase snake_case + - Remove diacritics + - Remove punctuation + """ + import unicodedata + import re + + # Normalize unicode + normalized = unicodedata.normalize('NFD', full_emic_label) + ascii_name = ''.join( + c for c in normalized + if unicodedata.category(c) != 'Mn' + ) + + # Lowercase + lowercase = ascii_name.lower() + + # Remove punctuation + no_punct = re.sub(r"[''`\",.:;!?()[\]{}]", '', lowercase) + + # Replace spaces with underscores + underscored = re.sub(r'\s+', '_', no_punct) + + # Remove non-alphanumeric except underscore + clean = re.sub(r'[^a-z0-9_]', '', underscored) + + # Collapse multiple underscores + final = re.sub(r'_+', '_', clean).strip('_') + + return final +``` + +--- + +## 6. Unknown Components: XX and XXX Placeholders + +### 6.1 When Components Are Unknown + +Unlike GHCID (where `XX`/`XXX` are temporary and require research), PPID may have permanently unknown components: + +| Scenario | Placeholder | Can be PID? | +|----------|-------------|-------------| +| Unknown birth country | `XX` | No (remains ID) | +| Unknown birth region | `XX` | No (remains ID) | +| Unknown birth place | `XXX` | No (remains ID) | +| Unknown death country | `XX` | No (remains ID) | +| Unknown death region | `XX` | No (remains ID) | +| Unknown death place | `XXX` | No (remains ID) | +| Unknown century | `XX-XX` | No (remains ID) | +| Unknown first token | `UNKNOWN` | No (remains ID) | +| Unknown last token | (empty) | Yes (if mononym) | + +### 6.2 ID Examples with Unknown Components + +``` +ID-XX-XX-XXX-FR-NM-OMH-20-20-UNKNOWN- # Unknown soldier, Normandy +ID-NL-NH-AMS-XX-XX-XXX-17-17-REMBRANDT- # Rembrandt, death place unknown +ID-XX-XX-XXX-XX-XX-XXX-XX-XX-ANONYMOUS- # Completely unknown person +``` + +--- + +## 7. UUID and Numeric Generation + +### 7.1 Dual Representation (Same as GHCID) + +Every PPID generates three representations: + +| Format | Purpose | Example | +|--------|---------|---------| +| **Semantic String** | Human-readable | `PID-NL-NH-AMS-NL-NH-HAA-19-20-JAN-BERG` | +| **UUID v5** | Linked data, URIs | `550e8400-e29b-41d4-a716-446655440000` | +| **Numeric (64-bit)** | Database keys, CSV | `213324328442227739` | + +### 7.2 Generation Algorithm + +```python +import uuid +import hashlib + +# PPID namespace UUID (different from GHCID namespace) +PPID_NAMESPACE = uuid.UUID('f47ac10b-58cc-4372-a567-0e02b2c3d479') + +def generate_ppid_identifiers(semantic_ppid: str) -> dict: + """ + Generate all identifier formats from semantic PPID string. + + Returns: + { + 'semantic': 'PID-NL-NH-AMS-...', + 'uuid_v5': '550e8400-...', + 'numeric': 213324328442227739 + } + """ + # UUID v5 from semantic string + ppid_uuid = uuid.uuid5(PPID_NAMESPACE, semantic_ppid) + + # Numeric from SHA-256 (64-bit) + sha256 = hashlib.sha256(semantic_ppid.encode()).digest() + numeric = int.from_bytes(sha256[:8], byteorder='big') + + return { + 'semantic': semantic_ppid, + 'uuid_v5': str(ppid_uuid), + 'numeric': numeric + } + + +# Example: +ppid = "PID-NL-NH-AMS-NL-NH-HAA-19-20-JAN-BERG" +identifiers = generate_ppid_identifiers(ppid) +# { +# 'semantic': 'PID-NL-NH-AMS-NL-NH-HAA-19-20-JAN-BERG', +# 'uuid_v5': 'a1b2c3d4-e5f6-5a1b-9c2d-3e4f5a6b7c8d', +# 'numeric': 1234567890123456789 +# } +``` + +--- + +## 8. Relationship to Person Observations + +### 8.1 Distinction: PPID vs Observation Identifiers + +| Identifier | Purpose | Structure | Persistence | +|------------|---------|-----------|-------------| +| **PPID** | Identify a person (reconstruction) | Geographic + temporal + emic | Permanent (if PID) | +| **Observation ID** | Identify a specific source observation | GHCID-based + RiC-O | Permanent | + +### 8.2 Observation Identifier Structure (Forthcoming) + +As noted in the user's input, observation identifiers will use a different pattern: + +``` +{REPOSITORY_GHCID}/{CREATOR_GHCID}/{RICO_RECORD_PATH} +``` + +Where: +- **REPOSITORY_GHCID**: GHCID of the institution holding the record +- **CREATOR_GHCID**: GHCID of the institution that created the record (may be same) +- **RICO_RECORD_PATH**: RiC-O derived path to RecordSet/Record/RecordPart + +Example: +``` +NL-NH-HAA-A-NHA/NL-NH-HAA-A-NHA/burgerlijke-stand/geboorten/1895/003/045 +│ │ │ +│ │ └── RiC-O path: fonds/series/file/item +│ └── Creator (same institution) +└── Repository +``` + +This is **separate from PPID** and will be specified in a future document. + +--- + +## 9. Comparison with Original POID/PRID Design + +### 9.1 What Changes + +| Aspect | POID/PRID (Doc 05) | Revised PPID (This Doc) | +|--------|-------------------|-------------------------| +| **Identifier opacity** | Opaque (no semantic content) | Semantic (human-readable) | +| **Geographic anchoring** | None | Dual (birth + death locations) | +| **Temporal anchoring** | None | Century range | +| **Name in identifier** | None | First + last token | +| **Type prefix** | POID/PRID | ID/PID | +| **Observation vs Person** | Different identifier types | Completely separate systems | +| **UUID backing** | Primary | Secondary (derived) | +| **Collision handling** | UUID collision (rare) | Semantic collision (more common) | + +### 9.2 What Stays the Same + +- Dual identifier generation (UUID + numeric) +- Deterministic generation from input +- Permanent persistence (once PID) +- Integration with GHCID for institution links +- Claim-based provenance model +- PiCo ontology alignment + +### 9.3 Transition Plan + +If this revised structure is adopted: + +1. **Document 05** becomes historical reference +2. **This document** becomes the authoritative identifier spec +3. No existing identifiers need migration (this is a new system) +4. Code examples in other documents need updates + +--- + +## 10. Implementation Considerations + +### 10.1 Character Set and Length + +```python +# Maximum lengths +MAX_COUNTRY_CODE = 2 # ISO 3166-1 alpha-2 +MAX_REGION_CODE = 3 # ISO 3166-2 suffix (some are 3 chars) +MAX_PLACE_CODE = 3 # GeoNames convention +MAX_CENTURY_RANGE = 5 # "XX-XX" +MAX_TOKEN_LENGTH = 20 # Reasonable limit for names +MAX_COLLISION_SUFFIX = 50 # Full emic label + +# Maximum total PPID length (without collision suffix) +# "PID-" + "XX-XXX-XXX-" * 2 + "XX-XX-" + "TOKEN-TOKEN" +# = 4 + (2+3+3+4)*2 + 6 + 20 + 20 = ~70 characters + +# With collision suffix: ~120 characters max +``` + +### 10.2 Validation Regex + +```python +import re + +PPID_PATTERN = re.compile( + r'^(ID|PID)-' # Type + r'([A-Z]{2}|XX)-' # First country + r'([A-Z]{2,3}|XX)-' # First region + r'([A-Z]{3}|XXX)-' # First place + r'([A-Z]{2}|XX)-' # Last country + r'([A-Z]{2,3}|XX)-' # Last region + r'([A-Z]{3}|XXX)-' # Last place + r'(\d{1,2}-\d{1,2}|XX-XX)-' # Century range + r'([A-Z0-9]+)-' # First token + r'([A-Z0-9]*)' # Last token (may be empty) + r'(-[a-z0-9_]+)?$' # Collision suffix (optional) +) + +def validate_ppid(ppid: str) -> tuple[bool, str]: + """Validate PPID format.""" + if not PPID_PATTERN.match(ppid): + return False, "Invalid PPID format" + + # Additional semantic validation + parts = ppid.split('-') + + # Century range validation + if len(parts) >= 9: + century_range = f"{parts[7]}-{parts[8]}" + if century_range != "XX-XX": + try: + first_c, last_c = map(int, [parts[7], parts[8]]) + if last_c < first_c: + return False, "Last century cannot be before first century" + if first_c < 1 or last_c > 22: # Reasonable bounds + return False, "Century out of reasonable range" + except ValueError: + pass + + return True, "Valid" +``` + +--- + +## 11. Open Questions + +### 11.1 BCE Dates + +How to handle persons from before Common Era? + +**Options**: +1. Negative century numbers: `-5--4` for 5th-4th century BCE +2. BCE prefix: `BCE5-BCE4` +3. Separate identifier scheme for ancient persons + +### 11.2 Non-Latin Name Tokens + +How to handle names in non-Latin scripts? + +**Options**: +1. Require transliteration (current approach) +2. Allow Unicode tokens with normalization +3. Dual representation (original + transliterated) + +### 11.3 Disputed Locations + +What if birth/death locations are historically disputed? + +**Options**: +1. Use most likely location with note +2. Use `XX`/`XXX` until resolved +3. Create multiple IDs for each interpretation + +### 11.4 Living Persons + +How to handle persons still alive (no death observation)? + +**Options**: +1. Cannot be PID until death +2. Use `XX-XX-XXX` for death location, current century for range +3. Separate identifier class for living persons + +--- + +## 12. References + +### GHCID Documentation +- [GHCID PID Scheme](../../GHCID_PID_SCHEME.md) +- [AGENTS.md: Persistent Identifiers](../../AGENTS.md#persistent-identifiers-ghcid) + +### Related PPID Documents +- [Original Identifier Structure (superseded)](./05_identifier_structure_design.md) +- [PiCo Ontology Analysis](./03_pico_ontology_analysis.md) +- [Cultural Naming Conventions](./04_cultural_naming_conventions.md) + +### Standards +- ISO 3166-1: Country codes +- ISO 3166-2: Subdivision codes +- GeoNames: Geographic names database diff --git a/docs/plan/person_pid/11_pico_ppid_comparison.md b/docs/plan/person_pid/11_pico_ppid_comparison.md new file mode 100644 index 0000000000..ccc6a74413 --- /dev/null +++ b/docs/plan/person_pid/11_pico_ppid_comparison.md @@ -0,0 +1,475 @@ +# PiCo vs PPID: Comparative Analysis + +**Version**: 0.1.0 +**Last Updated**: 2025-01-09 +**Related**: [PPID-GHCID Alignment](./10_ppid_ghcid_alignment.md) | [PiCo Ontology Analysis](./03_pico_ontology_analysis.md) + +--- + +## 1. Executive Summary + +This document compares the **PiCo (Persons in Context)** ontology developed by CBG|Centrum voor Familiegeschiedenis with our proposed **PPID (Person Persistent Identifier)** system. The analysis is based on deep research into PiCo's implementation in Open Archives (openarchieven.nl) and the WieWasWie platform. + +### 1.1 Key Finding + +PiCo and PPID serve **complementary purposes**: + +| System | Primary Purpose | Identifier Style | Scope | +|--------|-----------------|------------------|-------| +| **PiCo** | Data model for person observations in genealogical sources | Opaque UUIDs | Genealogical records (civil registries, church books) | +| **PPID** | Persistent identifiers for heritage sector persons | Semantic geographic-temporal | Heritage custodian staff and historical figures | + +**Recommendation**: PPID should **adopt PiCo's ontological distinctions** (PersonObservation vs PersonReconstruction) while using its own **semantic identifier format** aligned with GHCID conventions. + +--- + +## 2. PiCo Architecture (From Research) + +### 2.1 Core Classes + +From the PiCo specification at `personsincontext.org/model`: + +``` +┌─────────────────────────────────────────────────────────────────┐ +│ PiCo MODEL │ +├─────────────────────────────────────────────────────────────────┤ +│ │ +│ ┌─────────────────────────────────────────────────────────┐ │ +│ │ Person │ │ +│ │ (Container class - not used directly) │ │ +│ │ │ │ +│ │ ┌─────────────────┐ ┌─────────────────┐ │ │ +│ │ │ PersonObservation│ │PersonReconstruction │ │ +│ │ │ │ │ │ │ │ +│ │ │ - Data as found │ │ - Curated identity│ │ │ +│ │ │ on Source │ │ - Links multiple │ │ │ +│ │ │ - hadPrimarySource │ observations │ │ │ +│ │ │ - hasRole │ │ - wasDerivedFrom │ │ │ +│ │ │ - hasAge │ │ - wasGeneratedBy │ │ │ +│ │ │ - hasOccupation │ │ - wasRevisionOf │ │ │ +│ │ └─────────────────┘ └─────────────────┘ │ │ +│ └─────────────────────────────────────────────────────────┘ │ +│ │ +│ ┌─────────────────────────────────────────────────────────┐ │ +│ │ Source │ │ +│ │ (schema:ArchiveComponent) │ │ +│ │ - name, dateCreated, holdingArchive, associatedMedia │ │ +│ └─────────────────────────────────────────────────────────┘ │ +│ │ +│ ┌─────────────────────────────────────────────────────────┐ │ +│ │ PersonName (PNV) │ │ +│ │ - literalName, givenName, baseSurname, surnamePrefix │ │ +│ │ - patronym, initials │ │ +│ └─────────────────────────────────────────────────────────┘ │ +│ │ +└─────────────────────────────────────────────────────────────────┘ +``` + +### 2.2 PiCo Identifier Structure in Open Archives + +From the Open Archives API documentation: + +``` +URI Format: https://www.openarchieven.nl/{3-letter-archive-code}:{uuid}[/{token}] + +Examples: +- https://www.openarchieven.nl/rat:48c2b836-385f-11e0-bcd1-8edf61960649 +- https://www.openarchieven.nl/elo:f5169776-db74-70a3-51e3-20c15291429c + +Components: +- rat = Regionaal Archief Tilburg (3-letter archive code) +- 48c2b836-385f-11e0-bcd1-8edf61960649 = UUID of the record +- /ttl:pico = Optional token for content negotiation (Turtle + PiCo profile) +``` + +### 2.3 PiCo PersonObservation Example (Actual Data) + +From Open Archives API response: + +```turtle +@prefix oa: . +@prefix pico: . +@prefix prov: . +@prefix sdo: . + +oa:record_rat_48c2b836-385f-11e0-bcd1-8edf61960649_Person_22f30464-3867-11e0-bcd1-8edf61960649 + a pico:PersonObservation ; + prov:hadPrimarySource oa:record_rat_48c2b836-385f-11e0-bcd1-8edf61960649 ; + pico:hasRole "Moeder" ; + sdo:children oa:record_rat_48c2b836-385f-11e0-bcd1-8edf61960649_Person_22f2ae9c-... ; + sdo:spouse oa:record_rat_48c2b836-385f-11e0-bcd1-8edf61960649_Person_22f2da16-... ; + sdo:gender sdo:Female ; + sdo:name "Cornelia Verhulst" ; + sdo:familyName "Verhulst" ; + sdo:givenName "Cornelia" . +``` + +### 2.4 PiCo PersonReconstruction Example + +From PiCo specification: + +```turtle +cbg:person_reconstruction_2 + a pico:PersonReconstruction ; + sdo:name "Anna Maria Koppen" ; + sdo:familyName "Koppen" ; + sdo:givenName "Anna" ; + sdo:gender sdo:Female ; + sdo:birthPlace "Haarlem" ; + sdo:birthDate "1860-03-31"^^xsd:date ; + sdo:deathPlace "Detroit, VSA" ; + sdo:deathDate "1926"^^xsd:gYear ; + prov:wasDerivedFrom nha:huwelijksakte_1885_321_po_1, + cbg:NL-HaCBG_1755_0341_142_po_1 ; + prov:wasGeneratedBy cbg:reconstruction_activity_01 . +``` + +--- + +## 3. Detailed Comparison + +### 3.1 Identifier Format + +| Aspect | PiCo (CBG/Open Archives) | PPID (Proposed) | +|--------|--------------------------|-----------------| +| **Format** | `{archive}:{uuid}` | `{TYPE}-{FC}-{FR}-{FP}-{LC}-{LR}-{LP}-{CR}-{FT}-{LT}` | +| **Example** | `rat:48c2b836-385f-11e0-bcd1-8edf61960649` | `PID-NL-NH-AMS-NL-NH-HAA-19-20-JAN-BERG` | +| **Human Readable** | No (opaque UUID) | Yes (semantic components) | +| **Archive Prefix** | Yes (3-letter code) | No (implicit via source) | +| **Geographic** | No | Yes (birth + death locations) | +| **Temporal** | No | Yes (century range) | +| **Name** | No | Yes (first + last token) | + +### 3.2 Conceptual Model + +| Concept | PiCo | PPID | +|---------|------|------| +| **Raw Observation** | `PersonObservation` | Observation (separate system) | +| **Curated Identity** | `PersonReconstruction` | `PID` (promoted from `ID`) | +| **Temporary State** | Not explicit | `ID` class | +| **Permanent State** | All URIs persistent | `PID` class only | +| **Provenance** | PROV-O (wasGeneratedBy, wasDerivedFrom) | PROV-O + XPath claims | +| **Name Vocabulary** | PNV (Person Name Vocabulary) | Emic labels from sources | + +### 3.3 Persistence Philosophy + +| Aspect | PiCo | PPID | +|--------|------|------| +| **All identifiers persistent?** | Yes | No - only PID class | +| **Temporary identifiers?** | No explicit concept | Yes - ID class | +| **Promotion mechanism?** | N/A | ID → PID when criteria met | +| **Epistemic uncertainty?** | Implicit (multiple observations) | Explicit (ID vs PID distinction) | +| **Living persons?** | Can have PersonReconstruction | Must remain ID until death | + +### 3.4 Geographic Handling + +| Aspect | PiCo | PPID | +|--------|------|------| +| **In identifier?** | No | Yes | +| **In properties?** | Yes (birthPlace, deathPlace) | Also yes | +| **Format** | Free text or URI | ISO 3166-1/2 + GeoNames | +| **Historical mapping?** | Encouraged (link to thesaurus) | Required (historical → modern) | +| **Example** | `sdo:birthPlace "Haarlem"` | `...-NL-NH-HAA-...` | + +### 3.5 Temporal Handling + +| Aspect | PiCo | PPID | +|--------|------|------| +| **In identifier?** | No | Yes (century range) | +| **Date format** | ISO 8601 (xsd:date) | Century numbers | +| **BCE support** | Via negative years | Via negative centuries (-5--4) | +| **Precision** | Day-level possible | Century-level only in ID | +| **Example** | `sdo:birthDate "1860-03-31"^^xsd:date` | `...-19-20-...` | + +--- + +## 4. Key Differences Explained + +### 4.1 Why PiCo Uses Opaque UUIDs + +PiCo's design goals (from GitHub README): + +1. **Successor to A2A**: Designed to replace XML-based Archive-to-Archive standard +2. **Genealogical focus**: Primary use case is WieWasWie ancestor search +3. **Linked Data**: Interoperability via RDF, not human-readable identifiers +4. **Archive-centric**: Identifiers include archive code prefix + +PiCo's UUID approach is appropriate for: +- Massive genealogical databases (millions of records) +- Automated conversion from A2A +- Machine-to-machine data exchange + +### 4.2 Why PPID Uses Semantic Identifiers + +PPID's design goals: + +1. **GHCID alignment**: Consistent identifier philosophy across GLAM project +2. **Heritage sector focus**: Staff of heritage institutions, historical figures +3. **Human discovery**: Identifiers aid browsing and deduplication +4. **Epistemic honesty**: Explicit distinction between ID (uncertain) and PID (verified) +5. **Scholarly citation**: Identifiers can be meaningfully cited in publications + +PPID's semantic approach is appropriate for: +- Smaller, curated datasets +- Human curation workflows +- Cross-system deduplication +- Scholarly reference + +### 4.3 The ID/PID Distinction (Unique to PPID) + +PiCo assumes all identifiers are permanent once created. PPID introduces explicit epistemic states: + +``` +PiCo: + PersonObservation (always permanent) + ↓ prov:wasDerivedFrom + PersonReconstruction (always permanent) + +PPID: + Observation (separate system, permanent) + ↓ + ID (temporary, may change) + ↓ promotion when criteria met + PID (permanent, never changes) +``` + +**Why this matters for heritage sector**: + +- **Living persons**: Cannot have verified death observation → must remain ID +- **Incomplete records**: May never have enough data for PID promotion +- **Ongoing research**: Archives not yet explored → cannot claim PID status +- **Scholarly integrity**: Prevents overclaiming certainty + +--- + +## 5. Integration Recommendations + +### 5.1 Adopt PiCo Ontological Distinctions + +PPID should use PiCo's class hierarchy: + +```turtle +@prefix ppid: . +@prefix pico: . + +# PPID extends PiCo +ppid:PersonID rdfs:subClassOf pico:PersonReconstruction . +ppid:PersonPID rdfs:subClassOf pico:PersonReconstruction . + +# PPID observations link to source observations +ppid:hasSourceObservation rdfs:subPropertyOf prov:wasDerivedFrom ; + rdfs:range pico:PersonObservation . +``` + +### 5.2 Maintain PPID Semantic Identifier Format + +Do not adopt PiCo's opaque UUID format. Keep semantic GHCID-aligned format: + +``` +PID-NL-NH-AMS-NL-NH-HAA-19-20-JAN-BERG +``` + +**Rationale**: GHCID project-wide consistency, human discoverability, scholarly citation. + +### 5.3 Use PNV for Name Properties + +Adopt PiCo's use of Person Name Vocabulary for structured name data: + +```turtle +ppid:PRID-... pnv:hasName [ + a pnv:PersonName ; + pnv:literalName "Jan van den Berg" ; + pnv:givenName "Jan" ; + pnv:surnamePrefix "van den" ; + pnv:baseSurname "Berg" +] . +``` + +### 5.4 Use PROV-O for Provenance + +Adopt PiCo's PROV-O patterns for reconstruction provenance: + +```turtle +ppid:PID-NL-NH-AMS-... + prov:wasDerivedFrom , ; + prov:wasGeneratedBy [ + a prov:Activity ; + prov:startedAtTime "2025-01-09T00:00:00"^^xsd:dateTime ; + prov:wasAssociatedWith ppid:curator-001 + ] . +``` + +### 5.5 Separate Observation Identifiers + +As noted in the revised PPID design, observations use a **different identifier system**: + +``` +{REPOSITORY_GHCID}/{CREATOR_GHCID}/{RiC-O-PATH} + +Example: +NL-NH-HAA-A-NHA/NL-NH-HAA-A-NHA/burgerlijke-stand/geboorten/1895/003/045 +``` + +This is distinct from PiCo's `{archive}:{uuid}` but serves similar purposes. + +--- + +## 6. Resolved Open Questions + +Based on user clarifications: + +### 6.1 BCE Date Handling + +**Resolution**: Use negative century numbers. + +``` +Format: {first_century}-{last_century} + +Examples: +- 5th century BCE to 4th century BCE: "-5--4" +- 1st century BCE to 1st century CE: "-1-1" +- 5th century BCE to 3rd century CE: "-5-3" +``` + +This aligns with ISO 8601 extended format which uses negative years for BCE dates. + +### 6.2 Non-Latin Script Transliteration + +**Resolution**: Apply same transliteration rules as GHCID (documented in AGENTS.md). + +| Script | Standard | +|--------|----------| +| Cyrillic | ISO 9:1995 | +| Chinese | Hanyu Pinyin (ISO 7098) | +| Japanese | Modified Hepburn | +| Korean | Revised Romanization | +| Arabic | ISO 233-2/3 | +| Hebrew | ISO 259-3 | +| Greek | ISO 843 | + +### 6.3 Disputed Locations + +**Resolution**: Not a PPID concern - handled by ISO standardization. + +When historical locations are disputed: +- Use the ISO-standardized modern location +- Document the dispute in observation metadata +- Do not encode uncertainty in the identifier itself + +### 6.4 Living Persons + +**Resolution**: Living persons are **always ID class** and can only be promoted to PID after death. + +```python +def can_promote_to_pid(person_id: str, observations: list) -> bool: + """ + Check if ID can be promoted to PID. + + Living persons can NEVER be promoted. + """ + # Check for death observation + death_obs = [o for o in observations if o.is_death_record or o.is_post_death] + + if not death_obs: + # No death observation = person may be alive = cannot be PID + return False + + # Continue with other promotion criteria... + return check_other_criteria(observations) +``` + +**Rationale**: +1. PID requires verified last observation (death) +2. Living persons have incomplete lifecycle +3. Future observations may change identity assessment +4. Privacy considerations for living individuals + +--- + +## 7. Implementation Alignment + +### 7.1 Class Mapping + +| PiCo Class | PPID Equivalent | Notes | +|------------|-----------------|-------| +| `pico:Person` | (Container) | Not used directly | +| `pico:PersonObservation` | Observation (separate system) | Different identifier format | +| `pico:PersonReconstruction` | `ppid:PersonID` or `ppid:PersonPID` | Split by epistemic certainty | +| `pico:Source` | `schema:ArchiveComponent` | Same as PiCo | +| `pnv:PersonName` | `pnv:PersonName` | Adopt PNV | + +### 7.2 Property Mapping + +| PiCo Property | PPID Usage | Notes | +|---------------|------------|-------| +| `prov:hadPrimarySource` | Same | For observations | +| `prov:wasDerivedFrom` | Same | PRID from POIDs | +| `prov:wasGeneratedBy` | Same | Activity provenance | +| `prov:wasRevisionOf` | Same | Version history | +| `sdo:birthDate` | Same | In properties | +| `sdo:birthPlace` | Same + in identifier | Dual representation | +| `sdo:deathDate` | Same | In properties | +| `sdo:deathPlace` | Same + in identifier | Dual representation | +| `pico:hasRole` | Same | For observations | +| `pico:hasAge` | Same | When birthDate unknown | + +### 7.3 Namespace Declarations + +```turtle +@prefix ppid: . +@prefix pico: . +@prefix pnv: . +@prefix prov: . +@prefix sdo: . +@prefix xsd: . +``` + +--- + +## 8. Conclusion + +### 8.1 What PPID Adopts from PiCo + +1. **PersonObservation/PersonReconstruction distinction** - Core ontological pattern +2. **PROV-O provenance model** - wasDerivedFrom, wasGeneratedBy, wasRevisionOf +3. **Person Name Vocabulary (PNV)** - Structured name representation +4. **Schema.org properties** - birthDate, deathDate, birthPlace, deathPlace, etc. +5. **Source linking** - hadPrimarySource, holdingArchive + +### 8.2 What PPID Does Differently + +1. **Semantic identifier format** - Geographic-temporal-emic instead of opaque UUID +2. **ID/PID epistemic distinction** - Explicit uncertainty modeling +3. **Living person handling** - Must remain ID until death +4. **GHCID alignment** - Consistent with heritage custodian identifier philosophy +5. **Century range encoding** - Temporal disambiguation in identifier +6. **Emic label tokens** - Name components in identifier for discoverability + +### 8.3 Interoperability Path + +PPID can be fully interoperable with PiCo systems via: + +1. **OWL mappings**: `ppid:PersonPID rdfs:subClassOf pico:PersonReconstruction` +2. **SPARQL federation**: Query across PPID and PiCo endpoints +3. **Bidirectional links**: `owl:sameAs` between PPID and PiCo identifiers +4. **Profile negotiation**: Serve data in PiCo format via content negotiation + +--- + +## 9. References + +### PiCo Resources +- PiCo Specification: https://personsincontext.org/model +- PiCo GitHub: https://github.com/CBG-Centrum-voor-familiegeschiedenis/PiCo +- Open Archives API: https://www.openarchieven.nl/api/docs/uri.php +- CBG: https://cbg.nl/ + +### Standards +- Person Name Vocabulary (PNV): https://w3id.org/pnv +- PROV-O: https://www.w3.org/TR/prov-o/ +- Schema.org: https://schema.org/ + +### Related PPID Documents +- [PPID-GHCID Alignment](./10_ppid_ghcid_alignment.md) +- [PiCo Ontology Analysis](./03_pico_ontology_analysis.md) +- [Identifier Structure Design](./05_identifier_structure_design.md)