feat(merge): add script to merge PENDING files by matching emic names with existing files
This commit is contained in:
parent
7f53ec6074
commit
7ec4e05dd4
2 changed files with 492 additions and 117 deletions
|
|
@ -15,11 +15,12 @@ This document proposes a **revised PPID structure** that aligns with GHCID's geo
|
|||
|
||||
| Aspect | Original (Doc 05) | Revised (This Document) |
|
||||
|--------|-------------------|-------------------------|
|
||||
| **Format** | Opaque hex (`POID-7a3b-c4d5-...`) | Semantic (`PID-NL-NH-AMS-NL-NH-HAA-19-20-JAN-BERG`) |
|
||||
| **Format** | Opaque hex (`POID-7a3b-c4d5-...`) | Semantic (`PID_NL-NH-AMS_1895-03-15_NL-NH-HAA_1970-08-22_JAN-BERG`) |
|
||||
| **Type Distinction** | POID vs PRID | ID (temporary) vs PID (persistent) |
|
||||
| **Geographic** | None in identifier | Dual anchors: first + last observation |
|
||||
| **Temporal** | None in identifier | Century range |
|
||||
| **Name** | None in identifier | First + last token of emic label |
|
||||
| **Geographic** | None in identifier | Dual anchors: first + last observation location |
|
||||
| **Temporal** | None in identifier | ISO 8601 dates with variable precision (YYYY-MM-DD / YYYY-MM / YYYY) |
|
||||
| **Name** | None in identifier | First + last token of emic label (skip particles) |
|
||||
| **Delimiters** | Single type | Hierarchical: `_` (major groups) + `-` (within groups) |
|
||||
| **Persistence** | Always persistent | May remain ID indefinitely |
|
||||
|
||||
### 1.2 Design Philosophy
|
||||
|
|
@ -111,19 +112,27 @@ This is acceptable and expected. An ID is still a valid identifier for internal
|
|||
### 3.1 Full Format Specification
|
||||
|
||||
```
|
||||
{TYPE}-{FC}-{FR}-{FP}-{LC}-{LR}-{LP}-{CR}-{FT}-{LT}[-{FULL_EMIC}]
|
||||
│ │ │ │ │ │ │ │ │ │ │
|
||||
│ │ │ │ │ │ │ │ │ │ └── Collision suffix (optional)
|
||||
│ │ │ │ │ │ │ │ │ └── Last Token of emic label
|
||||
│ │ │ │ │ │ │ │ └── First Token of emic label
|
||||
│ │ │ │ │ │ │ └── Century Range (e.g., 19-20)
|
||||
│ │ │ │ │ │ └── Last observation Place (GeoNames 3-letter)
|
||||
│ │ │ │ │ └── Last observation Region (ISO 3166-2)
|
||||
│ │ │ │ └── Last observation Country (ISO 3166-1 alpha-2)
|
||||
│ │ │ └── First observation Place (GeoNames 3-letter)
|
||||
│ │ └── First observation Region (ISO 3166-2)
|
||||
│ └── First observation Country (ISO 3166-1 alpha-2)
|
||||
{TYPE}_{FL}_{FD}_{LL}_{LD}_{NT}[-{FULL_EMIC}]
|
||||
│ │ │ │ │ │ │
|
||||
│ │ │ │ │ │ └── Collision suffix (optional, snake_case)
|
||||
│ │ │ │ │ └── Name Tokens (FIRST-LAST, hyphen-joined)
|
||||
│ │ │ │ └── Last observation Date (ISO 8601: YYYY-MM-DD or reduced)
|
||||
│ │ │ └── Last observation Location (CC-RR-PPP, hyphen-joined)
|
||||
│ │ └── First observation Date (ISO 8601: YYYY-MM-DD or reduced)
|
||||
│ └── First observation Location (CC-RR-PPP, hyphen-joined)
|
||||
└── Type: ID or PID
|
||||
|
||||
Delimiters:
|
||||
- Underscore (_) = Major delimiter between logical groups
|
||||
- Hyphen (-) = Minor delimiter within groups
|
||||
```
|
||||
|
||||
**Expanded with all components:**
|
||||
|
||||
```
|
||||
{TYPE}_{FC-FR-FP}_{FD}_{LC-LR-LP}_{LD}_{FT-LT}[-{full_emic_label}]
|
||||
|
||||
Example: PID_NL-NH-AMS_1895-03-15_NL-NH-HAA_1970-08-22_JAN-BERG
|
||||
```
|
||||
|
||||
### 3.2 Component Definitions
|
||||
|
|
@ -131,31 +140,58 @@ This is acceptable and expected. An ID is still a valid identifier for internal
|
|||
| Component | Format | Description | Example |
|
||||
|-----------|--------|-------------|---------|
|
||||
| **TYPE** | `ID` or `PID` | Identifier class | `PID` |
|
||||
| **FC** | ISO 3166-1 α2 | First observation country (modern) | `NL` |
|
||||
| **FR** | ISO 3166-2 suffix | First observation region | `NH` |
|
||||
| **FP** | 3 letters | First observation place (GeoNames) | `AMS` |
|
||||
| **LC** | ISO 3166-1 α2 | Last observation country (modern) | `NL` |
|
||||
| **LR** | ISO 3166-2 suffix | Last observation region | `NH` |
|
||||
| **LP** | 3 letters | Last observation place (GeoNames) | `HAA` |
|
||||
| **CR** | `CC-CC` | Century range (CE) | `19-20` |
|
||||
| **FT** | UPPERCASE | First token of emic label | `JAN` |
|
||||
| **LT** | UPPERCASE | Last token of emic label | `BERG` |
|
||||
| **FL** | `CC-RR-PPP` | First observation Location | `NL-NH-AMS` |
|
||||
| → FC | ISO 3166-1 α2 | Country code | `NL` |
|
||||
| → FR | ISO 3166-2 suffix | Region code | `NH` |
|
||||
| → FP | 3 letters | Place code (GeoNames) | `AMS` |
|
||||
| **FD** | ISO 8601 | First observation Date | `1895-03-15` or `1895` |
|
||||
| **LL** | `CC-RR-PPP` | Last observation Location | `NL-NH-HAA` |
|
||||
| → LC | ISO 3166-1 α2 | Country code | `NL` |
|
||||
| → LR | ISO 3166-2 suffix | Region code | `NH` |
|
||||
| → LP | 3 letters | Place code (GeoNames) | `HAA` |
|
||||
| **LD** | ISO 8601 | Last observation Date | `1970-08-22` or `1970` |
|
||||
| **NT** | `FIRST-LAST` | Name Tokens (emic label) | `JAN-BERG` |
|
||||
| → FT | UPPERCASE | First token | `JAN` |
|
||||
| → LT | UPPERCASE | Last token (skip particles) | `BERG` |
|
||||
| **FULL_EMIC** | snake_case | Full emic label (collision only) | `jan_van_den_berg` |
|
||||
|
||||
### 3.3 Examples
|
||||
### 3.3 ISO 8601 Date Precision Levels
|
||||
|
||||
Dates use ISO 8601 format with **variable precision** based on what can be verified:
|
||||
|
||||
| Precision | Format | Example | When to Use |
|
||||
|-----------|--------|---------|-------------|
|
||||
| **Day** | `YYYY-MM-DD` | `1895-03-15` | Birth/death certificate with exact date |
|
||||
| **Month** | `YYYY-MM` | `1895-03` | Record states month but not day |
|
||||
| **Year** | `YYYY` | `1895` | Only year known (common for historical figures) |
|
||||
| **Unknown** | `XXXX` | `XXXX` | Cannot determine; identifier remains ID class |
|
||||
|
||||
**BCE Dates** (negative years per ISO 8601 extended):
|
||||
| Year | ISO 8601 | Example PPID |
|
||||
|------|----------|--------------|
|
||||
| 469 BCE | `-0469` | `PID_GR-AT-ATH_-0469_GR-AT-ATH_-0399_SOCRATES-` |
|
||||
| 44 BCE | `-0044` | `PID_IT-RM-ROM_-0100_IT-RM-ROM_-0044_GAIUS-CAESAR` |
|
||||
|
||||
**No Time Components**: Hours, minutes, seconds are never included (impractical for historical persons).
|
||||
|
||||
### 3.4 Examples
|
||||
|
||||
| Person | Full Emic Label | PPID |
|
||||
|--------|-----------------|------|
|
||||
| Jan van den Berg, born Amsterdam 1895, died Haarlem 1970 | Jan van den Berg | `PID-NL-NH-AMS-NL-NH-HAA-19-20-JAN-BERG` |
|
||||
| Rembrandt, born Leiden 1606, died Amsterdam 1669 | Rembrandt van Rijn | `PID-NL-ZH-LEI-NL-NH-AMS-17-17-REMBRANDT-RIJN` |
|
||||
| Maria Sibylla Merian, born Frankfurt 1647, died Amsterdam 1717 | Maria Sibylla Merian | `PID-DE-HE-FRA-NL-NH-AMS-17-18-MARIA-MERIAN` |
|
||||
| Unknown soldier, found Normandy, died 1944 | (unknown) | `ID-XX-XX-XXX-FR-NM-OMH-20-20-UNKNOWN-` |
|
||||
| Henry VIII, born London 1491, died London 1547 | Henry VIII | `PID-GB-ENG-LON-GB-ENG-LON-15-16-HENRY-VIII` |
|
||||
| Jan van den Berg (1895-03-15 → 1970-08-22) | Jan van den Berg | `PID_NL-NH-AMS_1895-03-15_NL-NH-HAA_1970-08-22_JAN-BERG` |
|
||||
| Rembrandt (1606-07-15 → 1669-10-04) | Rembrandt van Rijn | `PID_NL-ZH-LEI_1606-07-15_NL-NH-AMS_1669-10-04_REMBRANDT-RIJN` |
|
||||
| Maria Sibylla Merian (1647 → 1717) | Maria Sibylla Merian | `PID_DE-HE-FRA_1647_NL-NH-AMS_1717_MARIA-MERIAN` |
|
||||
| Socrates (c. 470 BCE → 399 BCE) | Σωκράτης (Sōkrátēs) | `PID_GR-AT-ATH_-0470_GR-AT-ATH_-0399_SOCRATES-` |
|
||||
| Julius Caesar (100 BCE → 44 BCE) | Gaius Iulius Caesar | `PID_IT-RM-ROM_-0100_IT-RM-ROM_-0044_GAIUS-CAESAR` |
|
||||
| Unknown soldier (? → 1944-06-06) | (unknown) | `ID_XX-XX-XXX_XXXX_FR-NM-OMH_1944-06-06_UNKNOWN-` |
|
||||
| Henry VIII (1491-06-28 → 1547-01-28) | Henry VIII | `PID_GB-ENG-LON_1491-06-28_GB-ENG-LON_1547-01-28_HENRY-VIII` |
|
||||
| Vincent van Gogh (1853-03-30 → 1890-07-29) | Vincent Willem van Gogh | `PID_NL-NB-ZUN_1853-03-30_FR-IDF-AUV_1890-07-29_VINCENT-GOGH` |
|
||||
|
||||
**Notes on Emic Labels**:
|
||||
- Always use **formal/complete emic names** from primary sources, not modern colloquial short forms
|
||||
- "Rembrandt" alone is a modern convention; the emic label from his lifetime was "Rembrandt van Rijn"
|
||||
- **Tussenvoegsels (particles)** like "van", "de", "den", "der", "van de", "van den", "van der" are **skipped** when extracting the last token (see §4.5)
|
||||
- Non-Latin names are transliterated following GHCID transliteration standards (see AGENTS.md)
|
||||
- This follows the same pattern as GHCID abbreviation rules (AGENTS.md Rule 8)
|
||||
|
||||
---
|
||||
|
|
@ -787,51 +823,123 @@ def detect_collision(new_ppid: str, existing_ppids: Set[str]) -> bool:
|
|||
return False
|
||||
|
||||
def get_base_ppid(ppid: str) -> str:
|
||||
"""Extract base PPID without collision suffix."""
|
||||
# Full PPID may have collision suffix after last token
|
||||
# e.g., "PID-NL-NH-AMS-NL-NH-HAA-19-20-JAN-BERG-jan_van_den_berg"
|
||||
# Base: "PID-NL-NH-AMS-NL-NH-HAA-19-20-JAN-BERG"
|
||||
"""Extract base PPID without collision suffix.
|
||||
|
||||
parts = ppid.split('-')
|
||||
New format uses underscore as major delimiter:
|
||||
PID_NL-NH-AMS_1895-03-15_NL-NH-HAA_1970-08-22_JAN-BERG-jan_van_den_berg
|
||||
↑ collision suffix starts here
|
||||
|
||||
# Standard PPID has 11 parts (TYPE + 6 geo + CR + FT + LT)
|
||||
# If more parts, the extra is collision suffix
|
||||
if len(parts) > 11:
|
||||
return '-'.join(parts[:11])
|
||||
Base PPID has exactly 6 underscore-delimited parts:
|
||||
TYPE_FL_FD_LL_LD_NT
|
||||
"""
|
||||
# Split by underscore (major delimiter)
|
||||
parts = ppid.split('_')
|
||||
|
||||
# Standard PPID has 6 parts: TYPE, FL, FD, LL, LD, NT
|
||||
if len(parts) < 6:
|
||||
return ppid # Invalid format
|
||||
|
||||
# Check if last part contains collision suffix (hyphen after name tokens)
|
||||
name_tokens_part = parts[5]
|
||||
|
||||
# Name tokens format: FIRST-LAST or FIRST-LAST-collision_suffix
|
||||
# Collision suffix is in snake_case (contains underscores within the last major part)
|
||||
# Since we already split by _, a collision suffix would appear as extra parts
|
||||
if len(parts) > 6:
|
||||
# Extra parts after NT are collision suffix
|
||||
base_parts = parts[:6]
|
||||
return '_'.join(base_parts)
|
||||
|
||||
# Check for hyphen-appended collision within NT part
|
||||
# e.g., "JAN-BERG-jan_van_den_berg" - but wait, this would be split by _
|
||||
# Actually: collision suffix uses - to connect: JAN-BERG-jan_van_den_berg
|
||||
# Let's handle this case
|
||||
if '-' in name_tokens_part:
|
||||
nt_parts = name_tokens_part.split('-')
|
||||
# First two are name tokens, rest is collision suffix
|
||||
if len(nt_parts) > 2 and nt_parts[2].islower():
|
||||
# Has collision suffix
|
||||
base_nt = '-'.join(nt_parts[:2])
|
||||
parts[5] = base_nt
|
||||
return '_'.join(parts[:6])
|
||||
|
||||
return ppid
|
||||
```
|
||||
|
||||
### 5.2 Collision Resolution via Full Emic Label
|
||||
### 5.2 Collision Resolution Strategy
|
||||
|
||||
When collision occurs, append full emic label in snake_case:
|
||||
Collisions are resolved through a **three-tier escalation** strategy:
|
||||
|
||||
1. **Tier 1**: Append full emic label in snake_case
|
||||
2. **Tier 2**: If still collides, add 8-character hash discriminator
|
||||
3. **Tier 3**: If still collides (virtually impossible), add timestamp-based discriminator
|
||||
|
||||
```python
|
||||
import hashlib
|
||||
import secrets
|
||||
from datetime import datetime
|
||||
from typing import Set
|
||||
|
||||
def resolve_collision(
|
||||
base_ppid: str,
|
||||
full_emic_label: str,
|
||||
existing_ppids: Set[str]
|
||||
existing_ppids: Set[str],
|
||||
distinguishing_data: dict = None
|
||||
) -> str:
|
||||
"""
|
||||
Resolve collision by appending full emic label.
|
||||
Resolve collision using three-tier escalation strategy.
|
||||
|
||||
Args:
|
||||
base_ppid: The base PPID without collision suffix
|
||||
full_emic_label: The person's full emic name
|
||||
existing_ppids: Set of existing PPIDs to check against
|
||||
distinguishing_data: Optional dict with additional data for hashing
|
||||
(e.g., occupation, parent names, source document ID)
|
||||
|
||||
Example:
|
||||
Base: "PID-NL-NH-AMS-NL-NH-HAA-19-20-JAN-BERG"
|
||||
Base: "PID_NL-NH-AMS_1895-03-15_NL-NH-HAA_1970-08-22_JAN-BERG"
|
||||
Emic: "Jan van den Berg"
|
||||
Result: "PID-NL-NH-AMS-NL-NH-HAA-19-20-JAN-BERG-jan_van_den_berg"
|
||||
|
||||
Tier 1 Result: "PID_NL-NH-AMS_1895-03-15_NL-NH-HAA_1970-08-22_JAN-BERG-jan_van_den_berg"
|
||||
Tier 2 Result: "PID_NL-NH-AMS_1895-03-15_NL-NH-HAA_1970-08-22_JAN-BERG-jan_van_den_berg-a7b3c2d1"
|
||||
Tier 3 Result: "PID_NL-NH-AMS_1895-03-15_NL-NH-HAA_1970-08-22_JAN-BERG-jan_van_den_berg-20250109143022"
|
||||
"""
|
||||
suffix = generate_collision_suffix(full_emic_label)
|
||||
resolved = f"{base_ppid}-{suffix}"
|
||||
|
||||
# Check if still collides (extremely rare)
|
||||
if resolved in existing_ppids:
|
||||
# Add numeric discriminator
|
||||
counter = 2
|
||||
while f"{resolved}_{counter}" in existing_ppids:
|
||||
counter += 1
|
||||
resolved = f"{resolved}_{counter}"
|
||||
# Tier 1: Full emic label suffix
|
||||
emic_suffix = generate_collision_suffix(full_emic_label)
|
||||
tier1_ppid = f"{base_ppid}-{emic_suffix}"
|
||||
|
||||
return resolved
|
||||
if tier1_ppid not in existing_ppids:
|
||||
return tier1_ppid
|
||||
|
||||
# Tier 2: Add deterministic hash discriminator
|
||||
# Hash is based on distinguishing data if provided, otherwise random
|
||||
if distinguishing_data:
|
||||
# Deterministic: hash of distinguishing data
|
||||
hash_input = f"{tier1_ppid}|{sorted(distinguishing_data.items())}"
|
||||
hash_bytes = hashlib.sha256(hash_input.encode()).digest()
|
||||
discriminator = hash_bytes[:4].hex() # 8 hex characters
|
||||
else:
|
||||
# Fallback: cryptographically secure random
|
||||
discriminator = secrets.token_hex(4) # 8 hex characters
|
||||
|
||||
tier2_ppid = f"{tier1_ppid}-{discriminator}"
|
||||
|
||||
if tier2_ppid not in existing_ppids:
|
||||
return tier2_ppid
|
||||
|
||||
# Tier 3: Timestamp-based (virtually impossible to reach)
|
||||
# This should never happen with random discriminator, but provides safety
|
||||
timestamp = datetime.utcnow().strftime("%Y%m%d%H%M%S")
|
||||
tier3_ppid = f"{tier1_ppid}-{timestamp}"
|
||||
|
||||
# Final fallback: add microseconds if still colliding
|
||||
while tier3_ppid in existing_ppids:
|
||||
timestamp = datetime.utcnow().strftime("%Y%m%d%H%M%S%f")
|
||||
tier3_ppid = f"{tier1_ppid}-{timestamp}"
|
||||
|
||||
return tier3_ppid
|
||||
|
||||
|
||||
def generate_collision_suffix(full_emic_label: str) -> str:
|
||||
"""
|
||||
|
|
@ -870,6 +978,48 @@ def generate_collision_suffix(full_emic_label: str) -> str:
|
|||
return final
|
||||
```
|
||||
|
||||
### 5.3 Distinguishing Data for Tier 2 Hash
|
||||
|
||||
When two persons have identical base PPID and emic label, use **distinguishing data** to generate a deterministic hash:
|
||||
|
||||
| Priority | Distinguishing Data | Example |
|
||||
|----------|---------------------|---------|
|
||||
| 1 | Source document ID | `"NL-NH-HAA/BS/Geb/1895/123"` |
|
||||
| 2 | Parent names | `"father:Pieter_van_den_Berg"` |
|
||||
| 3 | Occupation | `"occupation:timmerman"` |
|
||||
| 4 | Spouse name | `"spouse:Maria_Jansen"` |
|
||||
| 5 | Unique claim from observation | Any distinguishing fact |
|
||||
|
||||
```python
|
||||
# Example: Two "Jan van den Berg" born same day, same place
|
||||
distinguishing_data_person_1 = {
|
||||
"source_document": "NL-NH-HAA/BS/Geb/1895/123",
|
||||
"father_name": "Pieter van den Berg",
|
||||
"occupation": "timmerman"
|
||||
}
|
||||
|
||||
distinguishing_data_person_2 = {
|
||||
"source_document": "NL-NH-HAA/BS/Geb/1895/456",
|
||||
"father_name": "Hendrik van den Berg",
|
||||
"occupation": "bakker"
|
||||
}
|
||||
|
||||
# Results in different deterministic hashes:
|
||||
# Person 1: PID_NL-NH-AMS_1895-03-15_..._JAN-BERG-jan_van_den_berg-a7b3c2d1
|
||||
# Person 2: PID_NL-NH-AMS_1895-03-15_..._JAN-BERG-jan_van_den_berg-f2e8d4a9
|
||||
```
|
||||
|
||||
### 5.4 Collision Probability Analysis
|
||||
|
||||
| Tier | Collision Probability | When Triggered |
|
||||
|------|----------------------|----------------|
|
||||
| **Base PPID** | ~1/10,000 for common names | Same location, date, name tokens |
|
||||
| **Tier 1** (+emic) | ~1/1,000,000 | Same full emic label |
|
||||
| **Tier 2** (+hash) | ~1/4.3 billion | Same emic AND no distinguishing data |
|
||||
| **Tier 3** (+time) | ~0 | Cryptographic failure |
|
||||
|
||||
**Practical Impact**: For a dataset of 10 million persons, expected Tier 2 collisions ≈ 0.002 (effectively zero).
|
||||
|
||||
---
|
||||
|
||||
## 6. Unknown Components: XX and XXX Placeholders
|
||||
|
|
@ -886,16 +1036,17 @@ Unlike GHCID (where `XX`/`XXX` are temporary and require research), PPID may hav
|
|||
| Unknown death country | `XX` | No (remains ID) |
|
||||
| Unknown death region | `XX` | No (remains ID) |
|
||||
| Unknown death place | `XXX` | No (remains ID) |
|
||||
| Unknown century | `XX-XX` | No (remains ID) |
|
||||
| Unknown date | `XXXX` | No (remains ID) |
|
||||
| Unknown first token | `UNKNOWN` | No (remains ID) |
|
||||
| Unknown last token | (empty) | Yes (if mononym) |
|
||||
| Unknown last token | (empty after hyphen) | Yes (if mononym) |
|
||||
|
||||
### 6.2 ID Examples with Unknown Components
|
||||
|
||||
```
|
||||
ID-XX-XX-XXX-FR-NM-OMH-20-20-UNKNOWN- # Unknown soldier, Normandy
|
||||
ID-NL-NH-AMS-XX-XX-XXX-17-17-REMBRANDT- # Rembrandt, death place unknown
|
||||
ID-XX-XX-XXX-XX-XX-XXX-XX-XX-ANONYMOUS- # Completely unknown person
|
||||
ID_XX-XX-XXX_XXXX_FR-NM-OMH_1944-06-06_UNKNOWN- # Unknown soldier, Normandy
|
||||
ID_NL-NH-AMS_1606_XX-XX-XXX_XXXX_REMBRANDT- # Rembrandt, death unknown (hypothetical)
|
||||
ID_XX-XX-XXX_XXXX_XX-XX-XXX_XXXX_ANONYMOUS- # Completely unknown person
|
||||
ID_NL-ZH-LEI_1606-07_NL-NH-AMS_1669_REMBRANDT-RIJN # Rembrandt, month known for birth, only year for death
|
||||
```
|
||||
|
||||
---
|
||||
|
|
@ -908,7 +1059,7 @@ Every PPID generates three representations:
|
|||
|
||||
| Format | Purpose | Example |
|
||||
|--------|---------|---------|
|
||||
| **Semantic String** | Human-readable | `PID-NL-NH-AMS-NL-NH-HAA-19-20-JAN-BERG` |
|
||||
| **Semantic String** | Human-readable | `PID_NL-NH-AMS_1895-03-15_NL-NH-HAA_1970-08-22_JAN-BERG` |
|
||||
| **UUID v5** | Linked data, URIs | `550e8400-e29b-41d4-a716-446655440000` |
|
||||
| **Numeric (64-bit)** | Database keys, CSV | `213324328442227739` |
|
||||
|
||||
|
|
@ -927,7 +1078,7 @@ def generate_ppid_identifiers(semantic_ppid: str) -> dict:
|
|||
|
||||
Returns:
|
||||
{
|
||||
'semantic': 'PID-NL-NH-AMS-...',
|
||||
'semantic': 'PID_NL-NH-AMS_1895-03-15_...',
|
||||
'uuid_v5': '550e8400-...',
|
||||
'numeric': 213324328442227739
|
||||
}
|
||||
|
|
@ -947,10 +1098,10 @@ def generate_ppid_identifiers(semantic_ppid: str) -> dict:
|
|||
|
||||
|
||||
# Example:
|
||||
ppid = "PID-NL-NH-AMS-NL-NH-HAA-19-20-JAN-BERG"
|
||||
ppid = "PID_NL-NH-AMS_1895-03-15_NL-NH-HAA_1970-08-22_JAN-BERG"
|
||||
identifiers = generate_ppid_identifiers(ppid)
|
||||
# {
|
||||
# 'semantic': 'PID-NL-NH-AMS-NL-NH-HAA-19-20-JAN-BERG',
|
||||
# 'semantic': 'PID_NL-NH-AMS_1895-03-15_NL-NH-HAA_1970-08-22_JAN-BERG',
|
||||
# 'uuid_v5': 'a1b2c3d4-e5f6-5a1b-9c2d-3e4f5a6b7c8d',
|
||||
# 'numeric': 1234567890123456789
|
||||
# }
|
||||
|
|
@ -1033,17 +1184,18 @@ If this revised structure is adopted:
|
|||
### 10.1 Character Set and Length
|
||||
|
||||
```python
|
||||
# Maximum lengths
|
||||
# Component lengths
|
||||
MAX_COUNTRY_CODE = 2 # ISO 3166-1 alpha-2
|
||||
MAX_REGION_CODE = 3 # ISO 3166-2 suffix (some are 3 chars)
|
||||
MAX_PLACE_CODE = 3 # GeoNames convention
|
||||
MAX_CENTURY_RANGE = 5 # "XX-XX"
|
||||
MAX_DATE = 10 # YYYY-MM-DD (ISO 8601)
|
||||
MAX_TOKEN_LENGTH = 20 # Reasonable limit for names
|
||||
MAX_COLLISION_SUFFIX = 50 # Full emic label
|
||||
MAX_COLLISION_SUFFIX = 50 # Full emic label in snake_case
|
||||
|
||||
# Maximum total PPID length (without collision suffix)
|
||||
# "PID-" + "XX-XXX-XXX-" * 2 + "XX-XX-" + "TOKEN-TOKEN"
|
||||
# = 4 + (2+3+3+4)*2 + 6 + 20 + 20 = ~70 characters
|
||||
# Example PPID structure (without collision suffix):
|
||||
# PID_NL-NH-AMS_1895-03-15_NL-NH-HAA_1970-08-22_JAN-BERG
|
||||
# 3 +1+ 10 +1+ 10 +1+ 10 +1+ 10 +1+ 20
|
||||
# ≈ 70 characters maximum
|
||||
|
||||
# With collision suffix: ~120 characters max
|
||||
```
|
||||
|
|
@ -1053,18 +1205,24 @@ MAX_COLLISION_SUFFIX = 50 # Full emic label
|
|||
```python
|
||||
import re
|
||||
|
||||
# Location pattern: CC-RR-PPP (country-region-place)
|
||||
LOCATION_PATTERN = r'([A-Z]{2}|XX)-([A-Z]{2,3}|XX)-([A-Z]{3}|XXX)'
|
||||
|
||||
# Date pattern: YYYY-MM-DD or YYYY-MM or YYYY or XXXX (including BCE with leading -)
|
||||
DATE_PATTERN = r'(-?\d{4}(?:-\d{2}(?:-\d{2})?)?|XXXX)'
|
||||
|
||||
# Name tokens pattern: FIRST-LAST or FIRST- (for mononyms)
|
||||
NAME_PATTERN = r'([A-Z0-9]+)-([A-Z0-9]*)'
|
||||
|
||||
# Full PPID pattern
|
||||
PPID_PATTERN = re.compile(
|
||||
r'^(ID|PID)-' # Type
|
||||
r'([A-Z]{2}|XX)-' # First country
|
||||
r'([A-Z]{2,3}|XX)-' # First region
|
||||
r'([A-Z]{3}|XXX)-' # First place
|
||||
r'([A-Z]{2}|XX)-' # Last country
|
||||
r'([A-Z]{2,3}|XX)-' # Last region
|
||||
r'([A-Z]{3}|XXX)-' # Last place
|
||||
r'(\d{1,2}-\d{1,2}|XX-XX)-' # Century range
|
||||
r'([A-Z0-9]+)-' # First token
|
||||
r'([A-Z0-9]*)' # Last token (may be empty)
|
||||
r'(-[a-z0-9_]+)?$' # Collision suffix (optional)
|
||||
r'^(ID|PID)' # Type
|
||||
r'_' + LOCATION_PATTERN + # First location (underscore + CC-RR-PPP)
|
||||
r'_' + DATE_PATTERN + # First date (underscore + ISO 8601)
|
||||
r'_' + LOCATION_PATTERN + # Last location (underscore + CC-RR-PPP)
|
||||
r'_' + DATE_PATTERN + # Last date (underscore + ISO 8601)
|
||||
r'_' + NAME_PATTERN + # Name tokens (underscore + FIRST-LAST)
|
||||
r'(-[a-z0-9_]+)?$' # Collision suffix (optional, hyphen + snake_case)
|
||||
)
|
||||
|
||||
def validate_ppid(ppid: str) -> tuple[bool, str]:
|
||||
|
|
@ -1072,23 +1230,36 @@ def validate_ppid(ppid: str) -> tuple[bool, str]:
|
|||
if not PPID_PATTERN.match(ppid):
|
||||
return False, "Invalid PPID format"
|
||||
|
||||
# Additional semantic validation
|
||||
parts = ppid.split('-')
|
||||
# Split by major delimiter (underscore)
|
||||
parts = ppid.split('_')
|
||||
|
||||
# Century range validation
|
||||
if len(parts) >= 9:
|
||||
century_range = f"{parts[7]}-{parts[8]}"
|
||||
if century_range != "XX-XX":
|
||||
try:
|
||||
first_c, last_c = map(int, [parts[7], parts[8]])
|
||||
if last_c < first_c:
|
||||
return False, "Last century cannot be before first century"
|
||||
if first_c < 1 or last_c > 22: # Reasonable bounds
|
||||
return False, "Century out of reasonable range"
|
||||
except ValueError:
|
||||
pass
|
||||
if len(parts) < 6:
|
||||
return False, "Incomplete PPID - requires 6 underscore-delimited parts"
|
||||
|
||||
# Extract dates for validation
|
||||
first_date = parts[2]
|
||||
last_date = parts[4]
|
||||
|
||||
# Date ordering validation (if both are known)
|
||||
if first_date != 'XXXX' and last_date != 'XXXX':
|
||||
# Parse years (handle BCE with leading -)
|
||||
try:
|
||||
first_year = int(first_date.split('-')[0]) if not first_date.startswith('-') else -int(first_date.split('-')[1])
|
||||
last_year = int(last_date.split('-')[0]) if not last_date.startswith('-') else -int(last_date.split('-')[1])
|
||||
|
||||
if last_year < first_year:
|
||||
return False, "Last observation date cannot be before first observation date"
|
||||
except (ValueError, IndexError):
|
||||
pass # Invalid date format caught by regex
|
||||
|
||||
return True, "Valid"
|
||||
|
||||
|
||||
# Example validations:
|
||||
assert validate_ppid("PID_NL-NH-AMS_1895-03-15_NL-NH-HAA_1970-08-22_JAN-BERG")[0]
|
||||
assert validate_ppid("PID_GR-AT-ATH_-0470_GR-AT-ATH_-0399_SOCRATES-")[0]
|
||||
assert validate_ppid("ID_XX-XX-XXX_XXXX_FR-NM-OMH_1944-06-06_UNKNOWN-")[0]
|
||||
assert not validate_ppid("PID_NL-NH-AMS_1970_NL-NH-HAA_1895_JAN-BERG")[0] # Dates reversed
|
||||
```
|
||||
|
||||
---
|
||||
|
|
@ -1097,39 +1268,42 @@ def validate_ppid(ppid: str) -> tuple[bool, str]:
|
|||
|
||||
### 11.1 BCE Dates
|
||||
|
||||
How to handle persons from before Common Era?
|
||||
**RESOLVED**: Use ISO 8601 extended format with negative years.
|
||||
|
||||
**Options**:
|
||||
1. Negative century numbers: `-5--4` for 5th-4th century BCE
|
||||
2. BCE prefix: `BCE5-BCE4`
|
||||
3. Separate identifier scheme for ancient persons
|
||||
- `-0469` for 469 BCE
|
||||
- `-0044` for 44 BCE
|
||||
- Examples in section 3.3 and 3.4
|
||||
|
||||
### 11.2 Non-Latin Name Tokens
|
||||
|
||||
How to handle names in non-Latin scripts?
|
||||
**RESOLVED**: Apply same transliteration rules as GHCID (see AGENTS.md).
|
||||
|
||||
**Options**:
|
||||
1. Require transliteration (current approach)
|
||||
2. Allow Unicode tokens with normalization
|
||||
3. Dual representation (original + transliterated)
|
||||
| Script | Standard |
|
||||
|--------|----------|
|
||||
| Cyrillic | ISO 9:1995 |
|
||||
| Chinese | Hanyu Pinyin (ISO 7098) |
|
||||
| Japanese | Modified Hepburn |
|
||||
| Korean | Revised Romanization |
|
||||
| Arabic | ISO 233-2/3 |
|
||||
|
||||
### 11.3 Disputed Locations
|
||||
|
||||
What if birth/death locations are historically disputed?
|
||||
|
||||
**Options**:
|
||||
1. Use most likely location with note
|
||||
2. Use `XX`/`XXX` until resolved
|
||||
3. Create multiple IDs for each interpretation
|
||||
**RESOLVED**: Not a PPID concern - handled by ISO standardization. Use modern ISO-standardized location codes; document disputes in observation metadata.
|
||||
|
||||
### 11.4 Living Persons
|
||||
|
||||
How to handle persons still alive (no death observation)?
|
||||
**RESOLVED**: Living persons are **always ID class** and can only be promoted to PID after death.
|
||||
|
||||
**Options**:
|
||||
1. Cannot be PID until death
|
||||
2. Use `XX-XX-XXX` for death location, current century for range
|
||||
3. Separate identifier class for living persons
|
||||
- Living persons have no verified last observation (death date/location)
|
||||
- Use `XXXX` for unknown death date and `XX-XX-XXX` for unknown death location
|
||||
- Example: `ID_NL-NH-AMS_1985-06-15_XX-XX-XXX_XXXX_JAN-BERG`
|
||||
- Can be promoted to PID only after death observation is verified
|
||||
|
||||
**Rationale**:
|
||||
1. PID requires verified last observation (death)
|
||||
2. Living persons have incomplete lifecycle data
|
||||
3. Future observations may change identity assessment
|
||||
4. Privacy considerations for living individuals
|
||||
|
||||
---
|
||||
|
||||
|
|
@ -1147,4 +1321,5 @@ How to handle persons still alive (no death observation)?
|
|||
### Standards
|
||||
- ISO 3166-1: Country codes
|
||||
- ISO 3166-2: Subdivision codes
|
||||
- ISO 8601: Date and time format (including BCE with negative years)
|
||||
- GeoNames: Geographic names database
|
||||
|
|
|
|||
200
scripts/merge_pending_by_name.py
Executable file
200
scripts/merge_pending_by_name.py
Executable file
|
|
@ -0,0 +1,200 @@
|
|||
#!/usr/bin/env python3
|
||||
"""
|
||||
Find and merge PENDING files that have matching emic names with existing files.
|
||||
|
||||
This script:
|
||||
1. Scans all existing custodian files to build a name -> file mapping
|
||||
2. Scans all PENDING files to find matches
|
||||
3. Merges staff data from PENDING into existing files
|
||||
4. Archives merged PENDING files
|
||||
|
||||
Usage:
|
||||
python scripts/merge_pending_by_name.py --dry-run # Preview
|
||||
python scripts/merge_pending_by_name.py # Apply
|
||||
"""
|
||||
|
||||
import os
|
||||
import yaml
|
||||
from pathlib import Path
|
||||
from datetime import datetime, timezone
|
||||
from typing import Dict, Optional
|
||||
import shutil
|
||||
|
||||
def load_yaml_fast(filepath: Path) -> Optional[Dict]:
|
||||
"""Load YAML file, return None on error."""
|
||||
try:
|
||||
with open(filepath, 'r', encoding='utf-8') as f:
|
||||
return yaml.safe_load(f)
|
||||
except:
|
||||
return None
|
||||
|
||||
def save_yaml(filepath: Path, data: Dict):
|
||||
"""Save YAML file."""
|
||||
with open(filepath, 'w', encoding='utf-8') as f:
|
||||
yaml.dump(data, f, allow_unicode=True, default_flow_style=False,
|
||||
sort_keys=False, width=120)
|
||||
|
||||
def normalize_name(name: str) -> str:
|
||||
"""Normalize name for matching."""
|
||||
if not name:
|
||||
return ""
|
||||
return name.lower().strip()
|
||||
|
||||
def merge_staff(source_data: Dict, target_data: Dict, source_name: str) -> int:
|
||||
"""Merge staff from source into target. Returns count of staff added."""
|
||||
if 'staff' not in source_data:
|
||||
return 0
|
||||
|
||||
source_staff = source_data['staff']
|
||||
staff_list = source_staff.get('staff_list', [])
|
||||
|
||||
if not staff_list:
|
||||
return 0
|
||||
|
||||
# Skip if target already has staff
|
||||
if 'staff' in target_data and target_data['staff'].get('staff_list'):
|
||||
return 0
|
||||
|
||||
# Add staff section
|
||||
target_data['staff'] = {
|
||||
'provenance': source_staff.get('provenance', {}),
|
||||
'staff_list': staff_list
|
||||
}
|
||||
|
||||
# Add provenance note
|
||||
if 'provenance' not in target_data:
|
||||
target_data['provenance'] = {}
|
||||
notes = target_data['provenance'].get('notes', [])
|
||||
if isinstance(notes, str):
|
||||
notes = [notes]
|
||||
notes.append(f"Staff data merged from {source_name} on {datetime.now(timezone.utc).isoformat()}")
|
||||
target_data['provenance']['notes'] = notes
|
||||
|
||||
return len(staff_list)
|
||||
|
||||
def main():
|
||||
import argparse
|
||||
parser = argparse.ArgumentParser()
|
||||
parser.add_argument('--dry-run', action='store_true')
|
||||
parser.add_argument('--custodian-dir', type=Path,
|
||||
default=Path('/Users/kempersc/apps/glam/data/custodian'))
|
||||
args = parser.parse_args()
|
||||
|
||||
custodian_dir = args.custodian_dir
|
||||
archive_dir = custodian_dir / 'archive' / 'pending_merged_20250109'
|
||||
|
||||
print("=" * 80)
|
||||
print("MERGING PENDING FILES BY NAME MATCH")
|
||||
print("=" * 80)
|
||||
print(f"Mode: {'DRY RUN' if args.dry_run else 'LIVE'}")
|
||||
print()
|
||||
|
||||
# Step 1: Build name -> file mapping for existing files
|
||||
print("Building name index from existing files...")
|
||||
existing_by_name = {}
|
||||
|
||||
for f in custodian_dir.glob('[A-Z][A-Z]-[A-Z][A-Z]-*.yaml'):
|
||||
if 'PENDING' in f.name or 'archive' in str(f):
|
||||
continue
|
||||
data = load_yaml_fast(f)
|
||||
if data:
|
||||
name = data.get('custodian_name', {}).get('emic_name', '')
|
||||
if name:
|
||||
normalized = normalize_name(name)
|
||||
existing_by_name[normalized] = (f, data)
|
||||
|
||||
print(f" Indexed {len(existing_by_name)} existing files")
|
||||
|
||||
# Step 2: Find matching PENDING files
|
||||
print("\nScanning PENDING files for matches...")
|
||||
matches = []
|
||||
no_matches = []
|
||||
|
||||
for f in sorted(custodian_dir.glob('*-XX-XXX-PENDING-*.yaml')):
|
||||
if 'archive' in str(f):
|
||||
continue
|
||||
data = load_yaml_fast(f)
|
||||
if data:
|
||||
name = data.get('custodian_name', {}).get('emic_name', '')
|
||||
normalized = normalize_name(name)
|
||||
staff_count = len(data.get('staff', {}).get('staff_list', []))
|
||||
|
||||
if normalized in existing_by_name:
|
||||
existing_file, existing_data = existing_by_name[normalized]
|
||||
matches.append({
|
||||
'pending_file': f,
|
||||
'pending_data': data,
|
||||
'existing_file': existing_file,
|
||||
'existing_data': existing_data,
|
||||
'name': name,
|
||||
'staff_count': staff_count
|
||||
})
|
||||
else:
|
||||
no_matches.append({
|
||||
'file': f,
|
||||
'name': name,
|
||||
'staff_count': staff_count
|
||||
})
|
||||
|
||||
print(f" Found {len(matches)} PENDING files with matching existing files")
|
||||
print(f" Found {len(no_matches)} PENDING files without matches")
|
||||
|
||||
# Step 3: Merge matches
|
||||
if matches:
|
||||
print("\n" + "=" * 80)
|
||||
print("MERGING MATCHED FILES")
|
||||
print("=" * 80)
|
||||
|
||||
if not args.dry_run:
|
||||
archive_dir.mkdir(parents=True, exist_ok=True)
|
||||
|
||||
total_staff = 0
|
||||
merged_count = 0
|
||||
skipped_count = 0
|
||||
|
||||
for m in matches:
|
||||
pending_file = m['pending_file']
|
||||
existing_file = m['existing_file']
|
||||
pending_data = m['pending_data']
|
||||
existing_data = m['existing_data']
|
||||
|
||||
# Check if existing already has staff
|
||||
existing_staff = len(existing_data.get('staff', {}).get('staff_list', []))
|
||||
if existing_staff > 0:
|
||||
skipped_count += 1
|
||||
continue
|
||||
|
||||
staff_added = m['staff_count']
|
||||
if staff_added == 0:
|
||||
skipped_count += 1
|
||||
continue
|
||||
|
||||
print(f"\n[{'DRY RUN' if args.dry_run else 'MERGE'}] {m['name'][:50]}")
|
||||
print(f" From: {pending_file.name}")
|
||||
print(f" To: {existing_file.name}")
|
||||
print(f" Staff: {staff_added}")
|
||||
|
||||
if not args.dry_run:
|
||||
# Merge staff
|
||||
merge_staff(pending_data, existing_data, pending_file.name)
|
||||
save_yaml(existing_file, existing_data)
|
||||
|
||||
# Move PENDING to archive
|
||||
shutil.move(str(pending_file), str(archive_dir / pending_file.name))
|
||||
|
||||
total_staff += staff_added
|
||||
merged_count += 1
|
||||
|
||||
print("\n" + "=" * 80)
|
||||
print("SUMMARY")
|
||||
print("=" * 80)
|
||||
print(f"Files merged: {merged_count}")
|
||||
print(f"Files skipped (already has staff or no staff): {skipped_count}")
|
||||
print(f"Total staff added: {total_staff}")
|
||||
print(f"Unmatched PENDING files remaining: {len(no_matches)}")
|
||||
|
||||
if not args.dry_run:
|
||||
print(f"\nArchived to: {archive_dir}")
|
||||
|
||||
if __name__ == '__main__':
|
||||
main()
|
||||
Loading…
Reference in a new issue