glam/data/entity_annotation/modules/integrations/pico/schema/temporal.yaml
kempersc 505c12601a Add test script for PiCo extraction from Arabic waqf documents
- Implemented a new script `test_pico_arabic_waqf.py` to test the GLM annotator's ability to extract person observations from Arabic historical documents.
- The script includes environment variable handling for API token, structured prompts for the GLM API, and validation of extraction results.
- Added comprehensive logging for API responses, extraction results, and validation errors.
- Included a sample Arabic waqf text for testing purposes, following the PiCo ontology pattern.
2025-12-12 17:50:17 +01:00

570 lines
18 KiB
YAML

# =============================================================================
# PiCo Integration Module: Temporal Patterns & Calendar Systems
# =============================================================================
# Part of: data/entity_annotation/modules/integrations/pico/
# Parent: _index.yaml
#
# Description: Temporal expression handling, calendar systems, date normalization,
# and PROV-O provenance model for tracking observation/reconstruction
# activities.
#
# Last Updated: 2025-12-12
# =============================================================================
# -----------------------------------------------------------------------------
# Calendar Systems
# -----------------------------------------------------------------------------
# Historical documents use various calendar systems. This section defines
# how to handle and normalize dates from different calendrical traditions.
calendar_systems:
description: |
Historical sources use diverse calendar systems depending on culture,
religion, and time period. Proper extraction requires:
1. Identifying the source calendar
2. Preserving the original date expression
3. Providing normalized ISO 8601 equivalents where possible
supported_calendars:
gregorian:
id: "gregorian"
label: "Gregorian Calendar"
uri: "https://www.wikidata.org/wiki/Q12138"
description: |
The civil calendar used worldwide since 1582 (Catholic countries)
or later (Protestant/Orthodox countries).
adoption_dates:
catholic: "1582-10-15"
protestant: "1700-03-01"
british_empire: "1752-09-14"
russia: "1918-02-14"
greece: "1923-03-01"
usage_notes: |
- Default for modern documents
- Used in civil registrations after adoption
- Standard for ISO 8601 normalization
example:
original: "15 October 1582"
normalized: "1582-10-15"
julian:
id: "julian"
label: "Julian Calendar"
uri: "https://www.wikidata.org/wiki/Q11184"
description: |
Calendar introduced by Julius Caesar in 45 BCE. Used in Europe
until Gregorian reform, and by Eastern Orthodox churches today.
offset_from_gregorian:
16th_century: 10
17th_century: 10
18th_century: 11
19th_century: 12
20th_century: 13
21st_century: 13
usage_notes: |
- Greek Orthodox Church records use Julian calendar
- Russian Empire used Julian until 1918
- Dual dating common in transition periods
- Format: "Julian date / Gregorian date" or "O.S./N.S." notation
example:
original: "14 March 1875 (O.S.)"
gregorian_equivalent: "27 March 1875"
normalized: "1875-03-27"
note: "Greek Orthodox used Julian; Gregorian equivalent calculated"
hijri:
id: "hijri"
label: "Islamic/Hijri Calendar"
uri: "https://www.wikidata.org/wiki/Q28892"
alternative_names:
- "Islamic Calendar"
- "Muslim Calendar"
- "Lunar Hijri"
- "Anno Hegirae (AH)"
description: |
Lunar calendar used in Islamic societies. Year 1 = 622 CE (Hijra).
354 or 355 days per year (12 lunar months).
months:
1: "Muharram"
2: "Safar"
3: "Rabi' al-Awwal"
4: "Rabi' al-Thani"
5: "Jumada al-Awwal"
6: "Jumada al-Thani"
7: "Rajab"
8: "Sha'ban"
9: "Ramadan"
10: "Shawwal"
11: "Dhu al-Qa'dah"
12: "Dhu al-Hijjah"
usage_notes: |
- Ottoman Empire, Waqf documents, Sijill records
- Year conversion: Gregorian = (Hijri * 0.97) + 622
- Month-level precision often sufficient
- Some documents use both Hijri and local calendars
example:
original: "month of Rajab, year 1225 Hijri"
normalized: "1810-07"
note: "Approximate month - exact day unknown"
hebrew:
id: "hebrew"
label: "Hebrew Calendar"
uri: "https://www.wikidata.org/wiki/Q9644"
alternative_names:
- "Jewish Calendar"
- "Anno Mundi"
description: |
Lunisolar calendar used in Jewish religious and civil life.
Year 1 = 3761 BCE (traditional Creation date).
months:
1: "Nisan"
2: "Iyar"
3: "Sivan"
4: "Tammuz"
5: "Av"
6: "Elul"
7: "Tishrei"
8: "Cheshvan"
9: "Kislev"
10: "Tevet"
11: "Shevat"
12: "Adar"
usage_notes: |
- Ketubot (marriage contracts)
- Get (divorce documents)
- Synagogue records
- Year conversion: Gregorian = Hebrew - 3760 (approx)
- Month names often transliterated in various ways
example:
original: "23 Elul 5656"
normalized: "1896-09-01"
note: "Hebrew date from Creation (anno mundi)"
french_republican:
id: "french_republican"
label: "French Republican Calendar"
uri: "https://www.wikidata.org/wiki/Q181974"
description: |
Calendar used in France 1793-1805. Year 1 = 1792 CE.
12 months of 30 days + 5-6 supplementary days.
months:
1: "Vendemiaire"
2: "Brumaire"
3: "Frimaire"
4: "Nivose"
5: "Pluviose"
6: "Ventose"
7: "Germinal"
8: "Floreal"
9: "Prairial"
10: "Messidor"
11: "Thermidor"
12: "Fructidor"
usage_notes: |
- French civil registrations 1793-1805
- Some Belgian/Dutch territories
- Conversion tables widely available
example:
original: "14 Vendemiaire an IV"
normalized: "1795-10-06"
chinese:
id: "chinese"
label: "Chinese Calendar"
uri: "https://www.wikidata.org/wiki/Q32823"
description: |
Lunisolar calendar used in China and East Asia.
Combines 60-year cycle with lunar months.
usage_notes: |
- Emperor reign year + lunar month + day
- Gregorian adopted 1912 (Republic of China)
- Traditional dates still used for festivals
example:
original: "Guangxu 22, 8th month, 15th day"
normalized: "1896-09-21"
# -----------------------------------------------------------------------------
# Date Expression Patterns
# -----------------------------------------------------------------------------
date_expression_patterns:
description: |
Common patterns for expressing dates in historical sources.
GLM annotators should recognize these patterns and extract:
1. The original expression (exact transcription)
2. The calendar system used
3. A normalized ISO 8601 date (where possible)
patterns:
full_date:
description: "Complete date with day, month, and year"
examples:
- pattern: "15 October 1582"
calendar: "gregorian"
normalized: "1582-10-15"
- pattern: "the fifteenth day of October in the year 1582"
calendar: "gregorian"
normalized: "1582-10-15"
- pattern: "23 Elul 5656"
calendar: "hebrew"
normalized: "1896-09-01"
partial_date:
description: "Date with some components missing"
examples:
- pattern: "March 1875"
calendar: "gregorian"
normalized: "1875-03"
precision: "month"
- pattern: "in the year 1810"
calendar: "gregorian"
normalized: "1810"
precision: "year"
- pattern: "month of Rajab, 1225 AH"
calendar: "hijri"
normalized: "1810-07"
precision: "month"
dual_dating:
description: "Documents showing both Julian and Gregorian dates"
notation_styles:
- "O.S. (Old Style = Julian)"
- "N.S. (New Style = Gregorian)"
- "Slash notation: 14/27 March 1875"
examples:
- pattern: "14/27 March 1875"
interpretation: "14 March (Julian) = 27 March (Gregorian)"
normalized: "1875-03-27"
note: "Use Gregorian for normalization"
- pattern: "6 January 1894 (Gregorian)"
normalized: "1894-01-06"
note: "Explicit calendar indicator"
relative_dating:
description: "Dates relative to events or other dates"
examples:
- pattern: "three days after Easter"
requires: "Year context to calculate"
- pattern: "the Sunday before St. Martins Day"
requires: "Year context and liturgical calendar"
floruit:
description: "Period when person was known to be active"
notation: "fl."
examples:
- pattern: "fl. 1780-1820"
interpretation: "Active between 1780 and 1820"
- pattern: "fl. c. 1850"
interpretation: "Active around 1850"
# -----------------------------------------------------------------------------
# Temporal Properties in PiCo
# -----------------------------------------------------------------------------
temporal_properties:
description: |
Properties for capturing temporal information about persons
observed in historical sources.
biographical_dates:
birth_date:
property: "sdo:birthDate"
property_uri: "https://schema.org/birthDate"
range: "xsd:date or xsd:gYearMonth or xsd:gYear"
description: "Date of birth"
extraction_notes: |
- May be explicitly stated or inferred from age
- Capture calendar system if non-Gregorian
- Normalize to ISO 8601 for querying
death_date:
property: "sdo:deathDate"
property_uri: "https://schema.org/deathDate"
range: "xsd:date or xsd:gYearMonth or xsd:gYear"
description: "Date of death"
extraction_notes: |
- "deceased" annotation indicates death before document date
- Infer approximate date from context when possible
baptism_date:
property: "pico:baptismDate"
range: "xsd:date"
description: "Date of baptism/christening"
note: "Common in church records; often within days of birth"
burial_date:
property: "pico:burialDate"
range: "xsd:date"
description: "Date of burial"
note: "Common in church/cemetery records"
event_dates:
marriage_date:
property: "pico:marriageDate"
range: "xsd:date"
description: "Date of marriage event"
divorce_date:
property: "pico:divorceDate"
range: "xsd:date"
description: "Date of divorce"
document_date:
property: "sdo:dateCreated"
property_uri: "https://schema.org/dateCreated"
range: "xsd:date"
description: "Date the source document was created"
note: "Critical for temporal context of observations"
age_expressions:
age_at_event:
property: "pico:ageAtEvent"
range: "xsd:string"
description: "Age as stated in document"
examples:
- "25 years"
- "about 30 years old"
- "minor (under legal age)"
- "of full age (adult)"
note: |
Preserve original expression; calculate birth year if needed.
"oud 25 jaar" (Dutch) = "25 years old"
# -----------------------------------------------------------------------------
# PROV-O Provenance Model
# -----------------------------------------------------------------------------
provenance_model:
description: |
PiCo uses W3C PROV-O for provenance tracking at two levels:
1. OBSERVATION LEVEL: Where did this observation come from?
- prov:hadPrimarySource -> Source document
- prov:wasGeneratedBy -> Extraction activity (optional)
2. RECONSTRUCTION LEVEL: How was this person entity created?
- prov:wasDerivedFrom -> Source observation(s)
- prov:wasGeneratedBy -> Reconstruction activity
- prov:wasRevisionOf -> Previous reconstruction version
activity_class:
class: "prov:Activity"
class_uri: "http://www.w3.org/ns/prov#Activity"
description: "The activity that generated a PersonReconstruction"
properties:
- property: "prov:wasAssociatedWith"
description: "Agent responsible for the activity"
range: "prov:Agent"
- property: "prov:startedAtTime"
description: "When the activity started"
range: "xsd:dateTime"
- property: "prov:endedAtTime"
description: "When the activity completed"
range: "xsd:dateTime"
- property: "prov:used"
description: "Resources/tools used in the activity"
range: "prov:Entity"
note: "E.g., ML model, matching algorithm, rule set"
activity_types:
human_reconstruction:
description: "Manual reconstruction by researcher"
note: "Provide: time, place, knowledge sources, researcher name"
algorithmic_reconstruction:
description: "Automated reconstruction by software"
note: "Provide: algorithm name, version, configuration, parameters"
agent_class:
class: "prov:Agent"
class_uri: "http://www.w3.org/ns/prov#Agent"
description: "Person or organization responsible for reconstruction"
properties:
- property: "sdo:name"
description: "Name of the agent"
range: "xsd:string"
- property: "sdo:url"
description: "URL identifying the agent"
range: "sdo:URL"
examples:
- name: "CBG Center for Family History"
url: "https://cbg.nl"
type: "organization"
- name: "GLM-4.6 Person Extractor v1.0"
url: null
type: "software"
derivation_properties:
- property: "prov:wasDerivedFrom"
property_uri: "http://www.w3.org/ns/prov#wasDerivedFrom"
description: "Links PersonReconstruction to source PersonObservation(s)"
domain: "pico:PersonReconstruction"
range: "pico:PersonObservation"
cardinality: "1..*"
note: "REQUIRED for all PersonReconstructions"
- property: "prov:wasRevisionOf"
property_uri: "http://www.w3.org/ns/prov#wasRevisionOf"
description: "Links to previous version of reconstruction"
domain: "pico:PersonReconstruction"
range: "pico:PersonReconstruction"
cardinality: "0..1"
note: "For tracking reconstruction updates over time"
# -----------------------------------------------------------------------------
# PiCo Vocabularies/Thesauri
# -----------------------------------------------------------------------------
pico_vocabularies:
description: |
PiCo defines three SKOS concept schemes for controlled terminology:
- Roles: The role a person plays in a source (child, declarant, witness, etc.)
- SourceTypes: Types of historical sources (birth certificate, census, etc.)
- EventTypes: Types of life events (birth, marriage, death, etc.)
roles_thesaurus:
id: "picot_roles"
uri: "https://terms.personsincontext.org/roles/"
type: "skos:ConceptScheme"
label: "Persons in Context role thesaurus"
description: "Roles that persons can have in historical sources"
usage: |
Use pico:hasRole property with a term from this thesaurus.
Example: picot_roles:575 (child), picot_roles:489 (declarant)
example_concepts:
- id: "575"
label: "child"
description: "Person appearing as child in a record"
- id: "489"
label: "declarant"
description: "Person declaring/reporting an event"
- id: "witness"
label: "witness"
description: "Person witnessing an event or signing a document"
- id: "bride"
label: "bride"
description: "Female partner in a marriage"
- id: "groom"
label: "groom"
description: "Male partner in a marriage"
sourcetypes_thesaurus:
id: "picot_sourcetypes"
uri: "https://terms.personsincontext.org/sourcetypes/"
type: "skos:ConceptScheme"
label: "Persons in Context sourceType thesaurus"
description: "Types of historical sources containing person observations"
usage: |
Use sdo:additionalType property on sdo:ArchiveComponent.
Example: picot_sourcetypes:551 (civil registry: birth)
example_concepts:
- id: "551"
label: "civil registry: birth"
description: "Birth certificate from civil registration"
- id: "marriage"
label: "civil registry: marriage"
description: "Marriage certificate"
- id: "death"
label: "civil registry: death"
description: "Death certificate"
- id: "census"
label: "census"
description: "Population census record"
- id: "church_baptism"
label: "church record: baptism"
description: "Baptismal record from church register"
- id: "notarial"
label: "notarial record"
description: "Notarial act or protocol"
eventtypes_thesaurus:
id: "picot_eventtypes"
uri: "https://terms.personsincontext.org/eventtypes/"
type: "skos:ConceptScheme"
label: "Persons in Context eventType thesaurus"
description: "Types of life events documented in sources"
example_concepts:
- id: "birth"
label: "birth"
- id: "baptism"
label: "baptism"
- id: "marriage"
label: "marriage"
- id: "death"
label: "death"
- id: "burial"
label: "burial"
- id: "emigration"
label: "emigration"
- id: "immigration"
label: "immigration"
# -----------------------------------------------------------------------------
# CH-Annotator Hypernym Integration for Temporal
# -----------------------------------------------------------------------------
temporal_hypernym_mapping:
description: |
Mapping between temporal expressions and CH-Annotator hypernyms.
mappings:
- pico_property: "sdo:birthDate"
ch_hypernym: "TMP.DAT"
ch_code: "TMP.DAT"
note: "Birth date temporal expression"
- pico_property: "sdo:deathDate"
ch_hypernym: "TMP.DAT"
ch_code: "TMP.DAT"
note: "Death date temporal expression"
- pico_property: "sdo:dateCreated"
ch_hypernym: "TMP.DAT"
ch_code: "TMP.DAT"
note: "Document creation date"
- calendar_expression: "Hijri date"
ch_hypernym: "TMP.DAT"
normalization: "Convert to Gregorian ISO 8601"
- calendar_expression: "Hebrew date"
ch_hypernym: "TMP.DAT"
normalization: "Convert to Gregorian ISO 8601"
- calendar_expression: "Julian date"
ch_hypernym: "TMP.DAT"
normalization: "Convert to Gregorian ISO 8601"