- Implemented a new script `test_pico_arabic_waqf.py` to test the GLM annotator's ability to extract person observations from Arabic historical documents. - The script includes environment variable handling for API token, structured prompts for the GLM API, and validation of extraction results. - Added comprehensive logging for API responses, extraction results, and validation errors. - Included a sample Arabic waqf text for testing purposes, following the PiCo ontology pattern.
570 lines
18 KiB
YAML
570 lines
18 KiB
YAML
# =============================================================================
|
|
# PiCo Integration Module: Temporal Patterns & Calendar Systems
|
|
# =============================================================================
|
|
# Part of: data/entity_annotation/modules/integrations/pico/
|
|
# Parent: _index.yaml
|
|
#
|
|
# Description: Temporal expression handling, calendar systems, date normalization,
|
|
# and PROV-O provenance model for tracking observation/reconstruction
|
|
# activities.
|
|
#
|
|
# Last Updated: 2025-12-12
|
|
# =============================================================================
|
|
|
|
# -----------------------------------------------------------------------------
|
|
# Calendar Systems
|
|
# -----------------------------------------------------------------------------
|
|
# Historical documents use various calendar systems. This section defines
|
|
# how to handle and normalize dates from different calendrical traditions.
|
|
|
|
calendar_systems:
|
|
description: |
|
|
Historical sources use diverse calendar systems depending on culture,
|
|
religion, and time period. Proper extraction requires:
|
|
1. Identifying the source calendar
|
|
2. Preserving the original date expression
|
|
3. Providing normalized ISO 8601 equivalents where possible
|
|
|
|
supported_calendars:
|
|
|
|
gregorian:
|
|
id: "gregorian"
|
|
label: "Gregorian Calendar"
|
|
uri: "https://www.wikidata.org/wiki/Q12138"
|
|
description: |
|
|
The civil calendar used worldwide since 1582 (Catholic countries)
|
|
or later (Protestant/Orthodox countries).
|
|
adoption_dates:
|
|
catholic: "1582-10-15"
|
|
protestant: "1700-03-01"
|
|
british_empire: "1752-09-14"
|
|
russia: "1918-02-14"
|
|
greece: "1923-03-01"
|
|
usage_notes: |
|
|
- Default for modern documents
|
|
- Used in civil registrations after adoption
|
|
- Standard for ISO 8601 normalization
|
|
example:
|
|
original: "15 October 1582"
|
|
normalized: "1582-10-15"
|
|
|
|
julian:
|
|
id: "julian"
|
|
label: "Julian Calendar"
|
|
uri: "https://www.wikidata.org/wiki/Q11184"
|
|
description: |
|
|
Calendar introduced by Julius Caesar in 45 BCE. Used in Europe
|
|
until Gregorian reform, and by Eastern Orthodox churches today.
|
|
offset_from_gregorian:
|
|
16th_century: 10
|
|
17th_century: 10
|
|
18th_century: 11
|
|
19th_century: 12
|
|
20th_century: 13
|
|
21st_century: 13
|
|
usage_notes: |
|
|
- Greek Orthodox Church records use Julian calendar
|
|
- Russian Empire used Julian until 1918
|
|
- Dual dating common in transition periods
|
|
- Format: "Julian date / Gregorian date" or "O.S./N.S." notation
|
|
example:
|
|
original: "14 March 1875 (O.S.)"
|
|
gregorian_equivalent: "27 March 1875"
|
|
normalized: "1875-03-27"
|
|
note: "Greek Orthodox used Julian; Gregorian equivalent calculated"
|
|
|
|
hijri:
|
|
id: "hijri"
|
|
label: "Islamic/Hijri Calendar"
|
|
uri: "https://www.wikidata.org/wiki/Q28892"
|
|
alternative_names:
|
|
- "Islamic Calendar"
|
|
- "Muslim Calendar"
|
|
- "Lunar Hijri"
|
|
- "Anno Hegirae (AH)"
|
|
description: |
|
|
Lunar calendar used in Islamic societies. Year 1 = 622 CE (Hijra).
|
|
354 or 355 days per year (12 lunar months).
|
|
months:
|
|
1: "Muharram"
|
|
2: "Safar"
|
|
3: "Rabi' al-Awwal"
|
|
4: "Rabi' al-Thani"
|
|
5: "Jumada al-Awwal"
|
|
6: "Jumada al-Thani"
|
|
7: "Rajab"
|
|
8: "Sha'ban"
|
|
9: "Ramadan"
|
|
10: "Shawwal"
|
|
11: "Dhu al-Qa'dah"
|
|
12: "Dhu al-Hijjah"
|
|
usage_notes: |
|
|
- Ottoman Empire, Waqf documents, Sijill records
|
|
- Year conversion: Gregorian = (Hijri * 0.97) + 622
|
|
- Month-level precision often sufficient
|
|
- Some documents use both Hijri and local calendars
|
|
example:
|
|
original: "month of Rajab, year 1225 Hijri"
|
|
normalized: "1810-07"
|
|
note: "Approximate month - exact day unknown"
|
|
|
|
hebrew:
|
|
id: "hebrew"
|
|
label: "Hebrew Calendar"
|
|
uri: "https://www.wikidata.org/wiki/Q9644"
|
|
alternative_names:
|
|
- "Jewish Calendar"
|
|
- "Anno Mundi"
|
|
description: |
|
|
Lunisolar calendar used in Jewish religious and civil life.
|
|
Year 1 = 3761 BCE (traditional Creation date).
|
|
months:
|
|
1: "Nisan"
|
|
2: "Iyar"
|
|
3: "Sivan"
|
|
4: "Tammuz"
|
|
5: "Av"
|
|
6: "Elul"
|
|
7: "Tishrei"
|
|
8: "Cheshvan"
|
|
9: "Kislev"
|
|
10: "Tevet"
|
|
11: "Shevat"
|
|
12: "Adar"
|
|
usage_notes: |
|
|
- Ketubot (marriage contracts)
|
|
- Get (divorce documents)
|
|
- Synagogue records
|
|
- Year conversion: Gregorian = Hebrew - 3760 (approx)
|
|
- Month names often transliterated in various ways
|
|
example:
|
|
original: "23 Elul 5656"
|
|
normalized: "1896-09-01"
|
|
note: "Hebrew date from Creation (anno mundi)"
|
|
|
|
french_republican:
|
|
id: "french_republican"
|
|
label: "French Republican Calendar"
|
|
uri: "https://www.wikidata.org/wiki/Q181974"
|
|
description: |
|
|
Calendar used in France 1793-1805. Year 1 = 1792 CE.
|
|
12 months of 30 days + 5-6 supplementary days.
|
|
months:
|
|
1: "Vendemiaire"
|
|
2: "Brumaire"
|
|
3: "Frimaire"
|
|
4: "Nivose"
|
|
5: "Pluviose"
|
|
6: "Ventose"
|
|
7: "Germinal"
|
|
8: "Floreal"
|
|
9: "Prairial"
|
|
10: "Messidor"
|
|
11: "Thermidor"
|
|
12: "Fructidor"
|
|
usage_notes: |
|
|
- French civil registrations 1793-1805
|
|
- Some Belgian/Dutch territories
|
|
- Conversion tables widely available
|
|
example:
|
|
original: "14 Vendemiaire an IV"
|
|
normalized: "1795-10-06"
|
|
|
|
chinese:
|
|
id: "chinese"
|
|
label: "Chinese Calendar"
|
|
uri: "https://www.wikidata.org/wiki/Q32823"
|
|
description: |
|
|
Lunisolar calendar used in China and East Asia.
|
|
Combines 60-year cycle with lunar months.
|
|
usage_notes: |
|
|
- Emperor reign year + lunar month + day
|
|
- Gregorian adopted 1912 (Republic of China)
|
|
- Traditional dates still used for festivals
|
|
example:
|
|
original: "Guangxu 22, 8th month, 15th day"
|
|
normalized: "1896-09-21"
|
|
|
|
# -----------------------------------------------------------------------------
|
|
# Date Expression Patterns
|
|
# -----------------------------------------------------------------------------
|
|
|
|
date_expression_patterns:
|
|
description: |
|
|
Common patterns for expressing dates in historical sources.
|
|
GLM annotators should recognize these patterns and extract:
|
|
1. The original expression (exact transcription)
|
|
2. The calendar system used
|
|
3. A normalized ISO 8601 date (where possible)
|
|
|
|
patterns:
|
|
|
|
full_date:
|
|
description: "Complete date with day, month, and year"
|
|
examples:
|
|
- pattern: "15 October 1582"
|
|
calendar: "gregorian"
|
|
normalized: "1582-10-15"
|
|
|
|
- pattern: "the fifteenth day of October in the year 1582"
|
|
calendar: "gregorian"
|
|
normalized: "1582-10-15"
|
|
|
|
- pattern: "23 Elul 5656"
|
|
calendar: "hebrew"
|
|
normalized: "1896-09-01"
|
|
|
|
partial_date:
|
|
description: "Date with some components missing"
|
|
examples:
|
|
- pattern: "March 1875"
|
|
calendar: "gregorian"
|
|
normalized: "1875-03"
|
|
precision: "month"
|
|
|
|
- pattern: "in the year 1810"
|
|
calendar: "gregorian"
|
|
normalized: "1810"
|
|
precision: "year"
|
|
|
|
- pattern: "month of Rajab, 1225 AH"
|
|
calendar: "hijri"
|
|
normalized: "1810-07"
|
|
precision: "month"
|
|
|
|
dual_dating:
|
|
description: "Documents showing both Julian and Gregorian dates"
|
|
notation_styles:
|
|
- "O.S. (Old Style = Julian)"
|
|
- "N.S. (New Style = Gregorian)"
|
|
- "Slash notation: 14/27 March 1875"
|
|
examples:
|
|
- pattern: "14/27 March 1875"
|
|
interpretation: "14 March (Julian) = 27 March (Gregorian)"
|
|
normalized: "1875-03-27"
|
|
note: "Use Gregorian for normalization"
|
|
|
|
- pattern: "6 January 1894 (Gregorian)"
|
|
normalized: "1894-01-06"
|
|
note: "Explicit calendar indicator"
|
|
|
|
relative_dating:
|
|
description: "Dates relative to events or other dates"
|
|
examples:
|
|
- pattern: "three days after Easter"
|
|
requires: "Year context to calculate"
|
|
|
|
- pattern: "the Sunday before St. Martins Day"
|
|
requires: "Year context and liturgical calendar"
|
|
|
|
floruit:
|
|
description: "Period when person was known to be active"
|
|
notation: "fl."
|
|
examples:
|
|
- pattern: "fl. 1780-1820"
|
|
interpretation: "Active between 1780 and 1820"
|
|
|
|
- pattern: "fl. c. 1850"
|
|
interpretation: "Active around 1850"
|
|
|
|
# -----------------------------------------------------------------------------
|
|
# Temporal Properties in PiCo
|
|
# -----------------------------------------------------------------------------
|
|
|
|
temporal_properties:
|
|
description: |
|
|
Properties for capturing temporal information about persons
|
|
observed in historical sources.
|
|
|
|
biographical_dates:
|
|
birth_date:
|
|
property: "sdo:birthDate"
|
|
property_uri: "https://schema.org/birthDate"
|
|
range: "xsd:date or xsd:gYearMonth or xsd:gYear"
|
|
description: "Date of birth"
|
|
extraction_notes: |
|
|
- May be explicitly stated or inferred from age
|
|
- Capture calendar system if non-Gregorian
|
|
- Normalize to ISO 8601 for querying
|
|
|
|
death_date:
|
|
property: "sdo:deathDate"
|
|
property_uri: "https://schema.org/deathDate"
|
|
range: "xsd:date or xsd:gYearMonth or xsd:gYear"
|
|
description: "Date of death"
|
|
extraction_notes: |
|
|
- "deceased" annotation indicates death before document date
|
|
- Infer approximate date from context when possible
|
|
|
|
baptism_date:
|
|
property: "pico:baptismDate"
|
|
range: "xsd:date"
|
|
description: "Date of baptism/christening"
|
|
note: "Common in church records; often within days of birth"
|
|
|
|
burial_date:
|
|
property: "pico:burialDate"
|
|
range: "xsd:date"
|
|
description: "Date of burial"
|
|
note: "Common in church/cemetery records"
|
|
|
|
event_dates:
|
|
marriage_date:
|
|
property: "pico:marriageDate"
|
|
range: "xsd:date"
|
|
description: "Date of marriage event"
|
|
|
|
divorce_date:
|
|
property: "pico:divorceDate"
|
|
range: "xsd:date"
|
|
description: "Date of divorce"
|
|
|
|
document_date:
|
|
property: "sdo:dateCreated"
|
|
property_uri: "https://schema.org/dateCreated"
|
|
range: "xsd:date"
|
|
description: "Date the source document was created"
|
|
note: "Critical for temporal context of observations"
|
|
|
|
age_expressions:
|
|
age_at_event:
|
|
property: "pico:ageAtEvent"
|
|
range: "xsd:string"
|
|
description: "Age as stated in document"
|
|
examples:
|
|
- "25 years"
|
|
- "about 30 years old"
|
|
- "minor (under legal age)"
|
|
- "of full age (adult)"
|
|
note: |
|
|
Preserve original expression; calculate birth year if needed.
|
|
"oud 25 jaar" (Dutch) = "25 years old"
|
|
|
|
# -----------------------------------------------------------------------------
|
|
# PROV-O Provenance Model
|
|
# -----------------------------------------------------------------------------
|
|
|
|
provenance_model:
|
|
description: |
|
|
PiCo uses W3C PROV-O for provenance tracking at two levels:
|
|
|
|
1. OBSERVATION LEVEL: Where did this observation come from?
|
|
- prov:hadPrimarySource -> Source document
|
|
- prov:wasGeneratedBy -> Extraction activity (optional)
|
|
|
|
2. RECONSTRUCTION LEVEL: How was this person entity created?
|
|
- prov:wasDerivedFrom -> Source observation(s)
|
|
- prov:wasGeneratedBy -> Reconstruction activity
|
|
- prov:wasRevisionOf -> Previous reconstruction version
|
|
|
|
activity_class:
|
|
class: "prov:Activity"
|
|
class_uri: "http://www.w3.org/ns/prov#Activity"
|
|
description: "The activity that generated a PersonReconstruction"
|
|
|
|
properties:
|
|
- property: "prov:wasAssociatedWith"
|
|
description: "Agent responsible for the activity"
|
|
range: "prov:Agent"
|
|
|
|
- property: "prov:startedAtTime"
|
|
description: "When the activity started"
|
|
range: "xsd:dateTime"
|
|
|
|
- property: "prov:endedAtTime"
|
|
description: "When the activity completed"
|
|
range: "xsd:dateTime"
|
|
|
|
- property: "prov:used"
|
|
description: "Resources/tools used in the activity"
|
|
range: "prov:Entity"
|
|
note: "E.g., ML model, matching algorithm, rule set"
|
|
|
|
activity_types:
|
|
human_reconstruction:
|
|
description: "Manual reconstruction by researcher"
|
|
note: "Provide: time, place, knowledge sources, researcher name"
|
|
|
|
algorithmic_reconstruction:
|
|
description: "Automated reconstruction by software"
|
|
note: "Provide: algorithm name, version, configuration, parameters"
|
|
|
|
agent_class:
|
|
class: "prov:Agent"
|
|
class_uri: "http://www.w3.org/ns/prov#Agent"
|
|
description: "Person or organization responsible for reconstruction"
|
|
|
|
properties:
|
|
- property: "sdo:name"
|
|
description: "Name of the agent"
|
|
range: "xsd:string"
|
|
|
|
- property: "sdo:url"
|
|
description: "URL identifying the agent"
|
|
range: "sdo:URL"
|
|
|
|
examples:
|
|
- name: "CBG Center for Family History"
|
|
url: "https://cbg.nl"
|
|
type: "organization"
|
|
|
|
- name: "GLM-4.6 Person Extractor v1.0"
|
|
url: null
|
|
type: "software"
|
|
|
|
derivation_properties:
|
|
- property: "prov:wasDerivedFrom"
|
|
property_uri: "http://www.w3.org/ns/prov#wasDerivedFrom"
|
|
description: "Links PersonReconstruction to source PersonObservation(s)"
|
|
domain: "pico:PersonReconstruction"
|
|
range: "pico:PersonObservation"
|
|
cardinality: "1..*"
|
|
note: "REQUIRED for all PersonReconstructions"
|
|
|
|
- property: "prov:wasRevisionOf"
|
|
property_uri: "http://www.w3.org/ns/prov#wasRevisionOf"
|
|
description: "Links to previous version of reconstruction"
|
|
domain: "pico:PersonReconstruction"
|
|
range: "pico:PersonReconstruction"
|
|
cardinality: "0..1"
|
|
note: "For tracking reconstruction updates over time"
|
|
|
|
# -----------------------------------------------------------------------------
|
|
# PiCo Vocabularies/Thesauri
|
|
# -----------------------------------------------------------------------------
|
|
|
|
pico_vocabularies:
|
|
description: |
|
|
PiCo defines three SKOS concept schemes for controlled terminology:
|
|
|
|
- Roles: The role a person plays in a source (child, declarant, witness, etc.)
|
|
- SourceTypes: Types of historical sources (birth certificate, census, etc.)
|
|
- EventTypes: Types of life events (birth, marriage, death, etc.)
|
|
|
|
roles_thesaurus:
|
|
id: "picot_roles"
|
|
uri: "https://terms.personsincontext.org/roles/"
|
|
type: "skos:ConceptScheme"
|
|
label: "Persons in Context role thesaurus"
|
|
description: "Roles that persons can have in historical sources"
|
|
usage: |
|
|
Use pico:hasRole property with a term from this thesaurus.
|
|
Example: picot_roles:575 (child), picot_roles:489 (declarant)
|
|
example_concepts:
|
|
- id: "575"
|
|
label: "child"
|
|
description: "Person appearing as child in a record"
|
|
|
|
- id: "489"
|
|
label: "declarant"
|
|
description: "Person declaring/reporting an event"
|
|
|
|
- id: "witness"
|
|
label: "witness"
|
|
description: "Person witnessing an event or signing a document"
|
|
|
|
- id: "bride"
|
|
label: "bride"
|
|
description: "Female partner in a marriage"
|
|
|
|
- id: "groom"
|
|
label: "groom"
|
|
description: "Male partner in a marriage"
|
|
|
|
sourcetypes_thesaurus:
|
|
id: "picot_sourcetypes"
|
|
uri: "https://terms.personsincontext.org/sourcetypes/"
|
|
type: "skos:ConceptScheme"
|
|
label: "Persons in Context sourceType thesaurus"
|
|
description: "Types of historical sources containing person observations"
|
|
usage: |
|
|
Use sdo:additionalType property on sdo:ArchiveComponent.
|
|
Example: picot_sourcetypes:551 (civil registry: birth)
|
|
example_concepts:
|
|
- id: "551"
|
|
label: "civil registry: birth"
|
|
description: "Birth certificate from civil registration"
|
|
|
|
- id: "marriage"
|
|
label: "civil registry: marriage"
|
|
description: "Marriage certificate"
|
|
|
|
- id: "death"
|
|
label: "civil registry: death"
|
|
description: "Death certificate"
|
|
|
|
- id: "census"
|
|
label: "census"
|
|
description: "Population census record"
|
|
|
|
- id: "church_baptism"
|
|
label: "church record: baptism"
|
|
description: "Baptismal record from church register"
|
|
|
|
- id: "notarial"
|
|
label: "notarial record"
|
|
description: "Notarial act or protocol"
|
|
|
|
eventtypes_thesaurus:
|
|
id: "picot_eventtypes"
|
|
uri: "https://terms.personsincontext.org/eventtypes/"
|
|
type: "skos:ConceptScheme"
|
|
label: "Persons in Context eventType thesaurus"
|
|
description: "Types of life events documented in sources"
|
|
example_concepts:
|
|
- id: "birth"
|
|
label: "birth"
|
|
|
|
- id: "baptism"
|
|
label: "baptism"
|
|
|
|
- id: "marriage"
|
|
label: "marriage"
|
|
|
|
- id: "death"
|
|
label: "death"
|
|
|
|
- id: "burial"
|
|
label: "burial"
|
|
|
|
- id: "emigration"
|
|
label: "emigration"
|
|
|
|
- id: "immigration"
|
|
label: "immigration"
|
|
|
|
# -----------------------------------------------------------------------------
|
|
# CH-Annotator Hypernym Integration for Temporal
|
|
# -----------------------------------------------------------------------------
|
|
|
|
temporal_hypernym_mapping:
|
|
description: |
|
|
Mapping between temporal expressions and CH-Annotator hypernyms.
|
|
|
|
mappings:
|
|
- pico_property: "sdo:birthDate"
|
|
ch_hypernym: "TMP.DAT"
|
|
ch_code: "TMP.DAT"
|
|
note: "Birth date temporal expression"
|
|
|
|
- pico_property: "sdo:deathDate"
|
|
ch_hypernym: "TMP.DAT"
|
|
ch_code: "TMP.DAT"
|
|
note: "Death date temporal expression"
|
|
|
|
- pico_property: "sdo:dateCreated"
|
|
ch_hypernym: "TMP.DAT"
|
|
ch_code: "TMP.DAT"
|
|
note: "Document creation date"
|
|
|
|
- calendar_expression: "Hijri date"
|
|
ch_hypernym: "TMP.DAT"
|
|
normalization: "Convert to Gregorian ISO 8601"
|
|
|
|
- calendar_expression: "Hebrew date"
|
|
ch_hypernym: "TMP.DAT"
|
|
normalization: "Convert to Gregorian ISO 8601"
|
|
|
|
- calendar_expression: "Julian date"
|
|
ch_hypernym: "TMP.DAT"
|
|
normalization: "Convert to Gregorian ISO 8601"
|