glam/data/entity_annotation/docs/PROVENANCE_SOURCES.md
kempersc 505c12601a Add test script for PiCo extraction from Arabic waqf documents
- Implemented a new script `test_pico_arabic_waqf.py` to test the GLM annotator's ability to extract person observations from Arabic historical documents.
- The script includes environment variable handling for API token, structured prompts for the GLM API, and validation of extraction results.
- Added comprehensive logging for API responses, extraction results, and validation errors.
- Included a sample Arabic waqf text for testing purposes, following the PiCo ontology pattern.
2025-12-12 17:50:17 +01:00

20 KiB
Raw Blame History

Provenance Sources for PiCo Historical Document Examples

This document provides detailed provenance information for the real historical document sources used in the PiCo (Person in Context) ontology integration examples within the CH-Annotator convention.

Last Updated: 2025-12-12
Author: GLAM Project
Version: 1.0.0


Table of Contents

  1. Hebrew Ketubah (Jewish Marriage Contracts)
  2. Arabic Waqf Documents (Islamic Endowments)
  3. Ottoman Turkish Sijill (Sharia Court Registers)
  4. Russian Metrical Books (Church Records)
  5. Spanish Colonial Baptism Records
  6. Italian Notarial Records
  7. Greek Orthodox Church Records
  8. Dutch Civil Registry Records
  9. License and Attribution Requirements

1. Hebrew Ketubah (Jewish Marriage Contracts)

1.1 Yale Beinecke Library - Mashhad Ketubah (1896)

Field Value
Archive Yale University, Beinecke Rare Book & Manuscript Library
Collection Hebrew Manuscripts Supplement
Call Number Hebrew MSS suppl 194
Digital URL https://digital.library.yale.edu/catalog/2067542
Document Type Ketubah (Jewish marriage contract)
Date 23 Elul 5656 (September 1, 1896 CE)
Place Mashhad, Iran
Language Hebrew, Aramaic
Access Date 2025-12-12
License Public Domain (pre-1929)

Persons Identified:

  • Groom: Mosheh ben Mashiah (משה בן משיאח)
  • Bride: Rivkah bat Ya'akov (רבקה בת יעקב)

Notes: This ketubah is from the crypto-Jewish community of Mashhad, known as the Jadid al-Islam, who maintained Jewish practices in secret after forced conversion in 1839. The document follows standard Sephardic/Mizrahi ketubah format.


1.2 Philadelphia Mikveh Israel Ketubah (1842)

Field Value
Archive Congregation Mikveh Israel, Philadelphia
Collection Philadelphia Congregations Records
Digital URL https://philadelphiacongregations.org/records/item/MikvehIsrael.MarriageCertificate1842
Document Type Ketubah (Jewish marriage contract)
Date 1842 CE
Place Philadelphia, Pennsylvania, USA
Language Aramaic (traditional text), English (translation provided)
Access Date 2025-12-12
License Educational use permitted

Key Features:

  • Full Aramaic text transcription available
  • English translation provided by archive
  • Example of American Sephardic ketubah format

Sample Aramaic Text (from source):

בשבת... בשבת... יום... לחדש... שנת... לבריאת עולם למנין שאנו מונין כאן...
איך החתן... בר... אמר לה להדא בתולתא... בת...

1.3 College of Charleston Ketubah (1908)

Field Value
Archive College of Charleston, Special Collections
Collection Jewish Heritage Collection
Document Type Ketubah
Date 1908 CE
Language Hebrew, Aramaic
Access Date 2025-12-12

Persons Identified:

  • Bride: Esther Devorah bat Rabbi Abraham (אסתר דבורה בת ר׳ אברהם)
  • Groom: Rabbi Yitzchak (ר׳ יצחק)

1.4 Rhodes Jewish Museum Collection

Field Value
Archive Rhodes Jewish Museum
Location Rhodes, Greece
Collection Historical Documents
Document Types Ketubot, community records
Period 19th-20th century
Language Ladino, Hebrew, Greek

Notes: Documents from the historic Sephardic Jewish community of Rhodes, with unique Ladino elements.


2. Arabic Waqf Documents (Islamic Endowments)

2.1 Cambridge Digital Library - Islamic Collections

Field Value
Archive Cambridge University Library
Collection Islamic Manuscripts
Digital URL https://cudl.lib.cam.ac.uk/collections/islamic
Document Types Waqfiyya, legal documents, correspondence
Period 8th-20th century CE
Languages Arabic, Persian, Ottoman Turkish
License CC BY-NC 4.0
Access Date 2025-12-12

Key Collections:

  • Genizah Collection (Cairo Genizah fragments)
  • Arabic Scientific Manuscripts
  • Islamic Legal Documents

2.2 UPenn OPenn - Manuscripts of the Muslim World

Field Value
Archive University of Pennsylvania Libraries
Collection Manuscripts of the Muslim World
Digital URL https://openn.library.upenn.edu/html/muslimworld_contents.html
Document Types Waqfiyya, Quranic manuscripts, legal documents
Period 9th-20th century CE
Languages Arabic, Persian, Ottoman Turkish
License Public Domain / CC0
Access Date 2025-12-12

Notable Holdings:

  • Waqfiyya documents from Egypt, Syria, Turkey
  • Legal formularies with waqf templates
  • Property deeds and endowment records

2.3 Singapore National Heritage Board - Istanbul Waqf

Field Value
Archive Singapore National Heritage Board
Collection Roots.gov.sg
Accession Number 1115401
Digital URL https://www.roots.gov.sg/Collection-Landing/listing/1115401
Document Type Waqf document
Donor/Creator Muhammad b. Abd al-Ghani (محمد بن عبد الغني)
Properties Istanbul (various locations)
Language Ottoman Turkish, Arabic
Access Date 2025-12-12

Key Features:

  • Complete waqf document with property descriptions
  • Lists endowed properties in Istanbul
  • Named beneficiaries and conditions

2.4 Haseki Sultan Waqfiyya (1552 CE)

Field Value
Archive Various (studied in UC Berkeley eScholarship)
Document Type Waqfiyya (imperial endowment deed)
Date 1552 CE
Founder Haseki Hürrem Sultan (Roxelana)
Language Ottoman Turkish, Arabic
Research URL UC Berkeley eScholarship

Significance: One of the largest waqf endowments in Ottoman history, establishing charitable institutions across the empire.


3. Ottoman Turkish Sijill (Sharia Court Registers)

3.1 OpenJerusalem Project - Jerusalem Sharia Court Registers

Field Value
Archive OpenJerusalem Project
Collection Jerusalem Sharia Court Registers
Digital URL https://www.openjerusalem.org/
ARK Identifier ark:/58142/PfV7b
Volume Count 102 registers
Period 1834-1920 CE
Language Ottoman Turkish, Arabic
License Open Access
Access Date 2025-12-12

Document Types:

  • Property sales (بيع)
  • Marriage contracts (نكاح)
  • Inheritance divisions (قسمة)
  • Waqf registrations
  • Debt acknowledgments (إقرار)
  • Court testimonies (شهادة)

Key Features:

  • Searchable database with document transcriptions
  • Photographs of original registers
  • Multi-language metadata (Arabic, English, French)

3.2 ISAM Istanbul Kadi Registers (Kadı Sicilleri)

Field Value
Archive İslam Araştırmaları Merkezi (ISAM)
Collection Istanbul Kadı Sicilleri
Digital URL http://www.kadisicilleri.org/
Volume Count 40+ volumes online
Document Count 40,000+ documents
Period 16th-19th century CE
Language Ottoman Turkish
License Research access
Access Date 2025-12-12

Coverage:

  • Istanbul courts (multiple districts)
  • Galata, Üsküdar, Eyüp
  • Complete transcriptions with original images

3.3 Istanbul Historical Kadi Registers Corpus

Field Value
Archive Istanbul Metropolitan Municipality
Project History of Istanbul
Digital URL https://istanbultarihi.ist/434-istanbul-sharia-court-registers
Volume Count ~10,000 volumes
Courts 26 different courts
Period 1453-1922 CE
Language Ottoman Turkish

Significance: Largest collection of Ottoman court records in existence.


3.4 Harvard Ottoman Court Records Project

Field Value
Archive Harvard University
Project Ottoman Court Records Project (OCRP)
Digital URL https://cmes.fas.harvard.edu/projects/ocrp
Document Types Sijill transcriptions, translations
Period 16th-19th century CE
Languages Ottoman Turkish (original), English (translations)

3.5 Bulgarian National Library - Ottoman Sijills

Field Value
Archive Bulgarian National Library
Collection Oriental Department
Sijill Count 160+ volumes
Defter Count 1000+ registers
Coverage Bulgarian Ottoman provinces
Period 16th-19th century CE
Language Ottoman Turkish, Arabic

4. Russian Metrical Books (Church Records)

4.1 BYU Script Tutorial - Russian Metrical Books

Field Value
Institution Brigham Young University
Project Script Tutorial
Digital URL https://script.byu.edu/russian-handwriting/documents/record-types/metrical-books/births
Document Type Tutorial with real transcription examples
Languages Russian (Cyrillic), English (translation)
License Educational use
Access Date 2025-12-12

Content Includes:

  • Complete birth record format explanation
  • Vocabulary lists with translations
  • Sample transcriptions from actual metrical books
  • Handwriting recognition guides

Sample Birth Record Structure (from tutorial):

В метрической книге записано:
Родился: [date]
Крещён: [date]
Имя: [name]
Родители: [father's full name with rank/status], законная жена его [mother's name]
Восприемники: [godparents]
Священник: [officiating priest]

4.2 FamilySearch Russia Church Records

Field Value
Archive FamilySearch
Collection Russia Church Records
Wiki URL https://www.familysearch.org/en/wiki/Russia_Church_Records
Document Types Metrical books (births, marriages, deaths)
Period 1722-1918 CE
Languages Russian, Church Slavonic
Access Free with registration

Key Information:

  • Metrical books (метрические книги) mandated from 1722
  • Three-part structure: births/baptisms, marriages, deaths
  • Contains estate/class (сословие) information

4.3 Polish Archives - Kłobuck Parish Records

Field Value
Archive Szukaj w Archiwach (Polish State Archives)
Parish Kłobuck
Document Type Roman Catholic metrical books
Period 18th-19th century
Languages Latin, Polish, Russian

Notes: Example of Russian-era Polish parish records with parallel Latin/Russian entries.


4.4 RGIA St. Petersburg

Field Value
Archive Russian State Historical Archive (RGIA)
Location St. Petersburg, Russia
Holdings 300+ metrical books
Period 1832-1892 CE
Document Types Orthodox, Catholic, Lutheran, Jewish metrical books

5. Spanish Colonial Baptism Records

5.1 BYU Script Tutorial - Spanish Colonial Baptisms

Field Value
Institution Brigham Young University
Project Script Tutorial
Digital URL https://script.byu.edu/spanish-handwriting/documents/church-records/baptisms
Document Type Tutorial with real transcription examples
Languages Spanish (colonial), English
License Educational use
Access Date 2025-12-12

Standard Baptism Entry Structure:

En [place] a [date] bauticé solemnemente a [name], [legitimacy status] de [father] y de [mother].
Fueron padrinos [godparents].
Y para que conste lo firmo.
[Priest signature]

Key Vocabulary:

  • hijo/hija legítimo/a = legitimate child
  • hijo/hija natural = illegitimate child
  • párvulo/a = infant
  • español/a, indio/a, mestizo/a, mulato/a = casta categories
  • padrinos/madrinas = godparents

5.2 FamilySearch Mexico - Yucatán Catholic Church Records

Field Value
Archive FamilySearch
Collection Mexico, Yucatán, Catholic Church Records, 1543-1977
Collection ID 1909116
Digital URL https://www.familysearch.org/en/search/collection/1909116
Period 1543-1977 CE
Document Types Baptisms, marriages, deaths, confirmations
Language Spanish, Latin, Maya
Access Free with registration

Coverage:

  • 200+ parishes
  • Some of earliest New World records (from 1543)
  • Indigenous Maya populations

5.3 Archivo General de la Nación (AGN) Mexico

Field Value
Archive Archivo General de la Nación
Location Mexico City, Mexico
Holdings Colonial parish records, civil registry
Period 16th-20th century CE
Languages Spanish, Nahuatl, Latin

6. Italian Notarial Records

6.1 Antenati - Italian State Archives Portal

Field Value
Archive Italian Ministry of Culture
Project Antenati (Ancestors)
Digital URL https://antenati.cultura.gov.it/
Venice URL https://antenati.cultura.gov.it/archivio/state-archives-of-venezia/?lang=en
Document Types Civil registry, notarial acts, parish records
Period 1806-present (civil); 15th century+ (notarial)
Languages Italian, Latin, Venetian
License Open Access
Access Date 2025-12-12

Venice State Archive Holdings:

  • Civil Registry (Stato Civile) 1806-1815 (Napoleonic period)
  • Notarial archives (Archivio Notarile)
  • Guild records (Arti e Mestieri)

6.2 OAC California Digital Library - Italian Notarial Documents

Field Value
Archive University of California Libraries
Collection Italian Notarial Documents Collection
Finding Aid https://oac.cdlib.org/findaid/ark:%2F13030%2Fc8v412zd
Document Count 168 documents
Period 1465-1635 CE
Locations Venice, Padua, Verona
Languages Latin, Italian (Venetian)
Access Date 2025-12-12

Document Types:

  • Contracts (contratti)
  • Wills (testamenti)
  • Property transfers
  • Marriage agreements (sponsalia)
  • Business partnerships

6.3 SION-Digit Project - Jewish Notarial Records

Field Value
Project SION-Digit (Sources for the History of Italian Jewish Notarial Documents)
Coverage Venice, Bordeaux, Amsterdam
Period 16th-18th century CE
Focus Jewish community notarial acts
Languages Italian, Hebrew, Ladino

7. Greek Orthodox Church Records

7.1 FamilySearch Greece Church Records

Field Value
Archive FamilySearch
Wiki URL https://www.familysearch.org/en/wiki/Greece_Church_Records
Document Types Baptisms, marriages, deaths
Period 17th century - 1925 CE
Language Greek
Access Free with registration

Key Information:

  • Greek Orthodox records primary source before 1925 civil registration
  • Male registers (μητρώα αρρένων) for military service
  • Some records in Ottoman Turkish for pre-independence period

7.2 General State Archives of Greece (GAK)

Field Value
Archive Γενικά Αρχεία του Κράτους (GAK)
Document Types Church records, civil registry, Ottoman-era documents
Period 15th century - present
Languages Greek, Ottoman Turkish

7.3 Greek Ancestry Resources

Field Value
Resource Greek Ancestry
Coverage Village church records guide
Document Types Baptismal registers, marriage registers
Key Features Guides to accessing island and mainland records

8. Dutch Civil Registry Records

8.1 WieWasWie (Dutch Genealogical Database)

Field Value
Archive Centraal Bureau voor Genealogie (CBG)
Project WieWasWie
Digital URL https://www.wiewaswie.nl/
Document Types Birth, marriage, death certificates
Period 1811-present (civil); 1600s+ (church)
Languages Dutch
Access Subscription / Free at archives

8.2 Dutch Provincial Archives

Province Archive Holdings
Noord-Holland Noord-Hollands Archief Civil registry from 1811, church records from 1600s
Zuid-Holland Nationaal Archief Central government records
Gelderland Gelders Archief Regional archives
Noord-Brabant Brabants Historisch Informatie Centrum Catholic parish records

8.3 Dutch Marriage Certificate Format

Standard 19th-Century Format:

Heden den [date] compareerden voor ons [official name],
Ambtenaar van den Burgerlijken Stand der Gemeente [municipality]:

De Bruidegom: [groom's name], oud [age] jaren, [occupation],
geboren te [birthplace], wonende te [residence],
zoon van [father] en van [mother];

De Bruid: [bride's name], oud [age] jaren,
geboren te [birthplace], wonende te [residence],
dochter van [father] en van [mother];

Getuigen: [4 witnesses with ages, occupations, relationships]

En hebben wij dit huwelijk voltrokken in tegenwoordigheid van voornoemde getuigen.

9. License and Attribution Requirements

Open Access Resources

Source License Attribution Required
Cambridge Digital Library CC BY-NC 4.0 Yes
UPenn OPenn Public Domain / CC0 No (but encouraged)
OpenJerusalem Open Access Yes
Antenati Open Access Yes
FamilySearch Terms of Service Yes
BYU Script Tutorial Educational Use Yes

For PiCo extraction examples, use the following provenance block in YAML:

provenance:
  source_url: "https://example.org/document/12345"
  archive_name: "Example Archive"
  collection: "Collection Name"
  document_id: "Document Identifier"
  access_date: "2025-12-12"
  license: "CC BY-NC 4.0"
  attribution: "Courtesy of Example Archive. Used under CC BY-NC 4.0 license."
  notes: "Transcription verified against original digital image."

Data Fabrication Prohibition

CRITICAL: Per project rules (AGENTS.md Rule 21), all extraction examples MUST use real data from these verified sources. No fabrication of person names, dates, relationships, or document content is permitted.

When real data is not available from a source, the extraction example should be marked as:

provenance:
  source_url: null
  data_status: "SYNTHETIC_EXAMPLE"
  notes: "This example uses synthetic data for demonstration purposes only. Do not cite as historical evidence."

Document Type Coverage Summary

Document Type Real Sources Available Examples with Provenance
Hebrew Ketubah 4+ archives Yale (1896), Philadelphia (1842)
Arabic Waqf 3+ archives Cambridge, UPenn, Singapore
Ottoman Sijill 5+ archives OpenJerusalem, ISAM, Harvard
Russian Metrical 4+ archives BYU Tutorial, RGIA
Spanish Colonial Baptism 3+ archives BYU Tutorial, FamilySearch
Italian Notarial 3+ archives Antenati, OAC/CDL
Greek Orthodox 3+ archives FamilySearch, GAK
Dutch Civil Registry 3+ archives WieWasWie, Provincial

Changelog

Date Version Changes
2025-12-12 1.0.0 Initial compilation of provenance sources