glam/AGENTS.md
kempersc fa5680f0dd Add initial versions of custodian hub UML diagrams in Mermaid and PlantUML formats
- Introduced custodian_hub_v3.mmd, custodian_hub_v4_final.mmd, and custodian_hub_v5_FINAL.mmd for Mermaid representation.
- Created custodian_hub_FINAL.puml and custodian_hub_v3.puml for PlantUML representation.
- Defined entities such as CustodianReconstruction, Identifier, TimeSpan, Agent, CustodianName, CustodianObservation, ReconstructionActivity, Appellation, ConfidenceMeasure, Custodian, LanguageCode, and SourceDocument.
- Established relationships and associations between entities, including temporal extents, observations, and reconstruction activities.
- Incorporated enumerations for various types, statuses, and classifications relevant to custodians and their activities.
2025-11-22 14:33:51 +01:00

86 KiB
Raw Blame History

AI Agent Instructions for GLAM Data Extraction

This document provides instructions for AI agents (particularly OpenCODE and Claude) to assist with extracting heritage institution data from conversation JSON files and other sources.


🎯 PROJECT CORE MISSION

PRIMARY OBJECTIVE: Create a comprehensive, nuanced ontology that accurately represents the complex, temporal, multi-faceted nature of heritage custodian institutions worldwide.

This is NOT a simple data extraction project. This is an ontology engineering project that:

  • Models heritage entities as multi-aspect temporal entities (place, custodian, legal form, collections, people)
  • Integrates multiple base ontologies (CPOV, TOOI, CIDOC-CRM, RiC-O, Schema.org, PiCo)
  • Captures organizational change events over time (custody transfers, mergers, transformations)
  • Distinguishes between nominal references and formal organizational structures
  • Links heritage custodians to people, collections, and locations with independent temporal lifecycles

If you're looking for simple NER extraction, this is not the right project.


🚨 CRITICAL RULES FOR ALL AGENTS

Rule 0: LinkML Schemas Are the Single Source of Truth

MASTER SCHEMA LOCATION: schemas/20251121/linkml/

The LinkML schema files are the authoritative, canonical definition of the Heritage Custodian Ontology:

Primary Schema File (SINGLE SOURCE OF TRUTH):

  • schemas/20251121/linkml/01_custodian_name.yaml - Complete Heritage Custodian Ontology
    • Defines CustodianObservation (source-based references to heritage keepers)
    • Defines CustodianName (standardized emic names)
    • Defines CustodianReconstruction (formal entities: individuals, groups, organizations, governments, corporations)
    • Includes ISO 20275 legal form codes (for legal entities)
    • PiCo-inspired observation/reconstruction pattern
    • Based on CIDOC-CRM E39_Actor (broader than organization)

ALL OTHER FILES ARE DERIVED/GENERATED from these LinkML schemas:

DO NOT edit these derived files directly:

  • schemas/20251121/rdf/*.{ttl,nt,jsonld,rdf,n3,trig,trix} - GENERATED from LinkML via gen-owl + rdfpipe
  • schemas/20251121/typedb/*.tql - DERIVED TypeDB schema (manual translation from LinkML)
  • schemas/20251121/uml/mermaid/*.mmd - DERIVED UML diagrams (manual visualization of LinkML)
  • schemas/20251121/examples/*.yaml - INSTANCES conforming to LinkML schema

Workflow for Schema Changes:

1. EDIT LinkML schema (01_custodian_name.yaml)
   ↓
2. REGENERATE RDF formats:
   $ gen-owl -f ttl schemas/20251121/linkml/01_custodian_name.yaml > schemas/20251121/rdf/01_custodian_name.owl.ttl
   $ rdfpipe schemas/20251121/rdf/01_custodian_name.owl.ttl -o nt > schemas/20251121/rdf/01_custodian_name.nt
   $ rdfpipe schemas/20251121/rdf/01_custodian_name.owl.ttl -o jsonld > schemas/20251121/rdf/01_custodian_name.jsonld
   $ # ... repeat for all 8 formats
   ↓
3. UPDATE TypeDB schema (manual translation)
   ↓
4. UPDATE UML/Mermaid diagrams (manual visualization)
   ↓
5. VALIDATE example instances:
   $ linkml-validate -s schemas/20251121/linkml/01_custodian_name.yaml schemas/20251121/examples/example.yaml

Why LinkML is the Master:

  • Formal specification: Type-safe, validation rules, cardinality constraints
  • Multi-format generation: Single source → RDF, JSON-LD, Python, SQL, GraphQL
  • Version control: Clear diffs, semantic versioning, change tracking
  • Ontology alignment: Explicit class_uri and slot_uri mappings to base ontologies
  • Documentation: Rich inline documentation with examples

NEVER:

  • Edit RDF files directly (they will be overwritten on next generation)
  • Consider TypeDB schema as authoritative (it's a translation target)
  • Treat UML diagrams as specification (they're visualizations)

ALWAYS:

  • Refer to LinkML schemas for class definitions
  • Update LinkML first, then regenerate derived formats
  • Validate changes against LinkML metamodel
  • Document schema changes in LinkML YAML comments

See also:

  • schemas/20251121/RDF_GENERATION_SUMMARY.md - RDF generation process documentation
  • docs/MIGRATION_GUIDE.md - Schema migration procedures
  • LinkML documentation: https://linkml.io/

Rule 1: Ontology Files Are Your Primary Reference

BEFORE designing any schema, class, or property:

  1. READ the base ontology files in /data/ontology/
  2. SEARCH for existing classes and properties that match your needs
  3. DOCUMENT your ontology alignment with explicit rationale
  4. NEVER invent custom properties when ontology equivalents exist

Available Ontologies:

  • data/ontology/core-public-organisation-ap.ttl - CPOV (EU public sector)
  • data/ontology/tooiont.ttl - TOOI (Dutch government)
  • data/ontology/schemaorg.owl - Schema.org (web semantics, private sector)
  • data/ontology/CIDOC_CRM_v7.1.3.rdf - CIDOC-CRM (cultural heritage domain)
  • data/ontology/RiC-O_1-1.rdf - Records in Contexts (archival description)
  • data/ontology/bibframe_vocabulary.rdf - BIBFRAME (libraries)
  • data/ontology/pico.ttl - PiCo (person observations, staff roles)

See .opencode/agent/ontology-mapping-rules.md for complete ontology consultation workflow.

Rule 2: Wikidata Entities Are NOT Ontology Classes

Files:

  • data/wikidata/GLAMORCUBEPSXHFN/hyponyms_curated.yaml
  • data/wikidata/GLAMORCUBEPSXHFN/hyponyms_curated_full.yaml

These files contain:

  • Wikidata entity identifiers (Q-numbers) for heritage institution TYPES
  • Multilingual labels and descriptions
  • Hypernym classifications (upper-level categories)
  • Source data for ontology mapping analysis

These files DO NOT contain:

  • Formal ontology class definitions
  • Direct class_uri mappings for LinkML
  • Ontology properties or relationships

REQUIRED WORKFLOW:

hyponyms_curated.yaml (Wikidata Q-numbers)
    ↓
ANALYZE semantic meaning + hypernyms
    ↓
SEARCH base ontologies for matching classes
    ↓
MAP Wikidata entity → Ontology class(es)
    ↓
DOCUMENT rationale + properties
    ↓
CREATE LinkML schema with ontology class_uri

Example - WRONG :

Mansion:
  class_uri: wd:Q1802963  # ← This is an ENTITY, not a CLASS!

Example - CORRECT :

Mansion:
  # Wikidata source: Q1802963
  place_aspect:
    class_uri: crm:E27_Site  # CIDOC-CRM ontology class
  custodian_aspect:
    class_uri: cpov:PublicOrganisation  # If operates as museum

Rule 3: Multi-Aspect Modeling is Mandatory

Every heritage entity has MULTIPLE ontological aspects with INDEPENDENT temporal lifecycles.

Required Aspects:

  1. Place Aspect (physical location/site)

    • Ontology: crm:E27_Site + schema:Place
    • Temporal: Construction → Demolition/Present
  2. Custodian Aspect (organization managing heritage)

    • Ontology: cpov:PublicOrganisation OR schema:Organization
    • Temporal: Founding → Dissolution/Present
  3. Legal Form Aspect (legal entity registration)

    • Ontology: org:FormalOrganization + tooi:Overheidsorganisatie (Dutch)
    • Temporal: Registration → Deregistration/Present
  4. Collections Aspect (heritage materials)

    • Ontology: rico:RecordSet OR crm:E78_Curated_Holding OR bf:Collection
    • Temporal: Accession → Deaccession (per item)
  5. People Aspect (staff, curators)

    • Ontology: pico:PersonObservation + crm:E21_Person
    • Temporal: Employment start → Employment end (per person)
  6. Temporal Events (organizational changes)

    • Ontology: crm:E10_Transfer_of_Custody, rico:Event
    • Tracks custody transfers, mergers, relocations, transformations

Example: A historic mansion operating as a museum has:

  • Place aspect: Building constructed 1880, still standing (143 years)
  • Custodian aspect: Foundation established 1994 to operate museum (30 years)
  • Legal form: Dutch stichting registered 1994, KvK #12345678
  • Collections: Mondrian artworks acquired 1994-2024
  • People: Current curator employed 2020-present

Each aspect changes independently over time!


Project Overview

Goal: Extract structured data about worldwide GLAMORCUBESFIXPHDNT (Galleries, Libraries, Archives, Museums, Official institutions, Research centers, Corporations, Unknown, Botanical gardens/zoos, Educational providers, Societies, Features, Intangible heritage groups, miXed, Personal collections, Holy sites, Digital platforms, NGOs, Taste/smell heritage) institutions from 139+ Claude conversation JSON files and integrate with authoritative CSV datasets.

Output: Validated LinkML-compliant records representing heritage custodian organizations with provenance tracking, geographic data, identifiers, and relationship information.

Schema: See the modular LinkML schema v0.2.1 with 19-type GLAMORCUBESFIXPHDNT taxonomy described below.

Schema Reference (v0.2.1)

The project uses a modular LinkML schema organized into 6 specialized modules:

  1. schemas/heritage_custodian.yaml - Main schema (import-only structure)

    • Top-level schema that imports all modules
    • Defines schema metadata and namespace
  2. schemas/core.yaml - Core Classes

    • HeritageCustodian - Main institution entity
    • Location - Geographic data
    • Identifier - External identifiers (ISIL, Wikidata, VIAF, etc.)
    • DigitalPlatform - Online systems and platforms
    • GHCID - Global Heritage Custodian Identifier
  3. schemas/enums.yaml - Enumerations

    • InstitutionTypeEnum - 13 institution types (GALLERY, LIBRARY, ARCHIVE, MUSEUM, etc.)
    • ChangeTypeEnum - 11 organizational change types (FOUNDING, MERGER, CLOSURE, etc.)
    • DataSource - Data origin types (CSV_REGISTRY, CONVERSATION_NLP, etc.)
    • DataTier - Data quality tiers (TIER_1_AUTHORITATIVE through TIER_4_INFERRED)
    • PlatformTypeEnum - Digital platform categories
  4. schemas/provenance.yaml - Provenance Tracking

    • Provenance - Data source and quality metadata
    • ChangeEvent - Organizational change history (mergers, relocations, etc.)
    • GHCIDHistoryEntry - GHCID change tracking over time
  5. schemas/collections.yaml - Collection Metadata

    • Collection - Collection descriptions
    • Accession - Acquisition records
    • DigitalObject - Digital surrogates
  6. schemas/dutch.yaml - Dutch-Specific Extensions

    • DutchHeritageCustodian - Netherlands heritage institutions
    • Extensions for ISIL registry, platform integrations, KvK numbers

See /docs/SCHEMA_MODULES.md for detailed architecture and design patterns.

Base Ontologies for Global GLAM Data

CRITICAL: Before designing extraction pipelines or extending the schema, AI agents MUST consult the base ontologies that the LinkML schema builds upon. These ontologies provide standardized vocabularies and patterns for modeling heritage institutions.

Foundation Ontologies

The GLAM project integrates with three primary ontologies, each serving different geographic and semantic scopes:

1. TOOI - Dutch Government Organizational Ontology

File: /data/ontology/tooiont.ttl
Namespace: https://identifier.overheid.nl/tooi/def/ont/
Scope: Dutch heritage institutions (government archives, state museums, public cultural organizations)

When to Use:

  • Extracting Dutch heritage institutions from conversations
  • Modeling Dutch organizational change events (mergers, splits, reorganizations)
  • Integrating with Dutch ISIL registry or KvK (Chamber of Commerce) data
  • Parsing Dutch government heritage agency data

Key Classes:

  • tooi:Overheidsorganisatie - Government organization (extends to DutchHeritageCustodian)
  • tooi:Wijzigingsgebeurtenis - Change event (founding, merger, closure, relocation)

Key Properties:

  • tooi:officieleNaamInclSoort - Official name including type
  • tooi:begindatum / tooi:einddatum - Temporal validity (start/end dates)
  • tooi:organisatieIdentificatie - Formal identifiers (ISIL codes, etc.)

LinkML Mapping:

# schemas/dutch.yaml extends TOOI
DutchHeritageCustodian:
  is_a: HeritageCustodian
  class_uri: tooi:Overheidsorganisatie  # ← Maps to TOOI

Reference: See /docs/ONTOLOGY_EXTENSIONS.md for complete TOOI integration patterns.


2. CPOV - EU Core Public Organisation Vocabulary

Files:

  • /data/ontology/core-public-organisation-ap.ttl (RDF schema)
  • /data/ontology/core-public-organisation-ap.jsonld (JSON-LD context)

Namespace: http://data.europa.eu/m8g/
Scope: EU-wide and global public sector heritage organizations

When to Use:

  • Extracting European heritage institutions (France, Germany, Belgium, etc.)
  • Modeling international/global heritage organizations
  • Aligning with EU Linked Open Data initiatives (Europeana, DPLA)
  • Extracting non-Dutch institutions from conversations

Key Classes:

  • cpov:PublicOrganisation - Public sector organization (base for HeritageCustodian)
  • cv:ChangeEvent - Organizational change events
  • locn:Address - Physical location data

Key Properties:

  • skos:prefLabel / skos:altLabel - Preferred and alternative names
  • dct:identifier - Formal identifiers (ISIL, Wikidata, VIAF)
  • dct:temporal - Temporal coverage (founding to closure dates)
  • locn:address - Physical addresses

LinkML Mapping:

# schemas/core.yaml aligns with CPOV
HeritageCustodian:
  class_uri: cpov:PublicOrganisation  # ← Maps to CPOV
  
  slots:
    name:
      slot_uri: skos:prefLabel
    alternative_names:
      slot_uri: skos:altLabel
    identifiers:
      slot_uri: dct:identifier

Reference: See /docs/ONTOLOGY_EXTENSIONS.md for complete CPOV integration patterns.


3. Schema.org - Web Vocabulary for Structured Data

File: /data/ontology/schemaorg.owl
Namespace: http://schema.org/
Scope: Universal web semantics (museums, galleries, collections, events, learning resources)

When to Use:

  • Extracting private collections or non-governmental organizations
  • Modeling digital platforms (learning management systems, discovery portals)
  • Web discoverability and SEO optimization
  • Fallback when TOOI/CPOV don't apply

Key Classes:

  • schema:Museum / schema:Library / schema:ArchiveOrganization - Heritage institution types
  • schema:Place - Geographic locations
  • schema:LearningResource - Educational platforms (LMS, online courses)
  • schema:Event - Organizational events (founding, exhibitions)

LinkML Mapping:

# schemas/enums.yaml maps platform types to Schema.org
DigitalPlatformTypeEnum:
  LEARNING_MANAGEMENT:
    meaning: schema:LearningResource  # ← Maps to Schema.org

Reference: See /docs/ONTOLOGY_EXTENSIONS.md for Schema.org usage examples.


Ontology Decision Tree for Agents

When extracting heritage institution data, choose the appropriate ontology:

START: Extract institution from conversation
  ↓
Is the institution Dutch?
  ├─ YES → Use TOOI ontology
  │         - Map to schemas/dutch.yaml
  │         - Extract ISIL codes (NL-* format)
  │         - Extract KvK numbers (8-digit)
  │         - Model change events as tooi:Wijzigingsgebeurtenis
  │
  └─ NO → Is it a public/government organization?
           ├─ YES → Use CPOV ontology
           │         - Map to schemas/core.yaml
           │         - Extract standard identifiers (ISIL, Wikidata, VIAF)
           │         - Model change events as cv:ChangeEvent
           │
           └─ NO → Use Schema.org
                    - Map to schemas/core.yaml
                    - Use schema:Museum, schema:Library, etc.
                    - Emphasize web discoverability

Multi-Ontology Support: Institutions can implement MULTIPLE ontology classes simultaneously:

<https://w3id.org/heritage/custodian/nl/rijksmuseum>
    a tooi:Overheidsorganisatie,  # Dutch government organization
      cpov:PublicOrganisation,        # EU public sector
      schema:Museum ;                 # Schema.org web semantics

Required Ontology Consultation Workflow

Before extracting data, agents MUST perform these steps:

Step 1: Identify Institution Geographic Scope

# Determine which ontology applies
if institution_country == "NL":
    primary_ontology = "TOOI"
    ontology_file = "/data/ontology/tooiont.ttl"
elif institution_in_europe or institution_public_sector:
    primary_ontology = "CPOV"
    ontology_file = "/data/ontology/core-public-organisation-ap.ttl"
else:
    primary_ontology = "Schema.org"
    ontology_file = "/data/ontology/schemaorg.owl"

Step 2: Review Ontology Classes and Properties

Search ontology files for relevant classes:

# Dutch institutions - search TOOI
rg "tooi:Overheidsorganisatie|Wijzigingsgebeurtenis|begindatum" /data/ontology/tooiont.ttl

# EU/global institutions - search CPOV
rg "cpov:PublicOrganisation|cv:ChangeEvent|locn:Address" /data/ontology/core-public-organisation-ap.ttl

# All institutions - search Schema.org
rg "schema:Museum|schema:Library|schema:ArchiveOrganization" /data/ontology/schemaorg.owl

Step 3: Map Conversation Data to Ontology Properties

Create a mapping table before extraction:

Extracted Field TOOI Property CPOV Property Schema.org Property
Institution name tooi:officieleNaamInclSoort skos:prefLabel schema:name
Alternative names - skos:altLabel schema:alternateName
Founding date tooi:begindatum schema:startDate schema:foundingDate
Closure date tooi:einddatum schema:endDate schema:dissolutionDate
ISIL code tooi:organisatieIdentificatie dct:identifier schema:identifier
Address (use locn:Address) locn:address schema:address
Merger event tooi:Wijzigingsgebeurtenis cv:ChangeEvent schema:Event
Website - schema:url schema:url

Step 4: Document Ontology Alignment in Provenance

Always include ontology references in extraction metadata:

provenance:
  data_source: CONVERSATION_NLP
  extraction_method: "NLP extraction following CPOV ontology patterns"
  base_ontology: "http://data.europa.eu/m8g/"  # ← Document which ontology used
  ontology_alignment:
    - "cpov:PublicOrganisation"
    - "cv:ChangeEvent"
  extraction_date: "2025-11-09T..."

Common Ontology Patterns

Pattern 1: Organizational Change Events

When extracting mergers, splits, relocations, name changes:

# TOOI pattern (Dutch institutions)
change_history:
  - event_id: https://w3id.org/heritage/custodian/event/nha-merger-2001
    change_type: MERGER  # Maps to tooi:Wijzigingsgebeurtenis
    event_date: "2001-01-01"
    event_description: "Merger of Gemeentearchief Haarlem and Rijksarchief in Noord-Holland"
    ontology_class: "tooi:Wijzigingsgebeurtenis"

# CPOV pattern (EU/global institutions)
change_history:
  - event_id: https://w3id.org/heritage/custodian/event/bnf-founding
    change_type: FOUNDING  # Maps to cv:ChangeEvent
    event_date: "1461-01-01"
    event_description: "Founded by King Louis XI as Royal Library"
    ontology_class: "cv:ChangeEvent"

Pattern 2: Multilingual Names

CPOV and Schema.org support language-tagged literals:

name: Bibliothèque nationale de France
alternative_names:
  - National Library of France@en
  - BnF@fr
  - Französische Nationalbibliothek@de

# RDF serialization:
# skos:prefLabel "Bibliothèque nationale de France"@fr ;
# skos:altLabel "National Library of France"@en, "BnF"@fr ;

Pattern 3: Hierarchical Relationships

Use W3C Org Ontology patterns (integrated in CPOV):

# Parent institution
parent_organization:
  name: Ministry of Culture
  relationship_type: "org:hasUnit"  # CPOV uses W3C Org Ontology
  
# Branch institutions
branches:
  - name: Regional Archive Noord-Brabant
    relationship_type: "org:subOrganizationOf"

Anti-Patterns to Avoid

DON'T: Invent custom properties when ontology equivalents exist

# BAD - Custom property instead of ontology reuse
institution_official_name: "Rijksarchief"  # Use skos:prefLabel instead!

DON'T: Ignore ontology namespace conventions

# BAD - No ontology reference
change_type: "merger"  # Use cv:ChangeEvent with proper namespace!

DON'T: Extract without reviewing ontology files

# BAD - Extracting Dutch institutions without reading TOOI
agent: "I'll extract Dutch archives using Schema.org only"
# This loses semantic precision and ignores domain-specific patterns!

DO: Always map to base ontologies and document alignment

# GOOD - Ontology-aligned extraction
name: Rijksarchief in Noord-Holland
institution_type: ARCHIVE
ontology_class: tooi:Overheidsorganisatie  # ← Documented
provenance:
  base_ontology: "https://identifier.overheid.nl/tooi/def/ont/"
  ontology_alignment:
    - tooi:Overheidsorganisatie
    - prov:Organization  # TOOI uses PROV-O for temporal tracking

Additional Ontology Resources

CIDOC-CRM (Cultural Heritage Domain):

  • File: /data/ontology/CIDOC_CRM_v7.1.3.rdf
  • Use for: Museum object cataloging, provenance, conservation
  • Key classes: crm:E74_Group (organizations), crm:E5_Event (historical events)

RiC-O (Records in Contexts - Archival Description):

  • Use for: Archival collections, fonds, series, items
  • Key classes: rico:CorporateBody, rico:RecordSet
  • Integration: Planned for future schema extension

BIBFRAME (Bibliographic Resources):

  • Use for: Library catalogs, bibliographic metadata
  • Key classes: bf:Organization, bf:Work, bf:Instance
  • Integration: For library-specific extensions

Reference Documentation: See /docs/ONTOLOGY_EXTENSIONS.md for comprehensive integration patterns, RDF serialization examples, and extension workflows.


Institution Type Taxonomy

The project uses a 19-type GLAMORCUBESFIXPHDNT taxonomy (expanded November 2025) with single-letter codes for GHCID identifier generation:

Type Code Description Example Use Cases
GALLERY G Art gallery or exhibition space Commercial galleries, kunsthallen
LIBRARY L Library (public, academic, specialized) National libraries, university libraries
ARCHIVE A Archive (government, corporate, personal) National archives, city archives
MUSEUM M Museum (art, history, science, etc.) Rijksmuseum, natural history museums
OFFICIAL_INSTITUTION O Government heritage agencies Provincial archives, heritage platforms
RESEARCH_CENTER R Research institutes and documentation centers Knowledge centers, research libraries
CORPORATION C Corporate heritage collections Company archives, corporate museums
UNKNOWN U Institution type cannot be determined Ambiguous or unclassifiable organizations
BOTANICAL_ZOO B Botanical gardens and zoological parks Arboreta, botanical gardens, zoos
EDUCATION_PROVIDER E Educational institutions with collections Schools, training centers with heritage materials, universities
COLLECTING_SOCIETY S Societies collecting specialized materials Numismatic societies, heritage societies (heemkundige kring)
FEATURES F Physical landscape features with heritage significance Monuments, sculptures, statues, memorials, landmarks, cemeteries
INTANGIBLE_HERITAGE_GROUP I Organizations preserving intangible heritage Traditional performance groups, oral history societies, folklore organizations
MIXED X Multiple types (uses X code) Combined museum/archive facilities
PERSONAL_COLLECTION P Private personal collections Individual collectors
HOLY_SITES H Religious heritage sites and institutions Churches, temples, mosques, synagogues with collections
DIGITAL_PLATFORM D Digital heritage platforms and repositories Online archives, digital libraries, virtual museums
NGO N Non-governmental heritage organizations Heritage advocacy groups, preservation societies
TASTE_SMELL T Culinary and olfactory heritage institutions Historic restaurants, parfumeries, distilleries preserving traditional recipes and formulations

Notes:

  • MIXED institutions use "X" as the GHCID code and document all actual types in metadata
  • HOLY_SITES includes religious institutions managing cultural heritage collections (archives, libraries, artifacts)
  • FEATURES includes physical monuments and landscape features with heritage value (not institutions maintaining collections)
  • COLLECTING_SOCIETY includes historical societies (historische vereniging), philatelic societies, numismatic clubs, ephemera collectors
  • OFFICIAL_INSTITUTION includes aggregation platforms, provincial heritage services, and government heritage agencies
  • INTANGIBLE_HERITAGE_GROUP covers organizations preserving UNESCO-recognized intangible cultural heritage
  • DIGITAL_PLATFORM includes born-digital heritage platforms and digitization aggregators
  • NGO includes non-profit heritage organizations that don't fit other categories
  • TASTE_SMELL includes establishments actively preserving culinary traditions, historic recipes, perfume formulations, and sensory heritage
  • When institution type is unknown, records default to UNKNOWN pending verification

Mnemonic: GLAMORCUBESFIXPHDNT - Galleries, Libraries, Archives, Museums, Official institutions, Research centers, Corporations, Unknown, Botanical gardens/zoos, Education providers, Societies, Features, Intangible heritage groups, miXed, Personal collections, Holy sites, Digital platforms, NGOs, Taste/smell heritage

Note on order: The mnemonic GLAMORCUBESFIXPHDNT represents the alphabetical ordering by code: G-L-A-M-O-R-C-U-B-E-S-F-I-X-P-H-D-N-T

Note: Universities are classified under E (EDUCATION_PROVIDER), not U. The U-class is reserved for institutions where the type cannot be determined during data extraction.

Data Sources

Primary Sources

  1. Conversation JSON files (/Users/kempersc/Documents/claude/glam/*.json)

    • 139 conversation files covering global GLAM research
    • Countries include: Brazil, Vietnam, Chile, Japan, Mexico, Norway, Thailand, Taiwan, Belgium, Azerbaijan, Estonia, Namibia, Argentina, Tunisia, Ghana, Iran, Russia, Uzbekistan, Armenia, Georgia, Croatia, Greece, Nigeria, Somalia, Yemen, Oman, South Korea, Malaysia, Colombia, Switzerland, Moldova, Romania, Albania, Bosnia, Pakistan, Suriname, Nicaragua, Congo, Denmark, Austria, Australia, Myanmar, Cambodia, Sri Lanka, Tajikistan, Turkmenistan, Philippines, Latvia, Palestine, Limburg (NL), Gelderland (NL), Drenthe (NL), Groningen (NL), Slovakia, Kenya, Paraguay, Honduras, Mozambique, Eritrea, Sudan, Rwanda, Kiribati, Jamaica, Indonesia, Italy, Zimbabwe, East Timor, UAE, Kuwait, Lebanon, Syria, Maldives, Benin
    • Also 14 ontology research conversations
  2. Dutch ISIL Registry (data/ISIL-codes_2025-08-01.csv)

    • ~300 Dutch heritage institutions
    • Fields: Volgnr, Plaats, Instelling, ISIL code, Toegekend op, Opmerking
    • Authoritative source (Tier 1)
  3. Dutch Organizations CSV (data/voorbeeld_lijst_organisaties_en_diensten-totaallijst_nederland.csv)

    • Comprehensive Dutch heritage organizations
    • 40+ metadata columns including: name, address, ISIL code, organization type, partnerships, systems used, metadata standards
    • Rich integration data (Museum register, Rijkscollectie, Collectie Nederland, Archieven.nl, etc.)
    • Authoritative source (Tier 1)

Implementation Status (Updated Nov 2025)

Both Dutch datasets have been successfully parsed and cross-linked:

ISIL Registry :

  • 364 institutions parsed (2 invalid codes rejected)
  • 203 cities covered
  • Parser: src/glam_extractor/parsers/isil_registry.py
  • Tests: 10/10 passing (84% coverage)

Dutch Organizations :

  • 1,351 institutions parsed
  • 475 cities covered
  • 1,119 organizations with digital platforms
  • Parser: src/glam_extractor/parsers/dutch_orgs.py
  • Tests: 18/18 passing (98% coverage)

Cross-linking Results 🔗:

  • 340 institutions matched by ISIL code (92.1% overlap)
  • 198 records enriched with platform data
  • 127 name conflicts detected (require manual review)
  • 1,004 organizations without ISIL codes (candidates for assignment)

Analysis Scripts:

  • compare_dutch_datasets.py - Dataset comparison
  • crosslink_dutch_datasets.py - TIER_1 data merging demo
  • test_real_dutch_orgs.py - Real data validation

See PROGRESS.md for detailed statistics and findings.


Conversation JSON Structure

Each conversation JSON file has the following structure:

{
  "uuid": "conversation-uuid",
  "name": "Conversation name (often includes country/region)",
  "summary": "Optional summary",
  "created_at": "ISO 8601 timestamp",
  "updated_at": "ISO 8601 timestamp",
  "chat_messages": [
    {
      "uuid": "message-uuid",
      "text": "User or assistant message text",
      "sender": "human" | "assistant",
      "content": [
        {
          "type": "text" | "tool_use" | "tool_result",
          "text": "Message content (may contain markdown, lists, etc.)",
          ...
        }
      ]
    }
  ]
}

NLP Extraction Tasks

All extraction tasks map to the modular LinkML schema v0.2.0. See Schema Reference section above for module details.

Task 1: Entity Recognition - Institution Names

Objective: Extract heritage institution names from conversation text.

Schema Mapping: Populates HeritageCustodian class from schemas/core.yaml

Patterns to Look For:

  • Organization names (proper nouns)
  • Museum names (often contain "Museum", "Museu", "Museo", "Muzeum", etc.)
  • Library names (contain "Library", "Biblioteca", "Bibliothek", "Bibliotheek", etc.)
  • Archive names (contain "Archive", "Archivo", "Archiv", "Archief", etc.)
  • Gallery names
  • Cultural centers
  • Holy sites with collections (churches, temples, mosques, synagogues, monasteries, abbeys, cathedrals managing heritage materials)

Contextual Indicators:

  • Lists of institutions
  • Descriptions like "The X is a museum in Y"
  • URLs containing institution names
  • Mentions of collections, exhibitions, or holdings

Example Extraction:

Input: "The Biblioteca Nacional do Brasil in Rio de Janeiro holds over 9 million items..."

Output:
- name: "Biblioteca Nacional do Brasil"  # HeritageCustodian.name
- institution_type: LIBRARY  # InstitutionTypeEnum from schemas/enums.yaml
- city: "Rio de Janeiro"  # Location.city from schemas/core.yaml
- confidence_score: 0.95  # Provenance.confidence_score from schemas/provenance.yaml

Task 2: Location Extraction

Objective: Extract geographic information associated with institutions.

Schema Mapping: Populates Location class from schemas/core.yaml

Extract:

  • City names
  • Street addresses (when mentioned)
  • Postal codes
  • Provinces/states/regions
  • Country (can often be inferred from conversation title)

Geocoding:

  • Use Nominatim API to geocode addresses to lat/lon
  • Link to GeoNames IDs when possible
  • Handle multilingual place names

Example:

Input: "Nationaal Onderduikmuseum, Aalten"

Output:
- city: "Aalten"  # Location.city
- country: "NL"  # Location.country (ISO 3166-1 alpha-2)
- geonames_id: "2759899" (lookup via API)  # Location.geonames_id
- latitude: 51.9167 (from geocoding)
- longitude: 6.5833

Task 3: Identifier Extraction

Objective: Extract external identifiers mentioned in conversations.

Schema Mapping: Populates Identifier class from schemas/core.yaml

Identifier Types:

  • ISIL codes (format: NL-XXXXX, US-XXXXX, etc.)
  • Wikidata IDs (format: Q12345)
  • VIAF IDs (format: numeric)
  • URLs to institutional websites
  • KvK numbers (Dutch: 8-digit format)

Patterns:

ISIL: [A-Z]{2}-[A-Za-z0-9]+
Wikidata: Q[0-9]+
VIAF: viaf.org/viaf/[0-9]+
KvK: [0-9]{8}

Example:

Input: "ISIL code NL-AsdAM for Amsterdam Museum"

Output:
- identifier_scheme: "ISIL"  # Identifier.identifier_scheme
- identifier_value: "NL-AsdAM"  # Identifier.identifier_value
- institution_name: "Amsterdam Museum"  # HeritageCustodian.name (for linking)

Task 4: Relationship Extraction

Objective: Extract relationships between institutions.

Schema Mapping: Maps to ChangeEvent class from schemas/provenance.yaml (for mergers, splits) and future relationship modeling

Relationship Types:

  • Parent-child (e.g., "X is part of Y")
  • Partnerships (e.g., "X collaborates with Y")
  • Network memberships (e.g., "X is a member of Z consortium")
  • Merged organizations (e.g., "X merged with Y") → ChangeTypeEnum.MERGER

Indicators:

  • "part of", "branch of", "division of"
  • "in partnership with", "collaborates with"
  • "member of", "belongs to"
  • "merged with", "absorbed by" → Use ChangeEvent from schemas/provenance.yaml

Task 5: Collection Metadata Extraction

Objective: Extract information about collections held by institutions.

Schema Mapping: Populates Collection class from schemas/collections.yaml

Extract:

  • Collection names → Collection.collection_name
  • Collection types (archival, bibliographic, museum objects)
  • Subject areas → Collection.subject_areas
  • Time periods covered → Collection.temporal_coverage
  • Item counts (when mentioned) → Collection.extent
  • Access information → Collection.access_rights

Example:

Input: "The archive holds 15,000 documents from the 18th-19th centuries..."

Output:
- collection_type: "archival"  # Collection metadata
- item_count: 15000  # Collection.extent
- time_period_start: "1700-01-01"  # Collection.temporal_coverage
- time_period_end: "1899-12-31"

Task 6: Digital Platform Identification

Objective: Identify digital platforms and systems used by institutions.

Schema Mapping: Populates DigitalPlatform class from schemas/core.yaml

Platform Types:

  • Collection management systems (Atlantis, MAIS, CollectiveAccess, etc.)
  • Digital repositories (DSpace, EPrints, Fedora)
  • Discovery portals
  • SPARQL endpoints
  • APIs

Extract:

  • Platform name → DigitalPlatform.platform_name
  • Platform URL → DigitalPlatform.platform_url
  • Metadata standards used → DigitalPlatform.metadata_standards
  • Integration with aggregators (Europeana, DPLA, etc.)

Task 7: Metadata Standards Detection

Objective: Identify which metadata standards institutions use.

Schema Mapping: Stores in DigitalPlatform.metadata_standards (list of strings)

Standards to Detect:

  • Dublin Core
  • MARC21
  • EAD (Encoded Archival Description)
  • BIBFRAME
  • LIDO
  • CIDOC-CRM
  • Schema.org
  • RiC-O (Records in Contexts)
  • MODS, PREMIS, SPECTRUM, DACS

Indicators:

  • Explicit mentions: "uses Dublin Core", "MARC21 records"
  • Implicit: technical discussions about cataloging practices

Task 8: Organizational Change Event Extraction (NEW - v0.2.0)

Objective: Extract significant organizational change events from conversation history.

Schema Mapping: Populates ChangeEvent class from schemas/provenance.yaml

Change Types to Detect (from ChangeTypeEnum in schemas/enums.yaml):

  • FOUNDING: "established", "founded", "created", "opened"
  • CLOSURE: "closed", "dissolved", "ceased operations", "shut down"
  • MERGER: "merged with", "combined with", "joined with", "absorbed"
  • SPLIT: "split into", "divided into", "separated from", "spun off"
  • ACQUISITION: "acquired", "took over", "purchased"
  • RELOCATION: "moved to", "relocated to", "transferred to"
  • NAME_CHANGE: "renamed to", "formerly known as", "changed name to"
  • TYPE_CHANGE: "became a museum", "converted to archive", "now operates as"
  • STATUS_CHANGE: "reopened", "temporarily closed", "suspended operations"
  • RESTRUCTURING: "reorganized", "restructured", "reformed"
  • LEGAL_CHANGE: "incorporated as", "became a foundation", "legal status changed"

Extract for Each Event:

change_history:  # HeritageCustodian.change_history (list of ChangeEvent)
  - event_id: "https://w3id.org/heritage/custodian/event/unique-id"  # ChangeEvent.event_id
    change_type: MERGER  # ChangeEvent.change_type (ChangeTypeEnum from schemas/enums.yaml)
    event_date: "2001-01-01"  # ChangeEvent.event_date
    event_description: >-  # ChangeEvent.event_description
      Merger of Institution A and Institution B to form new organization C.
      Detailed description from conversation.
    affected_organization: null  # ChangeEvent.affected_organization (optional)
    resulting_organization: null  # ChangeEvent.resulting_organization (optional)
    related_organizations: []  # ChangeEvent.related_organizations (optional)
    source_documentation: "https://..."  # ChangeEvent.source_documentation (optional)

Temporal Context Indicators:

  • "In 2001, the museum merged with..."
  • "After the renovation in 1985..."
  • "Following the name change in 1968..."
  • "The archive was relocated from X to Y in 1923"

PROV-O Integration:

  • Map to prov:Activity in RDF serialization
  • Link with prov:wasInfluencedBy from HeritageCustodian
  • Use prov:atTime for event timestamps
  • Track prov:entity (affected) and prov:generated (resulting) organizations

Example Extraction:

Input: "The Noord-Hollands Archief was formed in 2001 through a merger of 
        Gemeentearchief Haarlem (founded 1910) and Rijksarchief in Noord-Holland 
        (founded 1802). The merger created a unified regional archive serving both 
        the city and province."

Output:
- event_id: "https://w3id.org/heritage/custodian/event/nha-merger-2001"
- change_type: MERGER  # ChangeTypeEnum.MERGER
- event_date: "2001-01-01"
- event_description: "Merger of Gemeentearchief Haarlem (municipal archive, founded 
                      1910) and Rijksarchief in Noord-Holland (state archive, founded 
                      1802) to form Noord-Hollands Archief."
- confidence_score: 0.95  # From Provenance metadata

GHCID Impact:

  • When institutions merge, relocate, or change names, GHCID may change
  • Track old GHCID in ghcid_history with valid_to timestamp matching event date → GHCIDHistoryEntry from schemas/provenance.yaml
  • Create new GHCIDHistoryEntry with valid_from matching event date
  • Link change event to GHCID change via temporal correlation

Indicators:

Task 9: Holy Sites Heritage Collection Identification

Objective: Identify religious sites that function as heritage custodians by maintaining cultural collections.

Schema Mapping: Populates HeritageCustodian class with institution_type: HOLY_SITES

When to Classify as HOLY_SITES:

Religious institutions qualify as HOLY_SITES heritage custodians when they manage:

  • Archival collections: Historical documents, parish registers, ecclesiastical records
  • Library collections: Rare manuscripts, theological texts, historical books
  • Museum collections: Religious artifacts, liturgical objects, art collections
  • Cultural heritage: Historical buildings with guided tours, preservation programs

Patterns to Look For:

  • Church archives (parish records, baptismal registers, historical documents)
  • Monastery libraries (manuscript collections, rare books)
  • Cathedral treasuries (liturgical objects, religious art)
  • Temple museums (Buddhist artifacts, historical collections)
  • Mosque libraries (Islamic manuscripts, Quranic texts)
  • Synagogue archives (Jewish community records, Torah scrolls)
  • Abbey collections (medieval manuscripts, historical artifacts)

Keywords and Indicators:

  • "church archive", "parish records", "ecclesiastical archive"
  • "monastery library", "monastic collection", "scriptorium"
  • "cathedral treasury", "cathedral museum"
  • "temple library", "temple collection"
  • "mosque library", "Islamic manuscript collection"
  • "synagogue archive", "Jewish heritage collection"
  • "religious heritage site", "pilgrimage site with museum"

NOT Holy Sites (use other types):

  • Secular museums about religion (use MUSEUM)
  • Academic religious studies centers (use RESEARCH_CENTER or UNIVERSITY)
  • Government archives of church records (use ARCHIVE)
  • Religious organizations without heritage collections (not heritage custodians)

Example Extraction:

Input: "The Vatican Apostolic Archive holds over 85 km of shelving with 
        documents dating back to the 8th century, including papal bulls, 
        correspondence, and medieval manuscripts."

Output:
- name: Vatican Apostolic Archive
  institution_type: HOLY_SITES  # Religious institution managing heritage collections
  description: >-
    The Vatican Apostolic Archive (formerly Vatican Secret Archives) is 
    the central repository for papal and Vatican documents, holding over 
    35,000 volumes of historical records spanning 12 centuries.    
  locations:
    - city: Vatican City
      country: VA
  collections:
    - collection_name: Papal Documents
      collection_type: archival
      temporal_coverage: "0800-01-01/2024-12-31"
      extent: "85 kilometers of shelving, 35,000+ volumes"
  provenance:
    data_source: CONVERSATION_NLP
    confidence_score: 0.95

Schema.org Mapping:

  • HOLY_SITES maps to schema:PlaceOfWorship in RDF serialization
  • Can also use schema:ArchiveOrganization or schema:Library for collection-specific context
  • Use multiple type assertions when appropriate

Cross-Cultural Considerations:

  • Christianity: churches, cathedrals, monasteries, abbeys, convents
  • Islam: mosques, madrasas (with historical libraries)
  • Judaism: synagogues, yeshivas (with archival collections)
  • Buddhism: temples, monasteries, pagodas (with artifact collections)
  • Hinduism: temples (with historical collections)
  • Sikhism: gurdwaras (with historical manuscripts)
  • Other faiths: shrines, pilgrimage sites with documented heritage collections

Data Quality and Provenance

Provenance Tracking

Every extracted record MUST include:

provenance:
  data_source: CONVERSATION_NLP
  data_tier: TIER_4_INFERRED
  extraction_date: "2025-11-05T..."
  extraction_method: "Subagent NER + pattern matching"
  confidence_score: 0.85
  conversation_id: "conversation-uuid"
  source_url: null
  verified_date: null
  verified_by: null

Confidence Scoring

Assign confidence scores (0.0-1.0) based on:

  • 0.9-1.0: Explicit, unambiguous mentions with context
  • 0.7-0.9: Clear mentions with some ambiguity
  • 0.5-0.7: Inferred from context, may need verification
  • 0.3-0.5: Low confidence, likely needs verification
  • 0.0-0.3: Very uncertain, flag for manual review

Data Tier Assignment

  • TIER_1_AUTHORITATIVE: CSV registries (ISIL, Dutch orgs)
  • TIER_2_VERIFIED: Data from institutional websites (crawl4ai)
  • TIER_3_CROWD_SOURCED: Wikidata, OpenStreetMap
  • TIER_4_INFERRED: NLP-extracted from conversations

Integration with CSV Data

Cross-linking Strategy

  1. ISIL Code Matching (primary)

    • If conversation mentions ISIL code, link to CSV record
    • High confidence match
  2. Name Matching (secondary)

    • Normalize names (lowercase, remove punctuation, handle abbreviations)
    • Fuzzy matching with threshold > 0.85
    • Check for alternative names
  3. Location + Type Matching (tertiary)

    • Match by city + institution type
    • Lower confidence, requires manual verification

Conflict Resolution

When conversation data conflicts with CSV data:

  • CSV data takes precedence (higher tier)
  • Mark conversation data with verified: false
  • Note conflict in provenance metadata
  • Create separate record if institutions are genuinely different

NLP Models and Tools

IMPORTANT: Instead of directly using spaCy or other NER libraries in the main codebase, use coding subagents via the Task tool to conduct Named Entity Recognition and text extraction.

Why Subagents:

  • Keeps the main codebase clean and maintainable
  • Allows flexible experimentation with different NER approaches
  • Subagents can choose the best tool for each specific extraction task
  • Better separation of concerns: extraction logic vs. data pipeline

How to Use Subagents for NER:

  1. Use the Task tool with subagent_type="general" for NER tasks
  2. Provide clear prompts describing what entities to extract
  3. Subagent will autonomously choose and apply appropriate NER tools (spaCy, transformers, regex, etc.)
  4. Subagent returns structured extraction results
  5. Main code validates and processes the results

CRITICAL: Creating LinkML Instance Files

Agent Capabilities Go Beyond Traditional NER

IMPORTANT: AI extraction agents are NOT limited to simple Named Entity Recognition. Unlike traditional NER tools that only identify entity boundaries and types, AI agents have comprehensive understanding and can:

  1. Extract Complete Records: Capture ALL relevant information for each institution in one pass
  2. Infer Missing Data: Use context to fill in fields that aren't explicitly stated
  3. Cross-Reference Within Documents: Link related entities (locations, identifiers, events) automatically
  4. Maintain Consistency: Ensure all extracted data conforms to the LinkML schema
  5. Generate Rich Metadata: Create complete provenance tracking and confidence scores

Mandatory: Create Complete LinkML Instance Files

When extracting data from conversations or other sources, agents MUST:

DO THIS: Create complete LinkML-compliant YAML instance files with ALL available information

# Example: data/instances/brazil_museums_001.yaml
---
# From schemas/core.yaml - HeritageCustodian class

- id: https://w3id.org/heritage/custodian/br/bnb-001
  name: Biblioteca Nacional do Brasil
  institution_type: LIBRARY  # From schemas/enums.yaml
  alternative_names:
    - National Library of Brazil
    - BNB
  description: >-
    The National Library of Brazil, located in Rio de Janeiro, is the largest 
    library in Latin America with over 9 million items. Founded in 1810 by 
    King João VI of Portugal. Collections include rare manuscripts, maps, 
    photographs, and Brazilian historical documents.    
  
  locations:  # From schemas/core.yaml - Location class
    - city: Rio de Janeiro
      street_address: Avenida Rio Branco, 219
      postal_code: "20040-008"
      region: Rio de Janeiro
      country: BR
      # Note: lat/lon can be geocoded later if not in text
  
  identifiers:  # From schemas/core.yaml - Identifier class
    - identifier_scheme: ISIL
      identifier_value: BR-RjBN
      identifier_url: https://isil.org/BR-RjBN
    - identifier_scheme: VIAF
      identifier_value: "123556639"
      identifier_url: https://viaf.org/viaf/123556639
    - identifier_scheme: Wikidata
      identifier_value: Q1526131
      identifier_url: https://www.wikidata.org/wiki/Q1526131
    - identifier_scheme: Website
      identifier_value: https://www.bn.gov.br
      identifier_url: https://www.bn.gov.br
  
  digital_platforms:  # From schemas/core.yaml - DigitalPlatform class
    - platform_name: Digital Library of the National Library of Brazil
      platform_url: https://bndigital.bn.gov.br
      platform_type: DISCOVERY_PORTAL
      metadata_standards:
        - Dublin Core
        - MARC21
  
  collections:  # From schemas/collections.yaml - Collection class
    - collection_name: Brazilian Historical Documents
      collection_type: archival
      subject_areas:
        - Brazilian History
        - Colonial Period
        - Imperial Brazil
      temporal_coverage: "1500-01-01/1889-11-15"
      extent: "Approximately 2.5 million documents"
  
  change_history:  # From schemas/provenance.yaml - ChangeEvent class
    - event_id: https://w3id.org/heritage/custodian/event/bnb-founding-1810
      change_type: FOUNDING
      event_date: "1810-01-01"
      event_description: >-
        Founded by King João VI of Portugal as the Royal Library 
        (Biblioteca Real) when the Portuguese court relocated to Brazil.        
      source_documentation: https://www.bn.gov.br/sobre-bn/historia
  
  provenance:  # From schemas/provenance.yaml - Provenance class
    data_source: CONVERSATION_NLP
    data_tier: TIER_4_INFERRED
    extraction_date: "2025-11-05T14:30:00Z"
    extraction_method: "AI agent comprehensive extraction from Brazilian GLAM conversation"
    confidence_score: 0.92
    conversation_id: "2025-09-22T14-40-15-0102c00a-4c0a-4488-bdca-5dd9fb94c9c5"
    notes: >-
      Extracted from conversation about Brazilian GLAM institutions. 
      Historical founding information cross-referenced from institutional website.      

DO NOT DO THIS: Return minimal JSON with only name and type

// BAD - This is insufficient!
{
  "name": "Biblioteca Nacional do Brasil",
  "institution_type": "LIBRARY"
}

Extraction Workflow for Agents

When processing a conversation or document:

  1. Read Entire Document First: Don't extract piecemeal - understand the full context
  2. Identify ALL Entities: Find every institution, location, identifier, event mentioned
  3. Gather Complete Information: For each institution, extract:
    • Basic metadata (name, type, description)
    • All locations mentioned (even if just city/country)
    • All identifiers (ISIL, Wikidata, VIAF, URLs)
    • Digital platforms and systems
    • Collection information
    • Historical events (founding, mergers, relocations)
    • Relationships to other institutions
  4. Create LinkML YAML: Write a complete instance file with ALL extracted data
  5. Add Provenance: Always include extraction metadata with confidence scores
  6. Validate: Ensure output conforms to schema (use linkml-validate if available)

Example Agent Prompt for Comprehensive Extraction

Extract ALL heritage institutions from the following conversation about Brazilian GLAM institutions.

For EACH institution found, create a COMPLETE LinkML-compliant record including:
- Institution name, type, and description
- ALL locations mentioned (cities, addresses, regions)
- ALL identifiers (ISIL codes, Wikidata IDs, VIAF IDs, URLs)
- Digital platforms, systems, or websites
- Collection information (types, subjects, time periods, extent)
- Historical events (founding dates, mergers, relocations, name changes)
- Relationships to other organizations

Output: YAML file conforming to schemas/core.yaml, schemas/enums.yaml, 
schemas/provenance.yaml, and schemas/collections.yaml

Use your understanding to:
- Infer missing fields from context (e.g., country from city names)
- Consolidate information scattered across multiple conversation turns
- Create rich descriptions summarizing key facts
- Assign appropriate confidence scores based on explicitness of mentions

Remember: You are NOT a simple NER tool. Use your full comprehension abilities 
to create the most complete, accurate, and useful records possible.

Multiple Institutions Per File

When a conversation discusses many institutions, create ONE YAML file with a list:

---
# data/instances/netherlands_limburg_museums.yaml

- id: https://w3id.org/heritage/custodian/nl/bonnefantenmuseum
  name: Bonnefantenmuseum
  institution_type: MUSEUM
  # ... complete record ...

- id: https://w3id.org/heritage/custodian/nl/thermenmuseum
  name: Thermenmuseum
  institution_type: MUSEUM
  # ... complete record ...

- id: https://w3id.org/heritage/custodian/nl/limburgs-museum
  name: Limburgs Museum
  institution_type: MUSEUM
  # ... complete record ...

Field Completion Strategies

Even when information is incomplete, do your best:

  • No explicit institution type? Infer from context ("national library" → LIBRARY)
  • Only city mentioned? That's fine - add locations: [{city: "Amsterdam", country: "NL"}]
  • No ISIL code? Check if you can infer the format (NL-CityCode) or leave it out
  • No description? Create one from available facts
  • Uncertain data? Lower the confidence score but still include it

Validation and Quality Control

After creating instance files:

  1. Schema Validation: If possible, run linkml-validate -s schemas/heritage_custodian.yaml data/instances/your_file.yaml
  2. Completeness Check: Ensure every institution has at minimum:
    • id (generate from country + institution name slug)
    • name
    • institution_type
    • provenance (with data_source, extraction_date, confidence_score)
  3. Consistency Check: Same institution mentioned multiple times? Merge into one record
  4. Quality Flags: If confidence < 0.5, add note in provenance.notes explaining uncertainty

Extraction Stack (for Subagents)

When subagents perform extraction, they may use:

  1. Pattern matching for identifiers (primary approach)

    • Regex for ISIL, VIAF, Wikidata IDs
    • URL extraction and normalization
    • High precision, no dependencies
  2. NER libraries (via subagents only)

    • spaCy: en_core_web_trf, nl_core_news_lg, xx_ent_wiki_sm
    • Transformers for classification
    • Used by subagents, not directly in main code
  3. Fuzzy matching for deduplication

    • rapidfuzz library
    • Levenshtein distance for name matching

Processing Pipeline

Conversation JSON
    ↓
Parse & Extract Text
    ↓
[SUBAGENT] NER Extraction
  - Subagent uses spaCy/transformers/patterns
  - Returns structured entities
    ↓
Pattern Matching (identifiers, URLs)
    ↓
Classification (institution type, standards)
    ↓
Geocoding (locations)
    ↓
Cross-link with CSV (ISIL/name matching)
    ↓
LinkML Validation
    ↓
Export (RDF, JSON-LD, CSV, Parquet)

Agent Interaction Patterns

When Asked to Extract Data from Conversations

  1. Start Small: Begin with 1-2 conversation files to test extraction logic
  2. Show Examples: Display extracted entities with confidence scores
  3. Ask for Validation: Show uncertain extractions for user confirmation
  4. Iterate: Refine patterns based on feedback
  5. Batch Process: Once patterns are validated, process all 139 files

When Asked to Design NLP Components

  1. Reference Schema: Always refer to the modular schema v0.2.1:
    • Core classes: schemas/core.yaml (HeritageCustodian, Location, Identifier, etc.)
    • Enumerations: schemas/enums.yaml (InstitutionTypeEnum, ChangeTypeEnum, etc.)
    • Provenance: schemas/provenance.yaml (Provenance, ChangeEvent, etc.)
    • See schema overview in the "Schema Reference (v0.2.1)" section above
  2. Consult Base Ontologies: BEFORE designing extraction logic, review relevant ontologies:
    • Dutch institutions: Study TOOI ontology (/data/ontology/tooiont.ttl)
    • EU/global institutions: Study CPOV ontology (/data/ontology/core-public-organisation-ap.ttl)
    • All institutions: Reference Schema.org patterns (/data/ontology/schemaorg.owl)
    • See "Base Ontologies for Global GLAM Data" section above for decision tree
  3. Use Design Patterns: Follow patterns in docs/plan/global_glam/05-design-patterns.md
  4. Track Provenance: Every extraction must include provenance metadata (from schemas/provenance.yaml)
  5. Handle Multilingual: Conversations cover 60+ countries, expect multilingual content
  6. Error Handling: Use Result pattern, never fail silently

When Asked to Validate Data

  1. LinkML Validation: Use linkml-validate to check schema compliance
  2. Cross-reference: Compare with CSV data when applicable
  3. Check Identifiers: Validate ISIL format, check Wikidata exists
  4. Geographic Verification: Geocode addresses, verify country codes
  5. Duplicate Detection: Use fuzzy matching to find potential duplicates

Example Agent Workflows

Workflow 1: Extract Brazilian Institutions

# User request
"Extract all museum, library, and archive names from the Brazilian GLAM conversation"

# Agent actions
1. Read conversation: 2025-09-22T14-40-15-0102c00a-4c0a-4488-bdca-5dd9fb94c9c5-Brazilian_GLAM_collection_inventories.json
2. Parse chat_messages array
3. **Launch subagent** to extract institutions using NER
   - Subagent analyzes text and extracts ORG entities
   - Filters for heritage-related keywords
   - Classifies institution types
   - Returns structured results
4. Extract locations (cities in Brazil)
5. Geocode using Nominatim
6. Create HeritageCustodian records
7. Add provenance metadata (data_source: CONVERSATION_NLP, extraction_method: "Subagent NER")
8. Validate with LinkML schema
9. Export to JSON-LD
10. Report results with confidence scores
# User request
"Cross-link the Dutch organizations CSV with any Dutch institutions found in conversations"

# Agent actions
1. Load data/voorbeeld_lijst_organisaties_en_diensten-totaallijst_nederland.csv
2. Parse into DutchHeritageCustodian records
3. Extract all NL-* ISIL codes
4. Search all conversation files for mentions of these ISIL codes
5. Fuzzy match organization names
6. For matches:
   - Merge metadata
   - Mark CSV data as TIER_1
   - Mark conversation data as TIER_4
   - Resolve conflicts (CSV wins)
7. For Dutch institutions in conversations NOT in CSV:
   - Create new records
   - Mark as TIER_4
   - Flag for verification
8. Export merged dataset

Workflow 3: Build Global Institution Map

# User request
"Create a geographic distribution map of all extracted institutions"

# Agent actions
1. Process all 139 conversation files
2. **Launch subagent(s)** to extract institution names + locations from each file
3. Geocode all addresses
4. Group by country
5. Count institutions per country
6. Generate GeoJSON for mapping
7. Create visualization (Leaflet, Mapbox, etc.)
8. Export statistics:
   - Institutions per country
   - Institutions per type
   - Geographic coverage
   - Data quality (tier distribution)

Multi-language Considerations

Language Detection

  • Detect language of conversation content
  • Subagents will choose appropriate NER models per language
  • Multilingual support handled by subagents

Common Languages in Dataset

  • English (international institutions)
  • Dutch (Netherlands institutions)
  • Portuguese (Brazil)
  • Spanish (Latin America, Spain)
  • Vietnamese, Japanese, Thai, Korean, Arabic, Russian, etc.

Translation Strategy

  • DO NOT translate institution names (preserve original)
  • Optionally translate descriptions for searchability
  • Store language tags with text fields
  • Use multilingual identifiers (Wikidata) for linking

Output Formats

Primary Output: JSON-LD

Linked Data format for semantic web integration:

{
  "@context": "https://w3id.org/heritage/custodian/context.jsonld",
  "@type": "HeritageCustodian",
  "@id": "https://example.org/institution/123",
  "name": "Amsterdam Museum",
  "institution_type": "MUSEUM",
  ...
}

Secondary Outputs

  • RDF/Turtle: For SPARQL querying
  • CSV: For spreadsheet analysis
  • Parquet: For data warehousing
  • SQLite: For local querying

Testing and Validation

Unit Tests

Test extraction functions with known inputs:

def test_extract_isil_codes():
    text = "The ISIL code NL-AsdAM identifies Amsterdam Museum"
    codes = extract_isil_codes(text)
    assert codes == [{"scheme": "ISIL", "value": "NL-AsdAM"}]

Integration Tests

Test full pipeline with sample conversations:

def test_brazilian_museum_extraction():
    conversation = load_json("Brazilian_GLAM_collection_inventories.json")
    records = extract_heritage_custodians(conversation)
    assert len(records) > 0
    assert all(r.provenance.data_source == "CONVERSATION_NLP" for r in records)

Validation Tests

Ensure LinkML schema compliance:

def test_linkml_validation():
    record = create_heritage_custodian(...)
    validator = SchemaValidator(schema="heritage_custodian.yaml")
    result = validator.validate(record)
    assert result.is_valid

Performance Optimization

Batch Processing

  • Process conversations in parallel (multiprocessing)
  • Cache geocoding results (15-minute TTL)
  • Deduplicate entity extraction

Incremental Updates

  • Track last processed timestamp
  • Only process new/updated conversations
  • Maintain state in SQLite database

Resource Management

  • Limit concurrent API calls (Nominatim: 1 req/sec)
  • Use connection pooling for HTTP requests
  • Stream large JSON files instead of loading into memory

Error Handling

Common Errors and Solutions

  1. JSON Parsing Errors

    • Malformed JSON files
    • Solution: Validate JSON schema, report file path
  2. NER Model Errors

    • Missing spaCy model
    • Solution: Provide installation instructions, download automatically
  3. Geocoding Failures

    • Unknown location, rate limit exceeded
    • Solution: Cache results, implement backoff, mark as unverified
  4. LinkML Validation Failures

    • Required field missing, invalid enum value
    • Solution: Log validation errors, provide field mapping
  5. Encoding Issues

    • Non-UTF-8 characters
    • Solution: Use UTF-8 everywhere, handle decode errors gracefully

Schema Quirks and Implementation Notes

IMPORTANT: These are critical implementation details discovered during development. Read carefully to avoid bugs.

Provenance Model Quirks

The Provenance model does NOT have a notes field:

# ❌ WRONG - Provenance has no 'notes' field
provenance = Provenance(
    data_source=DataSource.CSV_REGISTRY,
    notes="Some observation"  # This will fail!
)

# ✅ CORRECT - Use HeritageCustodian.description instead
custodian = HeritageCustodian(
    name="Museum Name",
    description="Notes and remarks go here",  # Put notes here
    provenance=Provenance(...)
)

Field Naming Conventions

Always use the correct field names (check the schema when in doubt):

# ❌ WRONG
custodian.institution_types  # Plural, list
custodian.location           # Singular

# ✅ CORRECT
custodian.institution_type   # Singular, single enum value
custodian.locations          # Plural, always a list (even with one item)

Pydantic v1 Enum Behavior

This project uses Pydantic v1. Enum fields are already strings, not enum objects:

# ❌ WRONG - Don't use .value accessor
print(custodian.institution_type.value)  # AttributeError!

# ✅ CORRECT - Enum fields are already strings
print(custodian.institution_type)  # "MUSEUM", "ARCHIVE", etc.

# Same for platform types
platform.platform_type  # Already a string, not an enum object

Required vs. Optional Fields

Many fields are optional but have validation rules. Always check for None:

# Optional fields that may be None
custodian.locations          # Optional[List[Location]]
custodian.identifiers        # Optional[List[Identifier]]
custodian.digital_platforms  # Optional[List[DigitalPlatform]]
custodian.description        # Optional[str]

# Always check before iterating
if custodian.locations:
    for location in custodian.locations:
        print(location.city)

CSV Parsing Best Practices

  1. Handle UTF-8 BOM: Use encoding='utf-8-sig' when reading CSVs
  2. Normalize headers: Strip whitespace, handle multiline headers
  3. Warn on errors: Skip invalid rows but log warnings
  4. Preserve originals: Store raw CSV data in intermediate models before conversion

Example:

with open(csv_path, 'r', encoding='utf-8-sig') as f:
    reader = csv.DictReader(f)
    for row in reader:
        try:
            record = parse_row(row)
        except ValidationError as e:
            print(f"Warning: Skipping row {row}: {e}")
            continue

Date Handling

Dates may be in various formats or empty:

# Handle empty dates
date_str = row.get('toegekend_op', '').strip()
assigned_date = datetime.fromisoformat(date_str) if date_str else None

# Provenance extraction_date is required (use current time)
from datetime import datetime, timezone
extraction_date = datetime.now(timezone.utc)

Testing Strategies

  1. Unit tests: Test model validation with known inputs
  2. Integration tests: Test full file parsing with fixtures
  3. Edge case tests: Empty files, malformed rows, minimal data
  4. Real data tests: Always validate with actual CSV files

Fixture scope matters:

# ❌ WRONG - Class-scoped fixture not available to other classes
class TestFoo:
    @pytest.fixture
    def sample_file(self):
        ...

# ✅ CORRECT - Module-scoped fixture available to all test classes
@pytest.fixture
def sample_file():  # At module level, not in a class
    ...

Next Steps for Agents

When continuing this project, agents should:

  1. Implement Parser Module (src/glam_extractor/parsers/) COMPLETE

    • ISIL registry parser (10 tests, 84% coverage)
    • Dutch organizations parser (18 tests, 98% coverage)
    • Conversation JSON parser (next priority)
  2. Implement Extractor Module (src/glam_extractor/extractors/)

    • spaCy NER integration
    • Pattern-based identifier extraction
    • Institution type classifier
    • Relationship extractor
  3. Implement Geocoder Module (src/glam_extractor/geocoding/)

    • Nominatim client with caching
    • GeoNames integration
    • Coordinate validation
  4. Implement Validator Module (src/glam_extractor/validators/)

    • LinkML schema validator
    • Cross-reference validator (CSV vs. conversation)
    • Duplicate detector
  5. Implement Exporter Module (src/glam_extractor/exporters/)

    • JSON-LD exporter
    • RDF/Turtle exporter
    • CSV exporter
    • Parquet exporter
    • SQLite database builder
  6. Create Test Fixtures (tests/fixtures/)

    • Sample conversation JSONs
    • Expected extraction outputs
    • Validation test cases
  7. Document Agent Prompts (docs/agent-prompts/)

    • Reusable prompts for common extraction tasks
    • Few-shot examples for LLM-based extraction
    • Quality review checklists

Persistent Identifiers (GHCID)

🚨 CRITICAL POLICY: REAL IDENTIFIERS ONLY 🚨

SYNTHETIC Q-NUMBERS ARE STRICTLY PROHIBITED IN THIS PROJECT.

All Wikidata Q-numbers used in GHCIDs MUST be:

  • Real Wikidata entity identifiers (verified via API query)
  • Confirmed to match the institution (fuzzy match score > 0.85)
  • Resolvable at https://www.wikidata.org/wiki/Q[number]

NEVER generate synthetic/fake Q-numbers from hashes, numeric IDs, or algorithms
NEVER append Q-numbers that don't correspond to real Wikidata entities
NEVER use placeholder Q-numbers like Q99999999, Q90000000, etc.

If no Wikidata Q-number is available:

  1. Use base GHCID without Q-suffix (e.g., NL-NH-AMS-M-HM)
  2. Flag institution with needs_wikidata_enrichment: true
  3. Run Wikidata enrichment workflow to obtain real Q-number
  4. Update GHCID only after real Q-number is verified

Rationale: Q-numbers are part of the Linked Open Data ecosystem. Using fake Q-numbers breaks semantic web integrity, creates citation errors, and violates W3C best practices for persistent identifiers.


GHCID uses a four-identifier strategy for maximum flexibility and transparency:

Four Identifier Formats

  1. UUID v5 (SHA-1) - PRIMARY persistent identifier

    • Deterministic (same GHCID string → same UUID)
    • RFC 4122 standard, universal library support
    • Transparent algorithm (anyone can verify)
    • Field: ghcid_uuid
  2. UUID v8 (SHA-256) - Secondary persistent identifier (future-proofing)

    • Deterministic with stronger cryptographic hash
    • SOTA security compliance
    • Field: ghcid_uuid_sha256
  3. UUID v7 - Database record ID ONLY (NOT for persistent identification)

    • Time-ordered for database performance
    • NOT deterministic (different each time)
    • Use for database primary keys, NOT for citations or cross-system references
    • Field: record_id
  4. Numeric (64-bit) - Compact identifier for CSV exports

    • Deterministic (SHA-256 → 64-bit integer)
    • Database optimization, spreadsheet-friendly
    • Field: ghcid_numeric

Critical Understanding: UUID v5 is Primary

Why UUID v5 (SHA-1) over UUID v8 (SHA-256)?

The primary identifier is UUID v5 because:

  • Transparency - Anyone can verify using standard uuid.uuid5() function
  • Reproducibility - No custom algorithm to share, RFC 4122 defines it
  • Interoperability - Every programming language has built-in UUID v5 support
  • Community Trust - Public, standardized algorithm builds confidence

SHA-1 Safety for Identifiers:

SHA-1 is deprecated for cryptographic security (digital signatures, TLS, passwords) but appropriate for identifier generation:

  • Heritage institution identifiers are non-adversarial (no attacker trying to forge museum IDs)
  • 128-bit collision resistance is sufficient (P(collision) ≈ 1.5×10^-29 for 1M institutions)
  • RFC 4122 (UUID v5) remains active standard (not deprecated by IETF)
  • See Why GHCID Uses UUID v5 and SHA-1 for detailed rationale

Future-Proofing:

  • We generate both UUID v5 and UUID v8 for every institution
  • Can migrate to SHA-256 primary if RFC 4122 is updated
  • Both are deterministic - no data loss in migration

When Extracting Data

Agents should generate ALL four identifiers for every institution:

# Example extraction output
- id: https://w3id.org/heritage/custodian/br/bnb-001
  name: Biblioteca Nacional do Brasil
  ghcid: BR-RJ-RIO-L-BNB
  ghcid_uuid: "550e8400-e29b-41d4-a716-446655440000"  # UUID v5 - PRIMARY
  ghcid_uuid_sha256: "a1b2c3d4-e5f6-8a1b-9c2d-3e4f5a6b7c8d"  # UUID v8 - Secondary
  ghcid_numeric: 213324328442227739  # 64-bit numeric
  # Note: UUID v7 (record_id) generated at database insertion, not during extraction

GHCID Collision Handling for AI Agents

CRITICAL: When extracting heritage institution data, AI agents MUST understand and apply temporal collision resolution rules to maintain PID stability.

The Collision Problem

Multiple institutions may generate the same base GHCID (before Q-number addition):

  • Two museums in Amsterdam abbreviated "SM": NL-NH-AMS-M-SM
  • Two historical societies in Utrecht: NL-UT-UTR-S-HK
  • Two libraries in São Paulo abbreviated "BM": BR-SP-SAO-L-BM

Decision Tree for Collision Resolution

When extracting data, agents should follow this decision process:

1. Generate base GHCID (without Q-number)
   ↓
2. Check if base GHCID exists in published dataset
   ↓
   NO → Use base GHCID as-is, record extraction_date
   ↓
   YES → Temporal priority check
   ↓
3. Compare extraction_date with existing publication_date
   ↓
   SAME DATE (batch import) → First Batch Collision
      ├─ ALL institutions get Q-numbers
      ├─ Extract Wikidata Q-number from identifiers
      └─ Append to GHCID: NL-NH-AMS-M-SM-Q621531
   ↓
   LATER DATE (historical addition) → Historical Addition
      ├─ PRESERVE existing GHCID (no modification)
      ├─ ONLY new institution gets Q-number
      └─ New GHCID: NL-NH-AMS-M-HM-Q17339437

Implementation Rules for Agents

Rule 1: Always Track Provenance Timestamp

provenance:
  data_source: CONVERSATION_NLP
  data_tier: TIER_4_INFERRED
  extraction_date: "2025-11-15T14:30:00Z"  # ← REQUIRED for collision detection
  extraction_method: "AI agent NER extraction"
  confidence_score: 0.92

Rule 2: Detect Collisions by Base GHCID

Before adding Q-numbers, group institutions by base GHCID:

# Collision detection pseudocode for agents
base_ghcid = generate_base_ghcid(institution)  # Without Q-number
existing_records = published_dataset.filter(base_ghcid=base_ghcid)

if len(existing_records) > 0:
    # Collision detected - apply temporal priority
    apply_collision_resolution(institution, existing_records)

Rule 3: First Batch - ALL Get Q-Numbers

If ALL colliding institutions have the same extraction_date:

# Example: 2025-11-01 batch import discovers two institutions
- name: Stedelijk Museum Amsterdam
  ghcid: NL-NH-AMS-M-SM-Q621531  # Gets Q-number
  provenance:
    extraction_date: "2025-11-01T10:00:00Z"

- name: Science Museum Amsterdam  
  ghcid: NL-NH-AMS-M-SM-Q98765432  # Gets Q-number
  provenance:
    extraction_date: "2025-11-01T10:00:00Z"  # Same date = first batch

Rule 4: Historical Addition - ONLY New Gets Q-Number

If new institution's extraction_date is later than existing record:

# EXISTING (2025-11-01, already published):
- name: Hermitage Amsterdam
  ghcid: NL-NH-AMS-M-HM  # ← NO CHANGE (PID stability!)
  provenance:
    extraction_date: "2025-11-01T10:00:00Z"

# NEW (2025-11-15, historical addition):
- name: Historical Museum Amsterdam
  ghcid: NL-NH-AMS-M-HM-Q17339437  # ← ONLY new gets Q-number
  provenance:
    extraction_date: "2025-11-15T14:30:00Z"

Q-Number Assignment Priority

CRITICAL POLICY: SYNTHETIC Q-NUMBERS ARE STRICTLY PROHIBITED

When collision requires Q-number, agents MUST obtain REAL Wikidata Q-numbers. Synthetic/generated Q-numbers are NEVER acceptable.

Required Process:

  1. Extract Wikidata Q-number from existing identifiers (if available):

    wikidata_ids = [
        i for i in institution['identifiers']
        if i['identifier_scheme'] == 'Wikidata'
    ]
    q_number = wikidata_ids[0]['identifier_value'] if wikidata_ids else None
    # Result: Q621531 → NL-NH-AMS-M-SM-Q621531
    
  2. Query Wikidata API to find real Q-number (if not in identifiers):

    # Search Wikidata by name, location, and institution type
    q_number = query_wikidata_api(
        name=institution['name'],
        location=institution['locations'][0]['city'],
        country=institution['locations'][0]['country'],
        instance_of='museum'  # or library, archive, etc.
    )
    # Use fuzzy matching with threshold > 0.85
    # Verify match quality before accepting
    
  3. NO Q-number available: LEAVE GHCID WITHOUT Q-SUFFIX

    if not q_number:
        # ✅ CORRECT - Use base GHCID without Q-number
        ghcid = base_ghcid  # e.g., NL-NH-AMS-M-HM
    
        # Mark for manual Wikidata lookup
        institution['provenance']['notes'] = (
            "Collision detected but no Wikidata Q-number available. "
            "Manual Wikidata search required before GHCID can be finalized."
        )
    
        # ❌ NEVER DO THIS - Generate synthetic Q-number
        # synthetic_q = f"Q{ghcid_numeric % 100000000}"  # FORBIDDEN!
    

Why Synthetic Q-Numbers Are Prohibited:

  • Fake identifiers - Not real Wikidata entities, breaks Linked Data integrity
  • Collision risk - Synthetic Q-numbers may conflict with real future Wikidata IDs
  • Loss of trust - Consumers expect Q-numbers to resolve to Wikidata entities
  • Semantic web violation - RDF triples with fake Q-numbers are invalid
  • Data quality degradation - Masks the need for proper Wikidata enrichment

Acceptable Workflow for Missing Q-Numbers:

  1. Extract institution data with base GHCID (no Q-suffix)
  2. Flag institution for Wikidata enrichment
  3. Run Wikidata query script to find real Q-numbers
  4. Update GHCID with real Q-number after verification
  5. Record GHCID change in ghcid_history

See also: Section "Data Tier Assignment" - Wikidata identifiers are TIER_3_CROWD_SOURCED and require verification against authoritative source.

Wikidata Enrichment Workflow:

When institutions need Q-numbers (for collision resolution or identifier completeness):

  1. Batch Query Wikidata API:

    from SPARQLWrapper import SPARQLWrapper, JSON
    
    def query_wikidata_for_institution(name, city, country, inst_type):
        """Query Wikidata SPARQL endpoint for heritage institutions."""
        endpoint = "https://query.wikidata.org/sparql"
    
        # SPARQL query for museums, libraries, archives in specific location
        query = f"""
        SELECT ?item ?itemLabel ?viaf ?isil WHERE {{
          ?item wdt:P31/wdt:P279* wd:{get_wikidata_class(inst_type)} .
          ?item wdt:P131* wd:{get_wikidata_location(city, country)} .
          OPTIONAL {{ ?item wdt:P214 ?viaf }}
          OPTIONAL {{ ?item wdt:P791 ?isil }}
          SERVICE wikibase:label {{ bd:serviceParam wikibase:language "en,nl,pt,es" }}
        }}
        """
    
        sparql = SPARQLWrapper(endpoint)
        sparql.setQuery(query)
        sparql.setReturnFormat(JSON)
        return sparql.query().convert()
    
  2. Fuzzy Match Results:

    from rapidfuzz import fuzz
    
    def match_institution_to_wikidata(institution_name, wikidata_results):
        """Match institution name to Wikidata query results."""
        best_match = None
        best_score = 0
    
        for result in wikidata_results:
            wd_label = result['itemLabel']['value']
            score = fuzz.ratio(institution_name.lower(), wd_label.lower())
    
            if score > best_score and score > 85:  # 85% similarity threshold
                best_match = result
                best_score = score
    
        return best_match, best_score
    
  3. Update Institution Record:

    if wikidata_match and match_score > 85:
        # Extract Q-number from Wikidata URI
        q_number = wikidata_match['item']['value'].split('/')[-1]
    
        # Add to identifiers
        institution['identifiers'].append({
            'identifier_scheme': 'Wikidata',
            'identifier_value': q_number,
            'identifier_url': f'https://www.wikidata.org/wiki/{q_number}'
        })
    
        # Update GHCID if collision requires it
        if collision_detected:
            institution['ghcid'] = f"{base_ghcid}-{q_number}"
    
        # Record enrichment in provenance
        institution['provenance']['enrichment_history'] = [{
            'enrichment_date': datetime.now(timezone.utc).isoformat(),
            'enrichment_method': 'Wikidata SPARQL query',
            'match_score': match_score,
            'verified': True
        }]
    

Existing Wikidata Enrichment Scripts:

  • scripts/enrich_latam_institutions_fuzzy.py - Latin America Wikidata enrichment
  • scripts/enrich_global_with_wikidata.py - Global enrichment (to be created)

See also: docs/WIKIDATA_ENRICHMENT.md for detailed Wikidata query strategies

GHCID History Tracking

When Q-number is added to resolve collision, update ghcid_history:

ghcid_history:
  - ghcid: NL-NH-AMS-M-HM-Q17339437  # Current (with Q-number)
    ghcid_numeric: 789012345678
    valid_from: "2025-11-15T14:30:00Z"  # When Q-number added
    valid_to: null
    reason: "Q-number added to resolve collision with existing NL-NH-AMS-M-HM (Hermitage Amsterdam)"
  
  - ghcid: NL-NH-AMS-M-HM  # Original (without Q-number)
    ghcid_numeric: 123456789012
    valid_from: "2025-11-15T14:00:00Z"  # When first extracted
    valid_to: "2025-11-15T14:30:00Z"   # When collision detected
    reason: "Base GHCID from geographic location and institution name"

PID Stability Principle - "Cool URIs Don't Change"

NEVER modify a published GHCID. Once exported to RDF, JSON-LD, or CSV, a GHCID becomes a persistent identifier that may be:

  • Cited in academic papers - Journal articles referencing heritage collections
  • Used in external APIs - Third-party systems querying our data
  • Embedded in linked data - RDF triples in knowledge graphs
  • Referenced in finding aids - Archival descriptions linking to institutions

Changing a published GHCID breaks these external references. Per W3C "Cool URIs Don't Change":

  • Correct: Add Q-number to NEW institution (historical addition)
  • WRONG: Retroactively add Q-number to EXISTING published GHCID

Error Handling for Agents

Scenario 1: Missing Provenance Timestamp

if 'extraction_date' not in institution['provenance']:
    # Use current timestamp as fallback
    institution['provenance']['extraction_date'] = datetime.now(timezone.utc).isoformat()
    # Log warning for manual review
    log.warning(f"Missing extraction_date for {institution['name']}, using current time")

Scenario 2: Multiple Historical Additions

# Three institutions generate NL-UT-UTR-S-HK
# Extraction dates: 2025-11-01, 2025-11-15, 2025-12-01

# Result:
# 2025-11-01: NL-UT-UTR-S-HK (first, no Q-number)
# 2025-11-15: NL-UT-UTR-S-HK-Q45678 (second, gets Q-number)
# 2025-12-01: NL-UT-UTR-S-HK-Q91234 (third, gets Q-number)

Scenario 3: No Wikidata Q-Number Available

if not wikidata_q_number:
    # ✅ CORRECT - Use base GHCID without Q-suffix
    ghcid = base_ghcid  # e.g., NL-NH-AMS-M-HM
    
    # Mark for manual Wikidata enrichment
    institution['needs_wikidata_enrichment'] = True
    institution['provenance']['notes'] = (
        "Collision detected but no Wikidata Q-number available. "
        "Institution flagged for manual Wikidata lookup before GHCID finalization."
    )
    
    # Log warning for human review
    log.warning(
        f"Institution '{institution['name']}' requires Wikidata Q-number "
        f"to resolve GHCID collision with base '{base_ghcid}'"
    )
    
    # ❌ NEVER DO THIS - Generate synthetic Q-number (FORBIDDEN!)
    # synthetic_q = f"Q{institution['ghcid_numeric'] % 100000000}"
    # This violates the project's data quality policy

Validation Checklist for Agents

Before publishing extracted data, verify:

  • All institutions have extraction_date in provenance metadata
  • Collisions detected by grouping on base GHCID (without Q-number)
  • First batch collisions: ALL instances have Q-numbers
  • Historical additions: ONLY new instances have Q-numbers
  • No published GHCIDs modified (PID stability test)
  • GHCID history entries created with valid temporal ordering
  • Q-numbers sourced from Wikidata when available
  • Collision reasons documented in ghcid_history

Example Extraction Prompts for Agents

Prompt Template for NLP Extraction:

Extract heritage institutions from this conversation about [REGION] GLAM institutions.

For EACH institution:
1. Generate base GHCID using geographic location and institution type
2. Check for collisions with previously published GHCIDs
3. Apply temporal priority rule:
   - If collision with same extraction_date → First Batch (all get Q-numbers)
   - If collision with earlier publication_date → Historical Addition (only new gets Q-number)
4. Extract Wikidata Q-number from conversation text if mentioned
5. Create GHCID history entry documenting collision resolution
6. Include extraction_date in provenance metadata

Output: LinkML-compliant YAML with complete collision handling

Prompt Template for CSV Parsing:

Parse this heritage institution CSV file dated [DATE].

All rows have the same extraction_date ([DATE]).

If multiple institutions generate the same base GHCID:
- This is a FIRST BATCH collision
- ALL colliding institutions MUST receive Q-numbers
- Extract Q-numbers from Wikidata_ID column
- Document collision in ghcid_history

Output: YAML with collision resolution applied

Testing Strategies for Collision Handling

Unit Test: First Batch Collision

def test_first_batch_collision():
    """Two institutions extracted same day with same base GHCID."""
    institutions = [
        {
            'name': 'Stedelijk Museum Amsterdam',
            'base_ghcid': 'NL-NH-AMS-M-SM',
            'identifiers': [{'identifier_scheme': 'Wikidata', 'identifier_value': 'Q621531'}],
            'provenance': {'extraction_date': '2025-11-01T10:00:00Z'}
        },
        {
            'name': 'Science Museum Amsterdam',
            'base_ghcid': 'NL-NH-AMS-M-SM',
            'identifiers': [{'identifier_scheme': 'Wikidata', 'identifier_value': 'Q98765432'}],
            'provenance': {'extraction_date': '2025-11-01T10:00:00Z'}
        }
    ]
    
    resolved = resolve_collisions(institutions)
    
    # Both should have Q-numbers
    assert resolved[0]['ghcid'] == 'NL-NH-AMS-M-SM-Q621531'
    assert resolved[1]['ghcid'] == 'NL-NH-AMS-M-SM-Q98765432'

Unit Test: Historical Addition

def test_historical_addition():
    """New institution added later with same base GHCID."""
    published = {
        'name': 'Hermitage Amsterdam',
        'ghcid': 'NL-NH-AMS-M-HM',  # Already published
        'provenance': {'extraction_date': '2025-11-01T10:00:00Z'}
    }
    
    new_institution = {
        'name': 'Historical Museum Amsterdam',
        'base_ghcid': 'NL-NH-AMS-M-HM',  # Collision!
        'identifiers': [{'identifier_scheme': 'Wikidata', 'identifier_value': 'Q17339437'}],
        'provenance': {'extraction_date': '2025-11-15T14:30:00Z'}
    }
    
    resolved = resolve_collision(new_institution, published_dataset=[published])
    
    # Published GHCID unchanged
    assert published['ghcid'] == 'NL-NH-AMS-M-HM'
    
    # New institution gets Q-number
    assert resolved['ghcid'] == 'NL-NH-AMS-M-HM-Q17339437'
    
    # GHCID history created
    assert len(resolved['ghcid_history']) == 2
    assert resolved['ghcid_history'][0]['ghcid'] == 'NL-NH-AMS-M-HM-Q17339437'

References for Collision Handling

  • Specification: docs/PERSISTENT_IDENTIFIERS.md - "Historical Collision Resolution" section
  • Algorithm: docs/plan/global_glam/07-ghcid-collision-resolution.md - Temporal dimension and decision logic
  • Examples: docs/GHCID_PID_SCHEME.md - Timeline examples with real institutions
  • Implementation: scripts/regenerate_historical_ghcids.py - Code comments documenting collision handling
  • Schema: schemas/provenance.yaml - GHCIDHistoryEntry and ChangeEvent classes

See also:

  • docs/PERSISTENT_IDENTIFIERS.md - Complete identifier format documentation
  • docs/UUID_STRATEGY.md - UUID v5 vs v7 vs v8 comparison
  • docs/WHY_UUID_V5_SHA1.md - SHA-1 safety rationale

References

  • Schema (v0.2.0):
    • Main: schemas/heritage_custodian.yaml
    • Core classes: schemas/core.yaml
    • Enumerations: schemas/enums.yaml
    • Provenance: schemas/provenance.yaml
    • Collections: schemas/collections.yaml
    • Dutch extensions: schemas/dutch.yaml
    • Architecture: /docs/SCHEMA_MODULES.md
  • Persistent Identifiers:
    • Overview: docs/PERSISTENT_IDENTIFIERS.md
    • UUID Strategy: docs/UUID_STRATEGY.md
    • SHA-1 Rationale: docs/WHY_UUID_V5_SHA1.md
    • GHCID PID Scheme: docs/GHCID_PID_SCHEME.md
    • Collision Resolution: docs/plan/global_glam/07-ghcid-collision-resolution.md
  • Architecture: docs/plan/global_glam/02-architecture.md
  • Data Standardization: docs/plan/global_glam/04-data-standardization.md
  • Design Patterns: docs/plan/global_glam/05-design-patterns.md
  • Dependencies: docs/plan/global_glam/03-dependencies.md

Version: 0.2.0
Schema Version: v0.2.0 (modular)
Last Updated: 2025-11-05
Maintained By: GLAM Data Extraction Project