kempersc c50c35fd3a enrich person custodian

2025-12-14 17:09:55 +01:00

182 KiB

Raw Blame History

AI Agent Instructions for GLAM Data Extraction

This document provides instructions for AI agents (particularly OpenCODE and Claude) to assist with extracting heritage institution data from conversation JSON files and other sources.

🎯 PROJECT CORE MISSION

PRIMARY OBJECTIVE: Create a comprehensive, nuanced ontology that accurately represents the complex, temporal, multi-faceted nature of heritage custodian institutions worldwide.

This is NOT a simple data extraction project. This is an ontology engineering project that:

Models heritage entities as multi-aspect temporal entities (place, custodian, legal form, collections, people)
Integrates multiple base ontologies (CPOV, TOOI, CIDOC-CRM, RiC-O, Schema.org, PiCo)
Captures organizational change events over time (custody transfers, mergers, transformations)
Distinguishes between nominal references and formal organizational structures
Links heritage custodians to people, collections, and locations with independent temporal lifecycles

If you're looking for simple NER extraction, this is not the right project.

🚨 CRITICAL RULES FOR ALL AGENTS

Rule 0: LinkML Schemas Are the Single Source of Truth

MASTER SCHEMA LOCATION: schemas/20251121/linkml/

The LinkML schema files are the authoritative, canonical definition of the Heritage Custodian Ontology:

Primary Schema File (SINGLE SOURCE OF TRUTH):

schemas/20251121/linkml/01_custodian_name.yaml - Complete Heritage Custodian Ontology
- Defines CustodianObservation (source-based references to heritage keepers)
- Defines CustodianName (standardized emic names)
- Defines CustodianReconstruction (formal entities: individuals, groups, organizations, governments, corporations)
- Includes ISO 20275 legal form codes (for legal entities)
- PiCo-inspired observation/reconstruction pattern
- Based on CIDOC-CRM E39_Actor (broader than organization)

ALL OTHER FILES ARE DERIVED/GENERATED from these LinkML schemas:

❌ DO NOT edit these derived files directly:

schemas/20251121/rdf/*.{ttl,nt,jsonld,rdf,n3,trig,trix} - GENERATED from LinkML via gen-owl + rdfpipe
schemas/20251121/typedb/*.tql - DERIVED TypeDB schema (manual translation from LinkML)
schemas/20251121/uml/mermaid/*.mmd - DERIVED UML diagrams (manual visualization of LinkML)
schemas/20251121/examples/*.yaml - INSTANCES conforming to LinkML schema

Workflow for Schema Changes:

1. EDIT LinkML schema (01_custodian_name.yaml)
   ↓
2. REGENERATE RDF formats (WITH FULL TIMESTAMPS):
   $ TIMESTAMP=$(date +%Y%m%d_%H%M%S)
   $ gen-owl -f ttl schemas/20251121/linkml/01_custodian_name.yaml 2>/dev/null > schemas/20251121/rdf/01_custodian_name_${TIMESTAMP}.owl.ttl
   $ rdfpipe schemas/20251121/rdf/01_custodian_name_${TIMESTAMP}.owl.ttl -o nt > schemas/20251121/rdf/01_custodian_name_${TIMESTAMP}.nt
   $ rdfpipe schemas/20251121/rdf/01_custodian_name_${TIMESTAMP}.owl.ttl -o jsonld > schemas/20251121/rdf/01_custodian_name_${TIMESTAMP}.jsonld
   $ # ... repeat for all 8 formats (ALL with same timestamp)
   ↓
3. REGENERATE UML diagrams (WITH FULL TIMESTAMPS):
   $ gen-yuml schemas/20251121/linkml/01_custodian_name.yaml > schemas/20251121/uml/mermaid/01_custodian_name_${TIMESTAMP}.mmd
   ↓
4. UPDATE TypeDB schema (manual translation)
   ↓
5. VALIDATE example instances:
   $ linkml-validate -s schemas/20251121/linkml/01_custodian_name.yaml schemas/20251121/examples/example.yaml

🚨 CRITICAL RULE: Full Timestamps Required

ALL generated files MUST include full timestamps (date AND time) in filenames:

Format: {base_name}_{YYYYMMDD}_{HHMMSS}.{extension}

Examples:

# ✅ CORRECT - Full timestamp (date + time)
custodian_multi_aspect_20251122_154430.owl.ttl
custodian_multi_aspect_20251122_154430.nt
custodian_multi_aspect_20251122_154430.jsonld
custodian_multi_aspect_20251122_154136.mmd

# ❌ WRONG - Date only (MISSING TIME!)
custodian_multi_aspect_20251122.owl.ttl

# ❌ WRONG - No timestamp
01_custodian_name.owl.ttl

Rationale: Full timestamps (date + time) allow multiple generation runs per day without filename conflicts, enable precise version tracking, and provide clear audit trails for schema evolution.

See: .opencode/SCHEMA_GENERATION_RULES.md for complete generation rules

Why LinkML is the Master:

✅ Formal specification: Type-safe, validation rules, cardinality constraints
✅ Multi-format generation: Single source → RDF, JSON-LD, Python, SQL, GraphQL
✅ Version control: Clear diffs, semantic versioning, change tracking
✅ Ontology alignment: Explicit class_uri and slot_uri mappings to base ontologies
✅ Documentation: Rich inline documentation with examples

NEVER:

❌ Edit RDF files directly (they will be overwritten on next generation)
❌ Consider TypeDB schema as authoritative (it's a translation target)
❌ Treat UML diagrams as specification (they're visualizations)

ALWAYS:

✅ Refer to LinkML schemas for class definitions
✅ Update LinkML first, then regenerate derived formats
✅ Validate changes against LinkML metamodel
✅ Document schema changes in LinkML YAML comments

See also:

schemas/20251121/RDF_GENERATION_SUMMARY.md - RDF generation process documentation
docs/MIGRATION_GUIDE.md - Schema migration procedures
LinkML documentation: https://linkml.io/

Rule 1: Ontology Files Are Your Primary Reference

BEFORE designing any schema, class, or property:

READ the base ontology files in /data/ontology/
SEARCH for existing classes and properties that match your needs
DOCUMENT your ontology alignment with explicit rationale
NEVER invent custom properties when ontology equivalents exist

Available Ontologies:

data/ontology/core-public-organisation-ap.ttl - CPOV (EU public sector)
data/ontology/tooiont.ttl - TOOI (Dutch government)
data/ontology/schemaorg.owl - Schema.org (web semantics, private sector)
data/ontology/CIDOC_CRM_v7.1.3.rdf - CIDOC-CRM (cultural heritage domain)
data/ontology/RiC-O_1-1.rdf - Records in Contexts (archival description)
data/ontology/bibframe_vocabulary.rdf - BIBFRAME (libraries)
data/ontology/pico.ttl - PiCo (person observations, staff roles)

See .opencode/agent/ontology-mapping-rules.md for complete ontology consultation workflow.

Rule 2: Wikidata Entities Are NOT Ontology Classes

Files:

data/wikidata/GLAMORCUBEPSXHFN/hyponyms_curated.yaml
data/wikidata/GLAMORCUBEPSXHFN/hyponyms_curated_full.yaml

These files contain:

✅ Wikidata entity identifiers (Q-numbers) for heritage institution TYPES
✅ Multilingual labels and descriptions
✅ Hypernym classifications (upper-level categories)
✅ Source data for ontology mapping analysis

These files DO NOT contain:

❌ Formal ontology class definitions
❌ Direct class_uri mappings for LinkML
❌ Ontology properties or relationships

REQUIRED WORKFLOW:

hyponyms_curated.yaml (Wikidata Q-numbers)
    ↓
ANALYZE semantic meaning + hypernyms
    ↓
SEARCH base ontologies for matching classes
    ↓
MAP Wikidata entity → Ontology class(es)
    ↓
DOCUMENT rationale + properties
    ↓
CREATE LinkML schema with ontology class_uri

Example - WRONG ❌:

Mansion:
  class_uri: wd:Q1802963  # ← This is an ENTITY, not a CLASS!

Example - CORRECT ✅:

Mansion:
  # Wikidata source: Q1802963
  place_aspect:
    class_uri: crm:E27_Site  # CIDOC-CRM ontology class
  custodian_aspect:
    class_uri: cpov:PublicOrganisation  # If operates as museum

Rule 3: Multi-Aspect Modeling is Mandatory

Every heritage entity has MULTIPLE ontological aspects with INDEPENDENT temporal lifecycles.

Required Aspects:

Place Aspect (physical location/site)
- Ontology: crm:E27_Site + schema:Place
- Temporal: Construction → Demolition/Present
Custodian Aspect (organization managing heritage)
- Ontology: cpov:PublicOrganisation OR schema:Organization
- Temporal: Founding → Dissolution/Present
Legal Form Aspect (legal entity registration)
- Ontology: org:FormalOrganization + tooi:Overheidsorganisatie (Dutch)
- Temporal: Registration → Deregistration/Present
Collections Aspect (heritage materials)
- Ontology: rico:RecordSet OR crm:E78_Curated_Holding OR bf:Collection
- Temporal: Accession → Deaccession (per item)
People Aspect (staff, curators)
- Ontology: pico:PersonObservation + crm:E21_Person
- Temporal: Employment start → Employment end (per person)
Temporal Events (organizational changes)
- Ontology: crm:E10_Transfer_of_Custody, rico:Event
- Tracks custody transfers, mergers, relocations, transformations

Example: A historic mansion operating as a museum has:

Place aspect: Building constructed 1880, still standing (143 years)
Custodian aspect: Foundation established 1994 to operate museum (30 years)
Legal form: Dutch stichting registered 1994, KvK #12345678
Collections: Mondrian artworks acquired 1994-2024
People: Current curator employed 2020-present

Each aspect changes independently over time!

Rule 5: NEVER Delete Enriched Data - Additive Only

🚨 CRITICAL: Data enrichment is ADDITIVE ONLY. Never delete or overwrite existing enriched content.

When restructuring or updating enriched institution records:

✅ ALLOWED (Additive Operations):

Add new fields or sections
Restructure YAML/JSON layout while preserving all content
Rename files (e.g., _unknown.yaml → _museum_name.yaml)
Add provenance metadata
Merge data from multiple sources (preserving all)

❌ FORBIDDEN (Destructive Operations):

Delete Google Maps data (reviews, ratings, photo counts, popular times)
Remove OpenStreetMap metadata
Overwrite website scrape results
Delete Wikidata enrichment data
Remove any *_enrichment sections
Truncate or summarize detailed content

Data Types That Must NEVER Be Deleted:

Data Source	Protected Fields
Google Maps	`reviews`, `rating`, `total_ratings`, `photo_count`, `popular_times`, `place_id`, `business_status`
OpenStreetMap	`osm_id`, `osm_type`, `osm_tags`, `amenity`, `building`, `heritage`
Wikidata	`wikidata_id`, `claims`, `sitelinks`, `aliases`, `descriptions`
Website Scrape	`organization_details`, `collections`, `exhibitions`, `contact`, `social_media`, `accessibility`
ISIL Registry	`isil_code`, `assigned_date`, `remarks`

Example - CORRECT Restructuring:

# BEFORE (flat structure)
google_maps_rating: 4.5
google_maps_reviews: 127
website_description: "Historic museum..."

# AFTER (nested structure) - ALL DATA PRESERVED
enrichment_sources:
  google_maps:
    rating: 4.5          # ← PRESERVED
    reviews: 127         # ← PRESERVED
  website:
    description: "Historic museum..."  # ← PRESERVED

Example - WRONG (Data Loss):

# BEFORE
google_maps_enrichment:
  rating: 4.5
  reviews: 127
  popular_times: {...}
  photos: [...]

# AFTER - WRONG! Data deleted!
enrichment_status: enriched
# Where did the rating, reviews, popular_times go?!

Rationale:

Enriched data is expensive to collect (API calls, rate limits, web scraping)
Google Maps data changes over time - historical snapshots are valuable
Reviews and ratings provide quality signals for heritage institutions
Photo metadata enables visual discovery and verification
Deleting data violates data provenance principles

If Unsure: When restructuring files, first READ the entire file, then WRITE a new version that includes ALL original content in the new structure.

Rule 6: WebObservation Claims MUST Have XPath Provenance

Every claim extracted from a webpage MUST have an XPath pointer to the exact location in archived HTML where that value appears. Claims without XPath provenance are FABRICATED and must be removed.

This is not about "confidence" or "uncertainty" - it's about verifiability. Either the claim value exists in the HTML at a specific XPath, or it was hallucinated/fabricated by an LLM.

Required Fields for WebObservation Claims:

Field	Required	Description
`claim_type`	YES	Type of claim (full_name, description, email, etc.)
`claim_value`	YES	The extracted value
`source_url`	YES	URL the claim was extracted from
`retrieved_on`	YES	ISO 8601 timestamp when page was archived
`xpath`	YES	XPath to the element containing this value
`html_file`	YES	Relative path to archived HTML file
`xpath_match_score`	YES	1.0 for exact match, <1.0 for fuzzy match

Example - CORRECT (Verifiable):

web_enrichment:
  claims:
    - claim_type: full_name
      claim_value: Historische Vereniging Nijeveen
      source_url: https://historischeverenigingnijeveen.nl/
      retrieved_on: "2025-11-29T12:28:00Z"
      xpath: /[document][1]/html[1]/body[1]/div[6]/div[1]/table[3]/tbody[1]/tr[1]/td[1]/p[6]
      html_file: web/0021/historischeverenigingnijeveen.nl/rendered.html
      xpath_match_score: 1.0

Example - WRONG (Fabricated - Must Be Removed):

web_enrichment:
  claims:
    - claim_type: full_name
      claim_value: Historische Vereniging Nijeveen
      confidence: 0.95  # ← NO! This is meaningless without XPath

Workflow:

Archive website using Playwright: python scripts/fetch_website_playwright.py <entry> <url>
Add XPath provenance: python scripts/add_xpath_provenance.py
Script removes fabricated claims (stored in removed_unverified_claims for audit)

See:

.opencode/WEB_OBSERVATION_PROVENANCE_RULES.md for complete documentation
schemas/20251121/linkml/modules/classes/WebClaim.yaml for LinkML schema definition

Rule 4: Technical Classes Are Excluded from Visualizations

Some LinkML classes exist solely for validation purposes and have NO semantic significance.

These "scaffolding" classes are essential for building (validation, parsing) but are not part of the final ontology structure. They MUST be excluded from UML diagrams, entity-relationship diagrams, and other semantic visualizations.

Currently Excluded Classes:

Class	Purpose	Why Excluded
`Container`	LinkML `tree_root` for instance validation	No `class_uri`, not serialized to RDF, purely structural

The Container Class:

Definition: schemas/20251121/linkml/modules/classes/Container.yaml

Container:
  description: >-
    Root container for validating complete instance documents.
    Not used in RDF serialization (flattened).    
  tree_root: true  # ← This is why it exists
  # Notice: NO class_uri - not mapped to any ontology

Purpose:

Enables linkml-validate -C Container instance.yaml validation
Provides entry point for parsing YAML/JSON instance documents
Has NO semantic meaning in the domain model

Exclusion Implementation:

Python generators: EXCLUDED_CLASSES = {"Container"} in scripts/generate_*.py
Frontend parsers: EXCLUDED_CLASSES Set in frontend/src/components/uml/UMLParser.ts

Verification:

# Check that Container is NOT in diagrams
grep "Container" schemas/20251121/uml/mermaid/complete_schema_*.mmd
# Should return no matches

# Verify class count (should be 71, not 72)
grep -c "class " schemas/20251121/uml/mermaid/complete_schema_*.mmd | tail -1

See: .opencode/LINKML_TECHNICAL_CLASSES.md for complete documentation and how to add new exclusions.

Rule 7: Deployment is LOCAL via SSH/rsync (NO CI/CD)

🚨 CRITICAL: This project has NO GitHub Actions, NO CI/CD pipelines. ALL deployments are executed LOCALLY via SSH and rsync.

Server Information:

Provider: Hetzner Cloud
IP Address: 91.98.224.44
SSH User: root
Frontend Path: /var/www/glam-frontend/
Data Path: /mnt/data/

Deployment Script: infrastructure/deploy.sh

# Deploy frontend only (most common)
./infrastructure/deploy.sh --frontend

# Deploy data files only
./infrastructure/deploy.sh --data

# Deploy infrastructure configs
./infrastructure/deploy.sh --infra

# Deploy everything
./infrastructure/deploy.sh --all

# Check server status
./infrastructure/deploy.sh --status

# Reload Caddy server
./infrastructure/deploy.sh --reload

Prerequisites:

.env file with HETZNER_HC_API_TOKEN (for Hetzner Cloud API)
SSH access to server (ssh root@91.98.224.44)
Local dependencies installed (npm, node for frontend builds)

Frontend Deployment Process:

1. npm run build (in frontend/)
2. Sync LinkML schemas to frontend/public/schemas/
3. Generate manifest.json for schema versions
4. rsync build output to server /var/www/glam-frontend/

AI Agent Rules:

✅ DO:

Use ./infrastructure/deploy.sh --frontend after frontend changes
Verify deployment with ./infrastructure/deploy.sh --status
Build frontend locally before deploying (cd frontend && npm run build)
Check SSH connectivity before attempting deployment

❌ DON'T:

Create GitHub Actions workflows
Create CI/CD configuration files
Assume automatic deployment on git push
Deploy without building first
Modify server configurations directly (use --infra flag)

Post-Deployment Verification:

# Check server is responding
curl -I https://91.98.224.44

# SSH and check files
ssh root@91.98.224.44 "ls -la /var/www/glam-frontend/"

# Check Caddy status
ssh root@91.98.224.44 "systemctl status caddy"

See:

.opencode/DEPLOYMENT_RULES.md for complete deployment rules
docs/DEPLOYMENT_GUIDE.md for comprehensive deployment documentation
infrastructure/deploy.sh for script source code

Rule 8: Legal Form Terms MUST Be Filtered from CustodianName

🚨 CRITICAL EXCEPTION TO EMIC PRINCIPLE: Legal form designations are ALWAYS filtered from CustodianName, even when the custodian self-identifies with them.

This is the ONE EXCEPTION to the emic (insider name) principle. Legal forms are metadata about the entity, not part of its identity.

Why This Rule Exists:

Legal Form ≠ Identity: "Stichting Rijksmuseum" has legal form "Stichting" (foundation), but the IDENTITY is "Rijksmuseum"
Legal Forms Change: Foundations become corporations, associations become foundations - the identity persists
Cross-Jurisdictional Consistency: "Getty Foundation" (US) and "Stichting Getty" (NL) are the same identity
Deduplication: Prevents "Museum X" and "Stichting Museum X" appearing as separate entities
ISO 20275 Alignment: Legal Entity Identifier standard separates legal form from entity name

Examples:

Source Name	CustodianName	Legal Form (metadata)
Stichting Rijksmuseum	Rijksmuseum	Stichting
Hidde Nijland Stichting	Hidde Nijland	Stichting
The Getty Foundation	The Getty	Foundation
British Museum Trust Ltd	British Museum	Trust Ltd
Fundação Biblioteca Nacional	Biblioteca Nacional	Fundação
Verein Deutsches Museum	Deutsches Museum	Verein
Association des Amis du Louvre	Amis du Louvre	Association

Legal Form Terms to Filter (partial list by language):

Language	Terms to Remove
Dutch	Stichting, Coöperatie, Maatschap, B.V., N.V., V.O.F., C.V.
English	Foundation, Trust, Inc., Incorporated, Ltd., Limited, LLC, Corp., Corporation
German	Stiftung, e.V., GmbH, AG, KG, OHG
French	Fondation, S.A., S.A.R.L., S.C.I., S.N.C.
Spanish	Fundación, S.A., S.L., S.L.L.
Portuguese	Fundação, Ltda., S.A.
Italian	Fondazione, S.p.A., S.r.l.

⚠️ Terms NOT to Filter (these describe organizational purpose, not legal form):

Language	Keep These (Part of Identity)
Dutch	Vereniging, Genootschap, Kring, Bond
English	Association, Society, Guild, Club
German	Verein, Gesellschaft, Bund
French	Association, Société, Cercle
Spanish	Asociación, Sociedad
Portuguese	Associação, Sociedade
Italian	Associazione, Società

Rationale: "Vereniging" (Dutch), "Association" (English/French), "Verein" (German), etc. describe what the organization IS (a voluntary association of members), not just how it's legally registered. "Historische Vereniging Nijeveen" is fundamentally different from "Stichting Rijksmuseum" - the former is a membership organization, the latter is a foundation. Filtering these terms destroys meaningful identity information.

Implementation:

CustodianName.yaml: Legal form filtering documented in schema description
Extraction Scripts: Must strip legal form terms before storing custodian_name
GHCID Generation: Uses filtered name for abbreviation generation
Validation: Scripts should flag names containing legal form terms

Where Legal Form IS Stored:

Legal form is NOT discarded - it is stored in separate metadata fields:

CustodianLegalStatus.legal_form - The ISO 20275 legal form code
CustodianLegalStatus.legal_name - Full registered name including legal form
CustodianObservation.observed_name - Original name as observed in source (may include legal form)

See: .opencode/LEGAL_FORM_FILTERING_RULE.md for comprehensive global legal form list

Rule 9: Enum-to-Class Promotion - Single Source of Truth

🚨 CRITICAL: When an enum is promoted to a class hierarchy, the original enum MUST be deleted. Never maintain parallel enum/class definitions.

This project follows the Single Source of Truth principle for type systems. An enum and a class hierarchy representing the same concept creates dangerous duplication that leads to:

Inconsistent values between enum and class definitions
Confusion about which to use in slot ranges
Maintenance burden of keeping both in sync
Validation failures from mismatched values

Lifecycle of Enum-to-Class Promotion:

1. ENUM (Scaffolding Phase)
   - Simple list of values
   - No properties needed yet
   - Used in slot range: MyEnum
   ↓
2. DECISION: Needs properties/hierarchy?
   - Does concept need properties (description, category, related_to)?
   - Does concept need inheritance (subtypes)?
   - Does concept need rich documentation per value?
   ↓
3. CLASS HIERARCHY (Production Phase)
   - Abstract base class with shared properties
   - Concrete subclasses for each value
   - Used in slot range: MyClass (polymorphic)
   ↓
4. DELETE ENUM ← MANDATORY!
   - Archive enum file: modules/enums/archive/MyEnum.yaml.archived_YYYYMMDD
   - Remove from manifest.json
   - Remove from instance files
   - Update all slot ranges to use class

Enums That Remain Permanent (based on external standards, no properties needed):

DataTierEnum - Fixed 4-tier quality model
CountryCodeEnum - ISO 3166-1 alpha-2 codes
LanguageCodeEnum - ISO 639-1 codes
Simple status enums with no semantic richness

Example: StaffRoleTypeEnum → StaffRole Class Hierarchy

# BEFORE (Enum phase - scaffolding)
StaffRoleTypeEnum:
  permissible_values:
    CURATOR: {}
    ARCHIVIST: {}
    CONSERVATOR: {}

staff_role:
  range: StaffRoleTypeEnum  # Simple enum reference

# AFTER (Class hierarchy - production)
StaffRole:
  abstract: true
  slots:
    - role_category      # RoleCategoryEnum
    - typical_domains    # List of CustodianType
    - pico_alignment     # PiCo ontology class_uri

Curator:
  is_a: StaffRole
  slot_usage:
    role_category: { equals_string: CURATORIAL }

staff_role:
  range: StaffRole  # Polymorphic class reference

# StaffRoleTypeEnum.yaml → ARCHIVED (deleted from active schema)

When You Encounter This Situation:

Check for existing class hierarchy: Search modules/classes/ for equivalent class
If class exists: Use class in slot range, ensure enum is archived
If promoting enum: Create class hierarchy, update slot ranges, archive enum
Document migration: Add comment in archived file explaining reason

Archive Location: schemas/20251121/linkml/archive/enums/

See:

.opencode/ENUM_TO_CLASS_PRINCIPLE.md for complete documentation
docs/ENUM_CLASS_SINGLE_SOURCE.md for case study (StaffRoleTypeEnum migration)

Rule 10: CH-Annotator is the Entity Annotation Convention

🚨 CRITICAL: All entity annotation in this project MUST follow the CH-Annotator convention (ch_annotator-v1_7_0).

Convention Identity:

ID: ch_annotator-v1_7_0
Full Name: CH-Annotator (Cultural Heritage Annotator)
Version: 1.7.0
Status: PRODUCTION
Formerly Known As: GLAM-NER v1.7.0-unified (renamed 2025-12-06)

Primary File Location:

data/entity_annotation/ch_annotator-v1_7_0.yaml

What CH-Annotator Covers:

Named Entity Recognition (NER)
Property Extraction
Entity Resolution
Entity Linking
Claim Validation
Document Structure Annotation

Hypernym Entity Types (9 categories):

Code	Hypernym	Primary Ontology
AGT	AGENT	`crm:E39_Actor`
GRP	GROUP	`crm:E74_Group`
TOP	TOPONYM	`crm:E53_Place`
GEO	GEOMETRY	`geo:Geometry`
TMP	TEMPORAL	`crm:E52_Time-Span`
APP	APPELLATION	`crm:E41_Appellation`
ROL	ROLE	`org:Role`
WRK	WORK	`frbroo:F1_Work`
QTY	QUANTITY	`crm:E54_Dimension`

Heritage Institution Subtype:

GRP.HER:  # Heritage Custodian
  subtypes:
    - GRP.HER.GAL  # Gallery
    - GRP.HER.LIB  # Library
    - GRP.HER.ARC  # Archive
    - GRP.HER.MUS  # Museum
    # ... (matches GLAMORCUBESFIXPHDNT taxonomy)

When Using CH-Annotator, Reference in Provenance:

provenance:
  data_source: CONVERSATION_NLP
  extraction_method: ch_annotator-v1_7_0  # ← Convention ID
  extraction_date: "2025-12-06T10:00:00Z"
  confidence_score: 0.92

Claim Provenance Model (5 required components):

claim:
  claim_type: full_name
  claim_value: "Rijksmuseum Amsterdam"
  provenance:
    namespace: skos                        # 1. Ontology prefix
    path: /html/body/h1[1]                 # 2. XPath/JSONPath
    timestamp: "2025-12-06T10:00:00Z"      # 3. ISO 8601
    agent: ch_annotator-v1_7_0             # 4. Extraction model
    context_convention: ch_annotator-v1_7_0 # 5. Convention version

Digital Humanities Authority Stack (Primary):

TEI P5 (document structure, names)
CIDOC-CRM 7.1.3 (cultural heritage modeling)
TimeML/TIMEX3 (temporal expressions)
FRBR/LRM (bibliographic entities)
GeoSPARQL (spatial geometry)

NERD Status: DEPRECATED for semantic precision. Retained ONLY for NLP pipeline interchange.

See:

.opencode/CH_ANNOTATOR_CONVENTION.md for complete documentation
data/entity_annotation/ch_annotator-v1_7_0.yaml for full convention
data/entity_annotation/modules/ for modular hypernym definitions

Rule 11: Z.AI GLM API for LLM Tasks (NOT BigModel)

CRITICAL: When using GLM models in scripts, use the Z.AI Coding Plan endpoint, NOT the regular BigModel API.

The project uses the same Z.AI Coding Plan that OpenCode uses internally. The regular BigModel API (open.bigmodel.cn) will NOT work with our tokens.

Correct API Configuration:

Property	Value
API URL	`https://api.z.ai/api/coding/paas/v4/chat/completions`
Environment Variable	`ZAI_API_TOKEN`
Recommended Model	`glm-4.6`
Cost	Free (0 per token for all GLM models)

Available Models: glm-4.5, glm-4.5-air, glm-4.5-flash, glm-4.5v, glm-4.6

Python Implementation:

import os
import httpx

# CORRECT - Z.AI Coding Plan endpoint
API_URL = "https://api.z.ai/api/coding/paas/v4/chat/completions"
api_key = os.environ.get("ZAI_API_TOKEN")

client = httpx.AsyncClient(
    timeout=60.0,
    headers={
        "Authorization": f"Bearer {api_key}",
        "Content-Type": "application/json",
    }
)

# WRONG - This will fail with quota errors!
# API_URL = "https://open.bigmodel.cn/api/paas/v4/chat/completions"
# api_key = os.environ.get("ZHIPU_API_KEY")

Integration with CH-Annotator: When using GLM for entity recognition, always reference CH-Annotator v1.7.0 in prompts:

PROMPT = """You are following CH-Annotator v1.7.0 convention.
Heritage institutions are type GRP.HER with subtypes:
- GRP.HER.MUS (museums)
- GRP.HER.LIB (libraries)
- GRP.HER.ARC (archives)
- GRP.HER.GAL (galleries)
..."""

Token Location:

Environment: Add ZAI_API_TOKEN to .env file
OpenCode Auth: Token stored in ~/.local/share/opencode/auth.json under key zai-coding-plan

See: .opencode/ZAI_GLM_API_RULES.md for complete documentation

Rule 12: Person Data Reference Pattern - Avoid Inline Duplication

🚨 CRITICAL: When person profile data is stored in data/custodian/person/, custodian files MUST reference the file path instead of duplicating the full profile inline.

This pattern:

Reduces data duplication (50+ lines → 5 lines per person)
Ensures single-source-of-truth for person data
Enables cross-custodian references (same person at multiple institutions)
Makes updates easier to manage

Directory Structure:

data/custodian/
├── person/                              # Canonical person profile storage
│   ├── alexandr-belov-bb547b46_20251210T120000Z.json
│   ├── giovanna-fossati_20251209T170000Z.json
│   └── ...
├── NL-NH-AMS-U-EFM-eye_filmmuseum.yaml  # References person files
└── ...

File Naming Convention: {linkedin-slug}_{ISO-timestamp}.json

✅ CORRECT - Reference Pattern:

collection_management_specialist:
- name: Alexandr Belov
  role: Collection/Information Specialist
  linkedin_url: https://www.linkedin.com/in/alexandr-belov-bb547b46
  current: true
  person_profile_path: data/custodian/person/alexandr-belov-bb547b46_20251210T120000Z.json

❌ WRONG - Inline Duplication (50+ lines of profile data inline when file exists):

collection_management_specialist:
- name: Alexandr Belov
  role: Collection/Information Specialist
  linkedin_url: https://www.linkedin.com/in/alexandr-belov-bb547b46
  about: "60+ lines of profile data..."
  skills: [...]
  languages: [...]
  career_history: [...]
  # ... etc

When to Use Each Pattern:

Use File Reference When	Use Inline Data When
Full profile already extracted and saved	Only basic info available
Person has extensive career history	Minimal profile data
Profile data exceeds ~10 lines	Quick placeholder entry
Same person referenced by multiple custodians	Before full extraction

See: .opencode/PERSON_DATA_REFERENCE_PATTERN.md for complete documentation

Rule 13: Custodian Type Annotations on LinkML Schema Elements

All LinkML schema elements (classes, slots, enums) MUST be annotated with their applicable GLAMORCUBESFIXPHDNT custodian type codes using the annotations block.

This convention enables:

Visual categorization in UML diagrams (cube face highlighting)
Semantic filtering by heritage institution type
Validation of slot/enum applicability to custodian types

Annotation Keys:

Key	Type	Required	Description
`custodian_types`	list[string]	YES	List of applicable type codes (e.g., `[A, L]`)
`custodian_types_rationale`	string	NO	Explanation of why these types apply
`custodian_types_primary`	string	NO	Primary type if multiple apply

Type Codes (single letters from GLAMORCUBESFIXPHDNT):

Code	Type	Code	Type
G	Gallery	F	Feature custodian
L	Library	I	Intangible heritage
A	Archive	X	Mixed types
M	Museum	P	Personal collection
O	Official institution	H	Holy/sacred site
R	Research center	D	Digital platform
C	Corporation	N	NGO
U	Unknown	T	Taste/smell heritage
B	Botanical/Zoo
E	Education provider
S	Collecting society

Example - Class Annotation:

classes:
  ArchivalFonds:
    class_uri: rico:RecordSet
    description: An archival fonds representing a collection of records
    annotations:
      custodian_types: [A]
      custodian_types_rationale: "Archival fonds are specific to archive institutions"

Example - Slot Annotation:

slots:
  arrangement_system:
    description: The archival arrangement system used
    range: ArrangementSystemEnum
    annotations:
      custodian_types: [A]
      custodian_types_rationale: "Arrangement systems are an archival concept"

Example - Enum Value Annotation:

enums:
  CollectionTypeEnum:
    permissible_values:
      ARCHIVAL_FONDS:
        description: A complete archival fonds
        annotations:
          custodian_types: [A]

Universal Types: Use ["*"] for elements that apply to all custodian types:

slots:
  preferred_label:
    annotations:
      custodian_types: ["*"]

Validation Rules:

Values MUST be valid single-letter GLAMORCUBESFIXPHDNT codes
Child class types MUST be subset of or equal to parent class types
Slot annotations should align with using class annotations

See: .opencode/CUSTODIAN_TYPE_ANNOTATION_CONVENTION.md for complete documentation

Rule 14: Exa MCP LinkedIn Profile Extraction

When extracting LinkedIn profile data for heritage custodian staff, use the exa_crawling_exa tool with direct profile URL for comprehensive extraction.

Tool Selection:

Scenario	Tool	Parameters
Profile URL known	`exa_crawling_exa`	`url`, `maxCharacters: 10000`
Profile URL unknown	`exa_linkedin_search_exa`	`query`, `searchType: "profiles"`
Fallback search	`exa_web_search_exa`	`query: "site:linkedin.com/in/ {name}"`

Preferred Workflow (when LinkedIn URL is available):

1. Use exa_crawling_exa with direct URL
   ↓
2. Extract comprehensive profile data
   ↓
3. Save to data/custodian/person/{linkedin-slug}_{timestamp}.json
   ↓
4. Update custodian file with person_profile_path reference

Example - Direct URL Extraction:

Tool: exa_crawling_exa
Parameters:
  url: "https://www.linkedin.com/in/alexandr-belov-bb547b46"
  maxCharacters: 10000

Why exa_crawling_exa over exa_linkedin_search_exa:

Returns complete career history with dates, durations, company metadata
Includes all education entries with institutions and degrees
Captures full about section text
Returns profile image URL
More reliable than search (which may return irrelevant profiles)

Output Location: data/custodian/person/{linkedin-slug}_{ISO-timestamp}.json

Required JSON Fields:

exa_search_metadata - Tool, timestamp, request ID, cost
linkedin_profile_url - Source URL
profile_data.career_history[] - Complete work history
profile_data.education[] - All education entries
profile_data.heritage_relevant_experience[] - Tagged heritage roles

See: .opencode/EXA_LINKEDIN_EXTRACTION_RULES.md for complete documentation

Rule 15: Connection Data Registration - Full Network Preservation

🚨 CRITICAL: When connection data is manually recorded for a person, ALL connections MUST be fully registered in a dedicated connections file in data/custodian/person/.

This rule ensures complete preservation of professional network data, enabling heritage sector network analysis and cross-custodian relationship discovery.

Connection File Naming Convention:

{linkedin-slug}_connections_{ISO-timestamp}.json

Examples:

alexandr-belov-bb547b46_connections_20251210T160000Z.json
giovanna-fossati_connections_20251211T140000Z.json

Required Connection File Structure:

{
  "source_metadata": {
    "source_url": "https://www.linkedin.com/search/results/people/?...",
    "scraped_timestamp": "2025-12-10T16:00:00Z",
    "scrape_method": "manual_linkedin_browse",
    "target_profile": "{linkedin-slug}",
    "target_name": "Full Name",
    "connections_extracted": 107
  },
  "connections": [
    {
      "name": "Connection Name",
      "degree": "1st" | "2nd" | "3rd+",
      "headline": "Current role or description",
      "location": "City, Region, Country",
      "organization": "Primary organization",
      "heritage_relevant": true | false,
      "heritage_type": "A" | "L" | "M" | "R" | "E" | "D" | "G" | etc.
    }
  ],
  "network_analysis": {
    "total_connections_extracted": 107,
    "heritage_relevant_count": 56,
    "heritage_relevant_percentage": 52.3,
    "connections_by_heritage_type": {...}
  }
}

Minimum Required Fields per Connection:

Field	Type	Description
`name`	string	Full name of connection
`degree`	string	Connection degree: `1st`, `2nd`, `3rd+`
`headline`	string	Current role/description
`heritage_relevant`	boolean	Is this person in heritage sector?

Heritage Type Codes: Use single-letter GLAMORCUBESFIXPHDNT codes (G, L, A, M, O, R, C, U, B, E, S, F, I, X, P, H, D, N, T).

Referencing from Custodian Files:

collection_management_specialist:
- name: Alexandr Belov
  role: Collection/Information Specialist
  linkedin_url: https://www.linkedin.com/in/alexandr-belov-bb547b46
  current: true
  person_profile_path: data/custodian/person/alexandr-belov-bb547b46_20251210T120000Z.json
  person_connections_path: data/custodian/person/alexandr-belov-bb547b46_connections_20251210T160000Z.json

Why This Rule Matters:

Complete Data Preservation: Connections are expensive to extract (manual scraping, rate limits)
Heritage Sector Mapping: Understanding who knows whom in the heritage community
Cross-Custodian Discovery: Find staff who work at multiple institutions
Network Analysis: Identify key influencers and knowledge hubs

See: .opencode/CONNECTION_DATA_REGISTRATION_RULE.md for complete documentation including network analysis sections

Rule 16: LinkedIn Photo URLs - Store CDN URLs, Not Overlay Pages

🚨 CRITICAL: When storing LinkedIn profile photo URLs, store the ACTUAL CDN image URL from media.licdn.com, NOT the overlay page URL.

The overlay page URL (/overlay/photo/) is trivially derivable from any LinkedIn profile URL and provides zero informational value. The actual CDN URL requires extraction effort and is the only URL that directly serves the image.

URL Patterns:

Type	Pattern	Value
❌ WRONG (overlay page)	`https://www.linkedin.com/in/{slug}/overlay/photo/`	Derivable, useless
✅ CORRECT (CDN image)	`https://media.licdn.com/dms/image/v2/{ID}/profile-displayphoto-shrink_800_800/...`	Actual image

CDN URL Structure:

https://media.licdn.com/dms/image/v2/{image_id}/profile-displayphoto-shrink_{size}/{encoded_params}

Where {size} can be: 100_100, 200_200, 400_400, 800_800

Person Profile JSON Schema:

{
  "linkedin_profile_url": "https://www.linkedin.com/in/giovannafossati",
  "linkedin_photo_url": "https://media.licdn.com/dms/image/v2/C4D03AQHQCBcoih82SQ/profile-displayphoto-shrink_800_800/...",
  "photo_urls": null
}

When CDN URL is Not Available (profile viewed without login, or no photo):

{
  "linkedin_profile_url": "https://www.linkedin.com/in/example-person",
  "linkedin_photo_url": null,
  "photo_urls": {
    "primary": "https://example.org/staff/person-headshot.jpg",
    "sources": [
      {
        "url": "https://example.org/staff/person-headshot.jpg",
        "source": "Institutional website",
        "retrieved_date": "2025-12-09"
      }
    ]
  }
}

Extraction Methods (in order of preference):

Browser DevTools - Inspect <img> element in profile photo overlay, copy src attribute
Exa Crawling - Use exa_crawling_exa tool with profile URL, extract from returned content
Alternative Sources - If LinkedIn CDN unavailable, use institutional websites, conference photos, etc.

Rationale:

Overlay URL = profile_url + "/overlay/photo/" → No information gain
CDN URL contains unique image identifiers → Actual data
CDN URLs can be used to display images directly
Multiple size variants available by changing size parameter

See:

.opencode/LINKEDIN_PHOTO_CDN_RULE.md for agent rule reference
docs/LINKEDIN_PHOTO_URL_EXTRACTION.md for complete extraction documentation

Rule 17: LinkedIn Connection Unique Identifiers - Include Abbreviated and Anonymous Names

🚨 CRITICAL: When parsing LinkedIn connections, EVERY connection MUST receive a unique connection_id, including abbreviated names (e.g., "Amy B.") and anonymous entries (e.g., "LinkedIn Member").

This rule ensures complete data preservation for heritage sector network analysis, even when LinkedIn privacy settings obscure full names.

Connection ID Format:

{target_slug}_conn_{index:04d}_{name_slug}

Examples:

elif-rongen-kaynakci-35295a17_conn_0042_amy_b
giovannafossati_conn_0156_linkedin_member
anne-gant-59908a18_conn_0003_tina_m_bastajian

Name Type Classification:

Type	Pattern	Example	`name_type` Value
Full Name	First + Last name	"John Smith"	`full`
Abbreviated	Contains single initial	"Amy B.", "S. Buse Yildirim", "Tina M. Bastajian"	`abbreviated`
Anonymous	Privacy-hidden profile	"LinkedIn Member"	`anonymous`

Required Fields per Connection:

{
  "connection_id": "target-slug_conn_0042_amy_b",
  "name": "Amy B.",
  "name_type": "abbreviated",
  "degree": "2nd",
  "headline": "Film Archivist at EYE Filmmuseum",
  "location": "Amsterdam, Netherlands",
  "heritage_relevant": true,
  "heritage_type": "A"
}

Abbreviated Name Detection Patterns:

# Single letter followed by period
"Amy B."           # Last name abbreviated
"S. Buse Yildirim" # First name abbreviated  
"Tina M. Bastajian" # Middle initial
"İ. Can Koç"       # Turkish first initial

Name Slug Generation Rules:

Normalize unicode (NFD decomposition)
Remove diacritics (é→e, ö→o, ñ→n, İ→i)
Convert to lowercase
Replace non-alphanumeric with underscores
Collapse multiple underscores
Truncate to 30 characters maximum

Implementation Script: scripts/parse_linkedin_connections.py

Key Functions:

is_abbreviated_name(name) - Detects single-letter initials
is_anonymous_name(name) - Detects "LinkedIn Member" patterns
generate_connection_id(name, index, target_slug) - Creates unique IDs

Usage:

python scripts/parse_linkedin_connections.py \
    data/custodian/person/manual_register/{slug}_connections_{timestamp}.md \
    data/custodian/person/{slug}_connections_{timestamp}.json \
    --target-name "Full Name" \
    --target-slug "linkedin-slug"

Why This Rule Matters:

Deduplication: Same abbreviated name across different connection lists can be linked
Privacy Respect: Preserves privacy while enabling analysis
Complete Data: No connections are silently dropped
Network Analysis: Enables heritage sector relationship mapping even with partial data

Output Statistics (from real extractions):

Person	Total	Full Names	Abbreviated	Anonymous
Elif Rongen-Kaynakçi	475	449 (94.5%)	26 (5.5%)	0
Giovanna Fossati	776	746 (96.1%)	30 (3.9%)	0

Understanding Duplicates in Raw Connection Data:

🚨 IMPORTANT: Duplicates in raw manually-registered connection data are expected and normal, not a data quality issue.

When manually registering LinkedIn connections, the same person can appear multiple times because:

Connection degree is relative to the VIEWER (person conducting the search), NOT the target profile being analyzed
In a social network graph, multiple paths of different lengths can exist to the same node - a person can be your 1st degree (direct connection) AND also reachable as 2nd degree via another connection
LinkedIn may display different degree values for the same person depending on which path it evaluates

Key Constraint: The degree field (1st, 2nd, 3rd+) reflects the relationship between the viewer and the connection, NOT the relationship between the target profile and the connection. This is a LinkedIn UI limitation.

Example: If you (Alice) are browsing Bob's connections and see Carol listed as "2nd", that means Carol is YOUR 2nd degree connection, not Bob's.

Processing: Raw MD files typically have 15-25% duplicates. Deduplicate by name when creating final JSON output.

See: .opencode/LINKEDIN_CONNECTION_ID_RULE.md for complete documentation

Rule 18: Custodian Staff Parsing from LinkedIn Company Pages

When manually registering heritage custodian staff from LinkedIn company "People" pages, use the parse_custodian_staff.py script to convert raw text files into structured JSON.

File Locations:

Type	Location
Raw input files	`data/custodian/person/manual_hc/{slug}-{timestamp}.md`
Parsed output files	`data/custodian/person/{slug}_staff_{timestamp}.json`
Parser script	`scripts/parse_custodian_staff.py`

Usage:

python scripts/parse_custodian_staff.py <input_file> <output_file> \
    --custodian-name "Custodian Name" \
    --custodian-slug "custodian-slug"

Example:

python scripts/parse_custodian_staff.py \
    data/custodian/person/manual_hc/collectie_overijssel-20251210T0055.md \
    data/custodian/person/collectie_overijssel_staff_20251210T0055.json \
    --custodian-name "Collectie Overijssel" \
    --custodian-slug "collectie-overijssel"

Output Structure:

{
  "custodian_metadata": {
    "custodian_name": "Collectie Overijssel",
    "industry": "Museums, Historical Sites, and Zoos",
    "location": { "city": "Zwolle", "region": "Overijssel" },
    "employee_count": "51-200",
    "associated_members": 58
  },
  "staff": [
    {
      "staff_id": "collectie-overijssel_staff_0000_annelien_vos_keen",
      "name": "Annelien Vos-Keen",
      "name_type": "full",
      "degree": "2nd",
      "headline": "Data Analist / KPI- en procesexpert",
      "mutual_connections": "Thomas van Maaren, Bob Coret, and 4 other mutual connections",
      "heritage_relevant": true,
      "heritage_type": "D"
    }
  ],
  "staff_analysis": {
    "total_staff_extracted": 39,
    "heritage_relevant_count": 25,
    "heritage_relevant_percentage": 64.1,
    "staff_by_heritage_type": { "A": 4, "D": 1, "M": 18 }
  }
}

Staff ID Format: {custodian_slug}_staff_{index:04d}_{name_slug}

Heritage Type Detection: Staff headlines are analyzed using GLAMORCUBESFIXPHDNT keywords:

Code	Type	Keywords
A	Archive	archief, archivist, archivaris, nationaal archief
M	Museum	museum, curator, conservator, collectie
D	Digital	digital, data, developer, digitalisering
E	Education	university, professor, docent, educatie

Existing Files:

Custodian	Staff	Heritage-Relevant
Collectie Overijssel	39	25 (64%)
Nationaal Archief	373	268 (72%)

Relationship to Rule 15 & 17: This script complements the connection parsing scripts:

Script	Purpose
`parse_linkedin_connections.py`	Parse PERSON's connections
`parse_custodian_staff.py`	Parse ORGANIZATION's staff

See: .opencode/CUSTODIAN_STAFF_PARSING_RULE.md for complete documentation

Rule 19: HTML-Only LinkedIn Extraction (Preferred Method)

Use ONLY manually saved HTML files (not MD copy-paste) for extracting LinkedIn data.

When extracting people data from LinkedIn (custodian staff, person connections), the HTML source contains 100% of the data. Markdown copy-paste loses critical information (especially profile URLs) and should NOT be used as a primary source.

Data Completeness Comparison:

Data Source	Profile URLs	Headlines	Names	Connection Degree	Coverage
HTML (saved page)	100%	100%	100%	100%	100%
MD (copy-paste)	0%	90%	100%	90%	~90%

Primary Script: scripts/parse_linkedin_html.py

Usage:

python scripts/parse_linkedin_html.py \
    "data/custodian/person/manual_hc/Rijksmuseum_ People _ LinkedIn.html" \
    --custodian-name "Rijksmuseum" \
    --custodian-slug "rijksmuseum" \
    --output data/custodian/person/rijksmuseum_staff_20251211T000000Z.json

Key Features:

Extracts complete profile data including LinkedIn profile URLs (linkedin.com/in/{slug})
Handles anonymous "LinkedIn Member" entries (privacy-protected profiles)
Deduplicates regular profiles by name but preserves each anonymous entry
Detects heritage-relevant staff using GLAMORCUBESFIXPHDNT keywords

How to Save LinkedIn HTML:

Navigate to company's LinkedIn page (e.g., linkedin.com/company/rijksmuseum/people/)
Scroll down to load ALL profiles (LinkedIn uses infinite scroll)
File > Save Page As... > "Webpage, Complete"
Save to data/custodian/person/manual_hc/

Note on Count Discrepancies: The "associated members" count in the page header may differ slightly from extracted count (typically within 1-3%) because:

Header count updates asynchronously
HTML contains actual rendered profiles which may be more current

See: .opencode/HTML_ONLY_LINKEDIN_EXTRACTION_RULE.md for complete documentation

Rule 20: Person Entity Profiles - Individual File Storage

🚨 CRITICAL: Person entity profiles MUST be stored as individual files in /Users/kempersc/apps/glam/data/custodian/person/entity directory.

This rule ensures complete preservation of professional profile data and enables cross-custodian relationship analysis.

Directory Structure:

data/custodian/person/
├── entity/                              # Canonical person profile storage
│   ├── alexandr-belov-bb547b46_20251210T120000Z.json
│   ├── giovanna-fossati_20251209T170000Z.json
│   └── ...
├── affiliated/                            # Custodian staff affiliations
│   └── parsed/
│       └── 3-october-vereniging_staff_20251210T155412Z.json
└── connection/                           # Professional network data
    └── manual/
        └── {slug}_connections_{timestamp}.json

File Naming Convention: {linkedin-slug}_{ISO-timestamp}.json

When to Create Person Entity Files:

Scenario	Create Entity File?	Reason
Exa LinkedIn extraction	✅ YES	Complete profile data available
Connection data available	✅ YES	Network analysis requires individual profiles
Heritage-relevant staff	✅ YES	Professional background analysis
Multiple affiliations	✅ YES	Cross-institution career tracking

Exa Tool Limitations:

Person profiles: ✅ Can extract with exa_crawling_exa (direct URL)
Connection data: ❌ CANNOT extract with Exa (requires manual registration)
Custodian affiliations: ❌ CANNOT extract with Exa (requires manual registration)

Manual Registration Requirements:

Connection data: Save to /data/custodian/person/connection/manual/ from LinkedIn browsing
Custodian staff: Save to /data/custodian/person/affiliated/manual/ from LinkedIn company pages
Parse with scripts: Convert to structured JSON using dedicated parsers

Entity Profile JSON Schema:

🚨 CRITICAL: ALL profiles MUST use structured JSON format with extraction_agent: "claude-opus-4.5". Raw content dumps are NOT acceptable.

{
  "extraction_metadata": {
    "source_file": "path/to/source/file",
    "staff_id": "unique_identifier",
    "extraction_date": "2025-12-10T16:00:00Z",
    "extraction_method": "exa_contents",
    "extraction_agent": "claude-opus-4.5",
    "linkedin_url": "https://www.linkedin.com/in/profile",
    "cost_usd": 0,
    "request_id": "exa_request_id"
  },
  "profile_data": {
    "name": "Full Name",
    "linkedin_url": "https://www.linkedin.com/in/profile",
    "headline": "Professional headline",
    "location": "City, Region, Country",
    "connections": "500 connections • 2,135 followers",
    "about": "Professional summary text",
    "experience": [...],
    "education": [...],
    "skills": [...],
    "languages": [...],
    "profile_image_url": "https://media.licdn.com/..."
  }
}

Required extraction_metadata Fields:

Field	Required	Description
`source_file`	YES	Path to source staff list file
`staff_id`	YES	Unique staff identifier
`extraction_date`	YES	ISO 8601 timestamp
`extraction_method`	YES	`exa_contents`, `exa_crawling_exa`, `exa_crawling_glm46`, or `manual`
`extraction_agent`	YES	Agent identifier: `"claude-opus-4.5"` for manual extraction, `""` (empty) for automated scripts
`linkedin_url`	YES	Full LinkedIn profile URL
`cost_usd`	YES	API cost (0 for Exa contents endpoint)
`request_id`	NO	Exa request ID for tracing

See: .opencode/PERSON_ENTITY_PROFILE_FORMAT_RULE.md for complete format documentation


**Referencing from Custodian Files**:
```yaml
collection_management_specialist:
- name: Alexandr Belov
  role: Collection/Information Specialist
  linkedin_url: https://www.linkedin.com/in/alexandr-belov-bb547b46
  current: true
  person_profile_path: data/custodian/person/alexandr-belov-bb547b46_20251210T120000Z.json

Why This Rule Matters:

Data Preservation: Complete professional profiles are valuable research assets
Cross-Institution Analysis: Track staff careers across multiple heritage organizations
Network Analysis: Build comprehensive heritage sector professional networks
Update Efficiency: Modify individual profiles without affecting custodian files
Research Value: Enables longitudinal studies of heritage sector employment

Implementation Status: ✅ COMPLETE - All 18 profiles from 3 October Vereeniging successfully stored as individual entity files

Rule 21: Data Fabrication is Strictly Prohibited

🚨 CRITICAL: ALL DATA MUST BE REAL AND VERIFIABLE. Fabricating any data - even as placeholders - is strictly prohibited and violates project integrity.

This rule applies to ALL data extraction, including:

LinkedIn profile data
Heritage institution records
Person connections
Staff affiliations
Any scraped or API-fetched content

What Constitutes Data Fabrication:

❌ FORBIDDEN (Never do this)	✅ ALLOWED (What you CAN do)
Creating fake names for people	Skip profiles that cannot be extracted
Inventing job titles or companies	Return `null` or empty fields for missing data
Making up education history	Mark profiles with `extraction_error: true`
Generating placeholder skills/experience	Log why extraction failed
Creating fictional LinkedIn URLs	Use "Unknown" only for display, not stored data
Any "fallback" data when real data unavailable	Gracefully fail and move to next record

Implementation in Extraction Scripts:

# ❌ WRONG - Creating fallback data
if not profile_data:
    profile_data = {
        "name": "Unknown Person",
        "headline": "No data available",
        "experience": []
    }

# ✅ CORRECT - Skip when no real data
if not profile_data or profile_data.get("extraction_error"):
    print(f"Skipping {url} - extraction failed")
    return None  # Do NOT save anything

Validation Requirements:

Before saving any profile, verify:

Name exists and is not generic ("LinkedIn Member", "Unknown", "N/A")
Name length is at least 2 characters
LinkedIn URL matches expected format
No extraction_error flag is set

Real-World Example of Violation:

The LinkedIn profile fetcher script created a completely fabricated profile for "Celyna Keates" when the Exa API failed. This profile contained:

Invented name
Made-up job history
Fictional education
Completely fabricated LinkedIn URL

This violated project integrity and had to be removed.

User Directive: "ALL PROFILES SHOULD BE REAL!!! Fabricating data is strictly prohibited."

See: .opencode/DATA_FABRICATION_PROHIBITION.md for complete documentation

Rule 22: Custodian YAML Files Are the Single Source of Truth for Enrichment Data

🚨 CRITICAL: The data/custodian/*.yaml files are the SINGLE SOURCE OF TRUTH for all heritage institution enrichment data. ALL enrichment pipelines MUST write to these files.

This rule ensures data consistency, traceability, and prevents "ghost data" from appearing in APIs that doesn't exist in the canonical data files.

The Custodian Data Hierarchy:

data/custodian/*.yaml          ← SINGLE SOURCE OF TRUTH (edit this!)
       ↓
+------+------+------+------+------+
|      |      |      |      |      |
↓      ↓      ↓      ↓      ↓      ↓
Ducklake  PostgreSQL  TypeDB  Oxigraph  Qdrant
(analytics) (geo API)  (graph) (RDF/SPARQL) (vector)
       ↓
REST API responses             ← DERIVED (serve from databases)
       ↓
Frontend display               ← DERIVED (render from API)

ALL databases are DERIVED from custodian YAML files. Databases MUST NEVER add enrichments independently.

Database	Purpose	Data Flow
Ducklake	Analytics, aggregations	Import from YAML → Query
PostgreSQL	Geographic API, PostGIS	Import from YAML → Serve API
TypeDB	Graph queries, relationships	Import from YAML → Graph traversal
Oxigraph	RDF/SPARQL, Linked Data	Import from YAML → SPARQL endpoint
Qdrant	Vector search, semantic	Import from YAML → Similarity search

What MUST Be Stored in Custodian YAML Files:

Data Type	Store in YAML?	Example Fields
Basic metadata	✅ YES	`name`, `ghcid`, `institution_type`, `description`
Location data	✅ YES	`location.city`, `location.country`, `coordinates`
Identifiers	✅ YES	`isil_code`, `wikidata_id`, `kvk_number`
Social media	✅ YES	`social_media.facebook`, `social_media.twitter`
Opening hours	✅ YES	`opening_hours`
Contact info	✅ YES	`phone`, `email`, `website`
Google Maps enrichment	✅ YES	`google_maps_enrichment.*`
Wikidata enrichment	✅ YES	`wikidata_enrichment.*`
Web scrape data	✅ YES	`web_enrichment.*`

Workflow for ALL Enrichment Operations:

1. FETCH data from external source (Google Maps, Wikidata, LinkedIn, etc.)
   ↓
2. VALIDATE data quality (see Rule 23 for social media validation)
   ↓
3. WRITE to data/custodian/{GHCID}.yaml  ← MANDATORY!
   ↓
4. IMPORT into database (if needed for API serving)
   ↓
5. VERIFY API returns same data as YAML file

Example - CORRECT Enrichment Workflow:

# Step 1: Fetch data
google_data = fetch_google_maps_data(place_id)

# Step 2: Validate
if is_valid_social_media(google_data.get('facebook')):
    # Step 3: Write to YAML (MANDATORY!)
    custodian_file = f"data/custodian/{ghcid}.yaml"
    with open(custodian_file, 'r') as f:
        custodian = yaml.safe_load(f)
    
    custodian['social_media'] = custodian.get('social_media', {})
    custodian['social_media']['facebook'] = google_data['facebook']
    
    with open(custodian_file, 'w') as f:
        yaml.dump(custodian, f)

Example - WRONG (Data Not in YAML):

# ❌ WRONG - Writing directly to database without updating YAML
cursor.execute(
    "UPDATE institutions SET facebook = %s WHERE ghcid = %s",
    (google_data['facebook'], ghcid)
)
# Now API returns data that doesn't exist in custodian YAML!
# This creates "ghost data" that can't be tracked or verified.

Why This Rule Exists:

Traceability: YAML files are version-controlled in Git
Verification: Anyone can check the source of truth
Consistency: API responses should match YAML files
Debugging: "Ghost data" in API but not in YAML is impossible to trace
Data governance: Clear ownership of data changes

Detecting "Ghost Data":

If API returns data that doesn't exist in the custodian YAML file, this is a data integrity violation:

# Check if API data matches YAML
curl -s "http://localhost:8002/institution/NL-NH-MID-M-AMW" | jq '.social_media'
# Returns: {"facebook": "https://www.facebook.com/facebook"}

# But YAML file has no social_media section!
grep -A5 "social_media:" data/custodian/NL-NH-MID-M-AMW.yaml
# (no output - section doesn't exist)

# This is WRONG! Ghost data detected.

Resolution for Ghost Data:

Option A: Add valid data to YAML (if data is correct but missing)
Option B: Remove invalid data from database (if data is garbage)
NEVER: Leave ghost data in database without YAML source

See:

.opencode/CUSTODIAN_DATA_SOURCE_OF_TRUTH.md for complete documentation
.opencode/SOCIAL_MEDIA_LINK_VALIDATION.md for social media validation rules

🚨 CRITICAL: Social media links MUST point to the specific institution's page, NOT to generic platform homepages or the platform's own account.

Generic social media links provide zero informational value and must be validated before storage.

Generic Links (INVALID - Must Remove):

Platform	Invalid Patterns	Why Invalid
Facebook	`facebook.com/`	Generic homepage
	`facebook.com/facebook`	Facebook's own page
	`facebook.com/home.php`	User homepage
Twitter/X	`twitter.com/`	Generic homepage
	`twitter.com/twitter`	Twitter's own account
	`x.com/`	Generic homepage
Instagram	`instagram.com/`	Generic homepage
	`instagram.com/instagram`	Instagram's own account
LinkedIn	`linkedin.com/`	Generic homepage
	`linkedin.com/company/linkedin`	LinkedIn's own page
YouTube	`youtube.com/`	Generic homepage
	`youtube.com/youtube`	YouTube's own channel

Valid Links (KEEP):

Platform	Valid Pattern	Example
Facebook	`facebook.com/{page_name}/`	`facebook.com/rijksmuseum/`
	`facebook.com/pages/{name}/{id}`	`facebook.com/pages/Museum/123456`
Twitter/X	`twitter.com/{username}`	`twitter.com/rijksmuseum`
	`x.com/{username}`	`x.com/rijksmuseum`
Instagram	`instagram.com/{username}`	`instagram.com/rijksmuseum`
LinkedIn	`linkedin.com/company/{slug}`	`linkedin.com/company/rijksmuseum`
YouTube	`youtube.com/@{channel}`	`youtube.com/@Rijksmuseum`
	`youtube.com/c/{channel}`	`youtube.com/c/rijksmuseum`
	`youtube.com/channel/{id}`	`youtube.com/channel/UCxyz123`

Validation Implementation:

INVALID_SOCIAL_MEDIA_PATTERNS = {
    'facebook': [
        r'^https?://(www\.)?facebook\.com/?$',
        r'^https?://(www\.)?facebook\.com/facebook/?$',
        r'^https?://(www\.)?facebook\.com/home\.php',
    ],
    'twitter': [
        r'^https?://(www\.)?(twitter|x)\.com/?$',
        r'^https?://(www\.)?(twitter|x)\.com/twitter/?$',
    ],
    'instagram': [
        r'^https?://(www\.)?instagram\.com/?$',
        r'^https?://(www\.)?instagram\.com/instagram/?$',
    ],
    'linkedin': [
        r'^https?://(www\.)?linkedin\.com/?$',
        r'^https?://(www\.)?linkedin\.com/company/linkedin/?$',
    ],
    'youtube': [
        r'^https?://(www\.)?youtube\.com/?$',
        r'^https?://(www\.)?youtube\.com/youtube/?$',
    ],
}

def is_valid_social_media_link(platform: str, url: str) -> bool:
    """Return True if URL is a valid institution-specific social media link."""
    if not url:
        return False
    
    patterns = INVALID_SOCIAL_MEDIA_PATTERNS.get(platform, [])
    for pattern in patterns:
        if re.match(pattern, url, re.IGNORECASE):
            return False  # Generic link detected
    
    return True

Enrichment Workflow with Validation:

# When enriching social media data
social_media = {}

if google_data.get('facebook'):
    fb_url = google_data['facebook']
    if is_valid_social_media_link('facebook', fb_url):
        social_media['facebook'] = fb_url
    else:
        log.warning(f"Skipping generic Facebook link: {fb_url}")

# Only write validated links to YAML
if social_media:
    custodian['social_media'] = social_media

Cleanup Existing Data:

If generic links exist in the database, they must be:

NOT copied to custodian YAML files
Removed from database during cleanup
Logged for audit purposes

See: .opencode/SOCIAL_MEDIA_LINK_VALIDATION.md for complete validation rules and regex patterns

Rule 24: Unused Import Investigation - Check Before Removing

🚨 CRITICAL: Before removing unused imports, INVESTIGATE whether they indicate incomplete implementations. Unused code is a signal that requires analysis, not automatic deletion.

This rule prevents accidental removal of imports that support planned features, incomplete implementations, or conditional code paths.

Investigation Checklist:

Step	Check	How to Verify
1	Was it recently used?	`git log -p --all -S 'ImportName' -- file.py`
2	Is there a TODO/FIXME?	Search file for TODO, FIXME, XXX comments
3	Pattern mismatch?	Check if file mixes old (`Optional[X]`) and new (`X \| None`) syntax
4	Incomplete feature?	Look for stub functions, empty implementations, `pass` statements
5	Conditional usage?	Check for `TYPE_CHECKING` blocks, feature flags

When Removal IS Appropriate (after investigation confirms):

Legacy cruft: File consistently uses modern syntax, import is leftover from Python 3.9 era
Refactored code: Import was part of deleted/refactored functionality
Template artifact: Import was copy-pasted from template but never used

When Removal is NOT Appropriate:

TODO comment references the import
Partial implementation exists that should use the import
Import supports a feature flag or conditional code path
Import is inside TYPE_CHECKING block for static analysis

Example - Safe Removal (Legacy Cruft):

# File: multi_embedding_retriever.py
from typing import Optional  # ← UNUSED

# Investigation results:
# 1. git log: No recent changes involving Optional
# 2. No TODO comments  
# 3. File uses `X | None` syntax consistently (60+ occurrences)
# 4. No incomplete implementations
# 
# Verdict: Safe to remove - legacy cruft from Python 3.9 migration

Example - DO NOT Remove (Incomplete Implementation):

# File: data_processor.py
from dataclasses import dataclass  # ← Appears unused
from typing import Protocol  # ← Appears unused

# TODO: Implement DataProcessor protocol pattern
class DataProcessor:
    def process(self, data):
        pass  # Stub implementation

# Verdict: DO NOT remove - imports support planned feature

Workflow:

1. Linter/IDE flags unused import
   ↓
2. STOP - Do not auto-remove
   ↓
3. Run investigation checklist (all 5 steps)
   ↓
4. Document findings in commit message
   ↓
5. Either:
   a) Remove with clear rationale, OR
   b) Keep and add TODO explaining why, OR
   c) Complete the incomplete implementation

See: .opencode/UNUSED_IMPORT_INVESTIGATION_RULE.md for complete documentation, git commands, and additional examples

Rule 25: Digital Platform Discovery Enrichment

🚨 CRITICAL: Every heritage custodian MUST be enriched with digital platform discovery data. All discovered platforms MUST have complete provenance tracking.

Digital platform discovery is essential for understanding how heritage institutions make their collections accessible online. This rule defines the required structure and provenance for documenting an institution's digital presence.

What to Discover:

Platform Category	Examples	Required Fields
Collection Management System	MAIS-Flexis, Adlib, CollectiveAccess, ArchivesSpace	`system_name`, `vendor`, `version` (if known)
Discovery Portals	Beeldbank, Archiefstukken, Genealogie, Kranten	`platform_name`, `platform_url`, `items_indexed`
External Integrations	Archieven.nl, Europeana, Memorix, Collectie Nederland	`platform_name`, `integration_url`, `integration_type`
APIs & Data Services	OAI-PMH endpoints, SPARQL endpoints, REST APIs	`endpoint_url`, `protocol`, `documentation_url`

Required Provenance Fields:

Every digital platform discovery MUST include provenance with these fields:

digital_platform_discovery_summary:
  discovery_metadata:
    retrieval_agent: firecrawl | playwright | manual | exa
    retrieval_timestamp: "2025-01-15T10:30:00Z"  # ISO 8601
    source_url: https://www.example-archive.nl/onderzoeken
    xpath_base: /html/body/main/section[2]  # XPath to platform listing section
    html_file: web/GHCID/example-archive.nl/rendered.html  # Archived HTML
  platforms_discovered: 7
  total_items_indexed: 545393

Retrieval Agent Values:

Agent	When to Use
`firecrawl`	Primary tool for web scraping (preferred)
`playwright`	JavaScript-heavy sites requiring browser rendering
`manual`	Manual inspection/copy from website
`exa`	Exa web search/crawling tools

Platform Discovery Structure:

# Collection Management System
collection_management_system:
  system_name: MAIS-Flexis
  vendor: DE REE Archiefsystemen
  vendor_url: https://www.de-ree.nl/
  version: null  # If unknown
  primary_use: archival_description
  provenance:
    source_url: https://www.example-archive.nl/over-ons
    xpath: /html/body/main/div[2]/p[3]
    retrieved_on: "2025-01-15T10:30:00Z"
    retrieval_agent: firecrawl

# Auxiliary Digital Platforms
auxiliary_digital_platforms:
  - platform_name: Beeldbank
    platform_url: https://beeldbank.example-archive.nl/
    platform_type: DISCOVERY_PORTAL
    content_type: images
    items_indexed: 125000
    description: "Digitized photographs, maps, and visual materials"
    provenance:
      source_url: https://www.example-archive.nl/onderzoeken
      xpath: /html/body/main/section[2]/div[1]/a
      retrieved_on: "2025-01-15T10:30:00Z"
      retrieval_agent: firecrawl

# External Platform Integrations
external_platform_integrations:
  - platform_name: Archieven.nl
    integration_type: discovery_aggregator
    integration_url: https://www.archieven.nl/nl/zoeken?mivast=123
    items_contributed: 450000
    sync_frequency: daily
    provenance:
      source_url: https://www.example-archive.nl/onderzoeken
      xpath: /html/body/main/section[3]/div[1]
      retrieved_on: "2025-01-15T10:30:00Z"
      retrieval_agent: firecrawl

Discovery Workflow:

1. NAVIGATE to custodian website using FireCrawl
   ↓
2. IDENTIFY key pages: /onderzoeken, /collecties, /zoeken, /about, /over-ons
   ↓
3. EXTRACT platform information with XPath locations
   ↓
4. ARCHIVE HTML to data/custodian/web/{GHCID}/
   ↓
5. DOCUMENT provenance for each discovered platform
   ↓
6. UPDATE custodian YAML with digital_platform_discovery_summary

Minimum Discovery Requirements:

Every custodian file SHOULD have at minimum:

collection_management_system (if identifiable)
auxiliary_digital_platforms (all public-facing discovery tools)
external_platform_integrations (aggregators like Archieven.nl, Europeana)
digital_platform_discovery_summary with provenance metadata

Example Reference: See data/custodian/NL-DR-ASS-A-DA.yaml (Drents Archief) for a comprehensive digital platform discovery implementation with 7 platforms documented.

Tools to Use:

Tool	MCP Name	Best For
FireCrawl Scrape	`firecrawl_firecrawl_scrape`	Single page content extraction
FireCrawl Map	`firecrawl_firecrawl_map`	Discovering all URLs on a site
FireCrawl Search	`firecrawl_firecrawl_search`	Finding specific platform pages
Playwright Snapshot	`playwright_browser_snapshot`	JS-heavy pages

See:

.opencode/DIGITAL_PLATFORM_DISCOVERY_RULE.md for complete documentation
docs/DIGITAL_PLATFORM_DISCOVERY_GUIDE.md for step-by-step guide
data/custodian/NL-DR-ASS-A-DA.yaml for reference implementation

Rule 26: Person Data Provenance - Web Claims for Staff Information

🚨 CRITICAL: All person/staff data associated with heritage custodians MUST have web claim provenance. Staff information without verifiable sources is unacceptable.

This rule ensures that person data (staff, directors, curators, etc.) can be traced back to authoritative sources with complete provenance chains.

What Requires Web Claims:

Data Type	Source Types	Required Provenance
Staff Names	LinkedIn, institutional pages	`source_url`, `xpath`, `retrieved_on`
Job Titles/Roles	LinkedIn, about pages	`source_url`, `xpath`, `retrieved_on`
Contact Information	Staff directories, contact pages	`source_url`, `xpath`, `retrieved_on`
Professional History	LinkedIn profiles	`linkedin_url`, `retrieved_on`, `retrieval_agent`
Education	LinkedIn, CVs	`source_url`, `retrieved_on`

Person Web Claim Structure:

staff:
  - person_id: "rijksmuseum_staff_0001_taco_dibbits"
    name: Taco Dibbits
    role: General Director
    current: true
    
    # Web claims with provenance
    web_claims:
      - claim_type: full_name
        claim_value: Taco Dibbits
        source_url: https://www.rijksmuseum.nl/en/about-us/organisation
        xpath: /html/body/main/section[2]/div[1]/h2
        retrieved_on: "2025-01-15T10:30:00Z"
        retrieval_agent: firecrawl
        xpath_match_score: 1.0
        
      - claim_type: role_title
        claim_value: General Director
        source_url: https://www.rijksmuseum.nl/en/about-us/organisation
        xpath: /html/body/main/section[2]/div[1]/p[1]
        retrieved_on: "2025-01-15T10:30:00Z"
        retrieval_agent: firecrawl
        xpath_match_score: 1.0
    
    # LinkedIn profile reference (if available)
    linkedin_claim:
      linkedin_url: https://www.linkedin.com/in/taco-dibbits-12345
      profile_data_path: data/custodian/person/entity/taco-dibbits-12345_20250115T103000Z.json
      retrieved_on: "2025-01-15T10:30:00Z"
      retrieval_agent: exa_crawling_exa

Claim Types for Person Data:

Claim Type	Description	Example Value
`full_name`	Person's full name	"Taco Dibbits"
`role_title`	Current job title	"General Director"
`department`	Department/division	"Curatorial Department"
`email`	Work email address	"t.dibbits@rijksmuseum.nl"
`phone`	Work phone number	"+31 20 674 7000"
`start_date`	Role start date	"2020-01-01"
`education`	Educational background	"PhD Art History, University of Amsterdam"
`specialization`	Area of expertise	"17th Century Dutch Painting"

Provenance Requirements:

Every person claim MUST have:

Field	Required	Description
`claim_type`	YES	Type of claim (from table above)
`claim_value`	YES	The extracted value
`source_url`	YES	URL where data was found
`retrieved_on`	YES	ISO 8601 timestamp
`retrieval_agent`	YES	Tool used: `firecrawl`, `playwright`, `exa`, `manual`
`xpath`	RECOMMENDED	XPath to element (if from HTML)
`xpath_match_score`	RECOMMENDED	1.0 for exact, <1.0 for fuzzy

LinkedIn Profile Integration:

When LinkedIn data is available, create a separate profile file and reference it:

# In custodian YAML
staff:
  - name: Alexandr Belov
    role: Collection/Information Specialist
    linkedin_url: https://www.linkedin.com/in/alexandr-belov-bb547b46
    person_profile_path: data/custodian/person/entity/alexandr-belov-bb547b46_20251210T120000Z.json
    
    # Claims from institutional website (separate from LinkedIn)
    institutional_claims:
      - claim_type: role_title
        claim_value: Collection Information Specialist
        source_url: https://www.eyefilm.nl/en/about/team
        xpath: /html/body/main/div[2]/ul/li[15]/span
        retrieved_on: "2025-01-15T10:30:00Z"
        retrieval_agent: firecrawl

Staff Discovery Workflow:

1. SCRAPE institutional "About Us" / "Team" / "Staff" pages
   ↓
2. EXTRACT names and roles with XPath provenance
   ↓
3. SEARCH LinkedIn for professional profiles (if public)
   ↓
4. CREATE person entity files in data/custodian/person/entity/
   ↓
5. LINK profiles to custodian YAML with web claims
   ↓
6. DOCUMENT all provenance sources

Tools for Person Data Extraction:

Source	Tool	Method
Institutional websites	FireCrawl	`firecrawl_firecrawl_scrape` with XPath extraction
LinkedIn profiles	Exa	`exa_crawling_exa` with direct URL
LinkedIn search	Exa	`exa_linkedin_search_exa` for unknown profiles
Staff directories	Playwright	`playwright_browser_snapshot` for JS-rendered pages

Relationship to Other Rules:

Rule 6: WebObservation claims MUST have XPath provenance (applies to person data too)
Rule 12: Person data reference pattern (use file paths, not inline duplication)
Rule 14: Exa LinkedIn profile extraction
Rule 20: Person entity profiles stored individually
Rule 21: Data fabrication prohibited (all person data must be real)

See:

.opencode/PERSON_DATA_PROVENANCE_RULE.md for complete documentation
schemas/20251121/linkml/modules/classes/PersonObservation.yaml for schema
schemas/20251121/linkml/modules/classes/StaffRole.yaml for role definitions

Rule 27: Person-Custodian Data Architecture - Single Source of Truth

🚨 CRITICAL: Person entity files are the SINGLE SOURCE OF TRUTH for all person data. Custodian YAML files store only references and affiliation provenance - NEVER web claims or profile data.

This architecture ensures no data duplication, clean separation of concerns, and enables cross-custodian career tracking.

Data Location Rules:

Data Type	Location	Example
Profile data (name, headline, about, experience, education, skills)	Person entity file	`data/custodian/person/entity/{slug}_{timestamp}.json`
Web claims (provenance for name, role, etc.)	Person entity file	`web_claims` array in entity JSON
Affiliations (all custodians this person works at)	Person entity file	`affiliations` array in entity JSON
Affiliation provenance (when/how association was observed)	Custodian YAML	`affiliation_provenance` block
Reference to entity file	Custodian YAML	`linkedin_profile_path` field

Person Entity File Structure (data/custodian/person/entity/{slug}_{timestamp}.json):

{
  "extraction_metadata": {
    "source_file": "data/custodian/person/affiliated/parsed/{custodian}_staff_{timestamp}.json",
    "staff_id": "{custodian}_staff_0001_{name_slug}",
    "extraction_date": "2025-12-14T11:21:47Z",
    "extraction_method": "exa_contents",
    "extraction_agent": "claude-opus-4.5",
    "linkedin_url": "https://www.linkedin.com/in/{slug}"
  },
  "profile_data": {
    "name": "Full Name",
    "headline": "Current Role at Organization",
    "location": "City, Region, Country",
    "about": "Professional summary...",
    "experience": [...],
    "education": [...],
    "skills": [...]
  },
  "web_claims": [
    {
      "claim_type": "full_name",
      "claim_value": "Full Name",
      "source_url": "https://www.linkedin.com/in/{slug}",
      "retrieved_on": "2025-12-14T11:21:47Z",
      "retrieval_agent": "linkedin_html_parser"
    },
    {
      "claim_type": "role_title",
      "claim_value": "Current Role at Organization",
      "source_url": "https://www.linkedin.com/in/{slug}",
      "retrieved_on": "2025-12-14T11:21:47Z",
      "retrieval_agent": "linkedin_html_parser"
    }
  ],
  "affiliations": [
    {
      "custodian_name": "Nationaal Archief",
      "custodian_slug": "nationaal-archief",
      "role_title": "Role at this institution",
      "heritage_relevant": true,
      "heritage_type": "A",
      "current": true,
      "observed_on": "2025-12-14T11:21:47Z",
      "source_url": "https://www.linkedin.com/company/nationaal-archief/people/"
    }
  ]
}

Custodian YAML Structure (data/custodian/{GHCID}.yaml):

person_observations:
  staff:
  - person_id: nationaal-archief_staff_0001_bibian_van_reeken
    person_name: Bibian van Reeken
    role_title: Projectmanager Digitalisering bij het Nationaal Archief
    heritage_relevant: true
    heritage_type: A
    current: true
    
    # ONLY affiliation provenance - when/how was this association observed?
    affiliation_provenance:
      source_url: https://www.linkedin.com/company/nationaal-archief/people/
      retrieved_on: '2025-12-14T11:21:47Z'
      retrieval_agent: linkedin_html_parser
    
    # Reference to entity file (contains full profile + web claims)
    linkedin_profile_url: https://www.linkedin.com/in/bibianvanreeken
    linkedin_profile_path: data/custodian/person/entity/bibianvanreeken_20251211T000000Z.json

✅ CORRECT - What Goes Where:

In Custodian YAML	In Person Entity File
`person_id` (identifier)	`extraction_metadata` (full provenance)
`person_name` (for display)	`profile_data` (complete profile)
`role_title` (current role)	`web_claims` (provenance for each claim)
`affiliation_provenance` (when observed)	`affiliations` (all custodians)
`linkedin_profile_path` (reference)	Full profile content

❌ WRONG - Web Claims in Custodian File:

# WRONG! Do NOT put web_claims in custodian YAML
person_observations:
  staff:
  - person_id: nationaal-archief_staff_0001_name
    web_claims:  # ← WRONG! These belong in entity file
      - claim_type: full_name
        claim_value: ...

Benefits of This Architecture:

No Data Duplication: Same person at multiple institutions → ONE entity file, multiple references
Single Source of Truth: Update person's profile once → All custodian references automatically current
Clean Separation: Entity file = who is this person; Custodian file = how are they affiliated
Cross-Custodian Tracking: Query all affiliations from entity file's affiliations array
Network Analysis Ready: Entity files support building relationship graphs

Relationship to Other Rules:

Rule 5: Never delete enriched data - additive only
Rule 12: Person data reference pattern (file paths, not inline duplication)
Rule 20: Person entity profiles stored individually
Rule 22: Custodian YAML is single source of truth for custodian enrichment data
Rule 26: Person data provenance - web claims required (now clarified: in entity files)

Scripts:

Script	Purpose
`scripts/parse_linkedin_html.py`	Parse LinkedIn company staff pages → affiliated/parsed/
`scripts/link_person_observations.py`	Link entity files to custodian YAML
`scripts/fetch_linkedin_profiles_exa.py`	Extract full profiles via Exa → entity/

See:

.opencode/PERSON_CUSTODIAN_DATA_ARCHITECTURE.md for complete documentation
.opencode/PERSON_DATA_REFERENCE_PATTERN.md for reference pattern details
docs/PERSON_CUSTODIAN_DATA_ARCHITECTURE.md for detailed architecture guide

Rule 28: Web Claims Deduplication - No Redundant Claims

🚨 CRITICAL: Do not state the same claim value multiple times unless there is strong variation in its value AND genuine uncertainty about its accuracy.

Web claims extracted from institutional websites often contain duplicate or near-duplicate information that must be deduplicated during verification.

Common Duplicate Patterns to Eliminate:

Duplicate Type	Example	Action
Favicon variants	5 different favicon sizes	Keep only `/favicon.ico`
Same value, different extraction	`page_title` and `org_name` from same `<title>` tag	Keep `org_name` only
Dynamic content	`image_count: 13`, `gallery_detected`	Remove (changes frequently)
Redundant social links	Same LinkedIn URL extracted twice	Keep one instance

When to Keep Multiple Claims:

Only keep multiple claims for the same property type when:

Genuine variation exists: Different regional social media accounts
Uncertainty about accuracy: Conflicting values from different sources
Temporal tracking: Historical vs current values (use valid_from/valid_to)

Implementation Pattern:

web_enrichment:
  verified_claims:
    verification_date: '2025-01-14T00:00:00Z'
    verification_method: firecrawl_live_scrape
    claims:
    - claim_type: org_name
      claim_value: Institution Name
      verification_status: verified
    # ONE social_linkedin, ONE favicon, etc.
  
  removed_claims:
    removal_date: '2025-01-14T00:00:00Z'
    removal_reason: Duplicates or low-value dynamic content
    claims:
    - claim_type: page_title
      reason: Duplicate of org_name (same xpath, same value)
    - claim_type: favicon
      original_values: [/favicon-32x32.png, /favicon-16x16.png]
      reason: Duplicate favicon variants, primary /favicon.ico retained

Audit Trail Requirement:

Always document removed claims in removed_claims section with:

claim_type: What was removed
reason: Why it was removed
original_values: (optional) What the duplicate values were

This preserves data governance audit trail while eliminating redundancy.

See: .opencode/WEB_CLAIMS_DEDUPLICATION_RULE.md for complete documentation and examples

Project Overview

Goal: Extract structured data about worldwide GLAMORCUBESFIXPHDNT (Galleries, Libraries, Archives, Museums, Official institutions, Research centers, Corporations, Unknown, Botanical gardens/zoos, Educational providers, Societies, Features, Intangible heritage groups, miXed, Personal collections, Holy sites, Digital platforms, NGOs, Taste/smell heritage) institutions from 139+ Claude conversation JSON files and integrate with authoritative CSV datasets.

Output: Validated LinkML-compliant records representing heritage custodian organizations with provenance tracking, geographic data, identifiers, and relationship information.

Schema: See the modular LinkML schema v0.2.1 with 19-type GLAMORCUBESFIXPHDNT taxonomy described below.

Schema Reference (v0.2.1)

The project uses a modular LinkML schema organized into 6 specialized modules:

schemas/heritage_custodian.yaml - Main schema (import-only structure)
- Top-level schema that imports all modules
- Defines schema metadata and namespace
schemas/core.yaml - Core Classes
- HeritageCustodian - Main institution entity
- Location - Geographic data
- Identifier - External identifiers (ISIL, Wikidata, VIAF, etc.)
- DigitalPlatform - Online systems and platforms
- GHCID - Global Heritage Custodian Identifier
schemas/enums.yaml - Enumerations
- InstitutionTypeEnum - 13 institution types (GALLERY, LIBRARY, ARCHIVE, MUSEUM, etc.)
- ChangeTypeEnum - 11 organizational change types (FOUNDING, MERGER, CLOSURE, etc.)
- DataSource - Data origin types (CSV_REGISTRY, CONVERSATION_NLP, etc.)
- DataTier - Data quality tiers (TIER_1_AUTHORITATIVE through TIER_4_INFERRED)
- PlatformTypeEnum - Digital platform categories
schemas/provenance.yaml - Provenance Tracking
- Provenance - Data source and quality metadata
- ChangeEvent - Organizational change history (mergers, relocations, etc.)
- GHCIDHistoryEntry - GHCID change tracking over time
schemas/collections.yaml - Collection Metadata
- Collection - Collection descriptions
- Accession - Acquisition records
- DigitalObject - Digital surrogates
schemas/dutch.yaml - Dutch-Specific Extensions
- DutchHeritageCustodian - Netherlands heritage institutions
- Extensions for ISIL registry, platform integrations, KvK numbers

See /docs/SCHEMA_MODULES.md for detailed architecture and design patterns.

Base Ontologies for Global GLAM Data

CRITICAL: Before designing extraction pipelines or extending the schema, AI agents MUST consult the base ontologies that the LinkML schema builds upon. These ontologies provide standardized vocabularies and patterns for modeling heritage institutions.

Foundation Ontologies

The GLAM project integrates with three primary ontologies, each serving different geographic and semantic scopes:

1. TOOI - Dutch Government Organizational Ontology

File: /data/ontology/tooiont.ttl
Namespace: https://identifier.overheid.nl/tooi/def/ont/
Scope: Dutch heritage institutions (government archives, state museums, public cultural organizations)

When to Use:

✅ Extracting Dutch heritage institutions from conversations
✅ Modeling Dutch organizational change events (mergers, splits, reorganizations)
✅ Integrating with Dutch ISIL registry or KvK (Chamber of Commerce) data
✅ Parsing Dutch government heritage agency data

Key Classes:

tooi:Overheidsorganisatie - Government organization (extends to DutchHeritageCustodian)
tooi:Wijzigingsgebeurtenis - Change event (founding, merger, closure, relocation)

Key Properties:

tooi:officieleNaamInclSoort - Official name including type
tooi:begindatum / tooi:einddatum - Temporal validity (start/end dates)
tooi:organisatieIdentificatie - Formal identifiers (ISIL codes, etc.)

LinkML Mapping:

# schemas/dutch.yaml extends TOOI
DutchHeritageCustodian:
  is_a: HeritageCustodian
  class_uri: tooi:Overheidsorganisatie  # ← Maps to TOOI

Reference: See /docs/ONTOLOGY_EXTENSIONS.md for complete TOOI integration patterns.

2. CPOV - EU Core Public Organisation Vocabulary

Files:

/data/ontology/core-public-organisation-ap.ttl (RDF schema)
/data/ontology/core-public-organisation-ap.jsonld (JSON-LD context)

Namespace: http://data.europa.eu/m8g/
Scope: EU-wide and global public sector heritage organizations

When to Use:

✅ Extracting European heritage institutions (France, Germany, Belgium, etc.)
✅ Modeling international/global heritage organizations
✅ Aligning with EU Linked Open Data initiatives (Europeana, DPLA)
✅ Extracting non-Dutch institutions from conversations

Key Classes:

cpov:PublicOrganisation - Public sector organization (base for HeritageCustodian)
cv:ChangeEvent - Organizational change events
locn:Address - Physical location data

Key Properties:

skos:prefLabel / skos:altLabel - Preferred and alternative names
dct:identifier - Formal identifiers (ISIL, Wikidata, VIAF)
dct:temporal - Temporal coverage (founding to closure dates)
locn:address - Physical addresses

LinkML Mapping:

# schemas/core.yaml aligns with CPOV
HeritageCustodian:
  class_uri: cpov:PublicOrganisation  # ← Maps to CPOV
  
  slots:
    name:
      slot_uri: skos:prefLabel
    alternative_names:
      slot_uri: skos:altLabel
    identifiers:
      slot_uri: dct:identifier

Reference: See /docs/ONTOLOGY_EXTENSIONS.md for complete CPOV integration patterns.

3. Schema.org - Web Vocabulary for Structured Data

File: /data/ontology/schemaorg.owl
Namespace: http://schema.org/
Scope: Universal web semantics (museums, galleries, collections, events, learning resources)

When to Use:

✅ Extracting private collections or non-governmental organizations
✅ Modeling digital platforms (learning management systems, discovery portals)
✅ Web discoverability and SEO optimization
✅ Fallback when TOOI/CPOV don't apply

Key Classes:

schema:Museum / schema:Library / schema:ArchiveOrganization - Heritage institution types
schema:Place - Geographic locations
schema:LearningResource - Educational platforms (LMS, online courses)
schema:Event - Organizational events (founding, exhibitions)

LinkML Mapping:

# schemas/enums.yaml maps platform types to Schema.org
DigitalPlatformTypeEnum:
  LEARNING_MANAGEMENT:
    meaning: schema:LearningResource  # ← Maps to Schema.org

Reference: See /docs/ONTOLOGY_EXTENSIONS.md for Schema.org usage examples.

Ontology Decision Tree for Agents

When extracting heritage institution data, choose the appropriate ontology:

START: Extract institution from conversation
  ↓
Is the institution Dutch?
  ├─ YES → Use TOOI ontology
  │         - Map to schemas/dutch.yaml
  │         - Extract ISIL codes (NL-* format)
  │         - Extract KvK numbers (8-digit)
  │         - Model change events as tooi:Wijzigingsgebeurtenis
  │
  └─ NO → Is it a public/government organization?
           ├─ YES → Use CPOV ontology
           │         - Map to schemas/core.yaml
           │         - Extract standard identifiers (ISIL, Wikidata, VIAF)
           │         - Model change events as cv:ChangeEvent
           │
           └─ NO → Use Schema.org
                    - Map to schemas/core.yaml
                    - Use schema:Museum, schema:Library, etc.
                    - Emphasize web discoverability

Multi-Ontology Support: Institutions can implement MULTIPLE ontology classes simultaneously:

<https://w3id.org/heritage/custodian/nl/rijksmuseum>
    a tooi:Overheidsorganisatie,  # Dutch government organization
      cpov:PublicOrganisation,        # EU public sector
      schema:Museum ;                 # Schema.org web semantics

Required Ontology Consultation Workflow

Before extracting data, agents MUST perform these steps:

Step 1: Identify Institution Geographic Scope

# Determine which ontology applies
if institution_country == "NL":
    primary_ontology = "TOOI"
    ontology_file = "/data/ontology/tooiont.ttl"
elif institution_in_europe or institution_public_sector:
    primary_ontology = "CPOV"
    ontology_file = "/data/ontology/core-public-organisation-ap.ttl"
else:
    primary_ontology = "Schema.org"
    ontology_file = "/data/ontology/schemaorg.owl"

Step 2: Review Ontology Classes and Properties

Search ontology files for relevant classes:

# Dutch institutions - search TOOI
rg "tooi:Overheidsorganisatie|Wijzigingsgebeurtenis|begindatum" /data/ontology/tooiont.ttl

# EU/global institutions - search CPOV
rg "cpov:PublicOrganisation|cv:ChangeEvent|locn:Address" /data/ontology/core-public-organisation-ap.ttl

# All institutions - search Schema.org
rg "schema:Museum|schema:Library|schema:ArchiveOrganization" /data/ontology/schemaorg.owl

Step 3: Map Conversation Data to Ontology Properties

Create a mapping table before extraction:

Extracted Field	TOOI Property	CPOV Property	Schema.org Property
Institution name	`tooi:officieleNaamInclSoort`	`skos:prefLabel`	`schema:name`
Alternative names	-	`skos:altLabel`	`schema:alternateName`
Founding date	`tooi:begindatum`	`schema:startDate`	`schema:foundingDate`
Closure date	`tooi:einddatum`	`schema:endDate`	`schema:dissolutionDate`
ISIL code	`tooi:organisatieIdentificatie`	`dct:identifier`	`schema:identifier`
Address	(use `locn:Address`)	`locn:address`	`schema:address`
Merger event	`tooi:Wijzigingsgebeurtenis`	`cv:ChangeEvent`	`schema:Event`
Website	-	`schema:url`	`schema:url`

Step 4: Document Ontology Alignment in Provenance

Always include ontology references in extraction metadata:

provenance:
  data_source: CONVERSATION_NLP
  extraction_method: "NLP extraction following CPOV ontology patterns"
  base_ontology: "http://data.europa.eu/m8g/"  # ← Document which ontology used
  ontology_alignment:
    - "cpov:PublicOrganisation"
    - "cv:ChangeEvent"
  extraction_date: "2025-11-09T..."

Common Ontology Patterns

Pattern 1: Organizational Change Events

When extracting mergers, splits, relocations, name changes:

# TOOI pattern (Dutch institutions)
change_history:
  - event_id: https://w3id.org/heritage/custodian/event/nha-merger-2001
    change_type: MERGER  # Maps to tooi:Wijzigingsgebeurtenis
    event_date: "2001-01-01"
    event_description: "Merger of Gemeentearchief Haarlem and Rijksarchief in Noord-Holland"
    ontology_class: "tooi:Wijzigingsgebeurtenis"

# CPOV pattern (EU/global institutions)
change_history:
  - event_id: https://w3id.org/heritage/custodian/event/bnf-founding
    change_type: FOUNDING  # Maps to cv:ChangeEvent
    event_date: "1461-01-01"
    event_description: "Founded by King Louis XI as Royal Library"
    ontology_class: "cv:ChangeEvent"

Pattern 2: Multilingual Names

CPOV and Schema.org support language-tagged literals:

name: Bibliothèque nationale de France
alternative_names:
  - National Library of France@en
  - BnF@fr
  - Französische Nationalbibliothek@de

# RDF serialization:
# skos:prefLabel "Bibliothèque nationale de France"@fr ;
# skos:altLabel "National Library of France"@en, "BnF"@fr ;

Pattern 3: Hierarchical Relationships

Use W3C Org Ontology patterns (integrated in CPOV):

# Parent institution
parent_organization:
  name: Ministry of Culture
  relationship_type: "org:hasUnit"  # CPOV uses W3C Org Ontology
  
# Branch institutions
branches:
  - name: Regional Archive Noord-Brabant
    relationship_type: "org:subOrganizationOf"

Anti-Patterns to Avoid

❌ DON'T: Invent custom properties when ontology equivalents exist

# BAD - Custom property instead of ontology reuse
institution_official_name: "Rijksarchief"  # Use skos:prefLabel instead!

❌ DON'T: Ignore ontology namespace conventions

# BAD - No ontology reference
change_type: "merger"  # Use cv:ChangeEvent with proper namespace!

❌ DON'T: Extract without reviewing ontology files

# BAD - Extracting Dutch institutions without reading TOOI
agent: "I'll extract Dutch archives using Schema.org only"
# This loses semantic precision and ignores domain-specific patterns!

✅ DO: Always map to base ontologies and document alignment

# GOOD - Ontology-aligned extraction
name: Rijksarchief in Noord-Holland
institution_type: ARCHIVE
ontology_class: tooi:Overheidsorganisatie  # ← Documented
provenance:
  base_ontology: "https://identifier.overheid.nl/tooi/def/ont/"
  ontology_alignment:
    - tooi:Overheidsorganisatie
    - prov:Organization  # TOOI uses PROV-O for temporal tracking

Additional Ontology Resources

CIDOC-CRM (Cultural Heritage Domain):

File: /data/ontology/CIDOC_CRM_v7.1.3.rdf
Use for: Museum object cataloging, provenance, conservation
Key classes: crm:E74_Group (organizations), crm:E5_Event (historical events)

RiC-O (Records in Contexts - Archival Description):

Use for: Archival collections, fonds, series, items
Key classes: rico:CorporateBody, rico:RecordSet
Integration: Planned for future schema extension

BIBFRAME (Bibliographic Resources):

Use for: Library catalogs, bibliographic metadata
Key classes: bf:Organization, bf:Work, bf:Instance
Integration: For library-specific extensions

Reference Documentation: See /docs/ONTOLOGY_EXTENSIONS.md for comprehensive integration patterns, RDF serialization examples, and extension workflows.

Institution Type Taxonomy

The project uses a 19-type GLAMORCUBESFIXPHDNT taxonomy (expanded November 2025) with single-letter codes for GHCID identifier generation:

Type	Code	Description	Example Use Cases
GALLERY	G	Art gallery or exhibition space	Commercial galleries, kunsthallen
LIBRARY	L	Library (public, academic, specialized)	National libraries, university libraries
ARCHIVE	A	Archive (government, corporate, personal)	National archives, city archives
MUSEUM	M	Museum (art, history, science, etc.)	Rijksmuseum, natural history museums
OFFICIAL_INSTITUTION	O	Government heritage agencies	Provincial archives, heritage platforms
RESEARCH_CENTER	R	Research institutes and documentation centers	Knowledge centers, research libraries
CORPORATION	C	Corporate heritage collections	Company archives, corporate museums
UNKNOWN	U	Institution type cannot be determined	Ambiguous or unclassifiable organizations
BOTANICAL_ZOO	B	Botanical gardens and zoological parks	Arboreta, botanical gardens, zoos
EDUCATION_PROVIDER	E	Educational institutions with collections	Schools, training centers with heritage materials, universities
COLLECTING_SOCIETY	S	Societies collecting specialized materials	Numismatic societies, heritage societies (heemkundige kring)
FEATURES	F	Physical landscape features with heritage significance	Monuments, sculptures, statues, memorials, landmarks, cemeteries
INTANGIBLE_HERITAGE_GROUP	I	Organizations preserving intangible heritage	Traditional performance groups, oral history societies, folklore organizations
MIXED	X	Multiple types (uses X code)	Combined museum/archive facilities
PERSONAL_COLLECTION	P	Private personal collections	Individual collectors
HOLY_SITES	H	Religious heritage sites and institutions	Churches, temples, mosques, synagogues with collections
DIGITAL_PLATFORM	D	Digital heritage platforms and repositories	Online archives, digital libraries, virtual museums
NGO	N	Non-governmental heritage organizations	Heritage advocacy groups, preservation societies
TASTE_SMELL	T	Culinary and olfactory heritage institutions	Historic restaurants, parfumeries, distilleries preserving traditional recipes and formulations

Notes:

MIXED institutions use "X" as the GHCID code and document all actual types in metadata
HOLY_SITES includes religious institutions managing cultural heritage collections (archives, libraries, artifacts)
FEATURES includes physical monuments and landscape features with heritage value (not institutions maintaining collections)
COLLECTING_SOCIETY includes historical societies (historische vereniging), philatelic societies, numismatic clubs, ephemera collectors
OFFICIAL_INSTITUTION includes aggregation platforms, provincial heritage services, and government heritage agencies
INTANGIBLE_HERITAGE_GROUP covers organizations preserving UNESCO-recognized intangible cultural heritage
DIGITAL_PLATFORM includes born-digital heritage platforms and digitization aggregators
NGO includes non-profit heritage organizations that don't fit other categories
TASTE_SMELL includes establishments actively preserving culinary traditions, historic recipes, perfume formulations, and sensory heritage
When institution type is unknown, records default to UNKNOWN pending verification

Mnemonic: GLAMORCUBESFIXPHDNT - Galleries, Libraries, Archives, Museums, Official institutions, Research centers, Corporations, Unknown, Botanical gardens/zoos, Education providers, Societies, Features, Intangible heritage groups, miXed, Personal collections, Holy sites, Digital platforms, NGOs, Taste/smell heritage

Note on order: The mnemonic GLAMORCUBESFIXPHDNT represents the alphabetical ordering by code: G-L-A-M-O-R-C-U-B-E-S-F-I-X-P-H-D-N-T

Note: Universities are classified under E (EDUCATION_PROVIDER), not U. The U-class is reserved for institutions where the type cannot be determined during data extraction.

Data Sources

Primary Sources

Conversation JSON files (/Users/kempersc/Documents/claude/glam/*.json)
- 139 conversation files covering global GLAM research
- Countries include: Brazil, Vietnam, Chile, Japan, Mexico, Norway, Thailand, Taiwan, Belgium, Azerbaijan, Estonia, Namibia, Argentina, Tunisia, Ghana, Iran, Russia, Uzbekistan, Armenia, Georgia, Croatia, Greece, Nigeria, Somalia, Yemen, Oman, South Korea, Malaysia, Colombia, Switzerland, Moldova, Romania, Albania, Bosnia, Pakistan, Suriname, Nicaragua, Congo, Denmark, Austria, Australia, Myanmar, Cambodia, Sri Lanka, Tajikistan, Turkmenistan, Philippines, Latvia, Palestine, Limburg (NL), Gelderland (NL), Drenthe (NL), Groningen (NL), Slovakia, Kenya, Paraguay, Honduras, Mozambique, Eritrea, Sudan, Rwanda, Kiribati, Jamaica, Indonesia, Italy, Zimbabwe, East Timor, UAE, Kuwait, Lebanon, Syria, Maldives, Benin
- Also 14 ontology research conversations
Dutch ISIL Registry (data/ISIL-codes_2025-08-01.csv)
- ~300 Dutch heritage institutions
- Fields: Volgnr, Plaats, Instelling, ISIL code, Toegekend op, Opmerking
- Authoritative source (Tier 1)
Dutch Organizations CSV (data/voorbeeld_lijst_organisaties_en_diensten-totaallijst_nederland.csv)
- Comprehensive Dutch heritage organizations
- 40+ metadata columns including: name, address, ISIL code, organization type, partnerships, systems used, metadata standards
- Rich integration data (Museum register, Rijkscollectie, Collectie Nederland, Archieven.nl, etc.)
- Authoritative source (Tier 1)

Implementation Status (Updated Nov 2025)

Both Dutch datasets have been successfully parsed and cross-linked:

ISIL Registry ✅:

364 institutions parsed (2 invalid codes rejected)
203 cities covered
Parser: src/glam_extractor/parsers/isil_registry.py
Tests: 10/10 passing (84% coverage)

Dutch Organizations ✅:

1,351 institutions parsed
475 cities covered
1,119 organizations with digital platforms
Parser: src/glam_extractor/parsers/dutch_orgs.py
Tests: 18/18 passing (98% coverage)

Cross-linking Results 🔗:

340 institutions matched by ISIL code (92.1% overlap)
198 records enriched with platform data
127 name conflicts detected (require manual review)
1,004 organizations without ISIL codes (candidates for assignment)

Analysis Scripts:

compare_dutch_datasets.py - Dataset comparison
crosslink_dutch_datasets.py - TIER_1 data merging demo
test_real_dutch_orgs.py - Real data validation

See PROGRESS.md for detailed statistics and findings.

Conversation JSON Structure

Each conversation JSON file has the following structure:

{
  "uuid": "conversation-uuid",
  "name": "Conversation name (often includes country/region)",
  "summary": "Optional summary",
  "created_at": "ISO 8601 timestamp",
  "updated_at": "ISO 8601 timestamp",
  "chat_messages": [
    {
      "uuid": "message-uuid",
      "text": "User or assistant message text",
      "sender": "human" | "assistant",
      "content": [
        {
          "type": "text" | "tool_use" | "tool_result",
          "text": "Message content (may contain markdown, lists, etc.)",
          ...
        }
      ]
    }
  ]
}

NLP Extraction Tasks

All extraction tasks map to the modular LinkML schema v0.2.0. See Schema Reference section above for module details.

Task 1: Entity Recognition - Institution Names

Objective: Extract heritage institution names from conversation text.

Schema Mapping: Populates HeritageCustodian class from schemas/core.yaml

Patterns to Look For:

Organization names (proper nouns)
Museum names (often contain "Museum", "Museu", "Museo", "Muzeum", etc.)
Library names (contain "Library", "Biblioteca", "Bibliothek", "Bibliotheek", etc.)
Archive names (contain "Archive", "Archivo", "Archiv", "Archief", etc.)
Gallery names
Cultural centers
Holy sites with collections (churches, temples, mosques, synagogues, monasteries, abbeys, cathedrals managing heritage materials)

Contextual Indicators:

Lists of institutions
Descriptions like "The X is a museum in Y"
URLs containing institution names
Mentions of collections, exhibitions, or holdings

Example Extraction:

Input: "The Biblioteca Nacional do Brasil in Rio de Janeiro holds over 9 million items..."

Output:
- name: "Biblioteca Nacional do Brasil"  # HeritageCustodian.name
- institution_type: LIBRARY  # InstitutionTypeEnum from schemas/enums.yaml
- city: "Rio de Janeiro"  # Location.city from schemas/core.yaml
- confidence_score: 0.95  # Provenance.confidence_score from schemas/provenance.yaml

Task 2: Location Extraction

Objective: Extract geographic information associated with institutions.

Schema Mapping: Populates Location class from schemas/core.yaml

Extract:

City names
Street addresses (when mentioned)
Postal codes
Provinces/states/regions
Country (can often be inferred from conversation title)

Geocoding:

Use Nominatim API to geocode addresses to lat/lon
Link to GeoNames IDs when possible
Handle multilingual place names

Example:

Input: "Nationaal Onderduikmuseum, Aalten"

Output:
- city: "Aalten"  # Location.city
- country: "NL"  # Location.country (ISO 3166-1 alpha-2)
- geonames_id: "2759899" (lookup via API)  # Location.geonames_id
- latitude: 51.9167 (from geocoding)
- longitude: 6.5833

Task 3: Identifier Extraction

Objective: Extract external identifiers mentioned in conversations.

Schema Mapping: Populates Identifier class from schemas/core.yaml

Identifier Types:

ISIL codes (format: NL-XXXXX, US-XXXXX, etc.)
Wikidata IDs (format: Q12345)
VIAF IDs (format: numeric)
URLs to institutional websites
KvK numbers (Dutch: 8-digit format)

Patterns:

ISIL: [A-Z]{2}-[A-Za-z0-9]+
Wikidata: Q[0-9]+
VIAF: viaf.org/viaf/[0-9]+
KvK: [0-9]{8}

Example:

Input: "ISIL code NL-AsdAM for Amsterdam Museum"

Output:
- identifier_scheme: "ISIL"  # Identifier.identifier_scheme
- identifier_value: "NL-AsdAM"  # Identifier.identifier_value
- institution_name: "Amsterdam Museum"  # HeritageCustodian.name (for linking)

Task 4: Relationship Extraction

Objective: Extract relationships between institutions.

Schema Mapping: Maps to ChangeEvent class from schemas/provenance.yaml (for mergers, splits) and future relationship modeling

Relationship Types:

Parent-child (e.g., "X is part of Y")
Partnerships (e.g., "X collaborates with Y")
Network memberships (e.g., "X is a member of Z consortium")
Merged organizations (e.g., "X merged with Y") → ChangeTypeEnum.MERGER

Indicators:

"part of", "branch of", "division of"
"in partnership with", "collaborates with"
"member of", "belongs to"
"merged with", "absorbed by" → Use ChangeEvent from schemas/provenance.yaml

Task 5: Collection Metadata Extraction

Objective: Extract information about collections held by institutions.

Schema Mapping: Populates Collection class from schemas/collections.yaml

Extract:

Collection names → Collection.collection_name
Collection types (archival, bibliographic, museum objects)
Subject areas → Collection.subject_areas
Time periods covered → Collection.temporal_coverage
Item counts (when mentioned) → Collection.extent
Access information → Collection.access_rights

Example:

Input: "The archive holds 15,000 documents from the 18th-19th centuries..."

Output:
- collection_type: "archival"  # Collection metadata
- item_count: 15000  # Collection.extent
- time_period_start: "1700-01-01"  # Collection.temporal_coverage
- time_period_end: "1899-12-31"

Task 6: Digital Platform Identification

Objective: Identify digital platforms and systems used by institutions.

Schema Mapping: Populates DigitalPlatform class from schemas/core.yaml

Platform Types:

Collection management systems (Atlantis, MAIS, CollectiveAccess, etc.)
Digital repositories (DSpace, EPrints, Fedora)
Discovery portals
SPARQL endpoints
APIs

Extract:

Platform name → DigitalPlatform.platform_name
Platform URL → DigitalPlatform.platform_url
Metadata standards used → DigitalPlatform.metadata_standards
Integration with aggregators (Europeana, DPLA, etc.)

Task 7: Metadata Standards Detection

Objective: Identify which metadata standards institutions use.

Schema Mapping: Stores in DigitalPlatform.metadata_standards (list of strings)

Standards to Detect:

Dublin Core
MARC21
EAD (Encoded Archival Description)
BIBFRAME
LIDO
CIDOC-CRM
Schema.org
RiC-O (Records in Contexts)
MODS, PREMIS, SPECTRUM, DACS

Indicators:

Explicit mentions: "uses Dublin Core", "MARC21 records"
Implicit: technical discussions about cataloging practices

Task 8: Organizational Change Event Extraction (NEW - v0.2.0)

Objective: Extract significant organizational change events from conversation history.

Schema Mapping: Populates ChangeEvent class from schemas/provenance.yaml

Change Types to Detect (from ChangeTypeEnum in schemas/enums.yaml):

FOUNDING: "established", "founded", "created", "opened"
CLOSURE: "closed", "dissolved", "ceased operations", "shut down"
MERGER: "merged with", "combined with", "joined with", "absorbed"
SPLIT: "split into", "divided into", "separated from", "spun off"
ACQUISITION: "acquired", "took over", "purchased"
RELOCATION: "moved to", "relocated to", "transferred to"
NAME_CHANGE: "renamed to", "formerly known as", "changed name to"
TYPE_CHANGE: "became a museum", "converted to archive", "now operates as"
STATUS_CHANGE: "reopened", "temporarily closed", "suspended operations"
RESTRUCTURING: "reorganized", "restructured", "reformed"
LEGAL_CHANGE: "incorporated as", "became a foundation", "legal status changed"

Extract for Each Event:

change_history:  # HeritageCustodian.change_history (list of ChangeEvent)
  - event_id: "https://w3id.org/heritage/custodian/event/unique-id"  # ChangeEvent.event_id
    change_type: MERGER  # ChangeEvent.change_type (ChangeTypeEnum from schemas/enums.yaml)
    event_date: "2001-01-01"  # ChangeEvent.event_date
    event_description: >-  # ChangeEvent.event_description
      Merger of Institution A and Institution B to form new organization C.
      Detailed description from conversation.
    affected_organization: null  # ChangeEvent.affected_organization (optional)
    resulting_organization: null  # ChangeEvent.resulting_organization (optional)
    related_organizations: []  # ChangeEvent.related_organizations (optional)
    source_documentation: "https://..."  # ChangeEvent.source_documentation (optional)

Temporal Context Indicators:

"In 2001, the museum merged with..."
"After the renovation in 1985..."
"Following the name change in 1968..."
"The archive was relocated from X to Y in 1923"

PROV-O Integration:

Map to prov:Activity in RDF serialization
Link with prov:wasInfluencedBy from HeritageCustodian
Use prov:atTime for event timestamps
Track prov:entity (affected) and prov:generated (resulting) organizations

Example Extraction:

Input: "The Noord-Hollands Archief was formed in 2001 through a merger of 
        Gemeentearchief Haarlem (founded 1910) and Rijksarchief in Noord-Holland 
        (founded 1802). The merger created a unified regional archive serving both 
        the city and province."

Output:
- event_id: "https://w3id.org/heritage/custodian/event/nha-merger-2001"
- change_type: MERGER  # ChangeTypeEnum.MERGER
- event_date: "2001-01-01"
- event_description: "Merger of Gemeentearchief Haarlem (municipal archive, founded 
                      1910) and Rijksarchief in Noord-Holland (state archive, founded 
                      1802) to form Noord-Hollands Archief."
- confidence_score: 0.95  # From Provenance metadata

GHCID Impact:

When institutions merge, relocate, or change names, GHCID may change
Track old GHCID in ghcid_history with valid_to timestamp matching event date → GHCIDHistoryEntry from schemas/provenance.yaml
Create new GHCIDHistoryEntry with valid_from matching event date
Link change event to GHCID change via temporal correlation

Indicators:

Task 9: Holy Sites Heritage Collection Identification

Objective: Identify religious sites that function as heritage custodians by maintaining cultural collections.

Schema Mapping: Populates HeritageCustodian class with institution_type: HOLY_SITES

When to Classify as HOLY_SITES:

Religious institutions qualify as HOLY_SITES heritage custodians when they manage:

Archival collections: Historical documents, parish registers, ecclesiastical records
Library collections: Rare manuscripts, theological texts, historical books
Museum collections: Religious artifacts, liturgical objects, art collections
Cultural heritage: Historical buildings with guided tours, preservation programs

Patterns to Look For:

Church archives (parish records, baptismal registers, historical documents)
Monastery libraries (manuscript collections, rare books)
Cathedral treasuries (liturgical objects, religious art)
Temple museums (Buddhist artifacts, historical collections)
Mosque libraries (Islamic manuscripts, Quranic texts)
Synagogue archives (Jewish community records, Torah scrolls)
Abbey collections (medieval manuscripts, historical artifacts)

Keywords and Indicators:

"church archive", "parish records", "ecclesiastical archive"
"monastery library", "monastic collection", "scriptorium"
"cathedral treasury", "cathedral museum"
"temple library", "temple collection"
"mosque library", "Islamic manuscript collection"
"synagogue archive", "Jewish heritage collection"
"religious heritage site", "pilgrimage site with museum"

NOT Holy Sites (use other types):

Secular museums about religion (use MUSEUM)
Academic religious studies centers (use RESEARCH_CENTER or UNIVERSITY)
Government archives of church records (use ARCHIVE)
Religious organizations without heritage collections (not heritage custodians)

Example Extraction:

Input: "The Vatican Apostolic Archive holds over 85 km of shelving with 
        documents dating back to the 8th century, including papal bulls, 
        correspondence, and medieval manuscripts."

Output:
- name: Vatican Apostolic Archive
  institution_type: HOLY_SITES  # Religious institution managing heritage collections
  description: >-
    The Vatican Apostolic Archive (formerly Vatican Secret Archives) is 
    the central repository for papal and Vatican documents, holding over 
    35,000 volumes of historical records spanning 12 centuries.    
  locations:
    - city: Vatican City
      country: VA
  collections:
    - collection_name: Papal Documents
      collection_type: archival
      temporal_coverage: "0800-01-01/2024-12-31"
      extent: "85 kilometers of shelving, 35,000+ volumes"
  provenance:
    data_source: CONVERSATION_NLP
    confidence_score: 0.95

Schema.org Mapping:

HOLY_SITES maps to schema:PlaceOfWorship in RDF serialization
Can also use schema:ArchiveOrganization or schema:Library for collection-specific context
Use multiple type assertions when appropriate

Cross-Cultural Considerations:

Christianity: churches, cathedrals, monasteries, abbeys, convents
Islam: mosques, madrasas (with historical libraries)
Judaism: synagogues, yeshivas (with archival collections)
Buddhism: temples, monasteries, pagodas (with artifact collections)
Hinduism: temples (with historical collections)
Sikhism: gurdwaras (with historical manuscripts)
Other faiths: shrines, pilgrimage sites with documented heritage collections

Data Quality and Provenance

Provenance Tracking

Every extracted record MUST include:

provenance:
  data_source: CONVERSATION_NLP
  data_tier: TIER_4_INFERRED
  extraction_date: "2025-11-05T..."
  extraction_method: "Subagent NER + pattern matching"
  confidence_score: 0.85
  conversation_id: "conversation-uuid"
  source_url: null
  verified_date: null
  verified_by: null

Confidence Scoring

Assign confidence scores (0.0-1.0) based on:

0.9-1.0: Explicit, unambiguous mentions with context
0.7-0.9: Clear mentions with some ambiguity
0.5-0.7: Inferred from context, may need verification
0.3-0.5: Low confidence, likely needs verification
0.0-0.3: Very uncertain, flag for manual review

Data Tier Assignment

TIER_1_AUTHORITATIVE: CSV registries (ISIL, Dutch orgs)
TIER_2_VERIFIED: Data from institutional websites (crawl4ai)
TIER_3_CROWD_SOURCED: Wikidata, OpenStreetMap
TIER_4_INFERRED: NLP-extracted from conversations

Integration with CSV Data

Cross-linking Strategy

ISIL Code Matching (primary)
- If conversation mentions ISIL code, link to CSV record
- High confidence match
Name Matching (secondary)
- Normalize names (lowercase, remove punctuation, handle abbreviations)
- Fuzzy matching with threshold > 0.85
- Check for alternative names
Location + Type Matching (tertiary)
- Match by city + institution type
- Lower confidence, requires manual verification

Conflict Resolution

When conversation data conflicts with CSV data:

CSV data takes precedence (higher tier)
Mark conversation data with verified: false
Note conflict in provenance metadata
Create separate record if institutions are genuinely different

NLP Models and Tools

Recommended Approach: Agent-Based NER

IMPORTANT: Instead of directly using spaCy or other NER libraries in the main codebase, use coding subagents via the Task tool to conduct Named Entity Recognition and text extraction.

Why Subagents:

Keeps the main codebase clean and maintainable
Allows flexible experimentation with different NER approaches
Subagents can choose the best tool for each specific extraction task
Better separation of concerns: extraction logic vs. data pipeline

How to Use Subagents for NER:

Use the Task tool with subagent_type="general" for NER tasks
Provide clear prompts describing what entities to extract
Subagent will autonomously choose and apply appropriate NER tools (spaCy, transformers, regex, etc.)
Subagent returns structured extraction results
Main code validates and processes the results

CRITICAL: Creating LinkML Instance Files

Agent Capabilities Go Beyond Traditional NER

IMPORTANT: AI extraction agents are NOT limited to simple Named Entity Recognition. Unlike traditional NER tools that only identify entity boundaries and types, AI agents have comprehensive understanding and can:

Extract Complete Records: Capture ALL relevant information for each institution in one pass
Infer Missing Data: Use context to fill in fields that aren't explicitly stated
Cross-Reference Within Documents: Link related entities (locations, identifiers, events) automatically
Maintain Consistency: Ensure all extracted data conforms to the LinkML schema
Generate Rich Metadata: Create complete provenance tracking and confidence scores

Mandatory: Create Complete LinkML Instance Files

When extracting data from conversations or other sources, agents MUST:

✅ DO THIS: Create complete LinkML-compliant YAML instance files with ALL available information

# Example: data/instances/brazil_museums_001.yaml
---
# From schemas/core.yaml - HeritageCustodian class

- id: https://w3id.org/heritage/custodian/br/bnb-001
  name: Biblioteca Nacional do Brasil
  institution_type: LIBRARY  # From schemas/enums.yaml
  alternative_names:
    - National Library of Brazil
    - BNB
  description: >-
    The National Library of Brazil, located in Rio de Janeiro, is the largest 
    library in Latin America with over 9 million items. Founded in 1810 by 
    King João VI of Portugal. Collections include rare manuscripts, maps, 
    photographs, and Brazilian historical documents.    
  
  locations:  # From schemas/core.yaml - Location class
    - city: Rio de Janeiro
      street_address: Avenida Rio Branco, 219
      postal_code: "20040-008"
      region: Rio de Janeiro
      country: BR
      # Note: lat/lon can be geocoded later if not in text
  
  identifiers:  # From schemas/core.yaml - Identifier class
    - identifier_scheme: ISIL
      identifier_value: BR-RjBN
      identifier_url: https://isil.org/BR-RjBN
    - identifier_scheme: VIAF
      identifier_value: "123556639"
      identifier_url: https://viaf.org/viaf/123556639
    - identifier_scheme: Wikidata
      identifier_value: Q1526131
      identifier_url: https://www.wikidata.org/wiki/Q1526131
    - identifier_scheme: Website
      identifier_value: https://www.bn.gov.br
      identifier_url: https://www.bn.gov.br
  
  digital_platforms:  # From schemas/core.yaml - DigitalPlatform class
    - platform_name: Digital Library of the National Library of Brazil
      platform_url: https://bndigital.bn.gov.br
      platform_type: DISCOVERY_PORTAL
      metadata_standards:
        - Dublin Core
        - MARC21
  
  collections:  # From schemas/collections.yaml - Collection class
    - collection_name: Brazilian Historical Documents
      collection_type: archival
      subject_areas:
        - Brazilian History
        - Colonial Period
        - Imperial Brazil
      temporal_coverage: "1500-01-01/1889-11-15"
      extent: "Approximately 2.5 million documents"
  
  change_history:  # From schemas/provenance.yaml - ChangeEvent class
    - event_id: https://w3id.org/heritage/custodian/event/bnb-founding-1810
      change_type: FOUNDING
      event_date: "1810-01-01"
      event_description: >-
        Founded by King João VI of Portugal as the Royal Library 
        (Biblioteca Real) when the Portuguese court relocated to Brazil.        
      source_documentation: https://www.bn.gov.br/sobre-bn/historia
  
  provenance:  # From schemas/provenance.yaml - Provenance class
    data_source: CONVERSATION_NLP
    data_tier: TIER_4_INFERRED
    extraction_date: "2025-11-05T14:30:00Z"
    extraction_method: "AI agent comprehensive extraction from Brazilian GLAM conversation"
    confidence_score: 0.92
    conversation_id: "2025-09-22T14-40-15-0102c00a-4c0a-4488-bdca-5dd9fb94c9c5"
    notes: >-
      Extracted from conversation about Brazilian GLAM institutions. 
      Historical founding information cross-referenced from institutional website.

❌ DO NOT DO THIS: Return minimal JSON with only name and type

// BAD - This is insufficient!
{
  "name": "Biblioteca Nacional do Brasil",
  "institution_type": "LIBRARY"
}

Extraction Workflow for Agents

When processing a conversation or document:

Read Entire Document First: Don't extract piecemeal - understand the full context
Identify ALL Entities: Find every institution, location, identifier, event mentioned
Gather Complete Information: For each institution, extract:
- Basic metadata (name, type, description)
- All locations mentioned (even if just city/country)
- All identifiers (ISIL, Wikidata, VIAF, URLs)
- Digital platforms and systems
- Collection information
- Historical events (founding, mergers, relocations)
- Relationships to other institutions
Create LinkML YAML: Write a complete instance file with ALL extracted data
Add Provenance: Always include extraction metadata with confidence scores
Validate: Ensure output conforms to schema (use linkml-validate if available)

Example Agent Prompt for Comprehensive Extraction

Extract ALL heritage institutions from the following conversation about Brazilian GLAM institutions.

For EACH institution found, create a COMPLETE LinkML-compliant record including:
- Institution name, type, and description
- ALL locations mentioned (cities, addresses, regions)
- ALL identifiers (ISIL codes, Wikidata IDs, VIAF IDs, URLs)
- Digital platforms, systems, or websites
- Collection information (types, subjects, time periods, extent)
- Historical events (founding dates, mergers, relocations, name changes)
- Relationships to other organizations

Output: YAML file conforming to schemas/core.yaml, schemas/enums.yaml, 
schemas/provenance.yaml, and schemas/collections.yaml

Use your understanding to:
- Infer missing fields from context (e.g., country from city names)
- Consolidate information scattered across multiple conversation turns
- Create rich descriptions summarizing key facts
- Assign appropriate confidence scores based on explicitness of mentions

Remember: You are NOT a simple NER tool. Use your full comprehension abilities 
to create the most complete, accurate, and useful records possible.

Multiple Institutions Per File

When a conversation discusses many institutions, create ONE YAML file with a list:

---
# data/instances/netherlands_limburg_museums.yaml

- id: https://w3id.org/heritage/custodian/nl/bonnefantenmuseum
  name: Bonnefantenmuseum
  institution_type: MUSEUM
  # ... complete record ...

- id: https://w3id.org/heritage/custodian/nl/thermenmuseum
  name: Thermenmuseum
  institution_type: MUSEUM
  # ... complete record ...

- id: https://w3id.org/heritage/custodian/nl/limburgs-museum
  name: Limburgs Museum
  institution_type: MUSEUM
  # ... complete record ...

Field Completion Strategies

Even when information is incomplete, do your best:

No explicit institution type? Infer from context ("national library" → LIBRARY)
Only city mentioned? That's fine - add locations: [{city: "Amsterdam", country: "NL"}]
No ISIL code? Check if you can infer the format (NL-CityCode) or leave it out
No description? Create one from available facts
Uncertain data? Lower the confidence score but still include it

Validation and Quality Control

After creating instance files:

Schema Validation: If possible, run linkml-validate -s schemas/heritage_custodian.yaml data/instances/your_file.yaml
Completeness Check: Ensure every institution has at minimum:
- id (generate from country + institution name slug)
- name
- institution_type
- provenance (with data_source, extraction_date, confidence_score)
Consistency Check: Same institution mentioned multiple times? Merge into one record
Quality Flags: If confidence < 0.5, add note in provenance.notes explaining uncertainty

Extraction Stack (for Subagents)

When subagents perform extraction, they may use:

Pattern matching for identifiers (primary approach)
- Regex for ISIL, VIAF, Wikidata IDs
- URL extraction and normalization
- High precision, no dependencies
NER libraries (via subagents only)
- spaCy: en_core_web_trf, nl_core_news_lg, xx_ent_wiki_sm
- Transformers for classification
- Used by subagents, not directly in main code
Fuzzy matching for deduplication
- rapidfuzz library
- Levenshtein distance for name matching

Processing Pipeline

Conversation JSON
    ↓
Parse & Extract Text
    ↓
[SUBAGENT] NER Extraction
  - Subagent uses spaCy/transformers/patterns
  - Returns structured entities
    ↓
Pattern Matching (identifiers, URLs)
    ↓
Classification (institution type, standards)
    ↓
Geocoding (locations)
    ↓
Cross-link with CSV (ISIL/name matching)
    ↓
LinkML Validation
    ↓
Export (RDF, JSON-LD, CSV, Parquet)

Agent Interaction Patterns

When Asked to Extract Data from Conversations

Start Small: Begin with 1-2 conversation files to test extraction logic
Show Examples: Display extracted entities with confidence scores
Ask for Validation: Show uncertain extractions for user confirmation
Iterate: Refine patterns based on feedback
Batch Process: Once patterns are validated, process all 139 files

When Asked to Design NLP Components

Reference Schema: Always refer to the modular schema v0.2.1:
- Core classes: schemas/core.yaml (HeritageCustodian, Location, Identifier, etc.)
- Enumerations: schemas/enums.yaml (InstitutionTypeEnum, ChangeTypeEnum, etc.)
- Provenance: schemas/provenance.yaml (Provenance, ChangeEvent, etc.)
- See schema overview in the "Schema Reference (v0.2.1)" section above
Consult Base Ontologies: BEFORE designing extraction logic, review relevant ontologies:
- Dutch institutions: Study TOOI ontology (/data/ontology/tooiont.ttl)
- EU/global institutions: Study CPOV ontology (/data/ontology/core-public-organisation-ap.ttl)
- All institutions: Reference Schema.org patterns (/data/ontology/schemaorg.owl)
- See "Base Ontologies for Global GLAM Data" section above for decision tree
Use Design Patterns: Follow patterns in docs/plan/global_glam/05-design-patterns.md
Track Provenance: Every extraction must include provenance metadata (from schemas/provenance.yaml)
Handle Multilingual: Conversations cover 60+ countries, expect multilingual content
Error Handling: Use Result pattern, never fail silently

When Asked to Validate Data

LinkML Validation: Use linkml-validate to check schema compliance
Cross-reference: Compare with CSV data when applicable
Check Identifiers: Validate ISIL format, check Wikidata exists
Geographic Verification: Geocode addresses, verify country codes
Duplicate Detection: Use fuzzy matching to find potential duplicates

Example Agent Workflows

Workflow 1: Extract Brazilian Institutions

# User request
"Extract all museum, library, and archive names from the Brazilian GLAM conversation"

# Agent actions
1. Read conversation: 2025-09-22T14-40-15-0102c00a-4c0a-4488-bdca-5dd9fb94c9c5-Brazilian_GLAM_collection_inventories.json
2. Parse chat_messages array
3. **Launch subagent** to extract institutions using NER
   - Subagent analyzes text and extracts ORG entities
   - Filters for heritage-related keywords
   - Classifies institution types
   - Returns structured results
4. Extract locations (cities in Brazil)
5. Geocode using Nominatim
6. Create HeritageCustodian records
7. Add provenance metadata (data_source: CONVERSATION_NLP, extraction_method: "Subagent NER")
8. Validate with LinkML schema
9. Export to JSON-LD
10. Report results with confidence scores

Workflow 2: Cross-link Dutch Institutions

# User request
"Cross-link the Dutch organizations CSV with any Dutch institutions found in conversations"

# Agent actions
1. Load data/voorbeeld_lijst_organisaties_en_diensten-totaallijst_nederland.csv
2. Parse into DutchHeritageCustodian records
3. Extract all NL-* ISIL codes
4. Search all conversation files for mentions of these ISIL codes
5. Fuzzy match organization names
6. For matches:
   - Merge metadata
   - Mark CSV data as TIER_1
   - Mark conversation data as TIER_4
   - Resolve conflicts (CSV wins)
7. For Dutch institutions in conversations NOT in CSV:
   - Create new records
   - Mark as TIER_4
   - Flag for verification
8. Export merged dataset

Workflow 3: Build Global Institution Map

# User request
"Create a geographic distribution map of all extracted institutions"

# Agent actions
1. Process all 139 conversation files
2. **Launch subagent(s)** to extract institution names + locations from each file
3. Geocode all addresses
4. Group by country
5. Count institutions per country
6. Generate GeoJSON for mapping
7. Create visualization (Leaflet, Mapbox, etc.)
8. Export statistics:
   - Institutions per country
   - Institutions per type
   - Geographic coverage
   - Data quality (tier distribution)

Multi-language Considerations

Language Detection

Detect language of conversation content
Subagents will choose appropriate NER models per language
Multilingual support handled by subagents

Common Languages in Dataset

English (international institutions)
Dutch (Netherlands institutions)
Portuguese (Brazil)
Spanish (Latin America, Spain)
Vietnamese, Japanese, Thai, Korean, Arabic, Russian, etc.

Translation Strategy

DO NOT translate institution names (preserve original)
Optionally translate descriptions for searchability
Store language tags with text fields
Use multilingual identifiers (Wikidata) for linking

Output Formats

Primary Output: JSON-LD

Linked Data format for semantic web integration:

{
  "@context": "https://w3id.org/heritage/custodian/context.jsonld",
  "@type": "HeritageCustodian",
  "@id": "https://example.org/institution/123",
  "name": "Amsterdam Museum",
  "institution_type": "MUSEUM",
  ...
}

Secondary Outputs

RDF/Turtle: For SPARQL querying
CSV: For spreadsheet analysis
Parquet: For data warehousing
SQLite: For local querying

Testing and Validation

Unit Tests

Test extraction functions with known inputs:

def test_extract_isil_codes():
    text = "The ISIL code NL-AsdAM identifies Amsterdam Museum"
    codes = extract_isil_codes(text)
    assert codes == [{"scheme": "ISIL", "value": "NL-AsdAM"}]

Integration Tests

Test full pipeline with sample conversations:

def test_brazilian_museum_extraction():
    conversation = load_json("Brazilian_GLAM_collection_inventories.json")
    records = extract_heritage_custodians(conversation)
    assert len(records) > 0
    assert all(r.provenance.data_source == "CONVERSATION_NLP" for r in records)

Validation Tests

Ensure LinkML schema compliance:

def test_linkml_validation():
    record = create_heritage_custodian(...)
    validator = SchemaValidator(schema="heritage_custodian.yaml")
    result = validator.validate(record)
    assert result.is_valid

Performance Optimization

Batch Processing

Process conversations in parallel (multiprocessing)
Cache geocoding results (15-minute TTL)
Deduplicate entity extraction

Incremental Updates

Track last processed timestamp
Only process new/updated conversations
Maintain state in SQLite database

Resource Management

Limit concurrent API calls (Nominatim: 1 req/sec)
Use connection pooling for HTTP requests
Stream large JSON files instead of loading into memory

Error Handling

Common Errors and Solutions

JSON Parsing Errors
- Malformed JSON files
- Solution: Validate JSON schema, report file path
NER Model Errors
- Missing spaCy model
- Solution: Provide installation instructions, download automatically
Geocoding Failures
- Unknown location, rate limit exceeded
- Solution: Cache results, implement backoff, mark as unverified
LinkML Validation Failures
- Required field missing, invalid enum value
- Solution: Log validation errors, provide field mapping
Encoding Issues
- Non-UTF-8 characters
- Solution: Use UTF-8 everywhere, handle decode errors gracefully

Schema Quirks and Implementation Notes

IMPORTANT: These are critical implementation details discovered during development. Read carefully to avoid bugs.

Provenance Model Quirks

The Provenance model does NOT have a notes field:

# ❌ WRONG - Provenance has no 'notes' field
provenance = Provenance(
    data_source=DataSource.CSV_REGISTRY,
    notes="Some observation"  # This will fail!
)

# ✅ CORRECT - Use HeritageCustodian.description instead
custodian = HeritageCustodian(
    name="Museum Name",
    description="Notes and remarks go here",  # Put notes here
    provenance=Provenance(...)
)

Field Naming Conventions

Always use the correct field names (check the schema when in doubt):

# ❌ WRONG
custodian.institution_types  # Plural, list
custodian.location           # Singular

# ✅ CORRECT
custodian.institution_type   # Singular, single enum value
custodian.locations          # Plural, always a list (even with one item)

Pydantic v1 Enum Behavior

This project uses Pydantic v1. Enum fields are already strings, not enum objects:

# ❌ WRONG - Don't use .value accessor
print(custodian.institution_type.value)  # AttributeError!

# ✅ CORRECT - Enum fields are already strings
print(custodian.institution_type)  # "MUSEUM", "ARCHIVE", etc.

# Same for platform types
platform.platform_type  # Already a string, not an enum object

Required vs. Optional Fields

Many fields are optional but have validation rules. Always check for None:

# Optional fields that may be None
custodian.locations          # Optional[List[Location]]
custodian.identifiers        # Optional[List[Identifier]]
custodian.digital_platforms  # Optional[List[DigitalPlatform]]
custodian.description        # Optional[str]

# Always check before iterating
if custodian.locations:
    for location in custodian.locations:
        print(location.city)

CSV Parsing Best Practices

Handle UTF-8 BOM: Use encoding='utf-8-sig' when reading CSVs
Normalize headers: Strip whitespace, handle multiline headers
Warn on errors: Skip invalid rows but log warnings
Preserve originals: Store raw CSV data in intermediate models before conversion

Example:

with open(csv_path, 'r', encoding='utf-8-sig') as f:
    reader = csv.DictReader(f)
    for row in reader:
        try:
            record = parse_row(row)
        except ValidationError as e:
            print(f"Warning: Skipping row {row}: {e}")
            continue

Date Handling

Dates may be in various formats or empty:

# Handle empty dates
date_str = row.get('toegekend_op', '').strip()
assigned_date = datetime.fromisoformat(date_str) if date_str else None

# Provenance extraction_date is required (use current time)
from datetime import datetime, timezone
extraction_date = datetime.now(timezone.utc)

Testing Strategies

Unit tests: Test model validation with known inputs
Integration tests: Test full file parsing with fixtures
Edge case tests: Empty files, malformed rows, minimal data
Real data tests: Always validate with actual CSV files

Fixture scope matters:

# ❌ WRONG - Class-scoped fixture not available to other classes
class TestFoo:
    @pytest.fixture
    def sample_file(self):
        ...

# ✅ CORRECT - Module-scoped fixture available to all test classes
@pytest.fixture
def sample_file():  # At module level, not in a class
    ...

Next Steps for Agents

When continuing this project, agents should:

Implement Parser Module (src/glam_extractor/parsers/) ✅ COMPLETE
- ✅ ISIL registry parser (10 tests, 84% coverage)
- ✅ Dutch organizations parser (18 tests, 98% coverage)
- ⏳ Conversation JSON parser (next priority)
Implement Extractor Module (src/glam_extractor/extractors/)
- spaCy NER integration
- Pattern-based identifier extraction
- Institution type classifier
- Relationship extractor
Implement Geocoder Module (src/glam_extractor/geocoding/)
- Nominatim client with caching
- GeoNames integration
- Coordinate validation
Implement Validator Module (src/glam_extractor/validators/)
- LinkML schema validator
- Cross-reference validator (CSV vs. conversation)
- Duplicate detector
Implement Exporter Module (src/glam_extractor/exporters/)
- JSON-LD exporter
- RDF/Turtle exporter
- CSV exporter
- Parquet exporter
- SQLite database builder
Create Test Fixtures (tests/fixtures/)
- Sample conversation JSONs
- Expected extraction outputs
- Validation test cases
Document Agent Prompts (docs/agent-prompts/)
- Reusable prompts for common extraction tasks
- Few-shot examples for LLM-based extraction
- Quality review checklists

Persistent Identifiers (GHCID)

🚨 COLLISION RESOLUTION: NATIVE LANGUAGE NAME SUFFIX 🚨

When multiple institutions generate the same base GHCID, collisions are resolved by appending the full legal name in native language in snake_case format.

Collision Suffix Rules:

✅ Use the institution's full official name in its native language
✅ Convert to snake_case (lowercase, underscores for spaces)
✅ Remove apostrophes, accents, commas, and other punctuation/diacritics
✅ Only add suffix on collision (not by default)
✅ First-added institution keeps base GHCID; later additions get name suffix

Examples:

Base GHCID collision: NL-NH-AMS-M-SM (two museums with "SM" abbreviation)
First institution: NL-NH-AMS-M-SM (Stedelijk Museum, added first - no suffix)
Second institution: NL-NH-AMS-M-SM-science_museum_amsterdam (added later - gets suffix)

Name Normalization:

"Musée d'Orsay" → "musee_dorsay"
"Biblioteca Nacional do Brasil" → "biblioteca_nacional_do_brasil"
"北京故宫博物院" → "beijing_gugong_bowuyuan" (pinyin transliteration)
"Österreichische Nationalbibliothek" → "osterreichische_nationalbibliothek"

Note: The GHCID string (including any name suffix) gets hashed to UUID, so the longer name won't be visible to end users - they see only the UUID.

🚨 SETTLEMENT STANDARDIZATION: GEONAMES IS AUTHORITATIVE 🚨

ALL settlement names in GHCID MUST be derived from GeoNames, not from source data.

The GeoNames geographical database (/data/reference/geonames.db) is the single source of truth for:

Settlement names (cities, towns, villages)
Settlement 3-letter codes
Administrative region codes (admin1 → ISO 3166-2)

Why GeoNames?

Consistency: Same coordinates → same settlement → same GHCID component
Disambiguation: Handles duplicate settlement names across regions
Standardization: Provides ASCII-safe names for identifiers
Persistence: Geographic reality is stable, ensuring GHCID stability

Settlement Resolution Process:

Coordinates Available (Preferred): Use reverse geocoding to find nearest GeoNames settlement
Name Only (Fallback): Look up settlement name in GeoNames with fuzzy matching
Manual (Last Resort): Flag entry with settlement_code: XXX for review

🚨 CRITICAL: GeoNames Feature Code Filtering 🚨

NEVER use neighborhoods or districts (PPLX) for GHCID generation. ONLY use proper settlements (cities, towns, villages).

GeoNames classifies populated places with feature codes. When reverse geocoding coordinates to find a settlement, you MUST filter by feature code to ensure you get a city/town/village, NOT a neighborhood or district.

ALLOWED Feature Codes (use these for GHCID settlements):

Code	Description	Example
PPL	Populated place (city/town/village)	Apeldoorn, Hamont, Lelystad
PPLA	Seat of first-order admin division	Provincial capitals
PPLA2	Seat of second-order admin division	Municipal seats
PPLA3	Seat of third-order admin division	District seats
PPLA4	Seat of fourth-order admin division	Sub-district seats
PPLC	Capital of a political entity	Amsterdam, Brussels
PPLS	Populated places (multiple)	Settlement clusters
PPLG	Seat of government	The Hague (when different from capital)

EXCLUDED Feature Codes (NEVER use for GHCID):

Code	Description	Why Excluded
PPLX	Section of populated place	Neighborhoods, districts, quarters (e.g., "Binnenstad", "Amsterdam Binnenstad")

Example of the Problem:

-- BAD: Query without feature code filter returns neighborhoods
SELECT name, feature_code, population FROM cities 
WHERE country_code='NL' ORDER BY distance LIMIT 1;
-- Result: "Binnenstad" (PPLX, pop 4,900) ❌ WRONG

-- GOOD: Query WITH feature code filter returns proper settlements
SELECT name, feature_code, population FROM cities 
WHERE country_code='NL' 
  AND feature_code IN ('PPL', 'PPLA', 'PPLA2', 'PPLA3', 'PPLA4', 'PPLC', 'PPLS', 'PPLG')
ORDER BY distance LIMIT 1;
-- Result: "Apeldoorn" (PPL, pop 136,670) ✅ CORRECT

Implementation in SQL:

-- Correct reverse geocoding query with feature code filter
SELECT 
    name, ascii_name, admin1_code, admin1_name,
    latitude, longitude, geonames_id, population, feature_code,
    ((latitude - ?) * (latitude - ?) + (longitude - ?) * (longitude - ?)) as distance_sq
FROM cities
WHERE country_code = ?
  AND feature_code IN ('PPL', 'PPLA', 'PPLA2', 'PPLA3', 'PPLA4', 'PPLC', 'PPLS', 'PPLG')
ORDER BY distance_sq
LIMIT 1

Verification: Always check feature_code in location_resolution metadata:

location_resolution:
  method: REVERSE_GEOCODE
  geonames_id: 2759706
  geonames_name: Apeldoorn
  feature_code: PPL  # ← MUST be PPL, PPLA, PPLA2, PPLA3, PPLA4, PPLC, PPLS, or PPLG
  admin1_code: '03'
  region_code: GE
  country_code: NL

If you see feature_code: PPLX, the GHCID is WRONG and must be regenerated.

Country Code Detection for GeoNames Lookups

CRITICAL: Determine country code from entry data BEFORE calling GeoNames reverse geocoding.

GeoNames queries are country-specific. Using the wrong country code will return incorrect results or no results.

Country Code Resolution Priority:

zcbs_enrichment.country - Most explicit source
location.country - Direct location field
locations[].country - Array location field
original_entry.country - CSV source field
google_maps_enrichment.address - Parse country from address string (", Belgium", ", Germany")
wikidata_enrichment.located_in.label - Infer from Wikidata location
Default: "NL" (Netherlands) - Only if no other source available

Example Country Detection Code:

# Determine country code FIRST
country_code = "NL"  # Default

if entry.get('zcbs_enrichment', {}).get('country'):
    country_code = entry['zcbs_enrichment']['country']
elif entry.get('location', {}).get('country'):
    country_code = entry['location']['country']
elif entry.get('google_maps_enrichment', {}).get('address', ''):
    address = entry['google_maps_enrichment']['address']
    if ', Belgium' in address or ', België' in address:
        country_code = "BE"
    elif ', Germany' in address or ', Deutschland' in address:
        country_code = "DE"

# THEN call reverse geocoding with correct country
result = reverse_geocode_to_city(latitude, longitude, country_code)

GHCID Settlement Code Format:

NL-{REGION}-{SETTLEMENT}-{TYPE}-{ABBREV}
           ^^^^^^^^^^^
           3-letter code from GeoNames

Code Generation Rules:

Single word: First 3 letters → Amsterdam = AMS, Lelystad = LEL
Dutch article (de, het, den, 's): Article initial + 2 from main word → Den Haag = DHA
Multi-word: Initials (up to 3) → Nieuw Amsterdam = NAM

Historical Custodians - Measurement Point Rule:

For heritage custodians that no longer exist or have historical locations:

Use the modern-day settlement (as of 2025-12-01) where the coordinates fall
GeoNames reflects current geographic reality
Historical place names should NOT be used for GHCID generation

Example: A museum operating 1900-1950 in what is now Lelystad (before Flevoland existed) uses LEL, not historical names.

🚨 CRITICAL: XXX Placeholders Are TEMPORARY - Research Required 🚨

XXX placeholders for region/settlement codes are NEVER acceptable as a final state. They indicate missing data that MUST be researched and resolved.

When you encounter or generate entries with XX (unknown region) or XXX (unknown settlement):

Step 1: Identify the Last Known Physical Location

For destroyed/historical institutions:

Use the last recorded physical location where the institution operated
Example: Gaza Cultural Center destroyed in 2024 → use Gaza City coordinates (PS-GZ-GAZ-M-GCC)

For refugee/diaspora organizations:

Use the location of their current headquarters OR original founding location
Document which location type was used in location_resolution.notes

For digital-only platforms:

Use the location of the parent/founding organization
Example: Interactive Encyclopedia of Palestine Question → Institute for Palestine Studies → Beirut (LB-BA-BEI-D-IEPQ)

Step 2: Research Sources (Priority Order)

Wikidata - Search for the institution, check P131 (located in) or P159 (headquarters location)
Google Maps - Search institution name, extract coordinates
Official Website - Look for contact page, about page with address
Web Archive - Use archive.org for destroyed/closed institutions
Academic Sources - Papers, reports mentioning the institution
News Articles - Particularly useful for destroyed heritage sites

Step 3: Update Entry with Resolved Location

# BEFORE (unacceptable)
ghcid:
  ghcid_current: PS-XX-XXX-A-NAPR
  location_resolution:
    method: NAME_LOOKUP
    country_code: PS
    region_code: XX
    city_code: XXX

# AFTER (properly researched)
ghcid:
  ghcid_current: PS-GZ-GAZ-A-NAPR
  location_resolution:
    method: MANUAL_RESEARCH
    country_code: PS
    region_code: GZ
    region_name: Gaza Strip
    city_code: GAZ
    city_name: Gaza City
    geonames_id: 281133
    research_date: "2025-12-06T00:00:00Z"
    research_sources:
      - type: wikidata
        id: Q123456
        claim: P131
      - type: web_archive
        url: https://web.archive.org/web/20231001/https://institution-website.org/contact
    notes: "Located in Gaza City prior to destruction in 2024"

Step 4: Rename File to Match New GHCID

Files MUST be renamed when GHCID changes:

# Old file
data/custodian/PS-XX-XXX-A-NAPR.yaml

# New file after research
data/custodian/PS-GZ-GAZ-A-NAPR.yaml

Common XXX Placeholder Scenarios and Solutions:

Scenario	Solution
Destroyed Gaza institution	Use pre-destruction coordinates (Gaza City, Khan Yunis, etc.)
Refugee archive (diaspora)	Use current headquarters OR founding camp location
Digital platform (online only)	Use parent organization headquarters
Decentralized initiative	Use founding location or primary organizer location
Historical institution (closed)	Use last operating location
Institution with country but no city	Research using name + country in Wikidata/Google

NEVER:

❌ Leave XXX placeholders in production data
❌ Use "Online" or "Palestine" as location values
❌ Skip location research because it's "difficult"
❌ Use XX/XXX for diaspora organizations (they have real locations)

ALWAYS:

✅ Document research sources in location_resolution.research_sources
✅ Add notes explaining location choice for complex cases
✅ Update GHCID history when location is resolved
✅ Rename files to match corrected GHCID

Netherlands Admin1 Code Mapping (GeoNames → ISO 3166-2):

GeoNames	Province	ISO Code
01	Drenthe	DR
02	Friesland	FR
03	Gelderland	GE
04	Groningen	GR
05	Limburg	LI
06	Noord-Brabant	NB
07	Noord-Holland	NH
09	Utrecht	UT
10	Zeeland	ZE
11	Zuid-Holland	ZH
15	Overijssel	OV
16	Flevoland	FL

Provenance Tracking: Record GeoNames resolution in entry metadata:

location_resolution:
  method: REVERSE_GEOCODE  # or NAME_LOOKUP or MANUAL
  geonames_id: 2751792
  geonames_name: Lelystad
  settlement_code: LEL
  admin1_code: "16"
  region_code: FL
  resolution_date: "2025-12-01T00:00:00Z"

See: .opencode/GEONAMES_SETTLEMENT_RULES.md for complete documentation.

🚨 INSTITUTION ABBREVIATION: EMIC NAME FIRST-LETTER PROTOCOL 🚨

The institution abbreviation component uses the FIRST LETTER of each significant word in the official emic (native language) name.

⚠️ GRANDFATHERING POLICY (PID STABILITY)

Existing GHCIDs created before December 2025 are grandfathered - their abbreviations will NOT be updated even if derived from English translations rather than emic names. This preserves PID stability per the "Cool URIs Don't Change" principle.

Applies to:

817 UNESCO Memory of the World custodian files enriched with custodian_name.emic_name
Abbreviations like NLP (National Library of Peru) remain unchanged even though emic name is "Biblioteca Nacional del Perú" (would be BNP)

For NEW custodians only: Apply emic name abbreviation protocol described below.

Abbreviation Rules:

Use the CustodianName (official emic name), NOT an English translation
Take the first letter of each word
Skip prepositions, articles, and conjunctions in all languages
Skip digits and numeric tokens (e.g., "40-45", "1945", "III")
Convert to UPPERCASE
Remove accents/diacritics (á→A, ñ→N, ö→O)
Maximum 10 characters

Skipped Words (prepositions/articles/conjunctions by language):

Dutch: de, het, een, van, voor, in, op, te, den, der, des, 's, aan, bij, met, naar, om, tot, uit, over, onder, door, en, of
English: a, an, the, of, in, at, on, to, for, with, from, by, as, under, and, or, but
French: le, la, les, un, une, des, de, d, du, à, au, aux, en, dans, sur, sous, pour, par, avec, l, et, ou
German: der, die, das, den, dem, des, ein, eine, einer, einem, einen, von, zu, für, mit, bei, nach, aus, vor, über, unter, durch, und, oder
Spanish: el, la, los, las, un, una, unos, unas, de, del, a, al, en, con, por, para, sobre, bajo, y, o, e, u
Portuguese: o, a, os, as, um, uma, uns, umas, de, do, da, dos, das, em, no, na, nos, nas, para, por, com, sobre, sob, e, ou
Italian: il, lo, la, i, gli, le, un, uno, una, di, del, dello, della, dei, degli, delle, a, al, allo, alla, ai, agli, alle, da, dal, dallo, dalla, dai, dagli, dalle, in, nel, nello, nella, nei, negli, nelle, su, sul, sullo, sulla, sui, sugli, sulle, con, per, tra, fra, e, ed, o, od

TODO: Expand to comprehensive global coverage for all ISO 639-1 languages as project expands.

Examples:

Emic Name	Abbreviation	Explanation
Heemkundige Kring De Goede Stede	HKGS	Skip "De"
De Hollandse Cirkel	HC	Skip "De"
Historische Vereniging Nijeveen	HVN	All significant words
Rijksmuseum Amsterdam	RA	All significant words
Musée d'Orsay	MO	Skip "d'" (d = de)
Biblioteca Nacional do Brasil	BNB	Skip "do"
L'Académie française	AF	Skip "L'"
Museum van de Twintigste Eeuw	MTE	Skip "van", "de"
Koninklijke Bibliotheek van België	KBB	Skip "van"

GHCID Format with Abbreviation:

NL-{REGION}-{SETTLEMENT}-{TYPE}-{ABBREV}
                                ^^^^^^^^
                                First letter of each significant word in emic name

Implementation: See src/glam_extractor/identifiers/ghcid.py:extract_abbreviation_from_name()

🚨 CRITICAL: Special Characters MUST Be Excluded from Abbreviations 🚨

When generating abbreviations for GHCID, special characters and symbols MUST be completely removed. Only alphabetic characters (A-Z) are permitted in the abbreviation component.

RATIONALE:

URL/URI safety - Special characters require encoding in URIs
Filename safety - Characters like &, /, \, : are invalid in filenames
Parsing consistency - Avoids delimiter conflicts in data pipelines
Cross-system compatibility - Ensures interoperability with all systems
Human readability - Clean identifiers are easier to communicate

CHARACTERS TO REMOVE (exhaustive list):

Ampersand: & (e.g., "Records & Archives" → "RA", NOT "R&A")
Slash: / (e.g., "Art/Design Museum" → "ADM", NOT "A/DM")
Backslash: \
Plus: + (e.g., "Culture+" → "C")
At sign: @
Hash/Pound: #
Percent: %
Dollar: $
Asterisk: *
Parentheses: ( )
Brackets: [ ] { }
Pipe: |
Colon: :
Semicolon: ;
Quotation marks: `" ' ``
Comma: ,
Period: . (unless part of abbreviation like "U.S." → "US")
Hyphen: - (skip, do not replace with letter)
Underscore: _
Equals: =
Question mark: ?
Exclamation: !
Tilde: ~
Caret: ^
Less/Greater than: < >

EXAMPLES:

Source Name	Correct Abbreviation	Incorrect (WRONG)
Department of Records & Information Management	DRIM	DR&IM ❌
Art + Culture Center	ACC	A+CC ❌
Museum/Gallery Amsterdam	MGA	M/GA ❌
Heritage@Digital	HD	H@D ❌
Archives (Historical)	AH	A(H) ❌
Research & Development Institute	RDI	R&DI ❌

REAL-WORLD EXAMPLE (from data/custodian/SX-XX-PHI-O-DR&IMSM.yaml):

# INCORRECT (current file - needs correction):
ghcid_current: SX-XX-PHI-O-DR&IMSM  # ❌ Contains "&"

# CORRECT (should be):
ghcid_current: SX-XX-PHI-O-DRIMSM   # ✅ Alphabetic only

Implementation: When extracting first letters from words containing special characters:

Split the word on special characters: "Records&Information" → ["Records", "Information"]
Take first letter from each resulting segment: "R" + "I" = "RI"
Or skip the special character entirely and treat as one word if no space around it

See: .opencode/ABBREVIATION_SPECIAL_CHAR_RULE.md for complete documentation

🚨 CRITICAL: Diacritics MUST Be Normalized to ASCII in Abbreviations 🚨

When generating abbreviations for GHCID, diacritics (accented characters) MUST be normalized to their ASCII base letter equivalents. Only ASCII uppercase letters (A-Z) are permitted.

This rule applies to ALL languages with diacritical marks including Czech, Polish, German, French, Spanish, Portuguese, Nordic languages, Hungarian, Romanian, Turkish, and others.

RATIONALE:

URI/URL safety - Non-ASCII characters require percent-encoding
Cross-system compatibility - ASCII is universally supported
Filename safety - Some systems have issues with non-ASCII filenames
Human readability - Easier to type and communicate

DIACRITICS NORMALIZATION TABLE:

Language	Diacritics	ASCII Equivalent
Czech	Č, Ř, Š, Ž, Ě, Ů	C, R, S, Z, E, U
Polish	Ł, Ń, Ó, Ś, Ź, Ż, Ą, Ę	L, N, O, S, Z, Z, A, E
German	Ä, Ö, Ü, ß	A, O, U, SS
French	É, È, Ê, Ç, Ô, Â	E, E, E, C, O, A
Spanish	Ñ, Á, É, Í, Ó, Ú	N, A, E, I, O, U
Portuguese	Ã, Õ, Ç, Á, É	A, O, C, A, E
Nordic	Å, Ä, Ö, Ø, Æ	A, A, O, O, AE
Hungarian	Á, É, Í, Ó, Ö, Ő, Ú, Ü, Ű	A, E, I, O, O, O, U, U, U
Turkish	Ç, Ğ, İ, Ö, Ş, Ü	C, G, I, O, S, U
Romanian	Ă, Â, Î, Ș, Ț	A, A, I, S, T

REAL-WORLD EXAMPLE (Czech institution):

# INCORRECT - Contains diacritics:
ghcid_current: CZ-VY-TEL-L-VHSPAOČRZS  # ❌ Contains "Č"

# CORRECT - ASCII only:
ghcid_current: CZ-VY-TEL-L-VHSPAOCRZS  # ✅ "Č" → "C"

IMPLEMENTATION:

import unicodedata

def normalize_diacritics(text: str) -> str:
    """Normalize diacritics to ASCII equivalents."""
    # NFD decomposition separates base characters from combining marks
    normalized = unicodedata.normalize('NFD', text)
    # Remove combining marks (category 'Mn' = Mark, Nonspacing)
    ascii_text = ''.join(c for c in normalized if unicodedata.category(c) != 'Mn')
    return ascii_text

# Example
normalize_diacritics("VHSPAOČRZS")  # Returns "VHSPAOCRZS"

EXAMPLES:

Emic Name (with diacritics)	Abbreviation	Wrong
Vlastivědné muzeum v Šumperku	VMS	VMŠ ❌
Österreichische Nationalbibliothek	ON	ÖN ❌
Bibliothèque nationale de France	BNF	BNF (OK - è not in first letter)
Múzeum Łódzkie	ML	MŁ ❌
Þjóðminjasafn Íslands	TI	ÞI ❌

See: .opencode/ABBREVIATION_SPECIAL_CHAR_RULE.md for complete documentation (covers both special characters and diacritics)

🚨 CRITICAL: Non-Latin Scripts MUST Be Transliterated Before Abbreviation 🚨

When generating GHCID abbreviations from institution names in non-Latin scripts (Cyrillic, Chinese, Japanese, Korean, Arabic, Hebrew, Greek, Devanagari, Thai, etc.), the emic name MUST first be transliterated to Latin characters using ISO or recognized standards.

This rule affects 170 institutions across 21 languages with non-Latin writing systems.

CORE PRINCIPLE: The emic name is PRESERVED in original script in custodian_name.emic_name. Transliteration is only used for abbreviation generation.

TRANSLITERATION STANDARDS BY SCRIPT:

Script	Languages	Standard	Example
Cyrillic	ru, uk, bg, sr, kk	ISO 9:1995	Институт → Institut
Chinese	zh	Hanyu Pinyin (ISO 7098)	东巴文化博物院 → Dongba Wenhua Bowuyuan
Japanese	ja	Modified Hepburn	国立博物館 → Kokuritsu Hakubutsukan
Korean	ko	Revised Romanization	독립기념관 → Dongnip Ginyeomgwan
Arabic	ar, fa, ur	ISO 233-2/3	المكتبة الوطنية → al-Maktaba al-Wataniya
Hebrew	he	ISO 259-3	ארכיון → Arkhiyon
Greek	el	ISO 843	Μουσείο → Mouseio
Devanagari	hi, ne	ISO 15919	राजस्थान → Rajasthana
Bengali	bn	ISO 15919	বাংলাদেশ → Bangladesh
Thai	th	ISO 11940-2	สำนักหอ → Samnak Ho
Armenian	hy	ISO 9985	Մdelays → Matenadaran
Georgian	ka	ISO 9984	ხელნაწერთა → Khelnawerti

WORKFLOW:

1. Emic Name (original script)
   ↓
2. Transliterate to Latin (ISO standard)
   ↓
3. Normalize diacritics (remove accents)
   ↓
4. Skip articles/prepositions
   ↓
5. Extract first letters → Abbreviation

EXAMPLES:

Language	Emic Name	Transliterated	Abbreviation
Russian	Институт восточных рукописей РАН	Institut Vostochnykh Rukopisey RAN	IVRR
Chinese	东巴文化博物院	Dongba Wenhua Bowuyuan	DWB
Korean	독립기념관	Dongnip Ginyeomgwan	DG
Hindi	राजस्थान प्राच्यविद्या प्रतिष्ठान	Rajasthana Pracyavidya Pratishthana	RPP
Arabic	المكتبة الوطنية للمملكة المغربية	al-Maktaba al-Wataniya lil-Mamlaka	MWMM
Hebrew	ארכיון הסיפור העממי בישראל	Arkhiyon ha-Sipur ha-Amami	ASAY
Greek	Αρχαιολογικό Μουσείο Θεσσαλονίκης	Archaiologiko Mouseio Thessalonikis	AMT

SCRIPT-SPECIFIC SKIP WORDS:

Language	Skip Words (Articles/Prepositions)
Arabic	al- (the), bi-, li-, fi- (prepositions)
Hebrew	ha- (the), ve- (and), be-, le-, me-
Persian	-e, -ye (ezafe connector), va (and)
CJK	None (particles integral to meaning)

IMPLEMENTATION:

from transliteration import transliterate_for_abbreviation

# Input: emic name in non-Latin script + language code
emic_name = "Институт восточных рукописей РАН"
lang = "ru"

# Step 1: Transliterate to Latin using ISO standard
latin = transliterate_for_abbreviation(emic_name, lang)
# Result: "Institut Vostochnykh Rukopisey RAN"

# Step 2: Apply standard abbreviation extraction
abbreviation = extract_abbreviation_from_name(latin, skip_words={'vostochnykh'})
# Result: "IVRRAN"

GRANDFATHERING POLICY: Existing abbreviations from 817 UNESCO MoW custodians are grandfathered. This transliteration standard applies only to NEW custodians created after December 2025.

See: .opencode/TRANSLITERATION_STANDARDS.md for complete ISO standards, mapping tables, and Python implementation

GHCID uses a four-identifier strategy for maximum flexibility and transparency:

Four Identifier Formats

UUID v5 (SHA-1) - PRIMARY persistent identifier
- Deterministic (same GHCID string → same UUID)
- RFC 4122 standard, universal library support
- Transparent algorithm (anyone can verify)
- Field: ghcid_uuid
UUID v8 (SHA-256) - Secondary persistent identifier (future-proofing)
- Deterministic with stronger cryptographic hash
- SOTA security compliance
- Field: ghcid_uuid_sha256
UUID v7 - Database record ID ONLY (NOT for persistent identification)
- Time-ordered for database performance
- NOT deterministic (different each time)
- Use for database primary keys, NOT for citations or cross-system references
- Field: record_id
Numeric (64-bit) - Compact identifier for CSV exports
- Deterministic (SHA-256 → 64-bit integer)
- Database optimization, spreadsheet-friendly
- Field: ghcid_numeric

Critical Understanding: UUID v5 is Primary

Why UUID v5 (SHA-1) over UUID v8 (SHA-256)?

The primary identifier is UUID v5 because:

✅ Transparency - Anyone can verify using standard uuid.uuid5() function
✅ Reproducibility - No custom algorithm to share, RFC 4122 defines it
✅ Interoperability - Every programming language has built-in UUID v5 support
✅ Community Trust - Public, standardized algorithm builds confidence

SHA-1 Safety for Identifiers:

SHA-1 is deprecated for cryptographic security (digital signatures, TLS, passwords) but appropriate for identifier generation:

Heritage institution identifiers are non-adversarial (no attacker trying to forge museum IDs)
128-bit collision resistance is sufficient (P(collision) ≈ 1.5×10^-29 for 1M institutions)
RFC 4122 (UUID v5) remains active standard (not deprecated by IETF)
See Why GHCID Uses UUID v5 and SHA-1 for detailed rationale

Future-Proofing:

We generate both UUID v5 and UUID v8 for every institution
Can migrate to SHA-256 primary if RFC 4122 is updated
Both are deterministic - no data loss in migration

When Extracting Data

Agents should generate ALL four identifiers for every institution:

# Example extraction output
- id: https://w3id.org/heritage/custodian/br/bnb-001
  name: Biblioteca Nacional do Brasil
  ghcid: BR-RJ-RIO-L-BNB
  ghcid_uuid: "550e8400-e29b-41d4-a716-446655440000"  # UUID v5 - PRIMARY
  ghcid_uuid_sha256: "a1b2c3d4-e5f6-8a1b-9c2d-3e4f5a6b7c8d"  # UUID v8 - Secondary
  ghcid_numeric: 213324328442227739  # 64-bit numeric
  # Note: UUID v7 (record_id) generated at database insertion, not during extraction

GHCID Collision Handling for AI Agents

CRITICAL: When extracting heritage institution data, AI agents MUST understand and apply temporal collision resolution rules to maintain PID stability.

The Collision Problem

Multiple institutions may generate the same base GHCID (before name suffix addition):

Two museums in Amsterdam abbreviated "SM": NL-NH-AMS-M-SM
Two historical societies in Utrecht: NL-UT-UTR-S-HK
Two libraries in São Paulo abbreviated "BM": BR-SP-SAO-L-BM

Decision Tree for Collision Resolution

When extracting data, agents should follow this decision process:

1. Generate base GHCID (without name suffix)
   ↓
2. Check if base GHCID exists in published dataset
   ↓
   NO → Use base GHCID as-is, record extraction_date
   ↓
   YES → Temporal priority check
   ↓
3. Compare extraction_date with existing publication_date
   ↓
   SAME DATE (batch import) → First Batch Collision
      ├─ ALL institutions get name suffixes
      ├─ Convert native language name to snake_case
      └─ Append to GHCID: NL-NH-AMS-M-SM-stedelijk_museum_amsterdam
   ↓
   LATER DATE (historical addition) → Historical Addition
      ├─ PRESERVE existing GHCID (no modification)
      ├─ ONLY new institution gets name suffix
      └─ New GHCID: NL-NH-AMS-M-SM-science_museum_amsterdam

Implementation Rules for Agents

Rule 1: Always Track Provenance Timestamp

provenance:
  data_source: CONVERSATION_NLP
  data_tier: TIER_4_INFERRED
  extraction_date: "2025-11-15T14:30:00Z"  # ← REQUIRED for collision detection
  extraction_method: "AI agent NER extraction"
  confidence_score: 0.92

Rule 2: Detect Collisions by Base GHCID

Before adding name suffixes, group institutions by base GHCID:

# Collision detection pseudocode for agents
base_ghcid = generate_base_ghcid(institution)  # Without name suffix
existing_records = published_dataset.filter(base_ghcid=base_ghcid)

if len(existing_records) > 0:
    # Collision detected - apply temporal priority
    apply_collision_resolution(institution, existing_records)

Rule 3: First Batch - ALL Get Name Suffixes

If ALL colliding institutions have the same extraction_date:

# Example: 2025-11-01 batch import discovers two institutions
- name: Stedelijk Museum Amsterdam
  ghcid: NL-NH-AMS-M-SM-stedelijk_museum_amsterdam  # Gets name suffix
  provenance:
    extraction_date: "2025-11-01T10:00:00Z"

- name: Science Museum Amsterdam  
  ghcid: NL-NH-AMS-M-SM-science_museum_amsterdam  # Gets name suffix
  provenance:
    extraction_date: "2025-11-01T10:00:00Z"  # Same date = first batch

Rule 4: Historical Addition - ONLY New Gets Name Suffix

If new institution's extraction_date is later than existing record:

# EXISTING (2025-11-01, already published):
- name: Hermitage Amsterdam
  ghcid: NL-NH-AMS-M-HM  # ← NO CHANGE (PID stability!)
  provenance:
    extraction_date: "2025-11-01T10:00:00Z"

# NEW (2025-11-15, historical addition):
- name: Historical Museum Amsterdam
  ghcid: NL-NH-AMS-M-HM-historical_museum_amsterdam  # ← ONLY new gets name suffix
  provenance:
    extraction_date: "2025-11-15T14:30:00Z"

Name Suffix Generation

Converting institution names to snake_case suffixes:

import re
import unicodedata

def generate_name_suffix(native_name: str) -> str:
    """Convert native language institution name to snake_case suffix.
    
    Examples:
        "Stedelijk Museum Amsterdam" → "stedelijk_museum_amsterdam"
        "Musée d'Orsay" → "musee_dorsay"
        "Österreichische Nationalbibliothek" → "osterreichische_nationalbibliothek"
    """
    # Normalize unicode (NFD decomposition) and remove diacritics
    normalized = unicodedata.normalize('NFD', native_name)
    ascii_name = ''.join(c for c in normalized if unicodedata.category(c) != 'Mn')
    
    # Convert to lowercase
    lowercase = ascii_name.lower()
    
    # Remove apostrophes, commas, and other punctuation
    no_punct = re.sub(r"[''`\",.:;!?()[\]{}]", '', lowercase)
    
    # Replace spaces and hyphens with underscores
    underscored = re.sub(r'[\s\-]+', '_', no_punct)
    
    # Remove any remaining non-alphanumeric characters (except underscores)
    clean = re.sub(r'[^a-z0-9_]', '', underscored)
    
    # Collapse multiple underscores
    final = re.sub(r'_+', '_', clean).strip('_')
    
    return final

Name suffix rules:

Use the institution's full official name in its native language
Transliterate non-Latin scripts to ASCII (e.g., Pinyin for Chinese)
Remove all diacritics (é → e, ö → o, ñ → n)
Remove punctuation (apostrophes, commas, periods)
Replace spaces with underscores
All lowercase

GHCID History Tracking

When name suffix is added to resolve collision, update ghcid_history:

ghcid_history:
  - ghcid: NL-NH-AMS-M-HM-historical_museum_amsterdam  # Current (with name suffix)
    ghcid_numeric: 789012345678
    valid_from: "2025-11-15T14:30:00Z"  # When name suffix added
    valid_to: null
    reason: "Name suffix added to resolve collision with existing NL-NH-AMS-M-HM (Hermitage Amsterdam)"
  
  - ghcid: NL-NH-AMS-M-HM  # Original (without name suffix)
    ghcid_numeric: 123456789012
    valid_from: "2025-11-15T14:00:00Z"  # When first extracted
    valid_to: "2025-11-15T14:30:00Z"   # When collision detected
    reason: "Base GHCID from geographic location and institution name"

PID Stability Principle - "Cool URIs Don't Change"

NEVER modify a published GHCID. Once exported to RDF, JSON-LD, or CSV, a GHCID becomes a persistent identifier that may be:

Cited in academic papers - Journal articles referencing heritage collections
Used in external APIs - Third-party systems querying our data
Embedded in linked data - RDF triples in knowledge graphs
Referenced in finding aids - Archival descriptions linking to institutions

Changing a published GHCID breaks these external references. Per W3C "Cool URIs Don't Change":

✅ Correct: Add name suffix to NEW institution (historical addition)
❌ WRONG: Retroactively add name suffix to EXISTING published GHCID

Error Handling for Agents

Scenario 1: Missing Provenance Timestamp

if 'extraction_date' not in institution['provenance']:
    # Use current timestamp as fallback
    institution['provenance']['extraction_date'] = datetime.now(timezone.utc).isoformat()
    # Log warning for manual review
    log.warning(f"Missing extraction_date for {institution['name']}, using current time")

Scenario 2: Multiple Historical Additions

# Three institutions generate NL-UT-UTR-S-HK
# Extraction dates: 2025-11-01, 2025-11-15, 2025-12-01

# Result:
# 2025-11-01: NL-UT-UTR-S-HK (first, no name suffix)
# 2025-11-15: NL-UT-UTR-S-HK-historische_kring_utrecht (second, gets name suffix)
# 2025-12-01: NL-UT-UTR-S-HK-heemkundige_kring_utrecht (third, gets name suffix)

Scenario 3: Collision Resolution with Name Suffix

if collision_detected:
    # Generate name suffix from native language name
    name_suffix = generate_name_suffix(institution['name'])
    
    # Append to base GHCID
    ghcid = f"{base_ghcid}-{name_suffix}"  # e.g., NL-NH-AMS-M-HM-historical_museum_amsterdam
    
    # Record collision resolution
    institution['provenance']['notes'] = (
        f"Name suffix added to resolve collision with existing {base_ghcid}."
    )

Validation Checklist for Agents

Before publishing extracted data, verify:

All institutions have extraction_date in provenance metadata
Collisions detected by grouping on base GHCID (without name suffix)
First batch collisions: ALL instances have name suffixes
Historical additions: ONLY new instances have name suffixes
No published GHCIDs modified (PID stability test)
GHCID history entries created with valid temporal ordering
Name suffixes derived from native language institution names
Collision reasons documented in ghcid_history

Example Extraction Prompts for Agents

Prompt Template for NLP Extraction:

Extract heritage institutions from this conversation about [REGION] GLAM institutions.

For EACH institution:
1. Generate base GHCID using geographic location and institution type
2. Check for collisions with previously published GHCIDs
3. Apply temporal priority rule:
   - If collision with same extraction_date → First Batch (all get name suffixes)
   - If collision with earlier publication_date → Historical Addition (only new gets name suffix)
4. Generate snake_case name suffix from native language institution name
5. Create GHCID history entry documenting collision resolution
6. Include extraction_date in provenance metadata

Output: LinkML-compliant YAML with complete collision handling

Prompt Template for CSV Parsing:

Parse this heritage institution CSV file dated [DATE].

All rows have the same extraction_date ([DATE]).

If multiple institutions generate the same base GHCID:
- This is a FIRST BATCH collision
- ALL colliding institutions MUST receive name suffixes
- Generate name suffix from institution's native language name
- Document collision in ghcid_history

Output: YAML with collision resolution applied

Testing Strategies for Collision Handling

Unit Test: First Batch Collision

def test_first_batch_collision():
    """Two institutions extracted same day with same base GHCID."""
    institutions = [
        {
            'name': 'Stedelijk Museum Amsterdam',
            'base_ghcid': 'NL-NH-AMS-M-SM',
            'identifiers': [{'identifier_scheme': 'Wikidata', 'identifier_value': 'Q621531'}],
            'provenance': {'extraction_date': '2025-11-01T10:00:00Z'}
        },
        {
            'name': 'Science Museum Amsterdam',
            'base_ghcid': 'NL-NH-AMS-M-SM',
            'identifiers': [{'identifier_scheme': 'Wikidata', 'identifier_value': 'Q98765432'}],
            'provenance': {'extraction_date': '2025-11-01T10:00:00Z'}
        }
    ]
    
    resolved = resolve_collisions(institutions)
    
    # Both should have name suffixes
    assert resolved[0]['ghcid'] == 'NL-NH-AMS-M-SM-stedelijk_museum_amsterdam'
    assert resolved[1]['ghcid'] == 'NL-NH-AMS-M-SM-science_museum_amsterdam'

Unit Test: Historical Addition

def test_historical_addition():
    """New institution added later with same base GHCID."""
    published = {
        'name': 'Hermitage Amsterdam',
        'ghcid': 'NL-NH-AMS-M-HM',  # Already published
        'provenance': {'extraction_date': '2025-11-01T10:00:00Z'}
    }
    
    new_institution = {
        'name': 'Historical Museum Amsterdam',
        'base_ghcid': 'NL-NH-AMS-M-HM',  # Collision!
        'identifiers': [{'identifier_scheme': 'Wikidata', 'identifier_value': 'Q17339437'}],
        'provenance': {'extraction_date': '2025-11-15T14:30:00Z'}
    }
    
    resolved = resolve_collision(new_institution, published_dataset=[published])
    
    # Published GHCID unchanged
    assert published['ghcid'] == 'NL-NH-AMS-M-HM'
    
    # New institution gets name suffix
    assert resolved['ghcid'] == 'NL-NH-AMS-M-HM-historical_museum_amsterdam'
    
    # GHCID history created
    assert len(resolved['ghcid_history']) == 2
    assert resolved['ghcid_history'][0]['ghcid'] == 'NL-NH-AMS-M-HM-historical_museum_amsterdam'

References for Collision Handling

Specification: docs/PERSISTENT_IDENTIFIERS.md - "Historical Collision Resolution" section
Algorithm: docs/plan/global_glam/07-ghcid-collision-resolution.md - Temporal dimension and decision logic
Examples: docs/GHCID_PID_SCHEME.md - Timeline examples with real institutions
Implementation: scripts/regenerate_historical_ghcids.py - Code comments documenting collision handling
Schema: schemas/provenance.yaml - GHCIDHistoryEntry and ChangeEvent classes

See also:

docs/PERSISTENT_IDENTIFIERS.md - Complete identifier format documentation
docs/UUID_STRATEGY.md - UUID v5 vs v7 vs v8 comparison
docs/WHY_UUID_V5_SHA1.md - SHA-1 safety rationale

References

Schema (v0.2.0):
- Main: schemas/heritage_custodian.yaml
- Core classes: schemas/core.yaml
- Enumerations: schemas/enums.yaml
- Provenance: schemas/provenance.yaml
- Collections: schemas/collections.yaml
- Dutch extensions: schemas/dutch.yaml
- Architecture: /docs/SCHEMA_MODULES.md
Persistent Identifiers:
- Overview: docs/PERSISTENT_IDENTIFIERS.md
- UUID Strategy: docs/UUID_STRATEGY.md
- SHA-1 Rationale: docs/WHY_UUID_V5_SHA1.md
- GHCID PID Scheme: docs/GHCID_PID_SCHEME.md
- Collision Resolution: docs/plan/global_glam/07-ghcid-collision-resolution.md
Architecture: docs/plan/global_glam/02-architecture.md
Data Standardization: docs/plan/global_glam/04-data-standardization.md
Design Patterns: docs/plan/global_glam/05-design-patterns.md
Dependencies: docs/plan/global_glam/03-dependencies.md

Version: 0.2.1
Schema Version: v0.2.1 (modular)
Last Updated: 2025-12-08
Maintained By: GLAM Data Extraction Project

182 KiB Raw Blame History Unescape Escape

AI Agent Instructions for GLAM Data Extraction

🎯 PROJECT CORE MISSION

🚨 CRITICAL RULES FOR ALL AGENTS

Rule 0: LinkML Schemas Are the Single Source of Truth

Rule 1: Ontology Files Are Your Primary Reference

Rule 2: Wikidata Entities Are NOT Ontology Classes

Rule 3: Multi-Aspect Modeling is Mandatory

Rule 5: NEVER Delete Enriched Data - Additive Only

Rule 6: WebObservation Claims MUST Have XPath Provenance

Rule 4: Technical Classes Are Excluded from Visualizations

Rule 7: Deployment is LOCAL via SSH/rsync (NO CI/CD)

Rule 8: Legal Form Terms MUST Be Filtered from CustodianName

Rule 9: Enum-to-Class Promotion - Single Source of Truth

Rule 10: CH-Annotator is the Entity Annotation Convention

Rule 11: Z.AI GLM API for LLM Tasks (NOT BigModel)

Rule 12: Person Data Reference Pattern - Avoid Inline Duplication

Rule 13: Custodian Type Annotations on LinkML Schema Elements

Rule 14: Exa MCP LinkedIn Profile Extraction

Rule 15: Connection Data Registration - Full Network Preservation

Rule 16: LinkedIn Photo URLs - Store CDN URLs, Not Overlay Pages

Rule 17: LinkedIn Connection Unique Identifiers - Include Abbreviated and Anonymous Names

Rule 18: Custodian Staff Parsing from LinkedIn Company Pages

Rule 19: HTML-Only LinkedIn Extraction (Preferred Method)

Rule 20: Person Entity Profiles - Individual File Storage

Rule 21: Data Fabrication is Strictly Prohibited

Rule 22: Custodian YAML Files Are the Single Source of Truth for Enrichment Data

Rule 23: Social Media Link Validation - No Generic Links

Rule 24: Unused Import Investigation - Check Before Removing

Rule 25: Digital Platform Discovery Enrichment

Rule 26: Person Data Provenance - Web Claims for Staff Information

Rule 27: Person-Custodian Data Architecture - Single Source of Truth

Rule 28: Web Claims Deduplication - No Redundant Claims

Project Overview

Schema Reference (v0.2.1)

Base Ontologies for Global GLAM Data

Foundation Ontologies

1. TOOI - Dutch Government Organizational Ontology

2. CPOV - EU Core Public Organisation Vocabulary

3. Schema.org - Web Vocabulary for Structured Data

Ontology Decision Tree for Agents

Required Ontology Consultation Workflow

Step 1: Identify Institution Geographic Scope

Step 2: Review Ontology Classes and Properties

Step 3: Map Conversation Data to Ontology Properties

Step 4: Document Ontology Alignment in Provenance

Common Ontology Patterns

Anti-Patterns to Avoid

Additional Ontology Resources

Institution Type Taxonomy

Data Sources

Primary Sources

Implementation Status (Updated Nov 2025)

Conversation JSON Structure

NLP Extraction Tasks

Task 1: Entity Recognition - Institution Names

Task 2: Location Extraction

Task 3: Identifier Extraction

Task 4: Relationship Extraction

Task 5: Collection Metadata Extraction

Task 6: Digital Platform Identification

Task 7: Metadata Standards Detection

Task 8: Organizational Change Event Extraction (NEW - v0.2.0)

Task 9: Holy Sites Heritage Collection Identification

Data Quality and Provenance

Provenance Tracking

Confidence Scoring

Data Tier Assignment

Integration with CSV Data

Cross-linking Strategy

Conflict Resolution

NLP Models and Tools

Recommended Approach: Agent-Based NER

CRITICAL: Creating LinkML Instance Files

Agent Capabilities Go Beyond Traditional NER

Mandatory: Create Complete LinkML Instance Files

Extraction Workflow for Agents

Example Agent Prompt for Comprehensive Extraction

Multiple Institutions Per File

Field Completion Strategies

182 KiB

Raw Blame History