- Introduced new classes: Heritage, HeritagePractice, HeritageRelevanceAssessment, HeritageRelevanceScore, HolySiteType, Mandate. - Added slots for heritage-related attributes including has_or_had_confidence_measure, has_or_had_related_heritage_form, heritage_education, heritage_employer, heritage_mandate, heritage_practice, and more. - Migrated existing attributes and ensured compliance with RiC-O naming conventions. - Enhanced documentation and descriptions for clarity and usability. - Archived previous versions of slots and classes to maintain schema integrity.
4821 lines
181 KiB
Markdown
4821 lines
181 KiB
Markdown
# AI Agent Instructions for GLAM Data Extraction
|
||
|
||
This document provides instructions for AI agents (particularly OpenCODE and Claude) to assist with extracting heritage institution data from conversation JSON files and other sources.
|
||
|
||
---
|
||
|
||
## 🎯 PROJECT CORE MISSION
|
||
|
||
**PRIMARY OBJECTIVE**: Create a comprehensive, nuanced ontology that accurately represents the complex, temporal, multi-faceted nature of heritage custodian institutions worldwide.
|
||
|
||
This is NOT a simple data extraction project. This is an **ontology engineering project** that:
|
||
- Models heritage entities as multi-aspect temporal entities (place, custodian, legal form, collections, people)
|
||
- Integrates multiple base ontologies (CPOV, TOOI, CIDOC-CRM, RiC-O, Schema.org, PiCo)
|
||
- Captures organizational change events over time (custody transfers, mergers, transformations)
|
||
- Distinguishes between nominal references and formal organizational structures
|
||
- Links heritage custodians to people, collections, and locations with independent temporal lifecycles
|
||
|
||
**If you're looking for simple NER extraction, this is not the right project.**
|
||
|
||
---
|
||
|
||
## 🚨 CRITICAL RULES FOR ALL AGENTS
|
||
|
||
### ⚠️ DATA QUALITY IS OF UTMOST IMPORTANCE ⚠️
|
||
|
||
**Wrong data is worse than no data.** All enrichments MUST be double-checked before being committed to the dataset. A single false claim (birth year from wrong person, social media from random account) corrupts the entire profile.
|
||
|
||
**Mandatory verification for ALL enrichments**:
|
||
1. **Source verification**: Is the source about the SAME person, not a namesake?
|
||
2. **Cross-reference check**: Do at least 3 identity attributes match (employer, location, profession, age, education)?
|
||
3. **Conflict detection**: Any conflicting signal (actress vs curator, Venezuela vs UK) = REJECT
|
||
4. **Provenance documentation**: Every claim must have full provenance with retrieval timestamp and source URL
|
||
|
||
**If in doubt, DO NOT add the claim.** It is better to have incomplete data than incorrect data.
|
||
|
||
### 🚫 AUTOMATED WEB ENRICHMENT IS PROHIBITED 🚫
|
||
|
||
**DO NOT USE** automated scripts to enrich person profiles with web search data. The `enrich_person_comprehensive.py` script has been DEPRECATED due to catastrophic entity resolution failures.
|
||
|
||
**What happened**: Automated enrichment attributed data from wrong people with similar names:
|
||
- Birth year from Venezuelan actress attributed to UK art curator
|
||
- ResearchGate profile of Mexican hydrogeologist attributed to Dutch museum curator
|
||
- Wikipedia article about Nazi doctor attributed to heritage worker
|
||
- 540+ false claims had to be manually removed
|
||
|
||
**ALL person enrichment must be done MANUALLY** with human verification that the source refers to the correct person.
|
||
|
||
---
|
||
|
||
This section summarizes 60 critical rules. Each rule has complete documentation in `.opencode/` files.
|
||
|
||
### Rule 0: LinkML Schemas Are the Single Source of Truth
|
||
|
||
🚨 **CRITICAL**: LinkML schema files in `schemas/20251121/linkml/` are the authoritative definition of the Heritage Custodian Ontology.
|
||
|
||
**Key Points**:
|
||
- ALL derived files (RDF, TypeDB, UML) are GENERATED - never edit them directly
|
||
- Always use full timestamps (`YYYYMMDD_HHMMSS`) in generated filenames
|
||
- Primary schema: `schemas/20251121/linkml/01_custodian_name.yaml`
|
||
|
||
**Workflow**:
|
||
```
|
||
1. EDIT LinkML schema
|
||
2. REGENERATE: gen-owl → rdfpipe → all 8 RDF formats
|
||
3. REGENERATE: gen-yuml → UML diagrams
|
||
4. UPDATE: TypeDB schema (manual)
|
||
5. VALIDATE: linkml-validate
|
||
```
|
||
|
||
**See**: `.opencode/SCHEMA_GENERATION_RULES.md` for complete generation rules
|
||
|
||
---
|
||
|
||
### Rule 0b: LinkML Type/Types File Naming Convention
|
||
|
||
🚨 **CRITICAL**: When creating class hierarchies that replace enums, follow the **Type/Types** naming pattern.
|
||
|
||
**Naming Pattern**:
|
||
- **`[Entity]Type.yaml`** (singular): Abstract base class defining the type taxonomy
|
||
- **`[Entity]Types.yaml`** (plural): File containing all concrete subclasses
|
||
|
||
**Examples**:
|
||
|
||
| Base Class File | Subclasses File | Description |
|
||
|-----------------|-----------------|-------------|
|
||
| `DigitalPlatformType.yaml` | `DigitalPlatformTypes.yaml` | Digital platform type taxonomy (69 types) |
|
||
| `WebPortalType.yaml` | `WebPortalTypes.yaml` | Web portal type taxonomy |
|
||
| `CustodianType.yaml` | `CustodianTypes.yaml` | Heritage custodian type taxonomy (GLAMORCUBESFIXPHDNT) |
|
||
|
||
**Import Pattern**:
|
||
```yaml
|
||
# In DigitalPlatformTypes.yaml (subclasses file)
|
||
imports:
|
||
- ./DigitalPlatformType # Import base class
|
||
|
||
classes:
|
||
DigitalLibrary:
|
||
is_a: DigitalPlatformType # Inherit from base
|
||
# ...
|
||
```
|
||
|
||
**Rationale**:
|
||
1. **Clarity**: "Type" (singular) = one abstract concept; "Types" (plural) = many concrete subclasses
|
||
2. **Discoverability**: Related files are adjacent in directory listings
|
||
3. **Consistency**: Follows established pattern across schema (CustodianType/CustodianTypes, WebPortalType/WebPortalTypes)
|
||
|
||
**Anti-Pattern**:
|
||
- ❌ `DigitalPlatformTypeBase.yaml` - "Base" suffix is redundant; use singular "Type" instead
|
||
- ❌ `DigitalPlatformTypeClasses.yaml` - "Classes" is less intuitive than "Types"
|
||
|
||
**See**: `.opencode/rules/type-naming-convention.md` for complete documentation
|
||
|
||
---
|
||
|
||
### Rule 1: Ontology Files Are Your Primary Reference
|
||
|
||
🚨 **CRITICAL**: Before designing any schema, class, or property, consult base ontologies.
|
||
|
||
**Required Steps**:
|
||
1. READ base ontology files in `/data/ontology/`
|
||
2. SEARCH for existing classes and properties
|
||
3. DOCUMENT your ontology alignment with rationale
|
||
4. NEVER invent custom properties when ontology equivalents exist
|
||
|
||
**Available Ontologies**:
|
||
- `tooiont.ttl` - TOOI (Dutch government)
|
||
- `core-public-organisation-ap.ttl` - CPOV (EU public sector)
|
||
- `schemaorg.owl` - Schema.org (web semantics)
|
||
- `CIDOC_CRM_v7.1.3.rdf` - CIDOC-CRM (cultural heritage)
|
||
- `RiC-O_1-1.rdf` - Records in Contexts (archival)
|
||
- `pico.ttl` - PiCo (person observations)
|
||
|
||
**See**: `.opencode/HYPER_MODULAR_STRUCTURE.md` for complete documentation
|
||
|
||
---
|
||
|
||
### Rule 2: Wikidata Entities Are NOT Ontology Classes
|
||
|
||
🚨 **CRITICAL**: Files in `data/wikidata/GLAMORCUBEPSXHFN/` contain Wikidata Q-numbers for institution TYPES, NOT formal ontology class definitions.
|
||
|
||
**Workflow**: `Wikidata Q-number → Analyze semantics → Search ontologies → Map to ontology class → Document rationale`
|
||
|
||
**Note**: Full rule content preserved in Appendix below (no .opencode equivalent).
|
||
|
||
---
|
||
|
||
### Rule 3: Multi-Aspect Modeling is Mandatory
|
||
|
||
🚨 **CRITICAL**: Every heritage entity has MULTIPLE ontological aspects with INDEPENDENT temporal lifecycles.
|
||
|
||
**Required Aspects**:
|
||
| Aspect | Ontology Class | Temporal Example |
|
||
|--------|---------------|------------------|
|
||
| Place | `crm:E27_Site` | Building: 1880-present |
|
||
| Custodian | `cpov:PublicOrganisation` | Foundation: 1994-present |
|
||
| Legal Form | `org:FormalOrganization` | Registration: 1994-present |
|
||
| Collections | `rico:RecordSet` | Accession dates vary |
|
||
| People | `pico:PersonObservation` | Employment: 2020-present |
|
||
| Events | `crm:E10_Transfer_of_Custody` | Discrete timestamps |
|
||
|
||
**Note**: Full rule content preserved in Appendix below (no .opencode equivalent).
|
||
|
||
---
|
||
|
||
### Rule 4: Technical Classes Are Excluded from Visualizations
|
||
|
||
🚨 **CRITICAL**: Some LinkML classes exist solely for validation (e.g., `Container` with `tree_root: true`). These have NO semantic significance and MUST be excluded from UML diagrams.
|
||
|
||
**Excluded Classes**: `Container` (tree_root for validation only)
|
||
|
||
**See**: `.opencode/LINKML_TECHNICAL_CLASSES.md` for complete documentation
|
||
|
||
---
|
||
|
||
### Rule 5: NEVER Delete Enriched Data - Additive Only
|
||
|
||
🚨 **CRITICAL**: Data enrichment is ADDITIVE ONLY. Never delete or overwrite existing enriched content.
|
||
|
||
**Protected Data Types**:
|
||
| Source | Protected Fields |
|
||
|--------|------------------|
|
||
| Google Maps | `reviews`, `rating`, `photo_count`, `popular_times`, `place_id` |
|
||
| OpenStreetMap | `osm_id`, `osm_type`, `osm_tags`, `amenity`, `heritage` |
|
||
| Wikidata | `wikidata_id`, `claims`, `sitelinks`, `aliases` |
|
||
| Website Scrape | `organization_details`, `collections`, `contact`, `social_media` |
|
||
| ISIL Registry | `isil_code`, `assigned_date`, `remarks` |
|
||
|
||
**See**: `.opencode/DATA_PRESERVATION_RULES.md` for complete documentation
|
||
|
||
---
|
||
|
||
### Rule 6: WebObservation Claims MUST Have XPath Provenance
|
||
|
||
🚨 **CRITICAL**: Every claim extracted from a webpage MUST have an XPath pointer to the exact location in archived HTML. Claims without XPath provenance are FABRICATED.
|
||
|
||
**Required Fields**:
|
||
```yaml
|
||
claim_type: full_name
|
||
claim_value: "Institution Name"
|
||
source_url: https://example.org/about
|
||
retrieved_on: "2025-11-29T12:28:00Z"
|
||
xpath: /html/body/div[1]/h1
|
||
html_file: web/GHCID/example.org/rendered.html
|
||
xpath_match_score: 1.0
|
||
```
|
||
|
||
**Scope**: Applies to `WebClaim` and `WebObservation` classes. Other classes (CustodianTimelineEvent, GoogleMapsEnrichment) have different provenance models.
|
||
|
||
**See**: `.opencode/WEB_OBSERVATION_PROVENANCE_RULES.md` for complete documentation
|
||
|
||
---
|
||
|
||
### Rule 7: Deployment is LOCAL via SSH/rsync (NO CI/CD)
|
||
|
||
🚨 **CRITICAL**: NO GitHub Actions. ALL deployments executed locally via SSH and rsync.
|
||
|
||
**Server**: `91.98.224.44` (Hetzner Cloud)
|
||
|
||
**Two Frontend Apps** (MONOREPO):
|
||
| Domain | Local Directory | Server Path |
|
||
|--------|-----------------|-------------|
|
||
| bronhouder.nl | `/frontend/` | `/var/www/glam-frontend/` |
|
||
| archief.support | `/apps/archief-assistent/` | `/var/www/archief-assistent/` |
|
||
|
||
**Deployment Commands**:
|
||
```bash
|
||
./infrastructure/deploy.sh --frontend # bronhouder.nl
|
||
./infrastructure/deploy.sh --data # Data files only
|
||
./infrastructure/deploy.sh --status # Check server
|
||
```
|
||
|
||
**See**: `.opencode/DEPLOYMENT_RULES.md` and `.opencode/MONOREPO_FRONTEND_APPS.md`
|
||
|
||
---
|
||
|
||
### Rule 8: Legal Form Terms MUST Be Filtered from CustodianName
|
||
|
||
🚨 **CRITICAL**: Exception to emic principle - Legal forms are ALWAYS filtered from CustodianName.
|
||
|
||
**Examples**: `Stichting Rijksmuseum` → CustodianName: `Rijksmuseum`, Legal Form: `Stichting`
|
||
|
||
**Terms to Filter** (by language):
|
||
- Dutch: Stichting, B.V., N.V., Coöperatie
|
||
- English: Foundation, Inc., Ltd., LLC, Corp.
|
||
- German: Stiftung, e.V., GmbH, AG
|
||
- French: Fondation, S.A., S.A.R.L.
|
||
|
||
**NOT Filtered** (part of identity): Vereniging, Association, Society, Verein
|
||
|
||
**See**: `.opencode/LEGAL_FORM_FILTERING_RULE.md` for complete documentation
|
||
|
||
---
|
||
|
||
### Rule 9: Enum-to-Class Promotion - Single Source of Truth
|
||
|
||
🚨 **CRITICAL**: When an enum is promoted to a class hierarchy, the original enum MUST be deleted. Never maintain parallel enum/class definitions.
|
||
|
||
**Archive Location**: `schemas/20251121/linkml/archive/enums/`
|
||
|
||
**See**: `.opencode/ENUM_TO_CLASS_PRINCIPLE.md` for complete documentation
|
||
|
||
---
|
||
|
||
### Rule 10: CH-Annotator is the Entity Annotation Convention
|
||
|
||
🚨 **CRITICAL**: All entity annotation follows `ch_annotator-v1_7_0` convention.
|
||
|
||
**9 Hypernym Types**: AGT (Agent), GRP (Group), TOP (Toponym), GEO (Geometry), TMP (Temporal), APP (Appellation), ROL (Role), WRK (Work), QTY (Quantity)
|
||
|
||
**Heritage Institutions**: `GRP.HER` with GLAMORCUBESFIXPHDNT subtypes (GRP.HER.MUS, GRP.HER.LIB, GRP.HER.ARC, etc.)
|
||
|
||
**See**: `.opencode/CH_ANNOTATOR_CONVENTION.md` for complete documentation
|
||
|
||
---
|
||
|
||
### Rule 11: Z.AI GLM API for LLM Tasks (NOT BigModel)
|
||
|
||
🚨 **CRITICAL**: Use Z.AI Coding Plan endpoint, NOT regular BigModel API.
|
||
|
||
**Configuration**:
|
||
- API URL: `https://api.z.ai/api/coding/paas/v4/chat/completions`
|
||
- Environment Variable: `ZAI_API_TOKEN`
|
||
- Models: `glm-4.5`, `glm-4.5-air`, `glm-4.5-flash`, `glm-4.6`
|
||
- Cost: Free (0 per token)
|
||
|
||
**See**: `.opencode/ZAI_GLM_API_RULES.md` for complete documentation
|
||
|
||
---
|
||
|
||
### Rule 12: Person Data Reference Pattern - Avoid Inline Duplication
|
||
|
||
🚨 **CRITICAL**: Person profiles stored in `data/custodian/person/entity/`. Custodian files reference via `person_profile_path` - NEVER duplicate 50+ lines of profile data inline.
|
||
|
||
**File Naming**: `{linkedin-slug}_{ISO-timestamp}.json`
|
||
|
||
**See**: `.opencode/PERSON_DATA_REFERENCE_PATTERN.md` for complete documentation
|
||
|
||
---
|
||
|
||
### Rule 13: Custodian Type Annotations on LinkML Schema Elements
|
||
|
||
🚨 **CRITICAL**: All schema elements MUST have `custodian_types` annotation with GLAMORCUBESFIXPHDNT single-letter codes.
|
||
|
||
**Annotation Keys**: `custodian_types` (list), `custodian_types_rationale` (string), `custodian_types_primary` (string)
|
||
|
||
**Universal**: Use `["*"]` for elements applying to all types.
|
||
|
||
**See**: `.opencode/CUSTODIAN_TYPE_ANNOTATION_CONVENTION.md` for complete documentation
|
||
|
||
---
|
||
|
||
### Rule 14: Exa MCP LinkedIn Profile Extraction
|
||
|
||
🚨 **CRITICAL**: Use `exa_crawling_exa` with direct URL for comprehensive LinkedIn profile extraction.
|
||
|
||
**Tool Priority**:
|
||
1. `exa_crawling_exa` - Profile URL known (preferred)
|
||
2. `exa_linkedin_search_exa` - Profile URL unknown
|
||
3. `exa_web_search_exa` - Fallback search
|
||
|
||
**Output**: `data/custodian/person/entity/{linkedin-slug}_{timestamp}.json`
|
||
|
||
**See**: `.opencode/EXA_LINKEDIN_EXTRACTION_RULES.md` for complete documentation
|
||
|
||
---
|
||
|
||
### Rule 15: Connection Data Registration - Full Network Preservation
|
||
|
||
🚨 **CRITICAL**: ALL LinkedIn connections must be fully registered in dedicated connections files.
|
||
|
||
**File Location**: `data/custodian/person/{slug}_connections_{timestamp}.json`
|
||
|
||
**Required**: `source_metadata`, `connections[]` array, `network_analysis` with heritage type breakdown
|
||
|
||
**See**: `.opencode/CONNECTION_DATA_REGISTRATION_RULE.md` for complete documentation
|
||
|
||
---
|
||
|
||
### Rule 16: LinkedIn Photo URLs - Store CDN URLs, Not Overlay Pages
|
||
|
||
🚨 **CRITICAL**: Store actual CDN URL, NOT overlay page URL.
|
||
|
||
- ❌ WRONG: `linkedin.com/in/{slug}/overlay/photo/` (derivable, useless)
|
||
- ✅ CORRECT: `media.licdn.com/dms/image/v2/{ID}/profile-displayphoto-shrink_800_800/...`
|
||
|
||
**See**: `.opencode/LINKEDIN_PHOTO_CDN_RULE.md` for complete documentation
|
||
|
||
---
|
||
|
||
### Rule 17: LinkedIn Connection Unique Identifiers
|
||
|
||
🚨 **CRITICAL**: Every connection gets unique ID including abbreviated and anonymous names.
|
||
|
||
**Format**: `{target_slug}_conn_{index:04d}_{name_slug}`
|
||
|
||
**Name Types**: `full`, `abbreviated` (Amy B.), `anonymous` (LinkedIn Member)
|
||
|
||
**See**: `.opencode/LINKEDIN_CONNECTION_ID_RULE.md` for complete documentation
|
||
|
||
---
|
||
|
||
### Rule 18: Custodian Staff Parsing from LinkedIn Company Pages
|
||
|
||
🚨 **CRITICAL**: Use `scripts/parse_custodian_staff.py` for staff registration parsing.
|
||
|
||
**Staff ID Format**: `{custodian_slug}_staff_{index:04d}_{name_slug}`
|
||
|
||
**See**: `.opencode/CUSTODIAN_STAFF_PARSING_RULE.md` for complete documentation
|
||
|
||
---
|
||
|
||
### Rule 19: HTML-Only LinkedIn Extraction (Preferred Method)
|
||
|
||
🚨 **CRITICAL**: Use ONLY manually saved HTML files for LinkedIn data extraction.
|
||
|
||
**Data Completeness**: HTML = 100% (including profile URLs), MD copy-paste = ~90%
|
||
|
||
**Script**: `scripts/parse_linkedin_html.py`
|
||
|
||
**How to Save**: Navigate → Scroll to load all → File > Save Page As > "Webpage, Complete"
|
||
|
||
**See**: `.opencode/HTML_ONLY_LINKEDIN_EXTRACTION_RULE.md` for complete documentation
|
||
|
||
---
|
||
|
||
### Rule 20: Person Entity Profiles - Individual File Storage
|
||
|
||
🚨 **CRITICAL**: Person profiles stored as individual files in `data/custodian/person/entity/`.
|
||
|
||
**File Naming**: `{linkedin-slug}_{ISO-timestamp}.json`
|
||
|
||
**Required**: ALL profiles MUST use structured JSON with `extraction_agent: "claude-opus-4.5"`. Raw content dumps are NOT acceptable.
|
||
|
||
**See**: `.opencode/PERSON_ENTITY_PROFILE_FORMAT_RULE.md` for complete documentation
|
||
|
||
---
|
||
|
||
### Rule 21: Data Fabrication is Strictly Prohibited
|
||
|
||
🚨 **CRITICAL**: ALL DATA MUST BE REAL AND VERIFIABLE. Fabricating any data is strictly prohibited.
|
||
|
||
**❌ FORBIDDEN**:
|
||
- Creating fake names, job titles, companies
|
||
- Inventing education history or skills
|
||
- Generating placeholder data when extraction fails
|
||
- Creating fictional LinkedIn URLs
|
||
|
||
**✅ ALLOWED**:
|
||
- Skip profiles that cannot be extracted
|
||
- Return `null` or empty fields for missing data
|
||
- Mark profiles with `extraction_error: true`
|
||
- Log why extraction failed
|
||
|
||
**See**: `.opencode/DATA_FABRICATION_PROHIBITION.md` for complete documentation
|
||
|
||
---
|
||
|
||
### Rule 22: Custodian YAML Files Are the Single Source of Truth
|
||
|
||
🚨 **CRITICAL**: `data/custodian/*.yaml` is the SINGLE SOURCE OF TRUTH for all enrichment data.
|
||
|
||
**Data Hierarchy**:
|
||
```
|
||
data/custodian/*.yaml ← SINGLE SOURCE OF TRUTH
|
||
↓
|
||
Ducklake → PostgreSQL → TypeDB → Oxigraph → Qdrant
|
||
(All databases are DERIVED - never add data independently)
|
||
↓
|
||
REST API → Frontend (both DERIVED)
|
||
```
|
||
|
||
**Workflow**: FETCH → VALIDATE → WRITE TO YAML → Import to database → Verify
|
||
|
||
**See**: `.opencode/CUSTODIAN_DATA_SOURCE_OF_TRUTH.md` for complete documentation
|
||
|
||
---
|
||
|
||
### Rule 23: Social Media Link Validation - No Generic Links
|
||
|
||
🚨 **CRITICAL**: Social media links MUST be institution-specific, NOT generic platform homepages.
|
||
|
||
**Invalid**: `facebook.com/`, `facebook.com/facebook`, `twitter.com/twitter`
|
||
|
||
**Valid**: `facebook.com/rijksmuseum/`, `twitter.com/rijksmuseum`
|
||
|
||
**See**: `.opencode/SOCIAL_MEDIA_LINK_VALIDATION.md` for complete documentation
|
||
|
||
---
|
||
|
||
### Rule 24: Unused Import Investigation - Check Before Removing
|
||
|
||
🚨 **CRITICAL**: Before removing unused imports, INVESTIGATE whether they indicate incomplete implementations.
|
||
|
||
**Checklist**:
|
||
1. Was it recently used? (`git log -p --all -S 'ImportName'`)
|
||
2. Is there a TODO/FIXME?
|
||
3. Pattern mismatch (old vs new syntax)?
|
||
4. Incomplete feature?
|
||
5. Conditional usage (`TYPE_CHECKING` blocks)?
|
||
|
||
**See**: `.opencode/UNUSED_IMPORT_INVESTIGATION_RULE.md` for complete documentation
|
||
|
||
---
|
||
|
||
### Rule 25: Digital Platform Discovery Enrichment
|
||
|
||
🚨 **CRITICAL**: Every heritage custodian MUST be enriched with digital platform discovery data.
|
||
|
||
**Discover**: Collection management systems, discovery portals, external integrations, APIs
|
||
|
||
**Required Provenance**: `retrieval_agent`, `retrieval_timestamp`, `source_url`, `xpath_base`, `html_file`
|
||
|
||
**See**: `.opencode/DIGITAL_PLATFORM_DISCOVERY_RULE.md` for complete documentation
|
||
|
||
---
|
||
|
||
### Rule 26: Person Data Provenance - Web Claims for Staff Information
|
||
|
||
🚨 **CRITICAL**: All person/staff data MUST have web claim provenance with verifiable sources.
|
||
|
||
**Required Fields**: `claim_type`, `claim_value`, `source_url`, `retrieved_on`, `retrieval_agent`
|
||
|
||
**Recommended**: `xpath`, `xpath_match_score`
|
||
|
||
**See**: `.opencode/PERSON_DATA_PROVENANCE_RULE.md` for complete documentation
|
||
|
||
---
|
||
|
||
### Rule 27: Person-Custodian Data Architecture
|
||
|
||
🚨 **CRITICAL**: Person entity files are the SINGLE SOURCE OF TRUTH for all person data.
|
||
|
||
**In Person Entity File**: `extraction_metadata`, `profile_data`, `web_claims`, `affiliations`
|
||
|
||
**In Custodian YAML**: `person_id`, `person_name`, `role_title`, `affiliation_provenance`, `linkedin_profile_path` (reference only)
|
||
|
||
**NEVER**: Put `web_claims` in custodian YAML files
|
||
|
||
**See**: `.opencode/PERSON_CUSTODIAN_DATA_ARCHITECTURE.md` for complete documentation
|
||
|
||
---
|
||
|
||
### Rule 28: Web Claims Deduplication - No Redundant Claims
|
||
|
||
🚨 **CRITICAL**: Do not duplicate claims unless genuine variation exists with uncertainty.
|
||
|
||
**Eliminate**: Favicon variants, same value from different extractions, dynamic content
|
||
|
||
**Document**: Removed claims in `removed_claims` section for audit trail
|
||
|
||
**See**: `.opencode/WEB_CLAIMS_DEDUPLICATION_RULE.md` for complete documentation
|
||
|
||
---
|
||
|
||
### Rule 29: Anonymous Profile Name Derivation from LinkedIn Slugs
|
||
|
||
🚨 **CRITICAL**: Names CAN be derived from hyphenated LinkedIn slugs - this is data transformation, NOT fabrication.
|
||
|
||
**Dutch Particles**: Keep lowercase when not first word (van, de, den, der)
|
||
|
||
**Known Compound Slugs**: Use mapping for `jponjee` → "J. Ponjee", etc.
|
||
|
||
**See**: `.opencode/ANONYMOUS_PROFILE_NAME_RULE.md` for complete documentation
|
||
|
||
---
|
||
|
||
### Rule 30: Person Profile Extraction Confidence Scoring
|
||
|
||
🚨 **CRITICAL**: Every enriched profile MUST have confidence score (0.50-0.95) for data extraction quality.
|
||
|
||
**Distinct from**: Heritage sector relevance score (different purpose)
|
||
|
||
**Scoring Factors**:
|
||
- Clear job title: +0.10 to +0.15
|
||
- Named institution: +0.05 to +0.10
|
||
- Privacy-abbreviated name: -0.15 to -0.20
|
||
- Intern/trainee: -0.10
|
||
|
||
**See**: `.opencode/PERSON_PROFILE_CONFIDENCE_SCORING.md` for complete documentation
|
||
|
||
---
|
||
|
||
### Rule 31: Organizational Subdivision Extraction
|
||
|
||
🚨 **CRITICAL**: ALWAYS capture organizational subdivisions as structured data.
|
||
|
||
**Types**: department, team, unit, division, section, lab_or_center, office
|
||
|
||
**Store in**: `affiliations[].subdivision` with `type`, `name`, `parent_subdivision`, `extraction_source`
|
||
|
||
**See**: `.opencode/ORGANIZATIONAL_SUBDIVISION_EXTRACTION.md` for complete documentation
|
||
|
||
---
|
||
|
||
### Rule 32: Government Ministries Are Heritage Custodians (Type O)
|
||
|
||
🚨 **CRITICAL**: Government ministries ARE heritage custodians due to statutory record-keeping obligations.
|
||
|
||
**Heritage Relevance Scores**:
|
||
| Role Category | Score Range |
|
||
|---------------|-------------|
|
||
| Records Management | 0.40-0.50 |
|
||
| IT/Systems (records) | 0.30-0.40 |
|
||
| Policy/Advisory | 0.25-0.35 |
|
||
| Administrative | 0.15-0.25 |
|
||
|
||
**See**: `.opencode/GOVERNMENT_MINISTRY_HERITAGE_RULE.md` for complete documentation
|
||
|
||
---
|
||
|
||
### Rule 33: GHCID Collision Duplicate Detection
|
||
|
||
🚨 **CRITICAL**: Duplicate detection is MANDATORY in GHCID collision resolution.
|
||
|
||
**Decision Matrix**:
|
||
- ALL details match → DUPLICATE (keep earliest, archive later)
|
||
- Same name, different city → NOT DUPLICATE (keep both, add suffix)
|
||
- Same name, same city, different Wikidata IDs → NOT DUPLICATE
|
||
- When in doubt → Keep both files (can merge later)
|
||
|
||
**See**: `.opencode/GHCID_COLLISION_DUPLICATE_DETECTION.md` for complete documentation
|
||
|
||
---
|
||
|
||
### Rule 34: Linkup is the Preferred Web Scraper
|
||
|
||
🚨 **CRITICAL**: Use Linkup as primary web scraper. Firecrawl credits are limited.
|
||
|
||
**Tool Priority**:
|
||
| Priority | Tool | When to Use |
|
||
|----------|------|-------------|
|
||
| 1st | `linkup_linkup-search` | General research, finding pages |
|
||
| 2nd | `linkup_linkup-fetch` | Fetching known URL |
|
||
| 3rd | `firecrawl_*` | Only when Linkup fails |
|
||
| 4th | `playwright_*` | Interactive pages, HTML archival |
|
||
|
||
**Two-Phase for XPath Provenance** (Rule 6 compliance):
|
||
1. Linkup for discovery
|
||
2. Playwright for archival with XPath extraction
|
||
|
||
**See**: `.opencode/LINKUP_PREFERRED_WEB_SCRAPER_RULE.md` for complete documentation
|
||
|
||
---
|
||
|
||
### Rule 35: Provenance Statements MUST Have Dual Timestamps
|
||
|
||
🚨 **CRITICAL**: Every provenance statement MUST include at least TWO timestamps to distinguish when the claim was created from when the source was archived.
|
||
|
||
**MANDATORY Timestamps**:
|
||
|
||
| Timestamp | Purpose | Example |
|
||
|-----------|---------|---------|
|
||
| `statement_created_at` | When the claim/annotation was extracted/created | `2025-12-30T14:30:00Z` |
|
||
| `source_archived_at` | When the source material was archived/captured | `2025-12-29T10:15:00Z` |
|
||
|
||
**Optional (Encouraged)**:
|
||
- `source_created_at` - When the original source content was published
|
||
- `source_last_modified_at` - When the source content was last updated
|
||
- `last_verified_at` - When the claim was last re-verified
|
||
- `next_verification_due` - When the claim should be re-verified
|
||
|
||
**Example - CORRECT (Dual Timestamps)**:
|
||
```yaml
|
||
provenance:
|
||
statement_created_at: "2025-12-30T14:30:00Z" # When we extracted this claim
|
||
source_archived_at: "2025-12-29T10:15:00Z" # When we archived the webpage
|
||
source_created_at: "2022-07-15T00:00:00Z" # Optional: article publish date
|
||
```
|
||
|
||
**Example - WRONG (Single Timestamp)**:
|
||
```yaml
|
||
# INVALID - Only one timestamp, vague agent
|
||
extraction_provenance:
|
||
timestamp: '2025-11-06T08:02:44Z' # Which timestamp is this?!
|
||
agent: claude-conversation # Too vague - which model?
|
||
```
|
||
|
||
**Agent Identifier Standards**:
|
||
|
||
| ❌ Invalid | ✅ Valid |
|
||
|------------|----------|
|
||
| `claude-conversation` | `opencode-claude-sonnet-4` |
|
||
| `claude` | `opencode-claude-opus-4` |
|
||
| `ai` | `batch-script-python-3.11` |
|
||
| `opencode` | `manual-human-curator` |
|
||
|
||
**Validation Rule**: `source_archived_at` MUST be ≤ `statement_created_at` (source archived before/when statement created)
|
||
|
||
**Migration Note**: 24,328 files in `data/custodian/` with `agent: claude-conversation` require migration to dual timestamp format.
|
||
|
||
**See**: `.opencode/PROVENANCE_TIMESTAMP_RULES.md` for complete documentation
|
||
|
||
---
|
||
|
||
### Rule 36: Original Language Preservation in Web Content Extraction
|
||
|
||
🚨 **CRITICAL**: ALL extracted text content MUST be preserved in its original source language. Translation is STRICTLY FORBIDDEN during extraction.
|
||
|
||
**Applies to**:
|
||
- Mission statements
|
||
- Vision statements
|
||
- Organizational descriptions
|
||
- About us content
|
||
- Historical narratives
|
||
- Collection descriptions
|
||
- Any textual content extracted from institutional websites
|
||
|
||
**Rationale**:
|
||
1. **Emic Authenticity** - The institution's own voice and terminology must be preserved
|
||
2. **Semantic Fidelity** - Translation introduces interpretation and potential distortion
|
||
3. **Provenance Integrity** - Translated content breaks XPath provenance and content hash verification
|
||
4. **Downstream Flexibility** - Original content allows users to request translations in their preferred language
|
||
|
||
**Required Fields**:
|
||
|
||
```yaml
|
||
mission_statement:
|
||
text: "Het Rijksmuseum is het museum van Nederland..." # Original Dutch
|
||
language: "nl" # ISO 639-1 code
|
||
source_url: "https://www.rijksmuseum.nl/nl/over-ons"
|
||
extracted_verbatim: true # Confirms no translation occurred
|
||
```
|
||
|
||
**LLM Prompt Requirements**:
|
||
|
||
All LLM prompts for content extraction MUST include explicit no-translation instructions:
|
||
|
||
```
|
||
CRITICAL: Extract the text EXACTLY as it appears on the webpage.
|
||
DO NOT TRANSLATE. Preserve the original language.
|
||
If the source is in Dutch, the output must be in Dutch.
|
||
If the source is in Spanish, the output must be in Spanish.
|
||
```
|
||
|
||
**Anti-Patterns (FORBIDDEN)**:
|
||
|
||
| Scenario | Status |
|
||
|----------|--------|
|
||
| Translate Dutch → English during extraction | ❌ FORBIDDEN |
|
||
| Store English text with Dutch `source_url` | ❌ FORBIDDEN |
|
||
| Mix languages in extracted content | ❌ FORBIDDEN |
|
||
| Omit `language` field | ❌ FORBIDDEN |
|
||
|
||
**Validation Checklist**:
|
||
- [ ] Text is in original source language
|
||
- [ ] `language` field matches content language
|
||
- [ ] `language` matches expected from GHCID (or mismatch documented)
|
||
- [ ] `extracted_verbatim: true` is set
|
||
|
||
**See**: `.opencode/ORIGINAL_LANGUAGE_PRESERVATION_RULE.md` for complete documentation including language-specific LLM prompts
|
||
|
||
---
|
||
|
||
### Rule 37: Specificity Score Annotations for LinkML Classes
|
||
|
||
🚨 **CRITICAL**: Every LinkML class MUST have specificity score annotations to enable intelligent RAG retrieval filtering and UML visualization.
|
||
|
||
**Annotation Format**:
|
||
|
||
```yaml
|
||
classes:
|
||
ClassName:
|
||
annotations:
|
||
specificity_score: 0.75 # Required: 0.0-1.0
|
||
specificity_rationale: "..." # Required: Why this score
|
||
template_specificity: # Optional: Template-specific scores
|
||
archive_search: 0.95
|
||
museum_search: 0.20
|
||
```
|
||
|
||
**Score Semantics** (LOWER = more broadly relevant):
|
||
|
||
| Score Range | Meaning | Examples |
|
||
|-------------|---------|----------|
|
||
| 0.00-0.20 | Universal | `HeritageCustodian`, `Location` |
|
||
| 0.20-0.40 | Broadly useful | `Collection`, `Identifier` |
|
||
| 0.40-0.60 | Moderately specific | `ChangeEvent`, `PersonProfile` |
|
||
| 0.60-0.80 | Fairly specific | `Archive`, `Museum`, `Library` |
|
||
| 0.80-1.00 | Highly specific | `LinkedInConnectionExtraction` |
|
||
|
||
**10 Conversation Templates** for `template_specificity`:
|
||
- `archive_search`, `museum_search`, `library_search`
|
||
- `collection_discovery`, `person_research`, `location_browse`
|
||
- `identifier_lookup`, `organizational_change`, `digital_platform`
|
||
- `general_heritage` (fallback - uses `specificity_score` directly)
|
||
|
||
**Validation Rules**:
|
||
1. Score must be in range [0.0, 1.0]
|
||
2. Rationale must not be empty
|
||
3. Child class score must be ≥ parent class score (inheritance consistency)
|
||
|
||
**Use Cases**:
|
||
- **RAG Retrieval**: Filter schema classes by relevance to user query
|
||
- **UML Visualization**: Generate focused diagrams showing only relevant classes
|
||
- **Context Management**: Reduce token usage by excluding low-relevance classes
|
||
|
||
**See**: `.opencode/rules/specificity-score-convention.md` for complete documentation
|
||
|
||
---
|
||
|
||
### Rule 38: Slot Centralization and Semantic URI Requirements
|
||
|
||
🚨 **CRITICAL**: All LinkML slots MUST be centralized in `schemas/20251121/linkml/modules/slots/` and MUST have semantically sound `slot_uri` predicates from base ontologies.
|
||
|
||
**Key Requirements**:
|
||
|
||
1. **Centralization**: All slots MUST be defined in `modules/slots/`, never inline in class files
|
||
2. **slot_uri**: Every slot MUST have a `slot_uri` from base ontologies (`data/ontology/`)
|
||
3. **Mappings**: Use `exact_mappings`, `close_mappings`, `related_mappings`, `narrow_mappings`, `broad_mappings` for additional semantic relationships
|
||
|
||
**Why This Matters**:
|
||
- **Frontend UML visualization** depends on centralized slots for edge rendering
|
||
- **Semantic URIs** enable linked data interoperability and RDF serialization
|
||
- **Mapping annotations** connect to SKOS-based vocabulary alignment standards
|
||
|
||
**Common slot_uri Sources**:
|
||
|
||
| Ontology | Prefix | Example Predicates |
|
||
|----------|--------|-------------------|
|
||
| SKOS | `skos:` | `prefLabel`, `altLabel`, `definition`, `note` |
|
||
| Schema.org | `schema:` | `name`, `description`, `url`, `dateCreated` |
|
||
| Dublin Core | `dcterms:` | `identifier`, `title`, `creator`, `date` |
|
||
| PROV-O | `prov:` | `wasGeneratedBy`, `wasAttributedTo`, `atTime` |
|
||
| RiC-O | `rico:` | `hasRecordSetType`, `isOrWasPartOf` |
|
||
| CIDOC-CRM | `crm:` | `P1_is_identified_by`, `P2_has_type` |
|
||
|
||
**Workflow for New Slots**:
|
||
1. Search `data/ontology/` for existing predicate
|
||
2. Create file in `modules/slots/` with `slot_uri`
|
||
3. Add mappings to related predicates in other ontologies
|
||
4. Update `manifest.json` with new slot file
|
||
|
||
**See**: `.opencode/rules/slot-centralization-and-semantic-uri-rule.md` for complete documentation
|
||
|
||
---
|
||
|
||
### Rule 39: Slot Naming Convention (RiC-O Style)
|
||
|
||
🚨 **CRITICAL**: LinkML slots representing relational predicates MUST follow RiC-O-style naming conventions to express temporal semantics accurately.
|
||
|
||
**Core Naming Patterns**:
|
||
|
||
| Pattern | Use Case | Examples |
|
||
|---------|----------|----------|
|
||
| `hasOrHad*` | Temporal relationship (active voice) | `hasOrHadHolder`, `hasOrHadPart`, `hasOrHadType` |
|
||
| `isOrWas*` | Temporal relationship (inverse) | `isOrWasPartOf`, `isOrWasMemberOf`, `isOrWasHolderOf` |
|
||
| `has*` | Permanent/immutable facts | `hasBeginningDate`, `hasBirthPlace`, `hasIdentifier` |
|
||
| `*Transitive` | Hierarchical (through chain) | `isIncludedInTransitive` |
|
||
| `directly*` | Hierarchical (immediate only) | `directlyIncludes`, `isDirectlyIncludedIn` |
|
||
|
||
**Semantic Distinction: Hierarchy vs Association**:
|
||
|
||
🚨 The same slot name can mask **different semantics**. Always analyze intent:
|
||
|
||
| Category | Semantic | Pattern | Ontology |
|
||
|----------|----------|---------|----------|
|
||
| **Organizational Hierarchy** | "This org is part of that org" | RiC-O `isOrWas*` / `hasOrHad*` | RiC-O |
|
||
| **Event Association** | "This event happened to that entity" | PROV-O `wasAssociatedWith` | PROV-O |
|
||
|
||
**Example**: The deprecated `parent_custodian` was used for TWO different semantics:
|
||
- `CustodianLegalStatus.parent_custodian` → **Hierarchy** → Now: `is_or_was_suborganization_of`
|
||
- `OrganizationalChangeEvent.parent_custodian` → **Event association** → Now: `associated_custodian`
|
||
|
||
**Migration Mapping** (Key Slots):
|
||
|
||
| Deprecated | Replacement | Pattern |
|
||
|------------|-------------|---------|
|
||
| `parent_custodian` (hierarchy) | `is_or_was_suborganization_of` | RiC-O |
|
||
| `parent_custodian` (event) | `associated_custodian` | PROV-O |
|
||
| `has_suborganization` | `has_or_had_suborganization` | RiC-O |
|
||
| `parent_collection` | `is_or_was_sub_collection_of` | RiC-O |
|
||
| `sub_collections` | `has_or_had_sub_collection` | RiC-O |
|
||
| `has_collection` | `has_or_had_collection` | RiC-O |
|
||
| `encompassing_body` | `is_or_was_encompassed_by` | RiC-O |
|
||
| `has_member` | `has_or_had_member` | RiC-O |
|
||
| `is_member_of` | `is_or_was_member_of` | RiC-O |
|
||
|
||
**Decision Tree**:
|
||
```
|
||
Is relationship about organizational structure?
|
||
├─ YES (child → parent) → isOrWas{Relationship}Of
|
||
├─ YES (parent → children) → hasOrHad{Relationship}
|
||
└─ NO (event → entity affected) → associated_custodian (PROV-O)
|
||
```
|
||
|
||
**LinkML Slot Naming**: Convert RiC-O predicates to snake_case:
|
||
- `rico:hasOrHadPart` → `has_or_had_part`
|
||
- `rico:isOrWasPartOf` → `is_or_was_part_of`
|
||
|
||
**Class Promotion Principle**:
|
||
|
||
🚨 **ALL slot values should be modeled as CLASSES, not primitive strings.** Even simple-seeming values like locality names, regions, and countries represent real-world entities with:
|
||
- **Identity** - Can be referenced from multiple places
|
||
- **Temporal properties** - Names change over time (e.g., "Peking" → "Beijing")
|
||
- **Relationships** - Linked to other entities (regions contain localities)
|
||
- **Provenance** - Different sources may provide different values
|
||
|
||
**Address Component Example**:
|
||
```yaml
|
||
# WRONG - String-valued slots
|
||
Address:
|
||
slots:
|
||
- locality # range: string ❌
|
||
- region # range: string ❌
|
||
- country_name # range: string ❌
|
||
|
||
# CORRECT - Class-valued slots with temporal semantics
|
||
Address:
|
||
slots:
|
||
- has_or_had_locality # range: Locality ✅
|
||
- has_or_had_region # range: Region ✅
|
||
- has_or_had_country # range: Country ✅
|
||
```
|
||
|
||
**Summary Decision Rules**:
|
||
|
||
| Question | Answer | Naming Pattern |
|
||
|----------|--------|----------------|
|
||
| Can this value change over time? | YES | `has_or_had_*` |
|
||
| Does this represent a real-world entity? | YES | Make it a CLASS |
|
||
| Is this a permanent/immutable fact? | YES | `has_*` (rare) |
|
||
| Is this truly just a label/literal? | YES | Simple slot name (very rare) |
|
||
|
||
**See**: `.opencode/rules/slot-naming-convention-rico-style.md` for complete documentation
|
||
|
||
---
|
||
|
||
### Rule 40: KIEN Registry is Authoritative for Intangible Heritage Custodians
|
||
|
||
🚨 **CRITICAL**: For Intangible Heritage Custodians (Type I), the KIEN registry at `https://www.immaterieelerfgoed.nl/` is **TIER_1_AUTHORITATIVE**. Google Maps is **TIER_3_CROWD_SOURCED** and frequently returns false matches.
|
||
|
||
**Why Google Maps Fails for Type I**:
|
||
- Virtual organizations without commercial storefronts
|
||
- Name collisions with unrelated businesses (e.g., "Platform" → "Platform 9 BV")
|
||
- No physical Google Maps presence for intangible heritage networks
|
||
- Volunteer-run organizations with residential addresses
|
||
|
||
**Data Tier Hierarchy for Type I**:
|
||
|
||
| Priority | Source | Data Tier |
|
||
|----------|--------|-----------|
|
||
| 1st | KIEN Registry (`immaterieelerfgoed.nl`) | TIER_1_AUTHORITATIVE |
|
||
| 2nd | Organization's Official Website | TIER_2_VERIFIED |
|
||
| 3rd | Wikidata | TIER_3_CROWD_SOURCED |
|
||
| 4th | Google Maps | TIER_3_CROWD_SOURCED (verify!) |
|
||
|
||
**Required Workflow**:
|
||
1. **Scrape KIEN page first** - Extract address from Contact section
|
||
2. **Validate Google Maps** - Compare domain/name against KIEN data
|
||
3. **Mark false matches** - Set `status: FALSE_MATCH` with documentation
|
||
|
||
**Marking False Matches**:
|
||
```yaml
|
||
google_maps_enrichment:
|
||
status: FALSE_MATCH
|
||
false_match_reason: "Google Maps returned different organization"
|
||
original_false_match:
|
||
place_id: ChIJ...
|
||
name: "Wrong Business Name"
|
||
website: "http://wrong-domain.nl/"
|
||
correction_timestamp: "2025-01-08T00:00:00Z"
|
||
```
|
||
|
||
**Location Resolution**: Use KIEN address → Geocode with Nominatim → NOT Google Maps coordinates
|
||
|
||
**See**: `.opencode/rules/kien-authoritative-source-rule.md` for complete documentation
|
||
|
||
---
|
||
|
||
### Rule 41: LinkML "Types" Classes Define SPARQL Template Variables
|
||
|
||
🚨 **CRITICAL**: LinkML classes following the `*Type` / `*Types` naming pattern (Rule 0b) serve as the **single source of truth** for valid values in SPARQL template slot variables.
|
||
|
||
When designing SPARQL templates, **extract variables from the schema** rather than hardcoding separate templates for each institution type or geographic level.
|
||
|
||
**Why This Matters**:
|
||
- Same template works across ALL institution types (musea, archieven, bibliotheken, etc.)
|
||
- Same template works across ALL geographic levels (country, subregion, settlement)
|
||
- Adding new types to schema automatically extends template capabilities
|
||
- Multilingual support comes for free from schema labels
|
||
|
||
**Template Variable Sources**:
|
||
|
||
| Variable | Schema Source | Examples |
|
||
|----------|---------------|----------|
|
||
| `institution_type` | `CustodianType` + 19 subclasses | M (Museum), A (Archive), L (Library) |
|
||
| `location` | Hierarchical: Country/Subregion/Settlement | NL, NL-NH, Amsterdam |
|
||
| `platform_type` | `DigitalPlatformTypes.yaml` (69+ types) | DigitalLibrary, Aggregator |
|
||
|
||
**Template Design Pattern**:
|
||
|
||
```yaml
|
||
# CORRECT: Single parameterized template
|
||
count_institutions_by_type_location:
|
||
slots:
|
||
institution_type:
|
||
schema_source: "modules/classes/CustodianType.yaml"
|
||
location:
|
||
resolution_order: [settlement, subregion, country]
|
||
|
||
# SlotExtractor detects level and selects appropriate SPARQL variant
|
||
sparql_template: |
|
||
SELECT (COUNT(?s) AS ?count) WHERE {
|
||
?s hc:institutionType "{{ institution_type }}" ;
|
||
hc:settlementName "{{ location }}" .
|
||
}
|
||
sparql_template_region: |
|
||
SELECT (COUNT(?s) AS ?count) WHERE {
|
||
?s hc:institutionType "{{ institution_type }}" ;
|
||
hc:subregionCode "{{ location }}" .
|
||
}
|
||
```
|
||
|
||
**SlotExtractor Responsibilities**:
|
||
1. **Detect institution type** from query: "musea" → M, "archieven" → A
|
||
2. **Detect location level**: "Amsterdam" → settlement, "Noord-Holland" → subregion
|
||
3. **Normalize values**: "Noord-Holland" → "NL-NH"
|
||
|
||
**See**: `.opencode/rules/types-classes-as-template-variables.md` for complete documentation
|
||
|
||
---
|
||
|
||
### Rule 42: No Ontology Prefixes in Slot Names
|
||
|
||
🚨 **CRITICAL**: LinkML slot names MUST NOT include ontology namespace prefixes. Ontology references belong in mapping properties (`slot_uri`, `exact_mappings`, `close_mappings`, etc.), NOT in element names.
|
||
|
||
**Why This Matters**:
|
||
- Slot names should be human-readable, domain-focused terminology
|
||
- Ontology mappings are documented via LinkML's dedicated mapping properties
|
||
- Embedding prefixes creates coupling between naming and specific ontology versions
|
||
- Clean separation allows renaming slots without changing ontology bindings
|
||
|
||
**Prohibited Prefixes**:
|
||
|
||
| Prefix | Ontology | Example Violation |
|
||
|--------|----------|-------------------|
|
||
| `rico_` | Records in Contexts | `rico_organizational_principle` |
|
||
| `skos_` | SKOS | `skos_broader`, `skos_narrower` |
|
||
| `schema_` | Schema.org | `schema_name` |
|
||
| `dcterms_` | Dublin Core | `dcterms_created` |
|
||
| `prov_` | PROV-O | `prov_generated_by` |
|
||
| `org_` | W3C Organization | `org_has_member` |
|
||
| `crm_` | CIDOC-CRM | `crm_carried_out_by` |
|
||
| `foaf_` | FOAF | `foaf_knows` |
|
||
|
||
**Correct Pattern**:
|
||
|
||
```yaml
|
||
# CORRECT: Clean name with ontology reference in slot_uri and mappings
|
||
slots:
|
||
record_holder:
|
||
description: The custodian that holds or held this record set.
|
||
slot_uri: rico:hasOrHadHolder
|
||
exact_mappings:
|
||
- rico:hasOrHadHolder
|
||
close_mappings:
|
||
- schema:holdingArchive
|
||
range: Custodian
|
||
```
|
||
|
||
**WRONG Pattern**:
|
||
|
||
```yaml
|
||
# WRONG: Ontology prefix embedded in slot name
|
||
slots:
|
||
rico_has_or_had_holder: # BAD - "rico_" prefix duplicates slot_uri info
|
||
slot_uri: rico:hasOrHadHolder
|
||
range: string
|
||
```
|
||
|
||
**Exceptions**:
|
||
- **External identifier slots**: `wikidata_id`, `viaf_id`, `isil_code` (system names, not ontology prefixes)
|
||
- **Internal technical slots**: `internal_wd_namespace_force` (prefixed with `internal_`)
|
||
|
||
**See**: `.opencode/rules/no-ontology-prefix-in-slot-names.md` for complete documentation and migration examples
|
||
|
||
---
|
||
|
||
### Rule 43: Slot Nouns Must Be Singular
|
||
|
||
🚨 **CRITICAL**: LinkML slot names MUST use singular nouns, even for multivalued slots. The `multivalued: true` property indicates cardinality, NOT the slot name.
|
||
|
||
**Rationale**:
|
||
1. **Predicate semantics**: Slots represent predicates/relationships. In RDF, `hasCollection` can have multiple objects without changing the predicate name.
|
||
2. **Consistency**: Singular names work for both single-valued and multivalued slots.
|
||
3. **Ontology alignment**: Standard ontologies use singular predicates (`skos:broader`, `org:hasMember`, `rico:hasOrHadHolder`).
|
||
4. **Readability**: `custodian.has_or_had_custodian_type` reads naturally as "custodian has (or had) custodian type".
|
||
|
||
**Correct Pattern**:
|
||
|
||
```yaml
|
||
slots:
|
||
has_or_had_custodian_type: # ✅ CORRECT - singular noun
|
||
slot_uri: org:classification
|
||
range: CustodianType
|
||
multivalued: true # Cardinality expressed here, not in name
|
||
|
||
has_or_had_collection: # ✅ CORRECT - singular noun
|
||
range: CustodianCollection
|
||
multivalued: true
|
||
|
||
has_or_had_member: # ✅ CORRECT - singular noun
|
||
range: Custodian
|
||
multivalued: true
|
||
```
|
||
|
||
**WRONG Pattern**:
|
||
|
||
```yaml
|
||
slots:
|
||
has_or_had_custodian_types: # ❌ WRONG - plural noun
|
||
multivalued: true
|
||
|
||
collections: # ❌ WRONG - plural noun
|
||
multivalued: true
|
||
```
|
||
|
||
**Migration Examples**:
|
||
|
||
| Old (Plural) | New (Singular) |
|
||
|--------------|----------------|
|
||
| `custodian_types` | `has_or_had_custodian_type` |
|
||
| `collections` | `has_or_had_collection` |
|
||
| `identifiers` | `identifier` |
|
||
| `alternative_names` | `alternative_name` |
|
||
| `staff_members` | `staff_member` |
|
||
|
||
**Exceptions** (compound concepts where plural is part of the proper noun):
|
||
- `archives_regionales` - French administrative term
|
||
- `united_states` - Geographic proper noun
|
||
|
||
**See**: `.opencode/rules/slot-noun-singular-convention.md` for complete documentation
|
||
|
||
---
|
||
|
||
### Rule 44: PPID Birth Date Enrichment and EDTF Unknown Date Notation
|
||
|
||
🚨 **CRITICAL**: When birth/death dates are missing from person entity sources, agents MUST first attempt enrichment via web search. Only after comprehensive search fails should EDTF unknown notation be used.
|
||
|
||
**Enrichment Workflow**:
|
||
1. Search Exa: `"{full_name}" born birthday birth date`
|
||
2. Search Linkup: `"{name}" biography`
|
||
3. If found → Record as `web_claim` with provenance
|
||
4. If NOT found → Use EDTF notation with `enrichment_metadata` recording the failed search
|
||
|
||
**EDTF Notation (Library of Congress Standard)**:
|
||
|
||
| Character | Meaning | Example |
|
||
|-----------|---------|---------|
|
||
| `X` | Unspecified digit | `197X` = some year 1970-1979 |
|
||
| `~` | Approximate (circa) | `1985~` = circa 1985 |
|
||
| `?` | Uncertain | `1985?` = possibly 1985 |
|
||
| `S` | Significant digits | `1975S3` = estimated 1975, accurate to decade |
|
||
| `[..]` | One of set | `[197X,198X]` = 1970s or 1980s |
|
||
|
||
**Common Patterns**:
|
||
|
||
| Scenario | EDTF Format |
|
||
|----------|-------------|
|
||
| Decade known (1970s) | `197X` |
|
||
| Century known (1900s) | `19XX` |
|
||
| Completely unknown | `XXXX` |
|
||
| Multiple possible decades | `[197X,198X]` |
|
||
| Estimated from career | `1975S3` |
|
||
|
||
**Filename Safety**: PPID filenames must avoid `?`, `%`, `[]`, `/`, `|`. Use simplified form in filename, full EDTF in metadata.
|
||
|
||
**Anti-Patterns**:
|
||
- ❌ `"1970s"` - Use `197X` instead
|
||
- ❌ `"circa 1985"` - Use `1985~` instead
|
||
- ❌ `"unknown"` - Use `XXXX` instead
|
||
- ❌ Custom notation like `197~8~` - Not EDTF compliant
|
||
|
||
**Validation**:
|
||
- Cannot use `XXXX` without `enrichment_metadata.birth_date_search.attempted: true`
|
||
- All dates must parse as valid EDTF
|
||
|
||
**See**: `.opencode/rules/ppid-birth-date-enrichment-rule.md` for complete documentation
|
||
|
||
---
|
||
|
||
### Rule 45: Inferred Data Must Be Explicit with Provenance
|
||
|
||
🚨 **CRITICAL**: All inferred data MUST be stored in explicit `inferred_*` fields with full provenance statements. Inferred values MUST NEVER silently replace or merge with verified data.
|
||
|
||
**Required Inferred Fields** (for person profiles):
|
||
|
||
| Inferred Field | Source Observations | Heuristic |
|
||
|----------------|---------------------|-----------|
|
||
| `inferred_birth_decade` | Earliest education/job dates | Entry age assumptions |
|
||
| `inferred_birth_settlement` | School/university location | Residential proximity |
|
||
| `inferred_current_settlement` | Profile location, current job | Direct extraction |
|
||
|
||
**Required Structure**:
|
||
|
||
```json
|
||
{
|
||
"inferred_birth_decade": {
|
||
"value": "196X",
|
||
"edtf": "196X",
|
||
"confidence": "low",
|
||
"inference_provenance": {
|
||
"method": "earliest_education_heuristic",
|
||
"inference_chain": [
|
||
{"step": 1, "observation": "University start 1986", "source_field": "profile_data.education[0]"},
|
||
{"step": 2, "assumption": "University entry at age 18", "rationale": "Dutch standard"},
|
||
{"step": 3, "calculation": "1986 - 18 = 1968", "result": "Birth year ~1968"},
|
||
{"step": 4, "generalization": "Round to decade", "result": "196X"}
|
||
],
|
||
"inferred_at": "2025-01-09T18:00:00Z",
|
||
"inferred_by": "enrich_ppids.py"
|
||
}
|
||
}
|
||
}
|
||
```
|
||
|
||
**Anti-Patterns**:
|
||
- ❌ Silent replacement: Putting inferred value directly in `birth_date.edtf` without marking it as inferred
|
||
- ❌ Hidden metadata: Separating inference flag from the value
|
||
- ❌ Missing chain: Not documenting HOW the value was derived
|
||
|
||
**PPID Component Tracking**:
|
||
```json
|
||
{
|
||
"ppid_components": {
|
||
"first_date": "196X",
|
||
"first_date_source": "inferred_birth_decade",
|
||
"first_location": "NL-UT-UTR",
|
||
"first_location_source": "inferred_birth_settlement"
|
||
}
|
||
}
|
||
```
|
||
|
||
**List-Valued Inferred Data** (EDTF Set Notation):
|
||
|
||
When inference yields multiple plausible values (e.g., decade boundary cases), store as a list:
|
||
|
||
```json
|
||
{
|
||
"inferred_birth_decade": {
|
||
"values": ["196X", "197X"],
|
||
"edtf": "[196X,197X]",
|
||
"primary_value": "196X",
|
||
"primary_rationale": "1965 is in 196X, but range extends into 197X",
|
||
"confidence": "very_low"
|
||
}
|
||
}
|
||
```
|
||
|
||
For PPID generation, use `primary_value`:
|
||
- `first_date_source: "inferred_birth_decade.primary_value"`
|
||
- `first_date_alternatives: ["197X"]`
|
||
|
||
**See**: `.opencode/rules/inferred-data-explicit-provenance-rule.md` for complete documentation
|
||
|
||
---
|
||
|
||
### Rule 46: Entity Resolution - Names Are NEVER Sufficient
|
||
|
||
🚨 **CRITICAL**: Similar or identical names are NEVER sufficient for entity resolution. When enriching person profiles via web search, verify MULTIPLE identity attributes before attributing ANY claim.
|
||
|
||
**⚠️ ALL ENRICHMENTS MUST BE DOUBLE-CHECKED ⚠️**
|
||
|
||
Wrong data is worse than no data. A single false claim (birth year from wrong person, spouse from a different namesake, social media from random account) corrupts the entire profile and undermines dataset trustworthiness.
|
||
|
||
**The Core Problem**:
|
||
- Web searches for "Carmen Juliá" return data about Carmen Julia Álvarez (Venezuelan actress), Carmen Julia Navarro (Mexican hydrogeologist), and Carmen Julia Gutiérrez (Spanish medievalist)
|
||
- None of these is the actual Carmen Juliá who is a UK art curator
|
||
- Name matching alone caused 122+ false birth year attributions
|
||
|
||
**Required Identity Attributes** (need 3 of 5 to match):
|
||
|
||
| # | Attribute | Check | Conflict Example |
|
||
|---|-----------|-------|------------------|
|
||
| 1 | **Career/Profession** | Same field | Source: "actress", Profile: "curator" → REJECT |
|
||
| 2 | **Employer** | Same institution | Source: "film studio", Profile: "museum" → REJECT |
|
||
| 3 | **Location** | Same city/country | Source: "Venezuela", Profile: "UK" → REJECT |
|
||
| 4 | **Age Range** | Plausible for career | Source: "born 1922", Profile: active 2025 → REJECT |
|
||
| 5 | **Education** | Same field | Source: "medical school", Profile: "art history" → REJECT |
|
||
|
||
**High-Risk Sources** (require stricter verification, NOT forbidden):
|
||
- Genealogy sites (geni.com, ancestry.*, myheritage.*) → **Require 5/5 matches**
|
||
- IMDB (actors with same name) → **Require 5/5 matches** (unless person works in film/TV)
|
||
- Wikipedia articles → **Require 4/5 matches**
|
||
- Social media → **Require 4/5 matches** + verify employer/location in bio
|
||
|
||
**Red Flags Requiring Investigation** (not automatic rejection):
|
||
- Profession difference (actress vs curator) → Investigate: did they change careers?
|
||
- Location difference (Venezuela vs UK) → Investigate: did they relocate?
|
||
- Time gap in career → Investigate: career break or different person?
|
||
|
||
**When to REJECT**: If investigation shows no plausible connection (e.g., Venezuelan actress active 1970s-2000s cannot be UK curator active 2020s - overlapping timelines, different continents)
|
||
|
||
**Name Commonality Thresholds**:
|
||
|
||
| Name Type | Required Matches |
|
||
|-----------|------------------|
|
||
| Unique (e.g., "Xander Vermeulen-Oosterhuis") | 2 of 5 |
|
||
| Moderate (e.g., "Carmen Juliá") | 3 of 5 |
|
||
| Common (e.g., "Jan de Vries") | 4 of 5 |
|
||
| Very common (e.g., "John Smith") | 5 of 5 or reject |
|
||
|
||
**See**: `.opencode/rules/entity-resolution-no-heuristics.md` for complete documentation
|
||
|
||
---
|
||
|
||
### Rule 47: Disambiguation Entity Profiles - Prevent Repeated Errors
|
||
|
||
🚨 **CRITICAL**: When entity resolution determines that a web source describes a **different person** with a similar name, **create a PPID profile for that person**. The PPID system is universal - ANY person who ever lived can have a profile.
|
||
|
||
**The Universal PPID Principle**:
|
||
- **ALL persons on Earth can be assigned PPIDs** - heritage workers and non-heritage alike
|
||
- **Historical persons** (deceased from any era) can have profiles
|
||
- The `heritage_relevance` field indicates heritage sector relevance, NOT profile eligibility
|
||
- **Anyone can have a PPID** - the actress, the doctor, the footballer
|
||
|
||
**Why This Matters**:
|
||
- Prevents future enrichment from making the same mistake
|
||
- Documents investigation work
|
||
- Builds comprehensive person database
|
||
- Bidirectional linking between profiles
|
||
|
||
**When to Create Namesake Profiles**:
|
||
- Entity resolution rejects a claim due to identity mismatch
|
||
- The namesake is notable enough to appear in search results repeatedly
|
||
- The confusion risk is high (similar name, some overlapping attributes)
|
||
|
||
**Example**: Carmen Julia Álvarez (Venezuelan actress)
|
||
- Discovered during enrichment of Carmen Juliá (UK curator)
|
||
- Different profession (actress vs curator), different location (Venezuela vs UK)
|
||
- **Create regular PPID**: `ID_VE-XX-CCS_1952_VE-XX-CCS_XXXX_CARMEN-JULIA-ALVAREZ.json`
|
||
- Set `heritage_relevance.is_heritage_relevant: false`
|
||
- Link both profiles via `disambiguation_notes`
|
||
|
||
**Namesake Profile Structure**:
|
||
```json
|
||
{
|
||
"ppid": "ID_VE-XX-CCS_1952_VE-XX-CCS_XXXX_CARMEN-JULIA-ALVAREZ",
|
||
"profile_data": {
|
||
"full_name": "Carmen Julia Álvarez",
|
||
"profession": "actress",
|
||
"birth_year": 1952
|
||
},
|
||
"heritage_relevance": {
|
||
"is_heritage_relevant": false,
|
||
"relevance_score": 0.0
|
||
},
|
||
"disambiguation_notes": {
|
||
"commonly_confused_with": [
|
||
{"ppid": "ID_UK-XX-XXX_...", "name": "Carmen Juliá", "profession": "curator"}
|
||
]
|
||
}
|
||
}
|
||
```
|
||
|
||
**See**: `.opencode/rules/disambiguation-entity-profiles.md` for complete documentation
|
||
|
||
---
|
||
|
||
### Rule 48: Class Files Must Not Define Inline Slots
|
||
|
||
🚨 **CRITICAL**: LinkML class files in `schemas/20251121/linkml/modules/classes/` MUST NOT define their own slots inline. All slots MUST be imported from the centralized `modules/slots/` directory.
|
||
|
||
**Architecture Requirement**:
|
||
|
||
```
|
||
schemas/20251121/linkml/
|
||
├── modules/
|
||
│ ├── classes/ # Class definitions ONLY
|
||
│ │ └── *.yaml # NO inline slot definitions
|
||
│ ├── slots/ # ALL slot definitions go here
|
||
│ │ └── *.yaml # One file per slot or logical group
|
||
│ └── enums/ # Enumeration definitions
|
||
```
|
||
|
||
**Correct Pattern**:
|
||
|
||
```yaml
|
||
# In modules/classes/Address.yaml
|
||
imports:
|
||
- linkml:types
|
||
- ../slots/street_address # Import from centralized location
|
||
- ../slots/postal_code
|
||
- ../slots/locality
|
||
|
||
classes:
|
||
Address:
|
||
slots:
|
||
- street_address # Reference by name only
|
||
- postal_code
|
||
- locality
|
||
```
|
||
|
||
**Anti-Pattern (WRONG)**:
|
||
|
||
```yaml
|
||
# WRONG - slots defined inline in class file
|
||
classes:
|
||
Address:
|
||
slots:
|
||
- street_address
|
||
|
||
slots: # ❌ DO NOT define slots here
|
||
street_address:
|
||
description: Street address
|
||
range: string
|
||
```
|
||
|
||
**Why This Matters**:
|
||
- **Frontend UML visualization** depends on centralized slots for edge rendering
|
||
- **Reusability**: Slots can be used across multiple classes
|
||
- **Semantic Consistency**: Single source of truth for slot semantics
|
||
- **Maintainability**: Changes propagate automatically to all classes
|
||
|
||
**See**: `.opencode/rules/class-files-no-inline-slots.md` for complete documentation
|
||
|
||
---
|
||
|
||
### Rule 49: Slot Usage Minimization - No Redundant Overrides
|
||
|
||
🚨 **CRITICAL**: LinkML `slot_usage` entries MUST provide meaningful modifications to the generic slot definition. Redundant entries that merely re-declare identical values MUST be removed.
|
||
|
||
**What is slot_usage?**
|
||
|
||
[`slot_usage`](https://linkml.io/linkml-model/latest/docs/slot_usage/) allows a class to customize inherited slot behavior (narrowing range, adding constraints, class-specific descriptions).
|
||
|
||
**The Problem**:
|
||
|
||
Analysis found **874 redundant `slot_usage` entries** across **374 class files** that simply re-declare the same `range` and `inlined` values already in the generic slot:
|
||
|
||
```yaml
|
||
# REDUNDANT - Same as generic slot definition
|
||
slot_usage:
|
||
template_specificity:
|
||
range: TemplateSpecificityScores # Already in generic!
|
||
inlined: true # Already in generic!
|
||
```
|
||
|
||
**Decision Matrix**:
|
||
|
||
| Scenario | Action |
|
||
|----------|--------|
|
||
| All properties match generic exactly | **REMOVE** |
|
||
| Only `range` and/or `inlined` match generic | **REMOVE** |
|
||
| Only `description` differs by adding articles (e.g., "the record sets" vs "record sets") | **TOLERATE** (semantic definiteness) |
|
||
| `description` provides substantive new information | **KEEP** |
|
||
| Any other property modified (`required`, `pattern`, `examples`, etc.) | **KEEP** |
|
||
|
||
**Example - Tolerated Description-Only Modification**:
|
||
|
||
```yaml
|
||
# Generic slot
|
||
slots:
|
||
has_or_had_record_set:
|
||
description: Record sets associated with a custodian.
|
||
|
||
# Class-specific - TOLERABLE (adds definiteness)
|
||
slot_usage:
|
||
has_or_had_record_set:
|
||
description: The record sets held by this archive. # "The" = definite reference
|
||
```
|
||
|
||
**Rationale**: Adding articles like "the" changes indefinite to definite reference, which is semantically significant (refers to a specific entity vs. a general concept).
|
||
|
||
**See**: `.opencode/rules/slot-usage-minimization-rule.md` for complete documentation
|
||
|
||
---
|
||
|
||
### Rule 50: Ontology-to-LinkML Mapping Convention
|
||
|
||
🚨 **CRITICAL**: When mapping base ontology classes and predicates to LinkML schema elements, use LinkML's dedicated mapping properties as documented at https://linkml.io/linkml-model/latest/docs/mappings/
|
||
|
||
**What "LinkML mapping" means**:
|
||
- Connecting LinkML schema elements (classes, slots, enums) to external ontology URIs
|
||
- Using LinkML's built-in mapping properties (`class_uri`, `slot_uri`, `*_mappings`)
|
||
- Following SKOS-based vocabulary alignment standards
|
||
|
||
**Primary Identity Properties**:
|
||
|
||
| Property | Applies To | Purpose |
|
||
|----------|-----------|---------|
|
||
| `class_uri` | Classes | Primary RDF class URI |
|
||
| `slot_uri` | Slots | Primary RDF predicate URI |
|
||
| `enum_uri` | Enums | Enum namespace URI |
|
||
|
||
**SKOS-Based Mapping Properties**:
|
||
|
||
| Property | SKOS Predicate | Use When |
|
||
|----------|---------------|----------|
|
||
| `exact_mappings` | `skos:exactMatch` | Different ontology, same semantics |
|
||
| `close_mappings` | `skos:closeMatch` | Similar but not identical |
|
||
| `related_mappings` | `skos:relatedMatch` | Broader conceptual relationship |
|
||
| `narrow_mappings` | `skos:narrowMatch` | External term is broader |
|
||
| `broad_mappings` | `skos:broadMatch` | External term is narrower |
|
||
|
||
**Example - Aggregation vs. Linking Distinction**:
|
||
|
||
```yaml
|
||
# Aggregation (data duplication)
|
||
classes:
|
||
DataAggregator:
|
||
class_uri: ore:Aggregation # Primary identity
|
||
exact_mappings:
|
||
- edm:EuropeanaAggregation
|
||
annotations:
|
||
data_storage_pattern: AGGREGATION
|
||
|
||
# Linking (single source of truth)
|
||
classes:
|
||
FederatedDiscoveryPortal:
|
||
class_uri: dcat:DataService # Links, doesn't store
|
||
close_mappings:
|
||
- schema:SearchAction
|
||
annotations:
|
||
data_storage_pattern: LINKING
|
||
```
|
||
|
||
**Validation Checklist**:
|
||
- [ ] `class_uri` / `slot_uri` points to a real URI in `data/ontology/` files
|
||
- [ ] Description includes ontology definition
|
||
- [ ] `exact_mappings` used ONLY for truly equivalent terms
|
||
- [ ] All prefixes declared in schema's `prefixes:` block
|
||
|
||
**See**: `.opencode/rules/ontology-to-linkml-mapping-convention.md` for complete documentation
|
||
|
||
---
|
||
|
||
### Rule 51: No Hallucinated Ontology References
|
||
|
||
**CRITICAL**: All ontology references (`class_uri`, `slot_uri`, `*_mappings`) MUST be verifiable against actual files in `/data/ontology/`.
|
||
|
||
**The Problem**: AI agents may suggest predicates like `dqv:value` or `premis:hasFrameRate` without verifying they exist in local ontology files. This causes RDF serialization failures.
|
||
|
||
**Available Ontologies** (verified 2025-01-13):
|
||
|
||
| Prefix | File | Verified |
|
||
|--------|------|----------|
|
||
| `prov:` | `prov-o.ttl` | ✅ |
|
||
| `schema:` | `schemaorg.owl` | ✅ |
|
||
| `org:` | `org.rdf` | ✅ |
|
||
| `skos:` | `skos.rdf` | ✅ |
|
||
| `dcterms:` | `dublin_core_elements.rdf` | ✅ |
|
||
| `foaf:` | `foaf.ttl` | ✅ |
|
||
| `rico:` | `RiC-O_1-1.rdf` | ✅ |
|
||
| `dqv:` | `dqv.ttl` | ✅ |
|
||
| `adms:` | `adms.ttl` | ✅ |
|
||
| `dcat:` | `dcat3.ttl` | ✅ |
|
||
|
||
**Verification Workflow**:
|
||
```bash
|
||
# 1. Check ontology file exists
|
||
ls data/ontology/ | grep -i "<prefix>"
|
||
|
||
# 2. Search for predicate
|
||
grep -l "<predicate>" data/ontology/*
|
||
```
|
||
|
||
**When No Standard Exists**: Use `hc:` prefix with documentation:
|
||
```yaml
|
||
slots:
|
||
heritage_relevance_score:
|
||
slot_uri: hc:heritageRelevanceScore # Always valid
|
||
annotations:
|
||
ontology_note: "No standard ontology equivalent exists"
|
||
```
|
||
|
||
**See**: `.opencode/rules/no-hallucinated-ontology-references.md` for complete documentation
|
||
|
||
---
|
||
|
||
### Rule 52: No Duplicate Ontology Mappings
|
||
|
||
🚨 **CRITICAL**: Each ontology URI MUST appear in only ONE mapping category per schema element. A URI cannot have multiple semantic relationships to the same class or slot.
|
||
|
||
**The Problem**: LinkML mapping properties (`exact_mappings`, `close_mappings`, `related_mappings`, `narrow_mappings`, `broad_mappings`) are mutually exclusive based on SKOS semantics. The same URI appearing in multiple categories creates logical contradictions.
|
||
|
||
**Anti-Pattern (WRONG)**:
|
||
|
||
```yaml
|
||
slots:
|
||
source_url:
|
||
exact_mappings:
|
||
- schema:url # Says "source_url IS schema:url"
|
||
broad_mappings:
|
||
- schema:url # Says "schema:url is MORE GENERAL than source_url"
|
||
# CONTRADICTION: source_url cannot both BE schema:url AND be more specific than it
|
||
```
|
||
|
||
**Correct Pattern**:
|
||
|
||
```yaml
|
||
slots:
|
||
source_url:
|
||
exact_mappings:
|
||
- schema:url # Keep only the most precise mapping
|
||
```
|
||
|
||
**Decision Guide** - When duplicates found, keep the MOST PRECISE:
|
||
|
||
| Precedence | Mapping Type | Meaning |
|
||
|------------|--------------|---------|
|
||
| 1st (keep) | `exact_mappings` | Semantic equivalence |
|
||
| 2nd | `close_mappings` | Nearly equivalent |
|
||
| 3rd | `narrow_mappings` | This is more specific |
|
||
| 4th | `broad_mappings` | This is more general |
|
||
| 5th | `related_mappings` | Conceptual association |
|
||
|
||
**Quick Reference**:
|
||
|
||
| If URI in... | Action |
|
||
|--------------|--------|
|
||
| exact + broad | Keep exact, remove broad |
|
||
| close + broad | Keep close, remove broad |
|
||
| related + broad | Keep related, remove broad |
|
||
| narrow + broad | ERROR - investigate (contradictory) |
|
||
|
||
**See**: `.opencode/rules/no-duplicate-ontology-mappings.md` for complete documentation
|
||
|
||
---
|
||
|
||
### Rule 53: Full Slot Migration - slot_fixes.yaml is AUTHORITATIVE
|
||
|
||
🚨 **CRITICAL**: The file `schemas/20251121/linkml/modules/slots/slot_fixes.yaml` is the **AUTHORITATIVE, CURATED SOURCE** for all slot migrations. Follow it **TO THE LETTER**.
|
||
|
||
**⚠️ slot_fixes.yaml Revisions Are Mandatory ⚠️**
|
||
|
||
The `revision` section in `slot_fixes.yaml` specifies the EXACT slots and classes to use. These revisions were manually curated based on:
|
||
- Ontology analysis (CIDOC-CRM, RiC-O, PROV-O, Schema.org alignment)
|
||
- Semantic correctness
|
||
- Pattern consistency (Rule 39: RiC-O style naming)
|
||
- Type/Types class hierarchy design (Rule 0b)
|
||
|
||
**YOU MUST**:
|
||
- ✅ Follow the `revision` section **EXACTLY**
|
||
- ✅ Use the EXACT slots and classes specified
|
||
- ✅ Apply ALL components of multi-part revisions
|
||
- ✅ Perform **FULL MIGRATION** - completely remove deprecated slot
|
||
- ✅ Update `processed.status: true` after migration
|
||
|
||
**YOU MUST NOT**:
|
||
- ❌ Substitute different slots than specified
|
||
- ❌ Use your own judgment to pick "similar" slots
|
||
- ❌ Partially apply revisions
|
||
- ❌ Add deprecation notes (keeping both old and new)
|
||
|
||
**Understanding `link_branch` in Revisions**:
|
||
|
||
The `link_branch` field indicates **nested class attributes**:
|
||
|
||
| Revision Item | Meaning |
|
||
|---------------|---------|
|
||
| Items **WITHOUT** `link_branch` | PRIMARY slot and class to create |
|
||
| Items **WITH** `link_branch: 1` | First attribute the primary class needs |
|
||
| Items **WITH** `link_branch: 2` | Second attribute the primary class needs |
|
||
|
||
**Example - `visitor_count` with `link_branch`**:
|
||
|
||
```yaml
|
||
- original_slot_id: https://nde.nl/ontology/hc/slot/visitor_count
|
||
revision:
|
||
- label: has_or_had_quantity # PRIMARY SLOT
|
||
type: slot
|
||
- label: Quantity # PRIMARY CLASS
|
||
type: class
|
||
- label: has_or_had_measurement_unit # Quantity.has_or_had_measurement_unit
|
||
type: slot
|
||
link_branch: 1 # ← Branch 1
|
||
- label: MeasureUnit # Range of branch 1 slot
|
||
type: class
|
||
link_branch: 1
|
||
- label: temporal_extent # Quantity.temporal_extent
|
||
type: slot
|
||
link_branch: 2 # ← Branch 2
|
||
- label: TimeSpan # Range of branch 2 slot
|
||
type: class
|
||
link_branch: 2
|
||
```
|
||
|
||
**Interpretation**: Create `Quantity` class with TWO slots:
|
||
- `has_or_had_measurement_unit` → `MeasureUnit` (branch 1)
|
||
- `temporal_extent` → `TimeSpan` (branch 2)
|
||
|
||
**Example - Following slot_fixes.yaml**:
|
||
|
||
```yaml
|
||
# slot_fixes.yaml specifies:
|
||
- original_slot_id: https://nde.nl/ontology/hc/slot/actual_start
|
||
revision:
|
||
- label: begin_of_the_begin # ← USE THIS SLOT
|
||
type: slot
|
||
- label: TimeSpan # ← WITH THIS CLASS
|
||
type: class
|
||
```
|
||
|
||
| Action | Status |
|
||
|--------|--------|
|
||
| Use `temporal_extent` with `TimeSpan.begin_of_the_begin` | ✅ CORRECT |
|
||
| Invent `has_actual_start_date` slot | ❌ WRONG - not in revision |
|
||
| Use `begin_of_the_begin` WITHOUT `TimeSpan` | ❌ WRONG - incomplete |
|
||
|
||
**Migration Steps**:
|
||
1. **READ** the `revision` section completely
|
||
2. **IDENTIFY** all slots and classes specified
|
||
3. **REMOVE** old slot from imports, slots list, and slot_usage
|
||
4. **ADD** new slot(s) and class import(s) per revision
|
||
5. **UPDATE** all examples to use new slots
|
||
6. **VALIDATE** with `linkml-lint` or `gen-owl`
|
||
7. **UPDATE** slot_fixes.yaml with `status: true` and notes
|
||
|
||
**Anti-Pattern (WRONG)**:
|
||
|
||
```yaml
|
||
# WRONG - Keeping deprecated slot OR using wrong replacement
|
||
classes:
|
||
TemporaryLocation:
|
||
slots:
|
||
- actual_start # ❌ OLD slot still present
|
||
- has_actual_start_date # ❌ Invented, not in revision
|
||
```
|
||
|
||
**Correct Pattern**:
|
||
|
||
```yaml
|
||
# CORRECT - Following slot_fixes.yaml revision exactly
|
||
classes:
|
||
TemporaryLocation:
|
||
slots:
|
||
- temporal_extent # ✅ Uses TimeSpan with begin_of_the_begin
|
||
# OLD slots completely removed
|
||
```
|
||
|
||
**See**: `.opencode/rules/full-slot-migration-rule.md` for complete documentation
|
||
|
||
---
|
||
|
||
### Rule 54: RAG API Podman Containerization
|
||
|
||
🚨 **CRITICAL**: The GLAM RAG API MUST be deployed via Podman container, NOT via venv/rsync. This solves Python import consistency issues between local development and production gunicorn.
|
||
|
||
**Why Podman**:
|
||
- **Import consistency**: gunicorn requires absolute imports (`from provenance import`) not relative (`from .provenance import`)
|
||
- **Isolation**: RAG API dependencies don't conflict with other server services
|
||
- **Reproducibility**: Dockerfile defines exact Python environment
|
||
- **Security**: Container runs as non-root user
|
||
|
||
**Deployment**:
|
||
```bash
|
||
# Deploy RAG API via Podman
|
||
./infrastructure/deploy.sh --rag
|
||
```
|
||
|
||
**Key Files**:
|
||
| File | Purpose |
|
||
|------|---------|
|
||
| `backend/rag/Dockerfile` | Container image definition |
|
||
| `backend/rag/requirements.txt` | Python dependencies (includes gunicorn) |
|
||
| `infrastructure/deploy.sh` | Deployment script (`--rag` flag) |
|
||
|
||
**Server Configuration**:
|
||
- Container image: `glam-rag-api:latest`
|
||
- Systemd service: `glam-rag-api.service`
|
||
- Network: Host mode (accesses localhost backends)
|
||
- Health endpoint: `https://bronhouder.nl/api/rag/health`
|
||
|
||
**Import Style** (for container deployment):
|
||
```python
|
||
# CORRECT - Works in container with gunicorn
|
||
from provenance import ProvenanceTracker
|
||
|
||
# WRONG - Fails with gunicorn
|
||
from .provenance import ProvenanceTracker
|
||
```
|
||
|
||
**See**: `.opencode/rules/podman-containerization-rule.md` for complete documentation
|
||
|
||
---
|
||
|
||
### Rule 55: Broaden Generic Predicate Ranges Instead of Creating Bespoke Predicates
|
||
|
||
🚨 **CRITICAL**: When fixing gen-owl "Ambiguous type" warnings, **broaden the range of generic predicates** rather than creating specialized bespoke predicates.
|
||
|
||
**The Problem**: gen-owl warnings occur when a slot is used as both:
|
||
- **DatatypeProperty** (base range: `string`, `integer`, `uri`)
|
||
- **ObjectProperty** (slot_usage override range: a class like `SubtitleFormatEnum`)
|
||
|
||
**❌ WRONG Approach - Create Bespoke Predicates**:
|
||
|
||
```yaml
|
||
# DON'T DO THIS - creates proliferation of rare-use predicates
|
||
slots:
|
||
has_or_had_subtitle_format: # Only used by VideoSubtitle
|
||
range: SubtitleFormatEnum
|
||
has_or_had_transcript_format: # Only used by VideoTranscript
|
||
range: TranscriptFormat
|
||
```
|
||
|
||
**Why This Is Wrong**:
|
||
- Creates **predicate proliferation** (schema bloat)
|
||
- Bespoke predicates are **rarely reused**
|
||
- **Fragments the ontology** unnecessarily
|
||
|
||
**✅ CORRECT Approach - Broaden Generic Predicate Ranges**:
|
||
|
||
```yaml
|
||
# DO THIS - broaden the base slot range
|
||
slots:
|
||
has_or_had_format:
|
||
range: uriorcurie # Broadened from string
|
||
|
||
# Classes narrow via slot_usage (this is fine)
|
||
classes:
|
||
VideoSubtitle:
|
||
slot_usage:
|
||
has_or_had_format:
|
||
range: SubtitleFormatEnum # Valid narrowing
|
||
```
|
||
|
||
**Range Broadening Options**:
|
||
|
||
| Original Range | Broadened Range | When to Use |
|
||
|----------------|-----------------|-------------|
|
||
| `string` | `uriorcurie` | When class overrides use URI-identified types/enums |
|
||
| `string` | `Any` | When truly polymorphic (strings AND class instances) |
|
||
| Specific class | Common base class | When multiple subclasses are used |
|
||
|
||
**Implementation Workflow**:
|
||
1. Identify warning: `gen-owl ... 2>&1 | grep "Ambiguous type for:"`
|
||
2. Broaden base slot range: `range: string` → `range: uriorcurie`
|
||
3. Keep class-level slot_usage overrides (they narrow the range)
|
||
4. Verify fix: Run gen-owl and confirm warning is gone
|
||
|
||
**See**: `.opencode/rules/broaden-generic-predicate-ranges-rule.md` for complete documentation
|
||
|
||
---
|
||
|
||
### Rule 56: Semantic Consistency Over Simplicity - Always Execute slot_fixes.yaml Revisions
|
||
|
||
🚨 **CRITICAL**: When `slot_fixes.yaml` specifies a revision, agents MUST execute it. Perceived simplicity of the existing slot is NOT a valid reason to reject a migration.
|
||
|
||
**The Core Problem**: Previous agents marked migrations as "NO MIGRATION NEEDED" citing reasons like "simple enum appropriate", "would add unnecessary indirection", "already has proper slot_uri". These judgments were **INCORRECT**.
|
||
|
||
**Why Revisions MUST Be Executed**:
|
||
|
||
| Principle | Explanation |
|
||
|-----------|-------------|
|
||
| **Schema Consistency** | Ontology achieves semantic power through consistent patterns, not local optimizations |
|
||
| **LinkML Mapping Separation** | `slot_uri` handles external ontology alignment; slot structure handles internal consistency |
|
||
| **Single Responsibility Principle** | Predicates should have single, focused purposes |
|
||
| **Extensibility First** | Structured classes enable future extension even if current use is simple |
|
||
|
||
**Invalid Reasons to Reject Migrations**:
|
||
|
||
| Rejected Reason | Why It's Invalid |
|
||
|-----------------|------------------|
|
||
| "Already has proper slot_uri" | slot_uri is for external mapping; internal structure is separate |
|
||
| "Simple string/enum is sufficient" | Consistency and extensibility trump local simplicity |
|
||
| "Would add unnecessary indirection" | Indirection enables reuse and future extension |
|
||
| "Creating a class would over-engineer" | Ontology design favors class-based modeling |
|
||
|
||
**Valid Reasons to Pause Migrations** (warrant discussion, not unilateral rejection):
|
||
- Semantic conflict - Proposed slot_uri contradicts semantic intent
|
||
- Class already exists under different name
|
||
- Circular dependency would be created
|
||
- Breaking change would affect external consumers
|
||
|
||
**Key Insight**: Agents confused "has good external mapping" with "needs no migration". These are independent concerns.
|
||
|
||
**See**: `.opencode/rules/semantic-consistency-over-simplicity.md` for complete documentation
|
||
|
||
---
|
||
|
||
### Rule 57: slot_fixes.yaml Revision Key is IMMUTABLE
|
||
|
||
🚨 **CRITICAL**: The `revision` key in `schemas/20251121/linkml/modules/slots/slot_fixes.yaml` is **IMMUTABLE**. AI agents MUST follow revision specifications exactly and are NEVER permitted to modify the content of revision entries.
|
||
|
||
**The Authoritative Source**:
|
||
|
||
The `revision` section in each slot_fixes.yaml entry was **manually curated** based on:
|
||
- Ontology analysis (CIDOC-CRM, RiC-O, PROV-O, Schema.org alignment)
|
||
- Semantic correctness
|
||
- Pattern consistency (Rule 39: RiC-O style naming)
|
||
- Type/Types class hierarchy design (Rule 0b)
|
||
|
||
**What Agents CAN Do**:
|
||
|
||
| Action | Permitted | Location |
|
||
|--------|-----------|----------|
|
||
| Add completion notes | ✅ YES | `processed.notes` |
|
||
| Update status | ✅ YES | `processed.status` |
|
||
| Add feedback responses | ✅ YES | `feedback.response` |
|
||
| Mark feedback as done | ✅ YES | `feedback.done` |
|
||
| Execute the migration per revision | ✅ YES | Class/slot files |
|
||
|
||
**What Agents CANNOT Do**:
|
||
|
||
| Action | Permitted | Reason |
|
||
|--------|-----------|--------|
|
||
| Modify `revision` content | ❌ NEVER | Authoritative specification |
|
||
| Substitute different slots | ❌ NEVER | Violates curated design |
|
||
| Skip revision components | ❌ NEVER | Incomplete migration |
|
||
| Add new revision items | ❌ NEVER | Requires human curation |
|
||
| Change revision labels | ❌ NEVER | Breaks semantic mapping |
|
||
|
||
**Example Structure**:
|
||
|
||
```yaml
|
||
- original_slot_id: https://nde.nl/ontology/hc/slot/example_slot
|
||
revision: # ← IMMUTABLE - DO NOT MODIFY
|
||
- label: has_or_had_example # Generic slot to use
|
||
type: slot
|
||
- label: Example # Class for range
|
||
type: class
|
||
processed:
|
||
status: false # ← CAN UPDATE to true
|
||
notes: "" # ← CAN ADD notes here
|
||
feedback:
|
||
done: false # ← CAN UPDATE to true
|
||
response: "" # ← CAN ADD response here
|
||
```
|
||
|
||
**Rationale**:
|
||
1. **Curated Quality**: Revisions were manually designed with ontology expertise
|
||
2. **Consistency**: Same patterns applied across all migrations
|
||
3. **Auditability**: Clear record of intended vs. actual changes
|
||
4. **Trust**: Users can rely on revision specifications being stable
|
||
|
||
**See**: `.opencode/rules/slot-fixes-revision-immutability-rule.md` for complete documentation
|
||
|
||
---
|
||
|
||
### Rule 58: Feedback vs Revision Distinction in slot_fixes.yaml
|
||
|
||
🚨 **CRITICAL**: The `feedback` and `revision` fields in `slot_fixes.yaml` serve distinct purposes and MUST NOT be conflated or renamed.
|
||
|
||
**Field Definitions**:
|
||
|
||
| Field | Purpose | Authority |
|
||
|-------|---------|-----------|
|
||
| `revision` | Defines WHAT the migration target is (slots/classes to create) | IMMUTABLE (Rule 57) |
|
||
| `feedback` | Contains user instructions on HOW the revision should be applied | User directives that override previous `notes` |
|
||
|
||
**Feedback Formats**:
|
||
|
||
1. **Structured** (with `done` field):
|
||
```yaml
|
||
feedback:
|
||
- timestamp: '2026-01-17T00:01:57Z'
|
||
user: Simon C. Kemper
|
||
done: false # Becomes true after agent processes
|
||
comment: |
|
||
The migration should use X instead of Y.
|
||
```
|
||
|
||
2. **String** (direct instruction):
|
||
```yaml
|
||
feedback: I reject this! type_id should be migrated to has_or_had_identifier + Identifier
|
||
```
|
||
|
||
**Interpretation Rules**:
|
||
|
||
| Feedback Contains | Meaning | Action Required |
|
||
|-------------------|---------|-----------------|
|
||
| "I reject this" | Previous `notes` were WRONG | Follow `revision` field instead |
|
||
| "I altered the revision" | User updated `revision` | Execute migration per NEW revision |
|
||
| "Conduct the migration" | Migration not yet done | Execute migration now |
|
||
| "Please conduct accordingly" | Migration pending | Execute migration now |
|
||
| "ADDRESSED" or `done: true` | Already processed | No action needed |
|
||
|
||
**Decision Tree**:
|
||
```
|
||
Is feedback field present?
|
||
├─ NO → Check processed.status, execute revision if false
|
||
└─ YES → Parse format:
|
||
├─ Structured with done: true → No action needed
|
||
├─ Structured with done: false → Process, then set done: true
|
||
└─ String format → Parse for keywords:
|
||
├─ "reject" → Follow revision, ignore previous notes
|
||
├─ "altered/adjusted" → Execute NEW revision
|
||
└─ "conduct/please" → Migration pending, execute now
|
||
```
|
||
|
||
**See**: `.opencode/rules/feedback-vs-revision-distinction.md` for complete documentation
|
||
|
||
---
|
||
|
||
### Rule 59: LinkML Union Types Require `range: Any`
|
||
|
||
🚨 **CRITICAL**: When using `any_of` for union types in LinkML, you MUST also specify `range: Any` at the attribute level. Without it, the union type validation does NOT work.
|
||
|
||
**The Problem**: LinkML's `any_of` construct allows defining slots that accept multiple types (e.g., string OR integer). However, without `range: Any`, the `any_of` constraint is silently ignored during validation.
|
||
|
||
**Correct Pattern**:
|
||
|
||
```yaml
|
||
slots:
|
||
identifier_value:
|
||
range: Any # ← REQUIRED for any_of to work
|
||
any_of:
|
||
- range: string
|
||
- range: integer
|
||
description: The identifier value (can be string or integer)
|
||
```
|
||
|
||
**Incorrect Pattern (WILL FAIL)**:
|
||
|
||
```yaml
|
||
slots:
|
||
identifier_value:
|
||
# Missing range: Any - validation will fail!
|
||
any_of:
|
||
- range: string
|
||
- range: integer
|
||
```
|
||
|
||
**Common Use Cases**:
|
||
|
||
| Use Case | Types | Example Fields |
|
||
|----------|-------|----------------|
|
||
| Identifier values | string \| integer | `identifier_value`, `geonames_id`, `viaf_id` |
|
||
| Social media IDs | string \| array | `youtube_channel_id`, `facebook_id`, `twitter_username` |
|
||
| Flexible identifiers | object \| array | `identifiers` (dict or list format) |
|
||
| Numeric strings | string \| integer | `postal_code`, `kvk_number` |
|
||
|
||
**Validation Behavior**:
|
||
|
||
| Schema Definition | Integer Data | String Data | Result |
|
||
|-------------------|--------------|-------------|--------|
|
||
| `range: string` | ❌ FAIL | ✅ PASS | Strict string only |
|
||
| `range: integer` | ✅ PASS | ❌ FAIL | Strict integer only |
|
||
| `any_of` without `range: Any` | ❌ FAIL | ❌ FAIL | Broken - nothing works |
|
||
| `any_of` with `range: Any` | ✅ PASS | ✅ PASS | Correct union behavior |
|
||
|
||
**See**: `.opencode/rules/linkml-union-type-range-any-rule.md` for complete documentation
|
||
|
||
---
|
||
|
||
### Rule 60: No Migration Deferral - Execute All slot_fixes.yaml Revisions
|
||
|
||
🚨 **CRITICAL**: AI agents MUST execute ALL migrations specified in `slot_fixes.yaml`. The pattern of marking migrations as "WELL_STRUCTURED_NO_MIGRATION_NEEDED" or similar deferral notes is **STRICTLY PROHIBITED**.
|
||
|
||
**The Problem**: Previous agents marked 95%+ of slot_fixes.yaml entries with notes like "Already well-structured" without executing migrations. This violated the curated intent of slot_fixes.yaml revisions.
|
||
|
||
**Invalid Justifications** (NEVER acceptable for skipping migration):
|
||
|
||
| Rejected Reason | Why It's Invalid |
|
||
|-----------------|------------------|
|
||
| "Already has proper slot_uri" | slot_uri is for external mapping; internal structure is separate concern |
|
||
| "Simple string/enum is sufficient" | Consistency and extensibility trump local simplicity |
|
||
| "Would add unnecessary indirection" | Indirection enables reuse and future extension |
|
||
| "Current structure is well-designed" | Revisions were curated with full ontology context |
|
||
| "WELL_STRUCTURED_NO_MIGRATION_NEEDED" | This exact phrase is now a red flag |
|
||
|
||
**Valid Pause Reasons** (require `feedback` entry, NOT deferral via notes):
|
||
- Genuine semantic conflict between revision and documented intent
|
||
- Circular dependency would be created
|
||
- Breaking change affecting known external consumers
|
||
|
||
**Statistics**: >95% of slot_fixes.yaml entries MUST be executed. <5% may have genuine conflicts requiring human review via `feedback` mechanism.
|
||
|
||
**Workflow**:
|
||
1. Read the `revision` section completely
|
||
2. Execute the migration exactly as specified
|
||
3. Mark `processed.status: true`
|
||
4. If genuine conflict exists, add `feedback` entry (NOT a deferral note)
|
||
|
||
**See**: `.opencode/rules/no-migration-deferral-rule.md` for complete documentation
|
||
|
||
---
|
||
|
||
## Appendix: Full Rule Content (No .opencode Equivalent)
|
||
|
||
The following rules have no separate .opencode file and are preserved in full:
|
||
|
||
### Rule 2: Wikidata Entities Are NOT Ontology Classes
|
||
|
||
**Files**:
|
||
- `data/wikidata/GLAMORCUBEPSXHFN/hyponyms_curated.yaml`
|
||
- `data/wikidata/GLAMORCUBEPSXHFN/hyponyms_curated_full.yaml`
|
||
|
||
**These files contain**:
|
||
- ✅ Wikidata entity identifiers (Q-numbers) for heritage institution TYPES
|
||
- ✅ Multilingual labels and descriptions
|
||
- ✅ Hypernym classifications (upper-level categories)
|
||
- ✅ Source data for ontology mapping analysis
|
||
|
||
**These files DO NOT contain**:
|
||
- ❌ Formal ontology class definitions
|
||
- ❌ Direct `class_uri` mappings for LinkML
|
||
- ❌ Ontology properties or relationships
|
||
|
||
**REQUIRED WORKFLOW**:
|
||
```
|
||
hyponyms_curated.yaml (Wikidata Q-numbers)
|
||
↓
|
||
ANALYZE semantic meaning + hypernyms
|
||
↓
|
||
SEARCH base ontologies for matching classes
|
||
↓
|
||
MAP Wikidata entity → Ontology class(es)
|
||
↓
|
||
DOCUMENT rationale + properties
|
||
↓
|
||
CREATE LinkML schema with ontology class_uri
|
||
```
|
||
|
||
**Example - WRONG** ❌:
|
||
```yaml
|
||
Mansion:
|
||
class_uri: wd:Q1802963 # ← This is an ENTITY, not a CLASS!
|
||
```
|
||
|
||
**Example - CORRECT** ✅:
|
||
```yaml
|
||
Mansion:
|
||
# Wikidata source: Q1802963
|
||
place_aspect:
|
||
class_uri: crm:E27_Site # CIDOC-CRM ontology class
|
||
custodian_aspect:
|
||
class_uri: cpov:PublicOrganisation # If operates as museum
|
||
```
|
||
|
||
### Rule 3: Multi-Aspect Modeling is Mandatory
|
||
|
||
**Every heritage entity has MULTIPLE ontological aspects with INDEPENDENT temporal lifecycles.**
|
||
|
||
**Required Aspects**:
|
||
|
||
1. **Place Aspect** (physical location/site)
|
||
- Ontology: `crm:E27_Site` + `schema:Place`
|
||
- Temporal: Construction → Demolition/Present
|
||
|
||
2. **Custodian Aspect** (organization managing heritage)
|
||
- Ontology: `cpov:PublicOrganisation` OR `schema:Organization`
|
||
- Temporal: Founding → Dissolution/Present
|
||
|
||
3. **Legal Form Aspect** (legal entity registration)
|
||
- Ontology: `org:FormalOrganization` + `tooi:Overheidsorganisatie` (Dutch)
|
||
- Temporal: Registration → Deregistration/Present
|
||
|
||
4. **Collections Aspect** (heritage materials)
|
||
- Ontology: `rico:RecordSet` OR `crm:E78_Curated_Holding` OR `bf:Collection`
|
||
- Temporal: Accession → Deaccession (per item)
|
||
|
||
5. **People Aspect** (staff, curators)
|
||
- Ontology: `pico:PersonObservation` + `crm:E21_Person`
|
||
- Temporal: Employment start → Employment end (per person)
|
||
|
||
6. **Temporal Events** (organizational changes)
|
||
- Ontology: `crm:E10_Transfer_of_Custody`, `rico:Event`
|
||
- Tracks custody transfers, mergers, relocations, transformations
|
||
|
||
**Example**: A historic mansion operating as a museum has:
|
||
- **Place aspect**: Building constructed 1880, still standing (143 years)
|
||
- **Custodian aspect**: Foundation established 1994 to operate museum (30 years)
|
||
- **Legal form**: Dutch stichting registered 1994, KvK #12345678
|
||
- **Collections**: Mondrian artworks acquired 1994-2024
|
||
- **People**: Current curator employed 2020-present
|
||
|
||
**Each aspect changes independently over time!**
|
||
|
||
### Rule 5: NEVER Delete Enriched Data - Additive Only
|
||
|
||
**🚨 CRITICAL: Data enrichment is ADDITIVE ONLY. Never delete or overwrite existing enriched content.**
|
||
|
||
When restructuring or updating enriched institution records:
|
||
|
||
**✅ ALLOWED (Additive Operations)**:
|
||
- Add new fields or sections
|
||
- Restructure YAML/JSON layout while preserving all content
|
||
- Rename files (e.g., `_unknown.yaml` → `_museum_name.yaml`)
|
||
- Add provenance metadata
|
||
- Merge data from multiple sources (preserving all)
|
||
|
||
**❌ FORBIDDEN (Destructive Operations)**:
|
||
- Delete Google Maps data (reviews, ratings, photo counts, popular times)
|
||
- Remove OpenStreetMap metadata
|
||
- Overwrite website scrape results
|
||
- Delete Wikidata enrichment data
|
||
- Remove any `*_enrichment` sections
|
||
- Truncate or summarize detailed content
|
||
|
||
**Data Types That Must NEVER Be Deleted**:
|
||
|
||
| Data Source | Protected Fields |
|
||
|-------------|------------------|
|
||
| **Google Maps** | `reviews`, `rating`, `total_ratings`, `photo_count`, `popular_times`, `place_id`, `business_status` |
|
||
| **OpenStreetMap** | `osm_id`, `osm_type`, `osm_tags`, `amenity`, `building`, `heritage` |
|
||
| **Wikidata** | `wikidata_id`, `claims`, `sitelinks`, `aliases`, `descriptions` |
|
||
| **Website Scrape** | `organization_details`, `collections`, `exhibitions`, `contact`, `social_media`, `accessibility` |
|
||
| **ISIL Registry** | `isil_code`, `assigned_date`, `remarks` |
|
||
|
||
**Example - CORRECT Restructuring**:
|
||
|
||
```yaml
|
||
# BEFORE (flat structure)
|
||
google_maps_rating: 4.5
|
||
google_maps_reviews: 127
|
||
website_description: "Historic museum..."
|
||
|
||
# AFTER (nested structure) - ALL DATA PRESERVED
|
||
enrichment_sources:
|
||
google_maps:
|
||
rating: 4.5 # ← PRESERVED
|
||
reviews: 127 # ← PRESERVED
|
||
website:
|
||
description: "Historic museum..." # ← PRESERVED
|
||
```
|
||
|
||
**Example - WRONG (Data Loss)**:
|
||
|
||
```yaml
|
||
# BEFORE
|
||
google_maps_enrichment:
|
||
rating: 4.5
|
||
reviews: 127
|
||
popular_times: {...}
|
||
photos: [...]
|
||
|
||
# AFTER - WRONG! Data deleted!
|
||
enrichment_status: enriched
|
||
# Where did the rating, reviews, popular_times go?!
|
||
```
|
||
|
||
**Rationale**:
|
||
- Enriched data is expensive to collect (API calls, rate limits, web scraping)
|
||
- Google Maps data changes over time - historical snapshots are valuable
|
||
- Reviews and ratings provide quality signals for heritage institutions
|
||
- Photo metadata enables visual discovery and verification
|
||
- Deleting data violates data provenance principles
|
||
|
||
**If Unsure**: When restructuring files, first READ the entire file, then WRITE a new version that includes ALL original content in the new structure.
|
||
|
||
---
|
||
|
||
### Rule 6: WebObservation Claims MUST Have XPath Provenance
|
||
|
||
**Every claim extracted from a webpage MUST have an XPath pointer to the exact location in archived HTML where that value appears. Claims without XPath provenance are FABRICATED and must be removed.**
|
||
|
||
This is not about "confidence" or "uncertainty" - it's about **verifiability**. Either the claim value exists in the HTML at a specific XPath, or it was hallucinated/fabricated by an LLM.
|
||
|
||
**Required Fields for WebObservation Claims**:
|
||
|
||
| Field | Required | Description |
|
||
|-------|----------|-------------|
|
||
| `claim_type` | YES | Type of claim (full_name, description, email, etc.) |
|
||
| `claim_value` | YES | The extracted value |
|
||
| `source_url` | YES | URL the claim was extracted from |
|
||
| `retrieved_on` | YES | ISO 8601 timestamp when page was archived |
|
||
| `xpath` | YES | XPath to the element containing this value |
|
||
| `html_file` | YES | Relative path to archived HTML file |
|
||
| `xpath_match_score` | YES | 1.0 for exact match, <1.0 for fuzzy match |
|
||
|
||
**Example - CORRECT (Verifiable)**:
|
||
```yaml
|
||
web_enrichment:
|
||
claims:
|
||
- claim_type: full_name
|
||
claim_value: Historische Vereniging Nijeveen
|
||
source_url: https://historischeverenigingnijeveen.nl/
|
||
retrieved_on: "2025-11-29T12:28:00Z"
|
||
xpath: /[document][1]/html[1]/body[1]/div[6]/div[1]/table[3]/tbody[1]/tr[1]/td[1]/p[6]
|
||
html_file: web/0021/historischeverenigingnijeveen.nl/rendered.html
|
||
xpath_match_score: 1.0
|
||
```
|
||
|
||
**Example - WRONG (Fabricated - Must Be Removed)**:
|
||
```yaml
|
||
web_enrichment:
|
||
claims:
|
||
- claim_type: full_name
|
||
claim_value: Historische Vereniging Nijeveen
|
||
confidence: 0.95 # ← NO! This is meaningless without XPath
|
||
```
|
||
|
||
**Workflow**:
|
||
1. Archive website using Playwright: `python scripts/fetch_website_playwright.py <entry> <url>`
|
||
2. Add XPath provenance: `python scripts/add_xpath_provenance.py`
|
||
3. Script removes fabricated claims (stored in `removed_unverified_claims` for audit)
|
||
|
||
**See**:
|
||
- `.opencode/WEB_OBSERVATION_PROVENANCE_RULES.md` for complete documentation
|
||
- `schemas/20251121/linkml/modules/classes/WebClaim.yaml` for LinkML schema definition
|
||
|
||
**Scope Clarification**: This rule applies to `WebClaim` and `WebObservation` classes only. Other data classes have different provenance models:
|
||
- **CustodianTimelineEvent**: Source-agnostic design - use `extraction_notes` for API queries/XPaths, and `observation_ref` to link to WebObservation/CustodianObservation for detailed provenance. See `.opencode/PROVENANCE_SEPARATION_RULE.md`.
|
||
- **GoogleMapsEnrichment**: Uses Place ID and API response provenance.
|
||
- **WikidataEnrichment**: Uses entity ID and SPARQL query provenance.
|
||
|
||
---
|
||
|
||
|
||
## Project Overview
|
||
|
||
**Goal**: Extract structured data about worldwide GLAMORCUBESFIXPHDNT (Galleries, Libraries, Archives, Museums, Official institutions, Research centers, Corporations, Unknown, Botanical gardens/zoos, Educational providers, Societies, Features, Intangible heritage groups, miXed, Personal collections, Holy sites, Digital platforms, NGOs, Taste/smell heritage) institutions from 139+ Claude conversation JSON files and integrate with authoritative CSV datasets.
|
||
|
||
**Output**: Validated LinkML-compliant records representing heritage custodian organizations with provenance tracking, geographic data, identifiers, and relationship information.
|
||
|
||
**Schema**: See the modular LinkML schema v0.2.1 with 19-type GLAMORCUBESFIXPHDNT taxonomy described below.
|
||
|
||
## Schema Reference (v0.2.1)
|
||
|
||
The project uses a **modular LinkML schema** organized into 6 specialized modules:
|
||
|
||
1. **`schemas/heritage_custodian.yaml`** - Main schema (import-only structure)
|
||
- Top-level schema that imports all modules
|
||
- Defines schema metadata and namespace
|
||
|
||
2. **`schemas/core.yaml`** - Core Classes
|
||
- `HeritageCustodian` - Main institution entity
|
||
- `Location` - Geographic data
|
||
- `Identifier` - External identifiers (ISIL, Wikidata, VIAF, etc.)
|
||
- `DigitalPlatform` - Online systems and platforms
|
||
- `GHCID` - Global Heritage Custodian Identifier
|
||
|
||
3. **`schemas/enums.yaml`** - Enumerations
|
||
- `InstitutionTypeEnum` - 13 institution types (GALLERY, LIBRARY, ARCHIVE, MUSEUM, etc.)
|
||
- `ChangeTypeEnum` - 11 organizational change types (FOUNDING, MERGER, CLOSURE, etc.)
|
||
- `DataSource` - Data origin types (CSV_REGISTRY, CONVERSATION_NLP, etc.)
|
||
- `DataTier` - Data quality tiers (TIER_1_AUTHORITATIVE through TIER_4_INFERRED)
|
||
- `PlatformTypeEnum` - Digital platform categories
|
||
|
||
4. **`schemas/provenance.yaml`** - Provenance Tracking
|
||
- `Provenance` - Data source and quality metadata
|
||
- `ChangeEvent` - Organizational change history (mergers, relocations, etc.)
|
||
- `GHCIDHistoryEntry` - GHCID change tracking over time
|
||
|
||
5. **`schemas/collections.yaml`** - Collection Metadata
|
||
- `Collection` - Collection descriptions
|
||
- `Accession` - Acquisition records
|
||
- `DigitalObject` - Digital surrogates
|
||
|
||
6. **`schemas/dutch.yaml`** - Dutch-Specific Extensions
|
||
- `DutchHeritageCustodian` - Netherlands heritage institutions
|
||
- Extensions for ISIL registry, platform integrations, KvK numbers
|
||
|
||
See `/docs/SCHEMA_MODULES.md` for detailed architecture and design patterns.
|
||
|
||
## Base Ontologies for Global GLAM Data
|
||
|
||
**CRITICAL**: Before designing extraction pipelines or extending the schema, AI agents MUST consult the base ontologies that the LinkML schema builds upon. These ontologies provide standardized vocabularies and patterns for modeling heritage institutions.
|
||
|
||
### Foundation Ontologies
|
||
|
||
The GLAM project integrates with three primary ontologies, each serving different geographic and semantic scopes:
|
||
|
||
#### 1. TOOI - Dutch Government Organizational Ontology
|
||
|
||
**File**: `/data/ontology/tooiont.ttl`
|
||
**Namespace**: `https://identifier.overheid.nl/tooi/def/ont/`
|
||
**Scope**: Dutch heritage institutions (government archives, state museums, public cultural organizations)
|
||
|
||
**When to Use**:
|
||
- ✅ Extracting Dutch heritage institutions from conversations
|
||
- ✅ Modeling Dutch organizational change events (mergers, splits, reorganizations)
|
||
- ✅ Integrating with Dutch ISIL registry or KvK (Chamber of Commerce) data
|
||
- ✅ Parsing Dutch government heritage agency data
|
||
|
||
**Key Classes**:
|
||
- `tooi:Overheidsorganisatie` - Government organization (extends to `DutchHeritageCustodian`)
|
||
- `tooi:Wijzigingsgebeurtenis` - Change event (founding, merger, closure, relocation)
|
||
|
||
**Key Properties**:
|
||
- `tooi:officieleNaamInclSoort` - Official name including type
|
||
- `tooi:begindatum` / `tooi:einddatum` - Temporal validity (start/end dates)
|
||
- `tooi:organisatieIdentificatie` - Formal identifiers (ISIL codes, etc.)
|
||
|
||
**LinkML Mapping**:
|
||
```yaml
|
||
# schemas/dutch.yaml extends TOOI
|
||
DutchHeritageCustodian:
|
||
is_a: HeritageCustodian
|
||
class_uri: tooi:Overheidsorganisatie # ← Maps to TOOI
|
||
```
|
||
|
||
**Reference**: See `/docs/ONTOLOGY_EXTENSIONS.md` for complete TOOI integration patterns.
|
||
|
||
---
|
||
|
||
#### 2. CPOV - EU Core Public Organisation Vocabulary
|
||
|
||
**Files**:
|
||
- `/data/ontology/core-public-organisation-ap.ttl` (RDF schema)
|
||
- `/data/ontology/core-public-organisation-ap.jsonld` (JSON-LD context)
|
||
|
||
**Namespace**: `http://data.europa.eu/m8g/`
|
||
**Scope**: EU-wide and global public sector heritage organizations
|
||
|
||
**When to Use**:
|
||
- ✅ Extracting European heritage institutions (France, Germany, Belgium, etc.)
|
||
- ✅ Modeling international/global heritage organizations
|
||
- ✅ Aligning with EU Linked Open Data initiatives (Europeana, DPLA)
|
||
- ✅ Extracting non-Dutch institutions from conversations
|
||
|
||
**Key Classes**:
|
||
- `cpov:PublicOrganisation` - Public sector organization (base for `HeritageCustodian`)
|
||
- `cv:ChangeEvent` - Organizational change events
|
||
- `locn:Address` - Physical location data
|
||
|
||
**Key Properties**:
|
||
- `skos:prefLabel` / `skos:altLabel` - Preferred and alternative names
|
||
- `dct:identifier` - Formal identifiers (ISIL, Wikidata, VIAF)
|
||
- `dct:temporal` - Temporal coverage (founding to closure dates)
|
||
- `locn:address` - Physical addresses
|
||
|
||
**LinkML Mapping**:
|
||
```yaml
|
||
# schemas/core.yaml aligns with CPOV
|
||
HeritageCustodian:
|
||
class_uri: cpov:PublicOrganisation # ← Maps to CPOV
|
||
|
||
slots:
|
||
name:
|
||
slot_uri: skos:prefLabel
|
||
alternative_names:
|
||
slot_uri: skos:altLabel
|
||
identifiers:
|
||
slot_uri: dct:identifier
|
||
```
|
||
|
||
**Reference**: See `/docs/ONTOLOGY_EXTENSIONS.md` for complete CPOV integration patterns.
|
||
|
||
---
|
||
|
||
#### 3. Schema.org - Web Vocabulary for Structured Data
|
||
|
||
**File**: `/data/ontology/schemaorg.owl`
|
||
**Namespace**: `http://schema.org/`
|
||
**Scope**: Universal web semantics (museums, galleries, collections, events, learning resources)
|
||
|
||
**When to Use**:
|
||
- ✅ Extracting private collections or non-governmental organizations
|
||
- ✅ Modeling digital platforms (learning management systems, discovery portals)
|
||
- ✅ Web discoverability and SEO optimization
|
||
- ✅ Fallback when TOOI/CPOV don't apply
|
||
|
||
**Key Classes**:
|
||
- `schema:Museum` / `schema:Library` / `schema:ArchiveOrganization` - Heritage institution types
|
||
- `schema:Place` - Geographic locations
|
||
- `schema:LearningResource` - Educational platforms (LMS, online courses)
|
||
- `schema:Event` - Organizational events (founding, exhibitions)
|
||
|
||
**LinkML Mapping**:
|
||
```yaml
|
||
# schemas/enums.yaml maps platform types to Schema.org
|
||
DigitalPlatformTypeEnum:
|
||
LEARNING_MANAGEMENT:
|
||
meaning: schema:LearningResource # ← Maps to Schema.org
|
||
```
|
||
|
||
**Reference**: See `/docs/ONTOLOGY_EXTENSIONS.md` for Schema.org usage examples.
|
||
|
||
---
|
||
|
||
### Ontology Decision Tree for Agents
|
||
|
||
When extracting heritage institution data, choose the appropriate ontology:
|
||
|
||
```
|
||
START: Extract institution from conversation
|
||
↓
|
||
Is the institution Dutch?
|
||
├─ YES → Use TOOI ontology
|
||
│ - Map to schemas/dutch.yaml
|
||
│ - Extract ISIL codes (NL-* format)
|
||
│ - Extract KvK numbers (8-digit)
|
||
│ - Model change events as tooi:Wijzigingsgebeurtenis
|
||
│
|
||
└─ NO → Is it a public/government organization?
|
||
├─ YES → Use CPOV ontology
|
||
│ - Map to schemas/core.yaml
|
||
│ - Extract standard identifiers (ISIL, Wikidata, VIAF)
|
||
│ - Model change events as cv:ChangeEvent
|
||
│
|
||
└─ NO → Use Schema.org
|
||
- Map to schemas/core.yaml
|
||
- Use schema:Museum, schema:Library, etc.
|
||
- Emphasize web discoverability
|
||
```
|
||
|
||
**Multi-Ontology Support**: Institutions can implement MULTIPLE ontology classes simultaneously:
|
||
|
||
```turtle
|
||
<https://w3id.org/heritage/custodian/nl/rijksmuseum>
|
||
a tooi:Overheidsorganisatie, # Dutch government organization
|
||
cpov:PublicOrganisation, # EU public sector
|
||
schema:Museum ; # Schema.org web semantics
|
||
```
|
||
|
||
---
|
||
|
||
### Required Ontology Consultation Workflow
|
||
|
||
**Before extracting data**, agents MUST perform these steps:
|
||
|
||
#### Step 1: Identify Institution Geographic Scope
|
||
|
||
```bash
|
||
# Determine which ontology applies
|
||
if institution_country == "NL":
|
||
primary_ontology = "TOOI"
|
||
ontology_file = "/data/ontology/tooiont.ttl"
|
||
elif institution_in_europe or institution_public_sector:
|
||
primary_ontology = "CPOV"
|
||
ontology_file = "/data/ontology/core-public-organisation-ap.ttl"
|
||
else:
|
||
primary_ontology = "Schema.org"
|
||
ontology_file = "/data/ontology/schemaorg.owl"
|
||
```
|
||
|
||
#### Step 2: Review Ontology Classes and Properties
|
||
|
||
**Search ontology files** for relevant classes:
|
||
|
||
```bash
|
||
# Dutch institutions - search TOOI
|
||
rg "tooi:Overheidsorganisatie|Wijzigingsgebeurtenis|begindatum" /data/ontology/tooiont.ttl
|
||
|
||
# EU/global institutions - search CPOV
|
||
rg "cpov:PublicOrganisation|cv:ChangeEvent|locn:Address" /data/ontology/core-public-organisation-ap.ttl
|
||
|
||
# All institutions - search Schema.org
|
||
rg "schema:Museum|schema:Library|schema:ArchiveOrganization" /data/ontology/schemaorg.owl
|
||
```
|
||
|
||
#### Step 3: Map Conversation Data to Ontology Properties
|
||
|
||
Create a mapping table before extraction:
|
||
|
||
| Extracted Field | TOOI Property | CPOV Property | Schema.org Property |
|
||
|-----------------|---------------|---------------|---------------------|
|
||
| Institution name | `tooi:officieleNaamInclSoort` | `skos:prefLabel` | `schema:name` |
|
||
| Alternative names | - | `skos:altLabel` | `schema:alternateName` |
|
||
| Founding date | `tooi:begindatum` | `schema:startDate` | `schema:foundingDate` |
|
||
| Closure date | `tooi:einddatum` | `schema:endDate` | `schema:dissolutionDate` |
|
||
| ISIL code | `tooi:organisatieIdentificatie` | `dct:identifier` | `schema:identifier` |
|
||
| Address | (use `locn:Address`) | `locn:address` | `schema:address` |
|
||
| Merger event | `tooi:Wijzigingsgebeurtenis` | `cv:ChangeEvent` | `schema:Event` |
|
||
| Website | - | `schema:url` | `schema:url` |
|
||
|
||
#### Step 4: Document Ontology Alignment in Provenance
|
||
|
||
**Always include** ontology references in extraction metadata:
|
||
|
||
```yaml
|
||
provenance:
|
||
data_source: CONVERSATION_NLP
|
||
extraction_method: "NLP extraction following CPOV ontology patterns"
|
||
base_ontology: "http://data.europa.eu/m8g/" # ← Document which ontology used
|
||
ontology_alignment:
|
||
- "cpov:PublicOrganisation"
|
||
- "cv:ChangeEvent"
|
||
extraction_date: "2025-11-09T..."
|
||
```
|
||
|
||
---
|
||
|
||
### Common Ontology Patterns
|
||
|
||
**Pattern 1: Organizational Change Events**
|
||
|
||
When extracting mergers, splits, relocations, name changes:
|
||
|
||
```yaml
|
||
# TOOI pattern (Dutch institutions)
|
||
change_history:
|
||
- event_id: https://w3id.org/heritage/custodian/event/nha-merger-2001
|
||
change_type: MERGER # Maps to tooi:Wijzigingsgebeurtenis
|
||
event_date: "2001-01-01"
|
||
event_description: "Merger of Gemeentearchief Haarlem and Rijksarchief in Noord-Holland"
|
||
ontology_class: "tooi:Wijzigingsgebeurtenis"
|
||
|
||
# CPOV pattern (EU/global institutions)
|
||
change_history:
|
||
- event_id: https://w3id.org/heritage/custodian/event/bnf-founding
|
||
change_type: FOUNDING # Maps to cv:ChangeEvent
|
||
event_date: "1461-01-01"
|
||
event_description: "Founded by King Louis XI as Royal Library"
|
||
ontology_class: "cv:ChangeEvent"
|
||
```
|
||
|
||
**Pattern 2: Multilingual Names**
|
||
|
||
CPOV and Schema.org support language-tagged literals:
|
||
|
||
```yaml
|
||
name: Bibliothèque nationale de France
|
||
alternative_names:
|
||
- National Library of France@en
|
||
- BnF@fr
|
||
- Französische Nationalbibliothek@de
|
||
|
||
# RDF serialization:
|
||
# skos:prefLabel "Bibliothèque nationale de France"@fr ;
|
||
# skos:altLabel "National Library of France"@en, "BnF"@fr ;
|
||
```
|
||
|
||
**Pattern 3: Hierarchical Relationships**
|
||
|
||
Use W3C Org Ontology patterns (integrated in CPOV):
|
||
|
||
```yaml
|
||
# Parent institution
|
||
parent_organization:
|
||
name: Ministry of Culture
|
||
relationship_type: "org:hasUnit" # CPOV uses W3C Org Ontology
|
||
|
||
# Branch institutions
|
||
branches:
|
||
- name: Regional Archive Noord-Brabant
|
||
relationship_type: "org:subOrganizationOf"
|
||
```
|
||
|
||
---
|
||
|
||
### Anti-Patterns to Avoid
|
||
|
||
**❌ DON'T**: Invent custom properties when ontology equivalents exist
|
||
|
||
```yaml
|
||
# BAD - Custom property instead of ontology reuse
|
||
institution_official_name: "Rijksarchief" # Use skos:prefLabel instead!
|
||
```
|
||
|
||
**❌ DON'T**: Ignore ontology namespace conventions
|
||
|
||
```yaml
|
||
# BAD - No ontology reference
|
||
change_type: "merger" # Use cv:ChangeEvent with proper namespace!
|
||
```
|
||
|
||
**❌ DON'T**: Extract without reviewing ontology files
|
||
|
||
```bash
|
||
# BAD - Extracting Dutch institutions without reading TOOI
|
||
agent: "I'll extract Dutch archives using Schema.org only"
|
||
# This loses semantic precision and ignores domain-specific patterns!
|
||
```
|
||
|
||
**✅ DO**: Always map to base ontologies and document alignment
|
||
|
||
```yaml
|
||
# GOOD - Ontology-aligned extraction
|
||
name: Rijksarchief in Noord-Holland
|
||
institution_type: ARCHIVE
|
||
ontology_class: tooi:Overheidsorganisatie # ← Documented
|
||
provenance:
|
||
base_ontology: "https://identifier.overheid.nl/tooi/def/ont/"
|
||
ontology_alignment:
|
||
- tooi:Overheidsorganisatie
|
||
- prov:Organization # TOOI uses PROV-O for temporal tracking
|
||
```
|
||
|
||
---
|
||
|
||
### Additional Ontology Resources
|
||
|
||
**CIDOC-CRM** (Cultural Heritage Domain):
|
||
- File: `/data/ontology/CIDOC_CRM_v7.1.3.rdf`
|
||
- Use for: Museum object cataloging, provenance, conservation
|
||
- Key classes: `crm:E74_Group` (organizations), `crm:E5_Event` (historical events)
|
||
|
||
**RiC-O** (Records in Contexts - Archival Description):
|
||
- Use for: Archival collections, fonds, series, items
|
||
- Key classes: `rico:CorporateBody`, `rico:RecordSet`
|
||
- Integration: Planned for future schema extension
|
||
|
||
**BIBFRAME** (Bibliographic Resources):
|
||
- Use for: Library catalogs, bibliographic metadata
|
||
- Key classes: `bf:Organization`, `bf:Work`, `bf:Instance`
|
||
- Integration: For library-specific extensions
|
||
|
||
**Reference Documentation**: See `/docs/ONTOLOGY_EXTENSIONS.md` for comprehensive integration patterns, RDF serialization examples, and extension workflows.
|
||
|
||
---
|
||
|
||
## Institution Type Taxonomy
|
||
|
||
The project uses a 19-type GLAMORCUBESFIXPHDNT taxonomy (expanded November 2025) with single-letter codes for GHCID identifier generation:
|
||
|
||
| Type | Code | Description | Example Use Cases |
|
||
|------|------|-------------|-------------------|
|
||
| **GALLERY** | G | Art gallery or exhibition space | Commercial galleries, kunsthallen |
|
||
| **LIBRARY** | L | Library (public, academic, specialized) | National libraries, university libraries |
|
||
| **ARCHIVE** | A | Archive (government, corporate, personal) | National archives, city archives |
|
||
| **MUSEUM** | M | Museum (art, history, science, etc.) | Rijksmuseum, natural history museums |
|
||
| **OFFICIAL_INSTITUTION** | O | Government heritage agencies | Provincial archives, heritage platforms |
|
||
| **RESEARCH_CENTER** | R | Research institutes and documentation centers | Knowledge centers, research libraries |
|
||
| **CORPORATION** | C | Corporate heritage collections | Company archives, corporate museums |
|
||
| **UNKNOWN** | U | Institution type cannot be determined | Ambiguous or unclassifiable organizations |
|
||
| **BOTANICAL_ZOO** | B | Botanical gardens and zoological parks | Arboreta, botanical gardens, zoos |
|
||
| **EDUCATION_PROVIDER** | E | Educational institutions with collections | Schools, training centers with heritage materials, universities |
|
||
| **COLLECTING_SOCIETY** | S | Societies collecting specialized materials | Numismatic societies, heritage societies (heemkundige kring) |
|
||
| **FEATURES** | F | Physical landscape features with heritage significance | Monuments, sculptures, statues, memorials, landmarks, cemeteries |
|
||
| **INTANGIBLE_HERITAGE_GROUP** | I | Organizations preserving intangible heritage | Traditional performance groups, oral history societies, folklore organizations |
|
||
| **MIXED** | X | Multiple types (uses X code) | Combined museum/archive facilities |
|
||
| **PERSONAL_COLLECTION** | P | Private personal collections | Individual collectors |
|
||
| **HOLY_SITES** | H | Religious heritage sites and institutions | Churches, temples, mosques, synagogues with collections |
|
||
| **DIGITAL_PLATFORM** | D | Digital heritage platforms and repositories | Online archives, digital libraries, virtual museums |
|
||
| **NGO** | N | Non-governmental heritage organizations | Heritage advocacy groups, preservation societies |
|
||
| **TASTE_SMELL** | T | Culinary and olfactory heritage institutions | Historic restaurants, parfumeries, distilleries preserving traditional recipes and formulations |
|
||
|
||
**Notes**:
|
||
- MIXED institutions use "X" as the GHCID code and document all actual types in metadata
|
||
- HOLY_SITES includes religious institutions managing cultural heritage collections (archives, libraries, artifacts)
|
||
- FEATURES includes physical monuments and landscape features with heritage value (not institutions maintaining collections)
|
||
- COLLECTING_SOCIETY includes historical societies (historische vereniging), philatelic societies, numismatic clubs, ephemera collectors
|
||
- OFFICIAL_INSTITUTION includes aggregation platforms, provincial heritage services, and government heritage agencies
|
||
- INTANGIBLE_HERITAGE_GROUP covers organizations preserving UNESCO-recognized intangible cultural heritage
|
||
- DIGITAL_PLATFORM includes born-digital heritage platforms and digitization aggregators
|
||
- NGO includes non-profit heritage organizations that don't fit other categories
|
||
- TASTE_SMELL includes establishments actively preserving culinary traditions, historic recipes, perfume formulations, and sensory heritage
|
||
- When institution type is unknown, records default to UNKNOWN pending verification
|
||
|
||
**Mnemonic**: **GLAMORCUBESFIXPHDNT** - Galleries, Libraries, Archives, Museums, Official institutions, Research centers, Corporations, Unknown, Botanical gardens/zoos, Education providers, Societies, Features, Intangible heritage groups, miXed, Personal collections, Holy sites, Digital platforms, NGOs, Taste/smell heritage
|
||
|
||
**Note on order**: The mnemonic GLAMORCUBESFIXPHDNT represents the alphabetical ordering by code: G-L-A-M-O-R-C-U-B-E-S-F-I-X-P-H-D-N-T
|
||
|
||
**Note**: Universities are classified under **E (EDUCATION_PROVIDER)**, not U. The U-class is reserved for institutions where the type cannot be determined during data extraction.
|
||
|
||
## Data Sources
|
||
|
||
### Primary Sources
|
||
1. **Conversation JSON files** (`/Users/kempersc/Documents/claude/glam/*.json`)
|
||
- 139 conversation files covering global GLAM research
|
||
- Countries include: Brazil, Vietnam, Chile, Japan, Mexico, Norway, Thailand, Taiwan, Belgium, Azerbaijan, Estonia, Namibia, Argentina, Tunisia, Ghana, Iran, Russia, Uzbekistan, Armenia, Georgia, Croatia, Greece, Nigeria, Somalia, Yemen, Oman, South Korea, Malaysia, Colombia, Switzerland, Moldova, Romania, Albania, Bosnia, Pakistan, Suriname, Nicaragua, Congo, Denmark, Austria, Australia, Myanmar, Cambodia, Sri Lanka, Tajikistan, Turkmenistan, Philippines, Latvia, Palestine, Limburg (NL), Gelderland (NL), Drenthe (NL), Groningen (NL), Slovakia, Kenya, Paraguay, Honduras, Mozambique, Eritrea, Sudan, Rwanda, Kiribati, Jamaica, Indonesia, Italy, Zimbabwe, East Timor, UAE, Kuwait, Lebanon, Syria, Maldives, Benin
|
||
- Also 14 ontology research conversations
|
||
|
||
2. **Dutch ISIL Registry** (`data/ISIL-codes_2025-08-01.csv`)
|
||
- ~300 Dutch heritage institutions
|
||
- Fields: Volgnr, Plaats, Instelling, ISIL code, Toegekend op, Opmerking
|
||
- Authoritative source (Tier 1)
|
||
|
||
3. **Dutch Organizations CSV** (`data/voorbeeld_lijst_organisaties_en_diensten-totaallijst_nederland.csv`)
|
||
- Comprehensive Dutch heritage organizations
|
||
- 40+ metadata columns including: name, address, ISIL code, organization type, partnerships, systems used, metadata standards
|
||
- Rich integration data (Museum register, Rijkscollectie, Collectie Nederland, Archieven.nl, etc.)
|
||
- Authoritative source (Tier 1)
|
||
|
||
### Implementation Status (Updated Nov 2025)
|
||
|
||
Both Dutch datasets have been **successfully parsed and cross-linked**:
|
||
|
||
**ISIL Registry** ✅:
|
||
- 364 institutions parsed (2 invalid codes rejected)
|
||
- 203 cities covered
|
||
- Parser: `src/glam_extractor/parsers/isil_registry.py`
|
||
- Tests: 10/10 passing (84% coverage)
|
||
|
||
**Dutch Organizations** ✅:
|
||
- 1,351 institutions parsed
|
||
- 475 cities covered
|
||
- 1,119 organizations with digital platforms
|
||
- Parser: `src/glam_extractor/parsers/dutch_orgs.py`
|
||
- Tests: 18/18 passing (98% coverage)
|
||
|
||
**Cross-linking Results** 🔗:
|
||
- 340 institutions matched by ISIL code (92.1% overlap)
|
||
- 198 records enriched with platform data
|
||
- 127 name conflicts detected (require manual review)
|
||
- 1,004 organizations without ISIL codes (candidates for assignment)
|
||
|
||
**Analysis Scripts**:
|
||
- `compare_dutch_datasets.py` - Dataset comparison
|
||
- `crosslink_dutch_datasets.py` - TIER_1 data merging demo
|
||
- `test_real_dutch_orgs.py` - Real data validation
|
||
|
||
See `PROGRESS.md` for detailed statistics and findings.
|
||
|
||
---
|
||
|
||
## Conversation JSON Structure
|
||
|
||
Each conversation JSON file has the following structure:
|
||
|
||
```json
|
||
{
|
||
"uuid": "conversation-uuid",
|
||
"name": "Conversation name (often includes country/region)",
|
||
"summary": "Optional summary",
|
||
"created_at": "ISO 8601 timestamp",
|
||
"updated_at": "ISO 8601 timestamp",
|
||
"chat_messages": [
|
||
{
|
||
"uuid": "message-uuid",
|
||
"text": "User or assistant message text",
|
||
"sender": "human" | "assistant",
|
||
"content": [
|
||
{
|
||
"type": "text" | "tool_use" | "tool_result",
|
||
"text": "Message content (may contain markdown, lists, etc.)",
|
||
...
|
||
}
|
||
]
|
||
}
|
||
]
|
||
}
|
||
```
|
||
|
||
## NLP Extraction Tasks
|
||
|
||
All extraction tasks map to the modular LinkML schema v0.2.0. See **Schema Reference** section above for module details.
|
||
|
||
### Task 1: Entity Recognition - Institution Names
|
||
|
||
**Objective**: Extract heritage institution names from conversation text.
|
||
|
||
**Schema Mapping**: Populates `HeritageCustodian` class from `schemas/core.yaml`
|
||
|
||
**Patterns to Look For**:
|
||
- Organization names (proper nouns)
|
||
- Museum names (often contain "Museum", "Museu", "Museo", "Muzeum", etc.)
|
||
- Library names (contain "Library", "Biblioteca", "Bibliothek", "Bibliotheek", etc.)
|
||
- Archive names (contain "Archive", "Archivo", "Archiv", "Archief", etc.)
|
||
- Gallery names
|
||
- Cultural centers
|
||
- Holy sites with collections (churches, temples, mosques, synagogues, monasteries, abbeys, cathedrals managing heritage materials)
|
||
|
||
**Contextual Indicators**:
|
||
- Lists of institutions
|
||
- Descriptions like "The X is a museum in Y"
|
||
- URLs containing institution names
|
||
- Mentions of collections, exhibitions, or holdings
|
||
|
||
**Example Extraction**:
|
||
```markdown
|
||
Input: "The Biblioteca Nacional do Brasil in Rio de Janeiro holds over 9 million items..."
|
||
|
||
Output:
|
||
- name: "Biblioteca Nacional do Brasil" # HeritageCustodian.name
|
||
- institution_type: LIBRARY # InstitutionTypeEnum from schemas/enums.yaml
|
||
- city: "Rio de Janeiro" # Location.city from schemas/core.yaml
|
||
- confidence_score: 0.95 # Provenance.confidence_score from schemas/provenance.yaml
|
||
```
|
||
|
||
### Task 2: Location Extraction
|
||
|
||
**Objective**: Extract geographic information associated with institutions.
|
||
|
||
**Schema Mapping**: Populates `Location` class from `schemas/core.yaml`
|
||
|
||
**Extract**:
|
||
- City names
|
||
- Street addresses (when mentioned)
|
||
- Postal codes
|
||
- Provinces/states/regions
|
||
- Country (can often be inferred from conversation title)
|
||
|
||
**Geocoding**:
|
||
- Use Nominatim API to geocode addresses to lat/lon
|
||
- Link to GeoNames IDs when possible
|
||
- Handle multilingual place names
|
||
|
||
**Example**:
|
||
```markdown
|
||
Input: "Nationaal Onderduikmuseum, Aalten"
|
||
|
||
Output:
|
||
- city: "Aalten" # Location.city
|
||
- country: "NL" # Location.country (ISO 3166-1 alpha-2)
|
||
- geonames_id: "2759899" (lookup via API) # Location.geonames_id
|
||
- latitude: 51.9167 (from geocoding)
|
||
- longitude: 6.5833
|
||
```
|
||
|
||
### Task 3: Identifier Extraction
|
||
|
||
**Objective**: Extract external identifiers mentioned in conversations.
|
||
|
||
**Schema Mapping**: Populates `Identifier` class from `schemas/core.yaml`
|
||
|
||
**Identifier Types**:
|
||
- ISIL codes (format: `NL-XXXXX`, `US-XXXXX`, etc.)
|
||
- Wikidata IDs (format: `Q12345`)
|
||
- VIAF IDs (format: numeric)
|
||
- URLs to institutional websites
|
||
- KvK numbers (Dutch: 8-digit format)
|
||
|
||
**Patterns**:
|
||
```regex
|
||
ISIL: [A-Z]{2}-[A-Za-z0-9]+
|
||
Wikidata: Q[0-9]+
|
||
VIAF: viaf.org/viaf/[0-9]+
|
||
KvK: [0-9]{8}
|
||
```
|
||
|
||
**Example**:
|
||
```markdown
|
||
Input: "ISIL code NL-AsdAM for Amsterdam Museum"
|
||
|
||
Output:
|
||
- identifier_scheme: "ISIL" # Identifier.identifier_scheme
|
||
- identifier_value: "NL-AsdAM" # Identifier.identifier_value
|
||
- institution_name: "Amsterdam Museum" # HeritageCustodian.name (for linking)
|
||
```
|
||
|
||
### Task 4: Relationship Extraction
|
||
|
||
**Objective**: Extract relationships between institutions.
|
||
|
||
**Schema Mapping**: Maps to `ChangeEvent` class from `schemas/provenance.yaml` (for mergers, splits) and future relationship modeling
|
||
|
||
**Relationship Types**:
|
||
- Parent-child (e.g., "X is part of Y")
|
||
- Partnerships (e.g., "X collaborates with Y")
|
||
- Network memberships (e.g., "X is a member of Z consortium")
|
||
- Merged organizations (e.g., "X merged with Y") → `ChangeTypeEnum.MERGER`
|
||
|
||
**Indicators**:
|
||
- "part of", "branch of", "division of"
|
||
- "in partnership with", "collaborates with"
|
||
- "member of", "belongs to"
|
||
- "merged with", "absorbed by" → Use `ChangeEvent` from `schemas/provenance.yaml`
|
||
|
||
### Task 5: Collection Metadata Extraction
|
||
|
||
**Objective**: Extract information about collections held by institutions.
|
||
|
||
**Schema Mapping**: Populates `Collection` class from `schemas/collections.yaml`
|
||
|
||
**Extract**:
|
||
- Collection names → `Collection.collection_name`
|
||
- Collection types (archival, bibliographic, museum objects)
|
||
- Subject areas → `Collection.subject_areas`
|
||
- Time periods covered → `Collection.temporal_coverage`
|
||
- Item counts (when mentioned) → `Collection.extent`
|
||
- Access information → `Collection.access_rights`
|
||
|
||
**Example**:
|
||
```markdown
|
||
Input: "The archive holds 15,000 documents from the 18th-19th centuries..."
|
||
|
||
Output:
|
||
- collection_type: "archival" # Collection metadata
|
||
- item_count: 15000 # Collection.extent
|
||
- time_period_start: "1700-01-01" # Collection.temporal_coverage
|
||
- time_period_end: "1899-12-31"
|
||
```
|
||
|
||
### Task 6: Digital Platform Identification
|
||
|
||
**Objective**: Identify digital platforms and systems used by institutions.
|
||
|
||
**Schema Mapping**: Populates `DigitalPlatform` class from `schemas/core.yaml`
|
||
|
||
**Platform Types**:
|
||
- Collection management systems (Atlantis, MAIS, CollectiveAccess, etc.)
|
||
- Digital repositories (DSpace, EPrints, Fedora)
|
||
- Discovery portals
|
||
- SPARQL endpoints
|
||
- APIs
|
||
|
||
**Extract**:
|
||
- Platform name → `DigitalPlatform.platform_name`
|
||
- Platform URL → `DigitalPlatform.platform_url`
|
||
- Metadata standards used → `DigitalPlatform.metadata_standards`
|
||
- Integration with aggregators (Europeana, DPLA, etc.)
|
||
|
||
### Task 7: Metadata Standards Detection
|
||
|
||
**Objective**: Identify which metadata standards institutions use.
|
||
|
||
**Schema Mapping**: Stores in `DigitalPlatform.metadata_standards` (list of strings)
|
||
|
||
**Standards to Detect**:
|
||
- Dublin Core
|
||
- MARC21
|
||
- EAD (Encoded Archival Description)
|
||
- BIBFRAME
|
||
- LIDO
|
||
- CIDOC-CRM
|
||
- Schema.org
|
||
- RiC-O (Records in Contexts)
|
||
- MODS, PREMIS, SPECTRUM, DACS
|
||
|
||
**Indicators**:
|
||
- Explicit mentions: "uses Dublin Core", "MARC21 records"
|
||
- Implicit: technical discussions about cataloging practices
|
||
|
||
### Task 8: Organizational Change Event Extraction (NEW - v0.2.0)
|
||
|
||
**Objective**: Extract significant organizational change events from conversation history.
|
||
|
||
**Schema Mapping**: Populates `ChangeEvent` class from `schemas/provenance.yaml`
|
||
|
||
**Change Types to Detect** (from `ChangeTypeEnum` in `schemas/enums.yaml`):
|
||
- **FOUNDING**: "established", "founded", "created", "opened"
|
||
- **CLOSURE**: "closed", "dissolved", "ceased operations", "shut down"
|
||
- **MERGER**: "merged with", "combined with", "joined with", "absorbed"
|
||
- **SPLIT**: "split into", "divided into", "separated from", "spun off"
|
||
- **ACQUISITION**: "acquired", "took over", "purchased"
|
||
- **RELOCATION**: "moved to", "relocated to", "transferred to"
|
||
- **NAME_CHANGE**: "renamed to", "formerly known as", "changed name to"
|
||
- **TYPE_CHANGE**: "became a museum", "converted to archive", "now operates as"
|
||
- **STATUS_CHANGE**: "reopened", "temporarily closed", "suspended operations"
|
||
- **RESTRUCTURING**: "reorganized", "restructured", "reformed"
|
||
- **LEGAL_CHANGE**: "incorporated as", "became a foundation", "legal status changed"
|
||
|
||
**Extract for Each Event**:
|
||
```yaml
|
||
change_history: # HeritageCustodian.change_history (list of ChangeEvent)
|
||
- event_id: "https://w3id.org/heritage/custodian/event/unique-id" # ChangeEvent.event_id
|
||
change_type: MERGER # ChangeEvent.change_type (ChangeTypeEnum from schemas/enums.yaml)
|
||
event_date: "2001-01-01" # ChangeEvent.event_date
|
||
event_description: >- # ChangeEvent.event_description
|
||
Merger of Institution A and Institution B to form new organization C.
|
||
Detailed description from conversation.
|
||
affected_organization: null # ChangeEvent.affected_organization (optional)
|
||
resulting_organization: null # ChangeEvent.resulting_organization (optional)
|
||
related_organizations: [] # ChangeEvent.related_organizations (optional)
|
||
source_documentation: "https://..." # ChangeEvent.source_documentation (optional)
|
||
```
|
||
|
||
**Temporal Context Indicators**:
|
||
- "In 2001, the museum merged with..."
|
||
- "After the renovation in 1985..."
|
||
- "Following the name change in 1968..."
|
||
- "The archive was relocated from X to Y in 1923"
|
||
|
||
**PROV-O Integration**:
|
||
- Map to `prov:Activity` in RDF serialization
|
||
- Link with `prov:wasInfluencedBy` from `HeritageCustodian`
|
||
- Use `prov:atTime` for event timestamps
|
||
- Track `prov:entity` (affected) and `prov:generated` (resulting) organizations
|
||
|
||
**Example Extraction**:
|
||
```markdown
|
||
Input: "The Noord-Hollands Archief was formed in 2001 through a merger of
|
||
Gemeentearchief Haarlem (founded 1910) and Rijksarchief in Noord-Holland
|
||
(founded 1802). The merger created a unified regional archive serving both
|
||
the city and province."
|
||
|
||
Output:
|
||
- event_id: "https://w3id.org/heritage/custodian/event/nha-merger-2001"
|
||
- change_type: MERGER # ChangeTypeEnum.MERGER
|
||
- event_date: "2001-01-01"
|
||
- event_description: "Merger of Gemeentearchief Haarlem (municipal archive, founded
|
||
1910) and Rijksarchief in Noord-Holland (state archive, founded
|
||
1802) to form Noord-Hollands Archief."
|
||
- confidence_score: 0.95 # From Provenance metadata
|
||
```
|
||
|
||
**GHCID Impact**:
|
||
- When institutions merge, relocate, or change names, GHCID may change
|
||
- Track old GHCID in `ghcid_history` with `valid_to` timestamp matching event date → `GHCIDHistoryEntry` from `schemas/provenance.yaml`
|
||
- Create new `GHCIDHistoryEntry` with `valid_from` matching event date
|
||
- Link change event to GHCID change via temporal correlation
|
||
|
||
**Indicators**:
|
||
|
||
### Task 9: Holy Sites Heritage Collection Identification
|
||
|
||
**Objective**: Identify religious sites that function as heritage custodians by maintaining cultural collections.
|
||
|
||
**Schema Mapping**: Populates `HeritageCustodian` class with `institution_type: HOLY_SITES`
|
||
|
||
**When to Classify as HOLY_SITES**:
|
||
|
||
Religious institutions qualify as HOLY_SITES heritage custodians when they manage:
|
||
- **Archival collections**: Historical documents, parish registers, ecclesiastical records
|
||
- **Library collections**: Rare manuscripts, theological texts, historical books
|
||
- **Museum collections**: Religious artifacts, liturgical objects, art collections
|
||
- **Cultural heritage**: Historical buildings with guided tours, preservation programs
|
||
|
||
**Patterns to Look For**:
|
||
- Church archives (parish records, baptismal registers, historical documents)
|
||
- Monastery libraries (manuscript collections, rare books)
|
||
- Cathedral treasuries (liturgical objects, religious art)
|
||
- Temple museums (Buddhist artifacts, historical collections)
|
||
- Mosque libraries (Islamic manuscripts, Quranic texts)
|
||
- Synagogue archives (Jewish community records, Torah scrolls)
|
||
- Abbey collections (medieval manuscripts, historical artifacts)
|
||
|
||
**Keywords and Indicators**:
|
||
- "church archive", "parish records", "ecclesiastical archive"
|
||
- "monastery library", "monastic collection", "scriptorium"
|
||
- "cathedral treasury", "cathedral museum"
|
||
- "temple library", "temple collection"
|
||
- "mosque library", "Islamic manuscript collection"
|
||
- "synagogue archive", "Jewish heritage collection"
|
||
- "religious heritage site", "pilgrimage site with museum"
|
||
|
||
**NOT Holy Sites** (use other types):
|
||
- Secular museums about religion (use MUSEUM)
|
||
- Academic religious studies centers (use RESEARCH_CENTER or UNIVERSITY)
|
||
- Government archives of church records (use ARCHIVE)
|
||
- Religious organizations without heritage collections (not heritage custodians)
|
||
|
||
**Example Extraction**:
|
||
```yaml
|
||
Input: "The Vatican Apostolic Archive holds over 85 km of shelving with
|
||
documents dating back to the 8th century, including papal bulls,
|
||
correspondence, and medieval manuscripts."
|
||
|
||
Output:
|
||
- name: Vatican Apostolic Archive
|
||
institution_type: HOLY_SITES # Religious institution managing heritage collections
|
||
description: >-
|
||
The Vatican Apostolic Archive (formerly Vatican Secret Archives) is
|
||
the central repository for papal and Vatican documents, holding over
|
||
35,000 volumes of historical records spanning 12 centuries.
|
||
locations:
|
||
- city: Vatican City
|
||
country: VA
|
||
collections:
|
||
- collection_name: Papal Documents
|
||
collection_type: archival
|
||
temporal_coverage: "0800-01-01/2024-12-31"
|
||
extent: "85 kilometers of shelving, 35,000+ volumes"
|
||
provenance:
|
||
data_source: CONVERSATION_NLP
|
||
confidence_score: 0.95
|
||
```
|
||
|
||
**Schema.org Mapping**:
|
||
- HOLY_SITES maps to `schema:PlaceOfWorship` in RDF serialization
|
||
- Can also use `schema:ArchiveOrganization` or `schema:Library` for collection-specific context
|
||
- Use multiple type assertions when appropriate
|
||
|
||
**Cross-Cultural Considerations**:
|
||
- Christianity: churches, cathedrals, monasteries, abbeys, convents
|
||
- Islam: mosques, madrasas (with historical libraries)
|
||
- Judaism: synagogues, yeshivas (with archival collections)
|
||
- Buddhism: temples, monasteries, pagodas (with artifact collections)
|
||
- Hinduism: temples (with historical collections)
|
||
- Sikhism: gurdwaras (with historical manuscripts)
|
||
- Other faiths: shrines, pilgrimage sites with documented heritage collections
|
||
|
||
## Data Quality and Provenance
|
||
|
||
### Provenance Tracking
|
||
|
||
**Every extracted record MUST include**:
|
||
```yaml
|
||
provenance:
|
||
data_source: CONVERSATION_NLP
|
||
data_tier: TIER_4_INFERRED
|
||
extraction_date: "2025-11-05T..."
|
||
extraction_method: "Subagent NER + pattern matching"
|
||
confidence_score: 0.85
|
||
conversation_id: "conversation-uuid"
|
||
source_url: null
|
||
verified_date: null
|
||
verified_by: null
|
||
```
|
||
|
||
### Confidence Scoring
|
||
|
||
Assign confidence scores (0.0-1.0) based on:
|
||
- **0.9-1.0**: Explicit, unambiguous mentions with context
|
||
- **0.7-0.9**: Clear mentions with some ambiguity
|
||
- **0.5-0.7**: Inferred from context, may need verification
|
||
- **0.3-0.5**: Low confidence, likely needs verification
|
||
- **0.0-0.3**: Very uncertain, flag for manual review
|
||
|
||
### Data Tier Assignment
|
||
|
||
- **TIER_1_AUTHORITATIVE**: CSV registries (ISIL, Dutch orgs)
|
||
- **TIER_2_VERIFIED**: Data from institutional websites (crawl4ai)
|
||
- **TIER_3_CROWD_SOURCED**: Wikidata, OpenStreetMap
|
||
- **TIER_4_INFERRED**: NLP-extracted from conversations
|
||
|
||
## Integration with CSV Data
|
||
|
||
### Cross-linking Strategy
|
||
|
||
1. **ISIL Code Matching** (primary)
|
||
- If conversation mentions ISIL code, link to CSV record
|
||
- High confidence match
|
||
|
||
2. **Name Matching** (secondary)
|
||
- Normalize names (lowercase, remove punctuation, handle abbreviations)
|
||
- Fuzzy matching with threshold > 0.85
|
||
- Check for alternative names
|
||
|
||
3. **Location + Type Matching** (tertiary)
|
||
- Match by city + institution type
|
||
- Lower confidence, requires manual verification
|
||
|
||
### Conflict Resolution
|
||
|
||
When conversation data conflicts with CSV data:
|
||
- **CSV data takes precedence** (higher tier)
|
||
- Mark conversation data with `verified: false`
|
||
- Note conflict in provenance metadata
|
||
- Create separate record if institutions are genuinely different
|
||
|
||
## NLP Models and Tools
|
||
|
||
### Recommended Approach: Agent-Based NER
|
||
|
||
**IMPORTANT**: Instead of directly using spaCy or other NER libraries in the main codebase, use **coding subagents** via the Task tool to conduct Named Entity Recognition and text extraction.
|
||
|
||
**Why Subagents**:
|
||
- Keeps the main codebase clean and maintainable
|
||
- Allows flexible experimentation with different NER approaches
|
||
- Subagents can choose the best tool for each specific extraction task
|
||
- Better separation of concerns: extraction logic vs. data pipeline
|
||
|
||
**How to Use Subagents for NER**:
|
||
1. Use the Task tool with `subagent_type="general"` for NER tasks
|
||
2. Provide clear prompts describing what entities to extract
|
||
3. Subagent will autonomously choose and apply appropriate NER tools (spaCy, transformers, regex, etc.)
|
||
4. Subagent returns structured extraction results
|
||
5. Main code validates and processes the results
|
||
|
||
## CRITICAL: Creating LinkML Instance Files
|
||
|
||
### Agent Capabilities Go Beyond Traditional NER
|
||
|
||
**IMPORTANT**: AI extraction agents are NOT limited to simple Named Entity Recognition. Unlike traditional NER tools that only identify entity boundaries and types, AI agents have **comprehensive understanding** and can:
|
||
|
||
1. **Extract Complete Records**: Capture ALL relevant information for each institution in one pass
|
||
2. **Infer Missing Data**: Use context to fill in fields that aren't explicitly stated
|
||
3. **Cross-Reference Within Documents**: Link related entities (locations, identifiers, events) automatically
|
||
4. **Maintain Consistency**: Ensure all extracted data conforms to the LinkML schema
|
||
5. **Generate Rich Metadata**: Create complete provenance tracking and confidence scores
|
||
|
||
### Mandatory: Create Complete LinkML Instance Files
|
||
|
||
When extracting data from conversations or other sources, agents MUST:
|
||
|
||
**✅ DO THIS**: Create complete LinkML-compliant YAML instance files with ALL available information
|
||
|
||
```yaml
|
||
# Example: data/instances/brazil_museums_001.yaml
|
||
---
|
||
# From schemas/core.yaml - HeritageCustodian class
|
||
|
||
- id: https://w3id.org/heritage/custodian/br/bnb-001
|
||
name: Biblioteca Nacional do Brasil
|
||
institution_type: LIBRARY # From schemas/enums.yaml
|
||
alternative_names:
|
||
- National Library of Brazil
|
||
- BNB
|
||
description: >-
|
||
The National Library of Brazil, located in Rio de Janeiro, is the largest
|
||
library in Latin America with over 9 million items. Founded in 1810 by
|
||
King João VI of Portugal. Collections include rare manuscripts, maps,
|
||
photographs, and Brazilian historical documents.
|
||
|
||
locations: # From schemas/core.yaml - Location class
|
||
- city: Rio de Janeiro
|
||
street_address: Avenida Rio Branco, 219
|
||
postal_code: "20040-008"
|
||
region: Rio de Janeiro
|
||
country: BR
|
||
# Note: lat/lon can be geocoded later if not in text
|
||
|
||
identifiers: # From schemas/core.yaml - Identifier class
|
||
- identifier_scheme: ISIL
|
||
identifier_value: BR-RjBN
|
||
identifier_url: https://isil.org/BR-RjBN
|
||
- identifier_scheme: VIAF
|
||
identifier_value: "123556639"
|
||
identifier_url: https://viaf.org/viaf/123556639
|
||
- identifier_scheme: Wikidata
|
||
identifier_value: Q1526131
|
||
identifier_url: https://www.wikidata.org/wiki/Q1526131
|
||
- identifier_scheme: Website
|
||
identifier_value: https://www.bn.gov.br
|
||
identifier_url: https://www.bn.gov.br
|
||
|
||
digital_platforms: # From schemas/core.yaml - DigitalPlatform class
|
||
- platform_name: Digital Library of the National Library of Brazil
|
||
platform_url: https://bndigital.bn.gov.br
|
||
platform_type: DISCOVERY_PORTAL
|
||
metadata_standards:
|
||
- Dublin Core
|
||
- MARC21
|
||
|
||
collections: # From schemas/collections.yaml - Collection class
|
||
- collection_name: Brazilian Historical Documents
|
||
collection_type: archival
|
||
subject_areas:
|
||
- Brazilian History
|
||
- Colonial Period
|
||
- Imperial Brazil
|
||
temporal_coverage: "1500-01-01/1889-11-15"
|
||
extent: "Approximately 2.5 million documents"
|
||
|
||
change_history: # From schemas/provenance.yaml - ChangeEvent class
|
||
- event_id: https://w3id.org/heritage/custodian/event/bnb-founding-1810
|
||
change_type: FOUNDING
|
||
event_date: "1810-01-01"
|
||
event_description: >-
|
||
Founded by King João VI of Portugal as the Royal Library
|
||
(Biblioteca Real) when the Portuguese court relocated to Brazil.
|
||
source_documentation: https://www.bn.gov.br/sobre-bn/historia
|
||
|
||
provenance: # From schemas/provenance.yaml - Provenance class
|
||
data_source: CONVERSATION_NLP
|
||
data_tier: TIER_4_INFERRED
|
||
extraction_date: "2025-11-05T14:30:00Z"
|
||
extraction_method: "AI agent comprehensive extraction from Brazilian GLAM conversation"
|
||
confidence_score: 0.92
|
||
conversation_id: "2025-09-22T14-40-15-0102c00a-4c0a-4488-bdca-5dd9fb94c9c5"
|
||
notes: >-
|
||
Extracted from conversation about Brazilian GLAM institutions.
|
||
Historical founding information cross-referenced from institutional website.
|
||
```
|
||
|
||
**❌ DO NOT DO THIS**: Return minimal JSON with only name and type
|
||
|
||
```json
|
||
// BAD - This is insufficient!
|
||
{
|
||
"name": "Biblioteca Nacional do Brasil",
|
||
"institution_type": "LIBRARY"
|
||
}
|
||
```
|
||
|
||
### Extraction Workflow for Agents
|
||
|
||
When processing a conversation or document:
|
||
|
||
1. **Read Entire Document First**: Don't extract piecemeal - understand the full context
|
||
2. **Identify ALL Entities**: Find every institution, location, identifier, event mentioned
|
||
3. **Gather Complete Information**: For each institution, extract:
|
||
- Basic metadata (name, type, description)
|
||
- All locations mentioned (even if just city/country)
|
||
- All identifiers (ISIL, Wikidata, VIAF, URLs)
|
||
- Digital platforms and systems
|
||
- Collection information
|
||
- Historical events (founding, mergers, relocations)
|
||
- Relationships to other institutions
|
||
4. **Create LinkML YAML**: Write a complete instance file with ALL extracted data
|
||
5. **Add Provenance**: Always include extraction metadata with confidence scores
|
||
6. **Validate**: Ensure output conforms to schema (use `linkml-validate` if available)
|
||
|
||
### Example Agent Prompt for Comprehensive Extraction
|
||
|
||
```
|
||
Extract ALL heritage institutions from the following conversation about Brazilian GLAM institutions.
|
||
|
||
For EACH institution found, create a COMPLETE LinkML-compliant record including:
|
||
- Institution name, type, and description
|
||
- ALL locations mentioned (cities, addresses, regions)
|
||
- ALL identifiers (ISIL codes, Wikidata IDs, VIAF IDs, URLs)
|
||
- Digital platforms, systems, or websites
|
||
- Collection information (types, subjects, time periods, extent)
|
||
- Historical events (founding dates, mergers, relocations, name changes)
|
||
- Relationships to other organizations
|
||
|
||
Output: YAML file conforming to schemas/core.yaml, schemas/enums.yaml,
|
||
schemas/provenance.yaml, and schemas/collections.yaml
|
||
|
||
Use your understanding to:
|
||
- Infer missing fields from context (e.g., country from city names)
|
||
- Consolidate information scattered across multiple conversation turns
|
||
- Create rich descriptions summarizing key facts
|
||
- Assign appropriate confidence scores based on explicitness of mentions
|
||
|
||
Remember: You are NOT a simple NER tool. Use your full comprehension abilities
|
||
to create the most complete, accurate, and useful records possible.
|
||
```
|
||
|
||
### Multiple Institutions Per File
|
||
|
||
When a conversation discusses many institutions, create ONE YAML file with a list:
|
||
|
||
```yaml
|
||
---
|
||
# data/instances/netherlands_limburg_museums.yaml
|
||
|
||
- id: https://w3id.org/heritage/custodian/nl/bonnefantenmuseum
|
||
name: Bonnefantenmuseum
|
||
institution_type: MUSEUM
|
||
# ... complete record ...
|
||
|
||
- id: https://w3id.org/heritage/custodian/nl/thermenmuseum
|
||
name: Thermenmuseum
|
||
institution_type: MUSEUM
|
||
# ... complete record ...
|
||
|
||
- id: https://w3id.org/heritage/custodian/nl/limburgs-museum
|
||
name: Limburgs Museum
|
||
institution_type: MUSEUM
|
||
# ... complete record ...
|
||
```
|
||
|
||
### Field Completion Strategies
|
||
|
||
Even when information is incomplete, do your best:
|
||
|
||
- **No explicit institution type?** Infer from context ("national library" → LIBRARY)
|
||
- **Only city mentioned?** That's fine - add `locations: [{city: "Amsterdam", country: "NL"}]`
|
||
- **No ISIL code?** Check if you can infer the format (NL-CityCode) or leave it out
|
||
- **No description?** Create one from available facts
|
||
- **Uncertain data?** Lower the confidence score but still include it
|
||
|
||
### Validation and Quality Control
|
||
|
||
After creating instance files:
|
||
|
||
1. **Schema Validation**: If possible, run `linkml-validate -s schemas/heritage_custodian.yaml data/instances/your_file.yaml`
|
||
2. **Completeness Check**: Ensure every institution has at minimum:
|
||
- `id` (generate from country + institution name slug)
|
||
- `name`
|
||
- `institution_type`
|
||
- `provenance` (with data_source, extraction_date, confidence_score)
|
||
3. **Consistency Check**: Same institution mentioned multiple times? Merge into one record
|
||
4. **Quality Flags**: If confidence < 0.5, add note in `provenance.notes` explaining uncertainty
|
||
|
||
### Extraction Stack (for Subagents)
|
||
|
||
When subagents perform extraction, they may use:
|
||
|
||
1. **Pattern matching** for identifiers (primary approach)
|
||
- Regex for ISIL, VIAF, Wikidata IDs
|
||
- URL extraction and normalization
|
||
- High precision, no dependencies
|
||
|
||
2. **NER libraries** (via subagents only)
|
||
- spaCy: `en_core_web_trf`, `nl_core_news_lg`, `xx_ent_wiki_sm`
|
||
- Transformers for classification
|
||
- Used by subagents, not directly in main code
|
||
|
||
3. **Fuzzy matching** for deduplication
|
||
- `rapidfuzz` library
|
||
- Levenshtein distance for name matching
|
||
|
||
### Processing Pipeline
|
||
|
||
```
|
||
Conversation JSON
|
||
↓
|
||
Parse & Extract Text
|
||
↓
|
||
[SUBAGENT] NER Extraction
|
||
- Subagent uses spaCy/transformers/patterns
|
||
- Returns structured entities
|
||
↓
|
||
Pattern Matching (identifiers, URLs)
|
||
↓
|
||
Classification (institution type, standards)
|
||
↓
|
||
Geocoding (locations)
|
||
↓
|
||
Cross-link with CSV (ISIL/name matching)
|
||
↓
|
||
LinkML Validation
|
||
↓
|
||
Export (RDF, JSON-LD, CSV, Parquet)
|
||
```
|
||
|
||
## Agent Interaction Patterns
|
||
|
||
### When Asked to Extract Data from Conversations
|
||
|
||
1. **Start Small**: Begin with 1-2 conversation files to test extraction logic
|
||
2. **Show Examples**: Display extracted entities with confidence scores
|
||
3. **Ask for Validation**: Show uncertain extractions for user confirmation
|
||
4. **Iterate**: Refine patterns based on feedback
|
||
5. **Batch Process**: Once patterns are validated, process all 139 files
|
||
|
||
### When Asked to Design NLP Components
|
||
|
||
1. **Reference Schema**: Always refer to the modular schema v0.2.1:
|
||
- Core classes: `schemas/core.yaml` (HeritageCustodian, Location, Identifier, etc.)
|
||
- Enumerations: `schemas/enums.yaml` (InstitutionTypeEnum, ChangeTypeEnum, etc.)
|
||
- Provenance: `schemas/provenance.yaml` (Provenance, ChangeEvent, etc.)
|
||
- See schema overview in the "Schema Reference (v0.2.1)" section above
|
||
2. **Consult Base Ontologies**: BEFORE designing extraction logic, review relevant ontologies:
|
||
- **Dutch institutions**: Study TOOI ontology (`/data/ontology/tooiont.ttl`)
|
||
- **EU/global institutions**: Study CPOV ontology (`/data/ontology/core-public-organisation-ap.ttl`)
|
||
- **All institutions**: Reference Schema.org patterns (`/data/ontology/schemaorg.owl`)
|
||
- See "Base Ontologies for Global GLAM Data" section above for decision tree
|
||
3. **Use Design Patterns**: Follow patterns in `docs/plan/global_glam/05-design-patterns.md`
|
||
4. **Track Provenance**: Every extraction must include provenance metadata (from `schemas/provenance.yaml`)
|
||
5. **Handle Multilingual**: Conversations cover 60+ countries, expect multilingual content
|
||
6. **Error Handling**: Use Result pattern, never fail silently
|
||
|
||
### When Asked to Validate Data
|
||
|
||
1. **LinkML Validation**: Use `linkml-validate` to check schema compliance
|
||
2. **Cross-reference**: Compare with CSV data when applicable
|
||
3. **Check Identifiers**: Validate ISIL format, check Wikidata exists
|
||
4. **Geographic Verification**: Geocode addresses, verify country codes
|
||
5. **Duplicate Detection**: Use fuzzy matching to find potential duplicates
|
||
|
||
## Example Agent Workflows
|
||
|
||
### Workflow 1: Extract Brazilian Institutions
|
||
|
||
```bash
|
||
# User request
|
||
"Extract all museum, library, and archive names from the Brazilian GLAM conversation"
|
||
|
||
# Agent actions
|
||
1. Read conversation: 2025-09-22T14-40-15-0102c00a-4c0a-4488-bdca-5dd9fb94c9c5-Brazilian_GLAM_collection_inventories.json
|
||
2. Parse chat_messages array
|
||
3. **Launch subagent** to extract institutions using NER
|
||
- Subagent analyzes text and extracts ORG entities
|
||
- Filters for heritage-related keywords
|
||
- Classifies institution types
|
||
- Returns structured results
|
||
4. Extract locations (cities in Brazil)
|
||
5. Geocode using Nominatim
|
||
6. Create HeritageCustodian records
|
||
7. Add provenance metadata (data_source: CONVERSATION_NLP, extraction_method: "Subagent NER")
|
||
8. Validate with LinkML schema
|
||
9. Export to JSON-LD
|
||
10. Report results with confidence scores
|
||
```
|
||
|
||
### Workflow 2: Cross-link Dutch Institutions
|
||
|
||
```bash
|
||
# User request
|
||
"Cross-link the Dutch organizations CSV with any Dutch institutions found in conversations"
|
||
|
||
# Agent actions
|
||
1. Load data/voorbeeld_lijst_organisaties_en_diensten-totaallijst_nederland.csv
|
||
2. Parse into DutchHeritageCustodian records
|
||
3. Extract all NL-* ISIL codes
|
||
4. Search all conversation files for mentions of these ISIL codes
|
||
5. Fuzzy match organization names
|
||
6. For matches:
|
||
- Merge metadata
|
||
- Mark CSV data as TIER_1
|
||
- Mark conversation data as TIER_4
|
||
- Resolve conflicts (CSV wins)
|
||
7. For Dutch institutions in conversations NOT in CSV:
|
||
- Create new records
|
||
- Mark as TIER_4
|
||
- Flag for verification
|
||
8. Export merged dataset
|
||
```
|
||
|
||
### Workflow 3: Build Global Institution Map
|
||
|
||
```bash
|
||
# User request
|
||
"Create a geographic distribution map of all extracted institutions"
|
||
|
||
# Agent actions
|
||
1. Process all 139 conversation files
|
||
2. **Launch subagent(s)** to extract institution names + locations from each file
|
||
3. Geocode all addresses
|
||
4. Group by country
|
||
5. Count institutions per country
|
||
6. Generate GeoJSON for mapping
|
||
7. Create visualization (Leaflet, Mapbox, etc.)
|
||
8. Export statistics:
|
||
- Institutions per country
|
||
- Institutions per type
|
||
- Geographic coverage
|
||
- Data quality (tier distribution)
|
||
```
|
||
|
||
## Multi-language Considerations
|
||
|
||
### Language Detection
|
||
- Detect language of conversation content
|
||
- Subagents will choose appropriate NER models per language
|
||
- Multilingual support handled by subagents
|
||
|
||
### Common Languages in Dataset
|
||
- English (international institutions)
|
||
- Dutch (Netherlands institutions)
|
||
- Portuguese (Brazil)
|
||
- Spanish (Latin America, Spain)
|
||
- Vietnamese, Japanese, Thai, Korean, Arabic, Russian, etc.
|
||
|
||
### Translation Strategy
|
||
- DO NOT translate institution names (preserve original)
|
||
- Optionally translate descriptions for searchability
|
||
- Store language tags with text fields
|
||
- Use multilingual identifiers (Wikidata) for linking
|
||
|
||
## Output Formats
|
||
|
||
### Primary Output: JSON-LD
|
||
Linked Data format for semantic web integration:
|
||
```jsonld
|
||
{
|
||
"@context": "https://w3id.org/heritage/custodian/context.jsonld",
|
||
"@type": "HeritageCustodian",
|
||
"@id": "https://example.org/institution/123",
|
||
"name": "Amsterdam Museum",
|
||
"institution_type": "MUSEUM",
|
||
...
|
||
}
|
||
```
|
||
|
||
### Secondary Outputs
|
||
- **RDF/Turtle**: For SPARQL querying
|
||
- **CSV**: For spreadsheet analysis
|
||
- **Parquet**: For data warehousing
|
||
- **SQLite**: For local querying
|
||
|
||
## Testing and Validation
|
||
|
||
### Unit Tests
|
||
Test extraction functions with known inputs:
|
||
```python
|
||
def test_extract_isil_codes():
|
||
text = "The ISIL code NL-AsdAM identifies Amsterdam Museum"
|
||
codes = extract_isil_codes(text)
|
||
assert codes == [{"scheme": "ISIL", "value": "NL-AsdAM"}]
|
||
```
|
||
|
||
### Integration Tests
|
||
Test full pipeline with sample conversations:
|
||
```python
|
||
def test_brazilian_museum_extraction():
|
||
conversation = load_json("Brazilian_GLAM_collection_inventories.json")
|
||
records = extract_heritage_custodians(conversation)
|
||
assert len(records) > 0
|
||
assert all(r.provenance.data_source == "CONVERSATION_NLP" for r in records)
|
||
```
|
||
|
||
### Validation Tests
|
||
Ensure LinkML schema compliance:
|
||
```python
|
||
def test_linkml_validation():
|
||
record = create_heritage_custodian(...)
|
||
validator = SchemaValidator(schema="heritage_custodian.yaml")
|
||
result = validator.validate(record)
|
||
assert result.is_valid
|
||
```
|
||
|
||
## Performance Optimization
|
||
|
||
### Batch Processing
|
||
- Process conversations in parallel (multiprocessing)
|
||
- Cache geocoding results (15-minute TTL)
|
||
- Deduplicate entity extraction
|
||
|
||
### Incremental Updates
|
||
- Track last processed timestamp
|
||
- Only process new/updated conversations
|
||
- Maintain state in SQLite database
|
||
|
||
### Resource Management
|
||
- Limit concurrent API calls (Nominatim: 1 req/sec)
|
||
- Use connection pooling for HTTP requests
|
||
- Stream large JSON files instead of loading into memory
|
||
|
||
## Error Handling
|
||
|
||
### Common Errors and Solutions
|
||
|
||
1. **JSON Parsing Errors**
|
||
- Malformed JSON files
|
||
- Solution: Validate JSON schema, report file path
|
||
|
||
2. **NER Model Errors**
|
||
- Missing spaCy model
|
||
- Solution: Provide installation instructions, download automatically
|
||
|
||
3. **Geocoding Failures**
|
||
- Unknown location, rate limit exceeded
|
||
- Solution: Cache results, implement backoff, mark as unverified
|
||
|
||
4. **LinkML Validation Failures**
|
||
- Required field missing, invalid enum value
|
||
- Solution: Log validation errors, provide field mapping
|
||
|
||
5. **Encoding Issues**
|
||
- Non-UTF-8 characters
|
||
- Solution: Use UTF-8 everywhere, handle decode errors gracefully
|
||
|
||
## Schema Quirks and Implementation Notes
|
||
|
||
**IMPORTANT**: These are critical implementation details discovered during development. Read carefully to avoid bugs.
|
||
|
||
### Provenance Model Quirks
|
||
|
||
The `Provenance` model does **NOT** have a `notes` field:
|
||
```python
|
||
# ❌ WRONG - Provenance has no 'notes' field
|
||
provenance = Provenance(
|
||
data_source=DataSource.CSV_REGISTRY,
|
||
notes="Some observation" # This will fail!
|
||
)
|
||
|
||
# ✅ CORRECT - Use HeritageCustodian.description instead
|
||
custodian = HeritageCustodian(
|
||
name="Museum Name",
|
||
description="Notes and remarks go here", # Put notes here
|
||
provenance=Provenance(...)
|
||
)
|
||
```
|
||
|
||
### Field Naming Conventions
|
||
|
||
Always use the correct field names (check the schema when in doubt):
|
||
|
||
```python
|
||
# ❌ WRONG
|
||
custodian.institution_types # Plural, list
|
||
custodian.location # Singular
|
||
|
||
# ✅ CORRECT
|
||
custodian.institution_type # Singular, single enum value
|
||
custodian.locations # Plural, always a list (even with one item)
|
||
```
|
||
|
||
### Pydantic v1 Enum Behavior
|
||
|
||
This project uses Pydantic v1. Enum fields are **already strings**, not enum objects:
|
||
|
||
```python
|
||
# ❌ WRONG - Don't use .value accessor
|
||
print(custodian.institution_type.value) # AttributeError!
|
||
|
||
# ✅ CORRECT - Enum fields are already strings
|
||
print(custodian.institution_type) # "MUSEUM", "ARCHIVE", etc.
|
||
|
||
# Same for platform types
|
||
platform.platform_type # Already a string, not an enum object
|
||
```
|
||
|
||
### Required vs. Optional Fields
|
||
|
||
Many fields are optional but have validation rules. Always check for `None`:
|
||
|
||
```python
|
||
# Optional fields that may be None
|
||
custodian.locations # Optional[List[Location]]
|
||
custodian.identifiers # Optional[List[Identifier]]
|
||
custodian.digital_platforms # Optional[List[DigitalPlatform]]
|
||
custodian.description # Optional[str]
|
||
|
||
# Always check before iterating
|
||
if custodian.locations:
|
||
for location in custodian.locations:
|
||
print(location.city)
|
||
```
|
||
|
||
### CSV Parsing Best Practices
|
||
|
||
1. **Handle UTF-8 BOM**: Use `encoding='utf-8-sig'` when reading CSVs
|
||
2. **Normalize headers**: Strip whitespace, handle multiline headers
|
||
3. **Warn on errors**: Skip invalid rows but log warnings
|
||
4. **Preserve originals**: Store raw CSV data in intermediate models before conversion
|
||
|
||
Example:
|
||
```python
|
||
with open(csv_path, 'r', encoding='utf-8-sig') as f:
|
||
reader = csv.DictReader(f)
|
||
for row in reader:
|
||
try:
|
||
record = parse_row(row)
|
||
except ValidationError as e:
|
||
print(f"Warning: Skipping row {row}: {e}")
|
||
continue
|
||
```
|
||
|
||
### Date Handling
|
||
|
||
Dates may be in various formats or empty:
|
||
|
||
```python
|
||
# Handle empty dates
|
||
date_str = row.get('toegekend_op', '').strip()
|
||
assigned_date = datetime.fromisoformat(date_str) if date_str else None
|
||
|
||
# Provenance extraction_date is required (use current time)
|
||
from datetime import datetime, timezone
|
||
extraction_date = datetime.now(timezone.utc)
|
||
```
|
||
|
||
### Testing Strategies
|
||
|
||
1. **Unit tests**: Test model validation with known inputs
|
||
2. **Integration tests**: Test full file parsing with fixtures
|
||
3. **Edge case tests**: Empty files, malformed rows, minimal data
|
||
4. **Real data tests**: Always validate with actual CSV files
|
||
|
||
Fixture scope matters:
|
||
```python
|
||
# ❌ WRONG - Class-scoped fixture not available to other classes
|
||
class TestFoo:
|
||
@pytest.fixture
|
||
def sample_file(self):
|
||
...
|
||
|
||
# ✅ CORRECT - Module-scoped fixture available to all test classes
|
||
@pytest.fixture
|
||
def sample_file(): # At module level, not in a class
|
||
...
|
||
```
|
||
|
||
## Next Steps for Agents
|
||
|
||
When continuing this project, agents should:
|
||
|
||
1. **Implement Parser Module** (`src/glam_extractor/parsers/`) ✅ **COMPLETE**
|
||
- ✅ ISIL registry parser (10 tests, 84% coverage)
|
||
- ✅ Dutch organizations parser (18 tests, 98% coverage)
|
||
- ⏳ Conversation JSON parser (next priority)
|
||
|
||
2. **Implement Extractor Module** (`src/glam_extractor/extractors/`)
|
||
- spaCy NER integration
|
||
- Pattern-based identifier extraction
|
||
- Institution type classifier
|
||
- Relationship extractor
|
||
|
||
3. **Implement Geocoder Module** (`src/glam_extractor/geocoding/`)
|
||
- Nominatim client with caching
|
||
- GeoNames integration
|
||
- Coordinate validation
|
||
|
||
4. **Implement Validator Module** (`src/glam_extractor/validators/`)
|
||
- LinkML schema validator
|
||
- Cross-reference validator (CSV vs. conversation)
|
||
- Duplicate detector
|
||
|
||
5. **Implement Exporter Module** (`src/glam_extractor/exporters/`)
|
||
- JSON-LD exporter
|
||
- RDF/Turtle exporter
|
||
- CSV exporter
|
||
- Parquet exporter
|
||
- SQLite database builder
|
||
|
||
6. **Create Test Fixtures** (`tests/fixtures/`)
|
||
- Sample conversation JSONs
|
||
- Expected extraction outputs
|
||
- Validation test cases
|
||
|
||
7. **Document Agent Prompts** (`docs/agent-prompts/`)
|
||
- Reusable prompts for common extraction tasks
|
||
- Few-shot examples for LLM-based extraction
|
||
- Quality review checklists
|
||
|
||
## Persistent Identifiers (GHCID)
|
||
|
||
**🚨 COLLISION RESOLUTION: NATIVE LANGUAGE NAME SUFFIX 🚨**
|
||
|
||
When multiple institutions generate the same base GHCID, collisions are resolved by appending the **full legal name in native language in snake_case format**.
|
||
|
||
**Collision Suffix Rules**:
|
||
- ✅ Use the institution's full official name in its native language
|
||
- ✅ Convert to snake_case (lowercase, underscores for spaces)
|
||
- ✅ Remove apostrophes, accents, commas, and other punctuation/diacritics
|
||
- ✅ Only add suffix on collision (not by default)
|
||
- ✅ First-added institution keeps base GHCID; later additions get name suffix
|
||
|
||
**Examples**:
|
||
- Base GHCID collision: `NL-NH-AMS-M-SM` (two museums with "SM" abbreviation)
|
||
- First institution: `NL-NH-AMS-M-SM` (Stedelijk Museum, added first - no suffix)
|
||
- Second institution: `NL-NH-AMS-M-SM-science_museum_amsterdam` (added later - gets suffix)
|
||
|
||
**Name Normalization**:
|
||
```
|
||
"Musée d'Orsay" → "musee_dorsay"
|
||
"Biblioteca Nacional do Brasil" → "biblioteca_nacional_do_brasil"
|
||
"北京故宫博物院" → "beijing_gugong_bowuyuan" (pinyin transliteration)
|
||
"Österreichische Nationalbibliothek" → "osterreichische_nationalbibliothek"
|
||
```
|
||
|
||
**Note**: The GHCID string (including any name suffix) gets hashed to UUID, so the longer name won't be visible to end users - they see only the UUID.
|
||
|
||
---
|
||
|
||
### 🚨 SETTLEMENT STANDARDIZATION: GEONAMES IS AUTHORITATIVE 🚨
|
||
|
||
**ALL settlement names in GHCID MUST be derived from GeoNames, not from source data.**
|
||
|
||
The GeoNames geographical database (`/data/reference/geonames.db`) is the **single source of truth** for:
|
||
- Settlement names (cities, towns, villages)
|
||
- Settlement 3-letter codes
|
||
- Administrative region codes (admin1 → ISO 3166-2)
|
||
|
||
**Why GeoNames?**
|
||
- **Consistency**: Same coordinates → same settlement → same GHCID component
|
||
- **Disambiguation**: Handles duplicate settlement names across regions
|
||
- **Standardization**: Provides ASCII-safe names for identifiers
|
||
- **Persistence**: Geographic reality is stable, ensuring GHCID stability
|
||
|
||
**Settlement Resolution Process**:
|
||
|
||
1. **Coordinates Available (Preferred)**: Use reverse geocoding to find nearest GeoNames settlement
|
||
2. **Name Only (Fallback)**: Look up settlement name in GeoNames with fuzzy matching
|
||
3. **Manual (Last Resort)**: Flag entry with `settlement_code: XXX` for review
|
||
|
||
### 🚨 CRITICAL: GeoNames Feature Code Filtering 🚨
|
||
|
||
**NEVER use neighborhoods or districts (PPLX) for GHCID generation. ONLY use proper settlements (cities, towns, villages).**
|
||
|
||
GeoNames classifies populated places with feature codes. When reverse geocoding coordinates to find a settlement, you MUST filter by feature code to ensure you get a city/town/village, NOT a neighborhood or district.
|
||
|
||
**ALLOWED Feature Codes** (use these for GHCID settlements):
|
||
|
||
| Code | Description | Example |
|
||
|------|-------------|---------|
|
||
| **PPL** | Populated place (city/town/village) | Apeldoorn, Hamont, Lelystad |
|
||
| **PPLA** | Seat of first-order admin division | Provincial capitals |
|
||
| **PPLA2** | Seat of second-order admin division | Municipal seats |
|
||
| **PPLA3** | Seat of third-order admin division | District seats |
|
||
| **PPLA4** | Seat of fourth-order admin division | Sub-district seats |
|
||
| **PPLC** | Capital of a political entity | Amsterdam, Brussels |
|
||
| **PPLS** | Populated places (multiple) | Settlement clusters |
|
||
| **PPLG** | Seat of government | The Hague (when different from capital) |
|
||
|
||
**EXCLUDED Feature Codes** (NEVER use for GHCID):
|
||
|
||
| Code | Description | Why Excluded |
|
||
|------|-------------|--------------|
|
||
| **PPLX** | Section of populated place | Neighborhoods, districts, quarters (e.g., "Binnenstad", "Amsterdam Binnenstad") |
|
||
|
||
**Example of the Problem**:
|
||
|
||
```sql
|
||
-- BAD: Query without feature code filter returns neighborhoods
|
||
SELECT name, feature_code, population FROM cities
|
||
WHERE country_code='NL' ORDER BY distance LIMIT 1;
|
||
-- Result: "Binnenstad" (PPLX, pop 4,900) ❌ WRONG
|
||
|
||
-- GOOD: Query WITH feature code filter returns proper settlements
|
||
SELECT name, feature_code, population FROM cities
|
||
WHERE country_code='NL'
|
||
AND feature_code IN ('PPL', 'PPLA', 'PPLA2', 'PPLA3', 'PPLA4', 'PPLC', 'PPLS', 'PPLG')
|
||
ORDER BY distance LIMIT 1;
|
||
-- Result: "Apeldoorn" (PPL, pop 136,670) ✅ CORRECT
|
||
```
|
||
|
||
**Implementation in SQL**:
|
||
|
||
```sql
|
||
-- Correct reverse geocoding query with feature code filter
|
||
SELECT
|
||
name, ascii_name, admin1_code, admin1_name,
|
||
latitude, longitude, geonames_id, population, feature_code,
|
||
((latitude - ?) * (latitude - ?) + (longitude - ?) * (longitude - ?)) as distance_sq
|
||
FROM cities
|
||
WHERE country_code = ?
|
||
AND feature_code IN ('PPL', 'PPLA', 'PPLA2', 'PPLA3', 'PPLA4', 'PPLC', 'PPLS', 'PPLG')
|
||
ORDER BY distance_sq
|
||
LIMIT 1
|
||
```
|
||
|
||
**Verification**: Always check `feature_code` in location_resolution metadata:
|
||
|
||
```yaml
|
||
location_resolution:
|
||
method: REVERSE_GEOCODE
|
||
geonames_id: 2759706
|
||
geonames_name: Apeldoorn
|
||
feature_code: PPL # ← MUST be PPL, PPLA, PPLA2, PPLA3, PPLA4, PPLC, PPLS, or PPLG
|
||
admin1_code: '03'
|
||
region_code: GE
|
||
country_code: NL
|
||
```
|
||
|
||
**If you see `feature_code: PPLX`**, the GHCID is WRONG and must be regenerated.
|
||
|
||
### Country Code Detection for GeoNames Lookups
|
||
|
||
**CRITICAL**: Determine country code from entry data BEFORE calling GeoNames reverse geocoding.
|
||
|
||
GeoNames queries are country-specific. Using the wrong country code will return incorrect results or no results.
|
||
|
||
**Country Code Resolution Priority**:
|
||
|
||
1. `zcbs_enrichment.country` - Most explicit source
|
||
2. `location.country` - Direct location field
|
||
3. `locations[].country` - Array location field
|
||
4. `original_entry.country` - CSV source field
|
||
5. `google_maps_enrichment.address` - Parse country from address string (", Belgium", ", Germany")
|
||
6. `wikidata_enrichment.located_in.label` - Infer from Wikidata location
|
||
7. Default: `"NL"` (Netherlands) - Only if no other source available
|
||
|
||
**Example Country Detection Code**:
|
||
|
||
```python
|
||
# Determine country code FIRST
|
||
country_code = "NL" # Default
|
||
|
||
if entry.get('zcbs_enrichment', {}).get('country'):
|
||
country_code = entry['zcbs_enrichment']['country']
|
||
elif entry.get('location', {}).get('country'):
|
||
country_code = entry['location']['country']
|
||
elif entry.get('google_maps_enrichment', {}).get('address', ''):
|
||
address = entry['google_maps_enrichment']['address']
|
||
if ', Belgium' in address or ', België' in address:
|
||
country_code = "BE"
|
||
elif ', Germany' in address or ', Deutschland' in address:
|
||
country_code = "DE"
|
||
|
||
# THEN call reverse geocoding with correct country
|
||
result = reverse_geocode_to_city(latitude, longitude, country_code)
|
||
```
|
||
|
||
**GHCID Settlement Code Format**:
|
||
|
||
```
|
||
NL-{REGION}-{SETTLEMENT}-{TYPE}-{ABBREV}
|
||
^^^^^^^^^^^
|
||
3-letter code from GeoNames
|
||
```
|
||
|
||
**Code Generation Rules**:
|
||
- Single word: First 3 letters → `Amsterdam` = `AMS`, `Lelystad` = `LEL`
|
||
- Dutch article (`de`, `het`, `den`, `'s`): Article initial + 2 from main word → `Den Haag` = `DHA`
|
||
- Multi-word: Initials (up to 3) → `Nieuw Amsterdam` = `NAM`
|
||
|
||
**Historical Custodians - Measurement Point Rule**:
|
||
|
||
For heritage custodians that no longer exist or have historical locations:
|
||
- Use the **modern-day settlement** (as of 2025-12-01) where the coordinates fall
|
||
- GeoNames reflects current geographic reality
|
||
- Historical place names should NOT be used for GHCID generation
|
||
|
||
Example: A museum operating 1900-1950 in what is now Lelystad (before Flevoland existed) uses `LEL`, not historical names.
|
||
|
||
### 🚨 CRITICAL: XXX Placeholders Are TEMPORARY - Research Required 🚨
|
||
|
||
**XXX placeholders for region/settlement codes are NEVER acceptable as a final state. They indicate missing data that MUST be researched and resolved.**
|
||
|
||
When you encounter or generate entries with `XX` (unknown region) or `XXX` (unknown settlement):
|
||
|
||
**Step 1: Identify the Last Known Physical Location**
|
||
|
||
For **destroyed/historical institutions**:
|
||
- Use the **last recorded physical location** where the institution operated
|
||
- Example: Gaza Cultural Center destroyed in 2024 → use Gaza City coordinates (`PS-GZ-GAZ-M-GCC`)
|
||
|
||
For **refugee/diaspora organizations**:
|
||
- Use the location of their **current headquarters** OR **original founding location**
|
||
- Document which location type was used in `location_resolution.notes`
|
||
|
||
For **digital-only platforms**:
|
||
- Use the location of the **parent/founding organization**
|
||
- Example: Interactive Encyclopedia of Palestine Question → Institute for Palestine Studies → Beirut (`LB-BA-BEI-D-IEPQ`)
|
||
|
||
**Step 2: Research Sources (Priority Order)**
|
||
|
||
1. **Wikidata** - Search for the institution, check P131 (located in) or P159 (headquarters location)
|
||
2. **Google Maps** - Search institution name, extract coordinates
|
||
3. **Official Website** - Look for contact page, about page with address
|
||
4. **Web Archive** - Use archive.org for destroyed/closed institutions
|
||
5. **Academic Sources** - Papers, reports mentioning the institution
|
||
6. **News Articles** - Particularly useful for destroyed heritage sites
|
||
|
||
**Step 3: Update Entry with Resolved Location**
|
||
|
||
```yaml
|
||
# BEFORE (unacceptable)
|
||
ghcid:
|
||
ghcid_current: PS-XX-XXX-A-NAPR
|
||
location_resolution:
|
||
method: NAME_LOOKUP
|
||
country_code: PS
|
||
region_code: XX
|
||
city_code: XXX
|
||
|
||
# AFTER (properly researched)
|
||
ghcid:
|
||
ghcid_current: PS-GZ-GAZ-A-NAPR
|
||
location_resolution:
|
||
method: MANUAL_RESEARCH
|
||
country_code: PS
|
||
region_code: GZ
|
||
region_name: Gaza Strip
|
||
city_code: GAZ
|
||
city_name: Gaza City
|
||
geonames_id: 281133
|
||
research_date: "2025-12-06T00:00:00Z"
|
||
research_sources:
|
||
- type: wikidata
|
||
id: Q123456
|
||
claim: P131
|
||
- type: web_archive
|
||
url: https://web.archive.org/web/20231001/https://institution-website.org/contact
|
||
notes: "Located in Gaza City prior to destruction in 2024"
|
||
```
|
||
|
||
**Step 4: Rename File to Match New GHCID**
|
||
|
||
Files MUST be renamed when GHCID changes:
|
||
```bash
|
||
# Old file
|
||
data/custodian/PS-XX-XXX-A-NAPR.yaml
|
||
|
||
# New file after research
|
||
data/custodian/PS-GZ-GAZ-A-NAPR.yaml
|
||
```
|
||
|
||
**Common XXX Placeholder Scenarios and Solutions**:
|
||
|
||
| Scenario | Solution |
|
||
|----------|----------|
|
||
| Destroyed Gaza institution | Use pre-destruction coordinates (Gaza City, Khan Yunis, etc.) |
|
||
| Refugee archive (diaspora) | Use current headquarters OR founding camp location |
|
||
| Digital platform (online only) | Use parent organization headquarters |
|
||
| Decentralized initiative | Use founding location or primary organizer location |
|
||
| Historical institution (closed) | Use last operating location |
|
||
| Institution with country but no city | Research using name + country in Wikidata/Google |
|
||
|
||
**NEVER**:
|
||
- ❌ Leave XXX placeholders in production data
|
||
- ❌ Use "Online" or "Palestine" as location values
|
||
- ❌ Skip location research because it's "difficult"
|
||
- ❌ Use XX/XXX for diaspora organizations (they have real locations)
|
||
|
||
**ALWAYS**:
|
||
- ✅ Document research sources in `location_resolution.research_sources`
|
||
- ✅ Add notes explaining location choice for complex cases
|
||
- ✅ Update GHCID history when location is resolved
|
||
- ✅ Rename files to match corrected GHCID
|
||
|
||
**Netherlands Admin1 Code Mapping** (GeoNames → ISO 3166-2):
|
||
|
||
| GeoNames | Province | ISO Code |
|
||
|----------|----------|----------|
|
||
| 01 | Drenthe | DR |
|
||
| 02 | Friesland | FR |
|
||
| 03 | Gelderland | GE |
|
||
| 04 | Groningen | GR |
|
||
| 05 | Limburg | LI |
|
||
| 06 | Noord-Brabant | NB |
|
||
| 07 | Noord-Holland | NH |
|
||
| 09 | Utrecht | UT |
|
||
| 10 | Zeeland | ZE |
|
||
| 11 | Zuid-Holland | ZH |
|
||
| 15 | Overijssel | OV |
|
||
| 16 | Flevoland | FL |
|
||
|
||
**Provenance Tracking**: Record GeoNames resolution in entry metadata:
|
||
|
||
```yaml
|
||
location_resolution:
|
||
method: REVERSE_GEOCODE # or NAME_LOOKUP or MANUAL
|
||
geonames_id: 2751792
|
||
geonames_name: Lelystad
|
||
settlement_code: LEL
|
||
admin1_code: "16"
|
||
region_code: FL
|
||
resolution_date: "2025-12-01T00:00:00Z"
|
||
```
|
||
|
||
**See**: `.opencode/GEONAMES_SETTLEMENT_RULES.md` for complete documentation.
|
||
|
||
---
|
||
|
||
### 🚨 INSTITUTION ABBREVIATION: EMIC NAME FIRST-LETTER PROTOCOL 🚨
|
||
|
||
**The institution abbreviation component uses the FIRST LETTER of each significant word in the official emic (native language) name.**
|
||
|
||
**⚠️ GRANDFATHERING POLICY (PID STABILITY)**
|
||
|
||
Existing GHCIDs created before December 2025 are **grandfathered** - their abbreviations will NOT be updated even if derived from English translations rather than emic names. This preserves PID stability per the "Cool URIs Don't Change" principle.
|
||
|
||
**Applies to:**
|
||
- 817 UNESCO Memory of the World custodian files enriched with `custodian_name.emic_name`
|
||
- Abbreviations like `NLP` (National Library of Peru) remain unchanged even though emic name is "Biblioteca Nacional del Perú" (would be `BNP`)
|
||
|
||
**For NEW custodians only:** Apply emic name abbreviation protocol described below.
|
||
|
||
**Abbreviation Rules**:
|
||
1. Use the **CustodianName** (official emic name), NOT an English translation
|
||
2. Take the **first letter** of each word
|
||
3. **Skip prepositions, articles, and conjunctions** in all languages
|
||
4. **Skip digits and numeric tokens** (e.g., "40-45", "1945", "III")
|
||
5. Convert to **UPPERCASE**
|
||
6. Remove accents/diacritics (á→A, ñ→N, ö→O)
|
||
7. Maximum **10 characters**
|
||
|
||
**Skipped Words** (prepositions/articles/conjunctions by language):
|
||
- **Dutch**: de, het, een, van, voor, in, op, te, den, der, des, 's, aan, bij, met, naar, om, tot, uit, over, onder, door, en, of
|
||
- **English**: a, an, the, of, in, at, on, to, for, with, from, by, as, under, and, or, but
|
||
- **French**: le, la, les, un, une, des, de, d, du, à, au, aux, en, dans, sur, sous, pour, par, avec, l, et, ou
|
||
- **German**: der, die, das, den, dem, des, ein, eine, einer, einem, einen, von, zu, für, mit, bei, nach, aus, vor, über, unter, durch, und, oder
|
||
- **Spanish**: el, la, los, las, un, una, unos, unas, de, del, a, al, en, con, por, para, sobre, bajo, y, o, e, u
|
||
- **Portuguese**: o, a, os, as, um, uma, uns, umas, de, do, da, dos, das, em, no, na, nos, nas, para, por, com, sobre, sob, e, ou
|
||
- **Italian**: il, lo, la, i, gli, le, un, uno, una, di, del, dello, della, dei, degli, delle, a, al, allo, alla, ai, agli, alle, da, dal, dallo, dalla, dai, dagli, dalle, in, nel, nello, nella, nei, negli, nelle, su, sul, sullo, sulla, sui, sugli, sulle, con, per, tra, fra, e, ed, o, od
|
||
|
||
**TODO**: Expand to comprehensive global coverage for all ISO 639-1 languages as project expands.
|
||
|
||
**Examples**:
|
||
|
||
| Emic Name | Abbreviation | Explanation |
|
||
|-----------|--------------|-------------|
|
||
| Heemkundige Kring De Goede Stede | HKGS | Skip "De" |
|
||
| De Hollandse Cirkel | HC | Skip "De" |
|
||
| Historische Vereniging Nijeveen | HVN | All significant words |
|
||
| Rijksmuseum Amsterdam | RA | All significant words |
|
||
| Musée d'Orsay | MO | Skip "d'" (d = de) |
|
||
| Biblioteca Nacional do Brasil | BNB | Skip "do" |
|
||
| L'Académie française | AF | Skip "L'" |
|
||
| Museum van de Twintigste Eeuw | MTE | Skip "van", "de" |
|
||
| Koninklijke Bibliotheek van België | KBB | Skip "van" |
|
||
|
||
**GHCID Format with Abbreviation**:
|
||
|
||
```
|
||
NL-{REGION}-{SETTLEMENT}-{TYPE}-{ABBREV}
|
||
^^^^^^^^
|
||
First letter of each significant word in emic name
|
||
```
|
||
|
||
**Implementation**: See `src/glam_extractor/identifiers/ghcid.py:extract_abbreviation_from_name()`
|
||
|
||
### 🚨 CRITICAL: Special Characters MUST Be Excluded from Abbreviations 🚨
|
||
|
||
**When generating abbreviations for GHCID, special characters and symbols MUST be completely removed. Only alphabetic characters (A-Z) are permitted in the abbreviation component.**
|
||
|
||
**RATIONALE**:
|
||
1. **URL/URI safety** - Special characters require encoding in URIs
|
||
2. **Filename safety** - Characters like `&`, `/`, `\`, `:` are invalid in filenames
|
||
3. **Parsing consistency** - Avoids delimiter conflicts in data pipelines
|
||
4. **Cross-system compatibility** - Ensures interoperability with all systems
|
||
5. **Human readability** - Clean identifiers are easier to communicate
|
||
|
||
**CHARACTERS TO REMOVE** (exhaustive list):
|
||
- **Ampersand**: `&` (e.g., "Records & Archives" → "RA", NOT "R&A")
|
||
- **Slash**: `/` (e.g., "Art/Design Museum" → "ADM", NOT "A/DM")
|
||
- **Backslash**: `\`
|
||
- **Plus**: `+` (e.g., "Culture+" → "C")
|
||
- **At sign**: `@`
|
||
- **Hash/Pound**: `#`
|
||
- **Percent**: `%`
|
||
- **Dollar**: `$`
|
||
- **Asterisk**: `*`
|
||
- **Parentheses**: `( )`
|
||
- **Brackets**: `[ ] { }`
|
||
- **Pipe**: `|`
|
||
- **Colon**: `:`
|
||
- **Semicolon**: `;`
|
||
- **Quotation marks**: `" ' \``
|
||
- **Comma**: `,`
|
||
- **Period**: `.` (unless part of abbreviation like "U.S." → "US")
|
||
- **Hyphen**: `-` (skip, do not replace with letter)
|
||
- **Underscore**: `_`
|
||
- **Equals**: `=`
|
||
- **Question mark**: `?`
|
||
- **Exclamation**: `!`
|
||
- **Tilde**: `~`
|
||
- **Caret**: `^`
|
||
- **Less/Greater than**: `< >`
|
||
|
||
**EXAMPLES**:
|
||
|
||
| Source Name | Correct Abbreviation | Incorrect (WRONG) |
|
||
|-------------|---------------------|-------------------|
|
||
| Department of Records & Information Management | DRIM | DR&IM ❌ |
|
||
| Art + Culture Center | ACC | A+CC ❌ |
|
||
| Museum/Gallery Amsterdam | MGA | M/GA ❌ |
|
||
| Heritage@Digital | HD | H@D ❌ |
|
||
| Archives (Historical) | AH | A(H) ❌ |
|
||
| Research & Development Institute | RDI | R&DI ❌ |
|
||
|
||
**REAL-WORLD EXAMPLE** (from `data/custodian/SX-XX-PHI-O-DR&IMSM.yaml`):
|
||
|
||
```yaml
|
||
# INCORRECT (current file - needs correction):
|
||
ghcid_current: SX-XX-PHI-O-DR&IMSM # ❌ Contains "&"
|
||
|
||
# CORRECT (should be):
|
||
ghcid_current: SX-XX-PHI-O-DRIMSM # ✅ Alphabetic only
|
||
```
|
||
|
||
**Implementation**: When extracting first letters from words containing special characters:
|
||
1. Split the word on special characters: "Records&Information" → ["Records", "Information"]
|
||
2. Take first letter from each resulting segment: "R" + "I" = "RI"
|
||
3. Or skip the special character entirely and treat as one word if no space around it
|
||
|
||
**See**: `.opencode/ABBREVIATION_SPECIAL_CHAR_RULE.md` for complete documentation
|
||
|
||
### 🚨 CRITICAL: Diacritics MUST Be Normalized to ASCII in Abbreviations 🚨
|
||
|
||
**When generating abbreviations for GHCID, diacritics (accented characters) MUST be normalized to their ASCII base letter equivalents. Only ASCII uppercase letters (A-Z) are permitted.**
|
||
|
||
This rule applies to ALL languages with diacritical marks including Czech, Polish, German, French, Spanish, Portuguese, Nordic languages, Hungarian, Romanian, Turkish, and others.
|
||
|
||
**RATIONALE**:
|
||
1. **URI/URL safety** - Non-ASCII characters require percent-encoding
|
||
2. **Cross-system compatibility** - ASCII is universally supported
|
||
3. **Filename safety** - Some systems have issues with non-ASCII filenames
|
||
4. **Human readability** - Easier to type and communicate
|
||
|
||
**DIACRITICS NORMALIZATION TABLE**:
|
||
|
||
| Language | Diacritics | ASCII Equivalent |
|
||
|----------|------------|------------------|
|
||
| **Czech** | Č, Ř, Š, Ž, Ě, Ů | C, R, S, Z, E, U |
|
||
| **Polish** | Ł, Ń, Ó, Ś, Ź, Ż, Ą, Ę | L, N, O, S, Z, Z, A, E |
|
||
| **German** | Ä, Ö, Ü, ß | A, O, U, SS |
|
||
| **French** | É, È, Ê, Ç, Ô, Â | E, E, E, C, O, A |
|
||
| **Spanish** | Ñ, Á, É, Í, Ó, Ú | N, A, E, I, O, U |
|
||
| **Portuguese** | Ã, Õ, Ç, Á, É | A, O, C, A, E |
|
||
| **Nordic** | Å, Ä, Ö, Ø, Æ | A, A, O, O, AE |
|
||
| **Hungarian** | Á, É, Í, Ó, Ö, Ő, Ú, Ü, Ű | A, E, I, O, O, O, U, U, U |
|
||
| **Turkish** | Ç, Ğ, İ, Ö, Ş, Ü | C, G, I, O, S, U |
|
||
| **Romanian** | Ă, Â, Î, Ș, Ț | A, A, I, S, T |
|
||
|
||
**REAL-WORLD EXAMPLE** (Czech institution):
|
||
|
||
```yaml
|
||
# INCORRECT - Contains diacritics:
|
||
ghcid_current: CZ-VY-TEL-L-VHSPAOČRZS # ❌ Contains "Č"
|
||
|
||
# CORRECT - ASCII only:
|
||
ghcid_current: CZ-VY-TEL-L-VHSPAOCRZS # ✅ "Č" → "C"
|
||
```
|
||
|
||
**IMPLEMENTATION**:
|
||
|
||
```python
|
||
import unicodedata
|
||
|
||
def normalize_diacritics(text: str) -> str:
|
||
"""Normalize diacritics to ASCII equivalents."""
|
||
# NFD decomposition separates base characters from combining marks
|
||
normalized = unicodedata.normalize('NFD', text)
|
||
# Remove combining marks (category 'Mn' = Mark, Nonspacing)
|
||
ascii_text = ''.join(c for c in normalized if unicodedata.category(c) != 'Mn')
|
||
return ascii_text
|
||
|
||
# Example
|
||
normalize_diacritics("VHSPAOČRZS") # Returns "VHSPAOCRZS"
|
||
```
|
||
|
||
**EXAMPLES**:
|
||
|
||
| Emic Name (with diacritics) | Abbreviation | Wrong |
|
||
|-----------------------------|--------------|-------|
|
||
| Vlastivědné muzeum v Šumperku | VMS | VMŠ ❌ |
|
||
| Österreichische Nationalbibliothek | ON | ÖN ❌ |
|
||
| Bibliothèque nationale de France | BNF | BNF (OK - è not in first letter) |
|
||
| Múzeum Łódzkie | ML | MŁ ❌ |
|
||
| Þjóðminjasafn Íslands | TI | ÞI ❌ |
|
||
|
||
**See**: `.opencode/ABBREVIATION_SPECIAL_CHAR_RULE.md` for complete documentation (covers both special characters and diacritics)
|
||
|
||
### 🚨 CRITICAL: Non-Latin Scripts MUST Be Transliterated Before Abbreviation 🚨
|
||
|
||
**When generating GHCID abbreviations from institution names in non-Latin scripts (Cyrillic, Chinese, Japanese, Korean, Arabic, Hebrew, Greek, Devanagari, Thai, etc.), the emic name MUST first be transliterated to Latin characters using ISO or recognized standards.**
|
||
|
||
This rule affects **170 institutions** across **21 languages** with non-Latin writing systems.
|
||
|
||
**CORE PRINCIPLE**: The emic name is PRESERVED in original script in `custodian_name.emic_name`. Transliteration is only used for abbreviation generation.
|
||
|
||
**TRANSLITERATION STANDARDS BY SCRIPT**:
|
||
|
||
| Script | Languages | Standard | Example |
|
||
|--------|-----------|----------|---------|
|
||
| **Cyrillic** | ru, uk, bg, sr, kk | ISO 9:1995 | Институт → Institut |
|
||
| **Chinese** | zh | Hanyu Pinyin (ISO 7098) | 东巴文化博物院 → Dongba Wenhua Bowuyuan |
|
||
| **Japanese** | ja | Modified Hepburn | 国立博物館 → Kokuritsu Hakubutsukan |
|
||
| **Korean** | ko | Revised Romanization | 독립기념관 → Dongnip Ginyeomgwan |
|
||
| **Arabic** | ar, fa, ur | ISO 233-2/3 | المكتبة الوطنية → al-Maktaba al-Wataniya |
|
||
| **Hebrew** | he | ISO 259-3 | ארכיון → Arkhiyon |
|
||
| **Greek** | el | ISO 843 | Μουσείο → Mouseio |
|
||
| **Devanagari** | hi, ne | ISO 15919 | राजस्थान → Rajasthana |
|
||
| **Bengali** | bn | ISO 15919 | বাংলাদেশ → Bangladesh |
|
||
| **Thai** | th | ISO 11940-2 | สำนักหอ → Samnak Ho |
|
||
| **Armenian** | hy | ISO 9985 | Մdelays → Matenadaran |
|
||
| **Georgian** | ka | ISO 9984 | ხელნაწერთა → Khelnawerti |
|
||
|
||
**WORKFLOW**:
|
||
|
||
```
|
||
1. Emic Name (original script)
|
||
↓
|
||
2. Transliterate to Latin (ISO standard)
|
||
↓
|
||
3. Normalize diacritics (remove accents)
|
||
↓
|
||
4. Skip articles/prepositions
|
||
↓
|
||
5. Extract first letters → Abbreviation
|
||
```
|
||
|
||
**EXAMPLES**:
|
||
|
||
| Language | Emic Name | Transliterated | Abbreviation |
|
||
|----------|-----------|----------------|--------------|
|
||
| **Russian** | Институт восточных рукописей РАН | Institut Vostochnykh Rukopisey RAN | IVRR |
|
||
| **Chinese** | 东巴文化博物院 | Dongba Wenhua Bowuyuan | DWB |
|
||
| **Korean** | 독립기념관 | Dongnip Ginyeomgwan | DG |
|
||
| **Hindi** | राजस्थान प्राच्यविद्या प्रतिष्ठान | Rajasthana Pracyavidya Pratishthana | RPP |
|
||
| **Arabic** | المكتبة الوطنية للمملكة المغربية | al-Maktaba al-Wataniya lil-Mamlaka | MWMM |
|
||
| **Hebrew** | ארכיון הסיפור העממי בישראל | Arkhiyon ha-Sipur ha-Amami | ASAY |
|
||
| **Greek** | Αρχαιολογικό Μουσείο Θεσσαλονίκης | Archaiologiko Mouseio Thessalonikis | AMT |
|
||
|
||
**SCRIPT-SPECIFIC SKIP WORDS**:
|
||
|
||
| Language | Skip Words (Articles/Prepositions) |
|
||
|----------|-------------------------------------|
|
||
| **Arabic** | al- (the), bi-, li-, fi- (prepositions) |
|
||
| **Hebrew** | ha- (the), ve- (and), be-, le-, me- |
|
||
| **Persian** | -e, -ye (ezafe connector), va (and) |
|
||
| **CJK** | None (particles integral to meaning) |
|
||
|
||
**IMPLEMENTATION**:
|
||
|
||
```python
|
||
from transliteration import transliterate_for_abbreviation
|
||
|
||
# Input: emic name in non-Latin script + language code
|
||
emic_name = "Институт восточных рукописей РАН"
|
||
lang = "ru"
|
||
|
||
# Step 1: Transliterate to Latin using ISO standard
|
||
latin = transliterate_for_abbreviation(emic_name, lang)
|
||
# Result: "Institut Vostochnykh Rukopisey RAN"
|
||
|
||
# Step 2: Apply standard abbreviation extraction
|
||
abbreviation = extract_abbreviation_from_name(latin, skip_words={'vostochnykh'})
|
||
# Result: "IVRRAN"
|
||
```
|
||
|
||
**GRANDFATHERING POLICY**: Existing abbreviations from 817 UNESCO MoW custodians are grandfathered. This transliteration standard applies only to **NEW custodians** created after December 2025.
|
||
|
||
**See**: `.opencode/TRANSLITERATION_STANDARDS.md` for complete ISO standards, mapping tables, and Python implementation
|
||
|
||
---
|
||
|
||
GHCID uses a **four-identifier strategy** for maximum flexibility and transparency:
|
||
|
||
### Four Identifier Formats
|
||
|
||
1. **UUID v5 (SHA-1)** - **PRIMARY** persistent identifier
|
||
- Deterministic (same GHCID string → same UUID)
|
||
- RFC 4122 standard, universal library support
|
||
- Transparent algorithm (anyone can verify)
|
||
- Field: `ghcid_uuid`
|
||
|
||
2. **UUID v8 (SHA-256)** - Secondary persistent identifier (future-proofing)
|
||
- Deterministic with stronger cryptographic hash
|
||
- SOTA security compliance
|
||
- Field: `ghcid_uuid_sha256`
|
||
|
||
3. **UUID v7** - Database record ID ONLY (NOT for persistent identification)
|
||
- Time-ordered for database performance
|
||
- NOT deterministic (different each time)
|
||
- Use for database primary keys, NOT for citations or cross-system references
|
||
- Field: `record_id`
|
||
|
||
4. **Numeric (64-bit)** - Compact identifier for CSV exports
|
||
- Deterministic (SHA-256 → 64-bit integer)
|
||
- Database optimization, spreadsheet-friendly
|
||
- Field: `ghcid_numeric`
|
||
|
||
### Critical Understanding: UUID v5 is Primary
|
||
|
||
**Why UUID v5 (SHA-1) over UUID v8 (SHA-256)?**
|
||
|
||
The primary identifier is **UUID v5** because:
|
||
|
||
- ✅ **Transparency** - Anyone can verify using standard `uuid.uuid5()` function
|
||
- ✅ **Reproducibility** - No custom algorithm to share, RFC 4122 defines it
|
||
- ✅ **Interoperability** - Every programming language has built-in UUID v5 support
|
||
- ✅ **Community Trust** - Public, standardized algorithm builds confidence
|
||
|
||
**SHA-1 Safety for Identifiers:**
|
||
|
||
SHA-1 is deprecated for **cryptographic security** (digital signatures, TLS, passwords) but **appropriate for identifier generation**:
|
||
|
||
- Heritage institution identifiers are **non-adversarial** (no attacker trying to forge museum IDs)
|
||
- 128-bit collision resistance is sufficient (P(collision) ≈ 1.5×10^-29 for 1M institutions)
|
||
- RFC 4122 (UUID v5) remains active standard (not deprecated by IETF)
|
||
- See **[Why GHCID Uses UUID v5 and SHA-1](docs/WHY_UUID_V5_SHA1.md)** for detailed rationale
|
||
|
||
**Future-Proofing:**
|
||
|
||
- We generate **both UUID v5 and UUID v8** for every institution
|
||
- Can migrate to SHA-256 primary if RFC 4122 is updated
|
||
- Both are deterministic - no data loss in migration
|
||
|
||
### When Extracting Data
|
||
|
||
Agents should generate ALL four identifiers for every institution:
|
||
|
||
```yaml
|
||
# Example extraction output
|
||
- id: https://w3id.org/heritage/custodian/br/bnb-001
|
||
name: Biblioteca Nacional do Brasil
|
||
ghcid: BR-RJ-RIO-L-BNB
|
||
ghcid_uuid: "550e8400-e29b-41d4-a716-446655440000" # UUID v5 - PRIMARY
|
||
ghcid_uuid_sha256: "a1b2c3d4-e5f6-8a1b-9c2d-3e4f5a6b7c8d" # UUID v8 - Secondary
|
||
ghcid_numeric: 213324328442227739 # 64-bit numeric
|
||
# Note: UUID v7 (record_id) generated at database insertion, not during extraction
|
||
```
|
||
|
||
### GHCID Collision Handling for AI Agents
|
||
|
||
**CRITICAL**: When extracting heritage institution data, AI agents MUST understand and apply temporal collision resolution rules to maintain PID stability.
|
||
|
||
#### The Collision Problem
|
||
|
||
Multiple institutions may generate the same base GHCID (before name suffix addition):
|
||
- Two museums in Amsterdam abbreviated "SM": `NL-NH-AMS-M-SM`
|
||
- Two historical societies in Utrecht: `NL-UT-UTR-S-HK`
|
||
- Two libraries in São Paulo abbreviated "BM": `BR-SP-SAO-L-BM`
|
||
|
||
#### Decision Tree for Collision Resolution
|
||
|
||
When extracting data, agents should follow this decision process:
|
||
|
||
```
|
||
1. Generate base GHCID (without name suffix)
|
||
↓
|
||
2. Check if base GHCID exists in published dataset
|
||
↓
|
||
NO → Use base GHCID as-is, record extraction_date
|
||
↓
|
||
YES → Temporal priority check
|
||
↓
|
||
3. Compare extraction_date with existing publication_date
|
||
↓
|
||
SAME DATE (batch import) → First Batch Collision
|
||
├─ ALL institutions get name suffixes
|
||
├─ Convert native language name to snake_case
|
||
└─ Append to GHCID: NL-NH-AMS-M-SM-stedelijk_museum_amsterdam
|
||
↓
|
||
LATER DATE (historical addition) → Historical Addition
|
||
├─ PRESERVE existing GHCID (no modification)
|
||
├─ ONLY new institution gets name suffix
|
||
└─ New GHCID: NL-NH-AMS-M-SM-science_museum_amsterdam
|
||
```
|
||
|
||
#### Implementation Rules for Agents
|
||
|
||
**Rule 1: Always Track Provenance Timestamp**
|
||
|
||
```yaml
|
||
provenance:
|
||
data_source: CONVERSATION_NLP
|
||
data_tier: TIER_4_INFERRED
|
||
extraction_date: "2025-11-15T14:30:00Z" # ← REQUIRED for collision detection
|
||
extraction_method: "AI agent NER extraction"
|
||
confidence_score: 0.92
|
||
```
|
||
|
||
**Rule 2: Detect Collisions by Base GHCID**
|
||
|
||
Before adding name suffixes, group institutions by base GHCID:
|
||
|
||
```python
|
||
# Collision detection pseudocode for agents
|
||
base_ghcid = generate_base_ghcid(institution) # Without name suffix
|
||
existing_records = published_dataset.filter(base_ghcid=base_ghcid)
|
||
|
||
if len(existing_records) > 0:
|
||
# Collision detected - apply temporal priority
|
||
apply_collision_resolution(institution, existing_records)
|
||
```
|
||
|
||
**Rule 3: First Batch - ALL Get Name Suffixes**
|
||
|
||
If ALL colliding institutions have the **same** `extraction_date`:
|
||
|
||
```yaml
|
||
# Example: 2025-11-01 batch import discovers two institutions
|
||
- name: Stedelijk Museum Amsterdam
|
||
ghcid: NL-NH-AMS-M-SM-stedelijk_museum_amsterdam # Gets name suffix
|
||
provenance:
|
||
extraction_date: "2025-11-01T10:00:00Z"
|
||
|
||
- name: Science Museum Amsterdam
|
||
ghcid: NL-NH-AMS-M-SM-science_museum_amsterdam # Gets name suffix
|
||
provenance:
|
||
extraction_date: "2025-11-01T10:00:00Z" # Same date = first batch
|
||
```
|
||
|
||
**Rule 4: Historical Addition - ONLY New Gets Name Suffix**
|
||
|
||
If new institution's `extraction_date` is **later** than existing record:
|
||
|
||
```yaml
|
||
# EXISTING (2025-11-01, already published):
|
||
- name: Hermitage Amsterdam
|
||
ghcid: NL-NH-AMS-M-HM # ← NO CHANGE (PID stability!)
|
||
provenance:
|
||
extraction_date: "2025-11-01T10:00:00Z"
|
||
|
||
# NEW (2025-11-15, historical addition):
|
||
- name: Historical Museum Amsterdam
|
||
ghcid: NL-NH-AMS-M-HM-historical_museum_amsterdam # ← ONLY new gets name suffix
|
||
provenance:
|
||
extraction_date: "2025-11-15T14:30:00Z"
|
||
```
|
||
|
||
#### Name Suffix Generation
|
||
|
||
**Converting institution names to snake_case suffixes:**
|
||
|
||
```python
|
||
import re
|
||
import unicodedata
|
||
|
||
def generate_name_suffix(native_name: str) -> str:
|
||
"""Convert native language institution name to snake_case suffix.
|
||
|
||
Examples:
|
||
"Stedelijk Museum Amsterdam" → "stedelijk_museum_amsterdam"
|
||
"Musée d'Orsay" → "musee_dorsay"
|
||
"Österreichische Nationalbibliothek" → "osterreichische_nationalbibliothek"
|
||
"""
|
||
# Normalize unicode (NFD decomposition) and remove diacritics
|
||
normalized = unicodedata.normalize('NFD', native_name)
|
||
ascii_name = ''.join(c for c in normalized if unicodedata.category(c) != 'Mn')
|
||
|
||
# Convert to lowercase
|
||
lowercase = ascii_name.lower()
|
||
|
||
# Remove apostrophes, commas, and other punctuation
|
||
no_punct = re.sub(r"[''`\",.:;!?()[\]{}]", '', lowercase)
|
||
|
||
# Replace spaces and hyphens with underscores
|
||
underscored = re.sub(r'[\s\-]+', '_', no_punct)
|
||
|
||
# Remove any remaining non-alphanumeric characters (except underscores)
|
||
clean = re.sub(r'[^a-z0-9_]', '', underscored)
|
||
|
||
# Collapse multiple underscores
|
||
final = re.sub(r'_+', '_', clean).strip('_')
|
||
|
||
return final
|
||
```
|
||
|
||
**Name suffix rules**:
|
||
- Use the institution's **full official name** in its **native language**
|
||
- Transliterate non-Latin scripts to ASCII (e.g., Pinyin for Chinese)
|
||
- Remove all diacritics (é → e, ö → o, ñ → n)
|
||
- Remove punctuation (apostrophes, commas, periods)
|
||
- Replace spaces with underscores
|
||
- All lowercase
|
||
|
||
#### GHCID History Tracking
|
||
|
||
When name suffix is added to resolve collision, update `ghcid_history`:
|
||
|
||
```yaml
|
||
ghcid_history:
|
||
- ghcid: NL-NH-AMS-M-HM-historical_museum_amsterdam # Current (with name suffix)
|
||
ghcid_numeric: 789012345678
|
||
valid_from: "2025-11-15T14:30:00Z" # When name suffix added
|
||
valid_to: null
|
||
reason: "Name suffix added to resolve collision with existing NL-NH-AMS-M-HM (Hermitage Amsterdam)"
|
||
|
||
- ghcid: NL-NH-AMS-M-HM # Original (without name suffix)
|
||
ghcid_numeric: 123456789012
|
||
valid_from: "2025-11-15T14:00:00Z" # When first extracted
|
||
valid_to: "2025-11-15T14:30:00Z" # When collision detected
|
||
reason: "Base GHCID from geographic location and institution name"
|
||
```
|
||
|
||
#### PID Stability Principle - "Cool URIs Don't Change"
|
||
|
||
**NEVER modify a published GHCID.** Once exported to RDF, JSON-LD, or CSV, a GHCID becomes a persistent identifier that may be:
|
||
|
||
- **Cited in academic papers** - Journal articles referencing heritage collections
|
||
- **Used in external APIs** - Third-party systems querying our data
|
||
- **Embedded in linked data** - RDF triples in knowledge graphs
|
||
- **Referenced in finding aids** - Archival descriptions linking to institutions
|
||
|
||
Changing a published GHCID breaks these external references. Per W3C "Cool URIs Don't Change":
|
||
|
||
- ✅ **Correct**: Add name suffix to NEW institution (historical addition)
|
||
- ❌ **WRONG**: Retroactively add name suffix to EXISTING published GHCID
|
||
|
||
#### Error Handling for Agents
|
||
|
||
**Scenario 1: Missing Provenance Timestamp**
|
||
|
||
```python
|
||
if 'extraction_date' not in institution['provenance']:
|
||
# Use current timestamp as fallback
|
||
institution['provenance']['extraction_date'] = datetime.now(timezone.utc).isoformat()
|
||
# Log warning for manual review
|
||
log.warning(f"Missing extraction_date for {institution['name']}, using current time")
|
||
```
|
||
|
||
**Scenario 2: Multiple Historical Additions**
|
||
|
||
```python
|
||
# Three institutions generate NL-UT-UTR-S-HK
|
||
# Extraction dates: 2025-11-01, 2025-11-15, 2025-12-01
|
||
|
||
# Result:
|
||
# 2025-11-01: NL-UT-UTR-S-HK (first, no name suffix)
|
||
# 2025-11-15: NL-UT-UTR-S-HK-historische_kring_utrecht (second, gets name suffix)
|
||
# 2025-12-01: NL-UT-UTR-S-HK-heemkundige_kring_utrecht (third, gets name suffix)
|
||
```
|
||
|
||
**Scenario 3: Collision Resolution with Name Suffix**
|
||
|
||
```python
|
||
if collision_detected:
|
||
# Generate name suffix from native language name
|
||
name_suffix = generate_name_suffix(institution['name'])
|
||
|
||
# Append to base GHCID
|
||
ghcid = f"{base_ghcid}-{name_suffix}" # e.g., NL-NH-AMS-M-HM-historical_museum_amsterdam
|
||
|
||
# Record collision resolution
|
||
institution['provenance']['notes'] = (
|
||
f"Name suffix added to resolve collision with existing {base_ghcid}."
|
||
)
|
||
```
|
||
|
||
#### Validation Checklist for Agents
|
||
|
||
Before publishing extracted data, verify:
|
||
|
||
- [ ] All institutions have `extraction_date` in provenance metadata
|
||
- [ ] Collisions detected by grouping on base GHCID (without name suffix)
|
||
- [ ] First batch collisions: ALL instances have name suffixes
|
||
- [ ] Historical additions: ONLY new instances have name suffixes
|
||
- [ ] No published GHCIDs modified (PID stability test)
|
||
- [ ] GHCID history entries created with valid temporal ordering
|
||
- [ ] Name suffixes derived from native language institution names
|
||
- [ ] Collision reasons documented in `ghcid_history`
|
||
|
||
#### Example Extraction Prompts for Agents
|
||
|
||
**Prompt Template for NLP Extraction**:
|
||
|
||
```
|
||
Extract heritage institutions from this conversation about [REGION] GLAM institutions.
|
||
|
||
For EACH institution:
|
||
1. Generate base GHCID using geographic location and institution type
|
||
2. Check for collisions with previously published GHCIDs
|
||
3. Apply temporal priority rule:
|
||
- If collision with same extraction_date → First Batch (all get name suffixes)
|
||
- If collision with earlier publication_date → Historical Addition (only new gets name suffix)
|
||
4. Generate snake_case name suffix from native language institution name
|
||
5. Create GHCID history entry documenting collision resolution
|
||
6. Include extraction_date in provenance metadata
|
||
|
||
Output: LinkML-compliant YAML with complete collision handling
|
||
```
|
||
|
||
**Prompt Template for CSV Parsing**:
|
||
|
||
```
|
||
Parse this heritage institution CSV file dated [DATE].
|
||
|
||
All rows have the same extraction_date ([DATE]).
|
||
|
||
If multiple institutions generate the same base GHCID:
|
||
- This is a FIRST BATCH collision
|
||
- ALL colliding institutions MUST receive name suffixes
|
||
- Generate name suffix from institution's native language name
|
||
- Document collision in ghcid_history
|
||
|
||
Output: YAML with collision resolution applied
|
||
```
|
||
|
||
#### Testing Strategies for Collision Handling
|
||
|
||
**Unit Test: First Batch Collision**
|
||
|
||
```python
|
||
def test_first_batch_collision():
|
||
"""Two institutions extracted same day with same base GHCID."""
|
||
institutions = [
|
||
{
|
||
'name': 'Stedelijk Museum Amsterdam',
|
||
'base_ghcid': 'NL-NH-AMS-M-SM',
|
||
'identifiers': [{'identifier_scheme': 'Wikidata', 'identifier_value': 'Q621531'}],
|
||
'provenance': {'extraction_date': '2025-11-01T10:00:00Z'}
|
||
},
|
||
{
|
||
'name': 'Science Museum Amsterdam',
|
||
'base_ghcid': 'NL-NH-AMS-M-SM',
|
||
'identifiers': [{'identifier_scheme': 'Wikidata', 'identifier_value': 'Q98765432'}],
|
||
'provenance': {'extraction_date': '2025-11-01T10:00:00Z'}
|
||
}
|
||
]
|
||
|
||
resolved = resolve_collisions(institutions)
|
||
|
||
# Both should have name suffixes
|
||
assert resolved[0]['ghcid'] == 'NL-NH-AMS-M-SM-stedelijk_museum_amsterdam'
|
||
assert resolved[1]['ghcid'] == 'NL-NH-AMS-M-SM-science_museum_amsterdam'
|
||
```
|
||
|
||
**Unit Test: Historical Addition**
|
||
|
||
```python
|
||
def test_historical_addition():
|
||
"""New institution added later with same base GHCID."""
|
||
published = {
|
||
'name': 'Hermitage Amsterdam',
|
||
'ghcid': 'NL-NH-AMS-M-HM', # Already published
|
||
'provenance': {'extraction_date': '2025-11-01T10:00:00Z'}
|
||
}
|
||
|
||
new_institution = {
|
||
'name': 'Historical Museum Amsterdam',
|
||
'base_ghcid': 'NL-NH-AMS-M-HM', # Collision!
|
||
'identifiers': [{'identifier_scheme': 'Wikidata', 'identifier_value': 'Q17339437'}],
|
||
'provenance': {'extraction_date': '2025-11-15T14:30:00Z'}
|
||
}
|
||
|
||
resolved = resolve_collision(new_institution, published_dataset=[published])
|
||
|
||
# Published GHCID unchanged
|
||
assert published['ghcid'] == 'NL-NH-AMS-M-HM'
|
||
|
||
# New institution gets name suffix
|
||
assert resolved['ghcid'] == 'NL-NH-AMS-M-HM-historical_museum_amsterdam'
|
||
|
||
# GHCID history created
|
||
assert len(resolved['ghcid_history']) == 2
|
||
assert resolved['ghcid_history'][0]['ghcid'] == 'NL-NH-AMS-M-HM-historical_museum_amsterdam'
|
||
```
|
||
|
||
#### References for Collision Handling
|
||
|
||
- **Specification**: `docs/PERSISTENT_IDENTIFIERS.md` - "Historical Collision Resolution" section
|
||
- **Algorithm**: `docs/plan/global_glam/07-ghcid-collision-resolution.md` - Temporal dimension and decision logic
|
||
- **Examples**: `docs/GHCID_PID_SCHEME.md` - Timeline examples with real institutions
|
||
- **Implementation**: `scripts/regenerate_historical_ghcids.py` - Code comments documenting collision handling
|
||
- **Schema**: `schemas/provenance.yaml` - `GHCIDHistoryEntry` and `ChangeEvent` classes
|
||
|
||
**See also:**
|
||
- `docs/PERSISTENT_IDENTIFIERS.md` - Complete identifier format documentation
|
||
- `docs/UUID_STRATEGY.md` - UUID v5 vs v7 vs v8 comparison
|
||
- `docs/WHY_UUID_V5_SHA1.md` - SHA-1 safety rationale
|
||
|
||
---
|
||
|
||
## References
|
||
|
||
- **Schema (v0.2.0)**:
|
||
- Main: `schemas/heritage_custodian.yaml`
|
||
- Core classes: `schemas/core.yaml`
|
||
- Enumerations: `schemas/enums.yaml`
|
||
- Provenance: `schemas/provenance.yaml`
|
||
- Collections: `schemas/collections.yaml`
|
||
- Dutch extensions: `schemas/dutch.yaml`
|
||
- Architecture: `/docs/SCHEMA_MODULES.md`
|
||
- **Persistent Identifiers**:
|
||
- Overview: `docs/PERSISTENT_IDENTIFIERS.md`
|
||
- UUID Strategy: `docs/UUID_STRATEGY.md`
|
||
- SHA-1 Rationale: `docs/WHY_UUID_V5_SHA1.md`
|
||
- GHCID PID Scheme: `docs/GHCID_PID_SCHEME.md`
|
||
- Collision Resolution: `docs/plan/global_glam/07-ghcid-collision-resolution.md`
|
||
- **Architecture**: `docs/plan/global_glam/02-architecture.md`
|
||
- **Data Standardization**: `docs/plan/global_glam/04-data-standardization.md`
|
||
- **Design Patterns**: `docs/plan/global_glam/05-design-patterns.md`
|
||
- **Dependencies**: `docs/plan/global_glam/03-dependencies.md`
|
||
|
||
---
|
||
|
||
**Version**: 0.2.1
|
||
**Schema Version**: v0.2.1 (modular)
|
||
**Last Updated**: 2025-12-08
|
||
**Maintained By**: GLAM Data Extraction Project
|
||
|
||
### Rule 61: Slot Fixes Authoritative File
|
||
|
||
🚨 **CRITICAL**: The file `/Users/kempersc/apps/glam/data/fixes/slot_fixes.yaml` is the AUTHORITATIVE source for slot migrations. NEVER delete entries from this file. Always mark completed migrations with `processed: {status: true}`.
|
||
|
||
**See**: `.opencode/rules/slot-fixes-authoritative-rule.md` for complete documentation
|
||
|
||
### Rule 62: Verified Ontology Terms Reference
|
||
|
||
🚨 **CRITICAL**: All `class_uri`, `slot_uri`, and mappings MUST use verified classes and predicates from local ontology files in `data/ontology/`.
|
||
|
||
**See**: `.opencode/rules/verified-ontology-terms.md` for the list of verified ontologies and verification procedures.
|
||
|