glam/.opencode/rules/no-hallucinated-ontology-references.md
kempersc 1fb924c412 feat: add ontology mappings to LinkML schema and enhance entity resolution
Schema enhancements (443 files):
- Add class_uri with proper ontology references (schema:, prov:, skos:, rico:)
- Add close_mappings, related_mappings per Rule 50 convention
- Replace stub hc: slot_uri with standard predicates (dcterms:identifier, skos:prefLabel)
- Improve descriptions with ontology mapping rationale
- Add prefixes blocks to all schema modules

Entity Resolution improvements:
- Add entity_resolution module with email semantics parsing
- Enhance build_entity_resolution.py with email-based matching signals
- Extend Entity Review API with filtering by signal types and count
- Add candidates caching and indexing for performance
- Add ReviewLoginPage component

New rules and documentation:
- Add Rule 51: No Hallucinated Ontology References
- Add .opencode/rules/no-hallucinated-ontology-references.md
- Add .opencode/rules/slot-ontology-mapping-reference.md
- Add adms.ttl and dqv.ttl ontology files

Frontend ontology support:
- Add RiC-O_1-1.rdf and schemaorg.owl to public/ontology
2026-01-13 13:51:02 +01:00

316 lines
9.2 KiB
Markdown

# Rule 51: No Hallucinated Ontology References
**Priority**: CRITICAL
**Scope**: All LinkML schema files (`schemas/20251121/linkml/`)
**Created**: 2025-01-13
---
## Summary
All ontology references in LinkML schema files (`class_uri`, `slot_uri`, `*_mappings`) MUST be verifiable against actual ontology files in `/data/ontology/`. References to predicates or classes that do not exist in local ontology files are considered **hallucinated** and are prohibited.
---
## The Problem
AI agents may suggest ontology mappings based on training data without verifying that:
1. The ontology file exists in `/data/ontology/`
2. The specific predicate/class exists within that ontology file
3. The prefix is declared and resolvable
This leads to schema files containing references like `dqv:value` or `adms:status` that cannot be validated or serialized to RDF.
---
## Requirements
### 1. All Ontology Prefixes Must Have Local Files
Before using a prefix (e.g., `prov:`, `schema:`, `org:`), verify the ontology file exists:
```bash
# Check if ontology exists
ls data/ontology/ | grep -i "prov\|schema\|org"
```
**Available Ontologies** (as of 2025-01-13):
| Prefix | File | Verified |
|--------|------|----------|
| `prov:` | `prov-o.ttl`, `prov.ttl` | ✅ |
| `schema:` | `schemaorg.owl` | ✅ |
| `org:` | `org.rdf` | ✅ |
| `skos:` | `skos.rdf` | ✅ |
| `dcterms:` | `dublin_core_elements.rdf` | ✅ |
| `foaf:` | `foaf.ttl` | ✅ |
| `rico:` | `RiC-O_1-1.rdf` | ✅ |
| `crm:` | `CIDOC_CRM_v7.1.3.rdf` | ✅ |
| `geo:` | `geo.ttl` | ✅ |
| `sosa:` | `sosa.ttl` | ✅ |
| `bf:` | `bibframe.rdf` | ✅ |
| `edm:` | `edm.owl` | ✅ |
| `premis:` | `premis3.owl` | ✅ |
| `dcat:` | `dcat3.ttl` | ✅ |
| `ore:` | `ore.rdf` | ✅ |
| `pico:` | `pico.ttl` | ✅ |
| `gn:` | `geonames_ontology.rdf` | ✅ |
| `time:` | `time.ttl` | ✅ |
| `locn:` | `locn.ttl` | ✅ |
| `dqv:` | `dqv.ttl` | ✅ |
| `adms:` | `adms.ttl` | ✅ |
**NOT Available** (do not use without adding):
| Prefix | Status | Alternative |
|--------|--------|-------------|
| `qudt:` | Only referenced in era_ontology.ttl | Use `hc:` with close_mappings annotation |
### 2. Predicates Must Exist in Ontology Files
Before using a predicate, verify it exists:
```bash
# Verify predicate exists
grep -l "hasFrameRate\|frameRate" data/ontology/premis3.owl
# Check specific predicate definition
grep -E "premis:hasFrameRate|:hasFrameRate" data/ontology/premis3.owl
```
### 3. Use hc: Prefix for Domain-Specific Concepts
When no standard ontology predicate exists, use the Heritage Custodian namespace:
```yaml
# CORRECT - Use hc: with documentation
slots:
heritage_relevance_score:
slot_uri: hc:heritageRelevanceScore
description: Heritage sector relevance score (0.0-1.0)
annotations:
ontology_note: >-
No standard ontology predicate for heritage relevance scoring.
Domain-specific metric for this project.
# WRONG - Hallucinated predicate
slots:
heritage_relevance_score:
slot_uri: dqv:heritageScore # Does not exist!
```
### 4. Document External References in close_mappings
When a similar concept exists in an ontology we don't have locally, document it in `close_mappings` with a note:
```yaml
slots:
confidence_score:
slot_uri: hc:confidenceScore
close_mappings:
- dqv:value # W3C Data Quality Vocabulary (not in local files)
annotations:
external_ontology_note: >-
dqv:value from W3C Data Quality Vocabulary would be semantically
appropriate but ontology not included in project. See
https://www.w3.org/TR/vocab-dqv/
```
---
## Verification Workflow
### Before Adding New Mappings
1. **Check if ontology file exists**:
```bash
ls data/ontology/ | grep -i "<ontology-name>"
```
2. **Search for predicate in ontology**:
```bash
grep -l "<predicate-name>" data/ontology/*
```
3. **Verify predicate definition**:
```bash
grep -B2 -A5 "<predicate-name>" data/ontology/<file>
```
4. **If not found**: Use `hc:` prefix with appropriate documentation
### When Reviewing Existing Mappings
Run validation script:
```bash
# Find all slot_uri references
grep -r "slot_uri:" schemas/20251121/linkml/modules/slots/ | \
grep -v "hc:" | \
cut -d: -f3 | \
sort -u
# Verify each prefix has a local file
for prefix in prov schema org skos dcterms foaf rico; do
echo "Checking $prefix:"
ls data/ontology/ | grep -i "$prefix" || echo " NOT FOUND!"
done
```
---
## Ontology Addition Process
If a new ontology is genuinely needed:
1. **Download the ontology**:
```bash
curl -L -o data/ontology/<name>.ttl "<url>" -H "Accept: text/turtle"
```
2. **Update ONTOLOGY_CATALOG.md**:
```bash
# Add entry to data/ontology/ONTOLOGY_CATALOG.md
```
3. **Verify predicates exist**:
```bash
grep "<predicate>" data/ontology/<name>.ttl
```
4. **Update LinkML prefixes** in schema files
---
## Examples
### CORRECT: Verified Mapping
```yaml
slots:
retrieval_timestamp:
slot_uri: prov:atTime # Verified in data/ontology/prov-o.ttl
range: datetime
```
### CORRECT: Domain-Specific with External Reference
```yaml
slots:
confidence_score:
slot_uri: hc:confidenceScore # HC namespace (always valid)
range: float
close_mappings:
- dqv:value # External reference (documented, not required locally)
annotations:
ontology_note: >-
Uses HC namespace as dqv: ontology not in local files.
dqv:value would be semantically appropriate alternative.
```
### WRONG: Hallucinated Mapping
```yaml
slots:
confidence_score:
slot_uri: dqv:value # INVALID - dqv: not in data/ontology/!
range: float
```
### WRONG: Non-Existent Predicate
```yaml
slots:
frame_rate:
slot_uri: premis:hasFrameRate # INVALID - predicate not in premis3.owl!
range: float
```
---
## Consequences of Violation
1. **RDF serialization fails** - Invalid prefixes cause gen-owl errors
2. **Schema validation errors** - LinkML validates prefix declarations
3. **Broken interoperability** - External systems cannot resolve URIs
4. **Data quality issues** - Semantic web tooling cannot process data
---
## PREMIS Ontology Reference (premis3.owl)
**CRITICAL**: The PREMIS ontology is frequently hallucinated. ALL premis: references MUST be verified.
### Valid PREMIS Classes
```
Action, Agent, Bitstream, Copyright, Dependency, EnvironmentCharacteristic,
Event, File, Fixity, HardwareAgent, Identifier, Inhibitor, InstitutionalPolicy,
IntellectualEntity, License, Object, Organization, OutcomeStatus, Person,
PreservationPolicy, Representation, RightsBasis, RightsStatus, Rule, Signature,
SignatureEncoding, SignificantProperties, SoftwareAgent, Statute,
StorageLocation, StorageMedium
```
### Valid PREMIS Properties
```
act, allows, basis, characteristic, citation, compositionLevel, dependency,
determinationDate, documentation, encoding, endDate, fixity, governs,
identifier, inhibitedBy, inhibits, jurisdiction, key, medium, note,
originalName, outcome, outcomeNote, policy, prohibits, purpose, rationale,
relationship, restriction, rightsStatus, signature, size, startDate,
storedAt, terms, validationRules, version
```
### Known Hallucinated PREMIS Terms (DO NOT USE)
| Hallucinated Term | Correction |
|-------------------|------------|
| `premis:PreservationEvent` | Use `premis:Event` |
| `premis:RightsDeclaration` | Use `premis:RightsBasis` or `premis:RightsStatus` |
| `premis:hasRightsStatement` | Use `premis:rightsStatus` |
| `premis:hasRightsDeclaration` | Use `premis:rightsStatus` |
| `premis:hasRepresentation` | Use `premis:relationship` or `dcterms:hasFormat` |
| `premis:hasRelatedStatementInformation` | Use `premis:note` or `adms:status` |
| `premis:hasObjectCharacteristics` | Use `premis:characteristic` |
| `premis:rightsGranted` | Use `premis:RightsStatus` class with `premis:restriction` |
| `premis:rightsEndDate` | Use `premis:endDate` |
| `premis:linkingAgentIdentifier` | Use `premis:Agent` class |
| `premis:storageLocation` (lowercase) | Use `premis:storedAt` property or `premis:StorageLocation` class |
| `premis:hasFrameRate` | Does not exist - use `hc:frameRate` |
| `premis:environmentCharacteristic` (lowercase) | Use `premis:EnvironmentCharacteristic` (class) |
### PREMIS Verification Commands
```bash
# List all PREMIS classes
grep -E "owl:Class.*premis" data/ontology/premis3.owl | \
sed 's/.*v3\///' | sed 's/".*//' | sort -u
# List all PREMIS properties
grep -E "ObjectProperty|DatatypeProperty" data/ontology/premis3.owl | \
grep -oP 'v3/\K[^"]+' | sort -u
# Verify a specific term exists
grep -c "YourTermHere" data/ontology/premis3.owl
```
---
## See Also
- Rule 38: Slot Centralization and Semantic URI Requirements
- Rule 50: Ontology-to-LinkML Mapping Convention
- `/data/ontology/ONTOLOGY_CATALOG.md` - Available ontologies
- `.opencode/rules/slot-ontology-mapping-reference.md` - Mapping reference
---
## Version History
- **2025-01-13**: Added 7 more hallucinated PREMIS terms discovered during schema audit:
- `premis:hasRightsStatement`, `premis:hasRightsDeclaration`, `premis:hasRepresentation`
- `premis:hasRelatedStatementInformation`, `premis:rightsGranted`, `premis:rightsEndDate`
- `premis:linkingAgentIdentifier`
- **2025-01-13**: Initial creation after discovering dqv:, adms:, qudt: references without local files