169 lines
6.3 KiB
Markdown
169 lines
6.3 KiB
Markdown
# U-Class (UNKNOWN) - Not a Wikidata Classification
|
|
|
|
## Overview
|
|
|
|
The **U-class** represents institutions where the type **cannot be determined** during data extraction. This is NOT a Wikidata class or concept, but rather a **fallback classification** used in the GLAMORCUBEPSXHF taxonomy.
|
|
|
|
## Key Points
|
|
|
|
### What U-Class Represents
|
|
|
|
- **UNKNOWN institution type** - Cannot be classified into G, L, A, M, O, R, C, B, E, P, S, H, or F categories
|
|
- **Ambiguous organizations** - Source data provides conflicting or unclear type information
|
|
- **Incomplete data** - Institution description lacks sufficient details for classification
|
|
|
|
### What U-Class Is NOT
|
|
|
|
- ❌ **NOT for Universities** - Universities are classified under **E (EDUCATION_PROVIDER)**
|
|
- ❌ **NOT a Wikidata class** - No SPARQL query or hyponym extraction needed
|
|
- ❌ **NOT a permanent classification** - Institutions should be reclassified when more information becomes available
|
|
|
|
## No SPARQL Query Required
|
|
|
|
Unlike other GLAMORCUBEPSXHF classes (G, L, A, M, O, R, C, B, E, P, S, H, F, X), the U-class does **NOT have a Wikidata query** because:
|
|
|
|
1. UNKNOWN is a data quality indicator, not a heritage institution type
|
|
2. No Wikidata classes correspond to "institution of unknown type"
|
|
3. U-class is assigned during data extraction when type cannot be determined
|
|
|
|
## When to Use U-Class
|
|
|
|
### ✅ Appropriate Use Cases
|
|
|
|
- Extracting from incomplete source data (CSV missing type column)
|
|
- Parsing conversations where institution purpose is unclear
|
|
- Processing records with conflicting type indicators
|
|
- Temporary classification pending manual review
|
|
|
|
### ❌ Inappropriate Use Cases
|
|
|
|
- **Universities** → Use **E (EDUCATION_PROVIDER)** instead
|
|
- **Mixed-type institutions** → Use **X (MIXED)** and document all actual types
|
|
- **Unknown location** → Use U-class only for unknown TYPE, not location
|
|
- **Lazy classification** → Always attempt proper classification first
|
|
|
|
## Workflow for U-Class Institutions
|
|
|
|
### 1. Initial Extraction
|
|
```yaml
|
|
- id: https://w3id.org/heritage/custodian/unknown/inst-001
|
|
name: Heritage Organization XYZ
|
|
institution_type: UNKNOWN # U-class assigned
|
|
description: "Organization mentioned in conversation with unclear purpose"
|
|
provenance:
|
|
data_source: CONVERSATION_NLP
|
|
data_tier: TIER_4_INFERRED
|
|
confidence_score: 0.45 # Low confidence
|
|
notes: "Institution type unclear - requires manual review"
|
|
```
|
|
|
|
### 2. Manual Review
|
|
- Research institution's website, Wikipedia, or authoritative sources
|
|
- Determine actual institution type (museum, library, archive, etc.)
|
|
- Update `institution_type` field with correct classification
|
|
|
|
### 3. Reclassification
|
|
```yaml
|
|
- id: https://w3id.org/heritage/custodian/unknown/inst-001
|
|
name: Heritage Organization XYZ
|
|
institution_type: MUSEUM # Reclassified from UNKNOWN to M-class
|
|
description: "Local history museum (verified via website)"
|
|
provenance:
|
|
data_source: CONVERSATION_NLP
|
|
data_tier: TIER_4_INFERRED
|
|
confidence_score: 0.85 # Higher confidence after verification
|
|
notes: "Originally classified as UNKNOWN, reclassified to MUSEUM after website verification"
|
|
verified_date: "2025-11-12T..."
|
|
verified_by: "manual_review"
|
|
```
|
|
|
|
## Universities Clarification
|
|
|
|
**CRITICAL**: Universities are **NOT U-class**.
|
|
|
|
### Correct Classification for Universities
|
|
|
|
| Institution Type | Correct Class | Example |
|
|
|------------------|---------------|---------|
|
|
| University | **E (EDUCATION_PROVIDER)** | Harvard University, Oxford University |
|
|
| University library | **E (EDUCATION_PROVIDER)** | Bodleian Library, Widener Library |
|
|
| University museum | **E (EDUCATION_PROVIDER)** | Yale University Art Gallery |
|
|
| University archive | **E (EDUCATION_PROVIDER)** | MIT Archives and Special Collections |
|
|
|
|
### Why Universities Are E-Class
|
|
|
|
- Universities are **educational institutions** with heritage collections
|
|
- E-class includes schools, training centers, vocational schools, AND universities
|
|
- Wikidata classes: Q3918 (university), Q875538 (university library), Q1143635 (academic library)
|
|
- See updated E-class query: `/data/wikidata/GLAMORCUBEPSXH/E/sparql/education_provider_hyponyms.sparql`
|
|
|
|
## Directory Structure
|
|
|
|
```
|
|
data/wikidata/GLAMORCUBEPSXH/U/
|
|
├── README.md # This file (explanation of U-class)
|
|
├── sparql/ # Empty - no SPARQL query needed
|
|
└── (no query files) # UNKNOWN is not a Wikidata concept
|
|
```
|
|
|
|
## Data Quality Implications
|
|
|
|
### TIER Assignment for U-Class
|
|
|
|
U-class institutions typically receive **TIER_4_INFERRED** (lowest tier) because:
|
|
- Type cannot be determined from available data
|
|
- Requires manual verification
|
|
- May contain errors or ambiguities
|
|
|
|
### Confidence Scoring
|
|
|
|
U-class institutions should have **low confidence scores** (< 0.6):
|
|
- Signals need for human review
|
|
- Flags incomplete data extraction
|
|
- Prioritizes records for verification
|
|
|
|
### Review Priority
|
|
|
|
U-class institutions should be **high priority for manual review**:
|
|
1. Search institution's official website
|
|
2. Query Wikidata for existing records
|
|
3. Check ISIL registry or national heritage registries
|
|
4. Reclassify to appropriate type (G, L, A, M, etc.)
|
|
|
|
## Statistics and Monitoring
|
|
|
|
### Expected U-Class Distribution
|
|
|
|
In a well-extracted dataset:
|
|
- **< 5%** of institutions should be U-class (most types should be determinable)
|
|
- **High U-class percentage** indicates data quality issues
|
|
- **Zero U-class** is ideal (all institutions properly classified)
|
|
|
|
### Monitoring Metrics
|
|
|
|
```python
|
|
# Pseudocode for monitoring U-class usage
|
|
total_institutions = count_all_institutions()
|
|
u_class_count = count_institutions_by_type("UNKNOWN")
|
|
u_class_percentage = (u_class_count / total_institutions) * 100
|
|
|
|
if u_class_percentage > 10:
|
|
print("WARNING: High UNKNOWN classification rate - review extraction logic")
|
|
elif u_class_percentage > 5:
|
|
print("CAUTION: Elevated UNKNOWN rate - some manual review needed")
|
|
else:
|
|
print("OK: Low UNKNOWN rate - extraction quality acceptable")
|
|
```
|
|
|
|
## References
|
|
|
|
- **Taxonomy Table**: `/AGENTS.md` (lines 391-419)
|
|
- **Schema Definition**: `/schemas/enums.yaml` (InstitutionTypeEnum)
|
|
- **E-Class Query**: `/data/wikidata/GLAMORCUBEPSXH/E/sparql/education_provider_hyponyms.sparql`
|
|
- **Query Execution Guide**: `/data/wikidata/GLAMORCUBEPSXH/00-QUERY_EXECUTION_GUIDE.md`
|
|
|
|
---
|
|
|
|
**Last Updated**: 2025-11-12
|
|
**Taxonomy Version**: GLAMORCUBEPSXHF v2.0
|
|
**Status**: U-class correctly defined as UNKNOWN (not UNIVERSITY)
|