glam/data/wikidata/GLAMORCUBEPSXHFN/U/README.md
2025-11-19 23:25:22 +01:00

169 lines
6.3 KiB
Markdown

# U-Class (UNKNOWN) - Not a Wikidata Classification
## Overview
The **U-class** represents institutions where the type **cannot be determined** during data extraction. This is NOT a Wikidata class or concept, but rather a **fallback classification** used in the GLAMORCUBEPSXHF taxonomy.
## Key Points
### What U-Class Represents
- **UNKNOWN institution type** - Cannot be classified into G, L, A, M, O, R, C, B, E, P, S, H, or F categories
- **Ambiguous organizations** - Source data provides conflicting or unclear type information
- **Incomplete data** - Institution description lacks sufficient details for classification
### What U-Class Is NOT
-**NOT for Universities** - Universities are classified under **E (EDUCATION_PROVIDER)**
-**NOT a Wikidata class** - No SPARQL query or hyponym extraction needed
-**NOT a permanent classification** - Institutions should be reclassified when more information becomes available
## No SPARQL Query Required
Unlike other GLAMORCUBEPSXHF classes (G, L, A, M, O, R, C, B, E, P, S, H, F, X), the U-class does **NOT have a Wikidata query** because:
1. UNKNOWN is a data quality indicator, not a heritage institution type
2. No Wikidata classes correspond to "institution of unknown type"
3. U-class is assigned during data extraction when type cannot be determined
## When to Use U-Class
### ✅ Appropriate Use Cases
- Extracting from incomplete source data (CSV missing type column)
- Parsing conversations where institution purpose is unclear
- Processing records with conflicting type indicators
- Temporary classification pending manual review
### ❌ Inappropriate Use Cases
- **Universities** → Use **E (EDUCATION_PROVIDER)** instead
- **Mixed-type institutions** → Use **X (MIXED)** and document all actual types
- **Unknown location** → Use U-class only for unknown TYPE, not location
- **Lazy classification** → Always attempt proper classification first
## Workflow for U-Class Institutions
### 1. Initial Extraction
```yaml
- id: https://w3id.org/heritage/custodian/unknown/inst-001
name: Heritage Organization XYZ
institution_type: UNKNOWN # U-class assigned
description: "Organization mentioned in conversation with unclear purpose"
provenance:
data_source: CONVERSATION_NLP
data_tier: TIER_4_INFERRED
confidence_score: 0.45 # Low confidence
notes: "Institution type unclear - requires manual review"
```
### 2. Manual Review
- Research institution's website, Wikipedia, or authoritative sources
- Determine actual institution type (museum, library, archive, etc.)
- Update `institution_type` field with correct classification
### 3. Reclassification
```yaml
- id: https://w3id.org/heritage/custodian/unknown/inst-001
name: Heritage Organization XYZ
institution_type: MUSEUM # Reclassified from UNKNOWN to M-class
description: "Local history museum (verified via website)"
provenance:
data_source: CONVERSATION_NLP
data_tier: TIER_4_INFERRED
confidence_score: 0.85 # Higher confidence after verification
notes: "Originally classified as UNKNOWN, reclassified to MUSEUM after website verification"
verified_date: "2025-11-12T..."
verified_by: "manual_review"
```
## Universities Clarification
**CRITICAL**: Universities are **NOT U-class**.
### Correct Classification for Universities
| Institution Type | Correct Class | Example |
|------------------|---------------|---------|
| University | **E (EDUCATION_PROVIDER)** | Harvard University, Oxford University |
| University library | **E (EDUCATION_PROVIDER)** | Bodleian Library, Widener Library |
| University museum | **E (EDUCATION_PROVIDER)** | Yale University Art Gallery |
| University archive | **E (EDUCATION_PROVIDER)** | MIT Archives and Special Collections |
### Why Universities Are E-Class
- Universities are **educational institutions** with heritage collections
- E-class includes schools, training centers, vocational schools, AND universities
- Wikidata classes: Q3918 (university), Q875538 (university library), Q1143635 (academic library)
- See updated E-class query: `/data/wikidata/GLAMORCUBEPSXH/E/sparql/education_provider_hyponyms.sparql`
## Directory Structure
```
data/wikidata/GLAMORCUBEPSXH/U/
├── README.md # This file (explanation of U-class)
├── sparql/ # Empty - no SPARQL query needed
└── (no query files) # UNKNOWN is not a Wikidata concept
```
## Data Quality Implications
### TIER Assignment for U-Class
U-class institutions typically receive **TIER_4_INFERRED** (lowest tier) because:
- Type cannot be determined from available data
- Requires manual verification
- May contain errors or ambiguities
### Confidence Scoring
U-class institutions should have **low confidence scores** (< 0.6):
- Signals need for human review
- Flags incomplete data extraction
- Prioritizes records for verification
### Review Priority
U-class institutions should be **high priority for manual review**:
1. Search institution's official website
2. Query Wikidata for existing records
3. Check ISIL registry or national heritage registries
4. Reclassify to appropriate type (G, L, A, M, etc.)
## Statistics and Monitoring
### Expected U-Class Distribution
In a well-extracted dataset:
- **< 5%** of institutions should be U-class (most types should be determinable)
- **High U-class percentage** indicates data quality issues
- **Zero U-class** is ideal (all institutions properly classified)
### Monitoring Metrics
```python
# Pseudocode for monitoring U-class usage
total_institutions = count_all_institutions()
u_class_count = count_institutions_by_type("UNKNOWN")
u_class_percentage = (u_class_count / total_institutions) * 100
if u_class_percentage > 10:
print("WARNING: High UNKNOWN classification rate - review extraction logic")
elif u_class_percentage > 5:
print("CAUTION: Elevated UNKNOWN rate - some manual review needed")
else:
print("OK: Low UNKNOWN rate - extraction quality acceptable")
```
## References
- **Taxonomy Table**: `/AGENTS.md` (lines 391-419)
- **Schema Definition**: `/schemas/enums.yaml` (InstitutionTypeEnum)
- **E-Class Query**: `/data/wikidata/GLAMORCUBEPSXH/E/sparql/education_provider_hyponyms.sparql`
- **Query Execution Guide**: `/data/wikidata/GLAMORCUBEPSXH/00-QUERY_EXECUTION_GUIDE.md`
---
**Last Updated**: 2025-11-12
**Taxonomy Version**: GLAMORCUBEPSXHF v2.0
**Status**: U-class correctly defined as UNKNOWN (not UNIVERSITY)