10 KiB
YAML Integrity Issues & Solutions
Date: 2025-11-16
File: data/wikidata/GLAMORCUBEPSXHFN/hyponyms_curated.yaml
Issues Discovered
1. Q-Number Extraction Incomplete ⚠️
Problem: The current Q-number extraction function in generate_gallery_query_with_exclusions.py misses 1 Q-number.
Root Cause: Typo in YAML file - labels: (plural) instead of label: (singular)
Location: Line 13354 in hyponyms_curated.yaml
# WRONG (current):
- labels: Q10418031 # ← Typo: "labels" should be "label"
hypernym:
- university colege
class:
- E
# CORRECT (should be):
- label: Q10418031
hypernym:
- university college # Also fix typo: "colege" → "college"
class:
- E
Impact:
- ❌ Q10418031 excluded from SPARQL query exclusions
- ❌ May appear in query results as "new" entity (false positive)
- ⚠️ Queries may return duplicate results
Current Extraction Method (in scripts/generate_gallery_query_with_exclusions.py):
# Pattern 1: Extract from "label: Q<digits>" lines
label_pattern = r'^\s*-?\s*label:\s+(Q\d+)' # ← Only matches "label:", not "labels:"
2. Nested Field Corruption Risk ✅ (Currently OK, but needs protection)
Problem: Scripts that process hyponyms_curated.yaml MUST preserve nested structures like:
rico:fields (e.g.,rico: [{'label': 'recordSetTypes'}])time:fields (e.g.,time: [{'label': 'Renaissance'}])
Good News: Current file has NO corruption (tested 2025-11-16)
- 135 rico fields: ✅ All correctly structured
- 35 time fields: ✅ All correctly structured
Risk: Future scripts or manual edits could corrupt nested structures
Expected Structure:
# CORRECT - Nested structure preserved:
- label: Q1759852
rico:
- label: recordSetTypes # ← Nested dict with 'label' key
time:
- label: Renaissance # ← Nested dict with 'label' key
hypernym:
- museum
# WRONG - Corrupted structure:
- label: Q1759852
rico: recordSetTypes # ← Lost nesting!
time: Renaissance # ← Lost nesting!
Solutions
Solution 1: Fix Typo in YAML File
Manual Fix (immediate):
# Line 13354 in hyponyms_curated.yaml
# Change:
- labels: Q10418031
hypernym:
- university colege
# To:
- label: Q10418031
hypernym:
- university college
Verification:
python scripts/test_yaml_integrity.py
# Should report: "✅ NO ISSUES FOUND - All tests passed!"
Solution 2: Use Robust Q-Number Extraction
Recommended: Update all scripts to use the new robust extraction function.
New Function (in scripts/extract_q_numbers_robust.py):
from scripts.extract_q_numbers_robust import extract_all_q_numbers
q_numbers = extract_all_q_numbers(yaml_path)
# Returns 2190 Q-numbers (includes Q10418031)
Update Required In:
scripts/generate_gallery_query_with_exclusions.py✅ (highest priority)scripts/generate_botanical_query_with_exclusions.pyscripts/execute_archive_query_corrected.py- Any future scripts that extract Q-numbers
Implementation:
# OLD (incomplete):
from scripts.generate_gallery_query_with_exclusions import extract_q_numbers_from_yaml
# NEW (complete):
from scripts.extract_q_numbers_robust import extract_all_q_numbers
Solution 3: Protect Against Nested Field Corruption
Guideline: When processing hyponyms_curated.yaml, follow these rules:
Rule 1: Preserve Original Entity Dict
# ✅ CORRECT - Preserve entire entity structure
def enrich_entity(entity: Dict[str, Any]) -> Dict[str, Any]:
return {
'curated': entity, # ← Entire dict preserved (includes rico, time, etc.)
'wikidata': fetch_wikidata(entity['label']),
'enrichment_date': datetime.now().isoformat()
}
Rule 2: Use YAML Safe Operations
# ✅ CORRECT - Use yaml.safe_load() and yaml.dump() with proper settings
import yaml
# Load
with open(yaml_path, 'r', encoding='utf-8') as f:
data = yaml.safe_load(f) # ← Preserves nested structures
# Save
with open(output_path, 'w', encoding='utf-8') as f:
yaml.dump(data, f,
allow_unicode=True,
default_flow_style=False, # ← Use block style (preserves nesting)
sort_keys=False, # ← Preserve key order
width=120) # ← Reasonable line width
Rule 3: Test Before Saving
# ✅ CORRECT - Validate enriched data before saving
def validate_enriched_data(enriched_data: Dict[str, Any]) -> bool:
"""Ensure nested fields are preserved."""
for section_name in ['hypernym', 'entity', 'entity_list']:
if section_name not in enriched_data:
continue
for entity in enriched_data[section_name]:
curated = entity.get('curated', {})
# Check rico field preservation
if 'rico' in curated:
rico = curated['rico']
if not (isinstance(rico, list) and
len(rico) > 0 and
isinstance(rico[0], dict) and
'label' in rico[0]):
raise ValueError(f"Rico field corrupted in entity: {curated.get('label')}")
# Check time field preservation
if 'time' in curated:
time_val = curated['time']
if not (isinstance(time_val, list) and
len(time_val) > 0 and
isinstance(time_val[0], dict) and
'label' in time_val[0]):
raise ValueError(f"Time field corrupted in entity: {curated.get('label')}")
return True
# Use before saving
enriched_data = enricher.enrich_all()
validate_enriched_data(enriched_data) # ← Will raise error if corruption detected
enricher.save_output(enriched_data)
Testing Tools
Tool 1: Full Integrity Test
python scripts/test_yaml_integrity.py
Checks:
- Q-number extraction completeness (3 methods compared)
- Nested field corruption (rico, time)
- Reports missing Q-numbers and corruption examples
Expected Output (after fixing typo):
✅ NO ISSUES FOUND - All tests passed!
Tool 2: Robust Extraction Test
python scripts/extract_q_numbers_robust.py
Checks:
- Comprehensive Q-number extraction
- Data quality issues (typos, formatting)
Expected Output (after fixing typo):
✅ Extracted 2190 Q-numbers
✅ No 'labels:' typos found
Action Plan
Immediate Actions (Today)
-
Fix typo in YAML file ✅ (manually edit line 13354)
# Change "labels:" to "label:" on line 13354 -
Verify fix
python scripts/test_yaml_integrity.py # Should report: "✅ NO ISSUES FOUND" -
Update
generate_gallery_query_with_exclusions.py✅- Replace
extract_q_numbers_from_yaml()with robust version - Handles both "label:" and "labels:" cases
- Replace
-
Regenerate gallery query with complete Q-numbers
python scripts/generate_gallery_query_with_exclusions.py # Should now exclude all 2190 Q-numbers
Short-Term Actions (This Week)
-
Update other scripts
generate_botanical_query_with_exclusions.pyexecute_archive_query_corrected.py- Any other scripts that extract Q-numbers
-
Add validation to enrichment workflow
- Update
enrich_hyponyms_with_wikidata.pyto usevalidate_enriched_data() - Run before saving output
- Update
Long-Term Actions (Next Sprint)
-
Add pre-commit hook
- Validate YAML structure before commits
- Check for "labels:" typos
- Verify nested field integrity
-
Document YAML schema
- Create formal schema definition
- Document expected structure for rico/time fields
- Add validation script to CI/CD
Prevention Guidelines
For Manual YAML Editing
❌ DON'T:
- Use "labels:" (plural) - always use "label:" (singular)
- Flatten nested structures (rico, time)
- Edit YAML with tools that don't preserve structure
✅ DO:
- Use "label:" (singular) for all entity identifiers
- Preserve nested list structures:
[{'label': 'value'}] - Use YAML-aware editors (VS Code with YAML extension)
For Script Development
❌ DON'T:
- Assume Q-numbers only appear in "label:" fields
- Use string replacement instead of YAML parsing
- Save without validating nested structures
✅ DO:
- Use
extract_all_q_numbers()for comprehensive extraction - Use
yaml.safe_load()andyaml.dump()with proper settings - Validate before saving using
validate_enriched_data() - Test with
test_yaml_integrity.pyafter any changes
Status Summary
| Issue | Status | Priority | Action |
|---|---|---|---|
| Q10418031 missing from extraction | ⚠️ Found | High | Fix typo: "labels:" → "label:" |
| Rico field corruption | ✅ No issues | Monitor | Add validation to enrichment |
| Time field corruption | ✅ No issues | Monitor | Add validation to enrichment |
| Incomplete Q-extraction function | ⚠️ Found | High | Use extract_all_q_numbers() |
Files Modified/Created
New Files
- ✅
scripts/test_yaml_integrity.py- Comprehensive integrity test - ✅
scripts/extract_q_numbers_robust.py- Robust Q-number extraction - ✅
docs/YAML_INTEGRITY_ISSUES.md- This document
Files Requiring Updates
- ⏳
data/wikidata/GLAMORCUBEPSXHFN/hyponyms_curated.yaml- Fix line 13354 typo - ⏳
scripts/generate_gallery_query_with_exclusions.py- Use robust extraction - ⏳
scripts/generate_botanical_query_with_exclusions.py- Use robust extraction - ⏳
scripts/enrich_hyponyms_with_wikidata.py- Add validation (optional)
Next Steps
- Fix the typo manually (line 13354)
- Run
python scripts/test_yaml_integrity.pyto verify - Update query generation scripts to use robust extraction
- Regenerate SPARQL queries with complete Q-number exclusions
- Document the fix in session notes
Document Created: 2025-11-16
Last Updated: 2025-11-16
Author: AI Agent (OpenCODE Session)