# YAML Integrity Issues & Solutions **Date**: 2025-11-16 **File**: `data/wikidata/GLAMORCUBEPSXHFN/hyponyms_curated.yaml` --- ## Issues Discovered ### 1. **Q-Number Extraction Incomplete** ⚠️ **Problem**: The current Q-number extraction function in `generate_gallery_query_with_exclusions.py` misses 1 Q-number. **Root Cause**: Typo in YAML file - `labels:` (plural) instead of `label:` (singular) **Location**: Line 13354 in `hyponyms_curated.yaml` ```yaml # WRONG (current): - labels: Q10418031 # ← Typo: "labels" should be "label" hypernym: - university colege class: - E # CORRECT (should be): - label: Q10418031 hypernym: - university college # Also fix typo: "colege" → "college" class: - E ``` **Impact**: - ❌ Q10418031 excluded from SPARQL query exclusions - ❌ May appear in query results as "new" entity (false positive) - ⚠️ Queries may return duplicate results **Current Extraction Method** (in `scripts/generate_gallery_query_with_exclusions.py`): ```python # Pattern 1: Extract from "label: Q" lines label_pattern = r'^\s*-?\s*label:\s+(Q\d+)' # ← Only matches "label:", not "labels:" ``` --- ### 2. **Nested Field Corruption Risk** ✅ (Currently OK, but needs protection) **Problem**: Scripts that process `hyponyms_curated.yaml` MUST preserve nested structures like: - `rico:` fields (e.g., `rico: [{'label': 'recordSetTypes'}]`) - `time:` fields (e.g., `time: [{'label': 'Renaissance'}]`) **Good News**: Current file has **NO corruption** (tested 2025-11-16) - 135 rico fields: ✅ All correctly structured - 35 time fields: ✅ All correctly structured **Risk**: Future scripts or manual edits could corrupt nested structures **Expected Structure**: ```yaml # CORRECT - Nested structure preserved: - label: Q1759852 rico: - label: recordSetTypes # ← Nested dict with 'label' key time: - label: Renaissance # ← Nested dict with 'label' key hypernym: - museum # WRONG - Corrupted structure: - label: Q1759852 rico: recordSetTypes # ← Lost nesting! time: Renaissance # ← Lost nesting! ``` --- ## Solutions ### Solution 1: Fix Typo in YAML File **Manual Fix** (immediate): ```bash # Line 13354 in hyponyms_curated.yaml # Change: - labels: Q10418031 hypernym: - university colege # To: - label: Q10418031 hypernym: - university college ``` **Verification**: ```bash python scripts/test_yaml_integrity.py # Should report: "✅ NO ISSUES FOUND - All tests passed!" ``` --- ### Solution 2: Use Robust Q-Number Extraction **Recommended**: Update all scripts to use the new robust extraction function. **New Function** (in `scripts/extract_q_numbers_robust.py`): ```python from scripts.extract_q_numbers_robust import extract_all_q_numbers q_numbers = extract_all_q_numbers(yaml_path) # Returns 2190 Q-numbers (includes Q10418031) ``` **Update Required In**: 1. `scripts/generate_gallery_query_with_exclusions.py` ✅ (highest priority) 2. `scripts/generate_botanical_query_with_exclusions.py` 3. `scripts/execute_archive_query_corrected.py` 4. Any future scripts that extract Q-numbers **Implementation**: ```python # OLD (incomplete): from scripts.generate_gallery_query_with_exclusions import extract_q_numbers_from_yaml # NEW (complete): from scripts.extract_q_numbers_robust import extract_all_q_numbers ``` --- ### Solution 3: Protect Against Nested Field Corruption **Guideline**: When processing `hyponyms_curated.yaml`, follow these rules: #### Rule 1: Preserve Original Entity Dict ```python # ✅ CORRECT - Preserve entire entity structure def enrich_entity(entity: Dict[str, Any]) -> Dict[str, Any]: return { 'curated': entity, # ← Entire dict preserved (includes rico, time, etc.) 'wikidata': fetch_wikidata(entity['label']), 'enrichment_date': datetime.now().isoformat() } ``` #### Rule 2: Use YAML Safe Operations ```python # ✅ CORRECT - Use yaml.safe_load() and yaml.dump() with proper settings import yaml # Load with open(yaml_path, 'r', encoding='utf-8') as f: data = yaml.safe_load(f) # ← Preserves nested structures # Save with open(output_path, 'w', encoding='utf-8') as f: yaml.dump(data, f, allow_unicode=True, default_flow_style=False, # ← Use block style (preserves nesting) sort_keys=False, # ← Preserve key order width=120) # ← Reasonable line width ``` #### Rule 3: Test Before Saving ```python # ✅ CORRECT - Validate enriched data before saving def validate_enriched_data(enriched_data: Dict[str, Any]) -> bool: """Ensure nested fields are preserved.""" for section_name in ['hypernym', 'entity', 'entity_list']: if section_name not in enriched_data: continue for entity in enriched_data[section_name]: curated = entity.get('curated', {}) # Check rico field preservation if 'rico' in curated: rico = curated['rico'] if not (isinstance(rico, list) and len(rico) > 0 and isinstance(rico[0], dict) and 'label' in rico[0]): raise ValueError(f"Rico field corrupted in entity: {curated.get('label')}") # Check time field preservation if 'time' in curated: time_val = curated['time'] if not (isinstance(time_val, list) and len(time_val) > 0 and isinstance(time_val[0], dict) and 'label' in time_val[0]): raise ValueError(f"Time field corrupted in entity: {curated.get('label')}") return True # Use before saving enriched_data = enricher.enrich_all() validate_enriched_data(enriched_data) # ← Will raise error if corruption detected enricher.save_output(enriched_data) ``` --- ## Testing Tools ### Tool 1: Full Integrity Test ```bash python scripts/test_yaml_integrity.py ``` **Checks**: - Q-number extraction completeness (3 methods compared) - Nested field corruption (rico, time) - Reports missing Q-numbers and corruption examples **Expected Output** (after fixing typo): ``` ✅ NO ISSUES FOUND - All tests passed! ``` ### Tool 2: Robust Extraction Test ```bash python scripts/extract_q_numbers_robust.py ``` **Checks**: - Comprehensive Q-number extraction - Data quality issues (typos, formatting) **Expected Output** (after fixing typo): ``` ✅ Extracted 2190 Q-numbers ✅ No 'labels:' typos found ``` --- ## Action Plan ### Immediate Actions (Today) 1. **Fix typo in YAML file** ✅ (manually edit line 13354) ```bash # Change "labels:" to "label:" on line 13354 ``` 2. **Verify fix** ```bash python scripts/test_yaml_integrity.py # Should report: "✅ NO ISSUES FOUND" ``` 3. **Update `generate_gallery_query_with_exclusions.py`** ✅ - Replace `extract_q_numbers_from_yaml()` with robust version - Handles both "label:" and "labels:" cases 4. **Regenerate gallery query with complete Q-numbers** ```bash python scripts/generate_gallery_query_with_exclusions.py # Should now exclude all 2190 Q-numbers ``` ### Short-Term Actions (This Week) 5. **Update other scripts** - `generate_botanical_query_with_exclusions.py` - `execute_archive_query_corrected.py` - Any other scripts that extract Q-numbers 6. **Add validation to enrichment workflow** - Update `enrich_hyponyms_with_wikidata.py` to use `validate_enriched_data()` - Run before saving output ### Long-Term Actions (Next Sprint) 7. **Add pre-commit hook** - Validate YAML structure before commits - Check for "labels:" typos - Verify nested field integrity 8. **Document YAML schema** - Create formal schema definition - Document expected structure for rico/time fields - Add validation script to CI/CD --- ## Prevention Guidelines ### For Manual YAML Editing ❌ **DON'T**: - Use "labels:" (plural) - always use "label:" (singular) - Flatten nested structures (rico, time) - Edit YAML with tools that don't preserve structure ✅ **DO**: - Use "label:" (singular) for all entity identifiers - Preserve nested list structures: `[{'label': 'value'}]` - Use YAML-aware editors (VS Code with YAML extension) ### For Script Development ❌ **DON'T**: - Assume Q-numbers only appear in "label:" fields - Use string replacement instead of YAML parsing - Save without validating nested structures ✅ **DO**: - Use `extract_all_q_numbers()` for comprehensive extraction - Use `yaml.safe_load()` and `yaml.dump()` with proper settings - Validate before saving using `validate_enriched_data()` - Test with `test_yaml_integrity.py` after any changes --- ## Status Summary | Issue | Status | Priority | Action | |-------|--------|----------|--------| | Q10418031 missing from extraction | ⚠️ Found | High | Fix typo: "labels:" → "label:" | | Rico field corruption | ✅ No issues | Monitor | Add validation to enrichment | | Time field corruption | ✅ No issues | Monitor | Add validation to enrichment | | Incomplete Q-extraction function | ⚠️ Found | High | Use `extract_all_q_numbers()` | --- ## Files Modified/Created ### New Files - ✅ `scripts/test_yaml_integrity.py` - Comprehensive integrity test - ✅ `scripts/extract_q_numbers_robust.py` - Robust Q-number extraction - ✅ `docs/YAML_INTEGRITY_ISSUES.md` - This document ### Files Requiring Updates - ⏳ `data/wikidata/GLAMORCUBEPSXHFN/hyponyms_curated.yaml` - Fix line 13354 typo - ⏳ `scripts/generate_gallery_query_with_exclusions.py` - Use robust extraction - ⏳ `scripts/generate_botanical_query_with_exclusions.py` - Use robust extraction - ⏳ `scripts/enrich_hyponyms_with_wikidata.py` - Add validation (optional) --- ## Next Steps 1. Fix the typo manually (line 13354) 2. Run `python scripts/test_yaml_integrity.py` to verify 3. Update query generation scripts to use robust extraction 4. Regenerate SPARQL queries with complete Q-number exclusions 5. Document the fix in session notes --- **Document Created**: 2025-11-16 **Last Updated**: 2025-11-16 **Author**: AI Agent (OpenCODE Session)