glam/docs/YAML_INTEGRITY_ISSUES.md

# YAML Integrity Issues & Solutions
**Date**: 2025-11-16
**File**: `data/wikidata/GLAMORCUBEPSXHFN/hyponyms_curated.yaml`

---

## Issues Discovered

### 1. **Q-Number Extraction Incomplete** ⚠️

**Problem**: The current Q-number extraction function in `generate_gallery_query_with_exclusions.py` misses 1 Q-number.

**Root Cause**: Typo in YAML file - `labels:` (plural) instead of `label:` (singular)

**Location**: Line 13354 in `hyponyms_curated.yaml`

```yaml
# WRONG (current):
  - labels: Q10418031  # ← Typo: "labels" should be "label"
    hypernym:
      - university colege
    class:
      - E

# CORRECT (should be):
  - label: Q10418031
    hypernym:
      - university college  # Also fix typo: "colege" → "college"
    class:
      - E
```

**Impact**:
- ❌ Q10418031 excluded from SPARQL query exclusions
- ❌ May appear in query results as "new" entity (false positive)
- ⚠️  Queries may return duplicate results

**Current Extraction Method** (in `scripts/generate_gallery_query_with_exclusions.py`):
```python
# Pattern 1: Extract from "label: Q<digits>" lines
label_pattern = r'^\s*-?\s*label:\s+(Q\d+)'  # ← Only matches "label:", not "labels:"
```

---

### 2. **Nested Field Corruption Risk** ✅ (Currently OK, but needs protection)

**Problem**: Scripts that process `hyponyms_curated.yaml` MUST preserve nested structures like:
- `rico:` fields (e.g., `rico: [{'label': 'recordSetTypes'}]`)
- `time:` fields (e.g., `time: [{'label': 'Renaissance'}]`)

**Good News**: Current file has **NO corruption** (tested 2025-11-16)
- 135 rico fields: ✅ All correctly structured
- 35 time fields: ✅ All correctly structured

**Risk**: Future scripts or manual edits could corrupt nested structures

**Expected Structure**:
```yaml
# CORRECT - Nested structure preserved:
- label: Q1759852
  rico:
    - label: recordSetTypes  # ← Nested dict with 'label' key
  time:
    - label: Renaissance     # ← Nested dict with 'label' key
  hypernym:
    - museum

# WRONG - Corrupted structure:
- label: Q1759852
  rico: recordSetTypes       # ← Lost nesting!
  time: Renaissance          # ← Lost nesting!
```

---

## Solutions

### Solution 1: Fix Typo in YAML File

**Manual Fix** (immediate):
```bash
# Line 13354 in hyponyms_curated.yaml
# Change:
  - labels: Q10418031
    hypernym:
      - university colege

# To:
  - label: Q10418031
    hypernym:
      - university college
```

**Verification**:
```bash
python scripts/test_yaml_integrity.py
# Should report: "✅ NO ISSUES FOUND - All tests passed!"
```

---

### Solution 2: Use Robust Q-Number Extraction

**Recommended**: Update all scripts to use the new robust extraction function.

**New Function** (in `scripts/extract_q_numbers_robust.py`):
```python
from scripts.extract_q_numbers_robust import extract_all_q_numbers

q_numbers = extract_all_q_numbers(yaml_path)
# Returns 2190 Q-numbers (includes Q10418031)
```

**Update Required In**:
1. `scripts/generate_gallery_query_with_exclusions.py` ✅ (highest priority)
2. `scripts/generate_botanical_query_with_exclusions.py`
3. `scripts/execute_archive_query_corrected.py`
4. Any future scripts that extract Q-numbers

**Implementation**:
```python
# OLD (incomplete):
from scripts.generate_gallery_query_with_exclusions import extract_q_numbers_from_yaml

# NEW (complete):
from scripts.extract_q_numbers_robust import extract_all_q_numbers
```

---

### Solution 3: Protect Against Nested Field Corruption

**Guideline**: When processing `hyponyms_curated.yaml`, follow these rules:

#### Rule 1: Preserve Original Entity Dict
```python
# ✅ CORRECT - Preserve entire entity structure
def enrich_entity(entity: Dict[str, Any]) -> Dict[str, Any]:
    return {
        'curated': entity,  # ← Entire dict preserved (includes rico, time, etc.)
        'wikidata': fetch_wikidata(entity['label']),
        'enrichment_date': datetime.now().isoformat()
    }
```

#### Rule 2: Use YAML Safe Operations
```python
# ✅ CORRECT - Use yaml.safe_load() and yaml.dump() with proper settings
import yaml

# Load
with open(yaml_path, 'r', encoding='utf-8') as f:
    data = yaml.safe_load(f)  # ← Preserves nested structures

# Save
with open(output_path, 'w', encoding='utf-8') as f:
    yaml.dump(data, f,
             allow_unicode=True,
             default_flow_style=False,  # ← Use block style (preserves nesting)
             sort_keys=False,           # ← Preserve key order
             width=120)                 # ← Reasonable line width
```

#### Rule 3: Test Before Saving
```python
# ✅ CORRECT - Validate enriched data before saving
def validate_enriched_data(enriched_data: Dict[str, Any]) -> bool:
    """Ensure nested fields are preserved."""
    for section_name in ['hypernym', 'entity', 'entity_list']:
        if section_name not in enriched_data:
            continue

        for entity in enriched_data[section_name]:
            curated = entity.get('curated', {})

            # Check rico field preservation
            if 'rico' in curated:
                rico = curated['rico']
                if not (isinstance(rico, list) and
                       len(rico) > 0 and
                       isinstance(rico[0], dict) and
                       'label' in rico[0]):
                    raise ValueError(f"Rico field corrupted in entity: {curated.get('label')}")

            # Check time field preservation
            if 'time' in curated:
                time_val = curated['time']
                if not (isinstance(time_val, list) and
                       len(time_val) > 0 and
                       isinstance(time_val[0], dict) and
                       'label' in time_val[0]):
                    raise ValueError(f"Time field corrupted in entity: {curated.get('label')}")

    return True

# Use before saving
enriched_data = enricher.enrich_all()
validate_enriched_data(enriched_data)  # ← Will raise error if corruption detected
enricher.save_output(enriched_data)
```

---

## Testing Tools

### Tool 1: Full Integrity Test
```bash
python scripts/test_yaml_integrity.py
```

**Checks**:
- Q-number extraction completeness (3 methods compared)
- Nested field corruption (rico, time)
- Reports missing Q-numbers and corruption examples

**Expected Output** (after fixing typo):
```
✅ NO ISSUES FOUND - All tests passed!
```

### Tool 2: Robust Extraction Test
```bash
python scripts/extract_q_numbers_robust.py
```

**Checks**:
- Comprehensive Q-number extraction
- Data quality issues (typos, formatting)

**Expected Output** (after fixing typo):
```
✅ Extracted 2190 Q-numbers
✅ No 'labels:' typos found
```

---

## Action Plan

### Immediate Actions (Today)

1. **Fix typo in YAML file** ✅ (manually edit line 13354)
   ```bash
   # Change "labels:" to "label:" on line 13354
   ```

2. **Verify fix**
   ```bash
   python scripts/test_yaml_integrity.py
   # Should report: "✅ NO ISSUES FOUND"
   ```

3. **Update `generate_gallery_query_with_exclusions.py`** ✅
   - Replace `extract_q_numbers_from_yaml()` with robust version
   - Handles both "label:" and "labels:" cases

4. **Regenerate gallery query with complete Q-numbers**
   ```bash
   python scripts/generate_gallery_query_with_exclusions.py
   # Should now exclude all 2190 Q-numbers
   ```

### Short-Term Actions (This Week)

5. **Update other scripts**
   - `generate_botanical_query_with_exclusions.py`
   - `execute_archive_query_corrected.py`
   - Any other scripts that extract Q-numbers

6. **Add validation to enrichment workflow**
   - Update `enrich_hyponyms_with_wikidata.py` to use `validate_enriched_data()`
   - Run before saving output

### Long-Term Actions (Next Sprint)

7. **Add pre-commit hook**
   - Validate YAML structure before commits
   - Check for "labels:" typos
   - Verify nested field integrity

8. **Document YAML schema**
   - Create formal schema definition
   - Document expected structure for rico/time fields
   - Add validation script to CI/CD

---

## Prevention Guidelines

### For Manual YAML Editing

❌ **DON'T**:
- Use "labels:" (plural) - always use "label:" (singular)
- Flatten nested structures (rico, time)
- Edit YAML with tools that don't preserve structure

✅ **DO**:
- Use "label:" (singular) for all entity identifiers
- Preserve nested list structures: `[{'label': 'value'}]`
- Use YAML-aware editors (VS Code with YAML extension)

### For Script Development

❌ **DON'T**:
- Assume Q-numbers only appear in "label:" fields
- Use string replacement instead of YAML parsing
- Save without validating nested structures

✅ **DO**:
- Use `extract_all_q_numbers()` for comprehensive extraction
- Use `yaml.safe_load()` and `yaml.dump()` with proper settings
- Validate before saving using `validate_enriched_data()`
- Test with `test_yaml_integrity.py` after any changes

---

## Status Summary

| Issue | Status | Priority | Action |
|-------|--------|----------|--------|
| Q10418031 missing from extraction | ⚠️  Found | High | Fix typo: "labels:" → "label:" |
| Rico field corruption | ✅ No issues | Monitor | Add validation to enrichment |
| Time field corruption | ✅ No issues | Monitor | Add validation to enrichment |
| Incomplete Q-extraction function | ⚠️  Found | High | Use `extract_all_q_numbers()` |

---

## Files Modified/Created

### New Files
- ✅ `scripts/test_yaml_integrity.py` - Comprehensive integrity test
- ✅ `scripts/extract_q_numbers_robust.py` - Robust Q-number extraction
- ✅ `docs/YAML_INTEGRITY_ISSUES.md` - This document

### Files Requiring Updates
- ⏳ `data/wikidata/GLAMORCUBEPSXHFN/hyponyms_curated.yaml` - Fix line 13354 typo
- ⏳ `scripts/generate_gallery_query_with_exclusions.py` - Use robust extraction
- ⏳ `scripts/generate_botanical_query_with_exclusions.py` - Use robust extraction
- ⏳ `scripts/enrich_hyponyms_with_wikidata.py` - Add validation (optional)

---

## Next Steps

1. Fix the typo manually (line 13354)
2. Run `python scripts/test_yaml_integrity.py` to verify
3. Update query generation scripts to use robust extraction
4. Regenerate SPARQL queries with complete Q-number exclusions
5. Document the fix in session notes

---

**Document Created**: 2025-11-16
**Last Updated**: 2025-11-16
**Author**: AI Agent (OpenCODE Session)