glam/docs/YAML_INTEGRITY_ISSUES.md
2025-11-19 23:25:22 +01:00

10 KiB

YAML Integrity Issues & Solutions

Date: 2025-11-16
File: data/wikidata/GLAMORCUBEPSXHFN/hyponyms_curated.yaml


Issues Discovered

1. Q-Number Extraction Incomplete ⚠️

Problem: The current Q-number extraction function in generate_gallery_query_with_exclusions.py misses 1 Q-number.

Root Cause: Typo in YAML file - labels: (plural) instead of label: (singular)

Location: Line 13354 in hyponyms_curated.yaml

# WRONG (current):
  - labels: Q10418031  # ← Typo: "labels" should be "label"
    hypernym:
      - university colege
    class:
      - E

# CORRECT (should be):
  - label: Q10418031
    hypernym:
      - university college  # Also fix typo: "colege" → "college"
    class:
      - E

Impact:

  • Q10418031 excluded from SPARQL query exclusions
  • May appear in query results as "new" entity (false positive)
  • ⚠️ Queries may return duplicate results

Current Extraction Method (in scripts/generate_gallery_query_with_exclusions.py):

# Pattern 1: Extract from "label: Q<digits>" lines
label_pattern = r'^\s*-?\s*label:\s+(Q\d+)'  # ← Only matches "label:", not "labels:"

2. Nested Field Corruption Risk (Currently OK, but needs protection)

Problem: Scripts that process hyponyms_curated.yaml MUST preserve nested structures like:

  • rico: fields (e.g., rico: [{'label': 'recordSetTypes'}])
  • time: fields (e.g., time: [{'label': 'Renaissance'}])

Good News: Current file has NO corruption (tested 2025-11-16)

  • 135 rico fields: All correctly structured
  • 35 time fields: All correctly structured

Risk: Future scripts or manual edits could corrupt nested structures

Expected Structure:

# CORRECT - Nested structure preserved:
- label: Q1759852
  rico:
    - label: recordSetTypes  # ← Nested dict with 'label' key
  time:
    - label: Renaissance     # ← Nested dict with 'label' key
  hypernym:
    - museum

# WRONG - Corrupted structure:
- label: Q1759852
  rico: recordSetTypes       # ← Lost nesting!
  time: Renaissance          # ← Lost nesting!

Solutions

Solution 1: Fix Typo in YAML File

Manual Fix (immediate):

# Line 13354 in hyponyms_curated.yaml
# Change:
  - labels: Q10418031
    hypernym:
      - university colege

# To:
  - label: Q10418031
    hypernym:
      - university college

Verification:

python scripts/test_yaml_integrity.py
# Should report: "✅ NO ISSUES FOUND - All tests passed!"

Solution 2: Use Robust Q-Number Extraction

Recommended: Update all scripts to use the new robust extraction function.

New Function (in scripts/extract_q_numbers_robust.py):

from scripts.extract_q_numbers_robust import extract_all_q_numbers

q_numbers = extract_all_q_numbers(yaml_path)
# Returns 2190 Q-numbers (includes Q10418031)

Update Required In:

  1. scripts/generate_gallery_query_with_exclusions.py (highest priority)
  2. scripts/generate_botanical_query_with_exclusions.py
  3. scripts/execute_archive_query_corrected.py
  4. Any future scripts that extract Q-numbers

Implementation:

# OLD (incomplete):
from scripts.generate_gallery_query_with_exclusions import extract_q_numbers_from_yaml

# NEW (complete):
from scripts.extract_q_numbers_robust import extract_all_q_numbers

Solution 3: Protect Against Nested Field Corruption

Guideline: When processing hyponyms_curated.yaml, follow these rules:

Rule 1: Preserve Original Entity Dict

# ✅ CORRECT - Preserve entire entity structure
def enrich_entity(entity: Dict[str, Any]) -> Dict[str, Any]:
    return {
        'curated': entity,  # ← Entire dict preserved (includes rico, time, etc.)
        'wikidata': fetch_wikidata(entity['label']),
        'enrichment_date': datetime.now().isoformat()
    }

Rule 2: Use YAML Safe Operations

# ✅ CORRECT - Use yaml.safe_load() and yaml.dump() with proper settings
import yaml

# Load
with open(yaml_path, 'r', encoding='utf-8') as f:
    data = yaml.safe_load(f)  # ← Preserves nested structures

# Save
with open(output_path, 'w', encoding='utf-8') as f:
    yaml.dump(data, f,
             allow_unicode=True,
             default_flow_style=False,  # ← Use block style (preserves nesting)
             sort_keys=False,           # ← Preserve key order
             width=120)                 # ← Reasonable line width

Rule 3: Test Before Saving

# ✅ CORRECT - Validate enriched data before saving
def validate_enriched_data(enriched_data: Dict[str, Any]) -> bool:
    """Ensure nested fields are preserved."""
    for section_name in ['hypernym', 'entity', 'entity_list']:
        if section_name not in enriched_data:
            continue
        
        for entity in enriched_data[section_name]:
            curated = entity.get('curated', {})
            
            # Check rico field preservation
            if 'rico' in curated:
                rico = curated['rico']
                if not (isinstance(rico, list) and 
                       len(rico) > 0 and 
                       isinstance(rico[0], dict) and 
                       'label' in rico[0]):
                    raise ValueError(f"Rico field corrupted in entity: {curated.get('label')}")
            
            # Check time field preservation
            if 'time' in curated:
                time_val = curated['time']
                if not (isinstance(time_val, list) and 
                       len(time_val) > 0 and 
                       isinstance(time_val[0], dict) and 
                       'label' in time_val[0]):
                    raise ValueError(f"Time field corrupted in entity: {curated.get('label')}")
    
    return True

# Use before saving
enriched_data = enricher.enrich_all()
validate_enriched_data(enriched_data)  # ← Will raise error if corruption detected
enricher.save_output(enriched_data)

Testing Tools

Tool 1: Full Integrity Test

python scripts/test_yaml_integrity.py

Checks:

  • Q-number extraction completeness (3 methods compared)
  • Nested field corruption (rico, time)
  • Reports missing Q-numbers and corruption examples

Expected Output (after fixing typo):

✅ NO ISSUES FOUND - All tests passed!

Tool 2: Robust Extraction Test

python scripts/extract_q_numbers_robust.py

Checks:

  • Comprehensive Q-number extraction
  • Data quality issues (typos, formatting)

Expected Output (after fixing typo):

✅ Extracted 2190 Q-numbers
✅ No 'labels:' typos found

Action Plan

Immediate Actions (Today)

  1. Fix typo in YAML file (manually edit line 13354)

    # Change "labels:" to "label:" on line 13354
    
  2. Verify fix

    python scripts/test_yaml_integrity.py
    # Should report: "✅ NO ISSUES FOUND"
    
  3. Update generate_gallery_query_with_exclusions.py

    • Replace extract_q_numbers_from_yaml() with robust version
    • Handles both "label:" and "labels:" cases
  4. Regenerate gallery query with complete Q-numbers

    python scripts/generate_gallery_query_with_exclusions.py
    # Should now exclude all 2190 Q-numbers
    

Short-Term Actions (This Week)

  1. Update other scripts

    • generate_botanical_query_with_exclusions.py
    • execute_archive_query_corrected.py
    • Any other scripts that extract Q-numbers
  2. Add validation to enrichment workflow

    • Update enrich_hyponyms_with_wikidata.py to use validate_enriched_data()
    • Run before saving output

Long-Term Actions (Next Sprint)

  1. Add pre-commit hook

    • Validate YAML structure before commits
    • Check for "labels:" typos
    • Verify nested field integrity
  2. Document YAML schema

    • Create formal schema definition
    • Document expected structure for rico/time fields
    • Add validation script to CI/CD

Prevention Guidelines

For Manual YAML Editing

DON'T:

  • Use "labels:" (plural) - always use "label:" (singular)
  • Flatten nested structures (rico, time)
  • Edit YAML with tools that don't preserve structure

DO:

  • Use "label:" (singular) for all entity identifiers
  • Preserve nested list structures: [{'label': 'value'}]
  • Use YAML-aware editors (VS Code with YAML extension)

For Script Development

DON'T:

  • Assume Q-numbers only appear in "label:" fields
  • Use string replacement instead of YAML parsing
  • Save without validating nested structures

DO:

  • Use extract_all_q_numbers() for comprehensive extraction
  • Use yaml.safe_load() and yaml.dump() with proper settings
  • Validate before saving using validate_enriched_data()
  • Test with test_yaml_integrity.py after any changes

Status Summary

Issue Status Priority Action
Q10418031 missing from extraction ⚠️ Found High Fix typo: "labels:" → "label:"
Rico field corruption No issues Monitor Add validation to enrichment
Time field corruption No issues Monitor Add validation to enrichment
Incomplete Q-extraction function ⚠️ Found High Use extract_all_q_numbers()

Files Modified/Created

New Files

  • scripts/test_yaml_integrity.py - Comprehensive integrity test
  • scripts/extract_q_numbers_robust.py - Robust Q-number extraction
  • docs/YAML_INTEGRITY_ISSUES.md - This document

Files Requiring Updates

  • data/wikidata/GLAMORCUBEPSXHFN/hyponyms_curated.yaml - Fix line 13354 typo
  • scripts/generate_gallery_query_with_exclusions.py - Use robust extraction
  • scripts/generate_botanical_query_with_exclusions.py - Use robust extraction
  • scripts/enrich_hyponyms_with_wikidata.py - Add validation (optional)

Next Steps

  1. Fix the typo manually (line 13354)
  2. Run python scripts/test_yaml_integrity.py to verify
  3. Update query generation scripts to use robust extraction
  4. Regenerate SPARQL queries with complete Q-number exclusions
  5. Document the fix in session notes

Document Created: 2025-11-16
Last Updated: 2025-11-16
Author: AI Agent (OpenCODE Session)