356 lines
10 KiB
Markdown
356 lines
10 KiB
Markdown
# YAML Integrity Issues & Solutions
|
|
**Date**: 2025-11-16
|
|
**File**: `data/wikidata/GLAMORCUBEPSXHFN/hyponyms_curated.yaml`
|
|
|
|
---
|
|
|
|
## Issues Discovered
|
|
|
|
### 1. **Q-Number Extraction Incomplete** ⚠️
|
|
|
|
**Problem**: The current Q-number extraction function in `generate_gallery_query_with_exclusions.py` misses 1 Q-number.
|
|
|
|
**Root Cause**: Typo in YAML file - `labels:` (plural) instead of `label:` (singular)
|
|
|
|
**Location**: Line 13354 in `hyponyms_curated.yaml`
|
|
|
|
```yaml
|
|
# WRONG (current):
|
|
- labels: Q10418031 # ← Typo: "labels" should be "label"
|
|
hypernym:
|
|
- university colege
|
|
class:
|
|
- E
|
|
|
|
# CORRECT (should be):
|
|
- label: Q10418031
|
|
hypernym:
|
|
- university college # Also fix typo: "colege" → "college"
|
|
class:
|
|
- E
|
|
```
|
|
|
|
**Impact**:
|
|
- ❌ Q10418031 excluded from SPARQL query exclusions
|
|
- ❌ May appear in query results as "new" entity (false positive)
|
|
- ⚠️ Queries may return duplicate results
|
|
|
|
**Current Extraction Method** (in `scripts/generate_gallery_query_with_exclusions.py`):
|
|
```python
|
|
# Pattern 1: Extract from "label: Q<digits>" lines
|
|
label_pattern = r'^\s*-?\s*label:\s+(Q\d+)' # ← Only matches "label:", not "labels:"
|
|
```
|
|
|
|
---
|
|
|
|
### 2. **Nested Field Corruption Risk** ✅ (Currently OK, but needs protection)
|
|
|
|
**Problem**: Scripts that process `hyponyms_curated.yaml` MUST preserve nested structures like:
|
|
- `rico:` fields (e.g., `rico: [{'label': 'recordSetTypes'}]`)
|
|
- `time:` fields (e.g., `time: [{'label': 'Renaissance'}]`)
|
|
|
|
**Good News**: Current file has **NO corruption** (tested 2025-11-16)
|
|
- 135 rico fields: ✅ All correctly structured
|
|
- 35 time fields: ✅ All correctly structured
|
|
|
|
**Risk**: Future scripts or manual edits could corrupt nested structures
|
|
|
|
**Expected Structure**:
|
|
```yaml
|
|
# CORRECT - Nested structure preserved:
|
|
- label: Q1759852
|
|
rico:
|
|
- label: recordSetTypes # ← Nested dict with 'label' key
|
|
time:
|
|
- label: Renaissance # ← Nested dict with 'label' key
|
|
hypernym:
|
|
- museum
|
|
|
|
# WRONG - Corrupted structure:
|
|
- label: Q1759852
|
|
rico: recordSetTypes # ← Lost nesting!
|
|
time: Renaissance # ← Lost nesting!
|
|
```
|
|
|
|
---
|
|
|
|
## Solutions
|
|
|
|
### Solution 1: Fix Typo in YAML File
|
|
|
|
**Manual Fix** (immediate):
|
|
```bash
|
|
# Line 13354 in hyponyms_curated.yaml
|
|
# Change:
|
|
- labels: Q10418031
|
|
hypernym:
|
|
- university colege
|
|
|
|
# To:
|
|
- label: Q10418031
|
|
hypernym:
|
|
- university college
|
|
```
|
|
|
|
**Verification**:
|
|
```bash
|
|
python scripts/test_yaml_integrity.py
|
|
# Should report: "✅ NO ISSUES FOUND - All tests passed!"
|
|
```
|
|
|
|
---
|
|
|
|
### Solution 2: Use Robust Q-Number Extraction
|
|
|
|
**Recommended**: Update all scripts to use the new robust extraction function.
|
|
|
|
**New Function** (in `scripts/extract_q_numbers_robust.py`):
|
|
```python
|
|
from scripts.extract_q_numbers_robust import extract_all_q_numbers
|
|
|
|
q_numbers = extract_all_q_numbers(yaml_path)
|
|
# Returns 2190 Q-numbers (includes Q10418031)
|
|
```
|
|
|
|
**Update Required In**:
|
|
1. `scripts/generate_gallery_query_with_exclusions.py` ✅ (highest priority)
|
|
2. `scripts/generate_botanical_query_with_exclusions.py`
|
|
3. `scripts/execute_archive_query_corrected.py`
|
|
4. Any future scripts that extract Q-numbers
|
|
|
|
**Implementation**:
|
|
```python
|
|
# OLD (incomplete):
|
|
from scripts.generate_gallery_query_with_exclusions import extract_q_numbers_from_yaml
|
|
|
|
# NEW (complete):
|
|
from scripts.extract_q_numbers_robust import extract_all_q_numbers
|
|
```
|
|
|
|
---
|
|
|
|
### Solution 3: Protect Against Nested Field Corruption
|
|
|
|
**Guideline**: When processing `hyponyms_curated.yaml`, follow these rules:
|
|
|
|
#### Rule 1: Preserve Original Entity Dict
|
|
```python
|
|
# ✅ CORRECT - Preserve entire entity structure
|
|
def enrich_entity(entity: Dict[str, Any]) -> Dict[str, Any]:
|
|
return {
|
|
'curated': entity, # ← Entire dict preserved (includes rico, time, etc.)
|
|
'wikidata': fetch_wikidata(entity['label']),
|
|
'enrichment_date': datetime.now().isoformat()
|
|
}
|
|
```
|
|
|
|
#### Rule 2: Use YAML Safe Operations
|
|
```python
|
|
# ✅ CORRECT - Use yaml.safe_load() and yaml.dump() with proper settings
|
|
import yaml
|
|
|
|
# Load
|
|
with open(yaml_path, 'r', encoding='utf-8') as f:
|
|
data = yaml.safe_load(f) # ← Preserves nested structures
|
|
|
|
# Save
|
|
with open(output_path, 'w', encoding='utf-8') as f:
|
|
yaml.dump(data, f,
|
|
allow_unicode=True,
|
|
default_flow_style=False, # ← Use block style (preserves nesting)
|
|
sort_keys=False, # ← Preserve key order
|
|
width=120) # ← Reasonable line width
|
|
```
|
|
|
|
#### Rule 3: Test Before Saving
|
|
```python
|
|
# ✅ CORRECT - Validate enriched data before saving
|
|
def validate_enriched_data(enriched_data: Dict[str, Any]) -> bool:
|
|
"""Ensure nested fields are preserved."""
|
|
for section_name in ['hypernym', 'entity', 'entity_list']:
|
|
if section_name not in enriched_data:
|
|
continue
|
|
|
|
for entity in enriched_data[section_name]:
|
|
curated = entity.get('curated', {})
|
|
|
|
# Check rico field preservation
|
|
if 'rico' in curated:
|
|
rico = curated['rico']
|
|
if not (isinstance(rico, list) and
|
|
len(rico) > 0 and
|
|
isinstance(rico[0], dict) and
|
|
'label' in rico[0]):
|
|
raise ValueError(f"Rico field corrupted in entity: {curated.get('label')}")
|
|
|
|
# Check time field preservation
|
|
if 'time' in curated:
|
|
time_val = curated['time']
|
|
if not (isinstance(time_val, list) and
|
|
len(time_val) > 0 and
|
|
isinstance(time_val[0], dict) and
|
|
'label' in time_val[0]):
|
|
raise ValueError(f"Time field corrupted in entity: {curated.get('label')}")
|
|
|
|
return True
|
|
|
|
# Use before saving
|
|
enriched_data = enricher.enrich_all()
|
|
validate_enriched_data(enriched_data) # ← Will raise error if corruption detected
|
|
enricher.save_output(enriched_data)
|
|
```
|
|
|
|
---
|
|
|
|
## Testing Tools
|
|
|
|
### Tool 1: Full Integrity Test
|
|
```bash
|
|
python scripts/test_yaml_integrity.py
|
|
```
|
|
|
|
**Checks**:
|
|
- Q-number extraction completeness (3 methods compared)
|
|
- Nested field corruption (rico, time)
|
|
- Reports missing Q-numbers and corruption examples
|
|
|
|
**Expected Output** (after fixing typo):
|
|
```
|
|
✅ NO ISSUES FOUND - All tests passed!
|
|
```
|
|
|
|
### Tool 2: Robust Extraction Test
|
|
```bash
|
|
python scripts/extract_q_numbers_robust.py
|
|
```
|
|
|
|
**Checks**:
|
|
- Comprehensive Q-number extraction
|
|
- Data quality issues (typos, formatting)
|
|
|
|
**Expected Output** (after fixing typo):
|
|
```
|
|
✅ Extracted 2190 Q-numbers
|
|
✅ No 'labels:' typos found
|
|
```
|
|
|
|
---
|
|
|
|
## Action Plan
|
|
|
|
### Immediate Actions (Today)
|
|
|
|
1. **Fix typo in YAML file** ✅ (manually edit line 13354)
|
|
```bash
|
|
# Change "labels:" to "label:" on line 13354
|
|
```
|
|
|
|
2. **Verify fix**
|
|
```bash
|
|
python scripts/test_yaml_integrity.py
|
|
# Should report: "✅ NO ISSUES FOUND"
|
|
```
|
|
|
|
3. **Update `generate_gallery_query_with_exclusions.py`** ✅
|
|
- Replace `extract_q_numbers_from_yaml()` with robust version
|
|
- Handles both "label:" and "labels:" cases
|
|
|
|
4. **Regenerate gallery query with complete Q-numbers**
|
|
```bash
|
|
python scripts/generate_gallery_query_with_exclusions.py
|
|
# Should now exclude all 2190 Q-numbers
|
|
```
|
|
|
|
### Short-Term Actions (This Week)
|
|
|
|
5. **Update other scripts**
|
|
- `generate_botanical_query_with_exclusions.py`
|
|
- `execute_archive_query_corrected.py`
|
|
- Any other scripts that extract Q-numbers
|
|
|
|
6. **Add validation to enrichment workflow**
|
|
- Update `enrich_hyponyms_with_wikidata.py` to use `validate_enriched_data()`
|
|
- Run before saving output
|
|
|
|
### Long-Term Actions (Next Sprint)
|
|
|
|
7. **Add pre-commit hook**
|
|
- Validate YAML structure before commits
|
|
- Check for "labels:" typos
|
|
- Verify nested field integrity
|
|
|
|
8. **Document YAML schema**
|
|
- Create formal schema definition
|
|
- Document expected structure for rico/time fields
|
|
- Add validation script to CI/CD
|
|
|
|
---
|
|
|
|
## Prevention Guidelines
|
|
|
|
### For Manual YAML Editing
|
|
|
|
❌ **DON'T**:
|
|
- Use "labels:" (plural) - always use "label:" (singular)
|
|
- Flatten nested structures (rico, time)
|
|
- Edit YAML with tools that don't preserve structure
|
|
|
|
✅ **DO**:
|
|
- Use "label:" (singular) for all entity identifiers
|
|
- Preserve nested list structures: `[{'label': 'value'}]`
|
|
- Use YAML-aware editors (VS Code with YAML extension)
|
|
|
|
### For Script Development
|
|
|
|
❌ **DON'T**:
|
|
- Assume Q-numbers only appear in "label:" fields
|
|
- Use string replacement instead of YAML parsing
|
|
- Save without validating nested structures
|
|
|
|
✅ **DO**:
|
|
- Use `extract_all_q_numbers()` for comprehensive extraction
|
|
- Use `yaml.safe_load()` and `yaml.dump()` with proper settings
|
|
- Validate before saving using `validate_enriched_data()`
|
|
- Test with `test_yaml_integrity.py` after any changes
|
|
|
|
---
|
|
|
|
## Status Summary
|
|
|
|
| Issue | Status | Priority | Action |
|
|
|-------|--------|----------|--------|
|
|
| Q10418031 missing from extraction | ⚠️ Found | High | Fix typo: "labels:" → "label:" |
|
|
| Rico field corruption | ✅ No issues | Monitor | Add validation to enrichment |
|
|
| Time field corruption | ✅ No issues | Monitor | Add validation to enrichment |
|
|
| Incomplete Q-extraction function | ⚠️ Found | High | Use `extract_all_q_numbers()` |
|
|
|
|
---
|
|
|
|
## Files Modified/Created
|
|
|
|
### New Files
|
|
- ✅ `scripts/test_yaml_integrity.py` - Comprehensive integrity test
|
|
- ✅ `scripts/extract_q_numbers_robust.py` - Robust Q-number extraction
|
|
- ✅ `docs/YAML_INTEGRITY_ISSUES.md` - This document
|
|
|
|
### Files Requiring Updates
|
|
- ⏳ `data/wikidata/GLAMORCUBEPSXHFN/hyponyms_curated.yaml` - Fix line 13354 typo
|
|
- ⏳ `scripts/generate_gallery_query_with_exclusions.py` - Use robust extraction
|
|
- ⏳ `scripts/generate_botanical_query_with_exclusions.py` - Use robust extraction
|
|
- ⏳ `scripts/enrich_hyponyms_with_wikidata.py` - Add validation (optional)
|
|
|
|
---
|
|
|
|
## Next Steps
|
|
|
|
1. Fix the typo manually (line 13354)
|
|
2. Run `python scripts/test_yaml_integrity.py` to verify
|
|
3. Update query generation scripts to use robust extraction
|
|
4. Regenerate SPARQL queries with complete Q-number exclusions
|
|
5. Document the fix in session notes
|
|
|
|
---
|
|
|
|
**Document Created**: 2025-11-16
|
|
**Last Updated**: 2025-11-16
|
|
**Author**: AI Agent (OpenCODE Session)
|