glam/data/wikidata/GLAMORCUBEPSXHFN/QUERY_LOGIC_EXPLANATION.md
2025-11-19 23:25:22 +01:00

167 lines
5.3 KiB
Markdown

# SPARQL Query Logic Explanation: Traversal vs. Result Exclusion
## The Confusion
When we say "exclude Q-numbers from curated vocabulary," there are TWO different meanings:
1.**Exclude from traversal** - Don't follow subclass paths through these Q-numbers
2.**Exclude from results** - Don't return these Q-numbers, but DO follow their paths
Our queries implement **#2** (exclude from results), which is CORRECT for hyponym discovery.
---
## How It Works
### Query Structure
```sparql
SELECT ?hyponym WHERE {
?hyponym wdt:P279+ wd:Q167346 . # Find ALL subclasses of "botanical garden"
FILTER(?hyponym NOT IN (wd:Q167346, wd:Q46169, ...)) # Exclude curated items from RESULTS
}
```
### Visual Example
```
Botanical Garden (Q167346) [CURATED - excluded from results]
├── Arboretum (Q167346) [CURATED - excluded from results]
│ ├── Historic arboretum (Q999001) [NEW - appears in results ✅]
│ └── University arboretum (Q999002) [NEW - appears in results ✅]
├── Zoo (Q43501) [CURATED - excluded from results]
│ ├── Safari park (Q2322153) [CURATED - excluded from results]
│ │ └── Drive-through safari (Q999003) [NEW - appears in results ✅]
│ └── Petting zoo (Q999004) [NEW - appears in results ✅]
└── Botanical conservatory (Q999005) [NEW - appears in results ✅]
```
### What The Query Does
1. **Starts** at base class (e.g., Q167346 "botanical garden")
2. **Traverses** through ALL subclasses, INCLUDING curated ones like Q43501 (zoo)
3. **Continues** traversing through curated items to find their children
4. **Filters** out curated Q-numbers from the final result set
5. **Returns** only NEW hyponyms not in curated vocabulary
---
## Why This Is Correct
### Goal: Find Missing Hyponyms
We want to discover institution types that are:
- Subclasses of our base classes (botanical garden, zoo, etc.)
- NOT already in our curated vocabulary
- **Including** subclasses of curated items
### Example Scenario
**Curated vocabulary includes:**
- Q167346: botanical garden
- Q43501: zoo
- Q473972: protected area
**Query should find:**
- "Drive-through safari" (subclass of zoo)
- "Marine protected area" (subclass of protected area)
- "Alpine garden" (subclass of botanical garden)
**Query should NOT find:**
- Q43501 (zoo) - already curated
- Q473972 (protected area) - already curated
---
## What Would Go Wrong If We Excluded From Traversal
### Incorrect Approach (Hypothetical)
```sparql
# WRONG: This would exclude curated items from the subclass path
SELECT ?hyponym WHERE {
?hyponym wdt:P279+ ?baseClass .
?baseClass wdt:P279* wd:Q167346 . # Only traverse through NON-curated items
FILTER(?baseClass NOT IN (wd:Q43501, wd:Q46169, ...)) # Exclude curated from PATH
}
```
**Problem:** This would NOT find "drive-through safari" because:
1. Drive-through safari → subclass of → Safari park (Q2322153, curated)
2. Safari park → subclass of → Zoo (Q43501, curated)
3. Query stops at Zoo because it's in the exclusion list
4. Never reaches drive-through safari
### Correct Approach (What We Do)
```sparql
# CORRECT: Traverse through everything, filter results
SELECT ?hyponym WHERE {
?hyponym wdt:P279+ wd:Q167346 . # Traverse through ALL subclasses
FILTER(?hyponym NOT IN (wd:Q43501, wd:Q46169, ...)) # Exclude curated from RESULTS only
}
```
**Success:** This DOES find "drive-through safari" because:
1. Query starts at botanical garden
2. Finds zoo (curated) as subclass
3. **Continues** traversing through zoo to find safari park
4. **Continues** through safari park to find drive-through safari
5. Returns drive-through safari (NOT in curated list)
6. Filters out zoo and safari park from results (in curated list)
---
## Summary
**What Our Queries Do (CORRECT)**
- Traverse through curated Q-numbers to find their children
- Exclude curated Q-numbers from appearing in results
- Return only NEW hyponyms not in curated vocabulary
**What We DO NOT Do (Would Be Wrong)**
- Block traversal through curated Q-numbers
- Stop searching when we encounter a curated item
- Prevent discovery of subclasses of curated items
---
## Expected Results
When running these queries, you should see:
- ✅ NEW institution types (children of curated items)
- ✅ Specific subtypes not in our vocabulary
- ❌ Base classes already in curated list
- ❌ Parent classes we already know about
**Example Expected Results:**
- "Tropical botanical garden" (subclass of botanical garden)
- "Children's zoo" (subclass of zoo)
- "Marine wildlife sanctuary" (subclass of wildlife sanctuary)
- "Geopark" (subclass of protected area)
**Example EXCLUDED Results (correctly filtered out):**
- "Botanical garden" (Q167346) - base class, curated
- "Zoo" (Q43501) - base class, curated
- "Protected area" (Q473972) - parent class, curated
---
## Verification
To verify the query is working correctly:
1. Run query on Wikidata Query Service
2. Check first 10 results
3. Verify NONE are in hyponyms_curated.yaml
4. Verify ALL are institution types (not random objects)
5. Verify they are subclasses of base classes
If results include curated Q-numbers → Query is broken (but it shouldn't!)
If results include non-institution items → Data quality issue in Wikidata (not query logic)
---
**Generated**: 2025-11-13T16:52:19+00:00
**Query Version**: botanical_query_updated_20251113T165219.sparql