5.3 KiB
SPARQL Query Logic Explanation: Traversal vs. Result Exclusion
The Confusion
When we say "exclude Q-numbers from curated vocabulary," there are TWO different meanings:
- ❌ Exclude from traversal - Don't follow subclass paths through these Q-numbers
- ✅ Exclude from results - Don't return these Q-numbers, but DO follow their paths
Our queries implement #2 (exclude from results), which is CORRECT for hyponym discovery.
How It Works
Query Structure
SELECT ?hyponym WHERE {
?hyponym wdt:P279+ wd:Q167346 . # Find ALL subclasses of "botanical garden"
FILTER(?hyponym NOT IN (wd:Q167346, wd:Q46169, ...)) # Exclude curated items from RESULTS
}
Visual Example
Botanical Garden (Q167346) [CURATED - excluded from results]
├── Arboretum (Q167346) [CURATED - excluded from results]
│ ├── Historic arboretum (Q999001) [NEW - appears in results ✅]
│ └── University arboretum (Q999002) [NEW - appears in results ✅]
├── Zoo (Q43501) [CURATED - excluded from results]
│ ├── Safari park (Q2322153) [CURATED - excluded from results]
│ │ └── Drive-through safari (Q999003) [NEW - appears in results ✅]
│ └── Petting zoo (Q999004) [NEW - appears in results ✅]
└── Botanical conservatory (Q999005) [NEW - appears in results ✅]
What The Query Does
- Starts at base class (e.g., Q167346 "botanical garden")
- Traverses through ALL subclasses, INCLUDING curated ones like Q43501 (zoo)
- Continues traversing through curated items to find their children
- Filters out curated Q-numbers from the final result set
- Returns only NEW hyponyms not in curated vocabulary
Why This Is Correct
Goal: Find Missing Hyponyms
We want to discover institution types that are:
- Subclasses of our base classes (botanical garden, zoo, etc.)
- NOT already in our curated vocabulary
- Including subclasses of curated items
Example Scenario
Curated vocabulary includes:
- Q167346: botanical garden
- Q43501: zoo
- Q473972: protected area
Query should find:
- "Drive-through safari" (subclass of zoo)
- "Marine protected area" (subclass of protected area)
- "Alpine garden" (subclass of botanical garden)
Query should NOT find:
- Q43501 (zoo) - already curated
- Q473972 (protected area) - already curated
What Would Go Wrong If We Excluded From Traversal
Incorrect Approach (Hypothetical)
# WRONG: This would exclude curated items from the subclass path
SELECT ?hyponym WHERE {
?hyponym wdt:P279+ ?baseClass .
?baseClass wdt:P279* wd:Q167346 . # Only traverse through NON-curated items
FILTER(?baseClass NOT IN (wd:Q43501, wd:Q46169, ...)) # Exclude curated from PATH
}
Problem: This would NOT find "drive-through safari" because:
- Drive-through safari → subclass of → Safari park (Q2322153, curated)
- Safari park → subclass of → Zoo (Q43501, curated)
- Query stops at Zoo because it's in the exclusion list
- Never reaches drive-through safari
Correct Approach (What We Do)
# CORRECT: Traverse through everything, filter results
SELECT ?hyponym WHERE {
?hyponym wdt:P279+ wd:Q167346 . # Traverse through ALL subclasses
FILTER(?hyponym NOT IN (wd:Q43501, wd:Q46169, ...)) # Exclude curated from RESULTS only
}
Success: This DOES find "drive-through safari" because:
- Query starts at botanical garden
- Finds zoo (curated) as subclass
- Continues traversing through zoo to find safari park
- Continues through safari park to find drive-through safari
- Returns drive-through safari (NOT in curated list)
- Filters out zoo and safari park from results (in curated list)
Summary
✅ What Our Queries Do (CORRECT)
- Traverse through curated Q-numbers to find their children
- Exclude curated Q-numbers from appearing in results
- Return only NEW hyponyms not in curated vocabulary
❌ What We DO NOT Do (Would Be Wrong)
- Block traversal through curated Q-numbers
- Stop searching when we encounter a curated item
- Prevent discovery of subclasses of curated items
Expected Results
When running these queries, you should see:
- ✅ NEW institution types (children of curated items)
- ✅ Specific subtypes not in our vocabulary
- ❌ Base classes already in curated list
- ❌ Parent classes we already know about
Example Expected Results:
- "Tropical botanical garden" (subclass of botanical garden)
- "Children's zoo" (subclass of zoo)
- "Marine wildlife sanctuary" (subclass of wildlife sanctuary)
- "Geopark" (subclass of protected area)
Example EXCLUDED Results (correctly filtered out):
- "Botanical garden" (Q167346) - base class, curated
- "Zoo" (Q43501) - base class, curated
- "Protected area" (Q473972) - parent class, curated
Verification
To verify the query is working correctly:
- Run query on Wikidata Query Service
- Check first 10 results
- Verify NONE are in hyponyms_curated.yaml
- Verify ALL are institution types (not random objects)
- Verify they are subclasses of base classes
If results include curated Q-numbers → Query is broken (but it shouldn't!) If results include non-institution items → Data quality issue in Wikidata (not query logic)
Generated: 2025-11-13T16:52:19+00:00
Query Version: botanical_query_updated_20251113T165219.sparql