167 lines
5.3 KiB
Markdown
167 lines
5.3 KiB
Markdown
# SPARQL Query Logic Explanation: Traversal vs. Result Exclusion
|
|
|
|
## The Confusion
|
|
|
|
When we say "exclude Q-numbers from curated vocabulary," there are TWO different meanings:
|
|
|
|
1. ❌ **Exclude from traversal** - Don't follow subclass paths through these Q-numbers
|
|
2. ✅ **Exclude from results** - Don't return these Q-numbers, but DO follow their paths
|
|
|
|
Our queries implement **#2** (exclude from results), which is CORRECT for hyponym discovery.
|
|
|
|
---
|
|
|
|
## How It Works
|
|
|
|
### Query Structure
|
|
|
|
```sparql
|
|
SELECT ?hyponym WHERE {
|
|
?hyponym wdt:P279+ wd:Q167346 . # Find ALL subclasses of "botanical garden"
|
|
|
|
FILTER(?hyponym NOT IN (wd:Q167346, wd:Q46169, ...)) # Exclude curated items from RESULTS
|
|
}
|
|
```
|
|
|
|
### Visual Example
|
|
|
|
```
|
|
Botanical Garden (Q167346) [CURATED - excluded from results]
|
|
├── Arboretum (Q167346) [CURATED - excluded from results]
|
|
│ ├── Historic arboretum (Q999001) [NEW - appears in results ✅]
|
|
│ └── University arboretum (Q999002) [NEW - appears in results ✅]
|
|
├── Zoo (Q43501) [CURATED - excluded from results]
|
|
│ ├── Safari park (Q2322153) [CURATED - excluded from results]
|
|
│ │ └── Drive-through safari (Q999003) [NEW - appears in results ✅]
|
|
│ └── Petting zoo (Q999004) [NEW - appears in results ✅]
|
|
└── Botanical conservatory (Q999005) [NEW - appears in results ✅]
|
|
```
|
|
|
|
### What The Query Does
|
|
|
|
1. **Starts** at base class (e.g., Q167346 "botanical garden")
|
|
2. **Traverses** through ALL subclasses, INCLUDING curated ones like Q43501 (zoo)
|
|
3. **Continues** traversing through curated items to find their children
|
|
4. **Filters** out curated Q-numbers from the final result set
|
|
5. **Returns** only NEW hyponyms not in curated vocabulary
|
|
|
|
---
|
|
|
|
## Why This Is Correct
|
|
|
|
### Goal: Find Missing Hyponyms
|
|
|
|
We want to discover institution types that are:
|
|
- Subclasses of our base classes (botanical garden, zoo, etc.)
|
|
- NOT already in our curated vocabulary
|
|
- **Including** subclasses of curated items
|
|
|
|
### Example Scenario
|
|
|
|
**Curated vocabulary includes:**
|
|
- Q167346: botanical garden
|
|
- Q43501: zoo
|
|
- Q473972: protected area
|
|
|
|
**Query should find:**
|
|
- "Drive-through safari" (subclass of zoo)
|
|
- "Marine protected area" (subclass of protected area)
|
|
- "Alpine garden" (subclass of botanical garden)
|
|
|
|
**Query should NOT find:**
|
|
- Q43501 (zoo) - already curated
|
|
- Q473972 (protected area) - already curated
|
|
|
|
---
|
|
|
|
## What Would Go Wrong If We Excluded From Traversal
|
|
|
|
### Incorrect Approach (Hypothetical)
|
|
|
|
```sparql
|
|
# WRONG: This would exclude curated items from the subclass path
|
|
SELECT ?hyponym WHERE {
|
|
?hyponym wdt:P279+ ?baseClass .
|
|
?baseClass wdt:P279* wd:Q167346 . # Only traverse through NON-curated items
|
|
FILTER(?baseClass NOT IN (wd:Q43501, wd:Q46169, ...)) # Exclude curated from PATH
|
|
}
|
|
```
|
|
|
|
**Problem:** This would NOT find "drive-through safari" because:
|
|
1. Drive-through safari → subclass of → Safari park (Q2322153, curated)
|
|
2. Safari park → subclass of → Zoo (Q43501, curated)
|
|
3. Query stops at Zoo because it's in the exclusion list
|
|
4. Never reaches drive-through safari
|
|
|
|
### Correct Approach (What We Do)
|
|
|
|
```sparql
|
|
# CORRECT: Traverse through everything, filter results
|
|
SELECT ?hyponym WHERE {
|
|
?hyponym wdt:P279+ wd:Q167346 . # Traverse through ALL subclasses
|
|
FILTER(?hyponym NOT IN (wd:Q43501, wd:Q46169, ...)) # Exclude curated from RESULTS only
|
|
}
|
|
```
|
|
|
|
**Success:** This DOES find "drive-through safari" because:
|
|
1. Query starts at botanical garden
|
|
2. Finds zoo (curated) as subclass
|
|
3. **Continues** traversing through zoo to find safari park
|
|
4. **Continues** through safari park to find drive-through safari
|
|
5. Returns drive-through safari (NOT in curated list)
|
|
6. Filters out zoo and safari park from results (in curated list)
|
|
|
|
---
|
|
|
|
## Summary
|
|
|
|
✅ **What Our Queries Do (CORRECT)**
|
|
- Traverse through curated Q-numbers to find their children
|
|
- Exclude curated Q-numbers from appearing in results
|
|
- Return only NEW hyponyms not in curated vocabulary
|
|
|
|
❌ **What We DO NOT Do (Would Be Wrong)**
|
|
- Block traversal through curated Q-numbers
|
|
- Stop searching when we encounter a curated item
|
|
- Prevent discovery of subclasses of curated items
|
|
|
|
---
|
|
|
|
## Expected Results
|
|
|
|
When running these queries, you should see:
|
|
- ✅ NEW institution types (children of curated items)
|
|
- ✅ Specific subtypes not in our vocabulary
|
|
- ❌ Base classes already in curated list
|
|
- ❌ Parent classes we already know about
|
|
|
|
**Example Expected Results:**
|
|
- "Tropical botanical garden" (subclass of botanical garden)
|
|
- "Children's zoo" (subclass of zoo)
|
|
- "Marine wildlife sanctuary" (subclass of wildlife sanctuary)
|
|
- "Geopark" (subclass of protected area)
|
|
|
|
**Example EXCLUDED Results (correctly filtered out):**
|
|
- "Botanical garden" (Q167346) - base class, curated
|
|
- "Zoo" (Q43501) - base class, curated
|
|
- "Protected area" (Q473972) - parent class, curated
|
|
|
|
---
|
|
|
|
## Verification
|
|
|
|
To verify the query is working correctly:
|
|
|
|
1. Run query on Wikidata Query Service
|
|
2. Check first 10 results
|
|
3. Verify NONE are in hyponyms_curated.yaml
|
|
4. Verify ALL are institution types (not random objects)
|
|
5. Verify they are subclasses of base classes
|
|
|
|
If results include curated Q-numbers → Query is broken (but it shouldn't!)
|
|
If results include non-institution items → Data quality issue in Wikidata (not query logic)
|
|
|
|
---
|
|
|
|
**Generated**: 2025-11-13T16:52:19+00:00
|
|
**Query Version**: botanical_query_updated_20251113T165219.sparql
|