# SPARQL Query Logic Explanation: Traversal vs. Result Exclusion ## The Confusion When we say "exclude Q-numbers from curated vocabulary," there are TWO different meanings: 1. ❌ **Exclude from traversal** - Don't follow subclass paths through these Q-numbers 2. ✅ **Exclude from results** - Don't return these Q-numbers, but DO follow their paths Our queries implement **#2** (exclude from results), which is CORRECT for hyponym discovery. --- ## How It Works ### Query Structure ```sparql SELECT ?hyponym WHERE { ?hyponym wdt:P279+ wd:Q167346 . # Find ALL subclasses of "botanical garden" FILTER(?hyponym NOT IN (wd:Q167346, wd:Q46169, ...)) # Exclude curated items from RESULTS } ``` ### Visual Example ``` Botanical Garden (Q167346) [CURATED - excluded from results] ├── Arboretum (Q167346) [CURATED - excluded from results] │ ├── Historic arboretum (Q999001) [NEW - appears in results ✅] │ └── University arboretum (Q999002) [NEW - appears in results ✅] ├── Zoo (Q43501) [CURATED - excluded from results] │ ├── Safari park (Q2322153) [CURATED - excluded from results] │ │ └── Drive-through safari (Q999003) [NEW - appears in results ✅] │ └── Petting zoo (Q999004) [NEW - appears in results ✅] └── Botanical conservatory (Q999005) [NEW - appears in results ✅] ``` ### What The Query Does 1. **Starts** at base class (e.g., Q167346 "botanical garden") 2. **Traverses** through ALL subclasses, INCLUDING curated ones like Q43501 (zoo) 3. **Continues** traversing through curated items to find their children 4. **Filters** out curated Q-numbers from the final result set 5. **Returns** only NEW hyponyms not in curated vocabulary --- ## Why This Is Correct ### Goal: Find Missing Hyponyms We want to discover institution types that are: - Subclasses of our base classes (botanical garden, zoo, etc.) - NOT already in our curated vocabulary - **Including** subclasses of curated items ### Example Scenario **Curated vocabulary includes:** - Q167346: botanical garden - Q43501: zoo - Q473972: protected area **Query should find:** - "Drive-through safari" (subclass of zoo) - "Marine protected area" (subclass of protected area) - "Alpine garden" (subclass of botanical garden) **Query should NOT find:** - Q43501 (zoo) - already curated - Q473972 (protected area) - already curated --- ## What Would Go Wrong If We Excluded From Traversal ### Incorrect Approach (Hypothetical) ```sparql # WRONG: This would exclude curated items from the subclass path SELECT ?hyponym WHERE { ?hyponym wdt:P279+ ?baseClass . ?baseClass wdt:P279* wd:Q167346 . # Only traverse through NON-curated items FILTER(?baseClass NOT IN (wd:Q43501, wd:Q46169, ...)) # Exclude curated from PATH } ``` **Problem:** This would NOT find "drive-through safari" because: 1. Drive-through safari → subclass of → Safari park (Q2322153, curated) 2. Safari park → subclass of → Zoo (Q43501, curated) 3. Query stops at Zoo because it's in the exclusion list 4. Never reaches drive-through safari ### Correct Approach (What We Do) ```sparql # CORRECT: Traverse through everything, filter results SELECT ?hyponym WHERE { ?hyponym wdt:P279+ wd:Q167346 . # Traverse through ALL subclasses FILTER(?hyponym NOT IN (wd:Q43501, wd:Q46169, ...)) # Exclude curated from RESULTS only } ``` **Success:** This DOES find "drive-through safari" because: 1. Query starts at botanical garden 2. Finds zoo (curated) as subclass 3. **Continues** traversing through zoo to find safari park 4. **Continues** through safari park to find drive-through safari 5. Returns drive-through safari (NOT in curated list) 6. Filters out zoo and safari park from results (in curated list) --- ## Summary ✅ **What Our Queries Do (CORRECT)** - Traverse through curated Q-numbers to find their children - Exclude curated Q-numbers from appearing in results - Return only NEW hyponyms not in curated vocabulary ❌ **What We DO NOT Do (Would Be Wrong)** - Block traversal through curated Q-numbers - Stop searching when we encounter a curated item - Prevent discovery of subclasses of curated items --- ## Expected Results When running these queries, you should see: - ✅ NEW institution types (children of curated items) - ✅ Specific subtypes not in our vocabulary - ❌ Base classes already in curated list - ❌ Parent classes we already know about **Example Expected Results:** - "Tropical botanical garden" (subclass of botanical garden) - "Children's zoo" (subclass of zoo) - "Marine wildlife sanctuary" (subclass of wildlife sanctuary) - "Geopark" (subclass of protected area) **Example EXCLUDED Results (correctly filtered out):** - "Botanical garden" (Q167346) - base class, curated - "Zoo" (Q43501) - base class, curated - "Protected area" (Q473972) - parent class, curated --- ## Verification To verify the query is working correctly: 1. Run query on Wikidata Query Service 2. Check first 10 results 3. Verify NONE are in hyponyms_curated.yaml 4. Verify ALL are institution types (not random objects) 5. Verify they are subclasses of base classes If results include curated Q-numbers → Query is broken (but it shouldn't!) If results include non-institution items → Data quality issue in Wikidata (not query logic) --- **Generated**: 2025-11-13T16:52:19+00:00 **Query Version**: botanical_query_updated_20251113T165219.sparql