glam/data/wikidata/GLAMORCUBEPSXHFN/QUERY_LOGIC_EXPLANATION.md
2025-11-19 23:25:22 +01:00

5.3 KiB

SPARQL Query Logic Explanation: Traversal vs. Result Exclusion

The Confusion

When we say "exclude Q-numbers from curated vocabulary," there are TWO different meanings:

  1. Exclude from traversal - Don't follow subclass paths through these Q-numbers
  2. Exclude from results - Don't return these Q-numbers, but DO follow their paths

Our queries implement #2 (exclude from results), which is CORRECT for hyponym discovery.


How It Works

Query Structure

SELECT ?hyponym WHERE {
  ?hyponym wdt:P279+ wd:Q167346 .  # Find ALL subclasses of "botanical garden"
  
  FILTER(?hyponym NOT IN (wd:Q167346, wd:Q46169, ...))  # Exclude curated items from RESULTS
}

Visual Example

Botanical Garden (Q167346) [CURATED - excluded from results]
  ├── Arboretum (Q167346) [CURATED - excluded from results]
  │   ├── Historic arboretum (Q999001) [NEW - appears in results ✅]
  │   └── University arboretum (Q999002) [NEW - appears in results ✅]
  ├── Zoo (Q43501) [CURATED - excluded from results]
  │   ├── Safari park (Q2322153) [CURATED - excluded from results]
  │   │   └── Drive-through safari (Q999003) [NEW - appears in results ✅]
  │   └── Petting zoo (Q999004) [NEW - appears in results ✅]
  └── Botanical conservatory (Q999005) [NEW - appears in results ✅]

What The Query Does

  1. Starts at base class (e.g., Q167346 "botanical garden")
  2. Traverses through ALL subclasses, INCLUDING curated ones like Q43501 (zoo)
  3. Continues traversing through curated items to find their children
  4. Filters out curated Q-numbers from the final result set
  5. Returns only NEW hyponyms not in curated vocabulary

Why This Is Correct

Goal: Find Missing Hyponyms

We want to discover institution types that are:

  • Subclasses of our base classes (botanical garden, zoo, etc.)
  • NOT already in our curated vocabulary
  • Including subclasses of curated items

Example Scenario

Curated vocabulary includes:

  • Q167346: botanical garden
  • Q43501: zoo
  • Q473972: protected area

Query should find:

  • "Drive-through safari" (subclass of zoo)
  • "Marine protected area" (subclass of protected area)
  • "Alpine garden" (subclass of botanical garden)

Query should NOT find:

  • Q43501 (zoo) - already curated
  • Q473972 (protected area) - already curated

What Would Go Wrong If We Excluded From Traversal

Incorrect Approach (Hypothetical)

# WRONG: This would exclude curated items from the subclass path
SELECT ?hyponym WHERE {
  ?hyponym wdt:P279+ ?baseClass .
  ?baseClass wdt:P279* wd:Q167346 .  # Only traverse through NON-curated items
  FILTER(?baseClass NOT IN (wd:Q43501, wd:Q46169, ...))  # Exclude curated from PATH
}

Problem: This would NOT find "drive-through safari" because:

  1. Drive-through safari → subclass of → Safari park (Q2322153, curated)
  2. Safari park → subclass of → Zoo (Q43501, curated)
  3. Query stops at Zoo because it's in the exclusion list
  4. Never reaches drive-through safari

Correct Approach (What We Do)

# CORRECT: Traverse through everything, filter results
SELECT ?hyponym WHERE {
  ?hyponym wdt:P279+ wd:Q167346 .  # Traverse through ALL subclasses
  FILTER(?hyponym NOT IN (wd:Q43501, wd:Q46169, ...))  # Exclude curated from RESULTS only
}

Success: This DOES find "drive-through safari" because:

  1. Query starts at botanical garden
  2. Finds zoo (curated) as subclass
  3. Continues traversing through zoo to find safari park
  4. Continues through safari park to find drive-through safari
  5. Returns drive-through safari (NOT in curated list)
  6. Filters out zoo and safari park from results (in curated list)

Summary

What Our Queries Do (CORRECT)

  • Traverse through curated Q-numbers to find their children
  • Exclude curated Q-numbers from appearing in results
  • Return only NEW hyponyms not in curated vocabulary

What We DO NOT Do (Would Be Wrong)

  • Block traversal through curated Q-numbers
  • Stop searching when we encounter a curated item
  • Prevent discovery of subclasses of curated items

Expected Results

When running these queries, you should see:

  • NEW institution types (children of curated items)
  • Specific subtypes not in our vocabulary
  • Base classes already in curated list
  • Parent classes we already know about

Example Expected Results:

  • "Tropical botanical garden" (subclass of botanical garden)
  • "Children's zoo" (subclass of zoo)
  • "Marine wildlife sanctuary" (subclass of wildlife sanctuary)
  • "Geopark" (subclass of protected area)

Example EXCLUDED Results (correctly filtered out):

  • "Botanical garden" (Q167346) - base class, curated
  • "Zoo" (Q43501) - base class, curated
  • "Protected area" (Q473972) - parent class, curated

Verification

To verify the query is working correctly:

  1. Run query on Wikidata Query Service
  2. Check first 10 results
  3. Verify NONE are in hyponyms_curated.yaml
  4. Verify ALL are institution types (not random objects)
  5. Verify they are subclasses of base classes

If results include curated Q-numbers → Query is broken (but it shouldn't!) If results include non-institution items → Data quality issue in Wikidata (not query logic)


Generated: 2025-11-13T16:52:19+00:00
Query Version: botanical_query_updated_20251113T165219.sparql