glam/data/wikidata/GLAMORCUBEPSXHFN/B/README_QUERY_GENERATION.md
2025-11-19 23:25:22 +01:00

8.6 KiB

Botanical (B-Class) SPARQL Query Generation

Overview

This directory contains automatically-generated SPARQL queries for discovering missing B-class (Botanical/Zoo) heritage institutions in Wikidata.

The queries search for hyponyms of 27 base classes (botanical gardens, zoos, aquariums, etc.) while excluding Q-numbers that have already been curated in hyponyms_curated.yaml.

Automated Generation Script

Location

scripts/generate_botanical_query_with_exclusions.py

Purpose

  • Extracts all Q-numbers from data/wikidata/GLAMORCUBEPSXHFN/hyponyms_curated.yaml
    • From label: fields (primary Q-numbers)
    • From duplicate: fields (alternative Q-numbers for same entity)
  • Generates SPARQL query with FILTER exclusions (50 Q-numbers per chunk)
  • Creates metadata YAML with generation timestamp and statistics
  • Ensures consistency between curated vocabulary and query exclusions

Usage

# From project root
cd /Users/kempersc/apps/glam
python3 scripts/generate_botanical_query_with_exclusions.py

Output Files

The script generates two files with timestamps:

  1. SPARQL Query: botanical_query_updated_<timestamp>.sparql

    • Complete query with all Q-number exclusions
    • 27 base classes in UNION blocks
    • FILTER chunks (50 Q-numbers each)
    • LIMIT 10000 to prevent timeout
  2. Metadata YAML: botanical_query_updated_<timestamp>.yaml

    • Generation timestamp
    • Q-number count
    • Filter chunk count
    • Base class list
    • Source file reference

Example Output

============================================================
BOTANICAL QUERY GENERATOR
============================================================
Reading Q-numbers from /Users/kempersc/apps/glam/data/wikidata/GLAMORCUBEPSXHFN/hyponyms_curated.yaml
Extracted 1786 Q-numbers
Created 36 FILTER chunks
✅ Wrote query: .../B/queries/botanical_query_updated_20251116T090734.sparql
✅ Wrote metadata: .../B/queries/botanical_query_updated_20251116T090734.yaml
============================================================

Query Structure

Base Classes (27)

The query searches hyponyms of these Wikidata classes:

Q-Number Class Name
Q167346 Botanical gardens
Q43501 Zoos
Q2281788 Aquariums
Q7712619 Arboreta
Q181916 Herbarium
Q1970365 Natural history museums
Q20268591 Wildlife reserves
Q179049 Nature reserves
Q46169 National parks
Q473972 Protected areas
Q158454 Biosphere reserves
Q21164403 Safari parks
Q9480202 Safari parks (alt)
Q8085554 Wildlife sanctuaries
Q2616170 Marine reserve
Q936257 Conservation areas
Q1426613 Seed banks
Q4915239 Biorepository
Q2982911 Natural history collection
Q1905347 Gene bank
Q864217 Biobank
Q2189151 Soilbank
Q8508664 Herbaria
Q11489453 Culture collections
Q23790 Natural monuments
Q386426 Natural heritage
Q526826 Natural heritage (alt)

Exclusion Mechanism

IMPORTANT: The FILTER statements exclude Q-numbers from results, not from traversal.

# The query DOES traverse the hyponym tree from base classes
?hyponym wdt:P279+ wd:Q167346 .  # Find ALL hyponyms of botanical gardens

# Then FILTERS remove curated Q-numbers from results
FILTER(?hyponym NOT IN (wd:Q1234, wd:Q5678, ...))

This means:

  • Query explores the full hyponym tree
  • Curated Q-numbers are excluded from final results
  • New hyponyms are discovered in subsequent runs

Duplicate Q-number Handling

The script extracts Q-numbers from two sources:

  1. label: fields - Primary Q-number for each entity (e.g., label: Q9259)

  2. duplicate: fields - Alternative Q-numbers for the same entity

Example from hyponyms_curated.yaml:

- label: Q9259
  hypernym:
    - heritage site
  type:
    - B
    - F
    - L
    - A
    - M
  duplicate:
    - Q31838911  # ← Alternative ID excluded from query

Why exclude duplicates? These Q-numbers represent:

  • Wikidata merge operations (two entries merged into one)
  • Initial misidentification (entity was curated under different ID)
  • Alternative identifiers in external databases

By excluding both the primary label and duplicate IDs, we avoid rediscovering the same real-world entity under different Wikidata Q-numbers.

Current statistics:

  • 18 duplicate Q-numbers found across the curated vocabulary
  • Examples: Q31838911 (heritage site), Q1257025 (garden), Q21164403 (safari park)

Result Limit

LIMIT 10000
  • Prevents SPARQL timeout errors
  • Matches original query behavior
  • With 1,786+ exclusions, expected results: < 5,000

Workflow for Updates

When to Regenerate Query

Run the generation script whenever:

  1. New Q-numbers added to hyponyms_curated.yaml
  2. Batch curation complete and ready for next discovery round
  3. Query execution failed due to outdated exclusions

Step-by-Step Process

  1. Update curated vocabulary:

    # Add new Q-numbers to hyponyms_curated.yaml
    vim data/wikidata/GLAMORCUBEPSXHFN/hyponyms_curated.yaml
    
  2. Regenerate query:

    python3 scripts/generate_botanical_query_with_exclusions.py
    
  3. Verify output:

    # Check Q-number count increased
    grep "excluded_q_numbers_count:" B/queries/botanical_query_updated_*.yaml | tail -1
    
    # Check FILTER chunks look correct
    grep "FILTER(?hyponym NOT IN" B/queries/botanical_query_updated_*.sparql | head -3
    
  4. Execute query:

    # Use Wikidata Query Service: https://query.wikidata.org
    # Copy-paste the generated SPARQL query
    # Download results as JSON
    
  5. Process results:

    # Parse JSON results and update hyponyms_curated.yaml
    # Repeat cycle
    

Integration with Other Classes

This script is designed to be reusable for all 19 GLAMORCUBEPSXHFN classes:

  • A - Archives
  • B - Botanical gardens/Zoos (this class)
  • C - Corporations
  • D - Digital platforms
  • E - Education providers
  • F - Features
  • G - Galleries
  • H - Holy sites
  • I - Intangible heritage
  • L - Libraries
  • M - Museums
  • N - NGOs
  • O - Official institutions
  • P - Personal collections
  • R - Research centers
  • S - Societies
  • T - Taste/smell heritage
  • U - Unknown
  • X - Mixed

To Adapt for Other Classes

  1. Copy script: generate_botanical_query_with_exclusions.py
  2. Modify base classes: Update UNION blocks with relevant Q-numbers
  3. Update output path: Change B/queries to appropriate class directory
  4. Run generation: Execute with updated template

Query Statistics (Latest Generation)

File: botanical_query_updated_20251116T093744.sparql

  • Generated: 2025-11-16T09:37:44+00:00
  • Base Classes: 27
  • Excluded Q-numbers: 1,814
    • From label: fields: 1,797
    • From duplicate: fields: 18 (alternative IDs for same entities)
  • FILTER Chunks: 37
  • Expected Results: < 5,000 (with LIMIT 10000)

Troubleshooting

Q-number Count Mismatch

Symptom: Extracted Q-number count differs from manual count

Cause: Duplicates in hyponyms_curated.yaml

Solution: Script uses set() to deduplicate - this is correct behavior

YAML Parsing Errors

Symptom: Script fails with yaml.scanner.ScannerError

Cause: Tab characters or formatting issues in YAML file

Solution: Script uses regex extraction, which is robust to formatting issues

SPARQL Timeout

Symptom: Query exceeds Wikidata Query Service time limit

Cause: Too many results or complex traversal

Solution:

  • Check LIMIT clause is present (10000)
  • Add more exclusions to hyponyms_curated.yaml
  • Break into smaller base class groups

No New Results

Symptom: Query returns zero results

Cause: All hyponyms have been curated

Solution: Success! Move to next class or add more base classes

References

  • Curated Vocabulary: data/wikidata/GLAMORCUBEPSXHFN/hyponyms_curated.yaml
  • Generation Script: scripts/generate_botanical_query_with_exclusions.py
  • Wikidata Query Service: https://query.wikidata.org
  • Project Documentation: AGENTS.md (B-class taxonomy)

Version History

Date Version Changes
2025-11-16 1.0.0 Initial automated generation script
2025-11-13 0.9.x Manual query updates (573 Q-numbers)

Next Steps

  1. Execute query on Wikidata Query Service
  2. Review results for valid B-class institutions
  3. Add to curated vocabulary in hyponyms_curated.yaml
  4. Regenerate query with updated exclusions
  5. Repeat until no new hyponyms found