8.6 KiB
Botanical (B-Class) SPARQL Query Generation
Overview
This directory contains automatically-generated SPARQL queries for discovering missing B-class (Botanical/Zoo) heritage institutions in Wikidata.
The queries search for hyponyms of 27 base classes (botanical gardens, zoos, aquariums, etc.) while excluding Q-numbers that have already been curated in hyponyms_curated.yaml.
Automated Generation Script
Location
scripts/generate_botanical_query_with_exclusions.py
Purpose
- Extracts all Q-numbers from
data/wikidata/GLAMORCUBEPSXHFN/hyponyms_curated.yaml- From
label:fields (primary Q-numbers) - From
duplicate:fields (alternative Q-numbers for same entity)
- From
- Generates SPARQL query with FILTER exclusions (50 Q-numbers per chunk)
- Creates metadata YAML with generation timestamp and statistics
- Ensures consistency between curated vocabulary and query exclusions
Usage
# From project root
cd /Users/kempersc/apps/glam
python3 scripts/generate_botanical_query_with_exclusions.py
Output Files
The script generates two files with timestamps:
-
SPARQL Query:
botanical_query_updated_<timestamp>.sparql- Complete query with all Q-number exclusions
- 27 base classes in UNION blocks
- FILTER chunks (50 Q-numbers each)
- LIMIT 10000 to prevent timeout
-
Metadata YAML:
botanical_query_updated_<timestamp>.yaml- Generation timestamp
- Q-number count
- Filter chunk count
- Base class list
- Source file reference
Example Output
============================================================
BOTANICAL QUERY GENERATOR
============================================================
Reading Q-numbers from /Users/kempersc/apps/glam/data/wikidata/GLAMORCUBEPSXHFN/hyponyms_curated.yaml
Extracted 1786 Q-numbers
Created 36 FILTER chunks
✅ Wrote query: .../B/queries/botanical_query_updated_20251116T090734.sparql
✅ Wrote metadata: .../B/queries/botanical_query_updated_20251116T090734.yaml
============================================================
Query Structure
Base Classes (27)
The query searches hyponyms of these Wikidata classes:
| Q-Number | Class Name |
|---|---|
| Q167346 | Botanical gardens |
| Q43501 | Zoos |
| Q2281788 | Aquariums |
| Q7712619 | Arboreta |
| Q181916 | Herbarium |
| Q1970365 | Natural history museums |
| Q20268591 | Wildlife reserves |
| Q179049 | Nature reserves |
| Q46169 | National parks |
| Q473972 | Protected areas |
| Q158454 | Biosphere reserves |
| Q21164403 | Safari parks |
| Q9480202 | Safari parks (alt) |
| Q8085554 | Wildlife sanctuaries |
| Q2616170 | Marine reserve |
| Q936257 | Conservation areas |
| Q1426613 | Seed banks |
| Q4915239 | Biorepository |
| Q2982911 | Natural history collection |
| Q1905347 | Gene bank |
| Q864217 | Biobank |
| Q2189151 | Soilbank |
| Q8508664 | Herbaria |
| Q11489453 | Culture collections |
| Q23790 | Natural monuments |
| Q386426 | Natural heritage |
| Q526826 | Natural heritage (alt) |
Exclusion Mechanism
IMPORTANT: The FILTER statements exclude Q-numbers from results, not from traversal.
# The query DOES traverse the hyponym tree from base classes
?hyponym wdt:P279+ wd:Q167346 . # Find ALL hyponyms of botanical gardens
# Then FILTERS remove curated Q-numbers from results
FILTER(?hyponym NOT IN (wd:Q1234, wd:Q5678, ...))
This means:
- ✅ Query explores the full hyponym tree
- ✅ Curated Q-numbers are excluded from final results
- ✅ New hyponyms are discovered in subsequent runs
Duplicate Q-number Handling
The script extracts Q-numbers from two sources:
-
label:fields - Primary Q-number for each entity (e.g.,label: Q9259) -
duplicate:fields - Alternative Q-numbers for the same entity
Example from hyponyms_curated.yaml:
- label: Q9259
hypernym:
- heritage site
type:
- B
- F
- L
- A
- M
duplicate:
- Q31838911 # ← Alternative ID excluded from query
Why exclude duplicates? These Q-numbers represent:
- Wikidata merge operations (two entries merged into one)
- Initial misidentification (entity was curated under different ID)
- Alternative identifiers in external databases
By excluding both the primary label and duplicate IDs, we avoid rediscovering the same real-world entity under different Wikidata Q-numbers.
Current statistics:
- 18 duplicate Q-numbers found across the curated vocabulary
- Examples: Q31838911 (heritage site), Q1257025 (garden), Q21164403 (safari park)
Result Limit
LIMIT 10000
- Prevents SPARQL timeout errors
- Matches original query behavior
- With 1,786+ exclusions, expected results: < 5,000
Workflow for Updates
When to Regenerate Query
Run the generation script whenever:
- New Q-numbers added to
hyponyms_curated.yaml - Batch curation complete and ready for next discovery round
- Query execution failed due to outdated exclusions
Step-by-Step Process
-
Update curated vocabulary:
# Add new Q-numbers to hyponyms_curated.yaml vim data/wikidata/GLAMORCUBEPSXHFN/hyponyms_curated.yaml -
Regenerate query:
python3 scripts/generate_botanical_query_with_exclusions.py -
Verify output:
# Check Q-number count increased grep "excluded_q_numbers_count:" B/queries/botanical_query_updated_*.yaml | tail -1 # Check FILTER chunks look correct grep "FILTER(?hyponym NOT IN" B/queries/botanical_query_updated_*.sparql | head -3 -
Execute query:
# Use Wikidata Query Service: https://query.wikidata.org # Copy-paste the generated SPARQL query # Download results as JSON -
Process results:
# Parse JSON results and update hyponyms_curated.yaml # Repeat cycle
Integration with Other Classes
This script is designed to be reusable for all 19 GLAMORCUBEPSXHFN classes:
- A - Archives
- B - Botanical gardens/Zoos (this class)
- C - Corporations
- D - Digital platforms
- E - Education providers
- F - Features
- G - Galleries
- H - Holy sites
- I - Intangible heritage
- L - Libraries
- M - Museums
- N - NGOs
- O - Official institutions
- P - Personal collections
- R - Research centers
- S - Societies
- T - Taste/smell heritage
- U - Unknown
- X - Mixed
To Adapt for Other Classes
- Copy script:
generate_botanical_query_with_exclusions.py - Modify base classes: Update UNION blocks with relevant Q-numbers
- Update output path: Change
B/queriesto appropriate class directory - Run generation: Execute with updated template
Query Statistics (Latest Generation)
File: botanical_query_updated_20251116T093744.sparql
- Generated: 2025-11-16T09:37:44+00:00
- Base Classes: 27
- Excluded Q-numbers: 1,814
- From
label:fields: 1,797 - From
duplicate:fields: 18 (alternative IDs for same entities)
- From
- FILTER Chunks: 37
- Expected Results: < 5,000 (with LIMIT 10000)
Troubleshooting
Q-number Count Mismatch
Symptom: Extracted Q-number count differs from manual count
Cause: Duplicates in hyponyms_curated.yaml
Solution: Script uses set() to deduplicate - this is correct behavior
YAML Parsing Errors
Symptom: Script fails with yaml.scanner.ScannerError
Cause: Tab characters or formatting issues in YAML file
Solution: Script uses regex extraction, which is robust to formatting issues
SPARQL Timeout
Symptom: Query exceeds Wikidata Query Service time limit
Cause: Too many results or complex traversal
Solution:
- Check LIMIT clause is present (10000)
- Add more exclusions to
hyponyms_curated.yaml - Break into smaller base class groups
No New Results
Symptom: Query returns zero results
Cause: All hyponyms have been curated
Solution: Success! Move to next class or add more base classes
References
- Curated Vocabulary:
data/wikidata/GLAMORCUBEPSXHFN/hyponyms_curated.yaml - Generation Script:
scripts/generate_botanical_query_with_exclusions.py - Wikidata Query Service: https://query.wikidata.org
- Project Documentation:
AGENTS.md(B-class taxonomy)
Version History
| Date | Version | Changes |
|---|---|---|
| 2025-11-16 | 1.0.0 | Initial automated generation script |
| 2025-11-13 | 0.9.x | Manual query updates (573 Q-numbers) |
Next Steps
- Execute query on Wikidata Query Service
- Review results for valid B-class institutions
- Add to curated vocabulary in
hyponyms_curated.yaml - Regenerate query with updated exclusions
- Repeat until no new hyponyms found