kempersc 3c80de87e0 add isil entries

2025-11-19 23:25:22 +01:00

8.6 KiB

Raw Blame History

Botanical (B-Class) SPARQL Query Generation

Overview

This directory contains automatically-generated SPARQL queries for discovering missing B-class (Botanical/Zoo) heritage institutions in Wikidata.

The queries search for hyponyms of 27 base classes (botanical gardens, zoos, aquariums, etc.) while excluding Q-numbers that have already been curated in hyponyms_curated.yaml.

Automated Generation Script

Location

scripts/generate_botanical_query_with_exclusions.py

Purpose

Extracts all Q-numbers from data/wikidata/GLAMORCUBEPSXHFN/hyponyms_curated.yaml
- From label: fields (primary Q-numbers)
- From duplicate: fields (alternative Q-numbers for same entity)
Generates SPARQL query with FILTER exclusions (50 Q-numbers per chunk)
Creates metadata YAML with generation timestamp and statistics
Ensures consistency between curated vocabulary and query exclusions

Usage

# From project root
cd /Users/kempersc/apps/glam
python3 scripts/generate_botanical_query_with_exclusions.py

Output Files

The script generates two files with timestamps:

SPARQL Query: botanical_query_updated_<timestamp>.sparql
- Complete query with all Q-number exclusions
- 27 base classes in UNION blocks
- FILTER chunks (50 Q-numbers each)
- LIMIT 10000 to prevent timeout
Metadata YAML: botanical_query_updated_<timestamp>.yaml
- Generation timestamp
- Q-number count
- Filter chunk count
- Base class list
- Source file reference

Example Output

============================================================
BOTANICAL QUERY GENERATOR
============================================================
Reading Q-numbers from /Users/kempersc/apps/glam/data/wikidata/GLAMORCUBEPSXHFN/hyponyms_curated.yaml
Extracted 1786 Q-numbers
Created 36 FILTER chunks
✅ Wrote query: .../B/queries/botanical_query_updated_20251116T090734.sparql
✅ Wrote metadata: .../B/queries/botanical_query_updated_20251116T090734.yaml
============================================================

Query Structure

Base Classes (27)

The query searches hyponyms of these Wikidata classes:

Q-Number	Class Name
Q167346	Botanical gardens
Q43501	Zoos
Q2281788	Aquariums
Q7712619	Arboreta
Q181916	Herbarium
Q1970365	Natural history museums
Q20268591	Wildlife reserves
Q179049	Nature reserves
Q46169	National parks
Q473972	Protected areas
Q158454	Biosphere reserves
Q21164403	Safari parks
Q9480202	Safari parks (alt)
Q8085554	Wildlife sanctuaries
Q2616170	Marine reserve
Q936257	Conservation areas
Q1426613	Seed banks
Q4915239	Biorepository
Q2982911	Natural history collection
Q1905347	Gene bank
Q864217	Biobank
Q2189151	Soilbank
Q8508664	Herbaria
Q11489453	Culture collections
Q23790	Natural monuments
Q386426	Natural heritage
Q526826	Natural heritage (alt)

Exclusion Mechanism

IMPORTANT: The FILTER statements exclude Q-numbers from results, not from traversal.

# The query DOES traverse the hyponym tree from base classes
?hyponym wdt:P279+ wd:Q167346 .  # Find ALL hyponyms of botanical gardens

# Then FILTERS remove curated Q-numbers from results
FILTER(?hyponym NOT IN (wd:Q1234, wd:Q5678, ...))

This means:

✅ Query explores the full hyponym tree
✅ Curated Q-numbers are excluded from final results
✅ New hyponyms are discovered in subsequent runs

Duplicate Q-number Handling

The script extracts Q-numbers from two sources:

label: fields - Primary Q-number for each entity (e.g., label: Q9259)
duplicate: fields - Alternative Q-numbers for the same entity

Example from hyponyms_curated.yaml:

- label: Q9259
  hypernym:
    - heritage site
  type:
    - B
    - F
    - L
    - A
    - M
  duplicate:
    - Q31838911  # ← Alternative ID excluded from query

Why exclude duplicates? These Q-numbers represent:

Wikidata merge operations (two entries merged into one)
Initial misidentification (entity was curated under different ID)
Alternative identifiers in external databases

By excluding both the primary label and duplicate IDs, we avoid rediscovering the same real-world entity under different Wikidata Q-numbers.

Current statistics:

18 duplicate Q-numbers found across the curated vocabulary
Examples: Q31838911 (heritage site), Q1257025 (garden), Q21164403 (safari park)

Result Limit

LIMIT 10000

Prevents SPARQL timeout errors
Matches original query behavior
With 1,786+ exclusions, expected results: < 5,000

Workflow for Updates

When to Regenerate Query

Run the generation script whenever:

New Q-numbers added to hyponyms_curated.yaml
Batch curation complete and ready for next discovery round
Query execution failed due to outdated exclusions

Step-by-Step Process

Update curated vocabulary:

# Add new Q-numbers to hyponyms_curated.yaml
vim data/wikidata/GLAMORCUBEPSXHFN/hyponyms_curated.yaml

Regenerate query:

python3 scripts/generate_botanical_query_with_exclusions.py

Verify output:

# Check Q-number count increased
grep "excluded_q_numbers_count:" B/queries/botanical_query_updated_*.yaml | tail -1

# Check FILTER chunks look correct
grep "FILTER(?hyponym NOT IN" B/queries/botanical_query_updated_*.sparql | head -3

Execute query:

# Use Wikidata Query Service: https://query.wikidata.org
# Copy-paste the generated SPARQL query
# Download results as JSON

Process results:

# Parse JSON results and update hyponyms_curated.yaml
# Repeat cycle

Integration with Other Classes

This script is designed to be reusable for all 19 GLAMORCUBEPSXHFN classes:

A - Archives
B - Botanical gardens/Zoos (this class)
C - Corporations
D - Digital platforms
E - Education providers
F - Features
G - Galleries
H - Holy sites
I - Intangible heritage
L - Libraries
M - Museums
N - NGOs
O - Official institutions
P - Personal collections
R - Research centers
S - Societies
T - Taste/smell heritage
U - Unknown
X - Mixed

To Adapt for Other Classes

Copy script: generate_botanical_query_with_exclusions.py
Modify base classes: Update UNION blocks with relevant Q-numbers
Update output path: Change B/queries to appropriate class directory
Run generation: Execute with updated template

Query Statistics (Latest Generation)

File: botanical_query_updated_20251116T093744.sparql

Generated: 2025-11-16T09:37:44+00:00
Base Classes: 27
Excluded Q-numbers: 1,814
- From label: fields: 1,797
- From duplicate: fields: 18 (alternative IDs for same entities)
FILTER Chunks: 37
Expected Results: < 5,000 (with LIMIT 10000)

Troubleshooting

Q-number Count Mismatch

Symptom: Extracted Q-number count differs from manual count

Cause: Duplicates in hyponyms_curated.yaml

Solution: Script uses set() to deduplicate - this is correct behavior

YAML Parsing Errors

Symptom: Script fails with yaml.scanner.ScannerError

Cause: Tab characters or formatting issues in YAML file

Solution: Script uses regex extraction, which is robust to formatting issues

SPARQL Timeout

Symptom: Query exceeds Wikidata Query Service time limit

Cause: Too many results or complex traversal

Solution:

Check LIMIT clause is present (10000)
Add more exclusions to hyponyms_curated.yaml
Break into smaller base class groups

No New Results

Symptom: Query returns zero results

Cause: All hyponyms have been curated

Solution: Success! Move to next class or add more base classes

References

Curated Vocabulary: data/wikidata/GLAMORCUBEPSXHFN/hyponyms_curated.yaml
Generation Script: scripts/generate_botanical_query_with_exclusions.py
Wikidata Query Service: https://query.wikidata.org
Project Documentation: AGENTS.md (B-class taxonomy)

Version History

Date	Version	Changes
2025-11-16	1.0.0	Initial automated generation script
2025-11-13	0.9.x	Manual query updates (573 Q-numbers)

Next Steps

Execute query on Wikidata Query Service
Review results for valid B-class institutions
Add to curated vocabulary in hyponyms_curated.yaml
Regenerate query with updated exclusions
Repeat until no new hyponyms found

8.6 KiB Raw Blame History