11 KiB
Session Summary: Complete GLAMORCUBEPSXHFN SPARQL Query Generation
Date: 2025-11-13
Session Duration: ~10 minutes
Status: ✅ COMPLETE
Executive Summary
Successfully generated 13 comprehensive SPARQL queries for discovering missing Wikidata hyponyms across the GLAMORCUBEPSXHFN taxonomy. Each query excludes 389 curated Q-numbers from all sections of hyponyms_curated.yaml and supports 40+ languages.
What We Did
1. Fixed Archive (A) Query Issues
Problem: Previous session discovered that the archive query was only excluding 316 Q-numbers (from hyponym section only), missing Q-numbers in other sections.
Solution: Recursive extraction discovered the curated vocabulary has 5 sections:
hyponym(316 Q-numbers)entity(42 Q-numbers)rico(5 Q-numbers)exclude(27 Q-numbers)standards(additional entries)
Total: 389 unique Q-numbers now excluded in all queries.
2. Generated Complete Query Set
Created standardized queries for 13 GLAMORCUBEPSXHFN classes:
| Class | Type | Base Classes | Query File |
|---|---|---|---|
| A | Archive | 1 base (Q166118) | archive_query_missing_complete_20251113T130052.sparql |
| B | Botanical/Zoo | 18 bases (Q167346, Q43501, Q27686, etc.) | botanical_zoo_query_complete_20251113T130659.sparql |
| G | Gallery | 4 bases (Q1007870, Q1007871, Q194195, Q445396) | gallery_query_complete_20251113T130920.sparql |
| L | Library | 11 bases (Q7075, Q28564, Q856234, etc.) | library_query_complete_20251113T131006.sparql |
| M | Museum | 14 bases (Q33506, Q207694, Q1535661, etc.) | museum_query_complete_20251113T131027.sparql |
| O | Official Institution | 4 bases (Q480242, Q1664720, etc.) | official_query_complete_20251113T131055.sparql |
| R | Research Center | 3 bases (Q31855, Q13226383, Q4671277) | research_query_complete_20251113T131055.sparql |
| C | Corporation | 3 bases (Q4830453, Q783794, Q6881511) | corporation_query_complete_20251113T131055.sparql |
| E | Education Provider | 6 bases (Q3918, Q875538, Q15936437, etc.) | education_query_complete_20251113T131055.sparql |
| P | Personal Collection | 2 bases (Q2668072, Q160554) | personal_query_complete_20251113T131055.sparql |
| S | Collecting Society | 3 bases (Q1065742, Q2668072, Q43229) | collecting_query_complete_20251113T131055.sparql |
| H | Holy Sites | 6 bases (Q32815, Q16970, Q44613, etc.) | holy_query_complete_20251113T131055.sparql |
| F | Features | 4 bases (Q4989906, Q5003624, Q7075, Q39614) | features_query_complete_20251113T131055.sparql |
Note: U (Unknown) and X (Mixed) classes do not have dedicated queries since they represent classification states rather than Wikidata entity types.
Query Features (Standardized Across All Classes)
Technical Specifications
-
Complete Exclusion List: 389 Q-numbers from ALL sections of
hyponyms_curated.yaml- Extracted recursively from:
hyponym,entity,rico,exclude,standards - Split into 8 FILTER chunks (50 Q-numbers each)
- Extracted recursively from:
-
Transitive Subclass Queries:
?hyponym wdt:P279+ wd:Q[BASE_CLASS]- Uses
wdt:P279+(transitive, non-reflexive) - Avoids
wdt:P279*which would include the base class itself
- Uses
-
Language Support: 40+ languages
SERVICE wikibase:label { bd:serviceParam wikibase:language "en,es,fr,de,nl,pt,ar,zh,ja,ru,hi,id,ms,th,vi,ko,tr,fa,pl,it,uk,sv,cs,he,bn,mr,ta,te,ur,pa,el,ro,hu,da,no,fi,ca,sr,bg,hr,sk,sl". } -
Full ISO 8601 Timestamps:
- Format:
2025-11-13T13:09:20.290802+00:00 - Includes: date, time (HH:MM:SS), microseconds, timezone (+00:00 UTC)
- Filenames:
YYYYMMDDTHHmmSSformat (sortable)
- Format:
-
Metadata Files: Each query has companion YAML with:
- Class name and code
- Generation timestamp
- Base class mappings (Q-number → label)
- Exclusion statistics (389 Q-numbers, 8 chunks)
- SPARQL features documentation
- Execution parameters (endpoint, method, timeout)
File Structure
data/wikidata/GLAMORCUBEPSXHFN/
├── hyponyms_curated.yaml # Master vocabulary (389 Q-numbers)
├── A/queries/
│ ├── archive_query_missing_complete_20251113T130052.yaml
│ └── archive_query_missing_complete_20251113T130052.sparql
├── B/queries/
│ ├── botanical_zoo_query_complete_20251113T130659.yaml
│ └── botanical_zoo_query_complete_20251113T130659.sparql
├── G/queries/
│ ├── gallery_query_complete_20251113T130920.yaml
│ └── gallery_query_complete_20251113T130920.sparql
├── L/queries/
│ ├── library_query_complete_20251113T131006.yaml
│ └── library_query_complete_20251113T131006.sparql
├── M/queries/
│ ├── museum_query_complete_20251113T131027.yaml
│ └── museum_query_complete_20251113T131027.sparql
├── [O, R, C, E, P, S, H, F]/queries/
│ ├── [class]_query_complete_20251113T131055.yaml
│ └── [class]_query_complete_20251113T131055.sparql
Total Files Generated: 26 files (13 YAML + 13 SPARQL)
Query Execution Workflow
For each GLAMORCUBEPSXHFN class:
-
Execute Query at https://query.wikidata.org/
- Copy SPARQL from
[class]_query_complete_*.sparql - Run with POST method (large query size)
- Download results as JSON
- Copy SPARQL from
-
Review Results
- Check for valid heritage institution subtypes
- Identify false positives (non-heritage classes)
- Verify labels are semantically correct
-
Curate Vocabulary
- Add validated Q-numbers to
hyponyms_curated.yaml - Assign GLAMORCUBEPSXHFN type codes
- Add hypernym relationships
- Document country/region if applicable
- Add validated Q-numbers to
-
Re-run Query (iterative discovery)
- Newly curated Q-numbers will be excluded in next run
- Discover transitive subclasses of newly added hyponyms
- Continue until no new relevant results found
Key Improvements from Last Session
| Issue | Previous State | Current State |
|---|---|---|
| Exclusion Coverage | 311 → 316 Q-numbers (partial) | 389 Q-numbers (complete) |
| YAML Sections | Only hyponym section |
All 5 sections (hyponym, entity, rico, exclude, standards) |
| Archive Query | Returned duplicates | Returns 0 results (all captured) ✅ |
| Query Standardization | Inconsistent formats | Unified template across all classes |
| Timestamp Precision | Date only | Full ISO 8601 with microseconds |
| Classes Covered | 2 (A, B) | 13 (A, B, G, L, M, O, R, C, E, P, S, H, F) |
Statistics
- Total Q-numbers Excluded: 389
- FILTER Chunks: 8 (50 Q-numbers each)
- Classes Queried: 13/15 (U and X excluded by design)
- Base Classes Total: ~90 Wikidata root classes
- Languages Supported: 40+
- Query Files: 13
- Metadata Files: 13
- Total Files Created: 26
Next Steps (Recommended Workflow)
Phase 1: Execute Queries (Priority Order)
-
Start with well-defined classes (fewer false positives):
- Museum (M) - 14 base classes
- Library (L) - 11 base classes
- Gallery (G) - 4 base classes
-
Continue with specialized classes:
- Archive (A) - verify no new results
- Botanical/Zoo (B) - 18 bases (may have many results)
- Features (F) - monuments, memorials, sculptures
-
Proceed to organizational classes:
- Education Provider (E) - universities, schools
- Research Center (R) - institutes, facilities
- Official Institution (O) - government heritage
- Corporation (C) - corporate collections
-
Finish with niche classes:
- Holy Sites (H) - religious collections
- Collecting Society (S) - historical societies
- Personal Collection (P) - private collections
Phase 2: Curate Results
For each class:
- Download JSON results from Wikidata Query Service
- Review Q-numbers and labels for relevance
- Add to
hyponyms_curated.yaml:- label: Q[NUMBER] hypernym: - [descriptive term] type: - [GLAMORCUBEPSXHFN code] country: # optional - [ISO country code] - Document any duplicates or edge cases
Phase 3: Iterative Discovery
- Re-run queries after curation (new Q-numbers will be excluded)
- Discover transitive subclasses of newly added hyponyms
- Continue until queries return no new relevant results
- Mark class as "complete" in tracking document
Phase 4: Validation
- Cross-reference with existing GLAM registries
- Verify coverage of major institution types
- Check for geographic balance
- Document any gaps or missing subtypes
Technical Notes
Why 389 Q-numbers (not 390)?
The recursive extraction found 389 unique Q-numbers after deduplication. The previous session summary mentioned "390 Q-numbers" which was an estimate before deduplication.
Why wdt:P279+ (not wdt:P279*)?
wdt:P279*: Reflexive transitive closure (includes the base class itself)wdt:P279+: Transitive closure (excludes the base class)
We use wdt:P279+ to avoid returning the base classes (Q33506, Q7075, etc.) in the results.
Why 8 FILTER chunks?
Wikidata Query Service has SPARQL expression complexity limits. Splitting 389 Q-numbers into 8 chunks of ~50 each keeps queries executable while maintaining complete exclusion coverage.
Why 40+ languages?
GLAMORCUBEPSXHFN is a global taxonomy covering institutions worldwide. Supporting 40+ languages ensures:
- Non-English institution types are discoverable
- Multilingual labels improve matching during curation
- Geographic coverage extends beyond English-speaking countries
Files Modified/Created
Created (26 files):
data/wikidata/GLAMORCUBEPSXHFN/[A,B,G,L,M,O,R,C,E,P,S,H,F]/queries/*_query_complete_*.{yaml,sparql}(26 files)
Modified:
- None (all existing files preserved)
Temporary:
/tmp/all_q_numbers.txt(working file for Q-number extraction)
Session Achievements
✅ Fixed archive query exclusion list (311 → 316 → 389 Q-numbers)
✅ Created comprehensive botanical/zoo query (18 base classes)
✅ Generated 11 additional class queries (G, L, M, O, R, C, E, P, S, H, F)
✅ Established reusable query template with best practices
✅ Full ISO 8601 timestamps throughout
✅ Language-agnostic approach (40+ languages)
✅ Complete documentation in query YAML metadata
✅ Created session summary document
References
- Master Vocabulary:
data/wikidata/GLAMORCUBEPSXHFN/hyponyms_curated.yaml - Query Directory:
data/wikidata/GLAMORCUBEPSXHFN/*/queries/ - Wikidata Query Service: https://query.wikidata.org/
- GLAMORCUBEPSXHFN Taxonomy: See
AGENTS.md(15-type taxonomy) - Previous Session:
docs/sessions/SESSION_SUMMARY_20251113_ARCHIVE_QUERY_FIX.md(if exists)
Session End: 2025-11-13T13:11:00+00:00
Status: ✅ COMPLETE - Ready for query execution phase