glam/docs/sessions/SESSION_SUMMARY_20251113_SPARQL_GENERATION.md
2025-11-19 23:25:22 +01:00

293 lines
11 KiB
Markdown

# Session Summary: Complete GLAMORCUBEPSXHFN SPARQL Query Generation
**Date**: 2025-11-13
**Session Duration**: ~10 minutes
**Status**: ✅ **COMPLETE**
---
## Executive Summary
Successfully generated **13 comprehensive SPARQL queries** for discovering missing Wikidata hyponyms across the GLAMORCUBEPSXHFN taxonomy. Each query excludes 389 curated Q-numbers from all sections of `hyponyms_curated.yaml` and supports 40+ languages.
---
## What We Did
### 1. Fixed Archive (A) Query Issues
**Problem**: Previous session discovered that the archive query was only excluding 316 Q-numbers (from `hyponym` section only), missing Q-numbers in other sections.
**Solution**: Recursive extraction discovered the curated vocabulary has **5 sections**:
- `hyponym` (316 Q-numbers)
- `entity` (42 Q-numbers)
- `rico` (5 Q-numbers)
- `exclude` (27 Q-numbers)
- `standards` (additional entries)
**Total**: **389 unique Q-numbers** now excluded in all queries.
### 2. Generated Complete Query Set
Created standardized queries for **13 GLAMORCUBEPSXHFN classes**:
| Class | Type | Base Classes | Query File |
|-------|------|--------------|------------|
| **A** | Archive | 1 base (Q166118) | `archive_query_missing_complete_20251113T130052.sparql` |
| **B** | Botanical/Zoo | 18 bases (Q167346, Q43501, Q27686, etc.) | `botanical_zoo_query_complete_20251113T130659.sparql` |
| **G** | Gallery | 4 bases (Q1007870, Q1007871, Q194195, Q445396) | `gallery_query_complete_20251113T130920.sparql` |
| **L** | Library | 11 bases (Q7075, Q28564, Q856234, etc.) | `library_query_complete_20251113T131006.sparql` |
| **M** | Museum | 14 bases (Q33506, Q207694, Q1535661, etc.) | `museum_query_complete_20251113T131027.sparql` |
| **O** | Official Institution | 4 bases (Q480242, Q1664720, etc.) | `official_query_complete_20251113T131055.sparql` |
| **R** | Research Center | 3 bases (Q31855, Q13226383, Q4671277) | `research_query_complete_20251113T131055.sparql` |
| **C** | Corporation | 3 bases (Q4830453, Q783794, Q6881511) | `corporation_query_complete_20251113T131055.sparql` |
| **E** | Education Provider | 6 bases (Q3918, Q875538, Q15936437, etc.) | `education_query_complete_20251113T131055.sparql` |
| **P** | Personal Collection | 2 bases (Q2668072, Q160554) | `personal_query_complete_20251113T131055.sparql` |
| **S** | Collecting Society | 3 bases (Q1065742, Q2668072, Q43229) | `collecting_query_complete_20251113T131055.sparql` |
| **H** | Holy Sites | 6 bases (Q32815, Q16970, Q44613, etc.) | `holy_query_complete_20251113T131055.sparql` |
| **F** | Features | 4 bases (Q4989906, Q5003624, Q7075, Q39614) | `features_query_complete_20251113T131055.sparql` |
**Note**: U (Unknown) and X (Mixed) classes do not have dedicated queries since they represent classification states rather than Wikidata entity types.
---
## Query Features (Standardized Across All Classes)
### Technical Specifications
1. **Complete Exclusion List**: 389 Q-numbers from ALL sections of `hyponyms_curated.yaml`
- Extracted recursively from: `hyponym`, `entity`, `rico`, `exclude`, `standards`
- Split into 8 FILTER chunks (50 Q-numbers each)
2. **Transitive Subclass Queries**:
```sparql
?hyponym wdt:P279+ wd:Q[BASE_CLASS]
```
- Uses `wdt:P279+` (transitive, **non-reflexive**)
- Avoids `wdt:P279*` which would include the base class itself
3. **Language Support**: 40+ languages
```sparql
SERVICE wikibase:label {
bd:serviceParam wikibase:language "en,es,fr,de,nl,pt,ar,zh,ja,ru,hi,id,ms,th,vi,ko,tr,fa,pl,it,uk,sv,cs,he,bn,mr,ta,te,ur,pa,el,ro,hu,da,no,fi,ca,sr,bg,hr,sk,sl".
}
```
4. **Full ISO 8601 Timestamps**:
- Format: `2025-11-13T13:09:20.290802+00:00`
- Includes: date, time (HH:MM:SS), microseconds, timezone (+00:00 UTC)
- Filenames: `YYYYMMDDTHHmmSS` format (sortable)
5. **Metadata Files**: Each query has companion YAML with:
- Class name and code
- Generation timestamp
- Base class mappings (Q-number → label)
- Exclusion statistics (389 Q-numbers, 8 chunks)
- SPARQL features documentation
- Execution parameters (endpoint, method, timeout)
---
## File Structure
```
data/wikidata/GLAMORCUBEPSXHFN/
├── hyponyms_curated.yaml # Master vocabulary (389 Q-numbers)
├── A/queries/
│ ├── archive_query_missing_complete_20251113T130052.yaml
│ └── archive_query_missing_complete_20251113T130052.sparql
├── B/queries/
│ ├── botanical_zoo_query_complete_20251113T130659.yaml
│ └── botanical_zoo_query_complete_20251113T130659.sparql
├── G/queries/
│ ├── gallery_query_complete_20251113T130920.yaml
│ └── gallery_query_complete_20251113T130920.sparql
├── L/queries/
│ ├── library_query_complete_20251113T131006.yaml
│ └── library_query_complete_20251113T131006.sparql
├── M/queries/
│ ├── museum_query_complete_20251113T131027.yaml
│ └── museum_query_complete_20251113T131027.sparql
├── [O, R, C, E, P, S, H, F]/queries/
│ ├── [class]_query_complete_20251113T131055.yaml
│ └── [class]_query_complete_20251113T131055.sparql
```
**Total Files Generated**: 26 files (13 YAML + 13 SPARQL)
---
## Query Execution Workflow
For each GLAMORCUBEPSXHFN class:
1. **Execute Query** at https://query.wikidata.org/
- Copy SPARQL from `[class]_query_complete_*.sparql`
- Run with POST method (large query size)
- Download results as JSON
2. **Review Results**
- Check for valid heritage institution subtypes
- Identify false positives (non-heritage classes)
- Verify labels are semantically correct
3. **Curate Vocabulary**
- Add validated Q-numbers to `hyponyms_curated.yaml`
- Assign GLAMORCUBEPSXHFN type codes
- Add hypernym relationships
- Document country/region if applicable
4. **Re-run Query** (iterative discovery)
- Newly curated Q-numbers will be excluded in next run
- Discover transitive subclasses of newly added hyponyms
- Continue until no new relevant results found
---
## Key Improvements from Last Session
| Issue | Previous State | Current State |
|-------|----------------|---------------|
| **Exclusion Coverage** | 311 → 316 Q-numbers (partial) | **389 Q-numbers (complete)** |
| **YAML Sections** | Only `hyponym` section | All 5 sections (`hyponym`, `entity`, `rico`, `exclude`, `standards`) |
| **Archive Query** | Returned duplicates | Returns 0 results (all captured) ✅ |
| **Query Standardization** | Inconsistent formats | Unified template across all classes |
| **Timestamp Precision** | Date only | Full ISO 8601 with microseconds |
| **Classes Covered** | 2 (A, B) | **13 (A, B, G, L, M, O, R, C, E, P, S, H, F)** |
---
## Statistics
- **Total Q-numbers Excluded**: 389
- **FILTER Chunks**: 8 (50 Q-numbers each)
- **Classes Queried**: 13/15 (U and X excluded by design)
- **Base Classes Total**: ~90 Wikidata root classes
- **Languages Supported**: 40+
- **Query Files**: 13
- **Metadata Files**: 13
- **Total Files Created**: 26
---
## Next Steps (Recommended Workflow)
### Phase 1: Execute Queries (Priority Order)
1. **Start with well-defined classes** (fewer false positives):
- Museum (M) - 14 base classes
- Library (L) - 11 base classes
- Gallery (G) - 4 base classes
2. **Continue with specialized classes**:
- Archive (A) - verify no new results
- Botanical/Zoo (B) - 18 bases (may have many results)
- Features (F) - monuments, memorials, sculptures
3. **Proceed to organizational classes**:
- Education Provider (E) - universities, schools
- Research Center (R) - institutes, facilities
- Official Institution (O) - government heritage
- Corporation (C) - corporate collections
4. **Finish with niche classes**:
- Holy Sites (H) - religious collections
- Collecting Society (S) - historical societies
- Personal Collection (P) - private collections
### Phase 2: Curate Results
For each class:
1. Download JSON results from Wikidata Query Service
2. Review Q-numbers and labels for relevance
3. Add to `hyponyms_curated.yaml`:
```yaml
- label: Q[NUMBER]
hypernym:
- [descriptive term]
type:
- [GLAMORCUBEPSXHFN code]
country: # optional
- [ISO country code]
```
4. Document any duplicates or edge cases
### Phase 3: Iterative Discovery
1. Re-run queries after curation (new Q-numbers will be excluded)
2. Discover transitive subclasses of newly added hyponyms
3. Continue until queries return no new relevant results
4. Mark class as "complete" in tracking document
### Phase 4: Validation
1. Cross-reference with existing GLAM registries
2. Verify coverage of major institution types
3. Check for geographic balance
4. Document any gaps or missing subtypes
---
## Technical Notes
### Why 389 Q-numbers (not 390)?
The recursive extraction found **389 unique Q-numbers** after deduplication. The previous session summary mentioned "390 Q-numbers" which was an estimate before deduplication.
### Why wdt:P279+ (not wdt:P279*)?
- `wdt:P279*`: Reflexive transitive closure (includes the base class itself)
- `wdt:P279+`: Transitive closure (excludes the base class)
We use `wdt:P279+` to avoid returning the base classes (Q33506, Q7075, etc.) in the results.
### Why 8 FILTER chunks?
Wikidata Query Service has SPARQL expression complexity limits. Splitting 389 Q-numbers into 8 chunks of ~50 each keeps queries executable while maintaining complete exclusion coverage.
### Why 40+ languages?
GLAMORCUBEPSXHFN is a global taxonomy covering institutions worldwide. Supporting 40+ languages ensures:
- Non-English institution types are discoverable
- Multilingual labels improve matching during curation
- Geographic coverage extends beyond English-speaking countries
---
## Files Modified/Created
### Created (26 files):
- `data/wikidata/GLAMORCUBEPSXHFN/[A,B,G,L,M,O,R,C,E,P,S,H,F]/queries/*_query_complete_*.{yaml,sparql}` (26 files)
### Modified:
- None (all existing files preserved)
### Temporary:
- `/tmp/all_q_numbers.txt` (working file for Q-number extraction)
---
## Session Achievements
✅ Fixed archive query exclusion list (311 → 316 → 389 Q-numbers)
✅ Created comprehensive botanical/zoo query (18 base classes)
✅ Generated 11 additional class queries (G, L, M, O, R, C, E, P, S, H, F)
✅ Established reusable query template with best practices
✅ Full ISO 8601 timestamps throughout
✅ Language-agnostic approach (40+ languages)
✅ Complete documentation in query YAML metadata
✅ Created session summary document
---
## References
- **Master Vocabulary**: `data/wikidata/GLAMORCUBEPSXHFN/hyponyms_curated.yaml`
- **Query Directory**: `data/wikidata/GLAMORCUBEPSXHFN/*/queries/`
- **Wikidata Query Service**: https://query.wikidata.org/
- **GLAMORCUBEPSXHFN Taxonomy**: See `AGENTS.md` (15-type taxonomy)
- **Previous Session**: `docs/sessions/SESSION_SUMMARY_20251113_ARCHIVE_QUERY_FIX.md` (if exists)
---
**Session End**: 2025-11-13T13:11:00+00:00
**Status**: ✅ COMPLETE - Ready for query execution phase