glam/docs/sessions/SESSION_SUMMARY_20251113_SPARQL_GENERATION.md
2025-11-19 23:25:22 +01:00

11 KiB

Session Summary: Complete GLAMORCUBEPSXHFN SPARQL Query Generation

Date: 2025-11-13
Session Duration: ~10 minutes
Status: COMPLETE


Executive Summary

Successfully generated 13 comprehensive SPARQL queries for discovering missing Wikidata hyponyms across the GLAMORCUBEPSXHFN taxonomy. Each query excludes 389 curated Q-numbers from all sections of hyponyms_curated.yaml and supports 40+ languages.


What We Did

1. Fixed Archive (A) Query Issues

Problem: Previous session discovered that the archive query was only excluding 316 Q-numbers (from hyponym section only), missing Q-numbers in other sections.

Solution: Recursive extraction discovered the curated vocabulary has 5 sections:

  • hyponym (316 Q-numbers)
  • entity (42 Q-numbers)
  • rico (5 Q-numbers)
  • exclude (27 Q-numbers)
  • standards (additional entries)

Total: 389 unique Q-numbers now excluded in all queries.

2. Generated Complete Query Set

Created standardized queries for 13 GLAMORCUBEPSXHFN classes:

Class Type Base Classes Query File
A Archive 1 base (Q166118) archive_query_missing_complete_20251113T130052.sparql
B Botanical/Zoo 18 bases (Q167346, Q43501, Q27686, etc.) botanical_zoo_query_complete_20251113T130659.sparql
G Gallery 4 bases (Q1007870, Q1007871, Q194195, Q445396) gallery_query_complete_20251113T130920.sparql
L Library 11 bases (Q7075, Q28564, Q856234, etc.) library_query_complete_20251113T131006.sparql
M Museum 14 bases (Q33506, Q207694, Q1535661, etc.) museum_query_complete_20251113T131027.sparql
O Official Institution 4 bases (Q480242, Q1664720, etc.) official_query_complete_20251113T131055.sparql
R Research Center 3 bases (Q31855, Q13226383, Q4671277) research_query_complete_20251113T131055.sparql
C Corporation 3 bases (Q4830453, Q783794, Q6881511) corporation_query_complete_20251113T131055.sparql
E Education Provider 6 bases (Q3918, Q875538, Q15936437, etc.) education_query_complete_20251113T131055.sparql
P Personal Collection 2 bases (Q2668072, Q160554) personal_query_complete_20251113T131055.sparql
S Collecting Society 3 bases (Q1065742, Q2668072, Q43229) collecting_query_complete_20251113T131055.sparql
H Holy Sites 6 bases (Q32815, Q16970, Q44613, etc.) holy_query_complete_20251113T131055.sparql
F Features 4 bases (Q4989906, Q5003624, Q7075, Q39614) features_query_complete_20251113T131055.sparql

Note: U (Unknown) and X (Mixed) classes do not have dedicated queries since they represent classification states rather than Wikidata entity types.


Query Features (Standardized Across All Classes)

Technical Specifications

  1. Complete Exclusion List: 389 Q-numbers from ALL sections of hyponyms_curated.yaml

    • Extracted recursively from: hyponym, entity, rico, exclude, standards
    • Split into 8 FILTER chunks (50 Q-numbers each)
  2. Transitive Subclass Queries:

    ?hyponym wdt:P279+ wd:Q[BASE_CLASS]
    
    • Uses wdt:P279+ (transitive, non-reflexive)
    • Avoids wdt:P279* which would include the base class itself
  3. Language Support: 40+ languages

    SERVICE wikibase:label {
      bd:serviceParam wikibase:language "en,es,fr,de,nl,pt,ar,zh,ja,ru,hi,id,ms,th,vi,ko,tr,fa,pl,it,uk,sv,cs,he,bn,mr,ta,te,ur,pa,el,ro,hu,da,no,fi,ca,sr,bg,hr,sk,sl".
    }
    
  4. Full ISO 8601 Timestamps:

    • Format: 2025-11-13T13:09:20.290802+00:00
    • Includes: date, time (HH:MM:SS), microseconds, timezone (+00:00 UTC)
    • Filenames: YYYYMMDDTHHmmSS format (sortable)
  5. Metadata Files: Each query has companion YAML with:

    • Class name and code
    • Generation timestamp
    • Base class mappings (Q-number → label)
    • Exclusion statistics (389 Q-numbers, 8 chunks)
    • SPARQL features documentation
    • Execution parameters (endpoint, method, timeout)

File Structure

data/wikidata/GLAMORCUBEPSXHFN/
├── hyponyms_curated.yaml                 # Master vocabulary (389 Q-numbers)
├── A/queries/
│   ├── archive_query_missing_complete_20251113T130052.yaml
│   └── archive_query_missing_complete_20251113T130052.sparql
├── B/queries/
│   ├── botanical_zoo_query_complete_20251113T130659.yaml
│   └── botanical_zoo_query_complete_20251113T130659.sparql
├── G/queries/
│   ├── gallery_query_complete_20251113T130920.yaml
│   └── gallery_query_complete_20251113T130920.sparql
├── L/queries/
│   ├── library_query_complete_20251113T131006.yaml
│   └── library_query_complete_20251113T131006.sparql
├── M/queries/
│   ├── museum_query_complete_20251113T131027.yaml
│   └── museum_query_complete_20251113T131027.sparql
├── [O, R, C, E, P, S, H, F]/queries/
│   ├── [class]_query_complete_20251113T131055.yaml
│   └── [class]_query_complete_20251113T131055.sparql

Total Files Generated: 26 files (13 YAML + 13 SPARQL)


Query Execution Workflow

For each GLAMORCUBEPSXHFN class:

  1. Execute Query at https://query.wikidata.org/

    • Copy SPARQL from [class]_query_complete_*.sparql
    • Run with POST method (large query size)
    • Download results as JSON
  2. Review Results

    • Check for valid heritage institution subtypes
    • Identify false positives (non-heritage classes)
    • Verify labels are semantically correct
  3. Curate Vocabulary

    • Add validated Q-numbers to hyponyms_curated.yaml
    • Assign GLAMORCUBEPSXHFN type codes
    • Add hypernym relationships
    • Document country/region if applicable
  4. Re-run Query (iterative discovery)

    • Newly curated Q-numbers will be excluded in next run
    • Discover transitive subclasses of newly added hyponyms
    • Continue until no new relevant results found

Key Improvements from Last Session

Issue Previous State Current State
Exclusion Coverage 311 → 316 Q-numbers (partial) 389 Q-numbers (complete)
YAML Sections Only hyponym section All 5 sections (hyponym, entity, rico, exclude, standards)
Archive Query Returned duplicates Returns 0 results (all captured)
Query Standardization Inconsistent formats Unified template across all classes
Timestamp Precision Date only Full ISO 8601 with microseconds
Classes Covered 2 (A, B) 13 (A, B, G, L, M, O, R, C, E, P, S, H, F)

Statistics

  • Total Q-numbers Excluded: 389
  • FILTER Chunks: 8 (50 Q-numbers each)
  • Classes Queried: 13/15 (U and X excluded by design)
  • Base Classes Total: ~90 Wikidata root classes
  • Languages Supported: 40+
  • Query Files: 13
  • Metadata Files: 13
  • Total Files Created: 26

Phase 1: Execute Queries (Priority Order)

  1. Start with well-defined classes (fewer false positives):

    • Museum (M) - 14 base classes
    • Library (L) - 11 base classes
    • Gallery (G) - 4 base classes
  2. Continue with specialized classes:

    • Archive (A) - verify no new results
    • Botanical/Zoo (B) - 18 bases (may have many results)
    • Features (F) - monuments, memorials, sculptures
  3. Proceed to organizational classes:

    • Education Provider (E) - universities, schools
    • Research Center (R) - institutes, facilities
    • Official Institution (O) - government heritage
    • Corporation (C) - corporate collections
  4. Finish with niche classes:

    • Holy Sites (H) - religious collections
    • Collecting Society (S) - historical societies
    • Personal Collection (P) - private collections

Phase 2: Curate Results

For each class:

  1. Download JSON results from Wikidata Query Service
  2. Review Q-numbers and labels for relevance
  3. Add to hyponyms_curated.yaml:
    - label: Q[NUMBER]
      hypernym:
        - [descriptive term]
      type:
        - [GLAMORCUBEPSXHFN code]
      country:  # optional
        - [ISO country code]
    
  4. Document any duplicates or edge cases

Phase 3: Iterative Discovery

  1. Re-run queries after curation (new Q-numbers will be excluded)
  2. Discover transitive subclasses of newly added hyponyms
  3. Continue until queries return no new relevant results
  4. Mark class as "complete" in tracking document

Phase 4: Validation

  1. Cross-reference with existing GLAM registries
  2. Verify coverage of major institution types
  3. Check for geographic balance
  4. Document any gaps or missing subtypes

Technical Notes

Why 389 Q-numbers (not 390)?

The recursive extraction found 389 unique Q-numbers after deduplication. The previous session summary mentioned "390 Q-numbers" which was an estimate before deduplication.

Why wdt:P279+ (not wdt:P279*)?

  • wdt:P279*: Reflexive transitive closure (includes the base class itself)
  • wdt:P279+: Transitive closure (excludes the base class)

We use wdt:P279+ to avoid returning the base classes (Q33506, Q7075, etc.) in the results.

Why 8 FILTER chunks?

Wikidata Query Service has SPARQL expression complexity limits. Splitting 389 Q-numbers into 8 chunks of ~50 each keeps queries executable while maintaining complete exclusion coverage.

Why 40+ languages?

GLAMORCUBEPSXHFN is a global taxonomy covering institutions worldwide. Supporting 40+ languages ensures:

  • Non-English institution types are discoverable
  • Multilingual labels improve matching during curation
  • Geographic coverage extends beyond English-speaking countries

Files Modified/Created

Created (26 files):

  • data/wikidata/GLAMORCUBEPSXHFN/[A,B,G,L,M,O,R,C,E,P,S,H,F]/queries/*_query_complete_*.{yaml,sparql} (26 files)

Modified:

  • None (all existing files preserved)

Temporary:

  • /tmp/all_q_numbers.txt (working file for Q-number extraction)

Session Achievements

Fixed archive query exclusion list (311 → 316 → 389 Q-numbers)
Created comprehensive botanical/zoo query (18 base classes)
Generated 11 additional class queries (G, L, M, O, R, C, E, P, S, H, F)
Established reusable query template with best practices
Full ISO 8601 timestamps throughout
Language-agnostic approach (40+ languages)
Complete documentation in query YAML metadata
Created session summary document


References

  • Master Vocabulary: data/wikidata/GLAMORCUBEPSXHFN/hyponyms_curated.yaml
  • Query Directory: data/wikidata/GLAMORCUBEPSXHFN/*/queries/
  • Wikidata Query Service: https://query.wikidata.org/
  • GLAMORCUBEPSXHFN Taxonomy: See AGENTS.md (15-type taxonomy)
  • Previous Session: docs/sessions/SESSION_SUMMARY_20251113_ARCHIVE_QUERY_FIX.md (if exists)

Session End: 2025-11-13T13:11:00+00:00
Status: COMPLETE - Ready for query execution phase