glam/data/wikidata/GLAMORCUBEPSXHFN/QUERY_EXECUTION_GUIDE.md
2025-11-19 23:25:22 +01:00

11 KiB

GLAMORCUBEPSXHFN Query Execution Guide

Date: 2025-11-13
Status: Ready for execution


Quick Start

1. Choose a Query

Navigate to the query directory for your target class:

cd /Users/kempersc/apps/glam/data/wikidata/GLAMORCUBEPSXHFN

# Example: Museum queries
cat M/queries/museum_query_complete_20251113T131027.sparql

2. Execute Query

  1. Open https://query.wikidata.org/
  2. Paste SPARQL query content
  3. Click "Run" (or press Ctrl+Enter)
  4. Wait for results (may take 10-60 seconds for large result sets)

3. Download Results

Click "Download" → Choose format:

  • JSON (recommended for processing)
  • CSV (for spreadsheet analysis)
  • TSV (for data import)

4. Review Results

Check the downloaded file for:

  • Valid heritage institution subtypes
  • False positives (non-heritage classes)
  • Semantic correctness of labels

Query Inventory

Complete Queries (13 classes)

Class File Base Classes Expected Results
A (Archive) A/queries/archive_query_missing_complete_20251113T130052.sparql 1 0 (already captured)
B (Botanical/Zoo) B/queries/botanical_zoo_query_complete_20251113T130659.sparql 18 High (many subtypes)
G (Gallery) G/queries/gallery_query_complete_20251113T130920.sparql 4 Low-Medium
L (Library) L/queries/library_query_complete_20251113T131006.sparql 11 Medium-High
M (Museum) M/queries/museum_query_complete_20251113T131027.sparql 14 Very High
O (Official Inst.) O/queries/official_query_complete_20251113T131055.sparql 4 Low-Medium
R (Research Ctr) R/queries/research_query_complete_20251113T131055.sparql 3 Medium
C (Corporation) C/queries/corporation_query_complete_20251113T131055.sparql 3 Low
E (Education) E/queries/education_query_complete_20251113T131055.sparql 6 High
P (Personal Coll.) P/queries/personal_query_complete_20251113T131055.sparql 2 Low
S (Coll. Society) S/queries/collecting_query_complete_20251113T131055.sparql 3 Low-Medium
H (Holy Sites) H/queries/holy_query_complete_20251113T131055.sparql 6 Medium
F (Features) F/queries/features_query_complete_20251113T131055.sparql 4 Medium-High

Note: U (Unknown) and X (Mixed) classes do not have queries (special classification states).


Priority 1: High-Value, Low-Noise Classes

Start with well-defined institutional classes:

  1. Museum (M) - Expected to return many valid museum subtypes
  2. Library (L) - Well-structured taxonomy in Wikidata
  3. Gallery (G) - Focused domain, clear boundaries

Why first? These classes have:

  • Clear semantic boundaries
  • High-quality Wikidata curation
  • Low false positive rate
  • Immediate value for GLAM curation

Priority 2: Specialized Heritage Classes

Continue with niche heritage types:

  1. Archive (A) - Verify completeness (should return 0 results)
  2. Botanical/Zoo (B) - Large taxonomy, needs careful review
  3. Features (F) - Monuments, memorials, sculptures

Curation note: Features (F) may include non-heritage physical objects. Review each result carefully.

Priority 3: Organizational Classes

Proceed to organizational entities:

  1. Education Provider (E) - Universities, colleges, schools with collections
  2. Research Center (R) - Scientific institutes, documentation centers
  3. Official Institution (O) - Government heritage agencies

Curation note: Filter for institutions that actually maintain heritage collections (not all universities have museums/archives).

Priority 4: Niche/Low-Volume Classes

Finish with specialized collection types:

  1. Holy Sites (H) - Religious institutions with heritage collections
  2. Collecting Society (S) - Historical societies, numismatic clubs
  3. Personal Collection (P) - Private collections
  4. Corporation (C) - Corporate archives/museums

Curation note: These classes often overlap with others (e.g., corporate museums are also museums). Document multi-type classifications.


Query Execution Checklist

For each query execution:

  • Copy SPARQL from [class]/queries/*_complete_*.sparql
  • Execute at https://query.wikidata.org/
  • Download results as JSON
  • Save JSON to [class]/sparql/results_[YYYYMMDD].json
  • Review results for:
    • Valid heritage institution subtypes
    • False positives (non-heritage)
    • Semantic correctness
    • Geographic diversity
  • Document results in [class]/CURATION_LOG.md
  • Add validated Q-numbers to hyponyms_curated.yaml
  • Re-run query to discover next batch

Curation Workflow

Step 1: Review Query Results

Open the downloaded JSON file:

cat M/sparql/results_20251113.json | jq '.results.bindings[] | {q: .hyponym.value, label: .hyponymLabel.value}'

Step 2: Validate Each Q-number

For each result, check:

  1. Is it a heritage institution type?

    • Museums, libraries, archives, galleries, etc.
    • Collections, societies, cultural organizations
    • NOT: administrative units, geographic features (unless F-class)
  2. What GLAMORCUBEPSXHFN class(es)?

    • Single type: M (museum), L (library), A (archive), etc.
    • Multiple types: Use X (mixed) or list all applicable codes
  3. Geographic/cultural context?

    • Country-specific types (note in country: field)
    • Regional variations (note in subregion: field)
  4. Historical context?

    • Defunct institution types (note in time: field)
    • Historical periods (e.g., "Imperial Russia", "Medieval")

Step 3: Add to Curated Vocabulary

Edit hyponyms_curated.yaml:

hyponym:
  - label: Q[NUMBER]
    hypernym:
      - [descriptive term from Wikidata label]
    type:
      - [GLAMORCUBEPSXHFN code: A, B, C, E, F, G, H, L, M, O, P, R, S, or X]
    country:  # optional
      - [ISO 3166-1 alpha-2 country code]
    subregion:  # optional
      - [region name]
    time:  # optional
      - [temporal context, e.g., "1900-1950", "< 1948"]
    rico:  # optional (for archival record types)
      - label: recordSetTypes
    duplicate:  # optional (if merged with another Q-number)
      - Q[DUPLICATE_NUMBER]

Example:

  - label: Q123456
    hypernym:
      - maritime museum
    type:
      - M
    country:
      - Netherlands

Step 4: Re-run Query (Iterative Discovery)

After adding Q-numbers to hyponyms_curated.yaml:

  1. Queries automatically exclude newly curated Q-numbers (next execution)
  2. Run query again to discover transitive subclasses
  3. Continue until no new relevant results found
  4. Mark class as "complete" in tracking doc

Common False Positives

Museums (M)

  • Museum websites (Q386724) - Digital platforms, not institution types
  • Museum collections (Q2668072) - Collection types, not institutions
  • Museum buildings (Q41176) - Architecture, not organizations
  • Museum subtypes (e.g., Q207694 "art museum") - Valid!

Libraries (L)

  • Library catalogs (Q5994) - Systems, not institutions
  • Library software (Q7375) - Technology, not organizations
  • Library types (e.g., Q28564 "public library") - Valid!

Education (E)

  • All universities - Only include if they maintain heritage collections
  • Primary schools - Rarely have heritage significance
  • Universities with archives/museums - Valid if documented

Features (F)

  • Natural features (mountains, rivers) - Not heritage custodians
  • Living people - Not physical features
  • Monuments, memorials, sculptures, cemeteries - Valid!

Query Performance Tips

Timeout Issues

If query times out (>60 seconds):

  1. Split query by base class: Run separate queries for each UNION clause
  2. Add temporal filter: Limit to items created after a certain year
  3. Reduce language list: Focus on 10-15 major languages
  4. Use LIMIT: Add LIMIT 1000 for initial exploration

Large Result Sets

If query returns >10,000 results:

  1. Prioritize by usage: Add ORDER BY DESC(?usageCount) (count statements)
  2. Filter by sitelinks: FILTER(?sitelinks > 5) to focus on well-documented items
  3. Geographic focus: Add country/region filters for phased curation

Memory Issues

If browser/WDQS crashes:

  1. Use LIMIT: Start with LIMIT 100, increase gradually
  2. Download in batches: Run query multiple times with OFFSET
  3. Use API: Query via https://query.wikidata.org/sparql (programmatic)

Automation Scripts (Future)

Batch Query Execution

# Planned: scripts/execute_wikidata_queries.py
# - Read all *.sparql files
# - Execute via WDQS API
# - Save results to class-specific directories
# - Generate curation dashboard

Result Analysis

# Planned: scripts/analyze_query_results.py
# - Parse JSON results
# - Identify potential false positives
# - Suggest hypernym relationships
# - Generate curation candidates

Iterative Curation

# Planned: scripts/iterative_hyponym_discovery.py
# - Execute query
# - Present results for human review
# - Add validated Q-numbers to hyponyms_curated.yaml
# - Re-run query
# - Repeat until no new results

Troubleshooting

"Query timeout" error

Cause: Query takes >60 seconds to execute.

Solution: Simplify query (see "Query Performance Tips" above).

"Too many results" warning

Cause: Result set >10,000 rows.

Solution: Add LIMIT 1000 or use batch execution with OFFSET.

"Malformed query" error

Cause: SPARQL syntax error (rare, all queries pre-validated).

Solution: Check FILTER clauses are correctly closed with parentheses.

Query returns base classes

Cause: Using wdt:P279* instead of wdt:P279+.

Solution: Already corrected in all queries (use wdt:P279+).


Progress Tracking

Execution Log Template

Create [class]/EXECUTION_LOG.md for each class:

# [Class] Query Execution Log

## Execution 1
- **Date**: 2025-11-13
- **Query**: [filename]
- **Results**: [count] hyponyms
- **Curated**: [count] added to hyponyms_curated.yaml
- **Rejected**: [count] false positives
- **Notes**: [observations]

## Execution 2
- **Date**: 2025-11-XX
- **Query**: [filename] (re-run)
- **Results**: [count] new hyponyms (after exclusion)
- **Status**: [Complete / Continue / Review]

Completion Criteria

Mark class as "complete" when:

  • Query returns <10 new relevant results
  • All major institution subtypes are captured
  • Geographic coverage is adequate
  • Iterative discovery yields diminishing returns

References


Version: 1.0
Last Updated: 2025-11-13
Status: Ready for execution