glam/data/wikidata/GLAMORCUBEPSXHFN/QUERY_EXECUTION_GUIDE.md
2025-11-19 23:25:22 +01:00

381 lines
11 KiB
Markdown

# GLAMORCUBEPSXHFN Query Execution Guide
**Date**: 2025-11-13
**Status**: Ready for execution
---
## Quick Start
### 1. Choose a Query
Navigate to the query directory for your target class:
```bash
cd /Users/kempersc/apps/glam/data/wikidata/GLAMORCUBEPSXHFN
# Example: Museum queries
cat M/queries/museum_query_complete_20251113T131027.sparql
```
### 2. Execute Query
1. Open https://query.wikidata.org/
2. Paste SPARQL query content
3. Click **"Run"** (or press Ctrl+Enter)
4. Wait for results (may take 10-60 seconds for large result sets)
### 3. Download Results
Click **"Download"** → Choose format:
- **JSON** (recommended for processing)
- **CSV** (for spreadsheet analysis)
- **TSV** (for data import)
### 4. Review Results
Check the downloaded file for:
- Valid heritage institution subtypes
- False positives (non-heritage classes)
- Semantic correctness of labels
---
## Query Inventory
### ✅ Complete Queries (13 classes)
| Class | File | Base Classes | Expected Results |
|-------|------|--------------|------------------|
| **A** (Archive) | `A/queries/archive_query_missing_complete_20251113T130052.sparql` | 1 | 0 (already captured) |
| **B** (Botanical/Zoo) | `B/queries/botanical_zoo_query_complete_20251113T130659.sparql` | 18 | High (many subtypes) |
| **G** (Gallery) | `G/queries/gallery_query_complete_20251113T130920.sparql` | 4 | Low-Medium |
| **L** (Library) | `L/queries/library_query_complete_20251113T131006.sparql` | 11 | Medium-High |
| **M** (Museum) | `M/queries/museum_query_complete_20251113T131027.sparql` | 14 | Very High |
| **O** (Official Inst.) | `O/queries/official_query_complete_20251113T131055.sparql` | 4 | Low-Medium |
| **R** (Research Ctr) | `R/queries/research_query_complete_20251113T131055.sparql` | 3 | Medium |
| **C** (Corporation) | `C/queries/corporation_query_complete_20251113T131055.sparql` | 3 | Low |
| **E** (Education) | `E/queries/education_query_complete_20251113T131055.sparql` | 6 | High |
| **P** (Personal Coll.) | `P/queries/personal_query_complete_20251113T131055.sparql` | 2 | Low |
| **S** (Coll. Society) | `S/queries/collecting_query_complete_20251113T131055.sparql` | 3 | Low-Medium |
| **H** (Holy Sites) | `H/queries/holy_query_complete_20251113T131055.sparql` | 6 | Medium |
| **F** (Features) | `F/queries/features_query_complete_20251113T131055.sparql` | 4 | Medium-High |
**Note**: U (Unknown) and X (Mixed) classes do not have queries (special classification states).
---
## Recommended Execution Order
### Priority 1: High-Value, Low-Noise Classes
Start with well-defined institutional classes:
1. **Museum (M)** - Expected to return many valid museum subtypes
2. **Library (L)** - Well-structured taxonomy in Wikidata
3. **Gallery (G)** - Focused domain, clear boundaries
**Why first?** These classes have:
- Clear semantic boundaries
- High-quality Wikidata curation
- Low false positive rate
- Immediate value for GLAM curation
### Priority 2: Specialized Heritage Classes
Continue with niche heritage types:
4. **Archive (A)** - Verify completeness (should return 0 results)
5. **Botanical/Zoo (B)** - Large taxonomy, needs careful review
6. **Features (F)** - Monuments, memorials, sculptures
**Curation note**: Features (F) may include non-heritage physical objects. Review each result carefully.
### Priority 3: Organizational Classes
Proceed to organizational entities:
7. **Education Provider (E)** - Universities, colleges, schools with collections
8. **Research Center (R)** - Scientific institutes, documentation centers
9. **Official Institution (O)** - Government heritage agencies
**Curation note**: Filter for institutions that actually maintain heritage collections (not all universities have museums/archives).
### Priority 4: Niche/Low-Volume Classes
Finish with specialized collection types:
10. **Holy Sites (H)** - Religious institutions with heritage collections
11. **Collecting Society (S)** - Historical societies, numismatic clubs
12. **Personal Collection (P)** - Private collections
13. **Corporation (C)** - Corporate archives/museums
**Curation note**: These classes often overlap with others (e.g., corporate museums are also museums). Document multi-type classifications.
---
## Query Execution Checklist
For each query execution:
- [ ] Copy SPARQL from `[class]/queries/*_complete_*.sparql`
- [ ] Execute at https://query.wikidata.org/
- [ ] Download results as JSON
- [ ] Save JSON to `[class]/sparql/results_[YYYYMMDD].json`
- [ ] Review results for:
- [ ] Valid heritage institution subtypes
- [ ] False positives (non-heritage)
- [ ] Semantic correctness
- [ ] Geographic diversity
- [ ] Document results in `[class]/CURATION_LOG.md`
- [ ] Add validated Q-numbers to `hyponyms_curated.yaml`
- [ ] Re-run query to discover next batch
---
## Curation Workflow
### Step 1: Review Query Results
Open the downloaded JSON file:
```bash
cat M/sparql/results_20251113.json | jq '.results.bindings[] | {q: .hyponym.value, label: .hyponymLabel.value}'
```
### Step 2: Validate Each Q-number
For each result, check:
1. **Is it a heritage institution type?**
- Museums, libraries, archives, galleries, etc.
- Collections, societies, cultural organizations
- NOT: administrative units, geographic features (unless F-class)
2. **What GLAMORCUBEPSXHFN class(es)?**
- Single type: M (museum), L (library), A (archive), etc.
- Multiple types: Use X (mixed) or list all applicable codes
3. **Geographic/cultural context?**
- Country-specific types (note in `country:` field)
- Regional variations (note in `subregion:` field)
4. **Historical context?**
- Defunct institution types (note in `time:` field)
- Historical periods (e.g., "Imperial Russia", "Medieval")
### Step 3: Add to Curated Vocabulary
Edit `hyponyms_curated.yaml`:
```yaml
hyponym:
- label: Q[NUMBER]
hypernym:
- [descriptive term from Wikidata label]
type:
- [GLAMORCUBEPSXHFN code: A, B, C, E, F, G, H, L, M, O, P, R, S, or X]
country: # optional
- [ISO 3166-1 alpha-2 country code]
subregion: # optional
- [region name]
time: # optional
- [temporal context, e.g., "1900-1950", "< 1948"]
rico: # optional (for archival record types)
- label: recordSetTypes
duplicate: # optional (if merged with another Q-number)
- Q[DUPLICATE_NUMBER]
```
**Example**:
```yaml
- label: Q123456
hypernym:
- maritime museum
type:
- M
country:
- Netherlands
```
### Step 4: Re-run Query (Iterative Discovery)
After adding Q-numbers to `hyponyms_curated.yaml`:
1. Queries automatically exclude newly curated Q-numbers (next execution)
2. Run query again to discover transitive subclasses
3. Continue until no new relevant results found
4. Mark class as "complete" in tracking doc
---
## Common False Positives
### Museums (M)
-**Museum websites** (Q386724) - Digital platforms, not institution types
-**Museum collections** (Q2668072) - Collection types, not institutions
-**Museum buildings** (Q41176) - Architecture, not organizations
-**Museum subtypes** (e.g., Q207694 "art museum") - Valid!
### Libraries (L)
-**Library catalogs** (Q5994) - Systems, not institutions
-**Library software** (Q7375) - Technology, not organizations
-**Library types** (e.g., Q28564 "public library") - Valid!
### Education (E)
-**All universities** - Only include if they maintain heritage collections
-**Primary schools** - Rarely have heritage significance
-**Universities with archives/museums** - Valid if documented
### Features (F)
-**Natural features** (mountains, rivers) - Not heritage custodians
-**Living people** - Not physical features
-**Monuments, memorials, sculptures, cemeteries** - Valid!
---
## Query Performance Tips
### Timeout Issues
If query times out (>60 seconds):
1. **Split query by base class**: Run separate queries for each `UNION` clause
2. **Add temporal filter**: Limit to items created after a certain year
3. **Reduce language list**: Focus on 10-15 major languages
4. **Use LIMIT**: Add `LIMIT 1000` for initial exploration
### Large Result Sets
If query returns >10,000 results:
1. **Prioritize by usage**: Add `ORDER BY DESC(?usageCount)` (count statements)
2. **Filter by sitelinks**: `FILTER(?sitelinks > 5)` to focus on well-documented items
3. **Geographic focus**: Add country/region filters for phased curation
### Memory Issues
If browser/WDQS crashes:
1. **Use LIMIT**: Start with `LIMIT 100`, increase gradually
2. **Download in batches**: Run query multiple times with `OFFSET`
3. **Use API**: Query via https://query.wikidata.org/sparql (programmatic)
---
## Automation Scripts (Future)
### Batch Query Execution
```python
# Planned: scripts/execute_wikidata_queries.py
# - Read all *.sparql files
# - Execute via WDQS API
# - Save results to class-specific directories
# - Generate curation dashboard
```
### Result Analysis
```python
# Planned: scripts/analyze_query_results.py
# - Parse JSON results
# - Identify potential false positives
# - Suggest hypernym relationships
# - Generate curation candidates
```
### Iterative Curation
```python
# Planned: scripts/iterative_hyponym_discovery.py
# - Execute query
# - Present results for human review
# - Add validated Q-numbers to hyponyms_curated.yaml
# - Re-run query
# - Repeat until no new results
```
---
## Troubleshooting
### "Query timeout" error
**Cause**: Query takes >60 seconds to execute.
**Solution**: Simplify query (see "Query Performance Tips" above).
### "Too many results" warning
**Cause**: Result set >10,000 rows.
**Solution**: Add `LIMIT 1000` or use batch execution with `OFFSET`.
### "Malformed query" error
**Cause**: SPARQL syntax error (rare, all queries pre-validated).
**Solution**: Check FILTER clauses are correctly closed with parentheses.
### Query returns base classes
**Cause**: Using `wdt:P279*` instead of `wdt:P279+`.
**Solution**: Already corrected in all queries (use `wdt:P279+`).
---
## Progress Tracking
### Execution Log Template
Create `[class]/EXECUTION_LOG.md` for each class:
```markdown
# [Class] Query Execution Log
## Execution 1
- **Date**: 2025-11-13
- **Query**: [filename]
- **Results**: [count] hyponyms
- **Curated**: [count] added to hyponyms_curated.yaml
- **Rejected**: [count] false positives
- **Notes**: [observations]
## Execution 2
- **Date**: 2025-11-XX
- **Query**: [filename] (re-run)
- **Results**: [count] new hyponyms (after exclusion)
- **Status**: [Complete / Continue / Review]
```
### Completion Criteria
Mark class as "complete" when:
- [ ] Query returns <10 new relevant results
- [ ] All major institution subtypes are captured
- [ ] Geographic coverage is adequate
- [ ] Iterative discovery yields diminishing returns
---
## References
- **Query Files**: `/Users/kempersc/apps/glam/data/wikidata/GLAMORCUBEPSXHFN/[A-Z]/queries/`
- **Curated Vocabulary**: `/Users/kempersc/apps/glam/data/wikidata/GLAMORCUBEPSXHFN/hyponyms_curated.yaml`
- **Wikidata Query Service**: https://query.wikidata.org/
- **SPARQL Tutorial**: https://www.wikidata.org/wiki/Wikidata:SPARQL_tutorial
- **Session Summary**: `docs/sessions/SESSION_SUMMARY_20251113_SPARQL_GENERATION.md`
---
**Version**: 1.0
**Last Updated**: 2025-11-13
**Status**: Ready for execution