381 lines
11 KiB
Markdown
381 lines
11 KiB
Markdown
# GLAMORCUBEPSXHFN Query Execution Guide
|
|
|
|
**Date**: 2025-11-13
|
|
**Status**: Ready for execution
|
|
|
|
---
|
|
|
|
## Quick Start
|
|
|
|
### 1. Choose a Query
|
|
|
|
Navigate to the query directory for your target class:
|
|
|
|
```bash
|
|
cd /Users/kempersc/apps/glam/data/wikidata/GLAMORCUBEPSXHFN
|
|
|
|
# Example: Museum queries
|
|
cat M/queries/museum_query_complete_20251113T131027.sparql
|
|
```
|
|
|
|
### 2. Execute Query
|
|
|
|
1. Open https://query.wikidata.org/
|
|
2. Paste SPARQL query content
|
|
3. Click **"Run"** (or press Ctrl+Enter)
|
|
4. Wait for results (may take 10-60 seconds for large result sets)
|
|
|
|
### 3. Download Results
|
|
|
|
Click **"Download"** → Choose format:
|
|
- **JSON** (recommended for processing)
|
|
- **CSV** (for spreadsheet analysis)
|
|
- **TSV** (for data import)
|
|
|
|
### 4. Review Results
|
|
|
|
Check the downloaded file for:
|
|
- Valid heritage institution subtypes
|
|
- False positives (non-heritage classes)
|
|
- Semantic correctness of labels
|
|
|
|
---
|
|
|
|
## Query Inventory
|
|
|
|
### ✅ Complete Queries (13 classes)
|
|
|
|
| Class | File | Base Classes | Expected Results |
|
|
|-------|------|--------------|------------------|
|
|
| **A** (Archive) | `A/queries/archive_query_missing_complete_20251113T130052.sparql` | 1 | 0 (already captured) |
|
|
| **B** (Botanical/Zoo) | `B/queries/botanical_zoo_query_complete_20251113T130659.sparql` | 18 | High (many subtypes) |
|
|
| **G** (Gallery) | `G/queries/gallery_query_complete_20251113T130920.sparql` | 4 | Low-Medium |
|
|
| **L** (Library) | `L/queries/library_query_complete_20251113T131006.sparql` | 11 | Medium-High |
|
|
| **M** (Museum) | `M/queries/museum_query_complete_20251113T131027.sparql` | 14 | Very High |
|
|
| **O** (Official Inst.) | `O/queries/official_query_complete_20251113T131055.sparql` | 4 | Low-Medium |
|
|
| **R** (Research Ctr) | `R/queries/research_query_complete_20251113T131055.sparql` | 3 | Medium |
|
|
| **C** (Corporation) | `C/queries/corporation_query_complete_20251113T131055.sparql` | 3 | Low |
|
|
| **E** (Education) | `E/queries/education_query_complete_20251113T131055.sparql` | 6 | High |
|
|
| **P** (Personal Coll.) | `P/queries/personal_query_complete_20251113T131055.sparql` | 2 | Low |
|
|
| **S** (Coll. Society) | `S/queries/collecting_query_complete_20251113T131055.sparql` | 3 | Low-Medium |
|
|
| **H** (Holy Sites) | `H/queries/holy_query_complete_20251113T131055.sparql` | 6 | Medium |
|
|
| **F** (Features) | `F/queries/features_query_complete_20251113T131055.sparql` | 4 | Medium-High |
|
|
|
|
**Note**: U (Unknown) and X (Mixed) classes do not have queries (special classification states).
|
|
|
|
---
|
|
|
|
## Recommended Execution Order
|
|
|
|
### Priority 1: High-Value, Low-Noise Classes
|
|
|
|
Start with well-defined institutional classes:
|
|
|
|
1. **Museum (M)** - Expected to return many valid museum subtypes
|
|
2. **Library (L)** - Well-structured taxonomy in Wikidata
|
|
3. **Gallery (G)** - Focused domain, clear boundaries
|
|
|
|
**Why first?** These classes have:
|
|
- Clear semantic boundaries
|
|
- High-quality Wikidata curation
|
|
- Low false positive rate
|
|
- Immediate value for GLAM curation
|
|
|
|
### Priority 2: Specialized Heritage Classes
|
|
|
|
Continue with niche heritage types:
|
|
|
|
4. **Archive (A)** - Verify completeness (should return 0 results)
|
|
5. **Botanical/Zoo (B)** - Large taxonomy, needs careful review
|
|
6. **Features (F)** - Monuments, memorials, sculptures
|
|
|
|
**Curation note**: Features (F) may include non-heritage physical objects. Review each result carefully.
|
|
|
|
### Priority 3: Organizational Classes
|
|
|
|
Proceed to organizational entities:
|
|
|
|
7. **Education Provider (E)** - Universities, colleges, schools with collections
|
|
8. **Research Center (R)** - Scientific institutes, documentation centers
|
|
9. **Official Institution (O)** - Government heritage agencies
|
|
|
|
**Curation note**: Filter for institutions that actually maintain heritage collections (not all universities have museums/archives).
|
|
|
|
### Priority 4: Niche/Low-Volume Classes
|
|
|
|
Finish with specialized collection types:
|
|
|
|
10. **Holy Sites (H)** - Religious institutions with heritage collections
|
|
11. **Collecting Society (S)** - Historical societies, numismatic clubs
|
|
12. **Personal Collection (P)** - Private collections
|
|
13. **Corporation (C)** - Corporate archives/museums
|
|
|
|
**Curation note**: These classes often overlap with others (e.g., corporate museums are also museums). Document multi-type classifications.
|
|
|
|
---
|
|
|
|
## Query Execution Checklist
|
|
|
|
For each query execution:
|
|
|
|
- [ ] Copy SPARQL from `[class]/queries/*_complete_*.sparql`
|
|
- [ ] Execute at https://query.wikidata.org/
|
|
- [ ] Download results as JSON
|
|
- [ ] Save JSON to `[class]/sparql/results_[YYYYMMDD].json`
|
|
- [ ] Review results for:
|
|
- [ ] Valid heritage institution subtypes
|
|
- [ ] False positives (non-heritage)
|
|
- [ ] Semantic correctness
|
|
- [ ] Geographic diversity
|
|
- [ ] Document results in `[class]/CURATION_LOG.md`
|
|
- [ ] Add validated Q-numbers to `hyponyms_curated.yaml`
|
|
- [ ] Re-run query to discover next batch
|
|
|
|
---
|
|
|
|
## Curation Workflow
|
|
|
|
### Step 1: Review Query Results
|
|
|
|
Open the downloaded JSON file:
|
|
|
|
```bash
|
|
cat M/sparql/results_20251113.json | jq '.results.bindings[] | {q: .hyponym.value, label: .hyponymLabel.value}'
|
|
```
|
|
|
|
### Step 2: Validate Each Q-number
|
|
|
|
For each result, check:
|
|
|
|
1. **Is it a heritage institution type?**
|
|
- Museums, libraries, archives, galleries, etc.
|
|
- Collections, societies, cultural organizations
|
|
- NOT: administrative units, geographic features (unless F-class)
|
|
|
|
2. **What GLAMORCUBEPSXHFN class(es)?**
|
|
- Single type: M (museum), L (library), A (archive), etc.
|
|
- Multiple types: Use X (mixed) or list all applicable codes
|
|
|
|
3. **Geographic/cultural context?**
|
|
- Country-specific types (note in `country:` field)
|
|
- Regional variations (note in `subregion:` field)
|
|
|
|
4. **Historical context?**
|
|
- Defunct institution types (note in `time:` field)
|
|
- Historical periods (e.g., "Imperial Russia", "Medieval")
|
|
|
|
### Step 3: Add to Curated Vocabulary
|
|
|
|
Edit `hyponyms_curated.yaml`:
|
|
|
|
```yaml
|
|
hyponym:
|
|
- label: Q[NUMBER]
|
|
hypernym:
|
|
- [descriptive term from Wikidata label]
|
|
type:
|
|
- [GLAMORCUBEPSXHFN code: A, B, C, E, F, G, H, L, M, O, P, R, S, or X]
|
|
country: # optional
|
|
- [ISO 3166-1 alpha-2 country code]
|
|
subregion: # optional
|
|
- [region name]
|
|
time: # optional
|
|
- [temporal context, e.g., "1900-1950", "< 1948"]
|
|
rico: # optional (for archival record types)
|
|
- label: recordSetTypes
|
|
duplicate: # optional (if merged with another Q-number)
|
|
- Q[DUPLICATE_NUMBER]
|
|
```
|
|
|
|
**Example**:
|
|
|
|
```yaml
|
|
- label: Q123456
|
|
hypernym:
|
|
- maritime museum
|
|
type:
|
|
- M
|
|
country:
|
|
- Netherlands
|
|
```
|
|
|
|
### Step 4: Re-run Query (Iterative Discovery)
|
|
|
|
After adding Q-numbers to `hyponyms_curated.yaml`:
|
|
|
|
1. Queries automatically exclude newly curated Q-numbers (next execution)
|
|
2. Run query again to discover transitive subclasses
|
|
3. Continue until no new relevant results found
|
|
4. Mark class as "complete" in tracking doc
|
|
|
|
---
|
|
|
|
## Common False Positives
|
|
|
|
### Museums (M)
|
|
|
|
- ❌ **Museum websites** (Q386724) - Digital platforms, not institution types
|
|
- ❌ **Museum collections** (Q2668072) - Collection types, not institutions
|
|
- ❌ **Museum buildings** (Q41176) - Architecture, not organizations
|
|
- ✅ **Museum subtypes** (e.g., Q207694 "art museum") - Valid!
|
|
|
|
### Libraries (L)
|
|
|
|
- ❌ **Library catalogs** (Q5994) - Systems, not institutions
|
|
- ❌ **Library software** (Q7375) - Technology, not organizations
|
|
- ✅ **Library types** (e.g., Q28564 "public library") - Valid!
|
|
|
|
### Education (E)
|
|
|
|
- ❌ **All universities** - Only include if they maintain heritage collections
|
|
- ❌ **Primary schools** - Rarely have heritage significance
|
|
- ✅ **Universities with archives/museums** - Valid if documented
|
|
|
|
### Features (F)
|
|
|
|
- ❌ **Natural features** (mountains, rivers) - Not heritage custodians
|
|
- ❌ **Living people** - Not physical features
|
|
- ✅ **Monuments, memorials, sculptures, cemeteries** - Valid!
|
|
|
|
---
|
|
|
|
## Query Performance Tips
|
|
|
|
### Timeout Issues
|
|
|
|
If query times out (>60 seconds):
|
|
|
|
1. **Split query by base class**: Run separate queries for each `UNION` clause
|
|
2. **Add temporal filter**: Limit to items created after a certain year
|
|
3. **Reduce language list**: Focus on 10-15 major languages
|
|
4. **Use LIMIT**: Add `LIMIT 1000` for initial exploration
|
|
|
|
### Large Result Sets
|
|
|
|
If query returns >10,000 results:
|
|
|
|
1. **Prioritize by usage**: Add `ORDER BY DESC(?usageCount)` (count statements)
|
|
2. **Filter by sitelinks**: `FILTER(?sitelinks > 5)` to focus on well-documented items
|
|
3. **Geographic focus**: Add country/region filters for phased curation
|
|
|
|
### Memory Issues
|
|
|
|
If browser/WDQS crashes:
|
|
|
|
1. **Use LIMIT**: Start with `LIMIT 100`, increase gradually
|
|
2. **Download in batches**: Run query multiple times with `OFFSET`
|
|
3. **Use API**: Query via https://query.wikidata.org/sparql (programmatic)
|
|
|
|
---
|
|
|
|
## Automation Scripts (Future)
|
|
|
|
### Batch Query Execution
|
|
|
|
```python
|
|
# Planned: scripts/execute_wikidata_queries.py
|
|
# - Read all *.sparql files
|
|
# - Execute via WDQS API
|
|
# - Save results to class-specific directories
|
|
# - Generate curation dashboard
|
|
```
|
|
|
|
### Result Analysis
|
|
|
|
```python
|
|
# Planned: scripts/analyze_query_results.py
|
|
# - Parse JSON results
|
|
# - Identify potential false positives
|
|
# - Suggest hypernym relationships
|
|
# - Generate curation candidates
|
|
```
|
|
|
|
### Iterative Curation
|
|
|
|
```python
|
|
# Planned: scripts/iterative_hyponym_discovery.py
|
|
# - Execute query
|
|
# - Present results for human review
|
|
# - Add validated Q-numbers to hyponyms_curated.yaml
|
|
# - Re-run query
|
|
# - Repeat until no new results
|
|
```
|
|
|
|
---
|
|
|
|
## Troubleshooting
|
|
|
|
### "Query timeout" error
|
|
|
|
**Cause**: Query takes >60 seconds to execute.
|
|
|
|
**Solution**: Simplify query (see "Query Performance Tips" above).
|
|
|
|
### "Too many results" warning
|
|
|
|
**Cause**: Result set >10,000 rows.
|
|
|
|
**Solution**: Add `LIMIT 1000` or use batch execution with `OFFSET`.
|
|
|
|
### "Malformed query" error
|
|
|
|
**Cause**: SPARQL syntax error (rare, all queries pre-validated).
|
|
|
|
**Solution**: Check FILTER clauses are correctly closed with parentheses.
|
|
|
|
### Query returns base classes
|
|
|
|
**Cause**: Using `wdt:P279*` instead of `wdt:P279+`.
|
|
|
|
**Solution**: Already corrected in all queries (use `wdt:P279+`).
|
|
|
|
---
|
|
|
|
## Progress Tracking
|
|
|
|
### Execution Log Template
|
|
|
|
Create `[class]/EXECUTION_LOG.md` for each class:
|
|
|
|
```markdown
|
|
# [Class] Query Execution Log
|
|
|
|
## Execution 1
|
|
- **Date**: 2025-11-13
|
|
- **Query**: [filename]
|
|
- **Results**: [count] hyponyms
|
|
- **Curated**: [count] added to hyponyms_curated.yaml
|
|
- **Rejected**: [count] false positives
|
|
- **Notes**: [observations]
|
|
|
|
## Execution 2
|
|
- **Date**: 2025-11-XX
|
|
- **Query**: [filename] (re-run)
|
|
- **Results**: [count] new hyponyms (after exclusion)
|
|
- **Status**: [Complete / Continue / Review]
|
|
```
|
|
|
|
### Completion Criteria
|
|
|
|
Mark class as "complete" when:
|
|
|
|
- [ ] Query returns <10 new relevant results
|
|
- [ ] All major institution subtypes are captured
|
|
- [ ] Geographic coverage is adequate
|
|
- [ ] Iterative discovery yields diminishing returns
|
|
|
|
---
|
|
|
|
## References
|
|
|
|
- **Query Files**: `/Users/kempersc/apps/glam/data/wikidata/GLAMORCUBEPSXHFN/[A-Z]/queries/`
|
|
- **Curated Vocabulary**: `/Users/kempersc/apps/glam/data/wikidata/GLAMORCUBEPSXHFN/hyponyms_curated.yaml`
|
|
- **Wikidata Query Service**: https://query.wikidata.org/
|
|
- **SPARQL Tutorial**: https://www.wikidata.org/wiki/Wikidata:SPARQL_tutorial
|
|
- **Session Summary**: `docs/sessions/SESSION_SUMMARY_20251113_SPARQL_GENERATION.md`
|
|
|
|
---
|
|
|
|
**Version**: 1.0
|
|
**Last Updated**: 2025-11-13
|
|
**Status**: Ready for execution
|