glam/data/wikidata/GLAMORCUBEPSXHFN/G/queries/README.md
2025-11-19 23:25:22 +01:00

156 lines
5.3 KiB
Markdown

# G-class Query History
This directory contains all SPARQL queries used to discover G-class (Gallery) heritage custodians from Wikidata.
## Query Evolution
### Phase 1: Initial Discovery (Nov 13, 2025)
**`gallery_query_complete_20251113T130920.sparql`**
- First comprehensive G-class query
- 6 base classes
- Basic subclass discovery (P279+)
- No exclusions
### Phase 2: First Update (Nov 13, 2025)
**`gallery_query_updated_20251113T160823.sparql`**
- Refined base classes
- Added quality filters
- Improved pattern matching
### Phase 3: Verification Round 1 (Nov 16, 2025 AM)
**`gallery_query_updated_20251116T095704.sparql`**
- Q-number verification pass
- Updated base classes based on validation
- Better type distinctions (institution vs. building)
### Phase 4: Verification Round 2 (Nov 16, 2025 Midday)
**`gallery_query_updated_20251116T104506.sparql`**
- 14 verified base classes
- Comprehensive exclusions (1,819 Q-numbers)
- 37 filter chunks for optimization
- Key discoveries:
- Institution vs building distinction
- Artist-run and alternative spaces
- Online/digital galleries
### Phase 5: Enhanced Recursive Discovery (Nov 16, 2025 12:54) ⚠️ **SUPERSEDED**
**`gallery_query_enhanced_20251116T125414.sparql`**
**Issue Discovered**: Returns individual gallery **instances** (P31), not gallery **classes** (P279)
**Why Superseded**:
- Mixed P31 (instance of) and P279 (subclass of) patterns
- Would return "Tate Modern", "Louvre", etc. (individual galleries)
- We need "contemporary art gallery", "sculpture gallery", etc. (gallery types)
### Phase 6: Classes-Only Query (Nov 16, 2025 13:45) ⭐ **CURRENT**
**`gallery_classes_query_20251116T134500.sparql`** ← **USE THIS ONE**
**Major Innovation**: **ONLY P279 (subclass) relationships** - NO P31 (instance) patterns
**Critical Distinction**:
-**P279 (subclass of)** → Gallery CLASSES we want (e.g., "photography gallery")
-**P31 (instance of)** → Individual galleries we exclude (e.g., "MoMA")
**8 Search Strategies** (all P279-based):
1. ✨ Q118554787 direct hyponyms (1 level)
2. ✨ Q118554787 transitive hyponyms (all levels)
3. ✨ 14 curated G-class hypernyms - direct hyponyms
4. ✨ 14 curated G-class hypernyms - transitive hyponyms
5. ✨ 8 mixed G+M hypernyms - direct hyponyms
6. ✨ 8 mixed G+M hypernyms - transitive hyponyms
7. Q207694 (art gallery) transitive hyponyms
8. Q18761864 (exhibition space) transitive hyponyms
**Key Features**:
- **FILTER NOT EXISTS { ?item wdt:P31 ?anyInstance }** - excludes ALL instances
- Only returns classes/types (P279 relationships)
- Uses Q118554787 (broadest gallery hypernym)
- Recursive discovery via curated G-class entries
- Expected: 30-100 new gallery classes
**Improvements over Phase 5**:
- Focused on classes, not instances
- Added critical instance exclusion filter
- All strategies now use P279 (subclass) patterns only
- More accurate for taxonomy building
## File Structure
Each query version includes:
- **`.sparql`** - SPARQL query code
- **`.yaml`** - Metadata with:
- Base classes used
- Search strategies
- Expected results
- Deduplication notes
- Usage instructions
- Related files
## Current Statistics
**hyponyms_curated.yaml** (as of Nov 16, 2025):
- Total Q-numbers: 1,896
- G-class entries: 47 (25 pure G + 22 mixed)
- Q118554787: **NOT INCLUDED** ← Key opportunity!
## Usage Workflow
1. **Select Query**: Use latest classes-only query (`gallery_classes_query_20251116T134500.sparql`)
2. **Execute**: Copy to [Wikidata Query Service](https://query.wikidata.org/)
3. **Download**: Export results as JSON
4. **Deduplicate**: Filter against existing 1,896 Q-numbers
5. **Curate**: Assign type codes, country, hypernym metadata
6. **Add**: Append to `hyponyms_curated.yaml`
7. **Enrich**: Run `python scripts/enrich_hyponyms_with_wikidata.py`
8. **Validate**: Check for duplicates/conflicts
## Expected Impact
**Before** (Phase 5):
- 47 G-class entries
- Limited coverage
**After** (Phase 6 - Projected):
- 77-147 G-class entries (47 + 30-100 new classes)
- 2-3x increase in taxonomy coverage
- Gallery classes like "contemporary art gallery", "sculpture gallery", etc.
- Better classification for future institution extraction
## Documentation
See also:
- **Query documentation**: `../G_QUERY_UPDATE_2025-11-16.md`
- **Source data**: `../../hyponyms_curated.yaml`
- **Enriched data**: `../../hyponyms_curated_full.yaml`
- **SPARQL copies**: `../sparql/` (for easy access)
## Next Queries to Develop
After G-class completion, develop similar enhanced queries for:
- **L-class** (Libraries) - use curated library hypernyms
- **A-class** (Archives) - use curated archive hypernyms
- **M-class** (Museums) - largest class, needs careful strategy
- **Other classes** - R, C, O, B, E, S, F, I, X, P, H, D, N, T
## Query Best Practices
Based on lessons learned:
1.**Verify Q-numbers** before using as base classes
2.**Use curated entries** for recursive discovery
3.**Defer deduplication** to post-processing (performance)
4.**Multiple strategies** for comprehensive coverage
5.**Document metadata** in accompanying YAML files
6.**Version control** with timestamps
7.**Test with COUNT** queries first to estimate results
## Contact
For questions about query strategy or to report issues, see project documentation in `AGENTS.md`.