209 lines
5.9 KiB
Markdown
209 lines
5.9 KiB
Markdown
# Chilean GLAM Wikidata Enrichment - Next Session Guide
|
|
|
|
**Resume From**: Batch 14 (Rate Limited)
|
|
**Current Status**: 61/90 (67.8%) coverage
|
|
**Target**: 63/90 (70%) coverage
|
|
**Gap**: 2 more institutions needed
|
|
|
|
---
|
|
|
|
## What We Completed
|
|
|
|
✅ **Batch 13**: Added 1 validated match (CONADI → Q21002896)
|
|
✅ **Coverage Increase**: 66.7% → 67.8%
|
|
✅ **Files Created**: `chilean_institutions_batch13_enriched.yaml` (production dataset)
|
|
✅ **Scripts Ready**: Batch 14 search scripts prepared
|
|
|
|
---
|
|
|
|
## What Blocked Progress
|
|
|
|
❌ **Wikidata API Rate Limiting**: HTTP 403 errors during Batch 14 searches
|
|
⏳ **Solution**: Wait 24 hours for rate limit reset
|
|
|
|
---
|
|
|
|
## Next Actions (When Resuming)
|
|
|
|
### Step 1: Check Rate Limit Status
|
|
|
|
```bash
|
|
# Test if rate limit has reset
|
|
cd /Users/kempersc/apps/glam
|
|
python scripts/quick_wikidata_search_batch14.py
|
|
```
|
|
|
|
If still blocked → wait longer. If working → proceed to Step 2.
|
|
|
|
### Step 2: Execute Targeted Searches
|
|
|
|
Focus on these 2-3 high-priority candidates:
|
|
|
|
1. **Museo Rodulfo Philippi** (Chañaral) - Named after famous German-Chilean naturalist Rodolfo Amando Philippi (1808-1904)
|
|
- **Likelihood**: HIGH - well-documented scientist
|
|
- **Search Terms**: "Philippi", "Museo Philippi", "Rodolfo Amando Philippi"
|
|
|
|
2. **Museo Rudolph Philippi** (Valdivia) - Same scientist, alternate spelling
|
|
- **Likelihood**: HIGH - major city, better Wikidata coverage
|
|
- **Search Terms**: "Rudolf Philippi Valdivia", "Museo Philippi"
|
|
|
|
3. **Instituto Alemán Puerto Montt** - German school with heritage collections
|
|
- **Likelihood**: MEDIUM - German schools often documented
|
|
- **Search Terms**: "Deutsche Schule Puerto Montt", "Instituto Alemán"
|
|
|
|
### Step 3: Manual Verification
|
|
|
|
For any Q-numbers found:
|
|
1. Visit `https://www.wikidata.org/wiki/Q[number]`
|
|
2. Verify institution name matches
|
|
3. Check location matches (city/region)
|
|
4. Confirm institution type (museum/school)
|
|
|
|
### Step 4: Apply Enrichment
|
|
|
|
Create and run enrichment script similar to `enrich_chilean_batch13.py`:
|
|
|
|
```bash
|
|
# Template command (adjust for validated matches)
|
|
python scripts/enrich_chilean_batch14.py
|
|
```
|
|
|
|
This will generate `chilean_institutions_batch14_enriched.yaml`
|
|
|
|
### Step 5: Verify Target Reached
|
|
|
|
```bash
|
|
# Check final coverage
|
|
python -c "
|
|
import yaml
|
|
with open('data/instances/chile/chilean_institutions_batch14_enriched.yaml', 'r') as f:
|
|
data = yaml.safe_load(f)
|
|
total = len(data)
|
|
with_wd = sum(1 for i in data if i.get('identifiers') and any(
|
|
id.get('identifier_scheme') == 'Wikidata' for id in i['identifiers']))
|
|
print(f'Coverage: {with_wd}/{total} ({(with_wd/total)*100:.1f}%)')
|
|
print('✓ TARGET REACHED!' if with_wd >= 63 else f'Need {63-with_wd} more')
|
|
"
|
|
```
|
|
|
|
---
|
|
|
|
## Alternative Strategies (If Searches Fail)
|
|
|
|
### Option A: Accept Current Coverage
|
|
|
|
67.8% is **strong coverage** given:
|
|
- Many small regional museums lack Wikidata entries
|
|
- This is a global pattern (not Chile-specific issue)
|
|
- Museum coverage is excellent (87.2%)
|
|
|
|
### Option B: Create Wikidata Entries
|
|
|
|
For notable institutions lacking coverage (e.g., Museo Rodulfo Philippi):
|
|
1. Research institution history and significance
|
|
2. Create Wikidata entry following notability guidelines
|
|
3. Add to Chilean dataset with newly minted Q-number
|
|
|
|
**Time Investment**: ~2-4 hours per institution
|
|
|
|
### Option C: Focus on Other Datasets
|
|
|
|
Move to other Latin American countries with:
|
|
- Larger institution counts (Brazil, Mexico, Argentina)
|
|
- Better baseline Wikidata coverage
|
|
- More well-documented national museums
|
|
|
|
---
|
|
|
|
## Key Files Reference
|
|
|
|
### Primary Dataset (Use This)
|
|
`data/instances/chile/chilean_institutions_batch13_enriched.yaml`
|
|
- 90 institutions, 61 with Wikidata (67.8%)
|
|
|
|
### Search Scripts
|
|
- `scripts/manual_wikidata_search_batch14.py` - SPARQL-based (comprehensive)
|
|
- `scripts/quick_wikidata_search_batch14.py` - API-based (faster)
|
|
|
|
### Previous Results
|
|
- `scripts/batch13_manual_search_results.json` - Batch 13 search output
|
|
- `scripts/batch14_quick_search_results.json` - Empty (rate limited)
|
|
|
|
### Documentation
|
|
- `docs/chilean_enrichment_batch13_14_report.md` - Full session report
|
|
|
|
---
|
|
|
|
## Expected Outcomes
|
|
|
|
### Best Case (2 Matches Found)
|
|
- Philippi museums have Wikidata entries → 63/90 (70.0%) ✅ TARGET REACHED
|
|
|
|
### Likely Case (1 Match Found)
|
|
- One Philippi museum found → 62/90 (68.9%) - Close to target
|
|
|
|
### Worst Case (0 Matches Found)
|
|
- Stay at 61/90 (67.8%) - Accept as strong coverage
|
|
|
|
---
|
|
|
|
## Technical Notes
|
|
|
|
### Rate Limit Recovery
|
|
- Wikidata typically resets every 24 hours
|
|
- IP-based blocking, not account-based
|
|
- No action needed, automatic reset
|
|
|
|
### Search Strategy
|
|
- Use exact name matching only
|
|
- Focus on institutions named after notable people
|
|
- Major cities have better Wikidata coverage
|
|
|
|
### Data Quality Standards
|
|
- Manual verification required for all matches
|
|
- No synthetic Q-numbers (CRITICAL POLICY)
|
|
- Document rationale in provenance metadata
|
|
|
|
---
|
|
|
|
## Quick Command Reference
|
|
|
|
```bash
|
|
# Navigate to project
|
|
cd /Users/kempersc/apps/glam
|
|
|
|
# Test rate limit status
|
|
python scripts/quick_wikidata_search_batch14.py
|
|
|
|
# If working, run comprehensive search
|
|
python scripts/manual_wikidata_search_batch14.py
|
|
|
|
# Review results
|
|
cat scripts/batch14_manual_search_results.json | jq
|
|
|
|
# Apply enrichment (after validation)
|
|
python scripts/enrich_chilean_batch14.py
|
|
|
|
# Check final coverage
|
|
python -c "import yaml; ..." # (see Step 5 above)
|
|
```
|
|
|
|
---
|
|
|
|
## Success Criteria
|
|
|
|
✅ Execute Batch 14 searches without rate limiting
|
|
✅ Find and validate 2 Q-numbers for remaining institutions
|
|
✅ Reach 70% coverage (63/90)
|
|
✅ Maintain zero false positives
|
|
✅ Document all matches with provenance
|
|
|
|
**Minimum Acceptable**: 1 additional match → 62/90 (68.9%)
|
|
**Target**: 2 additional matches → 63/90 (70.0%)
|
|
**Stretch Goal**: 3+ matches → 64+/90 (71%+)
|
|
|
|
---
|
|
|
|
**Ready to Resume**: ✅ All scripts prepared, waiting for rate limit reset
|
|
**Estimated Time**: 1-2 hours (once rate limits clear)
|
|
**Priority**: HIGH - Almost at target!
|