glam/RESUME_CHILEAN_ENRICHMENT.md
2025-11-19 23:25:22 +01:00

209 lines
5.9 KiB
Markdown

# Chilean GLAM Wikidata Enrichment - Next Session Guide
**Resume From**: Batch 14 (Rate Limited)
**Current Status**: 61/90 (67.8%) coverage
**Target**: 63/90 (70%) coverage
**Gap**: 2 more institutions needed
---
## What We Completed
**Batch 13**: Added 1 validated match (CONADI → Q21002896)
**Coverage Increase**: 66.7% → 67.8%
**Files Created**: `chilean_institutions_batch13_enriched.yaml` (production dataset)
**Scripts Ready**: Batch 14 search scripts prepared
---
## What Blocked Progress
**Wikidata API Rate Limiting**: HTTP 403 errors during Batch 14 searches
**Solution**: Wait 24 hours for rate limit reset
---
## Next Actions (When Resuming)
### Step 1: Check Rate Limit Status
```bash
# Test if rate limit has reset
cd /Users/kempersc/apps/glam
python scripts/quick_wikidata_search_batch14.py
```
If still blocked → wait longer. If working → proceed to Step 2.
### Step 2: Execute Targeted Searches
Focus on these 2-3 high-priority candidates:
1. **Museo Rodulfo Philippi** (Chañaral) - Named after famous German-Chilean naturalist Rodolfo Amando Philippi (1808-1904)
- **Likelihood**: HIGH - well-documented scientist
- **Search Terms**: "Philippi", "Museo Philippi", "Rodolfo Amando Philippi"
2. **Museo Rudolph Philippi** (Valdivia) - Same scientist, alternate spelling
- **Likelihood**: HIGH - major city, better Wikidata coverage
- **Search Terms**: "Rudolf Philippi Valdivia", "Museo Philippi"
3. **Instituto Alemán Puerto Montt** - German school with heritage collections
- **Likelihood**: MEDIUM - German schools often documented
- **Search Terms**: "Deutsche Schule Puerto Montt", "Instituto Alemán"
### Step 3: Manual Verification
For any Q-numbers found:
1. Visit `https://www.wikidata.org/wiki/Q[number]`
2. Verify institution name matches
3. Check location matches (city/region)
4. Confirm institution type (museum/school)
### Step 4: Apply Enrichment
Create and run enrichment script similar to `enrich_chilean_batch13.py`:
```bash
# Template command (adjust for validated matches)
python scripts/enrich_chilean_batch14.py
```
This will generate `chilean_institutions_batch14_enriched.yaml`
### Step 5: Verify Target Reached
```bash
# Check final coverage
python -c "
import yaml
with open('data/instances/chile/chilean_institutions_batch14_enriched.yaml', 'r') as f:
data = yaml.safe_load(f)
total = len(data)
with_wd = sum(1 for i in data if i.get('identifiers') and any(
id.get('identifier_scheme') == 'Wikidata' for id in i['identifiers']))
print(f'Coverage: {with_wd}/{total} ({(with_wd/total)*100:.1f}%)')
print('✓ TARGET REACHED!' if with_wd >= 63 else f'Need {63-with_wd} more')
"
```
---
## Alternative Strategies (If Searches Fail)
### Option A: Accept Current Coverage
67.8% is **strong coverage** given:
- Many small regional museums lack Wikidata entries
- This is a global pattern (not Chile-specific issue)
- Museum coverage is excellent (87.2%)
### Option B: Create Wikidata Entries
For notable institutions lacking coverage (e.g., Museo Rodulfo Philippi):
1. Research institution history and significance
2. Create Wikidata entry following notability guidelines
3. Add to Chilean dataset with newly minted Q-number
**Time Investment**: ~2-4 hours per institution
### Option C: Focus on Other Datasets
Move to other Latin American countries with:
- Larger institution counts (Brazil, Mexico, Argentina)
- Better baseline Wikidata coverage
- More well-documented national museums
---
## Key Files Reference
### Primary Dataset (Use This)
`data/instances/chile/chilean_institutions_batch13_enriched.yaml`
- 90 institutions, 61 with Wikidata (67.8%)
### Search Scripts
- `scripts/manual_wikidata_search_batch14.py` - SPARQL-based (comprehensive)
- `scripts/quick_wikidata_search_batch14.py` - API-based (faster)
### Previous Results
- `scripts/batch13_manual_search_results.json` - Batch 13 search output
- `scripts/batch14_quick_search_results.json` - Empty (rate limited)
### Documentation
- `docs/chilean_enrichment_batch13_14_report.md` - Full session report
---
## Expected Outcomes
### Best Case (2 Matches Found)
- Philippi museums have Wikidata entries → 63/90 (70.0%) ✅ TARGET REACHED
### Likely Case (1 Match Found)
- One Philippi museum found → 62/90 (68.9%) - Close to target
### Worst Case (0 Matches Found)
- Stay at 61/90 (67.8%) - Accept as strong coverage
---
## Technical Notes
### Rate Limit Recovery
- Wikidata typically resets every 24 hours
- IP-based blocking, not account-based
- No action needed, automatic reset
### Search Strategy
- Use exact name matching only
- Focus on institutions named after notable people
- Major cities have better Wikidata coverage
### Data Quality Standards
- Manual verification required for all matches
- No synthetic Q-numbers (CRITICAL POLICY)
- Document rationale in provenance metadata
---
## Quick Command Reference
```bash
# Navigate to project
cd /Users/kempersc/apps/glam
# Test rate limit status
python scripts/quick_wikidata_search_batch14.py
# If working, run comprehensive search
python scripts/manual_wikidata_search_batch14.py
# Review results
cat scripts/batch14_manual_search_results.json | jq
# Apply enrichment (after validation)
python scripts/enrich_chilean_batch14.py
# Check final coverage
python -c "import yaml; ..." # (see Step 5 above)
```
---
## Success Criteria
✅ Execute Batch 14 searches without rate limiting
✅ Find and validate 2 Q-numbers for remaining institutions
✅ Reach 70% coverage (63/90)
✅ Maintain zero false positives
✅ Document all matches with provenance
**Minimum Acceptable**: 1 additional match → 62/90 (68.9%)
**Target**: 2 additional matches → 63/90 (70.0%)
**Stretch Goal**: 3+ matches → 64+/90 (71%+)
---
**Ready to Resume**: ✅ All scripts prepared, waiting for rate limit reset
**Estimated Time**: 1-2 hours (once rate limits clear)
**Priority**: HIGH - Almost at target!