glam/RESUME_CHILEAN_ENRICHMENT.md
2025-11-19 23:25:22 +01:00

5.9 KiB

Chilean GLAM Wikidata Enrichment - Next Session Guide

Resume From: Batch 14 (Rate Limited)
Current Status: 61/90 (67.8%) coverage
Target: 63/90 (70%) coverage
Gap: 2 more institutions needed


What We Completed

Batch 13: Added 1 validated match (CONADI → Q21002896)
Coverage Increase: 66.7% → 67.8%
Files Created: chilean_institutions_batch13_enriched.yaml (production dataset)
Scripts Ready: Batch 14 search scripts prepared


What Blocked Progress

Wikidata API Rate Limiting: HTTP 403 errors during Batch 14 searches
Solution: Wait 24 hours for rate limit reset


Next Actions (When Resuming)

Step 1: Check Rate Limit Status

# Test if rate limit has reset
cd /Users/kempersc/apps/glam
python scripts/quick_wikidata_search_batch14.py

If still blocked → wait longer. If working → proceed to Step 2.

Step 2: Execute Targeted Searches

Focus on these 2-3 high-priority candidates:

  1. Museo Rodulfo Philippi (Chañaral) - Named after famous German-Chilean naturalist Rodolfo Amando Philippi (1808-1904)

    • Likelihood: HIGH - well-documented scientist
    • Search Terms: "Philippi", "Museo Philippi", "Rodolfo Amando Philippi"
  2. Museo Rudolph Philippi (Valdivia) - Same scientist, alternate spelling

    • Likelihood: HIGH - major city, better Wikidata coverage
    • Search Terms: "Rudolf Philippi Valdivia", "Museo Philippi"
  3. Instituto Alemán Puerto Montt - German school with heritage collections

    • Likelihood: MEDIUM - German schools often documented
    • Search Terms: "Deutsche Schule Puerto Montt", "Instituto Alemán"

Step 3: Manual Verification

For any Q-numbers found:

  1. Visit https://www.wikidata.org/wiki/Q[number]
  2. Verify institution name matches
  3. Check location matches (city/region)
  4. Confirm institution type (museum/school)

Step 4: Apply Enrichment

Create and run enrichment script similar to enrich_chilean_batch13.py:

# Template command (adjust for validated matches)
python scripts/enrich_chilean_batch14.py

This will generate chilean_institutions_batch14_enriched.yaml

Step 5: Verify Target Reached

# Check final coverage
python -c "
import yaml
with open('data/instances/chile/chilean_institutions_batch14_enriched.yaml', 'r') as f:
    data = yaml.safe_load(f)
total = len(data)
with_wd = sum(1 for i in data if i.get('identifiers') and any(
    id.get('identifier_scheme') == 'Wikidata' for id in i['identifiers']))
print(f'Coverage: {with_wd}/{total} ({(with_wd/total)*100:.1f}%)')
print('✓ TARGET REACHED!' if with_wd >= 63 else f'Need {63-with_wd} more')
"

Alternative Strategies (If Searches Fail)

Option A: Accept Current Coverage

67.8% is strong coverage given:

  • Many small regional museums lack Wikidata entries
  • This is a global pattern (not Chile-specific issue)
  • Museum coverage is excellent (87.2%)

Option B: Create Wikidata Entries

For notable institutions lacking coverage (e.g., Museo Rodulfo Philippi):

  1. Research institution history and significance
  2. Create Wikidata entry following notability guidelines
  3. Add to Chilean dataset with newly minted Q-number

Time Investment: ~2-4 hours per institution

Option C: Focus on Other Datasets

Move to other Latin American countries with:

  • Larger institution counts (Brazil, Mexico, Argentina)
  • Better baseline Wikidata coverage
  • More well-documented national museums

Key Files Reference

Primary Dataset (Use This)

data/instances/chile/chilean_institutions_batch13_enriched.yaml

  • 90 institutions, 61 with Wikidata (67.8%)

Search Scripts

  • scripts/manual_wikidata_search_batch14.py - SPARQL-based (comprehensive)
  • scripts/quick_wikidata_search_batch14.py - API-based (faster)

Previous Results

  • scripts/batch13_manual_search_results.json - Batch 13 search output
  • scripts/batch14_quick_search_results.json - Empty (rate limited)

Documentation

  • docs/chilean_enrichment_batch13_14_report.md - Full session report

Expected Outcomes

Best Case (2 Matches Found)

  • Philippi museums have Wikidata entries → 63/90 (70.0%) TARGET REACHED

Likely Case (1 Match Found)

  • One Philippi museum found → 62/90 (68.9%) - Close to target

Worst Case (0 Matches Found)

  • Stay at 61/90 (67.8%) - Accept as strong coverage

Technical Notes

Rate Limit Recovery

  • Wikidata typically resets every 24 hours
  • IP-based blocking, not account-based
  • No action needed, automatic reset

Search Strategy

  • Use exact name matching only
  • Focus on institutions named after notable people
  • Major cities have better Wikidata coverage

Data Quality Standards

  • Manual verification required for all matches
  • No synthetic Q-numbers (CRITICAL POLICY)
  • Document rationale in provenance metadata

Quick Command Reference

# Navigate to project
cd /Users/kempersc/apps/glam

# Test rate limit status
python scripts/quick_wikidata_search_batch14.py

# If working, run comprehensive search
python scripts/manual_wikidata_search_batch14.py

# Review results
cat scripts/batch14_manual_search_results.json | jq

# Apply enrichment (after validation)
python scripts/enrich_chilean_batch14.py

# Check final coverage
python -c "import yaml; ..."  # (see Step 5 above)

Success Criteria

Execute Batch 14 searches without rate limiting
Find and validate 2 Q-numbers for remaining institutions
Reach 70% coverage (63/90)
Maintain zero false positives
Document all matches with provenance

Minimum Acceptable: 1 additional match → 62/90 (68.9%)
Target: 2 additional matches → 63/90 (70.0%)
Stretch Goal: 3+ matches → 64+/90 (71%+)


Ready to Resume: All scripts prepared, waiting for rate limit reset
Estimated Time: 1-2 hours (once rate limits clear)
Priority: HIGH - Almost at target!