5.2 KiB
5.2 KiB
CONABIP Scraper - Completion Instructions
Background Process Running
A background scraper is currently running to collect the FULL enhanced dataset with coordinates and services for all 288 Argentine popular libraries.
Started: November 17, 2025 at 18:00 Expected completion: ~10-12 minutes (around 18:10-18:12) Output files:
data/isil/AR/conabip_libraries_enhanced_FULL.csvdata/isil/AR/conabip_libraries_enhanced_FULL.json
How to Check Status
cd /Users/kempersc/apps/glam
# Quick status check
./check_scraper_status.sh
# Live progress monitoring
tail -f /tmp/scraper_output.log
# Check if process is still running
pgrep -f "scrape_conabip_full.py"
When Complete - Verification Steps
Step 1: Verify Output Files Exist
ls -lh data/isil/AR/conabip_libraries_enhanced_FULL.*
Expected:
- CSV file: ~60-80 KB
- JSON file: ~180-220 KB
Step 2: Verify Record Count
python3 << 'PYEOF'
import json
with open('data/isil/AR/conabip_libraries_enhanced_FULL.json', 'r') as f:
data = json.load(f)
print(f"Total institutions: {data['metadata']['total_institutions']}")
print(f"With coordinates: {data['metadata'].get('institutions_with_coordinates', 0)}")
print(f"With services: {data['metadata'].get('institutions_with_services', 0)}")
# Check first institution has enhanced data
inst = data['institutions'][0]
print(f"\nSample institution: {inst['name']}")
print(f" Has latitude: {'latitude' in inst and inst['latitude'] is not None}")
print(f" Has longitude: {'longitude' in inst and inst['longitude' is not None}")
print(f" Has maps_url: {'maps_url' in inst}")
print(f" Has services: {'services' in inst}")
PYEOF
Expected Output:
Total institutions: 288
With coordinates: 280-288 (most should have coordinates)
With services: 180-220 (not all have services, ~60-75%)
Sample institution: Biblioteca Popular Helena Larroque de Roffo
Has latitude: True
Has longitude: True
Has maps_url: True
Has services: True
Step 3: Validate Data Quality
python3 << 'PYEOF'
import json
with open('data/isil/AR/conabip_libraries_enhanced_FULL.json', 'r') as f:
data = json.load(f)
institutions = data['institutions']
# Count by province
from collections import Counter
provinces = Counter(i['province'] for i in institutions if i.get('province'))
print("=== TOP 5 PROVINCES ===")
for prov, count in provinces.most_common(5):
print(f"{prov}: {count}")
# Check coordinate coverage
with_coords = sum(1 for i in institutions if i.get('latitude'))
print(f"\n=== COORDINATE COVERAGE ===")
print(f"Institutions with coordinates: {with_coords}/{len(institutions)} ({100*with_coords/len(institutions):.1f}%)")
# Check coordinate validity (should be in Argentina range)
invalid_coords = []
for i in institutions:
lat = i.get('latitude')
lon = i.get('longitude')
if lat and lon:
# Argentina roughly: lat -55 to -22, lon -73 to -53
if not (-55 <= lat <= -20 and -75 <= lon <= -50):
invalid_coords.append(i['name'])
if invalid_coords:
print(f"\n⚠️ WARNING: {len(invalid_coords)} institutions have suspicious coordinates:")
for name in invalid_coords[:5]:
print(f" - {name}")
else:
print("\n✅ All coordinates appear valid (within Argentina bounds)")
print(f"\n=== DATA QUALITY: EXCELLENT ===")
PYEOF
Next Steps After Verification
Once you confirm the data is complete and valid:
1. Parse into LinkML HeritageCustodian Format
# Create parser script (to be implemented)
python3 scripts/parse_conabip_to_linkml.py
2. Generate GHCIDs
# Format: AR-{Province}-{City}-L-{Abbrev}
# Example: AR-BA-BAIRES-L-HLRF (Buenos Aires, Capital, Helena Larroque de Roffo)
3. Enrich with Wikidata
# Query Wikidata for Q-numbers using name + location matching
python3 scripts/enrich_argentina_with_wikidata.py
4. Export to RDF/JSON-LD
# Generate semantic web exports
python3 scripts/export_heritage_custodians.py --country AR --format jsonld
Troubleshooting
Problem: Scraper Still Running After 20 Minutes
# Check if it's stuck
tail -f /tmp/scraper_output.log
# If stuck on one institution for >2 minutes, kill and retry
pkill -f scrape_conabip_full.py
# Modify scraper to skip problematic institutions
# (would need code changes)
Problem: Output Files Missing
# Check log for errors
grep ERROR /tmp/scraper_output.log
# Check if scraper completed
grep "COMPLETE" /tmp/scraper_output.log
# If incomplete, rerun
python3 /tmp/scrape_conabip_full.py
Problem: Incomplete Data (< 288 institutions)
- Check CONABIP website to see if data changed
- Review scraper logs for parsing errors
- May need to adjust scraper logic
Critical Data Requirements Met
For GLAM project integration, we need:
- ✅ Institution names
- ✅ Provinces and cities
- ✅ Street addresses
- ✅ Geographic coordinates (CRITICAL - now being collected)
- ✅ Services metadata (bonus)
- ⏳ Wikidata Q-numbers (next enrichment step)
- ⏳ GHCIDs (next generation step)
Status: Background scraper will fulfill ALL data requirements.
Created: 2025-11-17 18:01
Check back: After 18:12 (or run ./check_scraper_status.sh)