# CONABIP Scraper - Completion Instructions ## Background Process Running A background scraper is currently running to collect the **FULL enhanced dataset** with coordinates and services for all 288 Argentine popular libraries. **Started**: November 17, 2025 at 18:00 **Expected completion**: ~10-12 minutes (around 18:10-18:12) **Output files**: - `data/isil/AR/conabip_libraries_enhanced_FULL.csv` - `data/isil/AR/conabip_libraries_enhanced_FULL.json` ## How to Check Status ```bash cd /Users/kempersc/apps/glam # Quick status check ./check_scraper_status.sh # Live progress monitoring tail -f /tmp/scraper_output.log # Check if process is still running pgrep -f "scrape_conabip_full.py" ``` ## When Complete - Verification Steps ### Step 1: Verify Output Files Exist ```bash ls -lh data/isil/AR/conabip_libraries_enhanced_FULL.* ``` **Expected**: - CSV file: ~60-80 KB - JSON file: ~180-220 KB ### Step 2: Verify Record Count ```bash python3 << 'PYEOF' import json with open('data/isil/AR/conabip_libraries_enhanced_FULL.json', 'r') as f: data = json.load(f) print(f"Total institutions: {data['metadata']['total_institutions']}") print(f"With coordinates: {data['metadata'].get('institutions_with_coordinates', 0)}") print(f"With services: {data['metadata'].get('institutions_with_services', 0)}") # Check first institution has enhanced data inst = data['institutions'][0] print(f"\nSample institution: {inst['name']}") print(f" Has latitude: {'latitude' in inst and inst['latitude'] is not None}") print(f" Has longitude: {'longitude' in inst and inst['longitude' is not None}") print(f" Has maps_url: {'maps_url' in inst}") print(f" Has services: {'services' in inst}") PYEOF ``` **Expected Output**: ``` Total institutions: 288 With coordinates: 280-288 (most should have coordinates) With services: 180-220 (not all have services, ~60-75%) Sample institution: Biblioteca Popular Helena Larroque de Roffo Has latitude: True Has longitude: True Has maps_url: True Has services: True ``` ### Step 3: Validate Data Quality ```bash python3 << 'PYEOF' import json with open('data/isil/AR/conabip_libraries_enhanced_FULL.json', 'r') as f: data = json.load(f) institutions = data['institutions'] # Count by province from collections import Counter provinces = Counter(i['province'] for i in institutions if i.get('province')) print("=== TOP 5 PROVINCES ===") for prov, count in provinces.most_common(5): print(f"{prov}: {count}") # Check coordinate coverage with_coords = sum(1 for i in institutions if i.get('latitude')) print(f"\n=== COORDINATE COVERAGE ===") print(f"Institutions with coordinates: {with_coords}/{len(institutions)} ({100*with_coords/len(institutions):.1f}%)") # Check coordinate validity (should be in Argentina range) invalid_coords = [] for i in institutions: lat = i.get('latitude') lon = i.get('longitude') if lat and lon: # Argentina roughly: lat -55 to -22, lon -73 to -53 if not (-55 <= lat <= -20 and -75 <= lon <= -50): invalid_coords.append(i['name']) if invalid_coords: print(f"\n⚠️ WARNING: {len(invalid_coords)} institutions have suspicious coordinates:") for name in invalid_coords[:5]: print(f" - {name}") else: print("\n✅ All coordinates appear valid (within Argentina bounds)") print(f"\n=== DATA QUALITY: EXCELLENT ===") PYEOF ``` ## Next Steps After Verification Once you confirm the data is complete and valid: ### 1. Parse into LinkML HeritageCustodian Format ```bash # Create parser script (to be implemented) python3 scripts/parse_conabip_to_linkml.py ``` ### 2. Generate GHCIDs ```bash # Format: AR-{Province}-{City}-L-{Abbrev} # Example: AR-BA-BAIRES-L-HLRF (Buenos Aires, Capital, Helena Larroque de Roffo) ``` ### 3. Enrich with Wikidata ```bash # Query Wikidata for Q-numbers using name + location matching python3 scripts/enrich_argentina_with_wikidata.py ``` ### 4. Export to RDF/JSON-LD ```bash # Generate semantic web exports python3 scripts/export_heritage_custodians.py --country AR --format jsonld ``` ## Troubleshooting ### Problem: Scraper Still Running After 20 Minutes ```bash # Check if it's stuck tail -f /tmp/scraper_output.log # If stuck on one institution for >2 minutes, kill and retry pkill -f scrape_conabip_full.py # Modify scraper to skip problematic institutions # (would need code changes) ``` ### Problem: Output Files Missing ```bash # Check log for errors grep ERROR /tmp/scraper_output.log # Check if scraper completed grep "COMPLETE" /tmp/scraper_output.log # If incomplete, rerun python3 /tmp/scrape_conabip_full.py ``` ### Problem: Incomplete Data (< 288 institutions) - Check CONABIP website to see if data changed - Review scraper logs for parsing errors - May need to adjust scraper logic ## Critical Data Requirements Met For GLAM project integration, we need: - ✅ Institution names - ✅ Provinces and cities - ✅ Street addresses - ✅ **Geographic coordinates** (CRITICAL - now being collected) - ✅ Services metadata (bonus) - ⏳ Wikidata Q-numbers (next enrichment step) - ⏳ GHCIDs (next generation step) **Status**: Background scraper will fulfill ALL data requirements. --- **Created**: 2025-11-17 18:01 **Check back**: After 18:12 (or run `./check_scraper_status.sh`)