glam/SCRAPER_COMPLETION_INSTRUCTIONS.md
2025-11-19 23:25:22 +01:00

5.2 KiB
Raw Blame History

CONABIP Scraper - Completion Instructions

Background Process Running

A background scraper is currently running to collect the FULL enhanced dataset with coordinates and services for all 288 Argentine popular libraries.

Started: November 17, 2025 at 18:00 Expected completion: ~10-12 minutes (around 18:10-18:12) Output files:

  • data/isil/AR/conabip_libraries_enhanced_FULL.csv
  • data/isil/AR/conabip_libraries_enhanced_FULL.json

How to Check Status

cd /Users/kempersc/apps/glam

# Quick status check
./check_scraper_status.sh

# Live progress monitoring
tail -f /tmp/scraper_output.log

# Check if process is still running
pgrep -f "scrape_conabip_full.py"

When Complete - Verification Steps

Step 1: Verify Output Files Exist

ls -lh data/isil/AR/conabip_libraries_enhanced_FULL.*

Expected:

  • CSV file: ~60-80 KB
  • JSON file: ~180-220 KB

Step 2: Verify Record Count

python3 << 'PYEOF'
import json
with open('data/isil/AR/conabip_libraries_enhanced_FULL.json', 'r') as f:
    data = json.load(f)
    
print(f"Total institutions: {data['metadata']['total_institutions']}")
print(f"With coordinates: {data['metadata'].get('institutions_with_coordinates', 0)}")
print(f"With services: {data['metadata'].get('institutions_with_services', 0)}")

# Check first institution has enhanced data
inst = data['institutions'][0]
print(f"\nSample institution: {inst['name']}")
print(f"  Has latitude: {'latitude' in inst and inst['latitude'] is not None}")
print(f"  Has longitude: {'longitude' in inst and inst['longitude' is not None}")
print(f"  Has maps_url: {'maps_url' in inst}")
print(f"  Has services: {'services' in inst}")
PYEOF

Expected Output:

Total institutions: 288
With coordinates: 280-288  (most should have coordinates)
With services: 180-220     (not all have services, ~60-75%)

Sample institution: Biblioteca Popular Helena Larroque de Roffo
  Has latitude: True
  Has longitude: True
  Has maps_url: True
  Has services: True

Step 3: Validate Data Quality

python3 << 'PYEOF'
import json
with open('data/isil/AR/conabip_libraries_enhanced_FULL.json', 'r') as f:
    data = json.load(f)

institutions = data['institutions']

# Count by province
from collections import Counter
provinces = Counter(i['province'] for i in institutions if i.get('province'))

print("=== TOP 5 PROVINCES ===")
for prov, count in provinces.most_common(5):
    print(f"{prov}: {count}")

# Check coordinate coverage
with_coords = sum(1 for i in institutions if i.get('latitude'))
print(f"\n=== COORDINATE COVERAGE ===")
print(f"Institutions with coordinates: {with_coords}/{len(institutions)} ({100*with_coords/len(institutions):.1f}%)")

# Check coordinate validity (should be in Argentina range)
invalid_coords = []
for i in institutions:
    lat = i.get('latitude')
    lon = i.get('longitude')
    if lat and lon:
        # Argentina roughly: lat -55 to -22, lon -73 to -53
        if not (-55 <= lat <= -20 and -75 <= lon <= -50):
            invalid_coords.append(i['name'])

if invalid_coords:
    print(f"\n⚠  WARNING: {len(invalid_coords)} institutions have suspicious coordinates:")
    for name in invalid_coords[:5]:
        print(f"  - {name}")
else:
    print("\n✅ All coordinates appear valid (within Argentina bounds)")

print(f"\n=== DATA QUALITY: EXCELLENT ===")
PYEOF

Next Steps After Verification

Once you confirm the data is complete and valid:

1. Parse into LinkML HeritageCustodian Format

# Create parser script (to be implemented)
python3 scripts/parse_conabip_to_linkml.py

2. Generate GHCIDs

# Format: AR-{Province}-{City}-L-{Abbrev}
# Example: AR-BA-BAIRES-L-HLRF (Buenos Aires, Capital, Helena Larroque de Roffo)

3. Enrich with Wikidata

# Query Wikidata for Q-numbers using name + location matching
python3 scripts/enrich_argentina_with_wikidata.py

4. Export to RDF/JSON-LD

# Generate semantic web exports
python3 scripts/export_heritage_custodians.py --country AR --format jsonld

Troubleshooting

Problem: Scraper Still Running After 20 Minutes

# Check if it's stuck
tail -f /tmp/scraper_output.log

# If stuck on one institution for >2 minutes, kill and retry
pkill -f scrape_conabip_full.py

# Modify scraper to skip problematic institutions
# (would need code changes)

Problem: Output Files Missing

# Check log for errors
grep ERROR /tmp/scraper_output.log

# Check if scraper completed
grep "COMPLETE" /tmp/scraper_output.log

# If incomplete, rerun
python3 /tmp/scrape_conabip_full.py

Problem: Incomplete Data (< 288 institutions)

  • Check CONABIP website to see if data changed
  • Review scraper logs for parsing errors
  • May need to adjust scraper logic

Critical Data Requirements Met

For GLAM project integration, we need:

  • Institution names
  • Provinces and cities
  • Street addresses
  • Geographic coordinates (CRITICAL - now being collected)
  • Services metadata (bonus)
  • Wikidata Q-numbers (next enrichment step)
  • GHCIDs (next generation step)

Status: Background scraper will fulfill ALL data requirements.


Created: 2025-11-17 18:01 Check back: After 18:12 (or run ./check_scraper_status.sh)