glam/SCRAPER_COMPLETION_INSTRUCTIONS.md

# CONABIP Scraper - Completion Instructions

## Background Process Running

A background scraper is currently running to collect the **FULL enhanced dataset** with coordinates and services for all 288 Argentine popular libraries.

**Started**: November 17, 2025 at 18:00
**Expected completion**: ~10-12 minutes (around 18:10-18:12)
**Output files**:
- `data/isil/AR/conabip_libraries_enhanced_FULL.csv`
- `data/isil/AR/conabip_libraries_enhanced_FULL.json`

## How to Check Status

```bash
cd /Users/kempersc/apps/glam

# Quick status check
./check_scraper_status.sh

# Live progress monitoring
tail -f /tmp/scraper_output.log

# Check if process is still running
pgrep -f "scrape_conabip_full.py"
```

## When Complete - Verification Steps

### Step 1: Verify Output Files Exist
```bash
ls -lh data/isil/AR/conabip_libraries_enhanced_FULL.*
```

**Expected**:
- CSV file: ~60-80 KB
- JSON file: ~180-220 KB

### Step 2: Verify Record Count
```bash
python3 << 'PYEOF'
import json
with open('data/isil/AR/conabip_libraries_enhanced_FULL.json', 'r') as f:
    data = json.load(f)

print(f"Total institutions: {data['metadata']['total_institutions']}")
print(f"With coordinates: {data['metadata'].get('institutions_with_coordinates', 0)}")
print(f"With services: {data['metadata'].get('institutions_with_services', 0)}")

# Check first institution has enhanced data
inst = data['institutions'][0]
print(f"\nSample institution: {inst['name']}")
print(f"  Has latitude: {'latitude' in inst and inst['latitude'] is not None}")
print(f"  Has longitude: {'longitude' in inst and inst['longitude' is not None}")
print(f"  Has maps_url: {'maps_url' in inst}")
print(f"  Has services: {'services' in inst}")
PYEOF
```

**Expected Output**:
```
Total institutions: 288
With coordinates: 280-288  (most should have coordinates)
With services: 180-220     (not all have services, ~60-75%)

Sample institution: Biblioteca Popular Helena Larroque de Roffo
  Has latitude: True
  Has longitude: True
  Has maps_url: True
  Has services: True
```

### Step 3: Validate Data Quality
```bash
python3 << 'PYEOF'
import json
with open('data/isil/AR/conabip_libraries_enhanced_FULL.json', 'r') as f:
    data = json.load(f)

institutions = data['institutions']

# Count by province
from collections import Counter
provinces = Counter(i['province'] for i in institutions if i.get('province'))

print("=== TOP 5 PROVINCES ===")
for prov, count in provinces.most_common(5):
    print(f"{prov}: {count}")

# Check coordinate coverage
with_coords = sum(1 for i in institutions if i.get('latitude'))
print(f"\n=== COORDINATE COVERAGE ===")
print(f"Institutions with coordinates: {with_coords}/{len(institutions)} ({100*with_coords/len(institutions):.1f}%)")

# Check coordinate validity (should be in Argentina range)
invalid_coords = []
for i in institutions:
    lat = i.get('latitude')
    lon = i.get('longitude')
    if lat and lon:
        # Argentina roughly: lat -55 to -22, lon -73 to -53
        if not (-55 <= lat <= -20 and -75 <= lon <= -50):
            invalid_coords.append(i['name'])

if invalid_coords:
    print(f"\n⚠️  WARNING: {len(invalid_coords)} institutions have suspicious coordinates:")
    for name in invalid_coords[:5]:
        print(f"  - {name}")
else:
    print("\n✅ All coordinates appear valid (within Argentina bounds)")

print(f"\n=== DATA QUALITY: EXCELLENT ===")
PYEOF
```

## Next Steps After Verification

Once you confirm the data is complete and valid:

### 1. Parse into LinkML HeritageCustodian Format
```bash
# Create parser script (to be implemented)
python3 scripts/parse_conabip_to_linkml.py
```

### 2. Generate GHCIDs
```bash
# Format: AR-{Province}-{City}-L-{Abbrev}
# Example: AR-BA-BAIRES-L-HLRF (Buenos Aires, Capital, Helena Larroque de Roffo)
```

### 3. Enrich with Wikidata
```bash
# Query Wikidata for Q-numbers using name + location matching
python3 scripts/enrich_argentina_with_wikidata.py
```

### 4. Export to RDF/JSON-LD
```bash
# Generate semantic web exports
python3 scripts/export_heritage_custodians.py --country AR --format jsonld
```

## Troubleshooting

### Problem: Scraper Still Running After 20 Minutes
```bash
# Check if it's stuck
tail -f /tmp/scraper_output.log

# If stuck on one institution for >2 minutes, kill and retry
pkill -f scrape_conabip_full.py

# Modify scraper to skip problematic institutions
# (would need code changes)
```

### Problem: Output Files Missing
```bash
# Check log for errors
grep ERROR /tmp/scraper_output.log

# Check if scraper completed
grep "COMPLETE" /tmp/scraper_output.log

# If incomplete, rerun
python3 /tmp/scrape_conabip_full.py
```

### Problem: Incomplete Data (< 288 institutions)
- Check CONABIP website to see if data changed
- Review scraper logs for parsing errors
- May need to adjust scraper logic

## Critical Data Requirements Met

For GLAM project integration, we need:
- ✅ Institution names
- ✅ Provinces and cities
- ✅ Street addresses
- ✅ **Geographic coordinates** (CRITICAL - now being collected)
- ✅ Services metadata (bonus)
- ⏳ Wikidata Q-numbers (next enrichment step)
- ⏳ GHCIDs (next generation step)

**Status**: Background scraper will fulfill ALL data requirements.

---
**Created**: 2025-11-17 18:01
**Check back**: After 18:12 (or run `./check_scraper_status.sh`)