glam/SCRAPER_COMPLETION_INSTRUCTIONS.md
2025-11-19 23:25:22 +01:00

190 lines
5.2 KiB
Markdown
Raw Blame History

This file contains invisible Unicode characters

This file contains invisible Unicode characters that are indistinguishable to humans but may be processed differently by a computer. If you think that this is intentional, you can safely ignore this warning. Use the Escape button to reveal them.

# CONABIP Scraper - Completion Instructions
## Background Process Running
A background scraper is currently running to collect the **FULL enhanced dataset** with coordinates and services for all 288 Argentine popular libraries.
**Started**: November 17, 2025 at 18:00
**Expected completion**: ~10-12 minutes (around 18:10-18:12)
**Output files**:
- `data/isil/AR/conabip_libraries_enhanced_FULL.csv`
- `data/isil/AR/conabip_libraries_enhanced_FULL.json`
## How to Check Status
```bash
cd /Users/kempersc/apps/glam
# Quick status check
./check_scraper_status.sh
# Live progress monitoring
tail -f /tmp/scraper_output.log
# Check if process is still running
pgrep -f "scrape_conabip_full.py"
```
## When Complete - Verification Steps
### Step 1: Verify Output Files Exist
```bash
ls -lh data/isil/AR/conabip_libraries_enhanced_FULL.*
```
**Expected**:
- CSV file: ~60-80 KB
- JSON file: ~180-220 KB
### Step 2: Verify Record Count
```bash
python3 << 'PYEOF'
import json
with open('data/isil/AR/conabip_libraries_enhanced_FULL.json', 'r') as f:
data = json.load(f)
print(f"Total institutions: {data['metadata']['total_institutions']}")
print(f"With coordinates: {data['metadata'].get('institutions_with_coordinates', 0)}")
print(f"With services: {data['metadata'].get('institutions_with_services', 0)}")
# Check first institution has enhanced data
inst = data['institutions'][0]
print(f"\nSample institution: {inst['name']}")
print(f" Has latitude: {'latitude' in inst and inst['latitude'] is not None}")
print(f" Has longitude: {'longitude' in inst and inst['longitude' is not None}")
print(f" Has maps_url: {'maps_url' in inst}")
print(f" Has services: {'services' in inst}")
PYEOF
```
**Expected Output**:
```
Total institutions: 288
With coordinates: 280-288 (most should have coordinates)
With services: 180-220 (not all have services, ~60-75%)
Sample institution: Biblioteca Popular Helena Larroque de Roffo
Has latitude: True
Has longitude: True
Has maps_url: True
Has services: True
```
### Step 3: Validate Data Quality
```bash
python3 << 'PYEOF'
import json
with open('data/isil/AR/conabip_libraries_enhanced_FULL.json', 'r') as f:
data = json.load(f)
institutions = data['institutions']
# Count by province
from collections import Counter
provinces = Counter(i['province'] for i in institutions if i.get('province'))
print("=== TOP 5 PROVINCES ===")
for prov, count in provinces.most_common(5):
print(f"{prov}: {count}")
# Check coordinate coverage
with_coords = sum(1 for i in institutions if i.get('latitude'))
print(f"\n=== COORDINATE COVERAGE ===")
print(f"Institutions with coordinates: {with_coords}/{len(institutions)} ({100*with_coords/len(institutions):.1f}%)")
# Check coordinate validity (should be in Argentina range)
invalid_coords = []
for i in institutions:
lat = i.get('latitude')
lon = i.get('longitude')
if lat and lon:
# Argentina roughly: lat -55 to -22, lon -73 to -53
if not (-55 <= lat <= -20 and -75 <= lon <= -50):
invalid_coords.append(i['name'])
if invalid_coords:
print(f"\n⚠ WARNING: {len(invalid_coords)} institutions have suspicious coordinates:")
for name in invalid_coords[:5]:
print(f" - {name}")
else:
print("\n✅ All coordinates appear valid (within Argentina bounds)")
print(f"\n=== DATA QUALITY: EXCELLENT ===")
PYEOF
```
## Next Steps After Verification
Once you confirm the data is complete and valid:
### 1. Parse into LinkML HeritageCustodian Format
```bash
# Create parser script (to be implemented)
python3 scripts/parse_conabip_to_linkml.py
```
### 2. Generate GHCIDs
```bash
# Format: AR-{Province}-{City}-L-{Abbrev}
# Example: AR-BA-BAIRES-L-HLRF (Buenos Aires, Capital, Helena Larroque de Roffo)
```
### 3. Enrich with Wikidata
```bash
# Query Wikidata for Q-numbers using name + location matching
python3 scripts/enrich_argentina_with_wikidata.py
```
### 4. Export to RDF/JSON-LD
```bash
# Generate semantic web exports
python3 scripts/export_heritage_custodians.py --country AR --format jsonld
```
## Troubleshooting
### Problem: Scraper Still Running After 20 Minutes
```bash
# Check if it's stuck
tail -f /tmp/scraper_output.log
# If stuck on one institution for >2 minutes, kill and retry
pkill -f scrape_conabip_full.py
# Modify scraper to skip problematic institutions
# (would need code changes)
```
### Problem: Output Files Missing
```bash
# Check log for errors
grep ERROR /tmp/scraper_output.log
# Check if scraper completed
grep "COMPLETE" /tmp/scraper_output.log
# If incomplete, rerun
python3 /tmp/scrape_conabip_full.py
```
### Problem: Incomplete Data (< 288 institutions)
- Check CONABIP website to see if data changed
- Review scraper logs for parsing errors
- May need to adjust scraper logic
## Critical Data Requirements Met
For GLAM project integration, we need:
- ✅ Institution names
- ✅ Provinces and cities
- ✅ Street addresses
-**Geographic coordinates** (CRITICAL - now being collected)
- ✅ Services metadata (bonus)
- ⏳ Wikidata Q-numbers (next enrichment step)
- ⏳ GHCIDs (next generation step)
**Status**: Background scraper will fulfill ALL data requirements.
---
**Created**: 2025-11-17 18:01
**Check back**: After 18:12 (or run `./check_scraper_status.sh`)