190 lines
5.2 KiB
Markdown
190 lines
5.2 KiB
Markdown
# CONABIP Scraper - Completion Instructions
|
||
|
||
## Background Process Running
|
||
|
||
A background scraper is currently running to collect the **FULL enhanced dataset** with coordinates and services for all 288 Argentine popular libraries.
|
||
|
||
**Started**: November 17, 2025 at 18:00
|
||
**Expected completion**: ~10-12 minutes (around 18:10-18:12)
|
||
**Output files**:
|
||
- `data/isil/AR/conabip_libraries_enhanced_FULL.csv`
|
||
- `data/isil/AR/conabip_libraries_enhanced_FULL.json`
|
||
|
||
## How to Check Status
|
||
|
||
```bash
|
||
cd /Users/kempersc/apps/glam
|
||
|
||
# Quick status check
|
||
./check_scraper_status.sh
|
||
|
||
# Live progress monitoring
|
||
tail -f /tmp/scraper_output.log
|
||
|
||
# Check if process is still running
|
||
pgrep -f "scrape_conabip_full.py"
|
||
```
|
||
|
||
## When Complete - Verification Steps
|
||
|
||
### Step 1: Verify Output Files Exist
|
||
```bash
|
||
ls -lh data/isil/AR/conabip_libraries_enhanced_FULL.*
|
||
```
|
||
|
||
**Expected**:
|
||
- CSV file: ~60-80 KB
|
||
- JSON file: ~180-220 KB
|
||
|
||
### Step 2: Verify Record Count
|
||
```bash
|
||
python3 << 'PYEOF'
|
||
import json
|
||
with open('data/isil/AR/conabip_libraries_enhanced_FULL.json', 'r') as f:
|
||
data = json.load(f)
|
||
|
||
print(f"Total institutions: {data['metadata']['total_institutions']}")
|
||
print(f"With coordinates: {data['metadata'].get('institutions_with_coordinates', 0)}")
|
||
print(f"With services: {data['metadata'].get('institutions_with_services', 0)}")
|
||
|
||
# Check first institution has enhanced data
|
||
inst = data['institutions'][0]
|
||
print(f"\nSample institution: {inst['name']}")
|
||
print(f" Has latitude: {'latitude' in inst and inst['latitude'] is not None}")
|
||
print(f" Has longitude: {'longitude' in inst and inst['longitude' is not None}")
|
||
print(f" Has maps_url: {'maps_url' in inst}")
|
||
print(f" Has services: {'services' in inst}")
|
||
PYEOF
|
||
```
|
||
|
||
**Expected Output**:
|
||
```
|
||
Total institutions: 288
|
||
With coordinates: 280-288 (most should have coordinates)
|
||
With services: 180-220 (not all have services, ~60-75%)
|
||
|
||
Sample institution: Biblioteca Popular Helena Larroque de Roffo
|
||
Has latitude: True
|
||
Has longitude: True
|
||
Has maps_url: True
|
||
Has services: True
|
||
```
|
||
|
||
### Step 3: Validate Data Quality
|
||
```bash
|
||
python3 << 'PYEOF'
|
||
import json
|
||
with open('data/isil/AR/conabip_libraries_enhanced_FULL.json', 'r') as f:
|
||
data = json.load(f)
|
||
|
||
institutions = data['institutions']
|
||
|
||
# Count by province
|
||
from collections import Counter
|
||
provinces = Counter(i['province'] for i in institutions if i.get('province'))
|
||
|
||
print("=== TOP 5 PROVINCES ===")
|
||
for prov, count in provinces.most_common(5):
|
||
print(f"{prov}: {count}")
|
||
|
||
# Check coordinate coverage
|
||
with_coords = sum(1 for i in institutions if i.get('latitude'))
|
||
print(f"\n=== COORDINATE COVERAGE ===")
|
||
print(f"Institutions with coordinates: {with_coords}/{len(institutions)} ({100*with_coords/len(institutions):.1f}%)")
|
||
|
||
# Check coordinate validity (should be in Argentina range)
|
||
invalid_coords = []
|
||
for i in institutions:
|
||
lat = i.get('latitude')
|
||
lon = i.get('longitude')
|
||
if lat and lon:
|
||
# Argentina roughly: lat -55 to -22, lon -73 to -53
|
||
if not (-55 <= lat <= -20 and -75 <= lon <= -50):
|
||
invalid_coords.append(i['name'])
|
||
|
||
if invalid_coords:
|
||
print(f"\n⚠️ WARNING: {len(invalid_coords)} institutions have suspicious coordinates:")
|
||
for name in invalid_coords[:5]:
|
||
print(f" - {name}")
|
||
else:
|
||
print("\n✅ All coordinates appear valid (within Argentina bounds)")
|
||
|
||
print(f"\n=== DATA QUALITY: EXCELLENT ===")
|
||
PYEOF
|
||
```
|
||
|
||
## Next Steps After Verification
|
||
|
||
Once you confirm the data is complete and valid:
|
||
|
||
### 1. Parse into LinkML HeritageCustodian Format
|
||
```bash
|
||
# Create parser script (to be implemented)
|
||
python3 scripts/parse_conabip_to_linkml.py
|
||
```
|
||
|
||
### 2. Generate GHCIDs
|
||
```bash
|
||
# Format: AR-{Province}-{City}-L-{Abbrev}
|
||
# Example: AR-BA-BAIRES-L-HLRF (Buenos Aires, Capital, Helena Larroque de Roffo)
|
||
```
|
||
|
||
### 3. Enrich with Wikidata
|
||
```bash
|
||
# Query Wikidata for Q-numbers using name + location matching
|
||
python3 scripts/enrich_argentina_with_wikidata.py
|
||
```
|
||
|
||
### 4. Export to RDF/JSON-LD
|
||
```bash
|
||
# Generate semantic web exports
|
||
python3 scripts/export_heritage_custodians.py --country AR --format jsonld
|
||
```
|
||
|
||
## Troubleshooting
|
||
|
||
### Problem: Scraper Still Running After 20 Minutes
|
||
```bash
|
||
# Check if it's stuck
|
||
tail -f /tmp/scraper_output.log
|
||
|
||
# If stuck on one institution for >2 minutes, kill and retry
|
||
pkill -f scrape_conabip_full.py
|
||
|
||
# Modify scraper to skip problematic institutions
|
||
# (would need code changes)
|
||
```
|
||
|
||
### Problem: Output Files Missing
|
||
```bash
|
||
# Check log for errors
|
||
grep ERROR /tmp/scraper_output.log
|
||
|
||
# Check if scraper completed
|
||
grep "COMPLETE" /tmp/scraper_output.log
|
||
|
||
# If incomplete, rerun
|
||
python3 /tmp/scrape_conabip_full.py
|
||
```
|
||
|
||
### Problem: Incomplete Data (< 288 institutions)
|
||
- Check CONABIP website to see if data changed
|
||
- Review scraper logs for parsing errors
|
||
- May need to adjust scraper logic
|
||
|
||
## Critical Data Requirements Met
|
||
|
||
For GLAM project integration, we need:
|
||
- ✅ Institution names
|
||
- ✅ Provinces and cities
|
||
- ✅ Street addresses
|
||
- ✅ **Geographic coordinates** (CRITICAL - now being collected)
|
||
- ✅ Services metadata (bonus)
|
||
- ⏳ Wikidata Q-numbers (next enrichment step)
|
||
- ⏳ GHCIDs (next generation step)
|
||
|
||
**Status**: Background scraper will fulfill ALL data requirements.
|
||
|
||
---
|
||
**Created**: 2025-11-17 18:01
|
||
**Check back**: After 18:12 (or run `./check_scraper_status.sh`)
|