kempersc 3c80de87e0 add isil entries

2025-11-19 23:25:22 +01:00

5.2 KiB

Raw Blame History

CONABIP Scraper - Completion Instructions

Background Process Running

A background scraper is currently running to collect the FULL enhanced dataset with coordinates and services for all 288 Argentine popular libraries.

Started: November 17, 2025 at 18:00 Expected completion: ~10-12 minutes (around 18:10-18:12) Output files:

data/isil/AR/conabip_libraries_enhanced_FULL.csv
data/isil/AR/conabip_libraries_enhanced_FULL.json

How to Check Status

cd /Users/kempersc/apps/glam

# Quick status check
./check_scraper_status.sh

# Live progress monitoring
tail -f /tmp/scraper_output.log

# Check if process is still running
pgrep -f "scrape_conabip_full.py"

When Complete - Verification Steps

Step 1: Verify Output Files Exist

ls -lh data/isil/AR/conabip_libraries_enhanced_FULL.*

Expected:

CSV file: ~60-80 KB
JSON file: ~180-220 KB

Step 2: Verify Record Count

python3 << 'PYEOF'
import json
with open('data/isil/AR/conabip_libraries_enhanced_FULL.json', 'r') as f:
    data = json.load(f)
    
print(f"Total institutions: {data['metadata']['total_institutions']}")
print(f"With coordinates: {data['metadata'].get('institutions_with_coordinates', 0)}")
print(f"With services: {data['metadata'].get('institutions_with_services', 0)}")

# Check first institution has enhanced data
inst = data['institutions'][0]
print(f"\nSample institution: {inst['name']}")
print(f"  Has latitude: {'latitude' in inst and inst['latitude'] is not None}")
print(f"  Has longitude: {'longitude' in inst and inst['longitude' is not None}")
print(f"  Has maps_url: {'maps_url' in inst}")
print(f"  Has services: {'services' in inst}")
PYEOF

Expected Output:

Total institutions: 288
With coordinates: 280-288  (most should have coordinates)
With services: 180-220     (not all have services, ~60-75%)

Sample institution: Biblioteca Popular Helena Larroque de Roffo
  Has latitude: True
  Has longitude: True
  Has maps_url: True
  Has services: True

Step 3: Validate Data Quality

python3 << 'PYEOF'
import json
with open('data/isil/AR/conabip_libraries_enhanced_FULL.json', 'r') as f:
    data = json.load(f)

institutions = data['institutions']

# Count by province
from collections import Counter
provinces = Counter(i['province'] for i in institutions if i.get('province'))

print("=== TOP 5 PROVINCES ===")
for prov, count in provinces.most_common(5):
    print(f"{prov}: {count}")

# Check coordinate coverage
with_coords = sum(1 for i in institutions if i.get('latitude'))
print(f"\n=== COORDINATE COVERAGE ===")
print(f"Institutions with coordinates: {with_coords}/{len(institutions)} ({100*with_coords/len(institutions):.1f}%)")

# Check coordinate validity (should be in Argentina range)
invalid_coords = []
for i in institutions:
    lat = i.get('latitude')
    lon = i.get('longitude')
    if lat and lon:
        # Argentina roughly: lat -55 to -22, lon -73 to -53
        if not (-55 <= lat <= -20 and -75 <= lon <= -50):
            invalid_coords.append(i['name'])

if invalid_coords:
    print(f"\n⚠️  WARNING: {len(invalid_coords)} institutions have suspicious coordinates:")
    for name in invalid_coords[:5]:
        print(f"  - {name}")
else:
    print("\n✅ All coordinates appear valid (within Argentina bounds)")

print(f"\n=== DATA QUALITY: EXCELLENT ===")
PYEOF

Next Steps After Verification

Once you confirm the data is complete and valid:

1. Parse into LinkML HeritageCustodian Format

# Create parser script (to be implemented)
python3 scripts/parse_conabip_to_linkml.py

2. Generate GHCIDs

# Format: AR-{Province}-{City}-L-{Abbrev}
# Example: AR-BA-BAIRES-L-HLRF (Buenos Aires, Capital, Helena Larroque de Roffo)

3. Enrich with Wikidata

# Query Wikidata for Q-numbers using name + location matching
python3 scripts/enrich_argentina_with_wikidata.py

4. Export to RDF/JSON-LD

# Generate semantic web exports
python3 scripts/export_heritage_custodians.py --country AR --format jsonld

Troubleshooting

Problem: Scraper Still Running After 20 Minutes

# Check if it's stuck
tail -f /tmp/scraper_output.log

# If stuck on one institution for >2 minutes, kill and retry
pkill -f scrape_conabip_full.py

# Modify scraper to skip problematic institutions
# (would need code changes)

Problem: Output Files Missing

# Check log for errors
grep ERROR /tmp/scraper_output.log

# Check if scraper completed
grep "COMPLETE" /tmp/scraper_output.log

# If incomplete, rerun
python3 /tmp/scrape_conabip_full.py

Problem: Incomplete Data (< 288 institutions)

Check CONABIP website to see if data changed
Review scraper logs for parsing errors
May need to adjust scraper logic

Critical Data Requirements Met

For GLAM project integration, we need:

✅ Institution names
✅ Provinces and cities
✅ Street addresses
✅ Geographic coordinates (CRITICAL - now being collected)
✅ Services metadata (bonus)
⏳ Wikidata Q-numbers (next enrichment step)
⏳ GHCIDs (next generation step)

Status: Background scraper will fulfill ALL data requirements.

Created: 2025-11-17 18:01 Check back: After 18:12 (or run ./check_scraper_status.sh)

5.2 KiB Raw Blame History Unescape Escape