glam/data/isil/bosnia/AUTOMATION_SCRIPT_CREATED.md
2025-11-19 23:25:22 +01:00

6.3 KiB
Raw Blame History

Bosnia ISIL Automation Script - Created

Date: November 18, 2025
Status: Script ready, awaiting execution


Script Created

Location: /Users/kempersc/apps/glam/scripts/bosnia_isil_scraper.py

What It Does

Automates the manual fallback process of checking all 80 COBISS.BH libraries for ISIL codes.

Strategy

For each of the 80 libraries:

  1. Check COBISS Library Pages

    • Try multiple COBISS URL patterns
    • Search page content for ISIL code patterns
  2. Check Institutional Websites

    • Navigate to library homepage (if available)
    • Check main page and "About"/"Contact" sections
    • Search for ISIL codes, ISO 15511 mentions
  3. Pattern Matching

    • BA-* codes (ISO 3166-1 alpha-2 format)
    • BO-* codes (legacy format from Danish registry)
    • "ISIL: XX-XXXXX" mentions
    • ISO 15511 standard references

Output

File: data/isil/bosnia/bosnia_isil_codes_found.json

Format:

[
  {
    "number": 1,
    "name": "Library Name",
    "city": "City",
    "acronym": "ACRONYM",
    "homepage": "www.example.ba",
    "isil_found": true/false,
    "isil_codes": ["BA-SA-CODE", "BO-CODE"],
    "sources_checked": ["COBISS library pages", "Website: www.example.ba"],
    "notes": ["Found in COBISS: https://...", "Found on website: https://..."]
  }
]

Performance

  • Manual Estimate: 80 libraries × 5 min/library = ~6.5 hours
  • Automated Estimate: ~10-20 minutes (including wait times, retries)
  • Intermediate Saves: Every 10 libraries (fault tolerance)
  • Logging: Real-time progress to scraper_log.txt

How to Run

Option 1: Run Directly

cd /Users/kempersc/apps/glam
python scripts/bosnia_isil_scraper.py

Option 2: Run in Background

cd /Users/kempersc/apps/glam
nohup python scripts/bosnia_isil_scraper.py > data/isil/bosnia/scraper_output.txt 2>&1 &

# Monitor progress
tail -f data/isil/bosnia/scraper_log.txt

Option 3: Test on First 5 Libraries

# Edit script to limit to 5 libraries for testing
python scripts/bosnia_isil_scraper.py

Expected Outcomes

Scenario 1: ISIL Codes Found

If ISIL codes ARE included in COBISS records or institutional websites:

  • Script extracts 50-80 ISIL codes
  • Validates country code format (BA- vs. BO-)
  • Creates complete mapping: COBISS acronym → ISIL code

Scenario 2: No ISIL Codes Found

If ISIL codes are NOT publicly accessible:

  • Confirms exhaustive search (COBISS + websites)
  • Validates that ISIL codes require direct contact with NUBBiH
  • Provides evidence for email request to Registration Authority

Scenario 3: Partial Results ⚠️

If some libraries have ISIL codes but not all:

  • Identifies which libraries publish their ISIL codes
  • Reveals inconsistencies in COBISS data entry
  • Prioritizes which libraries to contact directly

Dependencies

Python Packages:

pip install playwright
playwright install chromium

Already Installed (based on project structure):

  • Python 3.11+
  • Playwright (used earlier in session)

Monitoring Progress

Real-Time Log

tail -f data/isil/bosnia/scraper_log.txt

Example Output:

2025-11-18 15:30:00 - Starting Bosnia ISIL scraper...
2025-11-18 15:30:01 - Loaded 80 libraries
2025-11-18 15:30:05 - [1/80] Checking: Agronomski i prehrambeno-tehnološki fakultet, Mostar (APFMO)
2025-11-18 15:30:15 -   ✓ Found ISIL codes in COBISS: ['BA-MO-APFMO']
2025-11-18 15:30:20 - [2/80] Checking: Akademija likovnih umjetnosti (ALU)
...

Intermediate Results

Check progress every 10 libraries:

cat data/isil/bosnia/bosnia_isil_codes_found.json | jq 'length'

Risk Mitigation

Fault Tolerance

  1. Intermediate Saves: Results saved every 10 libraries
  2. Error Handling: Script continues if individual pages fail
  3. Logging: All errors logged to scraper_log.txt
  4. Timeout Protection: 10-15 second timeouts per page

Rate Limiting

  • 2-second delay between libraries
  • Prevents overwhelming COBISS servers
  • Respects website terms of service

After Completion

Analyze Results

# Count how many ISIL codes were found
jq '[.[] | select(.isil_found == true)] | length' data/isil/bosnia/bosnia_isil_codes_found.json

# List all unique ISIL codes
jq '[.[].isil_codes[]] | unique' data/isil/bosnia/bosnia_isil_codes_found.json

# Find libraries without ISIL codes
jq '[.[] | select(.isil_found == false) | .name]' data/isil/bosnia/bosnia_isil_codes_found.json

Next Steps Based on Results

If ISIL Codes Found:

  1. Validate code format (BA- vs. BO-)
  2. Create LinkML instance files
  3. Update investigation report with findings

If No ISIL Codes Found:

  1. Confirm exhaustive search completed
  2. Send email to NUBBiH (template in FINAL_REPORT.md)
  3. Document that ISIL codes require direct contact

Comparison: Manual vs. Automated

Aspect Manual Automated Script
Time ~6.5 hours ~10-20 minutes
Coverage 80 libraries 80 libraries
Accuracy Human error possible Consistent pattern matching
Documentation Manual notes Structured JSON + logs
Reproducibility Low (fatigue) High (repeatable)
Intermediate Saves Manual Every 10 libraries
Error Recovery Start over Resume from last save

Script Code Overview

# Key functions:
- search_for_isil(text): Pattern matching for ISIL codes
- check_cobiss_library_page(page, acronym): Check COBISS pages
- check_institution_website(page, homepage): Check library websites
- scrape_all_libraries(): Main orchestration loop

# Output:
- bosnia_isil_codes_found.json: Structured results
- scraper_log.txt: Real-time progress log

Decision Point

You have three options:

  1. Run the script now → Complete automation (10-20 min)
  2. Test on 5 libraries → Validate approach before full run
  3. Skip automation → Proceed with email contact strategy

Recommendation: Run the script. Even if ISIL codes aren't found, it provides conclusive evidence for the email request to NUBBiH, demonstrating due diligence.


Status: AWAITING USER DECISION TO EXECUTE

Next Command (if executing):

cd /Users/kempersc/apps/glam && python scripts/bosnia_isil_scraper.py