glam/data/isil/bosnia/AUTOMATION_SCRIPT_CREATED.md
2025-11-19 23:25:22 +01:00

252 lines
6.3 KiB
Markdown
Raw Blame History

This file contains ambiguous Unicode characters

This file contains Unicode characters that might be confused with other characters. If you think that this is intentional, you can safely ignore this warning. Use the Escape button to reveal them.

# Bosnia ISIL Automation Script - Created
**Date**: November 18, 2025
**Status**: Script ready, awaiting execution
---
## Script Created
**Location**: `/Users/kempersc/apps/glam/scripts/bosnia_isil_scraper.py`
### What It Does
Automates the manual fallback process of checking all 80 COBISS.BH libraries for ISIL codes.
### Strategy
For each of the 80 libraries:
1. **Check COBISS Library Pages**
- Try multiple COBISS URL patterns
- Search page content for ISIL code patterns
2. **Check Institutional Websites**
- Navigate to library homepage (if available)
- Check main page and "About"/"Contact" sections
- Search for ISIL codes, ISO 15511 mentions
3. **Pattern Matching**
- `BA-*` codes (ISO 3166-1 alpha-2 format)
- `BO-*` codes (legacy format from Danish registry)
- "ISIL: XX-XXXXX" mentions
- ISO 15511 standard references
### Output
**File**: `data/isil/bosnia/bosnia_isil_codes_found.json`
**Format**:
```json
[
{
"number": 1,
"name": "Library Name",
"city": "City",
"acronym": "ACRONYM",
"homepage": "www.example.ba",
"isil_found": true/false,
"isil_codes": ["BA-SA-CODE", "BO-CODE"],
"sources_checked": ["COBISS library pages", "Website: www.example.ba"],
"notes": ["Found in COBISS: https://...", "Found on website: https://..."]
}
]
```
### Performance
- **Manual Estimate**: 80 libraries × 5 min/library = **~6.5 hours**
- **Automated Estimate**: **~10-20 minutes** (including wait times, retries)
- **Intermediate Saves**: Every 10 libraries (fault tolerance)
- **Logging**: Real-time progress to `scraper_log.txt`
---
## How to Run
### Option 1: Run Directly
```bash
cd /Users/kempersc/apps/glam
python scripts/bosnia_isil_scraper.py
```
### Option 2: Run in Background
```bash
cd /Users/kempersc/apps/glam
nohup python scripts/bosnia_isil_scraper.py > data/isil/bosnia/scraper_output.txt 2>&1 &
# Monitor progress
tail -f data/isil/bosnia/scraper_log.txt
```
### Option 3: Test on First 5 Libraries
```bash
# Edit script to limit to 5 libraries for testing
python scripts/bosnia_isil_scraper.py
```
---
## Expected Outcomes
### Scenario 1: ISIL Codes Found ✅
If ISIL codes ARE included in COBISS records or institutional websites:
- Script extracts 50-80 ISIL codes
- Validates country code format (BA- vs. BO-)
- Creates complete mapping: COBISS acronym → ISIL code
### Scenario 2: No ISIL Codes Found ❌
If ISIL codes are NOT publicly accessible:
- Confirms exhaustive search (COBISS + websites)
- Validates that ISIL codes require direct contact with NUBBiH
- Provides evidence for email request to Registration Authority
### Scenario 3: Partial Results ⚠️
If some libraries have ISIL codes but not all:
- Identifies which libraries publish their ISIL codes
- Reveals inconsistencies in COBISS data entry
- Prioritizes which libraries to contact directly
---
## Dependencies
**Python Packages**:
```bash
pip install playwright
playwright install chromium
```
**Already Installed** (based on project structure):
- Python 3.11+
- Playwright (used earlier in session)
---
## Monitoring Progress
### Real-Time Log
```bash
tail -f data/isil/bosnia/scraper_log.txt
```
**Example Output**:
```
2025-11-18 15:30:00 - Starting Bosnia ISIL scraper...
2025-11-18 15:30:01 - Loaded 80 libraries
2025-11-18 15:30:05 - [1/80] Checking: Agronomski i prehrambeno-tehnološki fakultet, Mostar (APFMO)
2025-11-18 15:30:15 - ✓ Found ISIL codes in COBISS: ['BA-MO-APFMO']
2025-11-18 15:30:20 - [2/80] Checking: Akademija likovnih umjetnosti (ALU)
...
```
### Intermediate Results
Check progress every 10 libraries:
```bash
cat data/isil/bosnia/bosnia_isil_codes_found.json | jq 'length'
```
---
## Risk Mitigation
### Fault Tolerance
1. **Intermediate Saves**: Results saved every 10 libraries
2. **Error Handling**: Script continues if individual pages fail
3. **Logging**: All errors logged to `scraper_log.txt`
4. **Timeout Protection**: 10-15 second timeouts per page
### Rate Limiting
- 2-second delay between libraries
- Prevents overwhelming COBISS servers
- Respects website terms of service
---
## After Completion
### Analyze Results
```bash
# Count how many ISIL codes were found
jq '[.[] | select(.isil_found == true)] | length' data/isil/bosnia/bosnia_isil_codes_found.json
# List all unique ISIL codes
jq '[.[].isil_codes[]] | unique' data/isil/bosnia/bosnia_isil_codes_found.json
# Find libraries without ISIL codes
jq '[.[] | select(.isil_found == false) | .name]' data/isil/bosnia/bosnia_isil_codes_found.json
```
### Next Steps Based on Results
**If ISIL Codes Found**:
1. Validate code format (BA- vs. BO-)
2. Create LinkML instance files
3. Update investigation report with findings
**If No ISIL Codes Found**:
1. Confirm exhaustive search completed
2. Send email to NUBBiH (template in FINAL_REPORT.md)
3. Document that ISIL codes require direct contact
---
## Comparison: Manual vs. Automated
| Aspect | Manual | Automated Script |
|--------|--------|------------------|
| **Time** | ~6.5 hours | ~10-20 minutes |
| **Coverage** | 80 libraries | 80 libraries |
| **Accuracy** | Human error possible | Consistent pattern matching |
| **Documentation** | Manual notes | Structured JSON + logs |
| **Reproducibility** | Low (fatigue) | High (repeatable) |
| **Intermediate Saves** | Manual | Every 10 libraries |
| **Error Recovery** | Start over | Resume from last save |
---
## Script Code Overview
```python
# Key functions:
- search_for_isil(text): Pattern matching for ISIL codes
- check_cobiss_library_page(page, acronym): Check COBISS pages
- check_institution_website(page, homepage): Check library websites
- scrape_all_libraries(): Main orchestration loop
# Output:
- bosnia_isil_codes_found.json: Structured results
- scraper_log.txt: Real-time progress log
```
---
## Decision Point
**You have three options**:
1. **Run the script now** → Complete automation (10-20 min)
2. **Test on 5 libraries** → Validate approach before full run
3. **Skip automation** → Proceed with email contact strategy
**Recommendation**: Run the script. Even if ISIL codes aren't found, it provides conclusive evidence for the email request to NUBBiH, demonstrating due diligence.
---
**Status**: ⏳ AWAITING USER DECISION TO EXECUTE
**Next Command** (if executing):
```bash
cd /Users/kempersc/apps/glam && python scripts/bosnia_isil_scraper.py
```