252 lines
6.3 KiB
Markdown
252 lines
6.3 KiB
Markdown
# Bosnia ISIL Automation Script - Created
|
||
|
||
**Date**: November 18, 2025
|
||
**Status**: Script ready, awaiting execution
|
||
|
||
---
|
||
|
||
## Script Created
|
||
|
||
**Location**: `/Users/kempersc/apps/glam/scripts/bosnia_isil_scraper.py`
|
||
|
||
### What It Does
|
||
|
||
Automates the manual fallback process of checking all 80 COBISS.BH libraries for ISIL codes.
|
||
|
||
### Strategy
|
||
|
||
For each of the 80 libraries:
|
||
|
||
1. **Check COBISS Library Pages**
|
||
- Try multiple COBISS URL patterns
|
||
- Search page content for ISIL code patterns
|
||
|
||
2. **Check Institutional Websites**
|
||
- Navigate to library homepage (if available)
|
||
- Check main page and "About"/"Contact" sections
|
||
- Search for ISIL codes, ISO 15511 mentions
|
||
|
||
3. **Pattern Matching**
|
||
- `BA-*` codes (ISO 3166-1 alpha-2 format)
|
||
- `BO-*` codes (legacy format from Danish registry)
|
||
- "ISIL: XX-XXXXX" mentions
|
||
- ISO 15511 standard references
|
||
|
||
### Output
|
||
|
||
**File**: `data/isil/bosnia/bosnia_isil_codes_found.json`
|
||
|
||
**Format**:
|
||
```json
|
||
[
|
||
{
|
||
"number": 1,
|
||
"name": "Library Name",
|
||
"city": "City",
|
||
"acronym": "ACRONYM",
|
||
"homepage": "www.example.ba",
|
||
"isil_found": true/false,
|
||
"isil_codes": ["BA-SA-CODE", "BO-CODE"],
|
||
"sources_checked": ["COBISS library pages", "Website: www.example.ba"],
|
||
"notes": ["Found in COBISS: https://...", "Found on website: https://..."]
|
||
}
|
||
]
|
||
```
|
||
|
||
### Performance
|
||
|
||
- **Manual Estimate**: 80 libraries × 5 min/library = **~6.5 hours**
|
||
- **Automated Estimate**: **~10-20 minutes** (including wait times, retries)
|
||
- **Intermediate Saves**: Every 10 libraries (fault tolerance)
|
||
- **Logging**: Real-time progress to `scraper_log.txt`
|
||
|
||
---
|
||
|
||
## How to Run
|
||
|
||
### Option 1: Run Directly
|
||
|
||
```bash
|
||
cd /Users/kempersc/apps/glam
|
||
python scripts/bosnia_isil_scraper.py
|
||
```
|
||
|
||
### Option 2: Run in Background
|
||
|
||
```bash
|
||
cd /Users/kempersc/apps/glam
|
||
nohup python scripts/bosnia_isil_scraper.py > data/isil/bosnia/scraper_output.txt 2>&1 &
|
||
|
||
# Monitor progress
|
||
tail -f data/isil/bosnia/scraper_log.txt
|
||
```
|
||
|
||
### Option 3: Test on First 5 Libraries
|
||
|
||
```bash
|
||
# Edit script to limit to 5 libraries for testing
|
||
python scripts/bosnia_isil_scraper.py
|
||
```
|
||
|
||
---
|
||
|
||
## Expected Outcomes
|
||
|
||
### Scenario 1: ISIL Codes Found ✅
|
||
|
||
If ISIL codes ARE included in COBISS records or institutional websites:
|
||
- Script extracts 50-80 ISIL codes
|
||
- Validates country code format (BA- vs. BO-)
|
||
- Creates complete mapping: COBISS acronym → ISIL code
|
||
|
||
### Scenario 2: No ISIL Codes Found ❌
|
||
|
||
If ISIL codes are NOT publicly accessible:
|
||
- Confirms exhaustive search (COBISS + websites)
|
||
- Validates that ISIL codes require direct contact with NUBBiH
|
||
- Provides evidence for email request to Registration Authority
|
||
|
||
### Scenario 3: Partial Results ⚠️
|
||
|
||
If some libraries have ISIL codes but not all:
|
||
- Identifies which libraries publish their ISIL codes
|
||
- Reveals inconsistencies in COBISS data entry
|
||
- Prioritizes which libraries to contact directly
|
||
|
||
---
|
||
|
||
## Dependencies
|
||
|
||
**Python Packages**:
|
||
```bash
|
||
pip install playwright
|
||
playwright install chromium
|
||
```
|
||
|
||
**Already Installed** (based on project structure):
|
||
- Python 3.11+
|
||
- Playwright (used earlier in session)
|
||
|
||
---
|
||
|
||
## Monitoring Progress
|
||
|
||
### Real-Time Log
|
||
|
||
```bash
|
||
tail -f data/isil/bosnia/scraper_log.txt
|
||
```
|
||
|
||
**Example Output**:
|
||
```
|
||
2025-11-18 15:30:00 - Starting Bosnia ISIL scraper...
|
||
2025-11-18 15:30:01 - Loaded 80 libraries
|
||
2025-11-18 15:30:05 - [1/80] Checking: Agronomski i prehrambeno-tehnološki fakultet, Mostar (APFMO)
|
||
2025-11-18 15:30:15 - ✓ Found ISIL codes in COBISS: ['BA-MO-APFMO']
|
||
2025-11-18 15:30:20 - [2/80] Checking: Akademija likovnih umjetnosti (ALU)
|
||
...
|
||
```
|
||
|
||
### Intermediate Results
|
||
|
||
Check progress every 10 libraries:
|
||
```bash
|
||
cat data/isil/bosnia/bosnia_isil_codes_found.json | jq 'length'
|
||
```
|
||
|
||
---
|
||
|
||
## Risk Mitigation
|
||
|
||
### Fault Tolerance
|
||
|
||
1. **Intermediate Saves**: Results saved every 10 libraries
|
||
2. **Error Handling**: Script continues if individual pages fail
|
||
3. **Logging**: All errors logged to `scraper_log.txt`
|
||
4. **Timeout Protection**: 10-15 second timeouts per page
|
||
|
||
### Rate Limiting
|
||
|
||
- 2-second delay between libraries
|
||
- Prevents overwhelming COBISS servers
|
||
- Respects website terms of service
|
||
|
||
---
|
||
|
||
## After Completion
|
||
|
||
### Analyze Results
|
||
|
||
```bash
|
||
# Count how many ISIL codes were found
|
||
jq '[.[] | select(.isil_found == true)] | length' data/isil/bosnia/bosnia_isil_codes_found.json
|
||
|
||
# List all unique ISIL codes
|
||
jq '[.[].isil_codes[]] | unique' data/isil/bosnia/bosnia_isil_codes_found.json
|
||
|
||
# Find libraries without ISIL codes
|
||
jq '[.[] | select(.isil_found == false) | .name]' data/isil/bosnia/bosnia_isil_codes_found.json
|
||
```
|
||
|
||
### Next Steps Based on Results
|
||
|
||
**If ISIL Codes Found**:
|
||
1. Validate code format (BA- vs. BO-)
|
||
2. Create LinkML instance files
|
||
3. Update investigation report with findings
|
||
|
||
**If No ISIL Codes Found**:
|
||
1. Confirm exhaustive search completed
|
||
2. Send email to NUBBiH (template in FINAL_REPORT.md)
|
||
3. Document that ISIL codes require direct contact
|
||
|
||
---
|
||
|
||
## Comparison: Manual vs. Automated
|
||
|
||
| Aspect | Manual | Automated Script |
|
||
|--------|--------|------------------|
|
||
| **Time** | ~6.5 hours | ~10-20 minutes |
|
||
| **Coverage** | 80 libraries | 80 libraries |
|
||
| **Accuracy** | Human error possible | Consistent pattern matching |
|
||
| **Documentation** | Manual notes | Structured JSON + logs |
|
||
| **Reproducibility** | Low (fatigue) | High (repeatable) |
|
||
| **Intermediate Saves** | Manual | Every 10 libraries |
|
||
| **Error Recovery** | Start over | Resume from last save |
|
||
|
||
---
|
||
|
||
## Script Code Overview
|
||
|
||
```python
|
||
# Key functions:
|
||
- search_for_isil(text): Pattern matching for ISIL codes
|
||
- check_cobiss_library_page(page, acronym): Check COBISS pages
|
||
- check_institution_website(page, homepage): Check library websites
|
||
- scrape_all_libraries(): Main orchestration loop
|
||
|
||
# Output:
|
||
- bosnia_isil_codes_found.json: Structured results
|
||
- scraper_log.txt: Real-time progress log
|
||
```
|
||
|
||
---
|
||
|
||
## Decision Point
|
||
|
||
**You have three options**:
|
||
|
||
1. **Run the script now** → Complete automation (10-20 min)
|
||
2. **Test on 5 libraries** → Validate approach before full run
|
||
3. **Skip automation** → Proceed with email contact strategy
|
||
|
||
**Recommendation**: Run the script. Even if ISIL codes aren't found, it provides conclusive evidence for the email request to NUBBiH, demonstrating due diligence.
|
||
|
||
---
|
||
|
||
**Status**: ⏳ AWAITING USER DECISION TO EXECUTE
|
||
|
||
**Next Command** (if executing):
|
||
```bash
|
||
cd /Users/kempersc/apps/glam && python scripts/bosnia_isil_scraper.py
|
||
```
|