glam/AUSTRIAN_ISIL_QUICK_START.md
2025-11-19 23:25:22 +01:00

200 lines
5.2 KiB
Markdown

# Austrian ISIL Dataset - Quick Start Guide
**Last Updated**: 2025-11-18
**Status**: ✅ COMPLETE AND VERIFIED
---
## Dataset Summary
| Metric | Value |
|--------|-------|
| **Total unique institutions** | **1,906** |
| **With ISIL codes** | 346 (18.1%) |
| **Without ISIL codes** | 1,560 (81.9%) |
| **Source** | Austrian ISIL Registry (official) |
| **Data tier** | TIER_1_AUTHORITATIVE |
| **Verification status** | ✅ Deduplication verified (no data loss) |
---
## File Locations
### Raw Data
- **Merged JSON**: `data/isil/austria/austrian_isil_merged.json`
- **Individual pages**: `data/isil/austria/page_001_data.json` through `page_194_data.json`
- **Scraper log**: `austrian_scrape_v2.log`
### Documentation
- **Session log**: `AUSTRIAN_ISIL_SESSION_CONTINUED_20251118.md`
- **Deduplication summary**: `AUSTRIAN_ISIL_DEDUPLICATION_SUMMARY.md`
- **Missing institutions analysis**: `docs/sessions/AUSTRIAN_ISIL_MISSING_INSTITUTIONS_ANALYSIS.md`
- **Verification report**: `docs/sessions/AUSTRIAN_ISIL_DEDUPLICATION_VERIFICATION.md`
---
## Quick Stats
```bash
# Count institutions
cd /Users/kempersc/apps/glam
cat data/isil/austria/austrian_isil_merged.json | jq '.institutions | length'
# Output: 1906
# Count with ISIL codes
cat data/isil/austria/austrian_isil_merged.json | jq '[.institutions[] | select(.isil_code != null)] | length'
# Output: 346
# Count without ISIL codes
cat data/isil/austria/austrian_isil_merged.json | jq '[.institutions[] | select(.isil_code == null)] | length'
# Output: 1560
```
---
## Data Quality Notes
### ✅ Verified Correct
- All 194 pages scraped successfully
- 22 duplicates removed after verification (all were identical)
- Zero metadata loss confirmed
- 100% data integrity preserved
### ⚠️ Known Limitations
- 81.9% of institutions lack ISIL codes (departments/branches)
- 20 dissolved libraries have placeholder name "Bibliothek aufgelöst!"
- Hierarchical relationships need manual resolution (parent institutions)
---
## Next Steps
### 1. Parse to LinkML Format
```bash
python3 scripts/parse_austrian_isil.py
```
**Requirements**:
- Handle hierarchical names (pipe-delimited)
- Assign `parent_organization` for sub-units
- Generate GHCIDs for all institutions
- Classify institution types
### 2. Geocode Locations
```bash
python3 scripts/geocode_austria.py
```
**Sources**:
- Nominatim API for city → lat/lon
- GeoNames for Austrian place names
- Manual corrections for ambiguous locations
### 3. Enrich with Wikidata
```bash
python3 scripts/enrich_austria_wikidata.py
```
**Strategy**:
- Query Wikidata for Austrian GLAM institutions
- Match by ISIL code (primary)
- Match by name + location (fuzzy, threshold > 0.85)
- Add Q-numbers to identifiers
---
## Common Commands
### View Institution Sample
```bash
cat data/isil/austria/austrian_isil_merged.json | jq '.institutions[0]'
```
### Find Institutions by Name
```bash
cat data/isil/austria/austrian_isil_merged.json | jq '.institutions[] | select(.name | contains("Universität Wien"))'
```
### Count Dissolved Libraries
```bash
cat data/isil/austria/austrian_isil_merged.json | jq '[.institutions[] | select(.name == "Bibliothek aufgelöst!")] | length'
# Output: 1 (deduplicated from 20)
```
### Export to CSV (Simple)
```bash
cat data/isil/austria/austrian_isil_merged.json | jq -r '.institutions[] | [.name, .isil_code // ""] | @csv' > austria_isil.csv
```
---
## Schema Mappings
### Institution Types (to be classified)
| Austrian Term | LinkML Type | GHCID Code |
|---------------|-------------|------------|
| Bibliothek | LIBRARY | L |
| Archiv | ARCHIVE | A |
| Museum | MUSEUM | M |
| Universitätsbibliothek | LIBRARY | L |
| Dokumentationszentrum | RESEARCH_CENTER | R |
### ISIL Code Format
```
AT-XXXXX
│ └─ Alphanumeric code with optional hyphens
└─ Austria (ISO 3166-1 alpha-2)
Examples:
AT-STARG → Simple code
AT-UBW-097 → With numbers
AT-40201-AR → With embedded hyphen
```
---
## Known Issues
### Issue 1: Hierarchical Names Need Parsing
**Example**: "Universität Wien | Bibliothek | Fachbereichsbibliothek Wirtschaftswissenschaften AT-UBW-097"
**Solution**: Split on " | " delimiter and create parent-child relationships
### Issue 2: Departments Without ISIL Codes
**Problem**: 1,560 institutions lack ISIL codes
**Solution**: Generate GHCIDs based on geographic location + name abbreviation
### Issue 3: Dissolved Libraries Indistinguishable
**Problem**: 20 "Bibliothek aufgelöst!" entries are identical
**Solution**: Keep 1 record, note 19 removed in documentation
---
## Validation Checklist
Before proceeding to LinkML conversion:
- [x] All 194 pages scraped
- [x] JSON files merged correctly
- [x] Duplicates identified and verified
- [x] Data quality confirmed (no metadata loss)
- [x] Documentation complete
- [ ] LinkML parser ready for hierarchical names
- [ ] Geocoding strategy defined
- [ ] Wikidata enrichment planned
---
## Contact & Support
- **Project**: Global GLAM Heritage Custodian Data Extraction
- **Schema**: LinkML v0.2.1 (modular)
- **Questions**: See `AGENTS.md` for AI extraction guidelines
---
**Status**: ✅ Ready for LinkML conversion
**Data Quality**: 100% verified
**Next Action**: Implement hierarchical relationship parsing