200 lines
5.2 KiB
Markdown
200 lines
5.2 KiB
Markdown
# Austrian ISIL Dataset - Quick Start Guide
|
|
|
|
**Last Updated**: 2025-11-18
|
|
**Status**: ✅ COMPLETE AND VERIFIED
|
|
|
|
---
|
|
|
|
## Dataset Summary
|
|
|
|
| Metric | Value |
|
|
|--------|-------|
|
|
| **Total unique institutions** | **1,906** |
|
|
| **With ISIL codes** | 346 (18.1%) |
|
|
| **Without ISIL codes** | 1,560 (81.9%) |
|
|
| **Source** | Austrian ISIL Registry (official) |
|
|
| **Data tier** | TIER_1_AUTHORITATIVE |
|
|
| **Verification status** | ✅ Deduplication verified (no data loss) |
|
|
|
|
---
|
|
|
|
## File Locations
|
|
|
|
### Raw Data
|
|
- **Merged JSON**: `data/isil/austria/austrian_isil_merged.json`
|
|
- **Individual pages**: `data/isil/austria/page_001_data.json` through `page_194_data.json`
|
|
- **Scraper log**: `austrian_scrape_v2.log`
|
|
|
|
### Documentation
|
|
- **Session log**: `AUSTRIAN_ISIL_SESSION_CONTINUED_20251118.md`
|
|
- **Deduplication summary**: `AUSTRIAN_ISIL_DEDUPLICATION_SUMMARY.md`
|
|
- **Missing institutions analysis**: `docs/sessions/AUSTRIAN_ISIL_MISSING_INSTITUTIONS_ANALYSIS.md`
|
|
- **Verification report**: `docs/sessions/AUSTRIAN_ISIL_DEDUPLICATION_VERIFICATION.md`
|
|
|
|
---
|
|
|
|
## Quick Stats
|
|
|
|
```bash
|
|
# Count institutions
|
|
cd /Users/kempersc/apps/glam
|
|
cat data/isil/austria/austrian_isil_merged.json | jq '.institutions | length'
|
|
# Output: 1906
|
|
|
|
# Count with ISIL codes
|
|
cat data/isil/austria/austrian_isil_merged.json | jq '[.institutions[] | select(.isil_code != null)] | length'
|
|
# Output: 346
|
|
|
|
# Count without ISIL codes
|
|
cat data/isil/austria/austrian_isil_merged.json | jq '[.institutions[] | select(.isil_code == null)] | length'
|
|
# Output: 1560
|
|
```
|
|
|
|
---
|
|
|
|
## Data Quality Notes
|
|
|
|
### ✅ Verified Correct
|
|
- All 194 pages scraped successfully
|
|
- 22 duplicates removed after verification (all were identical)
|
|
- Zero metadata loss confirmed
|
|
- 100% data integrity preserved
|
|
|
|
### ⚠️ Known Limitations
|
|
- 81.9% of institutions lack ISIL codes (departments/branches)
|
|
- 20 dissolved libraries have placeholder name "Bibliothek aufgelöst!"
|
|
- Hierarchical relationships need manual resolution (parent institutions)
|
|
|
|
---
|
|
|
|
## Next Steps
|
|
|
|
### 1. Parse to LinkML Format
|
|
```bash
|
|
python3 scripts/parse_austrian_isil.py
|
|
```
|
|
|
|
**Requirements**:
|
|
- Handle hierarchical names (pipe-delimited)
|
|
- Assign `parent_organization` for sub-units
|
|
- Generate GHCIDs for all institutions
|
|
- Classify institution types
|
|
|
|
### 2. Geocode Locations
|
|
```bash
|
|
python3 scripts/geocode_austria.py
|
|
```
|
|
|
|
**Sources**:
|
|
- Nominatim API for city → lat/lon
|
|
- GeoNames for Austrian place names
|
|
- Manual corrections for ambiguous locations
|
|
|
|
### 3. Enrich with Wikidata
|
|
```bash
|
|
python3 scripts/enrich_austria_wikidata.py
|
|
```
|
|
|
|
**Strategy**:
|
|
- Query Wikidata for Austrian GLAM institutions
|
|
- Match by ISIL code (primary)
|
|
- Match by name + location (fuzzy, threshold > 0.85)
|
|
- Add Q-numbers to identifiers
|
|
|
|
---
|
|
|
|
## Common Commands
|
|
|
|
### View Institution Sample
|
|
```bash
|
|
cat data/isil/austria/austrian_isil_merged.json | jq '.institutions[0]'
|
|
```
|
|
|
|
### Find Institutions by Name
|
|
```bash
|
|
cat data/isil/austria/austrian_isil_merged.json | jq '.institutions[] | select(.name | contains("Universität Wien"))'
|
|
```
|
|
|
|
### Count Dissolved Libraries
|
|
```bash
|
|
cat data/isil/austria/austrian_isil_merged.json | jq '[.institutions[] | select(.name == "Bibliothek aufgelöst!")] | length'
|
|
# Output: 1 (deduplicated from 20)
|
|
```
|
|
|
|
### Export to CSV (Simple)
|
|
```bash
|
|
cat data/isil/austria/austrian_isil_merged.json | jq -r '.institutions[] | [.name, .isil_code // ""] | @csv' > austria_isil.csv
|
|
```
|
|
|
|
---
|
|
|
|
## Schema Mappings
|
|
|
|
### Institution Types (to be classified)
|
|
|
|
| Austrian Term | LinkML Type | GHCID Code |
|
|
|---------------|-------------|------------|
|
|
| Bibliothek | LIBRARY | L |
|
|
| Archiv | ARCHIVE | A |
|
|
| Museum | MUSEUM | M |
|
|
| Universitätsbibliothek | LIBRARY | L |
|
|
| Dokumentationszentrum | RESEARCH_CENTER | R |
|
|
|
|
### ISIL Code Format
|
|
|
|
```
|
|
AT-XXXXX
|
|
│ └─ Alphanumeric code with optional hyphens
|
|
└─ Austria (ISO 3166-1 alpha-2)
|
|
|
|
Examples:
|
|
AT-STARG → Simple code
|
|
AT-UBW-097 → With numbers
|
|
AT-40201-AR → With embedded hyphen
|
|
```
|
|
|
|
---
|
|
|
|
## Known Issues
|
|
|
|
### Issue 1: Hierarchical Names Need Parsing
|
|
**Example**: "Universität Wien | Bibliothek | Fachbereichsbibliothek Wirtschaftswissenschaften AT-UBW-097"
|
|
|
|
**Solution**: Split on " | " delimiter and create parent-child relationships
|
|
|
|
### Issue 2: Departments Without ISIL Codes
|
|
**Problem**: 1,560 institutions lack ISIL codes
|
|
**Solution**: Generate GHCIDs based on geographic location + name abbreviation
|
|
|
|
### Issue 3: Dissolved Libraries Indistinguishable
|
|
**Problem**: 20 "Bibliothek aufgelöst!" entries are identical
|
|
**Solution**: Keep 1 record, note 19 removed in documentation
|
|
|
|
---
|
|
|
|
## Validation Checklist
|
|
|
|
Before proceeding to LinkML conversion:
|
|
|
|
- [x] All 194 pages scraped
|
|
- [x] JSON files merged correctly
|
|
- [x] Duplicates identified and verified
|
|
- [x] Data quality confirmed (no metadata loss)
|
|
- [x] Documentation complete
|
|
- [ ] LinkML parser ready for hierarchical names
|
|
- [ ] Geocoding strategy defined
|
|
- [ ] Wikidata enrichment planned
|
|
|
|
---
|
|
|
|
## Contact & Support
|
|
|
|
- **Project**: Global GLAM Heritage Custodian Data Extraction
|
|
- **Schema**: LinkML v0.2.1 (modular)
|
|
- **Questions**: See `AGENTS.md` for AI extraction guidelines
|
|
|
|
---
|
|
|
|
**Status**: ✅ Ready for LinkML conversion
|
|
**Data Quality**: 100% verified
|
|
**Next Action**: Implement hierarchical relationship parsing
|