glam/docs/isil_enrichment_strategy.md
2025-11-19 23:25:22 +01:00

445 lines
14 KiB
Markdown

# ISIL Enrichment Strategy for Latin American Institutions
**Date**: 2025-11-06
**Dataset**: 304 Latin American institutions (Brazil: 97, Chile: 90, Mexico: 117)
**Current ISIL Coverage**: 0% (0/304 institutions)
## Research Findings
### ISIL System Architecture
**Decentralized Model**: ISIL operates through national registration agencies, not a global database.
**Key Authorities**:
- **Denmark**: Slots- og Kulturstyrelsen (international registration authority for ISO 15511)
- **USA**: Library of Congress (US national agency, uses MARC org codes)
- **Europe**: data.europa.eu, data.overheid.nl (Dutch ISIL registry - we have this!)
- **National**: Each country's national library/archive maintains their own registry
**Important Discovery**: No single global ISIL database exists. Each country maintains its own.
### Country-Specific Findings
#### Brazil (BR-)
- **Expected Agency**: Biblioteca Nacional do Brasil or IBICT
- **Status**: No public ISIL registry found via web search
- **Challenge**: Search results dominated by ISIL (terrorism) references, not library codes
- **Recommendation**: Direct contact with Biblioteca Nacional do Brasil required
#### Mexico (MX-)
- **Expected Agency**: Biblioteca Nacional de México (under UNAM)
- **Alternate Contact**: Asociación Mexicana de Bibliotecarios (AMBAC)
- **Status**: No public ISIL registry found
- **Search Strategy**: Try .gob.mx or .unam.mx domain-specific searches for "código ISIL"
- **Recommendation**: Contact BNM/UNAM directly
#### Chile (CL-)
- **Expected Agency**: Biblioteca Nacional de Chile or DIBAM (now Servicio Nacional del Patrimonio Cultural)
- **Status**: No public ISIL registry found
- **Recommendation**: Contact Biblioteca Nacional de Chile
## Alternative Enrichment Strategy
### Phase 1: Leverage Existing Identifiers (IMMEDIATE - This Week)
Since ISIL codes are unavailable, we'll enrich using identifiers we CAN access:
#### 1.1 Wikidata Enrichment
**What We Have**: Some institutions already have Wikidata IDs in the dataset.
**What We Can Get**:
- ISIL codes (Wikidata Property P791) - if they exist
- Additional identifiers (GND, VIAF, Library of Congress)
- Geographic coordinates
- Parent organizations
- Website URLs
**Action**: Run Wikidata SPARQL queries
```python
# Script: scripts/enrich_from_wikidata.py
import requests
from SPARQLWrapper import SPARQLWrapper, JSON
def query_wikidata_isil(country_qid):
"""
Query Wikidata for institutions with ISIL codes
Args:
country_qid: Wikidata QID for country (Q155=Brazil, Q96=Mexico, Q298=Chile)
"""
sparql = SPARQLWrapper("https://query.wikidata.org/sparql")
query = f"""
SELECT DISTINCT ?org ?orgLabel ?isil ?viaf ?website ?coords WHERE {{
?org wdt:P17 wd:{country_qid} . # Country
OPTIONAL {{ ?org wdt:P791 ?isil . }} # ISIL code
OPTIONAL {{ ?org wdt:P214 ?viaf . }} # VIAF ID
OPTIONAL {{ ?org wdt:P856 ?website . }} # Official website
OPTIONAL {{ ?org wdt:P625 ?coords . }} # Coordinates
# Filter for GLAM institutions
VALUES ?type {{
wd:Q7075 # Library
wd:Q166118 # Archive
wd:Q33506 # Museum
wd:Q1007870 # Art gallery
}}
?org wdt:P31/wdt:P279* ?type .
SERVICE wikibase:label {{ bd:serviceParam wikibase:language "en,pt,es". }}
}}
LIMIT 1000
"""
sparql.setQuery(query)
sparql.setReturnFormat(JSON)
results = sparql.query().convert()
return results
```
**Expected Output**:
- Brazil: ~20-30 institutions with Wikidata records
- Mexico: ~15-25 institutions
- Chile: ~10-20 institutions
- ISIL codes: <10 total (realistic estimate)
#### 1.2 VIAF Enrichment
**What VIAF Provides**:
- Authority records for organizations
- ISIL codes (when available)
- Cross-references to national authority files
**Action**: For each institution with a VIAF ID, fetch the full VIAF record
```python
# Script: scripts/enrich_from_viaf.py
import requests
import xml.etree.ElementTree as ET
def fetch_viaf_record(viaf_id):
"""
Fetch VIAF record and extract ISIL code if present
Args:
viaf_id: VIAF identifier (e.g., "123556639")
Returns:
dict with extracted data including ISIL if available
"""
url = f"https://viaf.org/viaf/{viaf_id}/viaf.xml"
response = requests.get(url)
if response.status_code == 200:
root = ET.fromstring(response.content)
# Look for ISIL in various VIAF fields
# VIAF may include ISIL in organizational identifiers
data = {
'viaf_id': viaf_id,
'isil': None, # Extract if found
'alt_names': [],
'related_ids': {}
}
# Parse XML for ISIL references
# Format varies - may be in <ns1:sources> or <ns1:mainHeadings>
return data
return None
```
**Current Dataset Status**: Check how many institutions already have VIAF IDs
```bash
grep -c "VIAF" data/instances/latin_american_institutions.yaml
```
#### 1.3 OpenStreetMap (OSM) Enrichment
**What OSM Provides**:
- Building coordinates (better than city-level)
- Opening hours, contact info
- Sometimes website URLs
**Action**: For institutions with addresses, query Nominatim/Overpass API
```python
# Script: scripts/enrich_from_osm.py
import requests
def search_osm_by_name_and_location(institution_name, city, country):
"""
Search OpenStreetMap for institution details
Returns:
- Precise coordinates
- OSM tags (amenity=library, tourism=museum, etc.)
- Website URL (if tagged)
"""
overpass_url = "https://overpass-api.de/api/interpreter"
# Overpass QL query
query = f"""
[out:json];
area["name"="{country}"]->.country;
(
node(area.country)["name"~"{institution_name}",i];
way(area.country)["name"~"{institution_name}",i];
relation(area.country)["name"~"{institution_name}",i];
);
out body;
"""
response = requests.post(overpass_url, data={'data': query})
if response.status_code == 200:
return response.json()
return None
```
### Phase 2: Document the Gap (IMMEDIATE)
**Action**: Add provenance notes to all 304 institutions documenting ISIL absence
```yaml
# Example update to each institution record
provenance:
data_source: CONVERSATION_NLP
data_tier: TIER_4_INFERRED
extraction_date: "2025-11-05T14:30:00Z"
extraction_method: "AI agent comprehensive extraction"
confidence_score: 0.85
conversation_id: "..."
notes: >-
ISIL code not available. Brazil lacks publicly accessible ISIL registry
as of 2025-11-06. National registration agency (Biblioteca Nacional do
Brasil or IBICT) has not published ISIL directory. Alternative identifiers
used: Wikidata QID, website URL. See docs/isil_enrichment_strategy.md
for details.
```
**Script**: `scripts/add_isil_gap_notes.py`
```python
import yaml
from datetime import datetime
def add_isil_gap_documentation(input_file, output_file):
"""
Add standardized notes about ISIL unavailability to all institutions
"""
with open(input_file, 'r') as f:
institutions = yaml.safe_load(f)
gap_notes = {
'BR': "ISIL code not available. Brazil lacks publicly accessible ISIL registry as of 2025-11-06.",
'MX': "ISIL code not available. Mexico lacks publicly accessible ISIL registry as of 2025-11-06.",
'CL': "ISIL code not available. Chile lacks publicly accessible ISIL registry as of 2025-11-06."
}
for inst in institutions:
country = inst.get('locations', [{}])[0].get('country', 'Unknown')
if country in gap_notes and inst.get('provenance'):
existing_notes = inst['provenance'].get('notes', '')
# Append ISIL gap note if not already present
if 'ISIL code not available' not in existing_notes:
inst['provenance']['notes'] = (
existing_notes + '\n' + gap_notes[country]
).strip()
with open(output_file, 'w') as f:
yaml.dump(institutions, f, allow_unicode=True, sort_keys=False)
print(f"Updated {len(institutions)} institutions with ISIL gap documentation")
```
### Phase 3: National Agency Outreach (2-4 Weeks)
**Action**: Contact national libraries directly
#### Outreach Email Template
```
Subject: ISIL Registry Access Request - Global GLAM Dataset Research Project
Dear [National Library/Archive],
I am writing from the Global GLAM Dataset project, an open research initiative
documenting heritage institutions worldwide using LinkML schema and Linked Open Data.
We have extracted and geocoded data for [NUMBER] [COUNTRY] institutions from
archival research, including:
- [Institution examples]
To enhance data quality, we seek access to official ISIL (ISO 15511) codes.
Questions:
1. Is [INSTITUTION NAME] the designated ISIL national registration agency for [COUNTRY]?
2. Does a public ISIL registry or directory exist for [COUNTRY] institutions?
3. If available, can we access it for academic research purposes?
4. What is the process for obtaining ISIL codes for institutions lacking them?
Our dataset is published under CC-BY 4.0 and will credit all data sources.
Project repository: https://github.com/[your-repo]
Schema documentation: [link to LinkML schema]
Thank you for your assistance in advancing global cultural heritage data.
Best regards,
[Your Name]
Global GLAM Dataset Project
```
**Targets**:
**Brazil**:
- Biblioteca Nacional do Brasil (https://www.bn.gov.br/)
- Email: Contact form on website
- IBICT (https://www.ibict.br/)
- Email: ibict@ibict.br
**Mexico**:
- Biblioteca Nacional de México (https://www.bnm.unam.mx/)
- Email: Through UNAM contact
- AMBAC (Asociación Mexicana de Bibliotecarios)
**Chile**:
- Biblioteca Nacional de Chile (https://www.bibliotecanacional.gob.cl/)
- Email: Contact form on website
- Servicio Nacional del Patrimonio Cultural (https://www.patrimoniocultural.gob.cl/)
**Timeline**: Send emails by 2025-11-13, follow up after 2 weeks
### Phase 4: Generate Unofficial ISIL-Like Codes (Optional)
**Only if**: National agencies confirm no registry exists OR no response after 4 weeks
**Purpose**: Internal linking and future-proofing
**Format**: `{COUNTRY}-Unofficial-{CityCode}{InstitutionType}{Sequence}`
**Example**:
```yaml
# Museu de Arte do Rio, Brazil
identifiers:
- identifier_scheme: "ISIL-Unofficial"
identifier_value: "BR-Unofficial-RioM001"
identifier_url: null
notes: >-
Unofficial ISIL-like code generated for internal dataset use.
Format: BR (country) + Rio (city code) + M (museum) + 001 (sequence).
Not an official ISO 15511 ISIL code. Created 2025-11-06 due to
absence of public Brazilian ISIL registry.
```
**Provenance Tracking**:
```yaml
provenance:
data_tier: TIER_4_INFERRED
notes: >-
Unofficial ISIL-like identifier created for internal use only.
Official ISIL code unavailable as Brazil lacks public registry.
Code follows ISO 15511 format but is NOT officially registered.
```
**Warning**: Clearly document these are NOT official ISIL codes
### Phase 5: Publish Enriched Dataset (4-6 Weeks)
**Deliverables**:
1. Updated `latin_american_institutions.yaml` with:
- Wikidata-sourced ISIL codes (if any found)
- VIAF-sourced ISIL codes (if any found)
- OSM-enriched coordinates and URLs
- Documentation of ISIL gap in provenance notes
2. Report: `docs/latin_american_isil_findings.md`
- Summary of enrichment results
- National agency responses (if any)
- Recommendations for future work
3. Statistics:
- ISIL codes found: [number]/304
- Wikidata matches: [number]/304
- VIAF enrichments: [number]/304
- OSM coordinate improvements: [number]/304
## Implementation Checklist
**Week 1 (Nov 6-12)**:
- [x] Research ISIL availability (COMPLETED)
- [ ] Run Wikidata SPARQL queries for BR, MX, CL
- [ ] Extract ISIL codes from Wikidata results
- [ ] Count existing VIAF IDs in dataset
- [ ] Fetch VIAF records for institutions with VIAF IDs
**Week 2 (Nov 13-19)**:
- [ ] Send outreach emails to national libraries
- [ ] Run OSM enrichment for institutions with addresses
- [ ] Add ISIL gap documentation to all 304 provenance records
- [ ] Update PROGRESS.md with findings
**Week 3-4 (Nov 20-Dec 3)**:
- [ ] Process responses from national libraries (if any)
- [ ] Compile enrichment results
- [ ] Generate updated exports (JSON-LD, CSV, GeoJSON)
- [ ] Write final report
**Optional (if no registries found)**:
- [ ] Decide: Generate unofficial ISIL-like codes? (Discuss with stakeholders)
- [ ] If yes: Implement unofficial code generator
- [ ] Document clearly in schema and exports
## Expected Outcomes
### Realistic Scenario
- **ISIL codes found**: 5-15 (via Wikidata/VIAF, ~2-5% coverage)
- **Wikidata matches**: 50-80 institutions (~16-26%)
- **VIAF enrichments**: 30-50 institutions (~10-16%)
- **OSM improvements**: 100-150 institutions (~33-49%)
- **National agency responses**: 0-2 (low response rate expected)
### Success Metrics
- Document ISIL gap comprehensively
- Enrich with alternative authoritative identifiers
- Contact national agencies (attempt made)
- Provide clear provenance for all data
- Obtain official ISIL codes (low probability but worth trying)
## Lessons for Future Datasets
**Key Insights**:
1. ISIL is decentralized - no global database exists
2. Public registries vary by country (Netherlands: excellent, Brazil/Mexico/Chile: none found)
3. Alternative identifiers (Wikidata, VIAF) are more globally accessible
4. Direct outreach to national agencies is required for ISIL access
**Recommendations**:
- Prioritize Wikidata/VIAF as primary identifiers for global datasets
- Use ISIL when available (e.g., Netherlands, USA, some EU countries)
- Document identifier gaps transparently
- Consider unofficial codes only as last resort with clear labeling
---
**Status**: Strategy defined, implementation starting
**Next Update**: After Week 1 enrichment scripts complete
**Owner**: Global GLAM Dataset Project Team