glam/docs/isil_enrichment_strategy.md
2025-11-19 23:25:22 +01:00

14 KiB

ISIL Enrichment Strategy for Latin American Institutions

Date: 2025-11-06
Dataset: 304 Latin American institutions (Brazil: 97, Chile: 90, Mexico: 117)
Current ISIL Coverage: 0% (0/304 institutions)

Research Findings

ISIL System Architecture

Decentralized Model: ISIL operates through national registration agencies, not a global database.

Key Authorities:

  • Denmark: Slots- og Kulturstyrelsen (international registration authority for ISO 15511)
  • USA: Library of Congress (US national agency, uses MARC org codes)
  • Europe: data.europa.eu, data.overheid.nl (Dutch ISIL registry - we have this!)
  • National: Each country's national library/archive maintains their own registry

Important Discovery: No single global ISIL database exists. Each country maintains its own.

Country-Specific Findings

Brazil (BR-)

  • Expected Agency: Biblioteca Nacional do Brasil or IBICT
  • Status: No public ISIL registry found via web search
  • Challenge: Search results dominated by ISIL (terrorism) references, not library codes
  • Recommendation: Direct contact with Biblioteca Nacional do Brasil required

Mexico (MX-)

  • Expected Agency: Biblioteca Nacional de México (under UNAM)
  • Alternate Contact: Asociación Mexicana de Bibliotecarios (AMBAC)
  • Status: No public ISIL registry found
  • Search Strategy: Try .gob.mx or .unam.mx domain-specific searches for "código ISIL"
  • Recommendation: Contact BNM/UNAM directly

Chile (CL-)

  • Expected Agency: Biblioteca Nacional de Chile or DIBAM (now Servicio Nacional del Patrimonio Cultural)
  • Status: No public ISIL registry found
  • Recommendation: Contact Biblioteca Nacional de Chile

Alternative Enrichment Strategy

Phase 1: Leverage Existing Identifiers (IMMEDIATE - This Week)

Since ISIL codes are unavailable, we'll enrich using identifiers we CAN access:

1.1 Wikidata Enrichment

What We Have: Some institutions already have Wikidata IDs in the dataset.

What We Can Get:

  • ISIL codes (Wikidata Property P791) - if they exist
  • Additional identifiers (GND, VIAF, Library of Congress)
  • Geographic coordinates
  • Parent organizations
  • Website URLs

Action: Run Wikidata SPARQL queries

# Script: scripts/enrich_from_wikidata.py

import requests
from SPARQLWrapper import SPARQLWrapper, JSON

def query_wikidata_isil(country_qid):
    """
    Query Wikidata for institutions with ISIL codes
    
    Args:
        country_qid: Wikidata QID for country (Q155=Brazil, Q96=Mexico, Q298=Chile)
    """
    
    sparql = SPARQLWrapper("https://query.wikidata.org/sparql")
    
    query = f"""
    SELECT DISTINCT ?org ?orgLabel ?isil ?viaf ?website ?coords WHERE {{
      ?org wdt:P17 wd:{country_qid} .           # Country
      OPTIONAL {{ ?org wdt:P791 ?isil . }}      # ISIL code
      OPTIONAL {{ ?org wdt:P214 ?viaf . }}      # VIAF ID
      OPTIONAL {{ ?org wdt:P856 ?website . }}   # Official website
      OPTIONAL {{ ?org wdt:P625 ?coords . }}    # Coordinates
      
      # Filter for GLAM institutions
      VALUES ?type {{ 
        wd:Q7075        # Library
        wd:Q166118      # Archive
        wd:Q33506       # Museum
        wd:Q1007870     # Art gallery
      }}
      ?org wdt:P31/wdt:P279* ?type .
      
      SERVICE wikibase:label {{ bd:serviceParam wikibase:language "en,pt,es". }}
    }}
    LIMIT 1000
    """
    
    sparql.setQuery(query)
    sparql.setReturnFormat(JSON)
    results = sparql.query().convert()
    
    return results

Expected Output:

  • Brazil: ~20-30 institutions with Wikidata records
  • Mexico: ~15-25 institutions
  • Chile: ~10-20 institutions
  • ISIL codes: <10 total (realistic estimate)

1.2 VIAF Enrichment

What VIAF Provides:

  • Authority records for organizations
  • ISIL codes (when available)
  • Cross-references to national authority files

Action: For each institution with a VIAF ID, fetch the full VIAF record

# Script: scripts/enrich_from_viaf.py

import requests
import xml.etree.ElementTree as ET

def fetch_viaf_record(viaf_id):
    """
    Fetch VIAF record and extract ISIL code if present
    
    Args:
        viaf_id: VIAF identifier (e.g., "123556639")
    
    Returns:
        dict with extracted data including ISIL if available
    """
    
    url = f"https://viaf.org/viaf/{viaf_id}/viaf.xml"
    response = requests.get(url)
    
    if response.status_code == 200:
        root = ET.fromstring(response.content)
        
        # Look for ISIL in various VIAF fields
        # VIAF may include ISIL in organizational identifiers
        
        data = {
            'viaf_id': viaf_id,
            'isil': None,  # Extract if found
            'alt_names': [],
            'related_ids': {}
        }
        
        # Parse XML for ISIL references
        # Format varies - may be in <ns1:sources> or <ns1:mainHeadings>
        
        return data
    
    return None

Current Dataset Status: Check how many institutions already have VIAF IDs

grep -c "VIAF" data/instances/latin_american_institutions.yaml

1.3 OpenStreetMap (OSM) Enrichment

What OSM Provides:

  • Building coordinates (better than city-level)
  • Opening hours, contact info
  • Sometimes website URLs

Action: For institutions with addresses, query Nominatim/Overpass API

# Script: scripts/enrich_from_osm.py

import requests

def search_osm_by_name_and_location(institution_name, city, country):
    """
    Search OpenStreetMap for institution details
    
    Returns:
        - Precise coordinates
        - OSM tags (amenity=library, tourism=museum, etc.)
        - Website URL (if tagged)
    """
    
    overpass_url = "https://overpass-api.de/api/interpreter"
    
    # Overpass QL query
    query = f"""
    [out:json];
    area["name"="{country}"]->.country;
    (
      node(area.country)["name"~"{institution_name}",i];
      way(area.country)["name"~"{institution_name}",i];
      relation(area.country)["name"~"{institution_name}",i];
    );
    out body;
    """
    
    response = requests.post(overpass_url, data={'data': query})
    
    if response.status_code == 200:
        return response.json()
    
    return None

Phase 2: Document the Gap (IMMEDIATE)

Action: Add provenance notes to all 304 institutions documenting ISIL absence

# Example update to each institution record

provenance:
  data_source: CONVERSATION_NLP
  data_tier: TIER_4_INFERRED
  extraction_date: "2025-11-05T14:30:00Z"
  extraction_method: "AI agent comprehensive extraction"
  confidence_score: 0.85
  conversation_id: "..."
  notes: >-
    ISIL code not available. Brazil lacks publicly accessible ISIL registry 
    as of 2025-11-06. National registration agency (Biblioteca Nacional do 
    Brasil or IBICT) has not published ISIL directory. Alternative identifiers 
    used: Wikidata QID, website URL. See docs/isil_enrichment_strategy.md 
    for details.    

Script: scripts/add_isil_gap_notes.py

import yaml
from datetime import datetime

def add_isil_gap_documentation(input_file, output_file):
    """
    Add standardized notes about ISIL unavailability to all institutions
    """
    
    with open(input_file, 'r') as f:
        institutions = yaml.safe_load(f)
    
    gap_notes = {
        'BR': "ISIL code not available. Brazil lacks publicly accessible ISIL registry as of 2025-11-06.",
        'MX': "ISIL code not available. Mexico lacks publicly accessible ISIL registry as of 2025-11-06.",
        'CL': "ISIL code not available. Chile lacks publicly accessible ISIL registry as of 2025-11-06."
    }
    
    for inst in institutions:
        country = inst.get('locations', [{}])[0].get('country', 'Unknown')
        
        if country in gap_notes and inst.get('provenance'):
            existing_notes = inst['provenance'].get('notes', '')
            
            # Append ISIL gap note if not already present
            if 'ISIL code not available' not in existing_notes:
                inst['provenance']['notes'] = (
                    existing_notes + '\n' + gap_notes[country]
                ).strip()
    
    with open(output_file, 'w') as f:
        yaml.dump(institutions, f, allow_unicode=True, sort_keys=False)
    
    print(f"Updated {len(institutions)} institutions with ISIL gap documentation")

Phase 3: National Agency Outreach (2-4 Weeks)

Action: Contact national libraries directly

Outreach Email Template

Subject: ISIL Registry Access Request - Global GLAM Dataset Research Project

Dear [National Library/Archive],

I am writing from the Global GLAM Dataset project, an open research initiative 
documenting heritage institutions worldwide using LinkML schema and Linked Open Data.

We have extracted and geocoded data for [NUMBER] [COUNTRY] institutions from 
archival research, including:
- [Institution examples]

To enhance data quality, we seek access to official ISIL (ISO 15511) codes.

Questions:
1. Is [INSTITUTION NAME] the designated ISIL national registration agency for [COUNTRY]?
2. Does a public ISIL registry or directory exist for [COUNTRY] institutions?
3. If available, can we access it for academic research purposes?
4. What is the process for obtaining ISIL codes for institutions lacking them?

Our dataset is published under CC-BY 4.0 and will credit all data sources.

Project repository: https://github.com/[your-repo]
Schema documentation: [link to LinkML schema]

Thank you for your assistance in advancing global cultural heritage data.

Best regards,
[Your Name]
Global GLAM Dataset Project

Targets:

Brazil:

Mexico:

  • Biblioteca Nacional de México (https://www.bnm.unam.mx/)
  • Email: Through UNAM contact
  • AMBAC (Asociación Mexicana de Bibliotecarios)

Chile:

Timeline: Send emails by 2025-11-13, follow up after 2 weeks

Phase 4: Generate Unofficial ISIL-Like Codes (Optional)

Only if: National agencies confirm no registry exists OR no response after 4 weeks

Purpose: Internal linking and future-proofing

Format: {COUNTRY}-Unofficial-{CityCode}{InstitutionType}{Sequence}

Example:

# Museu de Arte do Rio, Brazil
identifiers:
  - identifier_scheme: "ISIL-Unofficial"
    identifier_value: "BR-Unofficial-RioM001"
    identifier_url: null
    notes: >-
      Unofficial ISIL-like code generated for internal dataset use. 
      Format: BR (country) + Rio (city code) + M (museum) + 001 (sequence).
      Not an official ISO 15511 ISIL code. Created 2025-11-06 due to 
      absence of public Brazilian ISIL registry.      

Provenance Tracking:

provenance:
  data_tier: TIER_4_INFERRED
  notes: >-
    Unofficial ISIL-like identifier created for internal use only.
    Official ISIL code unavailable as Brazil lacks public registry.
    Code follows ISO 15511 format but is NOT officially registered.    

Warning: Clearly document these are NOT official ISIL codes

Phase 5: Publish Enriched Dataset (4-6 Weeks)

Deliverables:

  1. Updated latin_american_institutions.yaml with:

    • Wikidata-sourced ISIL codes (if any found)
    • VIAF-sourced ISIL codes (if any found)
    • OSM-enriched coordinates and URLs
    • Documentation of ISIL gap in provenance notes
  2. Report: docs/latin_american_isil_findings.md

    • Summary of enrichment results
    • National agency responses (if any)
    • Recommendations for future work
  3. Statistics:

    • ISIL codes found: [number]/304
    • Wikidata matches: [number]/304
    • VIAF enrichments: [number]/304
    • OSM coordinate improvements: [number]/304

Implementation Checklist

Week 1 (Nov 6-12):

  • Research ISIL availability (COMPLETED)
  • Run Wikidata SPARQL queries for BR, MX, CL
  • Extract ISIL codes from Wikidata results
  • Count existing VIAF IDs in dataset
  • Fetch VIAF records for institutions with VIAF IDs

Week 2 (Nov 13-19):

  • Send outreach emails to national libraries
  • Run OSM enrichment for institutions with addresses
  • Add ISIL gap documentation to all 304 provenance records
  • Update PROGRESS.md with findings

Week 3-4 (Nov 20-Dec 3):

  • Process responses from national libraries (if any)
  • Compile enrichment results
  • Generate updated exports (JSON-LD, CSV, GeoJSON)
  • Write final report

Optional (if no registries found):

  • Decide: Generate unofficial ISIL-like codes? (Discuss with stakeholders)
  • If yes: Implement unofficial code generator
  • Document clearly in schema and exports

Expected Outcomes

Realistic Scenario

  • ISIL codes found: 5-15 (via Wikidata/VIAF, ~2-5% coverage)
  • Wikidata matches: 50-80 institutions (~16-26%)
  • VIAF enrichments: 30-50 institutions (~10-16%)
  • OSM improvements: 100-150 institutions (~33-49%)
  • National agency responses: 0-2 (low response rate expected)

Success Metrics

  • Document ISIL gap comprehensively
  • Enrich with alternative authoritative identifiers
  • Contact national agencies (attempt made)
  • Provide clear provenance for all data
  • Obtain official ISIL codes (low probability but worth trying)

Lessons for Future Datasets

Key Insights:

  1. ISIL is decentralized - no global database exists
  2. Public registries vary by country (Netherlands: excellent, Brazil/Mexico/Chile: none found)
  3. Alternative identifiers (Wikidata, VIAF) are more globally accessible
  4. Direct outreach to national agencies is required for ISIL access

Recommendations:

  • Prioritize Wikidata/VIAF as primary identifiers for global datasets
  • Use ISIL when available (e.g., Netherlands, USA, some EU countries)
  • Document identifier gaps transparently
  • Consider unofficial codes only as last resort with clear labeling

Status: Strategy defined, implementation starting
Next Update: After Week 1 enrichment scripts complete
Owner: Global GLAM Dataset Project Team