kempersc edb1e07941 updated schemata

2025-11-21 22:12:33 +01:00

12 KiB

Raw Blame History

Japan Wikidata Enrichment Strategy

Date: 2025-11-20
Status: Ready for execution
Script: scripts/enrich_japan_wikidata_real.py

Executive Summary

After cleaning 3,426 synthetic Q-numbers from the Japan dataset, we now need to perform real Wikidata enrichment. This document outlines the enrichment strategy, realistic expectations, and execution plan.

Key Facts

Metric	Value
Institutions needing enrichment	3,426
Libraries	3,348 (97.7%)
Museums	76 (2.2%)
Archives	2 (0.1%)
Estimated runtime	~1 hour 3 minutes
Expected match rate	10-20% (~340-685 matches)

Enrichment Script Created

File: scripts/enrich_japan_wikidata_real.py (474 lines)

Features

✅ Real Q-numbers only - Verifies every Q-number via Wikidata API
✅ Fuzzy matching - Uses rapidfuzz with ≥85% similarity threshold
✅ Location verification - Checks city/prefecture matches
✅ Multiple match algorithms - ratio, partial_ratio, token_sort_ratio
✅ Rate limiting - Respects Wikidata 1 req/sec limit
✅ SPARQL queries - Fetches Japanese heritage institutions by type
✅ Comprehensive reporting - Generates enrichment statistics
✅ Dry-run mode - Test without modifying dataset

Usage

# Test with 10 institutions (dry run)
python scripts/enrich_japan_wikidata_real.py --dry-run --limit 10

# Process museums only (faster testing)
python scripts/enrich_japan_wikidata_real.py --dry-run --limit 76

# Full enrichment (all 3,426 institutions, ~1 hour)
python scripts/enrich_japan_wikidata_real.py

# Full enrichment with progress (recommended)
python scripts/enrich_japan_wikidata_real.py 2>&1 | tee enrichment_log.txt

Realistic Expectations

Why Most Won't Match

97.7% of institutions needing enrichment are small libraries:

Sapporo Shinkotoni Library (branch library)
ASAHIKAWASHICHUO Library (city library)
KUSHIROSHITENJI Library (district library)
KITAMISHIRITSUTANNO Library (municipal library)

These typically do NOT have Wikidata entries because:

Not notable enough - Wikidata inclusion criteria require notability
Limited documentation - Small local institutions lack English-language sources
No external identifiers - Local ISIL codes don't appear in Wikidata
Wikidata focus - Focuses on major institutions (national libraries, major museums)

Expected Match Rates by Type

Institution Type	Count	Expected Match Rate	Expected Matches
Museums	76	30-50%	23-38
Archives	2	50-100%	1-2
Major Libraries	~50	20-40%	10-20
Small Libraries	~3,298	5-10%	165-330
Total	3,426	10-20%	340-685

Examples Likely to Match

✅ Major Museums:

Fukushima Prefectural Museum of Art → likely Q-number
Fukuoka Prefectural Museum of Art → likely Q-number
Tokyo Station Gallery → likely Q-number
Japan Olympic Museum → likely Q-number

✅ Major Libraries:

Sapporo Central Library (if queried) → likely Q-number
Hokkaido Prefectural Library → likely Q-number

❌ Small Libraries (won't match):

Sapporo Shinkotoni Library → no Q-number
ASAHIKAWASHICHUO Library → no Q-number
Branch/district libraries → no Q-numbers

Enrichment Workflow

Step 1: Query Wikidata SPARQL

For each institution type (LIBRARY, MUSEUM, ARCHIVE):

SELECT DISTINCT ?item ?itemLabel ?itemDescription ?isil ?viaf ?coords
WHERE {
  ?item wdt:P31/wdt:P279* wd:Q7075 .  # Instance of library
  ?item wdt:P17 wd:Q17 .                # Country: Japan
  OPTIONAL { ?item wdt:P791 ?isil }   # ISIL code
  OPTIONAL { ?item wdt:P214 ?viaf }   # VIAF ID
  OPTIONAL { ?item wdt:P625 ?coords } # Coordinates
  SERVICE wikibase:label { bd:serviceParam wikibase:language "ja,en" }
}
LIMIT 1000

Expected results:

Libraries: ~800-1,000 Wikidata entities
Museums: ~500-800 Wikidata entities
Archives: ~100-200 Wikidata entities

Step 2: Fuzzy Name Matching

For each institution, match against Wikidata candidates:

# Calculate fuzzy match score
ratio = fuzz.ratio(institution_name, wikidata_label)
partial_ratio = fuzz.partial_ratio(institution_name, wikidata_label)
token_sort_ratio = fuzz.token_sort_ratio(institution_name, wikidata_label)

match_score = max(ratio, partial_ratio, token_sort_ratio)

# Bonus for location match
if city_matches or coordinates_nearby:
    match_score += 10

# Accept if score >= 85
if match_score >= 85:
    return WikidataMatch(...)

Step 3: Verify Q-Number Exists

CRITICAL: Before adding Q-number to dataset, verify it exists:

# Safety check via Wikidata API
response = requests.get(
    'https://www.wikidata.org/w/api.php',
    params={'action': 'wbgetentities', 'ids': q_number, 'format': 'json'}
)

if 'missing' in response.json()['entities'][q_number]:
    print(f"⚠️  Q-number {q_number} does NOT exist! Skipping.")
    return None  # Never add fake Q-numbers!

Step 4: Add to Identifiers

If match verified, add to identifiers array:

identifiers:
  - identifier_scheme: Wikidata
    identifier_value: Q12345678  # REAL Q-number (verified)
    identifier_url: https://www.wikidata.org/wiki/Q12345678
  - identifier_scheme: VIAF  # If found in Wikidata
    identifier_value: "123456789"
    identifier_url: https://viaf.org/viaf/123456789

Step 5: Update Provenance

Document enrichment in provenance metadata:

provenance:
  enrichment_history:
    - enrichment_date: "2025-11-20T..."
      enrichment_method: "Wikidata SPARQL query + fuzzy name matching"
      match_score: 92.5
      location_match: true
      verified: true
      q_number: Q12345678
      wikidata_label: "Fukushima Prefectural Museum of Art"

Step 6: Remove Enrichment Flag

# BEFORE
needs_wikidata_enrichment: true

# AFTER (if match found)
# (flag removed)

# AFTER (if no match found)
needs_wikidata_enrichment: true  # Keep flag

Execution Plan

Phase 1: Test Run (Completed)

✅ Created enrichment script (474 lines)
✅ Tested with 10 institutions (dry run)
✅ Script works correctly (no matches expected for small libraries)
✅ Ready for full execution

Phase 2: Full Enrichment (Next Step)

Command:

cd /Users/kempersc/apps/glam
python scripts/enrich_japan_wikidata_real.py 2>&1 | tee enrichment_log.txt

Expected duration: ~1 hour 3 minutes (rate limiting: 1.1 sec/institution)

Expected output:

================================================================================
Wikidata Enrichment - Real Q-Numbers Only
================================================================================

Total institutions: 12,065
Need enrichment: 3,426

Querying Wikidata for Japanese heritage institutions...
  Querying LIBRARY...
    Found 1000 Wikidata entities
  Querying MUSEUM...
    Found 650 Wikidata entities
  Querying ARCHIVE...
    Found 120 Wikidata entities

Total Wikidata candidates: 1,770

Processing institutions...

[1/3426] Sapporo Shinkotoni Library
  ⚠️  No match: Sapporo Shinkotoni Library

[2/3426] Sapporo Sumikawa Library
  ⚠️  No match: Sapporo Sumikawa Library

...

[523/3426] Fukushima Prefectural Museum of Art
  ✅ Match: Fukushima Prefectural Museum of Art
     → 福島県立美術館 (Q11638009)
     Score: 94.2% | Location: true

...

[3426/3426] Final institution
  ⚠️  No match: ...

================================================================================
ENRICHMENT COMPLETE
================================================================================

📊 Results:
   Total processed: 3,426
   Matches found: 487
   High confidence: 234 (≥90%)
   Medium confidence: 253 (85-89%)
   No match: 2,939

✅ Enriched dataset: jp_institutions_wikidata_enriched.yaml
✅ Enrichment report: WIKIDATA_ENRICHMENT_REPORT.md

Phase 3: Review and Integration

After enrichment completes:

Review enrichment report - Check match statistics
Spot-check matches - Verify high-confidence matches are correct
Replace original dataset - Install enriched version as jp_institutions_resolved.yaml
Rebuild unified database - Rerun unify_all_datasets.py
Update session summary - Document enrichment results

What Happens to Institutions Without Matches?

Keep Base GHCIDs

Institutions that don't match Wikidata should KEEP their base GHCIDs (without Q-numbers):

# Example: Small library with no Wikidata entry
- name: Sapporo Shinkotoni Library
  ghcid: JP-HO-SAP-L-SSL  # Base GHCID (no Q-number)
  needs_wikidata_enrichment: true  # Flag remains (legitimate absence)
  provenance:
    notes: >-
      No Wikidata match found during enrichment (2025-11-20).
      Institution is a small municipal library that does not meet
      Wikidata notability criteria. Base GHCID is appropriate.

This is CORRECT Behavior

Per AGENTS.md policy:

If no Wikidata Q-number is available:

Use base GHCID without Q-suffix (e.g., NL-NH-AMS-M-HM)

Flag institution with needs_wikidata_enrichment: true

Run Wikidata enrichment workflow to obtain real Q-number

If enrichment finds no match, this is legitimate - not all institutions have Q-numbers

Don't Force Matches

❌ NEVER:

Lower fuzzy match threshold below 85%
Generate synthetic Q-numbers
Create fake Wikidata entries
Force matches when uncertain

✅ ALWAYS:

Accept that small institutions may not have Q-numbers
Keep base GHCIDs for institutions without matches
Document enrichment attempts in provenance
Consider creating Wikidata entries for notable missing institutions

Post-Enrichment Actions

If Match Rate is Low (<15%)

Possible reasons:

Most institutions are small libraries (expected)
Wikidata has limited coverage of Japanese local libraries
Name transliteration differences (romaji vs. kanji)

Actions:

✅ Accept results - this is expected
✅ Document in report that most non-matches are legitimate
⚠️ Consider creating Wikidata entries for notable missing institutions
⚠️ Improve name matching for Japanese characters (romaji/kanji variants)

If Match Rate is High (>25%)