glam/docs/sessions/SESSION_SUMMARY_20251116_WIKIDATA_ENRICHMENT.md
2025-11-19 23:25:22 +01:00

8.1 KiB

Session Summary: Wikidata Enrichment Script Development and Execution

Date: 2025-11-16
Session: Creating LinkML schemas and running Wikidata enrichment


What We Accomplished

1. Created LinkML Schemas

File: /Users/kempersc/apps/glam/schemas/hyponyms_curated.yaml

  • Schema for curated Wikidata entity classifications
  • Documents GLAMORCUBESFIXPHDNT taxonomy (19 type codes)
  • Supports curation fields: country, subregion, settlement, time, duplicate, rico, notes
  • Defines sections: hypernym, entity, entity_list, standards, collection, exclude

File: /Users/kempersc/apps/glam/schemas/hyponyms_curated_full.yaml

  • Schema for Wikidata-enriched output
  • Separates curated (manual metadata) from wikidata (API data)
  • Captures ALL Wikidata fields:
    • Labels, descriptions, aliases (all languages)
    • Claims/statements (all properties with qualifiers and references)
    • Sitelinks (Wikipedia and Wikimedia project links)
    • Entity metadata (page ID, revision, modification date)

2. Fixed YAML Validation Issues

Problem: hyponyms_curated.yaml had syntax errors

  • Tab character on line 2197
  • Inconsistent indentation on line 11986

Solution: Automated fixing script that:

  • Replaced tabs with spaces
  • Standardized indentation (2 spaces for list items, 4 spaces for properties)
  • Validated YAML parses correctly

3. Tested Enrichment Script

Test Run: Processed first 5 entities from hypernym section

  • Q26945172 (dance museum) - 8 labels, 8 properties
  • Q11341528 (comics museum) - 8 labels, 5 properties
  • Q80096168 (circus museum) - 4 labels, 3 properties
  • Q25964553 (ceramics museum) - 26 labels, 6 properties
  • Q132718497 (calligraphy museum) - 4 labels, 6 properties

Result: All 5 entities fetched successfully from Wikidata API

4. Started Full Enrichment 🔄

Status: Running in background (PID 28746) Command: python scripts/enrich_hyponyms_with_wikidata.py Log File: /tmp/enrichment.log

Data to Process:

hypernym:       1,662 entities
entity:           214 entities
entity_list:       35 entities
standards:          0 entities (empty)
collection:        59 entities
--------------------------
TOTAL:          1,970 entities

Estimated Time: ~3.3 minutes (10 requests/second with 100ms delay)

Output Files:

  • data/wikidata/GLAMORCUBEPSXHFN/hyponyms_curated_full.yaml - Enriched data
  • data/wikidata/GLAMORCUBEPSXHFN/.fetch_register.json - Cache registry

Script Features

Enrichment Capabilities

  • Fetches complete Wikidata entity data via Wikibase API
  • Preserves all manual curation metadata (country, time, hypernym, type, etc.)
  • Extracts ALL fields: labels, descriptions, aliases (all languages), claims (all properties), sitelinks
  • Maintains fetch register to avoid re-fetching (cache with fetch_date and fetch_count)
  • Processes 5 main sections (skips 'exclude' section)
  • Supports force refresh (individual Q-IDs or all entities)
  • Rate limiting (10 req/sec)
  • Structured output separating curated vs. Wikidata data

CLI Options

# Basic run (incremental, uses cache)
python scripts/enrich_hyponyms_with_wikidata.py

# Force refresh specific entities
python scripts/enrich_hyponyms_with_wikidata.py --refresh Q12345,Q67890

# Refresh all entities (ignore cache)
python scripts/enrich_hyponyms_with_wikidata.py --refresh-all

# Dry run (enrichment without saving)
python scripts/enrich_hyponyms_with_wikidata.py --dry-run

Monitoring the Enrichment

To check progress while script is running:

# Check if process is still running
ps aux | grep "[e]nrich_hyponyms_with_wikidata"

# Monitor log file
tail -f /tmp/enrichment.log

# Check register file for cached entities
cat data/wikidata/GLAMORCUBEPSXHFN/.fetch_register.json | python3 -m json.tool

# Check output file size (when created)
ls -lh data/wikidata/GLAMORCUBEPSXHFN/hyponyms_curated_full.yaml

Expected output file size: ~600MB (complete Wikidata metadata for ~2,000 entities)


Next Steps (After Enrichment Completes)

Immediate

  1. Verify Output:

    # Check file was created
    ls -lh data/wikidata/GLAMORCUBEPSXHFN/hyponyms_curated_full.yaml
    
    # Validate YAML
    python3 -c "import yaml; data = yaml.safe_load(open('data/wikidata/GLAMORCUBEPSXHFN/hyponyms_curated_full.yaml')); print(f'✓ Loaded {len(data.get(\"hypernym\", []))} hypernym entries')"
    
  2. Inspect Results:

    • Check enrichment statistics (printed at end)
    • Review sample enriched entities
    • Verify curation data preserved correctly
  3. Validate with LinkML:

    linkml-validate -s schemas/hyponyms_curated_full.yaml data/wikidata/GLAMORCUBEPSXHFN/hyponyms_curated_full.yaml
    

Medium-term

  1. Add 13 Gallery Types to hyponyms_curated.yaml:

    • Ready-to-paste YAML in /Users/kempersc/apps/glam/data/wikidata/GLAMORCUBEPSXHFN/G/NEXT_SESSION_HANDOFF.md
    • Q16020664, Q4801240, Q3325736 (artist-run spaces)
    • Q117072343, Q127346204, Q114023739 (specialized galleries)
    • Q111889841, Q107537774 (institutional variants)
    • Q3768550, Q29380643, Q17111940, Q125501487 (other gallery types)
  2. Review 10 Ambiguous Entities from G-class query:

    • Classify to appropriate GLAMORCUBESFIXPHDNT type
    • Update hyponyms_curated.yaml with resolved classifications
  3. Execute Next G-Class Query Round:

    • Regenerate query with 1,832 exclusions
    • Expect <50 results (diminishing returns)

Files Modified/Created This Session

Created

  1. /Users/kempersc/apps/glam/schemas/hyponyms_curated.yaml - Curated schema
  2. /Users/kempersc/apps/glam/schemas/hyponyms_curated_full.yaml - Enriched schema
  3. /Users/kempersc/apps/glam/data/wikidata/GLAMORCUBEPSXHFN/.fetch_register.json - Cache registry

Modified

  1. /Users/kempersc/apps/glam/data/wikidata/GLAMORCUBEPSXHFN/hyponyms_curated.yaml - Fixed YAML syntax

Will Be Created 🔄

  1. /Users/kempersc/apps/glam/data/wikidata/GLAMORCUBEPSXHFN/hyponyms_curated_full.yaml - Enriched output (in progress)

Technical Details

Wikidata API Usage

  • Endpoint: https://www.wikidata.org/w/api.php
  • Action: wbgetentities
  • Props: info|sitelinks|aliases|labels|descriptions|claims|datatype
  • Rate Limit: 10 requests/second (conservative)
  • Languages: ALL (not limited to specific language codes)

Data Extraction

  • ALL fields extracted (complete entity snapshot)
  • Datavalue types handled: entityid, string, time, quantity, monolingualtext, globecoordinate
  • Qualifiers and references preserved
  • Complete claim structure maintained

Output Structure

hypernym:
  - curated:                  # Original manual curation
      label: Q3363945
      hypernym: [museum, park]
      type: [M, F]
      country: [France]
    wikidata:                 # Full Wikidata data
      id: Q3363945
      labels:
        en: "archaeological park"
        fr: "parc archéologique"
        # ... all languages
      descriptions: {...}
      aliases: {...}
      claims:                 # ALL properties
        P31: [...]            # instance of
        P279: [...]           # subclass of
        # ... all properties
      sitelinks: {...}
    enrichment_status: success
    enrichment_date: "2025-11-16T..."

Troubleshooting

If Enrichment Fails

  1. Check process status:

    ps aux | grep "[e]nrich_hyponyms_with_wikidata"
    
  2. Check log file:

    cat /tmp/enrichment.log
    
  3. Verify register file:

    cat data/wikidata/GLAMORCUBEPSXHFN/.fetch_register.json | python3 -m json.tool
    
  4. Re-run with dry-run flag:

    python scripts/enrich_hyponyms_with_wikidata.py --dry-run
    

Common Issues

  • YAML parse errors: Run YAML validation before enrichment
  • Rate limiting: Script already includes 100ms delay (conservative)
  • Network errors: Script continues on fetch failures, marks as enrichment_status: fetch_failed
  • Memory issues: Processing ~2,000 entities should use <500MB RAM

Session End: Script running in background
Next Action: Wait for enrichment to complete (~3 minutes), then verify output