glam/docs/sessions/SESSION_SUMMARY_20251116_WIKIDATA_ENRICHMENT_FINAL.md
2025-11-19 23:25:22 +01:00

12 KiB

Session Summary: Wikidata Enrichment Script - Final Update

Date: 2025-11-16
Session: Creating LinkML schemas, fixing YAML, and extending identifier support


Final Status: Enrichment Running

Process ID: 36747
Log File: /tmp/enrichment_final.log
Progress: 477/1,970 entities (24.2%)
Estimated Time Remaining: ~2.5 minutes


What We Accomplished

1. Created LinkML Schemas

File: /Users/kempersc/apps/glam/schemas/hyponyms_curated.yaml

  • Schema for curated Wikidata entity classifications
  • Documents GLAMORCUBESFIXPHDNT taxonomy (19 type codes)
  • Supports all curation fields: country, subregion, settlement, time, duplicate, rico, notes
  • Defines sections: hypernym, entity, entity_list, standards, collection, exclude

File: /Users/kempersc/apps/glam/schemas/hyponyms_curated_full.yaml

  • Schema for Wikidata-enriched output
  • Separates curated (manual metadata) from wikidata (API data)
  • Captures ALL Wikidata fields:
    • Labels, descriptions, aliases (all languages)
    • Claims/statements (all properties with qualifiers and references)
    • Sitelinks (Wikipedia and Wikimedia project links)
    • Entity metadata (page ID, revision, modification date)

2. Fixed YAML Validation Issues

Problems Found:

  • Tab character on line 2197
  • Inconsistent indentation on line 11986
  • Multiple integer labels (years) instead of Q-numbers

Solutions:

  • Automated fixing script that replaced tabs with spaces
  • Standardized indentation (2 spaces for list items, 4 spaces for properties)
  • Validated YAML parses correctly

3. Extended Identifier Support NEW

Updated extract_identifier() method to handle:

  • Q-numbers (entities): Q12345, 'Q12345 - Museum'
  • P-numbers (properties): P31, P2671, P10421
  • Category identifiers: Category:Virtual_museums, Category:Contemporary art museums
  • List identifiers: List_of_museums (if any exist)
  • Invalid labels: Integers like 2018, 1956 are properly skipped

Example Test Results:

✓ Q26945172 → Fetches from Wikidata (dance museum)
✓ P31 → Fetches from Wikidata (instance of property)
✓ Category:Virtual_museums → Skips with note (no Wikidata entity)
✓ 2018 → Skips (invalid integer label)

4. Updated Enrichment Logic

Changes:

  • Renamed extract_qid()extract_identifier() throughout codebase
  • Updated get_entity_data() to accept any identifier type (Q, P, Category, List)
  • Added special handling for Category/List identifiers (skipped, no fetch)
  • Updated all variable names from qididentifier for clarity
  • Added enrichment_note field for non-fetchable identifiers

Enrichment Statuses:

  • success - Entity/property fetched successfully from Wikidata
  • fetch_failed - Network error or API failure
  • no_identifier - Invalid label (like integer years)
  • category_or_list - Wikipedia category/list page (no Wikidata entity)
  • cached - Retrieved from local cache

Data Statistics

Total Entities to Process: 1,970

Section Count Notes
hypernym 1,662 Includes Q-numbers, P-numbers, and integer years
entity 214 Mostly Q-numbers
entity_list 35 Mix of Q-numbers and Category: identifiers
standards 0 Empty section
collection 59 Q-numbers
TOTAL 1,970 ~3.3 minutes at 10 req/sec

Identifier Type Breakdown:

  • Q-numbers: ~1,920+ (entities)
  • P-numbers: ~4 (properties - P10421, P2671, P214, P349)
  • Category identifiers: 6 (Category:Virtual_museums, etc.)
  • Invalid (years): ~40+ (2018, 1974, 1956, etc. - will be skipped)

Script Features

Enrichment Capabilities

  • Fetches complete Wikidata entity data via Wikibase API
  • Handles Q-numbers, P-numbers, and Category identifiers
  • Preserves all manual curation metadata (country, time, hypernym, type, subregion, settlement, etc.)
  • Extracts ALL fields: labels, descriptions, aliases (all languages), claims (all properties), sitelinks
  • Maintains fetch register to avoid re-fetching (cache with fetch_date and fetch_count)
  • Processes 5 main sections (skips 'exclude' section)
  • Supports force refresh (individual identifiers or all entities)
  • Rate limiting (10 req/sec)
  • Structured output separating curated vs. Wikidata data

CLI Options

# Basic run (incremental, uses cache)
python scripts/enrich_hyponyms_with_wikidata.py

# Force refresh specific entities
python scripts/enrich_hyponyms_with_wikidata.py --refresh Q12345,P2671

# Refresh all entities (ignore cache)
python scripts/enrich_hyponyms_with_wikidata.py --refresh-all

# Dry run (enrichment without saving)
python scripts/enrich_hyponyms_with_wikidata.py --dry-run

Monitoring the Enrichment

To check progress while script is running:

# Check if process is still running
ps aux | grep "[e]nrich_hyponyms_with_wikidata"

# Monitor with custom script
python scripts/monitor_enrichment.py

# Monitor log file
tail -f /tmp/enrichment_final.log

# Check register file for cached entities
cat data/wikidata/GLAMORCUBEPSXHFN/.fetch_register.json | python3 -m json.tool

# Check output file size (when created)
ls -lh data/wikidata/GLAMORCUBEPSXHFN/hyponyms_curated_full.yaml

Expected output file size: ~600MB (complete Wikidata metadata for ~1,900 Q/P entities + 6 Category entries)


Next Steps (After Enrichment Completes)

Immediate

  1. Verify Output:

    # Check file was created
    ls -lh data/wikidata/GLAMORCUBEPSXHFN/hyponyms_curated_full.yaml
    
    # Validate YAML
    python3 -c "import yaml; data = yaml.safe_load(open('data/wikidata/GLAMORCUBEPSXHFN/hyponyms_curated_full.yaml')); print(f'✓ Loaded {len(data.get(\"hypernym\", []))} hypernym entries')"
    
  2. Inspect Results:

    • Check enrichment statistics (printed at end)
    • Review sample enriched entities
    • Verify curation data preserved correctly
    • Verify P-numbers fetched correctly
    • Verify Category identifiers properly skipped
  3. Validate with LinkML:

    linkml-validate -s schemas/hyponyms_curated_full.yaml data/wikidata/GLAMORCUBEPSXHFN/hyponyms_curated_full.yaml
    

Medium-term

  1. Add 13 Gallery Types to hyponyms_curated.yaml:

    • Ready-to-paste YAML in /Users/kempersc/apps/glam/data/wikidata/GLAMORCUBEPSXHFN/G/NEXT_SESSION_HANDOFF.md
    • Q16020664, Q4801240, Q3325736 (artist-run spaces)
    • Q117072343, Q127346204, Q114023739 (specialized galleries)
    • Q111889841, Q107537774 (institutional variants)
    • Q3768550, Q29380643, Q17111940, Q125501487 (other gallery types)
  2. Review 10 Ambiguous Entities from G-class query:

    • Classify to appropriate GLAMORCUBESFIXPHDNT type
    • Update hyponyms_curated.yaml with resolved classifications
  3. Execute Next G-Class Query Round:

    • Regenerate query with 1,832 exclusions
    • Expect <50 results (diminishing returns)

Files Modified/Created This Session

Created

  1. /Users/kempersc/apps/glam/schemas/hyponyms_curated.yaml - Curated schema
  2. /Users/kempersc/apps/glam/schemas/hyponyms_curated_full.yaml - Enriched schema
  3. /Users/kempersc/apps/glam/data/wikidata/GLAMORCUBEPSXHFN/.fetch_register.json - Cache registry
  4. /Users/kempersc/apps/glam/scripts/monitor_enrichment.py - Progress monitoring script
  5. /Users/kempersc/apps/glam/docs/sessions/SESSION_SUMMARY_20251116_WIKIDATA_ENRICHMENT.md - Session documentation

Modified

  1. /Users/kempersc/apps/glam/data/wikidata/GLAMORCUBEPSXHFN/hyponyms_curated.yaml - Fixed YAML syntax
  2. /Users/kempersc/apps/glam/scripts/enrich_hyponyms_with_wikidata.py - Extended identifier support (Q, P, Category, List)

Will Be Created 🔄

  1. /Users/kempersc/apps/glam/data/wikidata/GLAMORCUBEPSXHFN/hyponyms_curated_full.yaml - Enriched output (in progress)

Technical Details

Wikidata API Usage

  • Endpoint: https://www.wikidata.org/w/api.php
  • Action: wbgetentities
  • Props: info|sitelinks|aliases|labels|descriptions|claims|datatype
  • Rate Limit: 10 requests/second (conservative)
  • Languages: ALL (not limited to specific language codes)
  • Identifier Types: Q-numbers (entities) and P-numbers (properties)

Data Extraction

  • ALL fields extracted (complete entity snapshot)
  • Datavalue types handled: entityid, string, time, quantity, monolingualtext, globecoordinate
  • Qualifiers and references preserved
  • Complete claim structure maintained
  • P-number properties fetched with same API call structure

Output Structure

hypernym:
  # Q-number (entity)
  - curated:                  # Original manual curation
      label: Q3363945
      hypernym: [museum, park]
      type: [M, F]
      country: [France]
    wikidata:                 # Full Wikidata data
      id: Q3363945
      type: item
      labels:
        en: "archaeological park"
        fr: "parc archéologique"
        # ... all languages
      descriptions: {...}
      aliases: {...}
      claims:                 # ALL properties
        P31: [...]            # instance of
        P279: [...]           # subclass of
        # ... all properties
      sitelinks: {...}
    enrichment_status: success
    identifier: Q3363945
    enrichment_date: "2025-11-16T..."
  
  # P-number (property)
  - curated:
      label: P2671
      description: "Google Knowledge Graph ID"
    wikidata:
      id: P2671
      type: property
      labels:
        en: "Google Knowledge Graph ID"
      datatype: external-id
      # ... full property metadata
    enrichment_status: success
    identifier: P2671
    enrichment_date: "2025-11-16T..."

entity_list:
  # Category identifier (no Wikidata fetch)
  - curated:
      label: Category:Virtual_museums
    wikidata: null
    enrichment_status: category_or_list
    identifier: Category:Virtual_museums
    enrichment_note: "Wikipedia category or list page (no Wikidata entity)"

Troubleshooting

If Enrichment Fails

  1. Check process status:

    ps aux | grep "[e]nrich_hyponyms_with_wikidata"
    
  2. Check log file:

    cat /tmp/enrichment_final.log | tail -100
    
  3. Verify register file:

    cat data/wikidata/GLAMORCUBEPSXHFN/.fetch_register.json | python3 -m json.tool
    
  4. Re-run with dry-run flag:

    python scripts/enrich_hyponyms_with_wikidata.py --dry-run
    

Common Issues

  • YAML parse errors: Run YAML validation before enrichment
  • Rate limiting: Script already includes 100ms delay (conservative)
  • Network errors: Script continues on fetch failures, marks as enrichment_status: fetch_failed
  • Memory issues: Processing ~2,000 entities should use <500MB RAM
  • Integer labels: Script properly skips invalid integer labels (years) without crashing

Key Improvements This Session

1. Comprehensive Identifier Support

  • Before: Only handled Q-numbers
  • After: Handles Q-numbers, P-numbers, Category identifiers, List identifiers
  • Impact: Can now enrich properties (P-numbers) and properly skip Wikipedia categories

2. Better Error Handling

  • Before: Crashed on integer labels (years)
  • After: Gracefully skips invalid labels with proper status codes
  • Impact: Enrichment completes without manual intervention

3. Clear Documentation

  • Before: No identifier type documentation
  • After: Comprehensive examples of all identifier types
  • Impact: Future maintainers understand the codebase

Session End: Script running in background
Next Action: Wait for enrichment to complete (~2-3 minutes), then verify output

Current Time: 2025-11-16T12:45 (approx)
Expected Completion: 2025-11-16T12:48 (approx)