kempersc/glam

Fork 0

kempersc 3c80de87e0 add isil entries

2025-11-19 23:25:22 +01:00

8.1 KiB

Raw Permalink Blame History

Session Summary: Wikidata Enrichment Script Development and Execution

Date: 2025-11-16
Session: Creating LinkML schemas and running Wikidata enrichment

What We Accomplished

1. Created LinkML Schemas ✅

File: /Users/kempersc/apps/glam/schemas/hyponyms_curated.yaml

Schema for curated Wikidata entity classifications
Documents GLAMORCUBESFIXPHDNT taxonomy (19 type codes)
Supports curation fields: country, subregion, settlement, time, duplicate, rico, notes
Defines sections: hypernym, entity, entity_list, standards, collection, exclude

File: /Users/kempersc/apps/glam/schemas/hyponyms_curated_full.yaml

Schema for Wikidata-enriched output
Separates curated (manual metadata) from wikidata (API data)
Captures ALL Wikidata fields:
- Labels, descriptions, aliases (all languages)
- Claims/statements (all properties with qualifiers and references)
- Sitelinks (Wikipedia and Wikimedia project links)
- Entity metadata (page ID, revision, modification date)

2. Fixed YAML Validation Issues ✅

Problem: hyponyms_curated.yaml had syntax errors

Tab character on line 2197
Inconsistent indentation on line 11986

Solution: Automated fixing script that:

Replaced tabs with spaces
Standardized indentation (2 spaces for list items, 4 spaces for properties)
Validated YAML parses correctly

3. Tested Enrichment Script ✅

Test Run: Processed first 5 entities from hypernym section

Q26945172 (dance museum) - 8 labels, 8 properties
Q11341528 (comics museum) - 8 labels, 5 properties
Q80096168 (circus museum) - 4 labels, 3 properties
Q25964553 (ceramics museum) - 26 labels, 6 properties
Q132718497 (calligraphy museum) - 4 labels, 6 properties

Result: All 5 entities fetched successfully from Wikidata API

4. Started Full Enrichment 🔄

Status: Running in background (PID 28746) Command: python scripts/enrich_hyponyms_with_wikidata.py Log File: /tmp/enrichment.log

Data to Process:

hypernym:       1,662 entities
entity:           214 entities
entity_list:       35 entities
standards:          0 entities (empty)
collection:        59 entities
--------------------------
TOTAL:          1,970 entities

Estimated Time: ~3.3 minutes (10 requests/second with 100ms delay)

Output Files:

data/wikidata/GLAMORCUBEPSXHFN/hyponyms_curated_full.yaml - Enriched data
data/wikidata/GLAMORCUBEPSXHFN/.fetch_register.json - Cache registry

Script Features

Enrichment Capabilities

✅ Fetches complete Wikidata entity data via Wikibase API
✅ Preserves all manual curation metadata (country, time, hypernym, type, etc.)
✅ Extracts ALL fields: labels, descriptions, aliases (all languages), claims (all properties), sitelinks
✅ Maintains fetch register to avoid re-fetching (cache with fetch_date and fetch_count)
✅ Processes 5 main sections (skips 'exclude' section)
✅ Supports force refresh (individual Q-IDs or all entities)
✅ Rate limiting (10 req/sec)
✅ Structured output separating curated vs. Wikidata data

CLI Options

# Basic run (incremental, uses cache)
python scripts/enrich_hyponyms_with_wikidata.py

# Force refresh specific entities
python scripts/enrich_hyponyms_with_wikidata.py --refresh Q12345,Q67890

# Refresh all entities (ignore cache)
python scripts/enrich_hyponyms_with_wikidata.py --refresh-all

# Dry run (enrichment without saving)
python scripts/enrich_hyponyms_with_wikidata.py --dry-run

Monitoring the Enrichment

To check progress while script is running:

# Check if process is still running
ps aux | grep "[e]nrich_hyponyms_with_wikidata"

# Monitor log file
tail -f /tmp/enrichment.log

# Check register file for cached entities
cat data/wikidata/GLAMORCUBEPSXHFN/.fetch_register.json | python3 -m json.tool

# Check output file size (when created)
ls -lh data/wikidata/GLAMORCUBEPSXHFN/hyponyms_curated_full.yaml

Expected output file size: ~600MB (complete Wikidata metadata for ~2,000 entities)

Next Steps (After Enrichment Completes)

Immediate

Verify Output:

# Check file was created
ls -lh data/wikidata/GLAMORCUBEPSXHFN/hyponyms_curated_full.yaml

# Validate YAML
python3 -c "import yaml; data = yaml.safe_load(open('data/wikidata/GLAMORCUBEPSXHFN/hyponyms_curated_full.yaml')); print(f'✓ Loaded {len(data.get(\"hypernym\", []))} hypernym entries')"

Inspect Results:
- Check enrichment statistics (printed at end)
- Review sample enriched entities
- Verify curation data preserved correctly

Validate with LinkML:

linkml-validate -s schemas/hyponyms_curated_full.yaml data/wikidata/GLAMORCUBEPSXHFN/hyponyms_curated_full.yaml

Medium-term

Add 13 Gallery Types to hyponyms_curated.yaml:
- Ready-to-paste YAML in /Users/kempersc/apps/glam/data/wikidata/GLAMORCUBEPSXHFN/G/NEXT_SESSION_HANDOFF.md
- Q16020664, Q4801240, Q3325736 (artist-run spaces)
- Q117072343, Q127346204, Q114023739 (specialized galleries)
- Q111889841, Q107537774 (institutional variants)
- Q3768550, Q29380643, Q17111940, Q125501487 (other gallery types)
Review 10 Ambiguous Entities from G-class query:
- Classify to appropriate GLAMORCUBESFIXPHDNT type
- Update hyponyms_curated.yaml with resolved classifications
Execute Next G-Class Query Round:
- Regenerate query with 1,832 exclusions
- Expect <50 results (diminishing returns)

Files Modified/Created This Session

Created ✅

/Users/kempersc/apps/glam/schemas/hyponyms_curated.yaml - Curated schema
/Users/kempersc/apps/glam/schemas/hyponyms_curated_full.yaml - Enriched schema
/Users/kempersc/apps/glam/data/wikidata/GLAMORCUBEPSXHFN/.fetch_register.json - Cache registry

Modified ✅

/Users/kempersc/apps/glam/data/wikidata/GLAMORCUBEPSXHFN/hyponyms_curated.yaml - Fixed YAML syntax

Will Be Created 🔄

/Users/kempersc/apps/glam/data/wikidata/GLAMORCUBEPSXHFN/hyponyms_curated_full.yaml - Enriched output (in progress)

Technical Details

Wikidata API Usage

Endpoint: https://www.wikidata.org/w/api.php
Action: wbgetentities
Props: info|sitelinks|aliases|labels|descriptions|claims|datatype
Rate Limit: 10 requests/second (conservative)
Languages: ALL (not limited to specific language codes)

Data Extraction

ALL fields extracted (complete entity snapshot)
Datavalue types handled: entityid, string, time, quantity, monolingualtext, globecoordinate
Qualifiers and references preserved
Complete claim structure maintained

Output Structure

hypernym:
  - curated:                  # Original manual curation
      label: Q3363945
      hypernym: [museum, park]
      type: [M, F]
      country: [France]
    wikidata:                 # Full Wikidata data
      id: Q3363945
      labels:
        en: "archaeological park"
        fr: "parc archéologique"
        # ... all languages
      descriptions: {...}
      aliases: {...}
      claims:                 # ALL properties
        P31: [...]            # instance of
        P279: [...]           # subclass of
        # ... all properties
      sitelinks: {...}
    enrichment_status: success
    enrichment_date: "2025-11-16T..."

Troubleshooting

If Enrichment Fails

Check process status:

ps aux | grep "[e]nrich_hyponyms_with_wikidata"

Check log file:
```
cat /tmp/enrichment.log
```

Verify register file:

cat data/wikidata/GLAMORCUBEPSXHFN/.fetch_register.json | python3 -m json.tool

Re-run with dry-run flag:

python scripts/enrich_hyponyms_with_wikidata.py --dry-run

Common Issues

YAML parse errors: Run YAML validation before enrichment
Rate limiting: Script already includes 100ms delay (conservative)
Network errors: Script continues on fetch failures, marks as enrichment_status: fetch_failed
Memory issues: Processing ~2,000 entities should use <500MB RAM

Session End: Script running in background
Next Action: Wait for enrichment to complete (~3 minutes), then verify output

8.1 KiB Raw Permalink Blame History