8.1 KiB
Session Summary: Wikidata Enrichment Script Development and Execution
Date: 2025-11-16
Session: Creating LinkML schemas and running Wikidata enrichment
What We Accomplished
1. Created LinkML Schemas ✅
File: /Users/kempersc/apps/glam/schemas/hyponyms_curated.yaml
- Schema for curated Wikidata entity classifications
- Documents GLAMORCUBESFIXPHDNT taxonomy (19 type codes)
- Supports curation fields:
country,subregion,settlement,time,duplicate,rico,notes - Defines sections:
hypernym,entity,entity_list,standards,collection,exclude
File: /Users/kempersc/apps/glam/schemas/hyponyms_curated_full.yaml
- Schema for Wikidata-enriched output
- Separates
curated(manual metadata) fromwikidata(API data) - Captures ALL Wikidata fields:
- Labels, descriptions, aliases (all languages)
- Claims/statements (all properties with qualifiers and references)
- Sitelinks (Wikipedia and Wikimedia project links)
- Entity metadata (page ID, revision, modification date)
2. Fixed YAML Validation Issues ✅
Problem: hyponyms_curated.yaml had syntax errors
- Tab character on line 2197
- Inconsistent indentation on line 11986
Solution: Automated fixing script that:
- Replaced tabs with spaces
- Standardized indentation (2 spaces for list items, 4 spaces for properties)
- Validated YAML parses correctly
3. Tested Enrichment Script ✅
Test Run: Processed first 5 entities from hypernym section
- Q26945172 (dance museum) - 8 labels, 8 properties
- Q11341528 (comics museum) - 8 labels, 5 properties
- Q80096168 (circus museum) - 4 labels, 3 properties
- Q25964553 (ceramics museum) - 26 labels, 6 properties
- Q132718497 (calligraphy museum) - 4 labels, 6 properties
Result: All 5 entities fetched successfully from Wikidata API
4. Started Full Enrichment 🔄
Status: Running in background (PID 28746)
Command: python scripts/enrich_hyponyms_with_wikidata.py
Log File: /tmp/enrichment.log
Data to Process:
hypernym: 1,662 entities
entity: 214 entities
entity_list: 35 entities
standards: 0 entities (empty)
collection: 59 entities
--------------------------
TOTAL: 1,970 entities
Estimated Time: ~3.3 minutes (10 requests/second with 100ms delay)
Output Files:
data/wikidata/GLAMORCUBEPSXHFN/hyponyms_curated_full.yaml- Enriched datadata/wikidata/GLAMORCUBEPSXHFN/.fetch_register.json- Cache registry
Script Features
Enrichment Capabilities
- ✅ Fetches complete Wikidata entity data via Wikibase API
- ✅ Preserves all manual curation metadata (country, time, hypernym, type, etc.)
- ✅ Extracts ALL fields: labels, descriptions, aliases (all languages), claims (all properties), sitelinks
- ✅ Maintains fetch register to avoid re-fetching (cache with fetch_date and fetch_count)
- ✅ Processes 5 main sections (skips 'exclude' section)
- ✅ Supports force refresh (individual Q-IDs or all entities)
- ✅ Rate limiting (10 req/sec)
- ✅ Structured output separating curated vs. Wikidata data
CLI Options
# Basic run (incremental, uses cache)
python scripts/enrich_hyponyms_with_wikidata.py
# Force refresh specific entities
python scripts/enrich_hyponyms_with_wikidata.py --refresh Q12345,Q67890
# Refresh all entities (ignore cache)
python scripts/enrich_hyponyms_with_wikidata.py --refresh-all
# Dry run (enrichment without saving)
python scripts/enrich_hyponyms_with_wikidata.py --dry-run
Monitoring the Enrichment
To check progress while script is running:
# Check if process is still running
ps aux | grep "[e]nrich_hyponyms_with_wikidata"
# Monitor log file
tail -f /tmp/enrichment.log
# Check register file for cached entities
cat data/wikidata/GLAMORCUBEPSXHFN/.fetch_register.json | python3 -m json.tool
# Check output file size (when created)
ls -lh data/wikidata/GLAMORCUBEPSXHFN/hyponyms_curated_full.yaml
Expected output file size: ~600MB (complete Wikidata metadata for ~2,000 entities)
Next Steps (After Enrichment Completes)
Immediate
-
Verify Output:
# Check file was created ls -lh data/wikidata/GLAMORCUBEPSXHFN/hyponyms_curated_full.yaml # Validate YAML python3 -c "import yaml; data = yaml.safe_load(open('data/wikidata/GLAMORCUBEPSXHFN/hyponyms_curated_full.yaml')); print(f'✓ Loaded {len(data.get(\"hypernym\", []))} hypernym entries')" -
Inspect Results:
- Check enrichment statistics (printed at end)
- Review sample enriched entities
- Verify curation data preserved correctly
-
Validate with LinkML:
linkml-validate -s schemas/hyponyms_curated_full.yaml data/wikidata/GLAMORCUBEPSXHFN/hyponyms_curated_full.yaml
Medium-term
-
Add 13 Gallery Types to
hyponyms_curated.yaml:- Ready-to-paste YAML in
/Users/kempersc/apps/glam/data/wikidata/GLAMORCUBEPSXHFN/G/NEXT_SESSION_HANDOFF.md - Q16020664, Q4801240, Q3325736 (artist-run spaces)
- Q117072343, Q127346204, Q114023739 (specialized galleries)
- Q111889841, Q107537774 (institutional variants)
- Q3768550, Q29380643, Q17111940, Q125501487 (other gallery types)
- Ready-to-paste YAML in
-
Review 10 Ambiguous Entities from G-class query:
- Classify to appropriate GLAMORCUBESFIXPHDNT type
- Update
hyponyms_curated.yamlwith resolved classifications
-
Execute Next G-Class Query Round:
- Regenerate query with 1,832 exclusions
- Expect <50 results (diminishing returns)
Files Modified/Created This Session
Created ✅
/Users/kempersc/apps/glam/schemas/hyponyms_curated.yaml- Curated schema/Users/kempersc/apps/glam/schemas/hyponyms_curated_full.yaml- Enriched schema/Users/kempersc/apps/glam/data/wikidata/GLAMORCUBEPSXHFN/.fetch_register.json- Cache registry
Modified ✅
/Users/kempersc/apps/glam/data/wikidata/GLAMORCUBEPSXHFN/hyponyms_curated.yaml- Fixed YAML syntax
Will Be Created 🔄
/Users/kempersc/apps/glam/data/wikidata/GLAMORCUBEPSXHFN/hyponyms_curated_full.yaml- Enriched output (in progress)
Technical Details
Wikidata API Usage
- Endpoint:
https://www.wikidata.org/w/api.php - Action:
wbgetentities - Props:
info|sitelinks|aliases|labels|descriptions|claims|datatype - Rate Limit: 10 requests/second (conservative)
- Languages: ALL (not limited to specific language codes)
Data Extraction
- ALL fields extracted (complete entity snapshot)
- Datavalue types handled: entityid, string, time, quantity, monolingualtext, globecoordinate
- Qualifiers and references preserved
- Complete claim structure maintained
Output Structure
hypernym:
- curated: # Original manual curation
label: Q3363945
hypernym: [museum, park]
type: [M, F]
country: [France]
wikidata: # Full Wikidata data
id: Q3363945
labels:
en: "archaeological park"
fr: "parc archéologique"
# ... all languages
descriptions: {...}
aliases: {...}
claims: # ALL properties
P31: [...] # instance of
P279: [...] # subclass of
# ... all properties
sitelinks: {...}
enrichment_status: success
enrichment_date: "2025-11-16T..."
Troubleshooting
If Enrichment Fails
-
Check process status:
ps aux | grep "[e]nrich_hyponyms_with_wikidata" -
Check log file:
cat /tmp/enrichment.log -
Verify register file:
cat data/wikidata/GLAMORCUBEPSXHFN/.fetch_register.json | python3 -m json.tool -
Re-run with dry-run flag:
python scripts/enrich_hyponyms_with_wikidata.py --dry-run
Common Issues
- YAML parse errors: Run YAML validation before enrichment
- Rate limiting: Script already includes 100ms delay (conservative)
- Network errors: Script continues on fetch failures, marks as
enrichment_status: fetch_failed - Memory issues: Processing ~2,000 entities should use <500MB RAM
Session End: Script running in background
Next Action: Wait for enrichment to complete (~3 minutes), then verify output