12 KiB
Session Summary: Wikidata Enrichment Script - Final Update
Date: 2025-11-16
Session: Creating LinkML schemas, fixing YAML, and extending identifier support
Final Status: Enrichment Running ✅
Process ID: 36747
Log File: /tmp/enrichment_final.log
Progress: 477/1,970 entities (24.2%)
Estimated Time Remaining: ~2.5 minutes
What We Accomplished
1. Created LinkML Schemas ✅
File: /Users/kempersc/apps/glam/schemas/hyponyms_curated.yaml
- Schema for curated Wikidata entity classifications
- Documents GLAMORCUBESFIXPHDNT taxonomy (19 type codes)
- Supports all curation fields:
country,subregion,settlement,time,duplicate,rico,notes - Defines sections:
hypernym,entity,entity_list,standards,collection,exclude
File: /Users/kempersc/apps/glam/schemas/hyponyms_curated_full.yaml
- Schema for Wikidata-enriched output
- Separates
curated(manual metadata) fromwikidata(API data) - Captures ALL Wikidata fields:
- Labels, descriptions, aliases (all languages)
- Claims/statements (all properties with qualifiers and references)
- Sitelinks (Wikipedia and Wikimedia project links)
- Entity metadata (page ID, revision, modification date)
2. Fixed YAML Validation Issues ✅
Problems Found:
- Tab character on line 2197
- Inconsistent indentation on line 11986
- Multiple integer labels (years) instead of Q-numbers
Solutions:
- Automated fixing script that replaced tabs with spaces
- Standardized indentation (2 spaces for list items, 4 spaces for properties)
- Validated YAML parses correctly
3. Extended Identifier Support ✅ NEW
Updated extract_identifier() method to handle:
- ✅ Q-numbers (entities): Q12345, 'Q12345 - Museum'
- ✅ P-numbers (properties): P31, P2671, P10421
- ✅ Category identifiers: Category:Virtual_museums, Category:Contemporary art museums
- ✅ List identifiers: List_of_museums (if any exist)
- ✅ Invalid labels: Integers like 2018, 1956 are properly skipped
Example Test Results:
✓ Q26945172 → Fetches from Wikidata (dance museum)
✓ P31 → Fetches from Wikidata (instance of property)
✓ Category:Virtual_museums → Skips with note (no Wikidata entity)
✓ 2018 → Skips (invalid integer label)
4. Updated Enrichment Logic ✅
Changes:
- Renamed
extract_qid()→extract_identifier()throughout codebase - Updated
get_entity_data()to accept any identifier type (Q, P, Category, List) - Added special handling for Category/List identifiers (skipped, no fetch)
- Updated all variable names from
qid→identifierfor clarity - Added
enrichment_notefield for non-fetchable identifiers
Enrichment Statuses:
success- Entity/property fetched successfully from Wikidatafetch_failed- Network error or API failureno_identifier- Invalid label (like integer years)category_or_list- Wikipedia category/list page (no Wikidata entity)cached- Retrieved from local cache
Data Statistics
Total Entities to Process: 1,970
| Section | Count | Notes |
|---|---|---|
| hypernym | 1,662 | Includes Q-numbers, P-numbers, and integer years |
| entity | 214 | Mostly Q-numbers |
| entity_list | 35 | Mix of Q-numbers and Category: identifiers |
| standards | 0 | Empty section |
| collection | 59 | Q-numbers |
| TOTAL | 1,970 | ~3.3 minutes at 10 req/sec |
Identifier Type Breakdown:
- Q-numbers: ~1,920+ (entities)
- P-numbers: ~4 (properties - P10421, P2671, P214, P349)
- Category identifiers: 6 (Category:Virtual_museums, etc.)
- Invalid (years): ~40+ (2018, 1974, 1956, etc. - will be skipped)
Script Features
Enrichment Capabilities
- ✅ Fetches complete Wikidata entity data via Wikibase API
- ✅ Handles Q-numbers, P-numbers, and Category identifiers
- ✅ Preserves all manual curation metadata (country, time, hypernym, type, subregion, settlement, etc.)
- ✅ Extracts ALL fields: labels, descriptions, aliases (all languages), claims (all properties), sitelinks
- ✅ Maintains fetch register to avoid re-fetching (cache with fetch_date and fetch_count)
- ✅ Processes 5 main sections (skips 'exclude' section)
- ✅ Supports force refresh (individual identifiers or all entities)
- ✅ Rate limiting (10 req/sec)
- ✅ Structured output separating curated vs. Wikidata data
CLI Options
# Basic run (incremental, uses cache)
python scripts/enrich_hyponyms_with_wikidata.py
# Force refresh specific entities
python scripts/enrich_hyponyms_with_wikidata.py --refresh Q12345,P2671
# Refresh all entities (ignore cache)
python scripts/enrich_hyponyms_with_wikidata.py --refresh-all
# Dry run (enrichment without saving)
python scripts/enrich_hyponyms_with_wikidata.py --dry-run
Monitoring the Enrichment
To check progress while script is running:
# Check if process is still running
ps aux | grep "[e]nrich_hyponyms_with_wikidata"
# Monitor with custom script
python scripts/monitor_enrichment.py
# Monitor log file
tail -f /tmp/enrichment_final.log
# Check register file for cached entities
cat data/wikidata/GLAMORCUBEPSXHFN/.fetch_register.json | python3 -m json.tool
# Check output file size (when created)
ls -lh data/wikidata/GLAMORCUBEPSXHFN/hyponyms_curated_full.yaml
Expected output file size: ~600MB (complete Wikidata metadata for ~1,900 Q/P entities + 6 Category entries)
Next Steps (After Enrichment Completes)
Immediate
-
Verify Output:
# Check file was created ls -lh data/wikidata/GLAMORCUBEPSXHFN/hyponyms_curated_full.yaml # Validate YAML python3 -c "import yaml; data = yaml.safe_load(open('data/wikidata/GLAMORCUBEPSXHFN/hyponyms_curated_full.yaml')); print(f'✓ Loaded {len(data.get(\"hypernym\", []))} hypernym entries')" -
Inspect Results:
- Check enrichment statistics (printed at end)
- Review sample enriched entities
- Verify curation data preserved correctly
- Verify P-numbers fetched correctly
- Verify Category identifiers properly skipped
-
Validate with LinkML:
linkml-validate -s schemas/hyponyms_curated_full.yaml data/wikidata/GLAMORCUBEPSXHFN/hyponyms_curated_full.yaml
Medium-term
-
Add 13 Gallery Types to
hyponyms_curated.yaml:- Ready-to-paste YAML in
/Users/kempersc/apps/glam/data/wikidata/GLAMORCUBEPSXHFN/G/NEXT_SESSION_HANDOFF.md - Q16020664, Q4801240, Q3325736 (artist-run spaces)
- Q117072343, Q127346204, Q114023739 (specialized galleries)
- Q111889841, Q107537774 (institutional variants)
- Q3768550, Q29380643, Q17111940, Q125501487 (other gallery types)
- Ready-to-paste YAML in
-
Review 10 Ambiguous Entities from G-class query:
- Classify to appropriate GLAMORCUBESFIXPHDNT type
- Update
hyponyms_curated.yamlwith resolved classifications
-
Execute Next G-Class Query Round:
- Regenerate query with 1,832 exclusions
- Expect <50 results (diminishing returns)
Files Modified/Created This Session
Created ✅
/Users/kempersc/apps/glam/schemas/hyponyms_curated.yaml- Curated schema/Users/kempersc/apps/glam/schemas/hyponyms_curated_full.yaml- Enriched schema/Users/kempersc/apps/glam/data/wikidata/GLAMORCUBEPSXHFN/.fetch_register.json- Cache registry/Users/kempersc/apps/glam/scripts/monitor_enrichment.py- Progress monitoring script/Users/kempersc/apps/glam/docs/sessions/SESSION_SUMMARY_20251116_WIKIDATA_ENRICHMENT.md- Session documentation
Modified ✅
/Users/kempersc/apps/glam/data/wikidata/GLAMORCUBEPSXHFN/hyponyms_curated.yaml- Fixed YAML syntax/Users/kempersc/apps/glam/scripts/enrich_hyponyms_with_wikidata.py- Extended identifier support (Q, P, Category, List)
Will Be Created 🔄
/Users/kempersc/apps/glam/data/wikidata/GLAMORCUBEPSXHFN/hyponyms_curated_full.yaml- Enriched output (in progress)
Technical Details
Wikidata API Usage
- Endpoint:
https://www.wikidata.org/w/api.php - Action:
wbgetentities - Props:
info|sitelinks|aliases|labels|descriptions|claims|datatype - Rate Limit: 10 requests/second (conservative)
- Languages: ALL (not limited to specific language codes)
- Identifier Types: Q-numbers (entities) and P-numbers (properties)
Data Extraction
- ALL fields extracted (complete entity snapshot)
- Datavalue types handled: entityid, string, time, quantity, monolingualtext, globecoordinate
- Qualifiers and references preserved
- Complete claim structure maintained
- P-number properties fetched with same API call structure
Output Structure
hypernym:
# Q-number (entity)
- curated: # Original manual curation
label: Q3363945
hypernym: [museum, park]
type: [M, F]
country: [France]
wikidata: # Full Wikidata data
id: Q3363945
type: item
labels:
en: "archaeological park"
fr: "parc archéologique"
# ... all languages
descriptions: {...}
aliases: {...}
claims: # ALL properties
P31: [...] # instance of
P279: [...] # subclass of
# ... all properties
sitelinks: {...}
enrichment_status: success
identifier: Q3363945
enrichment_date: "2025-11-16T..."
# P-number (property)
- curated:
label: P2671
description: "Google Knowledge Graph ID"
wikidata:
id: P2671
type: property
labels:
en: "Google Knowledge Graph ID"
datatype: external-id
# ... full property metadata
enrichment_status: success
identifier: P2671
enrichment_date: "2025-11-16T..."
entity_list:
# Category identifier (no Wikidata fetch)
- curated:
label: Category:Virtual_museums
wikidata: null
enrichment_status: category_or_list
identifier: Category:Virtual_museums
enrichment_note: "Wikipedia category or list page (no Wikidata entity)"
Troubleshooting
If Enrichment Fails
-
Check process status:
ps aux | grep "[e]nrich_hyponyms_with_wikidata" -
Check log file:
cat /tmp/enrichment_final.log | tail -100 -
Verify register file:
cat data/wikidata/GLAMORCUBEPSXHFN/.fetch_register.json | python3 -m json.tool -
Re-run with dry-run flag:
python scripts/enrich_hyponyms_with_wikidata.py --dry-run
Common Issues
- YAML parse errors: Run YAML validation before enrichment
- Rate limiting: Script already includes 100ms delay (conservative)
- Network errors: Script continues on fetch failures, marks as
enrichment_status: fetch_failed - Memory issues: Processing ~2,000 entities should use <500MB RAM
- Integer labels: Script properly skips invalid integer labels (years) without crashing
Key Improvements This Session
1. Comprehensive Identifier Support ✅
- Before: Only handled Q-numbers
- After: Handles Q-numbers, P-numbers, Category identifiers, List identifiers
- Impact: Can now enrich properties (P-numbers) and properly skip Wikipedia categories
2. Better Error Handling ✅
- Before: Crashed on integer labels (years)
- After: Gracefully skips invalid labels with proper status codes
- Impact: Enrichment completes without manual intervention
3. Clear Documentation ✅
- Before: No identifier type documentation
- After: Comprehensive examples of all identifier types
- Impact: Future maintainers understand the codebase
Session End: Script running in background
Next Action: Wait for enrichment to complete (~2-3 minutes), then verify output
Current Time: 2025-11-16T12:45 (approx)
Expected Completion: 2025-11-16T12:48 (approx)