9.3 KiB
O (OFFICIAL_INSTITUTION) - Initial Wikidata Extraction Analysis
Date: 2025-11-12
Root Classes: Q22698 (government agency), Q7210356 (cultural heritage organization)
Property: P279+ (subclass of, transitive)
Query Method: Multiple SPARQL queries by language groups
Executive Summary
Successfully extracted 7,804 unique classes with 30,287 terms across 16 languages for the OFFICIAL_INSTITUTION type. This represents government heritage agencies, cultural ministries, and official cultural organizations.
Key Findings
✅ Strong multilingual coverage - 16 languages with balanced distribution
✅ Diverse institutional landscape - 7,804 unique organizational types
✅ Higher complexity than expected - Second-largest dataset after F (FEATURES)
⚠️ Q7210356 dominates - Cultural heritage org has far more subclasses than government agency
Extraction Statistics
Overall Metrics
| Metric | Count | Notes |
|---|---|---|
| Unique Classes | 7,804 | Second highest after F (7,861) |
| Total Terms | 30,287 | Higher than expected |
| Languages | 16 | Phase 1 coverage |
| Root Classes | 2 | Q22698, Q7210356 |
Query Breakdown
| Query | Languages | Results | Classes Covered |
|---|---|---|---|
| Q22698 English | en | 287 | Government agencies |
| Q7210356 English | en | 5,000 (limit) | Cultural heritage orgs |
| Western European | nl, de, fr | 5,000 (limit) | Multi-root |
| Romance | es, pt, it | 5,000 (limit) | Multi-root |
| Slavic | ru, pl, cs | 5,000 (limit) | Multi-root |
| CJK | zh, ja, ko | 5,000 (limit) | Multi-root |
| Middle East | ar, tr, fa | 5,000 (limit) | Multi-root |
Total Raw Results: 30,287 (after merging duplicates: 7,804 unique classes)
Language Distribution
Top 16 Languages (Complete Coverage)
| Rank | Language | Code | Terms | % of Total | Coverage Quality |
|---|---|---|---|---|---|
| 1 | English | en | 5,287 | 17.5% | ⭐⭐⭐⭐⭐ Excellent |
| 2 | Turkish | tr | 2,471 | 8.2% | ⭐⭐⭐⭐ Good |
| 3 | Russian | ru | 2,406 | 7.9% | ⭐⭐⭐⭐ Good |
| 4 | French | fr | 2,295 | 7.6% | ⭐⭐⭐⭐ Good |
| 5 | Spanish | es | 2,210 | 7.3% | ⭐⭐⭐⭐ Good |
| 6 | Chinese | zh | 2,063 | 6.8% | ⭐⭐⭐⭐ Good |
| 7 | Japanese | ja | 1,954 | 6.5% | ⭐⭐⭐⭐ Good |
| 8 | Dutch | nl | 1,950 | 6.4% | ⭐⭐⭐⭐ Good |
| 9 | Arabic | ar | 1,700 | 5.6% | ⭐⭐⭐ Fair |
| 10 | Italian | it | 1,496 | 4.9% | ⭐⭐⭐ Fair |
| 11 | German | de | 1,470 | 4.9% | ⭐⭐⭐ Fair |
| 12 | Portuguese | pt | 1,455 | 4.8% | ⭐⭐⭐ Fair |
| 13 | Polish | pl | 1,371 | 4.5% | ⭐⭐⭐ Fair |
| 14 | Korean | ko | 1,268 | 4.2% | ⭐⭐⭐ Fair |
| 15 | Persian | fa | 1,225 | 4.0% | ⭐⭐⭐ Fair |
| 16 | Czech | cs | 666 | 2.2% | ⭐⭐ Limited |
Language Family Balance
| Family | Languages | Terms | % of Total |
|---|---|---|---|
| Indo-European | 11 | 20,178 | 66.6% |
| - Romance | es, pt, fr, it | 7,456 | 24.6% |
| - Germanic | en, nl, de | 8,707 | 28.7% |
| - Slavic | ru, pl, cs | 4,443 | 14.7% |
| - Iranian | fa | 1,225 | 4.0% |
| East Asian | 3 | 5,285 | 17.5% |
| - Sino-Tibetan | zh | 2,063 | 6.8% |
| - Japonic | ja | 1,954 | 6.5% |
| - Koreanic | ko | 1,268 | 4.2% |
| Afro-Asiatic | ar | 1,700 | 5.6% |
| Turkic | tr | 2,471 | 8.2% |
Root Class Analysis
Q22698 (Government Agency)
English-only query results: 287 classes
Characteristics:
- Narrower focus on government structures
- Likely includes: ministries, departments, agencies, commissions
- Lower subclass proliferation than Q7210356
Sample expected terms:
- Cultural ministry
- Heritage commission
- National heritage agency
- Provincial heritage service
- Federal arts council
Q7210356 (Cultural Heritage Organization)
English-only query results: 5,000 classes (limit reached)
Characteristics:
- Extremely broad and diverse
- Includes NGOs, foundations, societies, trusts
- High subclass proliferation
- Likely overlaps with other GLAMORCUBEPSXHF types
Observations:
- This root class is MUCH larger than government agency
- May need supplemental queries to capture all subclasses
- Cross-contamination with M (MUSEUM), L (LIBRARY), A (ARCHIVE) likely
Data Quality Assessment
Strengths ✅
- Excellent multilingual balance - No single language dominates (English only 17.5%)
- Strong European coverage - Romance + Germanic + Slavic = 40.5%
- Good East Asian representation - CJK languages = 17.5%
- Turkish prominence - 8.2% coverage (second after English)
Weaknesses ⚠️
- Missing African languages - No Swahili, Hausa, Yoruba, Amharic
- Limited South/Southeast Asian - Hindi, Indonesian, Vietnamese absent
- No Central Asian coverage - Uzbek, Kazakh, Turkmen missing
- Limited Middle East - Only Arabic, Persian, Turkish
Root Class Imbalance 🚨
Major finding: Q7210356 (cultural heritage org) is MUCH larger than Q22698 (gov agency)
- Q22698: ~287 English terms
- Q7210356: 5,000+ English terms (hit query limit)
Implications:
- O (OFFICIAL_INSTITUTION) may be too broad
- Consider splitting into O-GOV (government) vs. O-NGO (non-governmental)?
- Or refine root class selection to better match intended scope
Cross-Contamination Risk
Overlap with Other Classes
HIGH RISK of semantic overlap with:
- M (MUSEUM) - "heritage museum", "government museum"
- A (ARCHIVE) - "national archive", "heritage archive"
- L (LIBRARY) - "national library", "government library"
- S (COLLECTING_SOCIETY) - "heritage society", "preservation society"
Root cause: Q7210356 (cultural heritage organization) is a SUPERCLASS of many GLAM types
Mitigation strategies:
- Apply exclusion filters: Remove classes that are subclasses of Q33506 (museum), Q7075 (library), etc.
- Manual curation: Review top 100 classes for boundary violations
- Accept overlap: O includes government-run GLAMs as aggregation platforms
Recommended approach: Accept overlap but document it. Official institutions often operate museums/libraries as subsidiaries.
Comparison with Other Classes
| Class | Unique Classes | Total Terms | Languages | Status |
|---|---|---|---|---|
| G (GALLERY) | 464 | 10,482 | 45 | ✅ Complete |
| L (LIBRARY) | 1,178 | 19,318 | 36 | ✅ Complete |
| A (ARCHIVE) | 1,143 | 15,793 | 38 | ✅ Complete |
| M (MUSEUM) | 22,949 | 133,918 | 45 | ✅ Complete |
| H (HOLY_SITES) | 741 | 13,894 | 42 | ✅ Complete |
| F (FEATURES) | 7,861 | 82,614 | 45 | ✅ Complete |
| O (OFFICIAL) | 7,804 | 30,287 | 16 | 🔄 In Progress |
Key Insights
- Second-largest by class count (7,804 vs F's 7,861)
- Fourth-largest by term count (30,287)
- LOWEST language coverage (16 vs 36-45 for others)
Action required: Continue with Phase 1 language queries to reach 40+ languages
Next Steps
Immediate Actions
-
Complete Phase 1 language coverage (target: 40+ languages)
- Add: hi, vi, id, th, uk, sv, no, da, fi, el, he, etc.
- Estimated: +15,000 terms, +25 languages
-
Generate HTML visualization
- Create searchable term browser
- Language distribution charts
- Class hierarchy explorer
-
Root class validation
- Review top 100 classes manually
- Identify cross-contamination cases
- Document boundary decisions
Phase 2 Considerations
- Refine root classes - Consider narrower scope or exclusion filters
- Web research - Government heritage websites in underrepresented regions
- African language enrichment - Target Swahili, Hausa, Amharic
- South Asian expansion - Hindi, Bengali, Tamil, Urdu
Data Export Formats
- JSON (merged dataset)
- CSV (flat term list)
- SKOS/RDF (semantic web)
- HTML (interactive browser)
- SQLite (queryable database)
Sample Terms (Top 20 by Language Coverage)
To be extracted from merged dataset
Files Generated
| File | Size | Description |
|---|---|---|
q22698_en_raw.json |
18.5 KB | Government agency (English) |
q7210356_en_raw.json |
308.2 KB | Cultural heritage org (English) |
western_european_raw.json |
306.4 KB | Dutch, German, French |
romance_raw.json |
289.1 KB | Spanish, Portuguese, Italian |
slavic_raw.json |
275.3 KB | Russian, Polish, Czech |
cjk_raw.json |
365.7 KB | Chinese, Japanese, Korean |
middle_east_raw.json |
294.5 KB | Arabic, Turkish, Persian |
hyponyms_merged.json |
1,857.7 KB | Merged dataset (all languages) |
Technical Notes
Query Optimization
- Batch size: 5,000 results per query (avoid 50k limit)
- Rate limiting: 2-second delay between queries
- Timeout: 300 seconds per query
- Format: JSON (SPARQL Results JSON format)
Known Issues
- Q7210356 result limit - English query hit 5,000 limit, likely incomplete
- Language coverage gaps - Need 25+ more languages for Phase 1 target
- Root class overlap - May need exclusion filters for GLAM types
Status: 🔄 Phase 1 in progress - Language expansion needed
Next Update: After reaching 40+ languages
Estimated Completion: 2025-11-13