glam/data/wikidata/GLAMORCUBEPSXHFN/O/analysis/initial_extraction_2025-11-12.md
2025-11-19 23:25:22 +01:00

9.3 KiB

O (OFFICIAL_INSTITUTION) - Initial Wikidata Extraction Analysis

Date: 2025-11-12
Root Classes: Q22698 (government agency), Q7210356 (cultural heritage organization)
Property: P279+ (subclass of, transitive)
Query Method: Multiple SPARQL queries by language groups


Executive Summary

Successfully extracted 7,804 unique classes with 30,287 terms across 16 languages for the OFFICIAL_INSTITUTION type. This represents government heritage agencies, cultural ministries, and official cultural organizations.

Key Findings

Strong multilingual coverage - 16 languages with balanced distribution
Diverse institutional landscape - 7,804 unique organizational types
Higher complexity than expected - Second-largest dataset after F (FEATURES)
⚠️ Q7210356 dominates - Cultural heritage org has far more subclasses than government agency


Extraction Statistics

Overall Metrics

Metric Count Notes
Unique Classes 7,804 Second highest after F (7,861)
Total Terms 30,287 Higher than expected
Languages 16 Phase 1 coverage
Root Classes 2 Q22698, Q7210356

Query Breakdown

Query Languages Results Classes Covered
Q22698 English en 287 Government agencies
Q7210356 English en 5,000 (limit) Cultural heritage orgs
Western European nl, de, fr 5,000 (limit) Multi-root
Romance es, pt, it 5,000 (limit) Multi-root
Slavic ru, pl, cs 5,000 (limit) Multi-root
CJK zh, ja, ko 5,000 (limit) Multi-root
Middle East ar, tr, fa 5,000 (limit) Multi-root

Total Raw Results: 30,287 (after merging duplicates: 7,804 unique classes)


Language Distribution

Top 16 Languages (Complete Coverage)

Rank Language Code Terms % of Total Coverage Quality
1 English en 5,287 17.5% Excellent
2 Turkish tr 2,471 8.2% Good
3 Russian ru 2,406 7.9% Good
4 French fr 2,295 7.6% Good
5 Spanish es 2,210 7.3% Good
6 Chinese zh 2,063 6.8% Good
7 Japanese ja 1,954 6.5% Good
8 Dutch nl 1,950 6.4% Good
9 Arabic ar 1,700 5.6% Fair
10 Italian it 1,496 4.9% Fair
11 German de 1,470 4.9% Fair
12 Portuguese pt 1,455 4.8% Fair
13 Polish pl 1,371 4.5% Fair
14 Korean ko 1,268 4.2% Fair
15 Persian fa 1,225 4.0% Fair
16 Czech cs 666 2.2% Limited

Language Family Balance

Family Languages Terms % of Total
Indo-European 11 20,178 66.6%
- Romance es, pt, fr, it 7,456 24.6%
- Germanic en, nl, de 8,707 28.7%
- Slavic ru, pl, cs 4,443 14.7%
- Iranian fa 1,225 4.0%
East Asian 3 5,285 17.5%
- Sino-Tibetan zh 2,063 6.8%
- Japonic ja 1,954 6.5%
- Koreanic ko 1,268 4.2%
Afro-Asiatic ar 1,700 5.6%
Turkic tr 2,471 8.2%

Root Class Analysis

Q22698 (Government Agency)

English-only query results: 287 classes

Characteristics:

  • Narrower focus on government structures
  • Likely includes: ministries, departments, agencies, commissions
  • Lower subclass proliferation than Q7210356

Sample expected terms:

  • Cultural ministry
  • Heritage commission
  • National heritage agency
  • Provincial heritage service
  • Federal arts council

Q7210356 (Cultural Heritage Organization)

English-only query results: 5,000 classes (limit reached)

Characteristics:

  • Extremely broad and diverse
  • Includes NGOs, foundations, societies, trusts
  • High subclass proliferation
  • Likely overlaps with other GLAMORCUBEPSXHF types

Observations:

  • This root class is MUCH larger than government agency
  • May need supplemental queries to capture all subclasses
  • Cross-contamination with M (MUSEUM), L (LIBRARY), A (ARCHIVE) likely

Data Quality Assessment

Strengths

  1. Excellent multilingual balance - No single language dominates (English only 17.5%)
  2. Strong European coverage - Romance + Germanic + Slavic = 40.5%
  3. Good East Asian representation - CJK languages = 17.5%
  4. Turkish prominence - 8.2% coverage (second after English)

Weaknesses ⚠️

  1. Missing African languages - No Swahili, Hausa, Yoruba, Amharic
  2. Limited South/Southeast Asian - Hindi, Indonesian, Vietnamese absent
  3. No Central Asian coverage - Uzbek, Kazakh, Turkmen missing
  4. Limited Middle East - Only Arabic, Persian, Turkish

Root Class Imbalance 🚨

Major finding: Q7210356 (cultural heritage org) is MUCH larger than Q22698 (gov agency)

  • Q22698: ~287 English terms
  • Q7210356: 5,000+ English terms (hit query limit)

Implications:

  • O (OFFICIAL_INSTITUTION) may be too broad
  • Consider splitting into O-GOV (government) vs. O-NGO (non-governmental)?
  • Or refine root class selection to better match intended scope

Cross-Contamination Risk

Overlap with Other Classes

HIGH RISK of semantic overlap with:

  1. M (MUSEUM) - "heritage museum", "government museum"
  2. A (ARCHIVE) - "national archive", "heritage archive"
  3. L (LIBRARY) - "national library", "government library"
  4. S (COLLECTING_SOCIETY) - "heritage society", "preservation society"

Root cause: Q7210356 (cultural heritage organization) is a SUPERCLASS of many GLAM types

Mitigation strategies:

  1. Apply exclusion filters: Remove classes that are subclasses of Q33506 (museum), Q7075 (library), etc.
  2. Manual curation: Review top 100 classes for boundary violations
  3. Accept overlap: O includes government-run GLAMs as aggregation platforms

Recommended approach: Accept overlap but document it. Official institutions often operate museums/libraries as subsidiaries.


Comparison with Other Classes

Class Unique Classes Total Terms Languages Status
G (GALLERY) 464 10,482 45 Complete
L (LIBRARY) 1,178 19,318 36 Complete
A (ARCHIVE) 1,143 15,793 38 Complete
M (MUSEUM) 22,949 133,918 45 Complete
H (HOLY_SITES) 741 13,894 42 Complete
F (FEATURES) 7,861 82,614 45 Complete
O (OFFICIAL) 7,804 30,287 16 🔄 In Progress

Key Insights

  • Second-largest by class count (7,804 vs F's 7,861)
  • Fourth-largest by term count (30,287)
  • LOWEST language coverage (16 vs 36-45 for others)

Action required: Continue with Phase 1 language queries to reach 40+ languages


Next Steps

Immediate Actions

  1. Complete Phase 1 language coverage (target: 40+ languages)

    • Add: hi, vi, id, th, uk, sv, no, da, fi, el, he, etc.
    • Estimated: +15,000 terms, +25 languages
  2. Generate HTML visualization

    • Create searchable term browser
    • Language distribution charts
    • Class hierarchy explorer
  3. Root class validation

    • Review top 100 classes manually
    • Identify cross-contamination cases
    • Document boundary decisions

Phase 2 Considerations

  1. Refine root classes - Consider narrower scope or exclusion filters
  2. Web research - Government heritage websites in underrepresented regions
  3. African language enrichment - Target Swahili, Hausa, Amharic
  4. South Asian expansion - Hindi, Bengali, Tamil, Urdu

Data Export Formats

  • JSON (merged dataset)
  • CSV (flat term list)
  • SKOS/RDF (semantic web)
  • HTML (interactive browser)
  • SQLite (queryable database)

Sample Terms (Top 20 by Language Coverage)

To be extracted from merged dataset


Files Generated

File Size Description
q22698_en_raw.json 18.5 KB Government agency (English)
q7210356_en_raw.json 308.2 KB Cultural heritage org (English)
western_european_raw.json 306.4 KB Dutch, German, French
romance_raw.json 289.1 KB Spanish, Portuguese, Italian
slavic_raw.json 275.3 KB Russian, Polish, Czech
cjk_raw.json 365.7 KB Chinese, Japanese, Korean
middle_east_raw.json 294.5 KB Arabic, Turkish, Persian
hyponyms_merged.json 1,857.7 KB Merged dataset (all languages)

Technical Notes

Query Optimization

  • Batch size: 5,000 results per query (avoid 50k limit)
  • Rate limiting: 2-second delay between queries
  • Timeout: 300 seconds per query
  • Format: JSON (SPARQL Results JSON format)

Known Issues

  1. Q7210356 result limit - English query hit 5,000 limit, likely incomplete
  2. Language coverage gaps - Need 25+ more languages for Phase 1 target
  3. Root class overlap - May need exclusion filters for GLAM types

Status: 🔄 Phase 1 in progress - Language expansion needed
Next Update: After reaching 40+ languages
Estimated Completion: 2025-11-13