glam/data/wikidata/GLAMORCUBEPSXHFN/F/analysis/initial_extraction_2025-11-11.md
2025-11-19 23:25:22 +01:00

7.3 KiB

FEATURES (F) - Initial Extraction Analysis

Date: 2025-11-11 21:17:54
Status: COMPLETE (6 of 6 root classes extracted)

Extraction Summary

Metric Value
Total bindings 53,389
Unique classes (Q-numbers) 26,498
Unique languages 45
Total unique terms 38,524
Root classes covered 6/6 (monument, sculpture, statue, memorial, landmark, cemetery)

Root Class Distribution

Feature Type Root Q-Number Unique Classes Total Terms Avg Terms/Class
CEMETERY Q39614 208 1,437 6.9
LANDMARK Q7075 393 2,654 6.8
MEMORIAL Q5003624 750 5,609 7.5
MONUMENT Q4989906 1,421 10,393 7.3
SCULPTURE Q860861 24,962 29,907 1.2
STATUE Q179700 425 3,389 8.0

Language Distribution

Top 25 Languages

Language Terms Percentage
en 28,918 54.16%
de 2,275 4.26%
fr 1,928 3.61%
es 1,842 3.45%
ja 1,758 3.29%
nl 1,313 2.46%
zh 1,202 2.25%
it 1,154 2.16%
ru 1,118 2.09%
tr 1,012 1.90%
pt 886 1.66%
cs 878 1.64%
ca 817 1.53%
uk 808 1.51%
ar 787 1.47%
hu 671 1.26%
sv 656 1.23%
pl 627 1.17%
ko 597 1.12%
he 531 0.99%
fi 474 0.89%
da 404 0.76%
el 346 0.65%
fa 338 0.63%
id 260 0.49%

Language Balance

  • English: 28,918 terms (54.2%)
  • Non-English: 24,471 terms (45.8%)

Coverage Assessment

Geographic Coverage:

  • European: de, fr, es, nl, it, ru, pt, cs, ca, uk, sv, pl, el, ro, hu, da, no, fi, sr, bg, hr
  • Asian: ja, zh, ko, ar, tr, fa, he, hi, th, vi, id, ms, bn, ur, ta, te, pa, mr
  • ⚠️ African: sw, am, ha, yo, zu, af, so, ti, lg, ny (low representation)
  • ⚠️ Indigenous: Limited coverage (target for Phase 2 - Exa research)

Most Multilingual Classes (Top 20)

Q-Number Languages English Label Feature Types Example Translations
Q173387 40 grave memorial, monument, statue de:Grab; fr:tombe; es:sepultura
Q4989906 40 monument statue de:Denkmal; fr:monument; es:monumento
Q162875 38 mausoleum memorial, monument de:Mausoleum; fr:mausolée; es:mausoleo
Q381885 38 tomb memorial, monument, statue de:Grab; fr:tombeau; es:tumba
Q80793 38 sundial monument de:Sonnenuhr; fr:cadran solaire; es:reloj de sol
Q212805 37 digital library landmark de:digitale Bibliothek; fr:bibliothèque numérique;
Q28564 37 public library landmark de:öffentliche Bibliothek; fr:bibliothèque publiqu
Q20350 37 moai monument de:Moai; fr:moaï; es:moái
Q22806 36 national library landmark de:Nationalbibliothek; fr:bibliothèque nationale;
Q143912 35 triumphal arch monument, statue de:Triumphbogen; fr:arc de triomphe; es:arco de tr
Q170980 35 obelisk monument, statue de:Obelisk; fr:obélisque; es:obelisco
Q178743 35 stele monument, statue de:Stele; fr:stèle; es:estela
Q856234 34 academic library landmark de:Universitätsbibliothek; fr:bibliothèque univers
Q180927 34 mastaba memorial, monument de:Mastaba; fr:mastaba; es:mastaba
Q10145 34 milestone memorial, monument de:Meilenstein; fr:borne routière; es:miliario
Q5003624 34 memorial monument, statue de:Gedenkstätte; fr:mémorial; es:monumento conmemo
Q172896 33 catacombs cemetery de:Katakombe; fr:catacombes; es:catacumbas
Q200141 33 necropolis cemetery de:Nekropole; fr:nécropole; es:necrópolis
Q1081138 33 historic site memorial, monument, sculpture de:historische Stätte; fr:site historique; es:siti
Q193457 33 mud volcano monument de:Schlammvulkan; fr:volcan de boue; es:volcán de

Feature Type Language Distribution

CEMETERY

Top 10 Languages:

  • en: 277 terms (19.3%)
  • es: 117 terms (8.1%)
  • de: 106 terms (7.4%)
  • fr: 86 terms (6.0%)
  • tr: 81 terms (5.6%)
  • ja: 73 terms (5.1%)
  • nl: 69 terms (4.8%)
  • it: 58 terms (4.0%)
  • uk: 56 terms (3.9%)
  • ru: 51 terms (3.5%)

LANDMARK

Top 10 Languages:

  • en: 471 terms (17.7%)
  • es: 189 terms (7.1%)
  • zh: 162 terms (6.1%)
  • fr: 158 terms (6.0%)
  • de: 154 terms (5.8%)
  • hu: 123 terms (4.6%)
  • ja: 118 terms (4.4%)
  • nl: 105 terms (4.0%)
  • cs: 102 terms (3.8%)
  • it: 96 terms (3.6%)

MEMORIAL

Top 10 Languages:

  • en: 924 terms (16.5%)
  • de: 436 terms (7.8%)
  • fr: 382 terms (6.8%)
  • ja: 316 terms (5.6%)
  • es: 307 terms (5.5%)
  • nl: 276 terms (4.9%)
  • ru: 220 terms (3.9%)
  • zh: 211 terms (3.8%)
  • it: 209 terms (3.7%)
  • tr: 190 terms (3.4%)

MONUMENT

Top 10 Languages:

  • en: 1,677 terms (16.1%)
  • de: 900 terms (8.7%)
  • fr: 684 terms (6.6%)
  • ja: 600 terms (5.8%)
  • es: 583 terms (5.6%)
  • nl: 515 terms (5.0%)
  • it: 417 terms (4.0%)
  • tr: 401 terms (3.9%)
  • ru: 399 terms (3.8%)
  • zh: 398 terms (3.8%)

SCULPTURE

Top 10 Languages:

  • en: 25,048 terms (83.8%)
  • ja: 468 terms (1.6%)
  • de: 428 terms (1.4%)
  • es: 418 terms (1.4%)
  • fr: 398 terms (1.3%)
  • pt: 357 terms (1.2%)
  • ar: 326 terms (1.1%)
  • zh: 258 terms (0.9%)
  • it: 227 terms (0.8%)
  • ru: 223 terms (0.7%)

STATUE

Top 10 Languages:

  • en: 521 terms (15.4%)
  • de: 251 terms (7.4%)
  • es: 228 terms (6.7%)
  • fr: 220 terms (6.5%)
  • ja: 183 terms (5.4%)
  • it: 147 terms (4.3%)
  • ru: 144 terms (4.2%)
  • nl: 140 terms (4.1%)
  • zh: 132 terms (3.9%)
  • ca: 98 terms (2.9%)

Data Quality Notes

Strengths

  1. Comprehensive Coverage: All 6 root classes extracted (monument, sculpture, statue, memorial, landmark, cemetery)
  2. Large Scale: 26,498 unique classes across 45 languages
  3. Multilingual Depth: Top classes have 35-40 language translations
  4. European/Asian Balance: Strong coverage across European and major Asian languages

Challenges

  1. English Dominance: 54.2% of terms are English (sculpture class particularly English-heavy)
  2. African Language Gap: Very low representation (sw, am, ha, yo, zu combined < 1%)
  3. Query Limit: Initial query hit 50,000 binding limit (required statue supplemental query)
  4. Sculpture Bias: Sculpture class dominates with 24,962 classes (94.2% of total)
  • Phase 1 Complete: Wikidata extraction successful
  • 🔄 Phase 2 Target: Use Exa web research to enrich African and Indigenous language terms
  • 📊 Quality Check: Validate top multilingual classes for translation accuracy
  • 🔍 Depth Analysis: Investigate sculpture class depth distribution (may have deep/narrow branches)

Next Steps

  1. Immediate: Move to next GLAMORCUBEPSXH class (O - OFFICIAL_INSTITUTION)
  2. Post-extraction: Create combined vocabulary analysis across all 15 classes
  3. Phase 2 Planning: Identify specific African/Indigenous language targets for web research

Extraction Method: Wikidata SPARQL Query Service
Query Files:

  • F/sparql/features_hyponyms.sparql (main query, 5 root classes)
  • F/sparql/features_statue_hyponyms.sparql (supplemental query) Raw Data: F/sparql/hyponyms_raw.json