glam/data/wikidata/GLAMORCUBEPSXHFN/R/analysis/initial_extraction_2025-11-12.md
2025-11-19 23:25:22 +01:00

7.1 KiB

R-Class (RESEARCH_CENTER) Initial Extraction Analysis

Date: 2025-11-12

Overview

Extracted Wikidata subclass hierarchies for RESEARCH_CENTER (R-class) from base classes Q31855 (research institute) and Q7310853 (documentation center).

Extraction Statistics

Quantitative Results

  • Unique Classes: 592
  • Total Terms (labels + altLabels): 3,724
  • Languages Covered: 119
  • Base Classes Queried: 2
    • Q31855: research institute
    • Q7310853: documentation center

Language Distribution (Top 10)

  1. English (en): 1,055 terms
  2. Spanish (es): 251 terms
  3. Japanese (ja): 231 terms
  4. German (de): 220 terms
  5. French (fr): 199 terms
  6. Slovenian (sl): 132 terms
  7. Chinese (zh): 118 terms
  8. Russian (ru): 110 terms
  9. Italian (it): 101 terms
  10. Korean (ko): 79 terms

Key Research Center Types

Major Categories Identified

  • Research institutes (general)
  • Think tanks / policy research centers
  • Documentation centers
  • Laboratory types:
    • University laboratories
    • Research laboratories
    • Medical research labs
    • Biological stations
    • Field stations
  • Academic research units:
    • University institutes
    • Research departments
    • Centers of excellence
  • Specialized research facilities:
    • Agricultural research stations
    • Technical research institutes
    • Nuclear research facilities
    • Environmental research centers

Sample Terms by Major Languages

English (en)

  • research institute
  • think tank
  • laboratory
  • documentation center
  • research center
  • field station
  • biological station
  • research facility
  • institute of technology
  • academic department

Spanish (es)

  • instituto de investigación
  • centro de investigación
  • laboratorio
  • centro de documentación
  • estación biológica
  • centro de estudios

German (de)

  • Forschungsinstitut
  • Forschungseinrichtung
  • Biologische Station
  • Dokumentationszentrum
  • Forschungszentrum
  • Forschungslabor

French (fr)

  • institut de recherche
  • centre de recherche
  • laboratoire de recherche
  • centre de documentation
  • station biologique

Japanese (ja)

  • 研究所 (kenkyūjo - research institute)
  • シンクタンク (think tank)
  • 研究センター (research center)
  • 生物実験所 (biological laboratory)

Semantic Breadth Analysis

Hierarchical Structure

The R-class demonstrates significant hierarchical depth:

  • Top-level: Generic research institutes
  • Mid-level: Specialized research facility types (labs, stations, centers)
  • Leaf-level: Institution-specific subtypes (university labs, government research facilities)

Cross-Domain Coverage

Research center types span multiple domains:

  • Natural Sciences: Biological stations, environmental research centers
  • Social Sciences: Think tanks, policy research institutes
  • Technology: Technical research institutes, R&D facilities
  • Medicine: Medical research institutes, clinical research centers
  • Agriculture: Agricultural experiment stations
  • Information Science: Documentation centers, digital humanities centers

Institutional Contexts

  • University-affiliated research units (institutes, centers, labs)
  • Government research agencies
  • Independent research organizations
  • Corporate R&D facilities
  • International research collaborations

Data Quality Observations

Strengths

  • Comprehensive coverage of research center typology
  • Strong multilingual representation (119 languages)
  • Rich alternative labels (3,724 total terms for 592 classes)
  • Good representation of specialized research facility types

Challenges

  • Overlapping definitions between "research institute", "research center", "laboratory"
  • Ambiguous boundaries between academic departments and research centers
  • Some institution-specific subclasses (e.g., specific university institutes) that may be too granular
  • Mixed levels of abstraction (e.g., "research institute" vs. "Max Planck Institute")

Disambiguation Needs

  • Laboratory vs. research institute vs. research center
  • Academic department vs. research unit
  • Documentation center vs. archive vs. library (overlap with other GLAMORCUBEPSXH classes)
  • Think tank vs. research institute vs. policy center

Comparison with Other Classes

Relative Metrics

  • R-class: 592 classes, 3,724 terms, 119 languages
  • L-class (Library): 364 classes, 2,513 terms, 108 languages
  • A-class (Archive): 223 classes, 1,794 terms, 91 languages
  • M-class (Museum): 789 classes, 6,172 terms, 135 languages

Observations:

  • R-class is moderately sized (larger than A/L, smaller than M)
  • High term density (6.3 terms/class) indicates rich terminology
  • Excellent language coverage (119 languages, second only to M-class at 135)

Extraction Methodology

SPARQL Query Design

SELECT DISTINCT ?class ?classLabel ?altLabel WHERE {
  VALUES ?base_class { wd:Q31855 wd:Q7310853 }
  ?class wdt:P279* ?base_class .
  OPTIONAL { ?class skos:altLabel ?altLabel . }
  SERVICE wikibase:label { 
    bd:serviceParam wikibase:language "[AUTO_LANGUAGE],en,nl,de,fr,es,pt,it,ja,zh,ru,ar,hi,ko,vi,th,id,pl,tr,sv,da,no,fi,cs,hu,ro,uk,bg,el,he".
  }
}

Deduplication Process

  • Input: 3,317 raw SPARQL results
  • Output: 592 unique classes (82% deduplication rate)
  • Method: Deduplicated by Wikidata QID, consolidated labels and altLabels per QID

Files Generated

  1. sparql/hyponyms_raw.json - Raw SPARQL results (3,317 records)
  2. sparql/hyponyms_clean.json - Deduplicated results (592 unique classes)
  3. analysis/stats.json - Statistical summary

Next Steps

Immediate Actions

  1. Extract subclass hierarchies from Wikidata
  2. Deduplicate results
  3. Generate initial analysis
  4. Create controlled vocabulary (top-level terms)
  5. Build multilingual term mappings
  6. Document disambiguation rules

Future Enhancements

  • Map R-class terms to existing research organization taxonomies (e.g., GRID, ROR)
  • Cross-reference with academic institutional hierarchies
  • Develop heuristics for distinguishing research centers from academic departments
  • Create geographic distribution analysis of research center types

Validation Notes

Quality Assurance

  • Verified base classes exist in Wikidata (Q31855, Q7310853)
  • Confirmed transitive closure (P279*) captures full hierarchy
  • Checked sample terms for semantic coherence
  • Validated language codes against ISO 639-1

Known Issues

  • Some institution-specific classes may be too granular for vocabulary purposes
  • Overlap with other GLAMORCUBEPSXH classes (especially L for documentation centers)
  • Ambiguous boundaries between research center subtypes

Conclusion

The R-class extraction successfully captured a rich, multilingual vocabulary for research centers with 592 unique classes and 3,724 terms across 119 languages. The taxonomy demonstrates broad coverage of research facility types spanning natural sciences, social sciences, technology, and medicine. Key challenges include overlapping definitions and mixed abstraction levels. The dataset provides a solid foundation for building a controlled RESEARCH_CENTER vocabulary.


Extraction Date: 2025-11-12
SPARQL Endpoint: https://query.wikidata.org/sparql
Status: Initial extraction complete ✓