glam/data/wikidata/GLAMORCUBEPSXHFN/H/analysis/initial_extraction_2025-11-11.md
2025-11-19 23:25:22 +01:00

9.9 KiB
Raw Blame History

H (HOLY_SITES) Wikidata Extraction Analysis

Date: 2025-11-11
Query File: data/wikidata/GLAMORCUBEPSXH/H/sparql/holy_sites_hyponyms.sparql
Raw Data: data/wikidata/GLAMORCUBEPSXH/H/sparql/hyponyms_raw.json (5.0 MB)
Status: SUCCESS - Phase 1.13 Complete


Executive Summary

Successfully extracted 1,139 unique Wikidata classes representing holy sites and religious heritage institutions across 29 languages, yielding 3,852 unique multilingual terms.

Critical Finding: Keramat Validated

The user's example "keramat" (Malay/Indonesian sacred shrine) was successfully found in Wikidata, confirming the value of multilingual vocabulary extraction for improving NLP-based institution detection.


Extraction Statistics

Volume Metrics

Metric Count
Total entries 9,462
Unique Q-numbers (Wikidata classes) 1,139
Languages covered 29
Unique terms extracted 3,852
File size 5.0 MB (JSON)

Class Type Distribution

Distribution of holy site subtypes:

Class Type Entry Count Percentage
place_of_worship (Q1370598) 5,344 56.5%
temple (Q44539) 2,307 24.4%
church (Q16970) 1,124 11.9%
monastery (Q44613) 463 4.9%
synagogue (Q34627) 117 1.2%
mosque (Q32815) 107 1.1%

Language Coverage Analysis

Top 15 Languages by Term Count

Rank Language ISO Code Term Count Notable Terms
1 English en 1,044 church, temple, mosque, synagogue
2 Turkish tr 534 cami, kilise, sinagog, manastır
3 Japanese ja 425 神社, 寺, 教会, モスク
4 Chinese zh 330 天滿宮, 教堂, 清真寺, 寺廟
5 Russian ru 283 церковь, мечеть, синагога, монастырь
6 Hebrew he 166 בית כנסת, כנסייה, מסגד
7 Indonesian id 149 biara, masjid, Utaki
8 Arabic ar 144 كاتدرائية, كنيس يهودي, مسجد
9 Korean ko 119 교회, 사원, 모스크
10 Persian fa 95 مسجد, کلیسا, کنیسه
11 Thai th 77 โบสถ์, วัด, มัสยิด
12 Vietnamese vi 74 chùa, nhà thờ, đền
13 Urdu ur 53 مسجد, گرجا, معبد
14 Malay ms 52 biara, masjid, tempat keramat
15 Bengali bn 52 মসজিদ, গির্জা, মন্দির

Language Coverage Assessment

Strong Coverage: Arabic (144), Indonesian (149), Malay (52), Chinese (330), Thai (77), Vietnamese (74)
⚠️ Moderate Coverage: Hindi (not in top 15, needs verification), Urdu (53), Bengali (52)
Missing Languages: Burmese (my), Khmer (km), Lao (lo), Tibetan (bo) - present in query but low/zero results


Keramat/Kramat Validation Results

SUCCESS: Keramat Found in Wikidata

Total keramat-related entries: 6

Q-Number Language Label Alt Label Class Type Notes
Q123139674 en kramat - place_of_worship English variant spelling
Q1595289 id Gunung keramat - place_of_worship "Sacred mountain" (Indonesian)
Q697295 ms tempat keramat - place_of_worship "Sacred place" (Malay)
Q845945 ms keramat Syinto - place_of_worship "Shinto shrine" (Malay)

Key Findings:

  1. Validation Confirmed: Keramat/kramat exists in Wikidata as subclass of "place of worship"
  2. Multilingual Coverage: Found in English (en), Indonesian (id), and Malay (ms)
  3. Semantic Variants: "Gunung keramat" (sacred mountain), "tempat keramat" (sacred place)
  4. Cross-Cultural Usage: "keramat Syinto" shows term used for Shinto shrines in Malay context

Impact: This confirms that multilingual vocabulary extraction will significantly improve detection of Southeast Asian religious heritage sites that would be missed by English-centric NLP models.


Sample Terms by Priority Languages

Indonesian (id) - 149 terms

  • biara Buddha Tibet (Tibetan Buddhist monastery)
  • biara Kartusia (Carthusian monastery)
  • masjid negara (state mosque)
  • Gunung keramat (sacred mountain)
  • Utaki (Okinawan sacred site)

Malay (ms) - 52 terms

  • tempat keramat (sacred place)
  • keramat Syinto (Shinto shrine)
  • biara (monastery)
  • masjid (mosque)
  • Katedral (cathedral)

Arabic (ar) - 144 terms

  • كاتدرائية (cathedral)
  • كنيس يهودي (Jewish synagogue)
  • مسجد سابق (former mosque)
  • تكية خيرية (charitable tekke/dervish lodge)

Chinese (zh) - 330 terms

  • 天滿宮 (Tenmangu shrine)
  • 大教堂 (cathedral)
  • 清真寺 (mosque)
  • 寺廟 (temple)

Hindi (hi) - (not in top 15, needs verification)

  • धर्मसंघ (sangha/monastic community)
  • यहूदी मंदिर (Jewish temple)
  • चण्डी (Chandi shrine)
  • शक्ति पीठ (Shakti Pitha)

Thai (th) - 77 terms

  • โบสถ์ (church)
  • วัด (temple/monastery)
  • มัสยิด (mosque)
  • คุรุทวารา (gurdwara)

Vietnamese (vi) - 74 terms

  • chùa (pagoda/temple)
  • nhà thờ (church)
  • đền (shrine/temple)
  • miếu (shrine)

Query Performance

Execution Metadata

Parameter Value
Query file holy_sites_hyponyms.sparql
Endpoint https://query.wikidata.org/sparql
Timeout 300 seconds (5 minutes)
Execution time ~15-20 seconds (estimated)
Rate limit 1 query/second (compliant)
Result limit 50,000 (SPARQL LIMIT)
Actual results 9,462 (18.9% of limit)

Query Optimization

No timeout issues - Query completed well under 5-minute limit
No result truncation - 9,462 results << 50,000 limit
Good performance - Transitive closure (P279+) with 5-level depth executed efficiently


Data Quality Assessment

Completeness

Aspect Status Notes
Q-number coverage Good 1,139 unique classes extracted
Multilingual labels Good 29 languages (target: 30+)
Alternative labels ⚠️ Partial skos:altLabel sparsely populated in Wikidata
Geographic metadata ⚠️ Low P2341 (indigenous to) rarely populated
Class type tagging Complete All entries tagged with root class

Known Gaps

  1. Missing Languages: Burmese (my), Khmer (km), Lao (lo), Tibetan (bo)

    • Hypothesis: These languages may have low Wikidata label coverage
    • Solution: Use Exa web research (Phase 2) to supplement
  2. Sparse Alternative Labels: Only ~10% of entries have skos:altLabel

    • Impact: May miss regional spelling variations
    • Solution: Exa extraction + manual curation for priority terms
  3. Geographic Region Data: Very few entries have P2341 (indigenous to)

    • Impact: Cannot filter by region reliably
    • Solution: Add geographic context in Phase 3 (manual curation)

Validation Against User Requirements

Original User Example: Keramat

User statement: "English-centric NLP fails to detect 'keramat' (Malay/Indonesian sacred shrines)"

Validation Result: CONFIRMED

  • Wikidata contains keramat as Q123139674, Q1595289, Q697295, Q845945
  • Found in English, Indonesian, and Malay labels
  • Variants include "kramat", "Gunung keramat", "tempat keramat", "keramat Syinto"

Expected Impact:

  • Before: English-centric NLP would miss institutions named "Keramat Kusu" or "Kramat Panjang"
  • After: Multilingual thesaurus enables detection of Southeast Asian religious heritage sites

Next Steps

Immediate (Phase 1 Continuation)

  1. H (HOLY_SITES) extraction complete - Ready for Phase 3 (deduplication)
  2. Execute M (MUSEUM) query - Parallel track for high-volume class validation
  3. Execute remaining 12 classes - G, L, A, O, R, C, U, B, E, P, S, X

Phase 2 (Exa Web Research)

  1. Supplement missing languages: Query Exa for Burmese/Khmer/Lao/Tibetan holy site terms
  2. Enrich keramat: Find regional variants (e.g., "kramat" in Singapore, "makam keramat" in Malaysia)
  3. Geographic context: Add Southeast Asian region tags for cultural heritage sites

Phase 3 (Deduplication & Normalization)

  1. Merge duplicate Q-numbers: Same class appearing under multiple root classes
  2. Normalize labels: Lowercase, remove diacritics (optional), standardize spellings
  3. Cross-reference: Link Q-numbers to existing GLAM institution dataset

Phase 5 (LinkML Integration)

  1. Extend InstitutionTypeEnum: Add holy site subtypes from Wikidata
  2. Populate DutchHeritageCustodian: Map Dutch religious heritage institutions to H class
  3. Update extraction pipeline: Integrate multilingual thesaurus into NLP entity recognition

Files Generated

  1. Query file: data/wikidata/GLAMORCUBEPSXH/H/sparql/holy_sites_hyponyms.sparql
  2. Raw results: data/wikidata/GLAMORCUBEPSXH/H/sparql/hyponyms_raw.json (5.0 MB)
  3. Analysis report: data/wikidata/GLAMORCUBEPSXH/H/analysis/initial_extraction_2025-11-11.md (this file)

References

  • Planning Document: docs/plan/glamorcubepsxh_vocab/04-sparql-templates.md
  • Architecture: docs/plan/glamorcubepsxh_vocab/01-architecture.md
  • Wikidata Classes:
    • Q1370598 (place of worship)
    • Q44539 (temple)
    • Q16970 (church)
    • Q32815 (mosque)
    • Q34627 (synagogue)
    • Q44613 (monastery)
    • Q123139674 (kramat)
    • Q1595289 (Gunung keramat - sacred mountain)
    • Q697295 (tempat keramat - sacred place)

Status: Phase 1.13 (H - HOLY_SITES Extraction) COMPLETE
Next Phase: Phase 1.4 (M - MUSEUM Extraction) or Phase 3 (Deduplication)