glam/docs/oclc/extracted_kg_fundamentals/OEBPS/xhtml/chapter_13.xhtml

<html xmlns="http://www.w3.org/1999/xhtml" xmlns:epub="http://www.idpf.org/2007/ops" lang="en" xml:lang="en">
<head>
<title>Knowledge Graphs</title>
<meta content="text/html; charset=utf-8" http-equiv="default-style"/>
<link href="../styles/stylesheet.css" rel="stylesheet" type="text/css"/>
<meta content="urn:uuid:531250ec-8629-4bbe-b4be-eb1eb0e84538" name="Adept.expected.resource"/>
</head>
<body epub:type="bodymatter">
<div class="body">
<p class="SP"> </p>
<section aria-labelledby="ch13" epub:type="chapter" role="doc-chapter">
<header>
<h1 class="chapter-number" id="ch13"><span aria-label="337" id="pg_337" role="doc-pagebreak"/>13</h1>
<h1 class="chapter-title"><b>Question Answering</b></h1>
</header>
<div class="ABS">
<p class="ABS"><b>Overview.</b>   Querying and information retrieval (IR) are both excellent ways of accessing a knowledge graph (KG), but human beings often desire a more intuitive means of access, without necessarily sacrificing expressiveness (e.g., by limiting oneself to just keyword queries). Motivated by this need, question answering (QA), where the questions are posed in English or some other natural language, has emerged as a popular way of accessing the KG. However, as an application, QA has an importance beyond KGs, as it is necessary for the proper functioning of chatbots, personal assistants, and other artificial intelligence (AI) applications inspired by advances in Natural Language Processing (NLP). Historically, KGs have been instrumental in helping QA systems achieve state-of-the-art performance, although the dependence of the best QA systems on KGs have diminished in the last three years due to the advent of sophisticated language models. Nevertheless, KGs continue to be influential in designing good solutions for general-purpose (or even domain-specific QA). Thus, when discussing QA in the context of KGs, two agendas arise: using KGs to support good QA performance and using QA to support easy access to KGs. These two agendas are more complementary than initial appearances suggest, despite their very different goals. In this chapter, we provide an overview of QA first from the perspective of using (or not using) KGs to deliver good QA performance, and then by the use of QA to support intuitive KG access.</p>
</div>
<section epub:type="division">
<h2 class="head a-head"><a id="sec13-1"/><b>13.1 Introduction</b></h2>
<p class="noindent">QA is a specialized area in the field of IR, primarily concerned with providing relevant answers in response to questions proposed in <i>natural language</i>. Questions posed by users can be factoid questions, such as “Who is the current president of the United States?” but more recently, more complex types of questions that require aggregation or even more sophisticated and spontaneous inference have also become important.</p>
<p>QA can rightfully be considered as an evolution in the IR landscape, which historically has just been about document retrieval (namely, as discussed before, a search engine in this vein takes some keywords as input and returns the relevant ranked documents that contain these keywords, whether explicitly or in some semantic sense determined by using techniques like word embeddings). Certainly, these traditional IR systems do not return answers, and accordingly users are left to extract answers from the documents themselves. <span aria-label="338" id="pg_338" role="doc-pagebreak"/>However, an answer to a question is precisely what users are often looking for. Hence, the main objective of all QA systems is to retrieve answers to questions rather than full documents or best matching passages, as most IR systems currently do.</p>
<p>QA has been considered as an important research problem for at least two decades, as is evidenced by the Text Retrieval Conference<sup><a href="chapter_13.xhtml#fn1x13" id="fn1x13-bk">1</a></sup> (TREC) initiating a QA track back in 1999, which tested systems’ ability to retrieve short text snippets in response to factoid questions. Following TREC’s success, the workshops of both Cross-Language Evaluation Forum (CLEF) and NII Test Collection for IR Systems (NTCIR) started multilingual and cross-lingual QA tracks focusing on European and Asian languages, respectively. Generally, QA systems are classified as open-domain or closed-domain. Open-domain QA is far more challenging, as it could consider questions across any category and can rely only on a universal ontology and public information on the web. Crowdsourced, encyclopedic knowledge sources like Wikipedia and DBpedia are thus central to the performance of such systems. On the other hand, closed-domain QA deals with questions from a specific domain (e.g., music, weather forecasting) Such a domain-specific QA system usually involves heavy use of NLP subsystems and also benefits from the modeling, construction, and use of a domain-specific ontology and knowledge base (KB).</p>
<p>In this chapter, we will be considering two agendas and set of perspectives around the problem of QA. The first agenda is not primarily concerned with QA itself, but with providing a more intuitive (for humans) means of accessing the KG, which may itself be supporting other applications such as structured analytics, business intelligence, or recommendation. As observed in previous chapters, expressive querying is not an easy skill for nontechnical subject matter experts to pick up (perhaps not the best use of their time). Key value–based systems, as well as NoSQL (or even just vanilla keyword-based IR), may help but lose expressiveness in the process. What actual users would like, in the most ideal situation, is to simply tell the system in <i>natural language</i> what they would like to know. Taken this way, QA is just another way of querying the KG.</p>
<p>The second agenda treats QA as the main application to be solved and is inspired by the web community and the historically challenging enterprise of open-domain QA, which has become much more applicable and relevant due to release of open data sets, the power of search engines, including semantic search capabilities such as the Google Knowledge Graph (GKG), and the rising influence of digital assistants such as Siri and Alexa. The second agenda is not concerned with <i>how</i> questions are answered, so long as they can be answered correctly. Technically, KGs are not necessary for the second agenda, although they were considered necessary for the best solutions until quite recently. With the advent of more sophisticated language models and representation learning, the need for KGs to yield good performance on QA tasks has become less apparent, although the matter is <span aria-label="339" id="pg_339" role="doc-pagebreak"/>not considered closed. In this chapter, we focus much more on the first agenda than the second, because the goal of this entire section is to describe ways to access the KG, which has many documented applications and use-cases beyond QA. However, because the two agendas share many common strands, especially in the way that they have evolved in the research literature, we spend some time in the next section describing how KGs have been used in some influential QA systems, as well as how they have been superseded more recently by neurally trained language models. With this background in place, we then turn our attention to QA as a means of accessing the KG itself (a problem more broadly known as <i>semantic question answering</i> or, SQA), on which much research has been conducted in the Semantic Web (SW) community in particular.</p>
</section>
<section epub:type="division">
<h2 class="head a-head"><a id="sec13-2"/><b>13.2 Question Answering as a Stand-Alone Application</b></h2>
<p class="noindent">We just noted that, in the context of KGs, the applicability of QA arises in two senses. The first is simply as a means of accessing the KG, while the second is concerned with QA as a <i>stand-alone application</i>, regardless of whether a KG is even used for it. This second agenda, which we briefly describe in this section through the lens of both an early (though still quite recent) and state-of-the-art system, was originally very important for KG research. The reason was that KGs were believed to be useful for achieving state-of-the-art results on open QA, and even for domain-specific QA. It was widely believed that the best performance could be achieved only if an appropriate KG of entities and relationships could first be constructed over the domain-specific corpus.</p>
<section epub:type="division">
<h3 class="head b-head"><a id="sec13-2-1"/><b>13.2.1 Learning from Conversational Dialogue: KnowBot</b></h3>
<p class="noindent">Because the primary use-case of QA systems has been in natural-language dialogue settings in which the system is interacting with users (often in a domain-specific scenario such as a chatbot encountered on a specific company’s website), an interesting possibility is to build a QA system that can learn about its domain from conversational dialogue and leverage KGs to yield good performance. In 2015, a system called KnowBot was presented with precisely this goal in mind. KnowBot learned to relate concepts in science questions to propositions in a fact corpus, store new concepts and relations in a KG, and use the KG to solve questions. An impressive contribution of KnowBot was that it was the first system that acquired knowledge for QA from open, natural-language dialogue without a fixed ontology or domain model that predetermined what users could say.</p>
<p>More specifically, KnowBot grows a KG of commonsense semantic relations in open, conversational dialogue, and uses task progress to drive natural-language understanding. It assumes that the user intends to provide one or more novel relations and uses constraints to disambiguate noisy relations. Because KnowBot is an open-domain dialogue system, it is different from relation extraction (RE) systems that rely on predetermined ontologies to determine valid relation types and arguments, such as Never-Ending Language Learning <span aria-label="340" id="pg_340" role="doc-pagebreak"/>(NELL). KnowBot is able to quickly bootstrap domain knowledge from users via dialogue-driven extraction, thereby producing effective relations without annotation or significant engineering effort. It also improves with each interaction, acquiring relations that are especially useful in the context of a particular task, and is able to embed these relations in a an enriched dialogue context.</p>
<p>For example, KnowBot was tested on a science data set (called SciText) that is a corpus of unlabeled, true-false, natural-language sentences derived from science textbooks, study guides, and even Wikipedia Science. The QA task itself consisted of 107 science questions from the fourth-grade New York Regents Exam. While each question has four possible answers, the authors converted each of the four QA pairs into a true-false QA statement using pattern-based rules. The degree to which a SciText sentence supports a QA pair is the sentence’s <i>alignment score</i>.</p>
<p>While the alignment score depends on keyword overlap, SciText needs domain knowledge to answer the questions. KnowBot conducts dialogue about science questions and learns how concepts in each question relate to propositions in SciText. KnowBot presents users with a question, prompts them to choose and explain their answer, and extracts relations (and puts extracted relations in the KG) in order to increase its confidence in the user’s answer.</p>
<p><i>Concepts</i> are the nodes in KnowBot’s KG, but KnowBot concepts are defined a little differently than the ontological concepts we saw earlier. In KnowBot, a <i>concept keyword</i> is any nonstopword of at least three characters, and a <i>concept</i> is a set of concept keywords with a common root (e.g., {presiding, presided, presides}). The Porter algorithm is used for stemming. If the concept is acquired from a QA statement, it is called a <i>question concept</i>, while <i>support concepts</i> are acquired from SciText support sentences. As with other KGs, relations connect pairs of concepts and represent semantic correspondence.</p>
<p>KnowBot builds KGs at three levels: per utterance, per dialogue, and globally (i.e., over all dialogue). An utterance-level KG (uKG) is a fully connected graph with nodes comprising all the concepts in an utterance. Because this likely contains much irrelevant correspondence, aggressive pruning is used to remove many edges, with remaining edges updating a dialogue-level KG (dKG). Pruning obeys two simple constraints: an edge can only relate a question concept to a support concept (alignment constraint), and edges cannot relate concepts whose keywords are adjacent in the utterance (adjacency constraint). Upon dialogue termination, the dKG updates the global KG (gKG), which stores relations acquired from all dialogue.</p>
<p>There are many technical details behind KnowBot that we do not cover here. However, the important points to remember is that each dialogue focuses on a single question, and after each user turn, the system, after each user turn (in the dialogue) updates the dKG and rescores each of the four candidate answers using an alignment score formula (based on overlapping concepts between the QA statement and the supporting SciText sentence, as <span aria-label="341" id="pg_341" role="doc-pagebreak"/>well as the number of relations between the two sentences). The dialogue terminates once the user’s answer has the highest alignment score, implying that the user has successfully provided the missing knowledge required for answering the question.</p>
<p>At the same time, the gKG, which includes relations learned from every KnowBot dialogue, results in redundancy that is used by KnowBot (based on the intuition that relations that recur across dialogue are more likely to be relevant to the original problem) to improve performance. For example, it ignores singleton relations yielded by a lone user utterance.</p>
<p>Hixon et al. (2015) did extensive evaluation of dialogue strategies (e.g., are users able to successfully complete the complex dialogue task in the absence of trained semantic parsers for natural-language understanding?) and used baselines such as interactive query expansion (IQE), which was the most reasonable competitive system when KnowBot was proposed. Metrics such as task completion (the proportion of dialogue that ends in agreement) were used, as well as dialogue length and acquisition rate (of the number of edges in the dKG at the end of each dialogue). With its best strategies, KnowBot was able to achieve more than 50 percent task completion, compared to less than 6 percent IQE. It marked an impressive shift for open-domain QA.</p>
</section>
<section epub:type="division">
<h3 class="head b-head"><a id="sec13-2-2"/><b>13.2.2 Bidirectional Encoder Representations from Transformers</b></h3>
<p class="noindent">Bidirectional Encoder Representations from Transformers (BERT) is a recently proposed language representation model proposed by researchers into Google’s AI language model. BERT, as well as its successor, RoBERTa, have proved particularly successful at QA (although BERT can also be used for other tasks, such as language inference, without substantial, task-specific architecture modifications), although they do not have a strong KG dependence. Because it is considered as being on the frontiers of QA research, if not state-of-the-art (because its successor now does better on the benchmarks), we describe it in some detail next.</p>
<p>To motivate BERT, we already noted in previous chapters how learning representations of words has been an active area of research for decades. Many of the classic techniques were nonneural, but more recently, neural methods like GLoVE, skip-gram, and continuous bag of word (CBOW) models have proved dominant. Regardless of which representation is the best one, the fact remains that pretrained word embeddings are an integral part of modern NLP architectures, and often offer substantial improvements over embeddings created from scratch. Furthermore, neural representation learning has since been extended to address data units of coarser granularity, including sentences, paragraphs, and (as we saw in chapter 10) nodes and edges in KGs. At the same time, word representation learning itself has undergone significant improvements (work on this did not stop with word2vec or GloVE, despite the immense popularity of these approaches). ELMo, for example, was presented in 2018 by researchers from the Allen Institute for Artificial Intelligence and the University of Washington, and it was a new type of deep contextualized word representation that models complex characteristics of word use, as well as how these uses vary across <span aria-label="342" id="pg_342" role="doc-pagebreak"/>linguistic contexts. When published, ELMo advanced the state-of-the-art for several NLP tasks including QA, but BERT was able to advance it even further.<sup><a href="chapter_13.xhtml#fn2x13" id="fn2x13-bk">2</a></sup></p>
<p>BERT primarily relies on two steps: pretraining and fine-tuning. Pretraining is important in BERT because it marked a shift from previous models, such as those that used left-to-right or right-to-left language models to pretrain their systems. BERT instead uses two unsupervised tasks for pretraining. First, it uses a task called <i>masked language model</i> (traditionally known as the <i>Cloze</i> task, dating to the 1950s), wherein some percentage (in the paper, 15 percent of WordPiece tokens) of the input tokens are selected at random and masked, followed by the training of a deep bidirectional representation by trying to predict the masked tokens. Specifically, the hidden vectors corresponding to the masked tokens are fed into the output softmax layer (over the vocabulary), as with traditional language models. More details on this task, including the steps taken to address noise or robustness issues, are described in the original BERT paper by Devlin et al. (2018), which is also described in the section entitled “Bibliographic Notes,” at the end of this chapter. The second unsupervised task used for pretraining is <i>next sentence prediction</i> (NSP). NSP is important because it helps BERT perform extremely well on downstream applications such as QA and natural-language inference, both of which are based, not just on the goodness of a language model, but on understanding the relationship between two sentences (that not directly captured by a language model). Once again, to ensure that results are robust and the model can be trained to deliver meaningful results, the authors make some important design decisions. For example, when choosing the sentences A and B for each pretraining example, the authors label B as the next sentence 50 percent of the time, while a random sentence from the corpus is used the other 50 percent. Finally, note that the data that was used for the pretraining was the BooksCorpus, which contains 800 million words, and the English Wikipedia (ignoring lists, tables, and headers). Because of the second NSP pretraining task, a corpus such as the Billion Word Benchmark could not be used because it contains sentences in a shuffled order.</p>
<p>Fine-tuning (which is task specific and depends on who, or for what application, BERT is being used for) is the next important component, but it is relatively straightforward, as a self-attention mechanism in the transformer allows BERT to model most downstream tasks of choice, whether the task involves a single piece of text or text pairs. The idea is to send in the task-specific inputs and outputs into the architecture and fine-tune all the parameters of the model in an end-to-end fashion. Examples of inputs (assuming two sentences <i>S</i><sub>1</sub> and <i>S</i><sub>2</sub>), are paraphrased sentence pairs, hypothesis-premise (entailment task), and question-passage (for QA). For token-level tasks (e.g., sequence tagging, QA), the token representations of the inputs are fed into the output layer, while for classification tasks such as entailment or sentiment analysis, the representation for the special (CLS) token <span aria-label="343" id="pg_343" role="doc-pagebreak"/>(i.e., a <i>classification</i> token that is always the first token of every sequence and is meant for precisely this purpose) is fed into the output layer. Fine-tuning is much less expensive than pretraining.</p>
<p>Concerning experimental success, on the Stanford Question Answering Dataset (SQuAD v1.1), which is a collection of 100,000 crowdsourced Q/A pairs, BERT was able to outperform the top leaderboard system by 1.5 points on the F1-score when used in an ensemble, and 1.3 points as a single system. With SQuAD 2.0, which extended the 1.1 version by allowing for the null possibility that no short answer exists in the provided paragraph, there was a 5.1 percent F1-score improvement over the next best system. It also brought BERT to within 6.4 percent of human-level performance on the task.</p>
<p>However, as we mentioned previously, the results were far more impressive than just outperforming the leaderboard winner (at the time) on this one task, as the authors of BERT showed that it could achieve state-of-the-art performance on 11 NLP tasks. Another example of this is when BERT was evaluated on the General Language Understanding Evaluation (GLUE), which comprises diverse natural-language understanding tasks. BERT was fine-tuned on GLUE, and the best BERT model was found to outperform the next best non-BERT system (OpenAI GPT<sup><a href="chapter_13.xhtml#fn3x13" id="fn3x13-bk">3</a></sup>) on GLUE by more than 7 percent (on average, across all the diverse tasks included in GLUE). On the task that GLUE is best known for, Multi-Genre Natural Language Inference (MNLI), there is a 4.6 percent absolute accuracy improvement. On the official leaderboard (which does not make the test set available), there was more than 7 percent difference between the two systems at this time, further validating BERT’s superior performance on this benchmark. On another sentence-pair completion benchmark called Situations with Adversarial Generations (SWAG; the task in SWAG is, given a sentence, to choose the most plausible <i>continuation</i> among four sentence choices provided to the system), BERT was able to outperform OpenAI GPT by 8.3 percent accuracy and even outperform an expert human (with human performance measured with 100 samples) by about 1.3 percent. This result, while impressive on the surface, exposes a worrisome factor about many of these evaluations that has also been raised with other such tests (including those that rely on some variant of the Turing test). That the evaluation would suggest that either the natural-language understanding problem has been solved by machines (which defies what we actually observe when these systems are deployed in real-world settings, or asked to answer questions that ordinary humans would find simple but that models like BERT can still be tricked by), or that the benchmark is missing some critical element (or is subject to some kind of bias) begs the question as to whether it is <i>truly</i> measuring open-domain QA in its full scope. Because of the recency of BERT, we have not seen too much evidence of these benchmark (or evaluation) limitations by way of rigorous and replicable research, but a smattering of papers studying such issues have been slowly <span aria-label="344" id="pg_344" role="doc-pagebreak"/>emerging. We anticipate that “diagnosing” such language models and benchmarks (and discussing the full scope of their strengths and weaknesses) will itself become a hotbed of research by the time this book is published.</p>
<p class="TNI-H3"><b>13.2.2.1 Subsequent Advancements</b> Although BERT is quite recent, and its integration into the Google search engine was only announced in 2019, successors purporting to improve its performance even further have already started percolating into the research community. RoBERTa is a good example of a recently proposed successor. In studying the original BERT system, Liu et al. (2019) found that it was significantly undertrained and could match or exceed the performance of every model published after it. They propose an improved recipe for training BERT models (called RoBERTa), which matches or exceeds the performance of all post-BERT published methods. The modifications were reported to be simple, including longer training times with bigger batches, and over more data, removing the next sentence prediction objective, training on longer sequences, and dynamically changing the masking pattern applied to training data. The authors also collected and presented a new (and larger) data set of comparable size to other privately used data sets in order to better control for the effects of training set size.</p>
<p>On both GLUE and SQuAD, the modifications lead to significant performance improvements, including a score of 88.5 on GLUE (matching the 88.4 reported by another recent work in 2019) and matching state-of-the-art results on SQuAD and RACE. The best models in all of these cases were post-BERT and were somewhat different in their architectural choices. The implication for a while was that BERT’s design or foundational elements may not be the best after all in the race to achieve ever-increasing performance on QA tasks. RoBERTA’s results, however, seem to reinforce that with simple modifications in the training procedure, the fundamentals of BERT still make it reign supreme over (or at <span aria-label="345" id="pg_345" role="doc-pagebreak"/>least equal to) these other more complex, post-BERT models. It also highlights the importance of training and design choices that can get overlooked when designing or training a new architecture. For the sake of illustration, we provide an overview of RoBERTa and some other post-BERT models (in comparison to the original BERT-large model) in <a href="chapter_13.xhtml#tab13-1" id="rtab13-1">table 13.1</a> in terms of data and performance.</p>
<div class="table">
<p class="TT"><a id="tab13-1"/><span class="FIGN"><a href="#rtab13-1">Table 13.1</a>:</span> <span class="FIG">Comparative overview of post-BERT models, including RoBERTa.</span></p>
<figure class="table">
<table class="table">
<thead>
<tr>
<th class="TCH"><p class="TB"><b>System</b></p></th>
<th class="TCH"><p class="TB"><b>Performance</b></p></th>
<th class="TCH"><p class="TB"><b>Data</b></p></th>
</tr>
</thead>
<tbody>
<tr>
<td class="TB"><p class="TB">BERT (bidirectional transformer with MLM and NSP)</p></td>
<td class="TB"><p class="TB">Outperforms state-of-the-art in 2018</p></td>
<td class="TB"><p class="TB">16 GB of BERT data (BooksCorpus + Wikipedia); 3.3 billion words</p></td>
</tr>
<tr>
<td class="TB"><p class="TB">RoBERTa (BERT without NSP)</p></td>
<td class="TB"><p class="TB">2–20% improvement over BERT</p></td>
<td class="TB"><p class="TB">160 GB of (16 GB BERT data + 144 GB as well)</p></td>
</tr>
<tr>
<td class="TB"><p class="TB">DistilBERT</p></td>
<td class="TB"><p class="TB">3% degradation from BERT</p></td>
<td class="TB"><p class="TB">16 GB of BERT data</p></td>
</tr>
<tr>
<td class="TB"><p class="TB">XLNET</p></td>
<td class="TB"><p class="TB">2–15% improvement over BERT</p></td>
<td class="TB"><p class="TB">Base model used 16 GB of BERT data, while the large model used 113 GB (16 GB of BERT data + 97 GB as well); about 33 billion words</p></td>
</tr>
</tbody>
</table>
</figure>
</div>
</section>
<section epub:type="division">
<h3 class="head b-head"><a id="sec13-2-3"/><b>13.2.3 Necessity of Knowledge Graphs</b></h3>
<p class="noindent">While systems like KnowBot depended on building and growing a KG to support open-domain QA, BERT showed that it may be possible to acquire knowledge and answer questions without relying on either growing or using an explicit KG. More recently, there has been some evidence that language models could be used as a substitute for a KG, as the model itself seems to be capturing some of the knowledge that was traditionally thought to be the necessary domain of KGs. In one recent paper out of Facebook AI Research and University College London, for example, Lewis et al. (2019) showed that in addition to learning linguistic knowledge, language models like BERT (that have been trained using recent advanced neural methods and practices) can answer queries that rely on relational knowledge. The authors argue that language models have many advantages over structured KBs in that they require no schema engineering, allow practitioners to query about open classes of relations (recall that Open IE had been a difficult problem to solve; if valid, the use of language models may preclude practitioners having to build KGs using Open IE techniques for answering questions), and are easy to extend or retrain as more data becomes available because they are unsupervised, in that human labeling is not required. An impressive aspect of the authors’ work is that they do a rigorous comparison, using a language model analysis probe, to answer interesting comparative questions such as: Do language models store relational knowledge, and if so, how much? Without any kind of fine-tuning, how does the performance of such language models compare to methods that automatically extract symbolic KGs from text corpora?</p>
<p>The authors of the work described here concluded that the largest BERT model (BERT-large) was able to capture accurate relational knowledge comparable to that of an off-the-shelf RE system, as well as an oracle-based entity linking system. Surprisingly, the models could recover factual knowledge quite well from pretrained models, but one caveat was that the performance on many-to-many relations was poor. On open-domain QA, BERT-large was able to achieve 57.1 percent precision@10 compared to 63.5 percent for a KG that was constructed using task-specific RE.</p>
<p>While this work has been among the more influential in studying the phenomenon noted here, several points should be borne in mind before being rid of KG construction altogether. First, the obvious caveat is that a large quantity of text data is required to train a language model with the kind of power that BERT-large possesses. While BERT-large and its successors may be enough for embedding general-text corpora like Wikipedia, it is not always the case (and may be rare, in fact) that such large quantities of text are available for <span aria-label="346" id="pg_346" role="doc-pagebreak"/>domain-specific applications. Second, there is still a performance difference, even in the results noted here, between BERT-large’s performance and the RE. We could argue that this is because the latter system was trained in a task-specific way, but even if we were willing to give BERT-large the same advantage, how should we incorporate training data (when available) into a language model? This is a harder question for which there is no good answer. Another caveat is that RE itself is also improving, so it is not remaining stagnant while BERT and other models like it continue to improve.</p>
<p>While these are some challenges that stand in the way of replacing KGs with language models altogether (even for the specific application of complex QA), the paper described here shows that the gap is closing and that, at the very least, any QA system that does rely on a KG should use such language models as baselines. There was a time, not very long ago, when simply entertaining the question of whether language models could rival KGs as a store of some <i>kinds</i> of relational information would not have been viewed as very promising (recall that the exercise conducted in this paper could not show convincing, or even good, results for many-many relations, which may suggest that it is not capable of answering questions that are complex or that require some kind of nontrivial reasoning and are far too subtle for a model like BERT to currently capture). But as the power of language models and their performance in NLP tasks continue to become more impressive, the question of whether KGs are necessary anymore for open-domain QA is looking more favorable for language models than for painstakingly constructed (and identified) KGs. More research on this matter is likely forthcoming.</p>
</section>
</section>
<section epub:type="division">
<h2 class="head a-head"><a id="sec13-3"/><b>13.3 Question Answering as Knowledge Graph Querying</b></h2>
<p class="noindent">QA for querying a KG (also known as SQA) was motivated with making KGs easier to access, as it would not require the mastery of a formal language (e.g., SPARQL) or knowledge of the ontologies that were used for KG construction (KGC) in the first place. Over the years, due to SQA being a popular research area, many systems have been developed and proposed, many of which have considerable overlap with themselves and with earlier research in this domain. Despite this influx of ideas, much work still remains to be done, and human-level performance is far from being achieved. Nevertheless, just like so much of the rest of this book, common trends and themes have become manifest, indicating the maturity of some of the core conceptual ideas. We present some of these key ideas in this section, with brief guidance on systems implementing these ideas. A broader overview of related work is provided in the “Bibliographic Notes” section for the interested reader.</p>
<p>Most SQA systems follow a two-stage approach: in the first stage, a query analyzer attempts to break the original question into a structured format that is more amenable for retrieval from KGs, while in the second stage, actual retrieval takes place. Considerable research has been produced for refining both stages. A cursory view of the problem may suggest that the second stage is less of a problem, as it amounts to query execution and <span aria-label="347" id="pg_347" role="doc-pagebreak"/>could draw on some of the techniques and infrastructures described in chapter 12. At this time, there are a number of commercial (and even freely available) products that can execute queries efficiently over most KGs under a reasonable set of constraints. However, a deeper analysis of SQA, as well as the study of actual QA systems and the challenges they face, make even the retrieval step nontrivial. In some of the more advanced systems, the query analyzer and retrieval may even be interlinked or iterative (e.g., the analyzer may yield an initial query, but based on the outputs of retrieval, it may decide to refine the query even further or come up with a whole new set of queries). At some point, a set of answers is compiled and returned to the user, possibly after postprocessing steps such as aggregation, if necessary.</p>
<p>One other point to note is that, while the SW community favors SPARQL as the structured query language into which natural-language questions are reformulated, it is not a strictly necessary feature of SQA. Questions could be reformulated into an Elasticsearch boolean tree query, or even as a combination of keyword queries. The primary observation is that, regardless of the actual language or syntax employed, some kind of structure is necessary to answer the kinds of questions posed to SQA systems, and mere keyword-based retrieval is unlikely to be very useful.</p>
<section epub:type="division">
<h3 class="head b-head"><a id="sec13-3-1"/><b>13.3.1 Challenges and Solutions</b></h3>
<p class="noindent">There are many challenges associated with building robust and expressive SQA systems, some of which include the <i>lexical gap</i> (an NLP problem that arises from the fact that the same intent or meaning can be expressed in a variety of ways); <i>ambiguity</i> (another dominant NLP problem that arises because the same phrase, word, or even sentence could have different meanings based on syntax, semantics, or overall context); <i>multilingualism</i> (which arises only in more international contexts); the ability to answer <i>complex queries</i>; the ability to answer queries when the <i>requisite knowledge</i> needed for answering those queries is <i>distributed</i>; the ability to answer questions of a difficult nature (e.g., <i>procedural, temporal</i> or <i>spatial</i> questions); and the ability for certain classes of SQA solutions (such as template-based solutions) to generalize in a sufficiently robust way when the underlying data set, ontology, or even question type changes. Almost always, there is a trade-off between expressiveness and correctness; that is, systems that are able to handle very expressive or difficult queries will, on average, return incorrect results (or in the case of ranking-based approaches, ranked answers with fewer or lower-ranked relevant answers) than systems that are far less expressive but yield higher-quality answers for the questions that they have been designed to take as input.</p>
<p>Rather than provide a systems-level description of existing SQA solutions, we discuss here how the community has dealt with some of these challenges, using examples of representative systems where applicable.</p>
<p class="TNI-H3"><span aria-label="348" id="pg_348" role="doc-pagebreak"/><b>13.3.1.1 Addressing the Lexical Gap</b> The lexical gap in SQA arises primarily because the labels in the KG that is being queried are different from the ones used in the question. For example, the question might ask “Who was JFK’s dad?,” for which the KG may contain the answer but using the more formal property <i>“fatherOf.”</i> In other words, the KG and the question are referring to similar concepts but using different terminology. In a survey that was recently conducted studying how SQA systems handled the lexical gap, a variety of NLP techniques were found to have been employed by the systems, including incorporating similarity features like Levenstein (accounting for possible misspellings), applying preprocessing tasks like stemming and lemmatization, using WordNet or other resources for supplementing words with their synonyms, and even using pattern libraries (e.g., the pattern “X sampled Y from the buffet” could be used as an equivalency pattern for “X tasted Y”).</p>
<p>Considering each of these in some more detail, recall that string normalization and other similarity features have already been encountered in a variety of KG settings, including instance matching (IM). Common examples are Jaro-Winkler, an edit distance that measures transpositions; phonetic similarities like Soundex (particularly useful for strings representing names); and even n-grams. It must be noted, however, that such similarity functions are not cheap and can prove to be prohibitively expensive, even for a single question executed over a sufficiently large KG. One option is to use <i>fuzzy</i> implementations or approximations. For example, Apache Lucene (which we studied earlier in this book, in chapter 11) offers a Levenstein automaton that is more efficient than exact Levenstein computation. Other approximations rely on mathematical properties like the triangle inequality to prune large sets of candidates so that fewer strings in the KG have to be compared (using the expensive matching function) to the label in the question.</p>
<p>Synonyms and other such features can be incorporated into an <i>automatic query expansion</i> (AQE) framework. AQE is much more common and useful in standard IR, but it is less commonly used in SQA because it tends to lead to noisy results. However, it can be used as a supplement, especially if the original query does not yield any results. At least one empirical work has shown that when different lexical (based on synonym, hypernym, and hyponym relationships) and semantic [based on the use of Resource Description Framework (RDF) and RDF Schema (RDFS) constraints such as subclass and superclass relationships] features are used in an AQE framework, with machine learning used to weight their influence appropriately, SQA performance on a benchmark improved compared to direct matching.</p>
<p>Speaking of standard IR, it is worthwhile asking if we can model the problem of querying a KG as that of querying a document corpus, while still retrieving entities and KG resources as answers. One approach, which has been considered by at least one set of authors, is to convert each RDF resource in the KG into a key-value document, with keys being text attributes constructed over useful information sets such as title, property values, and <span aria-label="349" id="pg_349" role="doc-pagebreak"/>RDF inlinks. Once converted, ordinary document retrieval algorithms can be applied, with classic approaches being tf-idf and BM25. While it may seem simplistic to model the problem in this way, it has proved to have a surprising degree of empirical success and be efficient. One reason for this empirical success could be the IR community’s progress in improving its algorithms by incorporating ever more sophisticated features (such as word embeddings) and learning methodologies (such as learning to rank). However, one severe limitation often arises in the types of questions that such systems are limited to answering, as the questions are often treated as bags of words, just like the generated documents. If the KG does not contain many text attributes, it leads to challenges of its own.</p>
<p>Pattern libraries are useful primarily because the feature classes considered previously are useful mainly for individuals, not properties. Properties tend to suffer worse ambiguity than individuals, because (in addition to label ambiguity) a property could syntactically be expressed as a noun and verb phrase that may not even be a continuous substring. For example, “Martha sang the national anthem as a duet with Christine” expresses a musical-collaboration relationship between Martha and Christine, but the specific words expressing the property are not consecutive. Positioning of arguments is yet another problem, as it is not always necessary that the first entity is always the subject in the extracted triple with that property. In particular, subject-object positions in an extracted triple would get reversed depending on whether the property is expressed using an active or passive voice (e.g., “sang” versus “was sung by”). These problems are similar to the ones encountered in chapter 6 on RE.</p>
<p>Pattern libraries help alleviate some of these issues, assuming that the library can be built in the first place. The PATTY system, for example, detects entities in a corpus of sentences provided for mining such rules, determines the shortest path between the entities (in the accompanying KG), and expands the path with occurring modifiers to mine the pattern. Similarly, BOA generates linguistic patterns using a corpus and KG, while PARALEX automatically learns its templates from paraphrases obtained from the WikiAnswers site. Distant supervision and advanced RE techniques can be employed here as well. Once constructed, the pattern library could be used at the test time to improve the process of formulating queries (that have a higher likelihood of succeeding when executed against the KG) from input questions.</p>
<p>Entailments could also be used to shorten the lexical gap. They rely on a corpus of already-answered questions (or linguistic QA patterns) to infer answers for new questions posed at the test time. A phrase or word entails from another phrase or word if it follows from it. For example, “man” entails “person” but not the other way around. One way (reminiscent of case-based reasoning) to employ entailments in SQA is to pregenerate a large set of possible questions for an ontology or KG. When a question comes in at test time, a systematic approach (based, for example, on syntactic and similarity features) can be used to identify the most similar match (from the pregenerated questions) and produce <span aria-label="350" id="pg_350" role="doc-pagebreak"/>the answer to that matched question. Thus, entailment is being inferred at the test time from the user’s question to one of the pregenerated questions, using a range of well-defined, empirically high-performing features. In practice, this method, by itself, turns out to be both nonrobust and limited in the types of questions that it is able to handle owing to the computational observation that the number of possible questions tends to grow super-linearly with the size of an ontology. The approach may be better suited to domain-specific QA, where the ontology is too complex for ordinary QA or NLP tools to apply, and also where questions tend to be limited in structure and not very natural. Another approach that has been recently explored is to find several matches (or variants thereof) and combine these more basic questions into a complex question.</p>
<p>We note that many of the approaches described here are complementary, and some of the more sophisticated approaches are built on such compositions. For example, the BELA system implemented a four-layered approach to addressing the lexical gap challenge. The first layer involves mapping the question to the concepts of the ontology using index-lookup. Next, Levenstein distance is used to improve the mapping (e.g., if a word in the question and a property in the ontology exceed a threshold in terms of their Levenstein similarity). Third, WordNet is used to find synonyms for given words. In the last layer, BELA uses sophisticated semantic analysis. However, empirical evaluations showed that the earlier, simple layers had maximal impact on performance improvements, while later layers had marginal influence.</p>
<p class="TNI-H3"><b>13.3.1.2 Addressing Ambiguity</b> Ambiguity principally arises when the same phrase has different meanings, which may arise for syntactic or structural reasons (e.g., “conduct” as a verb has different meaning compared to “conduct” as a noun), or lexical and semantic reasons (e.g., “line” could refer to a queue or to a line drawn on paper). In practice, ambiguity affects precision in SQA systems, while the lexical gap mainly affects recall. For this reason, perhaps, the solutions that were proposed for addressing the lexical gap can sometimes have a negative effect on ambiguity. The looser the matching criteria become, the more candidates are retrieved that are less likely to be correct. To address ambiguity, <i>disambiguation</i> solutions are required in order to select from multiple candidate concepts to resolve the meaning of a phrase with uncertain meaning. Two types of disambiguation are important in the context of SQA.</p>
<p>First, <i>corpus-based methods</i> can be used to resolve the meaning of a phrase by computing statistics such as counts of phrases in a text corpus and applying the distributional hypothesis. Recall that algorithms like word2vec rely on this hypothesis, which states that the context of a phrase determines its meaning. By using a variety of statistical approaches (which can also include word2vec and other embedding approaches), context features like word cooccurrences, left and right neighbors, synonyms, hyponyms, and even the parse tree structure could be used to resolve the meaning of the phrase. Even more sophisticated <span aria-label="351" id="pg_351" role="doc-pagebreak"/>approaches take advantage of user context, including the user’s past queries and their profile.</p>
<p>Second, <i>resource-based methods</i> rely on the fact that candidate concepts are KG resources (usually in RDF), not just arbitrary phrases in text. Hence, resources can be compared using various scoring schemes based on structural cues such as the resource’s properties and the connections between different resources. An assumption is that a high score between all resources chosen in the mapping implies a greater likelihood of those resources being related. This, in turn, yields a greater likelihood of those resources being correctly chosen.</p>
<p>Several methods have been employed for making these collective determinations, some of which we have seen already in previous chapters. For example, Giannone et al. (2013) uses hidden Markov models (HMMs) to select correct ontological triples based on DBpedia’s graph structure. Another approach uses Markov Logic Networks (MLNs), previously encountered in this book when discussing KG identification or completion. Yet another system, EasyESA, is based on the distributional hypothesis and represents an entity by a vector of target words. A system called gAnswer tackles ambiguity using RDF fragments, which are starlike RDF subgraphs. In essence, it uses the number of connections among the fragments of candidates to score and select them. Wikimantic uses Wikipedia article interlinks for a generative model and can be used to disambiguate short questions or sentences, so long as the language is reasonable. DEANNA formulates SQA as an integer linear programming (ILP) problem by employing semantic coherence (the cooccurrence of resources in the same context) as a measure. It constructs a disambiguation graph encoding candidate selection for resources and properties, and also uses an objective function to maximize the combined similarity while constraints guarantee the validity of the selections. While a full formulation of the problem in this way is NP-hard, existing ILP solvers can yield good approximations. More advanced versions of the system have used DBpedia and YAGO with a mapping of input queries to semantic relations based on text search. Empirically, on the QALD 2 benchmark, DEANNA outperformed almost every system on factoid questions, as well as on list questions. It is limited, however, in the types of graph patterns that can be used in the query (complex queries or graph patterns cannot be handled by the system).</p>
<p>The systems described here are only a small sample of several other approaches over the year that have specifically attempted to improve disambiguation in the SQA context to improve overall system performance. At this time, the problem is not completely solved and an active area of research both in the SW and NLP communities. However, the variety of techniques mentioned here (ranging from graphical models to use of external resources like DBpedia, Wikipedia, and YAGO) seems to suggest that a hybrid, well-engineered solution may be the current best hope.</p>
<p class="TNI-H3"><span aria-label="352" id="pg_352" role="doc-pagebreak"/><b>13.3.1.3 Addressing Multilingualism and Complex Queries</b> Unfortunately, the majority of QA research is still predicated on the assumption that questions and answers will be in English. The web, as well as human society as a whole, is far more diverse. RDF offers the convenient facility of allowing a single resource to be described in more than one language by using language tags (such as @en and @fr). Ideally, users want to express their questions (and receive answers) in their native language, as that is the most natural and intuitive way for them to communicate their intent. However, there is no denying that the resources (including open KGs) available in English, as well as other Western languages like German, are more complete than in other languages. Hence, one line of approach allows users to pose a question in their native language, but then tries to answer it by mapping the question in some way to the KG (which may be encoded in a different language). Multilingual versions of WordNet (such as GermaNet, which is part of EuroWordNet) are especially appropriate for doing good mappings. Other authors have shown that a partial translation of the question may be enough to answer it, since the recognition of other entities can be accomplished using semantic similarity and other relatedness scores between resources connected to the resources initially mapped in the KG using the partial translation. An example of a system that does this, as well as making good use of open resources, is QAKiS, whose name stands for “Question Answering wiKiframework-based System.” This system automatically extends preexisting mappings between various Wikipedia versions (in different languages) to DBpedia.</p>
<p>A more fundamental problem that arises regardless of language is the issue of complex questions, or (somewhat equivalently) questions that are formulated into complex queries. In normal discourse, it can be subjective what is meant by complex or simple. In QA, simple questions are ones that can be answered by locating and translating a set of simple triple patterns. Complex queries, on the other hand, may require the simultaneous retrieval and combination (in some well-defined way that is dictated by the semantics of the question) of several facts, or when the resulting query has to obey restrictions (or require postprocessing) as a result of ordered, aggregated, or filtered results.</p>
<p>An excellent example of a QA system that is well known in the popular public and can handle complex queries is IBM’s Watson. Watson handles such questions by first determining a focus element, which represents the searched entity. Information about this element is used to predict the answer <i>type</i>, which in turn restricts the range of possible answers. Indirect questions and multiple sentences can be handled by imposing such restraints on the answers. The full Watson system is much more complex than this one feature, but the use of the feature allows it to handle complex questions and avoid nonsensical false positives (such as returning the name of a building when a celebrity’s nickname is being asked for).</p>
<p>Other systems also exist for handling questions of a complex nature. One example is YAGO-QA, which allows nested queries when the subquery has already been answered. For example, if the question is, “What is the date of birth of the founder of SpaceX?” and it <span aria-label="353" id="pg_353" role="doc-pagebreak"/>is posed after the question “Who is the founder of SpaceX?” YAGO-QA would be able to answer the former question, assuming that it was able to answer the latter question to begin with. Another assumption is that the requisite information is contained in an open-world KG like WordNet, Wikipedia categories and infoboxes, or GeoNames, as those are the sources from which YAGO-QA extracts facts. The system also contains various surface forms such as abbreviations and paraphrases for entities.</p>
<p>Another system is Intui2, which is an SQA system based on DBpedia. Intui2 relies on <i>synfragments</i>, which map to a subtree of the syntactic parse tree. A synfragment, therefore, is a minimal span of text that may be interpreted as an RDF triple or as a complex query. The interpretation of a parent synfragment is derived from the combined interpretations of its child synfragments, ordered by both semantic and syntactic properties. The underlying assumption behind Intui2 is that an RDF query could be interpreted correctly by recursively interpreting its synfragments. Intui3 was an evolution of Intui2 in that it replaced ad-hoc (or manually engineered) components with libraries such as the neural network–based toolkit SENNA (primarily useful for doing NLP), as well as the DBpedia Lookup service.</p>
<p>Examples of other systems include GETARUNS and PYTHIA. The former system creates a logical form out of a query consisting of a focus, predicate, and arguments. The focus element, which also arose in the context of IBM Watson, identifies the expected answer type (e.g., the focus of the SpaceX founder question would be the ontological concept <i>“Person”</i>). If no focus element is determined, GETARUNS assumes that it is a binary, yes/no-type question. In a second step, the logical form is converted to a SPARQL query by using label matching to map elements to resources in the KG. Among other ways to improve quality, filters are used by the system to handle additional restrictions that cannot be expressed in a SPARQL query. Such restrictions can arise when dealing with data naturally expressed as a list (e.g., “Who was the sixth employee of Facebook?”). In contrast to GETARUNS, PYTHIA is an ontology-based SQA system that can potentially handle queries that are more linguistically complex, such as those involving quantifiers and numerals, but it has its own set of limitations and assumptions.</p>
<p class="TNI-H3"><b>13.3.1.4 Other Challenges</b> The challenges noted earlier are some of the main challenges encountered by SQA systems in the modern setting. However, these are not the only set of problems that must be solved for SQA to be truly successful. One challenge that we did not discuss in detail, for example, occurs when dealing with large KGs or even multiple KGs sitting in distributed infrastructure. For example, we may want to answer a question not only using DBpedia or some other KG, but a set of KGs whose resources are loosely (and, potentially, noisily) interlinked using <i>sameAs, equivalentClass</i>, and <i>equivalentProperty</i> links. If it is known in advance that several KGs need to be used, then such interlinking could be done in advance (or the KGs could be completed offline using instance matching and other techniques covered in part III), with links such as <i>sameAs</i> stored in a separate infrastructure like an Entity Name System (ENS) or a registrylike index). <span aria-label="354" id="pg_354" role="doc-pagebreak"/>It is much more challenging to create these links in an online fashion when the query itself is posed, and there is still an open question on how best to do so without incurring a high time complexity.</p>
<p>Common approaches to this problem assume that such links already exist between a set of KGs, and the SQA system needs to use these links to answer a question that cannot be answered using only one KG. As just one example, the ALOQUS system uses the PROTON upper-level ontology to phrase the queries, and then aligns the ontology to other KG ontologies using the BLOOMS systems. Using the alignments, the original query can be executed on the target systems. A filtering system, using a threshold on the confidence measure, is used to improve both the speed and quality of the final results.</p>
<p class="TNI-H3"><b>13.3.1.5 Special Question Types</b> In recent years, there has also been a lot of interest in SQA where the questions may be of a special nature or involve information modalities such as procedures, time, or space. Note that SQA involving these question types does not have to be domain-specific (but it could be). In practice, such question types are better suited for some domains than others. For example, if the KG has lots of geopolitical information or events (such as the GDELT KG), spatiotemporal questions are relevant for accessing that information. We provide core details on how SQA research currently handles some of these special question types.</p>
<p><i>Procedural</i> questions tend to ask questions about the “how” of a phenomenon, rather than mere facts. Similarly, <i>causal</i> questions, which are an advanced area of research and beyond the scope of this book, ask about the “why.” Current SQA systems cannot handle procedural QAs very well, mainly because there are currently no KGs (that are openly available and constructed at scale) that contain such knowledge. One option to addressing this problem is to assume that the procedural knowledge is somewhere on the web, and if the original problem can be satisfactorily solved by following the procedure on the webpage, it may be enough to find the webpage in response to a procedural question. The KOMODO system follows precisely this approach; it is able to return webpages with step-by-step directions on how to reach a user-specified goal. KOMODO operates by submitting the question to an ordinary search engine and then cleaning the highest-ranked returned pages to identify and extract procedural text using statistical distributions of Part-of-Speech (POS) tags.</p>
<p><i>Spatial</i> and <i>temporal</i> questions have both been the focus of more SQA research, as there is more data available (in several KGs) on the spatiotemporal properties of incidents and events. In RDF, locations can be expressed as two-dimensional (2D) geocoordinates (latitude and longitude); however, support for three-dimensional (3D) location representations is considerably lower. Another alternative is to model spatial <i>relationships</i>, which is often more relevant to users because many users are not interested in geocoordinate-based retrieval.</p>
<p>Spatiotemporal QA can also be domain-specific. For example, the Clinical Narrative Temporal Relation Ontology (CNTRO), which is based on an Interval-based Temporal <span aria-label="355" id="pg_355" role="doc-pagebreak"/>Logic, was introduced and used by a set of authors to answer temporal questions on clinical narratives. The logic is convenient, in that it allows the usage of both temporal instances and intervals. Hence, temporal relations of events can be inferred from those of others, such as by using the transitivity relationship between temporal qualifiers like <i>before</i> or <i>after</i>.<sup><a href="chapter_13.xhtml#fn4x13" id="fn4x13-bk">4</a></sup> In addition to other domain-specific features, CNTRO includes a Semantic Web Rule Language (SWRL)-based reasoner that can deduce extra time information (based on given information or facts), and even other temporal artifacts such as possible causalities (e.g., the relationship between a therapy for a disease and its application to cure a patient of that disease).</p>
<p>Several systems are capable of combining spatial and temporal reasoning. For example, QALL-ME is a multilingual SQA that is based on description logic (DL) and uses the spatiotemporal context of a question (if the context is not directly present, the location and time of the user asking the question are added to the query) to determine the language that should be used for the answer (which may differ from the question language). In a similar vein, the implicit temporal and spatial context of the user could also be used by a dialogue-based system to resolve the ambiguity challenge.</p>
</section>
<section epub:type="division">
<h3 class="head b-head"><a id="sec13-3-2"/><b>13.3.2 Template-Based Solutions</b></h3>
<p class="noindent">It should be evident from much of the earlier discussion that reformulating a question into a query in SPARQL, or even in a logical form, is a difficult problem. For complex questions in particular, the resulting SPARQL query contains more than one basic graph pattern, and the reformulation process is susceptible to (sometimes subtle) noise.</p>
<p>One of the more dominant class of techniques for yielding queries that are either correct reformulations or of sufficiently high quality is <i>template-based solutions</i>. These approaches map input questions to either manually or automatically created SPARQL query templates. While there has also been research in building SPARQL queries in a completely template-free manner (e.g., using only the given syntactic structure of the input question), we focus on template-based solutions in this section owing to their practical importance, as well as significant research output produced over the years. Note that the two approaches are not necessarily mutually exclusive. It may be possible, for example, to use a template-based solution to <i>bootstrap</i> the training of a question-understanding system, which may eventually learn to bypass templates (and even query reformulation) altogether, as recent language models like BERT have sought to do.</p>
<p>Importantly, many templates can be generated automatically. For example, the Casia system generates graph pattern templates by using the question type, POS tags, and even named entities. Generated patterns are mapped to resources using similarity measures and resources such as WordNet and PATTY. The possible combinations of graph patterns <span aria-label="356" id="pg_356" role="doc-pagebreak"/>are used to build SPARQL queries, with the system focusing on queries that do not need filters, aggregations, or superlatives. Other systems take a slightly different approach. The Xser system first assigns <i>semantic labels</i> (variables, entities, relations, and concepts) to phrases by formulating the problem as sequence labeling and solving it using a structured perceptron (trained on several NLP features, including <i>n</i>-grams of POS tags, named entity tags, and words). A major advantage of Xser over Casia is that it can handle complex graph patterns. Other examples of systems include TBSL and SINA.</p>
<p>Some such systems have also been domain-specific. For example, manually created (and also machine learning–created) templates have been created for the narrow medical patients–treatment domain. This domain is a natural use-case for templates, because the precision and quality of results are so important.</p>
<p>Template-based solutions are not just restricted to SPARQL. As a case in point, the TPSM system maps natural-language questions to Web Ontology Language (OWL) queries by formulating the problem as a fuzzy constraint satisfaction problem (CSP). Constraints include both surface-text matching, similarity of surface forms, and preference of POS tags. Correct mapping elements acquired by solving the fuzzy CSP are combined into a model using predefined templates. As NoSQL systems become more popular, more such non-SPARQL approaches (that attempt to automatically frame questions in such forms as Elasticsearch boolean tree queries) may correspondingly become more popular as well.</p>
</section>
<section epub:type="division">
<h3 class="head b-head"><a id="sec13-3-3"/><b>13.3.3 Evaluation of SQA</b></h3>
<p class="noindent">When discussing language model–based QA such as BERT, we noted their evaluations on community-developed QA benchmarks. Systematic evaluation of SQA has similarly met with vigorous support. One such evaluation is Question Answering on Linked Data (QALD), which is the best-known all-purpose evaluation of open-world SQA on DBpedia facts. DBpedia is a staple of the Linked Data ecosystem, as we cover in chapter 14. The QALD benchmark was instituted in 2011 and made progressively more difficult each year. The general task has been supplemented with additional challenges, including multilingual QA and hybrid QA that use text and Linked Data jointly. Another addition is SQA on statistical data by way of RDF Data Cubes.</p>
<p>Another benchmark is BioASQ, which is domain-specific and only actively ran until 2015. The task consists of both semantic indexing and SQA on biomedical data. Systems were expected to be hybrid, returning answers that comprised of matching triples and text snippets; partial evaluation using either modality was also permitted. An introductory version first separated the tasks of Named Entity Recognition (NER) and named entity disambiguation (on the question), as well as the answering of the question itself. The more advanced evaluation combined all these steps to evaluate a full system.</p>
<p><span aria-label="357" id="pg_357" role="doc-pagebreak"/>TREC LiveQA, which started in 2015, poses Yahoo Answers questions to the systems.<sup><a href="chapter_13.xhtml#fn5x13" id="fn5x13-bk">5</a></sup> These questions are unanswered and were originally intended for other humans. The idea was to pose realistic questions formulated in the wild (so to speak) rather than cleverly contrived. It also went beyond just factual questions, such as the ones posed in QALD, BioASQ, and the old QA track of TREC.</p>
<p>Despite such efforts, however, it became increasingly clear that it was becoming difficult to compare systems in a uniform setting and to maintain a shared infrastructure for benchmarks. In the last few years, the HOBBIT project was funded in the European Union to achieve this goal, among others. HOBBIT seeks to provide an integrated platform with standardized interfaces that allows practitioners to benchmark their algorithms without complex installations. It is also able to generate benchmark data from real-world sources and runs yearly evaluation campaigns. As such, it has provided valuable impetus for the measuring and reporting performance of SQA systems more uniformly than was possible earlier. We provide more details on HOBBIT and some of the open challenges and benchmarks that it supports in the section entitled “Software and Resources,” at the end of this chapter.</p>
<p>Herein, we note that, in one of the more recent challenges that took place around the present time, <i>Scalable Question Answering</i> was successfully executed on HOBBIT at the Extended Semantic Web Conference (ESWC) in 2018. The effort was established with the goal of providing a timely benchmark for assessing and comparing recent systems that mediate between many users who are expressing their information needs in natural language, and RDF KG. Successful approaches to this challenge were able to scale up to Big Data volumes, handling many questions and accelerating the QA process, with the highest possible number of questions answered with the greatest accuracy in the shortest possible time.</p>
<p>The data set was derived from the LC-QuAD data set, comprising 5,000 questions of variable complexity and their corresponding SPARQL queries over DBpedia. In contrast to the analogous challenge task run at ESWC in 2017, the adoption of this new data set ensured an increase in the complexity of the questions and the introduction of “noise” via spelling mistakes and anomalies as a way to simulate real-world scenarios in which questions may be served to the system imperfectly (due to speech recognition failures or typing errors).</p>
<p>The benchmark works by sending to the QA system one question at the start, two more questions after 1 minute, and a stream of <i>k</i> + 1 new questions after <i>k</i> minutes. Further, 1 minute after the last set of questions is dispatched, the benchmark closes and the evaluation begins. Precision, recall, and F-measure metrics are all reported, but an additional <span aria-label="358" id="pg_358" role="doc-pagebreak"/>measure (called the <i>Response Power</i>) was introduced as the main ranking criterion. Response Power was defined as the harmonic mean of three measures (precision, recall, and the ratio between processed questions, wherein an empty answer is considered as processed and a missing answer is considered as unprocessed, and the total number of questions sent to the system).</p>
<p>While more than 20 teams expressed an interest in the challenge, teams from three countries (Canada, Finland, and France) were able to submit and present their systems at the conference (which is a requirement). The best system managed to achieve a Response Power of 0.472, showing that much work still remains to be done, although considerable progress has been made.</p>
</section>
</section>
<section epub:type="division">
<h2 class="head a-head"><a id="sec13-4"/><b>13.4 Concluding Notes</b></h2>
<p class="noindent">In this chapter, we described QA, both as a problem in itself (where users want answers to natural-language questions posed to search engines like Google or digital voice assistants like Siri or Alexa), as well as a solution for allowing users not versed in languages like SPARQL to access the rich trove of information in KGs, particularly large or complex ones.</p>
<p>Many open research questions remain in the field of QA, and we are likely to see more published research and commercially available tools owing to the popularity of the area. There is plenty of scope to combine the advantages or lessons of SQA and language model–based QA. Currently, the latter technique is used more for general QA because it has been trained on a large text corpus of a generic nature, while the former is more useful if a KG exists and needs to be queried using more natural and intuitive mechanisms, such as questions posed in English or other natural languages. Given the success of BERT and language model–based QA, it is reasonable to ask if there is some way to transfer their success to structured KGs. Researchers are only starting to investigate such questions.</p>
</section>
<section epub:type="division">
<h2 class="head a-head"><a id="sec13-5"/><b>13.5 Software and Resources</b></h2>
<p class="noindent">The main motivation behind the HOBBIT infrastructure and platform was that many of the SQA systems presented in the SW literature have not been very usable or available as packages that can be evaluated on standard benchmarks (the absence of such a benchmark was yet another problem in the community for at least a while, and to some extent even today). Hence, we describe the HOBBIT project and its resources in some detail, although papers on individual systems mentioned in the chapter, as well as in the “Bibliographic Notes” section, could be used to verify if any software links are given and still available. In many cases, if a specific system is required to be implemented, the user may have no option but to reimplement the system based on the paper guidelines. This is an unfortunate consequence of nonstandardization in the early days of SQA research.</p>
<p><span aria-label="359" id="pg_359" role="doc-pagebreak"/>We also describe resources for BERT and some of the other impressive language models discussed in the first part of this chapter. Note that BERT has also found usage in other modular architectures, attesting to its usability (e.g., BERTserini was proposed as an end-to-end QA system that integrates BERT with the open-source Anserini IR toolkit). BERTserini was shown to be capable of identifying answers from a corpus of Wikipedia articles, and only required fine-tuning the pretrained BERT (on SQuAD). It was deployed as a chatbot that users could interact with. Other such applications may also exist, some of which are likely proprietary and unpublished.</p>
<section epub:type="division">
<h3 class="head b-head"><a id="sec13-5-1"/><b>13.5.1 BERT and Language Model–Based Question Answering</b></h3>
<p class="noindent">The code and pretrained models for BERT are available om a GitHub project page: https://<wbr/>github<wbr/>.com<wbr/>/google<wbr/>-research<wbr/>/bert. In general, the language model and language-embedding subcommunities in NLP have a strong and laudable history of releasing their models, both for training on domain-specific or arbitrary corpora, but also pretrained models that can just be downloaded, used, and evaluated in an off-the-shelf fashion. For general domain-independent QA, therefore, these pretrained models can be accessed and executed with relative ease. Resources for some previous word-embedding models (e.g., links to word2vec, GloVE, and fastText) have been noted in previous chapters. In chapter 10, other kinds of links for training embeddings were also provided. However, if QA is the specific goal, using tools like BERT and RoBERTa gives the best chance of achieving high performance. RoBERTa is accessible at <a href="https://github.com/pytorch/fairseq">https://<wbr/>github<wbr/>.com<wbr/>/pytorch<wbr/>/fairseq</a>. Among other models, OpenAI and GPT-2 source code is available at <a href="https://github.com/openai/gpt-2">https://<wbr/>github<wbr/>.com<wbr/>/openai<wbr/>/gpt<wbr/>-2</a>. Pretrained PyTorch models for many of these models are available at <a href="https://github.com/huggingface/pytorch-pretrained-BERT">https://<wbr/>github<wbr/>.com<wbr/>/huggingface<wbr/>/pytorch<wbr/>-pretrained<wbr/>-BERT</a>.</p>
</section>
<section epub:type="division">
<h3 class="head b-head"><a id="sec13-5-2"/><b>13.5.2 HOBBIT</b></h3>
<p class="noindent">Earlier, we noted that HOBBIT as an important resource in the quest to standardize and evaluate QA systems in the context of KGs. The platform itself is a distributed Findable, Accessible, Interoperable, and Reusable (FAIR) benchmarking platform that is meant to be involved in the entire Linked Data life cycle, not just QA. As we are primarily interested in QA, however, we focus on that aspect of HOBBIT herein. We mentioned the Scalable Question Answering benchmark earlier, which came to a conclusion at ESWC 2018. It remains a valid resource to use for evaluating and benchmarking QA efforts, and due to its recency, the risk of benchmark bias is minimal. The main project page of HOBBIT is accessed at <a href="https://project-hobbit.eu/">https://<wbr/>project<wbr/>-hobbit<wbr/>.eu<wbr/>/</a>, and the scalable QA challenge details can be accessed at <a href="https://project-hobbit.eu/open-challenges/sqa-open-challenge/">https://<wbr/>project<wbr/>-hobbit<wbr/>.eu<wbr/>/open<wbr/>-challenges<wbr/>/sqa<wbr/>-open<wbr/>-challenge<wbr/>/</a>. As per the website, the online instance of the HOBBIT benchmarking platform is available at <a href="https://master.project-hobbit.eu">https://<wbr/>master<wbr/>.project<wbr/>-hobbit<wbr/>.eu</a>, which also includes usage details and the platform wiki. The code of the platform and the benchmarks are accessible at <a href="https://github.com/hobbit-project">https://<wbr/>github<wbr/>.com<wbr/>/hobbit<wbr/>-project</a>, and the benchmarks are also available in CKAN (<a href="https://hobbit.ilabt.imec.be/">https://<wbr/>hobbit<wbr/>.ilabt<wbr/>.imec<wbr/>.be<wbr/>/</a>), <span aria-label="360" id="pg_360" role="doc-pagebreak"/>along with related publications and source code. The project also maintains a YouTube channel (<a href="https://www.youtube.com/channel/UC3eWNVAKXLqAdOhuQ5kd57g/featured">https://<wbr/>www<wbr/>.youtube<wbr/>.com<wbr/>/channel<wbr/>/UC3eWNVAKXLqAdOhuQ5kd57g<wbr/>/featured</a>) where tutorial videos can be found.</p>
</section>
</section>
<section epub:type="division">
<h2 class="head a-head"><a id="sec13-6"/><b>13.6 Bibliographic Notes</b></h2>
<p class="noindent">There has been enormous research in both QA and language models in the last decade, and the papers studying or extending models like BERT now number in the tens (if not hundreds). By necessity, our coverage here of related work was not comprehensive; it has focused on original, rather than secondary (or follow-up) papers. Furthermore, because the main focus in this chapter is on QA rather than on language models, we have focused more on QA. We also paid attention to the first agenda presented in the introduction (QA as a means of intuitively accessing the KG), in contrast to surveys and theses on QA that place more emphasis on the second agenda (QA, especially of the commonsense or open-domain variety, as the main application to be solved in and of itself, with or without a KG).</p>
<p>In the initial part of the chapter, we discussed the KnowBot system, which learns from conversational dialogue. Good references for this kind of learning include Hixon et al. (2015), Bordes et al. (2016), Weston (2016), and Wen et al. (2017). The first of these describes KnowBot directly; the other papers address the issue of learning from conversational dialogue.</p>
<p>Next, our focus in the chapter shifted to language models and BERT. There is a broad set of references that the interested reader should pursue for more information on this topic (including the use of language models for conversational QA). We recommend Devlin et al. (2018), Liu et al. (2019), Yang, Xie, et al. (2019), Reddy et al. (2019), Yang, Dai, et al. (2019), Sanh et al. (2019), Rajpurkar et al. (2016), Peters et al. (2018), Radford et al. (2018, 2019), Wang and Cho (2019), Wolf et al. (2019), and Rogers et al. (2020) as initial sources. The first of these covers BERT itself, but the other papers are also very instructive, and some cover language models that have become (or are becoming) at least as influential.</p>
<p>Before moving into the primary agenda described here (QA as a means of intuitively accessing the KG), we also asked the question as to whether the advent of these language models (and their excellent performance thus far) demonstrates that KGs may not really be necessary for open-domain QA. For those looking for more insight into this connection between KGs and (relatively) open-domain and open-world QA, we recommend Petroni et al. (2019), which was mentioned in this chapter as a recent example of language models being able to answer queries that rely on relational knowledge. Other important studies include Bosselut and Choi (2019), Roberts et al. (2020), and Bouraoui et al. (2019). All of these are extremely recent papers, and most likely will be followed by others exploring similar questions.</p>
<p>Moving beyond language models, the vast majority of the chapter focused on QA as one means of KG querying. This mode of QA, called Semantic Question Answering in the SW <span aria-label="361" id="pg_361" role="doc-pagebreak"/>literature, has resulted in multiple papers over the years. The synthesis lectures on NLP for the Semantic Web by Maynard et al. (2016) may serve as a useful general reference. Another excellent survey, on which we relied on for our own synthesis in this chapter, was provided by Höffner et al. (2017). In this paper, the authors defined Semantic Question Answering as having three important features: first, users asked questions in natural language; second, they used their own terminology, rather than being constrained by a specific ontology or schema; and third, they obtained the answers by posing the queries to a KG (what the authors referred to as an “RDF knowledge base”). To study Semantic Question Answering, the authors identified about 72 publications covering 62 distinct SQA systems.</p>
<p>The challenges that we identified in this chapter were largely inspired by the analysis of those 72 publications. Good papers on addressing (and characterizing) the lexical gap include Ngomo (2012), Schulz and Mihov (2002), Usbeck et al. (2015), Zhang et al. (2013), and Biemann et al. (2015). Addressing ambiguity was at the forefront of the work by Giannone et al. (2013), Shizhu et al. (2014), Unger and Cimiano (2011b), Cimiano (2009), Freitas, Oliveira, O’Riain, et al. (2011), Freitas, Oliveira, Curry, et al. (2011), Carvalho et al. (2014), Boston et al. (2012), Zou et al. (2014), Shekarpour et al. (2012), and Yahya et al. (2012). Multilingualism and complex query handling have been the subjects of much more recent research; good references include Aggarwal et al. (2013), Buitelaar et al. (2009), Cojan et al. (2013), Deines and Krechel (2012), Gliozzo and Kalyanpur (2012), Unger and Cimiano (2011a), and Delmonte (2008). One of these, namely IBM’s Watson, the DeepQA architecture of which is described by Gliozzo and Kalyanpur (2012), has garnered significant attention in the popular press, especially after beating human champions at the game show <i>Jeopardy!</i>.<sup><a href="chapter_13.xhtml#fn6x13" id="fn6x13-bk">6</a></sup> More recently, Chakraborty et al. (2019) presented an introduction to neural approaches for QA over KGs.</p>
<p>These are not the only challenges being addressed, however; we also recommend papers by Joshi et al. (2012) and Damova et al. (2010) to gain a sense of challenges and approaches that we could not detail much in this chapter, including (for example), the challenge of querying large (or even multiple) KGs that are sitting in distributed infrastructure: Herzig et al. (2013) and Kejriwal (2014) are particularly relevant in this context. Special question types, especially of the spatiotemporal variety, have also become important recently; see Horrocks et al. (2004), Melo et al. (2011), and Younis et al. (2012).</p>
<p>In part, the ability of a system to do good SQA is limited by the state of NLP technology, which is not a dominant line of research in the Semantic Web. Template-based solutions have been conventionally presented in the context of answering complex questions by <i>reformulating</i> them as SPARQL queries in a semiautomatic fashion. Good references for template-based solutions in the SQA literature include Xu et al. (2014), Unger et al. (2012), Shekarpour et al. (2013), Zou et al. (2014), and Ben Abacha and Zweigenbaum (2012). As <span aria-label="362" id="pg_362" role="doc-pagebreak"/>we suggested in the chapter, it may even be possible to use template-based solutions to bootstrap the training of a question-understanding system, which may eventually learn to bypass templates (and even query reformulation) altogether, as recent language models like BERT have sought to do. Furthermore, with the advent of language models, the jury is still out on whether the natural-language understanding component of an SQA is truly the bottleneck, or if the challenge has now shifted elsewhere (e.g., in dealing with information sources of differing veracity or completeness, or reconciliation of answers that are semantically correct responses to a query but contain vague or contradictory components due to the quality of the KGs).</p>
<p>With so many systems, evaluation of SQA has been an important agenda in the community, and in the “Software and Resources” section, we introduced frameworks like HOBBIT that seek to provide standardized interfaces for practitioners to benchmark their algorithms. Other good references and resources for SQA evaluation (including for domain-specific tasks like biomedical SQA) that the interested reader can look to include papers by Bizer et al. (2009a), Höffner and Lehmann (2014), Tsatsaronis et al. (2015), Agichtein et al. (2015), Balikas et al. (2015), Tsatsaronis et al. (2012), and Dang et al. (2007).</p>
</section>
<section epub:type="division">
<h2 class="head a-head"><a id="sec13-7"/><b>13.7 Exercises</b></h2>
<ul class="numbered-ntb">
<li class="NL-N">1. Making appropriate assumptions, draw an ontology fragment capturing the structure of the KG described in the paragraph below:</li>
</ul>
<p>“MusicDB is a KG describing musical artists from the United States, including artists that are not officially represented by talent agencies or recording labels. It also contains information on concerts and performances, both past and upcoming, by those artists. For more famous artists, information on side projects such as acting roles and movie soundtracks is also maintained in the KG. Even when the artist has not officially published an album or single, the artist’s work on alternative platforms like YouTube is stored in the KG. There is a state-of-the-art IM system that is able to discover (and store in the KG) <i>:sameAs</i> links automatically discovered between various modalities of the artist’s work (e.g., the song on an album may have been performed in a concert, used as a soundtrack on a movie, and have a video officially posted on YouTube). The KG also has a special relation called <i>:cover</i>, which exists between two works A and B, when B is a ‘cover’ of A (i.e., B is the same ‘work’ as A but was performed by a different artist).”</p>
<ul class="numbered-ntb">
<li class="NL-N">2. Assuming that the KG corresponding to your ontology fragment exists and is of sufficiently high quality, what would be (using your own ontological classes and properties) the SPARQL translation of the following query in natural language to access the information requested from the KG: “Who are all of the musical artists that have performed in concerts in Germany, but not France, in the year 2017?”</li>
<li class="NL-N">3. Suppose that you ran the correct SPARQL query corresponding to exercise 2 but got back a sparse set of results, presumably because American artists who performed in Germany <span aria-label="363" id="pg_363" role="doc-pagebreak"/>also performed quite often in France. Hence, your boss asks you to instead return a list of artists who performed in Germany <i>relatively more often</i> than in France. As a first step, try to reframe this question in more precise terms, given your understanding of what is meant by the phrase “relatively more often.” Next, pose your reframed question as a SPARQL query.</li>
<li class="NL-N">4. Given a model like RoBERTa, which has essentially been trained on extremely large natural-language corpora containing all manner of information, are KGs even necessary? Assuming the language model is near-perfect, what arguments could you think of to make the case that KGs are required in certain domains? Put more generally, what properties about a domain make it more amenable to a KG-centric approach as opposed to a purely NLP approach?</li>
<li class="NL-N">5. Design a semisupervised machine learning architecture that uses a language model (pretrained on a large corpus) and a small set of training examples to automatically translate a natural-language question to a SPARQL query, <i>given</i> a KG. What would the training examples look like, and how might you constrain them? How might the language model help in reducing the need for too much training data? For purposes of illustration, you may use the same KG and ontology as exercise 1.</li>
</ul>
<div class="footnotes">
<ol class="footnotes">
<li><p class="FN" role="doc-footnote"><sup><a href="chapter_13.xhtml#fn1x13-bk" id="fn1x13">1</a></sup> <a href="https://trec.nist.gov/">https://<wbr/>trec<wbr/>.nist<wbr/>.gov<wbr/>/</a>.</p></li>
<li><p class="FN" role="doc-footnote"><sup><a href="chapter_13.xhtml#fn2x13-bk" id="fn2x13">2</a></sup> At the time of its release, BERT advanced the state-of-the-art for 11 NLP tasks.</p></li>
<li><p class="FN" role="doc-footnote"><sup><a href="chapter_13.xhtml#fn3x13-bk" id="fn3x13">3</a></sup> Generative Pre-trained Transformer.</p></li>
<li><p class="FN" role="doc-footnote"><sup><a href="chapter_13.xhtml#fn4x13-bk" id="fn4x13">4</a></sup> Transitivity arises from the fact that if A occurs <i>after</i> B, and B occurs <i>after</i> C, then A occurs <i>after</i> C.</p></li>
<li><p class="FN" role="doc-footnote"><sup><a href="chapter_13.xhtml#fn5x13-bk" id="fn5x13">5</a></sup> Recall that the original TREC evaluations were designed for pure IR approaches, with document-level relevance annotations.</p></li>
<li><p class="FN" role="doc-footnote"><sup><a href="chapter_13.xhtml#fn6x13-bk" id="fn6x13">6</a></sup> <a href="https://www.nytimes.com/2011/02/17/science/17jeopardy-watson.html">https://<wbr/>www<wbr/>.nytimes<wbr/>.com<wbr/>/2011<wbr/>/02<wbr/>/17<wbr/>/science<wbr/>/17jeopardy<wbr/>-watson<wbr/>.html</a>.</p></li>
</ol>
</div>
</section>
</section>
</div>
</body>
</html>