glam/docs/oclc/extracted_kg_fundamentals/OEBPS/xhtml/chapter_11.xhtml

<html xmlns="http://www.w3.org/1999/xhtml" xmlns:epub="http://www.idpf.org/2007/ops" lang="en" xml:lang="en">
<head>
<title>Knowledge Graphs</title>
<meta content="text/html; charset=utf-8" http-equiv="default-style"/>
<link href="../styles/stylesheet.css" rel="stylesheet" type="text/css"/>
<meta content="urn:uuid:531250ec-8629-4bbe-b4be-eb1eb0e84538" name="Adept.expected.resource"/>
</head>
<body epub:type="bodymatter">
<div class="body">
<p class="SP"> </p>
<section aria-labelledby="ch11" epub:type="chapter" role="doc-chapter">
<header>
<h1 class="chapter-number" id="ch11"><span aria-label="279" id="pg_279" role="doc-pagebreak"/>11</h1>
<h1 class="chapter-title"><b>Reasoning and Retrieval</b></h1>
</header>
<div class="ABS">
<p class="ABS"><b>Overview.</b>   Thus far, the focus has been on the acquisition of data (domain discovery) to construct a knowledge graph (KG), and on KG construction (KGC) and KG completion. What happens once a KG is in place? At the bare minimum, such a KG needs to allow some set of users to “access” it. The notion of accessing a KG is complex because it potentially spans a broad continuum of possibilities, from simple keyword-based information retrieval (IR) to more complex reasoning tasks. Understanding the spectrum of possibilities is important because it influences the infrastructure required for supporting KG indexing and access. For example, reasoning is computationally demanding, but it can yield more insights and provide more guarantees than simple IR. Even without reasoning, querying a KG involves a diverse set of possibilities, from structured querying (not unlike SQL queries posed on relational databases) to answering questions posed in natural language. In this chapter, we cover reasoning and retrieval in some detail, especially in the context of constructed KGs. Although both of these areas are vast, and entire fields of research in their own right, we focus on the fundamentals in this chapter. In the next two chapters, we will specifically focus on querying and question answering.</p>
</div>
<section epub:type="division">
<h2 class="head a-head"><a id="sec11-1"/><b>11.1 Introduction</b></h2>
<p class="noindent">In the last few parts of this book, we have described how to construct and improve (complete) a KG. But once we have our KG, what should we do with it? How do we store and access it? These are issues that any real-world application has to deal with, and in many cases, it can make the difference between whether KGs are considered for that application or stakeholder to begin with. For example, if the schema of the data set is well defined and populated, and low latency on queries posed to the system is of utmost importance, it may make more sense to opt for a traditional Relational Database Management System (RDBMS) than a KG that, even today, can be slower than normal databases due to the inherently semistructured nature of the data. In contrast, if the goal is to do machine learning, analytics, or both in the absence of rigid structure or very high quality, using a KG obviously makes more sense but still involves important design decisions in the choice of infrastructure, especially for accessing the KG.</p>
<p><span aria-label="280" id="pg_280" role="doc-pagebreak"/>There are several criteria that can, and should, be taken into account when making these decisions, but one of the most important criteria involves the user who will be consuming the KG outputs. Will the user be doing a search on the KG in the same way as they query using Google? Do they expect a ranked list of outputs, some of which may not even contain the exact words they searched for, but are still semantically relevant (e.g., using Google, a user would not necessarily be upset, if upon searching for “top places to visit in Los Angeles,” they would be shown a website somewhere at the top of the ranked list that lists “popular tourist attractions in LA” but does not otherwise contain phrases like “top places to visit”). The important thing to remember about such IR systems, including search engines such as on Google or Bing but also domain-specific or genre-specific portals like YouTube and Yelp, is that their primary goal is to satisfy user intent. Furthermore, although web search engines were initially designed for documents, like web HTML pages and text, the general definition of satisfying user intent (usually, but not always, expressed through keyword queries like the one described previously) can apply to any corpus, including video, social media, biological data, and semistructured KGs. The Google Knowledge Graph, which has been largely responsible for the recent popularity of this field, especially in industry, is the best example of a proprietary KG that expects to be accessed using this paradigm. When entities, or ranked lists of pages describing entities, are displayed to users in the Google search engine in response to a search, the goal is to satisfy user intent rather than look for strict or exact matches to the user’s query.</p>
<p>However, we can imagine many use-cases where not only must the query be more sophisticated than a keyword query, but the user must require that the responses to the query (if more than one) to strictly meet the conditions specified in the query. This kind of scenario occurs most often in the context of ordinary databases and, with industrial behemoths like Walmart, data warehouses. For example, if a business analyst at Walmart wants to know the total revenue of all Walmart stores in California, they would not be pleased if the system displayed total net income, or included Arizona with California when making the calculations. In fact, it is perfectly reasonable (and desired) for the system to return no results at all if none of the conditions in the query are satisfied (e.g., maybe the system does not have data for California within a date range specified in the query), rather than return approximate results that are “close” to the final answer. In short, users desire accuracy when they pose such queries. Generally, though not always, such queries make sense when the data is of sufficient quality and when there is adequate structure in the data.</p>
<p>Although we have presented many flexible ways of constructing, completing, and even modeling KGs (chapter 2), the fact remains that a KG is a graph to begin with. Graphs have nodes and edges and are structural by definition. Most KGs are even more structured, because there is usually an underlying ontology in terms of which the KG was constructed to begin with. It is not infeasible to imagine that domain-specific, structural query languages <span aria-label="281" id="pg_281" role="doc-pagebreak"/>such as those found in the database world (the primary example being SQL) could be devised to query graphs instead of relational tables.</p>
<p>Continuing with the notion of structure, we note that ontologies are actually far more than schemas because they tend to also be enriched with axioms and constraints. We saw this firsthand in chapter 2, when we discussed how RDFS extends the basic Resource Description Framework (RDF) model with additional well-defined (in other words, constrained) terms. We have also seen special properties like <i>owl:sameAs</i>, which are used to represent the result when two instances refer to the same underlying instance. Even in this simple example, wouldn’t it be desirable (for some applications) to have a framework where, if the result of a query is node A and we know from the KG that node A and node B are matching instances via <i>owl:sameAs</i>, then node B should also be included in the answer set as a course of <i>reasoning</i>?</p>
<p>In the next few sections, we expand upon these intuitions of reasoning and retrieval. Although they seem very different at first sight and were researched and refined by completely disjoint research communities in the pre-KG era, they have always had similar practical underpinnings—namely, the need to access data and present them to users in response to some kind of query. In the KG era, these access modes have coincided because retrieval and reasoning both apply to KGs, in contrast to documents (where reasoning of the structured kind we have described here is hard to do), or to tables (where document-style IR can be less well defined or relevant). Which mode of access is more appropriate depends on the application and user, as well as on the quality of the KG itself. The choice can be important because it influences infrastructure, storage, and other practical and engineering requirements like latency. In the next chapter, we cover querying infrastructure in more detail; in this chapter, we provide background on both reasoning and retrieval and how they apply to ordinary KGs.</p>
</section>
<section epub:type="division">
<h2 class="head a-head"><a id="sec11-2"/><b>11.2 Reasoning</b></h2>
<p class="noindent">As the name suggests, a <i>reasoner</i> is a piece of software that can infer logical consequences from a set of asserted facts or axioms. The notion of a semantic reasoner generalizes that of an inference engine, by providing a richer set of mechanisms to work with. The inference rules are commonly specified by means of an ontology language, and often a <i>description logic</i> language. Many reasoners, in fact, use <i>first-order predicate logic</i> to perform reasoning; inference commonly proceeds by forward and backward chaining. Forward chaining is described logically as repeated application of modus ponens<sup><a href="chapter_11.xhtml#fn1x11" id="fn1x11-bk">1</a></sup> and is a popular inference <span aria-label="282" id="pg_282" role="doc-pagebreak"/>strategy for expert systems and production rule systems. Intuitively, forward chaining starts with the available data and uses rules to extract more data until a goal or conclusion is reached. In contrast, backward chaining, as the name suggests, works backward from the goal and is typically used in automated theorem provers and inference engines. Backward chaining tends to be more popular in artificial intelligence (AI) applications as well. Finally, while there are also examples of probabilistic reasoners, we do not cover them in this chapter (we point interested readers to relevant references in the section entitled “Bibliographic Notes,” at the end of this chapter).</p>
<p>In the context of KGs, reasoning can be said to be the problem of <i>deriving facts that are not expressed in either the ontology or the KG explicitly</i>. Reasoning has been considered central to AI problems historically because it seems to come naturally to most human beings, including notions of implication, symmetry, transitivity, and equivalence (e.g., we are able to understand that the fact that a dog is an animal, and that Fido is a dog, together <i>imply</i> that Fido is an animal; also, we are able to understand that this is an <i>asymmetric</i> conclusion—that is, not every animal is Fido). However, machines do not have such capabilities built in a priori, which makes it necessary to formally specify axioms and constructs, as well as to design and construct an engine to process these axioms when given a KG. The output of reasoning, as the definition makes clear, is a set of new facts and conclusions. There are several practical reasons why such reasoning services are important when modeling and building KGs. In the design phase of an ontology, for example, we may want to do the following:</p>
<ul class="numbered">
<li class="NL">1. Receive warnings when making meaningless statements, such as testing satisfiability of defined concepts, because unsatisfiable, defined concepts are signs of faulty modeling.</li>
<li class="NL">2. See firsthand the consequences of statements made, such as testing defined concepts for subsumption, because unwanted, missing, or nonintuitive subsumptions may also be signs of imprecise or faulty modeling.</li>
<li class="NL">3. See redundancies, such as testing defined concepts for equivalence, because knowing about redundant classes helps avoid misconceptions.</li>
</ul>
<p>Similarly, when modifying existing ontologies, we may want to avail ourselves of the services described here, in addition to automatically generating concept definitions from examples of instances or automatically generating concept definitions for too many siblings than we can possibly generate manually.</p>
<p>Although reasoning is a vast area of research, we focus on the Web Ontology Language (OWL) as a particular modeling language in the next section. However, before delving into OWL, it is useful to understand the basic primitives in a semantic reasoning engine. To begin with, a good modeling language allows a modeler to construct some fairly complex, but useful relationships among classes, which form the core elements of any ontology. Several such common building blocks are noted in <a href="chapter_11.xhtml#tab11-1" id="rtab11-1">table 11.1</a>.</p>
<div class="table">
<p class="TT"><a id="tab11-1"/><span class="FIGN"><a href="#rtab11-1">Table 11.1</a>:</span> <span class="FIG">Relationships between classes supported by most reasoners, especially based on languages like OWL.</span></p>
<figure class="table">
<table class="table">
<thead>
<tr>
<th class="TCH"><p class="TB"><b>Relationship</b></p></th>
<th class="TCH"><p class="TB"><b>Informal Description</b></p></th>
<th class="TCH"><p class="TB"><b>Example</b></p></th>
</tr>
</thead>
<tbody>
<tr>
<td class="TB"><p class="TB"><i>intersectionOf</i></p></td>
<td class="TB"><p class="TB">Every instance of the first class is also an instance of all classes in the specified list.</p></td>
<td class="TB"><p class="TB"><i>:Brother  :intersectionOf (:Man,:Sibling)</i></p></td>
</tr>
<tr>
<td class="TB"><p class="TB"><i>unionOf</i></p></td>
<td class="TB"><p class="TB">Every instance of the first class is an instance of at least one of the classes in the specified list.</p></td>
<td class="TB"><p class="TB"><i>:Sibling    :unionOf (:Brother,:Sister)</i></p></td>
</tr>
<tr>
<td class="TB"><p class="TB"><i>complementOf</i></p></td>
<td class="TB"><p class="TB">The first class is equivalent to everything <i>not</i> in the second class.</p></td>
<td class="TB"><p class="TB"><i>:Sibling    :complementOf:OnlyChild</i></p></td>
</tr>
<tr>
<td class="TB"><p class="TB"><i>equivalentClass</i></p></td>
<td class="TB"><p class="TB">The first class and the second class contain exactly the same instances.</p></td>
<td class="TB"><p class="TB"><i>:Brother    :equivalentClass:MaleSibling</i></p></td>
</tr>
<tr>
<td class="TB"><p class="TB"><i>differentFrom</i></p></td>
<td class="TB"><p class="TB">The first resource (usually meant to indicate an “instance” in the context of KGs) and the second resource do not refer to the same thing.</p></td>
<td class="TB"><p class="TB"><i>:BobMarley    :differentFrom:AlbertEinstein</i></p></td>
</tr>
<tr>
<td class="TB"><p class="TB"><i>sameAs</i></p></td>
<td class="TB"><p class="TB">The first resource and the second resource refer to the same thing.</p></td>
<td class="TB"><p class="TB"><i>:Germany    :sameAs:Deutschland</i></p></td>
</tr>
<tr>
<td class="TB"><p class="TB"><i>disjointWith</i></p></td>
<td class="TB"><p class="TB">The first class and second class have no instances in common.</p></td>
<td class="TB"><p class="TB"><i>:Brother    :disjointWith:Sister</i></p></td>
</tr>
</tbody>
</table>
</figure>
 </div>
<section epub:type="division">
<h3 class="head b-head"><span aria-label="283" id="pg_283" role="doc-pagebreak"/><a id="sec11-2-1"/><b>11.2.1 Description Logics: A Brief Primer</b></h3>
<p class="noindent">Description logics (DLs) are logics designed primarily to serve formal descriptions of concepts and roles (i.e., relations). These logics were created from previous attempts to formalize semantic networks and frame-based systems. Semantically, they are inspired by predicate logic, but their language is usually designed for more practical modeling purposes, and with the goal of providing good computational guarantees such as decidability. Research in DLs can be applied (e.g., how do the various DL constructs apply to real-world applications?), or theoretical and comparative (e.g., what is the impact or complexity of two DLs when evaluated against various reasoning frameworks?).</p>
<p>The knowledge representation system based on DLs consists of two components: the TBox and the ABox. We have already encountered them before, although we did not refer to them as such. The TBox describes <i>terminology</i> [i.e., the ontology in the form of concepts (or “classes”) and roles (equivalently called relations, properties, and predicates in the KG context)], while the ABox contains assertions about individuals (“instances”) using the terms from the ontology. Concepts describe sets of individuals, while roles describe relations between individuals.</p>
<p>KG and ontology descriptions using DLs employ constructs with semantics given by predicate logic. However, due to historical reasons, predicate logic and DL notations are different, because DL notation is closer to semantic networks and frame-based systems. <span aria-label="284" id="pg_284" role="doc-pagebreak"/>We do not cover these notations and formalism in this chapter, but only illustrate a very basic description logic (called <i>attribute language</i>, or AL) using a simple example.</p>
<p>Let us assume two concepts, <i>Human</i> and <i>Male</i>, generally referred to as <i>atomic</i> concepts. We can use the two concepts to refer to human men via the expression <i>Human</i> <span class="font">⊓</span> <i>Male</i>; similarly, <i>Human</i> <span class="font">⊓</span>¬<i>Male</i> refers to humans who are “not male.” Another twist on this expression is ¬<i>Human</i><span class="font">⊓</span><i>Male</i>, referring to males that are not humans (e.g., male lions).</p>
<p>A more complex example, using qualifiers like ∀ (“for all”) and ∃ (“exists”), is an expression like <i>Human</i><span class="font">⊓</span>∃<i>hasChild.</i><span class="font">⊤</span>, which describes all humans who have at least one child. The symbol <span class="font">⊓</span> refers to an intersection or conjunction of concepts, while the expression ∃<i>hasChild.</i><span class="font">⊤</span> says that all <i>successors</i> of the role <i>hasChild</i> are in <span class="font">⊤</span> (i.e., in the set of everything). This is just a shorthand way of saying that there are no constraints on the successor so long as one exists. In the same vein, <i>Human</i> <span class="font">⊓</span>∀<i>hasChild.Male</i> describes all humans who only have male children. In this case, we have imposed a constraint on the successors of the <i>hasChild</i> role because the successors must be instances of the concept <i>Male</i>. Note the subtlety here between how machines and humans might interpret a formal statement like the last expression. If there is no restriction stating that humans can <i>only</i> have children that are <i>also</i> human, it is technically allowable for the KG to make assertions where a human has a child that is a male lion; such a human would still be described by this expression. Even in this simple example, one can see why reasoning, axiomatization, and conceptual modeling all require sophisticated thinking, making them double-edged swords. In prior decades, so-called expert systems had to contain many rules to avoid absurd conclusions like the one noted here, but this made them less robust to irregular occurrences. The problem has not been fully solved; most in-use reasoners can still be brittle. Probabilistic reasoning has helped, but the extent to which they have helped (or can help) is hotly debated.</p>
</section>
<section epub:type="division">
<h3 class="head b-head"><a id="sec11-2-2"/><b>11.2.2 Web Ontology Language</b></h3>
<p class="noindent">The W3C OWL is a Semantic Web (SW) language designed to represent rich and complex knowledge about things, groups of things, and relations between things. OWL is a computational logic–based language such that knowledge expressed in OWL can be exploited by computer programs (e.g., to verify the consistency of that knowledge or to make implicit knowledge explicit).</p>
<p>Why should we care about OWL as a modeling language compared to others? There are several advantages in using reasoners based on OWL:</p>
<ul class="numbered">
<li class="NL">1. OWL is <i>expressive</i>, making it particularly amenable to KGs. While legacy languages such as XML Schema Definition (XSD), Unified Modeling Language (UML) and SQL are adequate for listing a number of classes and properties and building up some simple hierarchical relationships (e.g., SQL allows us to build a new table for each class, a new column for each property, and so on), they have some severe limitations. SQL <span aria-label="285" id="pg_285" role="doc-pagebreak"/>does not allow easy representation of subclass relationships, while more expressive languages like UML can express static subclasses (e.g., <i>horse</i> is a subclass of <i>animal</i>) that are unchanging over time but cannot feasibly express dynamic relationships (e.g., all profits below $10,000 are tax-exempt in Country X). A distinguishing feature of OWL is that it can be used to express complicated and subtle ideas about data, which can be especially critical in domains like finance and medicine.</li>
<li class="NL">2. OWL is <i>flexible</i>, a valuable feature that is illustrated using the following example (inspired by relational databases). Suppose that we want to change a property in a database (in KGs, this would be the equivalent of modifying a property in the ontology). Perhaps it was because we had previously (erroneously) assumed that the property was single-valued, but real data shows that it is actually multivalued. For almost all modern relational databases, this change would require first deleting the entire column for that property and then creating an entirely new table that holds all of those property values (as well as a foreign key reference). The problem here, not taking into account the amount of work that would be required on the part of those maintaining the database, is that the change would induce <i>second-order effects</i> (e.g., it may end up invalidating any indices that deal with the original table, as well as related queries that users might have written). This is one reason (but not the only one) that legacy data models have rarely changed since being instituted, sometimes decades ago (usually, a new model is simply created if the change is unavoidable), due to the troublesome nature of such incremental modifications. In contrast, data-modeling statements in OWL are RDF triples and, by nature, incremental. Enhancing or modifying a data model after the fact can be easily accomplished by modifying the relevant triple. Most OWL-based tools take advantage of OWL’s flexibility by supporting straightforward and incremental changes. Incremental changes are also important when dealing with KGs constructed over web or <i>streaming</i> data, especially over a period of time, because new requirements and unforeseen challenges tend to emerge in an organic fashion.</li>
<li class="NL">3. Last but not least, OWL is also fairly <i>efficient</i> compared to rival tools and languages in the KG/ontology reasoning space. OWL allows data models to support many kinds of reasoning tasks; in fact, various flavors of OWL have been proposed based on the expressive reasoning capabilities required by the application. More important, several software packages are now available for creating ontologies and performing reasoning with OWL. We cover some of these tools in the section entitled “Software and Resources,” at the end of this chapter, the main one of which is the Protégé tool. Subsequently, we turn our attention to the various flavors of OWL.</li>
</ul>
<p class="TNI-H3"><b>11.2.2.1 Why Not Just RDFS?</b> Recall that in chapter 2, we introduced the RDFS language for defining ontologies. While RDFS bears some similarities to OWL (in the SW stack OWL is built on top of RDFS, meaning that it has all the capabilities of RDFS plus some others), a principal difference is in vocabulary expressiveness. OWL includes <span aria-label="286" id="pg_286" role="doc-pagebreak"/>the full vocabulary of RDFS, including <i>rdfs:type</i>, and <i>rdfs:domain</i>, but also includes other elements that are not included in RDFS. <a href="chapter_11.xhtml#tab11-2" id="rtab11-2">Table 11.2</a> provides an overview of the full range of class constructors in OWL, along with their corresponding DL syntax. The names of the constructors are fairly self-evident. An example of the constructor <i>unionOf</i> would be a statement like “Senators <span class="font">⊔</span> Legislators.” We introduced some of the more important and common constructs in the previous section, on reasoning primitives.</p>
<div class="table">
<p class="TT"><a id="tab11-2"/><span class="FIGN"><a href="#rtab11-2">Table 11.2</a>:</span> <span class="FIG">A list of OWL class constructors, with corresponding DL syntax.</span></p>
<figure class="table">
<table class="table">
<thead>
<tr>
<th class="TCH"><p class="TB"><b>Class Constructor</b></p></th>
<th class="TCH"><p class="TB"><b>DL Syntax</b></p></th>
</tr>
</thead>
<tbody>
<tr>
<td class="TB"><p class="TB"><i>intersectionOf</i></p></td>
<td class="TB"><p class="TB"><i>C</i><sub>1</sub> <span class="font">⊓</span><span class="ellipsis">…</span> <span class="font">⊓</span> <i>C</i><sub><i>n</i></sub></p></td>
</tr>
<tr>
<td class="TB"><p class="TB"><i>unionOf</i></p></td>
<td class="TB"><p class="TB"><i>C</i><sub>1</sub> <span class="font">⊔</span><span class="ellipsis">…</span> <span class="font">⊔</span> <i>C</i><sub><i>n</i></sub></p></td>
</tr>
<tr>
<td class="TB"><p class="TB"><i>complementOf</i></p></td>
<td class="TB"><p class="TB">¬<i>C</i></p></td>
</tr>
<tr>
<td class="TB"><p class="TB"><i>oneOf</i></p></td>
<td class="TB"><p class="TB">{<i>x</i><sub>1</sub>}<span class="font">⊔</span><span class="ellipsis">…</span> <span class="font">⊔</span>{<i>x</i><sub><i>n</i></sub>}</p></td>
</tr>
<tr>
<td class="TB"><p class="TB"><i>allValuesFrom</i></p></td>
<td class="TB"><p class="TB">∀<i>P.C</i></p></td>
</tr>
<tr>
<td class="TB"><p class="TB"><i>someValuesFrom</i></p></td>
<td class="TB"><p class="TB">∃<i>P.C</i></p></td>
</tr>
<tr>
<td class="TB"><p class="TB"><i>maxCardinality</i></p></td>
<td class="TB"><p class="TB">≤ <i>nP</i></p></td>
</tr>
<tr>
<td class="TB"><p class="TB"><i>minCardinality</i></p></td>
<td class="TB"><p class="TB">≥ <i>nP</i></p></td>
</tr>
</tbody>
</table>
</figure>
</div>
<p>However, a more important difference is that, unlike RDFS, OWL not only tells us how we can use a certain vocabulary, but also how we <i>can’t</i> use it. RDFS is much more constraint-free. For example, in RDFS, it is technically possible for anything to be an instance of <i>rdfs:class</i>. There is nothing stopping us from using a term both as a class and an instance. RDFS considers this to be legal because it does not constrain which statements can, or cannot, be inserted. However, in at least some flavors of OWL, such statements would not be legal (i.e., it would not be allowable to declare a term as being both a class and an instance). The full range of OWL axiom constructors are enumerated in <a href="chapter_11.xhtml#tab11-3" id="rtab11-3">table 11.3</a>. In practice, which constructors can be used or are applicable depends on the flavor of OWL used, as subsequently detailed.</p>
<div class="table">
<p class="TT"><a id="tab11-3"/><span class="FIGN"><a href="#rtab11-3">Table 11.3</a>:</span> <span class="FIG">A list of OWL axiom constructors, with examples and corresponding DL syntax.</span></p>
<figure class="table">
<table class="table">
<thead>
<tr>
<th class="TCH"><p class="TB"><b>OWL Axiom</b></p></th>
<th class="TCH"><p class="TB"><b>DL Syntax</b></p></th>
</tr>
</thead>
<tbody>
<tr>
<td class="TB"><p class="TB"><i>subClassOf</i></p></td>
<td class="TB"><p class="TB"><i>C</i><sub>1</sub> <span class="font">⊑</span> <i>C</i><sub>2</sub></p></td>
</tr>
<tr>
<td class="TB"><p class="TB"><i>equivalentClass</i></p></td>
<td class="TB"><p class="TB"><i>C</i><sub>1</sub> <i>≡ C</i><sub>2</sub></p></td>
</tr>
<tr>
<td class="TB"><p class="TB"><i>disjointWith</i></p></td>
<td class="TB"><p class="TB"><i>C</i><sub>1</sub> <span class="font">⊑</span>¬<i>C</i><sub>2</sub></p></td>
</tr>
<tr>
<td class="TB"><p class="TB"><i>sameIndividualAs</i></p></td>
<td class="TB"><p class="TB">{<i>x</i><sub>1</sub>} <i>≡</i> {<i>x</i><sub><i>n</i></sub>}</p></td>
</tr>
<tr>
<td class="TB"><p class="TB"><i>differentFrom</i></p></td>
<td class="TB"><p class="TB">{<i>x</i><sub>1</sub>}⊆¬{<i>x</i><sub><i>n</i></sub>}</p></td>
</tr>
<tr>
<td class="TB"><p class="TB"><i>subPropertyOf</i></p></td>
<td class="TB"><p class="TB"><i>P</i><sub>1</sub> <span class="font">⊑</span> <i>P</i><sub>2</sub></p></td>
</tr>
<tr>
<td class="TB"><p class="TB"><i>equivalentProperty</i></p></td>
<td class="TB"><p class="TB"><i>P</i><sub>1</sub> <i>≡ P</i><sub>2</sub></p></td>
</tr>
<tr>
<td class="TB"><p class="TB"><i>inverseOf</i></p></td>
<td class="TB"><p class="TB"><i>P</i><sub>1</sub> <i>≡ P</i><sub>2</sub><sup>−</sup></p></td>
</tr>
<tr>
<td class="TB"><p class="TB"><i>transitiveProperty</i></p></td>
<td class="TB"><p class="TB"><i>P</i><sup>+</sup> <span class="font">⊑</span> <i>P</i></p></td>
</tr>
<tr>
<td class="TB"><p class="TB"><i>functionalProperty</i></p></td>
<td class="TB"><p class="TB"><span class="font">⊤</span><span class="font">⊑</span>≤ 1<i>P</i></p></td>
</tr>
<tr>
<td class="TB"><p class="TB"><i>inverseFunctionalProperty</i></p></td>
<td class="TB"><p class="TB"><span class="font">⊤</span><span class="font">⊑</span>≤ 1<i>P</i><sup>−</sup></p></td>
</tr>
</tbody>
</table>
</figure>
</div>
<p>In summary, OWL imposes a much more rigid structure compared to the more free-for-all allowances made by RDFS. In turn, this permits more expressive and meaningful reasoning capabilities. Of course, some flavors of OWL choose to implement more constraints than others, primarily due to computational reasons because some kinds of inferences can be run really quickly, while others are intractable.</p>
<p class="TNI-H3"><b>11.2.2.2 Flavors of OWL</b> In the previous subsection, we referred several times to flavors of OWL. Technically, these are known as <i>sublanguages</i>. While many different sublanguages of OWL are theoretically possible, there are three sublanguages that are studied in practice because they are designed for use by specific communities of implementers and users:</p>
<ul class="numbered">
<li class="NL">1. <span aria-label="287" id="pg_287" role="doc-pagebreak"/><i>OWL Lite</i> supports those users primarily needing a classification hierarchy and simple constraints. For example, while it supports cardinality constraints, it permits cardinality values of only 0 or 1. It should be simpler to provide tool support for OWL Lite than its more expressive relatives, and OWL Lite provides a quick migration path for thesauri and other taxonomies. OWL Lite also has a lower formal complexity than OWL DL (described next). <a href="chapter_11.xhtml#tab11-4" id="rtab11-4">Table 11.4</a> expresses the list of OWL Lite language constructs. Note that some of the constructs are derived directly from terms in RDF or RDFS (as evidenced by their prefix, such as <i>rdfs:label</i>). Many of the terms (such as <i>rdfs:domain, Class</i>, and <i>Individual</i>, which is an instance of a class) are self-explanatory, and we do not provide detailed descriptions of them here. Annotations within the ontology are generally supported by standard (and ubiquitous) properties like <i>rdfs:label</i> and <i>rdfs:comment</i>.</li>
</ul>
<div class="table">
<p class="TT"><a id="tab11-4"/><span class="FIGN"><a href="#rtab11-4">Table 11.4</a>:</span> <span class="FIG">A nonexhaustive list of OWL Lite language features.</span></p>
<figure class="table">
<table class="table">
<thead>
<tr>
<th class="TCH"><p class="TB"><b>OWL Lite Feature</b></p></th>
<th class="TCH"><p class="TB"><b>Brief Description</b></p></th>
</tr>
</thead>
<tbody>
<tr>
<td class="TB"><p class="TB"><i>Class</i></p></td>
<td class="TB"><p class="TB">Defines a group of individuals that belong together because they share some properties, can also be organized in a specialization hierarchy using <i>subClassOf</i>. There is a built-in most general class named Thing that is the class of all individuals and is a superclass of all OWL classes. There is also a built-in most specific class named Nothing that is the class that has no instances and a subclass of all OWL classes.</p></td>
</tr>
<tr>
<td class="TB"><p class="TB"><i>rdfs:subClassOf</i></p></td>
<td class="TB"><p class="TB">Class hierarchies may be created by making one or more statements that a class is a subclass of another class.</p></td>
</tr>
<tr>
<td class="TB"><p class="TB"><i>rdf:Property</i></p></td>
<td class="TB"><p class="TB">Properties can be used to state relationships between individuals or from individuals to data values. Examples include <i>hasRelative</i> and <i>hasAge</i>. Both <i>owl:ObjectProperty</i> and <i>owl:DatatypeProperty</i> are subclasses of the RDF class <i>rdf:Property</i>.</p></td>
</tr>
<tr>
<td class="TB"><p class="TB"><i>rdfs:subPropertyOf</i></p></td>
<td class="TB"><p class="TB">Property hierarchies may be created by making one or more statements that a property is a subproperty of one or more other properties (e.g., <i>hasBrother</i> may be stated to be a subproperty of <i>hasSibling</i>).</p></td>
</tr>
<tr>
<td class="TB"><p class="TB"><i>rdfs:domain</i></p></td>
<td class="TB"><p class="TB">A domain of a property limits the individuals to which the property can be applied. If a property relates an individual to another individual, and the property has a class as one of its domains, then the individual must belong to the class.</p></td>
</tr>
<tr>
<td class="TB"><p class="TB"><i>rdfs:range</i></p></td>
<td class="TB"><p class="TB">The range of a property limits the individuals that the property may have as its value. If a property relates an individual to another individual, and the property has a class as its range, then the other individual must belong to the range class.</p></td>
</tr>
<tr>
<td class="TB"><p class="TB"><i>Individual</i></p></td>
<td class="TB"><p class="TB">Individuals are instances of classes, and properties may be used to relate one individual to another.</p></td>
</tr>
<tr>
<td class="TB"><p class="TB"><i>inverseOf</i></p></td>
<td class="TB"><p class="TB">If the property <i>P</i>1 is stated to be the inverse of the property <i>P</i>2, then if X is related to Y by the <i>P</i>2 property, then Y is related to X by the <i>P</i>1 property.</p></td>
</tr>
<tr>
<td class="TB"><p class="TB"><i>FunctionalProperty</i></p></td>
<td class="TB"><p class="TB">If a property is a <i>FunctionalProperty</i>, then it has no more than one value for each individual (it may have no values for an individual).</p></td>
</tr>
<tr>
<td class="TB"><p class="TB"><i>equivalentClass</i></p></td>
<td class="TB"><p class="TB">Equivalent classes have the same instances, and can be used to create synonymous classes.</p></td>
</tr>
</tbody>
</table>
</figure>
</div>
<ul class="numbered">
<li class="NL">2. <i>OWL DL</i> supports those users who want maximum expressiveness while retaining computational completeness (i.e., all conclusions are guaranteed to be computable) and decidability (i.e., all computations will finish in a finite time). OWL DL includes all the OWL language constructs, but they can be used only under certain restrictions (for example, while a class may be a subclass of many classes, a class cannot be an instance of <i>another</i> class). OWL DL is so named due to its more direct correspondence with DLs compared to the other flavors. OWL DL contains all the constructs of OWL Lite, but it also contains constructs that OWL Lite does not contain. We provide some examples of these incremental additions in <a href="chapter_11.xhtml#tab11-5" id="rtab11-5">table 11.5</a>.</li>
</ul>
<div class="table">
<p class="TT"><a id="tab11-5"/><span class="FIGN"><a href="#rtab11-5">Table 11.5</a>:</span> <span class="FIG">Two examples of incremental language features supported by OWL DL, but not OWL Lite.</span></p>
<figure class="table">
<table class="table">
<thead>
<tr>
<th class="width25 TCH"><p class="TB"><b>OWL DL Feature</b></p></th>
<th class="TCH"><p class="TB"><b>Brief Description</b></p></th>
</tr>
</thead>
<tbody>
<tr>
<td class="TB"><p class="TB"><i>oneOf</i></p></td>
<td class="TB"><p class="TB">Classes can be described by <i>enumeration</i> of the individuals that make up the class, with members of the class being <i>exactly</i> the set of enumerated individuals (no more or less). A good example is the class <i>calendarMonths</i>, which would enumerate the 12 months. A reasoner could deduce the maximum cardinality of any property that has <i>calendarMonths</i> as its <i>allValuesFrom</i> restriction.</p></td>
</tr>
<tr>
<td class="TB"><p class="TB"><i>disjointWith</i></p></td>
<td class="TB"><p class="TB">Classes may be stated to be disjoint from each other, which usually allows a reasoner to deduce an inconsistency (e.g., if an instance is declared to be an instance of two disjoint classes; also, a reasoner can deduce negative information such as, if X is an instance of A and A and B are disjoint, then X <i>must not</i> be an instance of B).</p></td>
</tr>
</tbody>
</table>
</figure>
</div>
<ul class="numbered">
<li class="NL">3. Finally, <i>OWL Full</i> is meant for users who want maximum expressiveness and the syntactic freedom of RDF with no computational guarantees. For example, in OWL <span aria-label="288" id="pg_288" role="doc-pagebreak"/>Full, a class can be treated simultaneously as a collection of individuals and as an individual in its own right. OWL Full allows an ontology to augment the meaning of the predefined (RDF or OWL) vocabulary. It is unlikely that any reasoning software will be able to support complete reasoning for every feature of OWL Full. OWL Full also provides maximum flexibility for ontologists to define all manner of properties (e.g., for specialized kinds of complex annotations), or declarations of complex classes (e.g., that are boolean combinations of other preexisting classes).</li>
</ul>
<p><span aria-label="289" id="pg_289" role="doc-pagebreak"/>There is an interesting relationship between these three sublanguages, as might be expected. Every <i>legal</i> OWL Lite ontology is a legal OWL DL ontology, and every legal OWL DL ontology is a legal OWL Full ontology. The inverses are not true, however. The same relation applies to conclusions (i.e., every <i>valid</i> OWL Lite conclusion is a valid OWL DL conclusion, and so on).</p>
</section>
<section epub:type="division">
<h3 class="head b-head"><a id="sec11-2-3"/><b>11.2.3 Sample Reasoning Framework: Protégé</b></h3>
<p class="noindent">An extremely important resource in the SW community for modeling ontologies using OWL is Protégé. Protégé was developed by the Stanford Center for Biomedical Informatics Research at the Stanford University School of Medicine. Protégé fully supports the latest OWL 2 and RDF specifications from the World Wide Web Consortium (W3C). It is highly extensible and is based on the Java programming language. In essence, it provides a plug-and-play environment that makes it a flexible base for rapid prototyping and application development.</p>
<p>Protégé’s plug-in architecture can be adapted to build both simple and complex ontology-based applications. Developers can integrate the output of Protégé with rule systems or other problem solvers to construct a wide range of intelligent systems based on SW and <span aria-label="290" id="pg_290" role="doc-pagebreak"/>KG technologies. A visualization of the interface for the biomedical domain is shown in <a href="chapter_11.xhtml#fig11-1" id="rfig11-1">figure 11.1</a>.</p>
<div class="figure">
<figure class="IMG"><a id="fig11-1"/><img alt="" src="../images/Figure11-1.png" width="450"/>
<figcaption><p class="CAP"><span class="FIGN"><a href="#rfig11-1">Figure 11.1</a>:</span> <span class="FIG">The Protégé interface in a biomedical domain (taken from the Wiki available at <i>protegewiki.stanford.edu/wiki/Main_Page</i>).</span></p></figcaption>
</figure>
</div>
<p>Protégé is not only a desktop-based tool; it also offers an ontology development environment for the web via WebProtégé. WebProtégé makes it easy to create, upload, modify, and share ontologies for collaborative viewing and editing. WebProtégé, just like the desktop version, fully supports OWL 2 and has a highly configurable user interface that can be used by both beginners and experts. Collaboration features include sharing and permissions, threaded notes and discussions, watches, and email notifications. A variety of formats are supported for ontology upload and download, including RDF/XML, Turtle, OWL/XML, and OBO. It is cross-compatible with the desktop version, and introduces the concept of web forms for domain-specific editing.</p>
<p>Perhaps the most important aspect of Protégé beyond all its technical capabilities is that it is actively supported by a strong community of users and developers that “field questions, write documentation, and contribute plug-ins.” At this time, Protégé 5.5.0 was released and is the most recent version. It offers several new features compared to previous versions of Protégé, but like the other versions, is free to use.</p>
</section>
</section>
<section epub:type="division">
<h2 class="head a-head"><span aria-label="291" id="pg_291" role="doc-pagebreak"/><a id="sec11-3"/><b>11.3 Retrieval</b></h2>
<p class="noindent">As defined in a classic text on the subject, IR is defined as <i>finding material (usually documents) of an unstructured nature (usually text) that satisfies an information need from within large collections (usually stored on computers)</i>. As the definition indicates, IR has often been employed in the context of unstructured (a misnomer for natural-language) data, with the term and the field itself largely taking off because of the increased relevance of web search engines like Google and Bing. In fact, most IR systems tend to be distinguished by the scale at which they operate. Modern IR research in companies like Bing and Google has focused on web-scale systems where search capabilities have to be provided over billions of documents stored on entire clusters of computers. Good indexing techniques are vital, and even minor improvements can have a strong influence on the bottom line. Particular aspects of the web itself, including exploitation of hypertext, and robustness to site providers manipulating page content to boost search engine scores, have to be dealt with to satisfy the information needs of users.</p>
<p>Personal IR is at the other end of the spectrum and is not necessarily small-scale, though much smaller usually than most web search system. Personal IR is primarily relevant in operating systems (Apple’s Mac OS Spotlight) and email search. Between personal IR and web IR lie <i>enterprise, institutional</i>, and <i>domain-specific</i> IR systems. Enterprise and institutional IR refer to the main customer bases or types they tend to serve. Domain-specific IR is an intriguing area of research, both in KGs and search, that has been rapidly exploding. An example of domain-specific IR is product search on Amazon or video search on YouTube. Beyond KGs, domain-specific IR involves building high-performance systems for (among others) searching legal documents like court filings, patents, medical documents, and research articles on topics ranging from sociology to chemistry.</p>
<p>Given that there is so much focus on documents, why discuss retrieval in the context of KGs? One reason is that KGs are rarely all symbolic, and often contain free-text fields like descriptions, phrases, and labels. For example, in encyclopedic KGs like DBpedia, which are derived from WIkipedia infoboxes, an entity like “Bob Marley” is described not only using an abstract and a label property, but has a birth date associated with him, and statistics like the number of awards won, links to some of his works (which in turn are subject nodes in the KG and have their corresponding links and literals), to only name a few. Symbolic reasoning may work with artifacts like dates, or even fields like awards, but would not work for description-like fields without either string matching (which does not work well for long strings like in the abstract field) or some kind of IR-inspired technique like tf-idf, which is described next. Furthermore, there are methods to represent KGs as <i>sets of key-value documents</i>, and systems like Elasticsearch support these representations seamlessly, as we describe in the next chapter. For example, for a single KG node, a document can be created, with the datatype properties of the node expressed as keys (and their corresponding object values, which are literals, expressed as values). More complex <span aria-label="292" id="pg_292" role="doc-pagebreak"/>representations are also feasible, some of which are active areas of theoretical and practical research. Either way, by representing KGs as sets of key-value documents, IR toolkits and frameworks like Lucene become applicable, as do vector space methods.</p>
<section epub:type="division">
<h3 class="head b-head"><a id="sec11-3-1"/><b>11.3.1 Term Frequency and Weighting</b></h3>
<p class="noindent">By far, the dominant approach to ranking documents is the <i>vector space model</i> (VSM). The core intuition behind a VSM is to represent each document (and even free-text queries, which may be thought of as short documents) as a <i>vector</i>, not dissimilar to the word-embedding and KG-embedding (KGE) models we considered earlier. However, unlike those models, classic document VSMs yield document vectors where dimensions are interpretable, with each dimension usually encoding the <i>importance</i> of a word. For this reason, classic VSMs based on bag of words and its variants have the dimensionality of the language’s common vocabulary (for English, a vocabulary of 50,000 is typically assumed, but it is not uncommon to consider the full dictionary as well), but are very sparse because most entries are zero (even in long documents, only a small portion of the language’s vocabulary is usually observed).</p>
<p>The notion of importance is vital here, and the weighting scheme that has withstood the test of time and is still considered a highly competitive baseline or set of features in many settings is tf-idf. The tf-idf formula for term <i>t</i> in document <i>d</i> may simply be expressed by the following formula:</p>
<figure class="DIS-IMG"><a id="eq11-1"/><img alt="" class="width" src="../images/eq11-1.png"/>
</figure>
<p>Here, the subexpression <i>tf</i><sub><i>t.d</i></sub> is some measure of the frequency of term <i>t</i> in <i>d</i>, usually the number of occurrences of <i>t</i> in <i>d</i>. The <i>idf</i><sub><i>t</i></sub> term, which is dependent on the full corpus rather than any individual document, is the ratio <i>N/df</i><sub><i>t</i></sub> where <i>N</i> is the total number of documents in the corpus and <i>df</i><sub><i>t</i></sub> is the number of documents in which <i>t</i> occurs at least once. The reason why the model is said to be a bag of words model is because, as the formula illustrates, the order of the words is irrelevant to the vector representation. The sentences “The cow jumped over the moon” and “The moon jumped over the cow” would have the same vectors.</p>
<p>Variants of this formula also exist (e.g., using the logarithms of these expressions is quite common as a type of smoothing). Almost always, the logarithm of the <i>idf</i> is taken rather than the ratio <i>N/df</i><sub><i>t</i></sub>. In the case of <i>tf</i>, one could also consider relative frequency instead of absolute. Note that the length of the document is not as relevant as one might imagine after looking at the formula. The reason is that “matching” between vectors is usually done using a measure like <i>cosine similarity</i>, which measures the cosine of the angle between two vectors and is agnostic to their lengths. In essence, this equates to normalizing each vector (specifically, the 2-norm), and then applying dot product as the similarity measure between <span aria-label="293" id="pg_293" role="doc-pagebreak"/>vectors. To take an example, if there were two documents <i>d</i><sub>1</sub> and <i>d</i><sub>2</sub>, which contained only the term “knowledge” 10 and 20 times respectively, the vector representations for both documents would be identical after normalization (containing a 1 as the dimension value represented by “knowledge,” and 0 in every other dimension). However, if even one other word were contained, the number of occurrences would start making a difference. Without a logarithmic adjustment, the effect of such “rare” words tend to be dampened, because the large frequencies end up dominating (both before and after normalization).</p>
<p>How could we use tf-idf for retrieving entities in KGs? One idea is to think of each entity as a “document” and to accept queries that are either free-text (a list of words), or more specific and semantic in origin (e.g., return a ranked list of entities matching the criteria “title: knowledge*, author: kejriwal*”). In fact, many form-based access mechanisms on the web (such as might be found on a university website) are based on such IR techniques. Of course, it is usually not the case that the underlying data over which libraries and other organizations retrieve is actually in the form of a KG. However, many modern tech-focused organizations like Amazon have far more structured data sets that resemble KGs. In any case, the point remains that IR techniques used for data sets that have descriptive text elements (with natural-language documents being an extreme case that only contains such elements) could be applied to KGs that have fields amenable to IR, such as “description,” “label,” and “comment,” among others. We saw earlier that OWL and even RDFS allow for such annotation properties, and many well-established KGs like DBpedia make liberal use of these properties.</p>
<p>The corollary is that if such text( or <i>literal</i>) properties are not liberally used, or it is difficult to index the KG, then IR methods lose their advantage over reasoning methods. The next section details this tension further. Generally, the most difficult (and most real-world) KGs to access are both noisy, and have a high mix of structured and text information. Even cutting-edge, hybrid access techniques that combine retrieval and reasoning fail to achieve excellent performance, and there is much room for improvement in this domain. We cover such a use-case in the context of human trafficking in chapter 17, where we discuss the construction and use of domain-specific KGs for social impact projects.</p>
</section>
</section>
<section epub:type="division">
<h2 class="head a-head"><a id="sec11-4"/><b>11.4 Retrieval versus Reasoning</b></h2>
<p class="noindent">The situation becomes quickly more complicated with increasing sophistication in either the KG (which may adhere to a complex ontology, such as in scientific domains, or contains many nontext and even nonliteral structural information that is critical for retrieving good answers to queries) or to the query itself. In essence, an IR approach is inherently limited in its reasoning capabilities, relying mainly on surface semantic properties. Word and graph embeddings such as we have studied earlier can help resolve some of these issues, but not all. For example, KG embeddings and word embeddings, as we saw earlier, tend to embed <span aria-label="294" id="pg_294" role="doc-pagebreak"/>each word or entity into a vector. A query, in contrast, is a subgraph, with some slots in the graph that are not known and have to be filled-in via query execution against the KG.</p>
<p>In recognition of the difficulty of so-called question-answering systems that are so crucial to the functioning of chatbots and intelligent agents like Alexa and Siri, the community has come up with a set of techniques specifically for question answering (see chapter 13). Some of the leading work in this space has been done in industry, and is not generally available to the public. Beyond natural-language question answering, a hybrid suite of approaches may be the best way to tackle querying in KGs.</p>
<p>One such approach is <i>fuzzy querying</i>. Work on fuzzy querying precedes the growth of KGs and was recognized mostly in the era of the early web, when noisy RDBMSs started becoming more common and data mining on such databases (which could sometimes be derived from web data, not unlike the web IE KG construction methods described in part II of this book) was desirable. Fuzzy querying meant retrieving answers to queries that corresponded more closely to imprecise natural language (e.g., “find all records such that almost all of the important fields are as specified,” where by specification we mean a “constraint” or pattern in the query, such as <i>age &gt;</i> 18). It was clear even back then that major extensions to languages like SQL would be necessary to accommodate such queries, including the extension of syntax, semantics, an elicitation and manipulation mechanism for specification of linguistic or fuzzy terms in the queries, and the embedding of fuzzy querying in native architectures (rather than designing completely new systems that would have likely had no impact on legacy architectures and primary users of database technology). Some of this work could potentially be extended to KGs.</p>
<p>More popular, however, is work on <i>query reformulation</i>. The idea there is to specify an ordinary query (rather than a fuzzy query that contains linguistic terms like “almost” or otherwise has an extended syntax) in a language like SQL or SPARQL, but not to assume strict semantics. Rather, the querying engine takes satisfying user intent to be the guiding semantic criterion for retrieval, and their evaluation is often done using IR metrics (a sample of which is described in the next section). As an example, suppose that we specify a SPARQL query that asks for the “population of the city of Los Angeles.” If the query were to be executed with strict semantics, the population of the city of Los Angeles, if it existed in the KG or database, would be retrieved; otherwise, nothing would be retrieved by the query engine. However, many people actually mean the city of Los Angeles to be the greater Los Angeles metropolitan area, which includes other “cities” like Torrance and Santa Monica. A query reformulation system would try to satisfy user intent by using some kind of (statistical or expert-derived) heuristic to automatically reformulate the original query, in this case by expanding it to include these other cities in the greater LA metropolitan area. Furthermore, because the system itself may not know which cities to include in this expanded query, it would return, not a single answer, but a ranked list of answers that can be evaluated using various IR metrics. Note that query expansion is not the <span aria-label="295" id="pg_295" role="doc-pagebreak"/>only form of query reformulation, which can also involve relaxations (in the extreme case, by deleting a specification, especially on a noisy, unreliable, or otherwise low-coverage predicate, to get higher recall), synonymy, soft string matching, and other operators to make the querying more robust. Whether fuzzy querying or query reformulation is appropriate depends on the preferred mode of user elicitation and expected output, as well as the actual quality of the KG. In more recent applications, like chatbots and question answering, both approaches may be apt.</p>
<p>In the introduction to this chapter, we stated that the reasoning and retrieval research communities had been fairly disjointed historically, with their own sets of researchers, data sets, and even publishing venues. KG research has brought both modes of data access into focus and placed it on common ground, primarily due to necessity. Most KGs have literals and free-text values, as we observed in the previous section, but they also have considerable structure that reasoning systems are better able to exploit. KGs not only span a range of genres, but also exist along a quality spectrum, with some KGs having high precision and others having good coverage of domain-specific instances, but also a lot of noise. KGs may be serving different sets of stakeholders, or they may be confined to one or more narrow use-cases. However, more often than not, trade-offs are involved, meaning that it is neither desirable nor wise to commit to a framework that is pure reasoning or retrieval. Approaches like fuzzy querying are useful for navigating such trade-offs. Because this is such an important problem, we expect research to flourish in this area for the foreseeable future.</p>
<section epub:type="division">
<h3 class="head b-head"><a id="sec11-4-1"/><b>11.4.1 Evaluation</b></h3>
<p class="noindent">In contrast with reasoning, IR practitioners always take a ranked list as the answer to be evaluated, given a corpus of documents and a free-text query. Several such corpora are available and have been constructed over many years in the community, including the Cranfield collection; the test bed evaluation series run in the Text Retrieval Conferences (TRECs) by the US National Institute of Standard and Technology (NIST); the NTCIR collection, which focuses mainly on East Asian language and cross-language IR; and CLEF (similarly focused on cross-language retrieval and European languages). For each ranked list, and a ground-truth of which entries are relevant (in its simplest and most popular form, relevance is just a binary measure—that is, with respect to a query, either a document is judged to be relevant and has a value of 1 in the corresponding ground-truth, or it has a value of 0 if it is considered irrelevant), an IR metric can be computed.</p>
<p>Several such candidate metrics have been, and continue to be, used in the literature and among practitioners, including Mean Reciprocal Rank (MRR), Normalized Discounted Cumulative Gain (NDCG), recall@k, precision@k, and Mean Average Precision (MAP).</p>
<p>Multiple systems can be evaluated fairly by having a large, broad corpus and many queries. For one or more of these metrics, a detailed set of performance data points can be obtained for each system. The usual tests of statistical significance (typically but not always <i>pairwise</i>, <span aria-label="296" id="pg_296" role="doc-pagebreak"/>because all performance results are collected over a common set of queries) can be used to determine whether one system is statistically significantly better than another.</p>
<p><b>Mean Reciprocal Rank (MRR).</b> Given a ranked list, the MRR for that list is the inverse of the rank of the single relevant document. MRR does not apply gracefully if there is more than one relevant document for a given query. As expected, the highest possible MRR is 1.0, and there is a rapid decline as the document moves down the list (e.g., the MRR is 0.5 if the relevant document is ranked second, 0.33 if ranked third, and so on).</p>
<p><b>recall@</b><b><i>k</i></b><b>.</b> The recall at rank <i>k</i> (termed as <i>recall@k</i>) in a ranked list is the ratio of relevant documents in the top <i>k</i> to the ratio of the total number of relevant documents (for that query and corpus). For example, a recall@10 of 90 percent means that 90 percent of the relevant documents in the ground-truth for the query are in the top 10. Obviously, recall@<i>m</i> will be at least as much as recall@<i>n</i> if <i>m &gt; n</i>. The lowest <i>k</i> at which recall reaches 100 percent tells us how many irrelevant documents we had to see before we saw <i>all</i> relevant documents for that query. To understand this better, let’s take an example of a 10,000-document corpus. Note first that “document” should be understood here (and the rest of this discussion) in the looser sense of the word, because in the IR context, a document is any element that can be retrieved, and this could include a video or KG entity rather than a text document. Returning to the example, suppose that only 20 of the 10,000 documents are relevant for a given query. Furthermore, suppose that we observed that the lowest <i>k</i> at which recall@<i>k</i> of 100 percent is achieved is 50. One obvious thing that we should point out here is that recall@<i>k</i> can never be 100 percent for any <i>k</i> that is strictly smaller than the size of the ground-truth (in this case, 20). In this example, the number of irrelevant documents we saw is 30, because of the 50 documents that we saw before we saw all the relevant documents (20 in total), 30 had to be irrelevant. The <i>k</i> also gives us the rank of the last relevant document (see the exercises at the end of this chapter), and for all these reasons, it is an important number.</p>
<p>By itself, recall@<i>k</i> is rank-agnostic in that it does not tell us <i>where</i> in the top <i>k</i> the relevant documents are located. For example, if there are 10 relevant documents in the ground-truth and recall@5 is 10 percent, then this means that exactly one relevant document is in the top 5. However, the relevant document may have been at rank 4, 5, or 1; this would not affect recall@5. The measure is still fairly robust when it is plotted on a graph, with <i>k</i> on the <i>x</i>-axis and recall@<i>k</i> on the <i>y</i>-axis. When we plot recall in this way, we start to see the differences in ranking (e.g., in this example, if the relevant document was at rank 4, then recall@4 would also be 10 percent, while if the document was at rank 5, recall@4 would be 0). The graph of this would illustrate this difference.</p>
<p>For multiple queries, the different values of recall@<i>k</i> can be averaged for each value of <i>k</i>, yielding a single curve (with error bars, capturing the variance in recall values at each <i>k</i>). Note that, based on these observations, once recall@<i>m</i> has reached 100 percent, then recall@<i>n</i> where <i>n &gt; m</i> will also be 100 percent, yielding a flat line. So a full plot can be <span aria-label="297" id="pg_297" role="doc-pagebreak"/>constructed for all values on the <i>x</i>-axis from [1<i>, <span class="ellipsis">…</span>, N</i>], where <i>N</i> is the total number of documents in the corpus.</p>
<p><b>precision@</b><b><i>k</i></b><b>.</b> The precision at rank <i>k</i> is analogous to recall@<i>k</i>, except that instead of computing the recall at rank <i>k</i>, it computes the precision [the number of relevant documents in the top <i>k</i> divided by <i>min</i>(<i>k, rel</i>(<i>q</i>)) where <i>rel</i>(<i>q</i>), is the smallest rank at which recall@<i>k</i> reaches 100 percent; i.e., all relevant documents have been observed by that rank]. For example, suppose there are 10 relevant documents, and the last relevant document is achieved at rank 50. The precision@50 would then be 10 / 50 = 20 percent, which stays flat at all values of <i>k</i> from 50 to <i>N</i>. Furthermore, unlike recall, the precision does not have to increase monotonically but can experience dips and increases. For example, if the very first document is relevant, then precision@1 is 100 percent, but if the second document is irrelevant and the third document is again relevant, then precision@2 drops to 50 percent, while precision@3 increases to 66.67 percent (since at rank 3, two out of three observed documents are relevant). By convention, to avoid dividing by 0, precision@<i>k</i>, where <i>k &lt; k</i>′, <i>k</i>′ being the lowest <i>k</i> at which the <i>first</i> relevant document is observed, is always 0. In the previous example, if the first document had not been relevant, then precision@1 and precision@2 would both have been 0 (while precision@3 would have been 33.33 percent).</p>
<p>When plotting the graphs, two kinds of conventions are common, based on the application. First, we can eliminate <i>k</i> altogether by plotting a graph of precision versus recall, precision being on the <i>y</i>-axis. <i>k</i> becomes like a hidden variable that is used to pair precision and recall values (at the same <i>k</i>) and allows one to see how precision changes with recall (see the exercises). This is a handy measurement because recall has to reach 100 percent at some <i>k</i> (in the most extreme case, recall@<i>N</i> is guaranteed to be 100 percent). In contrast, precision may never reach 100 percent for some queries (e.g., it is easy to show that if the first document is not relevant, then precision will never reach 100 percent) and unless all the relevant documents are always ranked at the top, precision@<i>N</i> will never be 100 percent. These extreme cases notwithstanding, in practice, there is almost always a trade-off between recall and precision, and a precision-recall curve helps us to evaluate this trade-off. Previously, in the chapter on reconciling KGs (chapter 8), we noticed similar trade-offs [e.g., both in the entity matching phase, where a precision-recall trade-off arose, as well as between Reduction Ratio (RR) and Pairs Completeness (PC) metrics in the context of evaluating blocking].</p>
<p><b>Interpolated Precision.</b> The second convention is to use <i>interpolated precision</i> to remove “jiggles” in the precision-recall plot that do not allow us to see the general trends very easily. The interpolated precision at a given recall level <i>r</i> is simply defined as the highest precision found for any recall level <i>r</i>′ <i>&gt; r</i>. The reason for this adjustment is that, even though the recall stays constant between two ranks <i>k</i><sub>1</sub> and <i>k</i><sub>2</sub> <i>&gt; k</i><sub>1</sub> if all documents between those two ranks (exclusive) are irrelevant and documents at <i>k</i><sub>1</sub> and <i>k</i><sub>2</sub> are relevant (which means that recall@<i>k</i><sub>1</sub> stays constant until recall@<i>k</i><sub>2</sub> − 1 and then increases at recall@<i>k</i><sub>2</sub>), <span aria-label="298" id="pg_298" role="doc-pagebreak"/>the precision will steadily decline between those levels, and then increase (i.e., precision@<i>k</i><sub>2</sub> − 1 will be lower than precision@<i>k</i><sub>2</sub>) when a relevant document is encountered at <i>k</i><sub>2</sub>. These steady declines, followed by a steep and sudden rise at the rank when a relevant document is encountered leads to irregularity in the plot. By using interpolated precision instead of raw precision, the precision stays flat (and equal to precision@<i>k</i><sub>1</sub>) all the way from <i>k</i><sub>1</sub> to <i>k</i><sub>2</sub> − 1, leading to a smoother characterization of the trade-off between recall and precision at various levels of recall.</p>
<p>While precision@<i>k</i> and recall@<i>k</i> (and their harmonic mean, the F1-measure@<i>k</i>, which tries to quantify their trade-off at every value of <i>k</i>) are useful if considered graphically, and MRR is a useful metric if there is great emphasis on getting a single existent right answer at the very top (e.g., ads and e-commerce, because the user will often have a very short attention span), they have some limitations. In general, there has been controversy over which IR metric is the best one, because they are not always correlated. MRR, for example, is applicable only to ground-truths where each query has exactly one relevant document. Also, we saw how quickly the MRR declines as the relevant document slides further down the ranked list. For this reason, it has been criticized especially when it is analyzed in average over a set of queries. Imagine, for example, that there were two queries and a very large (roughly infinite) set of documents to rank. Suppose that system 1 got the right document in the top position for the first query, but ranked the right document last for the second query. Suppose also that system 2 got the right documents at the no. 2 and no. 3 positions for the two queries, respectively. Strangely, the average MRR for system 1 (0.5) would be higher than that for system 2 (0.42)!</p>
<p><b>Mean Average Precision (MAP).</b> To address some of these issues, the TREC community in particular has preferred the MAP metric, defined using the formula below. Note that MAP assumes not only a given corpus D, but also a given set Q of queries. For each query <i>q</i> ∈ <i>Q</i>, the ground-truth <i>G</i><sub><i>q</i></sub> ⊆ <i>D</i> is the set of relevant documents:</p>
<figure class="DIS-IMG"><a id="eq11-2"/><img alt="" class="width" src="../images/eq11-2.png"/>
</figure>
<p>There are some subtle aspects to equation (<a href="chapter_11.xhtml#eq11-2">11.2</a>), which is best explained from the inside out (namely, after fixing a query <i>q</i> and a relevant document <i>g</i> that belongs in the ground-truth <i>G</i><sub><i>q</i></sub>). For each query, we have a ranking over the documents over D. Precision(g) is calculated in the usual way: we note the rank <i>k</i> at which <i>g</i> occurs in the ranked list, and then divide <i>k</i> by the total number of relevant documents noted until <i>k</i>. The first inner sum and division then gives us the average precision in MAP for a single information need (i.e., query). For this query, the average precision is approximately equal to the area under the uninterpolated precision-recall curve (computed by using <i>k</i> as a hidden variable, as described here). When averaged over all queries, we get the single MAP score. Unlike MRR, MAP tends to be smoother and more robust in its distribution of scores. For most <span aria-label="299" id="pg_299" role="doc-pagebreak"/>normal IR systems, MAP tends to vary between 0.1 and 0.7 according to a popular book on the subject. Indeed, it has been found that there can be more agreement between MAP scores of different systems for a single information need (query) than for MAP scores for different queries within the same system.</p>
<p>To summarize, the “average” in MAP is the average over precision values at different positions in the ranked list, while the “mean,” just as with MRR, is the mean over all queries, because there is one average precision per query. Just like MRR, MAP also lies between 0.0 and 1.0, with 1.0 implying that for all queries, all relevant documents were always ranked at the very top. MAP has some distinct advantages over the previous metrics. First, unlike the precision and recall (@<i>k</i>) metrics, there are no fixed levels and interpolation is unnecessary.</p>
<p><b>Normalized Discounted Cumulative Gain (NDCG).</b> Another metric that is highly important and that is most commonly used when relevance is not a binary measure but more nuanced (such as on a continuous scale from 0 to 1), is the NDCG (also written as nDCG). Intuitively, NDCG attempts to jointly quantify several aspects important to retrieval, namely (1) all else being equal, an item with higher relevance to the query should be ranked higher than an item with lower relevance; (2) the more relevant items there are for a query, the lower should be the contribution of any one relevant item to the evaluation of the ranking. In contrast, a metric like MRR is focused on optimizing the highest ranking of a single relevant item. Precision and recall are parameterized by <i>k</i> and form a continuum, rather than a single-point metric. The notion of binary relevance is another major limitation in all of the metrics considered thus far.</p>
<p>The NDCG is formally given by the following formula:</p>
<figure class="DIS-IMG"><a id="eq11-3"/><img alt="" class="width" src="../images/eq11-3.png"/>
</figure>
<p>We use three additional symbols in equation (<a href="chapter_11.xhtml#eq11-3">11.3</a>)—namely, <span class="font">&#119989;</span><sub><i>q</i></sub>, <i>n</i>, and <i>R</i><sub><i>q</i></sub>(<i>i</i>), of which <i>n</i> is a constant (i.e., query and corpus independent). Note that although NDCG does not necessarily require the (query-independent) parameter <i>n</i>, it is commonplace (in practice) to use it and set it to some reasonable number like 5 or 10. In essence, <i>n</i> means that, for any query, we only retrieve the top <i>n</i> results. <i>n</i> is generally assumed to be at least as large as the typical size of a ground-truth per query. Equation (<a href="chapter_11.xhtml#eq11-3">11.3</a>) also works if <i>n</i> is just set to <i>D</i> (the total size of the corpus). <i>R</i><sub><i>q</i></sub>(<i>i</i>) is the <i>relevance</i> score (constrained to typically lie between [0.0, 0.1]) of the <i>i</i>th document in the ranked list retrieved in response to query <i>q</i>.</p>
<p><span class="font">&#119989;</span><sub><i>q</i></sub> is a normalization factor that allows the NDCG to be constrained between [0.0, 1.0] just like MAP and the other metrics. <span class="font">&#119989;</span><sub><i>q</i></sub> is dependent on the query <i>q</i>, and can be computed by setting the NDCG for that query (in practice, by letting |<i>Q</i>| = 1 in the equation (<a href="chapter_11.xhtml#eq11-3">11.3</a>), and getting rid of the outer summation since the averaging is over a single number) to 1 under the assumption of a perfect ranking. In other words, the inner sum, evaluated for a <span aria-label="300" id="pg_300" role="doc-pagebreak"/><i>perfect ranking</i> for that query is the reciprocal of <span class="font">&#119989;</span><sub><i>q</i></sub>. Note that, unlike MAP and the other metrics seen thus far, the relevance score of a document (given a query) is directly taken into account in the NDCG. One reason why this is useful is interannotator disagreement over whether a document is relevant or not, given a query. For example, four out of five annotators may decide that <i>d</i> is relevant for query <i>q</i>. Rather than discard this disagreement, and round up or down, a more sophisticated approach is to designate the relevance of <i>d</i> given <i>q</i> as 4/5 = 0.8. Furthermore, note that when a document is irrelevant and has relevance 0.0 for query <i>q</i>, the expression 2<sup><i>R</i><sub><i>q</i></sub>(<i>i</i>)</sup> evaluates to 1, which leads to the numerator taking on value 0.</p>
<p>Given all of these IR metrics and the way that we described how reasoning was evaluated earlier, how should one compare the performance of reasoning and retrieval? Generally, it is like comparing apples and oranges, and before the advent of KGs, the two had never been compared within the auspices of a single community or application. However, as we have argued at various points in this chapter, both reasoning and retrieval have a role to play in accessing the knowledge in the KGs. Reasoning is preferable when the KG is relatively clean and contains useful information that is only implicit in the KG itself, but can be derived by combining the KG and ontological axioms. Large KGs and ontologies can present problems of scale for reasoners, however.</p>
<p>A few government-funded research programs have yielded interesting and comparative insights on what happens when a reasoner is used to process queries (such as written in SPARQL, as discussed in chapter 12) without modifying the queries in any way, as opposed to a more IR-based system that is allowed to use the initial queries as a way to understand the actual user intent (which is to say, query reformulations and modifications are allowed, among other functionalities). Of course, scale starts becoming an issue even for well-designed IR systems once such complex operations are considered over large enough corpora.</p>
</section>
<section epub:type="division">
<h3 class="head b-head"><a id="sec11-4-2"/><b>11.4.2 Sample Information Retrieval Framework: Lucene</b></h3>
<p class="noindent">In the IR community, several tools are now considered mainstream for building fast and robust retrieval systems. A well-known system is Lucene, a full-text search library in Java that makes it easy to add search functionality to an application or website.</p>
<p>At a high level, Lucene operates by adding content to a full-text index, and then permitting query execution against this index, returning results ranked by either the relevance to the query or sorted by an arbitrary field such as a creation date. In Lucene, a <i>document</i> is the unit of search and index, with an index consisting of one or more documents. However (and this is a point that we alluded to earlier in this chapter, and shall repeat again in the next), a document either in the Lucene or IR context doesn’t <i>necessarily</i> have to be a document, in the common English usage of the word. For example, a Lucene index can be created over a database table of products, in which case each product (i.e., row) would be represented in the index as a Lucene document.</p>
<p><span aria-label="301" id="pg_301" role="doc-pagebreak"/>In Lucene, a document consists of one or more <i>fields</i>. A field is simply a name-value pair (e.g., for a document describing this book, such a name-value pair might be “Title: Knowledge Graphs”). In this case, the field name is “Title” and the value is the title of that content item (“Knowledge Graphs”). Even though we’ve represented values as strings in these examples, they could potentially be other literals like numbers or dates.</p>
<p>In summary, indexing in Lucene involves creating documents comprised of one or more fields, followed by adding the documents to an <i>IndexWriter</i>. Similarly, searching involves retrieving documents from an index via a <i>IndexSearcher</i>. Many modern search and retrieval features are supported by (or can be implemented in) the basic Lucene framework, which has its own query syntax for doing searches. Lucene allows the user to specify which fields to search on and which fields to give more weight to (a process called <i>boosting</i>), and also it gives the user the ability to perform boolean queries, among other functionalities. In the next chapter, we will go deeper into boolean queries in the context of a key-value store called Elasticsearch, which has found value in some KG applications, and which has Lucene at its back end.</p>
<p>One reason why Lucene has survived the test of time is the flexibility and robustness of its query syntax, which anticipates the nature of IR being inherently difficult for machines (due to ambiguity in documents, length variance, different field reliability, etc.), and thus not amenable to a one-size-fits-all solution. The Lucene query syntax, in addition to supporting normal keyword matching against fields, also supports the following functions:</p>
<ul class="numbered">
<li class="NL">1. <b>Wildcard matching</b> (e.g., the query “Title: Knowledge*” will match any document that starts with “Knowledge” in its title)</li>
<li class="NL">2. <b>Proximity matching</b> (e.g., finding words that are within a specific distance from a word)</li>
<li class="NL">3. <b>Range searches</b> (e.g., match documents where the value in a “Date” field is between August 2001 and September 2004)</li>
<li class="NL">4. <b>Boosts</b> (e.g., customize which terms and classes are more important and should contribute high scores in terms of determining document relevance)</li>
<li class="NL">5. <b>Query parsing</b>, which makes it possible to do advanced tasks like programmatic query construction (allowing construction and deployment of dynamic and intelligent applications that can be programmed to construct their own queries via template or slot filling)</li>
</ul>
<p>Another reason for Lucene’s survival and popularity may very well be its open-source and community-driven status, as it is supported by the Apache Software Foundation. Bugfixes and releases are periodically released, and the project has a very active community around it. For example, at the present time, Lucene 7.7.2 was just released, containing nine bugfixes from the previous edition. The Apache Lucene project as a whole contains three subprojects:</p>
<ul class="numbered">
<li class="NL">1. <span aria-label="302" id="pg_302" role="doc-pagebreak"/><b>Lucene Core</b>, the flagship subproject, which provides Java-based indexing and search technology, spell-checking, hit highlighting, and advanced analysis/tokenization</li>
<li class="NL">2. <b>Solr</b>, a high-performance search server built using Lucene Core, with XML/HTTP and JSON/Python/Ruby application programming interfaces (API), hit highlighting, faceted search, caching, replication, and an admin interface</li>
<li class="NL">3. <b>PyLucene</b>, a Python port of Lucene Core motivated by the popularity of the Python programming language</li>
</ul>
</section>
</section>
<section epub:type="division">
<h2 class="head a-head"><a id="sec11-5"/><b>11.5 Concluding Notes</b></h2>
<p class="noindent">Once a KG is constructed and completed, the data in it must be accessed and consumed by users and applications. Reasoning and retrieval are two dominant modes of accessing the KG. Reasoning is the more conservative, and formally well-defined, mode of access, but it expects a higher degree of quality and structure in the KG, and more conformance of KG data and assertions to the ontology. Many real-world KGs are not able to meet these strict requirements, and more often than not, with more extreme cases violating constraints in the ontology. As the KG grows in size, such violations become unavoidable, if not common. At the same time, there are domains where data from high-quality databases are being modeled as KGs. Reasoning and semantically well defined querying (where responses to the query strictly obey specifications in the query and do not interpret user intent liberally) work well in such cases, and are important in both scientific domains, as well as proprietary business analytics.</p>
<p>Contrasted with reasoning, IR is a more robust but approximate form of data access where, given a query from the user, the goal is to interpret the <i>user intent</i> and retrieve a set of results. The results are usually ranked, although there are exceptions (i.e., where results are sets rather than lists) that are beyond the scope of this chapter. Retrieval works well in the presence of free-text values and string literals, and when either the user query or corpus (or both) have noisy, missing, or implicit information. IR and reasoning have both been thoroughly researched in the AI and information sciences community, well before the advent of KGs. Whether reasoning or retrieval, or some combination or hybrid thereof, should be used for accessing and serving data in a KG depends both on the application and the KG itself. Applications with strict quality requirements, where the user is depended upon to produce queries that express their intent (usually through a domain-specific structured language like SPARQL or SQL) generally prefer reasoning, while retrieval is better suited for applications where the domain is too broad, the knowledge is too noisy or incomplete, or the user is not always aware of all the consequences of query specifications (but has some intent or goal in mind). In recent years, the choice has not been between “one or the other,” but in how to best build systems where the benefits of both sets of approaches may be reaped. Fuzzy querying and query reformulation are two important, and reasonably well-established, classes of approaches in this promising direction.</p>
</section>
<section epub:type="division">
<h2 class="head a-head"><span aria-label="303" id="pg_303" role="doc-pagebreak"/><a id="sec11-6"/><b>11.6 Software and Resources</b></h2>
<p class="noindent">There are many good resources for both reasoning and retrieval, and two of the important ones, Protégé and Lucene, were given due attention in this chapter. These resources can be accessed at <a href="https://protege.stanford.edu/products.php">https://<wbr/>protege<wbr/>.stanford<wbr/>.edu<wbr/>/products<wbr/>.php</a> and <a href="https://lucene.apache.org/">https://<wbr/>lucene<wbr/>.apache<wbr/>.org<wbr/>/</a>, respectively. For those who do not want to install Protégé locally, an excellent alternative is WebProtégé, which is available at <a href="https://webprotege.stanford.edu/">https://<wbr/>webprotege<wbr/>.stanford<wbr/>.edu<wbr/>/</a> and hosted by Stanford. We briefly described Lucene Core, Solr, and PyLucene toward the end of the chapter: these are available at <a href="https://lucene.apache.org/core/">https://<wbr/>lucene<wbr/>.apache<wbr/>.org<wbr/>/core<wbr/>/</a>, lucene.apache.org/solr/, and <a href="https://lucene.apache.org/pylucene/">https://<wbr/>lucene<wbr/>.apache<wbr/>.org<wbr/>/pylucene<wbr/>/</a>, respectively. Many other packages use Lucene at their back end. Good packages for trying out structured IR-style querying are Elasticsearch and MongoDB, accessible at <a href="https://www.elastic.co/">https://<wbr/>www<wbr/>.elastic<wbr/>.co<wbr/>/</a> and <a href="https://www.mongodb.com/">https://<wbr/>www<wbr/>.mongodb<wbr/>.com<wbr/>/</a>, respectively.</p>
<p>There are several open-source software packages available that implement semantic reasoners. These resources include FaCT++ (available at <a href="http://owl.cs.manchester.ac.uk/tools/fact/">http://<wbr/>owl<wbr/>.cs<wbr/>.manchester<wbr/>.ac<wbr/>.uk<wbr/>/tools<wbr/>/fact<wbr/>/</a>), which is implemented in C++ and covers expressive DLs, Racer (<a href="https://github.com/ha-mo-we/Racer">https://<wbr/>github<wbr/>.com<wbr/>/ha<wbr/>-mo<wbr/>-we<wbr/>/Racer</a>), Apache Jena (<a href="http://jena.apache.org/">http://<wbr/>jena<wbr/>.apache<wbr/>.org<wbr/>/</a>), and CEL (<a href="https://tu-dresden.de/ing/informatik/thi/lat/forschung/software/cel">https://<wbr/>tu<wbr/>-dresden<wbr/>.de<wbr/>/ing<wbr/>/informatik<wbr/>/thi<wbr/>/lat<wbr/>/forschung<wbr/>/software<wbr/>/cel</a>), among others. Some of these (e.g., FaCT++) are compatible with Protégé’s DIG interface, which is a standard interface/protocol that was introduced to provide a common interface to DL reasoners.</p>
<p>Note that many programming languages today provide convenient packages for computing important IR metrics like NDCG and MAP. For example, the Scikit-learn package in Python provides functions for calculating NDCG, among many other metrics (<a href="https://scikit-learn.org/stable/modules/classes.html">https://<wbr/>scikit<wbr/>-learn<wbr/>.org<wbr/>/stable<wbr/>/modules<wbr/>/classes<wbr/>.html</a>). More specialized metrics may require ad-hoc implementations.</p>
</section>
<section epub:type="division">
<h2 class="head a-head"><a id="sec11-7"/><b>11.7 Bibliographic Notes</b></h2>
<p class="noindent">IR holds an important place both in the history and the current practice of computer science. Even before massive success stories such as the Google search engine, IR has played an important role in academia, with SIGIR, the premier conference on IR, dating back all the way to 1971 (with the conference becoming an annual event starting from the second instance, in 1978). As Smeaton et al. (2002) report in an analysis of 25 years of SIGIR proceedings, even as early as 1971, databases and natural-language interfaces played an influential role, with the 1980s being dedicated more to the development of conceptual IR and KBs. Just as in other communities, therefore, research in the 1980s and early 1990s laid the groundwork for IR applied to the web, as well as the advent of large-scale KGs in the late 2000s.</p>
<p>By the early 2000s, there had been considerable interest in higher-level tasks like document summarization, cross-lingual IR, and distributed IR; selected overviews or influential <span aria-label="304" id="pg_304" role="doc-pagebreak"/>papers include Nenkova and McKeown (2012), Callan (2002), Sharma and Mittal (2016), and Cahoon and McKinley (1996). Since that period, with the growth of companies like Netflix and Amazon, and due to public competitions like the Netflix challenge, as embodied in Feuerverger et al. (2012) and Ellenberg (2008), the IR community has started to see more work in recommender systems, which were already becoming popular due to the growth of the web; interested readers should consider Adomavicius and Tuzhilin (2005), Burke (2002), Bobadilla et al. (2013), and Yang et al. (2014) for a good start on exploring this vast body of literature. Machine learning has also become prominent in IR, just as in many other computing communities. Learning to rank was an important paradigm that led to a large body of output; see Liu et al. (2009), Chapelle and Chang (2011), and Cao et al. (2007). The LETOR benchmark is described in multiple papers; we recommend Qin et al. (2010) for interested readers. It was released by Microsoft Research Asia and is periodically updated, providing a huge boost to the research community in this area. Much more recently, neural networks and deep learning have become popular themes in IR, as in much of the AI literature; we cite selected, diverse papers by Li and Lu (2016), Severyn and Moschitti (2015), Zhang et al. (2016), and Pang et al. (2017) for interested readers to evaluate recent work in this area.</p>
<p>Several texts on IR have been published over the years, an excellent reference for studying classic textbook material is Manning et al. (2008). Additional references include Croft et al. (2010) and Tiwary and Siddiqui (2008), which does not include numerous task-specific overviews. For a primer on information organization, we recommend the short bulletin given by Glushko (2013).</p>
<p>Many of the metrics first proposed for evaluating IR systems have since percolated into other predictive applications and tasks, including NLP. Precision, recall, and F-measure are standard fare in that community, although certain metrics like NDCG and MRR remain specific to IR-centric tasks due to their special dependence on ranked outputs. Some good references for understanding and comparing these metrics include Radlinski and Craswell (2010), Sakai (2007), Sakai and Kando (2008), Hripcsak and Rothschild (2005), and Bellogín et al. (2017). The last considers statistical biases in IR metrics when they are applied to recommenders. These works, which form only a sample, show that while metrics like F-measure are standard in IR (and in IR-inspired communities that have to evaluate some measure of accuracy), they are interesting areas of study in their own right.</p>
<p>Much of the material on OWL and reasoning in this chapter has been derived from classic sources and official tutorials on the subject that we cite in chapter 12, when the matter is covered further in depth. We only cite some primers and introductory material herein for the sake of completeness, including Hitzler et al. (2009), Antoniou and Van Harmelen (2004a), and Krötzsch et al. (2012).</p>
<p>We briefly mentioned probabilistic reasoners, although a full description of these is beyond the scope of this introductory chapter. Good references for probabilistic reasoners <span aria-label="305" id="pg_305" role="doc-pagebreak"/>include Schum (2001), and concerning the SW, Klinov (2008) and Da Costa et al. (2006). For a synthesis, especially as it pertains to the broader goals of building intelligent systems, Pearl (2014) is an invaluable guide, as is Neapolitan (2012).</p>
<p>Considering the issue of retrieval versus reasoning, we note that a growing body of work has chosen to not view these as being mutually exclusive, but has instead tried to combine the two areas to yield a more powerful and robust system for accessing KGs in an intelligent way. We mentioned query reformulation as an important research area where we see this amalgam occurring in practice, as evidenced by relatively recent work in the last two decades (especially in the SW community) from Calvanese et al. (2004), Straccia and Troncy (2006), Huang and Efthimiadis (2009), Buron et al. (2019), and Viswanathan et al. (2017).</p>
</section>
<section epub:type="division">
<h2 class="head a-head"><a id="sec11-8"/><b>11.8 Exercises</b></h2>
<p class="noindent">In the following questions, our primary focus will be on retrieval, as we cover reasoning and query execution in depth in the next chapter. Toward the end of the exercises, we consider questions of a comparative nature.</p>
<ul class="numbered">
<li class="NL">1. What is the rank of the last relevant document in the ground-truth, given the lowest <i>k</i> (call this <i>k</i>′) such that recall@<i>k</i>′ reaches 100 percent? Why? <i>Hint: Argue that the document at the k</i> − 1 <i>position</i> must <i>be irrelevant.</i></li>
<li class="NL">2. Show that, if the first document in a ranked list is not relevant, then there is no <i>k</i> for which precision@<i>k</i> reaches 100 percent.</li>
<li class="NL">3. Show that, for some <i>k &gt;</i> 1, if precision@<i>k</i> is 100 percent, then it is necessarily the case that all top <i>k</i> documents are relevant. <i>Hint: What happens if only one of the documents is irrelevant?</i></li>
<li class="NL">4. Is it ever possible for the precision@<i>k</i> versus recall@<i>k</i> curve to have slope 0 before recall@<i>k</i> reaches 100 percent?</li>
<li class="NL">5. We will be computing NDCG and MAP for two systems for a set <i>Q</i> of three queries <i>q</i><sub>1</sub>, <i>q</i><sub>2</sub> and <i>q</i><sub>3</sub>. Assume a document set <i>D</i> with 10 documents {<i>d</i><sub>1</sub><i>, <span class="ellipsis">…</span>, d</i><sub>10</sub>}. The (ultimate) goal is to determine which system is better, and by how much, on both metrics. We will proceed in steps. For the NDCG, assume binary relevance [i.e., <i>R</i>(<i>j, m</i>) is either 1 or 0]. Consult the table here for specifics.</li>
</ul>
<figure class="table">
<table class="table">
<thead>
<tr>
<th class="TCH"><p class="TB"><b>Query</b></p></th>
<th class="TCH"><p class="TB"><b>System 1 (ranked list in order</b> [1<i>, <span class="ellipsis">…</span></i>, 10]<b>)</b></p></th>
<th class="TCH"><p class="TB"><b>System 2 (ranked list in order</b> [1<i>, <span class="ellipsis">…</span></i>, 10]<b>)</b></p></th>
<th class="TCH"><p class="TB"><b>Relevant items</b></p></th>
</tr>
</thead>
<tbody>
<tr>
<td class="TB"><p class="TB"><i>q</i><sub>1</sub></p></td>
<td class="TB"><p class="TB">[<i>d</i><sub>1</sub><i>, d</i><sub>2</sub><i>, d</i><sub>3</sub><i>, d</i><sub>4</sub><i>, d</i><sub>5</sub><i>, d</i><sub>6</sub><i>, d</i><sub>7</sub><i>, d</i><sub>8</sub><i>, d</i><sub>9</sub><i>, d</i><sub>10</sub>]</p></td>
<td class="TB"><p class="TB">[<i>d</i><sub>10</sub><i>, d</i><sub>9</sub><i>, d</i><sub>8</sub><i>, d</i><sub>7</sub><i>, d</i><sub>6</sub><i>, d</i><sub>5</sub><i>, d</i><sub>4</sub><i>, d</i><sub>3</sub><i>, d</i><sub>2</sub><i>, d</i><sub>1</sub>]</p></td>
<td class="TB"><p class="TB"><i>d</i><sub>2</sub><i>, d</i><sub>5</sub></p></td>
</tr>
<tr>
<td class="TB"><p class="TB"><i>q</i><sub>2</sub></p></td>
<td class="TB"><p class="TB">[<i>d</i><sub>1</sub><i>, d</i><sub>3</sub><i>, d</i><sub>5</sub><i>, d</i><sub>6</sub><i>, d</i><sub>7</sub><i>, d</i><sub>2</sub><i>, d</i><sub>4</sub><i>, d</i><sub>9</sub><i>, d</i><sub>8</sub><i>, d</i><sub>10</sub>]</p></td>
<td class="TB"><p class="TB">[<i>d</i><sub>10</sub><i>, d</i><sub>9</sub><i>, d</i><sub>8</sub><i>, d</i><sub>7</sub><i>, d</i><sub>6</sub><i>, d</i><sub>5</sub><i>, d</i><sub>4</sub><i>, d</i><sub>3</sub><i>, d</i><sub>2</sub><i>, d</i><sub>1</sub>]</p></td>
<td class="TB"><p class="TB"><i>d</i><sub>3</sub><i>, d</i><sub>6</sub><i>, d</i><sub>7</sub></p></td>
</tr>
<tr>
<td class="TB"><p class="TB"><i>q</i><sub>3</sub></p></td>
<td class="TB"><p class="TB">[<i>d</i><sub>8</sub><i>, d</i><sub>7</sub><i>, d</i><sub>2</sub><i>, d</i><sub>3</sub><i>, d</i><sub>1</sub><i>, d</i><sub>5</sub><i>, d</i><sub>4</sub><i>, d</i><sub>10</sub><i>, d</i><sub>9</sub><i>, d</i><sub>6</sub>]</p></td>
<td class="TB"><p class="TB">[<i>d</i><sub>10</sub><i>, d</i><sub>9</sub><i>, d</i><sub>8</sub><i>, d</i><sub>7</sub><i>, d</i><sub>6</sub><i>, d</i><sub>5</sub><i>, d</i><sub>4</sub><i>, d</i><sub>3</sub><i>, d</i><sub>2</sub><i>, d</i><sub>1</sub>]</p></td>
<td class="TB"><p class="TB"><i>d</i><sub>1</sub><i>, d</i><sub>2</sub><i>, d</i><sub>3</sub><i>, d</i><sub>4</sub></p></td>
</tr>
</tbody>
</table>
</figure>
<p class="AL">(a) <span aria-label="306" id="pg_306" role="doc-pagebreak"/>Suppose that we interpret the output of system 1 on <i>q</i><sub>3</sub> as a set. What is the recall of system 1 on <i>q</i><sub>3</sub>?</p>
<p class="AL">(b) As a first step (toward computing and comparing ranked output metrics), for all three queries, compute the NDCG normalization factor (i.e., <i>Z</i><sub><i>kj</i></sub>, for <i>j</i> = 1, 2, 3).</p>
<p class="AL">(c) Compute the NDCG for both systems for each query. Which system is better on which query? What is the average NDCG for each of the systems?</p>
<p class="AL">(d) Compute the average precision (AP) for all three queries for both systems. Which system is better on which query?</p>
<p class="AL">(e) Compute the MAP for both systems.</p>
<p class="AL">(f) What is the correlation between the AP and NDCG per query? Are there any queries where one metric leads you to an inconsistent result on which system is better?</p>
<ul class="numbered">
<li class="NL">6. Rather than assigning relevance scores of 1 or 0 to the 10 documents in the table from exercise 5, we assign a relevance score of 1.0 to the relevant items (for each query) listed therein, and relevance scores of 0.5 to all other items. Is it still possible to compute the MAP? What about the NDCG? Has the average NDCG (across queries) gone up, down or is unchanged? By how much?</li>
<li class="NL">7. ** In the vein of the table, construct the output for two systems, and a single ground-truth (i.e., there is only one query and ground-truth but two systems have produced outputs) such that the MAP says one system is better than another, but the NDCG says otherwise. Assume a population of only five documents for simplicity.</li>
<li class="NL">8. Is it ever possible to derive such an inconsistency if the ground-truth has only one document? What if it has two documents?</li>
<li class="NL">9. Given inconsistencies, when would you justify the use of MAP versus NDCG?</li>
<li class="NL">10. Imagine a KG describing citations (we saw an example fragment of such a KG in chapter 8, on instance matching). The KG is vast and contains computer science papers. You have a user who issues queries (you may assume that the query is issued in a way that is unambiguously interpreted by the machine) such as “Find me papers that have been coauthored or authored by a scientist whose last name is ‘Bloom’ and the phrase ‘Bloom Filter’ occurs somewhere in the title.” If you had access to both reasoning and retrieval facilities, how would you use them to construct a system that is able to answer queries such as these?<sup><a href="chapter_11.xhtml#fn2x11" id="fn2x11-bk">2</a></sup> Would there be any advantage to your approach over using a pure retrieval- or reasoning-based system?</li>
</ul>
<div class="footnotes">
<ol class="footnotes">
<li><p class="FN" role="doc-footnote"><sup><a href="chapter_11.xhtml#fn1x11-bk" id="fn1x11">1</a></sup> We do not provide a formal description of modus ponens herein. Stated in plain English, it is a rule in logic where the hypothesis is a premise <i>P</i>, the statement <i>P → Q</i>, and the conclusion is <i>Q</i>. For example, given the premise that today is Monday (<i>P</i>), and the truth of the statement “If today is Monday, then today is the first day of the week,” modus ponens allows us to deduce that today is the first day of the week.</p></li>
<li><p class="FN" role="doc-footnote"><sup><a href="chapter_11.xhtml#fn2x11-bk" id="fn2x11">2</a></sup> The example using the Bloom filter (which actually exists) is only indicative, and meant to help you think about possibilities where keywords overlap. You should be thinking about the general case when answering this question.</p></li>
</ol>
</div>
</section>
</section>
</div>
</body>
</html>