glam/docs/oclc/extracted_kg_fundamentals/OEBPS/xhtml/chapter_1.xhtml

<html xmlns="http://www.w3.org/1999/xhtml" xmlns:epub="http://www.idpf.org/2007/ops" lang="en" xml:lang="en">
<head>
<title>Knowledge Graphs</title>
<meta content="text/html; charset=utf-8" http-equiv="default-style"/>
<link href="../styles/stylesheet.css" rel="stylesheet" type="text/css"/>
<meta content="urn:uuid:531250ec-8629-4bbe-b4be-eb1eb0e84538" name="Adept.expected.resource"/>
</head>
<body epub:type="bodymatter">
<div class="body">
<p class="SP"> </p>
<section aria-labelledby="ch1" epub:type="chapter" role="doc-chapter">
<header>
<h1 class="chapter-number" id="ch1"><span aria-label="3" id="pg_3" role="doc-pagebreak"/>1</h1>
<h1 class="chapter-title"><b>Introduction to Knowledge Graphs</b></h1>
</header>
<div class="ABS">
<p class="ABS"><b>Overview.</b>   Graph theory has long occupied an important place in computer science and mathematics, and the subject has become more relevant recently due to a resurgence of interest in the study of complex systems. In this decade, knowledge graphs (KGs) have emerged as the latest instance of using graphs for representing (and reasoning over) data that can be semistructured and web scale in origin and potentially marked by conflicts and inconsistencies. KGs as we understand them today became popular after the Google Knowledge Graph was introduced via an official blog post in 2011. In this chapter, we introduce KGs and provide some insight into why they have emerged as powerful and popular tools in multiple communities.</p>
</div>
<section epub:type="division">
<h2 class="head a-head"><a id="sec1-1"/><b>1.1 Graphs</b></h2>
<p class="noindent">Graph theory has long had a close connection with computer science, though its origins undoubtedly have been mathematical since Leonhard Euler first proposed a solution for the Königsberg Bridge Problem in 1735. As the story behind that problem goes, Königsberg, a quaint town in Prussia in the eighteenth century, was divided by the Pregel River into four divisions. Seven bridges span the river at various points, permitting crossing from one division into another (<a href="chapter_1.xhtml#fig1-1" id="rfig1-1">figure 1.1</a>). According to folklore, the citizens of the city wondered whether it was possible to walk around the city while crossing each of the seven bridges <i>only once</i>. The starting point of such a route, if it existed, would not matter because the route would be a circuit. At the time, the citizens could not show such a route, but they also could not prove that such a route did not exist.</p>
<div class="figure">
<figure class="IMG"><a id="fig1-1"/><img alt="" src="../images/Figure1-1.png" width="450"/>
<figcaption><p class="CAP"><span class="FIGN"><a href="#rfig1-1">Figure 1.1</a>:</span> <span class="FIG">A version of the Königsberg Bridge Problem [as illustrated by Euler (1741) himself in his figure 1 from <i>Solutio problematis ad geometriam situs pertinentis</i>, Eneström 53]. On the right, the problem is modeled as a graph, with divisions as nodes, and bridges as edges. Figure obtained from MAA Euler Archive (copyright expired).</span></p></figcaption>
</figure>
</div>
<p>By modeling the problem as a graph, where divisions were nodes in the graph and the bridges were edges, Euler was able to prove the now-famous theorem such that a <i>Eulerian cycle</i> can exist in the graph if and only if every node has an even degree, with the degree of a node being the number of edges that were incident upon that node. In the graph representation of the problem shown at the right of <a href="chapter_1.xhtml#fig1-1">figure 1.1</a>, none of the nodes has an even degree; hence, the graph cannot possess an Eulerian cycle. The proof for the theorem is simple and elegant, expressing the power of using graphs to model and solve such problems.</p>
<p><span aria-label="4" id="pg_4" role="doc-pagebreak"/><i>Graph theory</i>, as the field came to be called (see the “Bibliographic Notes” section at the end of the chapter), became a popular area of study with the passage of time, as many scientists and mathematicians found that graphs could not only be used to model interesting phenomena in the real world (as Euler had shown), but that they were also sophisticated mathematical objects with complex properties. In modern computer science, graphs have taken on a life of their own starting several decades earlier, with algorithms like Edsger Dijkstra’s shortest-path algorithm and computationally intractable problems like the traveling salesman problem now considered standard textbook material in any treatment of basic computer science theory, algorithms, and data structures. The connection of graphs to the field of artificial intelligence (AI) has been equally interesting, regardless of which lens we survey it through, including expert systems (and rule-based) research; sequential models like hidden Markov models (HMMs) and conditional random fields (CRFs); planning research (where state and action spaces are often drawn as graphs); constraint satisfaction (evidenced by the famous graph-coloring problem); databases (evidenced by the rise and study of graph databases, some of which we survey in chapter 12); networks, recommendations, and e-commerce; and the Semantic Web. It would be difficult to find an area in computer science or AI that has not been touched by graphs to at least some degree.</p>
<p>Considering this brief background on the influence of graphs, it is perhaps unsurprising that knowledge representation, reasoning, and inference would also come to be influenced by graphs. Yet, for a long time, it was not the norm to represent large data sets, such as we now see on the web, as graphs. Databases of various types, including object-oriented databases (inspired by object-oriented programming) and other relational models, were more popular, and in the machine learning community, there was far more interest in mathematical structures like matrices and tensors. Even in the Semantic Web (SW), a relatively recent offshoot of the broader web community, graphs had been predominantly used to model ontologies rather than enormous data sets until the advent of the Linked Data movement in the mid-2000s (chapter 14).</p>
<p><span aria-label="5" id="pg_5" role="doc-pagebreak"/>In industry, graphs as a means of representing Big Data had almost no discernible influence till the highly publicized announcement and exposure of the Google Knowledge Graph.<sup><a href="chapter_1.xhtml#fn1x1" id="fn1x1-bk">1</a></sup> Arguably, KGs had existed for quite a while, and they were [and in communities like Natural Language Processing (NLP), still often are] referred to as <i>knowledge bases (KBs)</i>, as it was the norm to think of facts (knowledge) as entities, relations, and sets of triples rather than as a collection of nodes and edges. With the advent of the Google Knowledge Graph and its enormous influence on how we consume search results (as described in the next section), the rich relational information that ties together entities into a cohesive repository of knowledge has become much more apparent. At this time, the influence of KGs is being felt throughout society (as discussed in part V of this book), and KGs are being used in enterprise and government (chapter 15), in science (chapter 16), and even for addressing social problems like human trafficking and natural disaster response (chapter 17).</p>
<p>We introduce KGs formally in the next section by using a simple example to illustrate both the potential of and the challenges encountered when working with knowledge modeled as a graph. We then take a few examples of KGs in real-world domains.</p>
</section>
<section epub:type="division">
<h2 class="head a-head"><a id="sec1-2"/><b>1.2 Representing Knowledge as Graphs</b></h2>
<p class="noindent">Within computer science, as earlier indicated, graphs are commonplace and used to address a whole variety of problems. However, within AI, an interesting problem has always existed—namely, how should we “represent” knowledge within a machine so as to best facilitate “intelligent” agents and algorithms to reason over, and “work with,” that knowledge? What <i>is</i> knowledge? These are deeply philosophical questions, with roots in the schools of epistemology, but they are also immensely practical because today, we are definitively living in an era characterized by Big Data, whether it be social media, open government data, sensor data, or e-commerce data such as used to power companies like Amazon, papers and academic research (such as found on PubMed or Semantic Scholar), and so on. To a machine, the collection of data is merely streams and sets of syntactic units such as strings. To a human being, data carries meaning (i.e., it can be said to have <i>semantics</i>). It is because of semantics that we are able to use the data to reason about novel situations, as well as to apply our knowledge in challenging scenarios. If a sufficiently rich <i>model</i> of semantics could be expressed to a machine, it could (with enough context) be used to derive sophisticated insights, such as allowing the machine to deal with uncertainty, answer questions, and even perform tasks of a causal nature, such as abduction and counterfactual reasoning.</p>
<p><span aria-label="6" id="pg_6" role="doc-pagebreak"/>Although building such a “general AI” continues to be a far-reaching goal, an important milestone in achieving that goal is that a machine should be able to assign semantics, at a reasonable level, to “strings” of characters and numbers, if such semantics exist and can be perceived by humans in “natural” settings like plain English. Graphs provide one way to do this, as illustrated in <a href="chapter_1.xhtml#fig1-2" id="rfig1-2">figure 1.2</a>. In the graph, we express our knowledge of personnel from a <i>Star Trek</i> show (namely, <i>The Next Generation</i>) by representing our <i>entities</i> of interest as nodes in the graph, and with relationships (also called <i>properties</i> or <i>relations</i>) between entities (including the <i>types</i> of those entities) expressed via directed, labeled edges. What we are looking at in <a href="chapter_1.xhtml#fig1-2">figure 1.2</a> is the first example of a KG in this book. Because of the inherent structure present in the graph, we can pose well-defined queries to the system as (potentially sophisticated) <i>graph patterns</i> (e.g., the natural-language question “Who are all the personnel serving on the Enterprise?” can be expressed as a graph pattern “?x serves_on Enterprise,” where “?x” represents a variable that needs to be bound). We will go into much more detail into such queries and the language used for graph patterns in future chapters, but the point that we hope to illustrate here is that, by making productive use of structure, we can avoid some of the ambiguity incurred through natural language.</p>
<div class="figure">
<figure class="IMG"><a id="fig1-2"/><img alt="" src="../images/Figure1-2.png" width="450"/>
<figcaption><p class="CAP"><span class="FIGN"><a href="#rfig1-2">Figure 1.2</a>:</span> <span class="FIG">Representing knowledge as a graph: A simple KG fragment.</span></p></figcaption>
</figure>
</div>
<p>The example question about <i>Enterprise</i> personnel is instructive because it also highlights a crucial thing missing in the KG fragment in <a href="chapter_1.xhtml#fig1-2">figure 1.2</a>. Here, our human knowledge tells us that Jean Luc Picard, Deanna Troi, and Data are all serving on the <i>Enterprise</i>, although only the roles of Jean Luc Picard and Deanna Troi are directly specified in the KG (as captain and counselor, respectively). However, if we were to execute the graph pattern query shown here <i>as is</i>, we would only get back “Data” as a result. Clearly, something beyond the explicit KG is required in order for all three answers to be returned. Specifically, the system needs to know that if a ship is captained by X, then X serves on the ship, and similarly for other “roles” such as counselor. We impart this kind of domain-specific (and <span aria-label="7" id="pg_7" role="doc-pagebreak"/>in more universal situations,<sup><a href="chapter_1.xhtml#fn2x1" id="fn2x1-bk">2</a></sup> “open-domain”) knowledge by modeling the domain as an <i>ontology</i>. We already see the vestiges of an ontology, even in <a href="chapter_1.xhtml#fig1-2">figure 1.2</a>. For example, we see the <i>concepts</i> (also called <i>classes</i> or <i>types</i>) “Android,” “Human,” and so on. Entities in the KG are typed according to these concepts using the special relation <i>is-a</i>.</p>
<p>Actual ontologies are much more powerful, with hierarchies of classes (e.g., that <i>Capital</i> is a subclass of <i>City</i>), properties (e.g., the <i>serves_on</i> property is a superclass of another property, <i>captains</i>, not shown in <a href="chapter_1.xhtml#fig1-2">figure 1.2</a>), declarations that one property is related to another through a metarelation like <i>inverse</i> (e.g., <i>captains</i> is the inverse of <i>captained_by</i>), and other axiomatic declarations that help reasoners deduce additional knowledge that is not explicitly declared as facts in the KG.</p>
<p><span aria-label="8" id="pg_8" role="doc-pagebreak"/>The example in <a href="chapter_1.xhtml#fig1-2">figure 1.2</a> may not seem to be “doing much,” so to speak, but as a thought experiment, imagine that we are able to represent our domain knowledge in such a structured format, and also that we have a sufficiently rich ontology to model such knowledge. At web scale, the representation would allow us to pose increasingly complex queries and make sophisticated AI applications feasible. The most striking example of this is the web search, as evidenced by the Google Knowledge Graph mentioned earlier. Not very long ago, if a search were executed for a phrase such as “places to visit in Los Angeles,” the Google search engine (and other generic search engines as well) would simply return a ranked list of webpages. It then would be up to the user to browse through these pages and put together answers to their original question. Worse, good answers may not even be retrieved if the system did not somehow know (e.g., through search logs) that when people refer to “Los Angeles,” they are typically referring to Los Angeles <i>County</i> rather than the actual city of Los Angeles, which is only a small part of the greater metropolitan area.</p>
<p>By modeling entities and relations as shown in <a href="chapter_1.xhtml#fig1-2">figure 1.2</a> and using that information with other powerful correlational tools like mining of web search logs and personalization, it became possible to interpret the query more deeply by using the KG to recognize that Los Angeles is a city and a county, and that the latter intent is more common when users search for tourist venues in Los Angeles. However, in more recent times, the Google KG has managed to go even further (<a href="chapter_1.xhtml#fig1-3" id="rfig1-3">figure 1.3</a>) by directly providing a list of places to visit in LA (in response to the query), showing that it is able to grasp the semantics of the query better than string-matching approaches would have done. Searching has gotten more sophisticated and seamless as a result, and in many cases (as in this example), it may not even be necessary to click on a webpage link or go beyond the search engine for a response to the query. Tourist attractions are hardly the only examples of this; we see similar search results when searching for movies or concerts. When searching for a famous entity like a celebrity or geographical entity, Google also displays knowledge panels (<a href="chapter_1.xhtml#fig1-4" id="rfig1-4">figure 1.4</a>), some of the content of which is derived from sources like Wikipedia. All of these search facilities are being powered, at least in part, by a large-scale proprietary KG.</p>
<div class="figure">
<figure class="IMG"><a id="fig1-3"/><img alt="" src="../images/Figure1-3.png" width="450"/>
<figcaption><p class="CAP"><span class="FIGN"><a href="#rfig1-3">Figure 1.3</a>:</span> <span class="FIG">Results displayed by Google in response to queries such as “places to visit in Los Angeles,” which include not just a ranked list of webpages, but actual entities corresponding to user intent.</span></p></figcaption>
</figure>
</div>
<div class="figure">
<figure class="IMG"><a id="fig1-4"/><img alt="" src="../images/Figure1-4.png" width="450"/>
<figcaption><p class="CAP"><span class="FIGN"><a href="#rfig1-4">Figure 1.4</a>:</span> <span class="FIG">A knowledge panel describing the first movie in the <i>Lord of the Rings</i> series, in response to the keyword query “lord of the rings.”</span></p></figcaption>
</figure>
</div>
<p>Another important motivation for constructing and using KGs is that words and names can be deeply ambiguous and dependent on the <i>current</i> context of usage, which would include the user (and their past search history) initiating the search, but also their immediately preceding search. In the original blog post on the Google Knowledge Graph, “Taj Mahal” was used to illustrate this ambiguity because the name could be used to denote the famous monument in India, a local restaurant in any number of cities and areas, a casino in Atlantic City, New Jersey, and so on. If Google knew that the user happened to be in the Indian city of Agra at the moment when the search was executed, then the monument is the “relevant” (i.e., intended and desired) result. However, if the immediate search history involved restaurants in Los Angeles, then a nearby restaurant called Taj Mahal might be the relevant result. If no context is available, external information such as other people’s <span aria-label="9" id="pg_9" role="doc-pagebreak"/>search histories could serve as an important prior. In practice, a combination of these factors, honed over time and at large scale using pragmatic and proprietary techniques, gives modern search engines a powerful edge in fulfilling everyday informational needs.</p>
<p><span aria-label="10" id="pg_10" role="doc-pagebreak"/>What these cases are intended to illustrate is that “Taj Mahal” and “Los Angeles” are ultimately <i>things, not strings</i>, and need to be <i>interpreted</i> as such by a machine to enable truly powerful applications. The primary application invoked here is called <i>semantic search</i>, but others exist, as described in part V, where KG ecosystems with significant footprints are detailed. In the next few sections, we provide a brief glimpse into KGs being used in some of these other influential domains.</p>
</section>
<section epub:type="division">
<h2 class="head a-head"><a id="sec1-3"/><b>1.3 Examples of Knowledge Graphs</b></h2>
<p class="noindent">This section provides a pragmatic view of how KGs are employed in the real world. Mostly, we rely on typical examples predominantly cited in the literature and used in industry, although some domains have achieved more mainstream success than others (e.g., e-commerce versus geopolitical events). We use these domains to illustrate the various aspects involved in constructing and applying KGs in the real world.</p>
<section epub:type="division">
<h3 class="head b-head"><a id="sec1-3-1"/><b>1.3.1 Example 1: Scientific Publications and Academics</b></h3>
<p class="noindent">As a first example of a domain-specific KG, let us consider an academic domain (<a href="chapter_1.xhtml#fig1-5" id="rfig1-5">figure 1.5</a>). The two nodes in the center of the KG represent different publications. Some important <span aria-label="11" id="pg_11" role="doc-pagebreak"/>details concerning the publications are also shown, including their authors, dates of publication, and venues.</p>
<div class="figure">
<figure class="IMG"><a id="fig1-5"/><img alt="" src="../images/Figure1-5.png" width="450"/>
<figcaption><p class="CAP"><span class="FIGN"><a href="#rfig1-5">Figure 1.5</a>:</span> <span class="FIG">An illustration of an academic KG, showing two publications with overlapping authors. While oval nodes represent resources or entities, rectangles are used to represent literals, such as strings and numbers.</span></p></figcaption>
</figure>
</div>
<p>Despite its simplicity, the KG in <a href="chapter_1.xhtml#fig1-5">figure 1.5</a> illustrates some of the expressiveness in <i>representation</i>, which is a much studied research area in communities such as the Semantic Web. The oval nodes in the KG represent <i>entities</i> or <i>resources</i> and are generally referred to (in the SW community) as internationalized resource identifiers (IRIs), a generalized form of uniform resource identifiers (URIs). We define these concepts more precisely in the next chapter, so we are limiting our current focus to the overall distinction between entities and literals (also known as <i>attributes</i>). Entities can have relationships with other entities (such as between authors and their publications) or attributes (such as the year of a publication). The distinction can be expressed by the fact that in a triple (<i>h, r, t</i>), <i>t</i> is either a literal (for the latter) or an entity (for the former). In SW representations of KGs, <i>h</i> is always an entity, and can never be a literal.</p>
</section>
<section epub:type="division">
<h3 class="head b-head"><a id="sec1-3-2"/><b>1.3.2 Example 2: E-Commerce, Products, and Companies</b></h3>
<p class="noindent">In the second example, inspired by the <i>products and e-commerce</i> domain, we expand upon the notions presented earlier for the academic domain. Once again, we see the distinction between literals and entities, but as illustrated in <a href="chapter_1.xhtml#fig1-6" id="rfig1-6">figure 1.6</a>, there are numerous degrees of freedom, even when modeling the most basic structures in KGs. For example, the price of a product is modeled in a relatively simple way (as a relation incident upon a numeric literal object), but the rating is more complex. Potentially, the <i>Rating_1</i> resource shown in the KG could be used to define outgoing properties expressing not just the numeric <i>rating</i>, but also the review text, the person or website that provided the rating, and so on. We bring this up to illustrate that the choice of modeling can have implications for both upstream tasks <span aria-label="12" id="pg_12" role="doc-pagebreak"/>(such as information extraction, used for constructing the KG in the first place, given such a model or <i>ontology</i>) and downstream tasks, such as instance matching (IM) and question answering, that become relevant after the initial KG has been extracted and made available for access. Availability of information can also vary, usually depending on the raw data sources from which the KG was extracted.</p>
<div class="figure">
<figure class="IMG"><a id="fig1-6"/><img alt="" src="../images/Figure1-6.png" width="450"/>
<figcaption><p class="CAP"><span class="FIGN"><a href="#rfig1-6">Figure 1.6</a>:</span> <span class="FIG">An illustration of a <i>Product</i> KG, showing two different products.</span></p></figcaption>
</figure>
</div>
</section>
<section epub:type="division">
<h3 class="head b-head"><a id="sec1-3-3"/><b>1.3.3 Example 3: Social Networks</b></h3>
<p class="noindent"><a href="chapter_1.xhtml#fig1-7" id="rfig1-7">Figure 1.7</a> shows yet another common domain (namely, social networks). Note that, in this example, “social network” does not mean a particular social network company or website like Facebook or Twitter (although an example could also be constructed for a website-specific KG), but an actual social network as it exists in the offline world between a group of fictional individuals. As the example shows, the social network illustrated therein can be used to express a range of complex relationships using the power of directed labeled edges, as well as shared relationships between entities of different types (people and organizations, but also organizations and locations such as <i>&lt;</i>Google, headquarters, Mountain View &gt;. This could be generalized even further to include fine-grained concepts, including schools, organizations, clubs, and alumni associations, as well as fine-grained relations such as membership, family relations, and professional hierarchical relations (e.g., supervised_of, project_lead). In fact, there is already a famous vocabulary called Friend of a Friend (FOAF) that contains many of these concepts and relations and can be used to express social domain KGs without inventing a new ontology.</p>
<div class="figure">
<figure class="IMG"><a id="fig1-7"/><img alt="" src="../images/Figure1-7.png" width="450"/>
<figcaption><p class="CAP"><span class="FIGN"><a href="#rfig1-7">Figure 1.7</a>:</span> <span class="FIG">An illustration of a social network, illustrated as a KG.</span></p></figcaption>
</figure>
</div>
</section>
<section epub:type="division">
<h3 class="head b-head"><a id="sec1-3-4"/><b>1.3.4 Example 4: Geopolitical Events</b></h3>
<p class="noindent">Finally, we consider one of the more complex examples of a KG (specifically, in the geopolitical domain). Along with the artifices shown previously (such as the difference between entities and literals), the KG in <a href="chapter_1.xhtml#fig1-8" id="rfig1-8">figure 1.8</a> illustrates how “second-order” entities like events <span aria-label="13" id="pg_13" role="doc-pagebreak"/>can be modeled and represented. We refer to events as “second-order entities” because they have first-order entities like locations and times as their arguments. These first-order entities are themselves described further using attributes. Events may also be directly attributed and express relations between each other and to other entities. While precisely defining what separates an event from a (potentially <i>n</i>-ary, rather than a binary) relation in an ontology is a semantic rather than a syntactic issue, the practical differences are very real because extracting and resolving events with high quality (compared to extracting ordinary entities) constitute an active area of research. One of the best-known examples of a large-scale event KG that is also publicly available is the Global Database of Events, Language, and Tone (GDELT).</p>
<div class="figure">
<figure class="IMG"><a id="fig1-8"/><img alt="" src="../images/Figure1-8.png" width="450"/>
<figcaption><p class="CAP"><span class="FIGN"><a href="#rfig1-8">Figure 1.8</a>:</span> <span class="FIG">A fragment of an <i>Event</i> KG expressing distinct geopolitical phenomena.</span></p></figcaption>
</figure>
</div>
</section>
</section>
<section epub:type="division">
<h2 class="head a-head"><span aria-label="14" id="pg_14" role="doc-pagebreak"/><a id="sec1-4"/><b>1.4 How to Read This Text</b></h2>
<p class="noindent">This text is broken into five parts following a “natural” order, but many of them can be read out of order depending on the reader’s familiarity with KGs. Part I serves as an introduction to the core elements that make a graph a <i>knowledge</i> graph. Chapter 2, the only other chapter in this part, is heavier on technical details, including on models like Resource Description Framework (RDF), and is recommended reading to place the rest of the work in context, though not all the sections in that chapter are necessary.</p>
<p>Part II is self-contained and covers much of KG construction (KGC), including <i>domain discovery</i>, an important and relatively understudied area that covers how to <i>find</i> relevant data for <i>domain-specific</i> KGC to begin with.</p>
<p>Rarely is KGC adequate in terms of data quality and coverage. Often, there are missing links in the KG, and also many links are noisy. Part III covers <i>KG completion</i> (also called <i>identification</i> and <i>refinement</i>), which covers both well-studied techniques such as instance matching and more recent (often, neural network–based) KG-embedding (KGE) methods. Although part III is independent of part II, some of the concepts in part III will be understandable only in the context of the kinds of noise (and other problems) that are described in part II.</p>
<p>Part IV assumes that the KG has been constructed and completed, and it deals with the problem of querying and performing analytics on the KG. Some chapters, especially the more classically oriented (e.g., logic-based) ones, will have a higher quality than others, and they may require a review of the representational elements in chapter 2. Others will make fewer assumptions and tend to operate better in real-world systems and settings, but offer fewer theoretical guarantees. Personal preferences aside, each of these “schools” of querying comes with its own incumbent set of pros and cons, and which one to apply depends on the skills, design, and objectivity of the system architect.</p>
<p>Finally, part V, which is largely nontechnical and more descriptive than the rest of the book, describes the fruits of KG research—namely, <i>KG ecosystems</i>. This part is designed to be read by any reasonably knowledgeable practitioner or researcher with some familiarity <span aria-label="15" id="pg_15" role="doc-pagebreak"/>with KGs (although chapter 2 is still recommended), but the technical depth of material covered in parts II–IV is certainly not required.</p>
</section>
<section epub:type="division">
<h2 class="head a-head"><a id="sec1-5"/><b>1.5 Concluding Notes</b></h2>
<p class="noindent">A KG is a practical and machine-readable way of representing information about the world, including entities, relationships, attributes, facts, beliefs, and even provenance, including justifications and uncertainty. In part, the proliferation and use of KGs has been predicated by the difficulty of getting machines to “directly” understand modalities such as natural language. Indeed, it is controversial how our brains process such input; it is certainly not outside the realm of possibility that we interpret the world in relatively structured ways, although the “natural” structure that we are comfortable with may differ significantly from the mathematical structure preferred by computer programs. By first “constructing” (and “completing”; see part III) KGs from raw data (part II), and then instituting systems that allow us to “use” the constructed KG, whether through structured querying or more natural question answering (part IV), the research community as a whole has been able to build powerful systems and applications with an underlying KG as the central representation (part V).</p>
<p>In practice, however, we must also deal with the uncomfortable notion that a KG is still not very well defined (which makes KG representation challenging because no one representation can be held to be the “correct” one), and also that it is rarely the case that the structure of a KG (the “ontology”), or the corpus over which the KG must be constructed, is either known or fixed. In addition to dealing with the dynamic nature of the world, we need to interpret the world not just as a collection of entities and relationships but also as a collection of entities, relationships, and complex, higher-order entities such as <i>events</i>. These are advanced topics of research, although some consensus is starting to emerge in topics like event extraction from text.</p>
<p>It is safe to say that KGs will continue to thrive as a research agenda in the foreseeable future. Entire KG ecosystems already exist, as described in part V, spanning the spectrum from academia to industry and e-commerce. As these ecosystems start to converge and collaborate, the potential for groundbreaking research continues to increase. This is an exciting era for KGS.</p>
</section>
<section epub:type="division">
<h2 class="head a-head"><a id="sec1-6"/><b>1.6 Software and Resources</b></h2>
<p class="noindent">Because this is an introductory chapter, we are not providing pointers to KG-specific resources just yet. However, one of the open-source packages that a layperson in graph theory or KGs may find useful is <i>NetworkX</i>. NetworkX, the full project page of which can be accessed at <a href="https://networkx.github.io/">https://<wbr/>networkx<wbr/>.github<wbr/>.io<wbr/>/</a>, is a Python package for the “creation, manipulation, and study of the structure, dynamics, and functions of complex networks.” In the <span aria-label="16" id="pg_16" role="doc-pagebreak"/>context of this book, a KG may be seen as a special kind of complex network because it is expressing knowledge in all its interconnectedness. Some of the advantages of working with NetworkX are that it is easy to set up in a Python environment, has been released under an open-source, 3-clause<sup><a href="chapter_1.xhtml#fn3x1" id="fn3x1-bk">3</a></sup> Berkeley Software Distribution (BSD) license, and has been well tested with over 90 percent code coverage at this time. The package is also very flexible and versatile, allowing data structures for graphs, directed graphs, and multigraphs (and also <i>attributed graphs</i>, which are important for implementing KGs from a graph-theoretic perspective), many well-known graph algorithms (such as Dijkstra’s shortest-path algorithm), and many measurements and diagnostics for analyzing network structure. It also includes generators for synthetic networks, random graphs, and classic graphs.</p>
<p>NetworkX is a good package, therefore, for becoming familiar with graph theory, and even building some toy KGs. However, it should be noted that it is not a standard package for building, or querying, real-world KGs; its primary use-case is in the network science realm. For example, it is not designed for querying, which is extremely important for getting useful information out of the KG, as we will cover in depth in part IV. Most real-world KGs are modeled using languages like RDF. Such models will be the subject of the next chapter.</p>
</section>
<section epub:type="division">
<h2 class="head a-head"><a id="sec1-7"/><b>1.7 Bibliographic Notes</b></h2>
<p class="noindent">We started this chapter by introducing the motivations behind graph theory, going all the way back to Euler and the Königsberg Bridge Problem. For a refresher on the problem, we encourage the interested reader to review Hopkins and Wilson (2004). Considering its long history, graph theory is a mature area of study in mathematics by now (although many interesting extensions have been proposed, and it continues to see research activity in mathematics under different guises), and several texts and handbooks are available for an interested and mathematically oriented reader. These include the introductory text by West (1996), the handbook by Gross and Yellen (2004), and going even further back, the work by Bondy et al. (1976) for a more application-oriented treatment.</p>
<p>The computer science and algorithms research community, as we noted in the introduction, has long had a close relationship with graphs. Now-classic work includes the search algorithms by Tarjan (1972) and Dijkstra (1959), to name just two. Because this is textbook material now, any preliminary treatment of graphs and algorithms such as shortest paths can be found in a good text in computer science, algorithms, computational complexity, or data structures. Some examples of such introductory works can be found in Frakes (1992) and Shaffer (1997). More recent analysis in computer science, though still fairly mature (going back to the 1990s and 2000s), has looked at interesting and fast ways of applying <span aria-label="17" id="pg_17" role="doc-pagebreak"/>some of these algorithms, especially Dijkstra’s algorithm, including partitioning graphs, parallelization, and complexity analysis for such variants as graphs with weighted vertices. A good review of these may be found in the respective works by Möhring et al. (2007), Crauser et al. (1998), and Barbehenn (1998), among many others. For the reader who is looking for a good introduction to graph algorithms specifically, we recommend Even (2011).</p>
<p>Well before KGs, and quite disjoint from the algorithms community, researchers in physics and social sciences turned to graphs for modeling networks, which are useful in the study of complex systems, including protein-protein interactions, the growth of the web, and even ecological systems like the food chain in an ecological niche. <i>Network science</i>, as this field is called, is an actively researched, standard framework for studying complex systems that possess structure; see Barabási et al. (2016). Such systems, as evidenced in a range of works including Gavin et al. (2002), Hummon and Dereian (1989), and Borgatti et al. (2009), include the study of networks of protein-protein interactions, citation networks, and social networks, to name just a few. Recent research has led to many exciting advances in the construction and study of complex networks, especially from Big Data. For example, Chen and Redner (2010) study the community structure of the physical review citation network from the mid-1890s to 2007. Other domain-specific examples include the study by Li et al. (2007) of patent citation networks in nanotechnology and the study by Greenberg (2009) of the creation and influence of citation distortions.</p>
<p>Another highly active subarea of research in network science, and (arguably) one of the original motivations for employing network science as a scientific methodology for studying structure, is social networks. Work in this area can be traced back to at least the 1940s (and possibly beyond), when Moreno (1946) first proposed the “sociogram” as a way of studying such systems at a structural level. Since then, there have been tens of thousands of papers and articles on the subject; a standard, highly comprehensive treatment on social network analysis was provided by Wasserman and Faust (1994), along with a more recent book by Knoke and Yang (2008). More recently, pioneering work in this area includes a study of networks, crowds, and markets by Easley et al. (2010); social tie inference in heterogeneous networks by Tang et al. (2012); prediction of positive and negative links in social networks by Leskovec et al. (2010); and even ethics- and privacy-related challenges in mining social network data by Kleinberg (2007). Other important applications of network science includes bioinformatics, with research ranging from studies in systems pharmacology by Berger and Iyengar (2009) to tools designed for fast network motif detection as evidenced by Wernicke and Rasche (2006), and Schreiber and Schwöbbermeyer (2005), and tens (if not many hundreds) of other papers, many quite recent.</p>
<p>We cite all the previous works to make clear that there is considerable precedent, well before KGs were proposed, for using graphs for solving important problems and in studying myriad phenomena. Hence, it is not surprising that information retrieval and communities <span aria-label="18" id="pg_18" role="doc-pagebreak"/>like Natural Language Processing would eventually be influenced by the promise of building, querying, and doing machine learning on large-scale graphs derived from sources like documents, tables, and webpages. Yet, perhaps because this spurt of research activity has emerged from different places and communities, there is no cohesive treatment of much of this research to explore KGs qua KGs. The original blog post touting the Google Knowledge Graph as treating queries as “things, not strings” is still available online and can be accessed at https://<wbr/>www<wbr/>.blog<wbr/>.google<wbr/>/products<wbr/>/search<wbr/>/introducing<wbr/>-knowledge<wbr/>-graph<wbr/>-things<wbr/>-not<wbr/>/ [at least as of late 2019; an official citation is Singhal (2012)]. Some other surveys and studies of KGs and their influence include a recent brief by Kejriwal (2019) that is an extended technical survey on a specific subarea (domain-specific KG construction), as well as Qi et al. (2020). In addition, a book on KG methodology, tools, and selected use-cases was published by Fensel et al. (2020).</p>
<p>There is no textbook specifically on KGs to the best of our knowledge, and in that sense, we hope that this book will serve as a standard for how to organize one. No doubt, as research on all the different areas we cover in this book continues at a breakneck pace, some of the material will have to be updated or supplemented with each independent attempt or edition. However, for those looking for perspectives on KGs, or a somewhat broad overview, we recommend Paulheim (2017) on KG refinement, as well as a more recent paper by Ehrlinger and Wöß (2016) that seeks to define a KG more precisely. A much more recent review on KGs was provided by Hogan et al. (2020), and another handy resource is Yan et al. (2018). While Nickel et al. (2015) delve much deeper into the relational machine learning side of KGs, it is also a good place to begin to get a gentle overview of KGs and their uses. For a survey and case study of enterprise KGs, we recommend Jetschni and Meister (2017). We cite more material on individual aspects of KG research in the upcoming chapters.</p>
</section>
<section epub:type="division">
<h2 class="head a-head"><a id="sec1-8"/><b>1.8 Exercises</b></h2>
<ul class="numbered">
<li class="NL">1. To understand the general applicability of Euler’s theorem (recall that the theorem states that a connected graph is Eulerian if and only if all vertices have even degrees), consider the three graphs on the next page. Do any of the graphs have Eulerian circuits? If so, label the edges and state the sequence of edges in the circuit starting and ending at node A.</li>
<li class="NL">2. What is the minimum number of “edges” you have to remove from the underlying graph representation of the Königsberg Bridge Problem such that an Eulerian circuit exists? Can you do it without violating connectedness? Draw the resulting graph.</li>
<li class="NL">3. You’re trying to explain the Eulerian circuit problem to a friend. You show him the example of a connected graph where such a circuit does not exist (such as the Königsberg Bridge Problem) and tell him, “There is no way to start from node A and return to node A while traversing every edge in the graph exactly once.” The friend responds that he <span aria-label="19" id="pg_19" role="doc-pagebreak"/>agrees with you, but that does not mean that such a “circuit” cannot exist for another node B. Write a simple proof by contradiction showing why the validity of your statement, in fact, does imply that such a circuit cannot exist for any other node in the graph.</li>
</ul>
<figure class="IMG"><img alt="" src="../images/pg19-1.png" width="450"/>
</figure>
<ul class="numbered">
<li class="NL">4. Express as a KG the sentence “Michael’s lawyer and friend, John Robbins, represented him on a felony charge in the District Court of Oz. Oz is the capital of Neverland. In attendance at the court were the Cowardly Lion and the Wicked Witch of the West.” How many relations and entities are in your KG? Are there events in your KG?</li>
<li class="NL">5. Consider again the product KG that was introduced earlier in the chapter. How would you modify the graph (i.e., by adding, changing, or removing nodes or edges) in order to introduce competitor relationships between products (e.g., a Samsung phone and the iPhone). Also, is it true that Apple is the “manufacturer” of the iPhone? Is there any way in which you can make the relationship between Apple and iPhone more precise?</li>
</ul>
<div class="footnotes">
<ol class="footnotes">
<li><p class="FN" role="doc-footnote"><sup><a href="chapter_1.xhtml#fn1x1-bk" id="fn1x1">1</a></sup> https://<wbr/>www<wbr/>.blog<wbr/>.google<wbr/>/products<wbr/>/search<wbr/>/introducing<wbr/>-knowledge<wbr/>-graph<wbr/>-things<wbr/>-not<wbr/>/.</p></li>
<li><p class="FN" role="doc-footnote"><sup><a href="chapter_1.xhtml#fn2x1-bk" id="fn2x1">2</a></sup> For example, if X is a <i>city</i>, then X is also a <i>location</i>.</p></li>
<li><p class="FN" role="doc-footnote"><sup><a href="chapter_1.xhtml#fn3x1-bk" id="fn3x1">3</a></sup> <a href="https://opensource.org/licenses/BSD-3-Clause">https://<wbr/>opensource<wbr/>.org<wbr/>/licenses<wbr/>/BSD<wbr/>-3<wbr/>-Clause</a>.</p></li>
</ol>
</div>
</section>
</section>
</div>
</body>
</html>