220 lines
No EOL
47 KiB
HTML
220 lines
No EOL
47 KiB
HTML
<html xmlns="http://www.w3.org/1999/xhtml" xmlns:epub="http://www.idpf.org/2007/ops" xml:lang="en-US">
|
||
<head>
|
||
<title>Designing and Building Enterprise Knowledge Graphs</title>
|
||
<link href="../styles/stylesheet.css" rel="stylesheet" type="text/css"/>
|
||
<link href="../styles/page-template.xpgt" rel="stylesheet" type="application/vnd.adobe-page-template+xml"/>
|
||
<meta content="urn:uuid:81982e4f-53b2-476f-ab11-79954b0aab3c" name="Adept.expected.resource"/>
|
||
</head>
|
||
<body epub:type="bodymatter">
|
||
<section epub:type="chapter">
|
||
<h1 class="chno" epub:type="title"><span epub:type="pagebreak" id="page_1" title="1"/>CHAPTER 1</h1>
|
||
<h1 class="chtitle" epub:type="title">Introduction</h1>
|
||
<p class="noindent">Enterprise data management needs to evolve. Consider the following example:</p>
|
||
<div class="boxg">
|
||
<p class="noindent">“The Van Buren Bank has felt the effects of deregulation which made the once stable banking industry highly competitive. With the decreased spread between borrowing and lending rates, profits on loans have dwindled, making profit on services to customers critical. In the corporate banking group, account managers who in the past could concentrate only on loan volumes must now focus on customer and product profitability. This means they must make decisions differently and need different kinds of information. For example, if a long-time customer threatens to take his or her business elsewhere unless he or she is given an unusually low interest loan, the account manager must decide whether this is an otherwise profitable customer in terms of his or her total business with the bank.</p>
|
||
<p class="indent">In order to determine how profitable a customer is, the account manager must gather information about the various products and services the customer buys and the profitability of each. Conceptually this could be done by communicating with other account managers around the world who also do business with this customer (or communicating with their electronically readable records) and consolidating the information.</p>
|
||
<p class="indent">However, it may or may not be true that the customer is identified in the same way in each account manager’s records; or that the various products and services the bank sells are identified the same way or grouped into the same categories; or that the size of each account is recorded in the same currency; or that discounts, refunds, and so forth are handled in the same way. The account manager must translate all this information from many sources into a common form and determine how profitable this customer is.</p>
|
||
<p class="indent">Unfortunately, at the Van Buren Bank, all customer identifiers were typically assigned by the branches and were not standard across the bank. Therefore, there was no way to identify all the business of a given customer short of phoning up every branch and asking. It was clear that the Van Buren Bank required much more data integration than was currently built into its information systems.”</p>
|
||
</div>
|
||
<p class="indent">This example is an excerpt from a 1992 paper [<span class="blue">Goodhue et al., 1992</span>]!<sup><a epub:type="noteref" href="#pgfn1_1" id="rpgfn1_1">1</a></sup> The example hits the nail on the head on the struggles that enterprises still go through today, 30 years later.</p>
|
||
<p class="indent"><span epub:type="pagebreak" id="page_2" title="2"/>It is clear that enteprise data management needs to evolve in a way that data and knowledge are connected, where real-world concepts and relationshps are at the forefront instead of the complex and inscrutable application database schemas. This can now be accomplished with a technology called <i>knowledge graphs:</i> integrating knowledge and data at scale where the real world concepts and relationships are first class citizens. The data happens to be represented in a graph. Why? Read on...</p>
|
||
<p class="indent">Graphs are a new way to look at data. Not truly “new,” because these technologies have been around for several decades, or centuries even if you think of mathematics and graph theory. They are new, however, in the sense that graph technologies are only now moving into the mainstream information systems practice, and thus many people are newly exposed to them. While graphs, in the abstract, are a very intuitive and natural way to represent information and to model the world, graph databases per se are different from traditional ways in which we model and manage data. The industry has almost half a century of experience with relational databases and the relational model, and thus the tooling, methods, educational curricula, etc., are all geared toward this. <i>“The limits of my language mean the limits of my world”</i><sup><a epub:type="noteref" href="#pgfn1_2" id="rpgfn1_2">2</a></sup> very much applies here: SQL is the language, and thus understanding that there are other ways to do things, ways not even possible with SQL, makes it hard for many people to get started with graphs or to understand how they could use graphs to their benefit and advantage. This can make it hard to adopt graphs.</p>
|
||
<p class="indent">We will revisit the question of “why are things hard?” in more detail later in the book, but before we start building actual knowledge graphs, we will introduce some background. While the term “knowledge graph” was really introduced into the mainstream vocabulary less than a decade ago (by Google<sup><a epub:type="noteref" href="#pgfn1_3" id="rpgfn1_3">3</a></sup>), the idea is much older, and there is easily over half a century of research and software work that preceded what now seems to be a “hot” new concept in the world of enterprise information systems [<span class="blue">Gutierrez and Sequeda, 2021</span>].</p>
|
||
<p class="indent">We can look at knowledge graphs from two, somewhat related, angles: First, there is the practical question of <i>how to manage and exploit all the information modern enterprises collect and store.</i> Second, there is the more theoretical question of <i>how to represent and structure information about the real world.</i> While the first question is something all CIOs today are pondering, and the second question is something that indeed predates computers altogether, the two are inextricably linked, and we will discuss how knowledge graphs are the embodiment of solutions and answers to both questions.</p>
|
||
<section>
|
||
<h2 class="head2" id="ch1_1">1.1<span class="space3"/><span epub:type="title">WHAT IS THE PROBLEM?</span></h2>
|
||
<p class="noindent">In a modern enterprise, the critical data is stored in a relational database.<sup><a epub:type="noteref" href="#pgfn1_4" id="rpgfn1_4">4</a></sup> There are several roles that could collectively be described as <i>data consumers.</i> These include data analysts, data scientists, as well as others who must find answers to critical business questions (say, to optimize <span epub:type="pagebreak" id="page_3" title="3"/>business decisions) and deliver these answers as accurately and as quickly as possible. To be able to deliver, the data consumers need access to data stored in relational databases. What we hope to demonstrate in this book is that <i>“accessible data</i>” does not only consist of the physical bits stored in an enterprise information system (and the associated credentials for one to get their hands on said bits). In order to truly access data, one also needs to understand how the data is logically structured and, most importantly, <b>what it means.</b> The main obstacle to delivering the business answers is specifically the lack of understanding of the meaning of the data. Throughout this book, we will thus employ the idea that</p>
|
||
<p class="ccenter"><span class="bb">accessible data = physical bits + semantics</span></p>
|
||
<p class="indent">And by “semantics” we refer to the <i>meaning</i> of the physical bits—this will be discussed in a later section.</p>
|
||
<p class="indent">In an organization, the roles such as data engineers, data architects, and data stewards can be categorized as <i>data producers.</i> They typically are the “high clergy” entrusted with defining, structuring, and managing data. They are the ones who understand the complex database schemas that are the prerequisite for data access. Problems arise because of the communication difficulties between the data consumers and the data producers; this may be due to lack of agreed terminology or indeed simply because of the limits of human communication. Data consumers must communicate to the data producers what it is that they want; instead, our ultimate goal should be to encourage and empower the consumers to access data, perform queries, and generate reports on their own, with nominal support of data producers, thus reducing effort and time, and minimizing the chance for errors. We think of this as “democratizing” enterprise data.</p>
|
||
<p class="indent">Unfortunately, the current ways of solving this general problem are painstaking and complex in their own right. Below, we will discuss these.</p>
|
||
<section>
|
||
<h3 class="head3" id="ch1_1_1">1.1.1<span class="space3"/><span epub:type="title">SPREADSHEET APPROACH</span></h3>
|
||
<p class="noindent">A data analyst needs to answer a business question and asks a data engineer for some data. Once the data engineer starts to gather the data they realize that it’s a bit more complex and that they need to talk to the data architect who is the expert in the system (<a href="#fig1_1">Figure <span class="blue">1.1</span></a>). However, the data architect is very busy and may take days to answer. Finally, the data engineer gets the needed clarifications, comes up with the SQL query that returns the data which is then sent as a CSV or Excel file to the data analyst by email.</p>
|
||
<p class="indent">The data analyst takes the data and does some additional calculations in Excel and generates a report. This entire process can take days or even weeks.</p>
|
||
<p class="indent">The situation continues: the data analyst just got the data as a particular snapshot in time and needs to get an updated version of the data every week or every day. Additionally, the data analyst goes through this cycle with different data engineers, generating several spreadsheets of data. Technically savvy data analysts may have a (unofficial) database on their computer from <span epub:type="pagebreak" id="page_4" title="4"/>which they can export a spreadsheet (MS Access is common) and this makes it much easier to “munge” the data together.</p>
|
||
<figure>
|
||
<div class="image" id="fig1_1"><img alt="Image" src="../images/fig1_1.jpg"/></div>
|
||
<figcaption>
|
||
<p class="figcaption"><span class="blue">Figure 1.1:</span> Spreadsheet approach.</p>
|
||
</figcaption>
|
||
</figure>
|
||
<p class="indent">Finally, where is the data being integrated? On the data analyst’s laptop!</p>
|
||
<p class="indent">This leaves us with some open questions.</p>
|
||
<p class="bbull">• Did the data analyst communicate the correct request to the data engineer?</p>
|
||
<p class="bbull">• Did the data engineer correctly understand what the data analyst required?</p>
|
||
<p class="bbull">• Did the data engineer deliver the correct and precise results?</p>
|
||
<p class="bbull">• They may have understood correctly, but the SQL query could have been incorrect.</p>
|
||
</section>
|
||
<section>
|
||
<h3 class="head3" id="ch1_1_2">1.1.2<span class="space3"/><span epub:type="title">QUERY APPROACH</span></h3>
|
||
<p class="noindent">In this scenario, the data engineer is tired of running a query and simply sending the results to the data analyst. Therefore, the data engineer provides the data analyst with read-only access to the database and gives them the SQL query to execute. Given the complexity of the database, these are often complicated queries that the data analyst does not understand and thus are treated as “black boxes” (<a href="#fig1_2">Figure <span class="blue">1.2</span></a>).</p>
|
||
<p class="indent"><span epub:type="pagebreak" id="page_5" title="5"/>Similar to the Spreadsheet approach, the data analyst may receive a variety of SQL queries from different data engineers. A technically savvy data analyst is able to combine the different SQL queries into a one large query by joining each query as a subquery, assuming the queries are all to the same database:</p>
|
||
<div class="boxg">
|
||
<p class="noindentl"><code>SELECT *</code></p>
|
||
<p class="noindentl"><code>FROM (sql query 1) A</code></p>
|
||
<p class="noindentl"><code>JOIN (sql query 2) B ON A.ID = B.ID</code></p>
|
||
<p class="noindentl"><code>JOIN (sql query 2) C ON B.ID = C.ID</code></p>
|
||
<p class="noindentl">. . .</p>
|
||
</div>
|
||
<p class="indent">Often, the data analyst will extract the calculations from the spreadsheet and push them into the SQL query. These calculations can contain important business definitions and are pushed into the black box. This entire process is complicated, can easily get out of hand, and that is why you can see queries that are pages long and intelligible only to a few experts in an organization.</p>
|
||
<p class="indent">Again, we have some open questions.</p>
|
||
<p class="bbull">• Who actually understands what these SQL queries are doing?</p>
|
||
<p class="bbull">• Were the joins performed on the correct keys?</p>
|
||
<p class="bbull">• Can we trust the results of these queries?</p>
|
||
</section>
|
||
<section>
|
||
<h3 class="head3" id="ch1_1_3">1.1.3<span class="space3"/><span epub:type="title">DATA WAREHOUSE APPROACH</span></h3>
|
||
<p class="noindent">Enterprise Data Warehouses (EDWs) are a general solution used to integrate data from various disparate sources, for the purposes of data analysis and business decision-making [<span class="blue">Inmon, 2005</span>] (<a href="#fig1_3">Figure <span class="blue">1.3</span></a>). Based on our experience and anecdotal evidence, projects to build an EDW are often quoted to take “6 months and $1 million USD” but can take 2–3 times longer, and even then they may not be successful.<sup><a epub:type="noteref" href="#pgfn1_5" id="rpgfn1_5">5</a></sup> A team (sometimes an IT consultancy company) will gather requirements from all the business stakeholders to design an enterprise data model that covers all the requirements, and to understand what data is needed. A team of data engineers will write ETL code to extract the data from the sources, translate it into the enterprise data model, and then load it into the data warehouse. This follows a “boil the ocean” approach. Once the data is centralized in the warehouse, data analysts can access a single source instead of having to go through the previous spreadsheet and query approaches.</p>
|
||
<p class="indent">A common follow-up scenario is that either a requirement was misunderstood—hence the data is wrong—or a requirement was missing—and hence data is missing. This means that the <span epub:type="pagebreak" id="page_6" title="6"/>enterprise data model may need to change, and Extract, Transform, Load (ETL) code needs to be re-generated or corrected. Sometimes the engineer who wrote the ETL code is not available anymore so another engineer needs to reverse engineer the code.</p>
|
||
<figure>
|
||
<div class="image" id="fig1_2"><img alt="Image" src="../images/fig1_2.jpg"/></div>
|
||
<figcaption>
|
||
<p class="figcaption"><span class="blue">Figure 1.2:</span> Query approach.</p>
|
||
</figcaption>
|
||
</figure>
|
||
<p class="indent">The overall issue is trust. During the 6 months (or 12, or 18, ...) which the data warehouse was being built, the data analyst continued to do their work through the previous <i>ad hoc</i> Spreadsheet and Query approaches. Even though those approaches are <i>ad hoc,</i> the data analyst trusts those answers because they themselves are in control. When the data analyst compares the answers to the same question between the <i>ad hoc</i> process they control and the data warehouse that they do not control, the answers are most probably going to be different. Guess which process is going to be trusted? This is why data warehouses fail: not for technical reasons but for <i>social reasons,</i> because they are not trusted.</p>
|
||
<p class="indent">Open questions include.</p>
|
||
<p class="bbull">• How do we know what the data model actually means?</p>
|
||
<p class="bbull">• Can we explain where each piece of data is coming from?</p>
|
||
<figure>
|
||
<div class="image" id="fig1_3"><img alt="Image" src="../images/fig1_3.jpg"/></div>
|
||
<figcaption>
|
||
<p class="figcaption"><span class="blue">Figure 1.3:</span> Data warehouse approach.</p>
|
||
</figcaption>
|
||
</figure>
|
||
</section>
|
||
<section>
|
||
<h3 class="head3" id="ch1_1_4">1.1.4<span class="space3"/><span epub:type="title">DATA LAKE APPROACH</span></h3>
|
||
<p class="noindent"><span epub:type="pagebreak" id="page_7" title="7"/>A Data Lake is basically a Data Warehouse that (1) allows you to dump any type of data into it (a data warehouse only consists of structured/relational data) and (2) transformations are done after the data is in the lake: extract, load, and <i>then</i> transform (ELT) (<a href="#fig1_4">Figure <span class="blue">1.4</span></a>).</p>
|
||
<p class="indent">It is faster to load and centralize the data in one place. It is paramount to understand, however, that even if the data is physically co-located, it doesn’t mean that the data has been integrated in any way. Every transformation is done independently. The open questions for the data warehouse scenario hold for data lakes too.</p>
|
||
</section>
|
||
<section>
|
||
<h3 class="head3" id="ch1_1_5">1.1.5<span class="space3"/><span epub:type="title">DATA WRANGLING APPROACH</span></h3>
|
||
<p class="noindent">A main drawback of the Data Warehouse and Data Lake approaches is that they depend too much on IT, who now becomes the bottleneck. Following the rise of self-service analytics tools, we are now encountering self-service <i>data wrangling tools</i> that enable data analysts to prepare the data with minimal assistance or intervention from IT (<a href="#fig1_5">Figure <span class="blue">1.5</span></a>).</p>
|
||
<p class="indent">The situation we encounter is that each data analyst can be wrangling the data in different ways, without communicating with other analysts. The process of wrangling the data is not just about cleaning data but also transforming the data to align with (some) business meaning. For example, each data analyst may be tasked to do different revenue projects, but they may each be <span epub:type="pagebreak" id="page_8" title="8"/>transforming the data according to different definitions of revenue. In other words, each data analyst may be providing a different meaning (i.e., semantics) for the data.</p>
|
||
<figure>
|
||
<div class="image" id="fig1_4"><img alt="Image" src="../images/fig1_4.jpg"/></div>
|
||
<figcaption>
|
||
<p class="figcaption"><span class="blue">Figure 1.4:</span> Data Lake approach.</p>
|
||
</figcaption>
|
||
</figure>
|
||
<p class="indent">This time, our open questions are</p>
|
||
<p class="bbull">• How do we know that data wrangling is consistent across different data analysts?</p>
|
||
<p class="bbull">• How do we know that data analysts are not “reinventing the wheel” and redoing work that should and could be reused?</p>
|
||
</section>
|
||
<section>
|
||
<h3 class="head3" id="ch1_1_6">1.1.6<span class="space3"/><span epub:type="title">SO WHAT?</span></h3>
|
||
<p class="noindent">We have observed a number of issues with existing approaches: First and foremost, there is a gap between the data and the meaning of the data, and our thesis is that if we do not bridge this gap we easily end up with systems that can be characterized as “garbage in, garbage out” (<a href="#fig1_6">Figure <span class="blue">1.6</span></a>).</p>
|
||
<p class="indent">A lot of work is being done over and over, there are tendencies to “boil the ocean,” and data and knowledge work can easily be lost.</p>
|
||
<p class="indent">Gartner states that “Self-service analytics is often characterized by [. . . ] an underlying data model that has been simplified or scaled down for ease of understanding and straightforward data access.”<sup><a epub:type="noteref" href="#pgfn1_6" id="rpgfn1_6">6</a></sup> This may be wishful thinking, but the fact remains that in a data-centric <span epub:type="pagebreak" id="page_9" title="9"/>architecture a single, simple, and extensible data model is exactly what we need [<span class="blue">McComb, 2019</span>].</p>
|
||
<figure>
|
||
<div class="image" id="fig1_5"><img alt="Image" src="../images/fig1_5.jpg"/></div>
|
||
<figcaption>
|
||
<p class="figcaption"><span class="blue">Figure 1.5:</span> Data Wrangling approach.</p>
|
||
</figcaption>
|
||
</figure>
|
||
<figure>
|
||
<div class="image" id="fig1_6"><img alt="Image" src="../images/fig1_6.jpg"/></div>
|
||
<figcaption>
|
||
<p class="figcaption"><span class="blue">Figure 1.6:</span> So What?</p>
|
||
</figcaption>
|
||
</figure>
|
||
<p class="indent">Luckily, this is precisely what can be achieved with knowledge graphs.</p>
|
||
</section>
|
||
</section>
|
||
<section>
|
||
<h2 class="head2" id="ch1_2"><span epub:type="pagebreak" id="page_10" title="10"/>1.2<span class="space3"/><span epub:type="title">KNOWLEDGE GRAPHS</span></h2>
|
||
<section>
|
||
<h3 class="head3" id="ch1_2_1">1.2.1<span class="space3"/><span epub:type="title">WHAT IS A KNOWLEDGE GRAPH?</span></h3>
|
||
<p class="noindent">A <i>knowledge graph</i> represents a collection of real-world concepts (i.e., nodes) and relationships (i.e., edges) in the form of a graph used to link and integrate data coming from diverse sources. Knowledge graphs can be considered to achieve an early vision in computing, of creating intelligent systems that integrate knowledge and data on a large scale. In its simplest form, a knowledge graph encodes meaning and data together in the form of a graph:</p>
|
||
<p class="bbull">• Knowledge (i.e., meaning): Concepts and the relationships between the concepts are first class citizens, meaning that it encodes knowledge of how the domain users understand the world.</p>
|
||
<p class="bbull">• Graph (i.e., data): A data structure based on nodes and edges that enables integrating data coming from hetergeneous data sources, from unstructured to structured.</p>
|
||
<p class="indentt">The popularity of the term “knowledge graph” stems from the announcement of the Google Knowledge Graph in 2012.<sup><a epub:type="noteref" href="#pgfn1_7" id="rpgfn1_7">7</a></sup> Since then, a plethora of small-, medium-, and large-scale companies have deployed knowledge graphs to enhance search, do recommendations, perform analytics and machine learning, etc.</p>
|
||
</section>
|
||
<section>
|
||
<h3 class="head3" id="ch1_2_2">1.2.2<span class="space3"/><span epub:type="title">WHY KNOWLEDGE GRAPHS?</span></h3>
|
||
<p class="noindent">Knowledge graphs provide a meaningful and reliable view of your enterprise data. We focus on two primary reasons:</p>
|
||
<p class="indentt"><b>Knowledge graphs bridge the data-meaning gap:</b> This is accomplished by connecting the data consumers’ business terminology with data, and enabling access of the data through this terminology. This means that data consumers and business users can access the data in their own terms which dramatically improves search, findability, clarity, accuracy, thus reducing the time and cost to identify accurate data to answer critical business questions. Furthermore, a knowledge graph can provide reasoning capabilities to infer new knowledge, thus allowing us to generate new insights and recommendations.</p>
|
||
<p class="indentt"><b>Knowledge graphs are a natural data model for data integration.</b> The graph data model is flexible and nimble by nature, ideal for data integration. This implies that developers can quickly add and integrate data coming from a variety of sources. Furthermore, graphs enable tracking the lineage from the business concepts to the data, thus providing governance and explanability to the answers of the questions on the knowledge graph.</p>
|
||
<p class="indent">Let’s break this down a bit more:</p>
|
||
<p class="indent">Knowledge graphs bridge the data-meaning gap between how business users understand the world and how the data is physically represented and stored in a database. Consider the <span epub:type="pagebreak" id="page_11" title="11"/>domain of e-commerce. An order is placed by a customer. An order is shipped to an address. There may multiple addresses: a shipping address, a billing address, etc. An order has different order lines and an order line is connected to a product, and so forth. This is how a business users see the world of e-commerce. Very simple. However, the Order Management System’s database does not store the data in that way. Ideally, the database schema would have a table called Order and a table called Customer, etc. Unfortunately, that is never the case in enterprise application databases, because the data is stored to benefit the application, not the data consumers. It is common that enterprise databases consist of thousands of tables with tens of thousands of attributes due to query workload requirements (vertical/horizontal partioning), custom fields (tables having hundreds of columns called <code>segment1, segment2</code>, etc.), and extending the application with new requirements often leads to the replication of data in a new table, in order to avoid altering a schema designed for a specific part of the application. This is known as the <i>application-centric quagmire</i> [<span class="blue">McComb, 2018</span>]. Thus, if a business user needs to access the data in the Order Management System, they would not understand it on their own, hence the data-meaning gap.</p>
|
||
<p class="indent">In a knowledge graph, the business terminology is represented as concepts and relationships. They can have different associated labels (synonyms), even in different languages. The concepts and relationships are connected with the underlying application databases through mappings. Business questions can be represented as queries in terms of the knowledge graph instead of the inscrutable application database schemas.</p>
|
||
<p class="indent">Knowledge graphs are a natural data model that makes data integration easier.</p>
|
||
<p class="indent">First, graphs are an ideal abstraction model. You can have different types of data sources such as relational databases as well as files in formats such as csv, xls, xml, json, text, etc. You can represent all these different data models in a graph.<sup><a epub:type="noteref" href="#pgfn1_8" id="rpgfn1_8">8</a></sup> Those thousands of tables of an order management system database can be represented in a single, simple knowledge graph about e-commerece. The output of Natural Language Processing (NLP) tasks such as entity extraction, relationship extraction, all feed the graph. Therefore, graphs are a common, unifying model for different models and formats of source data.</p>
|
||
<p class="indent">Second, graphs are flexible by nature. In a relational database you need to define the schema up front before you can load data. If you have to make changes to the schema, it can become a headache because you may need to restructure the way the data is modeled. However, when you work with a graph, all you add are more nodes and edges.</p>
|
||
<p class="indent">Finally, graphs enable integration. Consider two disparate graphs. How do you connect them? Just add edges between nodes. Therefore, integrating data in the form of graph boils down to creating relationships between the concepts.</p>
|
||
<p class="indent">Knowledge graphs are also gaining traction in a variety of use cases such as “Customer 360,” identity graph, master data management, fraud detection, recommendation engines, social <span epub:type="pagebreak" id="page_12" title="12"/>networking, network operations, life science and drug discovery, among others. With a knowledge graph, organizations have the ability to execute graph analytics and algorithms (inferences, page rank, shortest path, etc.) that extend the capabilities of the state of the art.</p>
|
||
</section>
|
||
<section>
|
||
<h3 class="head3" id="ch1_2_3">1.2.3<span class="space3"/><span epub:type="title">WHY NOW?</span></h3>
|
||
<p class="noindent">Graph databases are hot. There are a plethora of companies whose products or services are either graph databases or functionality closely associated with graph databases.<sup><a epub:type="noteref" href="#pgfn1_9" id="rpgfn1_9">9</a></sup> The market is still nascent at the time of writing, but many mature products and services are already available and are used in production systems.</p>
|
||
<p class="indent">The W3C graph standards (RDF, SPARQL, etc.) have been around for two decades and at this point can be considered quite mature. The standards for property graphs are being developed, and the divide between the two families of graphs is getting smaller.</p>
|
||
<p class="indent">The tech giants have been adopting knowledge graphs [<span class="blue">Noy et al., 2019</span>]. In addition, there are numerous startups with products and services that were easier and faster to build and bring to market, thanks to graph databases and other graph technologies.</p>
|
||
</section>
|
||
</section>
|
||
<section>
|
||
<h2 class="head2" id="ch1_3">1.3<span class="space3"/><span epub:type="title">BACKGROUND</span></h2>
|
||
<p class="noindent">Before we get into the details of building knowledge graphs, let us discuss more generally the aforementioned problem of how to represent information about the real world.</p>
|
||
<section>
|
||
<h3 class="head3" id="ch1_3_1">1.3.1<span class="space3"/><span epub:type="title">HISTORY OF KNOWLEDGE GRAPHS</span></h3>
|
||
<p class="noindent">Graphs, as a branch of mathematics, date back at least to the 18th century and the German mathematician Leonhard Euler who posed a famous problem dubbed “Bridges of Königsberg” which now is in some ways considered the genesis of graph theory [<span class="blue">Euler, 1736</span>]. You could also argue that much of classical computer science is grounded in graphs and graph-based algorithms.</p>
|
||
<p class="indent">Theory aside, graphs—constructs consisting of nodes (also called “vertices”) and edges—are a very intuitive and natural way to model information and as such are a good choice for us as a mechanism to represent the world. Knowledge graphs are the modern embodiment of one of the oldest subfields of artificial intelligence called <i>knowledge representation</i> (KR). The goal of KR is to facilitate the representation of information in such a way that automated systems can better interpret that information. To quote <span class="blue">Brachman and Levesque [1985]</span>:</p>
|
||
<div class="boxgq">
|
||
<p class="noindent"><i>The notion of representation of knowledge is at heart an easy one to understand. It simply has to do with writing down, in some language or communicative medium, descriptions or pictures that correspond in some salient way to the world or the state of the world. In Artificial Intelligence (AI), <span epub:type="pagebreak" id="page_13" title="13"/>we are concerned with writing down descriptions of the world in such a way that an intelligent machine can come to new conclusions about its environment by formally manipulating these descriptions.</i></p>
|
||
</div>
|
||
<p class="indent">This characterization is relevant, since we will look at KR from the viewpoint of facilitating not only the management of representations, but also making such representations actionable. This should be seen in the context of what [<span class="blue">Fikes and Kehler, 1985</span>] define as the basic criteria for a knowledge representation language, namely <i>expressive power</i> (measuring the possibility and ease of expressing different pieces of knowledge), <i>understandability</i> (measuring whether knowledge can be understood by humans) and <i>accessibility</i> (measuring how effectively the knowledge can be accessed and used). Also, one of the key realizations about KR is that not all formalisms or structures qualify as representation; instead, a representation formalism needs to be associated with a <i>semantic theory</i> to provide the basis for reasoning [<span class="blue">Hayes, 1974</span>] and the means to define its meaning.</p>
|
||
<p class="indent">The early work on KR (in the 1960s) focused on human associative memory—mostly in a metaphoric sense—and introduced associative formalisms and structures for capturing information about the world. These structures, known as semantic networks [<span class="blue">Brachman, 1979, Quillian, 1967, Woods, 1975</span>], represent information as labeled graphs where nodes denote concepts and edges denote relationships between concepts. Modern knowledge graphs are not far at all from this original idea.</p>
|
||
<p class="indent">Semantic networks were criticised for their weak formal foundations, and attempts to mitigate this led to the introduction of frames, another approach to KR [<span class="blue">Fikes and Kehler, 1985, Karp, 1992, Minsky, 1975</span>]. The simplistic view of frame-based representation is that a frame represents an object or a concept. Attached to the frame is a collection of properties, and these may initially be filled with default values. It is easy to think of frames as a collection of linked objects that form a (knowledge) graph, and indeed frames paved the way for logic- and graph-based KR and modern knowledge graphs.</p>
|
||
<p class="indent">Subsequent work on logic, particularly the push to find tractable subsets of first-order predicate calculus, has given rise to description logics, logic-based formalizations of object-oriented modeling, and languages such as OWL.</p>
|
||
<p class="indent">For further details on the history of knowledge graphs, we refer the reader to <span class="blue">Gutierrez and Sequeda</span> [<span class="blue">2021</span>].<sup><a epub:type="noteref" href="#pgfn1_10" id="rpgfn1_10">10</a></sup></p>
|
||
</section>
|
||
<section>
|
||
<h3 class="head3" id="ch1_3_2">1.3.2<span class="space3"/><span epub:type="title">SEMANTICS</span></h3>
|
||
<p class="noindent">The term “semantics” is much overused and abused in today’s information systems vernacular. What, exactly, does this term mean? In the context of information systems, semantics defines how data “behaves” and how machines can interpret data, and when properly applied, can free <span epub:type="pagebreak" id="page_14" title="14"/>us from having to “hard-wire” logic into software systems and application code. In that sense, it gives rise to “data-driven” processing. But where does semantics come from? In very pragmatic terms, it is embodied in the following:</p>
|
||
<p class="numlist">1. relationships between data (e.g., USD) and definitions of data (e.g., currency);</p>
|
||
<p class="numlist">2. relationships within data (e.g., 10 and USD); and</p>
|
||
<p class="numlist">3. hard-wired in software (e.g., calculation of net sales).</p>
|
||
<p class="indent">The first two categories allow us to move toward data-driven processing, something where (most of) the logic of a system is embodied in the data and the definitions of data, and the software itself acts as an “interpreter” of the data and its definitions. This is drastically different from category #3 above. Most of today’s information systems still belong in this last category, unfortunately, forcing software work, upgrades, new versions, etc., whenever we want to make changes to processing logic.</p>
|
||
<p class="indent">Separating domain-specific processing logic from software code and rather associating this with data, whether by reference or by carrying these definitions with the data itself, also improves interoperability in data interchange, especially when data is being exchanged between different organizations.</p>
|
||
</section>
|
||
<section>
|
||
<h3 class="head3" id="ch1_3_3">1.3.3<span class="space3"/><span epub:type="title">SEMANTIC WEB</span></h3>
|
||
<p class="noindent">The Semantic Web [<span class="blue">Berners-Lee et al., 2001</span>] is a vision that was introduced in the late 1990s and early 2000s, motivated by the needs to enable better integration and “operationalization” of data on the Web. The idea is basically predicated on giving data accessible formal semantics, and allowing the definitions of semantics to be exchanged together with the actual data. The World Wide Web Consortium (W3C, an organization that manages many of the technical specifications of the Web) produced a set of standards for the Semantic Web, and by doing so laid the groundwork for modern knowledge graphs. The term “Web” in Semantic Web refers to the fact that, in effect, the whole vision is about building support for KR and graphs using web technologies, and thus the Semantic Web does not imply that by using these technologies you have to put your data “out there” on the public web (albeit that also is possible as we will demonstrate later). Foundational Web technologies (such as HTTP or URIs) are well understood and widely supported, making them an ideal foundation for building distributed KR systems.</p>
|
||
<p class="indent">The Semantic Web is predicated on a few overarching principles: make data accessible, give it accessible semantics (i.e., definitions of data in the form of ontologies, something we will address in the next section), and—optionally—provide mechanisms for reasoning about data (again, with the help of ontologies).</p>
|
||
</section>
|
||
<section>
|
||
<h3 class="head3" id="ch1_3_4"><span epub:type="pagebreak" id="page_15" title="15"/>1.3.4<span class="space3"/><span epub:type="title">MODELS, ONTOLOGIES, AND SCHEMATA</span></h3>
|
||
<p class="noindent">At the heart of representing information is the notion of a model, a definition of how your data is structured and what it means—its semantics, that is. Typically, a model defines both the physical and the logical structure of information, but in this book we will show that <i>mappings</i> allow us to build knowledge graphs by focusing on logical modeling, as the physical aspects have been “taken care of.” A modeling language provides the means to define a model and the semantics of data that conforms to the model. In KR, this model is often called an <i>ontology,</i> as already mentioned; it is called that because of its relationship with metaphysics: we define what is in the “world,” its various concepts and objects, look like and how they behave. In this book, rather than using the term ontology (and taking on all the nuances and implications of the term), we will call our models “knowledge graphs schemas”<sup><a epub:type="noteref" href="#pgfn1_11" id="rpgfn1_11">11</a></sup> instead, and these will run the gamut from simple notions of logically structuring your data to potentially complex semantic definitions and constraints.</p>
|
||
</section>
|
||
</section>
|
||
<section>
|
||
<h2 class="head2" id="ch1_4">1.4<span class="space3"/><span epub:type="title">WHY THIS BOOK?</span></h2>
|
||
<p class="noindent">This book is about designing and building knowledge graphs from relational databases. To introduce the landscape, let us briefly look at the main practical challenges of knowledge graph design and construction.</p>
|
||
<p class="indent">First, we must engage in domain modeling or “ontology engineering,” in the creation and definition of a sufficiently broad and shared data model: the knowledge graph schema [<span class="blue">Kendall and McGuinness, 2019, Tudorache, 2020</span>]. Second, we must do “mapping engineering,” namely to understand how existing (non-graph) data sources are mapped or transformed for the purposes of creating the eventual knowledge graph.</p>
|
||
<p class="indent">Engineering a knowledge graph schema is difficult in and of itself. The field—earlier often referred to as “knowledge acquisition”—was prolific throughout the 1990s. Early seminal work includes <span class="blue">Fox and Grüninger</span> [<span class="blue">1997</span>], <span class="blue">Uschold and King</span> [<span class="blue">1995</span>], followed by a multitude of methodologies, notably METHONTOLOGY [<span class="blue">Fernández-López et al., 1997</span>]. For an early survey, see <span class="blue">Corcho et al.</span> [<span class="blue">2003</span>]. Research in this field continues to progress by focusing on the sophisticated use of competency questions [<span class="blue">Azzaoui et al., 2013, Ren et al., 2014</span>], test-driven development [<span class="blue">Keet and Lawrynowicz, 2016</span>], ontology design patterns [<span class="blue">Hitzler et al., 2016</span>], reuse [<span class="blue">Suárez-Figueroa et al., 2012</span>], etc., just to name a few. Furthermore, numerous knowledge graph schemas have been designed with reuse in mind, such as the Finance Industry <span epub:type="pagebreak" id="page_16" title="16"/>Business Ontology (FIBO),<sup><a epub:type="noteref" href="#pgfn1_12" id="rpgfn1_12">12</a></sup> Gist<sup><a epub:type="noteref" href="#pgfn1_13" id="rpgfn1_13">13</a></sup> for general business concepts, <a href="http://Schema.org">Schema.org</a>,<sup><a epub:type="noteref" href="#pgfn1_14" id="rpgfn1_14">14</a></sup> etc. We will dive into reusing external knowledge graph schemas in Section <span class="blue">2.2.3</span>.</p>
|
||
<p class="indent">It seems to be a fair conjecture that knowledge graph schema engineering should actually not be that much of a challenge. This would be the case if the ultimate deliverable was just a schema in isolation. However, populating a knowledge graph schema with data seems to be an afterthought and not a key component of existing schema engineering methodologies. In the context of designing a knowledge graph, both the schema and the mappings that generate the data sourced from relational databases must be first-class citizens.</p>
|
||
<p class="indent">Let’s assume that a knowledge graph schema has either been created via an established methodology or an existing schema is being reused. The next step is to map relational databases and other data sources to the schema in order to generate the knowledge graph. One common practice bootstraps the process with a direct mapping that generates a so-called putative knowledge graph schema from the database schema [<span class="blue">Sequeda et al., 2012</span>]. This practice suggests approaching the problem as an schema-matching problem between the source putative knowledge graph schema and the target knowledge graph schema. In theory, this can work [<span class="blue">Jiménez-Ruiz et al., 2015</span>], but per our experience, this has not yet become (and may never become) practical in the real-world for the following reasons.</p>
|
||
<p class="bbull">• As previously discussed, enterprise database schemas are very large, consisting of thousands of tables and tens of thousands of attributes. Schema developers frequently and notoriously use acronyms and peculiar abbreviations for tables and columns (i.e., they use virtually meaningless names). Some commercial systems make frequent use of numbered columns for enterprise-specific data with no explicit semantic meaning whatsoever stored in the database (e.g., segment1, segment2). The data stored in these columns may also consist of codes that are meaningless by themselves.</p>
|
||
<p class="bbull">• Simple one-to-one schema correspondences are rare. We have found thoroughout our real-world experience that complex mappings dominate. That is, a mapping often integrates calculations and needs to incorporate business logic rules while simultaneously considering many database values. For example, the notion of “net sales of an order” is defined as “gross sales minus taxes and discounts given.” The tax rate can be different depending on location, and the discount can depend on the type of customer. A business user needs to provide these definitions before mappings can be created. Thus, without clairvoyance, automating the mappings is often simply not feasible.</p>
|
||
<p class="indent">Early on in our practice we observed that schemas and mappings must be developed holistically. That is, there is a continual back-and-forth between schema and mapping engineering. Furthermore, this process is a team effort. For these reasons, we argue that schema engineering <span epub:type="pagebreak" id="page_17" title="17"/>methodologies must be extended to support how the schema should be populated via mappings in order to build a knowledge graph in conjunction with a team consisting of data producers, data consumers, and knowledge scientist.</p>
|
||
</section>
|
||
<section epub:type="footnotes">
|
||
<div epub:type="footnote" id="pgfn1_1"><p class="pgnote"><sup><a href="#rpgfn1_1">1</a></sup> The mention of deregulation should have been a hint that this was not a modern example.</p></div>
|
||
<div epub:type="footnote" id="pgfn1_2"><p class="pgnote"><sup><a href="#rpgfn1_2">2</a></sup> “<i>Die Grenzen meiner Sprache bedeuten die Grenzen meiner Welt</i>—Ludwig Wittgenstein, 1922.</p></div>
|
||
<div epub:type="footnote" id="pgfn1_3"><p class="pgnote"><sup><a href="#rpgfn1_3">3</a></sup> <span class="blue"><a href="https://blog.google/products/search/introducing-knowledge-graph-things-not/">https://blog.google/products/search/introducing-knowledge-graph-things-not/</a></span></p></div>
|
||
<div epub:type="footnote" id="pgfn1_4"><p class="pgnote"><sup><a href="#rpgfn1_4">4</a></sup> Even though modern data lakes are not implemented using relational databases, they provide a SQL interface for access.</p></div>
|
||
<div epub:type="footnote" id="pgfn1_5"><p class="pgnote"><sup><a href="#rpgfn1_5">5</a></sup> This is based on anecdotes but is commonly agreed upon by industry professionals. We encourage the reader to ask around.</p></div>
|
||
<div epub:type="footnote" id="pgfn1_6"><p class="pgnote"><sup><a href="#rpgfn1_6">6</a></sup> <span class="blue"><a href="https://www.gartner.com/en/information-technology/glossary/self-service-analytics">https://www.gartner.com/en/information-technology/glossary/self-service-analytics</a></span></p></div>
|
||
<div epub:type="footnote" id="pgfn1_7"><p class="pgnote"><sup><a href="#rpgfn1_7">7</a></sup> <a href="https://blog.google/products/search/introducing-knowledge-graph-things-not">https://blog.google/products/search/introducing-knowledge-graph-things-not</a>/</p></div>
|
||
<div epub:type="footnote" id="pgfn1_8"><p class="pgnote"><sup><a href="#rpgfn1_8">8</a></sup> Recall that the focus of this book is on representing relational databases as knowledge graphs, but the fundamental ideas are more general.</p></div>
|
||
<div epub:type="footnote" id="pgfn1_9"><p class="pgnote"><sup><a href="#rpgfn1_9">9</a></sup> That includes the employers of both authors of this book.</p></div>
|
||
<div epub:type="footnote" id="pgfn1_10"><p class="pgnote"><sup><a href="#rpgfn1_10">10</a></sup> See also: <a href="http://knowledgegraph.today">http://knowledgegraph.today</a>∕.</p></div>
|
||
<div epub:type="footnote" id="pgfn1_11"><p class="pgnote"><sup><a href="#rpgfn1_11">11</a></sup> We are aware that the correct plural form of “schema” is, in fact, “schemata,” and yet we have decided to say “schemas” in this book because of the prevalence of the incorrect form in our industry. Our apologies for this. It seems that this is yet another English term molested by computer scientists (“indexes,” instead of the correct form “indices,” is another one that comes to mind).</p></div>
|
||
<div epub:type="footnote" id="pgfn1_12"><p class="pgnote"><sup><a href="#rpgfn1_12">12</a></sup> <a href="https://www.spec.edmcouncil.org/fibo/">https://spec.edmcouncil.org/fibo/</a></p></div>
|
||
<div epub:type="footnote" id="pgfn1_13"><p class="pgnote"><sup><a href="#rpgfn1_13">13</a></sup> <a href="http://www.semanticarts.com/gist">http://semanticarts.com/gist</a></p></div>
|
||
<div epub:type="footnote" id="pgfn1_14"><p class="pgnote"><sup><a href="#rpgfn1_14">14</a></sup> <a href="https://www.schema.org/">https://schema.org/</a></p></div>
|
||
</section>
|
||
</section>
|
||
</body>
|
||
</html> |