glam/docs/oclc/extracted_enterprise_kg/OEBPS/xhtml/15_chapter02.xhtml

<html xmlns="http://www.w3.org/1999/xhtml" xmlns:epub="http://www.idpf.org/2007/ops" xml:lang="en-US">
<head>
<title>Designing and Building Enterprise Knowledge Graphs</title>
<link href="../styles/stylesheet.css" rel="stylesheet" type="text/css"/>
<link href="../styles/page-template.xpgt" rel="stylesheet" type="application/vnd.adobe-page-template+xml"/>
<meta content="urn:uuid:81982e4f-53b2-476f-ab11-79954b0aab3c" name="Adept.expected.resource"/>
</head>
<body epub:type="bodymatter">
<section epub:type="chapter">
<h1 class="chno" epub:type="title"><span epub:type="pagebreak" id="page_19" title="19"/>CHAPTER 2</h1>
<h1 class="chtitle" epub:type="title">Designing Enterprise Knowledge Graphs</h1>
<p class="noindent">The architecture to build an enterprise knowledge graph from relational databases consists of three definitional entities.</p>
<p class="bbull">•  The source relational database: this is where the data is stored and is physically structured following a relational schema for which data producers are responsible.</p>
<p class="bbull">•  The target knowledge graph: this is a conceptual model of the domain, represented in a graph, which uses the <i>lingua franca</i> of the data consumers.</p>
<p class="bbull">•  The mappings: declarative associations between select contents of the source relational database schema and the target knowledge graph schema.</p>
<p class="indent">With these three components in place, business questions can be defined in terms of the knowledge graph schema’s conceptual abstraction instead of the individual heterogeneous source databases’ physical structures. Systems that implement knowledge graphs can be realized in a physical or virtual way. Finally, the resulting knowledge graph is used by business users and systems to access the knowledge and data.</p>
<section>
<h2 class="head2" id="ch2_1">2.1<span class="space3"/><span epub:type="title">SOURCE: RELATIONAL DATABASES</span></h2>
<p class="noindent">Relational databases have been at the forefront of enterprise data management since the 1980s, after the creation of the relational model by Codd in the 1970s [<span class="blue">Codd, 1970</span>].</p>
<p class="indent">These databases model data with <i>tables</i> and <i>columns.</i> A key characteristic of relational databases is to maintain consistency of the data using integrity constraints such as <i>primary key</i> and <i>foreign keys.</i> The rows in a table are the data values, and there is a standard query language: SQL.</p>
</section>
<section>
<h2 class="head2" id="ch2_2">2.2<span class="space3"/><span epub:type="title">TARGET: KNOWLEDGE GRAPH</span></h2>
<p class="noindent">A <i>knowledge graph,</i> as the name implies, represents knowledge and data in the form of a graph. As discussed in the previous chapter, the main elements of a graph are <i>nodes</i> and <i>edges.</i> The nodes represent concepts such as Customer, Order, Product, Address, etc., and the things that <span epub:type="pagebreak" id="page_20" title="20"/>instantiate them (in our examples: customer-1, order-1, etc.). The edges represent relationships between nodes, such as “Order” “placed by” “Customer” and “order-1” “placed by” “customer-1.”</p>
<p class="indent">There are two prevalent graph models: RDF Graphs and Property Graphs.</p>
<figure>
<div class="image" id="fig2_1"><img alt="Image" src="../images/fig2_1.jpg"/></div>
<figcaption>
<p class="figcaption"><span class="blue">Figure 2.1:</span> Graph.</p>
</figcaption>
</figure>
<section>
<h3 class="head3" id="ch2_2_1">2.2.1<span class="space3"/><span epub:type="title">RDF GRAPH</span></h3>
<p class="noindent">The RDF (Resource Description Framework) graph model is a directed edge-labeled graph. It consists of a set of nodes and a set of directed labeled edges between these nodes. RDF is a standardized data model recommended by the W3C, and has been around since the late 1990s. It is also the basic building block of the Semantic Web [<span class="blue">Berners-Lee et al., 2001</span>].</p>
<p class="indent">The grouping <i>node-edge-node</i> is known as an RDF “triple.” The head node is called the <i>subject,</i> the edge is called the <i>predicate,</i> and the tail node is called the <i>object.</i> A set of RDF triples is called an <i>RDF graph.</i> Given that RDF was designed within the context of the World Wide Web, it uses IRIs (Internationalized Resource Identifiers) as a means to identify nodes and edges. For a given triple, the subject must either be an IRI (an “IRI reference” is the term the RDF specifications use) or a “blank node” (this is essentially a node that only has an internal identity and cannot be addressed directly in a query, but we will ignore those in this book). The predicate must always be an IRI—there are no unidentified or “unlabeled” edges in RDF. The object can be an IRI, a blank node, or a “literal value” such as a string or a number.</p>
<p class="exe"><b>Example 2.1</b></p>
<p class="noindent">Consider the graph in <a href="#fig2_1">Figure <span class="blue">2.1</span></a>.</p>
<p class="indent">The graph in <a href="#fig2_1">Figure <span class="blue">2.1</span></a> can be represented as an RDF graph by separating each RDF triple, as shown in <a href="#tab2_1">Table <span class="blue">2.1</span></a> (for clarity, we have used simple identifiers instead of full IRIs).</p>
<table class="table1" id="tab2_1">
<caption class="tcaption"><span class="blue">Table 2.1:</span> Table with separated each RDF triple</caption>
<thead>
<tr>
<th class="thead"><b>Subject</b></th>
<th class="thead"><b>Predicate</b></th>
<th class="thead"><b>Object</b></th>
</tr>
</thead>
<tbody>
<tr>
<td class="tab1">customer-100</td>
<td class="tab1">type</td>
<td class="tab1">Customer</td>
</tr>
<tr>
<td class="tab1c">customer-100</td>
<td class="tab1c">name</td>
<td class="tab1c">“Juan Sequeda”</td>
</tr>
<tr>
<td class="tab1">customer-100</td>
<td class="tab1">email</td>
<td class="tab1">“juan@data.world”</td>
</tr>
<tr>
<td class="tab1c">customer-100</td>
<td class="tab1c">knows</td>
<td class="tab1c">customer-101</td>
</tr>
<tr>
<td class="tab1">customer-101</td>
<td class="tab1">type</td>
<td class="tab1">Customer</td>
</tr>
<tr>
<td class="tab1c">customer-100</td>
<td class="tab1c">name</td>
<td class="tab1c">“Ora Lasilla”</td>
</tr>
<tr>
<td class="tab1">customer-101</td>
<td class="tab1">email</td>
<td class="tab1"><a href="mailto:“ora@amazon.com“">“ora@amazon.com“</a></td>
</tr>
<tr>
<td class="tab1c">customer-101</td>
<td class="tab1c">knows</td>
<td class="tab1c">customer-100</td>
</tr>
<tr>
<td class="tab1">order-1</td>
<td class="tab1">type</td>
<td class="tab1">Order</td>
</tr>
<tr>
<td class="tab1c">order-1</td>
<td class="tab1c">netsales</td>
<td class="tab1c">“100”</td>
</tr>
<tr>
<td class="tab1">order-1</td>
<td class="tab1">currency</td>
<td class="tab1">“USD”</td>
</tr>
<tr>
<td class="tab1c">order-1</td>
<td class="tab1c">orderdate</td>
<td class="tab1c">“2021-01-01”</td>
</tr>
<tr>
<td class="tab1">order-1</td>
<td class="tab1">placedBy</td>
<td class="tab1">customer-100</td>
</tr>
<tr>
<td class="tab1c">order-2</td>
<td class="tab1c">type</td>
<td class="tab1c">Order</td>
</tr>
<tr>
<td class="tab1">order-2</td>
<td class="tab1">netsales</td>
<td class="tab1">“90”</td>
</tr>
<tr>
<td class="tab1c">order-1</td>
<td class="tab1c">currency</td>
<td class="tab1c">“EUR”</td>
</tr>
<tr>
<td class="tab1">order-2</td>
<td class="tab1">orderdate</td>
<td class="tab1">“2021-01-02”</td>
</tr>
<tr>
<td class="tab1c">order-2</td>
<td class="tab1c">placedBy</td>
<td class="tab1c">customer-102</td>
</tr>
</tbody>
</table>
</section>
<section>
<h3 class="head3" id="ch2_2_2"><span epub:type="pagebreak" id="page_21" title="21"/>2.2.2<span class="space3"/><span epub:type="title">PROPERTY GRAPH</span></h3>
<p class="noindent">The Property Graph model provides additional flexibility (or complexity, depending on how you look at it) when compared to directed edge labeled graphs, by allowing a set of propertyvalue pairs and a label to be associated with the nodes and edges in a graph. Said in another way, Property Graph nodes and edges are <i>structured objects,</i> as opposed to RDF where they are just identifiers and have no structure. As of 2021, Property Graphs are in the process of standardization by ISO. They are used in graph databases such as Neo4j and Tigergraph. Some graph databases, such as Amazon Neptune, allow the use of either Property Graphs or RDF. Current, pre-standardization property graphs lack some of the “web-friendly” features of RDF, such as global identifiers (IRIs) for nodes and edge labels. This can complicate use cases where interoperability and information exchange is needed, as the implementation now requires one to “re-invent” those features.</p>
<figure>
<div class="image" id="fig2_2"><img alt="Image" src="../images/fig2_2.jpg"/></div>
<figcaption>
<p class="figcaption"><span class="blue">Figure 2.2:</span> Property Graph.</p>
</figcaption>
</figure>
<p class="exe"><b>Example 2.2</b></p>
<p class="noindent"><span epub:type="pagebreak" id="page_22" title="22"/>The RDF Graph shown in Example 2.1 can be visually represented as a Property Graph as in <a href="#fig2_2">Figure <span class="blue">2.2</span></a>.</p>
<p class="indent">This example is extended by adding the property-value pair <i>since=2014</i> to the edge knows. This really is the fundamental difference between Property Graphs and RDF. At the time of writing, W3C is working on a new RDF specification, called RDF-star (or RDF*), that allows similar modeling in RDF as well.<sup><a epub:type="noteref" href="#pgfn2_1" id="rpgfn2_1">1</a></sup></p>
</section>
<section>
<h3 class="head3" id="ch2_2_3">2.2.3<span class="space3"/><span epub:type="title">KNOWLEDGE GRAPH SCHEMA</span></h3>
<p class="noindent">A knowledge graph schema is a definition that provides organization to the knowledge graph data, represented in a formal language.</p>
<p class="indent">In the RDF Graph model, the Knowledge Graph Schema is known as an <i>ontology</i> and it is represented in the RDF Schema (RDFS) or OWL (Web Ontology Language) languages [<span class="blue">Uschold, 2018</span>].<sup><a epub:type="noteref" href="#pgfn2_2" id="rpgfn2_2">2</a></sup> There are three main components.</p>
<p class="bbull">•  <b>Class:</b> is an abstraction mechanism for creating a collection of objects with similar characteristics. These objects are called <i>instances</i> of a class. For example, a class can be “Customer” and “customer-100,” “customer-101” are instances of this class. A class and instances are nodes in the knowledge graph.</p>
<p class="bbull">•  <b>Datatype Property:</b> these are relationships between instances of classes (the domain) and literals (the range). For example, “name” is a datatype property that relates all the instances of the Customer class to a string datatype. Customer is the domain of the name datatype property while string is the range. A datatype property is an edge in the <span epub:type="pagebreak" id="page_23" title="23"/>knowledge graph from an instance node to a data value. For example, “customer-100” “name” “Juan Sequeda.”</p>
<p class="bbull">•  <b>Object Property:</b> these are relationships between instances of two classes. For example, “placedBy” is an object property that relates all the instances of the “Order” class to instances of the “Customer” class. “Order” is the domain and “Customer” is the range. An object property is an edge in the knowledge graph between two instance nodes. For example, “order-1” “placedBy” “customer-100.”</p>
<p class="indent">Finally, a class may be a <b>subclass</b> of another, thus inheriting the properties from its parent superclass. This corresponds to logical subsumption. In Section <span class="blue">4.3.2</span>, we briefly discuss how subclass can be used for reasoning.</p>
<p class="indent">The Property Graph model does not yet have an agreed upon schema language. As of 2021, the LDBC Property Graph Schema Working Group is in the process of defining recommendations of schema languages for Property Graphs.</p>
<section id="ch2_sec1">
<h4 class="head4"><span epub:type="title">External Knowledge Graph Schema</span></h4>
<p class="noindent">Knowledge graph schemas, namely ontologies, are a means of not just representing knowledge but also sharing knowledge. The Semantic Web community, specifically the ontology engineering community, has developed methodologies to build knowledge graph schemas that can be reused (see Section <span class="blue">1.4</span>). Industry-specific communities have followed these best practices and built reusable knowledge graph schemas. Therefore, instead of creating a schema from scratch, you can reuse existing schemas. However, you don’t have to reuse them as-is. You can take parts of an existing schema—and possibly extend it—when designing a schema that satisfies your organization’s needs. RDFS and OWL were designed for extensibility. Finally, the source relational databases still need to be mapped to the knowledge graph schema, regardless of whether this schema was created or reused.</p>
<p class="indent">If different organizations reuse the same schema for their knowledge graphs, they can easily exchange data that has a shared meaning. This applies both between different parts of a single enterprise as well as across enterprises. A successful example is <span class="blue"><a href="https://schema.org/">https://schema.org/</a></span>, which is a shared vocabulary that makes it easier for any webmasters to annotate their webpages in a way that they can be processed by any applications crawling the web.</p>
<p class="indent">Over the past decade, many reusable knowledge graph schemas in various industries have been created. A non-exhaustive list:</p>
<p class="bbull">•  Life Science: <span class="blue"><a href="https://bioportal.bioontology.org/">https://bioportal.bioontology.org/</a></span></p>
<p class="bbull">•  Finance: <span class="blue"><a href="https://spec.edmcouncil.org/fibo/">https://spec.edmcouncil.org/fibo/</a></span></p>
<p class="bbull">•  Oil and Gas: <span class="blue"><a href="https://www.iso.org/standard/70694.html">https://www.iso.org/standard/70694.html</a></span></p>
<p class="bbull">•  Healthcare: <span class="blue"><a href="https://www.hl7.org/fhir/">https://www.hl7.org/fhir/</a></span></p>
<p class="bbull">•  <span epub:type="pagebreak" id="page_24" title="24"/>Real Estate: <span class="blue"><a href="https://w3id.org/rec/full/">https://w3id.org/rec/full/</a></span></p>
<p class="bbull">•  Social Network: <span class="blue"><a href="http://xmlns.com/foaf/spec/">http://xmlns.com/foaf/spec/</a></span></p>
<p class="bbull">•  Cultural Heritage: <span class="blue"><a href="https://pro.europeana.eu/page/edm-documentation">https://pro.europeana.eu/page/edm-documentation</a></span></p>
<p class="bbull">•  General Business Concepts: <span class="blue"><a href="https://www.semanticarts.com/gist/">https://www.semanticarts.com/gist/</a></span></p>
<p class="bbull">•  Metadata: <span class="blue"><a href="https://www.dublincore.org/schemas/rdfs/">https://www.dublincore.org/schemas/rdfs/</a>, <a href="https://open-kos.org">https://open-kos.org</a>, <a href="https://www.w3.org/TR/vocab-dcat-2">https://www.w3.org/TR/vocab-dcat-2</a>∕</span></p>
<p class="bbull">•  Provenance: <span class="blue"><a href="https://www.w3.org/TR/prov-o/">https://www.w3.org/TR/prov-o/</a></span></p>
</section>
</section>
<section>
<h3 class="head3" id="ch2_2_4">2.2.4<span class="space3"/><span epub:type="title">AN ABSTRACT GRAPH NOTATION USED IN THIS BOOK</span></h3>
<p class="noindent">The goal of this book is to provide the elements to design and build knowledge graphs, regardless of the underlying graph model. We therefore use an abstract notation and terminology that is a generalization over RDF graphs and Property Graphs. The main elements are: Concepts, Concept Attributes, Relationships, and Relationship Attributes.</p>
<p class="bbull">•  <b>Concept:</b> a node in the graph that represents a real-world entity. For example</p>
<div class="top">
<p class="center">(Customer)</p>
<p class="center">(Order)</p>
</div>
<p class="tbull">An instance of a Concept is also a node in a graph. The instance is connected to the Concept through a special relationship called –type→. For example: order-1 and order-2 are instances of the Order concept, and customer-100 and customer-101 are instances of the Customer concept:</p>
<div class="top">
<p class="center">(order-1) —type→ (Order)</p>
<p class="center">(order-2) —type→ (Order)</p>
<p class="center">(customer-100) —type→ (Customer)</p>
<p class="center">(customer-101) —type→ (Customer)</p>
</div>
<p class="bbull">•  <b>Relationship:</b> the edge in a graph that represents a connection between two concepts. For example:</p>
<div class="top">
<p class="center">(Order) —placedBy→ (Customer)</p>
<p class="center">(Customer) —knows→ (Customer)</p>
</div>
<p class="tbull">The relationship in a graph represents connections between instances of concepts. For example, order-1 is placedBy customer-100, order-2 is placedBy customer-101, and customer-100 knows customer-101:</p>
<div class="top">
<p class="center"><span epub:type="pagebreak" id="page_25" title="25"/>(order-1) —placedBy→ (customer-100)</p>
<p class="center">(order-2) —placedBy→ (customer-101)</p>
<p class="center">(customer-100) —knows→ (customer-101)</p>
</div>
<p class="bbull">•  <b>Concept Attribute:</b> represents a means of associating data values to a concept. It is represented as an edge in a graph that represents a connection between a concept and a data type. For example:</p>
<div class="top">
<p class="center">(Customer) —name→∙ [string]</p>
</div>
<p class="tbull">The concept attribute connects an instance of a concept with a data value. For example, “name = Juan Sequeda” is an attribute associated with customer-100 and “name = Ora Lassila” is an attribute associated with customer-101:</p>
<div class="top">
<p class="center">(customer-100) —name→ [JuanSequeda]</p>
<p class="center">(customer-101) —name→ [OraLassiIa]</p>
</div>
<p class="bbull">•  <b>Relationship Attribute:</b> represents a means of associating data values to a relationship. It is represented as a key-value pair associated with a relationship where the key is an attribute and the value is a datatype</p>
<div class="top">
<p class="center">(Customer) —knows[since = date]→ (Customer)</p>
</div>
<p class="tbull">For example, “since = 2004” is an attribute associated with the relationships knows:</p>
<div class="top">
<p class="center">(customer-100) —knows[since = 2014] → (customer-101)</p>
</div>
<p class="indent">Note that we are providing simple identifiers for the nodes and edges: (Customer), (Order), (order-1), (customer-1), — placedBy→, — name→, — knows[since =] →. See Section <span class="blue">2.2.6</span> for more information on identifiers.</p>
<p class="indent">This abstract graph model can be related to both RDF Graphs and Property Graphs. In RDF, a concept is called Class, a concept attribute is called Datatype Property, and a relationship is called Object Property. Relationship attributes in RDF are a bit trickier (this, in fact, is the primary difference between RDF graphs and Property Graphs). In “plain” RDF, edges cannot have attributes, but this can be addressed in a couple of different ways, either by changing how one models relationships, or via a mechanism called “reification” which the RDF standard supports.<sup><a epub:type="noteref" href="#pgfn2_3" id="rpgfn2_3">3</a></sup></p>
<p class="indent">In Property Graphs [<span class="blue">Bonifati et al., 2018</span>], a concept is a node that has a label which is a descriptive identifier of the real-world entity it represents. Concept attributes are key-value pairs, <span epub:type="pagebreak" id="page_26" title="26"/>or so called properties, associated with a node that provides the actual data an object represents. Relationships are edges between the nodes that also have labels. Relationship attributes are keyvalue pairs associated with the edges (<a href="#fig2_3">Figure <span class="blue">2.3</span></a>).</p>
<figure>
<div class="image" id="fig2_3"><img alt="Image" src="../images/fig2_3.jpg"/></div>
<figcaption>
<p class="figcaption"><span class="blue">Figure 2.3:</span> Comparing RDF and Property Graphs and the notation and terminology used in this book.</p>
</figcaption>
</figure>
</section>
<section>
<h3 class="head3" id="ch2_2_5">2.2.5<span class="space3"/><span epub:type="title">GRAPH QUERY LANGUAGES</span></h3>
<p class="noindent">The RDF and Property Graph models have their own types of query languages. SPARQL (aka “SPARQL Protocol and RDF Query Language”) is the standardized query language for RDF.</p>
<p class="indent">As of 2021, ISO is in the process of standardizing a query language for Property Graphs: GQL. There are a variety of open and proprietary property graph query languages: Neo4j’s Cypher, TigerGraph’s GSQL, Oracle’s PGQL, Apache Gremlin, and G-CORE.</p>
<p class="indent">We will not focus on query languages because the focus of this book is in the design and building of knowledge graphs and not querying knowledge graphs.</p>
</section>
<section>
<h3 class="head3" id="ch2_2_6">2.2.6<span class="space3"/><span epub:type="title">IDENTIFIERS</span></h3>
<p class="noindent">Metcalfe’s Law shows that the value of a network is proportional to the square of the number of nodes in the network. This law can be applied to data (albeit not directly with the proportions). <span epub:type="pagebreak" id="page_27" title="27"/>A knowledge graph’s value increases when it is the result of integrating multiple disparate data sources [<span class="blue">Hendler and Golbeck, 2008</span>].</p>
<p class="indent">To accelerate the integration of data, it is beneficial to have an agreed upon <i>identitier</i> mechanism or convention. An identifier is a label that globally and uniquely identifies an element in a knowledge graph.<sup><a epub:type="noteref" href="#pgfn2_4" id="rpgfn2_4">4</a></sup> Consider building a knowledge graph that integrates data from an order management system (OMS) and a customer relationship management (CRM) system. The order management system will have data about customers and the orders they have purchased, while the CRM system will have more detailed data about the customers. By agreeing on the identifier scheme for a customer, once a knowledge graph for the OMS and the CRM are created, all the customer data is integrated because they are referring to the same thing (e.g., the same customer customer-100). It is important to note that the identifier must uniquely identify an element not just within the database it comes from, but also be globally unique with respect to all the databases that are being integrated, and even for future databases. Namespaces can be used to create global unique identifiers. Effectively, identifiers serve a global join keys that can overcome the physical barriers of database applications within your organization.</p>
<p class="indent">Furthermore, identifiers should also be applied to the knowledge graph schema: Concepts (Order, Customer), Attributes (currency, name), and Relationships (placedBy, knows). This provides a mechanism for reuse and interoperability of the schema with a clear meaning (i.e., if we are referring to the same thing, we know what we are talking about) that enables a shared understanding of the data, democratizes the access to the data, and thus increases the use of the data.</p>
<div class="boxg">
<p class="noindent">The best identifiers are SIMPLE.<sup>a</sup> The following is a guide to design global and universal identifiers:</p>
<p class="bbull">•  <b>S</b>torable: You should be able to store the identifier offline. For instance, order ID may be stored in an OMS and ERP systems.</p>
<p class="bbull">•  <b>I</b>mmutable: It should not change over time. An order ID is usually the same within an organization.</p>
<p class="bbull">•  <b>M</b>eticulous: The same entity in two different systems should resolve to the same ID. It should be very difficult (or impossible) for two occurrences of the same order to claim they have a different order IDs.</p>
<p class="bbull">•  <b>P</b>ortable: An order ID can be moved from one system to another.</p>
<p class="bbull">•  <b>L</b>ow-cost: The ID needs to be cheap (or even free). If it is too expensive, the transaction costs will make it hard to use in many situations.</p>
<p class="bbull">•  <span epub:type="pagebreak" id="page_28" title="28"/><b>E</b>stablished: It needs to cover almost all of its subjects. An order ID covers all the orders.</p>
<p class="noindent"><sup>a</sup><span class="blue"><a href="https://www.safegraph.com/blog/data-standards-and-the-join-key">https://www.safegraph.com/blog/data-standards-and-the-join-key</a></span></p>
</div>
<p class="indent">Therefore, an important initial step in designing a knowledge graph is to decide on an appropriate identifier scheme for the knowledge graph schema and the data elements. Even though conventions may evolve over time, investing upfront thought on identifiers is crucial. Identifiers are the glue that connects data sources together and effectively establishes the graph.</p>
<p class="indent">An important and related topic to identifiers is entity resolution which is the task of finding records in different datasets that refer to the same entity. We briefly touch on this topic in Section <span class="blue">4.3.2</span>. However, entity resolution deserves its own deep dive which is out of scope for this book.</p>
<section id="ch2_sec2">
<h4 class="head4"><span epub:type="title">Identifiers in RDF Graphs</span></h4>
<p class="noindent">The notion of identifiers is built into the RDF model through Internationalized Resource Identifiers (IRI). An IRI is an internet protocol standard which builds on the Uniform Resource Identifiers (URI) by permitting not just the ASCII character set but also Universal Character Set characters such as Chinese, Japanese, etc. IRIs are part of the architecture of the World Wide Web<sup><a epub:type="noteref" href="#pgfn2_5" id="rpgfn2_5">5</a></sup> and follow the principle: <i>Global naming leads to global network effects.</i> By design, IRIs have global scope. Thus, two different appearances of an IRI denote the same thing.</p>
<p class="indent">That is why the subjects, predicates, and objects of an RDF triple are IRIs. In an RDF knowledge graph, the schema elements are identified by IRIs. The identifier for the concept Order in a RDF knowledge graph schema could be:</p>
<p class="indenttb"><span class="blue"><a href="https://schema.org/Order">https://schema.org/Order</a></span></p>
<p class="noindentt">The identifier for the attribute name could be:</p>
<p class="indenttb"><span class="blue"><a href="https://schema.org/name">https://schema.org/name</a></span></p>
<p class="noindent">The identifier for the relationship paymentMethod could be:</p>
<p class="indenttb"><span class="blue"><a href="https://schema.org/paymentMethod">https://schema.org/paymentMethod</a></span></p>
<p class="indent">These IRIs can be made dereferencable, meaning that you can even look them up on your browser (via HTTP GET).</p>
<p class="indent">As we mentioned earlier, the instances in an RDF knowledge graph are also identified using IRIs. The real identifier for the order-1—which is an instance of an Order—could be:</p>
<p class="indenttb"><span class="blue"><a href="https://mycompany.com/data/order-1">https://mycompany.com/data/order-1</a></span></p>
<p class="indent"><span epub:type="pagebreak" id="page_29" title="29"/>Instead of writing these (possibly) long IRIs, one can create shortcuts by defining a prefix. For example, the prefix schema can be name for <code><a href="https://schema.org/">https://schema.org/</a></code>, therefore <code>schema:Order</code> is equivalent to <code><a href="https://schema.org/0rder">https://schema.org/0rder</a>.</code></p>
</section>
<section id="ch2_sec3">
<h4 class="head4"><span epub:type="title">Establishing Identifiers from Relational Data</span></h4>
<p class="noindent">In order to dynamically establish identifiers from relational data, we need to define an identifier template. This is a format string that can be used to build identifier strings from multiple components. An identifier template references columns of a relational table. Our notation is to enclose column names in curly braces (“{“ and “}”)template-{<span class="underline"><code>ID</code></span>}</p>
<p class="indent">Common practice is to use a column that is a key (primary or unique) of a table. However, a key is locally unique to the table. Consider a table that has a column id which is the primary key. A row may have the value 1 as the id and therefore uniquely identifies that row in the table. However, the value 1 may also appear as a primary key column for another table. In order to make the identifier globally unique, we need to combine the key column with some identifying string.</p>
<p class="exe"><b>Example 2.3</b></p>
<p class="noindent">Consider the table order that has a primary key column oid and a table <code>customer</code> with a primary key column <code>cid</code>. Each of these tables have a row where oid and cid are both the value 1. In order to make a globally unique identifier, one could define the identifier template for customer as customer-{<span class="underline"><code>cid</code></span>} and the identifier template for order as order-{<span class="underline"><code>oid</code></span>}. The result of applying the template to the relational data will result in (customer-1) and (order-1).</p>
<p class="indent">As mentioned earlier, in this book we use simple identifiers for the sake of easier readability. In other words, we do not pretend that (customer-1) would be globally unique. For RDF Knowledge Graphs, we could change the prefix <code>customer-</code> to something that makes the identifier a valid IRI would make the resulting identifiers globally unique (e.g., the prefix could be <code><a href="https://mycompany.com/data/customer-">https://mycompany.com/data/customer-</a></code> instead).</p>
</section>
</section>
<section>
<h3 class="head3" id="ch2_2_7">2.2.7<span class="space3"/><span epub:type="title">MODELING</span></h3>
<p class="noindent">Modeling can be considered an art and a science. An art because beauty (or in this case, a form of correctness) is in the eye of the beholder (a data consumer). For one user, a data model can be too general, for someone else, the same model could be too complicated, and for another user it could be just right. It is also a science because the model can have implications on technical factors, such as the resulting size of data, size of query, and query performance.</p>
<p class="indent">Consider modeling the following scenario: an Order has a shipping address. An address has a street number, street name, city, state, postal code, and country.</p>
<p class="noindent"><b>Scenario 1:</b> Order and Address are Concepts</p>
<div class="top">
<p class="center">(Order) — FasShippingAddress → (Address)</p>
<p class="center"><span epub:type="pagebreak" id="page_30" title="30"/>(Address) —streetNumber→ [string]</p>
<p class="center">(Address) —streetName→ [string]</p>
<p class="center">(Address) —city→ [string]</p>
<p class="center">(Address) —state→ [string]</p>
<p class="center">(Address) —postalCode→ [string]</p>
<p class="center">(Address) —country→ [string]</p>
</div>
<p class="noindent"><b>Scenario 2:</b> Order, Address, and Country are Concepts</p>
<div class="top">
<p class="center">(Order) —hasShippingAddress→ (Address)</p>
<p class="center">(Address) —locatedIn→ (Country)</p>
<p class="center">(Address) —streetNumber→ [string]</p>
<p class="center">(Address) —streetName→ [string]</p>
<p class="center">(Address) —city→ [string]</p>
<p class="center">(Address) —state→ [string]</p>
<p class="center">(Address) —postalCode→ [string]</p>
<p class="center">(Country) —countryName→ [string]</p>
</div>
<p class="noindent"><b>Scenario 3:</b> Order, Address, State, and Country are Concepts</p>
<div class="top">
<p class="center">(Order) —hasShippingAddress→ (Address)</p>
<p class="center">(Address) —locatedIn→ (State)</p>
<p class="center">(State) —locatedIn→ (Country)</p>
<p class="center">(Address) —streetNumber→ [string]</p>
<p class="center">(Address) —streetName→ [string]</p>
<p class="center">(Address) —city→[string]</p>
<p class="center">(Address) —state→ [string]</p>
<p class="center">(Address) —postalCode→ [string]</p>
<p class="center">(State) —stateName→ [string]</p>
<p class="center">(Country) —countryName→ [string]</p>
</div>
<p class="indent">On one side, all address attributes (street number, name, city, state, etc.) are associated to an Address concept, as shown in Scenario 1. Attributes can be decomposed into their own concepts. Scenario 2 depicts Country being a standalone concept. Scenario 3 depicts State and Country being standalone concepts. We could continue decomposing (or normalize) these until we get to the most granular level: every attribute associated to the Address concept of Scenario 1 becomes its own standalone concept.</p>
<p class="indent">Which model is better? Which one is worse? This is a tricky question and the answer is: it depends. You have to decide how to balance between practicality and “purity” of the data model. A data consumer may think about the world in a very granular way (purity). The designer of the knowledge graph needs to think about the identifiers for each concept, the size of the graph, the complexity of the queries, and the performance implications (practicality).</p>
<p class="indent"><span epub:type="pagebreak" id="page_31" title="31"/>A rule of thumb: if you have a need to uniquely identify a thing in order to point and reference it because you are going to add information about that thing, then make that thing a concept. Otherwise, keep that thing as an attribute associated with an existing concept.</p>
<p class="indent">For example, address, state, and country should be their own concepts because there can be clear identifiers for those things and information associated with them (abbreviations, population, etc.). What about post code? Maybe. What about Street name? Probably not (unless you have a geographic use case).</p>
<p class="indent">Graph modeling is important. The choices made matter to the data consumers and designers of the knowledge graph. Note that these questions are not unique to graphs. Actually, graphs by default are in the sixth normal form (6NF). These same considerations apply to relational data modeling and conceptual models in general.</p>
<p class="indent">We do not try to educate the reader about relational and graph modeling, but we do want to make sure they understand that they need to learn about modeling. Therefore, we refer the reader to <span class="blue">Allemang et al.</span> [<span class="blue">2020</span>] and <span class="blue">Alexopoulos</span> [<span class="blue">2020</span>]. Our book is in between those two: how to map from a relational model to a graph model.</p>
</section>
</section>
<section>
<h2 class="head2" id="ch2_3">2.3<span class="space3"/><span epub:type="title">MAPPINGS: RELATIONAL DATABASE TO KNOWLEDGE GRAPH</span></h2>
<p class="noindent">A mapping is a function that represents the relationship from a source data model to a target data model. Mappings are used to represent how a relational database (the source) can be represented in a Knowledge Graph (the target). Mappings are commonly represented in a declarative formalism such as a rule language. A rule is an IF &lt;<code>condition</code>&gt; THEN &lt;<code>conclusion</code>&gt; construct where the &lt;<code>condition</code>&gt; is known as the body of the rule (or antecedant) and &lt;<code>conclusion</code>&gt; is known as the head of the rule (or consequent). In a mapping, the body of the rule is a condition that corresponds to the source data model and the head of the rule is a conclusion corresponding to the target data model. A set of rules corresponds to a mapping. We use the following notation to represent a mapping rule:</p>
<figure>
<div class="image" id="fig_1"><img alt="Image" src="../images/pg31_1.jpg"/></div>
</figure>
<p class="indent">Once mappings have been defined, they can then be applied in a materialization or virtualization approach. In a materialization approach, also known as ETL (Extract, Transforms, and Load), the mappings are used to physically transform the relational databases into a knowledge graph, which would then be loaded into a graph database. In other words, the mappings represents the transforms. Mappings correspond to the T (transformation) of ETL. In rule engine parlance, this approach is called <i>forward chaining.</i></p>
<p class="indent">In a virtualization approach, the mappings are used to rewrite queries in terms of the Knowledge Graph (e.g., SPARQL, Cypher) into SQL queries over the relational databases. In <span epub:type="pagebreak" id="page_32" title="32"/>rule engine parlance, this approach is called <i>backward chaining.</i> We will dive into these details in the tools section.</p>
<p class="indent">There are two types of mappings: Direct Mapping and Custom Mappings.</p>
<section>
<h3 class="head3" id="ch2_3_1">2.3.1<span class="space3"/><span epub:type="title">DIRECT MAPPING</span></h3>
<p class="noindent">A direct mapping is a default and automatic representation of a relational database as a knowledge graph, without any human intervention [<span class="blue">Sequeda et al., 2012</span>]. The resulting knowledge graph mirrors the relational database schema. A direct mapping consists of a set of fixed rules that are applied to all relational databases. If changes are to be made to the rules, then we would be defining a Custom Mapping (see the next section).</p>
<p class="indent">The definition of a Direct Mapping is the following.</p>
<p class="bbull">•  Table is a Concept: each table in the relational database represents a concept and each row in the table represents a node that is an instance of the concept.</p>
<p class="bbull">•  Column is a Concept Attribute: each column of a table is an attribute associated to the concept which has been directly mapped to the corresponding table.</p>
<p class="bbull">•  Foreign Key is a Relationship: each column that is a foreign key is a relationship going from the concept which has been directly mapped to the corresponding table and going to the concept which has been directly mapping to the referencing table.</p>
<p class="bbull">•  Primary Key(s) columns are using in the template to create an identifier for each row in a table.</p>
<p class="exe"><b>Example 2.4</b></p>
<p class="noindent">Consider the following relational database. The <code>sales_flat_order</code> table stores all the order transactions and <code>customer_entity</code> stores all the data about customers. The column <code>entity_id</code> is the primary key of the <code>sales_flat_order</code> table. The column <code>entity_id</code> is the primary key of the <code>customer_entity</code> table. The column <code>customer_id</code> of the <code>sales_flat_order</code> table is a foreign key that references the column <code>entity_id</code> of the <code>customer_entity</code> table.</p>
<div class="lf3">
<table class="table2" id="tab_1">
<thead>
<tr>
<th class="tc" colspan="4"><code>sales_flat_order</code></th>
</tr>
<tr>
<th class="thead tc"><code><b>entity_id</b></code></th>
<th class="thead tc"><code><b>grand_total</b></code></th>
<th class="thead tc"><code><b>order_currency_code</b></code></th>
<th class="thead tc"><code><b>customer_id</b></code></th>
</tr>
</thead>
<tbody>
<tr>
<td class="tab1 tc">1</td>
<td class="tab1 tc">110</td>
<td class="tab1 tc">USD</td>
<td class="tab1 tc">100</td>
</tr>
<tr>
<td class="tab1 tc">2</td>
<td class="tab1 tc">100</td>
<td class="tab1 tc">EUR</td>
<td class="tab1 tc">101</td>
</tr>
</tbody>
</table>
</div>
<div class="lf5">
<table class="table2" id="tab_2">
<thead>
<tr>
<th class="tc" colspan="3"><code>customer_entity</code></th>
</tr>
<tr>
<th class="thead tc"><code><b>entity_id</b></code></th>
<th class="thead tc"><code><b>email</b></code></th>
<th class="thead tc"><code><b>is_active</b></code></th>
</tr>
</thead>
<tbody>
<tr>
<td class="tab1 tc">100</td>
<td class="tab1 tc">juan@data.world</td>
<td class="tab1 tc">1</td>
</tr>
<tr>
<td class="tab1 tc">101</td>
<td class="tab1 tc"><a href="mailto:ora@amazon.com">ora@amazon.com</a></td>
<td class="tab1 tc">1</td>
</tr>
<tr>
<td class="tab1 tc">102</td>
<td class="tab1 tc"><a href="mailto:alice@email.com">alice@email.com</a></td>
<td class="tab1 tc">0</td>
</tr>
</tbody>
</table>
</div>
<p class="indent"><span epub:type="pagebreak" id="page_33" title="33"/>The resulting knowledge graph schema based on a direct mapping is the following:</p>
<div class="box-top">
<p class="center">(sales_flat_order)</p>
<p class="center">(customer_entity)</p>
<p class="center">(sales_flat_order) —sales_flat_order-customer_id → (customer_entity)</p>
<p class="center">(sales_flat_order) —entity_id → <code>[int]</code></p>
<p class="center">(sales_flat_order) —grand_total → <code>[float]</code></p>
<p class="center">(sales_flat_order) —order_currency_code → <code>[string]</code></p>
<p class="center">(sales_flat_order) —customer_id → <code>[int]</code></p>
<p class="center">(customer_entity) —entity_id → <code>[int]</code></p>
<p class="center">(customer_entity) —email → <code>[string]</code></p>
<p class="center">(customer entity) —is_active→ <code>[int]</code></p>
</div>
<p class="indent">The resulting knowledge graph based on the direct mapping is the following:</p>
<div class="box-top">
<p class="center">(sales_flat_order-1) —type→ (sales_flat_order)</p>
<p class="center">(sales_flat_order-1) —entity_id→ <code>[1]</code></p>
<p class="center">(sales_flat_order-1) —grand_total→ <code>[100]</code></p>
<p class="center">(sales_flat_order-1) —order_currency_code→ <code>[USD]</code></p>
<p class="center">(sales_flat_order-1) —customer_id→ <code>[100]</code></p>
<p class="center">(sales_flat_order-1) —sales_flat_order-customer_id→ (customer_entity-100)</p>
<p class="center">(sales_flat_order-2) —type→ (sales_flat_order)</p>
<p class="center">(sales_flat_order-2) — entity_id → <code>[2]</code></p>
<p class="center">(sales_flat_order-2) — grand-total → <code>[100]</code></p>
<p class="center">(sales_flat_order-2) — order_currency_code → <code>[EUR]</code></p>
<p class="center">(saleS_flat_order-2) —customer_id → <code>[102]</code></p>
<p class="center">(saleS_flat_order-2) — sales_flat_order-customer_id → (customer_entity-102)</p>
<p class="center">(customer_entity-100) —type→ (customer_entity)</p>
<p class="center">(customer_entity-100) —entity_id→ <code>[100]</code></p>
<p class="center">(customer_entity-100) —email → <code>[juan@data.world]</code></p>
<p class="center">(customer_entity-100) —is_active→ <code>[1]</code></p>
<p class="center">(customer_entity-101) —type→ (customer_entity)</p>
<p class="center">(customer_entity-101) —entity_id→ <code>[101]</code></p>
<p class="center">(customer_entity-101) —email→ <code><a href="mailto:[ora@amazon.com]">[ora@amazon.com]</a></code></p>
<p class="center">(customer_entity-101) —is_active→ <code>[1]</code></p>
<p class="center">(customer_entity-102) —type→ (customer_entity)</p>
<p class="center">(customer_entity-102) —entity_id→ <code>[102]</code></p>
<p class="center">(customer_entity-102) —email→ <code><a href="mailto:[alice@email.com]">[alice@email.com]</a></code></p>
<p class="center">(customer_entity-102) —is_active→ <code>[0]</code></p>
</div>
<p class="indent"><span epub:type="pagebreak" id="page_34" title="34"/>If a relational database is modeled following ideal practices (e.g., 3NF), then the resulting knowledge graph from a direct mapping may be an adequate start. However, enterprise relational databases are complex and inscrutable, thus the resulting knowledge graph from a directly mapped enterprise relational database is also going to be inscrutable.</p>
<section id="ch2_sec4">
<h4 class="head4"><span epub:type="title">Direct Mapping to RDF Graphs</span></h4>
<p class="noindent">The W3C has standardized a direct mapping from relational databases to RDF Graphs [<span class="blue">M. Arenas, 2012</span>]. The W3C Direct Mapping consists of two parts: a specification for generating identifiers for the different components of the database schema, and a specification for using the identifiers, in order to generate a direct graph.</p>
<p class="noindentt"><b>Generating Identifiers:</b> The W3C Direct Mapping generates an identifier for rows, tables, columns, and foreign keys. If a table has a primary key, then the row identifier will be an IRI, obtained by concatenating a base IRI, the percent-encoded form of the table name, the “#” character and for each column in the primary key, in order:</p>
<p class="bbull">•  the percent-encoded form of the column name,</p>
<p class="bbull">•  the “=” character,</p>
<p class="bbull">•  the percent-encoded representation of the column value, and</p>
<p class="bbull">•  if it is not the last column in the primary key, the “;” character.</p>
<p class="indent">If a table does not have a primary key, then the row identifier is a fresh blank node that is unique to each row.</p>
<p class="indent">The IRI for a table is obtained by concatenating the base IRI with the percent-encoded form of the table name. The IRI for an attribute is obtained by concatenating the base IRI with the percent-encoded form of the table name, the “#” character and the percent-encoded form of the column name. Finally, the IRI for foreign key is obtained by concatenating the base IRI with the percent-encoded form of the table name, the string “#ref-” and for each column in the foreign key, in order:</p>
<p class="bbull">•  the percent-encoded form of the column name and</p>
<p class="bbull">•  if it is not the last column in the foreign key, a “;” character.</p>
<p class="noindent"><b>Generating the Direct Graph:</b> A Direct Graph is the RDF graph resulting from directly mapping each of the rows of each table and each view in a database schema. Each row in a table generates a Row Graph. The row graph is an RDF graph consisting of the following triples: (1) a row type triple, (2) a literal triple for each column in a table where the column value is non-NULL, and (3) a reference triple for each foreign key in the table where none of the column values is NULL. A row type triple is an RDF triple with the subject as the row node for the <span epub:type="pagebreak" id="page_35" title="35"/>row, the predicate as the RDF IRI <code>rdf:type</code> and the object as the table IRI for the table name. A literal triple is an RDF triple with the subject as the row node for the row, the predicate as the literal property IRI for the column, and the object as the natural RDF literal representation of the column value. Finally, a reference triple is an RDF triple with the subject as the row node for the row, the predicate as the reference property IRI for the columns and the object as the row node for the referenced row.</p>
</section>
</section>
<section>
<h3 class="head3" id="ch2_3_2">2.3.2<span class="space3"/><span epub:type="title">CUSTOM MAPPING</span></h3>
<p class="noindent">A Custom Mapping is a customizable representation of a relational database as a knowledge graph. The body of the mapping rule is a SQL query on the relational database and the head of the rule represents elements of the knowledge graph schema: Concepts, Concept Attributes, Relationships, and Relationship Attributes. The knowledge graph schema can be created or reused (as discussed in Section <span class="blue">2.2.3</span>).</p>
<section id="ch2_sec5">
<h4 class="head4"><span epub:type="title">Concept Mappings</span></h4>
<p class="noindent">A concept mapping is a representation of a concepts in the knowledge graph from the relational database. It is represented as follows:</p>
<figure>
<div class="image" id="fig_2"><img alt="Image" src="../images/pg35_1.jpg"/></div>
</figure>
<p class="indent">Every row resulting from the SQL query and uniquely identified by the attribute ID represents an instance of the concept.</p>
<p class="exe"><b>Example 2.5</b></p>
<p class="noindent"><i>Source:</i> The table customer_entity stores data about all customers. The column is_active indicates if a customer is active or not.</p>
<div class="lf5">
<table class="table1" id="tab_3">
<thead>
<tr>
<th class="tc" colspan="3"><code>customer_entity</code></th>
</tr>
<tr>
<th class="thead tc"><code><b>entity_id</b></code></th>
<th class="thead tc"><code><b>email</b></code></th>
<th class="thead tc"><code><b>is_active</b></code></th>
</tr>
</thead>
<tbody>
<tr>
<td class="tab1 tc">100</td>
<td class="tab1 tc">juan@</td>
<td class="tab1 tc">1</td>
</tr>
<tr>
<td class="tab1 tc">101</td>
<td class="tab1 tc">ora@</td>
<td class="tab1 tc">1</td>
</tr>
<tr>
<td class="tab1 tc">102</td>
<td class="tab1 tc">alice@</td>
<td class="tab1 tc">0</td>
</tr>
</tbody>
</table>
</div>
<p class="noindent"><i>Target:</i> In the Knowledge Graph, the schema consists of the concept (Customer).</p>
<div class="box-top">
<p class="center">(Customer)</p>
</div>
<p class="noindent">The meaning of a customer is if they are active, therefore the expected Knowledge Graph is the following:</p>
<div class="box-top">
<p class="center">(customer-1) —type→ (Customer)</p>
<p class="center">(customer-2) —type→ (Customer)</p>
</div>
<p class="noindent"><span epub:type="pagebreak" id="page_36" title="36"/><i>Mapping:</i> To generate the knowledge graph, the definition of <i>Customer</i> is represented in a SQL query that filters on <code>is_active</code> = 1</p>
<figure>
<div class="image" id="fig_3"><img alt="Image" src="../images/pg36_1.jpg"/></div>
</figure>
</section>
<section id="ch2_sec6">
<h4 class="head4"><span epub:type="title">Concept Attribute Mappings</span></h4>
<p class="noindent">A concept attribute mapping is a representation of an attribute associated to a concept in the knowledge graph from the relational database. It is represented as follows:</p>
<figure>
<div class="image" id="fig_4"><img alt="Image" src="../images/pg36_2.jpg"/></div>
</figure>
<p class="indent">Every row resulting from the SQL query and uniquely identified by the attribute ID, has a corresponding attribute which represents a concept attribute in the knowledge graph.</p>
<p class="exe"><b>Example 2.6</b></p>
<p class="noindent"><i>Source:</i> The <code>sales_flat_order</code> table stores all the order transactions.</p>
<div class="lf5">
<table class="table1" id="tab_4">
<thead>
<tr>
<th class="tc" colspan="4"><code>sales_flat_order</code></th>
</tr>
<tr>
<th class="thead tc"><code><b>entity_id</b></code></th>
<th class="thead tc"><code><b>grand_total</b></code></th>
<th class="thead tc"><code><b>tax_amount</b></code></th>
<th class="thead tc"><code><b>discount_amount</b></code></th>
</tr>
</thead>
<tbody>
<tr>
<td class="tab1 tc">1</td>
<td class="tab1 tc">110</td>
<td class="tab1 tc">8.8</td>
<td class="tab1 tc">1.2</td>
</tr>
<tr>
<td class="tab1 tc">2</td>
<td class="tab1 tc">100</td>
<td class="tab1 tc">10</td>
<td class="tab1 tc">0</td>
</tr>
</tbody>
</table>
</div>
<p class="noindent"><i>Target:</i> The knowledge graph schema consists of the concept <i>Order</i> and an associated attribute <i>netsales</i> whose datatype is a float.</p>
<div class="box-top">
<p class="center">(Order) —netsales→ <code>[float]</code></p>
</div>
<p class="noindent">The business defines <i>netsales</i> by taking the grand total and subtracting the tax and discount. Therefore, the expected knowledge graph is the following:</p>
<div class="box-top">
<p class="center">(order-1) —netsales→ <code>[100]</code></p>
<p class="center">(order-2) —netsales→ <code>[90]</code></p>
</div>
<p class="noindent"><i>Mapping:</i> To generate the knowledge graph, the definition of <i>netsales</i> is represented in the SQL query. The mapping is the following:</p>
<figure>
<div class="image" id="fig_5"><img alt="Image" src="../images/pg37_1.jpg"/></div>
</figure>
</section>
<section id="ch2_sec7">
<h4 class="head4"><span epub:type="pagebreak" id="page_37" title="37"/><span epub:type="title">Relationship Mappings</span></h4>
<p class="noindent">A concept attribute mapping is a representation of an attribute associated to a concept in the knowledge graph from the relational database.</p>
<p class="indent">A relationship mapping is a representation of a relationship between two concepts in the knowledge graph from the relational database. It is represented as follows:</p>
<figure>
<div class="image" id="fig_6"><img alt="Image" src="../images/pg37_2.jpg"/></div>
</figure>
<p class="indent">Every row resulting from the SQL query represents a relationship between two concepts that are uniquely identified by ID1 and ID2, respectively.</p>
<p class="exe"><b>Example 2.7</b></p>
<p class="noindent"><i>Source:</i> The <code>sales_flat_order</code> table stores all the order transactions. The column <code>entity_id</code> is the primary key. The column <code>customer_id</code> is a foreign key that references the <code>customer_entity</code> table and associates the customers to the orders.</p>
<div class="lf5">
<table class="table1" id="tab_5">
<thead>
<tr>
<th class="tc" colspan="2"><code>sales_flat_order</code></th>
</tr>
<tr>
<th class="thead tc"><code><b>entity_id</b></code></th>
<th class="thead tc"><code><b>customer_id</b></code></th>
</tr>
</thead>
<tbody>
<tr>
<td class="tab1 tc">1</td>
<td class="tab1 tc">100</td>
</tr>
<tr>
<td class="tab1 tc">2</td>
<td class="tab1 tc">100</td>
</tr>
</tbody>
</table>
</div>
<p class="noindent"><i>Target:</i> The knowledge graph schema consists of the concepts <i>Order</i> and <i>Customer.</i> The relationship <i>placed by</i> connects the <i>Order</i> concept to <i>Customer</i> concept.</p>
<div class="box-top">
<p class="center">(Order) —placedBy→ (Customer)</p>
</div>
<p class="noindent">The expected knowledge graph is the following:</p>
<div class="box-top">
<p class="center">(Order-1) —placedBy→ (Customer-100)</p>
<p class="center">(Order-2) —placedBy→ (Customer-100)</p>
</div>
<p class="noindent"><i>Mapping:</i> The relationship mapping is from the foreign key column <code>customer_id</code> from the table <code>sales_flat_order</code> to the relationship <i>placedBy.</i> The mapping is the following:</p>
<figure>
<div class="image" id="fig_7"><img alt="Image" src="../images/pg38_1.jpg"/></div>
</figure>
</section>
<section id="ch2_sec8">
<h4 class="head4"><span epub:type="pagebreak" id="page_38" title="38"/><span epub:type="title">Relationship Attribute Mappings</span></h4>
<p class="noindent">A relationship attribute mapping is a representation of an attribute associated with a relationship between two concepts in the knowledge graph from the relational database. It is represented as follows:</p>
<figure>
<div class="image" id="fig_8"><img alt="Image" src="../images/pg38_2.jpg"/></div>
</figure>
<p class="indent">Every row resulting from the SQL query represents a relationship between two concepts that are uniquely identified by ID1 and ID2, respectively, and has a corresponding attribute ATTR which represents the relationship attribute in the knowledge graph.</p>
<p class="exe"><b>Example 2.8</b></p>
<p class="noindent"><i>Source:</i> The <code>customer_rel</code> table is a many-to-many table that represents a relationship when two customers connect with each other. The table has columns <code>customer_id1</code> and <code>customer_id2</code> which are foreign keys that references the <code>customer_entity</code> table, and a column created_at which represents when the two customers connected.</p>
<div class="lf5">
<table class="table1" id="tab_6">
<thead>
<tr>
<th class="tc" colspan="4"><code>customer_rel</code></th>
</tr>
<tr>
<th class="thead tc"><code><b>entity_id</b></code></th>
<th class="thead tc"><code><b>customer_id1</b></code></th>
<th class="thead tc"><code><b>customer_id2</b></code></th>
<th class="thead tc"><code><b>created_at</b></code></th>
</tr>
</thead>
<tbody>
<tr>
<td class="tab1 tc">10</td>
<td class="tab1 tc">100</td>
<td class="tab1 tc">101</td>
<td class="tab1 tc">2009-01-01</td>
</tr>
</tbody>
</table>
</div>
<p class="noindent"><i>Target:</i> The knowledge graph schema consists of the concepts <i>Customer</i> and the relationship <i>knows</i> connects a <i>Customer</i> concept with itself. The relationship attribute <i>since</i> is associated to the <i>knows</i> relationship</p>
<div class="box-top">
<p class="center">(Customer) —knows{since = [date]}→ (Customer)</p>
</div>
<p class="noindent">The expected knowledge graph is the following:</p>
<div class="box-top">
<p class="center">(Customer-100) —knows{since = [2009-01-01]}→ (Customer-101)</p>
</div>
<p class="noindent"><i>Mapping:</i> To generate the knowledge graph, the <i>knows</i> relationship is represented in the SQL query that projects the <code>customer_id1</code> and <code>customer_id2</code> which are the identifiers for customers. The relationship attribute mapping is from the column <code>created_at</code> to the relationship attribute <i>since</i> associated with the relationship <i>knows.</i> The mapping is the following:</p>
<figure>
<div class="image" id="fig_9"><img alt="Image" src="../images/pg39_1.jpg"/></div>
</figure>
</section>
</section>
<section>
<h3 class="head3" id="ch2_3_3"><span epub:type="pagebreak" id="page_39" title="39"/>2.3.3<span class="space3"/><span epub:type="title">MAPPING LANGUAGES</span></h3>
<p class="noindent">We have presented the mappings in an abstract notation. In practice, these mappings need to be implemented in a language that can then be executed. The mappings can be implemented in a programming language (e.g., Java, Python). We present two declarative languages to implement the mappings. Given the declarative nature of SQL, it can also be used to implement the mappings to a knowledge graph. For RDF Knowledge Graphs, there exists a W3C recommendation: R2RML, the <i>Relational Database to RDF Mapping Language.</i></p>
<section id="ch2_sec9">
<h4 class="head4"><span epub:type="title">SQL as a Mapping Language</span></h4>
<p class="noindent">Mappings are rules. Queries are rules. Views are named queries. SQL is a query language. Therefore, mappings can be represented in SQL.<sup><a epub:type="noteref" href="#pgfn2_6" id="rpgfn2_6">6</a></sup> Recall that a mapping rule is represented as follows:</p>
<figure>
<div class="image" id="fig_10"><img alt="Image" src="../images/pg39_2.jpg"/></div>
</figure>
<p class="indent">The mapping rule would represented in SQL as follows:</p>
<div class="boxg">
<p class="noindent"><code>SELECT &lt;conclusion&gt;</code></p>
<p class="noindent"><code>FROM ( &lt;condition&gt; )</code></p>
</div>
<p class="indent">Executing these SQL queries will return the corresponding elements of the knowledge graph, in other words, it would be a materialization approach. This is a viable way to test mappings before implementing in a mapping language such as R2RML (see next subsection), etc.</p>
<p class="exe"><b>Example 2.9</b></p>
<p class="noindent">Concept Mapping in SQL from Example 2.5</p>
<div class="boxg">
<p class="noindent"><code>SELECT</code></p>
<p class="noindent"><code>  concat(‘customer-’, entity_id) as s, ‘type’ p,</code></p>
</div>
<div class="boxg">
<p class="noindent"><span epub:type="pagebreak" id="page_40" title="40"/><code>  ‘Customer’ o</code></p>
<p class="noindent"><code>FROM (</code></p>
<p class="noindent"><code> SELECT entity_id FROM customer_entity WHERE is_active = 1</code></p>
<p class="noindent">)</p>
</div>
<p class="exe"><b>Example 2.10</b></p>
<p class="noindent">Concept Attribute Mapping in SQL from Example 2.6</p>
<div class="boxg">
<p class="noindent"><code>SELECT</code></p>
<p class="noindent"><code>  concat(‘order-’,entity_id) as s,</code></p>
<p class="noindent"><code>  ‘netsales’ p,</code></p>
<p class="noindent"><code>  net o</code></p>
<p class="noindent"><code>FROM (</code></p>
<p class="noindent"><code> SELECT entity_id, grand_total - tax_amount - discount_amount as net</code></p>
<p class="noindent"><code> FROM sales_flat_order</code></p>
<p class="noindent">)</p>
</div>
<p class="exe"><b>Example 2.11</b></p>
<p class="noindent">Relationship Mapping in SQL from Example 2.7</p>
<div class="boxg">
<p class="noindent"><code>SELECT</code></p>
<p class="noindent"><code>  concat(‘order-’,entity_id) s,</code></p>
<p class="noindent"><code>  ‘placedBy’ p,</code></p>
<p class="noindent"><code>  concat(‘customer-’,customer_id) o</code></p>
<p class="noindent"><code>FROM (</code></p>
<p class="noindent"><code> SELECT entity_id, customer_id FROM sales_flat_order</code></p>
<p class="noindent">)</p>
</div>
<p class="exe"><b>Example 2.12</b></p>
<p class="noindent">Relationship Attribute Mapping in SQL from Example 2.8</p>
<div class="boxg">
<p class="noindent"><code>SELECT</code></p>
<p class="noindent"><code>  concat(‘customer-’,customer_id1) s,</code></p>
<p class="noindent"><code>  concat(‘knows{since=’,created_at, ‘}’) p,</code></p>
<p class="noindent"><code>  concat(‘customer-’,customer_id2) o</code></p>
<p class="noindent"><span epub:type="pagebreak" id="page_41" title="41"/><code>FROM (</code></p>
<p class="noindent"><code> SELECT customer_id1, customer_id2, created_at FROM customer_rel</code></p>
<p class="noindent">)</p>
</div>
</section>
<section id="ch2_sec10">
<h4 class="head4"><span epub:type="title">Relational Database to RDF Mapping Language (R2RML)</span></h4>
<p class="noindent">R2RML<sup><a epub:type="noteref" href="#pgfn2_7" id="rpgfn2_7">7</a></sup> [<span class="blue">Das, 2012</span>] is a language for expressing customized mappings from relational databases to RDF graphs expressed in a graph structure and schema of the user’s choice. R2RML is a W3C recommendation. An R2RML mapping is itself represented as an RDF graph. Turtle is the recommended syntax for writing R2RML mappings.</p>
<p class="indent">RML<sup><a epub:type="noteref" href="#pgfn2_8" id="rpgfn2_8">8</a></sup> is an unofficial extension to R2RML that supports mappings to CSV, JSON, and XML. Recall that the focus of this book is on relational databases as a source, however, the principles presented in this book also apply to mappings in RML.</p>
<p class="indent">The R2RML language features can be divided in two parts: features for generating RDF terms (IRIs, blank nodes, or literals) and features for generating RDF triples.</p>
<p class="noindentt"><b>Generating RDF Terms:</b> An RDF term is either an IRI, a blank node, or a literal. A term map generates an RDF term for the subjects, predicates, and objects of the RDF triples from either a constant, a template or a column value. A constant-valued term map ignores the row and always generates the same RDF term. A column-valued term map generates an RDF term from the value of a column. A template-valued term map generates an RDF term from a string template, which is a format string that can be used to build strings from multiple components, including the values of a column. Template-valued term maps are commonly used to specify how an IRI should be generated.</p>
<p class="indent">The R2RML language allows a user to explicitly state the type of RDF term that needs to be generated (IRI, blank node or literal). If the RDF term is for a subject, then the term type must be either an IRI or blank Node. If the RDF term is for a predicate, then the term type must be an IRI. If the RDF term is for a object, then the term type can be either an IRI, blank node or literal. Additionally, a developer may assert that a literal has an assigned language tag or datatype.</p>
<p class="noindentt"><b>Generating RDF Triples:</b> RDF triples are derived from a logical table. A logical table can be either a base table or view in the relational schema, or an R2RML view. An R2RML view is a logical table whose contents are the result of executing a SQL SELECT query against the input database.</p>
<p class="indent"><span epub:type="pagebreak" id="page_42" title="42"/>A triples map is the heart of an R2RML mapping. It specifies a rule for translating each row of a logical table to RDF triples. A triples map is represented by a resource that references the following other resources</p>
<p class="bbull">•  It must have exactly one logical table. Its value is a logical table that specifies a SQL query result to be mapped to triples.</p>
<p class="bbull">•  It must have exactly one subject map that specifies how to generate a subject for each row of the logical table.</p>
<p class="bbull">•  It may have zero or more predicate-object maps, which specify pairs of predicate maps and object maps that, together with the subject generated by the subject map, may form one or more RDF triples for each row.</p>
<p class="indent">Recall that there are three types of term maps that generate RDF terms: constant-valued, column-valued, and template-valued. Given that a subject, predicate and object of an RDF triple must be RDF terms, this means that a subject, predicate, and object can be any of the three possible term maps, called subject map, predicate map, and object map, respectively. A predicateObject map groups predicate-object map pairs.</p>
<p class="indent">A subject map is a term map that specifies the subject of the RDF triple. The primary key of a table is usually the basis for creating an IRI. Therefore, it is normally the case that a subject map is a template-valued term map with an IRI template using the value of a column which is usually the primary key. Optionally, a subject map may have one or more class IRIs. For each RDF term generated by the subject map, RDF triples with predicate <code>rdf:type</code> and the class IRI as object will be generated.</p>
<p class="indent">A predicate-object map is a function that creates one or more predicate-object pairs for each row of a logical table. It is used in conjunction with a subject map to generate RDF triples in a triples map. A predicate-object map is represented by a resource that references the following other resources: one or more predicate maps and one or more object maps or referencing object maps.</p>
<p class="indent">A predicate map is a term map. It is common that the predicate of an RDF triple is a constant. Therefore, a predicate map is usually a constant-valued term map. An object map is also a term map. Several use cases may arise where the object could be either a constant-valued, template-valued or column-valued term map.</p>
<p class="indent">For further details, we refer the reader to <span class="blue">Das</span> [<span class="blue">2012</span>].</p>
<p class="exe"><b>Example 2.13</b></p>
<p class="noindent"><span epub:type="pagebreak" id="page_43" title="43"/>Concept Mapping in R2RML from Example 2.5, expressed using Turtle syntax:</p>
<div class="boxg">
<p class="noindent"><code>:AConceptMapping a rr:TriplesMap ;</code></p>
<p class="noindent"><code> rr:subjectMap [</code></p>
<p class="noindent"><code> rr:template “customer—{entity_id}”</code></p>
<p class="noindent"><code> rr:class :Customer ;</code></p>
<p class="noindent"><code>] ;</code></p>
<p class="noindent"><code>rr:logicalTable [</code></p>
<p class="noindent"><code> rr:sqlQuery "SELECT entity_id FROM customer_entity</code></p>
<p class="noindent"><code>     WHERE is_active = 1"</code></p>
<p class="noindent"><code>].</code></p>
</div>
<p class="exe"><b>Example 2.14</b></p>
<p class="noindent">Concept Attribute Mapping in R2RML from Example 2.6</p>
<div class="boxg">
<p class="noindent"><code>:AConceptAttributeMapping a rr:TriplesMap ;</code></p>
<p class="noindent"><code> rr:subjectMap [</code></p>
<p class="noindent"><code> rr:template “ order-{entity_id}”</code></p>
<p class="noindent"><code>] ;</code></p>
<p class="noindent"><code>rr:predicateObjectMap[</code></p>
<p class="noindent"><code> rr:predicate :netsales;</code></p>
<p class="noindent"><code> rr:objectMap [ rr:column “netsales” ] ;</code></p>
<p class="noindent"><code>];</code></p>
<p class="noindent"><code>rr:logicalTable [</code></p>
<p class="noindent"><code> rr: sqlQuery """</code></p>
<p class="noindent"><code>  SELECT entity_id, grand_total - tax_amount -</code></p>
<p class="noindent"><code>   discount_amount as netsales</code></p>
<p class="noindent"><code>  FROM sales_flat_order</code></p>
<p class="noindent"><code>" " "</code></p>
<p class="noindent"><code>].</code></p>
</div>
<p class="exe"><b>Example 2.15</b></p>
<p class="noindent">Relationship Mapping in R2RML from Example 2.7</p>
<div class="boxg">
<p class="noindent"><code>:ARelationshipMapping a rr:TriplesMap ;</code></p>
<p class="noindent"><code> rr:subjectMap [</code></p>
<p class="noindent"><code> rr:template "order-{entity_id}";</code></p>
<p class="noindent"><code>] ;</code></p>
<p class="noindent"><code>rr:predicateObjectMap[</code></p>
<p class="noindent"><code> rr:predicate :placedBy;</code></p>
<p class="noindent"><span epub:type="pagebreak" id="page_44" title="44"/><code>rr:objectMap [ rr:template "customer-[customer_id}" ]</code></p>
<p class="noindent"><code>];</code></p>
<p class="noindent"><code>rr:logicalTable [</code></p>
<p class="noindent"><code> rr:tableName "sales_flat_order"</code></p>
<p class="noindent"><code>].</code></p>
</div>
</section>
</section>
</section>
<section epub:type="footnotes">
<div epub:type="footnote" id="pgfn2_1"><p class="pgnote"><sup><a href="#rpgfn2_1">1</a></sup> <span class="blue"><a href="https://w3c.github.io/rdf-star/">https://w3c.github.io/rdf-star/</a></span></p></div>
<div epub:type="footnote" id="pgfn2_2"><p class="pgnote"><sup><a href="#rpgfn2_2">2</a></sup> For all intents and purposes, you can think of RDFS as a simple schema language, and OWL as an extension of RDFS that provides more expressive power.</p></div>
<div epub:type="footnote" id="pgfn2_3"><p class="pgnote"><sup><a href="#rpgfn2_3">3</a></sup> See a note in the previous section about the emerging W3C RDF-star specification which addresses this issue.</p></div>
<div epub:type="footnote" id="pgfn2_4"><p class="pgnote"><sup><a href="#rpgfn2_4">4</a></sup> In this book we use examples with identifiers that obviously are not globally unique: For example, <code>order-1</code> is an identifier for the instance of <code>Order</code> while <code>customer-100</code> is an identifier for the instance of <code>Customer</code>. This is done so that we can keep our examples terse.</p></div>
<div epub:type="footnote" id="pgfn2_5"><p class="pgnote"><sup><a href="#rpgfn2_5">5</a></sup> <span class="blue"><a href="https://www.w3.org/TR/webarch/#identification">https://www.w3.org/TR/webarch/#identification</a></span></p></div>
<div epub:type="footnote" id="pgfn2_6"><p class="pgnote"><sup><a href="#rpgfn2_6">6</a></sup> We acknowledge the irony. However, representing mappings in SQL is used for quick-and-dirty implementations.</p></div>
<div epub:type="footnote" id="pgfn2_7"><p class="pgnote"><sup><a href="#rpgfn2_7">7</a></sup> <span class="blue"><a href="https://www.w3.org/TR/r2rml/">https://www.w3.org/TR/r2rml/</a></span></p></div>
<div epub:type="footnote" id="pgfn2_8"><p class="pgnote"><sup><a href="#rpgfn2_8">8</a></sup> <span class="blue"><a href="https://rml.io/specs/rml/">https://rml.io/specs/rml/</a></span></p></div>
</section>
</section>
</body>
</html>