glam/docs/oclc/extracted_enterprise_kg/OEBPS/xhtml/17_chapter04.xhtml

<html xmlns="http://www.w3.org/1999/xhtml" xmlns:epub="http://www.idpf.org/2007/ops" xml:lang="en-US">
<head>
<title>Designing and Building Enterprise Knowledge Graphs</title>
<link href="../styles/stylesheet.css" rel="stylesheet" type="text/css"/>
<link href="../styles/page-template.xpgt" rel="stylesheet" type="application/vnd.adobe-page-template+xml"/>
<meta content="urn:uuid:81982e4f-53b2-476f-ab11-79954b0aab3c" name="Adept.expected.resource"/>
</head>
<body epub:type="bodymatter">
<section epub:type="chapter">
<h1 class="chno" epub:type="title"><span epub:type="pagebreak" id="page_97" title="97"/>CHAPTER 4</h1>
<h1 class="chtitle" epub:type="title">Building Enterprise Knowledge Graphs</h1>
<p class="noindent">In the previous chapter, we defined the elements to design an Enterprise Knowledge Graph. We focused on mapping patterns that connect relational databases with knowledge graphs schemas. A natural question is: how to actually define the mappings in order to build the knowledge graphs? A common expectations is the existence of AI/ML technology that can generate these mappings.</p>
<p class="indent">However, the technology fallacy is <i>“the mistaken belief that just because business challenges are caused by digital technology, that they also need to be solved by digital technology”</i> [<span class="blue">Kane et al., 2019</span>]. If you put multiple people in a room, they will not agree on the meaning of a Customer. Furthermore, there will not be a consensus on which data should be used to identify a Customer. We are drowning in a sea of data, and it is clear that we need technology to help eliminate the noise, but if even humans do not agree on a meaning, why do we expect that AI/ML system will come up with the correct answer?</p>
<p class="indent">In order to build an Enterprise Knowledge Graph, we need to combine People, Processes, and Tools.</p>
<section>
<h2 class="head2" id="ch4_1">4.1<span class="space3"/><span epub:type="title">PEOPLE</span></h2>
<p class="noindent">Designing and building a knowledge graph is not just a technical task. It is important to understand who are the personas involved in the data ecosystem within an organization. The existing types of personas in an organization can be categorized as data producers and data consumers.</p>
<section>
<h3 class="head3" id="ch4_1_1">4.1.1<span class="space3"/><span epub:type="title">DATA PRODUCERS AND CONSUMERS</span></h3>
<p class="noindent">Data producers are responsible for managing the data infrastructure such as data warehouses, data lakes, data pipelines, etc. A data producer understands the database schemas and how the data are interconnected. For example, a data steward is responsible for a specific database, maintain its data dictionary, keep track of PII, and provision access to the data. A data engineer is responsible for the data infrastructure and builds the data pipelines that feed data into a lake. Other typical job titles of data producers are database administrator and ETL developer, among others.</p>
<p class="indent"><span epub:type="pagebreak" id="page_98" title="98"/>Data consumers are responsible for analyzing data in order to answer business questions. They understand how the business functions, the important business questions that need to be answered, and work closely with subject matter experts. Traditionally, they will consume data from a data warehouse or a data lake. For example, a data analyst is responsible of creating and maintaining business intelligence report that answers business questions. A data scientist is responsible for finding trends in data by using statistical and machine learning methods. Other typical job titles of data consumers are BI developer and business analyst, among others.</p>
<p class="indent">Recall the problems described in Section <span class="blue">1.1</span>. There is back and forth communication between data consumers and data producers when a data consumer requests data from a data producer. Think about the following questions.</p>
<p class="bbull">•  Did the data consumer communicate the correct message to the data producer?</p>
<p class="bbull">•  Did the data producer understand what the data consumer was requesting?</p>
<p class="bbull">•  Did the data producer deliver the correct data?</p>
<p class="bbull">•  Did the data producer generate the data in a repeatable manner? Or was it one-off work?</p>
<p class="indent">Common practice is that the data consumer receives data that needs to be further processed and cleaned. The famous 80/20 rule states: <i>“Most data scientists spend only 20 percent of their time on actual data analysis and 80 percent of their time finding, cleaning, and reorganizing huge amounts of data.”</i><sup><a epub:type="noteref" href="#pgfn4_1" id="rpgfn4_1">1</a></sup> This inefficient, repetitive process actually happens after the data producer delivers data to the data consumer. Clearly, there is a gap between the data producers and data consumers because the data lacks ownership.</p>
</section>
<section>
<h3 class="head3" id="ch4_1_2">4.1.2<span class="space3"/><span epub:type="title">DATA PRODUCT MANAGER</span></h3>
<p class="noindent">Let’s step back for a moment and make an analogy with software. Software engineers and software users are bridged by product managers, who make sure that the software satisfies the requirements of the users. Now replace “software engineer” with “data producer,” “software user” with “data consumer,” and “software” with “data,” and you’ll see that the situation is analogous.</p>
<p class="indent">We need to treat data as a product. A <i>data product</i> is defined as a product that facilitates an end goal through the use of data [<span class="blue">Patil, 2012</span>]. Knowledge graphs enable data products. A data product can be the knowledge graph itself, or subsets of it. A data product can also be tabular view over the knowledge graph where the definitions from the knowledge graph are associated to each column of the tabular view.<sup><a epub:type="noteref" href="#pgfn4_2" id="rpgfn4_2">2</a></sup></p>
<p class="indent">A product must have an owner who takes responsbility. A <i>data product manager</i> is responsible and takes ownership of data products. They understand the ecosystem of people and data, <span epub:type="pagebreak" id="page_99" title="99"/>and tasks in an organization that need to be addressed with the data. They manage a multidisciplinary team responsible for producing, managing, exposing, and giving access to the data products.</p>
<p class="indent">From a social perspective, the team manages the shared business meaning, understands data consumers’ requirements and use cases, and tracks the value that is being generated.</p>
<p class="indent">From a technical perspective, the team is responsible for maintaining the quality and reliability of the data through data wrangling, cleaning and provenance. This is much more than just eliminating white spaces, replacing wrong characters, and normalizing dates. This is about designing and building the knowledge graph schema and mappings which is the foundation of data products.</p>
<p class="indent">The work that a data product team does is what we call “knowledge science.” Example scenarios are:</p>
<p class="bbull">•  lead conversations with your organization’s stakeholders to understand their pain points,</p>
<p class="bbull">•  debate and document with stakeholders about the definition of a “customer” or “order net sales,”</p>
<p class="bbull">•  draw whiteboard sketches that define the schemas and models for data,</p>
<p class="bbull">•  maintain a data catalog,</p>
<p class="bbull">•  wrangle and clean data, and</p>
<p class="bbull">•  apply an agile methodology to generate data.</p>
<p class="indent">The team should constantly experiment and measure. Define KPIs metrics to track how the data products is being adopoted and used, and if it is driving ROI.</p>
<p class="indent">Therefore, a data product manager serves as the technical and communication bridge between Data Producers and Data Consumers. Members of a data product team have job titles such as knowledge scientist, knowledge engineer, ontologist, and taxonomist, among others.</p>
</section>
</section>
<section>
<h2 class="head2" id="ch4_2">4.2<span class="space3"/><span epub:type="title">PROCESS</span></h2>
<p class="noindent">Now that we have gone over the building blocks to design a knowledge graph, we need to understand the process to put those building blocks together in order to build the knowledge graph. The process we present is an agile methodology to create the knowledge graph schema and mappings in order to build a knowledge graph and data products in an iterative manner. Effectively, the knowledge graph will always continue to evolve. This process is based on the pay-as-you-go methodology [<span class="blue">Sequeda et al., 2019</span>].</p>
<p class="indent">Before starting the process, we need to define the success criteria. At the center of the methodology are a set of prioritized business questions that need to be answered, which serves <span epub:type="pagebreak" id="page_100" title="100"/>as the success criteria for each iteration of the agile methodology. The business questions serve as competency questions that characterize the knowledge graph schema. The knowledge graph must be able to answer the business questions. If it does, then it was a successful iteration. If it doesn’t, the iteration was not successful.</p>
<p class="indent">Even though the methodology is described assuming the schema is built from scratch, the schema can be bootstrapped by reusing an existing industry-specific knowledge graph schema. Ideally, one would hope that an existing knowledge graph schema fulfills the success criteria (i.e., prioritized business questions). If it does, reuse the portion of the existing knowledge graphs schema. If it doesn’t, you can still be inspired by the existing knowledge graph schema.</p>
<p class="indent">Traditionally, the business questions fall under two categories.</p>
<p class="bbull">•  Business questions that take too long to be answered today: the process of answering business questions today takes too long because they follow ad hoc approaches (see “What is the Problem” Section <span class="blue">1.1</span>).</p>
<p class="bbull">•  Business questions that have multiple and different answers today: depending on who you ask, the business question has different answers, hence there is a lack of trust.</p>
<p class="indent">The methodology is organized in three phases, with different expectations for each persona throughout the process.</p>
<p class="noindentt"><b>Phase 1: Knowledge Capture</b> The business question is analyzed and understood, resulting in a report that represents a minimal viable knowledge graph schema and mapping. A knowledge scientist discusses with data consumers to understand the business questions, define an initial “whiteboard” version of the knowledge graph schema and the expected data product that the data consumers want to consume. The knowledge scientist works with the data producer to determine which data is needed and define the SQL queries to access the data. These queries will ultimately become the mappings. This is documented in what we call a <i>knowledge report.</i></p>
<p class="noindentt"><b>Phase 2: Knowledge Implementation</b> The knowledge scientist implements the knowledge graph schema and mappings based on the content of the knowledge report and generates the data product. The knowledge graph is validated to make sure it complies with the requirements established in the knowledge report.</p>
<p class="noindentt"><b>Phase 3: Knowledge Access</b> The data consumers are exposed to a data product. Tradition data tools (BI tools, R, etc.) can consume the tabular data product. Graph tools (graph analytics, graph visualizations, etc.) can consumer the graph data product. The data consumer can now use their preferred tools and provide answers to new and existing business questions without having to further interface with data producers.</p>
<p class="indent">Once this initial iteration is completed, the next business question is analyzed. If the next question can be answered with the current knowledge graph, then we are done. Otherwise, we might need to extend the knowledge graph schema and new mappings incrementally.</p>
<table class="tableb" id="tab4_1">
<caption class="tcaption"><span class="blue">Table 4.1:</span> Questions that need to be answered</caption>
<thead>
<tr>
<th class="thead"/>
<th class="thead"><b>Questions</b></th>
</tr>
</thead>
<tbody>
<tr>
<td class="tab1">What?</td>
<td class="tab1">What is the business problem? What are the business questions?</td>
</tr>
<tr>
<td class="tab1c">Why?</td>
<td class="tab1c">Why do we need to answer these questions? What is the motivation?</td>
</tr>
<tr>
<td class="tab1">Who?</td>
<td class="tab1">Who produces the data? Who will consume the data? Who is involved?</td>
</tr>
<tr>
<td class="tab1c">How?</td>
<td class="tab1c">How is this the business question answered today, if at all?</td>
</tr>
<tr>
<td class="tab1">Where?</td>
<td class="tab1">Where are the data sources required to answer the business questions? Are these databases? spreadsheets? or something else? How are data sources accessed?</td>
</tr>
<tr>
<td class="tab1c">When?</td>
<td class="tab1c">When will the data be consumed? Real-time? Daily? Update criteria?</td>
</tr>
</tbody>
</table>
<p class="indent"><span epub:type="pagebreak" id="page_101" title="101"/>With this approach, the knowledge graph is developed in an agile and iterative pay-as-you-go-approach. The following three sections will present each phase in detail. Section <span class="blue">4.2.4</span> will run through an e-commerce use case showcasing each of these phases through two iterations.</p>
<div class="boxg">
<p class="noindent">This methodology was developed and refined in projects with a number of customers, over several years. It builds upon the extensive work from the fields of knowledge acquisition and ontology engineering. Common steps across all methodologies is to identify a purpose, define competency questions and formalize the terminology in an ontology language [<span class="blue">Uschold and King, 1995</span>].</p>
</div>
<section>
<h3 class="head3" id="ch4_2_1">4.2.1<span class="space3"/><span epub:type="title">PHASE 1: KNOWLEDGE CAPTURE</span></h3>
<p class="noindent">The objectives of the knowledge capture phase are the following</p>
<p class="bbull">•  Understand and clarify the business questions,</p>
<p class="bbull">•  Identify the necessary data and queries to answer the business questions,</p>
<p class="bbull">•  Define the requirements of the data product.</p>
<section id="ch4_sec1">
<h4 class="head4"><span epub:type="title">Step 1: Analyze as-is Processes</span></h4>
<p class="noindent">The goal is to analyze and document existing processes because many of these processes may have never been written down before. When a business question needs to be answered, we first need to understand the larger context: what is the business problem that needs to be addressed? Is it currently being addressed, and if so, how? Answering the following questions, as shown in <a href="#tab4_1">Table <span class="blue">4.1</span></a>, help achieve this goal.</p>
<table class="tableb" id="tab4_2">
<caption class="tcaption"><span class="blue">Table 4.2:</span> Concepts in the knowledge report</caption>
<tbody>
<tr>
<td class="tab1">Concept Name</td>
<td class="tab1">The agreed label of a Concept</td>
</tr>
<tr>
<td class="tab1c">Concept Alternative Names</td>
<td class="tab1c">A list of alternative labels for the concept, including in different languages</td>
</tr>
<tr>
<td class="tab1">Concept Definition</td>
<td class="tab1">The agreed definition of a Concept</td>
</tr>
<tr>
<td class="tab1c">Concept Identifier</td>
<td class="tab1c">The identifier that w≡ uniquely identify the Concept</td>
</tr>
<tr>
<td class="tab1">Concept Instance Identifier</td>
<td class="tab1">The attribute from the data that uniquely identifies each instance of the Concept and will form a global unique identifier</td>
</tr>
<tr>
<td class="tab1c">Table Name/SQL Query</td>
<td class="tab1c">The table name or SQL query that represents the Concept</td>
</tr>
</tbody>
</table>
</section>
<section id="ch4_sec2">
<h4 class="head4"><span epub:type="pagebreak" id="page_102" title="102"/><span epub:type="title">Step 2: Collect Documentation</span></h4>
<p class="noindent">In this step, the knowledge scientist focuses on the answers to the How and Where questions from the previous step. They identify documentation about the data sources and any SQL queries, spreadsheets, or scripts being used to answer the business questions today. They may also interview data consumers and producers to understand their current workflow.</p>
</section>
<section id="ch4_sec3">
<h4 class="head4"><span epub:type="title">Step 3: Develop Knowledge Report</span></h4>
<p class="noindent">The knowledge scientist analyzes what was learned in steps 1 and 2 and starts working with the data consumers to understand the business questions, recognize key concepts, attributes and relationships from the business questions, identify the business terminology such as preferred labels, alternative labels, and natural language definitions for the concepts and relationships. At this stage, it is common to encounter disagreements. Different people use the same word to mean different concepts or different words are used to mean the same concept. These discussions and definitions need to be documented. The conversation is very focused on the business questions which helps drive a consensus. Subsequently, the knowledge scientist works with the data producers to identify which tables and attributes in the database contain data related to the concepts and relationships identified from the business questions. The conversation with the data producers is also focused.</p>
<p class="indent">An outcome of this step is a high-level overview of the knowledge graph schema: a whiteboard illustration. The final deliverable is a knowledge report detailing the Concepts, Attributes, and Relationships (CAR) of the knowledge graph schema. Each CAR is associated with SQL logic which serves as the mapping to the relational database. The template for the knowledge report is shown in <a href="#tab4_2">Tables 4.2</a>, <a href="#tab4_3">4.3</a>, and <a href="#tab4_4">4.4</a>.</p>
<p class="indent">The knowledge report also documents the tabular data products. A tabular data products is a table of data with columns that a data consumer would like to access in order to answer the original business question that kicks off the iteration of the methodology. The knowledge report <span epub:type="pagebreak" id="page_103" title="103"/>for a tabular data product should list all the attributes of knowledge graph that will appear in the table.</p>
<table class="tableb" id="tab4_3">
<caption class="tcaption"><span class="blue">Table 4.3:</span> Attributes in the knowledge report</caption>
<tbody>
<tr>
<td class="tab1">Attribute Name</td>
<td class="tab1">The agreed label of an Attribute</td>
</tr>
<tr>
<td class="tab1c">Attribute Alternative Names</td>
<td class="tab1c">A list of alternative labels for the Attribute, including in different languages</td>
</tr>
<tr>
<td class="tab1">Attribute Definition</td>
<td class="tab1">The agreed definition of an Attribute</td>
</tr>
<tr>
<td class="tab1c">Attribute Identifier</td>
<td class="tab1c">The identifier that will uniquely identify the Attribute</td>
</tr>
<tr>
<td class="tab1">Associated Concept</td>
<td class="tab1">The Concept for which this Attribute is associate to</td>
</tr>
<tr>
<td class="tab1c">Table Name/SQL Query</td>
<td class="tab1c">The table name or SQL query that represents the Attribute</td>
</tr>
<tr>
<td class="tab1">Column</td>
<td class="tab1">The column from the table/query that is the mapping to the Attribute</td>
</tr>
<tr>
<td class="tab1c">Datatype</td>
<td class="tab1c">The expected datatype of the Attribute</td>
</tr>
<tr>
<td class="tab1">Attribute Cardinality</td>
<td class="tab1">The expected cardinality: 1:1,1:M of the Attribute</td>
</tr>
<tr>
<td class="tab1c">Nullable</td>
<td class="tab1c">If there are NULL values, what does it mean?</td>
</tr>
</tbody>
</table>
<table class="tableb" id="tab4_4">
<caption class="tcaption"><span class="blue">Table 4.4:</span> Relationships in the knowledge report</caption>
<tbody>
<tr>
<td class="tab1">Relationship Name</td>
<td class="tab1">The agreed label of a Relationship</td>
</tr>
<tr>
<td class="tab1c">Relationship Alternative Names</td>
<td class="tab1c">A list of alternative labels for the Relationship, including in different languages</td>
</tr>
<tr>
<td class="tab1">Relationship Definition</td>
<td class="tab1">The agreed definition of a Relationship</td>
</tr>
<tr>
<td class="tab1c">Relationship Identifier</td>
<td class="tab1c">The identifier that will uniquely identify the Relationship</td>
</tr>
<tr>
<td class="tab1">Associated From Concept</td>
<td class="tab1">What Concept does this relationship come from?</td>
</tr>
<tr>
<td class="tab1c">Associated To Concept</td>
<td class="tab1c">What Concept does this relationship go to?</td>
</tr>
<tr>
<td class="tab1">Table Name/SQL Query</td>
<td class="tab1">The table name or SQL query that represents the relationships. This query should return a pair of attributes that represents the identifiers of the From and To Concept</td>
</tr>
<tr>
<td class="tab1c">Relationship Cardinality</td>
<td class="tab1c">The expected cardinality: 1:1,1:M</td>
</tr>
</tbody>
</table>
<div class="boxg">
<p class="noindent">The notion of Knowledge Reports mimics the Intermediate Representations (IRs) from METHONTOLOGY [<span class="blue">Fernández-López et al., 1997</span>]. In the conceptualization phase of METHONTOLOGY, the informal view of a domain is represented in a semi-formal specification <span epub:type="pagebreak" id="page_104" title="104"/>using IRs which can be represented in a tabular or graph representation. The IRs can be understood by both the domain experts and the knowledge scientist, therefore bridging the gap between the data consumers informal understanding of the domain and the formal ontology language used to represent the domain.</p>
</div>
<p class="indent">The deliverable of this phase is the knowledge report, which documents how the business concepts are related, how they are connected to the data and the data products that the data consumers expect. The knowledge report needs to be peer-reviewed by the data consumers and data producers.<sup><a epub:type="noteref" href="#pgfn4_3" id="rpgfn4_3">3</a></sup> If all parties are in agreement, then we can proceed to the next phase. Otherwise, the discrepancies must be resolved. The discrepancies can be identified quickly due to the granularity of the knowledge report.</p>
</section>
</section>
<section>
<h3 class="head3" id="ch4_2_2">4.2.2<span class="space3"/><span epub:type="title">PHASE 2: KNOWLEDGE IMPLEMENTATION</span></h3>
<p class="noindent">A key insight of METHONTOLOGY’s Intermediate Representations, is that they ease the transformation into a formal ontology language. We build upon this insight. That is why the goal of the knowledge implementation phase is to formalize the content of the knowledge report into a knowledge graph schema, mappings, and queries.</p>
<section id="ch4_sec4">
<h4 class="head4"><span epub:type="title">Step 4: Create/Extend Knowledge Graph Schema</span></h4>
<p class="noindent">Based on the knowledge report, the knowledge scientist can create the knowledge graph schema or extend the existing schema. This is straightforward to due given that knowledge report specifically documents the Concepts, Attributes, and Relationships of the knowledge graph.</p>
<p class="indent">For knowledge graphs implemented using RDF Graphs, <a href="#tab4_5">Table <span class="blue">4.5</span></a> details the correspondence between the elements of the knowledge report and OWL ontology constructs.</p>
</section>
<section id="ch4_sec5">
<h4 class="head4"><span epub:type="title">Step 5: Implement Mapping</span></h4>
<p class="noindent">Similar to the previous step, the knowledge scientist can create the mappings from the knowledge report. The mappings express a correspondence from tables or SQL queries of the relational database to concepts, attributes and relationships of the knowledge graph schema. This implies that the complexity of creating mappings is left in SQL. After all, SQL is the most common query language.</p>
<p class="noindent">For knowledge graphs implemented using RDF Graphs, <a href="#tab4_6">Table <span class="blue">4.6</span></a> details the correspondence between the elements of the knowledge report and R2RML mapping constructs.</p>
<p class="noindent">The mappings can be applied either in a virtualized (NoETL) or materialized (ETL) approach. In a virtualized approach, graph queries are translated to SQL queries using the mappings. In a materialized approach, the relational data is extracted, translated into a graph using the mappings and then loaded into a graph database.<span epub:type="pagebreak" id="page_105" title="105"/></p>
<table class="tableb" id="tab4_5">
<caption class="tcaption"><span class="blue">Table 4.5:</span> Correspondence between knowledge report and OWL</caption>
<thead>
<tr>
<th class="thead"><b>Knowledge Report</b></th>
<th class="thead"><b>OWL</b></th>
</tr>
</thead>
<tbody>
<tr>
<td class="tab1">Concept Name</td>
<td class="tab1">Label (rdfs:label, skos:prefLabel) of owl:Class</td>
</tr>
<tr>
<td class="tab1c">Concept Alternative Names</td>
<td class="tab1c">Alternative Label (skos:altLabel) of owl:Class</td>
</tr>
<tr>
<td class="tab1">Concept Definition</td>
<td class="tab1">Definition (rdfs:comment) of owl:Class</td>
</tr>
<tr>
<td class="tab1c">Concept Identifier</td>
<td class="tab1c">IRI of the owl:Class</td>
</tr>
<tr>
<td class="tab1">Attribute Name</td>
<td class="tab1">Label (rdfs:label, skos:prefLabel) of owl:DatatypeProperty</td>
</tr>
<tr>
<td class="tab1c">Attribute Alternative Names</td>
<td class="tab1c">Alternative Label (skos:altLabel) of owl:DatatypeProperty</td>
</tr>
<tr>
<td class="tab1">Attribute Definition</td>
<td class="tab1">Definition (rdfs:comment) of owl:DatatypeProperty</td>
</tr>
<tr>
<td class="tab1c">Attribute Identifier</td>
<td class="tab1c">IRI of the owl:DatatypeProperty</td>
</tr>
<tr>
<td class="tab1">Attribute Associated Concept</td>
<td class="tab1">Domain (rdfs:domain) of owl:DatatypeProperty</td>
</tr>
<tr>
<td class="tab1c">Attribute Datatype</td>
<td class="tab1c">Range (rdfs:range) of owl:DatatypeProperty</td>
</tr>
<tr>
<td class="tab1">Attribute Cardinality</td>
<td class="tab1">An owl:Restriction</td>
</tr>
<tr>
<td class="tab1c">Relationship Name</td>
<td class="tab1c">Label (rdfs:label, skos:prefLabel) of owl:ObjectProperty</td>
</tr>
<tr>
<td class="tab1">Relationship Alternative Names</td>
<td class="tab1">Alternative Label (skos:altLabel) of owl:ObjectProperty</td>
</tr>
<tr>
<td class="tab1c">Relationship Definition</td>
<td class="tab1c">Definition (rdfs:comment) of owl:ObjectProperty</td>
</tr>
<tr>
<td class="tab1">Relationship Associated From Concept</td>
<td class="tab1">Domain (rdfs:domain) of owl:ObjectProperty</td>
</tr>
<tr>
<td class="tab1c">Relationship Associated To Concept</td>
<td class="tab1c">Range (rdfs:range) of owl:ObjectProperty</td>
</tr>
<tr>
<td class="tab1">Relationship Cardinality</td>
<td class="tab1">An owl:Restriction</td>
</tr>
</tbody>
</table>
<table class="tableb" id="tab4_6">
<caption class="tcaption"><span class="blue">Table 4.6:</span> Correspondence between knowledge report and R2RML</caption>
<thead>
<tr>
<th class="thead"><b>Knowledge Report</b></th>
<th class="thead"><b>R2RML</b></th>
</tr>
</thead>
<tbody>
<tr>
<td class="tab1">Concept Identifier</td>
<td class="tab1">rr:class</td>
</tr>
<tr>
<td class="tab1c">Concept Table Name/SQL Query</td>
<td class="tab1c">rr:IogicalTable</td>
</tr>
<tr>
<td class="tab1">Attribute Identifier</td>
<td class="tab1">rr: predicate</td>
</tr>
<tr>
<td class="tab1c">Column</td>
<td class="tab1c">rr:column</td>
</tr>
<tr>
<td class="tab1">Attribute Table Name/SQL</td>
<td class="tab1">rr:logicalTable, rr:sqlQuery</td>
</tr>
<tr>
<td class="tab1c">Relationship Identifier</td>
<td class="tab1c">rr:column</td>
</tr>
<tr>
<td class="tab1">Relationship Table Name/SQL Query</td>
<td class="tab1">rr:logicalTable, rr:sqlQuery</td>
</tr>
</tbody>
</table>
</section>
<section id="ch4_sec6">
<h4 class="head4"><span epub:type="pagebreak" id="page_106" title="106"/><span epub:type="title">Step 6: Generate Data Products</span></h4>
<p class="noindent">Recall that tabular data product is the definition of the tabular result that a data consumer would like to access in order to answer their original business question. Graph query constructs such as SPARQL’s SELECT, Cypher’s RETURN, return tabular result set. Therefore, each tabular data product can be implemented by a graph query. The graph queries will then be executed over the resulting knowledge graph to generate the tabular data product.</p>
</section>
<section id="ch4_sec7">
<h4 class="head4"><span epub:type="title">Step 7: Validate Data</span></h4>
<p class="noindent">The final step in this phase is to validate the knowledge graph and the data products resulting in the queries from the previous step. This data should also be validated by both data producers and consumers. The validation should at least consider cardinality, counts, data types, missing values, and tabular data products.</p>
<p class="bbull">•  <b>Cardinalty:</b> The mapping report defined the expected cardinality for attributes and relationships. These cardinalities should be validated in the knowledge graph.</p>
<p class="bbull">•  <b>Counts:</b> Compare the number of results for each concept, attribute, and relationship with the number of results from the Table or SQL query on the relational database defined as the mapping. The number of results should be the same.</p>
<p class="bbull">•  <b>Missing Values:</b> Checking the validity of NULL values.</p>
<p class="bbull">•  <b>Tabular Data Product:</b> Share sample tabular data to the data consumer.</p>
<p class="indent">The cardinality, counts, and missing values validation can be implemented with graph queries and constraint/validation languages and thus automated. For RDF Knowledge Graphs, SHACL (Shapes Constraint Language) can be used to implement the constraints.<sup><a epub:type="noteref" href="#pgfn4_4" id="rpgfn4_4">4</a></sup> Traditional <span epub:type="pagebreak" id="page_107" title="107"/>data quality measures can be made (i.e., valid emails, valid dates, etc.), but that is out of scope for this book.</p>
<p class="indent">After successful data validation with the data producers and consumer, the data can begin to be used for Knowledge Access in the next phase. Otherwise, the root cause of the error must be found. This commonly occurs in either in the graph query or a mapping.</p>
</section>
</section>
<section>
<h3 class="head3" id="ch4_2_3">4.2.3<span class="space3"/><span epub:type="title">PHASE 3: KNOWLEDGE ACCESS</span></h3>
<section id="ch4_sec8">
<h4 class="head4"><span epub:type="title">Step 8: Build Report</span></h4>
<p class="noindent">A goal is to enable data consumers to be self-service and answer their business questions. This is accomplished when data consumers can use analytics tools (BI tools, Graph visualization, etc.) over a simple and understandable view of the data. The knowledge graph enables the simplified view. Many data consumers will want to interact with the data using traditional BI tools that consume tables. This is why tabular data products are defined. Other data consumers will want to use advanced graph analytics tools, thus can consume the graph data product.</p>
</section>
<section id="ch4_sec9">
<h4 class="head4"><span epub:type="title">Step 9: Answer Question</span></h4>
<p class="noindent">The business report should answer the original business question (the “What” in Step 1). This report is shared with the stakeholders who asked the original business question (the “Who” in Step 1). If they accept the business report as an answer to their question, then this is ready to move to production.</p>
</section>
<section id="ch4_sec10">
<h4 class="head4"><span epub:type="title">Step 10: Move to production</span></h4>
<p class="noindent">Once the decision has been made to move to production, we need to determine how the knowledge graph will be managed. The options fall into two categories.</p>
<p class="bbull">•  <b>Virtualization:</b> The source relational database is accessed through virtual graph queries. This means that the graph queries in terms of the target knowledge graph are translated to SQL queries over the source relational database using the mappings. This provides up-to-date data. The viability of virtualization mainly depends on the use case and query access patterns.</p>
<p class="bbull">•  <b>Materialization:</b> The source relational database is extracted, transformed to graph using the mappings, and then loaded into a graph database.</p>
<p class="indent">We will discuss this more in the Tools section.</p>
<p class="indent">For each business question that is being answered, we need to determine how the data consumers and the reports will access the data. Can the reporting/analytics tool connect directly to a graph database? Or does it consume tabular data via SQL or imported tabular files (CSV/XLS). For a materialization approach, a refresh schedule for the tabular views must be determined. Common refresh schedules are daily, weekly, monthly, or on demand. Additionally, the time window of the extract needs to be determined. Is the entire knowledge graph going <span epub:type="pagebreak" id="page_108" title="108"/>to be updated? Or is only yesterday’s data going to be updated? Or last week’s? These questions must be answered before Avide release.</p>
<table class="tableb" id="tab4_7">
<caption class="tcaption"><span class="blue">Table 4.7:</span> Questions in round 1</caption>
<tbody>
<tr>
<td class="tab1"><b>What</b></td>
<td class="tab1">How many orders are placed in a given time period per their status?</td>
</tr>
<tr>
<td class="tab1c"><b>Why</b></td>
<td class="tab1c">Depending on whom is asked, different answers can be provided. The IT department managing the website records an order when a customer has checked out. The fulfillment department records an order when it has shipped. The accounting department records an order when the funds charged against the credit card are actually transferred to the company’s bank account, regardless of the shipping status. Unaware of the source of the problem, the executives are vexed by inconsistencies across established business reports.</td>
</tr>
<tr>
<td class="tab1"><b>Who</b></td>
<td class="tab1">The Finance department, specifically the CFO.</td>
</tr>
<tr>
<td class="tab1c"><b>How</b></td>
<td class="tab1c">A business analyst asks the data engineer for this information every morning.</td>
</tr>
<tr>
<td class="tab1"><b>Where</b></td>
<td class="tab1">There is a proprietary Order Management System and an ERP system from a large vendor.</td>
</tr>
<tr>
<td class="tab1c"><b>When</b></td>
<td class="tab1c">Every morning they want to know this number.</td>
</tr>
</tbody>
</table>
</section>
</section>
<section>
<h3 class="head3" id="ch4_2_4">4.2.4<span class="space3"/><span epub:type="title">AN E-COMMERCE USE CASE</span></h3>
<p class="noindent">We present an example that goes through the methodology. We split the example in two rounds in order to see the nature of the agile methodology.</p>
<section id="ch4_sec11">
<h4 class="head4"><span epub:type="title">Round 1: Orders</span></h4>
<p class="noindent"><b>Phase 1 (Knowledge Capture):</b> We start by asking the questions in Step 1 above, and fill in the appropriate answers, as shown in <a href="#tab4_7">Table <span class="blue">4.7</span></a>.</p>
<p class="indent">The knowledge scientist gathers access to the database systems for the Order Management System and the ERP System and learns that the Order Management System was built on an open-source shopping cart system and has been heavily customized. It has been extended repeatedly over the past years and the original data architect is no longer with the company. Documentation about the database schema does not correspond to the production database schema. Furthermore, the database schema of the Order Management System consists of thousands of tables. Ten tables have the string “order” in the name with different types of prefixes (i.e., masterorder) and suffixes (i.e., ordertax).</p>
<p class="indent">The knowledge scientist gets the SQL script that the data engineer runs every morning to generate the data that is then passed along to a business analyst. It is important to call out that the data engineer did not write this SQL script; it was passed to them from a previous employee who is not at the company anymore.</p>
<p class="indent"><span epub:type="pagebreak" id="page_109" title="109"/>The knowledge scientist works with data consumers to understand the meaning of the word “order.” Discussions reveal that the definition of an order is if it had shipped or the accounts receivable had been received. Furthermore, with a view of the order number, order date and order status, the data consumers can answer their question. Together with the data engineer, the knowledge scientist learns that the Order Management System is the authoritative source for all orders because the ERP system consumes the data from the order management system. Within that database, the data relating to orders is vertically partitioned across several tables. The SQL scripts collected in the previous step provides focus to identify the candidate tables and attributes where the data is located. Only the following tables and attributes are needed from the thousands of tables and tens of thousands of attributes:</p>
<div class="top1">
<p class="noindent"><code>MasterOrder(moid, oid, master_date, order_type, osid, ...)</code></p>
<p class="noindent"><code>Order(oid, order_date, ...)</code></p>
<p class="noindent"><code>OrderStatus(osid, moid, order_status_date, ostid, ...)</code></p>
<p class="noindent"><code>OrderStatusType (ostid, status_type, ...)</code></p>
</div>
<p class="indent">The knowledge scientist working alongside the data engineer, identify the business requirement of an order as all rows in the masterorder table, where the ordertype column is equal to 2 or 3. Note that in some SQL scripts, this condition was not present. This is the reason why the Finance department was getting different answers for the same question.</p>
<p class="indent">Furthermore, it is revealed that the table OrderStatus holds all the different status that an order has across different periods of time. In discussions with the data consumer, it is confirmed that they only want to consider the last order status (they do not care about the historic order statuses). This may have been another source of differing numbers because a single order can have multiple order statuses, but it is unique for a given period of time. With this information, the knowledge scientist can whiteboard what the knowledge graph schema would look like.</p>
<p class="indent">Finally, an Order tabular data product is defined which consists of three attributes from the knowledge graph: Order Number, Order Date, and Order Status.</p>
<p class="indent">The Knowledge Report for Concepts are shown in Tables <span class="blue">4.8</span>, <span class="blue">4.9</span>, for Attributes are shown in <a href="#tab4_10">Tables <span class="blue">4.10</span></a>, <a href="#tab4_11"><span class="blue">4.11</span></a>, <a href="#tab4_12"><span class="blue">4.12</span></a>, for Relationships in <a href="#tab4_13">Table <span class="blue">4.13</span></a>, and for Data Product in <a href="#tab4_14">Table <span class="blue">4.14</span></a>.</p>
<p class="noindentt"><b>Phase 2 (Knowledge Implementation):</b> The knowledge scientist can now implement the knowledge graph schema. In our abstract notation, the knowledge graph schema is represented as follows:<span epub:type="pagebreak" id="page_110" title="110"/><span epub:type="pagebreak" id="page_111" title="11"/></p>
<table class="tableb" id="tab4_8">
<caption class="tcaption"><span class="blue">Table 4.8:</span> Order concepts in the knowledge report for round 1</caption>
<tbody>
<tr>
<td class="tab1">Concept Name</td>
<td class="tab1">Order</td>
</tr>
<tr>
<td class="tab1c">Concept Alternative Names</td>
<td class="tab1c">N/A</td>
</tr>
<tr>
<td class="tab1">Concept Definition</td>
<td class="tab1">An order is if it had shipped or the accounts receivable had been received</td>
</tr>
<tr>
<td class="tab1c">Concept Identifier</td>
<td class="tab1c">(Order)</td>
</tr>
<tr>
<td class="tab1">Concept Instance Identifier</td>
<td class="tab1">order-{moid}</td>
</tr>
<tr>
<td class="tab1c">Table Name/SQL Query</td>
<td class="tab1c"><code>SELECT m.moid FROM masterorder m JOIN order o on m.oid = o.oid WHERE o.order_type in (2,3)</code></td>
</tr>
</tbody>
</table>
<table class="tableb" id="tab4_9">
<caption class="tcaption"><span class="blue">Table 4.9:</span> Order status concepts in the knowledge report for round 1</caption>
<tbody>
<tr>
<td class="tab1">Concept Name</td>
<td class="tab1">Order Status</td>
</tr>
<tr>
<td class="tab1c">Concept Alternative Names</td>
<td class="tab1c">N/A</td>
</tr>
<tr>
<td class="tab1">Concept Definition</td>
<td class="tab1">An order can have a status</td>
</tr>
<tr>
<td class="tab1c">Concept Identifier</td>
<td class="tab1c">(OrderStatus)</td>
</tr>
<tr>
<td class="tab1">Concept Instance Identifier</td>
<td class="tab1">orderstatus-{osid}</td>
</tr>
<tr>
<td class="tab1c">Table Name/SQL Query</td>
<td class="tab1c"><code>orderstatus</code></td>
</tr>
</tbody>
</table>
<table class="tableb" id="tab4_10">
<caption class="tcaption"><span class="blue">Table 4.10:</span> Order date attribute in the knowledge report for round 1</caption>
<tbody>
<tr>
<td class="tab1">Attribute Name</td>
<td class="tab1">Order Date</td>
</tr>
<tr>
<td class="tab1c">Attribute Alternative Names</td>
<td class="tab1c">N/A</td>
</tr>
<tr>
<td class="tab1">Attribute Definition</td>
<td class="tab1">The date the order was placed on</td>
</tr>
<tr>
<td class="tab1c">Attribute Identifier</td>
<td class="tab1c">()-orderDate-&gt;[]</td>
</tr>
<tr>
<td class="tab1">Associated Concept</td>
<td class="tab1">(Order)</td>
</tr>
<tr>
<td class="tab1c">Table Name/SQL Query</td>
<td class="tab1c"><code>SELECT m.moid, o.order_date FROM masterorder m JOIN order o ON m.oid = o.id</code></td>
</tr>
<tr>
<td class="tab1">Column</td>
<td class="tab1">order_date</td>
</tr>
<tr>
<td class="tab1c">Datatype</td>
<td class="tab1c">xsd:dateTime</td>
</tr>
<tr>
<td class="tab1">Attribute Cardinality</td>
<td class="tab1">1:1 an order must have exactly one order date</td>
</tr>
<tr>
<td class="tab1c">Nullable</td>
<td class="tab1c">There can’t be NULL values. If there is a NULL value, then that is a data error</td>
</tr>
</tbody>
</table>
<table class="tableb" id="tab4_11">
<caption class="tcaption"><span class="blue">Table 4.11:</span> Order number attribute in the knowledge report for round 1</caption>
<tbody>
<tr>
<td class="tab1">Attribute Name</td>
<td class="tab1">Order Number</td>
</tr>
<tr>
<td class="tab1c">Attribute Alternative Names</td>
<td class="tab1c">N/A</td>
</tr>
<tr>
<td class="tab1">Attribute Definition</td>
<td class="tab1">The unique number that identifies an order</td>
</tr>
<tr>
<td class="tab1c">Attribute Identifier</td>
<td class="tab1c">()-orderNumber-&gt; []</td>
</tr>
<tr>
<td class="tab1">Associated Concept</td>
<td class="tab1">(Order)</td>
</tr>
<tr>
<td class="tab1c">Table Name/SQL Query</td>
<td class="tab1c">masterorder</td>
</tr>
<tr>
<td class="tab1">Column</td>
<td class="tab1">moid</td>
</tr>
<tr>
<td class="tab1c">Datatype</td>
<td class="tab1c">xsd:int</td>
</tr>
<tr>
<td class="tab1">Attribute Cardinality</td>
<td class="tab1">1:1 an order must have exactly one order number</td>
</tr>
<tr>
<td class="tab1c">Nullable</td>
<td class="tab1c">There can’t be NULL values. If there is a NULL value, then that is a data error</td>
</tr>
</tbody>
</table>
<table class="tableb" id="tab4_12">
<caption class="tcaption"><span class="blue">Table 4.12:</span> Order status label attribute in the knowledge report for round 1</caption>
<tbody>
<tr>
<td class="tab1">Attribute Name</td>
<td class="tab1">Order Status Label</td>
</tr>
<tr>
<td class="tab1c">Attribute Alternative Names</td>
<td class="tab1c">N/A</td>
</tr>
<tr>
<td class="tab1">Attribute Definition</td>
<td class="tab1">The label that an order status has</td>
</tr>
<tr>
<td class="tab1c">Attribute Identifier</td>
<td class="tab1c">()-OrderStatusLabel-&gt; []</td>
</tr>
<tr>
<td class="tab1">Associated Concept</td>
<td class="tab1">(OrderStatus)</td>
</tr>
<tr>
<td class="tab1c">Table Name/SQL Query</td>
<td class="tab1c"><code>orderstatus</code></td>
</tr>
<tr>
<td class="tab1">Column</td>
<td class="tab1"><code>label</code></td>
</tr>
<tr>
<td class="tab1c">Datatype</td>
<td class="tab1c">xsd: string</td>
</tr>
<tr>
<td class="tab1">Attribute Cardinality</td>
<td class="tab1">1:1 an order status must have exactly one label</td>
</tr>
<tr>
<td class="tab1c">Nullable</td>
<td class="tab1c">There can’t be NULL values. If there is a NULL value, then that is a data error</td>
</tr>
</tbody>
</table>
<table class="tableb" id="tab4_13">
<caption class="tcaption"><span class="blue">Table 4.13:</span> Relationships in the knowledge report for round 1</caption>
<tbody>
<tr>
<td class="tab1">Relationship Name</td>
<td class="tab1">has order status</td>
</tr>
<tr>
<td class="tab1c">Relationship Alternative Names</td>
<td class="tab1c">N/A</td>
</tr>
<tr>
<td class="tab1">Relationship Definition</td>
<td class="tab1">All orders have a status such as delivered, etc.</td>
</tr>
<tr>
<td class="tab1c">Relationship Identifier</td>
<td class="tab1c">()-hasOrderStatus-&gt;()</td>
</tr>
<tr>
<td class="tab1">Associated From Concept</td>
<td class="tab1">Order</td>
</tr>
<tr>
<td class="tab1c">Associated To Concept</td>
<td class="tab1c">Order Status</td>
</tr>
<tr>
<td class="tab1">Table Name/SQL Query</td>
<td class="tab1"><code>SELECT moid, ostid, MAX (order_status_date) FROM orderstatus GROUP BY order_status_date</code></td>
</tr>
<tr>
<td class="tab1c">Relationship Cardinality</td>
<td class="tab1c">1:1 an order must have exactly one order status</td>
</tr>
</tbody>
</table>
<table class="tableb" id="tab4_14">
<caption class="tcaption"><span class="blue">Table 4.14:</span> Order tabular data product in the knowledge report for round 1</caption>
<thead>
<tr>
<th class="thead"><b>Attribute</b></th>
<th class="thead"><b>Concept</b></th>
</tr>
</thead>
<tbody>
<tr>
<td class="tab1">Order Number</td>
<td class="tab1">Order</td>
</tr>
<tr>
<td class="tab1c">Order Date</td>
<td class="tab1c">Order</td>
</tr>
<tr>
<td class="tab1">Order Status Label</td>
<td class="tab1">Order Status</td>
</tr>
</tbody>
</table>
<figure>
<div class="image" id="fig4_1"><img alt="Image" src="../images/fig4_1.jpg"/></div>
<figcaption>
<p class="figcaption"><span class="blue">Figure 4.1:</span> Knowledge graph schema.</p>
</figcaption>
</figure>
<p class="noindent"><span epub:type="pagebreak" id="page_112" title="112"/><code>(Order)</code></p>
<p class="noindent"><code>(OrderStatus)</code></p>
<p class="noindent"><code>(Order)-hasOrderStatus-&gt;(OrderStatus)</code></p>
<p class="noindent"><code>(Order)-orderNumber-&gt;[int]</code></p>
<p class="noindent"><code>(Order)-orderDate-&gt;[datetime]</code></p>
<p class="noindent"><code>(OrderStatus)-orderStatusName-&gt;[string]</code></p>
<p class="indentt">Visually, the knowledge graph schema is shown in <a href="#fig4_1">Figure <span class="blue">4.1</span></a>.</p>
<p class="indent">In OWL, the knowledge graph schema is implemented as follows:</p>
<div class="boxg">
<p class="noindent"><span epub:type="pagebreak" id="page_113" title="133"/><code> :Order rdf:type owl:Class ;</code></p>
<p class="noindenth"><code> rdfs:label "Order";</code></p>
<p class="noindenth"><code> rdfs: comment "An order is if it had shipped or the accounts receivable had been received";</code></p>
<p class="noindent"><code>.</code></p>
<p class="noindentt"><code> :OrderStatus rdf:type owl:Class ;</code></p>
<p class="noindenth"><code> rdfs:label "Order Status";</code></p>
<p class="noindenth"><code> rdfs: comment "An order can have a status.";</code></p>
<p class="noindent"><code>.</code></p>
<p class="noindentt"><code> :OrderDate rdf:type owl:DatatypeProperty ;</code></p>
<p class="noindenth"><code> rdfs:label "Order Date" ;</code></p>
<p class="noindenth"><code> rdfs:comment "The date the order was placed on.";</code></p>
<p class="noindenth"><code> rdfs:domain ec:Order ;</code></p>
<p class="noindenth"><code> rdfs:range xsd:dateTime ;</code></p>
<p class="noindent"><code>.</code></p>
<p class="noindentt"><code> :OrderNumber rdf:type owl:DatatypeProperty ;</code></p>
<p class="noindenth"><code> rdfs:label "Order Number" ;</code></p>
<p class="noindenth"><code> rdfs: comment "The unique number that identifies an order.";</code></p>
<p class="noindenth"><code> rdfs:domain ec:Order ;</code></p>
<p class="noindenth"><code> rdfs:range xsd:int ;</code></p>
<p class="noindent"><code>.</code></p>
<p class="noindentt"><code> :orderStatusLabel rdf:type owl:DatatypeProperty ;</code></p>
<p class="noindenth"><code> rdfs:label "Order Status Label" ;</code></p>
<p class="noindenth"><code> rdfs:comment "The label that an order status has.";</code></p>
<p class="noindenth"><code> rdfs:domain ec:Order ;</code></p>
<p class="noindenth"><code> rdfs:range xsd:string ;</code></p>
<p class="noindent"><code>.</code></p>
<p class="noindentt"><code> :hasOrderStatus rdf:type owl:ObjectProperty ;</code></p>
<p class="noindenth"><code> rdfs:label "has order status";</code></p>
<p class="noindenth"><code> rdfs: comment "All orders have a status such as delivered, etc.";</code></p>
<p class="noindenth"><code> rdfs:domain ec:Order ;</code></p>
<p class="noindenth"><code> rdfs:range ec: OrderStatus ;</code></p>
<p class="noindent"><code>.</code></p>
</div>
<p class="indent"><span epub:type="pagebreak" id="page_114" title="114"/>The following is the mapping in our abstract notation:</p>
<figure>
<div class="image" id="fig_1"><img alt="Image" src="../images/pg114_1.jpg"/></div>
</figure>
<figure>
<div class="image" id="fig_2"><img alt="Image" src="../images/pg114_2.jpg"/></div>
</figure>
<figure>
<div class="image" id="fig_3"><img alt="Image" src="../images/pg114_3.jpg"/></div>
</figure>
<figure>
<div class="image" id="fig_4"><img alt="Image" src="../images/pg114_4.jpg"/></div>
</figure>
<figure>
<div class="image" id="fig_5"><img alt="Image" src="../images/pg114_5.jpg"/></div>
</figure>
<figure>
<div class="image" id="fig_6"><img alt="Image" src="../images/pg114_6.jpg"/></div>
</figure>
<p class="indent">The following is the mapping in R2RML:</p>
<div class="boxg">
<p class="noindent"><code>map:A a rr:TriplesMap ;</code></p>
<p class="noindent"><code> rr:SubjectMap [</code></p>
<p class="noindent"><code>  rr:class :Order ;</code></p>
<p class="noindent"><code>  rr:template "order-{moid}∙"</code></p>
<p class="noindent"><code> ] ;</code></p>
<p class="noindent"><code>rr:logicalTable [</code></p>
<p class="noindent"><code> rr:SqlQuery """</code></p>
<p class="noindent"><span epub:type="pagebreak" id="page_115" title="115"/><code> SELECT m.moid</code></p>
<p class="noindent"><code> FROM masterorder m</code></p>
<p class="noindent"><code> JOIN order o on m.oid = o.oid</code></p>
<p class="noindent"><code> WHERE o.order_type in (2,3)</code></p>
<p class="noindent"><code> " " "</code></p>
<p class="noindent"><code>].</code></p>
<p class="noindentt"><code>map:A a rr:TriplesMap ;</code></p>
<p class="noindent"><code> rr:subjectMap [</code></p>
<p class="noindent"><code>  rr:template "order-{moid}"</code></p>
<p class="noindent"><code> ] ;</code></p>
<p class="noindent"><code> rr:predicateObjectMap[</code></p>
<p class="noindent"><code>  rr:predicate :orderNumber;</code></p>
<p class="noindent"><code>  rr:objectMap [ rr:column "moid" ] ;</code></p>
<p class="noindent"><code> ];</code></p>
<p class="noindent"><code> rr:logicalTable [</code></p>
<p class="noindent"><code>  rr:tableName "masterorder"</code></p>
<p class="noindent"><code> ].</code></p>
<p class="noindentt"><code>map:A a rr:TriplesMap ;</code></p>
<p class="noindent"><code> rr:subjectMap [</code></p>
<p class="noindent"><code>  rr:template "order-{moid}"</code></p>
<p class="noindent"><code>] ;</code></p>
<p class="noindent"><code> rr:predicateObjectMap[</code></p>
<p class="noindent"><code>  rr:predicate : OrderDate ;</code></p>
<p class="noindent"><code>  rr:objectMap [ rr:column "orderdate" ] ;</code></p>
<p class="noindent"><code>];</code></p>
<p class="noindent"><code>rr:logicalTable [</code></p>
<p class="noindent"><code> rr:SqlQuery "SELECT m.moid, o.order_date FROM masterorder m</code></p>
<p class="noindent"><code>   JOIN order o ON m.oid = o.id"</code></p>
<p class="noindent"><code>].</code></p>
<p class="noindentt"><code>map :A a rr:TriplesMap ;</code></p>
<p class="noindent"><code> rr:subjectMap [</code></p>
<p class="noindent"><code>  rr:template "order—{moid}"</code></p>
<p class="noindent"><code>] ;</code></p>
<p class="noindent"><span epub:type="pagebreak" id="page_116" title="116"/><code>rr:predicateObjectMap[</code></p>
<p class="noindent"><code> rr:predicate :hasOrderStatus;</code></p>
<p class="noindent"><code> rr:objectMap [ rr:template "orderstatus-{osid}" ] ;</code></p>
<p class="noindent"><code>];</code></p>
<p class="noindent"><code>rr:logicalTable [</code></p>
<p class="noindent"><code>  rr:sqlQuery "SELECT moid, ostid, MAX(order_status_date)</code></p>
<p class="noindent"><code>   FROM orderstatus GROUP BY order_status_date"</code></p>
<p class="noindent"><code>].</code></p>
<p class="noindentt"><code>map:A a rr:TriplesMap ;</code></p>
<p class="noindent"><code> rr:subjectMap [</code></p>
<p class="noindent"><code>  rr:template "orderstatus—{osid}"</code></p>
<p class="noindent"><code>] ;</code></p>
<p class="noindent"><code> rr:predicateObjectMap [</code></p>
<p class="noindent"><code>  rr:predicate ;orderStatusLabel;</code></p>
<p class="noindent"><code>  rr:objectMap [ rr:column "label" ] ;</code></p>
<p class="noindent"><code>] ;</code></p>
<p class="noindent"><code>rr:logicalTable [</code></p>
<p class="noindent"><code>  rr:tableName "orderstatus"</code></p>
<p class="noindent"><code>].</code></p>
</div>
<p class="indent">The following graph queries generate the tabular data product: <b>SPARQL</b></p>
<div class="boxg">
<p class="noindent"><code>SELECT ?Order_Number ?Order_Date ?Order_Status</code></p>
<p class="noindent"><code>WHERE {</code></p>
<p class="noindent"><code>?x a : Order;</code></p>
<p class="noindent"><code> :orderNumber ?Order_Number;</code></p>
<p class="noindent"><code> :orderDate ?Order_Date ;</code></p>
<p class="noindent"><code> :hasOrderStatus [</code></p>
<p class="noindent"><code>  :orderStatusLabel ?Order_Status;</code></p>
<p class="noindent"><code> ]</code></p>
<p class="noindent"><code>}</code></p>
</div>
<p class="noindent"><span epub:type="pagebreak" id="page_117" title="117"/>Cypher</p>
<div class="boxg">
<p class="noindent"><code>MATCH (o:Order)-[:hasOrderStatus]-&gt;(os:OrderStatus)</code></p>
<p class="noindent"><code>RETURN o.orderNumber, o.OrderDate, os.OrderStatusName</code></p>
</div>
<p class="noindent">Gremlin</p>
<div class="boxg">
<p class="noindent"><code>g.V() .hasLabel(‘Order’).outE(‘hasOrderStatus’).</code></p>
</div>
<p class="noindent">GSQL</p>
<div class="boxg">
<p class="noindentl"><code>SELECT Order.orderNumber, Order.orderDate, OrderStatus.orderStatusName</code></p>
<p class="noindentl"><code>FROM Order-(hasOrderStatus)-&gt;OrderStatus</code></p>
</div>
<p class="indent">Sample data is provided to the data consumers and data producers for further validation.</p>
<p class="noindentt"><b>Phase 3 (Knowledge Access):</b> The knowledge graph is now accessible to the data consumers. Data consumers who want to consume a simple table can make use of the tabular data product with the reliable data that is needed in order to answer the original query: “How many orders were placed in a given time period per their status?” The tabular data product is now accessible to a large number of data consumers.</p>
<p class="indent">To get the exact same data directly from the database of the Order Management System, the data consumer would have to spend time with data producers to determine the SQL query, which would have been:</p>
<div class="boxg">
<p class="noindent"><code>SELECT</code></p>
<p class="noindent"><code>  m.moid as OrderNumber,</code></p>
<p class="noindent"><code>  o.orderdate as OrderDate,</code></p>
<p class="noindent"><code>  ost.statustype as OrderStatusName</code></p>
<p class="noindent"><code>FROM masterorder m</code></p>
<p class="noindent"><code>JOIN order o ON m.oid = o.oid</code></p>
<p class="noindent"><code>JOIN (</code></p>
<p class="noindent"><code>  SELECT moid, ostid, max(orderstatusdate)</code></p>
<p class="noindent"><code>  FROM OrderStatus</code></p>
<p class="noindent"><code>  ROUP BY orderstatusdate</code></p>
<p class="noindent"><code>) os ON m.moid = os.moid</code></p>
<p class="noindent"><code>JOIN OrderStatusType ost ON os.ostid = ostid.ostid</code></p>
<p class="noindent"><code>WHERE m.ordertype in (2,3)</code></p>
</div>
<p class="indent"><span epub:type="pagebreak" id="page_118" title="118"/>The need to write a query to access the data does not go away. The difference is that the graph queries are written in terms of the way how data consumers think about their domain. The SQL query is written in terms of the application which is completely separated from the data consumer’s mental model. Being able to write queries, albeit graph queries, in terms of the data consumers mental model is the how the data-meaning gap is bridged, productivity is gained, and trust is earned.</p>
</section>
<section id="ch4_sec12">
<h4 class="head4"><span epub:type="title">Round 2: Order Net Sales</span></h4>
<p class="noindent">In order to demonstrate the agile nature of the methodology, consider the following new request: extend the knowledge graph with the net sales of an order.</p>
<p class="noindentt"><b>Phase 1 (Knowledge Capture):</b> In the first phase, the knowledge scientist needs to answer the key questions, as shown in <a href="#tab4_15">Table <span class="blue">4.15</span></a>.</p>
<p class="indent">In conversations with the data consumer, the knowledge scientist learns that the data consumer gets a CSV file from IT. The data consumer opens it in Excel and applies some calculations. The knowledge scientist works with the data consumer to understand the meaning of the word “order net sales.” It is then understood that the net sales of an order is calculated by subtracting the tax and the shipping cost from the final price and also adjusting based upon the discount given. However, if the currency of the order is not in USD or CAD, then the shipping tax must be subtracted.</p>
<p class="indent">Working with a data engineer, they identify another table that is needed: ordertax. It is noted that the knowledge graph schema only needs to be extended to support two new Attributes: “Order Net Sales” and “Order Currency” associated to the Order concept as shown in the following knowledge report in <a href="#tab4_16">Tables 4.16</a> and <a href="#tab4_17">4.17</a>, respectively. The Order tabular data product is extended with those two new attributes as shown in <a href="#tab4_18">Table <span class="blue">4.18</span></a>.</p>
<p class="noindentt"><b>Phase 2 (Knowledge Implementation):</b> In our abstract notation, the knowledge graph schema is extended with the following:</p>
<div class="boxg">
<p class="noindent"><code>(Order)-orderNetSales-&gt;[float]</code></p>
<p class="noindent"><code>(Order)-orderCurrency-&gt;[string]</code></p>
</div>
<p class="indent">In OWL, the knowledge graph schema is implemented as follows:</p>
<div class="boxg">
<p class="noindent"><code>:orderNetSales rdf:type owl:DatatypeProperty ;</code></p>
<p class="noindenth"><code>rdfs:label "Order Net Sales";</code></p>
<p class="noindenth"><code>rdfs:comment "Subtracting the tax and shipping cost...";</code></p>
<p class="noindenth"><code>rdfs:domain ec:Order ;</code></p>
<p class="noindenth"><code>rdfs:range xsd:float ;</code></p>
</div>
<table class="tableb" id="tab4_15">
<caption class="tcaption"><span class="blue">Table 4.15:</span> Questions in round 2</caption>
<tbody>
<tr>
<td class="tab1"><b>What</b></td>
<td class="tab1">What is the net sales of an order?</td>
</tr>
<tr>
<td class="tab1c"><b>Why</b></td>
<td class="tab1c">Depending on whom is asked, different answers are provided. The net sales is dependent on at least 4 different aspects of each order and sometimes aspects of each individual line item. The departments and individuals reporting results are variously not applying all of the proper items, not applying them consistently or not applying them correctly (per the business’ desired rules).</td>
</tr>
<tr>
<td class="tab1"><b>Who</b></td>
<td class="tab1">The Finance department, specifically the CFO.</td>
</tr>
<tr>
<td class="tab1c"><b>How</b></td>
<td class="tab1c">A business analyst asks the IT developer for this information every morning.</td>
</tr>
<tr>
<td class="tab1"><b>Where</b></td>
<td class="tab1">This is in the proprietary Order Management System.</td>
</tr>
<tr>
<td class="tab1c"><b>When</b></td>
<td class="tab1c">Every morning they want to know the net sales of every order and also various statistics and aggregations.</td>
</tr>
</tbody>
</table>
<table class="tableb" id="tab4_16">
<caption class="tcaption"><span class="blue">Table 4.16:</span> Knowledge report for attribute order net sales</caption>
<tbody>
<tr>
<td class="tab1">Attribute Name</td>
<td class="tab1">Order Net Sales</td>
</tr>
<tr>
<td class="tab1c">Attribute Alternative Names</td>
<td class="tab1c">Revenue</td>
</tr>
<tr>
<td class="tab1">Attribute Definition</td>
<td class="tab1">Subtracting the tax and the shipping cost from the final price and also adjusting based upon the discount given. However, if the currency of the order is not in USD or CAD, then the shipping tax must be subtracted.</td>
</tr>
<tr>
<td class="tab1c">Attribute Identifier</td>
<td class="tab1c">()-orderNetSales-&gt; []</td>
</tr>
<tr>
<td class="tab1">Associated Concept</td>
<td class="tab1">(Order)</td>
</tr>
<tr>
<td class="tab1c">Table Name/SQL Query</td>
<td class="tab1c"><code>SELECT moid, o.ordertotal - ot.finaltax - CASE WHEN O.currencyid in (“USD”, “CAD”) THEN o. shippingcost ELSE o. shippingcost = ot. shippingtax END as ordernetsales FROM masterorder m JOIN order o on m.oid = o.id JOIN ordertax ot on o.oid = ot.oids</code></td>
</tr>
<tr>
<td class="tab1">Column</td>
<td class="tab1">ordernetsales</td>
</tr>
<tr>
<td class="tab1c">Datatype</td>
<td class="tab1c">xsd: float</td>
</tr>
<tr>
<td class="tab1">Attribute Cardinality</td>
<td class="tab1">1:1 an order must have exactly one order net sales</td>
</tr>
<tr>
<td class="tab1c">Nullable</td>
<td class="tab1c">There can’t be NULL values. If there is a NULL value, then that is a data error.</td>
</tr>
</tbody>
</table>
<table class="tableb" id="tab4_17">
<caption class="tcaption"><span class="blue">Table 4.17:</span> Knowledge report for attribute order currency</caption>
<tbody>
<tr>
<td class="tab1">Attribute Name</td>
<td class="tab1">Order Currency</td>
</tr>
<tr>
<td class="tab1c">Attribute Alternative Names</td>
<td class="tab1c">N/A</td>
</tr>
<tr>
<td class="tab1">Attribute Definition</td>
<td class="tab1">The currency in which the order was transacted</td>
</tr>
<tr>
<td class="tab1c">Attribute Identifier</td>
<td class="tab1c">()-orderCurrency0&gt; []</td>
</tr>
<tr>
<td class="tab1">Associated Concept</td>
<td class="tab1">(Order)</td>
</tr>
<tr>
<td class="tab1c">Table Name/SQL Query</td>
<td class="tab1c">order</td>
</tr>
<tr>
<td class="tab1">Column</td>
<td class="tab1">currency</td>
</tr>
<tr>
<td class="tab1c">Datatype</td>
<td class="tab1c">xsd:string</td>
</tr>
<tr>
<td class="tab1">Attribute Cardinality</td>
<td class="tab1">1:1 an order must have exactly one order currency</td>
</tr>
<tr>
<td class="tab1c">Nullable</td>
<td class="tab1c">There can’t be empty currency. If there is a NULL, it is defaulted to USD</td>
</tr>
</tbody>
</table>
<table class="tableb" id="tab4_18">
<caption class="tcaption"><span class="blue">Table 4.18:</span> Order tabular data product in the knowledge report for round 2</caption>
<thead>
<tr>
<th class="thead"><b>Attribute</b></th>
<th class="thead"><b>Concept</b></th>
</tr>
</thead>
<tbody>
<tr>
<td class="tab1">Order Number</td>
<td class="tab1">Order</td>
</tr>
<tr>
<td class="tab1c">Order Date</td>
<td class="tab1c">Order</td>
</tr>
<tr>
<td class="tab1">Order Net Sales</td>
<td class="tab1">Order</td>
</tr>
<tr>
<td class="tab1c">Order Currency</td>
<td class="tab1c">Order</td>
</tr>
<tr>
<td class="tab1">Order Status Label</td>
<td class="tab1">Order Status</td>
</tr>
</tbody>
</table>
<div class="boxg">
<p class="noindent"><span epub:type="pagebreak" id="page_119" title="119"/><span epub:type="pagebreak" id="page_120" title="120"/><span epub:type="pagebreak" id="page_121" title="121"/><code>.</code></p>
<p class="noindenth"><code>:orderCurrency rdf:type owl:DatatypeProperty ;</code></p>
<p class="noindenth"><code> rdfs:label "Order Currency";</code></p>
<p class="noindenth"><code> rdfs:comment "The currency in which the order was transacted.";</code></p>
<p class="noindenth"><code> rdfs:domain ec:Order ;</code></p>
<p class="noindenth"><code> rdfs:range xsd:string ;</code></p>
<p class="noindenth"><code>.</code></p>
</div>
<p class="indent">The following is the mapping in our abstract notation:</p>
<figure>
<div class="image" id="fig_7"><img alt="Image" src="../images/pg121_1.jpg"/></div>
</figure>
<figure>
<div class="image" id="fig_8"><img alt="Image" src="../images/pg121_2.jpg"/></div>
</figure>
<p class="indent">The following is the mapping in R2RML:</p>
<div class="boxg">
<p class="noindentt"><code>map:A a rr:TriplesMap ;</code></p>
<p class="noindent"><code> rr:subjectMap [</code></p>
<p class="noindent"><code>  rr:template “order-{entity_id}”</code></p>
<p class="noindent"><code> ] ;</code></p>
<p class="noindent"><code> rr:predicateObjectMap[</code></p>
<p class="noindent"><code>   rr:predicate :orderNetSales;</code></p>
<p class="noindent"><code>   rr:objectMap [ rr:column “netsales” ] ;</code></p>
<p class="noindent"><code>  ];</code></p>
<p class="noindent"><code>  rr:logicalTable [</code></p>
<p class="noindent"><code>   rr:sqlQuery """</code></p>
<p class="noindent"><code>    SELECT moid, o.ordertotal - ot.finaltax -</code></p>
<p class="noindent5"><code>CASE WHEN o.currencyid in (‘ ‘USD’ ’, ‘ ‘CAD’ ’) THEN o.shippingcost</code></p>
<p class="noindent5"><code>ELSE o.shippingcost = ot.shippingtax END as ordernetsales</code></p>
<p class="noindent"><code>    FROM masterorder m</code></p>
<p class="noindent"><code>    JOIN order o on m.oid = o.id</code></p>
<p class="noindent"><span epub:type="pagebreak" id="page_122" title="122"/><code>    JOIN ordertax ot on o.oid = ot.oids</code></p>
<p class="noindent"><code> " " "</code></p>
<p class="noindent"><code> ].</code></p>
<p class="noindentt"><code>map:A a rr:TriplesMap ;</code></p>
<p class="noindent"><code> rr:subjectMap [</code></p>
<p class="noindent"><code> rr:template “order-{moid}”</code></p>
<p class="noindent"><code> ] ;</code></p>
<p class="noindent"><code> rr:predicateObjectMap[</code></p>
<p class="noindent"><code> rr:predicate : orderCurrency;</code></p>
<p class="noindent"><code> rr:objectMap [ rr:column “currency” ] ;</code></p>
<p class="noindent"><code> ];</code></p>
<p class="noindent"><code> rr:logicalTable [</code></p>
<p class="noindent"><code> rr:sqlQuery</code></p>
<p class="noindent3"><code>"SELECT moid, isnull(currency, ‘USD’) as currency FROM order"</code></p>
<p class="noindent"><code> ] .</code></p>
</div>
<p class="indent">The existing graph query are extended as follows:</p>
<p class="noindent">SPARQL</p>
<div class="boxg">
<p class="noindentl"><code>SELECT ?Order Number ?Order_Date ?Order Status ?Order_Net_Sales</code></p>
<p class="noindent"><code>    ?Order_Currency</code></p>
<p class="noindent"><code>WHERE {</code></p>
<p class="noindent"><code>?x a : Order;</code></p>
<p class="noindent"><code> :orderNumber ?Order_Number;</code></p>
<p class="noindent"><code> :orderDate ?Order_Date;</code></p>
<p class="noindent"><code> :orderNetSales ?Order_Net_Sales;</code></p>
<p class="noindent"><code> :orderCurrency ?Order_Currency;</code></p>
<p class="noindent"><code> :hasOrderStatus [</code></p>
<p class="noindent"><code>  :orderStatusLabel ?Order_Status;</code></p>
<p class="noindent"><code> ]</code></p>
<p class="noindent"><code>}</code></p>
</div>
<p class="noindent">Cypher</p>
<div class="boxg">
<p class="noindent"><code>MATCH (o:Order)-[:hasOrderStatus]-&gt;(os:OrderStatus)</code></p>
<p class="noindent"><code>RETURN o.orderNumber, o.OrderDate, o.OrderNetSales, o.OrderCurrency,</code></p>
<p class="noindent"><span epub:type="pagebreak" id="page_123" title="123"/><code>os.OrderStatusName</code></p>
</div>
<p class="noindent">Gremlin</p>
<div class="boxg">
<p class="noindent"><code>g.V().hasLabel(‘Order’).outE(‘hasOrderStatus’).</code></p>
</div>
<p class="noindent">GSQL</p>
<div class="boxg">
<p class="noindent"><code>SELECT Order.orderNumber, Order.orderDate, Order.OrderNetSales,</code></p>
<p class="noindent"><code>    Order.OrderCurrency, OrderStatus.orderStatusName</code></p>
<p class="noindent"><code>FROM Order-(hasOrderStatus)-&gt;OrderStatus</code></p>
</div>
<p class="noindent"><b>Phase 3: (Knowledge Access):</b> With the extended Order tabular data product, the data consumers can further enhance the report in order to answer the new question of this round.</p>
</section>
</section>
</section>
<section>
<h2 class="head2" id="ch4_3">4.3<span class="space3"/><span epub:type="title">TOOLS</span></h2>
<p class="noindent">Building a knowledge graph and developing associated software that can consume graph data (and manage the graph) requires many different kinds of tools and software libraries. In this chapter we will discuss some of them, but the discussion is not “complete” from the standpoint of what is currently available, or as a tutorial on how to get things done. Rather than understanding specific tools (which come and go as the market evolves), it is more important to understand the kinds of tools one needs, and more specifically, how to evaluate them to be able to choose the ones that best suit one’s needs.</p>
<section>
<h3 class="head3" id="ch4_3_1">4.3.1<span class="space3"/><span epub:type="title">METADATA MANAGEMENT</span></h3>
<p class="noindent">A first class of tools are for metadata management in order to create an inventory of your organization’s data assets. The goals of these tools, also known as data catalogs are to:</p>
<p class="bbull">•  catalog what data assets (databases, tables, business terminology, reports, etc.) exist and how are they related to one another;</p>
<p class="bbull">•  understand and discover data assets related to topics and business concepts;</p>
<p class="bbull">•  provide governance to manage the policies related to data assets;</p>
<p class="bbull">•  describe provenance on how various data assets are related to one another over time; and</p>
<p class="bbull">•  enable collaboration between users and data assets.</p>
<p class="indent"><span epub:type="pagebreak" id="page_124" title="124"/>A data catalog is a foundational tool to understand what data your organization has and how it could be effectively used to create a knowledge graph.</p>
</section>
<section>
<h3 class="head3" id="ch4_3_2">4.3.2<span class="space3"/><span epub:type="title">KNOWLEDGE MANAGEMENT</span></h3>
<p class="noindent">The data product team need to be equipped with knowledge managemenent tools to designing the knowledge graph. The following are the types of tools that would be used in Phase 1 of the methodology (Section <span class="blue">4.2.1</span>).</p>
<section id="ch4_sec13">
<h4 class="head4"><span epub:type="title">Domain Modeling</span></h4>
<p class="noindent">Whatever data you manipulate or store, there is always a model (also referred to as a “schema” or “ontology”). Sometimes, this model is not explicit, but it exists nevertheless, at least in the developers’ heads and is consequently reflected in the code that is written to manipulate or consume graph data. Some databases offer explicit support for representing a schema, although it should be noted that—at the time of writing—there is no widely accepted schema language for property graphs.<sup><a epub:type="noteref" href="#pgfn4_5" id="rpgfn4_5">5</a></sup> For RDF, representing a schema is always possible, as an RDF schema (or beyond that, an OWL ontology) is built and represented using basic RDF graph primitives. Thus, an RDF schema is embedded in, and coexists with, the graph’s “instance” data, making RDF a particularly handy approach for “self-describing data.”</p>
<p class="indent">To create a model, one needs some kind of an editor. Given that there are multiple textual serialization syntaxes for RDF,<sup><a epub:type="noteref" href="#pgfn4_6" id="rpgfn4_6">6</a></sup> one minimally needs a text editor (e.g., <b>Emacs),</b> but defining a larger model typically requires a dedicated editor. Some of these editors allow you to define the model graphically (e.g., <b>Gra.fo</b>) whereas others give you some type of structured “outline” view of your model (e.g., <b>Protégé, Topbraid</b>) or focus on taxonomies (e.g., <b>PoolParty</b>). Regardless, a model editor should be able to guide you in the definition of your model, whether that be by enforcing a valid model structure or beyond that, identifying inconsistencies in your model.</p>
</section>
<section id="ch4_sec14">
<h4 class="head4"><span epub:type="title">Schema Mapping</span></h4>
<p class="noindent">Once you have a model defined, it serves as the reference when defining other parts of your knowledge graph application. This includes the mappings from relational databases to the knowledge graph.</p>
<p class="indent">As discussed in Section <span class="blue">2.3.3</span>, existing mapping languages are for RDF Knowledge Graphs. There are no known mapping languages for property graphs. At the time of writing, commercial tools include <b>Gra.fo and data.world, Metaphactory, Data Lens,</b> among others. Open source tools are <b>Karma</b>,<sup><a epub:type="noteref" href="#pgfn4_7" id="rpgfn4_7">7</a></sup> <b>RML</b>,<sup><a epub:type="noteref" href="#pgfn4_8" id="rpgfn4_8">8</a></sup> among others.</p>
<p class="indent"><span epub:type="pagebreak" id="page_125" title="125"/>Typical functionality of schema mapping tools include mapping creation and editing either by writing “raw” R2RML or a visual paradigm, semi-automation of the mappings, and execution of the mappings in order to physically generate the knowledge graph.</p>
</section>
<section id="ch4_sec15">
<h4 class="head4"><span epub:type="title">Entity Resolution</span></h4>
<p class="noindent">Earlier, we spoke about the importance of good identifier conventions and schemes. <i>Entity resolution</i><sup><a epub:type="noteref" href="#pgfn4_9" id="rpgfn4_9">9</a></sup> is the process of taking some description or mention of an entity and <i>resolving</i> it to a unique (pre-existing) identifier. Unique identifiers for entities (concepts and individuals) are the cornerstone for building working knowledge graphs, and entity resolution, typically as part of the graph ETL process, is thus an essential part of the process of building a knowledge graph. The accuracy of the entity resolution process also directly contributes to the overall data quality of your knowledge graph.</p>
<p class="indent">Entity resolution can range from a simple process of recognizing synonyms and picking a canonical name or identifier, to sophisticated matching involving machine learning techniques and even the use of a knowledge graph (say, DBPedia). Also, a common use case for knowledge graphs is <i>identity resolution</i> where, say, a company wants to identify a specific customer even though they have multiple different references (phone numbers, email addresses, actual names, etc.). This is a form of entity resolution.</p>
<p class="indent">Some sources of information about entity resolution are <span class="blue">Herzog et al</span>. [<span class="blue">2007</span>], <span class="blue">Köpcke et al</span>. [<span class="blue">2010</span>], and <span class="blue">Getoor and Machanavajjhala</span> [<span class="blue">2012</span>]. There are also a number of tools and libraries available for entity resolution, as well as cloud-based services, which fall under the traditional category of Master Data Management and Data Integration companies (e.g., Informatica, Talend, etc.).</p>
</section>
</section>
<section>
<h3 class="head3" id="ch4_3_3">4.3.3<span class="space3"/><span epub:type="title">DATA MANAGEMENT</span></h3>
<p class="noindent">The data product team need to be equipped with data management tools to build the knowledge graph. The following are the types of tools that would be used in Phase 2 of the methodology (Section <span class="blue">4.2.2</span>).</p>
<section id="ch4_sec16">
<h4 class="head4"><span epub:type="title">Graph Databases</span></h4>
<p class="noindent">To build a knowledge graph one obviously needs some way of storing and managing the graph. At the time of writing, the graph database market is still very much nascent and in some ways the “Wild West,” as there are a number of different query languages and graph (meta)models. There are some well-known and established graph database products (such as <b>Amazon Neptune, Stardog, Ontotext GraphDB, OpenLink Virtuoso, Neo4J, TigerGraph,</b> and <b>MarkLogic)</b> as well as a constantly changing landscape of newcomers. As an alternative to commercial and proprietary products, there are also open-source offerings that one should consider as possibilities; for example, all RDF frameworks and libraries typically offer some solutions for persistence, <span epub:type="pagebreak" id="page_126" title="126"/>whether those be “native”—a database implementation as part of the library—or solutions where the graph is backed onto some other type of third-party persistence substrate (relational database, key/value store, etc.).</p>
<p class="indent">The type of graph (meta)model one chooses limits the choice of query and schema languages available down the road. Most graph databases support either RDF graphs (and thus the SPARQL query language) or “labeled property graphs,” or sometimes both. In the case of property graphs, two query languages have emerged that are not specifically tied to one single database implementation or product: Gremlin from the Apache Tinkerpop open source project and Cypher (specifically its open variant openCypher) from Neo4J. Besides those two, many database products have their own query languages or introduce extensions to existing query languages. Some database products also offer support beyond graphs, for document or relational storage.</p>
<p class="indent">Apart from the above, there is the choice between persistent databases and “in-memory” databases. In the case of the latter, one obviously needs some persistence solution from which the in-memory database can be (re-)populated when needed—this does not necessarily need to be a database, as the in-memory graph can simply be loaded from a static file, for example. In-memory databases are typically geared toward analytics and large-scale graph algorithm computation.</p>
</section>
<section id="ch4_sec17">
<h4 class="head4"><span epub:type="title">Graph Frameworks</span></h4>
<p class="noindent">Several open-source frameworks also offer solutions for persistence. For example, Java RDF frameworks <b>Eclipse RDF4J</b> (through its <i>RDF4J Server</i>) and <b>Apache Jena</b> (through its <i>Fuseki</i> server) both offer the option of persisting graphs on disk with access over HTTP. Similarly on the property graph side, <b>Apache Tinkerpop</b> offers the <i>Gremlin Server.</i> The RDF frameworks also offer “embedded” persistent graph database solutions (RDF4J through its NativeStore class, Jena through its TDBFactory factory class). For smaller applications, these approaches can be considered. Note, however, that most RDF frameworks typically rely on SPARQL and SPARQL Update and offer “wrappers” around graph databases that support these query languages.</p>
</section>
<section id="ch4_sec18">
<h4 class="head4"><span epub:type="title">Virtualization and Federation</span></h4>
<p class="noindent">When the relational data is highly dynamic or it is too big and not feasible to be materialized into a graph, it makes sense to virtualize the knowledge grah in order to keep the data in its original form. Research on translating SPARQL queries to SQL using mappings [<span class="blue">Sequeda and Miranker, 2013</span>] has matured over the past decade. This has led to various commercial offerings (e.g., data.world, Stardog, GraphDB, etc.). Additionally, there are open source offerings such as <span epub:type="pagebreak" id="page_127" title="127"/>Morph<sup><a epub:type="noteref" href="#pgfn4_10" id="rpgfn4_10">10</a></sup> [<span class="blue">Priyatna et al., 2014</span>] and Ontop<sup><a epub:type="noteref" href="#pgfn4_11" id="rpgfn4_11">11</a></sup> [<span class="blue">Calvanese et al., 2017</span>]. At the time of writing, there are no known virtualization tools for property graphs.</p>
<p class="indent">Federation provides the capability of aggregating knowledge graphs from distributed sources. The SPARQL query language for RDF knowledge graphs provides a federation extension to execute distributed queries over any number of SPARQL endpoints.<sup><a epub:type="noteref" href="#pgfn4_12" id="rpgfn4_12">12</a></sup> The SPARQL endpoint can be of an RDF graph databases or a virtualized knowledge graph over a relational databases. At the time of writing, there are no known federation tools for property graphs.</p>
<p class="indent">The combination of virtualization and federation enables accessing distributed knowledge graphs using a single query.</p>
</section>
<section id="ch4_sec19">
<h4 class="head4"><span epub:type="title">Other Databases</span></h4>
<p class="noindent">As we have learned, a graph is a mathematical construct—or, from the computer science standpoint, a data structure. Storing and managing a graph does not necessarily require a graph database per se, because graph structures can be stored using other database technologies as well. Depending on one’s use case, this may or may not be particularly efficient, though.</p>
<p class="indent">The reason we bring up “other” databases is that they best serve our “knowledge graph exercise” as sources from which to populate a graph. That’s really what this whole book is about, although we have mostly limited the discussion to relational (or tabular) sources. It should be noted, though, that document databases (e.g., <b>MongoDB, Amazon DocumentDB),</b> key/value stores (e.g., <b>Cassandra, Amazon DynamoDB),</b> and others, can also be used successfully. Finally, static data sources such as semi-structured data (e.g., XML documents) or structured data (e.g., CSV files) are also fair game when it comes to populating your graph.</p>
<p class="indent">Note, also, that populating your graph from external sources can take many forms: You can use the sources once, to initially construct your graph, or they can serve as the real “source of truth” and your graph is more like a cached version of that data; the latter case is obviously true for in-memory graph databases. If your source of truth is not the graph database, you may need to consider how changes to the graph—if allowed—can be propagated back to the source database.</p>
</section>
<section id="ch4_sec20">
<h4 class="head4"><span epub:type="title">Validation</span></h4>
<p class="noindent">W3C has produced a specification for validating RDF data, called SHACL (short for “Shapes Constraint Language”). In SHACL, a “shape” (rather than an RDF or OWL class) is a graph pattern against which graph data can be compared and validated. A shape is effectively a set of constraints that the graph data has to satisfy, and these shapes can be defined for nodes—and thus they roughly correspond to classes—or properties—in which case they correspond to property definitions or restrictions from RDF and OWL. In addition to being able to express and validate constraints that could be defined using OWL (e.g., cardinality constraints), SHACL also lets <span epub:type="pagebreak" id="page_128" title="128"/>one define constraints for syntactic validation of property values (e.g., to make sure that a string is formatted as a valid U.S. phone number).</p>
<p class="indent">There is also an alternative to SHACL, called ShEx (for “Shape Expressions”). Roughly similar to SHACL in functionality and capabilities, ShEx introduces a new syntax for expressing the shapes (SHACL shapes are expressed in RDF, and thus can co-exist in the broader graph they are meant to validate).</p>
<p class="indent">At the time of writing, there are no non-proprietary validation language for property graphs.</p>
</section>
</section>
<section>
<h3 class="head3" id="ch4_3_4">4.3.4<span class="space3"/><span epub:type="title">ADDITIONAL TOOLS</span></h3>
<section id="ch4_sec21">
<h4 class="head4"><span epub:type="title">Search and Discovery</span></h4>
<p class="noindent">You have successfully created a knowledge graph and it is stored in a graph databases. The next step is to build data products using the knowledge graphs. These data products, let them be tabular, graphs or APIs, need to be searchable and discoverable by data consumers. Data Catalogs can be repurposed for this need. In addition to using a data catalog to inventory the raw data assets, they can be used to inventory the data products. Data producers are the audience of a data catalog of the raw data assets. On the other hand, data consumers are the audience of a data catalog of the data products derived from the knowledge graph.</p>
</section>
<section id="ch4_sec22">
<h4 class="head4"><span epub:type="title">Public Sources</span></h4>
<p class="noindent">Your knowledge graph can be enhanced and “enriched” by using public data sources. These sources include actual data (e.g., public knowledge graphs such as DBPedia, Wikidata) as well as public ontologies (such as Gist, FIBO, <a href="http://schema.org">schema.org</a> etc.).</p>
<p class="indent">For public data, you can either ingest that data into your graph, or you can use federated queries for access. For the former, you need to consider how to “refresh” the data if the public source is updated; for example, if you are using RDF, you can place the public data in a separate named graph, and perform “bulk updates” by deleting all data from that graph and reloading the external source. For the latter, you need to be prepared for potentially unanticipated changes in the public knowledge graph, or possible outages. Naturally, in both cases, you need to consider questions of trustworthiness, accuracy, etc., as well as legal questions such as usage rights.</p>
</section>
</section>
</section>
<section epub:type="footnotes">
<div epub:type="footnote" id="pgfn4_1"><p class="pgnote"><sup><a href="#rpgfn4_1">1</a></sup> <span class="blue"><a href="https://www.infoworld.com/article/3228245/the-80-20-data-science-dilemma.html">https://www.infoworld.com/article/3228245/the-80-20-data-science-dilemma.html</a></span></p></div>
<div epub:type="footnote" id="pgfn4_2"><p class="pgnote"><sup><a href="#rpgfn4_2">2</a></sup> Remember, data and knowledge needs to be connected and this can also be done in the form of a table.</p></div>
<div epub:type="footnote" id="pgfn4_3"><p class="pgnote"><sup><a href="#rpgfn4_3">3</a></sup> We peer review scientific papers. We peer review software code. We must also peer review the way we manage our data, in this case, the knowledge reports.</p></div>
<div epub:type="footnote" id="pgfn4_4"><p class="pgnote"><sup><a href="#rpgfn4_4">4</a></sup> Recent tools such as Great Expectations could be adopted for this need.</p></div>
<div epub:type="footnote" id="pgfn4_5"><p class="pgnote"><sup><a href="#rpgfn4_5">5</a></sup> At the time of writing, the Property Graph Schema Working Group of the Linked Data Benchmark Council is a community effort providing recommendations for a property graph schema language to the ISO GQL standards body.</p></div>
<div epub:type="footnote" id="pgfn4_6"><p class="pgnote"><sup><a href="#rpgfn4_6">6</a></sup> Turtle is the most common RDF syntax.</p></div>
<div epub:type="footnote" id="pgfn4_7"><p class="pgnote"><sup><a href="#rpgfn4_7">7</a></sup> <span class="blue"><a href="https://github.com/usc-isi-i2/Web-Karma">https://github.com/usc-isi-i2/Web-Karma</a></span></p></div>
<div epub:type="footnote" id="pgfn4_8"><p class="pgnote"><sup><a href="#rpgfn4_8">8</a></sup> <span class="blue"><a href="https://github.com/RMLio/rmlmapper-java">https://github.com/RMLio/rmlmapper-java</a></span></p></div>
<div epub:type="footnote" id="pgfn4_9"><p class="pgnote"><sup><a href="#rpgfn4_9">9</a></sup> Sometirnes also refenred to as <i>record linkage, reference matching, entity linking,</i> etc.</p></div>
<div epub:type="footnote" id="pgfn4_10"><p class="pgnote"><sup><a href="#rpgfn4_10">10</a></sup> <span class="blue"><a href="https://github.com/oeg-upm/morph-rdb">https://github.com/oeg-upm/morph-rdb</a></span></p></div>
<div epub:type="footnote" id="pgfn4_11"><p class="pgnote"><sup><a href="#rpgfn4_11">11</a></sup> <span class="blue"><a href="https://github.com/ontop/ontop">https://github.com/ontop/ontop</a></span></p></div>
<div epub:type="footnote" id="pgfn4_12"><p class="pgnote"><sup><a href="#rpgfn4_12">12</a></sup> <span class="blue"><a href="https://www.w3.org/TR/sparql11-federated-query/">https://www.w3.org/TR/sparql11-federated-query/</a></span></p></div>
</section>
</section>
</body>
</html>