glam/docs/oclc/extracted_kg_fundamentals/OEBPS/xhtml/chapter_12.xhtml

<html xmlns="http://www.w3.org/1999/xhtml" xmlns:epub="http://www.idpf.org/2007/ops" lang="en" xml:lang="en">
<head>
<title>Knowledge Graphs</title>
<meta content="text/html; charset=utf-8" http-equiv="default-style"/>
<link href="../styles/stylesheet.css" rel="stylesheet" type="text/css"/>
<meta content="urn:uuid:531250ec-8629-4bbe-b4be-eb1eb0e84538" name="Adept.expected.resource"/>
</head>
<body epub:type="bodymatter">
<div class="body">
<p class="SP"> </p>
<section aria-labelledby="ch12" epub:type="chapter" role="doc-chapter">
<header>
<h1 class="chapter-number" id="ch12"><span aria-label="307" id="pg_307" role="doc-pagebreak"/>12</h1>
<h1 class="chapter-title"><b>Structured Querying</b></h1>
</header>
<div class="ABS">
<p class="ABS"><b>Overview.</b>   Once a knowledge graph (KG) is in place, it has to be queried in order to retrieve the desired information. Structured querying is one such mechanism, wherein a formal, database-like language (based on strong logical foundations, and with clearly defined semantics and syntax) is used to retrieve subgraphs or graph patterns from the KG. An alternative mechanism, which has continued to become more popular with improvements in both Natural Language Processing (NLP) and deep learning, is posing natural-language questions with the expectation that the underlying retrieval system would be able to understand the question and retrieve high-quality answers. In this chapter, we focus on structured querying, which continues to be the predominant method for accessing structured databases that are either tabular or graphlike. In the next chapter, we study question answering.</p>
</div>
<section epub:type="division">
<h2 class="head a-head"><a id="sec12-1"/><b>12.1 Introduction</b></h2>
<p class="noindent">In the data management community, querying has always been a fundamental component of research and practice. KGs are no different. With the advent of large data sets, including terabyte-sized or even petabyte-sized web corpora, large troves of structured data, including tables, relational databases, and spreadsheets, and other such Big Data, the final KG that is constructed is itself enormous.</p>
<p>With the assumption that the KG is generally semistructured (making a tabular serialization inherently unsuitable), and may contain many missing (or even non-conforming) values and semantics, two broad kinds of structured querying are applicable. The first, which is more traditional and has been inspired by the massive literature in the database community on SQL and SQL-like domain-specific languages (DSLs), is based on a graph pattern–matching language called SPARQL. SPARQL is heavily favored by the Semantic Web (SW) community, and is designed to work with Resource Description Framework (RDF), a modeling language for KGs that we covered back in chapter 2. A second kind of structured querying is based on key-value stores, which is inspired by NoSQL efforts in the relational database literature as well as the information retrieval community. These <span aria-label="308" id="pg_308" role="doc-pagebreak"/>two kinds of structured querying offer their own sets of advantages and disadvantages. We return to this issue toward the end of this chapter, after describing both in detail.</p>
</section>
<section epub:type="division">
<h2 class="head a-head"><a id="sec12-2"/><b>12.2 SPARQL</b></h2>
<p class="noindent">The SPARQL query language is the official World Wide Web Consortium (W3C) standard for querying and extracting information from RDF graphs. It represents the counterpart to select-project-join queries in the relational model. It is based on a powerful graph-matching facility, allows the binding of variables to components in the input RDF graph, and supports conjunctions and disjunctions of triple patterns. In addition, operators akin to relational joins, unions, left outer joins, selections, and projections can be combined to build more expressive queries.</p>
<p>A basic SPARQL query is a <i>graph pattern</i>, which is defined as a set of triple patterns. Triple patterns are like ordinary triples, in that each pattern consists of subject, predicate, and object, but the twist is that each of these elements may now be a variable, an internationalized resource identifier (IRI), or a literal. In other words, the query specifies the known literals and leaves the unknowns as variables that can occur in multiple patterns to constitute join operations. A question mark is placed at the front of the token to indicate that it is a variable. For example, in the triple pattern <i>“?name foaf:name ‘John’”</i> has the variable <i>“?name”</i> as the subject element; the pattern is essentially a query asking for the uniform resource identifier (URIs) that are linked to the literal <i>“John”</i> via the property <i>foaf:name</i> (specified in the predicate position using the IRI <a href="http://xmlns.com/foaf/0.1/name">http://<wbr/>xmlns<wbr/>.com<wbr/>/foaf<wbr/>/0<wbr/>.1<wbr/>/name</a>, for which it is a shorthand).</p>
<p>Given such a triple pattern, the query processor needs to find all possible <i>variable bindings</i> that satisfy the given patterns and return the bindings from the projection clause to the application. Join operations become necessary because, within the scope of a single graph pattern (recall that these are sets of triple patterns), triple patterns may share variables.</p>
<p>Generally, it can become unwieldy to specify SPARQL queries as sets of triple patterns; furthermore, triple patterns are not expressive enough by themselves to satisfy many real-world needs, including quantification, ordering, and grouping. In contrast, those familiar with SQL will recognize that these are core elements of the language and of extended relational algebra. SPARQL allows these facilities as well by providing a SQL-like way of specifying expressive queries to be executed against RDF data sets. A simple, yet powerful, query that illustrates this expressiveness is shown in <a href="chapter_12.xhtml#fig12-1" id="rfig12-1">figure 12.1</a>.</p>
<div class="figure">
<figure class="IMG"><a id="fig12-1"/><img alt="" src="../images/Figure12-1.png" width="300"/>
<figcaption><p class="CAP"><span class="FIGN"><a href="#rfig12-1">Figure 12.1</a>:</span> <span class="FIG">A simple SPARQL query, the elements of which are described in the text.</span></p></figcaption>
</figure>
</div>
<p>We explain the various elements of this query here:</p>
<ul class="numbered">
<li class="NL">1. PREFIX: This is used to declare a shorthand (the “prefix”) for a Uniform Resource Locator (URL) namespace. When the prefix is used, it is as if the full URL was used in its place.</li>
<li class="NL">2. <span aria-label="309" id="pg_309" role="doc-pagebreak"/>SELECT: This is similar to the SELECT often found in SQL queries (i.e., in the example given here, we can imagine the query execution returning a “table” with two columns, one for the <i>Person</i> and the other for the <i>Homepage</i> binding). The variable names are mnemonic; if we wanted, we could have declared the variable <i>“?person”</i> to be <i>“?p”</i> (assuming we also changed it in the other lines of the query).</li>
<li class="NL">3. FROM: Again, just like SQL, it is used to specify the data set over which the query will be executed.</li>
<li class="NL">4. WHERE: Like SQL, this is the place where the conditions are imposed. In this case, the conditions are graph patterns. In the example here, the condition says that anything that binds to <i>“?person”</i> must have type <i>foaf:Person</i> (where <i>foaf:</i> has the shorthand as expressed in PREFIX), and also possess a homepage.</li>
<li class="NL">5. ORDER BY: If we imagine that the results are retrieved as a two-column table, this says that the rows of the table must be ordered by descending order of homepage (where “descending” intuitively would be interpreted by the system in the usual way that comparators are defined for that datatype).</li>
<li class="NL">6. LIMIT: The number of rows in the answers-table is limited (to 5, in this case).</li>
</ul>
<p>While all of the querying facilities expressed by this example query were supported by SPARQL 1.0, SPARQL 1.1, adopted as a W3C recommendation in March 2013, allowed for significantly more expressiveness. Specifically, it allowed aggregate functions inspired by the relational database community, such as COUNT, “grouping” functions (e.g., GROUP BY), and “having” functions (e.g., HAVING).</p>
<p>SPARQL 1.1 also contains support for subqueries (nesting queries within queries), negation and filtering, property paths, new variable introduction, basic federated querying, and support for graph patterns inside FILTERs. New built-in aggregate expressions include AVG, COUNT, GROUP_CONCAT, MAX, MIN, SAMPLE, and SUM with their usual meanings. All are allowed with or without DISTINCT. As noted earlier, the grouping of results can be optionally done as well, while HAVING executes a filter expression over an aggregation result.</p>
<p>Of the aggregation expressions mentioned, AVG, MAX, MIN and SUM are fairly self-explanatory and typically applied to numeric sets of results. By way of example, consider the query fragment in <a href="chapter_12.xhtml#fig12-2" id="rfig12-2">figure 12.2</a> (not all details may be specified to make this query <span aria-label="310" id="pg_310" role="doc-pagebreak"/>syntactically valid; see the exercises at the end of this chapter). What we are trying to discover is the lowest possible price (the minimum of low prices) offered by a certain dealer for products (that the dealer deals in, because we did not specify the product type in the query) that were manufactured after January 1, 2019. Similar queries can be composed with the other aggregation operators, such as MAX, AVG, and SUM. In a similar vein, COUNT only counts the number of elements, but COUNT(*) can be used to count all results. SAMPLE returns any element, which is useful if we have reason to believe that there is only one result, or we only want one result regardless (and we don’t care which one it is). GROUP_CONCAT concatenates all elements and can be used in conjunction with separator expressions such as “,”.</p>
<div class="figure">
<figure class="IMG"><a id="fig12-2"/><img alt="" src="../images/Figure12-2.png" width="350"/>
<figcaption><p class="CAP"><span class="FIGN"><a href="#rfig12-2">Figure 12.2</a>:</span> <span class="FIG">A partial SPARQL query illustrating an aggregation.</span></p></figcaption>
</figure>
</div>
<section epub:type="division">
<h3 class="head b-head"><a id="sec12-2-1"/><b>12.2.1 Subqueries</b></h3>
<p class="noindent">Subqueries are the preferred way to embed SPARQL queries within other queries, which usually is done to achieve results that cannot otherwise be achieved, such as limiting the number of results from some subexpression within the query. Due to the bottom-up nature of SPARQL query evaluation, the subqueries are evaluated logically first, and then the results are projected to the outer query. Note, however, that only variables projected out of the subquery will be visible, or in scope, to the outer query. By way of example, consider the following (small) KG, expressed pithily (several triples are on one line):</p>
<p class="center">@prefix: <i>&lt;</i><a href="http://people.example/">http://<wbr/>people<wbr/>.example<wbr/>/</a>&gt;.<br/>:martha:name “Martha”, “Martha Marshall”, “M. Marshall”.<br/>:martha:knows:bob,:alice.<br/>:bob:name “Bob”, “Bob Brioni”, “B. Brioni”.<br/>:alice:name “Alice”, “Alice Ace”, “A. Ace”.</p>
<p>Now consider the query in <a href="chapter_12.xhtml#fig12-3" id="rfig12-3">figure 12.3</a>, using the same prefix nomenclature as before. The overall query is evaluated by first evaluating the inner query, which yields a single minName for each <i>y</i>; i.e., we are essentially picking a canonical name for every person resource. We use MAX to pick the name with the most number of characters, as that would usually be the full name. The final result is a set of names (representing the people <span aria-label="311" id="pg_311" role="doc-pagebreak"/>Martha knows), but each name identifies a unique individual. Not doing the GROUP BY would have yielded a misleading result (see the exercises).</p>
<div class="figure">
<figure class="IMG"><a id="fig12-3"/><img alt="" src="../images/Figure12-3.png" width="400"/>
<figcaption><p class="CAP"><span class="FIGN"><a href="#rfig12-3">Figure 12.3</a>:</span> <span class="FIG">An example of a SPARQL query with a subquery.</span></p></figcaption>
</figure>
</div>
</section>
</section>
<section epub:type="division">
<h2 class="head a-head"><a id="sec12-3"/><b>12.3 Relational Processing of Queries over Knowledge Graphs</b></h2>
<p class="noindent">Even though SPARQL looks very different from SQL, there has been a broad body of work on using decades-long query optimization work in the SQL and relational database communities to inform efficient query processing of SPARQL. Yet, as noted before, RDF and KGs cannot naturally be represented as “tables,” and it is also not clear how one could easily convert SPARQL to SQL. However, relational database management systems (RDBMSs) have repeatedly shown that they are very efficient, scalable, and successful in hosting types of data that have formerly not been anticipated to be stored inside relational databases. In addition, RDBMSs have shown their ability to handle vast amounts of data very efficiently using powerful indexing mechanisms. Over the years, <i>relational RDF stores</i> have been proposed and developed in the community so that practitioners may have the best of both worlds to the greatest extent possible (i.e., have the representational benefits of KGs while being able to leverage findings from the RDBMS community to execute fast queries over these KGs, in addition to drawing upon other RDBMS benefits). These stores tend to fall broadly into three categories:</p>
<ul class="numbered">
<li class="NL">1. <b>Triple (vertical) table stores:</b> In these stores, each RDF triple is stored directly in a three-column table (subject, predicate, object).</li>
<li class="NL">2. <b>Property (</b><b><i>n</i></b><b>-ary) table stores:</b> In these stores, multiple RDF properties are modeled as <i>n</i>-ary table columns for the same subject.</li>
<li class="NL">3. <b>Horizontal table stores:</b> In these stores, RDF triples are modeled as one horizontal table or into a set of vertically partitioned binary tables (one table for each RDF property).</li>
</ul>
<p>By way of example, alternative representations of the same RDF data set are shown in <a href="chapter_12.xhtml#fig12-4" id="rfig12-4">figure 12.4</a>. We briefly describe each of the stores next.</p>
<div class="figure">
<figure class="IMG"><a id="fig12-4"/><img alt="" src="../images/Figure12-4.png" width="450"/>
<figcaption><p class="CAP"><span class="FIGN"><a href="#rfig12-4">Figure 12.4</a>:</span> <span class="FIG">Relational RDF store representations (for all three categories described in the text) for the same RDF data set.</span></p></figcaption>
</figure>
</div>
<section epub:type="division">
<h3 class="head b-head"><span aria-label="312" id="pg_312" role="doc-pagebreak"/><a id="sec12-3-1"/><b>12.3.1 Triple (Vertical) Stores</b></h3>
<p class="noindent">The triplestore (also called a <i>triple table</i>) approach is the most straightforward way in which RDF can be mapped into an RDBMS. Each RDF triple is basically stored in one large table with a three-column schema (for the subject, predicate, and object). Indexes are added to each of the columns to make joins less expensive.</p>
<p>Unfortunately, because the triples are stored in a <i>single</i> RDF table, queries can be slow to execute in the general case. Scalability can be a major issue as well, with the RDF table exceeding the main memory size as the data (i.e., the number of triples) starts increasing. While simple, statement-based queries can be satisfactorily processed by such stores, they do not represent the most important mechanism of querying RDF data. Indeed, complex queries with multiple triple patterns (requiring many self-joins over this single large table) do not scale well.</p>
<p>A good example of a triplestore is 3store, which is based on a central triple table holding the hashes for the subject, predicate, object, and graph identifier (equal to zero if the triple resides in the anonymous background graph). A symbols table is used for enabling reverse lookups from the hash to the hashed value to return results. Furthermore, 3store allows SQL operations to be performed on precomputed values in the datatypes of the columns without the use of casts. For evaluating SPARQL queries, the triples table is joined once for each triple in the graph pattern where variables are bound to their values when the <span aria-label="313" id="pg_313" role="doc-pagebreak"/>slot in which the variable appears is encountered. Subsequent occurrences of variables in the graph pattern are used to constrain any appropriate joins with their initial binding. To produce the intermediate results table, the hashes of any SPARQL variables required to be returned in the results set are projected, and the hashes from the intermediate results table are joined to the symbols table to provide the textual representation of the results.</p>
<p>Other examples of triplestores (or resembling triplestores) include RDF Triple eXpress (RDF-3X), and Hexastore. RDF-3X is designed as an RDF query engine that tries to avoid the expensive self-joins mentioned here by creating an exhaustive set of indexes and relying on the fast processing of merge joins. Triples in RDF-3X are stored (and sorted lexicographically) in a compressed clustered B+ tree. An impressive feature is that the physical design of RDF-3X is workload-independent (i.e., the design eliminates the need for tuning by building indexes over all six permutations of the three dimensions constituting an RDF triple). RDF-3X supports both individual updates and entire batch updates, making it a flexible choice for desiderata.</p>
<p>Similarly, Hexastore also focuses on scalability and generality in its data storage, processing, and representation, and it is based on the idea of indexing the RDF data in a multi-indexing scheme. Hexastore does not distinguish among the three elements in a triple and treats subjects, properties, and objects in an RDF triple equally, with each element type having its own special index structures built around it. By virtue of this design, it needs six distinct indexes for indexing the RDF data (hence the name), with the indexes materializing all possible orders of precedence of the three RDF elements. A clear disadvantage of this design principle, however, is a five-fold increase in storage compared to traditional triples stores.</p>
</section>
<section epub:type="division">
<h3 class="head b-head"><a id="sec12-3-2"/><b>12.3.2 Property Table Stores</b></h3>
<p class="noindent">Due to the proliferations of self-joins involved with the triplestore, the property table approach has been proposed as an alternative relational RDF store. Property tables improve triplestores by allowing multiple triple patterns that reference the <i>same</i> subject to be retrieved without an expensive join. In this model, RDF tables are physically stored in a representation closer to traditional relational schemas in order to speed up queries compared to triplestores. Intuitively, each named table includes a subject and several fixed properties (as columns). Improvements are also possible. For example, a variant of the property table, named property-class table, uses the <i>rdf:type</i> of subject resources to group similar sets of resources together in the same table.</p>
<p>An excellent, well-known example of a property table store is Jena, an open-source toolkit for SW programmers. The schema of the first version of Jena consisted of a statement table, a literals table, and a resources table. The statement table <i>(Subject, Predicate, ObjectURI, ObjectLiteral)</i> contained all statements and referenced the resources and literals tables for subjects, predicates, and objects. To distinguish literal objects from resource URIs, two columns were used. The literals table contained all literal values, while the <span aria-label="314" id="pg_314" role="doc-pagebreak"/>resources table contained all resource URIs in the graph. Note, however, that this design requires each query operation to execute multiple joins between the statement table, the literals table, or the resources table (in the general case).</p>
<p>The <i>Jena2 schema</i> attempted to address this issue by trading off space for time and using a <i>denormalized schema</i>, in which resource URIs and simple literal values are stored directly in the statement table. To distinguish database references from literals and URIs, column values are encoded with a prefix that indicates the type of the value. A separate literals table is only used to store literal values of which the length exceeds a certain threshold (e.g., blobs). Other design considerations similarly applied and were implemented, with the increase in database space consumption addressed by using string compression schemes. Both versions of the Jena schema described here permit multiple graphs to be stored in a single database instance, with Jena2 additionally supporting multiple statement tables in a single database to accommodate applications that need to flexibly map graphs to different tables. In this way, graphs that are often accessed together may be stored together, while graphs that are never accessed together may be stored separately.</p>
<p>Other examples of property tables include RDFMATCH, Sesame, RDFSuite, and 4store. Technically, RDFMATCH is an Oracle-based SQL table <i>function</i> that can be used to query RDF data. An advantage of this approach is that its results can be further processed by SQL’s querying capabilities and easily combined with other queries on traditional RDBMS. For efficient querying, the function uses B-tree indexes, in addition to creating materialized join views for specialized subject-property. A special module is also provided to analyze the table of RDF triples and estimate the size of various materialized views, based on which users are able to define a subset of materialized views.</p>
</section>
<section epub:type="division">
<h3 class="head b-head"><a id="sec12-3-3"/><b>12.3.3 Horizontal Stores</b></h3>
<p class="noindent">Horizontal stores (also called <i>vertical partitioning</i>) represent yet another mechanism for relational representation and storage of RDF data. In this approach, the triples table is rewritten into <i>n</i> two-column tables, with <i>n</i> being the number of unique properties in the RDF graph. In each of these tables, the first column contains the subjects that define that property, and the second column contains the object values for those subjects. The subjects that do not define a particular property are simply omitted from the table for that property. Each table is sorted by subject, so that particular subjects can be located quickly, and thereby enable fast merge joins for reconstructing information about multiple properties for subsets of subjects. For a multivalued attribute, each distinct value is listed in a successive row in the table for that property. An advantage of the horizontal store is that, while property tables need to be carefully constructed so that they are wide enough (but not too wide) to independently answer queries, the algorithm for creating tables in the vertically partitioned approach is straightforward and does not need to change over time. Another advantage, especially over the property-class schema approach (where queries that do not <span aria-label="315" id="pg_315" role="doc-pagebreak"/>restrict on class tend to have many union clauses), is that because all data for a particular property is located in the same table, union clauses in queries are far less common.</p>
<p>A good implementation of the horizontal store defined here is SW-Store (the first time that vertical partitioning was proposed), which relies on a column-oriented DBMS, C-store, to store tables as collections of columns rather than as collections of rows (as in standard row-oriented databases such as Oracle, DB2, and Postgres, where entire tuples are stored consecutively). Column-oriented databases address a key problem encountered with row-oriented databases (namely, that if only a few attributes are accessed per query, entire rows need to be read into memory from disk before the projection can occur). By storing data in columns rather than rows, the projection occurs for free only where those columns that are relevant to a query need to be read. However, some authors have argued that storing a sparse data set, as is often the case with RDF, in multiple tables can cause problems, and have instead suggested storing a sparse data set in a single table and leaving the complexities of sparse data management to an RDBMS (usually with the addition of an interpreted storage format). However, this is by no means the only solution to the problem.</p>
<p>It is also important to note that, just as with many of the other advanced techniques in this chapter, the book is not closed on vertical partitioning, and indeed a number of papers have tried to extend it. For example, roStore proposes an ontology-guided approach to extend horizontal stores, and it outperforms SW-Store when it is necessary to reason over property hierarchies. The intuition is to use semantic query rewriting rules to improve performance by reasoning over the ontology schema of the RDF triples by having a single table for each property hierarchy. Such tables have three columns, one for each element of an RDF triple, with the remaining tables following the pattern design of ordinary horizontal stores. This design has the consequence of reducing the number of tables for ontologies containing several property hierarchies, a common occurrence in domains like biology and medicine. By reducing the number of property tables, there is a huge impact on query performance where the query requires joins over properties of the same property hierarchy due to the maintenance and generation of fewer relations compared to ordinary horizontal stores like SW-Store.</p>
<p>This brief discussion of roStore illustrates a subtle point—namely, that some systems explicitly choose to optimize for certain query classes. The success of those systems is both domain- and community-specific (i.e., those systems optimizing for a query class that is important and that occurs often enough in real workloads have a greater chance of being cited and used than a system that is designed for more obscure query classes and domains). To the best of our knowledge, there is no one system (in any of the three relational types of stores) that is guaranteed to trump every other system on <i>every</i> query class, metric, or eventuality. In querying, trade-offs are necessary.</p>
</section>
</section>
<section epub:type="division">
<h2 class="head a-head"><span aria-label="316" id="pg_316" role="doc-pagebreak"/><a id="sec12-4"/><b>12.4 NoSQL</b></h2>
<p class="noindent">One of the key limitations (though not necessarily a disadvantage) of using triplestores and the SPARQL query language is the need to represent data in RDF (or RDF-like) format. In many cases, especially where URIs are not involved or where the data is noisy and not always adhering to a well-defined, or stable, schema, RDF is not the desired format. One option is to use a normal relational database, but these are not apt or efficient when data is missing or noisy. We could even argue that a primary motivation for building KGs in the first place was to overcome relational databases’ strong and conservative constraints on the kinds of data that an RDBMS can effectively operate over.</p>
<p>Over the last decade, the burgeoning NoSQL movement has become very popular. As the name suggests, NoSQL is best expressed as both infrastructure and querying that do not conform to the traditional RDBMs that use SQL as the primary querying interface. In fact, NoSQL as a term was originally coined (in 1998, by Carlo Strozzi) for an RDBMS that did not utilize SQL as the querying interface. Today, however, the term has been appropriated, especially by big companies offering cloud and data management services (such as Amazon and Google) to represent alternative data stores that store and process huge amounts of data as they appear in their applications. Although not evident in the terminology, NoSQL has come to be closely associated, just like KGs themselves, with web and Big Data.</p>
<p>Even in the very early days, there were various reasons why people searched for alternative solutions beyond RDBMSs for storing, accessing, and querying data. The rich feature set and the Atomicity, Consistency, Isolation, and Durability (ACID) properties implemented by RDBMSs might be necessary for some applications and use-cases, but do not scale as well <i>horizontally</i> (e.g., using commodity hardware). For many applications, a one-size-fits-all scenario is also not always appropriate, as it is simply more expedient to build systems based on the nature of the application and its workload. RDBMSs can also be expensive and generally follow a license-driven revenue model. Thus, costs do not fade in the future.</p>
<p>NoSQL has emerged as a <i>customizable</i> solution with shared-nothing horizontal scaling that is usually considered to be a prerequisite requirement that must be met by infrastructures claiming to be truly NoSQL. By “shared-nothing,” we mean the ability to replicate and partition data over many servers. This allows the infrastructure to support large numbers of simple read/write operations in unit time and meet application needs with high (often irregular) throughput. Important features of typical NoSQL solutions are noted in <a href="chapter_12.xhtml#tab12-1" id="rtab12-1">table 12.1</a>.</p>
<div class="table">
<p class="TT"><a id="tab12-1"/><span class="FIGN"><a href="#rtab12-1">Table 12.1</a>:</span> <span class="FIG">Important features of typical NoSQL solutions.</span></p>
<figure class="table">
<table class="table">
<thead>
<tr>
<th class="TCH"/>
<th class="TCH"><p class="TB"><b>Use-case</b></p></th>
<th class="TCH"><p class="TB"><b>Strengths</b></p></th>
<th class="TCH"><p class="TB"><b>Limitations</b></p></th>
</tr>
</thead>
<tbody>
<tr>
<td class="TB"><p class="TB">Key-value</p></td>
<td class="TB"><p class="TB">Objects need to be accessed via a key or an attribute.</p></td>
<td class="TB"><p class="TB">Extremely scalable, easy to partition, and has fast random access if keys can be stored in memory.</p></td>
<td class="TB"><p class="TB">Limited expressiveness, as objects can be queried only using the key; furthermore, without knowing the key, a user cannot easily query the object.</p></td>
</tr>
<tr>
<td class="TB"><p class="TB">Document</p></td>
<td class="TB"><p class="TB">Data is easily structured as documents, and the structure of the data is evolving.</p></td>
<td class="TB"><p class="TB">The data motel is rich enough to store complex, even irregular data, including arrays, nested structures, and dictionaries. Secondary indices allow fast access.</p></td>
<td class="TB"><p class="TB">Expressive languages where the structure provides the semantics cannot be used easily (if at all) with such models. There is also a lack of standard application programming interfaces (APIs) and query domain-specific languages for such models.</p></td>
</tr>
<tr>
<td class="TB"><p class="TB">Graph</p></td>
<td class="TB"><p class="TB">Excellent for data with relational characteristics.</p></td>
<td class="TB"><p class="TB">Linked data sets can be queried easily, and it is also easy to map entity-relationship abstract models to this data model.</p></td>
<td class="TB"><p class="TB">Efficiency and partitioning can be a problem, especially for big graphs, due to the high volume of internode message passing. Similar to document models, there is limited support for standard APIs and/or domain-specific query languages.</p></td>
</tr>
<tr>
<td class="TB"><p class="TB">Wide-column</p></td>
<td class="TB"><p class="TB">Parallel (and batch-oriented) processing of big, typically aggregated data (such as in data warehouses).</p></td>
<td class="TB"><p class="TB">Suitable for storing large quantities of data, as it can be efficiently partitioned by both rows and columns.</p></td>
<td class="TB"><p class="TB">Very difficult to use if the schema or ontology is evolving, and arbitrary querying is supported to a limited extent due to design constraints.</p></td>
</tr>
</tbody>
</table>
</figure>
</div>
<section epub:type="division">
<h3 class="head b-head"><a id="sec12-4-1"/><b>12.4.1 Key-Value Stores</b></h3>
<p class="noindent">Perhaps because of their simplicity, key-value stores continue to rise in popularity in the NoSQL world. Simply put, a <i>key-value store</i> (or key-value database) is a simple database <span aria-label="317" id="pg_317" role="doc-pagebreak"/>that uses an associative array (such as a map or dictionary) as the fundamental data model where each key is associated with one (and only one) value in a collection. This relationship is referred to as a <i>key-value pair</i>. In each key-value pair, the key is represented by an arbitrary string (or other primitive or hashable datatype, in theory) such as a file name, URI, or hash. The value can be any kind of data, such as an image, user preference file, or document. The value is stored as a blob, requiring no upfront data model or schema definition.</p>
<p><span aria-label="318" id="pg_318" role="doc-pagebreak"/>The storage of the value as a blob removes the need to index the data to improve performance. In a naive implementation, therefore, one cannot filter or control what is returned in response to a request <i>based</i> on the value, because it is opaque. In general, key-value stores have no query language. They provide a way to store, retrieve, and update data using simple GET, PUT, and DELETE commands. The path to retrieve data is a direct request to the object in memory or on disk. The simplicity of this model makes a key-value store fast, easy to use, scalable, portable, and flexible. However, the lack of expressive querying can be a disadvantage to some applications.</p>
<p>Scalability can also be a major advantage of key-value stores, because they tend to <i>scale out</i> by implementing partitioning (storing data on more than one node), replication, and auto recovery. It can be easier for key-value databases to <i>scale up</i> by maintaining the database in memory and minimizing the effects of ACID guarantees (a guarantee that committed transactions persist somewhere) by avoiding locks, latches, and low-overhead server calls. There are natural limitations to these strategies, but a full discussion is beyond the scope of this chapter. For extreme scalability, a different set of solutions, such as Apache Cassandra (which is essentially an ultrascalable key-value store) or HBase, both of which we describe subsequently, should be preferred. In practice, most KGs are amenable to key-value stores, especially with the advent of large memories on server-class machines and the support of horizontal scaling by major cloud providers like Amazon Web Services. Key-value stores are also good for servicing tasks such as session management at high scale, providing product recommendations, storing user preferences and profiles, ad servicing, and working effectively as a cache for heavily accessed, but rarely updated, data.</p>
<p class="TNI-H3"><b>12.4.1.1 JSON</b> Many key-value stores principally rely on the JavaScript Object Notation (JSON) format for passing and retrieving data. JSON is a lightweight data-interchange format that is meant for human readability, while still being easy for machines to parse and generate. It is built on two structures:</p>
<ul class="numbered">
<li class="NL">1. <i>A collection of name/value pairs:</i> In various programming languages like C++, Java, and Python, this is realized as an object, record, struct, dictionary, hash table, keyed list, or associative array. An example would be the data structure { “name”: “A. Einstein”, “age”: 29}. An important thing to note is <i>what</i> datatypes are allowed as keys and values. The full specification (and formal definition) can be found on the <span aria-label="319" id="pg_319" role="doc-pagebreak"/>official website,<sup><a href="chapter_12.xhtml#fn1x12" id="fn1x12-bk">1</a></sup> but the most common formats are integers and strings. However, datatypes such as fractions and hexadecimals are also permitted. Most important, the “value” (e.g., 42 in the snippet given here) can itself be a JSON, which makes this a <i>recursive</i> data-interchange format that can be used to express deep nested structures, if necessary.</li>
<li class="NL">2. <i>An ordered list of values:</i> In most languages, this is realized as an array, vector, list, or sequence. For example, [“A. J. Bose”, “M. Kejriwal”] would be a valid JSON. Again, the elements of the list can themselves be JSONs.</li>
</ul>
<p>A key advantage of JSON, even beyond the fact that it is preferred by key-value NoSQL systems, is that it is also highly amenable to storage on, and retrieval from, disk using programming languages like Python, which have become exceedingly popular in an era of machine learning applications. JSON objects on file can be read as Python dictionaries using only a single line of code.</p>
<p>Two excellent examples of key-value NoSQL databases are MongoDB and Elasticsearch. The former is arguably the most popular NoSQL database at the time of writing (for a justification of this claim, see the section entitled “Bibliographic Notes,” at the end of this chapter). Conceptually, MongoDB is much more than a key-value store, because it can support any schema-free collected of documents. It relies on Apache Lucene, which is a fairly sophisticated, Java-based indexing and search functionality—exposing technology that is required for accessing many of the information retrieval (IR) facilities that is central to the hybrid querying (i.e., using a combination of text and nontext attributes) of these NoSQL databases. Similarly, Elasticsearch is a distributed, RESTful search and analytics engine capable of solving a growing number of use-cases. It is often compared with MongoDB; however, unlike MongoDB, it advertises itself more as a search engine than a NoSQL database system because of the strong focus on exposing robust search capabilities. There are also documented cases where Elasticsearch has been used to store and query KGs. However, for practitioners familiar with key-value stores, it is not a burdensome matter to transition from one to the other, and both their querying mechanisms rely on an IR technology like Lucene.</p>
<p>MongoDB and Elasticsearch can serve also as “semistructured” search engines, like other key-value NoSQL engines, due to their support for speedy querying (using Lucene indices). We say “semistructured” because the search is not purely text or keyword based, and hence, a purely IR-methodology like single inverted index does not apply. In fact, multiple indices are required, depending on the schema of the underlying key-value store. There is no one way to combine scores across multiple indices into one composite score, and it’s more art than science.</p>
<p><span aria-label="320" id="pg_320" role="doc-pagebreak"/>Because there has been much interest in the kind of semistructured or hybrid search that Elasticsearch exposes via its query DSL, we cover it in this section both as a model and an example (in use) of such kinds of search. An interesting functionality of the Elasticsearch query DSL is that it is used to expose the power of Lucene through a JSON interface, and it employs a combination of text and structured attributes. Intuitively, the Elasticsearch query DSL gets much of its power from using a recursive boolean-like representation that is both intuitive (can be drawn as a tree) and expressive. An illustrative example is provided in <a href="chapter_12.xhtml#fig12-5" id="rfig12-5">figure 12.5</a>.</p>
<div class="figure">
<figure class="IMG"><a id="fig12-5"/><img alt="" src="../images/Figure12-5.png" width="450"/>
<figcaption><p class="CAP"><span class="FIGN"><a href="#rfig12-5">Figure 12.5</a>:</span> <span class="FIG">Illustrative (i.e., conceptual) example of Elasticsearch boolean tree query. Here, <i>gte</i> and <i>lte</i> stand for the symbols ≥ and ≤, respectively. The other leaf nodes are key-value pairs.</span></p></figcaption>
</figure>
</div>
<p>At the most basic level, the Elasticsearch query DSL uses JSON to define queries via two types of clauses. First, a leaf query clause looks for a particular value in a particular field by using query terms like <i>match, term,</i> and <i>range</i>. Second, compound query clauses wrap other leaf or compound queries and are used to combine multiple queries in a logical fashion (examples being the <i>bool</i> and <i>dis_max</i> queries) or to alter their behavior. <a href="chapter_12.xhtml#fig12-5">Figure 12.5</a> illustrates such an example. Intuitively, the query first filters all documents such that there is a key called “tag” that has the value “science.” This select set of documents is then evaluated in other ways (i.e., if the “user” key has value “mkejriwal” and the age of the user is between 25 and 35).</p>
<p>Queries in Elasticsearch can be executed in either a query context or a filter context. A query clause used in the query context is interpreted in an IR sense; namely, the query clause calculates a score (for each document) representing how well the document matches the query, relative to other documents. By sorting in decreasing order of nonzero scores, a ranked list of candidate documents is returned for each query. The results should be evaluated using IR metrics, some of which (e.g., MRR and NDCG) were detailed in chapter 11. In contrast, the filter context is much more coarse grained (yes/no) compared to the <span aria-label="321" id="pg_321" role="doc-pagebreak"/>more nuanced query context. Formally, a filter context query returns an unranked set of documents, and it cannot be evaluated using metrics like NDCG. It is not uncommon to embed filters into query contexts to speed up processing and provide an obvious avenue for culling document candidates. Intuitively, a filter is like a constraint (e.g., imagine that we want to retrieve books that have “Information” in their title and “Metric” in their content, but we only want books that have been published after January 1, 2017). We could insert a filter clause in the query to that effect, similar to the “tag:science” filter in the query in <a href="chapter_12.xhtml#fig12-5">figure 12.5</a>.</p>
<p>The date constraint has been embedded in a filter context, because it is a hard constraint rather than a search criterion for influencing relevance scores, in contrast with the other two criteria. Note that it is important to interpret user intent correctly here in deciding what to place in a filter context. The filter context comes closest to behaving like a reasoner, even though it does not have the capabilities of Web Ontology Language (OWL) reasoners such as those available in Protégé and other such ontology management systems. The query context offers a mechanism for IR over semistructured key-value documents.</p>
<p>For interested readers, we provide a few pointers to Elasticsearch tools and tutorials in the section entitled”Software and Resources,” at the end of this chapter. At this time, Elasticsearch has gained enormously in market share and has documented uses in a variety of companies and projects. For example, Wikipedia uses it to provide suggested text for full-text search, the <i>Guardian</i> uses it to give editors current feedback about public opinion on published media through social and visitor data, Stack Overflow uses it to complete full-text search, geolocation queries, and source-related questions and/or answers, and GitHub queries billions of lines of code using Elasticsearch. Similarly, MongoDB has its own large community of users, with some well-known organizations using it, including large companies such as Adobe, Cisco, eBay and AstraZeneca, as well as nontraditional technology users like the City of Chicago, which has built its real-time geospatial analytics platform on top of MongoDB.</p>
</section>
<section epub:type="division">
<h3 class="head b-head"><a id="sec12-4-2"/><b>12.4.2 Graph Databases</b></h3>
<p class="noindent">Since the emergence of database management systems, there has been an ongoing debate about what the database <i>model</i> for such systems should be. The discussions thus far make it clear that there is no single correct way of doing data modeling. A similar conclusion was reached in chapter 2, which presented multiple options for <i>representing</i> KGs and argued that there was no single correct way of achieving a representation that fulfills all desiderata.</p>
<p>The parameters influencing database model development are manifold, and among the most important influences are the characteristics or structure of the <i>domain</i> to be modeled, the capabilities of a user (both from the standpoint of querying and modeling), and their willingness to accept certain assumptions and constraints on software and infrastructure. In this context, <i>graph database models</i> can be loosely defined as models where data <i>structures</i> for the schema and instances are modeled as graphs (or generalizations thereof), and data <i>manipulation</i> <span aria-label="322" id="pg_322" role="doc-pagebreak"/>is expressed by graph-oriented operations and type constructors. Graph database models used to be very popular in the 1980s and 1990s, along with models like object-oriented models, but were superseded at the time by other exotic models, such as geographical, spatial, and XML models.</p>
<p>With the advent of KGs, however, and other graphlike data sets on the web (including web and transportation, social, and biological networks), graph database models have enjoyed a resurgence while XML-like models have declined in popularity. However, we note that although KGs are a major application thrust, graph databases are not just KG-driven, but amenable to a wide range of graph-theoretic applications, such as large social networks. Just like the other NoSQL models and notions like key-value stores or hybrid search, little effort has been made to formally and explicitly define a graph database model. A rough definition was provided earlier in this discussion by deferring to the graphlike orientation of data structures and manipulation, both of which are core components of any database model and management system. Applications of graph database models tend to arise where data interconnectivity or topology is at least as important as the data itself. Clearly, KGs fulfill this desideratum.</p>
<p>Introducing graphs as a modeling tool offers some clear advantages for structured (but inherently nontabular) data. The single biggest advantage is that modeling is more natural because graphs have the advantage of being able to retain information about an entity in a single node and expose relational information via edges. However, a second advantage arises during querying, because (as we have seen with SPARQL), the body of research on graph-querying languages (as well as graph algorithms like shortest path) can be utilized compared to more ad-hoc data models. Put another way, by explicitly allowing such operations in querying, a suitable algebra (and querying DSL) designed for graph models allows users to express a query at a high level of abstraction, making querying natural as well. Last but not least, several influential implementations, such as Neo4j, have made graph databases more mainstream. Today, generic graph databases are more popular than SPARQL- or RDF-specific graph databases, such as the relational RDF stores described previously. However, there is also a lot of flux in database rankings (typically based on adoption) every year due to rapid advancements in both graph databases and the Semantic Web. We also note that these systems are not mutually exclusive; for instance, the Neo4j graph database can support the reading and writing of RDF data, and it is not very difficult (in practice) to translate SPARQL queries to the Cypher DSL supported within Neo4j.</p>
<p>The representation of entities and relations are identified as being fundamental to graph databases. We have covered many of these representational issues in chapter 2, but we review some of the critical elements here for the sake of completeness. Recall that an entity or object represents something that exists as a single and complete unit. A relation is a property or predicate that establishes a connection between two or more entities. As with KGs, relations in graph databases embody connectivity among entities.</p>
<p><span aria-label="323" id="pg_323" role="doc-pagebreak"/>Besides entities and relations, most modern graph databases support a variety of interesting artifacts that we summarize next.</p>
<p class="TNI-H3"><b>12.4.2.1 Hypernodes and Hypergraphs</b> Most modern graph databases support a kind of “nesting” through the use of <i>hypernodes</i> and <i>hypergraphs</i>. These metastructures are motivated by the fact that representing the database as a simple, flat graph (with many interconnected nodes) has the drawback of not being intuitively presentable to a user (or modeler). To address this challenge, a hypernode database is used, which consists of a set of nested graphs with intrinsic support for data abstraction and the ability to represent each real-world object as a separate database entity. Hypergraphs further extend this concept and are mathematically defined as a generalization of a graph, where an edge can join any number of vertices, rather than just two. This simple extension shows why such graphs can be powerful for modeling relational data, because each record or entity is, in many cases, more naturally thought of as a collection of properties (hyperedge), rather than a set of binary relations between the entity and each of its properties. In the RDF formalism, a hypergraph sometimes considered a conceptually more elegant way of representing <i>n</i>-ary data than mechanisms like reification.</p>
<p class="TNI-H3"><b>12.4.2.2 Schema and Instance Graphs</b> Unlike RDBMSs, the separation of schema and instance data is not as easy in graphs. An advantage of KGs is that, for most domains, this decomposition is easier than for generic graphs, because KGs tend to be extracted according to (and modeled on the basis of) a specified ontology. It is more difficult to maintain this distinction in domains like e-commerce and bioinformatics, as we describe in part V when we describe various KG ecosystems. A well-defined separation of schema and instance is a key advantage of RDBMs, where the schema is informally defined as a set of tables (and where each table itself is represented as a set of attributes with datatypes and other metadata). With the advent of object-oriented data models, more complexity is possible, but schemas are still distinct from instances. Just like with KGs, separation in graph databases can be maintained only if two types of graphs are defined: schema graphs and instance graphs. At minimum, a schema graph defines entity types represented as nodes labeled with a type name (alternatively referred to as “concept” or “class” throughout this book). However, similar to sophisticated domain ontologies, a schema graph may contain much more than just a flat set of concepts.</p>
<p>On the other hand, an instance graph contains concrete entities represented as nodes labeled by either an entity-type name or an object identifier; primitive values represented as nodes labeled with a value from the domain of a primitive entity; and relations represented as edges labeled with the corresponding relation-name according to the schema.</p>
<p>Just like with RDBMSs, this simple model has been expanded in multiple ways over the years, such as by including support for nodes for explicit representation of tuples and sets (<i>PaMaL, GDM</i>), <i>n</i>-ary relations (<i>GOAL,GDM</i>), hypernodes (<i>Hypernode model, Simatic-XT</i>, and <i>GGL</i>), <span aria-label="324" id="pg_324" role="doc-pagebreak"/>and hypergraphs (<i>GROOVY</i> )—extensions that provide support for nested structures, as discussed earlier. A novel use-case of hypergraphs in GROOVY is to use them for defining value-functional dependencies. Similarly, the hypernode model can use nested graphs at both the schema and instance level. A database consists of a set of hypernodes defining types and their respective instances. However, although hypernodes are used in several models, there are differences in usage. For example, while <i>Simatic-XT</i> and <i>GGL</i> use hypernodes as an abstraction mechanism consisting of packaging other graphs as an encapsulated vertex, the <i>Hypernode model</i> also uses hypernodes to represent other abstractions (e.g., complex objects and relations). In a similar vein, graph models based on simple and extended graph structures are also differentiated in their support for defining nontraditional datatypes. While <i>LDM, PaMaL, GOAL</i>, and <i>GDM</i> allow the representation of complex objects by defining special constructs (tuple, set, or association nodes), hyper- <br/>nodes and hypergraphs are arguably more powerful, because they are flexible data structures that support the representation of arbitrarily complex objects and present an inherent ability to encapsulate information.</p>
<p>Clearly, there is more research to be done in this area, though the proliferation of systems, tools, and capabilities noted here shows that graph databases are starting to converge in their approach to modeling and querying semistructured data. Next, we briefly note a few other features, followed by a popular graph database implementation (Neo4j) that has continued to gain in database adoption and market share and even has its own intuitive DSL for querying.</p>
<p class="TNI-H3"><b>12.4.2.3 Integrity Constraints</b> Another important feature, borrowed from RDBMSs, is integrity constraints. Integrity constraints are general statements and rules defining the set of consistent database states or changes of state (or both). Previously, such constraints (in the context of KGs) were seen in the context of ontologies specified using languages like OWL. For example, one such important constraint was schema-instance consistency. For example, entity types and type checking constitute a powerful data-modeling and integrity checking tool, as they allow database schemas to be represented and enforced. However, checking consistency is also related to restructuring, queries, and updates of the database. In some cases, the problem can be undecidable (e.g., statistically checking consistency following an edge addition in an arbitrary graph object-oriented program). Another kind of integrity constraint is schema-instance separation (i.e., the degree to which schema and instance are different objects in the database). However, while in most models there is a separation between database <i>schema</i> and <i>instance</i>, many KGs may not necessarily have this separation. Another exception arises in some hypernode models, where the lack of typing constraints on hypernodes enables changes to the database to be dynamic (and makes the checking of this integrity constraint inapplicable). Yet another special kind of integrity constraint is data redundancy, but as we saw in chapter 8, on instance matching, this kind <span aria-label="325" id="pg_325" role="doc-pagebreak"/>of constraint is much harder to enforce syntactically and requires advanced artificial intelligence (AI) solutions.</p>
<p class="TNI-H3"><b>12.4.2.4 Example: Neo4j</b> Neo4j is an open-source, NoSQL, native graph database that has been in development since 2003, but was only made publicly available starting in 2007. Just like other similar software in this space (motivated by for-profit enterprise needs, but without sacrificing the benefits of open-source, community-driven development), Neo4j has both a Community Edition and Enterprise Edition.</p>
<p>Neo4j relies primarily on the <i>property graph</i> model, where data is organized as nodes, relationships, and properties (data stored on the nodes or relationships). Nodes are the entities in the graph and can hold any number of attributes (key-value pairs) called <i>properties</i>. Nodes can be tagged with labels, representing their different roles in the domain. Node labels may also serve to attach metadata (including index or constraint information) to certain nodes. On the other hand, <i>relationships</i> provide directed, named, semantically relevant connections between two node entities. Relationships always have a direction and type; importantly, like nodes, relationships can have properties. This makes artifacts like reification much easier and more seamless in the property graph model and Neo4j than in traditional RDF settings. In practice, relationships tend to have quantitative properties, such as weights, costs, ratings, and time intervals. Due to the efficient way that relationships are stored, two nodes can share any number or type of relationship without sacrificing performance. Although they are stored in a specific direction, relationships can always be navigated efficiently in either direction.</p>
<p>Neo4j is referred to as a <i>native</i> graph database because it efficiently implements the property graph model down to the storage level. Other important features include <i>Cypher</i>, which is a declarative query language similar to SQL but optimized for graphs. The query language started out as being specific to Neo4j but has since been adapted by other databases like SAP HANA Graph and Redis (via the openCypher project); <i>drivers</i> for popular programming languages like Python and Java; and <i>ACID transaction compliance</i>, which makes it amenable for using in production scenarios. Next, we briefly describe the key features and syntax of the Cypher query language due to its importance and continuing adoption in industrial graph databases like the examples mentioned previously.</p>
<p><b>Cypher DSL.</b> Although a machine language, Cypher was designed to also be human-readable, with constructs based on English prose. Nodes in Cypher take the form (<i>x</i>: <i>label</i>), where <i>x</i> is a variable and <i>label</i> is a node that typically represents the type of the variable. For example, if the variable was meant to denote a company, the label would be <i>Company</i>, which helps Neo4j determine how to group nodes together. Note, however, that labels do not necessarily have to be types; they can be another way of grouping nodes. It is highly recommended to declare labels for nodes, because Cypher uses such labels for filtering during query execution, similar to how SQL uses table names. Relationships in Cypher are represented by – &gt; or <i>&lt;</i>–, depending on the directionality. Additional properties <span aria-label="326" id="pg_326" role="doc-pagebreak"/>of the relationship, such as the relationship type, can be placed in square brackets inside the arrow. An example bringing these syntactic notions together, is illustrated here:</p>
<p class="noindent"> </p>
<p class="noindent">CREATE (bob:Person {name: “Bob”})<br/>CREATE (james:Person {name: “James”})<br/>CREATE (stephen:Person {name: “Stephen”})<br/>CREATE (sarah:Person {name: “Sarah”})<br/>CREATE (google:Company {name: “Google”})<br/>CREATE (tech:Sector {name: “Technology”})<br/>CREATE (bob)-[:WORKSIN]- &gt;(tech)-[:COMPANY]- &gt;(google)<br/>CREATE (stephen)-[:EMPLOYEDBY]- &gt;(google)<br/>CREATE (james)-[:FRIEND]- &gt;(stephen)-[:SIBLING]- &gt;(sarah)</p>
<p class="noindent"> </p>
<p>A graph representation of the snippet here is illustrated in <a href="chapter_12.xhtml#fig12-6" id="rfig12-6">figure 12.6</a>. Note that, because of edge or relationship directionality, a query such as “MATCH (p:Person) <i>&lt;</i>-[:WORKSIN]-(tech:Sector)” will not yield results given the declarations here. However, a query that leaves the direction open can also be specified as “MATCH (p:Person)-[:WORKSIN]-(tech:Sector),” which would yield a positive match.</p>
<div class="figure">
<figure class="IMG"><a id="fig12-6"/><img alt="" src="../images/Figure12-6.png" width="450"/>
<figcaption><p class="CAP"><span class="FIGN"><a href="#rfig12-6">Figure 12.6</a>:</span> <span class="FIG">Graph representation of the Cypher snippet described in the text.</span></p></figcaption>
</figure>
</div>
<p>Just as we did with nodes, if we want to refer to a relationship later in a query, we can give it a variable like <i>[r]</i> or <i>[rel]</i>. We can also use longer, more expressive variable names like <i>[likes]</i> or <i>[knows]</i>. If we do not need to reference the relationship later, we can specify <span aria-label="327" id="pg_327" role="doc-pagebreak"/>an anonymous relationship using two dashes −−, +−− &gt;, or + <i>&lt;</i>−−.<sup><a href="chapter_12.xhtml#fn2x12" id="fn2x12-bk">2</a></sup> As an example, we could use either -[rel]- &gt; or -[rel:LIKES]- &gt; and call the <i>[rel]</i> variable later in a query to reference the relationship and its details. Note the difference between -[LIKES]- &gt; and -[:LIKES]- &gt;; in the latter case, LIKES represents a variable and not a relationship type, which means that Cypher will end up searching all types of relationships when a query is issued.</p>
<p>Graphs declared using Cypher do not just support nodes, relationships, and labels. A last important feature to note here is <i>properties</i>. Note the important difference in nomenclature; while properties and relationships were considered synonymous in chapter 2 and other chapters of this book where we paid considerable attention to the RDF data model, they are different in Neo4j. In Neo4j, properties are name-value pairs that provide additional details to nodes and relationships. When applied to nodes, they are akin to datatype properties; however, when declaring properties for relationships in RDF, one has to rely on complex schemes like reification.</p>
<p>Properties in Cypher can be declared using curly braces. For example, a node property can be declared using a statement like (<i>p</i>: <i>Person</i>{<i>name</i>: “<i>Mary</i>”}), while a relationship property can be declared using a statement like − [<i>rel</i>: <i>WORKS</i>_<i>FOR</i>{<i>start</i>_<i>date</i>: 2017}] <i>→</i>. Properties can have values with multiple datatypes; Cypher currently supports numbers (including integers and floats), strings, booleans, spatial types like <i>Point</i>, and temporal points like <i>Date</i> and <i>Time</i>.</p>
<p>While nodes and relationships make up the building blocks for graph patterns, these building blocks can be composed together to express simple or complex patterns. Patterns are the most powerful capability of Cypher graphs and can be written as a continuous path or separated into smaller patterns, tied together with commas.</p>
<p>An example pattern in Cypher is the statement “(:Mary)-[:LIKES] <i>→</i> (:Blockchain).” This pattern expresses the natural-language statement “Mary likes Blockchain.” However, the statement, by itself, does not tell us whether we want to find this existing pattern in our current database or insert it in as a new pattern. To decide either way, we need to use qualifying keywords like MATCH and RETURN, similar to equivalent words in languages like SQL. MATCH searches for an existing node, relationship, label, property, or pattern in the database, and it is like SELECT in SQL. We saw some examples with MATCH earlier. In contrast, RETURN specifies what values or results we might want to return from a Cypher query (e.g., whether Cypher should return nodes, relationships, node and relationship properties, or patterns in the query results). Particularly, the node and relationship variables we discussed earlier become important when using RETURN. To bring back nodes, relationships, properties, or patterns, variables need to have been specified in a MATCH clause for the data we want retrieved.</p>
</section>
<section epub:type="division">
<h3 class="head b-head"><span aria-label="328" id="pg_328" role="doc-pagebreak"/><a id="sec12-4-3"/><b>12.4.3 NoSQL Databases with Extreme Scalability</b></h3>
<p class="noindent">The NoSQL options covered so far in this chapter have demonstrable uses in KG applications, with Neo4j and Elasticsearch already employed in several such projects. In this section, we complete our treatment of NoSQL by covering two other systems, Cassandra and HBase, which are generally used across enterprises for their ability to handle extreme quantities of data. These databases are more a product of (and response to) the needs of Big Data rather than KGs per se. However, it is not unreasonable to suppose, as KGs become ever larger, especially in enterprise and medicine, that systems like these (or inspired thereof) will become more popular options for processing KGs. As such, any discussion on NoSQL would not be complete without a brief treatment of them. We also note that these two examples are, by no means, the only emblematic cases of Big Data NoSQL products, although they are among the two most popular products released under nonproprietary (in both cases, Apache) licenses. Brief notes on rival systems are provided in the “Bibliographic Notes” section.</p>
<p class="TNI-H3"><b>12.4.3.1 Apache Cassandra</b> Apache Cassandra is a free (and open-source), distributed, wide-column store DBMS designed to handle large amounts of data across many commodity servers. Cassandra offers support for computing clusters spanning multiple data centers making it extremely scalable. Fundamentally, Cassandra was designed as a Big Data enterprise solution, with support for such features as <i>asynchronous masterless replication</i> (allowing low latency operations for all clients); <i>elasticity</i> (read and write throughput only increase linearly as new machines are added to the cluster, with no interruption or restart); and <i>fault tolerance</i> (with support for replication across data centers, and failed node replacement with no cluster downtime).</p>
<p>Fittingly, the origins of Cassandra were in enterprise as well (namely, Facebook), in response to a set of enterprise-specific needs (the Facebook inbox search feature). It was released as open-source on Google code in 2008, and in 2009 became an Apache Incubator. By 2010, it had graduated to a top-level project. Over time, multiple releases were engineered, with addition of more sophisticated features with each release. For example, the 2010 release added support for integrated caching and MapReduce, and a release in 2011 added the Cassandra Query Language (CQL).</p>
<p>An instance of Cassandra typically consists of only one table, which represents a distributed multidimensional map indexed by a key. The values in the table are addressed by a triplet <i>(row-key, column-key, time-stamp)</i>, where the <i>row-key</i> identifies rows by a string of arbitrary length, and a <i>column-key</i> identifies a column in a row and is further qualified (e.g., as part of a column family). A consistent hashing function, which preserves the order of row-keys, is used to partition and distribute data among the nodes. The order preservation property of the hash function is important to support range scans over the data of a table. Consistent hashing is also used by other similar Big Data NoSQL enterprise systems like <span aria-label="329" id="pg_329" role="doc-pagebreak"/>Amazon Dynamo; however, Cassandra handles it in a fundamentally different way from Dynamo.</p>
<p>The single biggest requirement that Cassandra fulfills, especially in enterprise applications, is that it is durable because the cluster as a whole keeps operating even in the face of multiple failures, including potentially catastrophic ones (such as when an entire data center loses power). Cassandra has no single points of failure and no network bottlenecks, as every node in a Cassandra cluster is identical. However, an arguably weaker feature of Cassandra is consistency.<sup><a href="chapter_12.xhtml#fn3x12" id="fn3x12-bk">3</a></sup> Finally, Cassandra also offers support for Apache Pig and Hive, making it suitable for deployment in a Big Data ecosystem.</p>
<p>While we are not aware of documented cases in enterprise that use Cassandra for managing KG data, its wide-column properties, as well as support for timestamps and representation of tabular data as a multidimensional map indexed by a key, make it particularly amenable for KGs with subjects that have many properties and have associated timestamps or are otherwise streaming. For example, a KG capturing supply chain and logistics across a large company may need an architecture like Cassandra if availability and partition tolerance are important. As we start seeing KGs and Big Data intersect more in the future, such architectures are predicted to become more commonplace, and they may even inspire the birth of new architectures.</p>
<p class="TNI-H3"><b>12.4.3.2 Apache HBase</b> Apache HBase is a distributed, scalable, NoSQL, Big Data store that runs on a Hadoop cluster. At its core, HBase has the ability to host very large (web-scale) tables and provides real-time and random read/write access to Hadoop data by using a wide-column data store modeled after Google Bigtable, the database interface to the proprietary Google File System. In essence, HBase provides Bigtable-like capabilities on top of Hadoop-compatible file systems (e.g., MapR or HDFS) by borrowing Bigtable features like compression, in-memory operation, and bloom filters on a per-column basis.</p>
<p>As a NoSQL database, HBase is significantly differentiated from traditional RDBMs by being designed to scale across a cluster. Basically, HBase groups rows into regions that define how table data is split over multiple nodes in a cluster. If a region gets too large, it is automatically split to share the load across more servers. However, even though HBase is NOSQL, it still stores data in a tablelike format, with the ability to group columns into column families. Such groupings allow physical distribution of row values on various cluster nodes. Because of this design element, tables with billions of rows and millions of columns can be processed and stored by HBase. Furthermore, data stored in HBase does not need to fit into a rigid schema like with an RDBMS, making it ideal for storing unstructured or semistructured data.</p>
<p><span aria-label="330" id="pg_330" role="doc-pagebreak"/>HBase is architected to have <i>strongly</i> consistent reads and writes, which makes it different from many other NoSQL databases, including Cassandra, that are only <i>eventually</i> consistent. Once a write has been performed, all read requests for that data will return the same value. HBase tables are also replicated for failover. These features make HBase amenable to a number of applications. For example, HBase is used in the medical field for storing genome sequences and running MapReduce on them. Other applications include e-commerce, sports, and web analytics. At the present time, Facebook is known to use HBase storage to store real-time messages, and Mozilla is using HBase to store all crash data. HBase has also been used by both Yahoo! and Twitter, the former for storing document fingerprints to detect near-duplications, and the latter for providing a distributed, read/write backup of transactional tables in its production backend. As another example, Infolinks uses HBases to process advertisement selection and user events, as well as to generate feedback for their production systems. It remains to be seen, however, whether any applications that strongly rely on KGs also use HBase. In general, the relational structure of KGs makes it more difficult to use a MapReduce-based system; however, as KGs in enterprise grow in size, and as quality and consistency requirements also become more demanding, HBase may be a preferred system for storing and processing such KGs, as it is one of the few popular NoSQL options offering strong consistency.</p>
</section>
</section>
<section epub:type="division">
<h2 class="head a-head"><a id="sec12-5"/><b>12.5 Concluding Notes</b></h2>
<p class="noindent">In this chapter, we provided a detailed overview of how to query KGs. The focus was on structured queries (i.e., sets of triple patterns) rather than natural-language questions that human beings are typically interested in asking, but that are far less precise or reliable for critical applications like business intelligence, bioinformatics, and logistics. Furthermore, KGs, being a combination of natural-language and structured data, are amenable to graph-theoretic querying architectures, and with some adjustments, such as query reformulation, such systems can access the KG in very robust and expressive ways. NoSQL has become a popular mode of access and storage in recent years, owing to deep support from enterprise, release under open-source Apache licenses (which has yielded a flourishing community), and the inability of ordinary RDBMSs to handle many of the challenges of so-called new age data like massive tables and heterogeneous graphs. Research in this area continues to expand, although winners and losers are already starting to emerge. We expect the future to see a deeper convergence between KG and graph-theoretic architectures, ordinary RDBMSs, and NoSQL systems that are predicated on being extremely scalable.</p>
</section>
<section epub:type="division">
<h2 class="head a-head"><a id="sec12-6"/><b>12.6 Software and Resources</b></h2>
<p class="noindent">We mentioned several systems in this chapter, many of which were inspired by, or are heavily used in, industry applications. Here, we provide a concise list of links to some of <span aria-label="331" id="pg_331" role="doc-pagebreak"/>the main resources discussed in this chapter. On occasion, it may be that a link is either broken or has been superseded by a newer version; hence, it is wise to do a fresh search if using any of these in a production environment.</p>
<ul class="numbered">
<li class="NL">1. <b>SPARQL:</b> There are numerous tutorials on SPARQL on the web, but a good official resource for SPARQL is the W3C page: <a href="https://www.w3.org/TR/rdf-sparql-query/">https://<wbr/>www<wbr/>.w3<wbr/>.org<wbr/>/TR<wbr/>/rdf<wbr/>-sparql<wbr/>-query<wbr/>/</a>.</li>
<li class="NL">2. <b>Amazon Neptune:</b> We mentioned a number of RDF triplestores in the first part of the chapter. For a particular system, such as RDF-3X, a reimplementation may be necessary and readers should refer to the original source (see the “Bibliographic Notes” section for a summary of the material that we relied on when writing this chapter). However, Amazon Neptune offers services in the cloud for processing and querying RDF data. More details can be found at https://<wbr/>aws<wbr/>.amazon<wbr/>.com<wbr/>/blogs<wbr/>/aws<wbr/>/amazon<wbr/>-neptune<wbr/>-a<wbr/>-fully<wbr/>-managed<wbr/>-graph<wbr/>-database<wbr/>-service<wbr/>/.</li>
<li class="NL">3. <b>JSON:</b> The official website of JSON (<a href="https://www.json.org/json-en.html">https://<wbr/>www<wbr/>.json<wbr/>.org<wbr/>/json<wbr/>-en<wbr/>.html</a>) provides both an introduction and a formal standard of the JSON format.</li>
<li class="NL">4. <b>Apache Cassandra:</b> The official website of Apache Cassandra (<a href="http://cassandra.apache.org/">http://<wbr/>cassandra<wbr/>.apache<wbr/>.org<wbr/>/</a>) provides many resources, including download links, a blog, supported features, and extensive documentation.</li>
<li class="NL">5. <b>Apache HBase:</b> Similar to the Cassandra website, the official HBase website (<a href="https://hbase.apache.org/">https://<wbr/>hbase<wbr/>.apache<wbr/>.org<wbr/>/</a>) is also well documented and provides many resources, including download links, features, and news.</li>
<li class="NL">6. <b>Apache Hadoop:</b> Hadoop is one of the classic Big Data systems and has many resources devoted to it on the web. The official website is <a href="https://hadoop.apache.org/">https://<wbr/>hadoop<wbr/>.apache<wbr/>.org<wbr/>/</a>, but tutorials can be found in several other websites, including <a href="https://www.bmc.com/blogs/hadoop-introduction/">https://<wbr/>www<wbr/>.bmc<wbr/>.com<wbr/>/blogs<wbr/>/hadoop<wbr/>-introduction<wbr/>/</a> and <a href="https://www.edureka.co/blog/hadoop-tutorial/">https://<wbr/>www<wbr/>.edureka<wbr/>.co<wbr/>/blog<wbr/>/hadoop<wbr/>-tutorial<wbr/>/</a>.</li>
<li class="NL">7. <b>MongoDB:</b> The official MongoDB website (<a href="https://www.mongodb.com/">https://<wbr/>www<wbr/>.mongodb<wbr/>.com<wbr/>/</a>) provides many pointers, including documentation and download links, to setting up and using MongoDB.</li>
<li class="NL">8. <b>Elasticsearch:</b> The official website of Elasticsearch (<a href="https://www.elastic.co/">https://<wbr/>www<wbr/>.elastic<wbr/>.co<wbr/>/</a>) is similar to that for MongoDB, and contains all the resources and pointers that users are likely to need to download, and get started with, a free version. Elasticsearch events and conferences are also actively promoted on the website, some of which (e.g., ElasticON) are global affairs that attract heavy participation from the Elasticsearch developer community.</li>
<li class="NL">9. <b>Neo4j:</b> The official Neo4j website (<a href="https://neo4j.com/">https://<wbr/>neo4j<wbr/>.com<wbr/>/</a>) provides much of the information that is needed to download the system and get started, including free Ebooks, download links, and a “Neo4j Sandbox” that enables interested users to use Neo4j without downloading it first.</li>
</ul>
<p class="noindent"><span aria-label="332" id="pg_332" role="doc-pagebreak"/>Note that, while many of these services are available in the cloud, we only cited Amazon Neptune as an example of this. Furthermore, one is not restricted to Amazon for accessing these services on the cloud; other providers include Microsoft Azure (<a href="https://azure.microsoft.com/en-us/">https://<wbr/>azure<wbr/>.microsoft<wbr/>.com<wbr/>/en<wbr/>-us<wbr/>/</a>) and Google Cloud (https://<wbr/>cloud<wbr/>.google<wbr/>.com<wbr/>/). For learning more about SPARQL queries and trying some exercises to gain more proficiency in SPARQL, we recommend the website <a href="https://www.w3.org/2009/Talks/0615-qbe/">https://<wbr/>www<wbr/>.w3<wbr/>.org<wbr/>/2009<wbr/>/Talks<wbr/>/0615<wbr/>-qbe<wbr/>/</a>.</p>
</section>
<section epub:type="division">
<h2 class="head a-head"><a id="sec12-7"/><b>12.7 Bibliographic Notes</b></h2>
<p class="noindent">A large part of this chapter was devoted to SPARQL. There are many resources and references for SPARQL, an important one of which is the W3C recommendation (<a href="https://www.w3.org/TR/rdf-sparql-query/">https://<wbr/>www<wbr/>.w3<wbr/>.org<wbr/>/TR<wbr/>/rdf<wbr/>-sparql<wbr/>-query<wbr/>/</a>), mentioned in the previous section. Because the document can be superseded by other recommendations, it is always a good idea, however, to do a fresh search, especially in the W3C technical reports index, available at <a href="http://www.w3.org/TR/">http://<wbr/>www<wbr/>.w3<wbr/>.org<wbr/>/TR<wbr/>/</a>.</p>
<p>Academic references to SPARQL are numerous. A relational algebra for SPARQL was provided by Cyganiak (2005), and the semantics and complexity were studied by Pérez et al. (2006). Other important references, including benchmarks, include Barbieri et al. (2009), Bizer and Schultz (2009), Sirin and Parsia (2007), Quilitz and Leser (2008), Schmidt et al. (2009), and Huang et al. (2011). Some of the studies, such as Huang et al. (2011), are specifically focused on scalability. The material on relational processing of RDF stores, especially vertical and horizontal stores, in this chapter is largely based on the synthesis by Sakr and Al-Naymat (2010). It has been cited by many newer papers since; for instance, see Sun et al. (2015), Abdelaziz et al. (2017), and Schätzle et al. (2016). It is worthwhile to note that, while XML is not as popular as it once was, some recent papers have been published on XML querying as well. A relatively recent survey on structural XML query processing is Bača et al. (2017).</p>
<p>The NoSQL movement in non-SW communities has been gaining steam since it achieved mainstream adoption about a decade ago. Some good syntheses of NoSQL may be found by Hecht and Jablonski (2011), Strauch and Kriha (2011), Leavitt (2010), Cattell (2011), and Han et al. (2011). These surveys are dated, however; for a much more recent treatment, see Davoudian et al. (2018). Graph databases have been surveyed by Angles and Gutierrez (2008), and much of the established material on graph databases in this chapter was based on that. More recent work by Reutter et al. (2017) includes executing regular queries on graph databases; Heidari et al. (2018), on scalable graph processing frameworks; Angles et al. (2016), on foundations of modern graph query languages [an alternative citation is Angles et al. (2017)]; and modern work on graph querying and big graph data by Angles et al. (2018) and Junghanns et al. (2017), respectively. Good references on Cypher, which we used as an example in the chapter, and on Neo4j generally, are Francis et al. (2018) and Miller (2013).</p>
<p><span aria-label="333" id="pg_333" role="doc-pagebreak"/>Cassandra and HBase, described as NoSQL databases for extremely scalable situations, are further detailed by Cassandra (2014), Lakshman and Malik (2010), Vora (2011), George (2011), and Khetrapal and Ganesh (2006). A definitive guide on Cassandra was provided by Hewitt (2010). Some of these sources are more industry-oriented and applied than academic in discourse. To understand how RDF stores based on HBase and MapReduce could be realized, Sun and Jin (2010) is a good starting point.</p>
<p>We also commented in this chapter on the relative popularity of document-based NoSQL stores like MongoDB and Elasticsearch. Not many academic studies measuring popularity exist, but the latest rankings can be obtained from some reputable sources on the web that industry often looks to, including db-engines (<a href="https://db-engines.com/en/ranking">https://<wbr/>db<wbr/>-engines<wbr/>.com<wbr/>/en<wbr/>/ranking</a>) and scalegrid.io (scalegrid.io). According to the most recent data, MongoDB ranks highly, right after traditional SQL RDBMSs like Oracle and MySQL. There have been some papers comparing databases like MongoDB to both traditional database systems (like Oracle) and NoSQL databases like Cassandra. We refer interested readers to Győrödi et al. (2015), Abramova and Bernardino (2013), Parker et al. (2013), and Boicea et al. (2012) on the subject. Generally good sources to learn about MongoDB and Elasticsearch, other than the official web documentation and tutorials, are Chodorow (2013), Banker (2011), Gormley and Tong (2015), Paro (2015), and Kuc and Rogozinski (2013).</p>
</section>
<section epub:type="division">
<h2 class="head a-head"><a id="sec12-8"/><b>12.8 Exercises</b></h2>
<ul class="numbered">
<li class="NL-AL">1. (a) Consider again the query and small KG in the “Subqueries” section. What would the set of results be if instead the executed query was the one shown here?</li>
</ul>
<figure class="IMG"><img alt="" src="../images/pg333-1.png" width="450"/>
</figure>
<p class="AL">(b) What’s misleading about the set of results that are obtained now?</p>
<ul class="numbered-ntb">
<li class="NL">2. Would the statement <i>:q rdfs:domain:d</i> AND <i>:x:q:y</i> imply <i>:x rdf:type:d</i>?</li>
<li class="NL">3. Complete the SPARQL query shown on the next page (by filling in the blanks). The query retrieves the URIs of actors who have starred in more than 10 movies, and the number of movies they have starred in. Your query should also retrieve the actor’s name, if it exists in the graph.</li>
<li class="NL">4. Would the statement <i>:w rdfs:subPropertyOf:p</i> AND <i>:a:p:b</i> imply <i>:a:w:b</i>?</li>
</ul>
<figure class="IMG"><img alt="" src="../images/pg334-1.png" width="450"/>
</figure>
<ul class="numbered-ntb">
<li class="NL">5. <span aria-label="334" id="pg_334" role="doc-pagebreak"/>Translate the two OWL axioms shown here to English.</li>
</ul>
<figure class="IMG"><img alt="" src="../images/pg334-2.png" width="450"/>
</figure>
<ul class="numbered-ntb">
<li class="NL">6. Is the query in <a href="chapter_12.xhtml#fig12-2">figure 12.2</a> a valid SPARQL query? If not, make a small set of changes (using assumptions where appropriate) to make it a potentially valid query.</li>
<li class="NL">7. Consider the data shown next for both (a) and (b).</li>
</ul>
<figure class="IMG"><img alt="" src="../images/pg335-1.png" width="450"/>
</figure>
<p class="AL">(a) <span aria-label="335" id="pg_335" role="doc-pagebreak"/>Assuming that <i>John</i> is a <i>LazyStudent</i>, what is the <i>minimal</i> set of assertions you need to remove to make the knowledge base (KB) consistent?</p>
<p class="AL">(b) Assuming that <i>Peter</i> is a <i>HardWorkingStudent</i>, what is the <i>minimal</i> set of assertions you need to remove to make the KB consistent?</p>
<div class="footnotes">
<ol class="footnotes">
<li><p class="FN" role="doc-footnote"><sup><a href="chapter_12.xhtml#fn1x12-bk" id="fn1x12">1</a></sup> <a href="https://www.json.org/json-en.html">https://<wbr/>www<wbr/>.json<wbr/>.org<wbr/>/json<wbr/>-en<wbr/>.html</a>.</p></li>
<li><p class="FN" role="doc-footnote"><sup><a href="chapter_12.xhtml#fn2x12-bk" id="fn2x12">2</a></sup> Of these, the first refers to an undirected relationship, while the last two are directed.</p></li>
<li><p class="FN" role="doc-footnote"><sup><a href="chapter_12.xhtml#fn3x12-bk" id="fn3x12">3</a></sup> In fact, Cassandra is typically classified as an “availability and partition tolerance (AP)” system, because availability and partition tolerance are generally considered more important than consistency in Cassandra.</p></li>
</ol>
</div>
</section>
</section>
</div>
</body>
</html>