244 lines
No EOL
76 KiB
HTML
244 lines
No EOL
76 KiB
HTML
<html xmlns="http://www.w3.org/1999/xhtml" xmlns:epub="http://www.idpf.org/2007/ops" lang="en" xml:lang="en">
|
||
<head>
|
||
<title>Knowledge Graphs</title>
|
||
<meta content="text/html; charset=utf-8" http-equiv="default-style"/>
|
||
<link href="../styles/stylesheet.css" rel="stylesheet" type="text/css"/>
|
||
<meta content="urn:uuid:531250ec-8629-4bbe-b4be-eb1eb0e84538" name="Adept.expected.resource"/>
|
||
</head>
|
||
<body epub:type="bodymatter">
|
||
<div class="body">
|
||
<p class="SP"> </p>
|
||
<section aria-labelledby="ch15" epub:type="chapter" role="doc-chapter">
|
||
<header>
|
||
<h1 class="chapter-number" id="ch15"><span aria-label="391" id="pg_391" role="doc-pagebreak"/>15</h1>
|
||
<h1 class="chapter-title"><b>Enterprise and Government</b></h1>
|
||
</header>
|
||
<div class="ABS">
|
||
<p class="ABS"><b>Overview.</b> Enterprises, nonprofits, and governments are important drivers of the modern economy. Although graphs have been prevalent in artificial intelligence (AI) for a long time, one could largely attribute the modern renaissance of the term <i>knowledge graphs</i> (KGs) to the Google Knowledge Graph, which was first advertised through a blog post in the early 2010s and has since become a paradigmatic application of web-scale KG construction and use. Since then, KGs have been taken up by all manner of companies and organizations, including start-ups and nontech organizations that have been traditionally conservative about using such technology. Governments have also started publishing much of their data, a lot of which is structured, as open data. Although these data sets do not typically constitute, in their raw form, a KG, they are rife with entities, relations, and events, and with the tools described in part II of this book, can be used to construct domain-specific KGs for applications like city planning, public transportation, and health informatics. In this chapter, we provide a brief overview of KG adoption in both enterprise and government.</p>
|
||
</div>
|
||
<section epub:type="division">
|
||
<h2 class="head a-head"><a id="sec15-1"/><b>15.1 Introduction</b></h2>
|
||
<p class="noindent">Because of the Big Data movement that started almost a decade ago, as well as the rapid rise of useful information on the web, KGs have had a big impact on industry. Recent pioneering work on KGs could arguably be traced to a blog post by a Google vice president at the start of the 2010s, which introduced the Google Knowledge Graph as revamping the search experience with “things” rather than “strings” as first-class citizens. Since that post, KGs have become increasingly popular in enterprise, and this has always served as a strong motivation for the academic KG research community. Informally, it is well known that many of the major companies have been building their own KGs, which are often domain-specific. Smaller companies, which do not have access to either the same kind of data or expertise as the bigger companies, have also made strides in applying KGs, sometimes by judiciously leveraging open-source packages and open-world KGs like DBpedia and Wikidata. Another manifestation of KGs in enterprise is the rapid onset of Schema.org publishing, especially when the entity (such as a movie or museum) needs to be prominently and accurately displayed in search engine results. As we described briefly in the previous chapter, Schema.org (or microdata) is an embedded knowledge fragment <span aria-label="392" id="pg_392" role="doc-pagebreak"/>within a webpage; however, these fragments describe entities and their properties using established ontologies. The full set of published microdata on the web resembles a giant KG distributed across websites and companies with many common entities (such as the listing of the same movie by two different theater companies) that have to be indirectly resolved by search engines or aggregators using techniques like instance matching (IM) in response to a search (“query”). For this reason, it is fruitful to think about Schema.org as a massive, decentralized application of KG technology and concepts by a multitude of companies and service providers. Another example of KGs in enterprise is attribute-rich social networks, such as used by LinkedIn and Facebook, which tend to be hybrids between classic networks (where there is no ontology and relationships are untyped) and rich KGs with hundreds, even thousands, of relations and concepts. Yet another example, with increasing relevance, is the product graph used by e-commerce giants like Amazon. The fact that KGs are especially amenable to novel machine learning algorithms like tensor networks and representation learning compared to standard two-dimensional (2D) tables makes them even more attractive to companies looking to use and integrate advanced AI into their workflows.</p>
|
||
<p>Governments are one step removed from KGs, because they do not usually have the resources to construct full-fledged KGs, nor is there an urgent case for doing so. However, Open Government Data (OGD) has become much more commonplace, as is a focus on using more data-driven methodologies for driving efficiency and service quality improvements. Much of this data is open source, and some government data sets resemble KGs in form and content, if not in name or representation. In the hands of a moderately experienced practitioner, important subsets of OGD can be queried and analyzed just like the KGs we studied in chapter 14. In any case, we can always draw on the KG construction techniques described in part II to extract a KG from tables, documents, and government sources (like city webpages), ontologized according to our needs and preferences. For these reasons, we describe OGD and its possibilities in some detail in the second half of this chapter, even though it is not a true KG ecosystem.</p>
|
||
</section>
|
||
<section epub:type="division">
|
||
<h2 class="head a-head"><a id="sec15-2"/><b>15.2 Enterprise</b></h2>
|
||
<p class="noindent">Graphs and KBs have both historically been associated with academia, as the bulk of early AI research in these areas was mostly academic. However, the rise of tech giants like Google, Microsoft, and Amazon, as well as the importance of unstructured and semistructured data, much of it on the web (but also proprietary data, such as user search logs), has led to considerable industrial innovation and adoption around KGs. While all the details are not known, enough public information is available that we now know that many of these big companies are investing considerable effort into building proprietary KGs. Some of this effort is scientific, but much of it relies on careful engineering and infrastructure set up to ensure high efficiency and bottom-line impact. KGs have become popular enough in <span aria-label="393" id="pg_393" role="doc-pagebreak"/>enterprise that even small and medium-sized companies are following suit, and there is a range of companies focused on providing KG services and application programming interfaces (APIs). More commonly, however, the KG is used in a supportive role. Sometimes this support is for a customer-facing service (the best example being rich semantic search, as we describe later in the chapter), but it can also be for the business side of the company (for running queries and aggregations over heterogeneous data that cannot be encapsulated properly or efficiently in a data warehouse), or for other divisions like sales, marketing, and data analytics. In most enterprise settings, with search being an exception, KGs tend to be domain-specific, though the domain could be very broad. For example, the product graph built and used within Amazon is technically a domain-specific KG (from the e-commerce domain), but it contains many entities that cannot be found, even in encyclopedic KGs like DBpedia. Interestingly, these KGs also have unique properties not well studied in academia; for example, many entities in the Amazon product graph are not even named entities, but are instead best described as sets of key-value pairs that provide the features (prices, ingredients, and so on) of the products.</p>
|
||
<p>Google is one of the first modern adopters of large-scale KGs, especially KGs constructed over web sources. The Google Knowledge Graph, which is an internal project at Google, seeks to incorporate semantics into regular keyword-based retrieval. One way for incorporating semantics, illustrated in <a href="chapter_15.xhtml#fig15-1" id="rfig15-1">figure 15.1</a>, is through the <i>knowledge panel</i>, best illustrated for historic entities like Leonardo da Vinci. Before such a KG had been conceived, keyword search would usually just yield a list of webpages, and it was the user’s task to click around these pages to satisfy their intent. Today, clicks may not even be required for many queries. For example, a search like “Los Angeles places to visit” would yield a list of candidate places to visit in Los Angeles on the Google search page itself, satisfying the intent of most users without requiring additional clicks or navigation. As another example of search extending beyond just webpage listing, a query such as “UK pound to USD” directly yields a calculator on Google, which is updated with the latest currency interchange rate and can be used to satisfy user intent for arbitrary pound and dollar values. While it is not clear that a KG underlies <i>all</i> of these facilities, the Google Knowledge Graph does play an important role in providing such rich semantics for many classes of Google searches.</p>
|
||
<div class="figure">
|
||
<figure class="IMG"><a id="fig15-1"/><img alt="" class="width" src="../images/Figure15-1.png"/>
|
||
<figcaption><p class="CAP"><span class="FIGN"><a href="#rfig15-1">Figure 15.1</a>:</span> <span class="FIG">An example of semantics being incorporated into modern search engines like Google to satisfy user intent without requiring even a click. In this case, a search for an entity like “Leonardo da Vinci” yields a knowledge panel powered by the Google Knowledge Graph, and it provides some of the core information about the entity most likely to be useful to a typical user conducting the search. We also illustrate a magnified version of the panel (right).</span></p></figcaption>
|
||
</figure>
|
||
</div>
|
||
<section epub:type="division">
|
||
<h3 class="head b-head"><a id="sec15-2-1"/><b>15.2.1 Knowledge Vault</b></h3>
|
||
<p class="noindent">The mechanics of the Google Knowledge Graph are largely proprietary or unknown, but an important contributing technology is the Knowledge Vault (KV) devised by Dong et al. (2014), which is best described as a web-scale probabilistic knowledge base (KB) that combined extractions from web content (obtained via analysis of text, tabular data, page structure, and human annotations) with prior knowledge derived from existing knowledge repositories.<span aria-label="394" id="pg_394" role="doc-pagebreak"/></p>
|
||
<p><span aria-label="395" id="pg_395" role="doc-pagebreak"/>KV used supervised machine learning models for fusing these information modalities. In the rest of this section, we describe the KV in more detail, in keeping with the goal of illustrating how KGs are deployed in industrial settings. Later in the chapter, we describe more recent refinements to the KV, although the extent to which these improvements have made it into actual enterprise KGs is not known. However, there is reason to believe that there is significant synergy between the publication and actual industrial systems, as the main authors of these papers tend to be from industry, usually large tech giants like Google or Amazon.</p>
|
||
<p>A primary motivation behind the KV stemmed from the fact that, although several large-scale KGs existed at present, such as Never-Ending Language Learning (NELL), YAGO, and DBpedia, and have been described in the previous chapter, they were highly incomplete. Dong et al. (2014) pointed out, for example, that in Freebase, which then was the largest open-source KG, 71 percent of people entities had no known place of birth, and almost 75 percent had no known nationality. Coverage also tended to be lower for relations that were not common, which meant that there was selection bias when it came to what knowledge was actually included in these KGs. Selection bias of this kind was emerging as a serious (and still not fully understood or quantified) problem in KBs that relied on crowdsourced content, the best example being Wikipedia. This may be a contributing factor in the growth of Wikipedia essentially plateauing over the last few years.</p>
|
||
<p>The KV was designed to be a fused knowledge repository that had greater coverage than any individual KG, while still being fairly precise. The KV was defined as a weighted labeled graph, with the weight (<i>G</i>(<i>s, p, o</i>)) for a triple (<i>s, p, o</i>) being 1 if that triple corresponds to a fact in the real world (and 0 otherwise). Because the binary problem is difficult to solve while still achieving both high coverage and correctness, the authors framed the problem as computing <i>Pr</i>(<i>G</i>(<i>s, p, o</i>) = 1) for a <i>candidate</i> triple (<i>s, p, o</i>), where the probability is conditional on different sources of specified information.</p>
|
||
<section epub:type="division">
|
||
<h4 class="head c-head"><b>15.2.1.1 Main Technologies</b> KV relies on three broad sets of techniques to achieve the probabilistic knowledge fusion goals stated previously. Namely, it relies on the following:</h4>
|
||
<p class="snoindent">• <b>Extractors</b>, similar to some of the modules covered in part II, especially web information extraction (IE) and relation extraction (RE), were designed to extract triples from a large number of web sources. Each extractor assigns a confidence score to an extracted triple, meant to represent uncertainty about the identity of the relation and its corresponding arguments. Examples of extractors used in Dong et al. (2014) are listed in <a href="chapter_15.xhtml#tab15-1" id="rtab15-1">table 15.1</a>. The KV extractors were designed specifically for fact extraction from the web, including extraction from text documents, HTML trees and HTML tables. An important problem that arises when using these four fact extraction methods is <i>data fusion</i>. In the KV, this fusion is achieved by constructing a feature vector for each extracted triple and then applying a binary classifier to compute the probability of the triple being true, given the feature vector. To be fast and scalable, a separate classifier is used for each predicate.</p>
|
||
<div class="table">
|
||
<p class="TT"><a id="tab15-1"/><span class="FIGN"><a href="#rtab15-1">Table 15.1</a>:</span> <span class="FIG">Examples of extractors used in the KV.</span></p>
|
||
<figure class="table">
|
||
<table class="table">
|
||
<thead>
|
||
<tr>
|
||
<th class="width25 TCH"><p class="TB"><b>Input Modality</b></p></th>
|
||
<th class="TCH"><p class="TB"><b>Extractor Techniques Used</b></p></th>
|
||
</tr>
|
||
</thead>
|
||
<tbody>
|
||
<tr>
|
||
<td class="TB"><p class="TB">Text documents</p></td>
|
||
<td class="TB"><p class="TB">Relatively standard methods for RE from text, but adapted to be scalable. Training is done using distant supervision (first covered in the context of semisupervised information extraction in part II).</p></td>
|
||
</tr>
|
||
<tr>
|
||
<td class="TB"><p class="TB">HTML trees</p></td>
|
||
<td class="TB"><p class="TB">As with text documents, classifiers are trained, but the features input to the classifier are derived by connecting two entities from the Document Object Model (DOM) trees representing the HTML, instead of from the text. Specifically, the lexicalized path along the tree (between the two entities) is used as the feature vector.</p></td>
|
||
</tr>
|
||
<tr>
|
||
<td class="TB"><p class="TB">HTML tables</p></td>
|
||
<td class="TB"><p class="TB">Heuristic techniques are employed, because traditional extraction techniques (like for HTML trees and text) do not work well for tables. One such heuristic is to match the entities in the column to Freebase, and then using the matching results to reason about which predicate each column corresponds to.</p></td>
|
||
</tr>
|
||
<tr>
|
||
<td class="TB"><p class="TB">Human annotated pages</p></td>
|
||
<td class="TB"><p class="TB">Human annotations in this context refer to the microformat and microdata annotations described in the previous chapter. In the KV, only Schema.org annotations in the webpages are used (limited to a small subset of 14 predicates mostly related to people) by defining a mapping manually between these 14 Schema.org predicates and the Freebase schema.</p></td>
|
||
</tr>
|
||
</tbody>
|
||
</table>
|
||
</figure>
|
||
</div>
|
||
<p><span aria-label="396" id="pg_396" role="doc-pagebreak"/>How is the actual feature vector composed? The approach adopted by KV is to use two numbers for each extractor: the square root of the number of sources the extractor extracts the triple from, and the mean score of extractions from the extractor, averaging over sources. The classifier determines a different weight for each of these two components, and it can learn from the relative reliabilities of each extraction system. Because a separate classifier is fit per predicate, each predicate’s individual reliability can be separately modeled.</p>
|
||
<p>For training this fusion system, the labels are acquired by applying a triple labeling heuristic called the <i>local closed world assumption</i> to the training set, and boosted decision stumps were used as the actual classifiers (because they were found to significantly improve over a baseline logistic regression model). This heuristic is described in all its detail in the original KV paper by Dong et al. (2014), but in essence, it expands upon the original notion that only the triples in the initial KG (used for training) are correct (and triples not in the initial KG are incorrect) in several ways. As per this heuristic, some triples are considered indeterminate, for example, and are not labeled as either correct or incorrect (and are hence discarded from the training and test sets); in principle, this set is like the possible matches <span aria-label="397" id="pg_397" role="doc-pagebreak"/>set recovered from models like Fellegi-Sunter, proposed for the IM problem (as described in chapter 8).</p>
|
||
<p class="snoindent">• <b>Graph-based priors</b> encode the prior probability of each possible triple, with the values of the priors derived from triples stored in an existing KG. Prior knowledge is important because it allows the filtering of extracted facts that are too unreliable. It also allows link prediction, even when there is little or no evidence for the predicted fact that has been extracted from the web. Two algorithms were used: the path ranking algorithm (PRA) and neural network model (MLP, or multilayer perceptron). PRA is similar to distant supervision, in that the algorithm starts with a set of entity pairs connected by a predicate <i>p</i>. The algorithm performs a random walk on the graph, starting at all subject nodes, with paths reaching the object nodes being considered successful. The quality of the paths are measured in terms of support and precision metrics, similar to association rule mining. For example, PRA can learn that two entities connected via a marriedTo predicate often (but not always) have a path via another entity that is their child (the edges in the path being parentOf predicates).</p>
|
||
<p>Intuitively, these paths can be interpreted as rules. The rule equivalent of the path mentioned previously is that if two people are married, they are <i>likely</i> to have a common child. Because multiple rules or paths can apply for a given entity pair, the rules are combined by fitting a binary classifier (in this case, logistic regression) with the features being the probabilities of reaching an object node <i>o</i> from a source node <i>s</i> following different types of paths, and with labels again derived by applying the local closed world assumption on a preexisting, or initial, KG like Freebase. A classifier is again fit for each predicate independently in parallel.</p>
|
||
<p>The neural network model is different from the PRA and is more akin to the KG representation learning methods described in chapter 10, though the actual loss function and optimization is different from the Trans* and other algorithms described therein. KV favors tensor decomposition instead, in order to learn embeddings for entities and predicates (which in principle is not dissimilar from the neural tensor network model presented in that chapter). Although more advanced than the PRA, the authors of KV found that, when evaluated on a real test set, the performance of the neural method was roughly equal to that of PRA, with the Area under the ROC Curve (AUC) for the neural and PRA models being 0.882 and 0.884, respectively. This result illustrates an important (and ubiquitous) finding in industry: namely, that on real, industry data sets, more classic and conventional methods can sometimes perform as well as the state-of-the-art in the research community. Reasons for this phenomenon abound, with some of the popular ones being data set bias (and benchmark overfitting) and the limitations of research test sets.</p>
|
||
<p class="myitemizeitem">The priors output by both of these algorithms are fused using a similar fusion method as the one described earlier for extractors. The main difference is in the features, as this component does not yield extractions. Instead, KV uses the confidence values from each <span aria-label="398" id="pg_398" role="doc-pagebreak"/>prior system, as well as indicator values specifying if the prior was able to predict or not (similar to a dummy value distinguishing a missing prediction from a 0.0 prediction score). A boosted classifier is again used. The fused system achieved an AUC of more than 0.91, significantly improving over the individual AUCs.</p>
|
||
<p class="snoindent">• <b>Knowledge fusion</b> computes the probability of a triple being true, based on agreement between different extractors and priors. Once again, the same fusion method described previously (using boosted classifiers on feature vectors derived from component outputs and confidences) was used, and empirically (just as done earlier), the fusion was found to yield significantly improved quantitative performance. Furthermore, an interesting finding was that when priors and extractor outputs were fused, the number of triples about which KV was uncertain (for which the predicted probability was between 30 percent and 70 percent) declined. The moral of the story is that leveraging multiple sources and systems in large-scale, real-world systems, especially in industry, can be critical to obtain the best performance. Such fusion could, in practice, make all the difference between an architecture being viable in an actual deployment, and being relegated to the sidelines as a failed company research project.</p>
|
||
<p class="TNI-H3"><b>15.2.1.2 Refinements</b> Since the original publication, the KV has undergone several refinements, not all of which have been made public. One recently published and heavily improved version by Lockard et al. (2019) is <i>OpenCeres</i>. We have already covered aspects of this system in some of the chapters in part II, particularly RE and Open IE. OpenCeres builds on Ceres, which significantly improved its performance compared to KV by being able to automatically extract from semistructured sites with a precision over 90 percent using techniques like distant supervision. Semistructured sites are important because they contributed approximately 75 percent of total extracted facts and about 94 percent of high-confidence extractions. OpenCeres extends Ceres by providing feasible solutions for Open IE on semistructured websites. Because it is based on Open IE, OpenCeres is able to identify predicate strings on a website that represent the relations, as well as to identify unseen predicates by applying semisupervised label propagation. Unlike the web IE techniques we covered in chapter 5, OpenCeres is also highly novel, in that it is able to use visual aspects of the webpage for distant supervision and label propagation. Returning to the issue of data set bias and limitations of research benchmarks mentioned earlier, OpenCeres was evaluated on a new benchmark data set and online websites. It obtained an F-measure of 68 percent, higher than baseline systems, while extracting seven times as many predicates as present in the original ontology. On a set of 31 movie websites, OpenCeres yielded 1.17 million extractions with almost 70 percent precision. Work on OpenCeres continues in industry<sup><a href="chapter_15.xhtml#fn1x15" id="fn1x15-bk">1</a></sup> at this time.</p>
|
||
</section>
|
||
</section>
|
||
<section epub:type="division">
|
||
<h3 class="head b-head"><span aria-label="399" id="pg_399" role="doc-pagebreak"/><a id="sec15-2-2"/><b>15.2.2 Social Media and Open Graph Protocol</b></h3>
|
||
<p class="noindent">Social network and social media companies like LinkedIn and Facebook have also become proponents of KGs. A major initiative in this direction is the Open Graph Protocol (OGP), which enables any webpage to become a rich object in a social graph. As one example, Facebook uses the OGP to allow any webpage to have the same functionality as any other object on Facebook.</p>
|
||
<p>The core idea behind the protocol is to add metadata to the webpage, with the four required properties being <i>og:title, og:type, og:image,</i> and <i>og:url</i>. The initial version of the protocol is based on Resource Description Framework in Attributes (RDFa). RDFa is itself an extension of HTML5, which was designed for helping web developers mark up entities like people and places on websites to generate better search listings and visibility on the web.</p>
|
||
<p>Good examples of OGP usage can be found on IMDB. For example, a partial OGP snippet on the IMDB page for the movie <i>The Lion King</i> (2019) is illustrated in <a href="chapter_15.xhtml#fig15-2" id="rfig15-2">figure 15.2</a>. The snippet is relatively easy and modular to insert into the webpage, and with some scripting, it can even be generated from an existing database of information. Other than the compulsory properties, recommended properties include <i>og:description, og:site_name,</i> and <i>og:video</i>, among others. Interestingly, some properties can even have extra metadata attached to them, specified in the same way as other metadata in terms of property and content, but with the property having an extra <i>“:”</i>. For example, the <i>og:image</i> property has optional structured properties like <i>og:image:type</i> (a MIME<sup><a href="chapter_15.xhtml#fn2x15" id="fn2x15-bk">2</a></sup> type for the image), and <i>og:image:width</i> (the number of pixels wide). While the OG is not quite as rich as an ontology, it serves that purpose when marking up a webpage with snippets.</p>
|
||
<div class="figure">
|
||
<figure class="IMG"><a id="fig15-2"/><img alt="" class="width" src="../images/Figure15-2.png"/>
|
||
<figcaption><p class="CAP"><span class="FIGN"><a href="#rfig15-2">Figure 15.2</a>:</span> <span class="FIG">An example of an OGP snippet embedded in the HTML source of the IMDB page associated with the <i>Lion King</i> (2019). By embedding these snippets, developers can turn their web objects into graph objects. The protocol has been extremely popular with developers catering to social media companies like Facebook. Accessed on Nov. 17 at <i><a href="http://www.imdb.com/title/tt6105098/">www.imdb.com/title/tt6105098/</a></i>.</span></p></figcaption>
|
||
</figure>
|
||
</div>
|
||
<p>While many different technologies and schemas exist and could be combined together, there does not exist a single technology that provides enough information to richly represent any webpage within the social graph (a richer version of the classical social network). The OGP builds on these existing technologies and gives developers a single path for implementation. In that sense, it is simple and unified, and has a significant web presence. At this time, it is being published by IMDB, Microsoft, Rotten Tomatoes, <i>TIME</i>, and several others. Additionally, it is consumed by both Facebook and Google. Adoption is also helped by lightweight plug-ins (e.g., on WordPress) that can be used to easily add Open Graph metadata to WordPress-powered sites.</p>
|
||
</section>
|
||
<section epub:type="division">
|
||
<h3 class="head b-head"><a id="sec15-2-3"/><b>15.2.3 Schema.org</b></h3>
|
||
<p class="noindent">In chapter 14 on Linked Data, we mentioned microdata as another means of adding more machine-amenable structure to the web. Schema.org is the best example of this effort, and it has witnessed massive uptake in the last decade, in part (if not mostly) due to the <span aria-label="400" id="pg_400" role="doc-pagebreak"/>official support and use of Schema.org markup by the major search engines in indexing and rendering search results.</p>
|
||
<p><span aria-label="401" id="pg_401" role="doc-pagebreak"/>On its homepage,<sup><a href="chapter_15.xhtml#fn3x15" id="fn3x15-bk">3</a></sup> Schema.org is described as a “collaborative, community activity with a mission to create, maintain, and promote schemas for structured data on the Internet, on web pages, in email messages, and beyond.” The vocabulary and initiative were founded by Google, Microsoft, Yahoo!, and Yandex (another search engine), with the vocabularies maintained and developed by an open community process. The vocabulary itself can be used with several encodings, including microdata and RDFa. Currently, at least 10 million websites are known (although the actual number is believed to be much higher) to mark up their pages and email messages with Schema.org snippets, and applications from Google, Microsoft, and Yandex, among others, use the Schema.org vocabularies as well.</p>
|
||
<p>However, while it is not possible for most people to obtain all the Schema.org fragments at true web scale, a large corpus of Schema.org data has been made directly available to the public as a result of the web Data Commons (WDC) project (see the section entitled “Software and Resources” at the end of this chapter), which extracts embedded structured data (including RDFa, microdata, and microformats) from several billion webpages. The project provides this extracted data for download and even publishes statistics about the deployment of the various formats.</p>
|
||
<p>As Schema.org has grown, mechanisms have been adopted to extend the Schema.org core to support more detailed and specialized vocabularies. Two categories of extension are (1) <i>hosted</i> extensions, managed and published as part of the Schema.org project, with the design led by dedicated community groups; and (2) <i>external</i> extensions, managed by other organizations with their own processes and collaboration mechanisms. As the name suggests, external extensions are more organic and driven by external needs, and external sources on the web must be referred to for documentation and development information. It is important to note that the steering group does not officially endorse these kinds of extensions. Hosted extensions, which are officially endorsed, include extensions on such diverse subjects as health and life sciences (Health-lifesci.schema.org), Internet of Things (Iot.schema.org), and bibliographies (Bib.schema.org).</p>
|
||
<p>The Schema.org movement continues to grow, and it is quite possible (although we are not aware of an official census) that the total number of Schema.org markups (expressed as RDF triples) has already exceeded the number of RDF triples published using Linked Data principles. However, it is important to note that Schema.org markups are complementary to Linked Data. The vast majority of Schema.org entities seem to be comprised of products, services, and local businesses and organizations that are vital to the proper functioning of search engines, especially when searching for localized and longtail entities (such as a restaurant in your city that does not have a Wikipedia page). In contrast, Linked <span aria-label="402" id="pg_402" role="doc-pagebreak"/>Data has seen uptake either from the scientific community (detailed further in chapter 16, on KGs for domain sciences), or relying heavily on primary data sources like Wikipedia and GeoNames. Efforts to bridge the two movements (i.e., crawl, study, and republish) as Linked Data Schema.org annotations are currently under way as a major research frontier in the SW. However, the fourth Linked Data principle is difficult to honor, as many longtail entities likely cannot link to anything currently existing as Linked Open Data, and as we covered in part II, crawling and domain discovery are not easy problems to address, especially at scale.</p>
|
||
</section>
|
||
</section>
|
||
<section epub:type="division">
|
||
<h2 class="head a-head"><a id="sec15-3"/><b>15.3 Governments and Nonprofits</b></h2>
|
||
<p class="noindent">Governments and nonprofit organizations have also become significant adopters and contributors of structured knowledge. In many cases, this knowledge is ontologized and could be serialized as a KG, which can be queried using many of the techniques covered earlier. While Schema.org is as an excellent example of an ecosystem that organizations (both profits and nonprofits) have striven to be a part of, owing to its continued influence on search results, it is certainly not the only example of structured knowledge being published by governments and nonprofits. Next, we briefly cover some other influential examples.</p>
|
||
<section epub:type="division">
|
||
<h3 class="head b-head"><a id="sec15-3-1"/><b>15.3.1 Open Government Data</b></h3>
|
||
<p class="noindent">According to the Organisation for Economic Co-operation and Development (OECD), OGD<sup><a href="chapter_15.xhtml#fn4x15" id="fn4x15-bk">4</a></sup> is a “philosophy, and increasingly a set of policies, that promotes transparency, accountability and value creation by making government data available to all.” In modern economies, both Western and emerging, public bodies produce and commission large quantities of data, usually with the goal of becoming more transparent and accountable to citizens. By encouraging the use, reuse, and free distribution of data sets, governments also attempt to promote business creation and innovative, citizen-centric services. The eight principles of OGD, according to 30 open government advocates who gathered in California to develop a more robust understanding of why OGD is essential to democracy, are briefly enumerated in <a href="chapter_15.xhtml#tab15-2" id="rtab15-2">table 15.2</a>.</p>
|
||
<div class="table">
|
||
<p class="TT"><a id="tab15-2"/><span class="FIGN"><a href="#rtab15-2">Table 15.2</a>:</span> <span class="FIG">An enumeration of the eight OGD principles.</span></p>
|
||
<figure class="table">
|
||
<table class="table">
|
||
<thead>
|
||
<tr>
|
||
<th class="TCH"><p class="TB"><b>OGD Principle</b></p></th>
|
||
<th class="TCH"><p class="TB"><b>Description of Principle</b></p></th>
|
||
</tr>
|
||
</thead>
|
||
<tbody>
|
||
<tr>
|
||
<td class="TB"><p class="TB">Complete</p></td>
|
||
<td class="TB"><p class="TB">All public data is made available. Public data is data that is not subject to valid privacy, security, or privilege limitations.</p></td>
|
||
</tr>
|
||
<tr>
|
||
<td class="TB"><p class="TB">Primary</p></td>
|
||
<td class="TB"><p class="TB">Data is given as it is collected at the source, with the highest possible level of granularity, not in aggregate or modified forms.</p></td>
|
||
</tr>
|
||
<tr>
|
||
<td class="TB"><p class="TB">Timely</p></td>
|
||
<td class="TB"><p class="TB">Data is made available as quickly as needed to preserve its value.</p></td>
|
||
</tr>
|
||
<tr>
|
||
<td class="TB"><p class="TB">Accessible</p></td>
|
||
<td class="TB"><p class="TB">Data is available to the widest range of users, for the widest range of purposes.</p></td>
|
||
</tr>
|
||
<tr>
|
||
<td class="TB"><p class="TB">Machine-processable</p></td>
|
||
<td class="TB"><p class="TB">Data is reasonably structured to allow automated processing.</p></td>
|
||
</tr>
|
||
<tr>
|
||
<td class="TB"><p class="TB">Nondiscriminatory</p></td>
|
||
<td class="TB"><p class="TB">Data is available to anyone, with no registration required.</p></td>
|
||
</tr>
|
||
<tr>
|
||
<td class="TB"><p class="TB">Nonproprietary</p></td>
|
||
<td class="TB"><p class="TB">Data is available in a format over which no entity has exclusive control.</p></td>
|
||
</tr>
|
||
<tr>
|
||
<td class="TB"><p class="TB">License-free</p></td>
|
||
<td class="TB"><p class="TB">Data is not subject to any copyright, patent, trademark, or trade secret regulation. Reasonable privacy, security, and privilege restrictions may be allowed.</p></td>
|
||
</tr>
|
||
<tr>
|
||
<td class="TSN" colspan="3"><p class="TSN">We credit the succinct descriptions provided at <a href="https://public.resource.org/8_principles.html">https://public.resource.org/8_principles.html</a>.</p></td>
|
||
</tr>
|
||
</tbody>
|
||
</table>
|
||
</figure>
|
||
</div>
|
||
<p>OGD can pose some tricky questions for governments, such as: Who will pay for the collection and processing of public data if it is made freely available? What are the incentives for government bodies to maintain and update their data? What data sets should be prioritized for release to maximize public value? These questions involve various trade-offs, and it is important to take steps to developing a framework for such cost-benefit analyses, including data collection and curation, as well as preparation of case studies to demonstrate the concrete benefits (economic, social, and policy) of OGD.</p>
|
||
<p><span aria-label="403" id="pg_403" role="doc-pagebreak"/>The OECD OGD project aims to promote international efforts on OGD impact assessment. The mapping of practices across countries helps establish a knowledge repository on OGD policies, strategies, and initiatives and support the development of a methodology to assess the impact and creation of governance value, especially along socioeconomic dimensions, through OGD initiatives.</p>
|
||
<p>We iterate that this is not just an initiative that is being enthusiastically adopted in the rich economies. In the last decade, in particular, there have been many conferences and initiatives for (1) building OGD cultures, especially in the Middle East and North Africa regions to combat endemic corruption; (2) developing useful indices that can allow one to compare OGD success and adoption rates in a quantitative way across countries; and (3) promoting the movement, through reviews, blogs, and meetings.</p>
|
||
<p>A concrete application of OGD is <i>Smart Cities</i>. The Smart Cities Council describes<sup><a href="chapter_15.xhtml#fn5x15" id="fn5x15-bk">5</a></sup> a smart city as one that “uses information and communications technology (ICT) to enhance its livability, workability, and sustainability.” Smart Cities have become an increasingly important concept in the face of global pressures like climate change and village-city emigration in developing countries, leading to the rise of giant metropolises. Pollution, poverty, income inequality, food shortages, and imbalances are other severe problems, leading to a surge in antiestablishment populism in recent years. Smart Cities, through efficient resource <span aria-label="404" id="pg_404" role="doc-pagebreak"/>mobilization and better governance and accountability, attempt to mitigate some of these issues to make cities more livable and sustainable.</p>
|
||
<p>The Smart Cities movement was not originally (and still is not) centered around KGs as the primary technology. However, due to their flexibility, KGs have been recognized as being important to the Smart City movement. Recently, for example, Santos et al. (2017) used KGs for supporting automatic generation of dashboards, metadata analysis, and data visualization. It is likely that this trend will continue, especially as cities release their data sets in highly heterogeneous formats, for which KGs are particularly apt as data representations.</p>
|
||
<p class="TNI-H3"><b>15.3.1.1 US Government Data (Data.gov)</b> A good example of an OGD ecosystem is Data.gov in the United States. Data.gov is managed and hosted by the US General Services Administration, Technology Transformation Service, is developed publicly on GitHub, and is powered by two open-source applications, CKAN and WordPress. Data.gov follows the Project Open Data (POD) schema, a set of required fields for every data set displayed on Data.gov. The POD schema addresses the challenge of defining and naming standard metadata fields so that a data consumer has sufficient information to process and understand the described data. This is especially important considering the vast number of data sets that have become available on Data.gov portals. Metadata can range from basic to advanced, from allowing discovery of the mere fact that a certain data set or artifact exists on a certain subject all the way to providing detailed information documenting the structure, quality, and other properties of a data set. Clearly, making metadata machine-readable increases its utility, but it also requires effective and consensus-driven standardization. By following the POD schema, Data.gov takes a step in this important direction.</p>
|
||
<p>Although the total number of data sets, which is available on the Data.gov Metrics page, can fluctuate, the range and growth has been impressive in recent years. According to official statistics, as of June 2017, there were approximately 200,000 data sets reported as the total on Data.gov, representing about 10 million data resources.</p>
|
||
<p>Importantly, we note that releasing data in many cases is now no longer voluntary. Under the terms of the 2013 Federal Open Data Policy, newly generated government data is required to be made available in open, machine-readable formats, while continuing to ensure privacy and security. For instance, federal Chief Financial Officers Act agencies are required to create a single data inventory, publish public data listings, and develop new public feedback mechanisms. Agencies are also required to identify public points of contacts for agency data sets.</p>
|
||
<p>As a distributed collection of JSON and semistructured files, it is not immediately apparent that Data.gov is a “knowledge graph.” Certainly, integrating so many of the data sets (especially at the level of records, as discussed in chapter 8, on instance matching) is a challenging problem that is only partially helped by the use of a uniform schema. However, the use of the schema also motivates thinking of the collection as a KG. Research initiatives <span aria-label="405" id="pg_405" role="doc-pagebreak"/>are currently underway to realize this vision, especially using free, open-source tools and NoSQL environments like Elasticsearch. Relevant research is noted in the section entitled “Bibliographic Notes,” at the end of this chapter.</p>
|
||
</section>
|
||
<section epub:type="division">
|
||
<h3 class="head b-head"><a id="sec15-3-2"/><b>15.3.2 BBC</b></h3>
|
||
<p class="noindent">The BBC is a British public service broadcaster that is the world’s oldest national broadcasting organization (and the largest broadcaster in the world measured by the number of employees). In fact, the BBC was also one of the first organizations to use linked data, with a highly successful public use-case being its Olympics coverage. However, even before that, BBC was obtaining existing uniform resource identifiers (URIs) for musical artists from MusicBrainz, a freely available linked data set that currently contains tens of millions of artists, albums, and songs. As noted in the previous chapter, most linked data sets on Linked Open Data (LOD) also connect to DBpedia as a hub source for fulfilling the fourth Linked Data principle, which is also true for MusicBrainz. This gave the BBC a large body of externally developed (and freely available) knowledge about the music industry. However, the path forward was not seamless, since (as also described earlier) Linked Open Data can suffer from quality problems. When an organization using these data sets discovers errors, they have a choice in whether to report it. The BBC made the strategic decision to help improve the MusicBrainz database when it found errors, in keeping with its charter to provide benefit to the public.</p>
|
||
<p>We used this example not only to illustrate an early example of a nonacademic organization using LOD (and by virtue, KGs), but also as an example of open data innovation. At the time, the open data movement was not as popular, but today, more companies and start-ups are starting to publish (and use) open data to create real business and social value. This movement started picking up steam in the late 2000s; for instance, Barack Obama issued a <i>Memorandum on Transparency and Open Government</i> on January 21, 2009, the very next day after he was sworn in as US president, endorsing the opening of government data and committing to accountability. Companies were slower to adapt to this trend, but today, it is not unusual for companies to give back to the community via open-source code and limited data set releases. In the media landscape, the BBC’s adoption of linked data and KGs likely played an important role in other organizations (like the <i>New York Times</i>) taking up the challenge and discovering applied uses of KG technologies and data for solving some of their own problems.</p>
|
||
</section>
|
||
<section epub:type="division">
|
||
<h3 class="head b-head"><a id="sec15-3-3"/><b>15.3.3 OpenStreetMap</b></h3>
|
||
<p class="noindent">OpenStreetMap (OSM) is a <i>collaborative mapping</i> project with the overarching goal to create a free and editable map of the world. It is an instructive example of massive-scale, crowdsourced geographic information, with a direct analog to crowdsourced encyclopedic information sources like Wikipedia. OSM was created by Steve Coast in the United Kingdom in 2004, and it was inspired by both the success of crowdsourced efforts like <span aria-label="406" id="pg_406" role="doc-pagebreak"/>Wikipedia and the proprietary nature of existing map data in the UK. OSM currently has more than two million registered users, each of whom can contribute to the project by collecting data that uses techniques such as manual surveys, aerial photography, and Global Positioning System (GPS).</p>
|
||
<p>The data generated by OSM is used by websites such as Craigslist, OsmAnd, Geocaching, and Foursquare (to name just a few), and is an alternative to services like Google Maps. Map data collection is both a grassroots and crowdsourced effort, with data being collected from scratch by volunteers performing systematic ground surveys (using tools such as a handheld GPS unit, a notebook, or even a voice recorder). The data is then entered into the OSM database. Mapathon competition events are also held by the OSM team and nonprofit organizations and local governments to map a particular area.</p>
|
||
<p>Some government agencies have released official data under appropriate licenses. This includes the US, where works of the federal government are placed in the public domain. In the US, OSM uses Landsat 7 satellite imagery, Prototype Global Shorelines from the National Oceanic and Atmospheric Administration (NOAA), and TIGER data from the Census Bureau. In the UK, some Ordnance Survey OpenData is imported, while Natural Resources Canada’s CanVec vector data and GeoBase provide land cover and streets. An important source of information that does not lead to copyright or licensing problems is out-of-copyright maps that serve as good sources of information about features that do not change frequently. Copyright periods vary, but in the UK Crown copyright expires after 50 years; hence Ordnance Survey maps until the 1960s can legally be used. A complete set of UK 1 inch/mile maps from the late 1940s and early 1950s has been collected, scanned, and is available online as a resource for contributors.</p>
|
||
<p>In February 2015, OSM added route planning functionality to the map on its official website. The routing uses external services (namely, OSRM, GraphHopper, and MapQuest). Three examples of software available for working with OSM include OpenStreetBrowser, which displays finer map and category options; OsmAnd, which is a free software for Android and iOS mobile devices that can use offline (vector) data from OSM, and which supports layering OSM vector data with prerendered raster map tiles from OSM; and Maps.me, which is another free software for mobile devices that provides offline maps based on OSM data.</p>
|
||
<p>A notable application of OSM has been in the area of humanitarian aid. For example, during the 2010 Haiti earthquake, OSM and Crisis Commons volunteers used available satellite imagery to map the roads, buildings, and refugee camps of Port-au-Prince in just two days, building one of the most complete digital map of Haiti’s roads ever to exist. Other organizations that have used these data and maps include the World Bank, the European Commission Joint Research Center, and even the Office for the Coordination of Humanitarian Affairs. Other natural disasters in which OSM data played an important role <span aria-label="407" id="pg_407" role="doc-pagebreak"/>include the northern Mali conflict, Typhoon Haiyan in the Philippines in 2013, and the Ebola epidemic in West Africa in 2014.</p>
|
||
</section>
|
||
</section>
|
||
<section epub:type="division">
|
||
<h2 class="head a-head"><a id="sec15-4"/><b>15.4 Where Is the Future Headed?</b></h2>
|
||
<p class="noindent">KGs in governments, nonprofits, and for-profit organizations have continued to grow, a trend that will most likely continue in the foreseeable future. In addition to growth in existing ecosystems, new initiatives are in the works, some with potentially transformative impact. Because any discussion of where the future will go is necessarily speculative and beyond the scope of this book, we focus on two well-realized trends. Interestingly, both are grassroots efforts, although one is primarily industrial and the other is primarily academic.</p>
|
||
<p>First, an encouraging sign of progress in the adoption of KGs in enterprise is the growing number of start-ups and nontech companies that have been using, building, or otherwise working with KGs recently. This trend has not gone unnoticed in business and consulting. For example, Gartner placed Graph Analytics on its “Top 10 Data and Analytics Technology Trend for 2019” report in the area of solving critical business priorities. While graphs are considered a general data model (or structure) in the academic computer science and math communities, it is fairly evident in the actual report that KGs and graph databases constituted the primary technologies in Garner’s scope, as the following quote evinces:<sup><a href="chapter_15.xhtml#fn6x15" id="fn6x15-bk">6</a></sup> “The application of graph processing and graph DBMSs will grow at 100 percent annually through 2022 to continuously accelerate data preparation and enable more complex and adaptive data science.” Examples of start-ups that are offering KGs (whether by way of supporting technology, data, or both) as primary business offerings include Stardog, which has partnered with industry leaders like Morgan Stanley and the National Aeronautics and Space Administration (NASA), and Diffbot, which is attempting to assemble a web-scale KG of billions of facts and offers a subscription-like model with starter rates as low as a cable TV bill. Other examples also exist, some more specific to communities like SW than others. While most start-ups have focused on English-language data and cater to customers in Western economies, the rise of spending power, modernization, and economic importance of countries like India and China has led to a renewal of interest in building KGs from multilingual data sets covering more diverse domains and data genres such as Short Message Service (SMS).</p>
|
||
<p>Second, a very recent initiative that is motivated by the concern that enterprises like Google, Amazon, and Microsoft each have their own, currently proprietary KG ecosystem is the Open Knowledge Network (OKN). Proprietary KG ecosystems have two consequences: (1) repetition and redundancy, with similar technical problems being solved <span aria-label="408" id="pg_408" role="doc-pagebreak"/>across similar talent pools (and for virtually identical markets); (2) the lack of openness, uniform standards, or public availability. It is notable that the challenges in the second point were addressed by movements such as Linked Data, and even Schema.org, but these initiatives do not have the centralized resources or personnel of the major technological firms. Alarmingly, it seems that large, interlinked KGs on the web (which would not include Schema.org, as the entities are not naturally linked in the way that Wikipedia, DBpedia, and other data sets on LOD are) fall within two camps: high-quality, closely guarded KGs being developed in industry, of which even basic details can sometimes be hard to publicly acquire (Microsoft Satori is one example), and open, crowdsourced KGs that have to be developed, hosted, and maintained in increasingly resource-light settings, or through small individual efforts by a massive collective (crowdsourcing).</p>
|
||
<p>OKN offers itself as an alternative by pushing for a small set of core protocols and vocabulary, as well as a web-style architecture, that encourage diversity (i.e., publishers and consumers of all sizers and stripes) and easy proliferation and publishing of services (e.g., search) that would lead to rapid application offerings. Invoking the analogy to the pre-web era, OKN would play the same role in the development, publishing, storage, and access of web-scale knowledge graphs that the HTTP protocol played, and continues to play, in the publishing and consumption of websites. It is controversial to what extent the initiative is different from either the SW or the Linked Data movement (which is broadly supported by the SW community, as noted in chapter 14). Interestingly, while the Linked Data movement is global in scope, with contributions from both US and European institutions, the scope of OKN has been limited to US agencies, institutions, and individuals thus far.</p>
|
||
<p>At this time, the initiative involves a number of important figures from academia and industry, with regular workshops organized and held by the US National Science Foundation. Important domains that key stakeholders believe will be affected by the effort include finance, geosciences, commercial applications like personal assistants, and biomedicine. Like any ambitious effort, the success of the movement is not guaranteed, but it is a principled step toward achieving an open, and truly web-scale, KG ecosystem that offers benefits to public, private, and individual players alike.</p>
|
||
</section>
|
||
<section epub:type="division">
|
||
<h2 class="head a-head"><a id="sec15-5"/><b>15.5 Concluding Notes</b></h2>
|
||
<p class="noindent">In this chapter, we covered the adoption of KGs in enterprise and government. Although the term “knowledge graph” became popular in modern times because of the Google Knowledge Graph, which is truly web scale in its scope and largely proprietary, it has since been widely adopted (as a technology) or constructed (as valuable data) by all manner of organizations. Most encouragingly, start-ups and nontech companies have taken up the mantle in discovering novel use-cases of KGs and employing them as a service. Governments are not far behind, although data sets on portals like Data.gov are still too raw compared to industry standards. Also important is the successful application of scientific <span aria-label="409" id="pg_409" role="doc-pagebreak"/>principles like Linked Data by organizations like the BBC. More recently, initiatives like the OKN are seeking to democratize large-scale, all-encompassing KGs by incentivizing easy proliferation and publishing of KGs and KG-centric services like search. In a narrower domain, OSM has successfully democratized the publishing and use of map and geospatial data.</p>
|
||
</section>
|
||
<section epub:type="division">
|
||
<h2 class="head a-head"><a id="sec15-6"/><b>15.6 Software and Resources</b></h2>
|
||
<p class="noindent">The Schema.org resources are available at <a href="https://schema.org/">https://<wbr/>schema<wbr/>.org<wbr/>/</a>, with the schemas available at <a href="https://schema.org/docs/schemas.html">https://<wbr/>schema<wbr/>.org<wbr/>/docs<wbr/>/schemas<wbr/>.html</a>. For developers, there is a separate set of resources available at <a href="https://schema.org/docs/developers.html">https://<wbr/>schema<wbr/>.org<wbr/>/docs<wbr/>/developers<wbr/>.html</a>. The Knowledge Vault (KV), which was mentioned in the early part of the chapter, received a lot of press<sup><a href="chapter_15.xhtml#fn7x15" id="fn7x15-bk">7</a></sup> (and was written about in several publications, as cited in the next section), but does not seem to have been made openly available. An adequate substitute for research purposes might be Wikidata, which is available at <a href="https://www.wikidata.org/wiki/Wikidata:Main_Page">https://<wbr/>www<wbr/>.wikidata<wbr/>.org<wbr/>/wiki<wbr/>/Wikidata:Main<wbr/>_Page</a>. However, more recently, researchers who previously worked on the KV research released a publication on OpenCeres (accessible at <a href="http://lunadong.com/publication/openCeres_naacl.pdf">http://<wbr/>lunadong<wbr/>.com<wbr/>/publication<wbr/>/openCeres<wbr/>_naacl<wbr/>.pdf</a>), which contains details that are replicable on large, openly available corpora. The CommonCrawl is an important resource in this regard, accessible at <a href="http://commoncrawl.org/">http://<wbr/>commoncrawl<wbr/>.org<wbr/>/</a>. We recommend that interested readers also take a closer look at the WDC, which extracts structured data from the CommonCrawl and provides extracted data for public download to support academia and industry alike. WDC has its homepage at <a href="http://webdatacommons.org/">http://<wbr/>webdatacommons<wbr/>.org<wbr/>/</a>.</p>
|
||
<p>Concerning social media efforts, the primary webpage for learning about the OGP is <a href="https://ogp.me/">https://<wbr/>ogp<wbr/>.me<wbr/>/</a>. Toward the end of that webpage, several implementations are mentioned, including libraries written in Python, Ruby, and Java, accessible at <a href="http://pypi.python.org/pypi/PyOpenG-raph">http://<wbr/>pypi<wbr/>.python<wbr/>.org<wbr/>/pypi<wbr/>/PyOpenG<wbr/>-raph</a>, <a href="http://github.com/intridea/opengraph">http://<wbr/>github<wbr/>.com<wbr/>/intridea<wbr/>/opengraph</a>, and <a href="http://github.com/callumj/opengraph-java">http://<wbr/>github<wbr/>.com<wbr/>/callumj<wbr/>/opengraph<wbr/>-java</a>, respectively.</p>
|
||
<p>We have mentioned that OGD is becoming more common throughout the world. In the US, a good resource is <a href="https://www.data.gov/open-gov/">https://<wbr/>www<wbr/>.data<wbr/>.gov<wbr/>/open<wbr/>-gov<wbr/>/</a>. Another good resource is the OECD website on the subject: <a href="https://www.oecd.org/gov/digital-government/open-gov-ernment-data.htm">https://<wbr/>www<wbr/>.oecd<wbr/>.org<wbr/>/gov<wbr/>/digital<wbr/>-government<wbr/>/open<wbr/>-gov<wbr/>-ernment<wbr/>-data<wbr/>.htm</a>. Among other important resources, it publishes the OURdata Index, which assesses governments’ efforts to implement open data in the three critical areas—the openness, usefulness, and reusability of government data. The most recent index showed that South Korea has the highest index, while the US ranked out of the top 10, but was slightly above the OECD average. Some sample websites of countries that have instituted open government efforts or portals include Singapore (<a href="https://data.gov.sg/">https://<wbr/>data<wbr/>.gov<wbr/>.sg<wbr/>/</a>), the <span aria-label="410" id="pg_410" role="doc-pagebreak"/>United Kingdom (<a href="https://data.gov.uk/">https://<wbr/>data<wbr/>.gov<wbr/>.uk<wbr/>/</a>), and India (<a href="https://data.gov.in/">https://<wbr/>data<wbr/>.gov<wbr/>.in<wbr/>/</a>). Most democratic nations today offer such platforms, although the ease of use and completeness of data vary significantly.</p>
|
||
<p>We mentioned the BBC as an important and early adopter of KG and SW technologies. A good webpage for accessing the ontologies and other resources is <a href="http://www.bbc.co.uk/ontologies">www<wbr/>.bbc<wbr/>.co<wbr/>.uk<wbr/>/ontologies</a>.</p>
|
||
<p>The OSM resource is accessible at openstreetmap.org/#map=5/38.007/-95.844. It is free to use under an open license. For users interested in exploring download options, we recommend <a href="https://www.openstreetmap.org/export#map=5/38.007/-95.844">https://<wbr/>www<wbr/>.openstreetmap<wbr/>.org<wbr/>/export#map<wbr/>=5<wbr/>/38<wbr/>.007<wbr/>/<wbr/>-95<wbr/>.844</a>, which provides export options licensed under the Open Data Commons Open Database License.</p>
|
||
<p>At this time, projects exploring the OKN have already received funding from the National Science Foundation. A “Dear Colleague” letter is accessible at <a href="https://www.nsf.gov/pubs/2019/nsf19050/nsf19050.jsp">https://<wbr/>www<wbr/>.nsf<wbr/>.gov<wbr/>/pubs<wbr/>/2019<wbr/>/nsf19050<wbr/>/nsf19050<wbr/>.jsp</a>. An OKN report by the Networking and Information Technology Research and Development (NITRD) program is directly downloadable from <a href="https://www.nitrd.gov/news/Open-Knowledge-Network-Workshop-Report-2018.aspx">https://<wbr/>www<wbr/>.nitrd<wbr/>.gov<wbr/>/news<wbr/>/Open<wbr/>-Knowledge<wbr/>-Network<wbr/>-Workshop<wbr/>-Report<wbr/>-2018<wbr/>.aspx</a>.</p>
|
||
</section>
|
||
<section epub:type="division">
|
||
<h2 class="head a-head"><a id="sec15-7"/><b>15.7 Bibliographic Notes</b></h2>
|
||
<p class="noindent">We started this chapter by describing the effects of the Big Data movement on industry and government. While listing a full bibliography of Big Data would be a near-impossible task, we note some influential papers that have especially looked at industry, including Chen et al. (2012), Labrinidis and Jagadish (2012), Khan et al. (2017), Yin and Kaynak (2015), Xu and Duan (2019), Lee et al. (2014), and Bilal et al. (2016). There are fewer papers describing the influence of Big Data on government, but the output is still extensive, important examples including Kim et al. (2014), Bertot et al. (2014), Janssen and van den Hoven (2015), and Archenaa and Anita (2015). Janssen and van den Hoven (2015), in particular, describe an intersection of open Big Data with Linked Data, which has been discussed in previous chapters, but note that such a combination could potentially present challenges to transparency and privacy. When applying Big Data technology, including KG research, to government-scale problems, ethics and privacy are important issues to consider. Note that in some cases, industry and government interests can intersect in the Big Data research. Pan et al. (2017) is a useful example of how linked data and KGs can be exploited in large organizations.</p>
|
||
<p>Significant attention was given in this chapter to the Google Knowledge Graph. At this time, the blog post introducing the Google Knowledge Graph is still available online and can be accessed by consulting Singhal (2012). Since that time, many other papers have been written that have either described or improved on the Google Knowledge Graph in some aspects. Many modern papers on KGs directly cite the Google Knowledge Graph and its development as a motivating factor, sometimes for domain-specific cases. A partial list of works includes Hoffart et al. (2014), Ehrlinger and Wöß (2016), Steiner and Mirea <span aria-label="411" id="pg_411" role="doc-pagebreak"/>(2012), Steiner et al. (2012), Rotmensch et al. (2017), Vang (2013), Speer et al. (2017), and Paulheim (2017). Many patents have also been filed, a particularly important consideration when discussing innovation in industry. Two examples stemming from early work right after (or around the time of) the Google Knowledge Graph’s blog announcement include Eder (2012) and Ryu et al. (2013).</p>
|
||
<p>The chapter also described the KV by Dong et al. (2014), which is believed to be contributing valuable research to the Google Knowledge Graph. The KV itself draws on a long line of research in IE and graph priors, with some important works noted in earlier chapters on IE. Other papers by Carlson et al. (2010), Lao et al. (2011), Li and Grishman (2013), Nakashole et al. (2011), Niu et al. (2012b), and Wick et al. (2013) must also be noted in having influenced the KV. OpenCeres, which does Open IE on semistructured web data, can be understood to be a more modern and updated version of KV; it was described by Lockard et al. (2019).</p>
|
||
<p>Social media has intersected significantly with KGs in recent years. We described the OGP as one contributing technology; good references for OGP include Haugen (2010) and Open Graph Protocol (2016). Beyond OGP, KGs have been used to inform a number of important social media applications. Pan et al. (2018) describe how to detect fake news in a content-based manner using KGs; similarly, Shiralkar et al. (2017) have used it to support fact-checking. Several other researchers have used KGs and social media for personalized recommendation and ranking; see Karidi et al. (2018) and Zhang et al. (2016), among others. Another related work is Choudhury et al. (2017), on the construction and querying of dynamic KGs, especially the kind that are constructed over social media. In chapter 7 on nontraditional IE, we covered a broader spectrum of work on extracting information from tweets and constructing a KG thereof.</p>
|
||
<p>Schema.org has rapidly expanded its presence on the web as a dominant form of structured data that can be processed and used by search engines. Good references include Guha et al. (2016) and Ronallo (2012), the latter of which also covers HTML5 microdata. An analysis of Schema.org is provided by Patel-Schneider (2014), and Meusel et al. (2015) does a web-scale study on its adoption and evolution over time. Mika (2015) provides an argument for why Schema.org is important for the web, and Nam and Kejriwal (2018) provide case studies on how organizations publish Schema.org markup. Hepp (2015) describes Schema.org for researchers and practitioners, especially with a view toward e-commerce. It is quite likely that we will continue to see increasing research on Schema.org during this decade as well.</p>
|
||
<p>OGD became a well-researched topic over the previous decade, owing not only to the rise of the Big Data movement, but also to governments releasing a lot of their data on web portals, often in a structured form. An excellent introduction to the OGD movement is Ubaldi (2013). In other relevant works, Ding et al. (2011) describes a portal for linked OGD ecosystems, while Vetrò et al. (2016) describes a framework for open data quality <span aria-label="412" id="pg_412" role="doc-pagebreak"/>measurement (with an application toward OGD). Janssen et al. (2012) discusses the benefits and barriers of open data and open government, while Jetzek et al. (2014) more optimistically discusses innovations that could become possible through OGD. Another similar work along these lines is Chan (2013). Other good references, especially on specific initiatives or directives, include Janssen (2011) and Attard et al. (2015), the latter of which is a systematic review.</p>
|
||
<p>In the US, Data.gov has been the most prominent example of OGD. References for Data.gov (or on using it) include Ding et al. (2010a,b), Hendler et al. (2012), and Lakhani et al. (2002), among many others. More recent works include Wang et al. (2019) and Mahmud et al. (2020). In the UK, Shadbolt et al. (2012) discuss lessons derived from Data.gov.uk. For more details on the BBC, SW, and Linked Data, we recommend Raimond et al. (2014), which describes the BBC World Service radio archive, although several earlier papers on the BBC’s use of semantic technology also can be consulted.</p>
|
||
<p>OSM is another valuable resource described briefly in this chapter, and that has also received interest from the SW and Linked Data communities. Haklay and Weber (2008) and Bennett (2010) are good references for learning about the OSM project, while Anelli et al. (2016) and Auer et al. (2009) describe how SW and Linked Data can be brought into the picture. Other related work, especially pertaining to data quality and other issues (such as crowdsourcing) that are important for these kinds of databases, includes Haklay (2010) and Budhathoki and Haythornthwaite (2013).</p>
|
||
<p>We concluded this chapter with the OKN, which is too recent to have gotten much research coverage thus far. Some references that mention it, however, are Sheth et al. (2019), Dietz et al. (2018), Xiaogang (2019), Kejriwal (2019), and Alarabiat et al. (2018). White papers and more information can also be downloaded from an NITRD website dedicated to the topic (which can be accessed at <a href="https://www.nitrd.gov/nitrdgroups/index.php?title=Open_Knowledge_Network">https://<wbr/>www<wbr/>.nitrd<wbr/>.gov<wbr/>/nitrdgroups<wbr/>/index<wbr/>.php<wbr/>?title<wbr/>=Open<wbr/>_Knowledge<wbr/>_Network</a>). We note that the National Science Foundation has also allocated resources toward funding promising OKN initiatives.</p>
|
||
</section>
|
||
<section epub:type="division">
|
||
<h2 class="head a-head"><a id="sec15-8"/><b>15.8 Exercises</b></h2>
|
||
<ul class="numbered">
|
||
<li class="NL">1. Try to look up a (subjectively determined) well-known, not as well-known, and completely unknown (e.g., your neighbor) entity on Google from the concept set {Person, Organization, Restaurant, Tourist Location}. For these 4 × 3 = 12 entities that you picked, how many were found to have a knowledge panel? Do all of your well-known entities have a knowledge panel, and do all of the completely unknown ones not have one? If there are unexpected cases, list a hypothetical explanation for the case. What would be one way for you to validate your hypothesis to determine if you’re right?</li>
|
||
<li class="NL">2. Now let’s consider the entities that are not as well known, but are also not completely unknown, however you might have interpreted that description. For such entities, if you found any that do not have a knowledge panel, what would be one way (short of <span aria-label="413" id="pg_413" role="doc-pagebreak"/>having direct edit access to the Google Knowledge Graph) to increase the chances of finding a knowledge panel on that entity a short time (say a week, or even a month) from today? <i>Hint: In reading our description of the Google Knowledge Graph, as well as other resources you can find about it online, what can you determine about the raw sources of information that ultimately get processed and eventually become entities, relations, and attributes in the Google Knowledge Graph?</i></li>
|
||
<li class="NL">3. Suppose that you are opening a new restaurant and trying to set up a website for it. Other than the aesthetics of having a nice-looking website with no downtime, you are looking to leverage a Schema.org-based search engine optimization (SEO) strategy for appearing high in Google search rankings, and which may help attract more patrons. Describe what such a strategy would look like. What kinds of Schema.org snippets would you look into integrating into your website, and how often would you change them? Is timeliness very important? Why or why not?</li>
|
||
<li class="NL">4. By drawing on the examples and information listed in the OGP resource (<a href="https://ogp.me/">http<wbr/>-s:<wbr/>/<wbr/>/<wbr/>ogp<wbr/>.me<wbr/>/</a>), how might you integrate your restaurant’s profile and information into a social graph (i.e., try to describe, being as specific as possible and using snippets of code, how you would turn your restaurant’s webpage into a graph object using OGP).</li>
|
||
<li class="NL">5. List some commercial domains where you believe there is no use for Schema.org and/or where it is overkill to try and include it in a website. If you have found such a domain, could you generalize and state what properties of the domain (criteria) make it amenable or not amenable to the kinds of Schema.org SEO strategies that you considered for the restaurant domain in the previous exercise? Try to provide a list of five diverse domains that you believe would <i>not</i> be amenable to a Schema.org-based SEO strategy, and state the expected reasons (using your criteria).</li>
|
||
<li class="NL">6. Recall the OGD principles listed in <a href="chapter_15.xhtml#tab15-2">table 15.2</a>. Could you think of data sets that have the first four properties (i.e., they are complete, primary, timely, and accessible), but not the last four?</li>
|
||
<li class="NL">7. A think tank that studies the intersection of government and geospatial data hires you as a consultant expressing a desire to develop an app that their employees and social scientists can use to (i) visualize major cities and regions of the world using maps; and (ii) display demographic, economic, and other relevant socioeconomic and political indicators at appropriate granularities (e.g., county-level income data). You have been hired because of your expertise in KGs and the think tank’s belief that a properly designed KG can help them develop a scalable and sustainable answer to their needs. You will be working with a user experience designer and developers, and your main task is to help them build such a KG. Making appropriate assumptions where needed, draft an architecture document for constructing, completing, and accessing the KG to serve the needs of the think tank as described. In particular, drawing on the material in both this chapter and chapter 14, be precise about the raw and auxiliary data sources <span aria-label="414" id="pg_414" role="doc-pagebreak"/>that you will use or need to realize this vision. Try to limit yourself to open data sets as much as possible.</li>
|
||
<li class="NL">8. You have been successful in the endeavor from exercise 7, and the think tank has received new funding due to the impressive capabilities of the KG. It now wants to develop a powerful new version of the app that is able to integrate news data on a daily basis. Specifically, they have hired a machine learning expert who will develop algorithms to derive real-valued signals (e.g., indicators of local violence, political unrest, strikes, and other significant events) from news data and send it to your KG. You may further assume that you have access to a service (e.g., LexisNexis: <a href="https://www.lexisnexis.com/en-us/gateway.page">https://<wbr/>www<wbr/>.lexisnexis<wbr/>.com<wbr/>/en<wbr/>-us<wbr/>/gateway<wbr/>.page</a>) that delivers a daily dump of news articles to the think tank’s server, and you have a full suite of tools at your disposal, including Named Entity Recognition (NER), KG embeddings (KGEs), and others. How might you extend the architecture you drafted earlier? Would changes be needed at the level of the KG’s ontology to accommodate these indicator signals? What additional open data sources could you bring to bear to help the machine learning expert achieve better accuracy?</li>
|
||
</ul>
|
||
<div class="footnotes">
|
||
<ol class="footnotes">
|
||
<li><p class="FN" role="doc-footnote"><sup><a href="chapter_15.xhtml#fn1x15-bk" id="fn1x15">1</a></sup> The authors of Lockard et al. (2019) describing OpenCeres are currently employed at Amazon.</p></li>
|
||
<li><p class="FN" role="doc-footnote"><sup><a href="chapter_15.xhtml#fn2x15-bk" id="fn2x15">2</a></sup> Multipurpose Internet Mail Extensions.</p></li>
|
||
<li><p class="FN" role="doc-footnote"><sup><a href="chapter_15.xhtml#fn3x15-bk" id="fn3x15">3</a></sup> <a href="https://schema.org/">https://<wbr/>schema<wbr/>.org<wbr/>/</a>.</p></li>
|
||
<li><p class="FN" role="doc-footnote"><sup><a href="chapter_15.xhtml#fn4x15-bk" id="fn4x15">4</a></sup> <a href="http://www.oecd.org/gov/digital-government/open-government-data.htm">http://<wbr/>www<wbr/>.oecd<wbr/>.org<wbr/>/gov<wbr/>/digital<wbr/>-government<wbr/>/open<wbr/>-government<wbr/>-data<wbr/>.htm</a>.</p></li>
|
||
<li><p class="FN" role="doc-footnote"><sup><a href="chapter_15.xhtml#fn5x15-bk" id="fn5x15">5</a></sup> <a href="https://smartcitiescouncil.com/article/hill-smart-cities-week-tackling-opportunities-and-challenges">https://<wbr/>smartcitiescouncil<wbr/>.com<wbr/>/article<wbr/>/hill<wbr/>-smart<wbr/>-cities<wbr/>-week<wbr/>-tackling<wbr/>-opportunities<wbr/>-and<wbr/>-challenges</a>.</p></li>
|
||
<li><p class="FN" role="doc-footnote"><sup><a href="chapter_15.xhtml#fn6x15-bk" id="fn6x15">6</a></sup> <a href="https://www.gartner.com/en/newsroom/press-releases/2019-02-18-gartner-identifies-top-10-data-and-analytics-technolo">https://<wbr/>www<wbr/>.gartner<wbr/>.com<wbr/>/en<wbr/>/newsroom<wbr/>/press<wbr/>-releases<wbr/>/2019<wbr/>-02<wbr/>-18<wbr/>-gartner<wbr/>-identifies<wbr/>-top<wbr/>-10<wbr/>-data<wbr/>-and<wbr/>-analytics<wbr/>-technolo</a>.</p></li>
|
||
<li><p class="FN" role="doc-footnote"><sup><a href="chapter_15.xhtml#fn7x15-bk" id="fn7x15">7</a></sup> Examples of such coverage include https://<wbr/>www<wbr/>.engadget<wbr/>.com<wbr/>/2014<wbr/>-08<wbr/>-21<wbr/>-google<wbr/>-knowledge<wbr/>-vault<wbr/>.html, https://<wbr/>zebratechies<wbr/>.com<wbr/>/google<wbr/>-knowledge<wbr/>-graph<wbr/>-knowledge<wbr/>-vault<wbr/>-and<wbr/>-how<wbr/>-its<wbr/>-impact<wbr/>-on<wbr/>-serp, and https://<wbr/>www<wbr/>.searchenginenews<wbr/>.com<wbr/>/sample<wbr/>/update<wbr/>/entry<wbr/>/understanding<wbr/>-googles<wbr/>-knowledge<wbr/>-vault.</p></li>
|
||
</ol>
|
||
</div>
|
||
</section>
|
||
</section>
|
||
</div>
|
||
</body>
|
||
</html> |