345 lines
No EOL
75 KiB
HTML
345 lines
No EOL
75 KiB
HTML
<html xmlns="http://www.w3.org/1999/xhtml" xmlns:epub="http://www.idpf.org/2007/ops" lang="en" xml:lang="en">
|
||
<head>
|
||
<title>Knowledge Graphs</title>
|
||
<meta content="text/html; charset=utf-8" http-equiv="default-style"/>
|
||
<link href="../styles/stylesheet.css" rel="stylesheet" type="text/css"/>
|
||
<meta content="urn:uuid:531250ec-8629-4bbe-b4be-eb1eb0e84538" name="Adept.expected.resource"/>
|
||
</head>
|
||
<body epub:type="bodymatter">
|
||
<div class="body">
|
||
<p class="SP"> </p>
|
||
<section aria-labelledby="ch14" epub:type="chapter" role="doc-chapter">
|
||
<header>
|
||
<h1 class="chapter-number" id="ch14"><span aria-label="367" id="pg_367" role="doc-pagebreak"/>14</h1>
|
||
<h1 class="chapter-title"><b>Linked Data</b></h1>
|
||
</header>
|
||
<div class="ABS">
|
||
<p class="ABS"><b>Overview.</b> At this time, knowledge graphs (KGs) have been adopted in several communities and ecosystems. In many of these communities, adoption continues to grow; in some cases, it occurred superlinearly. A particular community that was an early adopter of publishing KGs on the web using a core set of principles is the Semantic Web (SW) community. The four principles, termed the <i>Linked Data principles</i>, have together yielded a web-based ecosystem that contains many interconnected KGs and currently spans many billions of triples. Many data sets in this ecosystem are open, which has led to the moniker <i>Linked Open Data</i> (LOD). In this chapter, we describe the principles and their collective impact. We also review important facts about the most central and influential KGs in the LOD ecosystem.</p>
|
||
</div>
|
||
<section epub:type="division">
|
||
<h2 class="head a-head"><a id="sec14-1"/><b>14.1 Introduction</b></h2>
|
||
<p class="noindent">At the core of the web’s success is the ability for anyone, anywhere, to publish, link, and consume information using simple protocols like HTTP that can be “layered” over underlying Internet protocols like TCP and IP. Yet it is important to remember that much of this information is designed to be consumed by <i>humans</i>. Most webpages that you have likely browsed to date have consisted mostly of text and images, which, for context, are linked to other similar webpages with text and images. One of the motivations that we offered for KGs in the introduction is that they have a structure that machines can consume and reason over more easily, accurately, and scalably than they can consume text and images (at least at the present time). In this respect, KGs were offered as a convenient interface that can be produced, either directly or indirectly [i.e., via information extraction (IE)] by humans or from human content, and are more meaningful to machines than natural text representations. Unfortunately, and largely due to reasons both historical and rooted in convenience, the HTML pages that we often access and consume in our browsers do not have such machine-amenable structure.</p>
|
||
<p>Equally important is the growing need for making data, not just documents, the first-class citizen of the web in support of an emerging data economy. What this really means is that a <i>systematic</i> framework is desired for publishing, representing, and providing <i>direct</i> <span aria-label="368" id="pg_368" role="doc-pagebreak"/>access to raw data that currently needs to be wrapped in an HTML document before being publicly exposed on the web. Yet, as noted at the start of this chapter, the web is designed to render documents on a browser for human consumption. How can we publish raw data using such a systematic framework without redesigning the web itself?</p>
|
||
<p>The Linked Data movement, a direct product of a grassroots effort called the World Wide Web Consortium (W3C) Linked Open Data (LOD) project<sup><a href="chapter_14.xhtml#fn1x14" id="fn1x14-bk">1</a></sup> that was founded in January 2007, emerged as a potential (albeit, not unique) solution to this problem. In the years since, the movement has grown, and many data sets have been published using the Linked Data principles (described in the subsection entitled “Overall Impact”). Wikipedia is an important source of raw data that has often been used to populate several linked data sets. <a href="chapter_14.xhtml#fig14-1" id="rfig14-1">Figure 14.1</a> illustrates how Wikipedia infoboxes (described on the next page) can be a valuable source of structured data for populating knowledge bases (KBs) such as DBpedia. However, even before (and along with) the Linked Data movement, a number of need-driven efforts have attempted to induce some degree of structure to the web that makes its content more machine-readable. Important efforts are briefly described next.</p>
|
||
<div class="figure">
|
||
<figure class="IMG"><a id="fig14-1"/><img alt="" src="../images/Figure14-1.png" width="450"/>
|
||
<figcaption><p class="CAP"><span class="FIGN"><a href="#rfig14-1">Figure 14.1</a>:</span> <span class="FIG">Wikipedia infoboxes have been used to automatically populate KBs such as DBpedia on a large scale.</span></p></figcaption>
|
||
</figure>
|
||
</div>
|
||
<ul class="numbered">
|
||
<li class="NL">1. <b>Microformats and Microdata:</b> Microformats and microdata are snippets of data embedded in HTML pages like regular markup. They are meant to describe structured data, usually are restricted to specific entity categories, and are constrained by choice of vocabulary. Although this limits their applicability, an advantage is that they can be seamlessly integrated into HTML and can be identified and extracted programmatically using syntax alone (<a href="chapter_14.xhtml#fig14-2" id="rfig14-2">figure 14.2</a>). A major disadvantage, however, is that it is often impossible to use either microformats or microdata to express complex relationships between entities. While we do not go into the differences here, microformats are typically limited to the vocabulary officially maintained on the Micro-formats.org page, while microdata can use arbitrary vocabularies. In practice, however, the Schema.org vocabulary has rapidly emerged as the de facto microdata vocabulary due to widespread support and consumption by search engines like Google and Bing. At this time, the adoption of Schema.org has reached web scale, especially for popular domains like movies, tourist attractions, and products.</li>
|
||
</ul>
|
||
<div class="figure">
|
||
<figure class="IMG"><a id="fig14-2"/><img alt="" src="../images/Figure14-2.png" width="450"/>
|
||
<figcaption><p class="CAP"><span class="FIGN"><a href="#rfig14-2">Figure 14.2</a>:</span> <span class="FIG">An illustration of microdata (Schema.org) for a movie website (the <i>Rotten Tomatoes</i> page for the 2019 <i>Lion King</i> remake). Elements such as the rating are embedded in the source HTML as Schema.org (shown using the dark, solid oval) and could be extracted into a KG based on syntax. Search engines find it easier to work with such semantically rich data for precisely this reason, leading to more informed search results for a querying user. The manifestations of Schema.org snippets (including the ratings both on the <i>Rotten Tomatoes</i> page and in a Google search) are shown using the two dashed ovals.</span></p></figcaption>
|
||
</figure>
|
||
</div>
|
||
<ul class="numbered">
|
||
<li class="NL">2. <b>Web APIs and Mashups:</b> Companies such as Amazon expose their product data via web application programming interfaces (APIs) that are consumed by third parties (many of which are small, specialized businesses) that attempt to generate or boost product sales by using Amazon as the centralized marketplace for conducting transactions. However, web APIs have a significant presence even beyond online marketplaces. Many key players in the emerging data economy mentioned in this discussion are data providers (e.g., Knoema.com) that provide programmatic access <span aria-label="369" id="pg_369" role="doc-pagebreak"/>through natively developed APIs. Social media platforms like Facebook and Twitter are also major adopters of APIs. Due to the proliferation of APIs, websites like ProgrammableWeb now maintain directories listing web APIs. One can creatively combine and use these APIs in tandem to create novel applications, called <i>mashups</i>.</li>
|
||
<li class="NL">3. <b>Templated HTML:</b> Many websites, such as Amazon and IMDB, are populated from a set of underlying databases (whether NoSQL or relational). Although the database itself tends to be proprietary, the pages can be crawled and, if there is sufficient regularity in the schema, the information can be extracted using wrappers and other IE tools. Unlike the other options, however, this one involves considerable noise and is only a partial solution to the original problem of exposing data wrapped in HTML documents. Even if it were possible to build perfect wrappers, a wrapper would have to be customized for each schema type in each website. This is clearly not a scalable solution to the problem of seamlessly accessing and using data published on the web.</li>
|
||
<li class="NL">4. <b>Infoboxes:</b> Infoboxes are structured pieces of information (usually key-value pairs) commonly employed by websites like Wikipedia to summarize and convey important information about an entity (<a href="chapter_14.xhtml#fig14-1">figure 14.1</a>). Despite their simplicity, their importance should not be underestimated, as it is far easier to extract infoboxes automatically than to wrap arbitrary website templates.</li>
|
||
</ul>
|
||
<p>Interestingly, the efforts themselves illustrate a smoother spectrum (in realizing a machine-amenable web, while still keeping the web friendly for human and browser consumption) than one might have anticipated. For example, infoboxes on websites like Wikipedia were <span aria-label="370" id="pg_370" role="doc-pagebreak"/>arguably designed to summarize the data for a human being, but are now used by search engines like Google and are also crucial to some specific Linked Data KGs like DBpedia (discussed in the subsection entitled “DBpedia”). On the other hand, APIs were almost exclusively designed to provide <i>programmatic access</i> to resources. It is still rather unusual to encounter non–computer scientists who use APIs directly in their work, as it is too low-level and primarily preferred by application designers.</p>
|
||
<p><span aria-label="371" id="pg_371" role="doc-pagebreak"/>Furthermore, although the options given here are powerful in their own right, they are also <i>piecemeal</i>, each option being designed organically to serve a particular niche or purpose. For example, it is difficult to define, extend, and consume web APIs in the same way as one would infoboxes. For any given option, there are severe disadvantages in adopting it for publishing arbitrary data over the web. Web APIs are customized to each data set and provider, and microformats express only a narrow range of entities and attributes (in addition to the interentity relationship specification problem mentioned earlier).</p>
|
||
<p>More fundamentally, the leading figures in the development of the web (including the inventor of the web, Tim Berners-Lee) have increasingly come to recognize that the next step in the web’s evolution will be to move from an interlinked repository of documents, which is still how much of it is structured today, to an interlinked repository of <i>things</i>. Recall the advanced querying strategies covered in part IV of this book, and imagine that we could apply such techniques, or pose such queries, against the <i>entire web</i>, not just a stand-alone triplestore or a NoSQL server. This is the broader vision that Linked Data, as well as the larger SW ecosystem in which it is embedded, seek to fulfill. The vision seeks to make things (usually named <i>entities</i>) rather than documents first-class citizens of the web, but there must be a way to express rich (i.e., typed) relationships between things. In contrast, current hyperlinks on the web are untyped, and their semantics are unspecified. In other words, not only do we not know <i>why</i> a page is linking to another page, but we do not know what the link even means. For any application that needs to rely on a rich set of semantics, such as fine-grained search, untyped links are inadequate.</p>
|
||
<p>Although it is easy to confuse so-called Linked Data with the actual data being published (a tendency not helped either by the name or the occasional misuse of the term in the literature), Linked Data refers to a set of four principles that specify how data (especially structured data in the form of entities, attributes, and relationships) should be published on the web. The principles have a direct analogy to simple standards that have made the document web so popular (<a href="chapter_14.xhtml#tab14-1" id="rtab14-1">table 14.1</a>). Furthermore, many Linked Data data sets have been published under an open license and are collectively referred to as <i>Linked Open Data</i> (LOD). More broadly, the <i>Web of Linked Data</i> comprises both open and nonopen data sets (and in some cases, even hybrid models like the Linking Open Drug Data initiative, wherein commercial entities share some of their data publicly) published using Linked Data principles.</p>
|
||
<div class="table">
|
||
<p class="TT"><a id="tab14-1"/><span class="FIGN"><a href="#rtab14-1">Table 14.1</a>:</span> <span class="FIG">The Linked Data principles, enumerated.</span></p>
|
||
<figure class="table">
|
||
<table class="table">
|
||
<thead>
|
||
<tr>
|
||
<th class="TCH"><p class="TB"><b>Principle 1:</b> Use URIs as names for things.</p></th>
|
||
</tr>
|
||
</thead>
|
||
<tbody>
|
||
<tr>
|
||
<td class="TB"><p class="TB"><b>Principle 2:</b> Use HTTP URIs so that people can look up those names.</p></td>
|
||
</tr>
|
||
<tr>
|
||
<td class="TB"><p class="TB"><b>Principle 3:</b> When someone looks up a URI, provide useful information, using the standards (RDF*, SPARQL).</p></td>
|
||
</tr>
|
||
<tr>
|
||
<td class="TB"><p class="TB"><b>Principle 4:</b> Include links to other URIs. so that they can discover more things.</p></td>
|
||
</tr>
|
||
</tbody>
|
||
</table>
|
||
</figure>
|
||
</div>
|
||
<section epub:type="division">
|
||
<h3 class="head b-head"><span aria-label="372" id="pg_372" role="doc-pagebreak"/><a id="sec14-1-1"/><b>14.1.1 Principle 1: Use Uniform Resource Identifiers for Naming Things</b></h3>
|
||
<p class="noindent">The W3C,<sup><a href="chapter_14.xhtml#fn2x14" id="fn2x14-bk">2</a></sup> which is the main international standards organization for the World Wide Web [and by extension, the Semantic Web (SW)] has informally described<sup><a href="chapter_14.xhtml#fn3x14" id="fn3x14-bk">3</a></sup> Uniform Resource Identifiers (URIs) as “short strings that identify resources in the web: documents, images, downloadable files, services, electronic mailboxes, and other resources.” Uniform Resource Locators (URLs) are a special case of URIs, which are themselves special cases of Internationalized Resource Identifiers (IRIs). We do not delve deeply into all the differences between these terms; instead, the important thing to remember is that URIs codify a <i>naming standard</i> (i.e., not every short string is a URI).</p>
|
||
<p>What things can have a URI? The rule of thumb: is anything that can be given a name, which covers a surprisingly broader range than the document web, such as (1) “real,” including flesh-and-blood, entities, such as celebrities and food items; (2) geographical entities, such as countries and cities; (3) mathematical and abstract entities like the Pythagorean theorem; (4) documents, such as the ones dereferenced on the document web; and (5) digital content, such as videos and web APIs. Referring again to <a href="chapter_14.xhtml#tab14-1">table 14.1</a>, because Linked Data builds atop the existing web architecture, the formal term “resource” is used to refer to all of these objects, and as per the first principle, a resource must be named and identified using a URI.</p>
|
||
<p>We also note that blank node URIs are generally not permitted [contrast this with the definition of Resource Description Framework (RDF) in earlier chapters, which allowed subjects and objects in triples to be blank nodes]. One reason for this is that blank node identifiers tend to be local to the data set and violate the second Linked Data principle, which calls for URIs to be dereferencable. If we want to publish a blank node using Linked Data principles, we would have to give a dereferencable name to the blank node. For example, if a particular blank node is describing all the marriages of celebrity X, then the node would have to be given a name like “Celebrity-X-Marriages” and specified using a <i>globally unique</i> URI.</p>
|
||
<p><span aria-label="373" id="pg_373" role="doc-pagebreak"/><a href="chapter_14.xhtml#tab14-2" id="rtab14-2">Table 14.2</a> lists examples of entities, along with real-world URIs in LOD data sets like DBpedia and YAGO. Note that it is no coincidence that the URIs strikingly resemble URLs (principle 2) that can be accessed using protocols like the HTTP. While it may seem obvious when stated explicitly, it is important to keep in mind that URIs are not the only established (or standardized) naming mechanism on the web. In particular, the first Linked Data principle does not allow naming standards like digital object identifiers (DOIs) that are predominant in the academic publishing community.</p>
|
||
<div class="table">
|
||
<p class="TT"><a id="tab14-2"/><span class="FIGN"><a href="#rtab14-2">Table 14.2</a>:</span> <span class="FIG">Examples of DBpedia entities with URIs.</span></p>
|
||
<figure class="table">
|
||
<table class="table">
|
||
<thead>
|
||
<tr>
|
||
<th class="TCH"><p class="TB"><b>Entity</b></p></th>
|
||
<th class="TCH"><p class="TB"><b>URI</b></p></th>
|
||
</tr>
|
||
</thead>
|
||
<tbody>
|
||
<tr>
|
||
<td class="TB"><p class="TB">John Lennon</p></td>
|
||
<td class="TB"><p class="TB"><a href="http://dbpedia.org/resource/John_Lennon">http://<wbr/>dbpedia<wbr/>.org<wbr/>/resource<wbr/>/John<wbr/>_Lennon</a></p></td>
|
||
</tr>
|
||
<tr>
|
||
<td class="TB"><p class="TB">The Beatles</p></td>
|
||
<td class="TB"><p class="TB"><a href="http://dbpedia.org/resource/The_Beatles">http://<wbr/>dbpedia<wbr/>.org<wbr/>/resource<wbr/>/The<wbr/>_Beatles</a></p></td>
|
||
</tr>
|
||
<tr>
|
||
<td class="TB"><p class="TB">Billy Preston</p></td>
|
||
<td class="TB"><p class="TB"><a href="http://dbpedia.org/resource/Billy_Preston">http://<wbr/>dbpedia<wbr/>.org<wbr/>/resource<wbr/>/Billy<wbr/>_Preston</a></p></td>
|
||
</tr>
|
||
<tr>
|
||
<td class="TB"><p class="TB">Pythagorean theorem</p></td>
|
||
<td class="TB"><p class="TB"><a href="http://dbpedia.org/resource/Pythagorean_theorem">http://<wbr/>dbpedia<wbr/>.org<wbr/>/resource<wbr/>/Pythagorean<wbr/>_theorem</a></p></td>
|
||
</tr>
|
||
<tr>
|
||
<td class="TB"><p class="TB">Judy Garland</p></td>
|
||
<td class="TB"><p class="TB"><a href="http://dbpedia.org/resource/Judy_Garland">http://<wbr/>dbpedia<wbr/>.org<wbr/>/resource<wbr/>/Judy<wbr/>_Garland</a></p></td>
|
||
</tr>
|
||
</tbody>
|
||
</table>
|
||
</figure>
|
||
</div>
|
||
</section>
|
||
<section epub:type="division">
|
||
<h3 class="head b-head"><a id="sec14-1-2"/><b>14.1.2 Principle 2: Use HTTP Uniform Resource Identifiers</b></h3>
|
||
<p class="noindent">Principle 2 states that it should be possible to look up URIs using the widely established HTTP. On the surface, there is not much to say about this rule, but its importance should not be neglected. Principle 2 is what brings Linked Data data sets into the web ecosystem, because a web protocol (and a browser) can be used for looking up things. Note that this is <i>not</i> true for naming schemes like Uniform Resource Names (URNs) and DOIs, which are not designed to be dereferenced using standard web protocols. Similar to DOIs, the protocol also establishes a uniqueness constraint on web access and visibility because two different resources cannot have the same URI and still be HTTP-accessible over the web. However, this does <i>not</i> mean that (1) the same resource cannot be referred to by different URIs, a rampant problem that can only be solved automatically, by instance matching (IM); or (2) every URI is HTTP-dereferencable. Technically, this also explains why principle 2 is necessary to begin with, as a URI can also be a URN (a globally unique name, but without an access mechanism). Because of principle 2, URN-URIs are ruled out, and even URLs are constrained, because the access mechanism must be HTTP and not some other protocol like the File Transfer Protocol (FTP).</p>
|
||
<p class="TNI-H3"><b>14.1.2.1 The Problem of Dereferencing</b> There is yet another subtlety that comes into play when we consider what makes the URI representing a resource dereferencable. Taken literally, “dereferencable” means that the resource must be fetched over the wire. In programming environments, as well as the document web, the resource is a data item (such as the element of an array) or a document, and dereferencing the item does not involve <span aria-label="374" id="pg_374" role="doc-pagebreak"/>any controversy. However, recall that, as per the first Linked Data principle, resources can be used to name <i>actual</i> things, not just the documents describing that thing. Ludicrously, dereferencing the “thing URI” would mean sending the actual thing over the web.</p>
|
||
<p>By consensus, two workarounds have been devised to address this issue, without abandoning the conceptual elegance of principle 1. Both strategies are designed to ensure that there is no confusion between the objects themselves and the documents that describe them, and that both humans and machines can retrieve the representations best suited for them. The two solutions are called <i>303 URIs</i> and <i>hash URIs</i>. In the first strategy, the server responds to the client (when dereferencing the URI of an actual object) with a <i>303 redirect</i> by sending the HTTP response code <i>303 See Other</i>, along with the URI of a web document that describes the real-world object. Although it sounds simple, the 303 URI strategy involves four steps. First, the client has to perform an HTTP GET request. Next, the server receives the request and has to recognize that the URI identifies a real-world object or abstract concept. The server responds using the 303 direct, as described previously. Third, the client performs another HTTP GET request using the new URI provided by the server. Finally, the server now replied with an HTTP response code, <i>200 OK</i>, and sends the client the requested document.</p>
|
||
<p>The <i>hash URI</i> strategy was designed to address one of the main criticisms of the 303 URI strategy—namely, that it requires two HTTP requests to retrieve the description of the real-world object. The hash URI strategy avoids this by exploiting the property that a URI may contain a special part (the <i>fragment identifier</i>) separated from the base part by a hash symbol. For example, consider the URIs <a href="http://companyY.com/schema/division#Marketing">http://<wbr/>companyY<wbr/>.com<wbr/>/schema<wbr/>/division#Marketing</a> and <a href="http://companyY.com/schema/division#Op-erations">http://<wbr/>companyY<wbr/>.com<wbr/>/schema<wbr/>/division#Op<wbr/>-erations</a>. When a client wants to retrieve such URIs, the HTTP protocol mandates the fragment part to be stripped off before making the request to the server. This implies that any URI that contains a hash cannot be <i>directly</i> retrieved, and thus does not identify a web document. Therefore, such URIs can be used to identify real-world objects and abstract concepts unambiguously.</p>
|
||
<p>How would the hash strategy work in practice? First, the client would truncate the URI, removing the fragment, and then connect to the server using a GET request (for either sample URI, it would be <a href="http://companyY.com/schema/division">http://<wbr/>companyY<wbr/>.com<wbr/>/schema<wbr/>/division</a> after stripping the fragment). The server answers by sending the requested document (typically RDF/XML). At this point, the client would need to be Linked Data aware, as it will inspect the response and find triples that tell it more about the resource that was originally requested (with the fragment). For example, if <a href="http://companyY.com/schema/division#Marketing">http://<wbr/>companyY<wbr/>.com<wbr/>/schema<wbr/>/division#Marketing</a> was originally requested, the client would likely discard all triples that do not describe the marketing division, even though a single file (retrieved using GET on <a href="http://companyY.com/schema/division">http://<wbr/>companyY<wbr/>.com<wbr/>/schema<wbr/>/division</a>) describes all divisions.</p>
|
||
<p>Note that both the 303 URI and hash URI strategies have their pros and cons. The 303 URIs tend to serve resource descriptions that are part of large data sets, such as descriptions <span aria-label="375" id="pg_375" role="doc-pagebreak"/>of individual concepts from open-world KGs like DBpedia. Meanwhile, hash URIs tend to be used for identifying terms within RDF vocabularies, and because the Linked Data–aware client usually has to do additional work on the fetched file, the file tends to be smaller. Hash URIs are also useful when RDF is embedded into HTML pages using RDF in attributes<sup><a href="chapter_14.xhtml#fn4x14" id="fn4x14-bk">4</a></sup> (RDFa). However, we note that the two strategies are not mutually exhaustive, and in fact, it is possible to combine their benefits. We could use the hash (following it with an indicative word like “this”) to distinguish between the URI for the actual object and the document (the stripped URI). No additional processing is required by the client because the retrieved file is the document describing that resource. Only one GET request is needed because the protocol itself mandates the stripping of the fragment identifier, so a 303 redirect is not required to begin with (because after stripping the fragment, the object URI turns into the document URI).</p>
|
||
</section>
|
||
<section epub:type="division">
|
||
<h3 class="head b-head"><a id="sec14-1-3"/><b>14.1.3 Principle 3: Provide Useful Information on Lookup Using Standards</b></h3>
|
||
<p class="noindent">Although principle 2 stipulates the lookup mechanism, it does not say anything about the content that should be retrieved upon lookup, or the representation of that content. Principle 3 bridges this gap. It states that when a resource is looked up using HTTP, useful information should be retrieved using standards like RDF and SPARQL. Generally, the latter implies that the information should be exposed using a SPARQL end point, while the former places restrictions on serialization and representation. However, it is not merely enough for the data to be accessible using SPARQL, or to be serialized in an official RDF format. The data must also be sufficiently useful, whether to a human or computer program, dereferencing the URI associated with the resource.</p>
|
||
<p>For this reason, unlike the first two principles, principle 3 is more subjective. It is easy to verify when a resource is fulfilling the first two principles, as there is nothing subjective or uncertain about mechanisms like HTTP or standards like URI. However, it is quite possible for a resource to display information that (arguably) is not useful, even if it obeys standards like SPARQL and RDF. Most people, however, have an intuitive sense of what makes a useful lookup (e.g., upon lookup, a basic description of the resource should be provided). An important aspect of usefulness is <i>vocabulary reuse</i>. For example, if we are trying to publish Linked Data about someone’s dog, common properties (such as the name of the dog) can be borrowed from vocabularies like Friend of a Friend (FOAF; for more details, see the subsection entitled “Friend of a Friend”); a new property should not be invented for this purpose. This also illustrates the subjectivity of the rule, as we could potentially end up creating a property if an equivalent property from an established vocabulary cannot be found, or if the semantics of the property are different from what may have been intended. To follow this rule, therefore, some judgment (and good faith) <span aria-label="376" id="pg_376" role="doc-pagebreak"/>is inevitably required, along with knowledge that is common to the Linked Data and SW communities, but not necessarily other communities.</p>
|
||
</section>
|
||
<section epub:type="division">
|
||
<h3 class="head b-head"><a id="sec14-1-4"/><b>14.1.4 Principle 4: Link New Data to Existing Data</b></h3>
|
||
<p class="noindent">The first three principles provide guidance on how to publish, represent, and access data, but they say nothing about how different data items should link to each other in the same way that documents on the regular web contain hyperlinks to one another. Connectivity among constituent elements is an important, if not the central, factor in what makes the web special and powerful. Because Linked Data is designed to ultimately serve the web, the fourth Linked Data principle states that new data should not be published in silos, but rather should be connected to existing data sets. Just as hyperlinks are crucial for connecting pages from different servers into a single global information space, principle 4 is crucial for maintaining the web’s follow-your-nose architecture by mandating the publishing of typed links among URIs (“things”) in different data sets. Several types of connecting links are possible, in addition to internal or local links that connect two or more URIs within the data set itself.</p>
|
||
<p>Note that, in addition to its other benefits, principle 3 facilitates easy syntactic adoption of principle 4 because RDF does not have to be limited to a single namespace. So long as the nonliteral elements in an RDF triple obey the definition of URIs, the namespace of the URI makes no difference to syntactic validity. Thus, the entire “data set,” which is not just the data set itself but also its links to external data sets, can be released as a single RDF dump, or it may be queryable from a single SPARQL end point, if desired. In practice, it is not uncommon for LOD providers (such as DBpedia, discussed in the subsection “DBpedia”) to publish external and internal links in separate files when exposing the data sets as N-triples dumps.</p>
|
||
<p>Suppose that the data set we want to publish using Linked Data principles declares its URIs in a single namespace. Assuming that the RDF triple is an object triple (i.e., the object is a URI, not a literal, typed or primitive), we say that the relationship represented by the triple is <i>external</i> if the namespaces of the subject and object are different, and <i>internal</i> otherwise. For example, the relationship in the triple (<i>dbr:Judy_Garland, owl:sameAs, yago:Judy_Garland</i>) is an external relation, where <i>dbr:</i> and <i>yago:</i> are shorthand prefixes for <a href="http://dbpedia.org/resource/">http://<wbr/>dbpedia<wbr/>.org<wbr/>/resource<wbr/>/</a> and <a href="http://yago-knowledge.org/re-source/">http://<wbr/>yago<wbr/>-knowledge<wbr/>.org<wbr/>/re<wbr/>-source<wbr/>/</a>, respectively. Conversely, in the triple (<i>dbr:Judy_Garland, dbo:spouse, dbr:Mickey_Deans</i>), the relationship is internal because the namespaces of the object and subject are identical.</p>
|
||
<p>Principle 4 encourages (or, stated more strongly, <i>requires</i>) the publication of external-relationship triples connecting entities among different RDF data sets (or sets of RDF triples that have different namespaces), and the properties that are most favored for doing so are <i>owl:sameAs</i> and <i>skos:related</i>. We’ve briefly alluded to <i>owl:sameAs</i> in chapter 2, but also when describing querying in part IV. The most important aspect to remember about such properties is that they are predefined with established semantics—that is, all <span aria-label="377" id="pg_377" role="doc-pagebreak"/>practitioners are supposed to use these properties with roughly the same semantic intent (in the case of <i>owl:sameAs</i>). Here, <i>skos:related</i> is more controversial; it clearly depends on the subjective frame of reference of the data set publisher. Some publishers may be too aggressive in declaring two entities to be related, while others may see a weak or nonexistent relationship. We see such issues arise in the broader web as well. Many see this diversity as an advantage and a core feature of the web (and by extension, Linked Data), but on occasion, depending on the task at hand, it can lead to noise and quality problems, and it often requires a degree of technical sophistication on the part of the application consuming the data.</p>
|
||
</section>
|
||
</section>
|
||
<section epub:type="division">
|
||
<h2 class="head a-head"><a id="sec14-2"/><b>14.2 Impact and Adoption of Linked Data Principles</b></h2>
|
||
<p class="noindent">Because of the increasing adoption of Linked Data, there has been a concerted effort both to quantify the quality of Linked Data data sets and to assess the importance and rates of adoption of various principles. In the “Bibliographic Notes” section, we note some specific research papers that did thorough assessments of the quality and growth of LOD. These assessments started becoming commonplace once the size and number of LOD data sets reached a certain threshold [e.g., the State of the LOD Cloud report by Bizer et al. (2011) analyzed the adoption of linked data best practices by LOD data sets within various topical domains]. Some of the more recent key findings of the overall ecosystem are noted briefly next, followed (in the next section) by more details on impact and applications of specific highly influential LOD data sets.</p>
|
||
<p>Before describing these findings, we provide a methodological note on how these assessments are done to begin with. One method, which has become standard, is to crawl a snapshot of the Linked Data web, typically by using a specialized crawler (e.g., the LDSpider framework). For example, in Schmachtenberg et al. (2014), a group of SW researchers seeded LDSpider with 560,000 seed URIs originating from three sources:</p>
|
||
<ul class="numbered">
|
||
<li class="NL">1. All URIs of the example resources from data sets contained in the <i>lod-cloud</i> group in the datahub.io data set catalog, as well as example URIs from other data sets in the catalog marked with Linked Data related tags.</li>
|
||
<li class="NL">2. A sample of the URIs contained in the Billion Triple Challenge 2012 data set.</li>
|
||
<li class="NL">3. URIs from data sets advertised on the public-lod@w3.org mailing list since 2011.</li>
|
||
</ul>
|
||
<p>Next, the researchers used the seeds to perform crawls during a specific month (in their case, April 2014) to retrieve entities from every data set using a breadth-first crawling strategy. All together, they crawled 900,129 documents describing 8,038,396 resources and made them available to the public for future replication. The crawled data belonged to 1,014 data sets, providing a representative sample for studying how well the Linked Data principles are being adopted across the spectrum of actual LOD data sets. Some of the main findings were as follows:</p>
|
||
<ul class="numbered">
|
||
<li class="NL">1. <span aria-label="378" id="pg_378" role="doc-pagebreak"/>In total, 56 percent of all data sets in the crawl had links pointing to at least one other data set (external links). The remaining 44 percent were either only the target of RDF links from other data sets or were isolated. Thus, there is still a lot of work to be done in fulfilling the fourth Linked Data principle. One reason why this has not been straightforward is the challenge of implementing an efficient, high-quality IM system for a topical domain.</li>
|
||
<li class="NL">2. In further support of the first point, the in- and out-degrees varied widely, with a small number of data sets in each category being highly linked, while most data sets were sparsely linked. Overall, social networking data sets showed the highest degree values, while data sets with geographic and user-generated content showed an imbalance between in- and out-degrees (e.g., geographic data sets had much larger in- than out-degrees, measured by area under the respective degree distribution curves)</li>
|
||
<li class="NL">3. An analysis of the overall graph structure of the full crawl mostly yielded one large, weakly connected component consisting of almost 72 percent of all data sets in the crawl. There were three small components, one consisting of three data sets, and two consisting of two. Furthermore, within the large weakly connected component, there was one large strongly connected component consisting of approximately 36 percent of all data sets. What this shows is the unevenness of connectivity in LOD, as strong connectivity implies the entities within the strongly connected component are easily navigable to one another via one (or a few) short links. How this compares to the document web is not well understood at present; however, the fact that almost 30 percent of data sets can’t be accessed from the other 70 percent is a cause for concern considering the motivations for proposing the Linked Data principles to begin with.</li>
|
||
<li class="NL">4. An analysis of <i>vocabulary usage</i> on LOD showed that the top vocabularies besides RDF, RDFS, and OWL are FOAF and DCTerms, with Simple Knowledge Organization System (SKOS) also in the top 10. Some vocabularies found much more growth on LOD than others (e.g., while FOAF was used by about 27 percent of all data sets in 2011, it was used by more than 69 percent of data sets in the crawl in 2014). In contrast, the Dublin Core vocabulary had less growth (though still a significant amount), going from about 31 percent of data sets in 2011 to 56 percent in 2014.</li>
|
||
<li class="NL">5. A negative finding by the report was the provision of alternative access methods for the data sets, such as SPARQL and dumps. In 2011, the numbers were encouraging, as more than 68 percent at the time provided end points, with almost 40 percent providing dumps. In 2014, these numbers were found to be much lower, but this was not necessarily due to lack of provision. Instead, one possibility indicated by the evidence was that the dumps and end points, if provided, were not easy to find using automated <span aria-label="379" id="pg_379" role="doc-pagebreak"/>methods such as the Vocabulary of Interlinked Datasets (VoID) descriptions<sup><a href="chapter_14.xhtml#fn5x14" id="fn5x14-bk">5</a></sup> linked from the data sets. In reality, the number of end points and dumps may very well have increased over time. This illustrates the importance, not just of ensuring that the data set itself obeys Linked Data principles, but (to deal with the growth of the ecosystem) also providing adequate information about the data set–level metadata to make it easier for automated agents and applications to find and consume these data sets.</li>
|
||
</ul>
|
||
<section epub:type="division">
|
||
<h3 class="head b-head"><a id="sec14-2-1"/><b>14.2.1 Overall Impact</b></h3>
|
||
<p class="noindent">The growth in the number of LOD data sets since its founding in 2007 is emblematic of the impact that the Linked Data principles have had on publishing and exposing structured data. In May 2007, LOD comprised only 12 data sets, including some of the ones that we cover next, such as FOAF and DBpedia. The growth was exponential for several years, though in current years, the growth has slowed. Topical domains represented in the LOD ecosystem include publications and bibliographics, bioinformatics and domain sciences, Open Government, social media, and most important, cross-domain data sets that are encyclopedic and open-world and play an important role in enabling the fulfillment of principle 4.</p>
|
||
<p>In enterprise and governmental organizations, we find that there is a healthy adoption of Linked Data by various organizations, especially those for which providing <i>background context</i> (e.g., by linking to DBpedia, as is commonly done) and <i>data integration</i> are both important. In addition to adoption, many important KGs in LOD are now used widely for a range of tasks and applications, from geospatial data integration to Natural Language Processing (NLP). Thus, when quantifying the impact of Linked Data, it is important to separate the impact of the <i>principles</i> themselves from the impact of the <i>data sets</i> that have been published using Linked Data principles (a task that may not be feasible or clear-cut in some cases). One could very well argue, in the case of DBpedia, that while Linked Data did not result in the publishing of new knowledge because much of the the raw information is already contained in Wikipedia, it unlocked new applications by republishing structured Wikipedia knowledge as RDF.</p>
|
||
</section>
|
||
</section>
|
||
<section epub:type="division">
|
||
<h2 class="head a-head"><a id="sec14-3"/><b>14.3 Important Knowledge Graphs in Linked Open Data</b></h2>
|
||
<p class="noindent">Here, we provide details on some significant KGs in the LOD ecosystem. These KGs are either heavily influential or widely used, and many have been part of LOD since the early days, continuing to grow with it.</p>
|
||
<section epub:type="division">
|
||
<h3 class="head b-head"><span aria-label="380" id="pg_380" role="doc-pagebreak"/><a id="sec14-3-1"/><b>14.3.1 DBpedia</b></h3>
|
||
<p class="noindent">DBpedia is as a crowdsourced community effort to extract structured information from Wikipedia and make this information available on the web. DBpedia makes this information accessible on the web under the terms of the Creative Commons Attribution-ShareAlike 3.0 license and the GNU Free Documentation license. These are liberal licenses that make it significantly easy to use, extend, and share the KB.</p>
|
||
<p>Although DBpedia is multilingual, like Wikipedia, the English-language version continues to be the largest, best-known, and most well maintained. According to the most recent posting on the DBpedia website, this version currently describes 4.58 million things, out of which 4.22 million are classified in a consistent (i.e., DBpedia) ontology, including 1,445,000 persons; 735,000 places (including 478,000 populated places); 411,000 creative works (including 123,000 music albums, 87,000 films, and 19,000 video games); 241,000 organizations (including 58,000 companies and 49,000 educational institutions); 251,000 species; and 6,000 diseases. Localized versions of DBpedia are available in 125 languages, which, in total, describe over 38 million things. Although there is significant overlap (almost 24 million of the 38 million) between the objects represented in the localized versions and in the English version, there are also almost 14 million entities that are specific to the locale.</p>
|
||
<p>In all, the full DBpedia data set contains over 38 million labels and abstracts, at least 25.2 million links to images, almost 30 million links to external web pages, almost 81 million links to Wikipedia categories, and numerous links to other linked data sets (an estimated 50 million). In practice, DBpedia has truly emerged as a Linked Data hub and is a primary source of linking for new data sets that want to be published using the Linked Data principles, especially principle 4, which that mandates interlinking.</p>
|
||
<p>Concerning infrastructure and architecture, the DBpedia RDF data set is hosted and published using OpenLink Virtuoso. The Virtuoso infrastructure provides access to DBpedia’s RDF data via a SPARQL end point, alongside HTTP support for any web client’s standard GETs for HTML or RDF representations of DBpedia resources.</p>
|
||
</section>
|
||
<section epub:type="division">
|
||
<h3 class="head b-head"><a id="sec14-3-2"/><b>14.3.2 GeoNames</b></h3>
|
||
<p class="noindent">GeoNames is a geographical KB that covers all countries and contains over 11 million names of places. It is available for download free of charge under a Creative Commons attribution license. It contains over 9 million unique features, including 2.8 million populated places and 5.5 million alternative names. GeoNames is also linked to DBpedia, which allows contextual knowledge to be shared across the two sources, an important goal of the Linked Data principles. The most important data sources used by GeoNames in the <i>Geo-Names Gazetteer</i> are the National Geospatial-Intelligence Agency (NGA) and the US Board on Geographic Names, the US Geological Survey Geographic Names information system (for names in the US), Ordnance Survey OpenData, and GeoBase for Canadian geographical <span aria-label="381" id="pg_381" role="doc-pagebreak"/>names, among others. Some other data sources, particularly for smaller countries, are listed in <a href="chapter_14.xhtml#tab14-3" id="rtab14-3">table 14.3</a>. In the section entitled “Software and Resources,” later in this chapter, we provide links to the full set of sources. An important point illustrated even by the small subset of sources in <a href="chapter_14.xhtml#tab14-3">table 14.3</a> is that <i>without</i> GeoNames, a practitioner or user in this space would have to download, integrate (in terms of schemas), and otherwise invest considerable effort when working with multiple countries, or even one country when more than one data source is available for that country.</p>
|
||
<div class="table">
|
||
<p class="TT"><a id="tab14-3"/><span class="FIGN"><a href="#rtab14-3">Table 14.3</a>:</span> <span class="FIG">A small subset of sources used for compiling the GeoNames KG.</span></p>
|
||
<figure class="table">
|
||
<table class="table">
|
||
<thead>
|
||
<tr>
|
||
<th class="TCH"><p class="TB"><b>Country</b></p></th>
|
||
<th class="TCH"><p class="TB"><b>Name</b></p></th>
|
||
<th class="TCH"><p class="TB"><b>Description</b></p></th>
|
||
<th class="TCH"><p class="TB"><b>Website</b></p></th>
|
||
</tr>
|
||
</thead>
|
||
<tbody>
|
||
<tr>
|
||
<td class="TB"><p class="TB">Argentina</p></td>
|
||
<td class="TB"><p class="TB">indec</p></td>
|
||
<td class="TB"><p class="TB">National Institute of Statistics and Census of Argentina</p></td>
|
||
<td class="TB"><p class="TB"><a href="http://www.indec.gov.ar/">http://<wbr/>www<wbr/>.indec<wbr/>.gov<wbr/>.ar<wbr/>/</a></p></td>
|
||
</tr>
|
||
<tr>
|
||
<td class="TB"><p class="TB">Spain</p></td>
|
||
<td class="TB"><p class="TB">cartociudad</p></td>
|
||
<td class="TB"><p class="TB">National Address Database</p></td>
|
||
<td class="TB"><p class="TB"><a href="http://www.cartociudad.es/portal/">http://<wbr/>www<wbr/>.cartociudad<wbr/>.es<wbr/>/portal<wbr/>/</a></p></td>
|
||
</tr>
|
||
<tr>
|
||
<td class="TB"><p class="TB">Faroe Islands</p></td>
|
||
<td class="TB"><p class="TB">um_stovnin _fo</p></td>
|
||
<td class="TB"><p class="TB">Environment Agency</p></td>
|
||
<td class="TB"><p class="TB"><a href="http://www.us.fo/">http://<wbr/>www<wbr/>.us<wbr/>.fo<wbr/>/</a></p></td>
|
||
</tr>
|
||
<tr>
|
||
<td class="TB"><p class="TB">Ireland</p></td>
|
||
<td class="TB"><p class="TB">osi</p></td>
|
||
<td class="TB"><p class="TB">Ordnance Survey Ireland (OSi)—Open Data</p></td>
|
||
<td class="TB"><p class="TB"><a href="https://www.osi.ie">https://<wbr/>www<wbr/>.osi<wbr/>.ie</a></p></td>
|
||
</tr>
|
||
<tr>
|
||
<td class="TSN" colspan="4"><p class="TSN">A more complete list is available at geonames.org/datasources/.</p></td>
|
||
</tr>
|
||
</tbody>
|
||
</table>
|
||
</figure>
|
||
</div>
|
||
<p>There is also a range of impressive features available for online users, including the ability to (1) search for names with full-text search (which itself could output results as a table or display outputs on a map); (2) browse capitals, the highest mountains, and the largest cities on a map; (3) browse names on map and show (or hide) feature classes and codes; (4) export names in a variety of formats, including as a comma-separated values (CSV) file or as a Portable Network Graphics (PNG) image; (5) geotag names (for registered users); (6) edit names in a wiki-style collaboration; and (7) send maps via email, among others. A number of companies and organizations are known to use GeoNames, including Apple Snow Leopard, Ubuntu, Bing Maps, DigitalGlobe, the <i>New York Times</i> (NYT), US Geological Survey, and Nokia, attesting to the usefulness, quality, and ease of use of this resource. In the academic community, especially Semantic Web and NLP (for such tasks as toponym resolution and location extraction or linking), the resource also has a loyal following.</p>
|
||
</section>
|
||
<section epub:type="division">
|
||
<h3 class="head b-head"><a id="sec14-3-3"/><b>14.3.3 YAGO</b></h3>
|
||
<p class="noindent">YAGO (whose name stands for “Yet Another Great Ontology”) is a semantic KB derived using automatic extraction methods from three sources: Wikipedia (e.g., categories and infoboxes), WordNet (e.g., synsets and hyponymy), and GeoNames. YAGO is a joint project of the Max Planck Institute for Informatics and the Telecom ParisTech University. <span aria-label="382" id="pg_382" role="doc-pagebreak"/>Currently, YAGO has knowledge of more than 10 million entities (like persons, organizations, and cities) and contains more than 120 million facts about these entities. The accuracy of YAGO was manually evaluated to be above 95 percent on a sample of facts. In keeping with the fourth Linked Data principle, YAGO has been linked to the DBpedia ontology and to SUMO (Suggested Upper Merged Ontology).</p>
|
||
<p>YAGO has continued to be maintained and updated over the years since its initial release in 2008, with YAGO3 being the latest version. The YAGO code recently went open source; the source code is now available on GitHub for anyone to work with. YAGO3 is provided in Turtle and tab-separated values (TSV) formats. To facilitate efficiency and exploration, thematic and specialized dumps have also been made freely available, along with dumps of the whole database. The KB can also be queried through online browsers and a SPARQL end point hosted by OpenLink Software.<sup><a href="chapter_14.xhtml#fn6x14" id="fn6x14-bk">6</a></sup> A major application of YAGO is the Watson system, which is a question-answering AI developed by IBM that won the first prize in a 2011 competition of the <i>Jeopardy!</i> quiz show (beating two top human champions). Another important feature of YAGO is that it is anchored in time and space, with a temporal and spatial dimension attached to many of its facts and entities.</p>
|
||
</section>
|
||
<section epub:type="division">
|
||
<h3 class="head b-head"><a id="sec14-3-4"/><b>14.3.4 Wikidata</b></h3>
|
||
<p class="noindent">Wikidata describes itself<sup><a href="chapter_14.xhtml#fn7x14" id="fn7x14-bk">7</a></sup> as a “free, collaborative, multilingual, secondary data-base, collecting structured data to provide support for Wikipedia, Wikimedia Commons, the other wikis of the Wikimedia movement, and to anyone in the world.” Wikimedia launched Wikidata in October 2012; initially, it had limited features because editors were only able to create items and connect them to Wikipedia articles. In January of the next year, three Wikipedias (the Hungarian edition, followed by Hebrew and Italian) began to connect to Wikidata. In the meantime, the Wikidata community created more than three million items independently. The English Wikipedia followed in February 2013; all the Wikipedia editions subsequently connected to Wikidata in March.</p>
|
||
<p>The representation model in Wikidata is somewhat different from RDF, with the main principles behind the representation described earlier in chapter 2. However, note that an RDF version of Wikidata is currently available. In its native form, Wikidata uses a simpler data model to store structured data, much more like the key-value models used by NoSQL databases like Elasticsearch. In essence, data is described through property-value pairs: properties are objects and have their <i>own</i> Wikidata pages with labels, aliases, and descriptions. Unlike items, these pages are not linked to Wikipedia articles.</p>
|
||
<p>However, property pages always specify a datatype that defines the type of values the property can have. For example, “GDP” has datatype <i>Quantity</i>, “has mother” has datatype <span aria-label="383" id="pg_383" role="doc-pagebreak"/><i>Item</i> because it relates to <i>mother</i>, and “postal code” has datatype <i>String</i> (by default, as the datatype is not explicitly mentioned or declared on the property page for “postal code”). This information is important for providing adequate user interfaces and ensuring the validity of inputs. There are only a small number of datatypes (mainly quantity, item, string, date and time, geographic coordinates, and URL). Data is international, although its display may be language-dependent (e.g., the number 1,007.8 is written as “1.007,8” in German but as “1 007.8” in French).</p>
|
||
<p>What about more complex data that property-value pairs cannot adequately represent? Although we do not revisit the Wikidata data model here (interested readers may want to review chapter 2), we do note that Wikidata provides intuitive support for citation-based provenance, which can prove to be a useful feature because there are so many contributors, and not all are equally careful or discriminatory in adding or verifying knowledge that gets added to the KB:</p>
|
||
<ul class="numbered">
|
||
<li class="NL">1. <i>Qualifiers</i> are used to state contextual information such as the validity time for an assertion, and they can also be used to encode ternary relations that fall outside the normal scope of the property-value model. As one example, we can use qualifiers to say that Audrey Hepburn played Holly Golightly in the movie <i>Breakfast at Tiffany’s</i> by adding to the item of the movie a property <i>“cast member”</i> with value “Audrey Hepburn” and an additional qualifier <i>“role = Holly Golightly.”</i> Qualifiers are extensible, not being “ontologized” in the sense that the set of qualifiers is fixed in advance. While qualifiers closely resemble the data found in Wikipedia infoboxes, they should not be misunderstood (though they often are) as a workaround to represent higher-arity relations in data models like RDF.</li>
|
||
<li class="NL">2. <i>Special statements</i> in Wikipedia allow a publisher to specify that (1) a property value is unknown, permitting conceptual assertion of nonknowledge; and (2) a property has no value. In chapter 8, on instance matching, we saw these issues arise when extracting features from pairs of instances in the similarity step, and therein proposed dummy values to provide signals to a machine learning classifier about missing property values and other such negative information. Special statements in Wikidata serve the same role as these predefined “dummy” values.</li>
|
||
</ul>
|
||
</section>
|
||
<section epub:type="division">
|
||
<h3 class="head b-head"><a id="sec14-3-5"/><b>14.3.5 Upper Mapping and Binding Exchange Layer</b></h3>
|
||
<p class="noindent">Upper Mapping and Binding Exchange Layer (UMBEL) is a logically organized KG of 34,000 concepts and entity types that can be used in information science for relating information from disparate sources to one another. UMBEL<sup><a href="chapter_14.xhtml#fn8x14" id="fn8x14-bk">8</a></sup> is an open-source extraction <span aria-label="384" id="pg_384" role="doc-pagebreak"/>of the OpenCyc KB and is able to take advantage of Cyc’s reasoning capabilities. This is an important capability, as Cyc itself was created over many years and contains important knowledge about the world and common objects not necessarily contained in encyclopedic KGs like DBpedia. In recent years, this type of knowledge has gained increased prominence for moonshot problem areas in AI like common-sense reasoning.</p>
|
||
<p>UMBEL promotes semantic interoperability of information via two means. First, it uses an ontology of about 35,000 reference concepts, designed to provide common mapping points for relating ontologies or schema to one another; second, it uses a vocabulary for aiding that ontology mapping, including expressions of likelihood relationships distinct from exact identity or equivalence (we saw a similar case earlier with <i>skos:related</i> when describing the fourth Linked Data principle). Because the vocabulary is designed for interoperable domain ontologies, it is useful for fulfilling principle 3.</p>
|
||
<p>UMBEL is written in the SW languages of SKOS and OWL 2. It is a class structure used in Linked Data, along with OpenCyc, YAGO, and the DBpedia ontology. Besides data integration, use-cases of UMBEL include concept search, concept definitions, and ontology consistency checking. It has also been used to build large ontologies and for online question-answering systems, among other applications.</p>
|
||
<p>Including OpenCyc, UMBEL has about 65,000 formal mappings to DBpedia, PROTON, GeoNames, and Schema.org, and provides linkages to at least two million English Wikipedia pages. Reference concepts and mappings are organized under a hierarchy of 31 different (and mostly disjoint) <i>supertypes</i>. Each of these supertypes has its own typology of entity classes to provide flexible tie-ins for external content. A total of 90 percent of UMBEL is contained in these entity classes. It was first released in July 2008 and was updated periodically since.</p>
|
||
</section>
|
||
<section epub:type="division">
|
||
<h3 class="head b-head"><a id="sec14-3-6"/><b>14.3.6 Friend of a Friend</b></h3>
|
||
<p class="noindent">According to the published vocabulary specification, FOAF is devoted to linking people and information using the web. Regardless of whether information is in people’s heads, in physical or digital documents, or in the form of factual data, it can be linked. Although it sounds like an ontology for publishing social network-like information (such as user profiles) FOAF is much broader and actually integrates three kinds of networks (namely, <i>social networks</i> of human collaboration, friendship, and association; <i>representational networks</i>, describing a condensed view of a toy universe (i.e., a deliberately simplified, hypothetical example) in factual terms; and <i>information networks</i>, using web-based linking to share independently published descriptions of this interconnected world). FOAF does not compete with socially oriented websites; rather, it provides an approach in which different sites can tell different parts of the larger story, and by which users can retain some control over their information in a nonproprietary format. Technically, the ontology is quite compact (19 classes, 44 object properties, and 27 datatype properties), and it is compatible with OWL 2 <span aria-label="385" id="pg_385" role="doc-pagebreak"/>RL (meaning that it is convenient to materialize derived FOAF knowledge by performing reasoning using the ontology, in a triplestore).</p>
|
||
<p>Historically, FOAF is an important effort that was conceived many years before the Linked Data principles were codified or developed, and in fact even before the era of social media websites like Facebook or Twitter. Although the project is still maintained, it has not witnessed the rapid adoption that social media has. Nonetheless, its vocabulary continues to be reused for publishing many Linked Data data sets, as well as for fulfilling core tenets such as the third Linked Data principle. In particular, many computer scientists, especially in the SW community, have published their profiles using FOAF yielding a large, decentralized open social graph on the internet.</p>
|
||
</section>
|
||
<section epub:type="division">
|
||
<h3 class="head b-head"><a id="sec14-3-7"/><b>14.3.7 Other Examples</b></h3>
|
||
<p class="noindent">A number of other organizations have adopted the principles of the Linked Data movement and have even exposed SPARQL end points. We list some of these data sets in <a href="chapter_14.xhtml#tab14-4" id="rtab14-4">table 14.4</a>, along with a brief description and the format in which the data can be accessed. Some of these data sets are discussed more fully in the next few chapters, depending on whether the application is enterprise specific (e.g., the BBC) or involves a specific community (e.g., biomedicine).</p>
|
||
<div class="table">
|
||
<p class="TT"><a id="tab14-4"/><span class="FIGN"><a href="#rtab14-4">Table 14.4</a>:</span> <span class="FIG">Other Linked Data data sets.</span></p>
|
||
<figure class="table">
|
||
<table class="table">
|
||
<thead>
|
||
<tr>
|
||
<th class="TCH"><p class="TB"><b>Project</b></p></th>
|
||
<th class="TCH"><p class="TB"><b>Webpage</b></p></th>
|
||
</tr>
|
||
</thead>
|
||
<tbody>
|
||
<tr>
|
||
<td class="TB"><p class="TB">Bio2RDF</p></td>
|
||
<td class="TB"><p class="TB"><a href="https://bio2rdf.org/">https://<wbr/>bio2rdf<wbr/>.org<wbr/>/</a></p></td>
|
||
</tr>
|
||
<tr>
|
||
<td class="TB"><p class="TB">BGS OpenGeoscience</p></td>
|
||
<td class="TB"><p class="TB"><a href="http://data.bgs.ac.uk/">http://<wbr/>data<wbr/>.bgs<wbr/>.ac<wbr/>.uk<wbr/>/</a></p></td>
|
||
</tr>
|
||
<tr>
|
||
<td class="TB"><p class="TB">MusicBrainz</p></td>
|
||
<td class="TB"><p class="TB"><a href="https://musicbrainz.org/">https://<wbr/>musicbrainz<wbr/>.org<wbr/>/</a></p></td>
|
||
</tr>
|
||
<tr>
|
||
<td class="TB"><p class="TB">Science Commons</p></td>
|
||
<td class="TB"><p class="TB"><a href="https://creativecommons.org/about/program-areas/open-science">https://<wbr/>creativecommons<wbr/>.org<wbr/>/about<wbr/>/program<wbr/>-areas<wbr/>/open<wbr/>-science</a></p></td>
|
||
</tr>
|
||
<tr>
|
||
<td class="TB"><p class="TB">Linked Sensor Data</p></td>
|
||
<td class="TB"><p class="TB"><a href="http://wiki.knoesis.org/index.php/LinkedSensorData">http://<wbr/>wiki<wbr/>.knoesis<wbr/>.org<wbr/>/index<wbr/>.php<wbr/>/LinkedSensorData</a></p></td>
|
||
</tr>
|
||
<tr>
|
||
<td class="TSN" colspan="3"><p class="TSN">A more complete list is available at w3.org/wiki/DataSetRDFDumps.</p></td>
|
||
</tr>
|
||
</tbody>
|
||
</table>
|
||
</figure>
|
||
</div>
|
||
<p>In some cases, we note that, even when organizations have chosen not to publish their entire data sets (what has been described as the KG in this book) as Linked Data, they have released their metadata, taxonomies, or even entire ontologies in that way. It is sometimes not completely clear in this context whether the data being published is part of the KG or the ontology. With this caveat in mind, we cite NYT as a well-known example. In early 2010, NYT added almost 5,000 new subject headings (or tags) to a set of 5,000 person-name subject headings that had been released in October 2009. The subjects include organizations, publicly traded companies, and geographic identifiers, ranging from Apple to Kansas to Williams College. We note that these tags are manually mapped to DBpedia, Freebase (which has since been acquired by Google, and of which the public <span aria-label="386" id="pg_386" role="doc-pagebreak"/>equivalent is Wikidata), and in the case of geographic entities, GeoNames. This is obviously in fulfillment of the fourth Linked Data principle described earlier. Additionally, NYT incorporated DBpedia identifiers into their Article Search API for even more seamless integration.</p>
|
||
<p>Because the data was released to data.nytimes.com, it can be used for building complex, data-driven applications (e.g., designing and building a web app for finding alumni in the news, a feature now offered prominently on social media websites like LinkedIn as well). The important thing to remember is that, by publishing such data as LOD, the development of such applications is democratized and made significantly easier. If the data were not accessible in a structured, interlinked format, customized code would have to be written and developed for each application. Although significant coding is still involved (an issue we take up at length in chapter 17), data curation has now been largely automated.</p>
|
||
</section>
|
||
</section>
|
||
<section epub:type="division">
|
||
<h2 class="head a-head"><a id="sec14-4"/><b>14.4 Concluding Notes</b></h2>
|
||
<p class="noindent">In this chapter, we provided a first, well-established example of a <i>KG ecosystem</i> (namely, linked data). At its core, linked data is not data; rather, it is a set of principles for publishing data on the web. Unlike regular documents published as HTML, and primarily meant for human perusal and consumption, data sets published using the Linked Data principles are structured, almost always adhering to the RDF data model that we introduced in depth in chapter 2 and have alluded to frequently during the course of this book. The success and impact of these principles are best illustrated through the large and growing collection of data sets on LOD, which are open to the public and freely available to use, and also span domains that range from narrow (such as geographical and geopolitical entities in GeoNames) to open-world and encyclopedic (such as the Wikipedia-derived DBpedia KG). Enthusiastic adopters of these principles include the medical and scientific communities, as well as enterprise and government. In the next two chapters, we cover the adoption of KGs by these individual communities in more detail.</p>
|
||
</section>
|
||
<section epub:type="division">
|
||
<h2 class="head a-head"><a id="sec14-5"/><b>14.5 Software and Resources</b></h2>
|
||
<p class="noindent">Many of the data sets and LOD examples mentioned in this chapter can be found easily on the web. A full description of the Linked Data movement itself, including principles and the diagram showcasing the breadth of LOD, may be accessed at <a href="http://linkeddata.org/">http://<wbr/>linkeddata<wbr/>.org<wbr/>/</a>. Data sets that are available as RDF dumps are listed on <a href="https://www.w3.org/wiki/DataSetRDFDumps">https://<wbr/>www<wbr/>.w3<wbr/>.org<wbr/>/wiki<wbr/>/DataSetRDFDumps</a>. The website also contains links to other resources, including tools. Not all of the links may be active, however. For the sake of completeness, we provide links to the important Linked Data resources here:</p>
|
||
<ul class="numbered">
|
||
<li class="NL">1. The GeoNames resource is available at <a href="https://www.geonames.org/">https://<wbr/>www<wbr/>.geonames<wbr/>.org<wbr/>/</a>, while a full list of GeoNames data sources is available at urlhttps://www.geonames.org/datasources/.</li>
|
||
<li class="NL">2. <span aria-label="387" id="pg_387" role="doc-pagebreak"/>The UMBEL resource was previously accessible at <a href="http://umbel.org/">http://<wbr/>umbel<wbr/>.org<wbr/>/</a>, but as mentioned earlier, support has migrated since early 2019 to KGpedia. However, a website for KGpedia is currently not available. Interested users still have the option of going to the UMBEL site and contacting the editors for a historical version of the resource.</li>
|
||
<li class="NL">3. DBpedia’s main page can be accessed at <a href="https://wiki.dbpedia.org/">https://<wbr/>wiki<wbr/>.dbpedia<wbr/>.org<wbr/>/</a>. Previous versions are also available, and other resources are available for the interested practitioner, including a support forum, news on events and hackathons, and links to the latest research.</li>
|
||
<li class="NL">4. The YAGO resource is available at <a href="https://www.mpi-inf.mpg.de/departments/databases-and-information-systems/research/yago-naga/yago/">https://<wbr/>www<wbr/>.mpi<wbr/>-inf<wbr/>.mpg<wbr/>.de<wbr/>/departments<wbr/>/databases<wbr/>-and<wbr/>-information<wbr/>-systems<wbr/>/research<wbr/>/yago<wbr/>-naga<wbr/>/yago<wbr/>/</a>.</li>
|
||
<li class="NL">5. FOAF, and several other important vocabularies like it that are important both for the Linked Data movement and for the Semantic Web more broadly, were previously described with links in chapter 2. FOAF itself may be accessed at <a href="http://xmlns.com/foaf/spec/">http://<wbr/>xmlns<wbr/>.com<wbr/>/foaf<wbr/>/spec<wbr/>/</a>.</li>
|
||
<li class="NL">6. Wikidata is accessible at <a href="https://www.wikidata.org/wiki/Wikidata:Main_Page">https://<wbr/>www<wbr/>.wikidata<wbr/>.org<wbr/>/wiki<wbr/>/Wikidata:Main<wbr/>_Page</a>. The default representation is not RDF or Linked Data (recall that we covered Wikidata as a separate model for KGs in chapter 2), but Linked Data versions are available (<a href="https://www.wikidata.org/wiki/Wikidata:RDF">https://<wbr/>www<wbr/>.wikidata<wbr/>.org<wbr/>/wiki<wbr/>/Wikidata:RDF</a>). The Wikidata query service offers a SPARQL end point as well (accessed at the URL <a href="https://query.wikidata.org/">https://<wbr/>query<wbr/>.wikidata<wbr/>.org<wbr/>/</a>). A similar facility is available for some of the other major KGs like DBpedia as well.</li>
|
||
</ul>
|
||
</section>
|
||
<section epub:type="division">
|
||
<h2 class="head a-head"><a id="sec14-6"/><b>14.6 Bibliographic Notes</b></h2>
|
||
<p class="noindent">The start of the SW movement can be traced back to a seminal article by Berners-Lee et al. (2001a), which describes the Semantic Web as the web of tomorrow that “will bring structure to the meaningful content of web pages, creating an environment where software agents roaming from page to page can readily carry out sophisticated tasks for users.” It is a follow-up on a previous Semantic Web road map that had also been published by Berners-Lee et al. (1998). Other work that was influential right around that time include an article describing SW services by McIlraith et al. (2001); Hendler (2001), on agents and the SW; and an article on ontology learning by Maedche and Staab (2001), among many others in the spurt of research that followed soon after. However, the focus at the time was largely on the vision of the Semantic Web, as well as the sorts of reasoning and architectural capabilities that would be necessary for the agents in Berners-Lee’s article to become commonplace.</p>
|
||
<p>The Linked Data movement and principles were still a few years ago, but some of the work that continued to be published in the mid-2000s bore hints of the movement to come. Among studies that we cite from this period include Antoniou and Van Harmelen (2004a) and Shadbolt et al. (2006). One of the earlier, highly cited articles on Linked Data (where it <span aria-label="388" id="pg_388" role="doc-pagebreak"/>was referred to as an “emerging web”) was Bizer (2009), although it itself cites an earlier document on design issues pertaining to Linked Data by Berners-Lee.<sup><a href="chapter_14.xhtml#fn9x14" id="fn9x14-bk">9</a></sup> Other relevant work includes Berners-Lee (2011). Influential articles, including on DBpedia, that were used as much of the primary material for sections of this chapter include Bizer et al. (2007, 2008), Auer et al. (2007), and Kobilarov et al. (2009), the last of which discusses how the BBC used DBpedia and Linked Data to make connections.</p>
|
||
<p>For readers looking to go into much more depth into Linked Data than we were able to provide in this chapter, we recommend the synthesis lectures by Heath and Bizer (2011). For practitioners, especially web developers, Wood et al. (2014) is also recommended because it has step-by-step examples of increasing complexity, as well as practical techniques using popular tools like Python and JavaScript. Interesting (though by now, dated) perspectives are also found in Jain et al. (2010) and Bizer et al. (2011). More recent work related to this subject include Kendall and McGuinness (2019), Sakr et al. (2019), and Pan et al. (2017). Other books, which focus on more specific aspects of linked data (such as for libraries and museums), include Hart and Dolbear (2016) and Van Hooland and Verborgh (2014). A good comparison of some influential linked data KGs, including DBpedia and YAGO, is provided by Färber et al. (2015). A different line of study examines the quality of linked data, including adoption of best practices. We cite Schmachtenberg et al. (2014) as a particularly thorough example. A more recent example is Zaveri et al. (2016).</p>
|
||
<p>We mentioned earlier in the chapter that URIs codify a naming standard, which was defined and is controlled by the W3C. A good overview of definitions and syntax can be accessed in the W3C’s RFC 3986 document (with more details provided in the updated RFCs 6874 and 7320). These RFCs may be accessed from the W3C’s website.<sup><a href="chapter_14.xhtml#fn10x14" id="fn10x14-bk">10</a></sup></p>
|
||
</section>
|
||
<section epub:type="division">
|
||
<h2 class="head a-head"><a id="sec14-7"/><b>14.7 Exercises</b></h2>
|
||
<ul class="numbered">
|
||
<li class="NL">1. Considering the data shown here, is either the first or second Linked Data principle violated? If either (or both) is violated, mention the line number where each violation occurs, and explain why it is considered a violation.</li>
|
||
</ul>
|
||
<figure class="IMG"><img alt="" class="width" src="../images/pg388-1.png"/>
|
||
</figure>
|
||
<ul class="numbered">
|
||
<li class="NL">2. What about the third and fourth Linked Data principles?</li>
|
||
<li class="NL">3. <span aria-label="389" id="pg_389" role="doc-pagebreak"/>For this exercise, go to the current DBpedia Downloads page: <a href="https://wiki.dbpedia.org/Downloads2015-04">https://<wbr/>wiki<wbr/>.dbpedia<wbr/>.org<wbr/>/Downloads2015<wbr/>-04</a>. Although this is not the latest version, the exact version does not matter for this and the next exercise, so long as it’s a relatively recent version (i.e., 2015 or later).</li>
|
||
</ul>
|
||
<p class="myenumitem">Is there evidence on the page that lends credence to the hypothesis that the fourth Linked Data principle is being honored? Why or why not?</p>
|
||
<ul class="numbered">
|
||
<li class="NL">4. When navigating to <a href="http://dbpedia.org/resource/Trojan">http://dbpedia.org/resource/Trojan</a>, we noticed that the address on the browser has changed to <a href="http://dbpedia.org/page/Trojan">http://dbpedia.org/page/Trojan</a>. Explain why this happened, citing the relevant Linked Data principles.</li>
|
||
<li class="NL">5. Try to find examples of at least three LOD data sets that we have not mentioned in this chapter and that are online and accessible. Draw a table and list their names, URLs where we can access a SPARQL end point or dump, a subject domain, and one use-case where the data set could prove useful.</li>
|
||
</ul>
|
||
<div class="footnotes">
|
||
<ol class="footnotes">
|
||
<li><p class="FN" role="doc-footnote"><sup><a href="chapter_14.xhtml#fn1x14-bk" id="fn1x14">1</a></sup> On occasion, also called the <i>Linking Open Data</i> project.</p></li>
|
||
<li><p class="FN" role="doc-footnote"><sup><a href="chapter_14.xhtml#fn2x14-bk" id="fn2x14">2</a></sup> <a href="https://www.w3.org/">https://<wbr/>www<wbr/>.w3<wbr/>.org<wbr/>/</a>.</p></li>
|
||
<li><p class="FN" role="doc-footnote"><sup><a href="chapter_14.xhtml#fn3x14-bk" id="fn3x14">3</a></sup> Source: <a href="https://www.w3.org/Addressing/">https://<wbr/>www<wbr/>.w3<wbr/>.org<wbr/>/Addressing<wbr/>/</a>.</p></li>
|
||
<li><p class="FN" role="doc-footnote"><sup><a href="chapter_14.xhtml#fn4x14-bk" id="fn4x14">4</a></sup> <a href="https://www.w3.org/TR/rdfa-primer/">https://<wbr/>www<wbr/>.w3<wbr/>.org<wbr/>/TR<wbr/>/rdfa<wbr/>-primer<wbr/>/</a>.</p></li>
|
||
<li><p class="FN" role="doc-footnote"><sup><a href="chapter_14.xhtml#fn5x14-bk" id="fn5x14">5</a></sup> The VoID vocabulary is used for providing data set-level metadata, published as triples either in the data set itself, or back-linked in a separate VoID file.</p></li>
|
||
<li><p class="FN" role="doc-footnote"><sup><a href="chapter_14.xhtml#fn6x14-bk" id="fn6x14">6</a></sup> <a href="https://www.openlinksw.com/">https://<wbr/>www<wbr/>.openlinksw<wbr/>.com<wbr/>/</a>.</p></li>
|
||
<li><p class="FN" role="doc-footnote"><sup><a href="chapter_14.xhtml#fn7x14-bk" id="fn7x14">7</a></sup> Source: <a href="https://www.wikidata.org/wiki/Wikidata:Introduction">https://<wbr/>www<wbr/>.wikidata<wbr/>.org<wbr/>/wiki<wbr/>/Wikidata:Introduction</a>.</p></li>
|
||
<li><p class="FN" role="doc-footnote"><sup><a href="chapter_14.xhtml#fn8x14-bk" id="fn8x14">8</a></sup> Note, however, that UMBEL is no longer supported by its editors (as of January 1, 2019) and support has instead migrated to a separate effort called KGpedia. Besides this change of terminology, many aspects still remain the same, insofar as we can determine; hence, we describe UMBEL as it was premigration.</p></li>
|
||
<li><p class="FN" role="doc-footnote"><sup><a href="chapter_14.xhtml#fn9x14-bk" id="fn9x14">9</a></sup> <a href="https://www.w3.org/DesignIssues/LinkedData.html">https://<wbr/>www<wbr/>.w3<wbr/>.org<wbr/>/DesignIssues<wbr/>/LinkedData<wbr/>.html</a>.</p></li>
|
||
<li><p class="FN" role="doc-footnote"><sup><a href="chapter_14.xhtml#fn10x14-bk" id="fn10x14">10</a></sup> <a href="https://www.w3.org/TR/#Specifications">https://<wbr/>www<wbr/>.w3<wbr/>.org<wbr/>/TR<wbr/>/#Specifications</a>.</p></li>
|
||
</ol>
|
||
</div>
|
||
</section>
|
||
</section>
|
||
</div>
|
||
</body>
|
||
</html> |