glam/docs/oclc/extracted_kg_fundamentals/OEBPS/xhtml/chapter_2.xhtml
2025-11-30 23:30:29 +01:00

321 lines
No EOL
91 KiB
HTML
Raw Blame History

This file contains invisible Unicode characters

This file contains invisible Unicode characters that are indistinguishable to humans but may be processed differently by a computer. If you think that this is intentional, you can safely ignore this warning. Use the Escape button to reveal them.

This file contains Unicode characters that might be confused with other characters. If you think that this is intentional, you can safely ignore this warning. Use the Escape button to reveal them.

<html xmlns="http://www.w3.org/1999/xhtml" xmlns:epub="http://www.idpf.org/2007/ops" lang="en" xml:lang="en">
<head>
<title>Knowledge Graphs</title>
<meta content="text/html; charset=utf-8" http-equiv="default-style"/>
<link href="../styles/stylesheet.css" rel="stylesheet" type="text/css"/>
<meta content="urn:uuid:531250ec-8629-4bbe-b4be-eb1eb0e84538" name="Adept.expected.resource"/>
</head>
<body epub:type="bodymatter">
<div class="body">
<p class="SP"> </p>
<section aria-labelledby="ch2" epub:type="chapter" role="doc-chapter">
<header>
<h1 class="chapter-number" id="ch2"><span aria-label="21" id="pg_21" role="doc-pagebreak"/>2</h1>
<h1 class="chapter-title"><b>Modeling and Representing Knowledge Graphs</b></h1>
</header>
<div class="ABS">
<p class="ABS"><b>Overview.</b>Before studying how to construct and use knowledge graphs (KGs), it is important to know how to <i>model</i> them. As indicated in the previous chapter, KGs have been heavily influenced by graph theory, and therefore it is unsurprising that many KG representations have an inherently graph-theoretic flavor. The diversity and subtleties of these different flavors is surprising and offer an interesting study in representational trade-offs. In this chapter, we study some KG representational models that have become dominant in KG-centric communities such as the Semantic Web. Chief among these models is the Resource Description Framework (RDF) model, as well as models that have been built on top of it, such as RDF Schema (RDFS). We also detail the Wikidata model and provide a brief primer on property-centric models such as the property graph and the property table. The chapter concludes with advanced research topics such as the Semantic Web Layer Cake, schema heterogeneity, and semantic labeling.</p>
</div>
<section epub:type="division">
<h2 class="head a-head"><a id="sec2-1"/><b>2.1Introduction</b></h2>
<p class="noindent">Research in data-driven communities such as Relational Database Management Systems (RDBMSs) has a rich history in modeling the data that is being stored, processed, and queried. These models seem intuitive on the surface; to an unpracticed eye, an RDBMS “looks” like a set of tables, but there is much more subtlety in the formal machinery that goes into modeling such systems. Some of these may be familiar already to the reader who has dabbled in RDBMs, including decompositions, foreign keys, and functional dependencies. A critical feature is the separation between the “schema” of the system, which lays out a template for how the actual data can be modeled or represented, and the data itself.</p>
<p>KGs are less rigid than RDBMSs, but this makes a study of KG formalism all the more relevant. Without clear representational underpinnings, it is easy to think of any data item that has structure as a KG. What, for example, prevents us from calling an ordinary social network (such as an undirected friendship network) a KG? Why is Wikipedia not a KG? And most important, if we want to undertake the process of modeling and constructing KGs in our domain of interest, be it fighting fraud or designing a better e-commerce search engine, what formalism can we rely on to define and <i>constrain</i> the KG in appropriate ways?</p>
<p><span aria-label="22" id="pg_22" role="doc-pagebreak"/>In the previous chapter, we suggested (mainly through examples) an initial definition of a KG as a “labeled, multirelational, directed graph” (i.e., a graph where both nodes and edges have labels, and edges have directions). Even in this definition, a difference between formal and “normative” behavior quickly starts to emerge. In most typical KGs, the labels (or some significant fraction of them) tend to be human-readable or understandable (e.g., it is dubious whether a labeled graph where the labels are all random strings would be <i>perceived</i> as a KG). On the other hand, the definition is not always as rigid as it seems. In rare cases, KGs can be uni-relational—that is, in a simple product-customer KG, there may be a broad diversity in node labels, but only a single edge label (“purchase”). Some edges could even be undirected, though this is an easy formal fix (see the exercises at the end of this chapter). Our point here is that any definition or formalism involving KGs must be taken as a <i>guide</i>, rather than a strict, prescriptive statement resembling a mathematical definition (such as the definition of an irrational number). The KG community overwhelmingly relies on practice (rather than theory), as well as actual application of KGs in various domains, when deciding on such formal matters.</p>
<p>To drive this point home even further, let us take a second (arguably, even simpler) definition of a KG—namely, as a set of triples. A triple is like a “directed edge,” where the first and third elements of the triple are the two nodes and the middle element is the labeled edge “going from” the first node to the third node. We will see in a subsequent section that this is an important definition on which models like Resource Description Framework (RDF) are grounded. It is also, however, the definition of a <i>knowledge base (KB)</i>, leading to the question of whether a KB is also a KG. Normatively, we argue that a KB cannot be a KG unless it has some kind of <i>structure</i>. In the extreme case, imagine that no two triples in the KB share any labels. Represented as a graph, such a set of triples starts looks highly disconnected, where the degree of every node is exactly 1. Put another way, a graph is clearly not the right data model for such a data set, any more than a table with lots of missing values is the right data set for an RDBMS. But when exactly does a KB morph into a KG? In borderline or ambiguous cases, the difference in terminology may be mere semantics. But more often than not, practice and application play a deciding role in whether a set of triples is just that, or can be visualized and described as something more (i.e., a KG). Furthermore, as we shall see when discussing the Wikidata data model, it is not always a good idea to describe every KG using the set of triples because it can lead to loss of representational power.</p>
<p>Graphs have always been an exciting area of research for algorithm research, but the rapid advent of KGs has brought graphs to the forefront of data-modeling research in this last decade. Graph algorithms still apply to KGs, and they play an important role when we discuss how to query KGs in part IV of this book. In this chapter, we describe some important languages that can be used to both model the KG, and define the schema or <i>ontology</i> for the KG. These languages have emerged as dominant forces in KG-centric communities <span aria-label="23" id="pg_23" role="doc-pagebreak"/>like the Semantic Web (SW) and even Natural Language Processing (NLP), and most of them receive support from the likes of the World Wide Web Consortium (W3C). Research on them continues, but the primary aspects are well established by this time.</p>
<section epub:type="division">
<h3 class="head b-head"><a id="sec2-1-1"/><b>2.1.1Resource Description Framework</b></h3>
<p class="noindent">Technically, RDF is a framework for representing information on the web, and not necessarily as a data model for representing KGs. However, because of the rise of KGs in the SW community, where the RDF model has been most developed and used, KGs in the SW have been represented either in RDF or in a “higher-level” representation based on RDF, such as RDFS. Because almost all models in the SW community ultimately derive from RDF, and some of these models, such as the Web Ontology Language (OWL), play a key role in part IV, when we describe reasoning and retrieval over KGs, we pay close attention to RDF and its design in this chapter.</p>
<p>At its core, RDF has an abstract syntax that reflects a simple, graph-based data model and has been motivated by several design factors, the most important of which is a way to represent information in a minimally constraining, flexible way. While an RDF can be used in isolated applications, where individually designed formats might be more direct and easily understood, the RDFs generality offers greater value from sharing. For this reason, it is a good fit for KGs that either derive from, or need to be published to, the web (whether publicly accessible or not). We note that many modern KGs (e.g., Wikidata) are connected to the web in some way, which may also explain why this model has emerged as a good fit.</p>
<p>The underlying structure of any expression in RDF is a collection of triples, each consisting of a subject, a predicate, and an object. A set of such triples is called an <i>RDF graph</i>. This can be illustrated by a node and directed-arc diagram, in which each triple is represented as a node-arc-node link. We illustrate such a triple in <a href="chapter_2.xhtml#fig2-1" id="rfig2-1">figure 2.1</a>. The direction of the arc is clearly important because it always emerges from the subject and points to the object. Informally, the assertion of an RDF triple (<i>s, p, o</i>) says that the object <i>o</i> is related to subject <i>s</i> via a <i>p</i> relationship. The assertion of an RDF graph, a set of triples, equates assertion of <span aria-label="24" id="pg_24" role="doc-pagebreak"/>all the triples in it; in other words, the <i>conjunction</i> of the statements corresponding to all the triples it contains.</p>
<div class="figure">
<figure class="IMG"><a id="fig2-1"/><img alt="" src="../images/Figure2-1.png" width="450"/>
<figcaption><p class="CAP"><span class="FIGN"><a href="#rfig2-1">Figure 2.1</a>:</span> <span class="FIG">An example of a KG triple (:mayank_kejriwal, foaf:name, “Mayank Kejriwal”) represented in RDF. The prefix <i>foaf:</i> is shorthand for <a href="http://xmlns.com/foaf/0.1/">http://xmlns.com/foaf/0.1/</a> (i.e., the subject in the triple is actually “<a href="http://xmlns.com/foaf/0.1/name">http://xmlns.com/foaf/0.1/name</a>,” and similarly, the prefix <i>:</i> before mayank_kejriwal is meant to express that mayank_kejriwal lies in the <i>default</i> namespace). In this case, the “object” is a literal; by convention, literals are represented as rectangles, while URI nodes and blank nodes are elliptical.</span></p></figcaption>
</figure>
</div>
<p>The nodes in an RDF expose the relationship between between the RDF and the web. An RDF node may be one of the following:</p>
<ul class="numbered">
<li class="NL">1.Uniform resource identifier (URI) with an optional <i>fragment</i> identifier (URI reference, or <i>URIref</i> ). For instance, “https://example.com/path/resource.txt #fragment” is a URIref.</li>
<li class="NL">2.Literal; for instance, “Mayank Kejriwal” but also numbers like 23 or dates.</li>
<li class="NL">3.Blank node, which is an abstract identifier that is locally unique and has no separate form of identification. Because they are not globally de-referenceable, blank nodes are not web artifacts but are used for establishing identities of resources (which may, for example, be unnamed<sup><a href="chapter_2.xhtml#fn1x2" id="fn1x2-bk">1</a></sup>) within a local data set or <i>namespace</i>.</li>
</ul>
<p class="noindent">RDF properties are always URI references. A URI reference or literal used as a node identifies what that node represents. As mentioned previously, a URI reference used as a predicate identifies a relationship between the things represented by the incident nodes. Without additional constraints, a predicate URI reference may also be a node in the graph. This, along with other, similar observations, has motivated additional (subsequently described) models and vocabularies that “build upon” RDF by constraining it in the appropriate ways.</p>
<p>The most nonintuitive form that an RDF node can take is that of a blank node, which is neither a URI reference (in the proper sense) nor a literal. In the RDF abstract syntax, a blank node is just a unique node that can be used in one or more RDF statements but has no intrinsic name. There are many practical reasons for using such blank nodes. In many cases, it is because we want to use a local identifier for an unidentified resource in order to make assertions about it. The marriage example in the footnote here illustrates this use. However, note that the local identification semantics of the blank node has some important consequences when merging two or more RDF graphs because the blank nodes must be kept distinct if meaning is to be preserved; on occasion, this calls for reallocation of blank node identifiers. Formally, blank node identifiers are not part of the RDF abstract syntax, and the representation of triples containing blank nodes is entirely dependent on the particular concrete syntax used. In other words, the scheme for coming up with, and assigning, blank node identifiers is something that is entirely dependent on the person modeling and publishing the RDF data.</p>
<p>Concerning literals, datatypes in RDF allow the representation of values such as integers, floating point numbers, and even dates. There is a formal way to define datatypes in RDF, but many of the common types are already defined. A typed literal is a string combined <span aria-label="25" id="pg_25" role="doc-pagebreak"/>with a datatype URI (e.g., <i>&lt;</i>xsd:boolean, “true” &gt;, where <i>xsd</i> is a shorthand prefix for XML Schema), while a plain literal is simply a string combined with an optional language tag.</p>
<p class="TNI-H3"><b>2.1.1.1Equivalences in RDF</b>From the previous discussion, an RDF graph is a set of RDF triples, and the set of nodes in this graph is the set of subjects and objects of triples in the graph. What does it mean for two RDF graphs, <i>G</i> and <i>G</i>, to be equivalent? Formally, an equivalence holds if there is a bijection <i>M</i> between the sets of nodes of the two graphs, such that the following conditions are fulfilled:</p>
<ul class="numbered">
<li class="NL">1.<i>M</i> maps blank nodes in <i>G</i> to blank nodes in <i>G</i>.</li>
<li class="NL">2.<i>M</i>(<i>lit</i>) = <i>lit</i> for all RDF literals <i>lit</i>, which are nodes of <i>G</i>.</li>
<li class="NL">3.<i>M</i>(<i>uri</i>) = <i>uri</i> for all RDF URI references <i>uri</i>, which are nodes of <i>G</i>.</li>
<li class="NL">4.The triple (<i>s, p, o</i>) is in <i>G</i> if and only if the triple (<i>M</i>(<i>s</i>)<i>, p, M</i>(<i>o</i>)) is in <i>G</i>.</li>
</ul>
<p>We bring up the notion of equivalence to show that, unlike with ordinary graph theory, which primarily concerns unlabeled nodes and edges and where equivalences are usually established through graph isomorphisms, RDF (and by close extension, <i>knowledge</i>) graphs require a more stringent set of conditions. It is possible for <i>G</i> and <i>G</i> to be isomorphic without being equivalent in the RDF universe; intuitively, this happens when two graphs are structurally identical but contain different <i>content</i>, which would lead to an incomplete or conflicting mapping between their respective nodes or properties. Note that two RDF URI references are equal if and only if they compare as equal, character by character, as Unicode strings. For literals, equality is slightly more complicated; the following conditions have to be met:</p>
<ul class="numbered">
<li class="NL">1.The strings of the two lexical forms compare as equal, character by character.</li>
<li class="NL">2.Either both or neither have language tags (and if the language tags exist, they must be equal).</li>
<li class="NL">3.Either both or neither have datatype URIs (and if the datatype URIs exist, they must be equal, character by character).</li>
</ul>
<p>The last condition only applies to typed literals, while the first two conditions apply to both plain and typed literals. Sometimes, this means that adding more information to a literal can make the graph nonequivalent to another graph. For example, imagine the literal “Mont Blanc,” which exists in two separate KGs. Now, however, the author of the second RDF KG decides to add language tags to all literals and appends the tag @fr to “Mont Blanc.” The two literals, per the conditions given here, would now become unequal, and strictly speaking, the two graphs would become nonequivalent. Similarly, removing information can have the same effect (e.g., if both literals initially had the same language type @fr, but the author of one graph decides to remove the language tag).</p>
</section>
<section epub:type="division">
<h3 class="head b-head"><span aria-label="26" id="pg_26" role="doc-pagebreak"/><a id="sec2-1-2"/><b>2.1.2RDF Serializations</b></h3>
<p class="noindent">The discussion so far illustrates that, conceptually, RDF can be represented just as a directed, labeled graph, or as shown in the case of the simplest KG representation possible, as a set of triples. RDF files that are meant to be downloaded as dumps are frequently formatted in just this way, and they are known as <i>N-triples</i> files. An N-triples file has a triple on each line, with each triple representing an edge in the RDF KG. Recall that subjects and properties in RDF graphs are necessarily URIs, and subjects can also be blank nodes, which are not globally de-referenceable in the way that ordinary URIs are. N-triples lines are “absolute” in that each line can be processed independently because full URIs have to be used in each line (prefixes and shorthands are not allowed). This also makes them amenable to a streaming setting where lines have to be read in (and parsed) from the file and then discarded.</p>
<p>However, it is not desirable to encode all RDF KGs this way. One reason is that N-triples files can grow to be very large because each URI tends to be a long string, including the namespace of the graph itself. For example, consider the simple KG fragment in <a href="chapter_2.xhtml#fig2-2" id="rfig2-2">figure 2.2</a>. In N-triples format, the URI for a node would be repeated in as many lines as there are edges incident on that node.<sup><a href="chapter_2.xhtml#fn2x2" id="fn2x2-bk">2</a></sup> This URI is long, and intuitively, it seems unnecessary to be repeating the URI so many times. Furthermore, because of their use of absolute URIs, <span aria-label="27" id="pg_27" role="doc-pagebreak"/>N-triples files are not designed to be human-readable. The reason is that, while machines can ingest the file line by line and construct an RDF graph from it, humans typically find it inconvenient to parse all the long URIs. Thus, there are two needs here that have to be addressed.</p>
<div class="figure">
<figure class="IMG"><a id="fig2-2"/><img alt="" src="../images/Figure2-2.png" width="450"/>
<figcaption><p class="CAP"><span class="FIGN"><a href="#rfig2-2">Figure 2.2</a>:</span> <span class="FIG">A KG fragment in its conceptual graph representation (above), and N-triples representation (below). The <i>rdf</i> and <i>foaf</i> prefixes represent the namespaces indicated in the N-triples fragment. The authors URIs are for indicative purposes only. Prefixes are not permitted in the N-triples representation.</span></p></figcaption>
</figure>
</div>
<p>First, is it possible to write out the RDF KG in a format that makes it readable by humans and machines alike, with humans being able to parse the KG and its contents rather intuitively?</p>
<p>Second, is it possible to shorten the URIs so that the RDF file is not quite as big as in its N-triples version? The two needs are interrelated in an intuitive sense: we can imagine that, if there were a way to avoid the frequent use of such long URIs, the RDF file would become more compact and take up less bandwidth and storage. The compactness would also make the file more readable.</p>
<p>The Terse RDF Triple Language, known as Turtle, is a concrete syntax for RDF that generalizes the N-triples format and makes it possible to not repeat full URIs throughout the document, among other syntactic facilities. As an example, let us consider the code snippet in <a href="chapter_2.xhtml#fig2-3" id="rfig2-3">figure 2.3</a>, expressed in Turtle, of a hypothetical KG fragment describing Batman and the Joker in the DC comic book universe.</p>
<div class="figure">
<figure class="IMG"><a id="fig2-3"/><img alt="" src="../images/Figure2-3.png" width="450"/>
<figcaption><p class="CAP"><span class="FIGN"><a href="#rfig2-3">Figure 2.3</a>:</span> <span class="FIG">A KG fragment expressed in Turtle as a predicate list.</span></p></figcaption>
</figure>
</div>
<p>This snippet is written as a <i>predicate list</i>, which is convenient when the same subject is referenced by a number of predicates. Additionally, prefixes can be used to truncate URIs further. In this case, the subject <i>&lt;</i>http://example.org/ #joker &gt; has the <i>base</i> prefix <a href="http://example.org/">http://example.org/</a>, which once declared, does not need to be repeated. New prefixes can be defined using variables; for instance, <i>foaf</i> (Friend of a Friend) is a well-known namespace often used to model classes like <i>Person</i> and properties like <i>name</i> that are applicable to social entities like people, and has a full URI <i>&lt;</i>http://xmlns.com/foaf/0.1/ &gt;. Just the use of prefixes alone makes the file a lot more compact and readable than the N-triples file, which <span aria-label="28" id="pg_28" role="doc-pagebreak"/>may be considered to be a special case (known as Simple Triples) of Turtle. However, beyond prefixes, there are further savings and greater readability because the subject is not repeated.</p>
<p>An even more compact representation afforded by Turtle is the <i>object list</i>. For example, consider the triple <i>&lt;http://example.org/#joker</i> &gt; <i>&lt;http://xmlns.com/foaf/ 0.1/name</i> &gt; <i>“The Clown Prince of Crime,” “The Ace of Knaves,”</i> which is actually expressing <i>two</i> triples, each with the same subject and predicate, but with different objects (in this case, both objects are literals). For large KGs, these compact representations can be attractive; however, one deficiency that should not be forgotten is that Turtle files, as expressed using object and predicate lists, cannot be parsed line by line, which make them more difficult to stream. A historical (though not current) limitation is that the W3C did not originally consider the Turtle language to be a standard (normative), although this changed in 2014, when the Turtle specification was published as a W3C recommendation. There are also multiple packages available for converting a valid RDF file in one serialization format to another.</p>
</section>
</section>
<section epub:type="division">
<h2 class="head a-head"><a id="sec2-2"/><b>2.2RDF Schema</b></h2>
<p class="noindent">RDF Schema provides a data-modeling vocabulary for RDF data. It can be best understood as a <i>semantic extension</i> of RDF. As the name itself indicates, it is like a vocabulary (the language remains RDF; i.e., RDFS is an extension of, not a replacement for, RDF) for building schemas that provides semantics to RDF graphs. Nevertheless, one can refer to RDFS itself as a language due to its incorporated extensions. A common use-case is to specify a functional domain ontology using RDFS, with the actual KG being in RDF, but obeying the vocabulary and other constraints specified by the RDFS ontology. RDFS is a popular language for building such ontologies, though in the truest sense, it has limited functionality compared to more powerful Semantic Web languages like OWL, which permit broader reasoning capabilities as covered in part IV. This is one reason why RDFS is thought of as a <i>schema</i> rather than an <i>ontology</i> language.</p>
<section epub:type="division">
<h3 class="head b-head"><a id="sec2-2-1"/><b>2.2.1RDFS Classes</b></h3>
<p class="noindent">The most important, and common, vocabulary unit in RDFS is <i>rdfs:Class</i>,<sup><a href="chapter_2.xhtml#fn3x2" id="fn3x2-bk">3</a></sup> which is used to declare classes (equivalently called <i>concepts</i>). Fundamentally, a class is a <i>group</i> (e.g., Country) into which <i>resources</i> (e.g., United States, Germany) may be divided. The members of a class are known as <i>instances</i> of the class. In the RDF context, classes are not special entities, but are themselves resources because they are also identified by URIs (or more generally, as permitted by the RDF standards, by <i>internationalized</i> resource identifiers, <span aria-label="29" id="pg_29" role="doc-pagebreak"/>or IRIs) and may be described using RDF properties. The <i>rdf:type</i> property is typically used to state that a resource is an instance of a class. Notice how the prefix allows us to distinguish between whether a vocabulary element belongs to a core RDF (such as <i>rdf:type</i>) or RDFS (such as <i>rdfs:Class</i>).</p>
<p>An important aspect of classes to note here is that RDF distinguishes between a class and the set of instances that make up that class (also called the <i>class extension</i>). In other words, two classes may have the same set of instances but still be different classes (e.g., the “World Economies” and “Countries” would likely have the same set of instances, but are technically different classes). This distinction is important because it shows that a class is not exhaustively defined by the set of its instances. A name is not just a name; it clearly expresses semantics in this worldview. We can see that this often makes intuitive sense because “World Economies” and “Countries” arguably do not have the same semantics, even though as concepts, they apply to an identical set of instances. In logic, we would say that the <i>extensional equivalence</i> of two classes in RDFS does not imply their <i>equality</i>. Another reason why this is useful, beyond intuitive semantic differences, is that it allows different classes like “World Economies” and “Countries” to have exactly the same instances, and yet have different properties. A world economy may have a property like “GDP” defined for it, while the same property may not be defined for a country because it was not considered important or general enough for the class of countries. If we had chosen to identify classes with only their extension, this advantage would be lost, and modeling real-world KGs would have become near-impossible.</p>
<p>Technically, a class may be a member of its own class extension (in other words, an instance of itself). All RDF Schema classes are themselves grouped into a class called <i>rdfs:Class</i>. For example, if we had a class <i>:Dog</i>, then we would declare it as a class through the triple <i>(:Dog, rdf:type, rdfs:Class)</i>. The intuition is to declare the resource (an ordinary URI at the end of the day) as a class. For this reason, <i>rdfs:Class</i> should be thought of as a reserved resource in the RDFS world. Being declared an instance of <i>rdfs:Class</i> has specific semantics, as the example triple with <i>:Dog</i> illustrates. Other such reserved URIs include <i>rdfs:subClassOf</i> and <i>rdfs:Literal</i>. While their names make them relatively self-explanatory, it is important to understand all the semantic implications when using these terms. For example, if a class <i>C</i> is declared to be a subclass of a class <i>C</i>, then all instances of <i>C</i> must necessarily be instances<sup><a href="chapter_2.xhtml#fn4x2" id="fn4x2-bk">4</a></sup> of <i>C</i>. This affects the capability of reasoners, as we will see in part IV. For now, the important thing to remember is that these reserved terms are designed precisely to allow RDFS to incorporate standard semantics into some of its vocabulary elements.</p>
<p><span aria-label="30" id="pg_30" role="doc-pagebreak"/>It is important to note that, in RDFS, instances and subclasses are different from one another. Specifically, recall that all things described by RDF are called resources, and are considered to be instances of the class <i>rdfs:Resource</i>. This is the putative “class of everything” (i.e., all other classes are subclasses of this class). Here, <i>rdfs:Resource</i> itself is an instance of <i>rdfs:Class</i>. Interestingly, <i>rdfs:Class</i> is also defined as an instance of itself, although this is not common among classes in general. Other important general classes that are critical for defining good schemas include (note again that some have the prefix <i>rdf:</i> and are technically defined in RDF, not RDFS):</p>
<ul class="numbered">
<li class="NL">1.<b>rdfs:Literal:</b> The class of literal values such as strings and integers. Property values such as textual strings are examples of RDF literals. Here, <i>rdfs:Literal</i> is an instance of <i>rdfs:Class</i>, and a subclass of <i>rdfs:Resource</i>.</li>
<li class="NL">2.<b>rdfs:Datatype:</b> The class of datatypes. All instances of <i>rdfs:Datatype</i> correspond to the RDF model of a datatype (we provided details on the different elements of this model earlier). Here, <i>rdfs:Datatype</i> is both an instance of and a subclass of <i>rdfs:Class</i>; each instance of <i>rdfs:Datatype</i> is a subclass of <i>rdfs:Literal</i>.</li>
<li class="NL">3.<b>rdf:LangString:</b> The class of language-tagged string values. Here, <i>rdf:LangString</i> is an instance of <i>rdfs:Datatype</i> and a subclass of <i>rdfs:Literal</i>.</li>
<li class="NL">4.<b>rdf:HTML:</b> The class of Hypertext Markup Language (HTML) literal values. Here, <i>rdf:HTML</i> is an instance of <i>rdfs:Datatype</i> and a subclass of <i>rdfs:Literal</i>.</li>
<li class="NL">5.<b>rdf:XMLLiteral:</b> The class of Extensible Markup Language (XML) literal values. Here, <i>rdf:XMLLiteral</i> is an instance of <i>rdfs:Datatype</i> and a subclass of <i>rdfs:Literal</i>.</li>
<li class="NL">6.<b>rdf:Property:</b> The class of RDF properties. Here, <i>rdf:Property</i> is an instance of <i>rdfs:Class</i>.</li>
</ul>
</section>
<section epub:type="division">
<h3 class="head b-head"><a id="sec2-2-2"/><b>2.2.2RDFS Properties</b></h3>
<p class="noindent">Recall that we earlier defined a property as a relation between subject resources and object resources. RDFS, however, goes further, introducing the concept of <i>subproperty</i>. The <i>rdfs:subPropertyOf</i> property may be used to state that one property is a subproperty of another. The semantics are as follows, and analogous to the subclass semantics. Specifically, if a property <i>P</i> is a subproperty of property <i>P</i>, then all pairs of resources that are related by <i>P</i> are also related by <i>P</i>. The term <i>superproperty</i> is often used as the inverse of subproperty. Conversely, if a property <i>P</i> is a superproperty of a property <i>P</i>, then all pairs of resources that are related by <i>P</i> are also related by <i>P</i>. This specification does not define a top property that is the superproperty of all properties.</p>
<p>Another important element introduced by RDFS is the notion of a propertys constraints on what it can take on as domain and range. The domain of a property is the set of values that can serve as its subject, while the range is the set of values that can serve as its object. RDFS introduces several role constraints by way of properties like <i>rdfs:range</i>, <i>rdfs:domain</i>, and <i>rdfs:subClassOf</i> (the last of which was described earlier). Here, <i>rdfs:domain</i> is an <span aria-label="31" id="pg_31" role="doc-pagebreak"/>instance of <i>rdf:Property</i> that is used to state that any resource that has a given property is an instance of one or more classes. A triple <i>(P, rdfs:domain, C)</i> states that <i>P</i> is an instance of the class <i>rdf:Property</i>, that <i>C</i> is a instance of the class <i>rdfs:Class</i>, and that the resources denoted by the subjects of triples whose predicate is <i>P</i> are instances of the class <i>C</i>. Where a property <i>P</i> has more than one <i>rdfs:domain</i> property, then the resources denoted by subjects of triples with predicate <i>P</i> are instances of <i>all</i> the classes stated by the <i>rdfs:domain</i> properties (this is a subtle point that can lead to erroneous semantics and modeling by an inexperienced practitioner).</p>
<p>The <i>rdfs:domain</i> property may even be applied to itself. The <i>rdfs:domain</i> of <i>rdfs:domain</i> is a class <i>rdf:Property</i>, which states that any resource with an <i>rdfs:domain</i> property is an instance of <i>rdf:Property</i>. This seems intuitive enough, but without declaring it in the way we just described, it is not formalized.</p>
<p>In a similar vein, <i>rdfs:range</i> is an instance of <i>rdf:Property</i> that is used to state that the values of a property are instances of one or more classes. The triple <i>(P, rdfs:range, C)</i> states that <i>P</i> is an instance of the class <i>rdf:Property</i>, that <i>C</i> is an instance of the class <i>rdfs:Class</i>, and that the resources denoted by the objects of triples whose predicate is <i>P</i> are instances of the class <i>C</i>. Where <i>P</i> has more than one <i>rdfs:range</i> property, the resources denoted by the objects of triples with predicate <i>P</i> are instances of all the classes stated by the <i>rdfs:range</i> properties.</p>
<p>Just like <i>rdfs:domain</i>, the <i>rdfs:range</i> property can be applied to itself. The <i>rdfs:range</i> of <i>rdfs:range</i> is the class <i>rdfs:Class</i>. This states that any resource that is the value of an <i>rdfs:range</i> property is an instance of <i>rdfs:Class</i>. The <i>rdfs:range</i> property is applied to properties. This can be represented in RDF using the <i>rdfs:domain</i> property. The <i>rdfs:domain</i> of <i>rdfs:range</i> is the class <i>rdf:Property</i>. This states that any resource with an <i>rdfs:range</i> property is an instance of <i>rdf:Property</i>. Similarly, <i>rdfs:range</i> can also be applied to <i>rdfs:domain</i> with similar intuitions—namely the <i>rdfs:range</i> of <i>rdfs:domain</i> is the class <i>rdfs:Class</i>, which states that any resource that is the <i>value</i> of an <i>rdfs:domain</i> property is an instance of <i>rdfs:Class</i>.</p>
</section>
</section>
<section epub:type="division">
<h2 class="head a-head"><a id="sec2-3"/><b>2.3Property-Centric Models</b></h2>
<p class="noindent">The RDF and RDFS models provide some sophisticated capabilities, but we saw (at the very beginning of this chapter) that simpler models, such as the set of triples, can often suffice for certain use-cases and communities. Such simple models have an advantage in that they are easy to express, use, and transfer between programs, but do not come with the sophisticated knowledge representation of RDF and other languages that build on it. Usually, the choice is not so extreme. In the next section, we present the Wikidata model, which offers a different trade-off between simplicity, expressivity, and formal or representational sophistication. In this section, we provide yet another perspective by describing a <span aria-label="32" id="pg_32" role="doc-pagebreak"/>simpler class of models that intuitively treat <i>properties</i> (or predicates) as first-class citizens.</p>
<p>The property graph is perhaps the simplest (and quite widely used) illustration of such a model. The Neo4j graph database, which is covered in some depth in chapter 12, uses this model as its primary representation of KGs. The most important difference between RDF and property graphs is that, while RDF uses Uniform Resource Locators (URLs) as identifiers (for relationships and entities), property graphs use purely local identifiers such as strings (recall that these were <i>literals</i> in the RDF context). At its core, a property graph is just a set of nodes and edges, where each node (and edge) may be thought of as a <i>data structure</i> of keys and values. An example is shown in <a href="chapter_2.xhtml#fig2-4" id="rfig2-4">figure 2.4</a>. Because nodes and edges are both structures, complex information can be expressed quite succinctly, though without the same kinds of semantic and logical safeguards that RDF provides. In recent years, these models have become quite popular, and the Neo4j implementation also provides a querying language specifically designed for manipulating such graphs. The main reason why the property graph is thought of as a <i>property-centric</i> model (other than the similarity in their names) is precisely because of the support for key-value representations at both the node and edge levels. The philosophy is that it is natural for entities to have an associated bag of properties, and these properties are expressed using keys and values, exactly as shown in <a href="chapter_2.xhtml#fig2-4">figure 2.4</a>. In contrast, RDF is able to provide such support only by introducing new classes at a more abstract level or through reification (for example). For many ordinary practitioners, developers, and subject matter experts who have not necessarily been trained to understand such concepts, this concept can be difficult to convey. We provide more illustrations and formalism, including the querying language itself, in chapter 12. At present, it suffices to know that such models exist as pragmatic alternatives to RDF-like models. In actual practice, therefore, one must carefully consider the pros and cons of all these models (including the subsequently presented Wikidata model) before making a representational choice. We emphasize that, at the present time, there are enough good models, tool support, and representational choices available that inventing a new model and language for a task is usually considered wasteful.</p>
<div class="figure">
<figure class="IMG"><a id="fig2-4"/><img alt="" src="../images/Figure2-4.png" width="450"/>
<figcaption><p class="CAP"><span class="FIGN"><a href="#rfig2-4">Figure 2.4</a>:</span> <span class="FIG">A triple as it would be represented in a property graph model. Note how both the nodes and the edges are key-value data structures rather than URIs or literals.</span></p></figcaption>
</figure>
</div>
<p>Another kind of property-centric model is the <i>property table</i>. Property tables attempt to improve standard triplestore infrastructures, where each triple statement is stored in a three-column <span aria-label="33" id="pg_33" role="doc-pagebreak"/>table in a relational database (RDB) or other table-oriented infrastructure, by taking advantage of regularity in RDF data sets. Hence, unlike the property <i>graphs</i> described previously, the underlying model is usually still RDF, and the property table provides a more efficient mechanism for representing an RDF graph (given certain assumptions about the regularity). However, they are worth studying because one also could consider non-RDF variants of the concept.</p>
<p>The key idea (though variants also exist) is illustrated in <a href="chapter_2.xhtml#fig2-5" id="rfig2-5">figure 2.5</a>. Let us assume that each subject in an RDF graph is associated with (at most) <i>n</i> single-valued properties. The multivalued properties (where, for a given subject <i>s</i> and multivalued property <i>p</i>, there are multiple triples of the form <i>&lt; s, p,</i> ?<i>x &gt;</i>, with ?<i>x</i> being a placeholder for an object) are treated separately; in essence, a separate two-column table with columns <i>subj</i> and <i>p</i><sub><i>i</i></sub> is created for <i>each</i> multivalued property <i>p</i><sub><i>i</i></sub>. As is evident even from this brief description, the property table makes more sense when a table such as the one in <a href="chapter_2.xhtml#fig2-5">figure 2.5</a> is relatively <i>nonsparse</i>. This tends to be the case for data that is more like an RDB than an irregular graph. Rather than choose between an RDB, with more efficient and intuitive structure for such data, and an RDF, which enables one to more naturally apply KG analytics, a practitioner could just use the property table. Semantic Web tools such as Apache Jena offer direct support for working with, and storing, such tables without sacrificing RDF as the conceptual mode of representation.</p>
<div class="figure">
<figure class="IMG"><a id="fig2-5"/><img alt="" src="../images/Figure2-5.png" width="450"/>
<figcaption><p class="CAP"><span class="FIGN"><a href="#rfig2-5">Figure 2.5</a>:</span> <span class="FIG">One approach to represent a set of triples or a triplestore as a property table. In a property-centric representation such as this, properties are elevated to the status of metadata as column headers (or even tables in themselves for multivalued properties) rather than values (the values in the cells of the second column in the original triplestore representation). In the derived tables, note that objects (or type) will always be the cell values, not including the first column, which is always the subject.</span></p></figcaption>
</figure>
</div>
</section>
<section epub:type="division">
<h2 class="head a-head"><span aria-label="34" id="pg_34" role="doc-pagebreak"/><a id="sec2-4"/><b>2.4Wikidata Model</b></h2>
<p class="noindent">To understand Wikidata, we must first understand its relationship with the Wikipedia project, created with the vision of “a world in which every single human being can freely share in the sum of all knowledge.” Wikipedia is a multilingual online encyclopedia created and maintained by volunteers who edit articles online using a simple markup language specifically designed for it. Wikipedia is available in over 300 languages, contains over 45 million pages, and has billions of visitors every month. While Wikipedia consists mostly of text documents, the pages in Wikipedia contain an enormous amount of structured data, including numbers, dates, coordinates, and relationships. Most structured data in Wikipedia is recorded in the <i>infoboxes</i> present in most pages. <a href="chapter_2.xhtml#fig2-6" id="rfig2-6">Figure 2.6</a> shows the info boxes for “The Joker” in the English and French Wikipedias. We see that the French Wikipedia has more data about the Joker, including aliases, gender, enemies, actors, and voices.</p>
<div class="figure">
<figure class="IMG"><a id="fig2-6"/><img alt="" src="../images/Figure2-6.png" width="450"/>
<figcaption><p class="CAP"><span class="FIGN"><a href="#rfig2-6">Figure 2.6</a>:</span> <span class="FIG">Info boxes for “The Joker” in the English and French Wikipedias, retrieved from <i>en.wikipedia.org/wiki/Joker_(character)</i> and <i>fr.wikipedia.org/wiki/Joker_(comics)</i>, respectively.</span></p></figcaption>
</figure>
</div>
<p>The goal of the Wikidata project is to build a central, multilingual KG to record the structured data for the subjects of Wikipedia articles, as well as to use Wikidata to populate the info boxes for all Wikipedia pages, in all languages. Doing so will remove the need for Wikipedia volunteers around the world to enter the same information manually in different languages. At this time, Wikidata contains data for more than 60 million items, and the migration of Wikipedia info boxes to Wikidata is well underway, with millions of Wikipedia pages already converted to query the data directly from Wikidata.</p>
<p>When we inspect the info boxes in <a href="chapter_2.xhtml#fig2-6">figure 2.6</a>, we see that most of the information can be represented in RDF. The subject is <span class="obeylines-h"><span class="verb"><samp class="cmtt-10">:the_joker</samp></span></span>, the property is derived from the attribute name, and the object is the attribute value. For example:</p>
<p class="verbatim"><span aria-label="35" id="pg_35" role="doc-pagebreak"/>:the_joker :publisher :dc_comics .</p>
<p>In the French Wikipedia, we see the attribute “Interprété pa” to record the actors who played the Joker in a movie. For example, Cesar Romero played the Joker in the <i>Batman</i> movie of 1966, and Jack Nicholson in the <i>Batman</i> movie of 1989. The following is an incorrect attempt to record this information in triples:</p>
<p class="verbatim">:the_joker :performer :cesar_romero .<br/>:the_joker :performer: jack_nicholson .<br/>:the_joker :movie :batman_1966 .<br/>:the_joker :movie :batman_1989 .</p>
<p>In this representation, we lost the information that Cesar Romero was in the 1966 movie and Jack Nicholson was in the 1989 movie. Triples allow us to represent the relationship between two items (character and actor), but in this example, the relationship is between three items (character, actor, and movie). Other examples include “spouse,” where we need to record the person as well as the start and end dates of the relationship; and “population” for countries where we need to record the population and the year when the population was recorded. It also makes sense for other properties of countries to be recorded as <i>n</i>-ary, rather than as binary, relations (see the exercises).</p>
<p>Another important requirement follows from the nature of Wikipedia as a secondary source of information. Rather than publishing its own research, Wikipedia publishes findings recorded in primary sources, which include scholarly publications, news articles, and web sites, and records references to primary sources so that readers can judge the trustworthiness <span aria-label="36" id="pg_36" role="doc-pagebreak"/>of the information. Wikidata is also a secondary source of information, so it also must record links to one or several primary sources. For every triple, we need the ability to represent links to primary sources where the information was published.</p>
<p>To faithfully represent the information in Wikipedia, we also need to represent uncertainty and units of measure. For example, the year when the famous <i>Mona Lisa</i> was painted is uncertain. Many art scholars believe that it was painted between 1503 and 1506, but recent research suggests that it was not started before 1513. In RDF, dates are represented as literals, but in this example, we need the ability to represent intervals. Units of measure are also important. For example, when we record the nominal gross domestic product (GDP) of the United States as 18,120,714,000,000, we must state whether this number is in US dollars, euros, or other currency.</p>
<p>In summary, the data representation requirements for Wikidata are:</p>
<ul class="bullet">
<li class="BL">Representation of <i>classes, instances, properties, and literals</i></li>
<li class="BL">Representation of <i>n-ary relations</i> to record contextual or qualifying information about triples</li>
<li class="BL">Representation of <i>references</i> to record links to the sources where the information was published</li>
<li class="BL">Representation of the <i>units of measure</i> for quantities</li>
<li class="BL">Representation of <i>uncertainty</i> for dates and quantities</li>
</ul>
<p class="noindent">The Wikidata KG, like all KGs, consists of nodes and edges. Wikidata uses two types of nodes, for <i>items</i> and for <i>properties</i>, and the edges are defined using <i>statements</i>. The main difference between Wikidata and KGs represented using RDF is that statements contain much more information than RDF triples; in addition to claims, which correspond to RDF triples, statements record references and qualifiers.</p>
<ul class="bullet">
<li class="BL"><b>Item:</b> A thing being described in Wikidata. It can be a concrete thing, such as <i>Jack Nicholson</i>, the actor, a creation of the mind, such as <i>The Joker,</i> an event, such as the <i>Normandy landings</i> in World War II, or anything for which primary documentation exists.</li>
<li class="BL"><b>Property:</b> Describes attributes that items can have, such as the <i>birth date</i> of a person, or relationships between two items, such as <i>place of birth</i>, which can be used to relate a person to a location.</li>
<li class="BL"><b>DataValue:</b> Represents the value of an attribute of an item, such as dates, quantities, coordinates, and so on.</li>
<li class="BL"><b>Statement:</b> Represents a specific piece of knowledge about an item, such as <i>Jack Nicholsons date of birth is April 22, 1937, according to the Integrated Authority File</i>, or <i>Jack Nicholson received the Academy Award for Best Actor in 1998 for his work in As Good as It Gets</i>. Statements include claims, references, and qualifiers.</li>
<li class="BL"><span aria-label="37" id="pg_37" role="doc-pagebreak"/><b>Claim:</b> The part of a statement that records the factual knowledge in the statement, such as <i>Jack Nicholsons date of birth is April 22, 1937</i>, or <i>Jack Nicholson received the Academy Award for Best Actor</i>. A claim is specified using a property, such as <i>date of birth</i>; a value, such as <i>April 22, 1937</i>; or an item, such as <i>As Good as It Gets</i>.</li>
<li class="BL"><b>Reference:</b> The part of a statement that records links to external resources or publications where the claim is documented, such as <i><span class="ellipsis"></span>according to the Integrated Authority File</i>.</li>
<li class="BL"><b>Qualifier:</b> A part of a claim that records contextual information about the claim, such as <i><span class="ellipsis"></span>in 1998</i> or <i><span class="ellipsis"></span>for his work in As Good as It Gets</i>. A qualifier, like a claim, is specified using a property and a value, but unlike claims, qualifiers may not have other qualifiers as parts.</li>
</ul>
<section epub:type="division">
<h3 class="head b-head"><a id="sec2-4-1"/><b>2.4.1Wikidata Items</b></h3>
<p class="noindent">Wikidata items are specified as tuples <i>(identifier, label, aliases, description, statements)</i>, where:</p>
<ul class="bullet">
<li class="BL"><b>Identifier:</b> Every Wikidata item has a unique identifier, often referred to as a <i>q-node</i>, as they consist of the letter Q followed by an integer number. In the example in <a href="chapter_2.xhtml#fig2-7" id="rfig2-7">figure 2.7</a>, we see the item corresponding to the Joker has identifier Q217533 in Wikidata.</li>
</ul>
<div class="figure">
<figure class="IMG"><a id="fig2-7"/><img alt="" src="../images/Figure2-7.png" width="450"/>
<figcaption><p class="CAP"><span class="FIGN"><a href="#rfig2-7">Figure 2.7</a>:</span> <span class="FIG">A KG fragment from Wikidata for “The Joker.”</span></p></figcaption>
</figure>
</div>
<ul class="bullet">
<li class="BL"><b>Label:</b> Items have labels to denote the names of the concept that they represent. Wikidata encourages items to have labels in different languages, as it is intended to be a multilingual resource. An item can have at most one label in each language, and unlike identifiers, labels are not required to be unique, so multiple items may have the same label. Wikidata encourages capitalization to follow natural-language rules, where proper names are capitalized but other labels are in lowercase.</li>
<li class="BL"><b>Aliases:</b> Because items can only have one label in each language, additional names for an item can be recorded as aliases. Wikidata encourages recording multiple aliases including colloquial and other commonly used names for items. Aliases are intended to improve recall of search engines.</li>
<li class="BL"><b>Description:</b> Items have concise descriptions, one per language. Descriptions often provide information about the type of the item and are useful for distinguishing items that share labels.</li>
<li class="BL"><b>Statements:</b> Items often have multiple statements to record attributes and relationships to other items. Each statement has a property, a value, zero or more references, and zero or more qualifiers. Items may have multiple statements for the same property. For example, in Wikidatas entry for the Joker, there are two statements using the property “member of,” recording the fact that the Joker is a member of both the Injustice Gang and the Injustice League. When items have multiple statements using the same property, the user interface groups them in a statement group, but in the KG, each statement is a separate edge.</li>
</ul>
<p><span aria-label="38" id="pg_38" role="doc-pagebreak"/>Wikidata statements are specified as tuples <i>(property, qualifiers, references, rank)</i>, with the following qualities:</p>
<ul class="bullet">
<li class="BL"><b>Property:</b> A Wikidata property (defined next) that unambiguously specifies an attribute or relation for an item.</li>
<li class="BL"><b>Qualifiers:</b> Each item/property/value triple may have multiple qualifiers that provide additional information about the triple. Qualifiers are defined using property/value pairs. While any property can be used in a qualifier, it is possible to define constraints to encourage or discourage use of properties as qualifiers (we describe constraints next). Qualifiers are often used to describe an interval of time when the triple is valid, but as the example illustrated, qualifiers can be used to record arbitrary details about a triple.</li>
<li class="BL"><b>References:</b> Wikidata encourages recording of references for every item/property/value triple. References enable users of Wikidata knowledge to find the primary sources or other secondary sources for the information encoded in triples. References, like qualifiers, are described using property/value pairs. As <a href="chapter_2.xhtml#fig2-7">figure 2.7</a> illustrates, not every statement in Wikidata has references.</li>
<li class="BL"><b>Rank:</b> Wikidata records information about the world, and it is often the case that knowledge gets revised (e.g., Pluto is no longer categorized as a planet), primary sources provide conflicting information, or values differ depending on measurement methods or approaches. Wikidata does not attempt to represent the truth; instead, it wants to record different values and points of view, using qualifiers and references to enable users of the knowledge to select among opposing views. Wikidata defines three ranks, preferred, normal, and deprecated. When a statement has a single value, the rank should be set to normal. When a statement has multiple values, the preferred rank should be used for statements with references and qualifiers that provide further details in support of the validity of their values; for example, through the use of qualifiers with the properties point in time (P585), determination method (P459), and so on. The deprecated rank should be used for statements that contain errors (e.g., as a result of an incorrect measurement method, or that were believed to be correct, but have been disproven).</li>
</ul>
</section>
<section epub:type="division">
<h3 class="head b-head"><a id="sec2-4-2"/><b>2.4.2Wikidata Properties</b></h3>
<p class="noindent">Wikidata properties, like items, are also nodes in the KG. Property specifications include all the elements of items plus two additional elements, <i>datatype</i> and <i>constraints</i>. Properties are often referred to as <i>p-nodes</i>, as their identifiers consist of the letter P followed by an integer.</p>
<p class="snoindent"><span class="paragraphHead"><b>Datatype.</b></span> The datatype specifies the type of value that can be used in statements using the property. For example, the <i>educated at</i> property has type <i>Item</i>, requiring that the value be another item in Wikidata; the <i>start time</i> and <i>end time</i> properties have type <i>Time</i>, requiring <span aria-label="39" id="pg_39" role="doc-pagebreak"/>that the values represent a point or time interval. The set of datatypes in Wikidata at the time or writing are:</p>
<ul class="bullet">
<li class="BL"><b>Item:</b> Reference to an item in Wikidata.</li>
<li class="BL"><b>Property:</b> Reference to a property in Wikidata.</li>
<li class="BL"><b>Time:</b> A specification of a possibly uncertain time, specified as a tuple (time-point, timezone, before, after, precision, calendar), where:
<ul class="bullet1">
<li class="BL-1"><b></b> Time-point denotes a point in time, represented as a timestamp resembling ISO 8601 (e.g., +2013-01-01T00:00:00Z; the year is always signed and padded to have between 4 and 16 digits; more extensive details about ISO 8601 may be found on the website: <a href="https://www.iso.org/iso-8601-date-and-time-format.html">https://<wbr/>www<wbr/>.iso<wbr/>.org<wbr/>/iso<wbr/>-8601<wbr/>-date<wbr/>-and<wbr/>-time<wbr/>-format<wbr/>.html</a>).</li>
<li class="BL-1"><b></b> Timezone is a signed integer specifying an offset from UTC in minutes.</li>
<li class="BL-1"><b></b> Before and after support specification of uncertain times as intervals [time-point after, time-point + before]; before and after are nonnegative integers representing units of time given by precision.</li>
<li class="BL-1"><b></b> Precision is an integer with meaning: 0—billion years, 1—hundred million years, 2—ten million years, 3—million years, 4—one hundred thousand years, 5—ten thousand years, 6—millennium, 7—century, 8—decade, 9—year, 10—month, 11—day, 12—hour, 13—minute, 14—second.</li>
<li class="BL-1"><b></b> Calendarmodel is a reference to an item that represents a calendar model; for example, the Gregorian calendar (Q12138), Chinese calendar (Q134032).</li>
</ul>
</li>
<li class="BL"><b>Quantity:</b> A specification of a possibly uncertain quantity expressed with units of measure, specified as a tuple (amount, lower-bound, upper-bound, unit), where:
<ul class="bullet1">
<li class="BL-1"><b></b> Amount represents the value of the quantity as a decimal number.</li>
<li class="BL-1"><b></b> Lower-bound and upper-bound support the specification of uncertain quantities as intervals (amount lower-bound, amount + upper-bound).</li>
<li class="BL-1"><b></b> Unit is a reference to an item that represents a unit of measure; for example, centimeters (Q174728), euro (Q4916).</li>
</ul>
</li>
<li class="BL"><b>Monolingual text:</b> Specified as a tuple (string, language), where:
<ul class="bullet1">
<li class="BL-1"><b></b> String is a Unicode string.</li>
<li class="BL-1"><b></b> Language is a language code.<sup><a href="chapter_2.xhtml#fn5x2" id="fn5x2-bk">5</a></sup></li>
</ul>
</li>
<li class="BL"><b>String:</b> A Unicode string.</li>
<li class="BL"><b>URL:</b> A generalized “URL” that identifies some kind of external resource, perhaps a link to an external site of some kind, or an identifier used for lookup in some kind of specialized resource.</li>
<li class="BL"><span aria-label="40" id="pg_40" role="doc-pagebreak"/><b>External identifier:</b> A string representing an identifier used in an external system.</li>
<li class="BL"><b>Globe coordinate:</b> A specification of a geographical position specified as a tuple (latitude, longitude, precision, globe), where:
<ul class="bullet1">
<li class="BL-1"><b></b> Latitude and longitude, specified as Geocentric Solar Magnetospheric System (GMS), or decimal degrees.</li>
<li class="BL-1"><b></b> Precision specifies the resolution of the source of the coordinates, given as a decimal number.</li>
<li class="BL-1"><b></b> Globe identifies a stellar body, such as Earth (Q2), the default, Mars (Q111).</li>
</ul>
</li>
<li class="BL"><b>Geographic shape:</b> A reference to a map data file on Wikimedia Commons.</li>
<li class="BL"><b>Commons media:</b> A reference to a media file in Wikimedia Commons.</li>
<li class="BL"><b>Tabular data:</b> A reference to tabular data file on Wikimedia Commons.</li>
<li class="BL"><b>Musical notation:</b> A string describing music following the LilyPond (Q195946) syntax.</li>
<li class="BL"><b>Mathematical expression:</b> A string expressed in a variant of LaTeX (Q5310).</li>
<li class="BL"><b>Lexeme, form, and sense:</b> Datatypes to specify lexicographical data.<sup><a href="chapter_2.xhtml#fn6x2" id="fn6x2-bk">6</a></sup></li>
</ul>
<p class="noindent"><span class="paragraphHead"><b>Property Constraints.</b></span> Wikidata acknowledges that the world is full of exceptions, and consequently it does not enforce any constraints on the values of statements other than the datatype of properties described here. For example, if the datatype of a property is Item, any item can be used as a value. Property constraints are rules that specify how properties should be used, give guidance to human editors, and enable the construction of automated validators that flag statements with constraint violations, bringing them to the attention of human editors. It makes sense to define a constraint on the <i>head of government</i> (P6) property to state that the value should be a human (Q5), but exceptions are allowed. For example, the town of Talkeetna (Q668224) elected the cat Stubbs (Q7627362) as mayor, and Wikidata allowed this information to be recorded, tagging it with a constraint violation (Stubbs died at the age of 20 in 2017 and has since been replaced by a human mayor).</p>
</section>
</section>
<section epub:type="division">
<h2 class="head a-head"><a id="sec2-5"/><b>2.5The Semantic Web Layer Cake</b></h2>
<p class="noindent">While RDF forms the basis of much of the Semantic Web, we have already seen that it is possible to extend its capabilities via modeling extensions like RDFS. Later in this book, in part IV, we will explore another set of extensions via OWL. However, even without knowing all the details of RDFS or OWL, we can see the broader trend here of a “modeling stack” or “layer cake” that consists of modeling facilities (often, but not always, languages and language extensions) for building and working with ever more sophisticated KG and ontological representations. A visualization of the famous Semantic Web Layer Cake is illustrated in <a href="chapter_2.xhtml#fig2-8" id="rfig2-8">figure 2.8</a>. <span aria-label="41" id="pg_41" role="doc-pagebreak"/>We have already introduced many of the layers in this cake, starting from URIs and Unicode at the top all the way to RDF and RDFS at the bottom. At the very top are applications and user interfaces, which are necessary for actual users to be able to interact with KG representations. Proof and trust are important issues and remain areas of advanced research; the core idea is to be able to verify, in some way, the accuracy and provenance of the data that is being ingested and modeled in the system. As one can imagine, not all knowledge is equally trustworthy. We provide research pointers on these in the section entitled “Bibliographic Notes” at the end of this chapter for the interested reader. In the middle of the cake is a crucial layer (querying, OWL, and rules) that will form the subject of later chapters on reasoning.</p>
<div class="figure">
<figure class="IMG"><a id="fig2-8"/><img alt="" src="../images/Figure2-8.png" width="250"/>
<figcaption><p class="CAP"><span class="FIGN"><a href="#rfig2-8">Figure 2.8</a>:</span> <span class="FIG">A simplified illustration of the Semantic Web Layer Cake.</span></p></figcaption>
</figure>
</div>
<p>One lesson to draw from a cake such as this is that representing and working with KGs is itself an interesting and important problem. Because of the visual ease of drawing KGs, it is easy to be misled into believing that representation itself is not a major issue. This is analogous to how we might be (naïvely) led to believe that an RDB is just a “set of tables.” However, without proper models and representations, including powerful and consistent languages to express the knowledge in a KG in ways that machines would be able to efficiently query and ingest, a KG ecosystem would quickly fall apart. Instead, through the establishment of communitywide standards, whether of the SW, NLP, or the Wikidata communities, it has become feasible, even commonplace, to publish KGs on the web for the community to use and which are periodically updated. The original papers describing these KGs have garnered many thousands of citations and even won awards for their large-scale impact. Other KGs, such as the YAGO project, have been similarly successful.</p>
</section>
<section epub:type="division">
<h2 class="head a-head"><span aria-label="42" id="pg_42" role="doc-pagebreak"/><a id="sec2-6"/><b>2.6Schema Heterogeneity and Semantic Labeling</b></h2>
<p class="noindent">As the Semantic Web and other related communities have expanded in scope, so have the schemas and ontologies that are currently in use. As we cover in detail in part V, there are entire KG ecosystems in place in areas as diverse as scientific inquiry, industry, and even artificial intelligence (AI) for social good. Even for generic domains such as encyclopedias and world knowledge (with the actual KG often derived from sources such as Wikipedia, Wikidata, and WordNet), there are multiple ontologies. It is not controversial to conclude that <i>schema heterogeneity</i><sup><a href="chapter_2.xhtml#fn7x2" id="fn7x2-bk">7</a></sup> is an important problem that needs to be addressed before different groups are able to work together on problems, and so that the potential of the web itself can be realized. While there are social aspects and solutions to this problem (an extreme being a mandate of sorts to use only one standard ontology per domain<sup><a href="chapter_2.xhtml#fn8x2" id="fn8x2-bk">8</a></sup>), or one group using an ontology <i>O</i> for a domain could persuade another group to also use <i>O</i> rather than a different (possibly derived or hybrid) ontology <i>O</i>, the truth is that different groups and individuals have different use-cases, and ontologies and schemas are expected to serve those use-cases rather than the other way around. The problem then becomes: how do we address schema heterogeneity without having to pick one ontology over the other?</p>
<p>One important class of solutions that has been well developed both in the database and Semantic Web communities is to achieve (sometimes manually, but usually semiautomatically) a mapping between the two ontologies. In the database community, this has been referred to as <i>schema mapping</i> or <i>matching</i>. Rahm and Bernstein (2001) described the problem as finding a good fit for the “Match” operator, which takes two schemas as inputs and outputs a mapping between elements of both schemas that <i>correspond semantically</i> to each other. It is in the idea of semantic correspondence that the problem starts bearing a connection to AI because there is usually no clear mathematical formula for capturing the phenomenon of semantic correspondence. In contrast, for classic computational problems such as the traveling salesman problem, or even Eulers bridge problem (with which we began the previous chapter), the condition for success is usually well defined mathematically (though not necessarily easier to find computationally).</p>
<p>Even as early as two decades ago, when Rahm and Bernstein (2001) wrote on the matter, machine learning methods had started gaining in prominence. In fact, Rahm and Bernstein (2001) provide a full taxonomy of schema-matching approaches, some of which are based on constraints (using graph-matching techniques, for example) and others of which use linguistic cues, including word frequencies and information retrieval techniques, which <span aria-label="43" id="pg_43" role="doc-pagebreak"/>are detailed in subsequent chapters. Our point here is that many approaches have been taken to the problem, and while good solutions have been developed and are in use, it is not a resolved matter, especially for difficult ontologies and new domains. A general machine learning approach to the problem is to designate one schema as a <i>source</i> schema, and the other as a <i>target</i> schema. The universe of elements (e.g., concepts) in the target schema then become akin to <i>labels</i> in the machine learning context, while elements in the source schema become the objects that need to be classified. Multiclass classification is applicable here, although other techniques can also be used.</p>
<p>Sometimes the matter is not as simple as automatically discovering a 1:1 correspondence, which the earlier machine learningbased approaches were more suitable for. There may be many-to-many correspondences, though correspondences usually tend to fall in the many-to-one (e.g., “phone number” and “extension code” in the source schema may correspond with a single, complete “phone number” field in the target) and one-to-many categories (e.g., a full “address” in the source schema maps to “zipcode” and “street address” in the target schema). Much subsequent research attempted to discover these correspondences with higher accuracy. Furthermore, even in the case of the 1:1 correspondence, simply discovering the match between concepts in the source and target schemas is not the end of the story, because the specific nature of the correspondence may also have to be discovered. Sometimes the mapping can be expressed as a syntactic transformation [e.g., if a phone is represented as xxx-xx-xxxx in the source and as (xxx) xx-xxxx in the target, a relatively simple program can be written, or given enough observations, <i>learned</i> by a pattern-mining or data transformation library, for mapping values between the two classes]. Sophisticated tools have been developed for learning such mappings and expressing such transformation programs in an adaptive fashion. The Karma semantic mapping and labeling system is a good example of an adaptive tool (see the section entitled “Software and Resources” for details) that also provides visual support to its users and supports multiple source and output formats (including spreadsheets and RDF).</p>
<p>Beyond direct schema and ontology matching, there are other use-cases of semantic mapping that we do not consider in this chapter, but provide pointers to in the section entitled “Bibliographic Notes.” For example, SQL databases on the web or in industry may have to be transformed into Semantic Web standards such as OWL and RDF if the goal is to convert the underlying data to KGs and query them using graph pattern-matching languages such as SPARQL (chapter 12). Contrary to how they are first introduced, SQL databases are complex models in themselves (obeying, with some qualifications, relational algebra) and come with their own constraints such as foreign keys, column types, and primary keys. This mapping problem, therefore, is nontrivial. In fact, the R2RML mapping language was specifically adopted as a W3C recommendation<sup><a href="chapter_2.xhtml#fn9x2" id="fn9x2-bk">9</a></sup> to formalize the mappings <span aria-label="44" id="pg_44" role="doc-pagebreak"/>from RDBs to RDF. Other research in the previous decade has given some valuable insight into the nature of this problem and potential solutions.</p>
</section>
<section epub:type="division">
<h2 class="head a-head"><a id="sec2-7"/><b>2.7Concluding Notes</b></h2>
<p class="noindent">Unlike ordinary databases, KGs are particularly amenable to a visual graph-theoretic representation, which makes them easy for humans to understand and conceive. However, the modeling and representation of KGs are no less complex than for a sophisticated database. Representational decisions thus made can affect the entire application life cycle of the KG, including querying and other downstream applications that may need to access the KG. In this chapter, we started the topic of KG representation by briefly describing the simple triples model, still widely used in subareas of machine learning and NLP, followed by more sophisticated representational machinery like RDF and RDFS. We also described the Wikidata model, which, though not necessarily as rigorous as RDF, is extremely popular (and continuing to increase in influence) at this time, and sets itself apart due to the simplicity and ease of use without sacrificing core expressivity. We concluded the chapter with a discussion on how all of these models, with different degrees of expressivity and modeling capability, fit together within the context of a larger community. Understanding the various options available when modeling and representing KGs is an important skill to have when constructing or working with real-world KGs that are meant to support research and industrial applications for years to come.</p>
</section>
<section epub:type="division">
<h2 class="head a-head"><a id="sec2-8"/><b>2.8Software and Resources</b></h2>
<p class="noindent">There are a number of resources available for working with, and validating, RDF. We do not focus here on the triplestores (packages that are useful for setting up RDF databases that can be queried and accessed), which will be the focus of chapter 12. One example of a simple command-line tool that can be used to convert different RDF serializations from one to the other is RDF2RDF, a Java tool that is wrapped into one single jar file for easy usage. It is licensed under GPL v2.0 and is publicly available at this link: <a href="http://www.l3s.de/~minack/rdf2rdf/">http://www.l3s.de/~minack/rdf2rdf/</a>. It can be run on the command line and is particularly convenient when the amount of memory available on the machine is proportionate to the RDF graph that needs to be converted.</p>
<p>A package that proves important for Python developers is RDFLib, which may be thought of as an RDF library for Python. The library contains parsers and serializers for RDF/XML, N3, NTriples, N-Quads, Turtle, TriX, RDFa, and Microdata. It also presents a <i>Graph</i> interface which can be backed by any of a number of <i>Store</i> implementations. The package is relatively complete in that a SPARQL 1.1 implementation is also included for supporting queries, and also update statements. RDFLib is available on PyPi and the open-source version is maintained using GitHub. Because of its plug-inbased architecture, several tools <span aria-label="45" id="pg_45" role="doc-pagebreak"/>and projects have been built on top of RDFLib. Similar tools are also available in other languages, including Java. Several support the property-centric models that were discussed, including property tables and graphs. Neo4j, accessed at <a href="https://neo4j.com/download/">https://<wbr/>neo4j<wbr/>.com<wbr/>/download<wbr/>/</a>, is an excellent example of the latter, and we describe it in detail in chapter 12. Apache Jena, mentioned as a software offering support for the former, maintains a homepage at <a href="https://jena.apache.org/">https://<wbr/>jena<wbr/>.apache<wbr/>.org<wbr/>/</a>.</p>
<p>The Wikidata project is completely available online at <a href="https://www.wikidata.org/wiki/Wikidata:Main_Page">https://<wbr/>www<wbr/>.wikidata<wbr/>.org<wbr/>/wiki<wbr/>/Wikidata:Main<wbr/>_Page</a>. The main page contains links to a whole set of resources, including an introduction to Wikidata, tutorials on editing and contributing to Wikidata, and additional information on how to retrieve and use data from Wikidata. For the more advanced reader who has an interest in using Wikidata, this last set of resources is a useful place to begin.</p>
<p>Some of the vocabularies we mentioned in this chapter, including FOAF, also constitute an important set of resources for modeling RDF data. FOAF itself can be accessed at <a href="http://xmlns.com/foaf/0.1/">http://<wbr/>xmlns<wbr/>.com<wbr/>/foaf<wbr/>/0<wbr/>.1<wbr/>/</a>, but other good examples include the Simple Knowledge Organization System (SKOS), which is accessed at <a href="http://www.w3.org/standards/techs/skos#w3c_all">http://<wbr/>www<wbr/>.w3<wbr/>.org<wbr/>/standards<wbr/>/techs<wbr/>/skos#w3c<wbr/>_all</a>, and the Dublin Core (<a href="https://www.dublincore.org/">https://<wbr/>www<wbr/>.dublincore<wbr/>.org<wbr/>/</a>). More advanced vocabularies, some of which may be considered languages for formally specifying detailed ontologies of concepts, properties, and constraints, will be discussed in part IV, when we cover the core of the OWL. The W3C has formal pages on RDF and RDFS, accessed at <a href="https://www.w3.org/RDF/">https://<wbr/>www<wbr/>.w3<wbr/>.org<wbr/>/RDF<wbr/>/</a> and <a href="https://www.w3.org/TR/rdf-schema/">https://<wbr/>www<wbr/>.w3<wbr/>.org<wbr/>/TR<wbr/>/rdf<wbr/>-schema<wbr/>/</a>, respectively. More novel ways of expressing RDF and linked data, which are beyond the scope of this chapter, include JSON-LD, a full description of which can be found at <a href="https://json-ld.org/">https://<wbr/>json<wbr/>-ld<wbr/>.org<wbr/>/</a>.</p>
<p>In the last section of the chapter, we mentioned schema heterogeneity and semantic labeling. A predominant tool for semantic labeling that comes with adaptive functionality and a fairly advanced user interface is Karma. It may be accessed at <a href="https://usc-isi-i2.github.io/karma/">https://<wbr/>usc<wbr/>-isi<wbr/>-i2<wbr/>.github<wbr/>.io<wbr/>/karma<wbr/>/</a>. It is described as an information integration tool that enables users to quickly and easily integrate data from a variety of data sources, including databases, spreadsheets, delimited text files, XML, JSON, KML, and web APIs. Users integrate information by modeling it according to an ontology of their choice using a graphical user interface that automates much of the process. Karma learns to recognize the mapping of data to ontology classes (hence, it is adaptive) and then uses the ontology to propose a model that ties together these classes. Users then interact with the system to adjust the automatically generated model. During this process, users can transform the data as needed to normalize data expressed in different formats and to restructure it. Once the model is complete, users can published the integrated data as RDF or store it in a database.</p>
</section>
<section epub:type="division">
<h2 class="head a-head"><span aria-label="46" id="pg_46" role="doc-pagebreak"/><a id="sec2-9"/><b>2.9Bibliographic Notes</b></h2>
<p class="noindent">Knowledge representation (KR) has always been an important area in AI and computer science. We cite Sowa (2014), Markman (2013), Brachman et al. (1983), Davis et al. (1993), Gero (1990), Levesque (1986), Brachman and Levesque (1985), Sowa (2000), and Zadeh (1996), though there are other works that are helpful. It suffices to say that, directly or indirectly, many branches of AI depend on KR, and KG research is only one such area. The references noted here are good places to begin, and have themselves cited, and been cited by, other influential studies for readers who wish to deeply explore KR as an area in itself.</p>
<p>The RDF model also has a long history (though not nearly as long as overall KR research), going back all the way to 1998, with the foundations of the model arising from research conducted even earlier. An excellent resource going back to the early days is Lassila et al. (1998), as well as Decker et al. (1998), the latter especially helpful for querying and doing inference on RDF data. Lassila et al. (1998) describe RDF as a “foundation for processing metadata” by virtue of the model providing interoperability between applications that need to exchange <i>machine-understandable</i> information on the web. Unlike other later publications that did not pay careful attention to defining metadata, the authors describe it clearly as “data describing web resources,” and even note the caveat that it is now always clear what the distinction between data and metadata is, especially due to the evolving and heterogeneous nature of information published on the web. Sometimes it is inevitable that a particular resource will be interpreted in both ways simultaneously. The broad goal of RDF, according to the authors, was to define a mechanism for describing resources that made no assumptions about a particular application domain in that it did not define (a priori) the semantics of any application domain. RDF in this sense was domain-neutral; however, it was a sufficiently broad language that it could express information about any domain using a systematic representation.</p>
<p>Inspiration for RDF was attributed to various communities and sources, including the web standardization community [in the form of HTML metadata and the Platform for Internet Content Selection (PICS)], the library community, the structured document community [in the form of Standard Generalized Markup Language (SGML), and more importantly XML, which formed the basis for the first standard RDF serialization], and also the KR community. The seminal article by Berners-Lee et al. (2001) in <i>Scientific American</i> is particularly relevant here. Other relevant works are Lassila (1998), Manola (1998, 1999), Connolly et al. (1997), and Berners-Lee et al. (1999). Akerkar (2009) may be a useful guide for unifying the various strands mentioned in this chapter, including RDF, XML, and ontologies. Other secondary areas that contributed to the RDF design include object-oriented programming and modeling languages; see the relevant works by Cox (1986), Calvanese et al. (1998), and Garcia-Molina et al. (2000) to gain an overview of some of this work. The overall database community also had a strong influence on RDF.</p>
<p><span aria-label="47" id="pg_47" role="doc-pagebreak"/>It is important however to note the limits of some of these influences. For example, while RDF drew from the KR community, it does not specify a mechanism for reasoning. This mechanism would later be added (as evident in the Semantic Web Layer Cake). A good description of the layer cake, but also the foundations of the modern Semantic Web more generally, is provided in several works, including by Hendler (2009), Passin (2004), and Bénel et al. (2010). In summary, by characterizing RDF as a <i>simple frame</i> system, the idea was to build more sophisticated capabilities on top of it, including support for reasoning. This aim has largely been fulfilled in the last two decades with the development of OWL; see Antoniou and Van Harmelen (2004b) and McGuinness et al. (2004).</p>
<p>In Lassila et al. (1998), the phrase “knowledge graph” never appears because the term had gained neither the popularity nor standardized connotation that it has today. Yet the document makes it clear that what we know today as KGs were targets for RDF representation. The graph-theoretic interpretation of RDF was made clear in the document, with visual fragments of what we refer to as KGs rendered throughout the document.</p>
<p>Other representations covered in this chapter, including RDFS, can also trace their origins to work from the early 2000s; good references include McBride (2004), and Allemang and Hendler (2011). For actual standards descriptions, the best resource is the W3Cs pages on these subjects. Beyond RDF, RDFS, and the Semantic Web, we also noted the rise of Wikidata and its data model as a simpler, though also less mature (in terms of reasoning and complex analyses), alternative to representing KGs. Some good readings on Wikidata, as well as its relationship to RDF and the Semantic Web, may be found in Vrandečić and Krötzsch (2014), Erxleben et al. (2014), and Hernández et al. (2015). Finally, good references on semantic labeling and the Karma system include Szekely et al. (2011, 2013), Knoblock et al. (2012), and Knoblock and Szekely (2015). We also mentioned that schema heterogeneity and mapping are problems that are not exclusive to KGs and the Semantic Web, but have also been prominent in the database community. A good survey of (now-classic) schema-matching approaches is Rahm and Bernstein (2001). There has also been work on mapping RDBs to RDF, and on ontological mapping. As a start, we recommend Choi et al. (2006), Sequeda et al. (2011), Sahoo et al. (2009), and Doan and Halevy (2005) for readers interested in these areas. For a broader overview of ontology matching, we recommend Euzenat et al. (2007).</p>
</section>
<section epub:type="division">
<h2 class="head a-head"><a id="sec2-10"/><b>2.10Exercises</b></h2>
<p class="noindent"><b>We will revisit the fragment from earlier in the chapter (see the image on the next page) for questions 13.</b></p>
<ul class="numbered">
<li class="NL">1.We want to give Batman a sidekick—namely, Robin. How or what would you add to the KG fragment to express this additional information (if you need to use relations <span aria-label="48" id="pg_48" role="doc-pagebreak"/>that are not expressed in the fragment given here, give them mnemonic names, and assume they are defined in the rel: vocabulary)?</li>
</ul>
<figure class="IMG"><img alt="" src="../images/pg48-1.png" width="350"/>
</figure>
<ul class="numbered">
<li class="NL">2.The FOAF vocabulary can be accessed at the link that is provided in the prefix. Go to the link and study some of the classes and properties listed in FOAF Core. Can you use FOAF to add a single statement to the Turtle fragment that expresses the information that the Joker is 32 years old?</li>
<li class="NL">3.Write out the entire portion of the fragment from <i>&lt;</i>#batman &gt; downward in N-triples.</li>
<li class="NL">4.Let us try to compare the reduction in complexity that can be incurred by using Turtle instead of N-triples. Suppose we measure complexity in terms of both the number of terms as well as statements, for example, the fragment that you rewrote above in N-triples has four statements in Turtle, and seven terms (the single subject <i>&lt;</i>#batman &gt;, three predicates, and three objects, one of which is a literal). Suppose that you were told that your KG had only URIs and no literals. Furthermore, your KG has 1,000,000 nodes, each of which has four predicates (you may assume that these are all unique) linking it to four other nodes on average. You may also assume that everything has a single common prefix, so Turtle only involves a single additional statement declaring the prefix. Ignoring this prefix statement, how many statements and terms would be in your KG representation if expressed in Turtle? How about N-triples? What are the percentage reductions in statements and terms if Turtle is used over N-triples?</li>
<li class="NL">5.You are told that “John Green, Michael Brown, and Jerry Red are members of the Mensa organization,” and that “Michael Brown is additionally a member of IEEE. Michael and John are friends, while Jerrys current project involves using AI for social good.” Using FOAF and/or some of the vocabularies covered in the “Software and Resources” section, could you write out the “knowledge” expressed in that sentence in the Turtle format? For this question, you should not be using made-up terms (i.e., <span aria-label="49" id="pg_49" role="doc-pagebreak"/>any terms that you use must be defined already in an established vocabulary). Values such as names and literals can be mnemonically proposed if necessary.</li>
<li class="NL">6.Returning to the motivation proposed behind the Wikidata data model, can you list properties of countries that must be recorded as <i>n</i>-ary (<i>n &gt;</i> 2) relations?</li>
<li class="NL">7.State whether the statements below are True or False. If false, state the reason (simply) and correct the statement by adding, removing, or modifying elements.</li>
</ul>
<p class="AL">(a) RDF extends the linking structure of the web to use URIs to name the relationship between things and the two ends of the link (Subject and Object).</p>
<p class="AL">(b) In Turtle (textual syntax for RDF) it is not allowed to use untyped (plain) literals.</p>
<p class="AL">(c) RDF can be used to represent information only about things that can be directly retrieved on the web.</p>
<p class="AL">(d) A resource can be represented by a blank node.</p>
<p class="AL">(e) Copyright or licensing information of some resource cannot be represented with RDF.</p>
<p class="AL">(f) The XML RDF syntax can describe some resources that cannot be described using the Turtle RDF syntax.</p>
<ul class="numbered">
<li class="NL">8.A friend remarks to you, “RDFS is a language intended to represent the structure of RDF resources.” What does the word “structure” mean in this context?</li>
<li class="NL">9.Consider the academic KG example in chapter 1 (<a href="chapter_1.xhtml#fig1-5">figure 1.5</a>) and show what its representation would look like as (a) a property graph, and as (b) a property table. Make assumptions as appropriate. What is one good example of a multivalued property in <a href="chapter_1.xhtml#fig1-5">figure 1.5</a>, of which object values should belong in a separate table?</li>
<li class="NL1">10.What is an example of a Wikidata entity that is an instance of <i>Item</i> and is linked to an instance of <i>Geographic Shape</i>? What property links the item to the geographic shape? <i>Note: You may have to look around on Wikidata to find such an entity.</i></li>
<li class="NL1">11.Look up the entry for COVID-19 on Wikidata<sup><a href="chapter_2.xhtml#fn10x2" id="fn10x2-bk">10</a></sup> and answer the following questions:</li>
</ul>
<p class="AL">(a) What is COVID-19 an instance of?</p>
<p class="AL">(b) What is the type of resource that COVID-19 is linked to via the property <i>number of deaths</i>?</p>
<p class="AL">(c) What is the type of the resource linked via the property <i>significant event</i> (if there is more than one, pick the first one)? What is the Wikidata ID of this resource? Name the property that links this resource <i>back</i> to the COVID-19 resource you started from.</p>
<div class="footnotes">
<ol class="footnotes">
<li><p class="FN" role="doc-footnote"><sup><a href="chapter_2.xhtml#fn1x2-bk" id="fn1x2">1</a></sup>As such an entity, consider the “second marriage of John Doe.” The marriage has attributes that can be used to describe it, including the two participants in the marriage, the date of marriage, and the venue, but it may not make sense to give it a name and globally de-referenceable URI.</p></li>
<li><p class="FN" role="doc-footnote"><sup><a href="chapter_2.xhtml#fn2x2-bk" id="fn2x2">2</a></sup>For example,the URI for Mayank Kejriwal is repeated thrice as a subject in the N-triples representation provided below the KG fragment in <a href="chapter_2.xhtml#fig2-2">figure 2.2</a>.</p></li>
<li><p class="FN" role="doc-footnote"><sup><a href="chapter_2.xhtml#fn3x2-bk" id="fn3x2">3</a></sup>The convention is to capitalize the word after the prefix for vocabulary units that are <i>nodes</i> in an RDF or RDFS graph, while <i>properties</i> (or edges) do not capitalize this word. We obey this convention in both this chapter and the remainder of this book.</p></li>
<li><p class="FN" role="doc-footnote"><sup><a href="chapter_2.xhtml#fn4x2-bk" id="fn4x2">4</a></sup>An alternative term that is sometimes used is <i>superclass</i>, which is the inverse of subclass. In other words, if a class <i>C</i> is a subclass of a class <i>C</i>, then <i>C</i> is a superclass of class <i>C</i>. If a class <i>C</i> is a superclass of a class <i>C</i>, then all instances of <i>C</i> are also instances of <i>C</i>.</p></li>
<li><p class="FN" role="doc-footnote"><sup><a href="chapter_2.xhtml#fn5x2-bk" id="fn5x2">5</a></sup><a href="https://www.wikidata.org/wiki/Help:Wikimedia_language_codes/lists/all">https://www.wikidata.org/wiki/Help:Wikimedia_language_codes/lists/all</a>.</p></li>
<li><p class="FN" role="doc-footnote"><sup><a href="chapter_2.xhtml#fn6x2-bk" id="fn6x2">6</a></sup><a href="https://www.wikidata.org/wiki/Wikidata:Lexicographical_data">https://www.wikidata.org/wiki/Wikidata:Lexicographical_data</a>.</p></li>
<li><p class="FN" role="doc-footnote"><sup><a href="chapter_2.xhtml#fn7x2-bk" id="fn7x2">7</a></sup>One may wonder why we we do not refer to this empirical phenomenon as <i>ontological</i> heterogeneity, but the phrase simply has not caught on. It may be because schema heterogeneity was already a major problem in the broader database community that has exerted a prominent influence on research and application.</p></li>
<li><p class="FN" role="doc-footnote"><sup><a href="chapter_2.xhtml#fn8x2-bk" id="fn8x2">8</a></sup>Arguably, the Gene Ontology (covered in chapter 16) is one of the rare success stories to have achieved such a standing, though by decentralized and gradual community consensus rather than as a mandate.</p></li>
<li><p class="FN" role="doc-footnote"><sup><a href="chapter_2.xhtml#fn9x2-bk" id="fn9x2">9</a></sup><a href="https://www.w3.org/TR/r2rml/">https://<wbr/>www<wbr/>.w3<wbr/>.org<wbr/>/TR<wbr/>/r2rml<wbr/>/</a>.</p></li>
<li><p class="FN" role="doc-footnote"><sup><a href="chapter_2.xhtml#fn10x2-bk" id="fn10x2">10</a></sup><a href="https://www.wikidata.org/wiki/Q84263196">https://<wbr/>www<wbr/>.wikidata<wbr/>.org<wbr/>/wiki<wbr/>/Q84263196</a>.</p></li>
</ol>
</div>
</section>
</section>
</div>
</body>
</html>