glam/docs/oclc/extracted_kg_fundamentals/OEBPS/xhtml/chapter_10.xhtml

<html xmlns="http://www.w3.org/1999/xhtml" xmlns:epub="http://www.idpf.org/2007/ops" lang="en" xml:lang="en">
<head>
<title>Knowledge Graphs</title>
<meta content="text/html; charset=utf-8" http-equiv="default-style"/>
<link href="../styles/stylesheet.css" rel="stylesheet" type="text/css"/>
<meta content="urn:uuid:531250ec-8629-4bbe-b4be-eb1eb0e84538" name="Adept.expected.resource"/>
</head>
<body epub:type="bodymatter">
<div class="body">
<p class="SP"> </p>
<section aria-labelledby="ch10" epub:type="chapter" role="doc-chapter">
<header>
<h1 class="chapter-number" id="ch10"><span aria-label="241" id="pg_241" role="doc-pagebreak"/>10</h1>
<h1 class="chapter-title"><b>Representation Learning for Knowledge Graphs</b></h1>
</header>
<div class="ABS">
<p class="ABS"><b>Overview.</b>   Neural networks have achieved powerful advances in recent years. An area of research that has been particularly affected is representation learning. In typical applications, representation learning involves embedding a data element, be it a span of text, an entity (or relation) in a knowledge graph (KG), or even an entire document, into a vector space. Usually, the vector space is real-valued and low-dimensional, which creates a robust representation. In the vector space, mathematical operations can be used in a variety of interesting ways (e.g., in the case of text, it has been found that one can derive relations such as King − Man + Woman ≈ Queen using completely unsupervised embedding techniques). In this chapter, we consider embedding KGs in vector spaces. Although KG embeddings (KGEs) constitute a relatively recent research area, the field has already witnessed a surge of research output, and relatively powerful techniques have emerged as a result. We describe some currently influential techniques that have been adopted and have yielded extremely competitive results.</p>
</div>
<section epub:type="division">
<h2 class="head a-head"><a id="sec10-1"/><b>10.1 Introduction</b></h2>
<p class="noindent">An old English saying that is often quoted is, “Birds of a feather flock together.” J. R. Firth’s hypothesis, also known as <i>Firth’s axiom</i>, reinterprets this saying in the computational linguistics community as, “You shall know a word’s meaning by its context.” Though simple, the principle marks a dramatic shift in how semantics are realized, essentially spawning a successful <i>statistical</i> model of semantics that was far more robust and amenable to modern machine learning pipelines than ever before.</p>
<p>Even though the principle behind Firth’s axiom is abstract, and open to multiple paths of realization, the dominant approach has been a vector space model (VSM). The idea is to convert (or somehow derive) a vector for the data unit under consideration. In the case of KGs, the data unit would be nodes and edges, but in the case of normal networks (such as a <i>friend-friend</i> undirected social network), the nodes constitute the data units, and in the case of natural language, the units are typically words. We saw in chapter 4, for example, how deep learning systems for Named Entity Recognition (NER) rely on good representation learning over words, and even characters. However, it is not enough to just obtain a vector for each unit, because we could randomly assign a vector to a word or node if this was <span aria-label="242" id="pg_242" role="doc-pagebreak"/>all that was required. The vectors clearly have to fulfill some criteria, and one criterion is deciding the interpretation of context. In natural-language documents, the context of a word is interpreted to be the neighborhoods (surrounding set of words) that the word occurs in. Ignoring the issue of <i>word sense</i>, for example, a two-dimensional VSM model that is faithful to Firth’s axiom in a reasonable corpus, as opposed to one that is not, is shown in <a href="chapter_10.xhtml#fig10-1" id="rfig10-1">figure 10.1</a>. We immediately see that words that generally seem to belong to the same semantic class tend to occur closer together in the vector space, although there is also a nontrivial relationship between the semantic classes themselves (one would expect US government agencies like the National Science Foundation and National Institutes of Health to be closer to US politicians in the VSM than to Canadian politicians). In some sense, it is not incorrect to assume that, at least in natural language, word vector clusters that obey Firth’s axiom tend to capture a continuous version of an ontology, or some latent space model of meaning that we have in our heads and capture through observable artifacts such as written documents.</p>
<div class="figure">
<figure class="IMG"><a id="fig10-1"/><img alt="" src="../images/Figure10-1.png" width="450"/>
<figcaption><p class="CAP"><span class="FIGN"><a href="#rfig10-1">Figure 10.1</a>:</span> <span class="FIG">An illustration of Firth’s hypothesis over a common corpus like Wikipedia pages. Words that are in the same semantic class (such as “cat” and “dog”) tend to share similar contexts and are clustered close together in the vector space. Because the projection of the vectors (which are usually in the tens, if not hundreds, of real-valued dimensions) is in 2D, the “clusters” appear closer together.</span></p></figcaption>
</figure>
</div>
<p>VSMs existed long before neural networks: for example, the famous <i>term frequency-inverse document frequency</i> (or tf-idf) representation of documents was also a VSM, albeit with very different features than the representations that we are going to study in this chapter. The vectors obtained using tf-idf are usually very sparse (with only a few nonzero entries), high-dimensional (proportional to the vocabulary of the language), and not altogether <span aria-label="243" id="pg_243" role="doc-pagebreak"/>apt for short texts. It is also not clear how we can use the model to get vectors for words. Another similar example, though lower-dimensional and more robust, is Latent Dirichlet Allocation (LDA), colloquially known as a <i>topic model</i>.</p>
<p>Firth’s hypothesis has enjoyed a startlingly successful renaissance in the age of neural networks. While the best-known application of Firth’s hypothesis is fittingly in Natural Language Processing (NLP), as embodied in now very well known algorithms like word2vec, fastText, and GloVe, many other applications, datatypes, and use-cases have been targeted by researchers in recent years, including KGs and networks. In the modern context, vectors derived in this way (whether using neural networks or another loss-minimizing optimization like matrix factorization) are generally referred to as <i>embeddings</i>. On the surface, it seems infeasible that such embeddings could be good. After all, natural language, and other real-world manifestations of naturalistic data (like social networks), have lots of irregularities and corner cases and can even contain outright contradictions and noise, as many symbolic and expert systems have discovered to their dismay. Is it not reasonable to assume a neural network might trip up in the same way? Other issues relate to logistics: How much data is required to train a good model? How should we set the dimensions of the learned vectors and optimize other hyperparameters? Some of these questions are better understood than others, but what is not disputed is that, given a reasonably sized corpus, the embeddings are relatively robust to noise and the occasional odd case. In part, this is because the embeddings are both lower-dimensional and continuous, allowing the learned model to be compact and robust. But the actual optimization, and the loss function that needs to be minimized, also greatly matter for the embedding to be useful.</p>
<p>Although Firth’s axiom seems to call out words explicitly, there is no reason why it can’t apply to other data units; we already cited document VSMs as an example and mentioned earlier how nodes and edges in a KG could potentially be embedded. But how do we define context in a KG? This is the question that will concern us through much of this chapter. We start with a description of the skip-gram and continuous bag of words (CBOW) models that have become classic studies in good, fast embeddings for not just natural language (their original purpose), but a whole host of inspired data formats, including networks and graphs. As such, understanding these models is important for a true appreciation of other embedding models proposed specifically for KGs. Next, we provide a general overview of KGEs and the schools of thought that are currently prevalent. Because a general theory of what constitutes a good KGE is not fully formed, it is informative to study specific, highly influential KGEs that continue to be used and repurposed in real-world KG applications.</p>
</section>
<section epub:type="division">
<h2 class="head a-head"><a id="sec10-2"/><b>10.2 Embedding Architectures: A Primer</b></h2>
<p class="noindent">Many types of models have been proposed for estimating continuous representations of words, including the well-known LDA and Latent Semantic Analysis (LSA) models. One <span aria-label="244" id="pg_244" role="doc-pagebreak"/>motivation for using neural networks, however, is that they can outperform LSA by better preserving linear regularities among words; furthermore, topic models like LDA become more expensive on larger data sets due to their generative nature.</p>
<p>A basic probabilistic feedforward neural network language model (NNLM) consists of input, projection, hidden, and output layers. The assumption is that the NNLM is applied to a corpus, which may be assumed to be a set of word sequences (we do not distinguish between sequences that belong to the same document in a corpus, and those that do not). To visualize how the network works, imagine that a window of size <i>N</i> is slid over each sequence; it is these windows that serve as piecewise inputs to the neural network itself. Namely, at the input layer, <i>N previous</i> words are encoded using 1-of-<i>V</i> coding, where <i>V</i> is the size of the vocabulary. The input layer is then projected to a projection layer <i>P</i> that has dimensionality <i>N</i> × <i>D</i>, using a shared projection matrix. As only <i>N</i> inputs are active at any given time, composition of the projection layer is a relatively cheap operation; however, the architecture becomes complex for computation between the projection and the hidden layer, as values in the projection layer are dense. Common choices are <i>N</i> = 10, with the projection layer having size 500–2,000 and the hidden layer, 500–1,000. Because the hidden layer is used to compute a probability distribution over all the words in the vocabulary, the output layer ends up with dimensionality <i>V</i>, leading to a per-training-example computational complexity of (<i>N</i> × <i>D</i>) + (<i>N</i> × <i>D</i> × <i>H</i>) + (<i>H</i> × <i>V</i>). The last expression (<i>H</i> × <i>V</i>) clearly dominates, but several practical steps can be taken to avoid it, including using hierarchical versions of the softmax as discussed next, or avoiding normalization during training. With these steps in place,<sup><a href="chapter_10.xhtml#fn1x10" id="fn1x10-bk">1</a></sup> most of the complexity is caused by the term <i>N</i> × <i>D</i> × <i>H</i>.</p>
<p><i>Hierarchical softmax</i> is a popular method for addresses the complexity challenge of fully computing and materializing a probability distribution over all the words in the vocabulary by representing the vocabulary as a <i>Huffman binary tree</i>. Huffman trees assign short binary codes to frequent words, and this further reduces the number of output units that need to be evaluated. This is more efficient than a balanced binary tree, which requires <i>log</i><sub>2</sub>(<i>V</i>) outputs to be evaluated. The Huffman tree, in essence, leverages the effect of data skew by requiring only about <i>log</i><sub>2</sub><i>(UnigramPerplexity(V))</i> outputs<sup><a href="chapter_10.xhtml#fn2x10" id="fn2x10-bk">2</a></sup> to be evaluated, leading to nontrivial speedups (e.g., 2x) for realistic vocabulary sizes (more than 1 million words). Note that this step, by itself, cannot address the high complexity caused by the term <i>N</i> × <i>D</i> × <i>H</i>, but other architectures that have since become popular in the community (including skip-grams, as discussed in the next section) do address this problem. In those architectures, <span aria-label="245" id="pg_245" role="doc-pagebreak"/>efficiency of the softmax normalization becomes important, and the benefits of using a hierarchical softmax and a Huffman binary tree become more readily apparent. Clearly, the larger the vocabulary, the higher the benefit of using this methodology in the output layer.</p>
<section epub:type="division">
<h3 class="head b-head"><a id="sec10-2-1"/><b>10.2.1 Continuous Bag of Words Model</b></h3>
<p class="noindent">The CBOW model is similar to the feedforward NNLM, but the nonlinear hidden layer is removed and the projection layer is shared for all words, not just the projection matrix; hence, all words get projected into the same position because their vectors are averaged. The reason why this architecture is called a “bag of words” model is because the order of words in the history does not influence projection. By sliding a window of roughly size 8 over the corpus, the training criterion becomes to correctly classify the current (middle) word by leveraging words from both the history (four previous words) and the future. The training complexity becomes (<i>N</i> × <i>D</i>) + (<i>D</i> × <i>log</i><sub>2</sub>(<i>V</i>)), which is substantially less than the original complexity of (approximately) <i>N</i> × <i>D</i> × <i>H</i>.</p>
<p>Why is the model called the <i>continuous</i> bag of words model? The main reason is that unlike the classic bag of words model, the representation of the context is both continuous and distributed. <a href="chapter_10.xhtml#fig10-2" id="rfig10-2">Figure 10.2</a> expresses the CBOW architecture, along with the similar skip-gram architecture described next. The weight matrix between the input and projection layer is shared for all word positions, just like in the NNLM.</p>
<div class="figure">
<figure class="IMG"><a id="fig10-2"/><img alt="" src="../images/Figure10-2.png" width="450"/>
<figcaption><p class="CAP"><span class="FIGN"><a href="#rfig10-2">Figure 10.2</a>:</span> <span class="FIG">Comparative illustrations of the skip-gram and CBOW neural models, which have become extremely popular in the word representation learning (“embedding”) due to their speed and good performance on word analogy tasks. Here, <i>x</i><sub><i>t</i></sub> is the target word, and <i>x</i><sub>1</sub><i>, <span class="ellipsis">…</span>, x</i><sub><i>k</i></sub> are the “context” words for some predetermined window size <i>k</i>. While CBOW takes the context words as input and predicts the target word, skip-gram operates in the opposite way.</span></p></figcaption>
</figure>
</div>
</section>
<section epub:type="division">
<h3 class="head b-head"><span aria-label="246" id="pg_246" role="doc-pagebreak"/><a id="sec10-2-2"/><b>10.2.2 Skip-Gram Model</b></h3>
<p class="noindent">Skip-gram is similar to CBOW, but instead of predicting the current word based on the context, it tries to maximize classification of a word based on another word in the same sentence. <a href="chapter_10.xhtml#fig10-2">Figure 10.2</a> expresses the intuition behind the difference between CBOW and skip-gram. Each current word is used as an input to a log-linear classifier with a continuous projection layer, and words are predicted within a certain range both before and after the current word. Mikolov et al. (2013) found that, while increasing this range could improve the quality of the resulting word vectors, it came at the cost of increasing the computational complexity. Because the more distant words are usually less related to the current word (compared to closer words), less weight is given to the distant words by sampling less from those words in the training examples. Consequently, the training complexity of the architecture is proportional to <i>C</i> × (<i>D</i> + (<i>D</i> ×<i>log</i><sub>2</sub>(<i>D</i>))), <i>C</i> being the maximum distance of the words. For example, if <i>C</i> = 4, then for each training word, a number <i>R</i> is randomly selected in the range <i>&lt;</i> 1, 4 &gt;, following which <i>R</i> words from both the history and the future of the current word are used as correct labels, requiring <i>R</i> × 2 word classifications (with the current word as input, and each of the <i>R</i> + <i>R</i> words as output). In the original paper on skip-grams, a value of <i>C</i> = 10 was suggested as appropriate.</p>
</section>
</section>
<section epub:type="division">
<h2 class="head a-head"><a id="sec10-3"/><b>10.3 Embeddings beyond Words</b></h2>
<p class="noindent">We suggested earlier that, generally, any data unit can be embedded in a vector space, so long as an appropriate context is available, and Firth’s axiom is applicable. However, the devil is often in the details, as we saw even with word embeddings. What, for example, are we trying to optimize, and what should serve as a context? How do we model graphs, networks, or other structured data sets so that they are amenable to similar procedures?</p>
<p>Over the last decade, the machine learning community has vigorously taken up the challenge of embedding myriad datatypes and modalities, including KGs. As we describe in the section entitled “Bibliographic Notes” at the end of the chapter, this field is popular enough that it now goes under the general name of <i>representation learning</i>. In addition to being a staple at classic machine learning conferences, representation learning now even has its own conference series. Starting with the next section, we will be exclusively focusing on KGEs; however, before moving on to KGEs, we provide a brief primer on how the same intuitions behind models like skip-gram can be applied to network embeddings.</p>
<p>A network, such as a social or telecommunications network, is like a simpler version of a KG, in that (in the simplest and best-known cases) the nodes are of a homogeneous type, and there is only a single kind of edge. If an ontology were to be formalized for such a network, it would have only one concept (e.g., Person) and one relationship (e.g., friend). Even this simple network is very important, and many real-world social phenomena can be modeled this way (e.g., the Facebook friendship network, the Twitter hashtag cooccurrence network, and so on). There are important problems entailed by such networks, such <span aria-label="247" id="pg_247" role="doc-pagebreak"/>as <i>link prediction</i>, predicting when two people are friends or otherwise socially or professionally connected (in which case, a friend invitation can be recommended by a social network company like Facebook); but also <i>node classification</i> (e.g., if a particular person is wealthy and hence, amenable to targeted advertising of expensive watches). In the social sciences, it is well understood that the structure of the network alone can provide valuable clues to solving such problems. Concerning friendship prediction, for example, it is well known that a friend of a friend is more likely to be a friend than not; furthermore, the more friends A and B have in common, the more likely it is that they themselves are friends. Similarly, phenomena like the <i>rich club effect</i> suggest that wealthy people are connected to other wealthy people.</p>
<p><span aria-label="248" id="pg_248" role="doc-pagebreak"/>How do we extract good features using only the structure of the network? Classic (and even some modern) techniques relied on methods like matrix factorization (on an appropriate graph representation, like the adjacency matrix), but in the last few years, node embedding algorithms, based on neural networks, have become popular for featurizing nodes using structural context. An early such model was DeepWalk, which has a very intuitive algorithm behind it. The core philosophy is illustrated in <a href="chapter_10.xhtml#fig10-3" id="rfig10-3">figure 10.3</a>. As a first step, the algorithm initiates <i>random walks</i> starting from each node. Each random walk can be thought of as a sentence, because it is just a sequence of nodes. The full set of random walks is exactly like a corpus, with nodes as words and can be embedded using a standard word-embedding algorithm such as word2vec. Although it sounds simple, DeepWalk achieved impressive results on a number of standard social network problems, and because of its reliance on word2vec, has fast execution times. Other node embedding algorithms have since superseded DeepWalk as state-of-the-art, including LINE and node2vec, but the basic philosophy behind these algorithms is similar to that of DeepWalk.</p>
<div class="figure">
<figure class="IMG"><a id="fig10-3"/><img alt="" src="../images/Figure10-3.png" width="450"/>
<figcaption><p class="CAP"><span class="FIGN"><a href="#rfig10-3">Figure 10.3</a>:</span> <span class="FIG">An illustration of DeepWalk, a network-embedding approach that relies on word embeddings as an underlying mechanism, on the Zachary Karate Club network. The concept is relatively simple: first, a “corpus” of random walks is constructed, with each random walk interpreted by the word-embedding algorithm (e.g., CBOW or skip-gram, but more modern embedding implementations could also be applied) as a sentence, and with nodes as words. The result is an embedding for each word (in this case, node). The algorithm can be extended in multiple ways (e.g., for directed graphs) and at this time, more sophisticated embeddings of this type have been proposed in the network community as well (LINE, node2vec, and several others). However, the algorithm continues to be reasonably popular, probably owing to its simplicity.</span></p></figcaption>
</figure>
</div>
</section>
<section epub:type="division">
<h2 class="head a-head"><a id="sec10-4"/><b>10.4 Knowledge Graph Embeddings</b></h2>
<p class="noindent">The previous section discussed some exciting possibilities for embedding ordinary networks using techniques inspired by word embeddings, as well as other representation learning methods first defined in related communities like NLP. There are limitations to some of these techniques, however, especially when applied to KGs. Take, for example, the random walk methodology adopted by DeepWalk for learning node representations. How can we extend it to learning about <i>edge</i> representations? One option is to incorporate the edge label in the random walk itself—that is, inserting the edge label between every two (consecutively traversed) nodes. It is debatable whether this will work, because the number of unique relationships (specified in an ontology) may be small compared to the number of unique nodes (instances in the KG). Furthermore, the directionality of edges can cause problems for algorithms inspired by word embeddings. Yet another problem is connectivity; there is no guarantee that KGs are connected graphs, and even when there is connectivity, it tends to be a result of hub nodes (e.g., “United States” may be a hub node <span aria-label="249" id="pg_249" role="doc-pagebreak"/>connecting many, though not all, nodes in a KG of pop artists, who may not share much in common but may mostly be based in the US or born in the US). Another reason for favoring special embeddings for KGs is that most KG nodes and edges have <i>meaningful</i> labels, and using only structural information is likely not going to yield the best performance, as some valuable text information is being thrown away in the process. Beyond these reasons, experimentally, we are not aware of any methods directly inspired by word embeddings that have yielded successful state-of-the-art results on KGs.</p>
<p>Instead, separate embedding algorithms have been designed, refined, and evaluated for KGs since they first started to become popular (in the early 2010s). In just this last decade, tens of algorithms have been proposed, some of which are variations on a common theme (but remain very valuable due to the improved experimental results on challenging benchmark data sets). In this section, we cover some of the important ones that have withstood the test of time thus far and inspired others like themselves. First, however, we provide a general overview of <i>energy functions</i>, which dictate how KGEs are optimized. In practice, the difference between KGE algorithms often boils down to a difference in their respective energy function formulations.</p>
<section epub:type="division">
<h3 class="head b-head"><a id="sec10-4-1"/><b>10.4.1 Energy Functions</b></h3>
<p class="noindent"><span aria-label="250" id="pg_250" role="doc-pagebreak"/>Recall that, in an ordinary machine learning context, the primary goal of optimization is to infer model parameters that lead to a minimization of a loss function. For example, when training a linear regression model, the loss function most commonly squared is based on <i>least squares</i> (i.e., minimizing the sum of squares of errors, or residuals, where the error is defined as the difference between the predicted value (by the regression line) and the actual value).</p>
<p>In the context of KGEs, an <i>energy function</i> is a function (sometimes also called a <i>score function</i>) that is usually input into a margin-based objective function and that must be minimized (just like a loss function). A good example of a margin-based objective function, given an energy function <i>f</i>(<i>h, r, t</i>), is</p>
<figure class="DIS-IMG"><a id="eq10-1"/><img alt="" class="width" src="../images/eq10-1.png"/>
</figure>
<p>Notice how the energy function is typically just a function of the <i>triple</i>, which is the most intuitive context for a relationship and entities (that constitute that triple). Recall that Firth’s axiom was an abstract principle, and it is up to the designer of an embedding algorithm to determine the appropriate context for each data unit. A very simple energy function (used in one of the earliest KGE algorithms) is ||<i>v</i><sub><i>h</i></sub> − <i>v</i><sub><i>t</i></sub>||<sub><i>l</i><sub>1/2</sub></sub> (i.e., the second norm of the difference between the vectors representing the head and tail entities). The energy should be high for <i>correct</i> triples, assumed to be in the graph <span class="font">&#119970;</span> in equation (<a href="chapter_10.xhtml#eq10-1">10.1</a>) and low for <i>incorrect</i> triples (assumed to be in the graph <span class="font">&#119970;</span>′). Generally, <span class="font">&#119970;</span> is assumed to be an initial KG (albeit incomplete) input into the algorithm, while <span class="font">&#119970;</span>′ is not explicitly provided, but is constructed from <span class="font">&#119970;</span> using various sample-and-replace permutations. While there are sometimes differences in how this training data is constructed, the basic principle behind constructing the negative graph <span class="font">&#119970;</span>′ is that a random replacement of a head entity <i>h</i> or tail entity <i>t</i> with some other entity (denoted as <i>h</i>′ or <i>t</i>′ depending on whether the head or the tail entity is getting replaced) in the graph will almost always yield an incorrect triple. This principle is important precisely because the original graph is incomplete. Without this assumption (or even with it, frankly), we cannot always guarantee that a triple in <span class="font">&#119970;</span>′ is truly incorrect. When evaluating KGEs, it is common to check that triples in <span class="font">&#119970;</span>′ are truly incorrect by ensuring that the triple does not occur in either <span class="font">&#119970;</span> or in the withheld test set of correct triples that the KGE does not access when learning the embedding. Notice also that the simple energy function we used does not depend on <i>r</i>. In more practical and realistic KGEs, however, vector (or matrix representations) of both entities and the relationship in the triple are involved in computing the score.</p>
<p>The energy function is directly related to the number of model parameters that the KGE algorithm is trying to infer. The more parameters there are, the more data is usually required to learn them well. <a href="chapter_10.xhtml#tab10-1" id="rtab10-1">Table 10.1</a> succinctly expresses the energy functions of many <span aria-label="251" id="pg_251" role="doc-pagebreak"/>established KGE architectures, with <a href="chapter_10.xhtml#tab10-2" id="rtab10-2">table 10.2</a> expressing the time and space complexity of these architectures in terms of the relevant parameters. The classic TransE algorithm, which has been hugely influential in the KGE community, requires 2<i>d</i> parameters to be inferred <i>per entity</i>,<sup><a href="chapter_10.xhtml#fn3x10" id="fn3x10-bk">3</a></sup> and <i>d</i> parameters <i>per relationship</i> (unique edge label). <i>d</i>, the embedding dimensionality, is usually some small fixed constant, such as 100. Even so, the number of parameters is certainly not trivial, even for this relatively early and pithy model (see the exercises at the end of this chapter). In general, both time and space complexity are affected, as <a href="chapter_10.xhtml#tab10-2">table 10.2</a> demonstrates.</p>
<div class="table">
<p class="TT"><a id="tab10-1"/><span class="FIGN"><a href="#rtab10-1">Table 10.1</a>:</span> <span class="FIG">Parameter descriptions of some well-known KGE architectures.</span></p>
<figure class="DIS-IMG"><img alt="" src="../images/Table10-1.png" width="450"/>
</figure>
</div>
<div class="table">
<p class="TT"><a id="tab10-2"/><span class="FIGN"><a href="#rtab10-2">Table 10.2</a>:</span> <span class="FIG">Time and space complexities of the KGE architectures in <a href="chapter_10.xhtml#tab10-1">table 10.1</a>.</span></p>
<figure class="table">
<table class="table">
<thead>
<tr>
<th class="TCH"><p class="TB"><b>Algorithm</b></p></th>
<th class="TCH"><p class="TB"><b>Space Complexity</b></p></th>
<th class="TCH"><p class="TB"><b>Time Complexity</b></p></th>
</tr>
</thead>
<tbody>
<tr>
<td class="TB"><p class="TB">TransE</p></td>
<td class="TB"><p class="TB"><i>O</i>(<i>nd</i> + <i>md</i>)</p></td>
<td class="TB"><p class="TB"><i>O</i>(<i>d</i>)</p></td>
</tr>
<tr>
<td class="TB"><p class="TB">TransH</p></td>
<td class="TB"><p class="TB"><i>O</i>(<i>nd</i> + <i>md</i>)</p></td>
<td class="TB"><p class="TB"><i>O</i>(<i>d</i>)</p></td>
</tr>
<tr>
<td class="TB"><p class="TB">TransR</p></td>
<td class="TB"><p class="TB"><i>O</i>(<i>nd</i> + <i>mdk</i>)</p></td>
<td class="TB"><p class="TB"><i>O</i>(<i>dk</i>)</p></td>
</tr>
<tr>
<td class="TB"><p class="TB">TransD</p></td>
<td class="TB"><p class="TB"><i>O</i>(<i>nd</i> + <i>md</i>)</p></td>
<td class="TB"><p class="TB"><i>O</i>(<i>max</i>(<i>d, k</i>))</p></td>
</tr>
<tr>
<td class="TB"><p class="TB">TransSparse</p></td>
<td class="TB"><p class="TB"><i>O</i>(<i>nd</i> + (1 − <i><span lang="el" xml:lang="el">θ</span></i>)<i>mdk</i>)</p></td>
<td class="TB"><p class="TB"><i>O</i>(<i>dk</i>)</p></td>
</tr>
<tr>
<td class="TB"><p class="TB">TransM</p></td>
<td class="TB"><p class="TB"><i>O</i>(<i>nd</i> + <i>md</i>)</p></td>
<td class="TB"><p class="TB"><i>O</i>(<i>d</i>)</p></td>
</tr>
<tr>
<td class="TB"><p class="TB">TransF</p></td>
<td class="TB"><p class="TB"><i>O</i>(<i>nd</i> + <i>md</i>)</p></td>
<td class="TB"><p class="TB"><i>O</i>(<i>d</i>)</p></td>
</tr>
<tr>
<td class="TB"><p class="TB">TransA</p></td>
<td class="TB"><p class="TB"><i>O</i>(<i>nd</i> + <i>md</i><sup>2</sup>)</p></td>
<td class="TB"><p class="TB"><i>O</i>(<i>d</i><sup>2</sup>)</p></td>
</tr>
<tr>
<td class="TB"><p class="TB">TransG</p></td>
<td class="TB"><p class="TB"><i>O</i>(<i>nd</i> + <i>mdc</i>)</p></td>
<td class="TB"><p class="TB"><i>O</i>(<i>dc</i>)</p></td>
</tr>
<tr>
<td class="TB"><p class="TB">SE</p></td>
<td class="TB"><p class="TB"><i>O</i>(<i>nd</i> + <i>md</i><sup>2</sup>)</p></td>
<td class="TB"><p class="TB"><i>O</i>(<i>d</i><sup>2</sup>)</p></td>
</tr>
</tbody>
</table>
</figure>
</div>
<p>Unlike least-squares linear regression, which has some impressive theoretical properties (on the condition, of course, that the underlying assumptions are valid), modern deep learning and machine learning optimizations tend to be based more on the desiderata of the task, and can even be a matter of experimentation and trial-and-error. For example, it is still not completely clear, despite significant theoretical progress, how many layers should be included in a deep neural network, or what the effect of other model mechanisms (like dropout) will be on the final optimization outcome. The actual optimization procedure used in KGE implementations is fairly standard, with Stochastic Gradient Descent (SGD) being a common choice. Other alternatives include SGD+AdaGrad, SGD+AdaDelta, and L-BFGS.</p>
<p>Similarly, in the KGE community, it is not always clear whether (for example) relationship embeddings should be represented as vectors or matrices, or if enforcing sparsity has some value. The situation becomes even murkier when we deal with sparse KGs. Some recent influential work, for example, has shown that KGEs may not be appropriate if the underlying KG is too sparse and narrow, and that statistical relational algorithms do a better job (see the “Bibliographic Notes” section). Currently, there is a resurgence of interest in the broader artificial intelligence (AI) community to develop algorithms that are either inherently more explainable or methods that can extract some kind of interpretation from complex, black-box models (almost always based on optimizing a loss or energy function in a nonconvex space, usually using some kind of neural network or deep learning) trained on a good deal of data. It is likely that we will see some of this research play out in the context of KGEs as well.</p>
</section>
</section>
<section epub:type="division">
<h2 class="head a-head"><a id="sec10-5"/><b>10.5 Influential KGE Systems</b></h2>
<p class="noindent">In this section, we describe some important KGE systems that are now considered established baselines. However, it is important to note that the research on KGEs is continuing <span aria-label="252" id="pg_252" role="doc-pagebreak"/>to evolve, and it is likely that several other good systems will have been published by the time this book has been published. Nevertheless, it is unlikely that any of these systems will mark a radical departure from the core concepts embodied in the systems described next.</p>
<section epub:type="division">
<h3 class="head b-head"><a id="sec10-5-1"/><b>10.5.1 Structured Embeddings</b></h3>
<p class="noindent">The technique of structured embeddings was among the first proposed for embedding KGs. The initial formalism, data sets, means of evaluation, and other aspects of the work have all continued to be influential on future lines of work. Hence, it is useful to introduce the model from the beginning starting with the core ideas.</p>
<p>First, as we’ve already seen, entities in KGs (just like words in documents) can be modeled in a <i>d</i>-dimensional continuous vector space, referred to as an <i>embedding space</i>. For the sake of notation,<sup><a href="chapter_10.xhtml#fn4x10" id="fn4x10-bk">4</a></sup> assume that the <i>ith</i> entity (in a list of entities in the KG) is assigned a vector <i>E</i><sub><i>i</i></sub> ∈ <span class="font">ℝ</span><sup><i>d</i></sup>.</p>
<p>Second, within the embedding space, there is a specific similarity measure that captures (not necessarily symmetric) relationships between entities. For the <i>kth</i> relation, let us assume a pair of <i>d</i> × <i>d</i> matrices <img alt="" class="inline" height="20" src="../images/pg252-in-1.png" width="68"/>; the similarity function <i>S</i><sub><i>k</i></sub> for two entities related through <i>R</i><sub><i>k</i></sub> is then defined as <img alt="" class="inline" height="20" src="../images/pg252-in-2.png" width="215"/>, using the p-norm (with <i>p</i> = 1 in the original paper).</p>
<p>The intuition is that the matrices and vectors should be learned so as to maximize, to the greatest extent possible, similarities between entities that are truly related through a relation, and to minimize similarities between entities not thus related. This intuition needs to be modeled as a neural network, which can be seen as a generalization of a Siamese <span aria-label="253" id="pg_253" role="doc-pagebreak"/>network that generally takes a pair of inputs and tries to learn a similarity measure (<a href="chapter_10.xhtml#fig10-4" id="rfig10-4">figure 10.4</a> illustrates the general principle behind a Siamese network). Specifically, the energy function (note the change in notation) <img alt="" class="inline" height="20" src="../images/pg253-in-1.png" width="263"/> parameterizes a neural network and is trained to rank the training samples below all other triplets using 1-norm distance. Here <i>R</i><sup><i>lhs</i></sup> and <i>R</i><sup><i>rhs</i></sup> are <i>d</i> × <i>d</i> × <i>D</i><sub><i>r</i></sub> tensors, where <img alt="" class="inline" height="20" src="../images/pg253-in-2.png" width="24"/> is designed to select the <i>ith</i> component along the third dimension of <i>R</i><sup><i>lhs</i></sup>, yielding a <i>d</i> × <i>d</i> matrix slice. <i>d</i> is usually small for most KGs, with a recommended value of 50 in the original paper. <i>E</i> is a <i>d</i>×<i>D</i><sub><i>e</i></sub> matrix containing the embeddings of <i>D</i><sub><i>e</i></sub> entities, while the vectorization function <i>v</i>(<i>n</i>): {1<i>, <span class="ellipsis">…</span>, D</i><sub><i>e</i></sub>} <i>→</i> <span class="font">ℝ</span><sup><i>D</i><sub><i>e</i></sub></sup> maps the entity dictionary index (n) into a sparse vector of dimension <i>D</i><sub><i>e</i></sub> consisting of all zeros except for a one in the <i>nth</i> dimension (1-hot encoding).</p>
<div class="figure">
<figure class="IMG"><a id="fig10-4"/><img alt="" src="../images/Figure10-4.png" width="450"/>
<figcaption><p class="CAP"><span class="FIGN"><a href="#rfig10-4">Figure 10.4</a>:</span> <span class="FIG">Illustration of the architecture of a Siamese neural network.</span></p></figcaption>
</figure>
</div>
<p class="noindent">The learning of the matrix <i>E</i> is an example of <i>multitasking</i> because a single embedding matrix is used for all relations, with the embedding of an entity containing factorized information contributed to by all relations in which the entity is involved. For each entity, the model learns how it interacts with other entities with respect all the relation types. A major advantage of this formulation is that it is memory inexpensive, and also scalable.</p>
<p>Training is done using SGD, and the negative training set is constructed in a way that is very similar to what we described in the previous section—namely, a positive training triple (i.e., a triple that exists in the initial graph) is randomly selected, and either its head or tail entity is randomly replaced, such that the new triple is not a positive training triple. Furthermore, during training, normalization is enforced (i.e., each column ||<i>E</i><sub><i>i</i></sub>|| equals 1 for all values of <i>i</i>). Other details are provided in the original paper (see the “Bibliographic Notes” section).</p>
<p><span aria-label="254" id="pg_254" role="doc-pagebreak"/>An interesting aspect of SE is the method chosen to estimate the probability of an arbitrary triple being correct (after the model has been trained). The authors introduced a <i>kernel density estimation</i> (KDE) function that can estimate the density for any triple. This function can be used for ranking [e.g., given the relationship <i>r</i> and head entity <i>h</i>, candidate tail entities can be ranked by creating a triple (<i>h, r, t</i>) for every candidate tail entity <i>t</i> and computing the density for that triple using the KDE proposed in the paper]. The higher the density of the triple, the more probable that it is correct, and hence the better the rank of the candidate tail entity. In this way, a full ranking can be produced over all such candidates. A similar procedure can be adopted regardless of whether it is the tail entity, the head entity, or the relationship that is missing.</p>
<p>Empirically, Structured Embeddings, being one of the first embeddings proposed, was evaluated against a simple, nonembedding baseline, as well as variants of the proposed algorithm. Two tasks were considered: a ranking over missing tail entities in triples, and over missing head entities in triples. The authors reported the mean ranks (lower is better) and Hits@10 (between 0 and 1; higher is better) or the rate of correct predictions ranked in the top 10 elements per triple query. Two benchmark data sets, which were also adopted by other authors in the community, were used for the evaluations (namely, WordNet and Freebase). WordNet is a general resource valuable across NLP and information retrieval, and it has been brought up in several chapters of this book. For our current purposes, it suffices to think of it as a KG of words, with relationships expressing word semantics such as synonymy. Freebase, which was acquired by Google in 2010,<sup><a href="chapter_10.xhtml#fn5x10" id="fn5x10-bk">5</a></sup> was originally a crowdsourced KG (not unlike Wikipedia, which is a crowdsourced encyclopedia) where users could contribute facts.<sup><a href="chapter_10.xhtml#fn6x10" id="fn6x10-bk">6</a></sup> The version of Freebase used for the evaluations is still publicly available and has allowed many other researchers to replicate the results (and compare it to their own embeddings).</p>
</section>
<section epub:type="division">
<h3 class="head b-head"><a id="sec10-5-2"/><b>10.5.2 Neural Tensor Networks</b></h3>
<p class="noindent">Following on the heels of structured embeddings, neural tensor networks (NTNs) were proposed as an alternative way to achieve good KGEs.<sup><a href="chapter_10.xhtml#fn7x10" id="fn7x10-bk">7</a></sup> The main difference was in the optimization function because the NTN replaced the standard linear neural network layer with a bilinear tensor layer that directly related two entity vectors across multiple dimensions. Specifically, the model computes a score of how likely it is that two entities are in a <span aria-label="255" id="pg_255" role="doc-pagebreak"/>certain relationship by the following function [using the notation<sup><a href="chapter_10.xhtml#fn8x10" id="fn8x10-bk">8</a></sup> in Socher et al. (2013), where NTNs were first proposed], which is based on an NTN:</p>
<figure class="DIS-IMG"><a id="eq10-2"/><img alt="" class="width" src="../images/eq10-2.png"/>
</figure>
<p>In equation (<a href="chapter_10.xhtml#eq10-2">10.2</a>), <i>f</i> is the <i>tanh</i> function, which is nonlinear and is applied element-wise, <img alt="" class="inline" height="22" src="../images/pg255-in-1.png" width="100"/> is a tensor and the bilinear tensor product <img alt="" class="inline" height="22" src="../images/pg255-in-2.png" width="66"/> yields a vector <i>h</i> ∈ <span class="font">ℝ</span><sup><i>k</i></sup> with each entry computed by one slide <i>i</i> = 1<i>, <span class="ellipsis">…</span>, k</i> of the tensor <img alt="" class="inline" height="21" src="../images/pg255-in-3.png" width="111"/>. The other parameters for relation <i>R</i> are just like in a standard neural network (<i>V</i><sub><i>R</i></sub> ∈ <span class="font">ℝ</span><sup><i>k</i>×2<i>d</i></sup>, <i>U</i> ∈ <span class="font">ℝ</span><sup><i>k</i></sup>, <i>b</i><sub><i>R</i></sub> ∈ <span class="font">ℝ</span><sup><i>k</i></sup>). A key advantage of the formulation of KGEs as a NTN is that the two inputs (entities) are related <i>multiplicatively</i> instead of (implicitly) through the nonlinearity that is standard in other neural network models where the entity vectors simply get concatenated. Each slice of the tensor may be intuitively seen as being responsible for one type of entity pair or relationship instantiation. The model may, using this formulation, be able to learn that both a scientific paper and a piece of equipment have components [in the KG formulation, <i>(scientific_paper, has_component, x)</i>, where <i>x</i> might be abstract or experimental results from various parts of the word vector space]. Experimentally, the authors showed that this can lead to performance improvements.</p>
<p>The NTN is trained with contrastive max-margin objective functions, just like the structured embeddings. Recall that the principal idea behind this optimization was that each triple in the training set should receive a higher score than triples where one of the entities is randomly replaced. In the NTN, each relation has its associated tensor net parameters. Letting <span lang="el" xml:lang="el">Ω</span> represent the set of all relationship parameters, the following objective should be minimized; <i>g</i> having been defined in equation (<a href="chapter_10.xhtml#eq10-3">10.3</a>) already:</p>
<figure class="DIS-IMG"><a id="eq10-3"/><img alt="" class="width" src="../images/eq10-3.png"/>
</figure>
<p>As in other similar loss formulations, the hyperparameter <i><span lang="el" xml:lang="el">λ</span></i> serves as a regularization mechanism for ensuring that the parameters are sparse in their values to avoid overfitting. Another unique aspect of NTN was that, unlike KGEs, which randomly initialized entity vectors before optimization began, it decided to represent entities by their word vectors, and it initialized the word vectors with pretrained vectors. This allowed sharing of statistical strength between words describing each entity. Although this can cause some noise (e.g., “New York” and “York” would share some overlap in their initial vectors, even though York is no more similar to New York than, say, Paris), it can also have benefits (e.g., “Los Angeles” and “Los Angeles International Airport” would have similar initial vectors, as they should) and may potentially lead to faster convergence in some KGs. The authors <span aria-label="256" id="pg_256" role="doc-pagebreak"/>of NTN represent each entity vector by averaging its word vectors, and use unsupervised word vectors that are pretrained over common sources. The authors did experiment with Recurrent Neural Networks as a more sophisticated alternative to simple averaging, but without experimental benefits. Just like structured embeddings, NTN was evaluated over WordNet and Freebase. Compared to structured embeddings, it achieves an average improvement in accuracy of over 20 percent.</p>
</section>
<section epub:type="division">
<h3 class="head b-head"><a id="sec10-5-3"/><b>10.5.3 Translational Embedding Models</b></h3>
<p class="noindent">Translational embedding models exploit distance-based scoring functions by measuring the plausibility of a fact as the distance between the two entities, usually after a translation carried out by the relation. The intuition behind translation is shown in <a href="chapter_10.xhtml#fig10-5" id="rfig10-5">figure 10.5</a>.</p>
<div class="figure">
<figure class="IMG"><a id="fig10-5"/><img alt="" src="../images/Figure10-5.png" width="450"/>
<figcaption><p class="CAP"><span class="FIGN"><a href="#rfig10-5">Figure 10.5</a>:</span> <span class="FIG">An illustration of basic translation (in the context of KGEs) that is exploited by all of the Trans* algorithms in increasingly sophisticated ways. In this example, an entity such as “London” can be translated into “United Kingdom” using the relation “capital_of:,” which is a single vector that allows entities from one class (in this example) to be translated to entities from another class. In practice, translation tends to work well when entities can be (at least implicitly) clustered in such semantically meaningful ways, although more sophisticated variants are able to learn very general translations for relations that may not be between entities belonging to such well-defined classes.</span></p></figcaption>
</figure>
</div>
</section>
<section epub:type="division">
<h3 class="head b-head"><a id="sec10-5-4"/><b>10.5.4 TransE</b></h3>
<p class="noindent">TransE is an energy-based model for learning low-dimensional embeddings of entities, such that relations are represented as translations in the embedding space: given a true triple or fact (<i>h, r, t</i>), the embedding of the tail entity <i>t</i> should be close to the embedding of the head entity <i>h</i> plus a vector that depends on the relationship <i>r</i>. TransE relies on a reduced set of parameters as it learns only one low-dimensional vector for each entity and each relationship.</p>
<p>The main motivation behind the translation-based parameterization is that hierarchical relationships are extremely common in KBs, and translations are the natural transformations for representing them. For example, consider tree representations wherein the siblings are close to each other and nodes at a given height are organized on the <i>x</i>-axis, with the parent-child relationship corresponding to a translation on the <i>y</i>-axis. In this context, a null translation vector corresponds to an equivalence relationship between entities, and the model can also represent the sibling relationship. This directly motivated the authors to use only one low-dimensional vector to represent the key relationships in KBs. A secondary motivation was guided by the analogical findings in the word-embedding and natural-language communities, where 1-to-1 relationships between various-type entities (like actors and movies) such as <i>starred-as</i> could be represented by the model as translations in the embedding space. Therefore, there may be empirical reason to suppose that such a thing might be achievable for entity embeddings derived from KGs rather than natural language.</p>
<p>More specifically, TransE represents both entities and relations as <i>d</i>-dimensional real-valued vectors in the same latent space (a subset of <span class="font">ℝ</span><sup><i>d</i></sup>). Given a fact (<i>h, r, t</i>), the relation is interpreted as a translation vector <img alt="" class="inline" height="12" src="../images/r-vector.png" width="8"/> such that the embedded entities <img alt="" class="inline" height="17" src="../images/h-vector.png" width="9"/> and <img alt="" class="inline" height="15" src="../images/t-vector.png" width="8"/> can be connected by <img alt="" class="inline" height="12" src="../images/r-vector.png" width="8"/> with low error [i.e., <img alt="" class="inline" height="17" src="../images/h-vector.png" width="9"/> + <img alt="" class="inline" height="12" src="../images/r-vector.png" width="8"/> = <img alt="" class="inline" height="15" src="../images/t-vector.png" width="8"/> when (<i>h, r, t</i>) is a true fact]. The (relatively straightforward) intuition here arguably originates from the analog-style reasoning first presented and demonstrated convincingly in word-embedding papers from the NLP literature. In multi-relational data, such an analogy is expected to hold by adopting generic approaches that <span aria-label="257" id="pg_257" role="doc-pagebreak"/>can choose the appropriate patterns considering all heterogeneous relationships at the same time. The actual scoring function is given here (technically, either <i>L</i><sub>1</sub> or <i>L</i><sub>2</sub> norm could be used):</p>
<span aria-label="258" id="pg_258" role="doc-pagebreak"/>
<figure class="DIS-IMG"><a id="eq10-4"/><img alt="" class="width" src="../images/eq10-4.png"/>
</figure>
<p>The score is expected to be large if (<i>h, r, t</i>) holds; the corollary is that large translational mismatch between the head and tail entities, where the relationship vector mediates the translation, will lead to a larger negative score.</p>
<p>For training the model, consider a training set <i>T</i> of triplets (<i>h, r, t</i>) known to be true. To learn good embeddings, TransE minimizes a margin-based ranking criterion over the training set:</p>
<figure class="DIS-IMG"><a id="eq10-5"/><img alt="" class="width" src="../images/eq10-5.png"/>
</figure>
<p>where [<i>x</i>]<sub>+</sub> denotes the positive part of <i>x</i>, <i><span lang="el" xml:lang="el">γ</span> &gt;</i> 0 is a margin hyperparameter, and</p>
<figure class="DIS-IMG"><a id="eq10-6"/><img alt="" class="width" src="../images/eq10-6.png"/>
</figure>
<p>The set of corrupted triplets, constructed according to equation (<a href="chapter_10.xhtml#eq10-6">10.6</a>), is composed of training triplets with either the head or tail (but not both at the same time) replaced by a random entity. The loss function favors lower values of the energy for training triplets than for corrupted triplets, and also is a natural implementation of the intended criterion. For a given entity, the embedding vector is the same regardless of whether the entity occurs in the head or tail position of the triplet. In this sense, TransE is different (sparser in its parameterization) than some other embedding algorithms that choose to learn more than one embedding for an entity, based on whether it occurs in the head position or the tail position.</p>
<p>Bordes et al. (2013) perform the optimization using SGD in minibatch mode over the possible head and tail entities, as well as relations. Embeddings for entities and relationships are first initialized following an established random procedure. The additional constraint in the optimization is that (when using <i>L</i><sub>2</sub>) the <i>L</i><sub>2</sub> norm of the entity embeddings is 1 (no regularization or norm constraints are given to the relationship embeddings). This constraint is important compared to previous embedding-based methods because it prevents the training process from trivially minimizing <i>L</i> by artificially increasing entity embedding norms. At each main iteration of the algorithm, the embedding vectors of the entities are first normalized. The algorithm is stopped based on its performance on a validation set.</p>
<p>TransE is one of the simpler and more efficient embedding algorithms, but it has flaws in dealing with 1-to-<i>N</i>, <i>N</i>-to-1, and <i>N</i>-to-<i>N</i> relations. Taking 1-to-<i>N</i> relations as an example, given such a relation <i>r</i> [i.e., ∃<i>i</i> = 1<i>,<span class="ellipsis">…</span>, p</i>, such that (<i>h, r, t</i><sub><i>i</i></sub>) are all in the positive training <span aria-label="259" id="pg_259" role="doc-pagebreak"/>KG], TransE enforces <img alt="" class="inline" height="19" src="../images/pg259-in-1.png" width="63"/> for all <i>i</i> = 1<i>,<span class="ellipsis">…</span>, p</i>, and then <img alt="" class="inline" height="19" src="../images/pg259-in-2.png" width="81"/>. The implication is that, given a 1-to-<i>N</i> relation (e.g., <i>AuthorOf</i> ), TransE might learn very similar vector representations for <i>The Lord of the Rings</i>, <i>The Hobbit</i>, and <i>The Silmarrilion</i>, which are all books written by <i>J. R. R. Tolkien</i>, even though they are different entities. The disadvantages for <i>N</i>-to-1 and <i>N</i>-to-<i>N</i> relations are potentially more severe.</p>
</section>
<section epub:type="division">
<h3 class="head b-head"><a id="sec10-5-5"/><b>10.5.5 Other Trans* Algorithms</b></h3>
<p class="noindent">To overcome the disadvantages of TransE in dealing with 1-to-N, N-to-1, and N-to-N relations, an effective strategy is to allow an entity to have distinct representations when involved in different relations. In this way, even if the embeddings of <i>Lord of the Rings</i>, <i>Hobbit</i>, and <i>Silmarrilion</i> turn out to be similar given the relation <i>AuthorOf</i>, they could still be far away, given <i>other</i> relations.</p>
<p><b>TransH</b> implements this intuition using <i>relation-specific hyperplanes</i>. As shown in <a href="chapter_10.xhtml#fig10-5">figure 10.5</a>, TransH models entities again as vectors, but each relation r as a vector r on a hyperplane with <img alt="" class="inline" height="15" src="../images/pg259-in-3.png" width="16"/> as the normal vector. Given a true triple (<i>h, r, t</i>), the entity representations <i>h</i> and <i>t</i> are first projected onto the hyperplane:</p>
<figure class="DIS-IMG"><a id="eq10-7"/><img alt="" class="width" src="../images/eq10-7.png"/>
</figure>
<p class="noindent">and similarly,</p>
<figure class="DIS-IMG"><a id="eq10-8"/><img alt="" class="width" src="../images/eq10-8.png"/>
</figure>
<p class="noindent">The projections are then assumed to be connected by <i>r</i> on the hyperplane with low error if (<i>h, r, t</i>) holds (i.e., <img alt="" class="inline" height="19" src="../images/pg259-in-4.png" width="74"/>). The scoring function is accordingly defined as</p>
<figure class="DIS-IMG"><a id="eq10-9"/><img alt="" class="width" src="../images/eq10-9.png"/>
</figure>
<p class="noindent">similar to the one used in TransE. By introducing the mechanism of projecting to relation-specific hyperplanes, TransH enables different roles of an entity in different relations.</p>
<p>Training in TransH specifically proceeds as follows. First, the following loss function (similar to the margin-based function used by algorithms like TransE) is used to encourage discrimination between positive and negative (incorrect/corrupted) triples:</p>
<figure class="DIS-IMG"><a id="eq10-10"/><img alt="" class="width" src="../images/eq10-10.png"/>
</figure>
<p class="noindent">Notice the similarities between this loss equation and the one defined earlier for TransE. Once again, [<i>x</i>]<sub>+</sub> = <i>max</i>(0<i>, x</i>), <i>P</i> is the set of positive triples, <i>N</i> is the set of negative triples constructed by corrupting (<i>h, r, t</i>), and <i><span lang="el" xml:lang="el">γ</span></i> is the margin separating positive from negative triples.</p>
<p>Concerning triples corruption, note that TransH takes a more sophisticated approach compared to previous methods like TransE. Recall that in TransE, negative triples were <span aria-label="260" id="pg_260" role="doc-pagebreak"/>constructed by randomly replacing either <i>h</i> or <i>t</i> in a positive triple (but not both), according to a previously established procedure. However, as the authors of TransH note, real KGs are more complicated and incomplete, and there is always a chance that a true positive triple may accidentally get introduced this way (even though it is not in the training set). To mitigate this problem, the authors of TransH set different probabilities for replacing the head or tail entity when corrupting the triplet, which depends on the mapping property of the relation (i.e., one-to-many, many-to-one, or many-to-many). The authors give more chance to replacing the head entity if the relation is one-to-many and give more chance to replacing the tail entity if the relation is many-to-one. In this way, the chance of generating false negative labels is reduced. Specifically, among all the triplets of a relation <i>r</i>, the following two statistics are generated: the average number of tail entities per head entity (<i>t</i><sub><i>h</i></sub>), and the average number of head entities per tail entity(<i>h</i><sub><i>t</i></sub>). A Bernoulli distribution is then defined with sampling parameter <img alt="" class="inline" height="22" src="../images/pg260-in-1.png" width="25"/>; namely, given a positive triple (<i>h, r, t</i>) of relation <i>r</i>, with probability <img alt="" class="inline" height="22" src="../images/pg260-in-2.png" width="25"/>, the triple would be corrupted by replacing the head, while with probability <img alt="" class="inline" height="23" src="../images/pg260-in-3.png" width="25"/>, the triple would be corrupted by replacing the tail.</p>
<p>Just like TransE, TransH also incorporates a number of constraints when minimizing the loss function. The first constraint is a <i>scale constraint</i>:</p>
<figure class="DIS-IMG"><a id="eq10-11"/><img alt="" class="width" src="../images/eq10-11.png"/>
</figure>
<p>The second constraint is an <i>orthogonality constraint</i>:</p>
<figure class="DIS-IMG"><a id="eq10-12"/><img alt="" class="width" src="../images/eq10-12.png"/>
</figure>
<p>Finally, there is a <i>unit normality</i> constraint:</p>
<figure class="DIS-IMG"><a id="eq10-13"/><img alt="" class="width" src="../images/eq10-13.png"/>
</figure>
<p>The unit normality and orthogonality constraints were clearly not applicable to TransE, as there was no concept of a hyperplane. The orthogonality constraint guarantees that the translation vector is actually in the hyperplane. Instead of directly optimizing the loss function with constraints, TransH instead uses the following <i>unconstrained loss</i>, with <i>soft constraints</i>:</p>
<figure class="IMG"><img alt="" src="../images/pg260-1.png" width="450"/>
</figure>
<p><span aria-label="261" id="pg_261" role="doc-pagebreak"/>Here, <i>C</i> is a hyperparameter weighing the importance of soft constraints. Similar to TransE, and other such algorithms, SGD is used to minimize the loss function. The positive triples-set is randomly traversed multiple times. When a positive triple is visited, a negative triple is randomly constructed as described previously. After a minibatch, the gradient is computed and the model parameters updated. Furthermore, note that the third constraint is missing from the unconstrained loss; instead, <i>w</i><sub><i>r</i></sub> is projected to a ball with unit radius before visiting each minibatch.</p>
<p><b>TransR</b> shares the intuition of TransH, but it introduces <i>relation-specific spaces</i> rather than hyperplanes. As we saw earlier, both TransE and TransH assume embeddings of entities and relations being in the same space <span class="font">ℝ</span><sup><i>k</i></sup>. However, an entity may have multiple aspects, and various relations focus on different aspects of entities. Hence, it is intuitive that some entities are similar, and thus close to each other in the entity space, but they are comparably different in some specific aspects, and thus far from each other in the corresponding relation spaces. One way to address this issue, adopted by TransR, is to model entities and relations in distinct spaces (i.e., entity space) and multiple relation spaces (i.e., relation-specific entity spaces), and performs translation in the corresponding relation space.</p>
<p>In TransR, entities are represented as vectors in an entity space <span class="font">ℝ</span><sup><i>d</i></sup>, and each relation is associated with a specific space <span class="font">ℝ</span><sup><i>k</i></sup> and modeled as a translation vector in that space. Given a fact (<i>h, r, t</i>), TransR first projects the entity representations <img alt="" class="inline" height="17" src="../images/h-vector.png" width="9"/> and <img alt="" class="inline" height="15" src="../images/t-vector.png" width="8"/> into the space specific to relation <i>r</i>; that is,</p>
<figure class="DIS-IMG"><a id="eq10-14"/><a id="eq10-15"/><img alt="" class="width" src="../images/eq10-14-15.png"/>
</figure>
<p>Here <span class="font">&#120132;</span><sub><i>r</i></sub> ∈ <span class="font">ℝ</span><sup><i>k</i>×<i>d</i></sup> is a projection matrix from the entity space to the relation space of <i>r</i>. Then, the scoring function is again defined as</p>
<figure class="DIS-IMG"><a id="eq10-16"/><img alt="" class="width" src="../images/eq10-16.png"/>
</figure>
<p>Earlier, we provided a simple illustration of TransR in <a href="chapter_10.xhtml#fig10-5">figure 10.5</a>. Although powerful in modeling complex relations, TransR introduces a projection matrix for each relation, which requires <i>O</i>(<i>dk</i>) parameters per relation. Thus, it loses some of the simplicity and efficiency that made TransE and TransH, which model relations as vectors and require only <i>O</i>(<i>d</i>) parameters per relation so attractive. An even more complicated version of the same approach was later proposed, wherein each relation is associated with <i>two</i> matrices, one to project head entities and the other tail entities.</p>
<p>The training of TransR uses the same margin-based scoring function objective as some of the earlier methods; for reference, see equation (<a href="chapter_10.xhtml#eq10-10">10.10</a>). The difference, of course, emerges in the choice of the score function <i>f</i><sub><i>r</i></sub>(<i>h, t</i>). Another more minor difference (between TransR <span aria-label="262" id="pg_262" role="doc-pagebreak"/>and TransH) arises with respect to triples corruption for the generation of negative triples, because Lin et al. (2015) consider both the previous method used in TransE (“unif”) and the Bernoulli-based method introduced first in the context of TransH (“bern”).</p>
<p>A more complicated version of TransR, called cluster-based TransR or <i>CTransR</i>, was also proposed by Lin et al. (2015), and was motivated by the fact that models like TransE, TransH, and the original TransR all learn a <i>unique</i> vector for each relation, which may be underrepresentative for fitting <i>all</i> entity pairs under this relation, given that relations are often quite diverse. To better model these relations, the authors of TransR incorporate the idea of piecewise linear regression to extend the original model. The core idea is first, to segment input instances into several groups. Formally, for a specific relation <i>r</i>, all entity pairs (<i>h, t</i>) in the training data are clustered into multiple groups, and entity pairs in each group are expected to exhibit similar <i>r</i> relation. All such entity pairs are represented with their vector offsets (<img alt="" class="inline" height="17" src="../images/h-vector.png" width="9"/> – <img alt="" class="inline" height="15" src="../images/t-vector.png" width="8"/>) for clustering, with <img alt="" class="inline" height="17" src="../images/h-vector.png" width="9"/> and <img alt="" class="inline" height="15" src="../images/t-vector.png" width="8"/> obtained using TransE. Afterward, a separate relation vector <i>r</i><sub><i>c</i></sub> (and similarly, matrix <span class="font">&#120132;</span><sub><i>r</i></sub>) is learned for each relation <i>r</i> and cluster <i>c</i>. Projected vectors of entities are defined as <i>h</i><sub><i>r, c</i></sub> = <i>h</i><span class="font">&#120132;</span><sub><i>r</i></sub> and <i>t</i><sub><i>r, c</i></sub> = <i>t</i><span class="font">&#120132;</span><sub><i>r</i></sub>, and the score function is defined as</p>
<figure class="DIS-IMG"><a id="eq10-17"/><img alt="" class="width" src="../images/eq10-17.png"/>
</figure>
<p>Here, <img alt="" class="inline" height="20" src="../images/pg262-in-1.png" width="57"/> aims to ensure a cluster-specific relation vector <i>r</i><sub><i>c</i></sub> not too far from the original relation vector <i>r</i>, and <i><span lang="el" xml:lang="el">α</span></i> controls the effect of this constraint. Similar to TransR, CTransR enforces constraints on embedding norms of h, r, and t and mapping matrices. The learning process of both TransR and CTransR are carried out using SGD. To avoid overfitting, Lin et al. (2015) initialize entity and relation embeddings with the results of TransE rather than as random vectors. Relation matrices are initialized as identity matrices.</p>
<p>All of these examples (TransE, TransH, and TransR) show that translation is a powerful idea in the KGE literature that has taken hold. Many other algorithms have been proposed along these lines that are beyond the scope of this chapter. For example, TransD simplifies TransR by further decomposing the projection matrix into a product of two vectors; specifically, by introducing additional mapping vectors <img alt="" class="inline" height="17" src="../images/pg262-in-2.png" width="75"/>, and <img alt="" class="inline" height="17" src="../images/pg262-in-3.png" width="51"/>, along with the entity/relation representations <img alt="" class="inline" height="17" src="../images/h-vector.png" width="9"/>, <img alt="" class="inline" height="15" src="../images/t-vector.png" width="8"/> ∈ <span class="font">ℝ</span><sup><i>d</i></sup> and <img alt="" class="inline" height="12" src="../images/r-vector.png" width="8"/> ∈ <span class="font">ℝ</span><sup><i>k</i></sup>. It is likely that this trend will continue, and translation will continue to be a part of more powerful KGE algorithms that continue will be proposed each year in academic venues.</p>
</section>
</section>
<section epub:type="division">
<h2 class="head a-head"><a id="sec10-6"/><b>10.6 Extrafactual Contexts</b></h2>
<p class="noindent">Many, if not all, of the algorithms described earlier fundamentally relied on observed facts in the KG for optimization. However, as described earlier in this book, especially in chapter 2, KGs in the real world contain much more than facts. In this section, we briefly discuss KGE techniques that further incorporate additional information besides facts. For example, <span aria-label="263" id="pg_263" role="doc-pagebreak"/>there is now a growing volume of research that embed KGs by using entity types, relation paths, textual descriptions, and logical rules, in addition to the observed facts.</p>
<section epub:type="division">
<h3 class="head b-head"><a id="sec10-6-1"/><b>10.6.1 Entity Types</b></h3>
<p class="noindent">The first kind of additional information that could be considered in a KGE is <i>entity types</i>, which are concepts or semantic categories to which entities belong. More broadly, in our framework, this would be akin to considering the ontology and KG in tandem. One easy way to realize this intuition is to supplement the KG with <i>entity-type triples</i> by interpreting <i>:is_a</i> as an ordinary relation and the corresponding triples as ordinary KG facts.</p>
<p>A more sophisticated approach, called <i>semantically smooth embedding (SSE)</i>, requires entities of the same type to stay close to each other in the embedding space (e.g., the movie <i>Psycho</i> should be closer in the KGE space to another movie, <i>Avatar</i>, than to a song like “We are the World.” To accomplish this goal, SSE employs two manifold learning algorithms (Laplacian eigenmaps and locally linear embedding) to model this smoothness assumption. For example, Laplacian eigenmaps require an entity to lie close to every other entity of the same type, giving a smoothness measure given by</p>
<figure class="DIS-IMG"><a id="eq10-18"/><img alt="" class="width" src="../images/eq10-18.png"/>
</figure>
<p>Here, <img alt="" class="inline" height="22" src="../images/pg263-in-1.png" width="20"/> is an indicator variable that is 1 if the entities represented by vectors <img alt="" class="inline" height="15" src="../images/pg263-in-2.png" width="10"/> and <img alt="" class="inline" height="18" src="../images/pg263-in-3.png" width="12"/> have the same concept type (and 0 otherwise). Both summations occur over all entities in the data set, indexed by <i>i</i> and <i>j</i>. A similar smoothness measure (say, <span class="font">ℛ</span><sub>∈</sub>) can be devised for locally linear embedding. Together, these two terms are incorporated as regularization terms by SSE to constrain the KGE. Empirically, while SSE was found to perform better than straightforward methods, a major limitation is that the concepts are assumed to lie in a non-hierarchical ontology (essentially, as a set of tags) and an entity cannot have more than one concept type. In most KGs, including the ones that we have seen so far, as well as the KG ecosystems that we will be covering in part V, this assumption does not hold.</p>
<p>Other sophisticated approaches that have been proposed since SSE include <i>type-embodied knowledge representation learning</i> (TKRL), which handles hierarchical entity categories and multiple concept labels, which is especially useful for ontologies common in the Semantic Web (SW) community (albeit less common in computer vision, or even NLP). TKRL is a translational distance model with type-specific entity projections. Given a fact (<i>h, r, t</i>), it first projects <i>h</i> and <i>t</i> with type-specific projection matrices, and then models <i>r</i> as a translation between the two projected entities. The intuition behind such translations was presented earlier in this chapter, in the context of translational embedding models. Further details can be found in Xie et al. (2016a), also cited in the “Bibliographic Notes” section.</p>
<p>Beyond incorporating type information in the embedding itself, the ontology can be used during prediction (i.e., entity types can also be used as <i>constraints</i> of head and tail positions <span aria-label="264" id="pg_264" role="doc-pagebreak"/>for different relations). To look at one example, head entities of relation <i>:movie_director_of</i> should have concept type <i>Person</i>, and tail entities should have a concept type that is a subclass of <i>CreativeWork</i>. Some systems attempt to impose such constraints in the training process, particularly during the generation of negative training examples. Negative examples that violate entity-type constraints are excluded from training, or generated with substantially low probabilities. Similar such constraints were implemented, for example, in RESCAL, a tensor factorization model, whose key idea was to discard invalid facts (with wrong entity types) and factorize only a subtensor composed of the remaining facts. Finally, note that the sample of techniques that we have briefly described here on incorporating concepts into the KGE process is not necessarily composed of mutually exclusive algorithms. There is nothing preventing us from using the ontology before the embedding (for generating nontrivial training samples as described here), during the embedding, and after the embedding (during prediction).</p>
</section>
<section epub:type="division">
<h3 class="head b-head"><a id="sec10-6-2"/><b>10.6.2 Textual Data</b></h3>
<p class="noindent">Using textual descriptions to augment KGEs is intuitive because in most KGs, there are concise descriptions for entities that contain rich semantic information about them. For example, in the DBpedia KG, description-like properties for the entity “The Terminator” include not only <i>rdfs:comment</i> and <i>dbo:abstract</i> (<i>The Terminator is a 1984 American science fiction film written and directed by James Cameron, produced by<span class="ellipsis">…</span></i>), but also <i>dbp:quote</i> (<i>Casting Arnold Schwarzenegger as our Terminator, on the other hand, shouldn’t have worked. The guy is supposed to be<span class="ellipsis">…</span></i>) and even phraselike properties like <i>dbp:footer</i> (<i>Arnold Schwarzenegger, Linda Hamilton, and Michael Biehn played the film’s leads.</i>). In the opening sections of this chapter, we described how representation learning really took off in the modern era because of algorithms like word2vec and GloVe that have been applied, with excellent results on tasks like word analogy, over large quantities of text. Furthermore, for common entities like movies and singers, textual information from external text sources like news releases and Wikipedia articles could also be leveraged to learn better embeddings.</p>
<p>Given these intuitions, it is not surprising that embedding KGs with textual information dates back to at least the NTN model, where textual information is simply used to <i>initialize</i> entity representations. We noted this earlier in our description of NTN, where we saw that entity vectors were initialized by considering simple averaging of pretrained word vectors (with the words as the constitutive words in the entity). Furthermore, the word vectors were acquired from a text-embedding model pretrained over an external news corpus. We also briefly noted the limitations of this method, particularly for phrasal entities where word composition does not apply. A later, more robust method attempted to initialize entities as average word vectors of their <i>descriptions</i> rather than just their names. A key limitation, however, of all of these methods is that they model textual information independent of KG facts, and hence fail to leverage interactions between them.</p>
<p><span aria-label="265" id="pg_265" role="doc-pagebreak"/>A first joint model that made better use of textual information during KGE optimization tried to align the given KG with an auxiliary text corpus, and then simultaneously conducted KGE and word embedding. Entities, relations, and words are represented in the same vector space, and operations such as inner product (similarity) between them are well defined and meaningful. Consequently, this joint model had three components or models: <i>knowledge, text</i>, and <i>alignment</i>. The knowledge model embedded entities and relations in the KG and was a variant of TransE. The text model embedded words in the text corpus and was a variant of skip-gram that was described at the beginning of this chapter. Finally, the alignment model guaranteed that the embeddings of entities, relations, and words shared the same space, using a variety of techniques, including (but not limited to) Wikipedia anchors and entity descriptions.</p>
<p>How does the joint model incorporate all of these possibly conflicting information sets, because they all have separate loss functions? One approach is to try to minimize the sum of the loss functions (as a single global loss function) but other approaches, including weighted averages, are also theoretically possible. The main feature of the joint embedding models, however, is to ensure that information is utilized at the same time from both structured KGs and unstructured text. KGE and word embedding are thus enhanced and supported by each other, especially due to the forced joint optimization of their respective loss functions. Moreover, by aligning these two types of information, joint embedding enables the prediction of <i>out-of-KG</i> entities, (i.e., phrases appearing in web text but not included in the KG yet). Of course, this is predicated on the fact that the textual source itself is rich enough to encompass the out-of-KG entities and vocabulary. More recent representation learning methods have even gotten around this problem, especially if the out-of-vocabulary (OOV) issue is arising due to misspellings and morphological variations, among other issues.<sup><a href="chapter_10.xhtml#fn9x10" id="fn9x10-bk">9</a></sup> Recent conceptual approaches that have been proposed for enabling these kinds of joint embeddings include the description-embodied knowledge representation learning (DKRL), as well as the text-enhanced KGE (TEKE). In the “Bibliographic Notes” section, we refer interested readers to papers on these approaches.</p>
</section>
<section epub:type="division">
<h3 class="head b-head"><a id="sec10-6-3"/><b>10.6.3 Beyond Text and Concepts: Other Information Sets</b></h3>
<p class="noindent">While entity types (ontological or concept information), as well as external or supporting text corpora, have received maximum attention in the research community as ways of supplementing or improving the different kinds of KGE models, there are several other classes of interesting information sets that have also been found to be useful. Work on much of these is still in its relative infancy, but experimental results are very promising.</p>
<p><span aria-label="266" id="pg_266" role="doc-pagebreak"/>By way of example, one such class of information is <i>temporal information</i>. Some authors have observed that KG facts are usually time-sensitive (e.g., it may be that A and B were married in 1980, but not 1985). In communities such as the Semantic Web, such higher-order facts are expressed using a general technique called <i>reification</i>. However, if the higher-order information is temporal, we can utilize a <i>time-aware embedding model</i> to further improve the original KGE. The main idea is to impose <i>temporal order constraints</i> on time-sensitive relation pairs, classic examples being <i>:born_in</i> and <i>:died_in</i> (in contrast, the similarly named relations <i>:born_at</i> and <i>:died_at</i> are not time-sensitive). Given such a relation pair (<i>r</i><sub><i>i</i></sub><i>, r</i><sub><i>j</i></sub>), the <i>prior</i> relation should intuitively lie closer to the <i>subsequent</i> relation after a temporal transition (i.e., <img alt="" class="inline" height="18" src="../images/pg266-in-1.png" width="64"/> where <span class="font">&#120132;</span> is a transition matrix capturing the temporal order information between relations). At a high level, imposing such temporal order constraints is not very different from the entity-type constraints that we explored earlier in the context of several algorithms. Some researchers have been able to use these constraints to learn temporally consistent relation embeddings. Similarly, other researchers have tried to model the temporal evolution of KGs, by actions such as modeling changes in a KG via labeled quadruples, such as that (<i>h, r, t, s</i>) is <i>True</i> or (<i>h, r, t, s</i>) is <i>False</i>, with <i>s</i> being a time signature or stamp. These quadruples indicate that (<i>h, r, t</i>) appears or vanishes at time <i>s</i>, respectively. Such models perform especially well in dynamic domains, (e.g., KG representations of medical and sensor data). More details on this KG evolution model can be looked up in Esteban et al. (2016). Research in this area continues to flourish.</p>
<p>Besides temporal information, other information sets that are rapidly becoming popular for supplementing KGs and improving KGEs include relation paths (multihop, rather than single-hop, relationships between entities as a way of incorporating richer context), entity attributes, graph structural information, and even logical rules. The last is particularly exciting because it may represent a reconciliation between two paradigms (statistical versus symbolic) that had been considered incompatible (at least in practice) by many for a long time. Logical rules, which were once the staple of expert systems, and are still important in the design of precise ontologies in the Semantic Web, contain rich background information. As we saw in chapter 9, on statistical relational learning (SRL), they can also be used in probabilistic frameworks and generally yield more interpretable results than pure neural networks. There are a wide range of systems (e.g., WARMR and AMIE), furthermore, that can semiautomatically extract such rules from KGs. A central question that arises, however, is: can such rules be utilized to refine the embedding itself? A number of approaches have tried to affirm this utility. For example, in one work (called KALE), a joint model was proposed that embeds KG facts and logical rules simultaneously, not unlike the textual joint models seen earlier. There are some unique challenges that arise, unfortunately, when using logical rules, and research continues on addressing the issues. One particular problem that is common to many of the methods is that they have to instantiate universally quantified rules into ground rules before learning their models. The grounding is expensive, <span aria-label="267" id="pg_267" role="doc-pagebreak"/>both in terms of time and space, especially with a large number of entities in the KG and a complex set of rules. Recent research has attempted to deal with this complexity issue, but the book on the subject is far from closed.</p>
</section>
</section>
<section epub:type="division">
<h2 class="head a-head"><a id="sec10-7"/><b>10.7 Applications</b></h2>
<p class="noindent">KGEs can benefit a range of downstream tasks such as KG completion, relation extraction (RE), and question answering. We describe these applications in this section (with the exception of question answering, to which an entire chapter is dedicated in part IV of this book). We also note that the evaluation of a particular KGE algorithm is intimately connected with the application that the KGE algorithm will be used for. In other words, embeddings do not necessarily have any intrinsic value, and it is far from clear that one embedding is universally superior to another; in theory, it is possible for an algorithm to yield embeddings that work well for one application (compared to another algorithm), but that do not work as well for a different task. Given that, we argue that one must always keep both the application and the data set (and its incumbent assumptions) in the forefront when making claims about one KGE algorithm outperforming another.</p>
<section epub:type="division">
<h3 class="head b-head"><a id="sec10-7-1"/><b>10.7.1 Link Prediction</b></h3>
<p class="noindent"><i>Link prediction</i> is typically referred to as the task of predicting an entity that has a specific relation with another given entity [i.e., predicting <i>h</i> given (<i>r, t</i>) or <i>t</i> given (<i>h, r</i>), with the former notationally denoted as (?<i>, r, t</i>) and the latter as (<i>h, r,</i> ?)]. For example, a link prediction query (?<i>, DirectorOf, Basic</i>_<i>Instinct</i>) aims to predict the director of the film Basic Instinct, under the assumption that this triple is not explicitly declared in the original (training) KG, while (<i>Paul</i>_<i>Verhoeven, DirectorOf,</i> ?) tries to predict films directed by the specific person in the head entity position. For obvious reasons, this task is also sometimes called <i>entity prediction</i> or (less commonly) <i>entity ranking</i>. A similar idea can also be used to predict relations between two given entities, such as (<i>h,</i> ?<i>, t</i>), a task referred to as <i>relation prediction</i>.</p>
<p>With entity and relation embeddings learned beforehand, link prediction can be carried out using a ranking procedure. To predict the head entity in a query triple (?<i>, r, t</i>), for example, we can take every entity <i>h</i>′ in the KG as a candidate answer and calculate a score <i>f</i><sub><i>r</i></sub>(<i>h</i>′<i>, t</i>) for each (<i>h</i>′<i>, r, t</i>). The structured embeddings method that preceded many of the Trans* algorithms provided a kernel density function for evaluating such function. However, if TransE is used, we can evaluate the function <img alt="" class="inline" height="21" src="../images/pg267-in-1.png" width="161"/> and rank the candidate head entities in descending order. Tail entity and relation prediction would work in the same way.</p>
<p>How do we measure which algorithm is better? Because the problem has been framed as one of ranking, several evaluation metrics (mostly developed in the information retrieval community) are applicable, including <i>mean rank</i> (the average of predicted ranks), <i>mean <span aria-label="268" id="pg_268" role="doc-pagebreak"/>reciprocal rank</i> (the average of reciprocal ranks), <i>Hits@n</i> (the proportion of ranks no larger than <i>n</i>; <i>n</i> = 1, 5, 10 are all common choices), and AUC-PR or the area under the precision-recall curve. We detail information retrieval metrics subsequently in a chapter dedicated to reasoning and retrieval.</p>
</section>
<section epub:type="division">
<h3 class="head b-head"><a id="sec10-7-2"/><b>10.7.2 Triple Classification</b></h3>
<p class="noindent">Triple classification is the task of determining whether an unseen (in the original KG) triple fact (<i>h, r, t</i>) is true or not [e.g., a good triple classification method would score the triple <i>(George_Lucas, DirectorOf, Star_Wars)</i> as <i>True</i> highly, while a triple like <i>(James_Cameron, ProducerOf, Star_Wars)</i> would get a low score.] Similar to link prediction, we can view this task as one of KG completion.</p>
<p>Once again, we can use either the kernel density function or a translational function (e.g., <img alt="" class="inline" height="21" src="../images/pg268-in-1.png" width="94"/> if using TransE) to score a triple (<i>h, r, t</i>), under the assumption that <i>h, t</i>, and <i>r</i> have all been observed in the training data set that was used to derive the KGEs, even though it is not necessary for them to have cooccurred in any triple in the training set. This is an important assumption, because in the general case, we cannot make claims about entities or relations that we have not seen at all in the training KG (in the context of other triples). Note, however, that some of the methods that used extrafactual contexts (e.g., joint embeddings that used both text corpora and KG), as well as straightforward extensions of the NTN algorithm (which used pretrained word embeddings for embedding initialization), may even be able to handle this eventuality. With these caveats in place, triple classification can be framed as binary classification by determining thresholding scores and predicting those triples to be true that have a score above the threshold. A slight modification of thresholding that works well in practice is to not have one threshold, but to introduce a threshold <i>t</i><sub><i>r</i></sub> for every relation <i>r</i>. The thresholds can be determined using a variety of well-established statistical methods. In the machine learning world, using a held-out validation set is a common mechanism.</p>
<p>To summarize, using relation thresholds, any unseen fact (<i>h, r, t</i>) containing relation <i>r</i> will be predicted as true if its score <i>f</i><sub><i>r</i></sub>(<i>h, t</i>) is higher than <i>t</i><sub><i>r</i></sub>, and false otherwise. In this way, relation-specific triple classifiers are obtained. Such classifiers are amenable to metrics such as micro- and macro-averaged accuracy from the machine learning and NLP communities, which can be calculated and used to evaluate this application. Ranking metrics such as mean average precision can also be used.</p>
</section>
<section epub:type="division">
<h3 class="head b-head"><a id="sec10-7-3"/><b>10.7.3 Entity Classification</b></h3>
<p class="noindent">Entity classification, which is a specific instance of the more general “node classification” problem that shows up in graphs and networks, is defined as the problem of classifying entities into semantic categories (e.g., <i>Paul_Verhoeven</i> is a Person, and <i>Basic_Instinct</i> a CreativeWork. While it is not strictly necessary, entity classification is usually taken to mean (as the two previous examples show) the link prediction task (<i>h, IsA,</i> ?) for an entity x. <span aria-label="269" id="pg_269" role="doc-pagebreak"/>Hence, similar prediction and evaluation procedures can be applied as for link prediction.</p>
<p>More complex versions of entity classification in the KG context have not been as well explored as node classification in the graph community. Considering the example given here, is it more appropriate to classify <i>Paul_Verhoeven</i> as a Person, Director, or Artist? In the general case, we want to try and predict the (possibly more than one) finest-grained semantic categories since the other categories can be inferred from those (i.e., Person can be predicted, given a reasonable ontology, from either Artist or Director). There are not a lot of methods that have aimed to solve multiclass entity classification, or to separately evaluate long- and short-tailed semantic categories. There is still much research to be done in this space.</p>
</section>
<section epub:type="division">
<h3 class="head b-head"><a id="sec10-7-4"/><b>10.7.4 Revisiting Instance Matching</b></h3>
<p class="noindent">As noted earlier in this book, instance matching (IM) is a complex problem, with over 50 years of research behind it. KGEs present yet another opportunity to improve performance on IM. Once again, just like entity classification, we can frame the problem as similar to those we have seen before. The ontology already contains a relation stating whether two entities are equivalent (denoted as sameAs), and an embedding has been learned for that relation. In this case, IM degenerates to a triple classification problem [i.e., the problem now is to judge whether the triple (<i>x, sameAs, y</i>) holds (or how likely it is to hold)]. Triple scores output by an embedding model can be directly used for such prediction (see the “Triple Classification” section for details). This intuitive strategy, however, does not always work because not all KGs encode the sameAs relation. For this reason, some authors have proposed to perform IM solely on the basis of learned entity embeddings. For example, in Nickel et al. (2011), given two entities <i>h</i><sub>1</sub> and <i>h</i><sub>2</sub>, and their vector representations <img alt="" class="inline" height="19" src="../images/pg269-in-1.png" width="13"/> and <img alt="" class="inline" height="19" src="../images/pg269-in-2.png" width="14"/>, the similarity between <i>h</i><sub>1</sub> and <i>h</i><sub>2</sub> is computed as <img alt="" class="inline" height="21" src="../images/pg269-in-3.png" width="151"/>, with the score used as the likelihood that <i>h</i><sub>1</sub> and <i>h</i><sub>2</sub> refer to the same entity. An advantage of strategies such as these is that they work even if the <i>sameAs</i> relation is not present in the input KG (this is exactly the unsupervised version of the IM problem). However, it is also important to remember that the utility of such methods is predicated on their modeling of the problem being correct. There is no evidence that the similarity function here, for example, would extend to arbitrary domains. In some domains, the function may not be the best guide to determining whether the argument entities match. Thus, it must be used with caution, and with a validation procedure in place, just like other unsupervised (e.g., clustering-based) algorithms.</p>
<p>Concerning evaluation, AUC-PR is the most widely adopted evaluation metric for this task, but standard precision, recall, and F-measure may also be used. The KGE community has not taken a close look at how to combine blocking-based methods with such embedding methods. Hence, several open research questions remain, and more developments in this area are likely forthcoming.</p>
</section>
<section epub:type="division">
<h3 class="head b-head"><span aria-label="270" id="pg_270" role="doc-pagebreak"/><a id="sec10-7-5"/><b>10.7.5 Other Applications</b></h3>
<p class="noindent">We have noted a subset of important applications in the previous sections of this chapter, but there are several more that we did not mention. For example, one exciting application is RE, detailed in chapter 6. As we argued there, RE is an important problem in NLP and KG construction (KGC). However, an alternative view is to see RE not as a means for KGC but as an application of KGs. For example, we could use preexisting KGs in a distant supervision framework to automatically generate labeled data and help improve an RE process. But such approaches are still text-based extractors, ignoring the capability of a KG to infer new facts by itself. Weston et al. (2013), for example, proposed to combine TransE with a text-based extractor to enhance RE performance. Yet other systems have drawn inspiration from recommender systems (which have themselves leveraged KGs to improve recommender performance) in using collaborative filtering techniques to improve performance on RE. These techniques factorize an input matrix to learn vector embeddings for entity pairs, textual mentions, and KG relations. The framework has been shown to be an improvement over traditional text-based extractors. Yet other authors have used matrix completion techniques instead of matrix factorization. In recent work, tensor-based variants have also been used. Suffice to say, there is still a lot of active research happening in this area, though advanced systems already exist.</p>
<p>One important point to note in the context of RE is that, as a KG application, it is different from the applications we considered previously in being out-of-KG. Out-of-KG applications are those that break through the boundary of the input KG and scale to broader domains. Many of the other applications we looked at, such as link prediction and entity resolution, are trying to improve the KG itself rather than scale to a broader domain. Another example of an out-of-KG application that was briefly mentioned earlier is a recommender system. Because these are out-of-KG applications, they rely on a number of areas in machine learning and NLP, and they are not exclusively married to KG technology or improvements.</p>
</section>
</section>
<section epub:type="division">
<h2 class="head a-head"><a id="sec10-8"/><b>10.8 Concluding Notes</b></h2>
<p class="noindent">As a research area, representation learning has come a long way, propelled by recent successes with neural networks. While word embeddings have been especially influenced by the advent of models like skip-gram and CBOW, a similar philosophy (Firth’s axiom) has led to improved representation learning for other data units such as documents and (nodes and edges in) KGs. This chapter mainly concerned KGEs, for which many representation learning algorithms have been developed over the last decade. Among these algorithms, the translational algorithms have become especially popular, with numerous variants and extensions proposed in the research community following the initial success of TransE. KGEs have numerous applications, including link prediction, entity classification, relation <span aria-label="271" id="pg_271" role="doc-pagebreak"/>prediction, and triple classification. More recently, KGEs have also been applied to out-of-KG applications such as recommender systems and RE.</p>
<p>While KGEs are continuing to proliferate into the mainstream research community, questions still remain on their effectiveness on graphs that exhibit both noise and sparsity. Recent work, for example, has shown that on real-world graphs, algorithms from the Trans* family may be significantly outperformed by SRL techniques like Probabilistic Soft Logic (PSL), described in chapter 9. That being said, these embeddings are continuing to get better each year in terms of significant performance improvements in standard KG application areas such as link prediction and triple classification. Because of their high utility, they have become a standard resource when addressing the basic issue of incomplete or noisy KGs.</p>
</section>
<section epub:type="division">
<h2 class="head a-head"><a id="sec10-9"/><b>10.9 Software and Resources</b></h2>
<p class="noindent">KGEs are a relatively novel and rapidly evolving area of research, and open-source software development has been scattered, sometimes provided only for the purposes of replicating experiments in support of a paper. Nevertheless, some valuable tools have emerged, especially in the last three or four years.</p>
<p>The OpenKE package includes some classic and effective models to support KGEs, including Trans{E,H,R,D}, RESCAL, DistMult, HolE, and ComplEx. They also provide other resources, such as pretrained embeddings on the website. The project page is accessible at <a href="http://139.129.163.161//index/toolkits#toolkits">http://<wbr/>139<wbr/>.129<wbr/>.163<wbr/>.161<wbr/>/<wbr/>/index<wbr/>/toolkits#toolkits</a>. A lesser-known project is <a href="https://github.com/BookmanHan/Embedding">https://<wbr/>github<wbr/>.com<wbr/>/BookmanHan<wbr/>/Embedding</a>, which supports models such as TransA, TransG, and Semantic Space Projection (SSP).</p>
<p>Another excellent and widely used resource, released by Facebook Research, is StarSpace (<a href="https://github.com/facebookresearch/StarSpace">https://<wbr/>github<wbr/>.com<wbr/>/facebookresearch<wbr/>/StarSpace</a>). StarSpace is described as a “general-purpose neural model for efficient learning of entity embeddings for solving a wide variety of problems.” These problems include KGEs, though the package can also be used for text classification, word embeddings, and information retrieval.</p>
<p>Recently, PyTorch-BigGraph (<a href="https://github.com/facebookresearch/PyTorch-BigGraph">https://<wbr/>github<wbr/>.com<wbr/>/facebookresearch<wbr/>/PyTorch<wbr/>-BigGraph</a>) was released to provide support for embedding larger-scale, graph-structured data. It also allows training of a number of models from the KGE literature, including TransE, RESCAL, DistMult, and ComplEx. Scalability is a key feature, which has been one of the concerns with OpenKE.</p>
<p>In earlier chapters, we provided resources on word embeddings. Resources for more advanced models such as BERT and RoBERTa will be provided in chapter 13, on question answering. An excellent general word-embedding package (but not the only one, by any means) that is robust to misspellings is fastText (<a href="https://github.com/facebookresearch/fastText">https://<wbr/>github<wbr/>.com<wbr/>/facebookresearch<wbr/>/fastText</a>), also released by Facebook Research. Finally, we mentioned network embeddings as an interesting line of related research that has a conceptual (if not algorithmic) connection <span aria-label="272" id="pg_272" role="doc-pagebreak"/>to KGEs. Several good packages exist; for instance, we are mentioning DeepWalk (<a href="https://github.com/phanein/deepwalk">https://<wbr/>github<wbr/>.com<wbr/>/phanein<wbr/>/deepwalk</a>), LINE (<a href="https://github.com/tangjianpku/LINE">https://<wbr/>github<wbr/>.com<wbr/>/tangjianpku<wbr/>/LINE</a>), and node2vec (github.com/aditya-grover/node2vec) as three of the most widely used ones.</p>
</section>
<section epub:type="division">
<h2 class="head a-head"><a id="sec10-10"/><b>10.10 Bibliographic Notes</b></h2>
<p class="noindent">KGEs are a very recent phenomenon, with the earliest study on the class of techniques covered in this chapter not appearing before the early 2010s, unlike many of the other areas we have covered thus far in this book, which have decades (and, in the case of entity resolution, more than a half-century) of research behind them, albeit not specifically attuned to KGs and their use-cases. As such, much of the material synthesized in this chapter was presented in their original form relatively recently, but in the last few years, some surveys and comprehensive overviews have also appeared. We highlight Wang et al. (2017) and Nguyen (2017) as influential in helping us organize the material in this chapter, as well as discussing the various KGEs using more uniform terminology and notation. Beyond KGEs specifically, a general survey on graph embeddings by Goyal and Ferrara (2018), as well as on KG refinement by Paulheim (2017), are also instructive, with the latter having more of an SW focus.</p>
<p>We mentioned both word and network embeddings as precursors to KGEs, in that they employ relatively similar techniques (a context-defining, objective function–based neural model). Among many others, good sources for the former include Mikolov et al. (2013), because it provides an early coverage of the skip-gram and CBOW models described in the earlier part of the chapter; and for the latter include Grover and Leskovec (2016), Perozzi et al. (2014), and Tang, Qu, et al. (2015). New representation learning algorithms are being proposed in these fields with great frequency, and we cite these earlier works as good avenues for researchers in these areas to begin exploring, as well as to acquire deeper context for the material in this chapter.</p>
<p>It is worthwhile for interested readers to study the original papers proposing many of the KGE techniques that we succinctly described in this chapter. The structured embeddings approach was first presented by Bordes et al. (2011). Many translational models have appeared over the years, with TransE and TransH remaining some of the more popularly used (perhaps owing to their available open-source availability, robustness, and ease and speed of use), and it is not possible to provide comprehensive references; good sources for TransE, TransH, TransR, TransD, TransSparse, TransM, TransF, TransA, and TransG are Bordes et al. (2013), Wang, Zhang, et al. (2014a), Lin et al. (2015), Ji et al. (2015, 2016), Fan et al. (2014), Feng et al. (2016), Xiao et al. (2015), and Xiao et al. (2016), respectively.</p>
<p>Beyond translational models, the NTN by Socher et al. (2013), the holographic embedding approach by Nickel et al. (2016), and the compositional vector space model by Neelakantan et al. (2015) are important and influential, and they have spawned variants with similar philosophies. Nontranslational models continue to be proposed, and many <span aria-label="273" id="pg_273" role="doc-pagebreak"/>new models (as we discuss toward the later sections of the chapter) have tried to embed KGs while using additional information sets or context, including externally available text corpus or pretrained word embeddings, logic, temporal features, and other exotic artifacts that could potentially lead to better embeddings. While many good references have been cited in surveys, such as Wang et al. (2017), papers that we specially cite here include Zhong et al. (2015), Wang et al. (2014b, 2016), Xie, Liu, Jia, et al. (2016), Xie, Liu, and Sun (2016), Guo et al. (2015, 2016), Wei et al. (2015), and Trivedi et al. (2017).</p>
</section>
<section epub:type="division">
<h2 class="head a-head"><a id="sec10-11"/><b>10.11 Exercises</b></h2>
<ul class="numbered">
<li class="NL">1. You have an ontology with 1,000 concepts and 2,000 relationships. You are trying to embed a KG (modeled on your ontology) with 100,000 unique instances. You may assume that an instance occurs both as a head and a tail entity (in separate triples). How many parameters must TransE infer? <i>Hint: Use <a href="chapter_10.xhtml#tab10-1">table 10.1</a> for reference.</i></li>
<li class="NL">2. For this question, we will consider the data in the table on the following page, which comprises 2D embeddings of entities. As a first step, plot this data on a graph. Do you see two obvious clusters? What does each cluster represent?</li>
</ul>
<figure class="table">
<table class="table">
<thead>
<tr>
<th class="TCH"><p class="TB">Bill Gates</p></th>
<th class="TCH"><p class="TB">[0.1, 0.3]</p></th>
</tr>
</thead>
<tbody>
<tr>
<td class="TB"><p class="TB">Sergei Brin</p></td>
<td class="TB"><p class="TB">[−0.5, 0.25]</p></td>
</tr>
<tr>
<td class="TB"><p class="TB">Larry Page</p></td>
<td class="TB"><p class="TB">[0.0, 0.0]</p></td>
</tr>
<tr>
<td class="TB"><p class="TB">Mark Zuckerberg</p></td>
<td class="TB"><p class="TB">[−0.25, 0.125]</p></td>
</tr>
<tr>
<td class="TB"><p class="TB">Jack Ma</p></td>
<td class="TB"><p class="TB">[0.0, 0.5]</p></td>
</tr>
<tr>
<td class="TB"><p class="TB">Facebook</p></td>
<td class="TB"><p class="TB">[1.5, 2.5]</p></td>
</tr>
<tr>
<td class="TB"><p class="TB">Google</p></td>
<td class="TB"><p class="TB">[1.5, 1.75]</p></td>
</tr>
<tr>
<td class="TB"><p class="TB">Microsoft</p></td>
<td class="TB"><p class="TB">[1.25, 1.25]</p></td>
</tr>
<tr>
<td class="TB"><p class="TB">Alibaba</p></td>
<td class="TB"><p class="TB">[0.0, 0.5]</p></td>
</tr>
<tr>
<td class="TB"/>
</tr>
</tbody>
</table>
</figure>
<ul class="numbered">
<li class="NL">3. Recall the TransE objective (minimizing <img alt="" class="inline" height="19" src="../images/pg273-in-1.png" width="76"/>), and assume that the vectors in the table are fixed. Given that Gates, Zuckerberg, and Ma founded Microsoft, Facebook, and Alibaba, respectively, and that Brin and Page founded Google, what would be the optimal embedding for a founder relation (i.e., given training triples such as (<i>Google, founder, LarryPage</i>))?</li>
<li class="NL">4. Given the founder relation embedding in exercise 3, what would be the TransE loss (taken to be the sum of the expression given there, as applied to each triple in the KG.</li>
<li class="NL">5. Considering that there is only one pair of “cofounders” in this data set (Page and Brin), suppose you were asked to model a “cofounder” relationship based on this minimal amount of training data. Per the TransE objective, what would be the optimal value for this relation, assuming all other vectors in the table have to stay fixed? (Hint: <i>cofounder</i> is a symmetric relation.)</li>
<li class="NL">6. <span aria-label="274" id="pg_274" role="doc-pagebreak"/>Suppose that we decided not to model cofounders separately, but to instead derive whether A and B are cofounders in the following way. Given that A founded a company X (but <i>without</i> the knowledge that B founded X), we determine a ranked list of A’s possible cofounders by framing it as a tail-entity prediction problem [i.e., (<i>X, founder,</i> ?), with ? constrained via an ontology to only return <i>Person</i>-type instances]. We use translation to determine the scores of all entities (because we know that A is already founder X, we do not compute the score for A); namely, we compute the cosine similarity<sup><a href="chapter_10.xhtml#fn10x10" id="fn10x10-bk">10</a></sup> between each Person-type instance vector and <img alt="" class="inline" height="20" src="../images/pg274-in-1.png" width="89"/>, and rank the instances in descending order of scores. What would the ranked list be if X is Google and A is Sergei Brin? What if A is Larry Page? In either case, do you get the true corresponding cofounder in the top one?</li>
<li class="NL">7. List one limitation of the TransE method compared to TransH and TransR. Give one example to demonstrate your point.</li>
<li class="NL">8. Considering the previous exercise, what would be one good reason to still use TransE? Could you think of KGs or use-cases where sticking with TransE, rather than advanced versions like TransH or TransR, might prove to be prudent?</li>
<li class="NL">9. This, as well as the next exercise, are both based on running an implementation of TransE. There are several such implementations available, as we highlight in the “Software and Resources” section. They usually come packaged with the Freebase and WordNet data sets (including train/test splits) that are often used as benchmarks in this space. We will only consider WordNet for these exercises.<br/>First, run a standard implementation of TransE on WordNet and compare some of the metrics to those reported by Bordes et al. (2013). Are there any differences that look significant? What might explain them?</li>
<li class="NL1">10. Now, we will introduce noise and sparsity into the training set by randomly picking two triples and exchanging their relations. Given <i>N</i>-triples in the WordNet training data set, we consider the following two experiments:</li>
</ul>
<p class="AL">(a) <b>Noise injection:</b> For each triple <i>t</i>, we sample a constant number of <i>p</i> triples from the remainder set, where that set is the subset of the <i>N</i> − 1 triples that do not have the same head and tail entities as <i>t</i>. For each of the <i>p</i> triples, we create two new (noisy) triples by exchanging the relations between <i>t</i> and that triple, and then inject the two noisy triples into the training set. Note that the new training set thus created is a superset of the original training set, as we are not removing any of the original triples.</p>
<p class="AL">(b) <span aria-label="275" id="pg_275" role="doc-pagebreak"/><b>Deletion:</b> Randomly delete a fraction <i>q</i> of triples from the training set. If the deletion results in the complete elimination of a head/tail entity or relation from the training set, add one triple back to the training set (from the discarded set), such that every entity and relation in the test set is represented at least once in the training set.<sup><a href="chapter_10.xhtml#fn11x10" id="fn11x10-bk">11</a></sup></p>
<p class="myenumitem">For different values of <i>p</i> = {1, 10, 50, 100} and <i>q</i> = {0.01, 0.1, 0.4} how does the performance of the algorithm on the test set change?<sup><a href="chapter_10.xhtml#fn12x10" id="fn12x10-bk">12</a></sup> What general statements can you make about the robustness of TransE? What happens if we consider the most extreme change, where we first perform deletion with <i>q</i> = 0.4, and (on the remaining training set) noise injection with <i>p</i> = 100. Are the results even better than random?</p>
<ul class="numbered">
<li class="NL1">11. If you used TransH rather than TransE, would your conclusions change? Is TransH necessarily better than TransE in dealing with noise?</li>
</ul>
<div class="footnotes">
<ol class="footnotes">
<li><p class="FN" role="doc-footnote"><sup><a href="chapter_10.xhtml#fn1x10-bk" id="fn1x10">1</a></sup> For the lay reader who is not very familiar with neural networks, some of this material specifying how the layers of the network work, or are structured, can be skimmed or skipped. A full discussion of neural networks is not within the scope of this work; there are various surveys and texts that can instead be perused by the interested reader or engineer looking to build or modify these networks.</p></li>
<li><p class="FN" role="doc-footnote"><sup><a href="chapter_10.xhtml#fn2x10-bk" id="fn2x10">2</a></sup> In NLP, perplexity is a way of evaluating language models, and is dependent on the probability distribution of the words.</p></li>
<li><p class="FN" role="doc-footnote"><sup><a href="chapter_10.xhtml#fn3x10-bk" id="fn3x10">3</a></sup> More precisely, this means <i>d</i> parameters for every entity that occurs at least once in a head position [in the triple (<i>h, r, t</i>), the entity <i>h</i> is said to occur in the head position, and <i>t</i> in the tail position], and <i>d</i> for every entity that occurs at least once in a tail position. Generally, an entity occurs in both positions; hence, 2<i>d</i> is an appropriate bound for the number of parameters learned per entity in the KG.</p></li>
<li><p class="FN" role="doc-footnote"><sup><a href="chapter_10.xhtml#fn4x10-bk" id="fn4x10">4</a></sup> We use the notation employed by Bordes et al. (2011) for this discussion.</p></li>
<li><p class="FN" role="doc-footnote"><sup><a href="chapter_10.xhtml#fn5x10-bk" id="fn5x10">5</a></sup> Source: https://<wbr/>www<wbr/>.cloudave<wbr/>.com<wbr/>/140<wbr/>/google<wbr/>-buys<wbr/>-freebase<wbr/>-this<wbr/>-is<wbr/>-huge<wbr/>/.</p></li>
<li><p class="FN" role="doc-footnote"><sup><a href="chapter_10.xhtml#fn6x10-bk" id="fn6x10">6</a></sup> Actually, Freebase was not purely crowdsourced, but rather composed in a hybrid way. Much of the information in it was crowdsourced, however, either directly or indirectly.</p></li>
<li><p class="FN" role="doc-footnote"><sup><a href="chapter_10.xhtml#fn7x10-bk" id="fn7x10">7</a></sup> This section is advanced and may be skipped by those unfamiliar with tensors.</p></li>
<li><p class="FN" role="doc-footnote"><sup><a href="chapter_10.xhtml#fn8x10-bk" id="fn8x10">8</a></sup> A triple in this notation is represented as (<i>e</i><sub>1</sub><i>, R, e</i><sub>2</sub>).</p></li>
<li><p class="FN" role="doc-footnote"><sup><a href="chapter_10.xhtml#fn9x10-bk" id="fn9x10">9</a></sup> We detail some popular packages in the section entitled “Software and Resources,” later in this chapter, which are particularly adept at dealing with the OOV problem.</p></li>
<li><p class="FN" role="doc-footnote"><sup><a href="chapter_10.xhtml#fn10x10-bk" id="fn10x10">10</a></sup> This is not the only way to compute similarity scores for the purposes of ranking or prediction, but in this exercise, we assume that it suffices.</p></li>
<li><p class="FN" role="doc-footnote"><sup><a href="chapter_10.xhtml#fn11x10-bk" id="fn11x10">11</a></sup> This is an important step, since otherwise, we will end up with entities and/or relations in the test set that we have never observed in the test set. In many implementations, this throws an error during test time.</p></li>
<li><p class="FN" role="doc-footnote"><sup><a href="chapter_10.xhtml#fn12x10-bk" id="fn12x10">12</a></sup> This would require a total of seven experiments, because you will be introducing only one change in each experiment compared to the original training set.</p></li>
</ol>
</div>
</section>
</section>
</div>
</body>
</html>