245 lines
No EOL
63 KiB
HTML
245 lines
No EOL
63 KiB
HTML
<html xmlns="http://www.w3.org/1999/xhtml" xmlns:epub="http://www.idpf.org/2007/ops" lang="en" xml:lang="en">
|
||
<head>
|
||
<title>Knowledge Graphs</title>
|
||
<meta content="text/html; charset=utf-8" http-equiv="default-style"/>
|
||
<link href="../styles/stylesheet.css" rel="stylesheet" type="text/css"/>
|
||
<meta content="urn:uuid:531250ec-8629-4bbe-b4be-eb1eb0e84538" name="Adept.expected.resource"/>
|
||
</head>
|
||
<body epub:type="bodymatter">
|
||
<div class="body">
|
||
<p class="SP"> </p>
|
||
<section aria-labelledby="ch9" epub:type="chapter" role="doc-chapter">
|
||
<header>
|
||
<h1 class="chapter-number" id="ch9"><span aria-label="221" id="pg_221" role="doc-pagebreak"/>9</h1>
|
||
<h1 class="chapter-title"><b>Statistical Relational Learning</b></h1>
|
||
</header>
|
||
<div class="ABS">
|
||
<p class="ABS"><b>Overview.</b> In addition to the explicitly modeled relationships, knowledge graphs (KGs) also contain many relational dependencies that violate the independence and identical distribution (i.i.d.) assumptions. Statistical relational learning (SRL) frameworks like Probabilistic Soft Logic (PSL) and Markov Logic Networks (MLNs) can help model (and infer over) such dependencies by combining powerful elements from both logic and probability theory. Although SRL is an entire field of research in its own right, some elements are particularly relevant to the study of KGs. We introduce SRL in this chapter, with a strong focus on frameworks that have been successfully applied to KGs and the KG identification problem.</p>
|
||
</div>
|
||
<section epub:type="division">
|
||
<h2 class="head a-head"><a id="sec9-1"/><b>9.1 Introduction</b></h2>
|
||
<p class="noindent">Working with KGs requires a systematic treatment of both uncertainty and relational information. Uncertainty can arise from several sources, some of which we have already explored in this book. For example, when extracting named entities, relationships, or events from raw data during the KG construction (KGC) phase, uncertainty arises due to the probabilistic nature of these algorithms. In general, most machine learning algorithms that we have seen, both during KGC and in steps like instance matching, output scores, or probabilities rather than hard labels. To convert these probabilities to hard labels, we have to use techniques like thresholding that either require clever guesswork (or experience), or if we are lucky, a validation set that helps us to systematically optimize the threshold to yield good results on an accuracy metric like the F-measure.</p>
|
||
<p>One of the intriguing questions that immediately arises in the context of this discussion is: what if we don’t discard the uncertainty output by these algorithms, but somehow use it to build a more refined KG (i.e., as a <i>feature</i> or source of information in itself)? Key to the success of any approach that manages this difficult task is to use <i>collective</i> information being output by such algorithms. For example, an extraction algorithm might recognize an instance of a named entity with probability 0.6, but other instances of the same named entity (say, from other documents or sentences) may be output with higher probability. In the instance matching (IM) step, we might find that the controversial instance is being <span aria-label="222" id="pg_222" role="doc-pagebreak"/>matched to other equivalent instances with similarly ambiguous probability. Looking at the two pieces of evidence together (ambiguity in extraction, and ambiguity in resolution), we may then conclude that the instance should not have been extracted in the first place or, at the very least, that it deserves more manual scrutiny than other instances. If we err on the side of precision, as many practitioners do, we would want to discard the instance rather than integrate it into the complete KG.</p>
|
||
<p>This example illustrates a relatively clear-cut case (namely, a sequence of ambiguous probabilities, which ultimately leads us to reject or manually label the wayward instance). Less interesting is the scenario where we have a sequence of low (or high) probabilities because these are clear rejects or accepts, respectively. However, the situation becomes much more interesting when we consider that some algorithms are more uncertain than another; furthermore, what one algorithm finds difficult may not yield uncertainty for another algorithm. This can happen within tasks [like Named Entity Recognition (NER)] or across tasks. The former problem is well studied in machine learning and is often a central focus of ensemble approaches. The latter problem is a more novel issue that has received attention relatively later in the history of machine learning, and that will be the main focus in this chapter. To generalize the problem further, imagine a range of algorithms that operate on a data set, with each algorithm outputting its own probabilities. Every entity and relation in our KG has an associated set of algorithmic traces and probabilities (with the origins of the trace going all the way back to the raw data, and possibly domain discovery itself). There is, thus, a lot of collective information that could potentially be used to “derive” a better KG than the original KG that has been handed down to us as a roughly sequential series of thresholded algorithmic outputs.</p>
|
||
<p>In the general case, we cannot do much about a mixed bag of probabilities about which we know nothing or can make no assumptions. But almost always, this is not the case. We know in what order the algorithms are run and what they are doing, in addition to a whole host of relevant details; for example, we know that information extraction (IE) precedes instance matching; we may even have a suspicion that the quality of instance matching is higher than the quality of our extractions. In fact, in many domains, we know a whole lot more, embodied by a body of useful knowledge that goes under the broad term of “domain expertise.” The question then becomes: How do we use domain expertise and algorithmic uncertainties to yield a more complete and less noisy KG than would be yielded by executing a series of individual KG construction and instance matching algorithms? As a framework, SRL is one of the best-known paradigms that can help us formally conceive, and come up with viable solutions to, this problem. Modern state-of-the-art SRL frameworks like PSL allow modelers to robustly incorporate broad sets of domain rules, including domain-specific similarity functions (at the levels of individuals and sets), different types of relations (which is important for KGs), and differing levels of importance (using real numbers) for different domain rules. We provide some background on how <span aria-label="223" id="pg_223" role="doc-pagebreak"/>these models are realized in the next section. In the second half of the chapter, after providing essential background on one of the earliest SRL frameworks (MLNs), we introduce PSL and its specific application to the KG completion problem.</p>
|
||
</section>
|
||
<section epub:type="division">
|
||
<h2 class="head a-head"><a id="sec9-2"/><b>9.2 Modeling Dependencies</b></h2>
|
||
<p class="noindent">One of the most important sources of domain expertise arises in modeling relational dependencies, and the strengths of such dependencies, in the selected domain. For example, let us consider the standard example of a social network, which typically has a simpler ontological structure than a full-fledged KG. Let us limit our ontology to the <i>Person</i> concept, and to a simple set of mnemonically specified relations such as <i>friendOf</i>, <i>spouseOf</i>, and <i>votesFor</i>. Note that only the first of these three relations is symmetric. Even in this domain, which is conceptually simple, a whole host of relational dependencies could be specified by largely building on knowledge acquired through social science research. For example, we may want to encode the rule, or <i>dependency</i>, that spouses largely vote for the same candidate. A weaker version of the rule is that friends often vote for the same candidate. Our research (or intuition) suggests that the spouse rule should have more importance than the friends rule, because the latter is true more often in the real world. However, in both cases, we do not want to treat the rules as deterministic; there is always <i>some</i> chance that the rule is violated.<sup><a href="chapter_9.xhtml#fn1x9" id="fn1x9-bk">1</a></sup></p>
|
||
<p>Informally, this information was easy to specify in plain English, but the beauty of SRLs is that they give us the mathematical machinery to formalize these notions. For most rules (and the only kind considered in this chapter), a simple form of first-order logic, namely with conjunctive bodies and single literal (nonnegative atoms) heads, is enough. For example, some rules for the social network domain described informally here are expressed next using this kind of conjunctive structure:</p>
|
||
<ul class="numbered">
|
||
<li class="NL">1. 0.8: <i>Spouse(A, B)</i> ∧ <i>VotesFor(A, P) → VotesFor(B, P)</i></li>
|
||
<li class="NL">2. 0.1: <i>Neighbors(A, B)</i> ∧ <i>VotesFor(A, P) → VotesFor(B, P)</i></li>
|
||
<li class="NL">3. 0.4: <i>Friends(A, B)</i> ∧ <i>VotesFor(A, P) → VotesFor(B, P)</i></li>
|
||
</ul>
|
||
<p>The uppercase arguments to the predicates (e.g., <i>A, B</i>) are logical variables that must be <i>grounded</i> by substituting them for concrete persons. The predicates can be constrained in practice in terms of which grounded objects they accept. For example, the <i>VotesFor</i> predicate can be constrained to only allow groundings (for <i>A</i>) from a set of persons, and groundings (for <i>P</i>) from a set of political parties. The weights indicate the importance of the rules, with higher weight indicating more importance. For example, the first rule, which has a weight of 0.8, is more important in this context than the third rule, which has <span aria-label="224" id="pg_224" role="doc-pagebreak"/>a weight of 0.4. In other words, it is much more likely for spouses to align on political voting preferences than friends. There is an even lower likelihood (though still nonzero) for neighbors to align on these preferences.</p>
|
||
<p>Note that, although the mathematics may not always be easy to explain to a domain expert who is not versed in these frameworks, the rules themselves are fairly easy to elicit. There has even been work to extract such rules automatically (from data, documents, or even scientific literature) or to learn the structure of the rules from data, given a fixed set of predicates like <i>Friends</i> and <i>Neighbors</i>. Of course, we must always bear in mind that the situation quickly becomes more complicated as the numbers of concepts and relations increases, as is the case with real-world KGs.</p>
|
||
</section>
|
||
<section epub:type="division">
|
||
<h2 class="head a-head"><a id="sec9-3"/><b>9.3 Statistical Relational Learning Frameworks</b></h2>
|
||
<p class="noindent">As we saw earlier, modeling relational dependencies is an important aspect of many modern application domains. At the same time, such domains are also characterized by uncertainty (e.g., considering the example in the previous section, it is clear that we are not, or should not be, equally confident about every rule). Statistical learning is mainly focused on uncertainty and probability, while relational learning is focused on modeling relational dependencies. SRL is a powerful, relatively novel framework that attempts to combine the benefits of both lines of work. Key SRL tasks include collective classification, link prediction, and link-based clustering, all of which are described later in this chapter. Because of these applications, as well as increasing maturity in the individual research areas of relational learning and statistical learning, SRL has emerged as a feasible research agenda.</p>
|
||
<p>In early work on the subject, several critical features were identified that a unifying framework (where the unification is of statistical and relational models) must necessarily do all of the following:</p>
|
||
<ul class="numbered">
|
||
<li class="NL">1. Incorporate both first-order logic and probabilistic graphical models.</li>
|
||
<li class="NL">2. Permit simple representation of canonical SRL problems such as link prediction and collective classification.</li>
|
||
<li class="NL">3. Permit viable mechanisms for using and representing <i>domain knowledge</i> in the SRL spirit.</li>
|
||
<li class="NL">4. Allow extension and adaptation of techniques from statistical learning, probabilistic (and also logical) inference, and inductive logic programming.</li>
|
||
</ul>
|
||
<p>Note that some of these requirements are functional, while others were deemed to be more pragmatic. For example, the third desideratum is necessary because the search space for SRL algorithms is very large even by artificial intelligence (AI) standards, and domain knowledge can be argued to be critical to success. Additionally, the ability to incorporate rich domain knowledge is an attractive feature of SRL, as we shall see when studying both MLNs and PSL.</p>
|
||
<section epub:type="division">
|
||
<h3 class="head b-head"><span aria-label="225" id="pg_225" role="doc-pagebreak"/><a id="sec9-3-1"/><b>9.3.1 Markov Logic Networks</b></h3>
|
||
<p class="noindent">Markov Logic was proposed as a framework that was capable of meeting all desiderata mentioned in the previous section. However, before delving into Markov Logic, we start with a simpler concept (namely, a <i>Markov Network</i>, also called <i>Markov Random Field</i>). Markov Networks were proposed as a model for the joint distribution of a set of variables <i>X</i> = (<i>X</i><sub>1</sub><i>, <span class="ellipsis">…</span>, X</i><sub><i>n</i></sub>) ∈<span class="font">𝒳</span>. Visually (see <a href="chapter_9.xhtml#fig9-1" id="rfig9-1">figure 9.1</a>), we may think of this network as an undirected graph <i>G</i> and a set of potential functions <i><span lang="el" xml:lang="el">ϕ</span></i><sub><i>k</i></sub>. For each variable, there is a corresponding node in <i>G</i>, and for each <i>clique</i> in the graph, there is a potential function. A potential function is a nonnegative real-valued function that characterizes the state of its corresponding clique. Because the network is a model of a probability joint distribution, the joint distribution, as represented by a Markov Network, is encapsulated by the following equation:</p>
|
||
<figure class="DIS-IMG"><a id="eq9-1"/><img alt="" class="width" src="../images/eq9-1.png"/>
|
||
</figure>
|
||
<div class="figure">
|
||
<figure class="IMG"><a id="fig9-1"/><img alt="" src="../images/Figure9-1.png" width="450"/>
|
||
<figcaption><p class="CAP"><span class="FIGN"><a href="#rfig9-1">Figure 9.1</a>:</span> <span class="FIG">A visual representation of a Markov Network as an undirected graph (above) and a factor graph (below).</span></p></figcaption>
|
||
</figure>
|
||
</div>
|
||
<p>Here, <i>x</i><sub>{<i>k</i>}</sub> is the state of the <i>k</i>th clique (or equivalently, the state of the variables that are in the clique). <i>Z</i> is known as the <i>partition function</i>, and it is given by the expression <span lang="el" xml:lang="el">Σ</span><sub><i>x</i>∈<i>X</i></sub><span lang="el" xml:lang="el">Π</span><sub><i>k</i></sub><i><span lang="el" xml:lang="el">ϕ</span></i><sub><i>k</i></sub>(<i>x</i><sub>{<i>k</i>}</sub>). A Markov Network can also be conveniently be represented as a log-linear model, where each clique potential is replaced by an exponentiated weighted sum of features of the state:</p>
|
||
<span aria-label="226" id="pg_226" role="doc-pagebreak"/>
|
||
<figure class="DIS-IMG"><a id="eq9-2"/><img alt="" class="width" src="../images/eq9-2.png"/>
|
||
</figure>
|
||
<p>Here, a feature may be any real-valued function of the state. It is common to focus on binary features <i>f</i><sub><i>j</i></sub>(<i>x</i>) ∈{0, 1}, with the most direct translation from the potential-form function being that there is one feature corresponding to each possible state <i>x</i><sub>{<i>k</i>}</sub> of the clique, with the weight being <i>log<span lang="el" xml:lang="el">ϕ</span></i><sub><i>k</i></sub>(<i>x</i><sub>{<i>k</i>}</sub>). Unfortunately, this full representation is exponential in the size of the cliques; in practice, a much smaller number of features (e.g., logical functions of the state of the clique) can be used, allowing for more compact representations (particularly in the presence of large cliques in the graphical model) than the potential-function form. MLNs take advantage of this, as we describe later in this chapter.</p>
|
||
<p>How can inference be done in Markov Networks? Unfortunately, it has been shown that exact inference in such networks is #P Complete, and one has to resort to approximate inference, such as Markov Chain Monte Carlo (MCMC) and, in particular, Gibbs sampling, which samples each variable in turn, given its Markov blanket.<sup><a href="chapter_9.xhtml#fn2x9" id="fn2x9-bk">2</a></sup></p>
|
||
<p>To compute marginal probabilities, a Gibbs sampler is run with the conditioning variables clamped to their given values. A second popular method for inference in Markov Networks is belief propagation. Maximum Likelihood and also Maximum a Posteriori estimates of a Markov Network’s weights cannot be computed in closed form, but the log-likelihood is a concave function of the weights and can be found efficiently using several established methods in the optimization community, including quasi-Newton optimization algorithms. Other alternatives include iterative scaling. If features are not specified, but data is available, features can be learned from the data by greedily constructing conjunctions of atomic features.</p>
|
||
<p>Syntactically, rules in a Markov Logic framework are indistinguishable from <i>first-order logic</i> formulas, except that each formula has a weight attached. Semantically, a set of Markov Logic formulas represents a probability distribution over possible worlds, where each world has a specific formal form, as described later in this chapter. First, however, we introduce some key elements of first-order logic that are important for putting the rest of this section in its proper context.</p>
|
||
<p class="TNI-H3"><b>9.3.1.1 First-Order Logic</b> A <i>formula</i> in first-order logic is constructed using four types of <i>symbols</i> (namely, constants, variables, functions, and predicates). Constant symbols represent objects in the domain of interest (e.g., if the domain of interest is <i>cities</i>, constants could be <i>Los Angeles, Tokyo</i>, and <i>Shanghai</i>). Variable symbols range over the objects in the domain. Function symbols (e.g., <i>locatedInCountry</i>) represent mappings from tuples <span aria-label="227" id="pg_227" role="doc-pagebreak"/>of objects to objects. Predicate symbols represent relations among objects in the domain (e.g., <i>SisterCity</i>) or attributes of objects (e.g., <i>HighlyPolluted</i>).</p>
|
||
<p>An <i>interpretation</i> indicates which symbols represent what objects, functions, and relations in the domain. Note that variables and constants may be <i>typed</i>, in which case variables range only over objects of the corresponding type and constants can represent only objects of the corresponding type. For example, a typed variable <i>x</i> (with type <i>City</i>) is constrained to range only over cities.</p>
|
||
<p>A <i>term</i> is any expression representing an object in the domain, and it can be a constant, variable, or function applied to a tuple of terms (in this sense, the definition of a term is recursive). For example, <i>Tokyo</i>, <i>y</i>, and LeastCommonMultiple(<i>x</i>, <i>y</i>) are all terms. An atomic formula or atom is a predicate symbol applied to a tuple of terms, an example of which is <i>SisterCity(x, CapitalOf(Japan))</i>. Formulas are recursively constructed from atomic formulas using logical connectives and quantifiers. The familiar logical rules of composition apply. That is, if <i>F</i> and <i>G</i> are formulas, so are the following:</p>
|
||
<ul class="numbered">
|
||
<li class="NL">1. Negation: ¬<i>F</i></li>
|
||
<li class="NL">2. Conjunction: <i>F</i> ∧ <i>G</i></li>
|
||
<li class="NL">3. Disjunction: <i>F</i> ∧ <i>G</i></li>
|
||
<li class="NL">4. Implication: <i>F</i> ⇒ <i>G</i></li>
|
||
<li class="NL">5. Equivalence: <i>F</i> ⇔ <i>G</i></li>
|
||
<li class="NL">6. Quantification (universal): ∀<i>xF</i><sub>1</sub> (true if <i>F</i><sub>1</sub> is true for every object <i>x</i> in the domain)</li>
|
||
<li class="NL">7. Quantification (existential): ∃<i>xF</i><sub>1</sub> (true if at least one object <i>x</i> exists such that <i>F</i><sub>1</sub> is true)</li>
|
||
</ul>
|
||
<p class="noindent">As with ordinary mathematical expressions, parentheses are typically used to signal precedence. A positive literal is defined as an atomic formula, while a negative literal is a negated atomic formula. Interestingly, we can define a KG (or any set of triples) as a single large formula that is a conjunction of several formulas (with a triple intuitively representing a single formula). This is also true for a <i>knowledge base</i>, or KB (a set of sentences or formulas in first-order logic), a more general model of data than a KG (which deliberately imposes a graph-theoretic interpretation on its triples-set). Bearing this generality in mind, we assume in the rest of this section that we are dealing with KGs rather than first-order KBs.</p>
|
||
<p>A final piece of machinery that is necessary is that of a <i>ground term</i>, which is a term containing no variables. A ground atom or ground predicate is an atomic formula whose arguments are all ground terms. A possible world (also called a <i>Herbrand interpretation</i>) assigns a truth value to each possible ground predicate. These possible worlds express the semantics (what it <i>means</i> to be true) of predicates in first-order logic, whereas the construction rules noted here (such as conjunction and implication) express the syntax. We <span aria-label="228" id="pg_228" role="doc-pagebreak"/>say that a formula <i>F</i> is satisfiable if and only if there exists at least one world in which it is true.</p>
|
||
<p>We now have the machinery to define the problem of inference in first-order logic in the context of a KG—namely, the basic inference problem in first-order logic is to determine whether a KG <i>entails</i> a formula <i>F</i>: if <i>F</i> is true in <i>all</i> worlds where KG is true (represented symbolically by the expression <i>KG</i> <span class="font">⊧</span> <i>F</i>). Another way of putting it (called <i>refutation</i>) is that <i>KG</i> <span class="font">⊧</span> <i>F</i> if and only if <i>KG</i> ∪¬<i>F</i> is unsatisfiable. Unfortunately, this makes a strict interpretation of the KG nonrobust because if a KG contains even a single contradiction, all formulas trivially follow from it. For automated inferential tasks, it is convenient to convert formulas to conjunctive normal form (CNF), or clausal form. As the name suggests, a CNF is nothing but a conjunction of clauses, where a clause is a disjunction of literals. A KG can be represented by a single CNF. Furthermore, every KG in first-order logic can be mechanically converted to clausal form. The advantage of clausal form is that it can be used in <i>resolution</i>, which is both a sound and refutation-complete inference procedure for first-order logic.</p>
|
||
<p>Theoretically, inference in first-order logic is semidecidable, and for this reason, KGs are constructed using restricted subsets of first-order logic that have desirable properties (such as decidability), the most commonly used of which are Horn clauses. A Horn clause is a clause that contains at most one positive literal. The famous Prolog programming language is, in fact, based on Horn clause logic; Prolog programs can be mined from data using inductive logic programming techniques, by searching for Horn clauses that (possibly approximately) hold in the data.</p>
|
||
<p>Already, the machinery described here starts to illustrate why we cannot limit ourselves to purely symbolic approaches when representing and working with (i.e., doing reasoning and inference on) KGs. The single biggest limitation is that many formulas in KGs are typically true in the real world, but it is rarely the case that they are <i>always</i> true. It is simply not feasible, in most domains, to come up with nontrivial formulas that are always true, and where it is feasible, such formulas capture a small portion of the relevant knowledge in that domain. While there are a number of solutions (many ad hoc) that have been proposed as practical extensions of first-order logic to deal with these challenges, a more systematic approach is Markov Logic, which we already introduced as fulfilling some of the desiderata that have been described. Armed with the tools and terminology of first-order logic, we take a deeper look at Markov Logic next.</p>
|
||
<p class="TNI-H3"><b>9.3.1.2 Markov Logic</b> The key motivation for Markov Logic was to address the limitation we noted previously; namely, that if we represent a KG using only first-order logic predicates and constraints, we would have a set of “hard” constraints on the set of possible worlds; hence, if even one world violates one formula, that world has zero probability. To counter this brittleness, Markov Logic was proposed to soften these constraints so that when a world does violate a formula in the KG, it only makes it <i>less probable</i>, not <i>impossible</i>. <span aria-label="229" id="pg_229" role="doc-pagebreak"/>Extending this notion, the framework was designed so that the fewer the formulas that a world violates, the higher its probability. Just like in the motivating example at the beginning of this chapter, Markov Logic associates its formula with a weight reflecting the strength, or importance, of that constraint. The higher the weight, the greater the difference that formula’s violation will make to the probability of the world (in log space).</p>
|
||
<p>With these ideas in place, a Markov Logic Network (MLN) is defined as a set <i>L</i> of pairs (<i>F</i><sub><i>i</i></sub><i>, w</i><sub><i>i</i></sub>) such that <i>F</i><sub><i>i</i></sub> is a formula in first-order logic and <i>w</i><sub><i>i</i></sub>, its associated weight, is a real number. Additionally, a finite set of constraints <i>C</i> = {<i>c</i><sub>1</sub><i>, <span class="ellipsis">…</span>c</i><sub>|<i>C</i>|</sub>} is also associated with the MLN, denoted using the symbol <i>M</i><sub><i>L, C</i></sub>. As an example, we provide instances of MLN formulas, with weights, given a set of English sentences in <a href="chapter_9.xhtml#tab9-1" id="rtab9-1">table 9.1</a>. As we noted before, the formula itself may be difficult to elicit from a domain expert without knowledge of first-order logic, but the English sentences are usually fairly easy to elicit from experts or from known scientific fact. Typically, some manual work is necessary in making the leap from the English sentences to the clausal forms noted in <a href="chapter_9.xhtml#fig9-1">figure 9.1</a>. Note that weights do not have to be between 0 and 1 and should not be interpreted as probabilities.</p>
|
||
<div class="table">
|
||
<p class="TT"><a id="tab9-1"/><span class="FIGN"><a href="#rtab9-1">Table 9.1</a>:</span> <span class="FIG">Examples of MLN rules (in clausal form) with weights, encoding domain knowledge expressed as natural-language sentences.</span></p>
|
||
<figure class="table">
|
||
<table class="table">
|
||
<thead>
|
||
<tr>
|
||
<th class="TCH"><p class="TB"><b>Weight</b></p></th>
|
||
<th class="TCH"><p class="TB"><b>MLN rule</b></p></th>
|
||
<th class="TCH"><p class="TB"><b>Natural-language description</b></p></th>
|
||
</tr>
|
||
</thead>
|
||
<tbody>
|
||
<tr>
|
||
<td class="TB"><p class="TB">0.6</p></td>
|
||
<td class="TB"><p class="TB">¬Friends(<i>x, y</i>) ∨ ¬Spouse(<i>y, z</i>)∨Friends(<i>x, z</i>)</p></td>
|
||
<td class="TB"><p class="TB">If Mary and Bob are espoused, and Mary and Joan are friends, then Bob and Joan are friends.</p></td>
|
||
</tr>
|
||
<tr>
|
||
<td class="TB"><p class="TB">2.0</p></td>
|
||
<td class="TB"><p class="TB">Friends(<i>x, y</i>) ∨ ¬Spouse(<i>x, y</i>)</p></td>
|
||
<td class="TB"><p class="TB">If two people are not friends, then they are not spouses.</p></td>
|
||
</tr>
|
||
<tr>
|
||
<td class="TB"><p class="TB">1.6</p></td>
|
||
<td class="TB"><p class="TB">¬Spouse(<i>x, y</i>) ∨ ¬Smokes(<i>y</i>)∨Smokes(<i>z</i>), ¬Spouse(<i>x, y</i>)∨Smokes(<i>y</i>)∨ ¬Smokes(<i>z</i>)</p></td>
|
||
<td class="TB"><p class="TB">If Mary and Bob are espoused, then either both smoke or neither does.</p></td>
|
||
</tr>
|
||
<tr>
|
||
<td class="TB"><p class="TB">1.2</p></td>
|
||
<td class="TB"><p class="TB">¬Stress(<i>x</i>)∨BackPain(<i>x</i>)</p></td>
|
||
<td class="TB"><p class="TB">Stress causes back pain.</p></td>
|
||
</tr>
|
||
</tbody>
|
||
</table>
|
||
</figure>
|
||
</div>
|
||
<p>Beyond examples, it is best to think of an MLN as a <i>template</i> for constructing Markov Networks, because different sets of constraints will produce different networks. These networks may be of varying size, but they nevertheless share regularities in structure and parameters, specified by the MLN. At minimum, the formula weights are shared across the networks, meaning that all groundings of the same formula will share the same weight. In the literature, these networks are designated as <i>ground Markov Networks</i> to distinguish them from the (template) MLNs specified in first-order logic. From formulas previously introduced, the probability distribution over possible worlds <i>x</i> specified by the ground Markov Network <i>M</i><sub><i>L,C</i></sub> is <img alt="" class="inline" height="20" src="../images/pg229-in-1.png" width="150"/>, which in turn equates to <img alt="" class="inline" height="20" src="../images/pg229-in-2.png" width="91"/>, with <i>n</i><sub><i>i</i></sub>(<i>x</i>) being the number of true groundings of <i>F</i><sub><i>i</i></sub> in <i>x</i>, and with <i>x</i><sub>{<i>i</i>}</sub> being the state (or <span aria-label="230" id="pg_230" role="doc-pagebreak"/>truth values) of the atoms appearing in <i>F</i><sub><i>i</i></sub> (<i><span lang="el" xml:lang="el">ϕ</span></i><sub><i>i</i></sub>(<i>x</i><sub>{<i>i</i>}</sub>) = <i>e</i><sup><i>w</i><sub><i>i</i></sub></sup>. Other formulations of this are also possible (e.g., using products of potential functions rather than log-linear models), but this is the most convenient approach in domains where hard and soft constraints are both present.</p>
|
||
<p>Because the ground network can be difficult to work with as a pure mathematical structure, it is helpful to visualize it. Assuming constants {Mary, Bob}, we apply them to the first two formulas in <a href="chapter_9.xhtml#fig9-1">figure 9.1</a> to obtain the graph shown in <a href="chapter_9.xhtml#fig9-2" id="rfig9-2">figure 9.2</a>. The construction is fairly simple and mechanistic: each node in the graph is a ground atom, and an edge is declared between two nodes if their corresponding ground atoms appear together in some grounding of a formula. Using this network, there are well-defined inference procedures that can help us determine the probability of Friends(Bob, Mary), the probability of Friends(Bob, Mary) ∧ Spouse(Mary, Bob), and so on.</p>
|
||
<div class="figure">
|
||
<figure class="IMG"><a id="fig9-2"/><img alt="" src="../images/Figure9-2.png" width="450"/>
|
||
<figcaption><p class="CAP"><span class="FIGN"><a href="#rfig9-2">Figure 9.2</a>:</span> <span class="FIG">A ground Markov Network, assuming that the constants {Mary, Bob} are applied to the first two formulas in <a href="chapter_9.xhtml#fig9-1">figure 9.1</a>.</span></p></figcaption>
|
||
</figure>
|
||
</div>
|
||
<p>A full discussion of either the formalism of, or assumptions behind, MLNs is beyond the scope of this chapter (we provide pointers to primary material on the subject in the section entitled “Bibliographic Notes,” at the end of this chapter). There are, however, some important propositions that have been proven about them. For example, it was shown that every probability distribution over discrete (or finite-precision numeric) variables can be represented as an MLN. These propositions are important because they show that MLNs are a rigorous mechanism for knowledge representation. More subjectively, their support for combining logic rules, weights, and probabilities makes them have many of the unifying features noted in the list in the previous section. Yet another important aspect that led to the uptake and popularity of the MLN framework when it was first proposed is that, as its authors showed in an influential work, SRL approaches like probabilistic relational <span aria-label="231" id="pg_231" role="doc-pagebreak"/>models, knowledge-based model construction, and stochastic logic programs can all be mapped into an MLN framework.</p>
|
||
</section>
|
||
<section epub:type="division">
|
||
<h3 class="head b-head"><a id="sec9-3-2"/><b>9.3.2 Probabilistic Soft Logic</b></h3>
|
||
<p class="noindent">Similar to MLNs, PSL is a framework for collective, probabilistic reasoning in relational domains. However, there are some key differences, the most important of which is that, unlike MLN, PSL uses <i>soft</i>-truth values in the interval [0,1], rather than the binary values {0,1}. This one feature yields several convenient outcomes, including allowing the incorporation of similarity functions at the levels of individuals and sets. Another advantage to employing continuous-valued random variables rather than binary variables is that the most probable explanation (MPE) inference problem can be cast as a convex optimization problem that is significantly more efficient to solve than its combinatorial counterpoint in MLNs (polynomial versus exponential).</p>
|
||
<p>Simply put, a PSL model is composed of a set of weighted, first-order logic rules (just as with MLNs), where each rule defines a set of features of a Markov Network sharing the same weight. To recap, the formula <i>w</i>: <i>P</i>(<i>A, B</i>) ∧<i>Q</i>(<i>B, C</i>) <i>→ R</i>(<i>A, B, C</i>) is a valid PSL rule, with <i>w</i> being the weight of the rule, A, B, and C being universally quantified variables, and P, Q, and R being predicates. A grounding of a rule comes from substituting constants for the universally quantified variables in the rule’s atoms. In this example, assigning the constant values a, b, and c to the respective variables in the example rule would produce the ground atoms P(a,b), Q(b,c), and R(a,b,c). Each ground atom takes a <i>soft-truth</i> value in the range [0,1], as opposed to MLNs, where the ground atom would have taken one of the two values in {0,1}.</p>
|
||
<p>PSL associates a <i>numeric distance to satisfaction</i> with each ground rule that determines the value of the corresponding feature in the Markov Network. The distance to satisfaction is defined by treating the ground rule as a formula over the ground atoms in the rule. PSL uses the Lukasiewicz t-norm and co-norm to provide a relaxation of the logical connectives, AND (∩), OR (∨), and NOT(¬), as follows (where relaxations are denoted using the symbol over the connective):</p>
|
||
<figure class="IMG"><img alt="" class="width" src="../images/pg231-1.png"/>
|
||
</figure>
|
||
<p>As with many SRL frameworks, an important feature of PSL is that it restricts the syntax of first-order formulas. Instead of allowing arbitrary first-order formulas, PSL formulas must have conjunctive bodies. For example, the soft transitive rule:<sup><a href="chapter_9.xhtml#fn3x9" id="fn3x9-bk">3</a></sup></p>
|
||
<p class="noindent">0.8: <i>friend</i>(<i>Joe, Mark</i>) ∧ <i>friend</i>(<i>George, Mark</i>) <i>→ friend</i>(<i>Joe, George</i>) is acceptable <span aria-label="232" id="pg_232" role="doc-pagebreak"/>in PSL, whereas a rule such as 0.95: (<i>friend</i>(<i>Joe, Mark</i>)∧<i>friend</i>(<i>George, Mark</i>))∨<i>roommate</i>(<i>Joe, George</i>) <i>→ friend</i>(<i>Joe, George</i>) is not, but it could potentially be reformulated into several valid PSL rules (see the exercises at the end of the chapter).</p>
|
||
</section>
|
||
</section>
|
||
<section epub:type="division">
|
||
<h2 class="head a-head"><a id="sec9-4"/><b>9.4 Knowledge Graph Identification</b></h2>
|
||
<p class="noindent">We provided some intuition earlier in this chapter on the importance of modeling relational dependencies, especially in KG ecosystems that are rich in such complexities. In the next two sections, we provide concrete examples of SRL frameworks that have been used by researchers for the important application of <i>KG identification (KGI)</i>, alternatively known as <i>KG completion</i>. The core motivation behind KGI is based on the real-world observation that KGs constructed using Natural Language Processing or other techniques are inevitably noisy (i.e., have incorrect links) and sparse (i.e., have missing links). KGI refers to the problem of identifying the true KG from an input set of messy triples. Arguably, a good solution to the problem should incorporate domain knowledge, knowledge of the various kinds of noise and inconsistencies that could be present in the input, and knowledge of the confidences of systems that constructed the KG in the first place. Because these are different sources of information that rely on both relational dependencies and statistical regularities, SRL is an apt framework for modeling the KGI problem. We present one such model that has been recently published in the literature and that incorporates multiple algorithms and sources of information, especially IE and IM. The framework is presented abstractly in <a href="chapter_9.xhtml#fig9-3" id="rfig9-3">figure 9.3</a>.</p>
|
||
<div class="figure">
|
||
<figure class="IMG"><a id="fig9-3"/><img alt="" src="../images/Figure9-3.png" width="450"/>
|
||
<figcaption><p class="CAP"><span class="FIGN"><a href="#rfig9-3">Figure 9.3</a>:</span> <span class="FIG">An illustration of KG identification.</span></p></figcaption>
|
||
</figure>
|
||
</div>
|
||
<p>KGI assumes that, at its core, a KG contains three types of facts: about entities, entity labels, and relations. KGI represents entities with the logical predicate <i>ENT</i>(<i>E</i>) and labels with the logical predicate <i>Lbl</i>(<i>E, L</i>), where entity <i>E</i> has label <i>L</i>. Relations are represented with the logical predicate <i>Rel</i>(<i>E</i><sub>1</sub><i>, E</i><sub>2</sub><i>, R</i>), where the relation <i>R</i> holds between the entities <i>E</i><sub>1</sub> and <i>E</i><sub>2</sub> [e.g., <i>R</i>(<i>E</i><sub>1</sub><i>, E</i><sub>2</sub>)]. <span aria-label="233" id="pg_233" role="doc-pagebreak"/>Note that while this notation is a little different from what we have seen thus far, there is a straightforward mapping between such notations, all of which ultimately rely on the notion of a KG as a directed, labeled graph. The KGI problem is then defined as identifying a true set of atoms from a set of noisy extractions. To do so, one of the earliest approaches to KGI incorporated three components into an SRL framework: capturing uncertain extractions, performing entity resolution, and enforcing ontological constraints. By creating a PSL program that included these three components, KGI was able to relate the program to a distribution over <i>possible</i> KGs. Efficient inference was used to identify the most probable KG. Experimentally, this system was found to work well for KGI, even when the KG was built with extractions from web data, an important source for many real-world KGs (as discussed in chapter 5).</p>
|
||
<section epub:type="division">
|
||
<h3 class="head b-head"><a id="sec9-4-1"/><b>9.4.1 Representing Uncertain Extractions</b></h3>
|
||
<p class="noindent">Output from an IE system can be noisy and are related to the logical predicates [e.g., ENT(E)] via introduction of <i>candidate</i> predicates. Specifically, for each candidate entity, a corresponding predicate <i>CAND</i>_<i>ENT</i>(<i>E</i>) is introduced, and similarly for labels and relations generated by the IE system. Uncertainty in the extractions is expressed by assigning the candidate predicates a soft-truth value equal to the confidence value from the extractor. For example, an IE might generate an entity extraction <i>Charlotte</i> with confidence of 0.7, which the KGI system would represent as <i>CAND</i>_<i>ENT</i>(<i>Charlotte</i>) with a soft-truth value of 0.7.</p>
|
||
<p>Note that, in general, KGC systems tend to rely on several different IEs (see part II of this book, for example). The KGI system represented metadata about the specific technique or algorithm used to extract a candidate by using separate predicates for each such algorithm or technique. Notationally, if algorithm A was used to extract candidate entity E, the candidate predicate is denoted as <i>CAND</i>_<i>ENT</i><sub><i>A</i></sub>(<i>E</i>). Relations and labels are treated analogously. The predicates are related to true values of attributes and relations using weighted rules [e.g., <img alt="" class="inline" height="20" src="../images/pg233-in-1.png" width="314"/> and <img alt="" class="inline" height="21" src="../images/pg233-in-2.png" width="247"/>]. The set of candidates generated from grounding these rules, using the IE outputs, is denoted using the symbol <span class="font">𝒞</span>.</p>
|
||
</section>
|
||
<section epub:type="division">
|
||
<h3 class="head b-head"><a id="sec9-4-2"/><b>9.4.2 Representing Instance Matching Outputs</b></h3>
|
||
<p class="noindent">The IM algorithms may output confidences along with their predicted matches, which also should be looked upon as candidates. In the KGI system, a predicate called <i>SAME</i>_<i>ENT</i> is used to capture the similarity of two instances [e.g., <i>SAME</i>_<i>ENT(Delhi, NewDelhi)</i>]. To incorporate IM outputs in the PSL framework, three rules were proposed:</p>
|
||
<figure class="DIS-IMG"><img alt="" class="width" src="../images/pg233-1.png"/>
|
||
</figure>
|
||
<p><span aria-label="234" id="pg_234" role="doc-pagebreak"/>The intuition behind these rules should be fairly clear—namely, that when two instances are very similar, they will have high truth value for <i>SAME</i>_<i>ENT</i>.</p>
|
||
</section>
|
||
<section epub:type="division">
|
||
<h3 class="head b-head"><a id="sec9-4-3"/><b>9.4.3 Enforcement of Ontological Constraints</b></h3>
|
||
<p class="noindent">The outputs of IE and instance matching are only one kind of knowledge that can assist in identifying a good KG. Another source of knowledge that brings out the power of SRL frameworks like PSL is the set of rules corresponding to an ontology. Each type of ontological relation can be represented as a predicate, with the predicates representing ontological knowledge of relationships between labels and relations. Predicates such as <i>DOM</i> and <i>RNG</i> can be used to limit the domains and ranges of certain relations (e.g., we can specify that the domain of a relation <i>Friend</i> is <i>Person</i>). Similarly, a predicate such as <i>MUT</i> can be used to specify that the labels corresponding to concepts such as <i>Country</i> and <i>Person</i> are mutually exclusive (i.e., an entity cannot simultaneously have both labels). Other predicates for enforcing the subsumption of labels (SUB), subsumption of relations (RSUB), and inversely related functions (INV) can be similarly devised. In Pujara et al. (2013), in fact, seven ontological constraints were specified (technically, these should be thought of as <i>types</i> of ontological constraints), reproduced here for the sake of completeness:</p>
|
||
<figure class="DIS-IMG"><img alt="" class="width" src="../images/pg234-1.png"/>
|
||
</figure>
|
||
</section>
|
||
<section epub:type="division">
|
||
<h3 class="head b-head"><a id="sec9-4-4"/><b>9.4.4 Putting It Together: Probabilistic Distributions over Uncertain Knowledge Graphs</b></h3>
|
||
<p class="noindent">All of the three classes of representations defined thus far can be combined into a single PSL program <span lang="el" xml:lang="el">Π</span>. A set of ground rules <i>R</i> can be instantiated using the inputs of the KGC and IM process, including the union of groundings from the set <span class="font">𝒞</span> of uncertain candidates, matching entities and ontological relations. The distribution over the set <i>I</i> of interpretations also corresponds to a distribution over the set <i>G</i> of KGs, and is given using equation (<a href="chapter_9.xhtml#eq9-1">9.3</a>), as follows:</p>
|
||
<figure class="DIS-IMG"><a id="eq9-3"/><img alt="" class="width" src="../images/eq9-3.png"/>
|
||
</figure>
|
||
<p>By conducting inference, the most likely interpretation can be identified, which (in the case of PSL) is an assignment of soft-truth values to entities, relations, and labels comprising the KG. In Pujara et al. (2013), the approach to converting these soft values into <span aria-label="235" id="pg_235" role="doc-pagebreak"/>discrete elements that could be used and reasoned with was to choose a threshold and select the corresponding set of facts of appropriate quality to include into an “identified” KG.</p>
|
||
</section>
|
||
<section epub:type="division">
|
||
<h3 class="head b-head"><a id="sec9-4-5"/><b>9.4.5 A Note on Experimental Performance</b></h3>
|
||
<p class="noindent">Pujara et al. (2013) evaluated the method on both a synthetic KG derived from the LinkedBrainz project, which maps data from the MusicBrainz community using ontological information from the Music Ontology project, and also real data extracted using web IE and NLP techniques from the Never-Ending Language Learning (NELL) project. They compared KGI to simple baseline methods, as well as a simpler version of KGI that used the output of instance matching (or ontological constraints) only, rather than the complete KGI pipeline. In all cases, the complete KGI method was found to lead to the highest F1 score (over 91.9 percent) and an AUC metric of more than 90 percent. In fact, compared to previous work on the NELL data set, including MLNs, KGI was found to demonstrate substantial improvement.</p>
|
||
</section>
|
||
</section>
|
||
<section epub:type="division">
|
||
<h2 class="head a-head"><a id="sec9-5"/><b>9.5 Other Applications</b></h2>
|
||
<p class="noindent">Because of the generality of statistical relational frameworks like MLN and PSL, as well as their expressiveness in terms of representing and reasoning over relational models and dependencies, many applications have been proposed besides KGI, and many don’t assume a KG at all, but are designed to operate on networks or even relational databases. We briefly describe some of these applications next. Further guidance is provided in the section entitled “Software and Resources,” later in this chapter.</p>
|
||
<section epub:type="division">
|
||
<h3 class="head b-head"><a id="sec9-5-1"/><b>9.5.1 Collective Classification</b></h3>
|
||
<p class="noindent">While ordinary classification tries to predict the class of an object given its attributes, collective classification also takes into account the classes of related objects. Attributes can be represented either in Markov Logic or PSL as predicates of the form <i>A</i>(<i>x, v</i>), with <i>A</i> being an attribute, <i>x</i> an object, and <i>v</i>, the value of <i>A</i> in <i>x</i>. The class, designated as an attribute <i>C</i>, is represented by the symbol <i>C</i>(<i>x, v</i>), with <i>v</i> being the class of <i>x</i>. Classification is now modeled as the problem of inferring the truth value of <i>C</i>(<i>x, v</i>) [for all <i>x</i> and <i>v</i> of interest, given all known <i>A</i>(<i>x, v</i>)]. An interesting aspect of this formulation is that it allows uniform modeling of ordinary and collective classification, because ordinary classification is merely the special case that guarantees independence of <i>C</i>(<i>x</i><sub><i>i</i></sub><i>, v</i>) and <i>C</i>(<i>x</i><sub><i>j</i></sub><i>, v</i>) for all <i>x</i><sub><i>i</i></sub> and <i>x</i><sub><i>j</i></sub> [and given the known <i>A</i>(<i>x, v</i>)]. However, collective classification includes other <i>C</i>(<i>x</i><sub><i>j</i></sub><i>, v</i>) in the Markov blanket of <i>C</i>(<i>x</i><sub><i>i</i></sub><i>, v</i>), even after conditioning on the known <i>A</i>(<i>x, v</i>). Relations between objects are represented by predicates of the form <i>R</i>(<i>x</i><sub><i>i</i></sub><i>, x</i><sub><i>j</i></sub>). The framework exposes some generalizations as well; for instance, <i>C</i>(<i>x</i><sub><i>i</i></sub><i>, v</i>) and <i>C</i>(<i>x</i><sub><i>j</i></sub><i>, v</i>) could potentially be indirectly dependent via <i>unknown</i> predicates [and possibly even including <i>R</i>(<i>x</i><sub><i>i</i></sub><i>, x</i><sub><i>j</i></sub>)].</p>
|
||
</section>
|
||
<section epub:type="division">
|
||
<h3 class="head b-head"><span aria-label="236" id="pg_236" role="doc-pagebreak"/><a id="sec9-5-2"/><b>9.5.2 Link Prediction</b></h3>
|
||
<p class="noindent">A link prediction system aims to determine whether a relation exists between two objects of interest (e.g., whether Craig is Jay’s supervisor) given object properties (and possibly other known relations). In Markov Logic, link prediction can be formulated in a way that is near-identical to the MLN formulation of collective classification. The one key difference is that the goal now is to determine <i>R</i>(<i>x</i><sub><i>i</i></sub><i>, x</i><sub><i>j</i></sub>) for all object pairs of interest, rather than <i>C</i>(<i>x, v</i>), as was the goal in collective classification.</p>
|
||
</section>
|
||
<section epub:type="division">
|
||
<h3 class="head b-head"><a id="sec9-5-3"/><b>9.5.3 Social Network Modeling</b></h3>
|
||
<p class="noindent">Social networks are like simpler versions of KGs, with nodes representing social actors (e.g., people) and edges representing relationships (e.g., friendship) between the actors on which they are incident. Social network analysis involves building models relating actors’ properties and their links. For example, the probability of two actors forming a link may depend on the similarity of their attributes (a phenomenon called <i>homophily</i>); conversely, two linked actors may be more likely to have certain properties in common. These models can be expressed as Markov Networks, with succinct formulaic representations [e.g., <i>0.8: A(x,p)</i> ∧ <i>A(y,p) → R(x,y)</i>, where x and y are actors, <i>R</i>(<i>x, y</i>) is a relationship between them, <i>A</i>(<i>x, p</i>) represents an attribute of <i>x</i> (with <i>p</i> being the value taken by <i>x</i> for that attribute), and the weight of the formula captures the correlation strength between relation and the attribute similarity]. For example, we can design a model stating that friends tend to have similar movie preferences. In fact, not only can MLNs and PSL encode observations and rules in ordinary social networks, they also allow richer dependencies to be intuitively stated (e.g., by writing formulas combining and interleaving multiple types of relations and attributes, and their potentially complex interdependencies).</p>
|
||
</section>
|
||
</section>
|
||
<section epub:type="division">
|
||
<h2 class="head a-head"><a id="sec9-6"/><b>9.6 Advanced Research: Data Programming</b></h2>
|
||
<p class="noindent">While graphical models have generally led to noted advances in the machine learning community<sup><a href="chapter_9.xhtml#fn4x9" id="fn4x9-bk">4</a></sup> and are certainly relevant to the problem of KG completion in a specific “flavor” (MLNs or PSL), a very recently published and relevant advance has been the so-called paradigm of data programming by Ratner et al. (2020). As defined by those authors, data programming consists of modeling multiple label sources without access to ground-truth, and generating probabilistic training labels<sup><a href="chapter_9.xhtml#fn5x9" id="fn5x9-bk">5</a></sup> representing the lineage of the individual labels. The motivation is similar to that of weak supervision and active learning; namely, the dearth of a large quantity of labels that are necessary for modern machine learning <span aria-label="237" id="pg_237" role="doc-pagebreak"/>architectures such as deep neural networks. Such labels are difficult to acquire due to the expense and manual labor required, especially for domain-specific cases like data analysis in the intelligence and defense communities, or medical image interpretation.</p>
|
||
<p>To illustrate the merits of data programming, Ratner et al. (2020) propose a system called <i>Snorkel</i> that enables users to write labeling functions instead of laboriously labeling data directly. Snorkel is an end-to-end system for combining weak supervision sources using this novel methodology. Because the labeling functions have different and unknown accuracies and correlations, Snorkel has to automatically model and combine their myriad outputs into a generative model. The resulting probabilistic labels output by the generative model can then be used to train a discriminative machine learning model.</p>
|
||
<p>We do not explore the full details of Snorkel in this chapter (we provide a link to the resource in the “Software and Resources” section), but we do use it as a means to illustrate the broad scope and use-cases of probabilistic graphical models. Systems like Snorkel show that graphical models can be used as a language for weak supervision. Other techniques covered in this chapter also show that it could be used as a metalanguage for combining outputs of IE, IM, and ontology constraints to yield a KG that is of higher quality, at least probabilistically. They are also a language for expressing the knowledge of domain experts in a tangible format. Most likely, other applications (or combinations and variants of existing lines of research and application) of graphical models and PSL will continue to emerge, especially as we start to witness more mainstream uptake of collective and joint reasoning solutions.</p>
|
||
</section>
|
||
<section epub:type="division">
|
||
<h2 class="head a-head"><a id="sec9-7"/><b>9.7 Concluding Notes</b></h2>
|
||
<p class="noindent">In this chapter, we described the importance of modeling relational dependencies and uncertainty in a joint framework that also makes it easy to elicit and represent domain knowledge. We described two unifying frameworks that meet many of the desiderata identified for SRL: MLNs and PSL. MLN, being an earlier framework, assumes that random variables are binary, while PSL, in modeling random variables as continuous, is able to turn the optimization problem into a convex form. We also described how PSL can be used for KGI by jointly representing and modeling IE, IM, and ontological constraints. Through inference, the most probable KG is thus identified. Finally, we briefly covered various other applications of SRL, including link prediction and collective classification. Although these are popular applications, they are by no means the only ones. Other sample applications that we did not describe, but that have also found SRL as a useful framework for formalizing the problem and devising robust solutions, include link-based clustering (i.e., cluster objects in a specific way—namely, that if they are more closely related via their links, they are more likely to belong to the same cluster), and instance matching.</p>
|
||
</section>
|
||
<section epub:type="division">
|
||
<h2 class="head a-head"><span aria-label="238" id="pg_238" role="doc-pagebreak"/><a id="sec9-8"/><b>9.8 Software and Resources</b></h2>
|
||
<p class="noindent">There are also numerous slides and classroom material that have been released by a number of instructors, owing to the popularity of SRL (or related SRL areas, such as probabilistic graphical models) in graduate curricula in computer science and AI. Because entire books have been written on this, an excellent resource is the book webpage of a relatively recent work on this subject by Getoor and Taskar (2007). The book webpage is accessible at cs.umd.edu/srl-book/. It lists courses at University of Maryland (cs.umd.edu/class/spring2005/cmsc828g), Purdue (cs.purdue.edu/home-s/neville/courses/CS590N.html), University of Washington (cs.washington.edu/education/courses/574/05sp), and University of Wisconsin (biostat.wisc.edu/~page/838.html).</p>
|
||
<p>The webpage also lists software, data, and links to workshops and meetings. We recommend thoroughly reviewing these. Webpages for specific groups that do research on this subject also list important resources and publications. We especially recommend the Lise’s INQuisitive Students (LINQS) group (<a href="https://linqs.soe.ucsc.edu/home">https://<wbr/>linqs<wbr/>.soe<wbr/>.ucsc<wbr/>.edu<wbr/>/home</a>), as it provides fairly comprehensive information on data sets, publications, and resources. We also mentioned Snorkel as a software embodying the principles of data programming. Project details and tutorials on Snorkel are available at <a href="https://www.snorkel.org/">https://<wbr/>www<wbr/>.snorkel<wbr/>.org<wbr/>/</a>.</p>
|
||
</section>
|
||
<section epub:type="division">
|
||
<h2 class="head a-head"><a id="sec9-9"/><b>9.9 Bibliographic Notes</b></h2>
|
||
<p class="noindent">SRL has a long history in recent AI literature, and it has found many applications beyond KGs. Like many areas of study in this book, providing a complete and exhaustive treatment of related work and bibliographic notes is infeasible, and we focus on key references and surveys instead, including the main papers that have guided the flow of material in this chapter. The application to KGs, and in particular, KGI, is a relatively recent phenomenon; we cite Pujara et al. (2013) as one of the important papers that took a detailed and systematic approach to KGI using PSL and showed that it could be successful with real-world data sets compared to rival approaches. However, a number of other important papers should be perused by interested readers, including Khosravi and Bina (2010), Getoor and Mihalkova (2011), Neville, Rattigan, et al. (2003), Rossi et al. (2012), and Kimmig et al. (2015). Some treatments are more domain-specific, but still extensive; for example, see Esposito et al. (2012) for a survey of SRL and social networks. Other applications for such probabilistic frameworks include trust analysis, drug-target prediction, and recommender systems, just to name a few [Kouki et al. (2015), Fakhraei et al. (2014), and Rettinger et al. (2011)].</p>
|
||
<p>Key references for learning about relational dependencies, and MLNs in particular, include Richardson and Domingos (2006), Wang and Domingos (2008), Kok and Domingos (2009), Domingos and Lowd (2009), Kok and Domingos (2005), and Singla and Domingos (2005). In one of the earliest papers by Singla and Domingos (2006) that applies Markov <span aria-label="239" id="pg_239" role="doc-pagebreak"/>Logic to a task (entity resolution) that is clearly relevant to KGs, we see some hint of techniques to come in the later part of the decade and the early 2010s. Some references go into much more detail than others. Koller and Friedman (2009), on probabilistic graphical models, is a key reference for learning about various kinds of probabilistic formalism that allow modeling joint distributions over interdependent unknowns using graph-based structures. For collective methods and their applications to link mining (closely related to some of the application areas covered toward the end of the chapter), we recommend Getoor and Diehl (2005) and the papers cited therein, in addition to some of the general resources covered in the “Software and Resources” section. For those looking for a more cursory treatment, a short and accessible introduction was also provided by Kimmig et al. (2012). Bach et al. (2017) provides a more complete and theoretical treatment of hinge-loss Markov Random Fields, which is a new type of probabilistic graphical model that generalizes different approaches to convex inference. PSL makes these fields easy to define and work with by using a syntax based on first-order logic.</p>
|
||
<p>The range and diversity of SRL approaches continue to grow, with some proposals including relational dependency networks, stochastic logic programs, and structural logistic regression, among many others. Important references include Wellman et al. (1992), Ngo and Haddawy (1997), Kersting and De Raedt (2001), Muggleton et al. (1996), Cussens and Pulman (1999), Sato and Kameya (1997), Dehaspe (1997), Friedman et al. (1999), Taskar et al. (2004), Neville et al. (2003b), Popescul and Ungar (2003), Cumby and Roth (2003), and Costa et al. (2003), to list only a few that were relatively early and influential. Toward the end of the chapter, we also mentioned data programming as an advanced area of research that draws on graphical models. The best reference for it to date is the open-access article by Ratner et al. (2020).</p>
|
||
</section>
|
||
<section epub:type="division">
|
||
<h2 class="head a-head"><a id="sec9-10"/><b>9.10 Exercises</b></h2>
|
||
<ul class="numbered">
|
||
<li class="NL">1. Consider again the unacceptable PSL rule: 0.95: <i>(friend(Joe, Mark)</i> ∧ <i>friend(George, Mark))</i> ∨ <i>roommate(Joe, George) → friend(Joe, George)</i>. How would you break it up into a set of acceptable PSL rules?</li>
|
||
<li class="NL">2. Consider the rules and data shown here. Compute the penalty of the three ground PSL rules, where <i>X</i> = <i>Bob</i>. Show your steps.</li>
|
||
</ul>
|
||
<figure class="IMG"><img alt="" src="../images/pg239-1.png" width="450"/>
|
||
</figure>
|
||
<ul class="numbered">
|
||
<li class="NL">3. <span aria-label="240" id="pg_240" role="doc-pagebreak"/>When would it be advantageous to use MLNs over PSL? When would it <i>not</i> be advantageous? Think of one good use-case for each.</li>
|
||
<li class="NL">4. You told a friend that “Michael is a stock market contrarian. When his friends, John and Bob, place a bet on a stock, he always bets against it.” Your friend claims that statements like that cannot be expressed in PSL, and you are determined to prove him wrong. Making appropriate assumptions, and assuming that we have data on stocks, as well as “bets” that Michael, John, and Bob have placed on the stocks, provide PSL rules and observations (not unlike the short program and observations from exercise 2 for testing your hypothesis). What would be one way for you to know whether you’re right or wrong about Michael?</li>
|
||
<li class="NL">5. You are trying to express your domain knowledge that A is more likely to follow the same celebrity on Twitter if (a) A’s friend B follows that celebrity, and if (b) A has the same musical tastes as B. Express the two rules as PSL. Assume that you are equally sure about both rules.</li>
|
||
<li class="NL">6. Later, it occurs to you that you’re not very sure about (b). Given a training data set of Twitter users, their musical tastes, celebrities, and their followers, what would be a way for you to test whether the second rule should be included? Do you have to make an include/not include decision? If not, what would be a methodology to maximize performance on a test data set without discarding the rule altogether?</li>
|
||
</ul>
|
||
<div class="footnotes">
|
||
<ol class="footnotes">
|
||
<li><p class="FN" role="doc-footnote"><sup><a href="chapter_9.xhtml#fn1x9-bk" id="fn1x9">1</a></sup> As it happens, the deterministic case does not require special treatment and can be treated as a single instance of the formal machinery to follow.</p></li>
|
||
<li><p class="FN" role="doc-footnote"><sup><a href="chapter_9.xhtml#fn2x9-bk" id="fn2x9">2</a></sup> In a graphical model, the Markov blanket of a node is the minimal set of nodes that renders it independent of the remaining network. Conveniently, in a Markov Network, this is simply the node’s neighbors in the graph.</p></li>
|
||
<li><p class="FN" role="doc-footnote"><sup><a href="chapter_9.xhtml#fn3x9-bk" id="fn3x9">3</a></sup> Stated intuitively, this rule says that a friend of a friend is (with high probability) a friend.</p></li>
|
||
<li><p class="FN" role="doc-footnote"><sup><a href="chapter_9.xhtml#fn4x9-bk" id="fn4x9">4</a></sup> For a complete overview of this topic, we recommend Koller and Friedman (2009).</p></li>
|
||
<li><p class="FN" role="doc-footnote"><sup><a href="chapter_9.xhtml#fn5x9-bk" id="fn5x9">5</a></sup> Note that this means that the labels cannot be guaranteed to be correct. In this sense, the labels are not unlike those acquired through other pseudolabeling techniques such as weak supervision.</p></li>
|
||
</ol>
|
||
</div>
|
||
</section>
|
||
</section>
|
||
</div>
|
||
</body>
|
||
</html> |