glam/docs/oclc/extracted_kg_fundamentals/OEBPS/xhtml/chapter_8.xhtml

<html xmlns="http://www.w3.org/1999/xhtml" xmlns:epub="http://www.idpf.org/2007/ops" lang="en" xml:lang="en">
<head>
<title>Knowledge Graphs</title>
<meta content="text/html; charset=utf-8" http-equiv="default-style"/>
<link href="../styles/stylesheet.css" rel="stylesheet" type="text/css"/>
<meta content="urn:uuid:531250ec-8629-4bbe-b4be-eb1eb0e84538" name="Adept.expected.resource"/>
</head>
<body epub:type="bodymatter">
<div class="body">
<p class="SP"> </p>
<section aria-labelledby="ch8" epub:type="chapter" role="doc-chapter">
<header>
<h1 class="chapter-number" id="ch8"><span aria-label="175" id="pg_175" role="doc-pagebreak"/>8</h1>
<h1 class="chapter-title"><b>Instance Matching</b></h1>
</header>
<div class="ABS">
<p class="ABS"><b>Overview.</b>   Once constructed, knowledge graphs (KGs) may contain sets of nodes that refer to the same underlying entity. Instance matching (IM) is the problem of semiautomatically clustering <i>instances</i> in the KG, such that each cluster resolves to a unique <i>entity</i>. Such entities are ordinarily named entities, although IM can apply even if the entity is unnamed. This chapter introduces the instance matching problem in detail and summarizes the main set of solutions that have been proposed over several decades. While the problem remains an active area of research, much progress has been made, and several techniques have become standard.</p>
</div>
<section epub:type="division">
<h2 class="head a-head"><a id="sec8-1"/><b>8.1 Introduction</b></h2>
<p class="noindent">We begin this chapter with an example of a real-world KG fragment describing four citations extracted from the web (<a href="chapter_8.xhtml#fig8-1" id="rfig8-1">figure 8.1</a>). The KG has already been constructed by this time using techniques such as Named Entity Recognition (NER) or table extraction that were covered earlier in part II of this book. Each citation (which we can assume is an instance of the class <i>Citation</i> that exists in an ontology describing the KG and its extractions) has its own <i>syntactic identifier</i> (1, 2, 3, and 4) in the KG. In other words, from the machine’s perspective, each citation is distinct.</p>
<div class="figure">
<figure class="IMG"><a id="fig8-1"/><img alt="" src="../images/Figure8-1.png" width="450"/>
<figcaption><p class="CAP"><span class="FIGN"><a href="#rfig8-1">Figure 8.1</a>:</span> <span class="FIG">A fragment of a bibliographic KG illustrating the IM problem.</span></p></figcaption>
</figure>
</div>
<p>This artifact can result in an erroneous answer to the following question: How many unique papers are in the KG? A reasonable human being would say that there are only two. Citations 1 and 4, as well as citations 2 and 3, form two clusters that each refer to the same underlying entity. The reason why these citations were not automatically resolved is that there are fuzzy differences between them. Citation 1, for example, contains additional information that citation 4 does not, such as the publisher (Springer) and the editor (Stinson). Instance matching, which (ironically) also goes by alternative monikers in the literature (with some terms favored much more heavily by some communities) such as instance matching, deduplication, and record linkage, is the algorithmic problem of resolving such instances or syntactic mentions (which may be extractions, coreferenced groups of extractions, or even entities from semistructured data sets like web tables, wrapper outputs, and XML files) into clusters, such that each cluster refers to the same underlying entity.</p>
<p><span aria-label="176" id="pg_176" role="doc-pagebreak"/>Interestingly, the human-provided answer would become more controversial if we replace <i>papers</i> with <i>citations</i> in the question that we just asked. Some academics, particularly librarians, may insist that the machine is right—there are four unique citations. Others, like us, may (if only unconsciously) interpret the assumed intent of the modified question to be the same as that of the original question.</p>
<p>The purpose of elaborating upon this possible scenario is to illustrate that it is difficult, if not impossible, to make a formal claim by way of a logical statement or an analytical formula that can be parsed by a machine. If this were possible, then a machine would be able to use such a claim to unequivocally decide the semantic equivalence of two syntactically distinct entities that a human being with reasonable contextual knowledge (or, in the case of domain-specific KGs, an expert with domain knowledge) would say are equivalent. Broadly speaking, this kind of ambiguity is ubiquitous in artificial intelligence (AI), and ordinary human or common-sense reasoning does not typically permit a robust analytical characterization. Hence, IM, despite seeming like a simple problem that most ordinary humans would be capable to solving effortlessly in nonspecialized domains (or by domain experts in specialized domains), is a tough problem for AI to crack.</p>
<p>While it may seem intuitively obvious that detecting, and consequently resolving, such semantically equivalent mentions is good for the KG, we offer several concrete motivations for IM. One reason was mentioned earlier: without IM, one cannot do simple things like counting. Looking again to <a href="chapter_8.xhtml#fig8-1">figure 8.1</a>, we note that IM is typically not limited to one <i>type</i> of instance. In the bibliographic KG, there are <i>Author</i> nodes that refer to the same authors. Without additional information, a machine would also give us an incorrect count of the number of unique authors in the KG. More specifically, without IM, we cannot rely on algorithms to accurately perform aggregation operations, which include not only counting, but also more complex summarizations that are important for analysis.</p>
<p>Another reason relates to the construction of the KG itself: in at least one case, namely the <i>year</i> of publication of citation 1 or 4, we derive an inconsistency, under the reasonable assumption that citation 1 or 4 is referring to the same publication. However, it need not always be the case that conflicting information sets of this nature necessarily indicate inconsistency. For example, one citation may have been for an arXiv publication while another was for exactly the same paper, but formally published in a peer-reviewed conference. However, this is clearly not the case for Citations 1 and 4 in <a href="chapter_8.xhtml#fig8-1">figure 8.1</a>. This example also illustrates why context, and in many cases domain-specific knowledge, can be important for making an informed judgment.</p>
<p>Finally, IM also plays a proactive role by helping us obtain a richer information set about a given entity. Returning to the example, one citation mention may include details such as the proceedings and venue, while a different mention of the same citation gives the year and publisher. Usually, the information sets are not disjoint; rather, they overlap, which may or may not lead to inconsistencies, as earlier described.<span aria-label="177" id="pg_177" role="doc-pagebreak"/></p>
<p><span aria-label="178" id="pg_178" role="doc-pagebreak"/>With these motivations in mind, we turn to the problem of how to solve IM. Like many difficult AI problems, at this time, the IM problem has not been completely solved, in that no adaptive IM system is able to achieve human-level performance across different domains, even with extensive systems tuning and training. However, much progress has been made on the problem over 50 years of research, and agreement has been reached on a number of issues.</p>
</section>
<section epub:type="division">
<h2 class="head a-head"><a id="sec8-2"/><b>8.2 Formalism</b></h2>
<p class="noindent">To lay out the formalism for IM, we assume the simplest graph-theoretic definition of a given KG <i>G</i> = (<i>M, R</i>)—namely, one that comprises a set <i>M</i> of mention nodes and a set <i>R</i> of directed, labeled edges. In the framework of KGs, the mention nodes are the instances we are trying to match. The term “instance,” without qualification, is generally abstract because an instance could also be a record in a table, as is often the case in a database or data warehouse framework. Henceforth, except where occasionally indicated, we use the terms “instance,” “entity,” and “mention” equivalently. With this context in place, we can define a pairwise linking function <span class="font">ℒ</span><sub><i>p</i></sub> as follows.</p>
<div class="newtheorem">
<p class="newtheorem"><span class="head"><b><i>Definition 8.2.1 (Pairwise Linking Function)</i></b></span> <i>Given a set M of of mention nodes, a pairwise linking function</i> <span class="font">ℒ</span><sub><i>p</i></sub> <i>is a boolean function</i> <span class="font">ℒ</span><sub><i>p</i></sub>: <i>M</i> × <i>M →</i> {<i>True, False</i>}<i>, which returns True for a pair of mention nodes iff they refer to the same underlying entity and returns False otherwise.</i></p>
</div>
<div class="newtheorem">
<p class="newtheorem"><span class="head"><b><i>Example 8.2.1</i></b></span> Considering <a href="chapter_8.xhtml#fig8-1">figure 8.1</a> again, the putative pairwise linking function would return <i>True</i> for the (order-independent) pairs of mentions <i>Citation 1</i> and <i>Citation 4</i>, and also <i>Citation 2</i> and <i>Citation 3</i>, and <i>False</i> for any other pair (e.g., <i>Citation 1</i> and <i>Citation 3</i>).</p>
</div>
<p>Note that a mention node is always linked to itself by definition 8.2.1 (i.e., <span class="font">ℒ</span><sub><i>p</i></sub> obeys the property of reflexivity). In fact, it is almost always the case that of the three important properties of reflexivity, symmetry, and transitivity (RST), reflexivity and symmetry are generally obeyed by a real-world <span class="font">ℒ</span><sub><i>p</i></sub>, and in many (but not all) cases, so is transitivity. In practice, the definition of <span class="font">ℒ</span><sub><i>p</i></sub> is applied to syntactically distinct mentions. By syntactically distinct nodes <i>m</i><sub>1</sub> and <i>m</i><sub>2</sub>, we mean that <i>m</i><sub>1</sub> <i>≠ m</i><sub>2</sub> (i.e., the two nodes have distinct syntactic identifiers in the underlying KG <i>G</i>).</p>
<p>Furthermore, we refer to the function <span class="font">ℒ</span><sub><i>p</i></sub> as <i>fuzzy</i> (in lieu of boolean) when the range of the function is in [0,1] instead of {<i>True, False</i>}. It is easy to see that the boolean definition can be framed as a special case of the fuzzy definition. Many practical functions in the literature, which tend to rely on machine learning algorithms like neural networks and SVMs, are fuzzy; hence, by default, this term is usually omitted. We make this distinction, as it will prove important for the formalism. We also note that, although it is unsafe to assign strict probabilistic semantics to the output of <span class="font">ℒ</span><sub><i>p</i></sub>, one can certainly assume probability-<i>like</i> <span aria-label="179" id="pg_179" role="doc-pagebreak"/>semantics. In practice, the outputs are used in conjunction with a thresholding system and a manually annotated gold standard to plot either precision-recall or Receiver Operating Characteristic (ROC) curves.</p>
<p>One limitation of <span class="font">ℒ</span><sub><i>p</i></sub>, as defined, does not easily extend to the case when more than two mentions refer to the same underlying entity. There are several ways, both theoretical and practical, that a pairwise linking function <span class="font">ℒ</span><sub><i>p</i></sub> can be extended to a clustering linking function <span class="font">ℒ</span><sub><i>c</i></sub>. The semantics of such a function are similar to that of the pairwise linking function, in that all mentions in a cluster should refer to the same underlying entity, and clusters should be maximal in this respect.</p>
<div class="newtheorem">
<p class="newtheorem"><span class="head"><b><i>Definition 8.2.2 (Clustering Linking Function)</i></b></span> <i>Given a set M of of mention nodes, a clustering linking function</i> <span class="font">ℒ</span><sub><i>c</i></sub> <i>is a function that partitions M into a set C of clusters, such that it is always the case that for each cluster c</i> ∈ <i>C, such that</i> |<i>c</i>|≥ 2<i>, all mention nodes in c refer to the same underlying entity e, and no other mention in G (i.e.,</i> ∈<span class="font">⋃</span> <i>C</i> −<i>c) refers to e. If</i> |<i>c</i>| = 1<i>, the single mention in c exclusively refers to its own underlying entity.</i></p>
</div>
<div class="newtheorem">
<p class="newtheorem"><span class="head"><b><i>Example 8.2.2</i></b></span> Given the four citation nodes in <a href="chapter_8.xhtml#fig8-1">figure 8.1</a>, the putative clustering linking function would return the partition {{Citation 1, Citation 4}, {Citation 3, Citation 2}}.</p>
</div>
<p>Unlike with the pairwise function <span class="font">ℒ</span><sub><i>p</i></sub>, it is not completely obvious how definition 8.2.2 can be extended to make <span class="font">ℒ</span><sub><i>c</i></sub> fuzzy. In general, the pairwise function is much easier to formulate, and reason about, than the clustering function, and for that reason, it has been much better studied in the IM literature. In some cases, one can do without the clustering function altogether, while in other cases, the outputs of a pairwise function have to be somehow resolved into higher-level clusters. Reasonable clustering functions can be derived by combining the pairwise function with additional assumptions such as transitivity. Another popular technique is to materialize the outputs of an <span class="font">ℒ</span><sub><i>p</i></sub> function as a network-like graph (different from the original <i>knowledge</i> graph) and execute an off-the-shelf clustering algorithm on this graph. In the section entitled “Postsimilarity Steps,” later in this chapter, we detail some of the possibilities.</p>
</section>
<section epub:type="division">
<h2 class="head a-head"><a id="sec8-3"/><b>8.3 Why Is Instance Matching Challenging?</b></h2>
<p class="noindent">Before proceeding to a description of standard IM solutions, it is useful to revisit the running example to achieve a better understanding of why the problem is so difficult for machines to begin with (but not the average human). We provided some intuitions in the introduction to this chapter, and to summarize those intuitions, IM is difficult to automate because (1) pairs of duplicate mentions are duplicates for a variety of reasons that are difficult to codify using a consistent and complete rule base, and (2) humans seem to draw on background (often intuitive) knowledge in several common IM domains that is hard to pin down in code. A third challenge that will become clear in the subsequent discussion is scale, because naive solutions to IM grow quadratically with the number of nodes in the <span aria-label="180" id="pg_180" role="doc-pagebreak"/>KG. In the rest of this chapter, we illustrate how current IM methods, which usually rely on machine learning techniques, can partially deal with these challenges. As noted before, human-level performance is yet to be achieved.</p>
</section>
<section epub:type="division">
<h2 class="head a-head"><a id="sec8-4"/><b>8.4 Two-Step Pipeline</b></h2>
<p class="noindent">As a thought experiment, it is useful to assume that we already know the <i>true pairwise</i> linking function <span class="font">ℒ</span>. In the worst case, therefore, discovering all possible pairs of linked mentions in a KG is tantamount to solving a quadratic problem. Even for data sets with a few thousand nodes, the complexity quickly explodes. Data sets with more than 100,000 nodes, a not-unusual occurrence in the age of “Big Data,” cannot be resolved with brute-force use of quadratic algorithms.</p>
<p>Recognizing this early on, the IM community has converged on a set of preprocessing techniques called blocking to mitigate this quadratic complexity. <i>Blocking</i> refers to the process of inexpensively clustering mentions that are approximately similar to (possibly overlapping) blocks. Only mentions that share a block are paired and evaluated by <span class="font">ℒ</span>. A more formal interpretation of approximately similar is that, with high probability, a blocking algorithm should cluster mentions that are more likely to be duplicates (according to <span class="font">ℒ</span>) than nonduplicates. Established blocking algorithms tend to be linear or superlinear but still subquadratic. In some cases, a blocking algorithm may theoretically still be quadratic, but it is still significantly faster in practice (sometimes terminating in less than 1 percent of the time that a full application of <span class="font">ℒ</span> would have taken).</p>
<p>Of course, the function <span class="font">ℒ</span> is usually not known in practice (some exceptions are described later), and must also be learned from data, or approximated using a heuristic. The output of blocking, usually a candidate set of mention pairs, is piped to the similarity step where the (known or learned) function <span class="font">ℒ</span> is applied. This leads to a <i>two-step</i> pipeline, as illustrated in <a href="chapter_8.xhtml#fig8-2" id="rfig8-2">figure 8.2</a>. Here, <span class="font">ℒ</span> may be applied to each mention pair independently, which makes a strong independent and identically distributed (i.i.d.) assumption, or may leverage relational dependencies between mention pairs. Such collective methods have been proposed fairly recently, but they tend to be domain-specific and are continuing to be widely researched. While the majority of this chapter is focused on noncollective IM, which is established and relatively domain-independent, toward the end of the chapter, we briefly cover extensions to collective IM.</p>
<div class="figure">
<figure class="IMG"><a id="fig8-2"/><img alt="" src="../images/Figure8-2.png" width="450"/>
<figcaption><p class="CAP"><span class="FIGN"><a href="#rfig8-2">Figure 8.2</a>:</span> <span class="FIG">The two-step template that is often used for efficiently tackling real-world IM problems. The workflow can be customized in numerous ways, depending on both data modality and assumption about the underlying IM methods. For example, instead of linking entities between two KGs (say, Freebase and DBpedia), instances within a single KG may have to be resolved. For unsupervised methods, training sets may not be required (or the training set may be used only in blocking or similarity, not in both). If the KGs are modeled according to different ontologies and the IM or blocking is sensitive to this, then ontology matching may be necessary as a “step 0.”</span></p></figcaption>
</figure>
</div>
<section epub:type="division">
<h3 class="head b-head"><a id="sec8-4-1"/><b>8.4.1 Blocking</b></h3>
<p class="noindent">Blocking is a preprocessing step that is used to mitigate the quadratic complexity of applying <span class="font">ℒ</span> to all (unordered) pairs of mention nodes from the set <i>M</i>. In the most general case, blocking methods use a many-many function called a <i>blocking scheme</i> to cluster approximately similar entities into overlapping blocks. We build up to the definition of a blocking scheme by first defining a simpler function called a <i>blocking key</i>.</p>
<div class="newtheorem">
<p class="newtheorem"><span aria-label="181" id="pg_181" role="doc-pagebreak"/><span class="head"><b><i>Definition 8.4.1 (Blocking Key)</i></b></span> <i>Given a set M of mention nodes, a blocking key K is a many-many function that takes a mention m</i> ∈ <i>M as input and returns a nonempty set of literals, referred to as the blocking key values (BKVs) of m</i>.</p>
</div>
<p>Let <i>K</i>(<i>m</i>) denote the set of BKVs assigned to the mention <i>m</i> ∈ <i>M</i> by the blocking key <i>K</i>. Furthermore, without loss of generality, the literals in definition 8.4.1 are all assumed to be strings.</p>
<div class="newtheorem">
<p class="newtheorem"><span class="head"><b><i>Example 8.4.1 (Blocking Key)</i></b></span> <i>A simple example of a blocking key is</i> Tokens(:label)<i>. Given a mention node with a</i>:label <i>property (e.g., a</i> Person <i>instance with label “John Smith”), the blocking key would ostensibly yield the BKVs</i> {<i>John, Smith</i>}<i>. A more complex example is</i> Tokens(:label) ∪ Tri-Grams(:label)<i>, which would yield a larger set of BKVs. Although we have used:label property in this example, any property could be used, but if an instance is not guaranteed to have a value for that property, then the blocking key has to be designed to deal with this behavior robustly (e.g., by returning the empty set).</i></p>
</div>
<p>Given a blocking key <i>K</i>, a candidate set <i>C</i> ⊆ <i>M</i> × <i>M</i> of mention pairs can be generated by a blocking method using the BKVs of the mentions. Before describing some viable blocking methods, we introduce the notion of a blocking scheme that generalizes blocking keys to yield blocks instead of just literals-sets.</p>
<div class="newtheorem">
<p class="newtheorem"><span class="head"><b><i>Definition 8.4.2 (Blocking Scheme)</i></b></span> <i>Given a set M of mention nodes, a blocking scheme</i> <span class="font">&#119974;</span> <i>is a boolean function that takes an unordered, distinct pair of mentions</i> (<i>m</i><sub><i>i</i></sub><i>, m</i><sub><i>j</i></sub>) ∈ <i>M</i> ×<i>M (such that m</i><sub><i>i</i></sub> <i>≠ m</i><sub><i>j</i></sub><i>) <span aria-label="182" id="pg_182" role="doc-pagebreak"/>as input and returns True if</i> (<i>m</i><sub><i>i</i></sub><i>, m</i><sub><i>j</i></sub>) <i>belong to a common block, and False otherwise.</i></p>
</div>
<p>It is important to note that it is <i>not</i> necessary for two instances that share a block to be paired together in the candidate set. In fact, it is often undesirable to do so, as example 8.4.2 illustrates. Furthermore, as shown in the example, while the definition of a blocking scheme makes no appeal to a blocking key, practical blocking schemes are built up by combining blocking keys using various set-theoretic operators. Such constraints are especially important when we consider that, theoretically, executing an arbitrary blocking scheme on a set of mentions (to obtain the set of blocks) is itself quadratic in the general setting and provides none of the benefits of blocking. Thus, a blocking scheme is purely conceptual, and actual blocking systems must apply efficient blocking methods (as subsequently described) that are approximately linear (or slightly superlinear) in the number of mentions, which almost always requires the provision of robust blocking keys.</p>
<div class="newtheorem">
<p class="newtheorem"><span class="head"><b><i>Example 8.4.2 (Blocking Scheme)</i></b></span> <i>An example of a blocking scheme, building on the previous example of blocking keys, is the function Tokens(:label)[m</i><sub><i>i</i></sub><i>]</i> ∩ <i>Tokens(:label)[m</i><sub><i>j</i></sub><i>]</i> ≠ {} <i>(we avoid subscripts by using square brackets for the argument). Even on instances of a simple type such as</i> Person<i>, a naive application of this blocking scheme as a candidate set generator would not work well in practice on a sufficiently large data set due to the presence of highly common names like “John” or “Tom,” or surnames like “Smith.” Considering that names in countries like the United States (and many other Western countries) follow a Zipf-like distribution, a candidate set generated using this blocking scheme, without any additional filters or refinements, would be roughly quadratic (in time and space complexity) in the number of mentions (problem 1). More generally, this problem is referred to as</i> data skew <i>and is a common problem in real data sets.</i></p>
</div>
<p>In example 8.4.2, the naive application of the blocking scheme assumes a blocking method that was quite popular at one time (and that we will describe shortly), where two instances are paired and added to the candidate set if (and only if) they share a block. It is also easy to study the properties of this scheme—that is, it is reflexive and symmetric, but not guaranteed to be transitive (problem 1). By definition 8.4.2, this implies that blocks can be overlapping. None of these conditions are necessary for blocking schemes to be fulfilled. It is perfectly possible to devise blocking schemes where transitivity is guaranteed, and where blocking yields a partition (hence, if <i>m</i><sub><i>i</i></sub> and <i>m</i><sub><i>j</i></sub> belong to a common block, then it is guaranteed to be the only common block, which is sufficient, though not necessary, for transitivity).</p>
<p>We describe three influential blocking methods next [Traditional Blocking, Sorted Neighborhood (SN), and Canopies], all of which assume that a blocking key <i>K</i> is already specified by a user. Depending on the method, <i>K</i> must also obey some constraints. Subsequently, <span aria-label="183" id="pg_183" role="doc-pagebreak"/>we also describe the automatic <i>learning</i> of a particularly robust class of blocking schemes that has been found to be extremely useful in practice for a variety of reasons.</p>
<p class="TNI-H3"><b>8.4.1.1 Traditional Blocking</b> We can generalize the way in which the blocking scheme in example 8.4.2 was generated from a blocking key. Specifically, given a blocking key <i>K</i>, an obvious solution is to generate the candidate set <i>C</i> as the set {(<i>m</i><sub><i>i</i></sub><i>, m</i><sub><i>j</i></sub>)|<i>m</i><sub><i>i</i></sub><i>, m</i><sub><i>j</i></sub> ∈ <i>M</i> ∧<i>m</i><sub><i>i</i></sub> <i>≠ m</i><sub><i>j</i></sub> ∧<i>K</i>(<i>m</i><sub><i>i</i></sub>) ∩<i>K</i>(<i>m</i><sub><i>j</i></sub>) ≠ {}}. Note that the definition of <i>C</i> as a set further implies that <i>m</i><sub><i>i</i></sub> and <i>m</i><sub><i>j</i></sub> may share multiple BKVs. It is only necessary for two mentions to share <i>at least</i> one BKV for them to be paired and added to the candidate set. As previously described, <i>C</i> is not guaranteed to be transitive. This makes the method nonrobust, especially to the aforementioned problem of data skew.</p>
<p>Despite this problem, Traditional Blocking is often the first line of attack in practical systems. In recent years, researchers have modified Traditional Blocking to handle the large blocks that result from skew. A simple method that is easy to implement and difficult to outperform is <i>block purging</i>. The premise of the method is that, with a sufficiently expressive blocking key, blocks that are too large can be safely ignored. With this adjustment in place, the blocking scheme is still the same as before, but the blocking method is now nontrivial and requires the specification of a parameter (the size threshold, using whichever large blocks are discarded) that could, in principle, be learned from training data or experiments. Consequently, the candidate set complexity (in terms of both size and time taken to generate) is much more robust to the actual sizes of blocks generated by the original blocking scheme. The reason why the method performs so well experimentally is that blocks that are most likely indexed by BKVs that are equivalent to stop words like “the” or “an” tend to be large, and because stop words don’t really contribute useful information about the instance, it is safe to discard such blocks without losing performance. The purging threshold could be specified in terms of pairs (i.e., discard all blocks that generate more pairs than this threshold) or in terms of block size (number of mentions in the block). While the threshold as an extra parameter may seem to involve extra effort, it has been found to be empirically robust to good default values (e.g., 100), so long as the default value is not too low. Because the method has time guarantees and runs quite fast, several iterations are possible to further tune the value. Another option, less used in the literature, is to start from a high threshold and set a timeout value based on a manually specified budget; namely, if the method has not concluded by the timeout value, then the threshold is reduced by a certain percentage and the blocking algorithm is reexecuted. Even more complex variants have been proposed, but there is no clear evidence that they provide much added value over simple block purging with a heuristically set threshold.</p>
<p class="TNI-H3"><b>8.4.1.2 Sorted Neighborhood</b> Another influential blocking method that was fundamentally designed to <i>guarantee</i> a bound on the size of the candidate set is the SN method, also known as <i>merge-purge</i>. The algorithm works as follows. First, a single BKV is <span aria-label="184" id="pg_184" role="doc-pagebreak"/>generated for each mention using a many-one blocking key. Next, the BKVs are used as sorting keys to impose an ordering on the mentions. Finally, a window of constant size <i>w</i> is slid over the sorted list. All mentions that share a window are paired and added to the candidate set. <a href="chapter_8.xhtml#fig8-3" id="rfig8-3">Figure 8.3</a> illustrates a workflow.</p>
<div class="figure">
<figure class="IMG"><a id="fig8-3"/><img alt="" src="../images/Figure8-3.png" width="450"/>
<figcaption><p class="CAP"><span class="FIGN"><a href="#rfig8-3">Figure 8.3</a>:</span> <span class="FIG">An illustration of SN blocking. On the left is a table with instance IDs and the instances’ BKVs. The table is sorted according to the BKVs. A window of size 3 is slid over the table, and instances within the window are paired and added to the candidate set (initialized as empty). The final candidate set (sent on to the similarity step in the two-step IM workflow) is shown at the bottom right.</span></p></figcaption>
</figure>
</div>
<div class="newtheorem">
<p class="newtheorem"><span class="head"><b><i>Example 8.4.3 (SN)</i></b></span> <i>Figure</i> <a href="chapter_8.xhtml#fig8-3"><i>8.3</i></a> <i>illustrates an example of SN from a US-based customer domain. We assume that the BKV is the concatenation of the first three letters of the last name of the individual, and three digits based on a hash of the social security number. Each instance (a customer) in the KG yields a single BKV and is represented using the customer ID in the sorted table shown. A window of size 3 is slid over the table, and the candidate set is generated. Because the table has eight instances, the total size of the candidate set is 13.</i></p>
</div>
<p class="noindent">The sliding window has two implications for candidate set generation. First, mentions that do not have the same BKV may still get paired. An example of such a pair in <a href="chapter_8.xhtml#fig8-3">figure 8.3</a> is {2,3}. Second, some mentions with the same BKV may <i>not</i> get paired. For example, if the sliding window parameter <i>w</i> had been 2 instead of 3 in <a href="chapter_8.xhtml#fig8-3">figure 8.3</a>, the pair {3,5} would not have been added to the candidate set, despite the two instances having the same BKV.</p>
<p>In terms of the blocking scheme formalism, SN is interesting because the concept of a block is bypassed altogether. Instead, each window defines a block, which is of fixed size. The blocking scheme cannot now be defined analytically because it depends on the actual distribution of BKVs. Not only that, but there is a global dependence; that is, in order to <span aria-label="185" id="pg_185" role="doc-pagebreak"/>determine whether two instances fall within a common block (in this case, window), we cannot <i>locally</i> compare their respective BKVs, for even as example 8.4.3 showed, having the same BKV is no guarantee that the two instances will be paired in the candidate set. In addition to the window size parameter, the pairing behavior of two instances depends on the BKVs of other instances (nonlocality). All things being equal, there is higher likelihood for two instances to get paired the more similar their blocking keys, and the fewer the instances that have the same BKV as the BKV of either instance (if distinct).</p>
<p>One source of nondeterminacy to bear in mind when using SN is that the order of instances that have the same BKV needs to be determined using some other information, or randomly. A variant of SN that has tried to deal with this problem instead slides a window over the sorted BKVs rather than the instances. This version loses size guarantees but tends to have higher recall and more predictable behavior because all instances with the same BKVs are always paired, by default, and the nondeterminacy problem noted here does not occur.</p>
<p>Assuming that the window size <i>w</i> is much smaller than the total number of mentions, SN has time and space complexity that is linear in the size of the data. For this reason, it has endured as a popular blocking technique in the IM community. Numerous variations now exist besides the ones alluded to here, including implementations in Big Data architectures like Hadoop and MapReduce. In general, the primary differences between the variants and the original version are input datatypes (e.g., XML Sorted Neighborhood versus Relational), specification of blocking keys, and various ways of tuning the sliding window parameter (e.g., adaptive versus constant) for maximal performance.</p>
<p>A key disadvantage of SN algorithms is that they rely on a single-valued blocking key. Hernández and Stolfo (1995) recognized this as a serious limitation and proposed <i>multipass SN</i>, whereby multiple blocking keys (each of which would still have to be single-valued) could be used to improve coverage. For a constant number of passes, the run time of the original method is not affected asymptotically. Practical scaling is achieved by limiting the number of passes to the number of cores in the processor.</p>
<p>However, because even in multipass SN, each blocking key remains single-valued, the use of expressive blocking keys (or even simple, token-based set-similarity measures that have high redundancy) is precluded. Extending SN to account for heterogeneous data sources is also nontrivial. For this reason, the application of SN to KGs and other heterogeneous, semistructured data sources has been limited. The use of a simple blocking method such as Traditional Blocking (combined with skew-compensating measures like block purging) has been more popular.</p>
<p class="TNI-H3"><b>8.4.1.3 Canopies</b> Clustering methods such as Canopies have also been successfully applied to blocking, especially in the context of relational databases (RDBs), although the application of Canopies to heterogeneous data like KGs is more straightforward than it is for SN. The basic Canopies algorithm takes a <i>distance function</i> and two threshold <span aria-label="186" id="pg_186" role="doc-pagebreak"/>parameters <i>tight</i> ≥ 0 and <i>loose</i> ≥ <i>tight</i> and operates in the following way. First, a seed mention <i>m</i> is randomly chosen from <i>M</i>. All mentions that have a distance less than <i>loose</i> are assigned to the <i>canopy</i> represented by <i>m</i>. Among these mentions, the mentions with distance less than <i>tight</i> (from the seed mention) are removed from <i>M</i> and not considered further. Another seed mention is now chosen from all mentions still in <i>M</i>, and the process continues until all points have been assigned to at least one canopy. Unlike SN, and many other more traditional key-based blocking methods, Canopies does not rely on a blocking key; instead, it takes a distance function as input. For this reason, at least one work has referred to it as an instance-based blocking method and distinguished it from feature-based blocking methods such as SN and Traditional Blocking.</p>
<p>Similar to other popular blocking methods like Traditional Blocking and SN, several variants of Canopies have been proposed over the years, but the basic framework continues to be popular. For example, a nearest-neighbors method could be used for clustering mentions rather than a threshold-based method. In yet another variant, a blocking key can be used to <i>first</i> generate a set of BKVs for each mention, and Canopies can then be executed by performing distance computations on the <i>BKV sets</i> of mentions, rather than directly on the mentions themselves. Because this variant relies on a blocking key, it can no longer be considered an instance-based blocking method.</p>
<p>In the Canopies framework, each canopy represents a block. Concerning the choice of the distance function, the method has been found to work well with (the distance version of) a number of token-based set similarity measures, including Jaccard and cosine similarity (on tf-idf vectors). Such measures are quite robust to a number of issues (e.g., tf-idf based cosine similarity is insensitive to stop words and Jaccard is more sensitive to the number of unique tokens in a text fragment rather than the overall number of tokens, which allows it to discount frequently repeated words). However, token-based measures also have their blind spots, and not every information set or attribute associated with an entity can be decomposed into token sets to begin with. In practice, multiple distance functions and measures may make sense in order to correctly cluster entities with both high precision and recall. It is not completely clear whether one can extend Canopies in a way that seamlessly accommodates multiple functions. A systematic method might be multiview clustering, but at the risk of sacrificing the efficiency and simplicity of the original Canopies algorithm.</p>
<p class="TNI-H3"><b>8.4.1.4 Learning Blocking Keys</b> Thus far, we have assumed that a blocking key is specified a priori. This used to be a safe assumption when IM was principally applied to tables (wherein it is referred to as <i>record linkage</i> or <i>deduplication</i>) with relatively constrained schemas. In such a case, a domain expert or practitioner looking at the table would have a “feel” for what a good blocking key could be. With some trial and error, an initially posited blocking key could be refined until satisfactory performance was achieved. With KGs, the situation is somewhat different. First, KGs tend to be large and heterogeneous, even schema-free, in that many mentions may not have object values specified for all properties. <span aria-label="187" id="pg_187" role="doc-pagebreak"/>For example, in <a href="chapter_8.xhtml#fig8-1">figure 8.1</a>, citation 2 does not have a specified publisher. Another problem that is common in KGs, but not RDBs, is that some properties have multiple object values specified. All the citations in <a href="chapter_8.xhtml#fig8-1">figure 8.1</a> have multiple authors specified. If the same data were laid out in a table, there would be a single column for author name, and the authors would be separated using some agreed-upon delimiter (like a comma). The disadvantage of such compact representations is that one needs extra knowledge in order to parse the contents of the column.</p>
<p>Regardless of the trade-offs between tabular and graph-theoretic representations of data, the merits of automatically learning blocking keys should be evident. At the very least, an automatic, good blocking key learner or BKL would yield a key that is data-driven and relieve the domain expert of the effort of having to specify or discover one using costly trial and error. Beyond scale and convenience, such learners are important for systems that cannot expose their data for reasons of privacy.</p>
<p>The best-known procedure for learning blocking schemes from labeled training data was first published only a decade ago. The key idea is to model the blocking scheme as a Disjunctive Normal Form (DNF) rule. A DNF blocking key can be constructed by starting with a set of <i>indexing functions</i> that take a primitive datatype as input and return a set of primitives as output. Without loss of generality, <i>String</i> is assumed as the only available primitive datatype, although the formalism given in example 8.4.4 also applies with a mixture of primitive types.</p>
<div class="newtheorem">
<p class="newtheorem"><span class="head"><b><i>Example 8.4.4 (Indexing Function)</i></b></span> <i>An example of an indexing function introduced earlier in the context of blocking keys is</i> Tokens<i>, which relies on predetermined delimiters to tokenize a string (e.g., “John Smith”) into a set of strings (e.g.,</i> {<i>“John”, “Smith”</i>}<i>).</i></p>
</div>
<p>A <i>blocking predicate b</i><sub><i>h</i></sub>(<i>prop</i>) on mentions is now defined by pairing an indexing function <i>h</i> with a property <i>prop</i>, and adopting the following semantics: Let two mentions in the KG be denoted by the symbols <i>m</i><sub><i>i</i></sub> and <i>m</i><sub><i>j</i></sub>. The logical predicate <i>b</i><sub><i>h</i></sub>(<i>prop</i>)[<i>m</i><sub><i>i</i></sub><i>, m</i><sub><i>j</i></sub>] is satisfied iff the intersection <i>h</i>(<i>prop</i>)[<i>m</i><sub><i>i</i></sub>] ∩ <i>h</i>(<i>prop</i>)[<i>m</i><sub><i>j</i></sub>] is nonempty, where <i>h</i>(<i>prop</i>)[<i>m</i>] is defined as the set obtained by applying <i>h</i>(<i>prop</i>) on the object value of <i>m</i> for property <i>prop</i>. Typically, the predicate mnemonically indicates the underlying indexing function. Similar to notation previously introduced, the property is included in parentheses and the argument in square brackets. Example 8.4.5 implements these ideas in practice.</p>
<div class="newtheorem">
<p class="newtheorem"><span class="head"><b><i>Example 8.4.5 (Blocking Predicate)</i></b></span> <i>An example of a blocking predicate is Common-</i><br/><i>Tokens(:label). CommonTokens indicates that Tokens is the underlying indexing function, while</i> : <i>label is the underlying property used by the predicate. If two mentions have so much as a common token in their labels, the blocking predicate would return True for the mention pair.</i></p>
</div>
<p>What is the difference between the blocking key defined earlier and the indexing function? The most important difference is that the indexing function is a very general kind <span aria-label="188" id="pg_188" role="doc-pagebreak"/>of function that takes the primitive datatype (rather than a KG instance) as input. On the other hand, a blocking key applies to instances of KGs, which means that it has to extract the relevant data (using properties like <i>:label</i>, for example) before applying an indexing function–like operation such as tokenization. The blocking predicate, being boolean, is like a simplified version of the blocking scheme introduced earlier, but it is much more specific in structure because it is not only always associated with an underlying indexing function, but it also applies intersection to the sets to determine the truth value. Recall that blocking schemes, in the general sense, were not bound by any such constraints; in fact, the underlying blocking scheme described by methods like SN (even given an analytic representation of the many-one blocking key itself) has no formulaic representation.</p>
<p>Before proceeding further, we must note the various ways in which predicates like CommonTokens can fail, most common of which is the data skew problem described earlier. To avoid such problems, one could envision ways to make the blocking predicate more complex (e.g., by considering an indexing function that is more sophisticated or discerning than tokenizing). However, it is unclear if even that would do the trick. What if the blocking key were too discerning? If a true positive pair does not make it through blocking, it has no chance of being flagged by the similarity step as being a matched pair of instances, which would hurt coverage metrics like recall. Clearly, we want a solution where a trade-off can be achieved between coverage and efficiency.</p>
<p>Intuitively, a combination of heuristics or predicates is usually required to achieve good performance on any linking task, including approximate tasks like blocking. The trick is to combine several blocking predicates into a single boolean expression, called a <i>DNF blocking scheme</i>, by using blocking predicates as atoms. For well-defined semantics, negated atoms are disallowed. Similar to the blocking predicates, a DNF blocking scheme takes a pair of mentions as input. The mnemonic considerations earlier stated also apply. Following example 8.4.6, we explain why such an expression is referred to as a scheme rather than a key, as well as the difference between the two.</p>
<div class="newtheorem">
<p class="newtheorem"><span class="head"><b><i>Example 8.4.6 (DNF Blocking Scheme)</i></b></span> <i>For a people’s data set, consider the DNF blocking scheme CommonTokens</i>(: <i>label</i>) ∨ <i>HasExactMatch</i>(: <i>age</i>) ∧ <i>CommonSoundex</i>(: <i>label</i>)<i>. Two new blocking predicates, named self-explanatorily, are introduced in the DNF expression: HasExactMatch uses the identity function as its underlying indexing function and returns True if two strings exactly match, while CommonSoundex uses a modified version of Tokens as its indexing function and returns True if two strings share at least one token that</i> sounds <i>the same (i.e., have the same</i> Soundex <i>encoding). The mention pair</i> (<i>m</i><sub><i>i</i></sub><i>, m</i><sub><i>j</i></sub>) <i>is added to the candidate set C if it satisfies the scheme.</i></p>
</div>
<p>It is fruitful to revisit the difference between a scheme and a key. Although the literature is not consistent on this issue (nor is it considered a touchy matter), we offer the following answer: A scheme takes a pair of mentions as input and returns either <i>True</i> or <i>False</i>; in other words, it is used to test for candidate set membership, but otherwise is not useful in <span aria-label="189" id="pg_189" role="doc-pagebreak"/>practice because it is still pairwise. On the other hand, a DNF blocking key is a variation of the scheme that (1) operates on a single mention to extract a set of BKVs (based on the underlying indexing functions and properties), and (2) can be used to construct an inverted index of BKVs that can then be efficiently processed by a blocking method like the three described earlier.</p>
<p>Just as with arbitrary blocking schemes, it is possible for a pair (<i>m</i><sub><i>i</i></sub><i>, m</i><sub><i>j</i></sub>) to be evaluated as <i>True</i> by a DNF blocking scheme, but not be included in the candidate set <i>C</i>. The reason, of course, is that the candidate set depends both on the BKVs and the blocking method used. Traditional Blocking can guarantee that a pair such as the one given here is always included in <i>C</i>, where a method like SN, as noted in an earlier caveat, would not. The converse is also true. Because of its sliding window methodology, a method like SN may include pairs in <i>C</i> that would be evaluated as <i>False</i> by the corresponding blocking scheme. For this reason, we recommend always thinking about blocking in terms of a blocking key and a blocking method rather than an idealized blocking scheme, although if available (especially in an analytical form) and paired with a reasonable or well-studied blocking method like Traditional Blocking, it can be instrumental for proving theoretical claims (such as membership or existence) about the candidate set.</p>
<p>As it turns out, one reason why DNF blocking schemes are attractive is that they can be learned using a training set of positive (matching) and negative (nonmatching) instance pairs. This is preferable not only because it means that adaptive methods can be used for automatically discovering the schemes, rather than constructing them using trial-and-error, intuition-based exploration, but because labeled pairs are usually available anyway for the state-of-the-art similarity steps based on machine learning. Thus, the labeled pairs can be used twice, maximizing the benefit of labeling them in the first place. DNF schemes are also not training hungry, so to speak; experiments have shown that with a few good training pairs, a robust scheme can be learned with good regularization mechanisms. Methods have been proposed to avoid labeling altogether by using heuristics like tf-idf to obtain an approximately correct training data set. Such methods are useful if the similarity step is unsupervised, which makes labeling unfeasible if the only goal is to use the labeled pairs for learning a blocking scheme.</p>
<p>What is interesting about DNF blocking scheme learners is that they do not rely on ordinary machine learning algorithms; instead, the problem is framed as an instance of <i>red-blue set covering</i>. This is a well-studied (and understood) problem that is known to be NP-complete, and greedy algorithms providing approximate solutions with guarantees have existed for at least a few decades. The basic version of the problem can be stated as follows. Given a finite set of red elements R, a finite set of blue elements B, and a family <i>S</i> ⊆ 2<sup><i>R</i>∪<i>B</i></sup>, the problem is to find a subfamily ⊆ <i>S</i> that covers all blue elements, but which covers the minimum possible number of red elements. In the blocking context, the blue elements are the positive training pairs, and the red elements are the negative training pairs. <span aria-label="190" id="pg_190" role="doc-pagebreak"/>Note that, because blocking is only a precursor to similarity, false positives are expected, and the similarity step is expected to weed out those false positives in the second step. A candidate DNF scheme covers a pair if it returns <i>True</i> for that pair. The goal is to discover a candidate that covers all of the positive pairs, while minimizing coverage of the negative pairs. The actual algorithms used for discovering these schemes are not quite as simple, because the number of possible DNF schemes is at least exponential in the number of predicates. The usual mechanism for controlling the complexity is to set a parameter <i>k</i> (which is usually 2 or 3) and only consider k-DNF expressions as candidate schemes. A k-DNF scheme is a strict subset of all DNF schemes that allows conjuncts that have at most <i>k</i> blocking predicates. As we noted earlier, predicates are not allowed to be negated, which leads to an even smaller space. Even so, the full implementation of the learning algorithm can be tricky. For more details on semisupervised and unsupervised variants, including a variant proposed specifically for Resource Description Framework (RDF) KGs, we refer readers to resources listed in the section entitled “Bibliographic Notes,” at the end of this chapter.</p>
<p>Work on red-blue set covering has continued in the algorithms community independent of blocking, instance matching, or KG. In principle, these developments can be used for speeding up learning, although we are not aware of any practical cases that have done so. Usually, the greedy algorithm suffices for learning a good DNF blocking scheme.</p>
</section>
<section epub:type="division">
<h3 class="head b-head"><a id="sec8-4-2"/><b>8.4.2 Similarity</b></h3>
<p class="noindent">Once blocking has been executed, the resultant candidate set <span class="font">&#119966;</span> of pairs must undergo <i>similarity</i> computations to filter out (whether in a probabilistic or deterministic manner) the subset of <span class="font">&#119966;</span> that comprises nonduplicate mention pairs. If the independence and identically distributed properties are assumed, each such pair is independently assigned a score, with higher scores indicating greater likelihood of the pair representing matching instances. Scores tend to be normalized to [0, 1], but this does not mean that they should be strictly interpreted as probabilities. The usual methodology is to either assume a threshold (all pairs with scores above this threshold are considered matching instance pairs) or to rank the pairs in descending order of score and consider only the top <i>N</i>. Either way, the threshold or <i>N</i> is an extra parameter that has to be determined using either a held-out validation set or some other kind of validation scheme, such as <i>k</i>-fold cross-validation.</p>
<p>The problem of choosing the pairs does not arise if the scores are deterministic (either 1.0 or 0.0), but this is rare in machine learning. In earlier research, when rules were used for determining if instances matched or not, this was much more common. However, to get good rules (or other heuristics), significant effort from domain engineers or experts had to be solicited and hence, in recent years, such methods have become increasingly obsolete. Furthermore, when using fuzzy scores, it turns out that there is a well-known theoretical model in early IM literature known as the <i>Fellegi-Sunter model</i>, named after Fellegi and Sunter (1969), who formulated it. The model makes the claim that <i>two</i> (not necessarily <span aria-label="191" id="pg_191" role="doc-pagebreak"/>distinct) thresholds are actually necessary to achieve a desired optimal trade-off between minimizing false positives and minimizing false negatives, two goals that often conflict (and that influence <i>precision</i> and <i>recall</i>, respectively; see the section entitled “Evaluating the Two-Step Pipeline,” later in this chapter, for more information). The goal behind using two thresholds is to partition the candidate set into three sets (matches, possible matches, and nonmatches). Possible matches are in the set that would benefit most from manual review. The ratio of conditional probabilities (the condition being based on whether the instance pair is assumed as matching or nonmatching) is used to compute the score that is compared to these thresholds to determine what set the pair should fall into. We provide original references for the Fellegi-Sunter model for interested readers in the “Bibliographic Notes” section.</p>
<p>This discussion assumes that we have a similarity function and have executed it on the candidate set to get scores or labels. But how do we learn such a similarity function to begin with? We principally assume a machine learning framework because machine learning models have emerged as state-of-the-art similarity functions. Notationally, let us denote the machine learning classifier as <span class="font">ℒ</span><sup>*</sup>, or an approximation of the (unknown, true) pairwise linking function <span class="font">ℒ</span><sub><i>p</i></sub> defined earlier.</p>
<p>First, each mention pair in <i>C</i> is converted to a numeric (typically, but not necessarily, real-valued) <i>feature vector</i>. <a href="chapter_8.xhtml#fig8-4" id="rfig8-4">Figure 8.4</a> illustrates the procedure for two such mentions. Because instances in KGs do not have to contain values for every single property, some practical assumptions must be made if a given property does not exist for an instance or if (for whatever reason) the feature function returns an exception. A common practice is to use special (i.e., dummy) values in the event that (1) values for a given property are missing from both mentions, even though values for that property were observed for at least one other mention in the data set; and (2) the value for a given property is missing from one (but not both) mentions. By using not one but several dummy values, the missing information itself can be used to provide a (potentially useful) signal to the model. However, using too many “dummy” values for narrow classes of exceptions can impede generalization.</p>
<div class="figure">
<figure class="IMG"><a id="fig8-4"/><img alt="" src="../images/Figure8-4.png" width="450"/>
<figcaption><p class="CAP"><span class="FIGN"><a href="#rfig8-4">Figure 8.4</a>:</span> <span class="FIG">An illustration of feature extraction from a mention-pair of instances (X,Y) in the candidate set, given a library of <i>m</i> feature extraction functions. We have not drawn the property edges for clarity. Different concepts in the ontology are intuitively expressed by the text (e.g., the node “American” is of concept type “Cuisine” in the ontology). Not every feature function in the library may be applicable to every pair of attributes. Usually, a dummy value of –1 is used in the feature vector to express when one of the <i>m</i> functions is inapplicable to a concept. Given <i>k</i> attributes (in this case, <i>k</i> = 4) for each instance (of a given type), this ensures that each feature vector is of a dimensionality of exactly <i>km</i> unless a conscious decision is made to compare attributes across types. We use<span class="ellipsis">…</span> to indicate the presence of multiple unknown features and feature values.</span></p></figcaption>
</figure>
</div>
<p>More specifically, given <i>n</i> properties, and <i>m</i> functions in a <i>feature library</i> (the set of feature functions), the feature vector would have at most <i>mn</i> elements. We say “at most” here because some features may be designed for specialized values (e.g., a feature that computes the number of milliseconds between two date values), and not be applicable to two arbitrary values or properties. In this case, the feature would be applicable only to those properties describing dates, and not <i>all n</i> properties. We also note that while more features are generally preferable, especially if they add value, care must be exercised because many feature functions may be correlated and could harm machine learning generalization, especially when the training set is small and highly heterogeneous, as is the case with real-world IM tasks. If it is not evident which feature functions should be retained and which should be discarded, a possible solution is to start by computing all possible <span aria-label="192" id="pg_192" role="doc-pagebreak"/>(i.e., ≤ <i>mn</i>) features and then apply a feature selection method like Lasso. A second, more traditional remedy is to invest domain engineering effort in assigning only a few features to each property. If no more than <i>c</i> feature functions are assigned to a property (<i>c &lt;&lt; m</i>), the feature vector cardinality will end up being much smaller than <i>mn</i>, potentially yielding more robust performance during actual deployment.</p>
<p><span aria-label="193" id="pg_193" role="doc-pagebreak"/>Thus far, we have considered feature functions in the abstract, but what do <i>actual</i> feature functions look like? Luckily, there is an enormous body of work on several classes of useful feature functions—most notably, string similarity, but also phonetic similarity functions (although research on feature functions for numeric or date types is surprisingly sparse). We provide some guidance on available software packages in the section entitled “Software and Resources,” at the end of this chapter. For the sake of completeness, we provide a list of functions that have been popularly used in <a href="chapter_8.xhtml#tab8-1" id="rtab8-1">table 8.1</a>.</p>
<div class="table">
<p class="TT"><a id="tab8-1"/><span class="FIGN"><a href="#rtab8-1">Table 8.1</a>:</span> <span class="FIG">A non-comprehensive list of pairwise feature functions, classified by type, that are popular in IM and other related fields.</span></p>
<figure class="table">
<table class="table">
<thead>
<tr>
<th class="TCH"><p class="TB"><b>Type</b></p></th>
<th class="TCH"><p class="TB"><b>Function</b></p></th>
<th class="TCH"><p class="TB"><b>Type</b></p></th>
<th class="TCH"><p class="TB"><b>Function</b></p></th>
</tr>
</thead>
<tbody>
<tr>
<td class="TB"><p class="TB">Character</p></td>
<td class="TB"><p class="TB">Edit/Levenstein</p></td>
<td class="TB"><p class="TB">Token</p></td>
<td class="TB"><p class="TB">Monge-Elkan</p></td>
</tr>
<tr>
<td class="TB"/>
<td class="TB"><p class="TB">Affine Gap</p></td>
<td class="TB"/>
<td class="TB"><p class="TB">TF-IDF</p></td>
</tr>
<tr>
<td class="TB"/>
<td class="TB"><p class="TB">Smith-Waterman</p></td>
<td class="TB"/>
<td class="TB"><p class="TB">Soft TF-IDF</p></td>
</tr>
<tr>
<td class="TB"/>
<td class="TB"><p class="TB">Jaro</p></td>
<td class="TB"/>
<td class="TB"><p class="TB">Q-gram TF-IDF</p></td>
</tr>
<tr>
<td class="TB"/>
<td class="TB"><p class="TB">Q-gram</p></td>
<td class="TB"/>
<td class="TB"><p class="TB">Jaccard</p></td>
</tr>
<tr>
<td class="TB"><p class="TB">Phonetic</p></td>
<td class="TB"><p class="TB">Soundex</p></td>
<td class="TB"><p class="TB">Numeric</p></td>
<td class="TB"><p class="TB">Difference</p></td>
</tr>
<tr>
<td class="TB"/>
<td class="TB"><p class="TB">NYSIIS</p></td>
<td class="TB"/>
<td class="TB"><p class="TB">Absolute Difference</p></td>
</tr>
<tr>
<td class="TB"/>
<td class="TB"><p class="TB">Metaphone</p></td>
<td class="TB"/>
<td class="TB"><p class="TB">Magnitude/Factor</p></td>
</tr>
<tr>
<td class="TB"/>
<td class="TB"><p class="TB">ONCA</p></td>
<td class="TB"/>
<td class="TB"/>
</tr>
<tr>
<td class="TB"/>
<td class="TB"><p class="TB">Double Metaphone</p></td>
<td class="TB"/>
<td class="TB"/>
</tr>
<tr>
<td class="TSN" colspan="4"><p class="TSN">NYSIIS stands for <i>New York State Immunization Information System</i> and ONCA stands for <i>Oxford Name Compression Algorithm</i>.</p></td>
</tr>
</tbody>
</table>
</figure>
</div>
<p>Another problem that also arises in the context of KGs when extracting features from pairs of instances is similar to the problem of exceptions noted here; namely, how should the features be extracted if there are multiple values for a property for an instance? In general, how do we deal with properties where the complete spectrum (no values, multiple values, or a single value) is observed over the KG? As one example, imagine a property called <i>:marriageDate</i>. Instances of people who were never married will have no value for this property, while those married more than once will have multiple dates. One way to deal with these cases is to use the dummy-assignment methodology if no values exist, and to use a <i>two-layer</i> similarity function if multiple values exist for the property for one or both instances in the pair. The first layer in such functions consists of an atomic similarity function (e.g., if the set consists of string values, this could be the normalized similarity version <span aria-label="194" id="pg_194" role="doc-pagebreak"/>of Levenstein distance), and the second layer consists of an aggregation. For example, we could consider a symmetric aggregation function such as <img alt="" class="inline" height="20" src="../images/pg194-in-1.png" width="212"/> where <i>s</i><sub><i>i</i></sub> and <img alt="" class="inline" height="20" src="../images/pg194-in-2.png" width="12"/> range over all strings in the two sets <i>S</i> and <i>S</i>′. The idea is that, for each string in <i>S</i> (and similarly for <i>S</i>′), we find the (not necessarily unique) string in the other set that has maximum normalized Levenstein similarity to that string. We collect all such similarities and then average them (the purpose of the denominator in the expression). This function is symmetric, but it may not be as robust to outliers. Other similar functions, not necessarily symmetric, could be devised as well. A more sophisticated variant of the second layer is not to use a function directly, but instead to model the problem as a weighted bipartite graph, and then calculate some kind of score (e.g., average weight over the edges in the shortest path spanning the maximum number of nodes) over this graph. The more sophisticated the aggregation, or the more expensive the atomic similarity in the first layer, the more expensive the feature extraction.</p>
<p>With feature extraction in place, each pair in <i>C</i> is converted to a feature vector. A machine learning classifier is trained on positively and negatively labeled training samples and used to assign scores to the vectors in the candidate set. Classifiers explored in the literature (see the “Bibliographic Notes” section) include classic models like random forest, multilayer perceptron, and support vector machine (SVM), all of which have been found to perform reasonably well. More recently, Siamese neural networks have been proposed for the IM task, although their effectiveness (specifically on IM) is still an open question, especially when training data is limited. A Siamese neural network uses (as the name suggests) the same weights while working simultaneously on two different input vectors to compute comparable output vectors. An intuitive analogy is fingerprint comparison or locality sensitive hashing (LSH), an advanced similarity technique that works well on large candidate sets when certain assumptions apply. In the “Bibliographic Notes” section, we provide some pointers to established work in these areas. One point to note is that, although these classifiers all tend to make the i.i.d. assumption, transitivity (which is a relational property that violates independence) plays a strong role in real-world IM determinations; if (<i>m</i><sub><i>e</i></sub><i>, m</i><sub><i>j</i></sub>) and (<i>m</i><sub><i>e</i></sub><i>, m</i><sub><i>k</i></sub>) are classified as matches with high probabilities, it is more likely than not that (<i>m</i><sub><i>j</i></sub><i>, m</i><sub><i>k</i></sub>) also represent a matching pair, thus establishing a dependence between the similarity score of (<i>m</i><sub><i>j</i></sub><i>, m</i><sub><i>k</i></sub>) and other pairs in which at least one of the two instances participated. While this observation is not typically employed at this stage, it motivates postprocessing steps such as clustering and soft transitive closure, as subsequently described.</p>
</section>
</section>
<section epub:type="division">
<h2 class="head a-head"><a id="sec8-5"/><b>8.5 Evaluating the Two-Step Pipeline</b></h2>
<p class="noindent">Because of the independence of blocking and similarity within the two-step formulation, the performance of each step can be controlled for the other when running experiments. This is convenient because of the increasing complexity of published blocking and similarity <span aria-label="195" id="pg_195" role="doc-pagebreak"/>algorithms in recent years. Despite the potential disadvantages of this independence (in practice, there can be unintended interdependencies between blocking and similarity, because feature functions and biases could often be traded between the two, especially if training sets are shared), this methodology has resulted in the adoption of well-defined evaluation metrics for both blocking and similarity. In recent years, while the independence assumption has been challenged in a small number of applications and algorithms (as just one example, a blocking technique called <i>comparisons propagation</i> proposes using the outcomes in the similarity step to estimate the usefulness of a block in real time, the premise being that if a block has produced too many nonduplicates by the similarity algorithm, it is best to discard it rather than finish processing it), as we briefly detail in the “Bibliographic Notes” section, implementation and adoption of these algorithms have mostly been limited to serial architectures owing to the need for continuous data sharing between the similarity and block-generating components. Experimentally, the benefits of such techniques over independent techniques like SN or Traditional Blocking (with skew-eliminating measures such as block purging) have not been established extensively enough, or in enough domains, to warrant widespread adoption. The two-step workflow, with both steps relatively independent, continues to be predominant in the vast majority of IM research. With this small caveat, we describe these metrics next.</p>
<section epub:type="division">
<h3 class="head b-head"><a id="sec8-5-1"/><b>8.5.1 Evaluating Blocking</b></h3>
<p class="noindent">The primary goal of blocking is to scale the naive one-step IM that pairs all mentions (order-independently) with each other. A blocking system accomplishes this goal by generating a smaller candidate set. If complexity reduction were the <i>only</i> goal, the blocking system could simply generate the empty set and obtain optimal performance. Such a system would be useless because it would generate a candidate set with zero duplicates coverage.</p>
<p>Thus, duplicates coverage and candidate set reduction are the two goals that every blocking system seeks to optimize. To formalize these measures, let <i>O</i> be denoted as the <i>exhaustive set</i> of all <sup>|<i>M</i>|</sup><i>C</i><sub>2</sub> pairs; in other words, the candidate set that would be obtained in the <i>absence</i> of blocking. Let <i>O</i><sub><i>D</i></sub> denote the subset of <i>O</i> that contains all (and only) matching mention pairs (i.e., semantic duplicates). <i>O</i><sub><i>D</i></sub> is designated as the ground-truth or gold standard set. As in previous sections, let <i>C</i> denote the candidate set generated by blocking. Using this notation, <i>reduction ratio (RR)</i> is defined by equation (<a href="chapter_8.xhtml#eq8-1">8.1</a>):</p>
<figure class="DIS-IMG"><a id="eq8-1"/><img alt="" class="width" src="../images/eq8-1.png"/>
</figure>
<p>The higher the reduction ratio, the higher the complexity reduction achieved by blocking, relative to the exhaustive set. Less commonly, RR can also be evaluated relative to the candidate set <i>C</i><sub><i>b</i></sub> of a <i>baseline</i> blocking method by replacing <i>O</i> in equation (<a href="chapter_8.xhtml#eq8-1">8.1</a>) with <i>C</i><sub><i>b</i></sub>. Note that, because RR has quadratic dependence, even small differences in RR can <span aria-label="196" id="pg_196" role="doc-pagebreak"/>have an enormous impact in terms of run time. For example, if <i>O</i> contains 200 million pairs (which would only take a mentions-set <i>M</i> with about 20,000 mentions—that is, a relatively common-sized data set that would not even remotely qualify as Big Data), and a hypothetical Blocking System 1 achieves an RR of 99.7 percent, while Blocking System 2 achieves 99.5 percent, their candidate sets would differ by 200,000 pairs. In short, small differences in the RR matter a lot, and the larger the mentions-set, the larger the impact of even a 0.1 percent improvement in RR.</p>
<p>We can also define a <i>coverage</i> metric called pairs completeness (PC):</p>
<figure class="DIS-IMG"><a id="eq8-2"/><img alt="" class="width" src="../images/eq8-2.png"/>
</figure>
<p>One interpretation of PC is that it is nothing but a measure of recall (used for evaluating overall duplicates coverage in the similarity step, as described in the subsequent section) that <i>controls</i> for the errors in further learning or approximating <span class="font">ℒ</span>, which is <i>not</i> known. In other words, PC gives an <i>upper bound</i> on the recall metric. More simply, it is the answer to the following question: If we know <span class="font">ℒ</span> and apply it to the candidate set <i>C</i> output by blocking, what would be the recall? For example, if PC is only 70 percent, meaning that 30 percent of the matching instance pairs did not get included in <i>C</i>, then coverage on the full IM task can <i>never</i> exceed 70 percent, and in most cases, will be below 70 percent (i.e., if the similarity step has nonzero false negatives). PC clearly represents an upper bound to overall IM recall.</p>
<p>There is usually a trade-off between achieving optimal PC and RR values in real-world blocking systems. The trade-off is achieved by tuning a relevant parameter. There are two ways to represent this trade-off. The first is a single-point estimate of the F-measure (FM), or harmonic mean, between a given PC and RR:</p>
<figure class="DIS-IMG"><a id="eq8-3"/><img alt="" class="width" src="../images/eq8-3.png"/>
</figure>
<p>A single-point estimate is useful only when it is not feasible to run the blocking algorithm for multiple parameter values. Otherwise, a more visual representation of the trade-off can be achieved by plotting a curve of PC versus RR for different values of the parameters.</p>
<p>Another trade-off metric, Pairs Quality (PQ), is less commonly used than the FM of PC and RR:</p>
<figure class="DIS-IMG"><a id="eq8-4"/><img alt="" class="width" src="../images/eq8-4.png"/>
</figure>
<p>Superficially, PQ seems to be a better measure of the trade-off between PC and RR than the FM estimate, which weighs RR and PC equally, despite the quadratic dependence of RR. PQ has often been described as a precision metric for blocking, intuitively because a high PQ indicates that the generated blocks (and by virtue, the candidate set) are dense in duplicate pairs. Unfortunately, in practice, PQ gives estimates that can be non-interpretable, if not outright misleading. For example, suppose there were 5,000 duplicates <span aria-label="197" id="pg_197" role="doc-pagebreak"/>in the ground-truth, and <i>C</i> only contained 20 pairs, of which 12 are matching instance pairs. PQ, in this case, would be 12/20, or 60 percent. Assuming that <i>O</i> is large enough that RR is close to 100 percent, the FM (as defined here) would still be much less than 1 percent (as PC is only 12/5,000, or 0.24 percent). In other words, the negligible FM result would be correctly interpreted as an indication that, for practical purposes, the blocking process has failed. The result indicated by PQ alone, however, indicates otherwise, because 60 percent is not a bad performance to obtain on a difficult data set. An alternative, proposed by at least one author but (to the best of our knowledge) not used widely, is to compute and report the FM of PQ and PC.</p>
</section>
<section epub:type="division">
<h3 class="head b-head"><a id="sec8-5-2"/><b>8.5.2 Evaluating Similarity</b></h3>
<p class="noindent">Recall that the ultimate goal of the similarity step is to partition the provided candidate set <i>C</i> into sets <i>C</i><sub><i>D</i></sub> and <i>C</i><sub><i>ND</i></sub> of matching and nonmatching instance pairs respectively. Thus, two obvious (though not exclusive) metrics that apply to similarity are <i>precision</i> and <i>recall</i>:</p>
<figure class="DIS-IMG"><a id="eq8-5"/><a id="eq8-6"/><img alt="" class="width" src="../images/eq8-5-6.png"/>
</figure>
<p>In words, precision is the ratio of true positives to the sum of true positives and false positives, while recall is the ratio of true positives to all positives in the ground-truth. Similar to PC and RR defined earlier, there is a trade-off between achieving high values for precision and recall. An FM estimate can again be defined for a single-point estimate, but a better, more visual, interpretation is achieved by plotting a curve of precision versus recall for multiple parameter values.</p>
<p>Note that, because similarity is defined as a binary classification problem in the machine learning interpretation of instance matching, other measures such as accuracy can also be defined. One reason why they are not considered in the IM literature is because they also evaluate performance on the negative (i.e., nonduplicates) class, which is not of interest in IM. An alternative to a precision-recall curve is Receiver Operating Characteristic (ROC), which plots true positives against false positives. Historically, and currently, precision-recall curves dominate ROC curves in the IM community, but nowadays, important machine learning packages (e.g., sklearn in Python) allow a user to print out various metrics and curves without any programming. In real life, we recommend printing out both the precision-recall and ROC curves to evaluate both (1) how well the IM system is doing in an “absolute” sense; and (2) how well the IM system is doing above <i>random</i>.</p>
</section>
</section>
<section epub:type="division">
<h2 class="head a-head"><span aria-label="198" id="pg_198" role="doc-pagebreak"/><a id="sec8-6"/><b>8.6 Postsimilarity Steps</b></h2>
<p class="noindent">Although IM performance is strongly affected by various factors extrinsic to the actual classifier developed in the “Similarity” section, particularly the quality of the data being linked, much can be done to improve the IM process itself. Empirically, the performance of blocking affects the performance of overall IM, both by directly impacting the final recall score (and by extension, “trade-off” metrics like the FM), but also precision.</p>
<p>The features used in a classifier are also extremely important, although with the recent advent of KG embeddings (KGEs), feature crafting may soon belong to the past. However, there are limits to how much time one can spend refining features. Instead, a less labor-prone task may be to assume the scores as noisy inputs and use the dependencies between scores and mentions to obtain the final clusters of mentions (each cluster referring to the same entity), and to fuse the mentions in each cluster into a single composite entity. A problem that seems similar on the surface, but needs to be addressed at an earlier (i.e., similarity) stage, is that of modeling dependencies between mentions to obtain scores that reflect such dependencies. If domain knowledge is available, collective similarity methods, briefly described later, may be applicable, but if not, statistical methods like clustering and transitive closure can be used as a first line of attack. Another solution, which has been in vogue in the Semantic Web community, is to not “collapse” matching instances into clusters to begin with, but to instead embed them within an <i>Entity Name System (ENS)</i>. Next, we describe these two approaches.</p>
<section epub:type="division">
<h3 class="head b-head"><a id="sec8-6-1"/><b>8.6.1 Clustering and Transitive Closure</b></h3>
<p class="noindent">To recover clusters of mentions that refer to the same underlying entity, a naive method would be to first choose a duplicates threshold and then recover the connected components from the unweighted graph (<a href="chapter_8.xhtml#fig8-5" id="rfig8-5">figure 8.5</a>). This method tends to have a high empirical failure rate, for the reason that a few noisy scores can lead to (1) disjoint clusters that, in fact, refer to the same underlying entity; and more commonly, (2) large, skewed clusters that got connected because of a few rogue links. Earlier, we saw that the data skew problem is quite common in the real world (due to nonnormal probability distributions often showing up in the context of common names and other such entities). While the first problem arises because low scores got assigned to matching (in the ground-truth) pairs, the second problem arises because high scores got assigned to nonmatching instance pairs. The latter problem tends to be more serious than the first precisely because of data skew. Namely, given a large connected component, how should we break it up into smaller components? We could ostensibly use network theory to try and identify the set of miscreant nodes and edges that caused the problem to begin with. But which network-theoretic measure should we use? Even for established concepts like centrality, several ways of defining and measuring it exist. Most nontechnical (and in many cases, technical) users grappling with <span aria-label="199" id="pg_199" role="doc-pagebreak"/>the tuning and outputs of such algorithms do not have a deep enough knowledge of network theory to decide on the correct course of action.</p>
<div class="figure">
<figure class="IMG"><a id="fig8-5"/><img alt="" src="../images/Figure8-5.png" width="450"/>
<figcaption><p class="CAP"><span class="FIGN"><a href="#rfig8-5">Figure 8.5</a>:</span> <span class="FIG">A naive hard-clustering method, based on connected components, for recovering instance clusters from thresholded, scored pairwise outputs obtained from a two-step IM pipeline. The nodes represent instances, the circles represent the clusters (in this case, equivalent to connected components), and an edge exists only between two instances if the similarity between them was above the threshold (and assuming also that they were compared in the first place—that is, were placed in the candidate set). A single edge between the two components (say, D and F) would “collapse” both components into one cluster. Soft clustering approaches take a more sophisticated approach. Other approaches do not apply thresholding but rather work directly with weighted edges in the graph.</span></p></figcaption>
</figure>
</div>
<p>Compared to algorithms that assume hard or unweighted edges, such as the connected components algorithm, we could go one step further and employ a class of <i>clustering</i> algorithms that generically accept a weighted graph as input, along with hyperparameters such as the number of clusters desired, and return the set of cluster assignments. Clustering is an important topic in the machine learning and data-mining literature, with applications well beyond IM. Examples of fairly established clustering algorithms include K-Means and spectral clustering. The important takeaway from this chapter is that, once we model the output of an IM similarity step as a weighted graph, any clustering algorithm can potentially be used, though some will achieve much higher performance than others. Unfortunately, it is not always clear which clustering algorithm is best for a given task or data set. Other considerations, such as the complexity of the algorithm, as well as its robustness to noisy scores, must also be taken into account. As mentioned earlier, in a typical IM pipeline, some high scores will inevitably get assigned to nonduplicate mentions and vice versa.</p>
<p>A potentially more serious problem is the specification of hyperparameters, which is by no means trivial in IM problems, even for simple algorithms like K-Means. The reason for this is that IM is (to use terminology proposed quite recently) a <i>microclustering</i> problem. In IM, the number of clusters grows much quicker with the size of the data than in traditional clustering algorithms. In many clustering evaluations applying the K-Means algorithm, <i>k</i> almost never exceeds 100, even when the data set to be clustered contains many instances. <span aria-label="200" id="pg_200" role="doc-pagebreak"/>In contrast, even simple IM benchmarks with a few thousand records can contain hundreds of duplicates.</p>
<p>What if we want to limit ourselves to the simple, but brittle, connected components scheme described here, but take the weights of the edges (the similarity score between instances paired in the blocking and evaluated in the similarity steps) into account to make the algorithm more robust? One way to think about this problem is as a <i>soft transitive closure</i>. A transitive closure occurs when we introduce a direct edge between A and B, if we observe a path between them. The idea is to recursively apply the transitive property: if A-B and B-C <i>→</i> A-C. As expressed here, the transitive property is a hard one (i.e., there is no fuzziness or uncertainty in the truth of the property). Instead, suppose that we assume that the property only holds with probability 0.8 in the general case. As another added layer of complexity, we also assume that the “edge” A-B itself has an associated strength, which is the score assigned to the pair by the similarity step. Per this representation, the resulting outputs of the similarity step constitute a body of rules embodying the transitive property (in practice, the rule base is represented simply as a set of weighted statements, with each statement encoding a scored pairwise match as output by the similarity step).</p>
<p>Given this rule base, there are a number of <i>soft transitive closure</i> procedures in the literature (just like normal clustering) that take this rule base as input and output truth values for all edges. A popular method is Probabilistic Soft Logic (PSL), which has other use-cases beyond IM and can, in fact, be applied to any similarly structured problem with dependencies and probabilities. In chapter 9, on statistical relational learning (SRL), we provide more details on the technical machinery that underlies such frameworks. Generally, however, true values are assigned only to edges that have already been assigned a nonzero score and all things equal, the higher the score, the higher the probability the true value will be assigned. Furthermore, all things being equal, the pair A-C will have a higher probability of being assigned a true value the shorter the weighted path between A and C. The specific way in which the algorithm takes the weights and links into account before making a true/false determination for a pair depends on the actual algorithm. Once the edges have been output, we could always rerun the connected component algorithm, this time with the hope that weaker links that lead to collapse of small components into large ones have been eliminated due to the probabilistic application of the transitive property (as opposed to the application of a single hard threshold, as in the naive case).</p>
<p>Another option that avoids the complexity of probabilistic frameworks while still maintaining a reasonable degree of robustness is to use either top-down or bottom-up (i.e., agglomerative) <i>hierarchical clustering</i>, which do not require the number of clusters to be specified. Hierarchical in this context means that the clustering algorithm creates the clusters by continually nesting data points. Initially, each data point is its own cluster (the “leaves” of the tree hierarchy), which in subsequent steps are combined into larger clusters. The final result, assuming the algorithm is allowed to fully terminate, is a treelike structure <span aria-label="201" id="pg_201" role="doc-pagebreak"/>with a single large cluster (containing all data points or instances) as the root. A conceptually elegant way to think about this tree is as a set of levels, where each level represents a partition of the instance-set. For example, at the lowermost level, each instance is its own cluster (singleton), while at the topmost (root) level, there is only one set of instances. In the intermediary levels, we have different partitions of instance sets, although the sets get more coarse-grained as we move from the leaves to the root. In principle, some decision criterion (not dissimilar to a threshold) can now be used to determine the level of the tree that should be used to recover a single set of clusters forming a partition of the instances. More sophisticated variants delay making the choice, but rather wait until an actual query or application before returning results. For streaming data, dynamic clustering and other such methods can also be used. In fact, because this part of the IM pipeline intersects with clustering, any clustering algorithm can be used, although there is little evidence to suggest that relatively classic algorithms like spectral and hierarchical clustering (appropriately set up and tuned) can be outperformed by more learning-heavy algorithms after controlling for the quality of previous steps like blocking and similarity.</p>
</section>
<section epub:type="division">
<h3 class="head b-head"><a id="sec8-6-2"/><b>8.6.2 Entity Name System</b></h3>
<p class="noindent">Even if clusters of duplicate mentions have been identified, it still begs the question of what to do with them. For example, if there are five John Doe nodes in the KG, and we have managed to cluster them together, how do we collapse the five nodes into a single node, especially if there are inconsistencies between them? Different communities have converged on surprisingly different solutions to this problem. We list some of the important cases here. As it turns out, one of the solutions (ENS), proposed and adopted to varying extents in the Semantic Web (SW) community, has the interesting property that it can be implemented directly with pairs (the original output of the two-step pipeline), rather than clusters, though clustering and postprocessing can significantly improve the result.</p>
<p>In the Natural Language Processing (NLP) community, the issue of entity linking tends to arise with respect to an external knowledge base (KB) like Wikipedia or WordNet. We’ve already alluded to the importance of entity linking in some of the chapters on KG construction. For example, imagine an extraction called “Katmandu,” which is a slightly unusual spelling of the Nepalese capital. If we’re able to link this extraction to the canonical entity in Wikipedia, we not only preclude the issue of resolving mentions like “Katmandu” to other mentions in the text that refer to the same entity (e.g., “Kathmandu”), but we also reduce the error of typing the entity incorrectly (i.e., with successful entity linking, we would correctly deduce that this extraction has type “City” rather than “Person”). These KBs are already resolved, in that there is a single mention for each unique entity. Furthermore, once an entity extracted from text has been linked to an entity in the external KB, it can be represented canonically by using the representation or the label of the entity in the external KB. This option is also very popular in non-NLP communities when it is assumed that the external KB is complete, or stated more weakly, that it is not complete but contains all <span aria-label="202" id="pg_202" role="doc-pagebreak"/>“entities of interest” (in other words, it is complete with respect to the domain). This also shows why this method of representation is not always appropriate (e.g., if there are many long-tail entities in the text or the input KG, we will obtain clusters of entities that do not resolve to an external KB node). Such clusters are known in the NLP community as NIL clusters. One way to resolve such NIL clusters is to choose a <i>preferred name</i> or a <i>canonical string</i> from the labels of the mentions in the cluster. Ideally, the label should have high information value. For example, “United Nations” would be preferred over “UN.”</p>
<p>Even if we could somehow discover a good heuristic for selecting a preferred name, actual KGs have nodes that have richer information sets than just string labels. In the <span aria-label="203" id="pg_203" role="doc-pagebreak"/>SW community, the assumption is that structured data (including KGs) are published in a decentralized manner by parties that don’t necessarily have equivalent interests or commitments to quality. At the same time, because of its flexible representational philosophy, the SW makes it easy to define the output of IM as a property (a labeled edge), just like any other relationship. One such property is <i>owl:sameAs</i>, and is near-universally employed by SW practitioners to represent the semantic equivalence of two RDF identifiers. Also, <i>owl:sameAs</i> is imbued with special semantics that are taken into account by SW reasoners that have to construct answers to aggregation queries and other complex SPARQL graph pattern queries. Given <i>owl:sameAs</i> links, we can construct a (usually distributed) ENS. The ENS is defined simply as a <i>thesaurus of mentions</i>, where each <i>owl:sameAs</i> link is treated as a declaration of synonymy. In practice, there are several ways to populate an ENS (<a href="chapter_8.xhtml#fig8-6" id="rfig8-6">figure 8.6</a>). While the problem of reconciling inconsistencies does not go away simply by declaring <i>owl:sameAs</i> links or populating an ENS, it defers decisions to applications and querying infrastructure further down the pipeline. Provenance, reification, and other sophisticated facilities available in RDF and SW vocabularies and software can be used to resolve entities compactly without collapsing them into clusters and losing information. The reason that this works well is that applications have different needs and quality requirements, and it is not always clear what the optimal level of granularity is for entity clusters.</p>
<div class="figure">
<figure class="IMG"><a id="fig8-6"/><img alt="" src="../images/Figure8-6.png" width="450"/>
<figcaption><p class="CAP"><span class="FIGN"><a href="#rfig8-6">Figure 8.6</a>:</span> <span class="FIG">An ENS as a representation of IM outputs. The top image illustrates an ENS population given both a clustering scheme and entity linking, while the bottom image only assumes basic pairwise outputs from the similarity step. A population given either clustering or entity linking (but not both) is similarly feasible. Applications could directly query the ENS, sometimes in a pay-as-you-go fashion. Using additional RDF machinery, the <i>owl:sameAs</i> (or other ontologically equivalent) links could be further annotated with provenance, confidence, and other meta-attributes.</span></p></figcaption>
</figure>
</div>
</section>
</section>
<section epub:type="division">
<h2 class="head a-head"><a id="sec8-7"/><b>8.7 Formalizing Instance Matching: Swoosh</b></h2>
<p class="noindent">Swoosh<sup><a href="chapter_8.xhtml#fn1x8" id="fn1x8-bk">1</a></sup> is a family of generic algorithms for solving IM. It assumes black-box functions for merging and pairwise comparison that are invoked by an IM engine. Given such black boxes, the idea behind Swoosh is to use an algorithm for <i>efficiently</i> performing the entire IM workflow itself (i.e., use strategies to minimize the number of invocations to the expensive black boxes). Although Swoosh sounds similar to blocking, its optimizations are based on answering a different question: given some specific properties obeyed by the black-box comparison and merge functions, what is an algorithm that optimally resolves a KB, where optimality is measured in terms of the number of instance comparisons?</p>
<p>While the algorithmic contributions of the proposed Swoosh variants (such as R-Swoosh, G-Swoosh, and F-Swoosh) are impressive in themselves, an important formal contribution is the formalization of the IM problem itself. Previously, the only other item that had presented significant formal work on IM was the Fellegi-Sunter model that we described earlier in the context of picking two thresholds. Swoosh presents a fundamental framework for generic IM by first defining the concepts of <i>merge</i> and <i>match</i> functions. Intuitively, as the name suggests, a merge function is a partial function from <span class="font">ℳ</span>×<span class="font">ℳ</span> (<span class="font">ℳ</span> being the set <span aria-label="204" id="pg_204" role="doc-pagebreak"/>of mentions), and it captures the computation of merged instances (hence, the function is applicable only for pairs of matching instances); in contrast, the match function is like the pairwise linking function <span class="font">ℒ</span><sub><i>p</i></sub>, defined over <span class="font">ℳ</span>×<span class="font">ℳ</span> and used to determine if the mention pair (<i>m</i><sub><i>i</i></sub><i>, m</i><sub><i>j</i></sub>) ∈<span class="font">ℳ</span>×<span class="font">ℳ</span> represents the same underlying entity. By way of terminology, if the match function returns true for (<i>m</i><sub><i>i</i></sub><i>, m</i><sub><i>j</i></sub>), then we say that <i>m</i><sub><i>i</i></sub> ≈ <i>m</i><sub><i>j</i></sub>; similarly, the instance obtained by merging by <i>m</i><sub><i>i</i></sub> and <i>m</i><sub><i>j</i></sub> is denoted by the symbol <i>&lt; m</i><sub><i>i</i></sub><i>, m</i><sub><i>j</i></sub> &gt;.</p>
<p>Following these disclosures, a <i>merge closure</i> of a set of instances <span class="font">ℳ</span> is defined as the smallest set of instances <i>S</i> such that <i>S</i> is a superset of <span class="font">ℳ</span> and, for any two instances <i>m</i><sub><i>i</i></sub> and <i>m</i><sub><i>j</i></sub>, it is always the case that if <i>m</i><sub><i>i</i></sub> ≈ <i>m</i><sub><i>j</i></sub>, then <i>&lt; m</i><sub><i>i</i></sub><i>, m</i><sub><i>j</i></sub> &gt; is in <i>S</i>.</p>
<p>Finally, the concept of <i>domination</i> is defined; intuitively, an instance <i>m</i><sub><i>i</i></sub> is dominated by <i>m</i><sub><i>j</i></sub> if <i>m</i><sub><i>j</i></sub> contains more information about the underlying instance than <i>m</i><sub><i>i</i></sub>. For example, imagine a simple instance represented only by a label “John White,” and another instance with label “J. White” about the same individual. We could argue that the first instance dominates the second instance, because it contains information about the first name as well. For the sake of theory, let us assume that there is an oracle that returns true when presented with two instances, if the first instance dominates the second. We can now define a KG, <i>G</i>, to be dominated by another KG, <i>G</i>′, if for every instance in <i>G</i>, there is at least one instance in <i>G</i>′ that dominates that instance. Mathematically, KG domination is a <i>partial pre-order</i> (i.e., it is a reflexive and transitive relation). It is not a partial order because it is not antisymmetric (and neither is it symmetric): it is possible for <i>G</i> and <i>G</i>′ to dominate each other. On the other hand, instance domination is antisymmetric: if <i>m</i><sub><i>i</i></sub> dominates <i>m</i><sub><i>j</i></sub> and <i>m</i><sub><i>j</i></sub> dominates <i>m</i><sub><i>i</i></sub>, then it is necessarily the case that <i>m</i><sub><i>i</i></sub> = <i>m</i><sub><i>j</i></sub> (note the set-theoretic notion of <i>equivalence</i> has been used here, not the shorthand symbol for the match function).</p>
<p>The reason why these terms are important is that they can be used to provide a precise definition of IM: an IM is defined as the minimal set <span class="font">ℳ</span>′ that meets two conditions: it is a subset of the merge closure <span class="font">ℳ</span><sup>*</sup> of the given mentions-set <span class="font">ℳ</span>, and it dominates <span class="font">ℳ</span><sup>*</sup>. By minimal, we mean that no strict subset of <span class="font">ℳ</span><sup>*</sup> satisfies both of these conditions. Although the definition does not impose any constraints upon <span class="font">ℳ</span><sup>*</sup>, the authors of Swoosh prove the claim that it is, in fact, unique, given the definitions of merge closure and domination. In a slight abuse of terminology, we refer to <span class="font">ℳ</span>′ as the IM of a given mentions-set, <span class="font">ℳ</span>.</p>
<p>Without assigning additional properties to the match and merge functions, it is difficult to develop practical algorithms. The authors do propose an algorithm, G-Swoosh, as a baseline that does not require the match and merge functions to obey any specific properties. It improves over a brute-force algorithm that is iterative and pairwise (and therefore highly suboptimal in terms of the number of required comparisons) by intelligently ordering the match and merge calls. For the original pseudocode and proofs, we refer the reader to Benjelloun et al. (2009), also referenced in the “Bibliographic Notes” section. G-Swoosh also includes the guarantee that it will terminate and compute the IM for an instance-set if the IM is finite, and the authors present proofs that it is in fact optimal, in the sense that no <span aria-label="205" id="pg_205" role="doc-pagebreak"/>algorithm that computes the IM of a mentions-set correctly will make fewer comparisons in the worst case.</p>
<p>To design algorithms beyond Swoosh, it becomes necessary to assign some properties to the match and merge functions. Interestingly, Swoosh defines four properties, labeled using the acronym ICAR, that are relevant for real-world IM scenarios:</p>
<ul class="numbered">
<li class="NL">1. Idempotence: An instance always matches itself, and merging an instance with itself yields the same instance.</li>
<li class="NL">2. Commutativity: For all pairs of instances, if and only if <i>m</i><sub><i>i</i></sub> ≈ <i>m</i><sub><i>j</i></sub>, then <i>m</i><sub><i>j</i></sub> ≈ <i>m</i><sub><i>i</i></sub>, and if <i>m</i><sub><i>i</i></sub> ≈ <i>m</i><sub><i>j</i></sub>, then <i>&lt; m</i><sub><i>i</i></sub><i>, m</i><sub><i>j</i></sub> &gt;=<i>&lt; m</i><sub><i>j</i></sub><i>, m</i><sub><i>i</i></sub> &gt; (again, note the equivalence and not the match symbol in this last equation).</li>
<li class="NL">3. Associativity: Associativity is defined on the merge operation; namely, if it is the case that <i>&lt; m</i><sub><i>i</i></sub><i>, &lt; m</i><sub><i>j</i></sub><i>, m</i><sub><i>k</i></sub> <i>&gt;&gt;</i> and <i>&lt;&lt; m</i><sub><i>i</i></sub><i>, m</i><sub><i>j</i></sub> <i>&gt;, m</i><sub><i>k</i></sub> &gt; both exist, then the two are equivalent.</li>
<li class="NL">4. Representativity: If <i>m</i><sub><i>k</i></sub> =<i>&lt; m</i><sub><i>i</i></sub><i>, m</i><sub><i>j</i></sub> &gt;, then for any <i>m</i><sub><i>l</i></sub> such that <i>m</i><sub><i>i</i></sub> ≈ <i>m</i><sub><i>l</i></sub>, we have <i>m</i><sub><i>k</i></sub> ≈ <i>m</i><sub><i>l</i></sub>.</li>
</ul>
<p>Note that the last of the properties (representativity) is the least intuitive of the four. One way of thinking about it is that the instance <i>m</i><sub><i>k</i></sub> obtained from merging <i>m</i><sub><i>i</i></sub> and <i>m</i><sub><i>j</i></sub> represents the original instances, in the sense that any instance that would have matched <i>m</i><sub><i>i</i></sub> (or <i>m</i><sub><i>j</i></sub> by commutativity) will also match <i>m</i><sub><i>l</i></sub> (i.e., there is no negative evidence that can be created through the merge of <i>m</i><sub><i>i</i></sub> and <i>m</i><sub><i>j</i></sub> that would actually prevent <i>m</i><sub><i>k</i></sub> from matching any other instance that would have matched <i>m</i><sub><i>i</i></sub> or <i>m</i><sub><i>j</i></sub>). Another important point to note about these four properties is that they do not imply the transitivity of the match function.</p>
<p>Recall that the G-Swoosh algorithm did not assume the ICAR properties. In contrast, if we do assume the ICAR properties, it is possible to significantly improve on G-Swoosh (the authors term the new algorithm “R-Swoosh”) in the worst case. Just like G-Swoosh, R-Swoosh maintains a certain optimality (i.e., it is possible to show that there exists at least one mentions-set where any algorithm has to perform at least as many comparisons as R-Swoosh to obtain the correct IM of the set). A third algorithm, F-Swoosh, further improves on R-Swoosh by eliminating redundant feature-based comparisons between values in the two mentions or instances being evaluated as a pair.</p>
</section>
<section epub:type="division">
<h2 class="head a-head"><a id="sec8-8"/><b>8.8 A Note on Research Frontiers</b></h2>
<p class="noindent">The differences between the three classic frameworks discussed earlier and the range of blocking, similarity, and postprocessing techniques that are described here have shown no signs of abating; in fact, we are just starting to see a renewed interest in the research community in addressing some long-standing challenges (and assumptions) in IM. Next, we very briefly describe some trends that seem to have taken root in various communities and are likely going to persist in the foreseeable future.</p>
<p><span aria-label="206" id="pg_206" role="doc-pagebreak"/>The first is the advent of collective methods for IM that we alluded to previously when describing soft transitive closure. Collective methods, including SRL, are not just limited to probabilistic application of the transitive property. In many domains in the real world, domain-specific relational dependencies are ubiquitous, if not always explicit. Here, we consider the classic example of resolving author names in the publication domain. The input is a large KG consisting of publications and their incumbent details (including authors), similar to the motivating example described in the introduction. One domain-specific relational dependency that is often found to be true in the publication domain is the tendency of coauthors to repeat being coauthors (i.e., if author <i>A</i> and author <i>B</i> have coauthored a publication <i>P</i>, it is quite likely that they have coauthored other publications as well). How can we use this for IM? Suppose that we were trying to resolve authors and publications in a domain-specific KG initially acquired through the methods described in part III of this book. Intuitively, we want the system to use the rule: if <i>A</i> and <i>B</i> have coauthored <i>P</i>, and if <i>A</i>′ and <i>B</i>′ have coauthored <i>P</i>′, then the fact that <i>A</i> = <i>A</i>′ should positively influence the likelihood of <i>B</i> = <i>B</i>′. In other words, all things being equal, we are using coauthorship repetition (prior domain knowledge) to increase the score of the pair (<i>B, B</i>′), given evidence that <i>A</i> = <i>A</i>′ and coauthorship of <i>A</i> − <i>B</i> and <i>A</i>′− <i>B</i>′. Note that this is very different from transitivity; we are not claiming that <i>A</i> and <i>B</i> are the same author. Collective methods like Markov Logic Networks (MLNs) and PSL allow domain experts to not only specify these rules (in many cases, probabilistically) using elegant formal frameworks, but also come with inference algorithms that can process these rules, along with actual IM outputs, to yield the underlying true KG with resolved instances. In chapter 9, we cover this class of techniques in much more detail.</p>
<p>Second, despite the advent of these sophisticated techniques, the fact remains that (with difficult domains especially), it is very hard to achieve reasonable IM performance, especially if labeled data is not available. Labeling is a nontrivial problem because, in IM, pairs have to be labeled rather than instances. Random sampling is precluded because of the quadratic growth in size of the set of pairs (recall the examples provided in the section “Evaluating Blocking,” earlier in this chapter, in the context of the RR metric), as well as the fact that almost all pairs within this exhaustive set of all pairs are nonmatches. One way to select good samples for labeling, or in frameworks like active learning, is to employ efficient crowdsourcing, leveraging microworker platforms like Amazon Mechanical Turk, or more recently, services such as Amazon Sagemaker Ground-Truth. Several such frameworks for IM have been presented in the literature; we note a few papers in the “Bibliographic Notes” section.</p>
<p>Third is the advent of schema-free methods, which are especially apt when the KG has been constructed using Open IE (without the benefit of a domain ontology), or has an otherwise extremely broad domain, such as with DBpedia, or even the Gene Ontology. Schema-free methods are also relevant when two KGs have to be linked to one another <span aria-label="207" id="pg_207" role="doc-pagebreak"/>and are individually typed using different ontologies (e.g., linking DBpedia instances to Freebase instances).</p>
<p>To understand the notion of schema-free, consider again the methods proposed earlier in this chapter when describing the similarity step. Therein, we constructed the feature vector (between two pairs of instances) of maximum cardinality <i>mn</i>, given <i>m</i> feature functions in the library and <i>n</i> attributes. The assumption, however, was that the two instances would share the same schema (i.e., have the same <i>n</i> attributes even though one or both of the instances might not have <i>values</i> defined for the attribute, in which case dummy values, or some other similar scheme, would have to be employed). This becomes problematic in several real-world scenarios. One scenario occurs when the set of attributes (roughly, the schema or ontology) is too diffuse, which happens when multiple approximately equivalent properties (e.g., <i>:birthdate</i> and <i>:bornOn</i>) are present in the ontology, likely due to historical reasons, as a consequence of which the two instances may define their values using these different properties. One instance may define the birth date value as the object of the <i>:birthdate</i> property, while another may use <i>:bornOn</i>. Short of cleaning up the ontology and the KG, which is almost never feasible in practice, or doing ontology matching (which is sometimes feasible but can lead to its own set of problems) by determining which concepts and attributes are equivalent to one another prior to executing an IM workflow, it is not clear how this problem can be resolved. A second scenario in which mismatched concepts or attributes present problems is when there are two differently ontologized or otherwise independently modeled KGs that need to be linked to one another, as noted previously with DBpedia, a KG that external data sets frequently have to link to in order to publish in the Linked Open Data ecosystem (chapter 14).</p>
<p>The earliest schema-free algorithms sought to address these problems by initially treating instances as bags of values, and then applying document-centric algorithms to obtain feature vectors (e.g., by using tf-idf or more recently, techniques like LSH). However, the limitations of this solution are fairly obvious. Consider, for example, date properties like <i>:bornOn</i> and <i>:diedOn</i>. Without the context of a property, it is difficult to distinguish between two instances where the birth date of one coincides with the death-date of another. With strings, the problem tends to be less severe, but with numbers, it can be even more severe than with dates. Another issue is that of blocking, as the state-of-the-art blocking scheme learners and blocking methods require a consistent and single set of attributes to be defined.</p>
<p>More recently, therefore, a new set of algorithms for these so-called highly heterogeneous information spaces has been proposed that is schema-free, but does not ignore the structure of the data by collapsing all values into bags of strings, numbers, and dates. Instead, the usual approach is to treat the instance as a set of key-value pairs, and by considering similarities not just between pairs of values, but also pairs of keys. Work on developing such schema-free algorithms has continued to flourish, particularly in the Semantic Web.</p>
</section>
<section epub:type="division">
<h2 class="head a-head"><span aria-label="208" id="pg_208" role="doc-pagebreak"/><a id="sec8-9"/><b>8.9 Data Cleaning beyond Instance Matching</b></h2>
<p class="noindent">In this chapter, our main focus has been on IM as one of the most fundamental KG completion problems that has to be solved in the real world, based on both empirical arguments that the problem is ubiquitous when entities, relations, and triples in the KG have been extracted from multiple data sources, and on historical arguments that IM has proved to be a difficult AI problem in the database (and NLP) communities. However, IM is by no means the only difficult data-cleaning problem that needs to be solved to achieve a truly robust and complete KG. In the world of databases, data cleaning has traditionally been defined as the general problem of detecting and removing errors and inconsistencies from data in order to improve data quality. Data quality problems are present in single data collections, such as files and databases (e.g., due to misspellings during data entry, missing information, or other invalid data), but also multiple data sources that need to be integrated (e.g., in data warehouses, federated database systems, or global web-based information systems).</p>
<p>Traditionally, data cleaning has tended to be associated more with structured than natural-language data. Arguably, natural-language tasks like coreference resolution can be interpreted as data cleaning, but the terminology hasn’t caught on yet in the NLP community. One reason might be that in the database and structured data communities, the input is raw data in the form of tables and files. In NLP, the raw data is almost always text, and the mentions that have to be resolved are extracted to begin with. Any operations performed on the text, such as punctuation removal, tokenization, and lemmatization, are usually referred to as <i>data preprocessing</i>.</p>
<p>While such distinctions have worked well in the past, when all of these communities were relatively well separated, when it comes to KGs, the distinctions become blurred (or start overlapping completely), and we cannot pragmatically afford such distinctions if we want to build a good KG.</p>
<p>Historically, data warehouses have required and provided extensive support for data cleaning. Such warehouses load and continuously refresh large data sets from a variety of sources (e.g., a Walmart data warehouse may be pulling in data from many of its locations around the country on a periodic basis), and because of heterogeneity in the real world, not every source that is integrated is of equal quality. Data warehouses are used for decision-making, so good data quality is essential for avoiding wrong conclusions. Just like KG completion follows KG completion, data warehouses face the data-cleaning challenge during the so-called extraction, transformation, and loading (ETL) process, illustrated in <a href="chapter_8.xhtml#fig8-7" id="rfig8-7">figure 8.7</a>, with the cleaning modules typically executed in a separate data-staging area before loading transformed data into the warehouse. A large number of tools of varying functionality are available to support these tasks, but a significant portion of cleaning often has to be done manually (or by low-level programs such as rule- or heuristics-based) that are accompanied by their own maintenance and robustness challenges.</p>
<div class="figure">
<figure class="IMG"><a id="fig8-7"/><img alt="" src="../images/Figure8-7.png" width="450"/>
<figcaption><p class="CAP"><span class="FIGN"><a href="#rfig8-7">Figure 8.7</a>:</span> <span class="FIG">An illustration of ETL for database-centric workflows. When applied to KGs, the “extract” phase is equivalent to the methods described in part II of this book, while the “transform” phase would include steps like IM, SRL (described in chapter 9), and other approaches that lead to a single, clean KG. The “load” phase would upload the KG to an infrastructure where it can be queried, accessed, and used to facilitate analytics, a subject that will be covered in part IV.</span></p></figcaption>
</figure>
</div>
<p><span aria-label="209" id="pg_209" role="doc-pagebreak"/><a href="chapter_8.xhtml#fig8-8" id="rfig8-8">Figure 8.8</a> provides a general taxonomy of data-cleaning problems seen through the lens of the database and data warehouse communities. In many cases, similar problems can arise with KGs (e.g., just like ordinary databases, KGs in particular domains can contain name and structural conflicts, as well as outdated spatiotemporal data). While in some of these cases, techniques can be framed or shared between the database and KG communities, in other cases, the problem has to be framed differently in a graph setting than in a tabular setting. For example, the missing value imputation problem, solutions for which are statistics-based methods, targets numerical columns (such as income or age) in databases, but it is better formulated as the link or relation prediction problem in KGs. In later chapters in this part of the book, we cover novel machine learning–based paradigms like KGEs and SRL for solving such problems in the KG setting more effectively than the numerical solutions traditionally presented for the missing value imputation problem.</p>
<div class="figure">
<figure class="IMG"><a id="fig8-8"/><img alt="" src="../images/Figure8-8.png" width="450"/>
<figcaption><p class="CAP"><span class="FIGN"><a href="#rfig8-8">Figure 8.8</a>:</span> <span class="FIG">A taxonomy of data-cleaning problems, as originally conceived of in the database community.</span></p></figcaption>
</figure>
</div>
<p>Unsurprisingly, just as with IM, there is no one-size-fits-all solution for addressing data-cleaning. In fact, it is not even completely clear initially what data-cleaning steps need to be performed on a data set to begin with. Data profiling, therefore, is an important step prior to any composition or execution of data-cleaning workflows. Wikipedia defines data profiling as the “process of examining the data available in an existing data source” or as “collecting statistics and information about that data.” In practice, data profiling is more shallow and mechanistic rather than adaptive (such as clustering or other descriptive <span aria-label="210" id="pg_210" role="doc-pagebreak"/>data-mining procedures), but that does not preclude the value of good engineering. This is why, in good systems and applications, data profiling is systematic. Rather than issuing exploratory queries or randomly browsing the data rendered in a spreadsheet, the use of dedicated tools (of which the main ones are proprietary industrial solutions), such as Microsoft’s SQL Server Integration Services and Informatica’s suite of products, is common. Despite their differences, such tools have far more in common, in that they often use high-level inputs (often mediated via graphical interfaces) to issue requisite queries at the back end, and then using efficient (in many cases, highly specialized) aggregation algorithms to compute the profiles, statistics, and metadata desired by the user. The volume of the data dictates how long the process as a whole takes. For example, if an analysis of variance (ANOVA) is specified with large numbers of levels for some of the variables, the analysis could end up taking hours.</p>
<p><span aria-label="211" id="pg_211" role="doc-pagebreak"/>The actual output does not have to be highly formalized, though statisticians and experts may prefer it in sensitive or difficult cases. For ordinary practitioners, good tools employ a liberal use of tabs, charts, tables, dashboards, and a full suite of other visualization aids, many of which are interactive. The use-cases of such profiles are numerous, with data cleaning being an important one. A specific purpose of basic profiling is to reveal data errors, including inconsistent formatting within columns, missing values, and outliers. For example, in the database world, if a column is numerical and is missing values, it suggests the application of missing value imputation to that column. Outliers may be flagged for further study or removed altogether. In recent years, such interactive data cleaning has become very popular, with systems including early efforts, such as Potter’s Wheel, which relies on a graphical user interface (GUI) to support a range of transforms along both the horizontal and vertical dimensions of a database; and more recent machine learning–based interactive systems such as ActiveClean, which relies on active learning to guide data cleaning toward instances where the analyst’s attention is likely to result in high expected utility to the current model. Other systems, like Wrangler, try to find a balance between machine learning and programming-by-declaration techniques. Another interesting recent innovation is crowdpowered data cleaning, where the work starts intersecting more closely with KGs. Some of the crowdpowered techniques rely on external KGs like YAGO and DBpedia, but also special KGs like GeoNames, to develop well-curated data sets. Others use crowdsourcing, which helps alleviate the cost of using experts or large quantities of training data. These methods are promising, but outputs do not carry the same kinds of guarantees as declarative (but manual and expert-intensive) tools.</p>
<p>To the best of our knowledge, the development of such tools (especially at the scale and quality of some of the production-level industrial tools mentioned here) for KGs is a largely open issue and subject to disruption. Of course, there is always the option of trying to represent the KG as an RDB model and then profiling the transformed model. We cover some options for relational model of RDF KGs in part IV, when we discuss querying and <span aria-label="212" id="pg_212" role="doc-pagebreak"/>infrastructure. However, while such modeling can work well for certain kinds of querying, where RDB infrastructure, query reformulation, and indexing mechanisms can be utilized to achieve speed at scale, it is more controversial whether the results of a profiler are even meaningful for a KG that is represented as an RDB. For example, a KG may be highly heterogeneous and not contain birth dates for many individuals (but in the cases that it does, the birth date may be precise and directly extracted from high-quality text). Applying missing value imputation or some other procedure to such a column is not wise, nor would it be appropriate to remove the column, despite its sparsity.</p>
<p>In general, such controversies arise due to KG models and RDB models relying on fundamentally different assumptions about the origins and quality of data. KG construction assumes that many instances will be incomplete in their attributes and relationships to other instances to begin with, because the algorithms are AI or NLP modules run over messy data. In contrast, data in RDBs tends to come from good, relatively higher-quality sources (if not outright manual inputs), rather than the interpreted outputs of AI algorithms. The schemas tend to be painstakingly designed, keeping in mind a narrower and better-defined set of business use-cases. Real-world KGs often have broad ontologies and are not strongly compliant with the ontologies. Hence, there is good reason to treat the data profiling and cleaning problem for KGs as being related to, but not the same as, data cleaning for RDBs and more structured data models.</p>
</section>
<section epub:type="division">
<h2 class="head a-head"><a id="sec8-10"/><b>8.10 Concluding Notes</b></h2>
<p class="noindent">In this chapter, we provided a definition and an overview of the IM problem. The IM problem arises frequently in several communities and domains and is an example of an AI problem that is much more effortless for humans than for machines. To make matters worse, naive solutions to IM are quadratic in the number of nodes in the KG. Techniques like blocking have to be used in practice to achieve acceptable reductions in the quadratic complexity, but at the potential cost of losing some recall in the final outputs.</p>
<p>IM has been researched for over 50 years, and a consensus has been reached on a wide range of issues. The problem has continued to attract new research ideas, some recent examples being the use of collective IM methods and novel frameworks like Swoosh, which, given some reasonable modeling assumptions, provide good theoretical guarantees. The microclustering nature of IM continues to be studied in the machine learning community, and more recently, the field has started to intersect with other research areas in computer science, including crowdsourcing and latent spaces. We provide some guidance for further reading on the former topic in the “Bibliographic Notes” section; the latter topic is covered more broadly in chapter 9.</p>
</section>
<section epub:type="division">
<h2 class="head a-head"><span aria-label="213" id="pg_213" role="doc-pagebreak"/><a id="sec8-11"/><b>8.11 Software and Resources</b></h2>
<p class="noindent">Because a practical IM solution is so dependent on input data formats (e.g., table versus graph), structure (the schema of the data), and domain (e-commerce versus biomedical IM), there are few standard implementations that could just be used without modification. A good example of such a system, proposed more than a decade ago, is Freely Extensible Biomedical Record Linkage (Febrl), a system introduced under an open-source license in 2008. Unlike Swoosh, which offers a theoretical framework, Febrl is an implemented system that was originally meant for a highly applied use-case (i.e., biomedical record linkage) and comes with no theoretical guarantees or formulation as such. It contains several reasonably advanced techniques for data cleaning and standardization, indexing and blocking, attribute value comparison, and record (in this case, instance) pair classification, all embodied within a GUI. Febrl can be seen as a training tool suitable for users to learn about and experiment with both traditional and new record linkage techniques, as well as for practitioners to conduct linkages with data sets containing up to several hundred thousand records.</p>
<p>The creation of Febrl was motivated by the fact that is important to have tools available that allow IM practitioners to experiment with traditional as well as advanced techniques, in order to understand their advantages and their limitations. Such tools should be flexible and contain many linkage methods, and also allow a multitude of configuration options for users to conduct a variety of experimental linkages. Additionally, as most IM users in the health sector do not have extensive experience in programming, an intuitive GUI should provide a well-structured and logical way on how to set up and run record linkage projects.</p>
<p>Presently, Febrl can be downloaded from the Source Forge portal (<a href="https://sourceforge.net/p/febrl/wiki/Home/">https://<wbr/>sourceforge<wbr/>.net<wbr/>/p<wbr/>/febrl<wbr/>/wiki<wbr/>/Home<wbr/>/</a>). The homepage is at <a href="http://users.cecs.anu.edu.au/~Peter.Christen/Febrl/febrl-0.3/febrldoc-0.3/manual.html">http://users.cecs.anu.edu.au/~Peter.Christen/Febrl/febrl-0.3/febrldoc-0.3/manual.html</a>, and contains extensive details on such aspects as indexing and field comparison functions. While the discussion given here may seem to suggest that Febrl is primarily geared toward the health sector, it can be used for other domains because it is customizable.</p>
<p>Another option, designed for Python users, is the Python Record Linkage Toolkit (<a href="https://recordlinkage.readthedocs.io/en/latest/ref-datasets.html">https://<wbr/>recordlinkage<wbr/>.readthedocs<wbr/>.io<wbr/>/en<wbr/>/latest<wbr/>/ref<wbr/>-datasets<wbr/>.html</a>). It also allows the loading of Febrl data sets into the programmatic environment. A similar package is the Record Linkage Toolkit (RLTK), which can be applied in a Python environment using pip. The package is accessed at <a href="https://github.com/usc-isi-i2/rltk">https://<wbr/>github<wbr/>.com<wbr/>/usc<wbr/>-isi<wbr/>-i2<wbr/>/rltk</a> and is a general-purpose, open-source platform that allows users to build powerful Python programs for instance matching. One of the major innovations in RLTK compared to some of the other tools is its ability to handle large-scale data sets. RLTK supports multicore algorithms for blocking, profiling data, computing a wide variety of features, and training and applying machine learning classifiers based on Python’s sklearn library. An end-to-end RLTK pipeline can be jump-started with only a few lines of code. RLTK continues to be under active maintenance and has been <span aria-label="214" id="pg_214" role="doc-pagebreak"/>funded under multiple projects, including DARPA LORELEI (Low-Resource Languages for Emergent Incidents), Memex, and IARPA CAUSE.</p>
<p>There are some important packages for domain-specific instance matching. One example is DeepMatcher (<a href="https://github.com/anhaidgroup/deepmatcher">https://<wbr/>github<wbr/>.com<wbr/>/anhaidgroup<wbr/>/deepmatcher</a>), gaining popularity in e-commerce. It is designed mainly for the kind of product data scraped from web sources (and where details are embedded in Schema.org snippets). A good benchmark is available at <a href="http://webdatacommons.org/largescaleproductcorpus/">http://<wbr/>webdatacommons<wbr/>.org<wbr/>/largescaleproductcorpus<wbr/>/</a>. Just like biomedical and patient record linkage, e-commerce instance matching is such an important problem that it has drawn a small community around it (many of which are major industrial players). Recent state-of-the-art results in e-commerce IM, especially achieved by researchers working on the Product Graph at Amazon, have been impressive.</p>
<p>In the Semantic Web, some tools for instance matching, especially in the RDF and Linked Data settings, have been made available due to recent efforts. One early and very influential approach is the Silk framework (<a href="http://silkframework.org/">http://<wbr/>silkframework<wbr/>.org<wbr/>/</a>), which is an open-source framework for integrating heterogeneous data sources, and thus is better suited to KGs than some of the other (more database-centric) tools mentioned thus far. In addition to generating links between related data items within different Linked Data sources, Silk helps data publishers to get RDF links from their data sources to other data sources on the web. It also offers facilities for applying data transformations to structured data sources. Silk offers a declarative link specification language, which has some attractive properties, including control and explainability of results. There is also an accompanying workbench GUI, in which link specifications can be intuitively declared.</p>
<p>Another resource in the SW ecosystem is LIMES (<a href="http://aksw.org/Projects/LIMES.html">http://<wbr/>aksw<wbr/>.org<wbr/>/Projects<wbr/>/LIMES<wbr/>.html</a>), a link discovery framework that implements time-efficient approaches for large-scale link discovery based on the characteristics of metric spaces (e.g., the triangle inequality). It can be configured using either a GUI or a configuration file. It is also offered as a Java library, and it is even available as a stand-alone tool.</p>
<p>There are also other resources for benchmarking instance matching, as collected in a tutorial accessed at <a href="https://www.ics.forth.gr/isl/BenchmarksTutorial/">https://<wbr/>www<wbr/>.ics<wbr/>.forth<wbr/>.gr<wbr/>/isl<wbr/>/BenchmarksTutorial<wbr/>/</a>. One of the important resources mentioned in the tutorial is the Ontology Alignment Evaluation Initiative (OAEI), accessed at <a href="http://oaei.ontologymatching.org/">http://<wbr/>oaei<wbr/>.ontologymatching<wbr/>.org<wbr/>/</a>, which offers IM benchmarking as a problem area, and even KG-centric matching tasks such as jointly matching instances and schemas. There is a workshop that includes several competitions organized each year, usually held in conjunction with the International Semantic Web Conference (ISWC) series. For those interested in benchmarking instance matching for traditional record linkage–like solutions, some solutions are available at <a href="https://dbs.uni-leipzig.de/en">https://<wbr/>dbs<wbr/>.uni<wbr/>-leipzig<wbr/>.de<wbr/>/en</a>, with standard benchmark data sets for IM accessed at <a href="https://dbs.uni-leipzig.de/research/projects/object_matching/benchmark_datasets_for_entity_resolution">https://<wbr/>dbs<wbr/>.uni<wbr/>-leipzig<wbr/>.de<wbr/>/research<wbr/>/projects<wbr/>/object<wbr/>_matching<wbr/>/benchmark<wbr/>_datasets<wbr/>_for<wbr/>_entity<wbr/>_resolution</a>. Another good resource is the Stanford Entity Resolution Framework (SERF), which is maintained by Benjelloun et al. (2009), <span aria-label="215" id="pg_215" role="doc-pagebreak"/>who also authored the set of papers on Swoosh. It may be accessed at <a href="http://infolab.stanford.edu/serf/">http://<wbr/>infolab<wbr/>.stanford<wbr/>.edu<wbr/>/serf<wbr/>/</a>.</p>
<p>In many cases, the tools and packages mentioned here offer all the facilities that a user is likely to need to do feature engineering, train an IM classifier, and even evaluate the system on withheld test data. In some cases, however, it may be necessary to develop more advanced feature functions, including learnable string similarity metrics. As far back as the early 2000s, some systems relied on such learnable metrics, including Multiply Adaptive Record Linkage with INduction (MARLIN). MARLIN is a two-level learning approach that continues to be used and evaluated widely for its effectiveness on different classes of IM problems. First, string similarity measures are trained for every database field so that they can provide accurate estimates of string distance between values for that field. Next, a final predicate for detecting duplicate records is learned from similarity metrics applied to each of the individual fields. SVMs are employed for both tasks, and the authors show that they outperform decision trees, the favored classifier in prior work.</p>
<p>We are not aware of any obvious implementations for learnable string similarity, but using standard graphical models and dynamic programming, such similarity measures can be implemented by interested users in their favored programming languages and environments. Another approach, which is continuing to gain favor due to the rapid onslaught of neural networks and representation learning (including for KGs, as we cover in chapter 10), is avoiding feature engineering altogether and instead relying on vector space models and embeddings as the features. Even though such methods have many advantages, there are open questions that preclude getting rid of feature engineering altogether. For example, it is not completely clear how to embed or represent mixtures of text, numbers, and dates, as is often the case in KGs. Certainly, some modeling effort is required before an off-the-shelf embedding can be applied. Whether modeling is more preferable to feature engineering remains to be seen. More details on KG-embedding software and resources will be provided in chapter 10.</p>
</section>
<section epub:type="division">
<h2 class="head a-head"><a id="sec8-12"/><b>8.12 Bibliographic Notes</b></h2>
<p class="noindent">There has been extensive research on several variants of IM over the years. Ironically, this has led to the IM problem itself going by various names (mentions) in the academic literature, including deduplication, duplicate detection, entity resolution, merge-purge, record linkage (which is very common in the database community), entity reconciliation, instance matching, and coreference resolution (or equivalently, anaphora resolution). The last version of the problem is limited to the NLP community, while the data is still in the form of natural-language extractions and there is sufficient context for applying NLP techniques. The other terms are far more common when dealing with data in some kind of structured form, as we assumed in this chapter. An excellent, though by no means exhaustive, set of references that survey the problem in its various guises include Getoor and <span aria-label="216" id="pg_216" role="doc-pagebreak"/>Machanavajjhala (2012), Elmagarmid et al. (2006), Köpcke et al. (2010), Hernández and Stolfo (1995), and Elango (2005). Books and synthesis lectures by Christen (2012) and Christophides et al. (2015) offer thorough treatments, far beyond the scope of this lone chapter, including introductory and advanced material. Doan et al. (2012) offer a useful review of the broader area of data integration, within which research areas such as instance matching and schema matching have historically been embedded.</p>
<p>Much of the early (and a large portion of the later) research focused on the similarity step, as well as advanced methods for clustering entities directly. In the early days, rule-based approaches were popular, but as noted in the “Similarity” section earlier in this chapter, in the last decade, machine learning has emerged as the dominant paradigm for learning an approximate pairwise linking function from a training set of duplicates (positive class) and nonduplicates (negative class). An excellent review by Elmagarmid et al. (2006) provides more context for this evolution. Original references for Swoosh, Febrl, MARLIN, and the Fellegi-Sunter model include Benjelloun et al. (2009), Christen (2008), Bilenko and Mooney (2003), and Fellegi and Sunter (1969), respectively. (Although there are many original references for Swoosh, we only cite Benjelloun et al., 2009, because it offers much of the content underlying the relevant section in this chapter.) Note, however, that more recently, blocking has become a popular subject of study due to the focus on Big Data. An excellent overview was provided by Christen (2011), where many of the original papers on individual algorithms such as SN were provided. Good references on learning blocking keys or schemes include Michelson and Knoblock (2006), Bilenko et al. (2006), Cao et al. (2011), Kejriwal and Miranker (2013), Ramadan and Christen (2015), and Shao and Wang (2018). McCallum et al. (2000) proposed the Canopy Clustering algorithm, which has also been enormously influential in both research and practice.</p>
<p>A similar evolution is already taking place in the Linked Data community (discussed in detail in chapter 14), where rule-based approaches, such as the Silk system proposed by Bizer, Volz, et al. (2009), still enjoy support but are being gradually supplanted by adaptive algorithms that rely on machine learning techniques such as active learning. The systems themselves have frequently undergone updates to reflect such changes. For other examples of IM systems prevalent in the SW community, we recommend Nentwig et al. (2017), Klímek et al. (2019), and Ferrara et al. (2011). Kejriwal (2016, chapter 3) compares many existing systems on some Big Data dimensions, but we also recommend Getoor and Machanavajjhala (2013), which gives slightly different perspectives. Kejriwal (2016) shows that, despite significant advancements, many of the systems were designed for the relatively homogeneous record linkage problem, and not for the heterogeneous KGs often published on the web. Whether they can be successfully extended is more of an empirical than a theoretical issue. Recent SW IM systems, especially leaning more on semisupervised, unsupervised, or self-supervised machine learning, are described by Kejriwal and Miranker (2015a,b), Nikolov et al. (2012), and Araujo et al. (2012).</p>
<p><span aria-label="217" id="pg_217" role="doc-pagebreak"/>We also mentioned some novel research frontiers. In particular, collective and statistical relational approaches to instance matching (which we revisit in the next chapter) include Bhattacharya and Getoor (2007) and Singla and Domingos (2006). Examples of work that involve instance matching and crowdsourcing include Wang et al. (2012), but a more recent overview can be found in a survey of data cleaning by Chu et al. (2016). Whang et al. (2009), and Papadakis et al. (2011) provide guidance on IM systems that violate the independence between the blocking and similarity steps in the two-step workflow, and instead choose to interleave the two. Finally, deep learning approaches for instance matching are still in their relative infancy, although recent research by Mudgal et al. (2018), Ebraheem et al. (2017), and Kooli et al. (2018) are illustrative and relevant.</p>
</section>
<section epub:type="division">
<h2 class="head a-head"><a id="sec8-13"/><b>8.13 Exercises</b></h2>
<ul class="numbered">
<li class="NL">1. Instead of resolving instances within a KG, suppose that we wanted to resolve instances between two KGs, <i>G</i> and <i>G</i>′ with <i>m</i> and <i>n</i> instances, respectively. We will study the effects of blocking.</li>
</ul>
<p class="AL">(a) What is the cardinality of the exhaustive set <i>O</i> of instance pairs?</p>
<p class="AL">(b) What would be the cardinality of <i>O</i> if both KGs were individually noisy (i.e., if we wanted to do instance matching both within and across the two KGs)?</p>
<p class="AL">(c) Using the results of the previous parts of this exercise, prove (using arguments, without algebraically expanding the terms) that <img alt="" class="inline" height="22" src="../images/pg217-in-1.png" width="219"/>.</p>
<ul class="numbered">
<li class="NL">2. ** This question concerns the terminology used for Swoosh. Given two instances <i>m</i> and <i>n</i>, such that <i>m</i> dominates <i>n</i>, construct the KGs <i>G</i> and <i>G</i>′ in terms of these instances so they dominate one another.</li>
<li class="NL">3. This exercise concerns the KG fragment in <a href="chapter_8.xhtml#fig8-1">figure 8.1</a>.</li>
</ul>
<p class="AL">(a) What would be a good blocking key for this KG if the goal were to link citations?</p>
<p class="AL">(b) Suppose that the blocking key was <i>CommonToken(:author)</i>. Assume that author names are perfectly tokenized by the blocking algorithm (i.e., “Rudich, S.” is properly tokenized to yield the bag {“Rudich”, “S.”}). What would be the PC if this blocking key were to be executed? Does this make it a good blocking key? Why or why not?</p>
<p class="AL">(c) Suppose that we decided that <i>CommonToken(:author)</i> is the matching function (i.e., if two citations have a common token in their author node, then they are matching). What would be the precision, recall, and FM of executing such a matching function on the exhaustive set of pairs?</p>
<ul class="numbered">
<li class="NL">4. In this exercise, we look at a slightly different version of entity resolution—namely, entity linking, which is the problem of linking a word or phrase (usually an extracted named entity) in natural-language text to no more than one entry in a preexisting KG <span aria-label="218" id="pg_218" role="doc-pagebreak"/>such as Wikidata. Although many methods for entity linking exist, we consider two intuitive ones here:</li>
</ul>
<p class="AL">(a) <b>The popularity method:</b> Selects the entity that is the most popular among the list of candidates in the KG. Although different notions of popularity exist, consider the most intuitive definition that comes to mind. For example, given “Paris,” Paris, France, would be more popular as an entity than Paris, Texas (which also exists).</p>
<p class="AL">(b) <b>The joint assignment method:</b> Selects the entity in the KG that is consistent (using some measure of consistency, such as coreference, or membership in the same semantic class) with the other mentions of entities. For example, if the extraction “Charlotte” were linked to a city, rather than the name of an individual, then a cooccurring extraction (e.g., in the same sentence) of “Paris” would be more likely to be linked to the city of Paris rather than the name of an individual (e.g., Paris Hilton).</p>
<p class="myenumitem">For each of the following sentences, we only want to disambiguate the extraction “Arizona.” We also italicize the <i>detected</i> entities. For each sentence, explain which of the two methods given here (or neither, or both) would be appropriate for getting to the right answer, and why. If neither method is appropriate, try to think of a better method and explain your intuitions in a few sentences. You may assume that there are only three candidates in your KG: (i) Arizona (state); (ii) Arizona (restaurant in New York City); (iii) Arizona (snake).</p>
<p class="AL">(a) <i>Arizona</i> is my all-time favorite <i>restaurant</i> in <i>New York City</i>.</p>
<p class="AL">(b) The best <i>BBQ</i> I’ve tasted is in <i>Arizona</i>.</p>
<p class="AL">(c) Very few <i>people</i> have been attacked by an <i>Arizona</i>.</p>
<ul class="numbered">
<li class="NL">5. Recall that earlier in the chapter, we presented string similarity features as one possible methodology for getting sets of features between pairs of entities that have been approved by blocking, and that are now being input into some similarity module. This exercise tests you on some common similarity measures (you should look up the details and formulas online because many sources describing the measures given here are available). For each string similarity measure, look up a definition online and provide the formula. Provide both a positive and negative use-case for the measure (i.e., what is one good use-case for each where you would, or would not, use this string similarity measure as a strong feature for entity resolution)?</li>
</ul>
<p class="AL">(a) Needleman-Wunsch</p>
<p class="AL">(b) Soundex</p>
<p class="AL">(c) Monge-Elkan</p>
<p class="AL">(d) Levenstein</p>
<p class="AL">(e) <span aria-label="219" id="pg_219" role="doc-pagebreak"/>Jaro-Winkler</p>
<ul class="numbered">
<li class="NL">6. Suppose that you could use only one string-matching function (appropriately thresholded) to determine whether entities in two KGs are linked. Would you want to use the Jaro-Winkler or Smith-Waterman similarity for linking the entities in the KGs shown in this table? Why?</li>
</ul>
<figure class="table">
<table class="table">
<thead>
<tr>
<th class="TCH"><p class="TB"><b>Knowledge Graph A</b></p></th>
<th class="TCH"><p class="TB"><b>Knowledge Graph B</b></p></th>
</tr>
</thead>
<tbody>
<tr>
<td class="TB"><p class="TB">Actor Jackie Chan</p></td>
<td class="TB"><p class="TB">Tom Michael Mitchell, E. Fredkin University Professor</p></td>
</tr>
<tr>
<td class="TB"><p class="TB">Professor Tom M. Mitchell</p></td>
<td class="TB"><p class="TB">Jackie Chan Kong-san</p></td>
</tr>
</tbody>
</table>
</figure>
<ul class="numbered">
<li class="NL">7. In a similar vein, considering the two KGs in this table, would you use tf-idf or Levenstein distance?</li>
</ul>
<figure class="table">
<table class="table">
<thead>
<tr>
<th class="TCH"><p class="TB"><b>Knowledge Graph A</b></p></th>
<th class="TCH"><p class="TB"><b>Knowledge Graph B</b></p></th>
</tr>
</thead>
<tbody>
<tr>
<td class="TB"><p class="TB">Apple Corporation</p></td>
<td class="TB"><p class="TB">Apple Corp</p></td>
</tr>
<tr>
<td class="TB"><p class="TB">IBM Corporation</p></td>
<td class="TB"><p class="TB">IBM Corp</p></td>
</tr>
</tbody>
</table>
</figure>
<ul class="numbered">
<li class="NL">8. Consider the data sets shown here, represented as tables, between which we want to match entities. The “Match” relation indicates the ground-truth, which would not be available to an entity resolution algorithm. We will use the following blocking key: the first letter of the last name concatenated with the last four digits of the phone number. Assume that Traditional Blocking is the blocking method. As a first step, list the blocking keys and the IDs of entities that fall into the blocks.</li>
</ul>
<figure class="IMG"><img alt="" src="../images/pg219-1.png" width="450"/>
</figure>
<ul class="numbered">
<li class="NL">9. What are the RR and PC of the blocking?</li>
<li class="NL1">10. Suppose that we decided to go one step further and declare that entities <i>e</i><sub>1</sub> and <i>e</i><sub>2</sub> (from the two data sets, respectively) match if they have the same blocking key. What would <span aria-label="220" id="pg_220" role="doc-pagebreak"/>be the precision and recall? <i>Hint: Do we even need to compute the recall if we have the answer to this question?</i></li>
<li class="NL1">11. Now, consider a blocking algorithm that is somewhat more advanced and robust than Traditional Blocking: namely, Canopies (also called Canopy Clustering). Although we provided a description in the chapter, it would be instructive to read the original paper (McCallum et al., 2000), because it has been very influential in the entity resolution and blocking literature. Consider both this question and the next three questions in the context of the image shown here. Assume that <i>d</i> (at the center) was the query entity that was used to generate the canopies. The inner, dotted circle represents the inner, tight canopy, while the outer, solid circle represents the outer, loose canopy. Suppose that these entities were restaurants (with attributes such as name, address, and structured representations of their menus). What would be good examples of similarity measures and thresholds to try in this context for getting reasonable blocking outputs?</li>
</ul>
<figure class="IMG"><img alt="" src="../images/pg220-1.png" width="150"/>
</figure>
<ul class="numbered">
<li class="NL1">12. What is the valid set of entities from which we randomly pick our next query (i.e., for iteration 2)? <i>Hint: Would c be in this valid set? Why or why not? What about g?</i></li>
<li class="NL1">13. Assuming that the chosen element (iteration 2) is the first alphabetic element from the set you picked in exercise 12, what is the valid group of elements for iteration 3? Try to approximate the similarity functions and thresholds by drawing circles of similar radii as you see in the image.</li>
<li class="NL1">14. Has the algorithm converged to a set of blocks? What is the minimum number of iterations that are still needed?</li>
<li class="NL1">15. In general, can you think of cases (when applying Canopies) where, given <i>n</i> data points and two thresholded similarity functions, you need <i>n</i> iterations to converge? If you were told that the algorithm took <i>n</i> iterations before converging, what could you conclude about the PC and RR of the algorithm?</li>
<li class="NL1">16. Can you think of ways to avoid the problem in exercise 15, short of replacing or redesigning the similarity measure?</li>
</ul>
<div class="footnotes">
<ol class="footnotes">
<li><p class="FN" role="doc-footnote"><sup><a href="chapter_8.xhtml#fn1x8-bk" id="fn1x8">1</a></sup> This section is advanced and may be skipped by a reader just looking to gain familiarity with the topic.</p></li>
</ol>
</div>
</section>
</section>
</div>
</body>
</html>