glam/docs/oclc/extracted_kg_fundamentals/OEBPS/xhtml/chapter_6.xhtml

<html xmlns="http://www.w3.org/1999/xhtml" xmlns:epub="http://www.idpf.org/2007/ops" lang="en" xml:lang="en">
<head>
<title>Knowledge Graphs</title>
<meta content="text/html; charset=utf-8" http-equiv="default-style"/>
<link href="../styles/stylesheet.css" rel="stylesheet" type="text/css"/>
<meta content="urn:uuid:531250ec-8629-4bbe-b4be-eb1eb0e84538" name="Adept.expected.resource"/>
</head>
<body epub:type="bodymatter">
<div class="body">
<p class="SP"> </p>
<section aria-labelledby="ch6" epub:type="chapter" role="doc-chapter">
<header>
<h1 class="chapter-number" id="ch6"><span aria-label="125" id="pg_125" role="doc-pagebreak"/>6</h1>
<h1 class="chapter-title"><b>Relation Extraction</b></h1>
</header>
<div class="ABS">
<p class="ABS"><b>Overview.</b>   Thus far in this book, the focus has been on extracting entities, whether from text or semistructured documents like webpages. However, the edges in a knowledge graph (KG) are relationships that exist between pairs of entities. Also, higher-order entities, such as events, serve an important purpose in modern KGs that explore complex domains such as geopolitics. This chapter introduces and covers techniques on relation and event extraction to construct such KGs. Although research on relation and event information extraction (IE) has been ongoing for a while now, empirical performance is still low compared to other IE subareas like Named Entity Recognition (NER), and there is still a long way to go. Recent progress has been encouraging, and some systems have been deployed in the real world.</p>
</div>
<section epub:type="division">
<h2 class="head a-head"><a id="sec6-1"/><b>6.1 Introduction</b></h2>
<p class="noindent">When constructing KGs, NER can be used to get the nodes in the KG. However, a KG also contains <i>relations</i>, such as <i>spouse_of</i> and <i>capital_of</i>. To get these relations, a different brand of IE has to be executed on the corpus. This kind of IE, relation extraction (RE), is the problem of detecting and classifying <i>relationships</i> between entities extracted from the text and involves its own set of challenges, being a significantly more difficult problem than NER. We illustrate RE via the following examples:</p>
<ul class="numbered">
<li class="NL">1. <b>spouse_of:</b> [John] was married to [Mary] in 1929.</li>
<li class="NL">2. <b>employed_by:</b> [Hardy] has been working for [Calmet Corporation] since he was a teenager.</li>
<li class="NL">3. <b>member:</b> The [Pepper Jack club] recently convinced [Mary Poppins] to join.</li>
</ul>
<p>In these examples, mentions of named entities are in brackets, and named entity types are omitted for clarity. The directionality of the relation should be evident from the context (although it is harder for machines to do so automatically); some relations, such as spouse_of, are symmetric and bidirectional. An interesting point that is conveyed even by the simple illustrations given here is that, unlike named entities, where it is possible to distinguish the “mentions” in the text and then type them with a concept in the ontology, the relation “instances” cannot be described as cleanly at the mention level. Instead, a pair <span aria-label="126" id="pg_126" role="doc-pagebreak"/>of entity mentions itself serves as a proxy for a relation instance, and the primary task is to determine if a relation of that type indeed exists between the two entities. In practice, it does not make sense to actually distinguish between relation “instances” and “types.” The problem is instead formulated as one of <i>existence</i>—that is, does the relation employed_by exist between a given pair of entity mentions <i>(Hardy, Calmet Corporation)</i>?</p>
<p>RE may be classified as being <i>global</i>- or <i>mention</i>-level. Global RE is expected to produce a list of entity pairs between which a semantic relation exists. It takes as input a large text corpus and produces a global list of semantically related entity pairs as output (along with the relation itself). In contrast, mention-level RE takes as input both an entity pair and the sentence containing it, and also has to identify whether a certain relation exists for that entity pair within the context of the sentence. Both global- and mention-level RE systems have been actively researched in the literature. To understand the difference, consider again the example sentence for employed_by. Mention-level RE can determine (for an ideal system) that there is an employed_by relationship between Hardy and Calmet Corporation, but it cannot determine the nature of employment. However, another sentence (or context) may indicate that Hardy is now the chief executive officer (CEO) of this company, which would allow global RE to produce the fact that a more specific relation CEO_of (which is a <i>subtype</i> of the inverse relation <i>employs</i>) exists between Hardy and Calmet Corporation.</p>
<p>As with NER, we note that the set of relationship types (and subtypes) that are within scope for the RE system is specified by a predefined ontology. In many cases, the ontology constrains the named entity types between which a relation can be defined. For example, for the employed_by relation, an acceptable domain would be instances of the concept PERSON, and an acceptable range would be instances of the concept ORGANIZATION.</p>
<p>Much of the work on RE is based on the task definition from the Automatic Content Extraction (ACE) program and ontology. ACE focuses on binary relations (i.e., relations between two entities), with the two entities involved referred to as arguments. For example, the employed_by relation in the previous example could be described by one of several relation subtypes in ACE, including EMPLOY-EXEC, EMPLOY-STAFF, or even EMPLOY-undetermined (the last being true in this case), depending on the type of employment relation described in the sentence. We describe ACE in more detail in the next section.</p>
<p>We note that although RE may seem independent from NER, the quality of the two can be intertwined. Complex entity types, as well as errors in either entity extraction or typing, lead to problems for RE, which in typical formulations is a downstream task relative to NER. RE and NER can also interact in the context of another important problem (namely, <i>event IE</i>). Giving a good definition of this task is hard to do because of the difficulty in defining an event. Intuitively however, we may think of an event as a <i>higher-order spatiotemporal entity</i> that may comprise a related set of entities (called <i>arguments</i>), and with some kind of temporal or spatial span (whether explicitly extracted or mentioned in the <span aria-label="127" id="pg_127" role="doc-pagebreak"/>text or not). For example, in the sentence “The G20 summit in 2019 took place in Osaka,” <i>G20 summit</i> may qualify as an “event” extraction, but by itself, the extraction does not do justice to the event. The event clearly has <i>arguments</i> that include both the location (<i>Osaka</i>) and the time (<i>2019</i>), which is the year in this case. These arguments could be finer-grained and may even include ranges (e.g., a war that occurred between a certain time span could also be an event extraction). If the participants had been mentioned, they may have been extracted as additional arguments of the event. Ultimately, the underlying ontology is used for deciding what constitutes a correct, or even well-defined, event extraction. In practice, identifying events in free text can often equate to identifying who did what to whom, the time and place of activity (“when” and “where”), for what reason (“why”), and through what methods or instruments (“how”). Effective event IE entails simultaneous extraction of several entities and the relationships between them. To formalize what is meant by an event, an ontological definition of event IE is assumed, just like with named entities, concepts, and relations. In the next section, we will cover some important ontologies and vocabularies that are used for defining applicable relation and event types.</p>
<p>The distinction between relation and event IE is not always clear; in some rare cases, even the distinction between NER and event extraction can become blurred. We could think of event IE, for example, as a “downstream” inference that takes into account the sets of extracted entities and relations, as well as their contexts. However, the methods that have achieved state-of-the-art performance in event IE take a more sophisticated view of the problem and attempt to <i>jointly</i> extract entities, relations, and events in the hopes of improving performance on all three. Toward the end of the chapter, we return to event IE and joint IE. The vast majority of this chapter, however, is scoped to RE because it is the next most important (and well-studied) kind of IE for free text after NER. Furthermore, despite the growing importance of events, most modern KGs still do not contain complex event definitions in their underlying ontologies (i.e., entities and relations continue to reign as first-class citizens in the vast majority of published KGs).</p>
</section>
<section epub:type="division">
<h2 class="head a-head"><a id="sec6-2"/><b>6.2 Ontologies and Programs</b></h2>
<p class="noindent">Before we delve into techniques for event extraction and RE, it is important to scope out the problem. At this time, there is a fair amount of consensus on what these tasks entail, even though the actual techniques continue to be actively researched. Unlike NER, the concept of an ontology becomes important when dealing with event extraction or RE (especially the former). The reason is that it is not always clear what an event really is; without some kind of constraint, the problem would quickly become ill defined. The choice of ontology is important enough that it usually ends up influencing the actual techniques used to extract events. We begin by exploring a popular event ontology called ACE, which has been used as the definitional backbone for many relational and event IE tasks. In fact, many of the techniques we cover later in this chapter were a direct (or inspired) output of some of the <span aria-label="128" id="pg_128" role="doc-pagebreak"/>government-funded programs that used ACE (or some minor modification thereof) as the underlying ontology in support of the IE (from text) problem.</p>
<section epub:type="division">
<h3 class="head b-head"><a id="sec6-2-1"/><b>6.2.1 Automatic Content Extraction</b></h3>
<p class="noindent">The ACE standard was developed by the National Institute of Standards and Technology (NIST) in 1999 and has evolved over time to support different evaluation cycles, the last evaluation having occurred in 2008. The objective of the ACE program was to develop IE technology to support automatic processing of source language data (in the form of natural as well as derived text, such as from optical character recognition). Automatic processing, as defined at the time of program institution, included classification, filtering, and selection based on the language content and semantics of the source data. The ACE program required development and refinement of technologies to automatically detect and characterize such semantics. Research objectives included detection and characterization of entities, relations, and events.</p>
<p>The Linguistic Data Consortium (LDC) developed guidelines for annotation, corpora, as well as other linguistic resources to support the program. Annotation was an important component of the program; for instance, ACE annotators tagged broadcast transcripts and newswire and newspaper data in three languages (English, Chinese, and Arabic), producing both training and test data for evaluating systems with three research objectives: Entity Detection and Tracking (EDT), Relation Detection and Characterization (RDC), and Event Detection and Characterization (EDC). A fourth annotation task, Entity Linking (LNK), grouped all references to a single entity and all its properties together into a Composite Entity. In the KG community, a similar problem arises that we have alluded to earlier (instance matching) and that will be detailed in chapter 8.</p>
<p>The type inventory of ACE includes Person, Organization, Geo-Political Entity (GPE), Location, Vehicle, Facility, and Weapon. ACE also further classifies entity mentions by including subtypes for each determined type (e.g., Organization subtypes include Government, Commercial, Educational, Nonprofit, and Other); if the entity does not fit into any subtype, it is not annotated. Furthermore, each entity was tagged according to its class (specific, generic, attributive, negatively quantified, or underspecified).</p>
<p>To support the RDC task (or what we designate as RE in this chapter), ACE relations included physical relations, including Located, Near, and Part-Whole; social/personal relations including Business, Family, and Other; a range of employment or membership relations; relations between artifacts and agents (including ownership); affiliation-type relations like ethnicity; relationships between persons and GPEs like citizenship; and finally, discourse relations. For every relation, annotators identified two primary arguments (namely, the two ACE entities that are linked), as well as the relation’s temporal attributes. Relations that were supported by explicit textual evidence were distinguished from those that depended on contextual inference on the part of the reader.</p>
<p><span aria-label="129" id="pg_129" role="doc-pagebreak"/>Finally, to support the EDC task, annotators identified five event types in which entities participate, with targeted types including Interaction, Movement, Transfer, Creation, and Destruction events. Provenance information tagged by annotators included not only the textual mention or anchor for each event (catgorized by type and subtype), but also event arguments (agent, object, source, and target) and attributes (temporal and locative, as well as others like instrument or purpose) according to a type-specific template. Later phases of ACE involved the addition of event types and relations between events.</p>
<p>In summary, ACE can now be thought of as both a standard for defining relations (and is an ontology in that traditional sense) but also for evaluations conducted by NIST for EDT and RDC. One of the most widely known data sets used to report RE systems’ performance in the literature is the ACE 2004 data set, although more recent ones (such as ACE 2005 and 2007) have also been adopted in evaluations. Some statistics on occurrence counts of relations and subrelations in ACE 2004 are compiled in <a href="chapter_6.xhtml#tab6-1" id="rtab6-1">table 6.1</a>.</p>
<div class="table">
<p class="TT"><a id="tab6-1"/><span class="FIGN"><a href="#rtab6-1">Table 6.1</a>:</span> <span class="FIG">Occurrence counts of relations and subrelations in the ACE 2004 data set.</span></p>
<figure class="table">
<table class="table">
<thead>
<tr>
<th class="TCH"><p class="TB"><b>Type</b></p></th>
<th class="TCH"><p class="TB"><b>Subtype</b></p></th>
<th class="TCH"><p class="TB"><b>Count</b></p></th>
</tr>
</thead>
<tbody>
<tr>
<td class="TB"><p class="TB">PHYS</p></td>
<td class="TB"><p class="TB">Located</p></td>
<td class="TB"><p class="TB">745</p></td>
</tr>
<tr>
<td class="TB"/>
<td class="TB"><p class="TB">Near</p></td>
<td class="TB"><p class="TB">87</p></td>
</tr>
<tr>
<td class="TB"/>
<td class="TB"><p class="TB">Part-Whole</p></td>
<td class="TB"><p class="TB">384</p></td>
</tr>
<tr>
<td class="TB"><p class="TB">PER-SOC</p></td>
<td class="TB"><p class="TB">Business</p></td>
<td class="TB"><p class="TB">179</p></td>
</tr>
<tr>
<td class="TB"/>
<td class="TB"><p class="TB">Family</p></td>
<td class="TB"><p class="TB">130</p></td>
</tr>
<tr>
<td class="TB"/>
<td class="TB"><p class="TB">Other</p></td>
<td class="TB"><p class="TB">56</p></td>
</tr>
<tr>
<td class="TB"><p class="TB">EMP-ORG</p></td>
<td class="TB"><p class="TB">Employ-Exec</p></td>
<td class="TB"><p class="TB">503</p></td>
</tr>
<tr>
<td class="TB"/>
<td class="TB"><p class="TB">Employ-Staff</p></td>
<td class="TB"><p class="TB">554</p></td>
</tr>
<tr>
<td class="TB"/>
<td class="TB"><p class="TB">Employ-Undetermined</p></td>
<td class="TB"><p class="TB">79</p></td>
</tr>
<tr>
<td class="TB"/>
<td class="TB"><p class="TB">Member-of-Group</p></td>
<td class="TB"><p class="TB">192</p></td>
</tr>
<tr>
<td class="TB"/>
<td class="TB"><p class="TB">Subsidiary</p></td>
<td class="TB"><p class="TB">209</p></td>
</tr>
<tr>
<td class="TB"/>
<td class="TB"><p class="TB">Partner</p></td>
<td class="TB"><p class="TB">12</p></td>
</tr>
<tr>
<td class="TB"/>
<td class="TB"><p class="TB">Other</p></td>
<td class="TB"><p class="TB">82</p></td>
</tr>
<tr>
<td class="TB"><p class="TB">ART</p></td>
<td class="TB"><p class="TB">User/Owner</p></td>
<td class="TB"><p class="TB">200</p></td>
</tr>
<tr>
<td class="TB"/>
<td class="TB"><p class="TB">Inventor/Manufacturer</p></td>
<td class="TB"><p class="TB">9</p></td>
</tr>
<tr>
<td class="TB"/>
<td class="TB"><p class="TB">Other</p></td>
<td class="TB"><p class="TB">3</p></td>
</tr>
<tr>
<td class="TB"><p class="TB">OTHER-AFF</p></td>
<td class="TB"><p class="TB">Ethnic</p></td>
<td class="TB"><p class="TB">39</p></td>
</tr>
<tr>
<td class="TB"/>
<td class="TB"><p class="TB">Ideology</p></td>
<td class="TB"><p class="TB">49</p></td>
</tr>
<tr>
<td class="TB"/>
<td class="TB"><p class="TB">Other</p></td>
<td class="TB"><p class="TB">54</p></td>
</tr>
<tr>
<td class="TB"><p class="TB">GPE-AFF</p></td>
<td class="TB"><p class="TB">Citizen/Resident</p></td>
<td class="TB"><p class="TB">273</p></td>
</tr>
<tr>
<td class="TB"/>
<td class="TB"><p class="TB">Based-in</p></td>
<td class="TB"><p class="TB">216</p></td>
</tr>
<tr>
<td class="TB"/>
<td class="TB"><p class="TB">Other</p></td>
<td class="TB"><p class="TB">40</p></td>
</tr>
<tr>
<td class="TB"><p class="TB">DISC</p></td>
<td class="TB"><p class="TB">Disc</p></td>
<td class="TB"><p class="TB">279</p></td>
</tr>
</tbody>
</table>
</figure>
</div>
<p><span aria-label="130" id="pg_130" role="doc-pagebreak"/>A similar standard to ACE is Entities, Relations, Events (ERE), which was created under the DARPA DEFT program as a more streamlined version of ACE, with the goal of easing annotation and ensuring more consistency across annotators. ERE attempts to achieve these goals by consolidating some of the annotation type distinctions that were found to be problematic in ACE, along with removing some of the more complex annotation features.</p>
<p>There are many interesting aspects of ACE that are beyond the scope of this chapter, but are nevertheless worthwhile exploring for readers looking to develop RE or event IE systems for an application. We provide pointers both to ACE and some of the other ontologies covered subsequently in the section entitled “Bibliographic Notes,” at the end of this chapter.</p>
</section>
<section epub:type="division">
<h3 class="head b-head"><a id="sec6-2-2"/><b>6.2.2 Other Ontologies: A Brief Primer</b></h3>
<p class="noindent">ACE is far from the only ontology available for relations and events. The Rich Event Ontology (REO) provides an independent conceptual backbone to unify existing semantic role labeling schemas and augment them with event-to-event causal and temporal relations. It does this by unifying various other influential ontologies and vocabularies, including FrameNet, VerbNet, ACE, and Rich ERE resources.</p>
<p>Besides ACE and REO, the Conflict and Mediation Event Observations Event and Actor Codebook (CAMEO) event coding ontology (funded under the US National Science Foundation) is another well-known instance of a vocabulary that was developed over a decades-long period and is licensed under Creative Commons. The origins of the project were not as ambitious as they have recently become, because it was intended initially to be finished in six months of part-time work. Instead, it developed into a next-generation coding scheme designed to correct some of the long-recognized conceptual and practical shortcomings in vocabularies like World Event/Interaction Survey (WEIS), as well as including elements that were important for geopolitical domains (e.g., it included support for detailed coding of substate actors). Eventually, CAMEO was used extensively in the Integrated Conflict Early Warning System (ICEWS) project, funded by the Defense Advanced Research Projects Agency (DARPA), where it was found to serve a robust set of needs.</p>
<p>The CAMEO formal codebook includes descriptions and extensive examples for each category and is available in both print and web-based formats. Despite CAMEO originally being intended specifically to code events dealing with international mediation, it has worked well as a general coding scheme for studying political conflict. For example, it includes four-digit tertiary subcategories that focus on very specific types of behavior, differentiating, for instance, between agreement to, or rejection of, cease-fire, peacekeeping, and conflict settlement.</p>
<p>A detailed support for actors is an important component of CAMEO. As the authors describe in the CAMEO specification, the concept of “actor” is now diffuse in the post–Cold War global political environment due to proliferation of substate, nonstate, multistate, and trans-state actors, some of whom exert greater force or influence than the official states <span aria-label="131" id="pg_131" role="doc-pagebreak"/>themselves. Furthermore, because of the focus on detailed support for actors, as well as many of the event and relation types mentioned here, CAMEO has become a domain-specific ontology for event IE, with geopolitical KG construction and understanding being the obvious use-case. In principle, the ontology can support events beyond geopolitics, but ACE still tends to be the preferred choice for generic event extraction.</p>
</section>
</section>
<section epub:type="division">
<h2 class="head a-head"><a id="sec6-3"/><b>6.3 Techniques for Relation Extraction</b></h2>
<p class="noindent">Effective techniques for RE face many challenges. First, there are many possible relations, which vary between domains. Given a pair of entities, the chances of error are high when RE is posed as a classification problem due to so many label choices. Nonbinary RE, which can start resembling event extraction, is more challenging and relatively less well studied compared to binary RE. We note that RE can be difficult even for humans, as evidenced by high interannotator disagreement on some corpora. This can make both training and testing nontrivial. Extending English RE systems to non-RE is more challenging than one might expect from similar multilingual transfer in tasks like NER, because RE systems have heavier language dependence.</p>
<p>As with NER, various flavors of machine learning, including supervised and semisupervised techniques, have been used to attack RE. Next, we detail some important findings.</p>
<section epub:type="division">
<h3 class="head b-head"><a id="sec6-3-1"/><b>6.3.1 Supervised Relation Extraction</b></h3>
<p class="noindent">Supervised RE generally applies to mention-level (rather than global-level) RE. As evident from the name, supervised methods require labeled data where each pair of entity mentions is tagged with one of the predefined relation types (obtained from the ontology used for the task, the common ontologies having been described earlier). A special relation type <i>None</i> or <i>NA</i> is usually used to label the entity pairs where no predefined relation type holds. As the field has evolved, two distinct kinds of supervised RE have emerged into the mainstream: feature-based supervised RE and kernel-based supervised RE.</p>
<p class="TNI-H3"><b>6.3.1.1 Feature-Based Supervised RE</b> Feature-based methods define the RE problem as a classification problem that we briefly suggested earlier as a potential formulation. Namely, for each pair of entity mentions, a set of features is generated and a classifier, which is defined broadly, because it could just be a single model, but it could potentially also be an ensemble of models, working together in a simple (e.g., averaging-based) or complicated manner is trained on labeled instances (i.e., pairs of mentions tagged with a relation label). The features are extracted again at test time for entity mention pairs whose labels are unknown or withheld, and the classifier is used to predict the relation, often probabilistically. Once a pair of entity mentions is converted to a feature vector, we are in familiar supervised machine learning territory. However, converting pairs of entity mentions to feature vectors is a nontrivial process.</p>
<p><span aria-label="132" id="pg_132" role="doc-pagebreak"/>In the research literature, several kinds of features have been explored, including lexical, syntactic, and semantic features. Specific feature types could include word features (which could be the words representing the mentions, as well as all words between the mentions in the sentence), entity types of the mentions (e.g., Person, Location), mention types (e.g., name or pronoun), dependency features [e.g., Part-of-Speech (POS) and chunk labels of words on which the mentions are dependent in the dependency tree], and even parse tree features (e.g., path of nonterminals connecting the mentions in the parse tree). This is not an exhaustive list, but only some examples. Note that feature types do not have to be generic, but they could be domain-specific to yield higher performance in certain domains, such as the biomedical or geopolitical domain.</p>
<p>An interesting category of features is <i>semantic</i> features, which were historically based on the use of semantic resources such as WordNet. WordNet senses and semantic classes can be especially useful when dealing with finer-grained relation subtypes, such as different kinds of social or interpersonal relations. Because WordNet is a general resource, the exact way in which it is used by a system to extract features can vary. Another interesting class of features is based on the observation that all of the ACE 2004 relation types are based on one of several constrained structures that are syntactic-semantic in nature; for instance, <i>premodifier</i> structure, where an adjective or proper noun modifies another noun (<i>American diplomat</i>); <i>possessive</i>, where the first mention is in possessive case (<i>California’s legislature</i>); <i>preposition</i>, where the two mention entities are related via a preposition (<i>mayor of Torrance</i>); and finally, <i>formulaic</i>, wherein the two mentions are written in some specific form (<i>Austin, Texas</i>). The reason why this observation is important is because we can use some rules and patterns to identify the specific structure in play, and this identification helps RE performance because models specialized for one or more of these structures can be used instead of a one-size-fits-all model.</p>
<p>Concerning the classification step, there are several ways to frame the problem because it is nonbinary (there is usually more than one relation in any nontrivial ontology). For example, because a model such as SVM is a binary classifier, one way to generalize it to do multiclass classification is by employing a strategy such as <i>one versus others</i> or (more rarely) one versus one.</p>
<p>Experimentally, feature-based methods have been superseded by both kernel-based methods and more recent methods based on representation learning (via deep neural nets). In practice, even when feature-based methods were very popular, much of the effort was spent in tuning and devising the right set of features for a domain and data set. While this kind of manual feature engineering has slowly but surely fallen out of favor, it is still often a first line of attack because it is easy to implement and quite intuitive. In domains where a reasonable amount of training data is available and some noise is tolerable, such methods may be sufficient owing to low implementation overhead (and low training and model selection complexity compared to deep nets).</p>
<section epub:type="division">
<h4 class="head c-head"><span aria-label="133" id="pg_133" role="doc-pagebreak"/><b>6.3.1.2 Kernel-Based Supervised RE</b> Feature-based RE is heavily dependent on the features extracted from the mention pairs and the sentence. Word embeddings could be used to add more global context, but without manual feature engineering, feature-based RE methods have not traditionally yielded maximal performance. The problem of feature engineering has been long recognized in the RE community, and kernel-based methods were proposed to avoid such explicit feature engineering. Inspired by SVMs, kernel methods compute similarities between representations of two <i>relation</i> instances, with the SVM used for the actual classification.</h4>
<p class="noindent">Before proceeding further, we provide a brief primer on kernels. Kernel methods are based on the idea of kernel functions, which operate in a high-dimensional <i>implicit</i> feature space, without needing to compute the coordinates in that space because they only need to compute the inner products between the images of all data pairs in that space. This computation, which is much less expensive than explicit computation of coordinates, is often referred to in the machine learning and SVM literature as the “kernel trick.” In essence, to employ the trick, we only need a similarity function over data point pairs in the raw representation. The task of engineering and extracting features is now replaced with the task of specifying the kernel.</p>
<p>Different kernels have been defined and used over the years, with some of the common ones including the <i>sequence</i> kernel, the <i>syntactic</i> kernel, the <i>dependency tree</i> kernel, the <i>dependency graph path</i> kernel, and even <i>composite</i> kernels. The <i>sequence kernel</i>, for example, is motivated by the (now-famous) <i>string subsequence kernel</i> and computes the number of shared subsequences between any two sequences. Relation instances, therefore, must be represented as sequences for the proper functioning of this kernel. One way to do so is to consider the sequence of words from the first mention to the second mention in the sentence. Furthermore, even this method can be made more robust by not considering just the word (as a singleton in the sequence), but also features extracted from the word. In this representation, each word itself is turned into a feature vector [with the domain of the features being not just the set of all words (i.e., the vocabulary), but also the set of POS tags, generalized POS tags, entity types, and so on, all of which can be leveraged to extract a specific set of features for a single word] and the sequence is a list of feature vectors. Although not considered in classic work, even word embeddings could be used as word feature vectors, though the continuous nature of the embeddings may be inappropriate for the sequence kernel.</p>
<p>To extend the sequence kernel to work well on multidimensional lists of features, a generalized subsequence kernel was proposed early in the literature on kernel-based RE. This kernel is able to efficiently compute the similarity for two sequences <i>s</i> and <i>t</i> using a recursive formulation. More details on the recursive formulation and its implementation can be found in the original paper (see the “Bibiliographic Notes” section). Another kernel, called the <i>relation kernel</i>, is based on the generalized subsequence kernel and is a sum of four <span aria-label="134" id="pg_134" role="doc-pagebreak"/>subkernels. The relation kernel can either be used in a multiclass SVM that is trained such that each relation type corresponds with its own class (with an extra class for the NA or NONE relation), or in a binary SVM that first decides whether <i>any</i> relation exists between the two entity mentions, followed by a multiclass SVM (if a relation is found to exist) to decide the appropriate relation type. This two-level approach was found experimentally to yield better results than the former approach.</p>
<p>A full description of the other kernels is beyond the scope of this chapter; however, the diversity of kernels mentioned earlier (based on graphs, trees, including dependency and syntactic trees, and sequences, as just described) illustrates the fruitful history of progress that kernel-based RE has enjoyed over the years, especially prior to the advent of deep learning. Syntactic tree kernels, for example, use the structural properties of a sentence (including its constituent parse tree) as primary features, but other works have augmented these features with information about entities and relations. Because the natural representation here is a tree, one issue that arises is the efficient computation of the kernel, because the number of possible subtrees is very large and it is not viable to explicitly construct each instance’s image vector. In the kernel literature, this problem has been addressed by devising a polynomial-time function that is based on a recursive definition and counts the number of common subtrees rooted at two nodes <i>v</i><sub>1</sub> and <i>v</i><sub>2</sub>.</p>
<p>Similar to the sequence kernel, a good representation for relation instances is necessary for the proper functioning of this kernel. Several possibilities exist, most of which construct a subtree from the complete syntactic tree characterizing a particular relation instance (pair of entities). These possibilities include, for example, a <i>path-enclosed tree</i> (the <i>smallest</i> subtree that includes both entities; i.e., the subtree enclosed by the shortest path connecting the two entities in the sentence’s parse tree) and a <i>minimum complete tree</i> (the complete subtree formed by the lowest common ancestor of the two entities). Experimentally, the former has been found to work quite well, though it can perform even better when contextual information is also included in the tree. For example, a <i>context-sensitive path tree</i> is an extension of the path tree, where one word to the left (right) of the first (second) entity are also included. Yet other work has tried to augment the tree by annotating the tree nodes with discriminant features such as WordNet senses and properties of entity mentions, and by designing a special kernel called the Feature-Enriched Tree Kernel to compute the similarity between such “enriched” trees.</p>
<p><i>Composite kernels</i> were designed to combine information captured by all of these different kernels (e.g., by combining syntactic tree and sequence kernels). Such combination is nontrivial because it has to be a valid kernel function for the kernel trick to apply. Functions that can be used to ensure valid composition between two kernels include sum, product, and linear combination. Several researchers have leveraged this property to design sophisticated composite kernels capturing multiple information sets to improve performance. For example, an influential approach designed individual kernels for three levels of NLP <span aria-label="135" id="pg_135" role="doc-pagebreak"/>processing (tokenization, sentence parsing, and deep dependency analysis), and then combined them so that processing errors in one of the kernels can be compensated using the information in the other kernels. Later state-of-the-art work went even further by trying to apply composite kernels in a distant supervision framework, leveraging Wikipedia infoboxes, and reporting significantly improved performance.</p>
</section>
</section>
<section epub:type="division">
<h3 class="head b-head"><a id="sec6-3-2"/><b>6.3.2 Evaluating Supervised Relation Extraction</b></h3>
<p class="noindent">As mentioned earlier, some of the popular data sets on which different systems have reported RE performance include ACE 2003 and 2004, though a few papers also report performance on the occasional nonstandard data set. Because RE is essentially a multi-class classification problem, performance can be evaluated in terms of precision, recall, and F-measure for the non-NA classes.</p>
<p>Recent surveys have attempted to compare reported performance across the techniques of RE described previously. Kernel-based RE systems were generally found to outperform feature-based RE. In one study across seven major relation types in the ACE 2004 data set, for example, a syntactic tree kernel with dynamically determined tree span was found to achieve the highest F-score (of 77.1 percent), outperforming feature-based RE systems (that used lexical, syntactic, and dependency tree features, as well as a classifier based on the syntactic-semantic structures previously described) by more than 5.5 percent, and other competitive kernel-based REs by margins almost as high. The fact that a syntactic tree kernel was able to achieve the best performance demonstrates the importance of incorporating the structural features of a sentence into any pipeline.</p>
</section>
<section epub:type="division">
<h3 class="head b-head"><a id="sec6-3-3"/><b>6.3.3 Semisupervised Relation Extraction</b></h3>
<p class="noindent">Acquiring labeled data at scale is always a challenge for any task, and RE is no different, motivating development of semisupervised techniques. A second motivation, however, is to also leverage the large amounts of unlabeled data currently available on the web, which could be used to significantly boost performance of existing RE architectures without necessarily requiring labeling effort.</p>
<p>The most notable semisupervised (alternatively called <i>weakly supervised</i>) method for RE is <i>bootstrapping</i>, which starts from a small set of seed relation instances and iteratively learns more relation instances and extraction patterns. While bootstrapping has been widely explored (as discussed in the next section), another learning paradigm called <i>distant supervision</i> has become a popular alternative because it uses a large number of known relation instances in existing large knowledge bases (KBs) to create a proxy for actual training data. A commonality between bootstrapping and distant supervision is that, for both paradigms, <i>noisy</i> training data is <i>automatically</i> generated. To achieve good performance, careful feature selection and pattern filtering need to be carried out. Besides bootstrapping and distant supervision, more traditional machine learning paradigms such as <i>active learning</i>, as well as recent paradigms such as <i>label propagation</i> and <i>multitask transfer learning</i>, <span aria-label="136" id="pg_136" role="doc-pagebreak"/>have been explored. The key idea behind the former method is that the learning algorithm is allowed to ask for true labels of some selected unlabeled instances. The criterion for choosing these instances varies, but all such criteria have the common goal of recovering the underlying hypothesis as quickly as possible (i.e., with few labeled instances).</p>
<p>In contrast, label propagation is a graph-based, semisupervised method where labeled and unlabeled instances<sup><a href="chapter_6.xhtml#fn1x6" id="fn1x6-bk">1</a></sup> in the data are represented as vertices in a graph with edges reflecting similarities between vertices. Using an iterative algorithm, label information for a vertex is propagated in a systematic way to nearby unlabeled vertices through these weighted edges. This process continues (i.e., the labels are spread out over the graph over several time steps). Ultimately, the labels of (previously unlabeled) vertices are considered to have been inferred when the propagation process meets a particular convergence criterion. An important advantage of label propagation is that the labels of (previously unlabeled) vertices are not only determined by nearby instances, but also by nearby unlabeled instances.</p>
<p>Multitask transfer learning models the semisupervised learning problem yet another way, by attempting to start from a few seed instances of the relation type of interest, as well as a large number of labeled instances of other relation types. By using common structural properties (usually syntactic) that many different relation types seem to share, the framework uses a transfer learning method in addition to human inputs in the form of entity-type constraints. By using a shared weight vector, knowledge learned from other relation types can be transferred to the target relation type (hence, it is a “multitask” learning paradigm).</p>
<p class="TNI-H3"><b>6.3.3.1 Bootstrapping</b> We mentioned earlier that bootstrapping requires a large, unlabeled corpus and a few seed instances of a relation type of interest. Given the seed examples, bootstrapping is expected to extract similar other entity pairs having the same relation type. An important early algorithm for bootstrapping was called Dual Iterative Pattern Relation Expansion (DIPRE) and relied on an idea called <i>Pattern Relation Duality</i>, which states that (1) given a good set of patterns, a good set of entity pairs (related according to a prespecified relation type) can be found; and (2) given a good set of such entity pairs, a good set of patterns can be learned. DIPRE puts this idea in practice through an iterative process. Patterns are represented as a tuple with five elements: <i>order, urlprefix, prefix, middle</i>, and <i>suffix</i>, where order is boolean, and the others are strings. The algorithm was designed mainly for web data, as evidenced by the urlprefix pattern element. Experimentally, it was shown that using just three seed examples of author and book pairs, and using a corpus of about 24 million webpages, DIPRE was able to generate a list of 15,000+ author-book pairs.</p>
<p><span aria-label="137" id="pg_137" role="doc-pagebreak"/>The Snowball system improved over DIPRE by incorporating named entity tags in the patterns, meaning that patterns such as “ACTOR acted in MOVIE” and “ACTOR-featuring MOVIE” could be learned from text. With these patterns, the system can further search the corpus and find more such pairs with the target relation. These entity pairs are then added to the set of seed relation instances, and the entire process is repeated until a certain condition is satisfied.</p>
<p>An important point to note with these bootstrapping methods is that the quality of extraction patterns has to be evaluated such that not too many noisy patterns are included during the extraction process. Heuristic methods have been proposed to this effect, with two factors (coverage and precision) typically considered. While coverage is related to the percentage of true relation instances that can be discovered by the pattern, precision is related to the percentage of correct relation instances among all the relation instances discovered by the pattern. The metrics of coverage and precision are recurring themes in evaluations of many KG-centric subcomponents, including (as we shall see later) instance matching.</p>
<p class="TNI-H3"><b>6.3.3.2 Distant Supervision</b> Bootstrapping draws upon only a small set of seed entity pairs for its initialization. However, with the growth of the web and the publishing of large-scale repositories covering all manner of subject areas (a good example being Wikipedia), much human knowledge, contributed by crowds of users, has been captured and stored in KBs. With such openly available knowledge, it has become possible to use a large set of entity pairs known to have a target relation to <i>generate</i> training data. This idea has become known in the community as <i>distant supervision</i>. In the earliest approaches leveraging this philosophy, the assumption was that if two entities participate in a relation, any sentence that contains the two entities expresses that relation. Because this assumption does not always hold, some of the approaches use features extracted from different sentences containing the entity pair to create a richer feature vector that is supposed to be more reliable. Representation learning has made this process significantly more robust, and new research continues to be published that lies at the intersection of embeddings and distant supervision. Many of these approaches enrich the feature space by defining lexical, syntactic, and named entity tag features. Standard multiclass logistic regression is often used as the classification algorithm. By using distant supervision, even the early approaches were able to empirically show that they could achieve almost 70 percent of precision based on human judgment. By using several sources (e.g., in one approach, both YAGO and Wikipedia documents were used for distant supervision), more improvements could be achieved, including F-measures well into the mid-70s. Being a general technique, distant supervision is a recurring theme in KG construction (KGC), and it has also emerged as a viable IE technique with respect to information types other than relations and genres beyond free text.</p>
</section>
<section epub:type="division">
<h3 class="head b-head"><span aria-label="138" id="pg_138" role="doc-pagebreak"/><a id="sec6-3-4"/><b>6.3.4 Unsupervised Relation Extraction</b></h3>
<p class="noindent">Earlier, we discussed RE when the types of relations to be extracted are known in advance. There are also cases where we do not have any specific relation types in mind but would like to discover salient relation types in a given corpus. For example, given a set of articles reporting hurricane and typhoon events, it would be useful if we could automatically discover that one of the most important relations for this domain is the <i>originate</i> relation between a typhoon and the place where it originated.</p>
<p>When first studied, this important problem was referred to as <i>unrestricted relation discovery</i> but it is now referred to as “unsupervised” RE. An early approach tackled this problem by first collecting a large number of news articles from different news sources on the web, and then using simple clustering (based on lexical similarity) to pinpoint articles discussing the same event. In this way, the feature representation of an entity could be enriched by using its many occurrences in various articles. The next step was to perform syntactic parsing and extract named entities from the articles. Each named entity could then be represented by a set of syntactic patterns as its features. For example, a pattern may indicate that the entity is the subject of the verb <i>originate</i>. As a final step, pairs of entities cooccurring in the same article were clustered using their feature representations. The results were tables in which rows corresponded to different articles and columns corresponded to different roles in a relation. The discovered tables were found to have an impressive accuracy of 75 percent.</p>
<p>Later authors tried to generalize unsupervised relation discovery in their formulation of the problem, mainly by assuming that the input of the problem consists of entity pairs and their contexts. An unsupervised relation discovery algorithm would cluster these entity pairs into disjoint groups, with each group representing a single semantic relation. An “other” or garbage cluster was also used to capture unrelated entity pairs or unimportant relations. The contexts for each entity pair consist of both the context of each entity, as well as the the cooccurrence context. An entity pair can be represented by a set of features derived from the contexts, although in the early papers, only surface pattern features (e.g., “arg-1” spoke in support of “arg-2”) were considered for modeling cooccurrence contexts. Clustering was fairly standard as well, drawing on established techniques like hierarchical agglomerative and K-Means clustering. Impressively, however, these methods were able to discover relations such as CityOfState and EmployedIn despite being given no a priori knowledge of these relations.</p>
<p>As the field of relation discovery has matured (though it is still far from human-level performance), there have been increasing efforts to generalize it even further, with the most complex version of the problem seeking to automatically induce an IE template, with a template containing multiple slots playing different semantic roles. A straightforward solution is to identify role filler candidates first, followed by clustering of these candidates. However, this simplified approach neglects the important observation that a single document <span aria-label="139" id="pg_139" role="doc-pagebreak"/>tends to cover different slots. Some solutions have been proposed, but the problem is far from resolved.</p>
<p>More recent work has tried to go beyond a single static template to discover multiple templates from a corpus and automatically give meaningful labels to discovered slots by performing two-step clustering. The first clustering step groups lexical patterns that are likely to describe the same type of events, while the second step groups candidate role fillers into slots for each type of event. A slot can be labeled using the syntactic patterns of the corresponding slot fillers.</p>
<p>In conclusion, while there is a lot of exciting research being conducted in the area of unsupervised relation discovery and template induction, the difficulty of the problem has prevented general-purpose solutions that are able to achieve roughly the same kind of performance as NERs. However, performance has steadily improved since the early days, even as the scope of the original problem has expanded to become ever-more complex and real-world. In the next chapter, for example, we study Open Information Extraction (Open IE), which attempts to discover entities and relations without any kind of ontological input. Open IE is a natural generalization of the problem of unsupervised relation discovery that we briefly covered in this section. In the literature, Open IE and unsupervised RE are sometimes considered as akin; for instance, Unsupervised RE System (URES) was a direct successor to the KnowItAll Open IE system, which we describe at length in the next chapter.</p>
</section>
</section>
<section epub:type="division">
<h2 class="head a-head"><a id="sec6-4"/><b>6.4 Recent Research: Deep Learning for Relation Extraction</b></h2>
<p class="noindent">As with NER, RE has also been considerably influenced recently by deep learning research. As discussed earlier, supervised deep learning techniques for any task, including RE, require large quantities of training data for learning. Using manually annotated data sets for RE takes enormous time and effort to compile and is clearly not scalable. Hence, techniques like distant supervision, which produces training data by aligning KB facts with texts, have become popular. Data sets acquired in this way allows the learning of more complex models for RE, including architectures based on convolutional neural networks (CNNs).<sup><a href="chapter_6.xhtml#fn2x6" id="fn2x6-bk">2</a></sup> The noise in such data sets generated through distant supervision, however, must be dealt with using special modeling techniques such as multi-instance learning. In this section, we provide some details on these efforts. Note that much research is still being conducted on this problem even at the present day. However, interesting trends have started to emerge.</p>
<p><span aria-label="140" id="pg_140" role="doc-pagebreak"/>Early work on using deep learning for RE started by employing the same supervised data sets used by non–deep learning supervised machine learning models described earlier in this chapter. These data sets include the ACE 2005 data set, but also SemEval-2010 Task 8, which is a freely available data set that contains 10,700+ samples, with a roughly 80/20 percent split for training and testing. There are nine ordered relation types. Because the relations are ordered, their directionality effectively doubles the number of relations, as a pair of entities is believed to be correctly labeled with a relation only if the order is also correct. The final data set, augmented using this principle, contains 19 (rather than 9) relation classes as a result, with a single extra class for <i>Other</i>.</p>
<p>Distant supervision-based approaches started to be used for the task following an influential paper in 2009, wherein documents were aligned with known KBs, under the assumption that if an entity pair in the KB is associated via a relation, then any document containing the mentions of the two entities would also manifest that relation. This is obviously a strong assumption, and it can frequently be violated. There are many documents that contain both <i>Sergey Brin</i> and <i>Google</i>, for example, without explicitly expressing the <i>Founder</i> relationship between Brin and Google. In 2010, a solution to this problem was proposed, wherein the assumption was relaxed by modeling RE as a multi-instance learning problem. <i>Multi-instance learning</i> is a kind of supervised learning wherein a label is assigned to a “bag of instances” rather than a single instance. In the specific case of RE, an entity pair defines a bag, with the bag consisting of all the sentences containing the mention of the entity pair. Instead of assigning a relation label to each sentence, a label is instead assigned to each bag of the relation entity. In effect, the original strong distant supervision assumption is relaxed to a much weaker one—namely, given that a relation exists between an entity pair, at least one document in the bag for the entity pair must express that relation.</p>
<p>The data set for distant supervision was created by aligning the Freebase KG with the <i>New York Times</i> corpus, both of which are well known in the NLP community. In the 2010 work, entity mentions were located using the Stanford NER tagger and were matched to the names of Freebase entities. Further, 52 possible relation types, including a special class NA (indicating that there is no relation between the two entities), were defined in the relation ontology. The data set thus compiled is quite large, with more than half a million sentences, 18,000+ relational facts, and 280,000+ entity pairs in the training data. The testing portion contains 172,000+ sentences, 96,000+ entity pairs, and almost 2,000 relational facts. Evaluation is typically done by comparing extracted facts against Freebase entries. While this evaluation is good for comparing different RE systems, it is important not to interpret any results in an “absolute” sense, since Freebase (or any KG, including Wikidata and DBpedia) was not complete, and false negatives would undermine model performance as a result.</p>
<p>Beyond distant supervision, we noted in chapter 5 how representation learning has yielded great impetus to extraction problems in the NLP community. Word embeddings are the <span aria-label="141" id="pg_141" role="doc-pagebreak"/>most common kind of representation learning (serving as inputs to higher-order deep learning models such as CNNs, as subsequently described), but in RE, <i>positional embeddings</i> were an innovation that helped drive higher RE performance as well. By positional embedding, we mean that the input to the model is not just the word embedding, but also the relative distance of each word from the entities in the sentence is encoded and sent into the higher-order model. This ensures that the deep network can keep track of how close each word is to an entity. Words closer to target entities are expected to contain more useful information involving the relation type. For example, in the sentence “Los Angeles is an economic mainstay in California,” “economic” has relative distance 3 to the head entity “Los Angeles” and relative distance of −3 to the tail entity “California.” These positions can be encoded in a vector of appropriate dimensionality (usually two positional vectors are used, for the head and tail entity, respectively). Together, these vectors, along with the word-embedding vector, would be concatenated and serve as the full-feature vector of the word to a higher-order deep learning model. A sentence would be similarly encoded as a bag of its constituent (full-feature) word vectors.</p>
<p>CNNs have become quite popular in the RE community, starting as early as 2011. Originally, CNNs were proposed primarily for computer vision problems for reasons that are beyond the scope of this chapter. The early deep learning approaches for RE (that did not use distant supervision) applied supervised learning with CNNs to the problem by modeling it as multiclass classification. The earliest work that used CNNs to automatically learn features instead of handcrafting features was published in 2013, and built an end-to-end network that encoded the input sentence using both word vectors and lexical features. This was followed by a convolutional kernel layer, a single-layer neural layer, and the usual softmax output layer, which yields a probability distribution over the relation classes. The architecture also relied on synonym vectors instead of word vectors, wherein a single vector is assigned to each synonym class rather than assigning every unique word its own vector. Unfortunately, this fails to leverage the representational power of word-embedding models. Furthermore, the embeddings (rather than being trained in an unsupervised way on the corpus) are randomly assigned to each synonym class. However, the model does try to incorporate a few lexical features using artifacts such as word, POS, and entity-type lists. Despite the criticisms of the model in hindsight (because some improvements are clearly possible), at the time, it led to an improvement by over 9 points (on the F-measure metric) in the ACE 2005 data set, illustrating the clear promise of this kind of CNN architecture for RE and laying the groundwork for a significant spurt in more deep learning RE research.</p>
<p>Other work using CNNs for RE soon followed, with a CNN model using max-pooling proposed in 2014, and CNN with multisized window kernels produced in 2015. The 2015 work proposed to be rid of exterior lexical features altogether, and instead allow the CNN to learn the features itself. Word and positional embeddings are used, followed by convolution and max-pooling (hence, it is similar in this regard to the work in 2014), but a <span aria-label="142" id="pg_142" role="doc-pagebreak"/>novelty in this architecture was the use of convolutional kernels of varying window sizes to capture wider ranges of <i>n</i>-gram features. Experimentally, using kernels with 2-3-4-5 window lengths was found to deliver the best performance. Furthermore, the word embeddings were initialized using pretrained word embeddings from the word2vec model, which yields further improvements compared to random initializations of word vectors or static word2vec vectors.</p>
<p>Earlier, we mentioned multi-instance learning as a possible paradigm for using distant supervision. In 2015, Zeng et al. (2015) proposed piecewise CNNs (PCNNs) for using this paradigm to build a relation extractor. While the model was similar other models constructed a priori, it had the important additional contribution of piecewise max-pooling across the sentence. The method worked by max-pooling in different segments of the sentence instead of the entire sentence. Zeng et al. (2015) used three segments, based on the positions of the two entities in question. Experimentally, PCNNs were found to outperform previous CNN models on precision at higher levels of recall (25 percent or more). Later models improved this even further (see the “Bibliographic Notes” section). Ablation studies have demonstrated the advantages of preferring PCNNs over CNNs, as well as multi-instance learning over ordinary learning.</p>
<p>The discussion of the approaches described here (which have continued to be superseded by even more advanced models that employ mechanisms like selective attention and exploit information across multiple documents in a bag) and their demonstrable empirical impact shows that deep learning has ushered in an exciting period for RE. While much work still needs to be done, progress is being reported almost every year, and often the state-of-the-art technology is superseded within the year. Particularly beneficial has been the use of distant supervision and the adoption of the multi-instance paradigm, later augmented with training mechanisms like selective attention over documents and cross-document max-pooling. Other works have even incorporated structured information into the pipeline in an effort to improve feature representations even further (e.g., by exploiting relation paths and relation class ties). Intuitively, the approach leverages relations such as sister_of and parent_of to extract instances for wife_of.</p>
<p>An interesting avenue of future work in this area that is already underway is to use recurrent neural networks (RNNs) instead of CNNs for encoding the sentences, since it seems that RNNs and LSTM networks naturally fit NLP tasks more than CNNs. There hasn’t been much conclusive evidence on the empirical benefits of using RNNs over CNNs in NLP tasks (e.g., while there is some evidence<sup><a href="chapter_6.xhtml#fn3x6" id="fn3x6-bk">3</a></sup> that RNNs can perform well on sentiment classification at the document level, some other papers have showed that CNNs could potentially outperform LSTMs on language modeling). However, the state-of-the-art and consensus on this issue keeps shifting. In chapter 13, on <i>question answering</i>, we cover a <span aria-label="143" id="pg_143" role="doc-pagebreak"/>recent model called BERT, which is not based on CNNs and achieves state-of-the-art (or near state-of-the-art) on several language-modeling tasks.</p>
</section>
<section epub:type="division">
<h2 class="head a-head"><a id="sec6-5"/><b>6.5 Beyond Relation Extraction: Event Extraction and Joint Information Extraction</b></h2>
<p class="noindent">We provided a brief introduction to event extraction at the beginning of this chapter. Event extraction has also been studied mainly using ACE data, but domain-specific (in particular, biomedical) event IE has been studied as well (e.g., for BioNLP shared tasks). To reduce task complexity, early work employed a series of classifiers (called a “pipeline”) that first extracts event triggers, followed by argument extraction. CNNs (including some of the models we briefly discussed in the previous section) have been applied successfully to event extraction as well. However, because pipeline approaches suffer from error propagation and cascading, joint extraction of event triggers and arguments has become very popular in the community.</p>
<p>The assumption guiding such joint IE models is that events and entities are closely related; entities are often actors or participants in events and events without entities are rare. Another motivation is that interpretation of events and entities can have high contextual interdependence. While early work in event IE modeled and extracted events separately from entities, performing inference at the sentence level and ignoring the rest of the document, joint IE explicitly models the dependencies among variables of events, entities, and their relations, the goal being to perform joint inference of these variables across a document.</p>
<p>In essence, joint IE decomposes the learning problem into three tractable subproblems: learning for within-event structures, learning for event-event relations (such as causality and inhibition), and learning for NER. The typical approach is to learn a probabilistic model for each of these subproblems, with a joint inference framework integrating the learned models into a single model that can jointly extract events and entities across the entire document. Depending on the paper, a variety of models have been put forth for realizing these intuitions, including Markov Logic (which we study in detail in the context of KG completion in part III of this book), structured perceptron and dependency parsing. Early work on joint inference largely relied on heuristic search to aggressively shrink the search space, because joint inference of the kind described here can lead to combinatorial explosion quickly if dealt with in a naive manner. Later work attempted more sophisticated techniques for reducing the search space (e.g., in 2011, dual decomposition was used to solve joint inference with runtime guarantees).</p>
<p>As intuited previously, improvements in event extraction would not have been possible without exploiting document-level context. For example, in the ACE domain, there is work on utilizing and propagating event type cooccurrence information to influence event classification decisions (e.g., in a domain-specific corpus describing guerrilla warfare in a country, an ATTACK event might cooccur often with a TRANSPORT event). By designing appropriate features, causal and temporal (among other) relations can be handled as well.</p>
<p><span aria-label="144" id="pg_144" role="doc-pagebreak"/>Experimentally, approaches that have jointly tried to extract entities and relations have reported considerable improvements over the pipeline approach. In fact, joint IE not only improves relation and event extraction, but also leads to improvements in entity extraction. One reason is that RE information can be leveraged for entity extraction in a joint IE model, unlike the pipeline approach, where NER precedes RE, which is not allowed to propagate backward. Unfortunately, comparing the models has proven to be difficult because they do not use a standard data set for their evaluations. It is also important to note that the “jointness” in joint modeling approaches occurs in one of two scenarios: either joint inference is conducted on local, independently trained classifiers for entities and relations, or actual joint learning is conducted wherein a single model is learned for extracting entities, relations, and events. Despite the improvements, there is considerable room for developing even more sophisticated approaches for handling the joint modeling problem, while still being efficient.</p>
</section>
<section epub:type="division">
<h2 class="head a-head"><a id="sec6-6"/><b>6.6 Concluding Notes</b></h2>
<p class="noindent">RE is an important type of IE that is essential for building KGs from raw data. Although all kinds of IE involve challenges, RE has proved to be significantly more challenging than NER. A range of techniques has been proposed over the years, refined and evaluated via programs like ACE. Supervised techniques include both feature-based and kernel-based methods, and they have historically been the best performing. However, semisupervised and unsupervised techniques have recently benefited a great deal due to paradigms like distant supervision and representation learning. Deep learning methods are continuing to be applied to RE to further advance the state-of-the-art. Finally, extending RE to event extraction, as well as building joint models that extract entities, relations, and extractions all at once in the hopes of doing better on each individual extraction problem, are all being explored in the community to expand the scope of the problem and build richer KGs. It is likely that we will continue to see new research in this area for the foreseeable future.</p>
</section>
<section epub:type="division">
<h2 class="head a-head"><a id="sec6-7"/><b>6.7 Software and Resources</b></h2>
<p class="noindent">We mentioned multiple ontologies in the beginning of the chapter and provided links and resources for them here:</p>
<ul class="numbered">
<li class="NL">1. Resources pertinent to the ACE program are available at <a href="https://www.ldc.upenn.edu/collaborations/past-projects/ace">https://<wbr/>www<wbr/>.ldc<wbr/>.upenn<wbr/>.edu<wbr/>/collaborations<wbr/>/past<wbr/>-projects<wbr/>/ace</a>. In particular, we encourage the interested reader to consider the annotation tasks and specifications (ACE 2008 is the latest version), as well as many of the other language resources available in the broader pages of the LDC.</li>
<li class="NL">2. REO was presented relatively recently by Brown et al. (2017), where they claim that it is “temporarily available by request” but is planned to be migrated to an “in-house <span aria-label="145" id="pg_145" role="doc-pagebreak"/>server in the near future, where it will be freely available.” However, the authors have not been able to locate a download link online, and prospective users may have to contact the authors of the original paper directly.</li>
<li class="NL">3. CAMEO, which is also very popular for events, is described extensively in a manual available at <a href="http://data.gdeltproject.org/documentation/CAMEO.Manual.1.1b3.pdf">http://<wbr/>data<wbr/>.gdeltproject<wbr/>.org<wbr/>/documentation<wbr/>/CAMEO<wbr/>.Manual<wbr/>.1<wbr/>.1b3<wbr/>.pdf</a>. The GDELT (Global Database of Events, Language, and Tone) project, described at <a href="https://www.gdeltproject.org/">https://<wbr/>www<wbr/>.gdeltproject<wbr/>.org<wbr/>/</a>, has many associated resources, including data, documentation and solutions. GDELT uses CAMEO, along with ICEWS (Integrated Conflict Early Warning System), a DARPA-funded initiative. More details on ICEWS may be found on Harvard Dataverse: <a href="https://dataverse.harvard.edu/dataverse/icews">https://<wbr/>dataverse<wbr/>.harvard<wbr/>.edu<wbr/>/dataverse<wbr/>/icews</a>.</li>
</ul>
<p>To <i>do</i> RE, one set of resources that can be used is the NLP packages mentioned in the previous chapter. Those packages provide excellent facilities for training a custom RE pipeline, but pretrained modules compatible with some of the packages are also available. Otherwise, RE still hasn’t delivered the same level of performance as NER, and consequently it is not as widely used (at least compared to NER) in industry and other sectors, where accuracy requirements are higher. Other useful resources for the interested practitioner for RE and event extraction include FrameNet and VerbNet, respectively, available at <a href="https://framenet.icsi.berkeley.edu/fndrupal/">https://<wbr/>framenet<wbr/>.icsi<wbr/>.berkeley<wbr/>.edu<wbr/>/fndrupal<wbr/>/</a> and <a href="https://verbs.colorado.edu/verbnet/">https://<wbr/>verbs<wbr/>.colorado<wbr/>.edu<wbr/>/verbnet<wbr/>/</a>, respectively. A VerbNet Java application programming interface (API) can be downloaded at <a href="http://verbs.colorado.edu/verb-index/vn/verbnet-3.2.tar.gz">http://<wbr/>verbs<wbr/>.colorado<wbr/>.edu<wbr/>/verb<wbr/>-index<wbr/>/vn<wbr/>/verbnet<wbr/>-3<wbr/>.2<wbr/>.tar<wbr/>.gz</a>. In the last ten years, broader initiatives (such as data sets and competitions) have done their part in advancing the state-of-the-art and leading to more research. One example is SemEval 2010 Task 8 (<a href="http://semeval2.fbk.eu/semeval2.php?location=tasks">http://<wbr/>semeval2<wbr/>.fbk<wbr/>.eu<wbr/>/semeval2<wbr/>.php<wbr/>?location<wbr/>=tasks</a>), which is a multiway classification task that contains nine general relations, along with an “OTHER” relation. Other interesting repositories and links are tracked on the page <a href="http://nlpprogress.com/english/relationship_extraction.html">http://<wbr/>nlpprogress<wbr/>.com<wbr/>/english<wbr/>/relationship<wbr/>_extraction<wbr/>.html</a>.</p>
<p>More recently, deep learning systems have been used for relation and event extraction, as noted earlier. Frameworks such as TensorFlow and PyTorch are useful tools in this regard, but so are advanced-language models (based on transformer architectures) such as BERT, which will be covered in more depth in chapter 13. Due to the interest in NLP and tasks such as RE that are important for question answering, industry has also been playing an active role in advancing the state-of-the-art. For example, a recent method published by Wang et al. (2019) at IBM was able to achieve state-of-the-art (or near state-of-the-art) performance with RE (<a href="https://www.ibm.com/blogs/research/2019/07/relation-extraction-method/">https://<wbr/>www<wbr/>.ibm<wbr/>.com<wbr/>/blogs<wbr/>/research<wbr/>/2019<wbr/>/07<wbr/>/relation<wbr/>-extraction<wbr/>-method<wbr/>/</a>). In a separate line of work, the DeepDive system is another example of a supporting infrastructure for extracting structured information from unstructured data (<a href="http://deepdive.stanford.edu/relation_extraction">http://<wbr/>deepdive<wbr/>.stanford<wbr/>.edu<wbr/>/relation<wbr/>_extraction</a>).</p>
</section>
<section epub:type="division">
<h2 class="head a-head"><span aria-label="146" id="pg_146" role="doc-pagebreak"/><a id="sec6-8"/><b>6.8 Bibliographic Notes</b></h2>
<p class="noindent">RE had been an important problem in the IE community long before the advent of KGs, which have only made them more relevant. The problem has also long been considered to be more difficult, in terms of achieving satisfactory empirical performance, than tasks like NER. The structure of our coverage is largely inspired by the recent survey of Pawar et al. (2017). They describe a clear distinction between global- and mention-level RE and also provide a good overview of resources like ACE. As they note, they are hardly the first to study RE; in many papers on IE, RE has at least some coverage. For a fuller scope, the interested reader should consider work by Sarawagi et al. (2008), Bach and Badaskar (2007), and de Abreu et al. (2013), the last of which covers techniques used for Portuguese and can provide a good sense of some of the challenges posed by RE when dealing with non-English text.</p>
<p>Good references for some of the ontologies, vocabularies and resources (such as ACE, ERE, REO, CAMEO, and WEIS) we mentioned include Doddington et al. (2004), Gerner et al. (2002a,b), Song et al. (2015), Aguilar et al. (2014), and Brown et al. (2017).</p>
<p>Individual references for many of the RE techniques (supervised and unsupervised) covered in this chapter can be found in Pawar et al. (2017), but more recently, surveys have been published on entire subareas of RE. For example, a lot of work has also been done on using distant supervision of RE; Smirnova and Cudré-Mauroux (2018) provide a survey of methods and research on this very specific topic. Similarly, Jung et al. (2012) summarize a specific area of research (kernel-based RE) that saw an explosion of research in the last two decades. For a good overview of domain-specific RE, we recommend Zhou et al. (2014) and Cohen and Hersh (2005), which cover the biomedical domain. In a similar vein, causal IE was surveyed by Asghar (2016).</p>
<p>Toward the end of the chapter, our focus shifted to deep learning methods for RE. This is a relatively novel area of research, but several good references should be perused by the interested reader. Important references include Zeng et al. (2014), Nguyen and Grishman (2015), Riedel et al. (2010), Zeng et al. (2015), Mintz et al. (2009), Hoffmann et al. (2011), Surdeanu et al. (2012), Lin et al. (2016), Jiang et al. (2016), Ye et al. (2016), Dauphin et al. (2017), Phi et al. (2019), and Yin et al. (2017). Finally, Kumar (2017) provides a synthesis of these and other methods that have used deep learning for RE.</p>
<p>Excellent references for joint IE and event IE (including event coreference) include Liao and Grishman (2010), Ahn (2006), Ji and Grishman (2008), Madhyastha et al. (2003), Humphreys et al. (1997), Yubo et al. (2015), McClosky et al. (2011), and Li et al. (2013). We note also that RE research has been intersecting with other kinds of IE, including Open IE (which we cover in chapter 7); see Zhang et al. (2019) for a very recent example of how the two can be fused in a large-scale setting.</p>
</section>
<section epub:type="division">
<h2 class="head a-head"><span aria-label="147" id="pg_147" role="doc-pagebreak"/><a id="sec6-9"/><b>6.9 Exercises</b></h2>
<ul class="numbered">
<li class="NL-N">1. Consider the following paragraph (sourced from Wikipedia, accessed at <a href="https://en.wikipedia.org/wiki/Albert_Camus">https://<wbr/>en<wbr/>.wikipedia<wbr/>.org<wbr/>/wiki<wbr/>/Albert<wbr/>_Camus</a>):<br/>    “Soon after Camus moved to Paris, the outbreak of World War II began to affect France. Camus volunteered to join the army but was not accepted because he had suffered from tuberculosis. As the Germans were marching towards Paris, Camus fled. He was laid off from Paris-Soir and ended up in Lyon, where he married pianist and mathematician Francine Faure on 3 December 1940. Camus and Faure moved back to Algeria (Oran) where he taught in primary schools. Because of his tuberculosis, he was forced to move to the French Alps. There he began writing his second cycle of works, this time dealing with revolt – a novel La Peste (The Plague) and a play Le Malentendu (The Misunderstanding). By 1943 he was known because of his earlier work. He returned to Paris where he met and became friends with Jean-Paul Sartre. He also became part of a circle of intellectuals including Simone de Beauvoir, Andr‘e Breton, and others.”<br/>    As a first step, design a small ontology that would allow you to express the kinds of entities and relations in this paragraph. Use reasonable assumptions, but make the ontology as nontrivial as possible. Express the ontology using RDFS (chapter 2).</li>
<li class="NL-N">2. Annotate the surface forms of the entities and relations in the paragraph (corresponding to instances of concepts and properties in your ontology), and express the annotations as an RDF KG.</li>
<li class="NL-N">3. Is there any real difference between a relation “instance” and a relation in the ontology? <i>Hint: Using the provided paragraph, or one inspired by it, can you think of cases where the “surface” form of the RE is different from its label in the ontology?</i></li>
<li class="NL-N">4. Are there multi-arity (taking more than two entities as arguments) relations in the paragraph? If yes, what? If not, could you add a sentence to the paragraph to introduce a multi-arity relation? Is this multi-arity relation <i>pairwise decomposable</i>; that is, if the relation takes four arguments, such as <i>R</i>(<i>n</i><sub>1</sub><i>, n</i><sub>2</sub><i>, n</i><sub>3</sub><i>, n</i><sub>4</sub>), where <i>n</i><sub><i>i</i></sub> are entities, then is it possible to express the relation equivalently as a set (with cardinality <sup>4</sup><i>C</i><sub>2</sub> = 6) of relations: {<i>R</i>′(<i>n</i><sub>1</sub><i>, n</i><sub>2</sub>)<i>, R</i>′(<i>n</i><sub>1</sub><i>, n</i><sub>3</sub>)<span class="ellipsis">…</span><i>, R</i>′(<i>n</i><sub>3</sub><i>, n</i><sub>4</sub>)}? Note that <i>R</i>′ does not necessarily have to be the same relation as <i>R</i>.</li>
<li class="NL-N">5. Give at least three examples of real-world multi-arity relations that would not be pairwise decomposable. One useful way to think about decomposability in this context is to ask yourself whether you would be able to soundly and completely recompose the original <i>n</i>-ary relations given a KB containing only the pairwise equivalents.</li>
<li class="NL-N">6. Informally describe a domain or scenario where joint IE, as we described it in the later part of this chapter, could prove to be important in reducing noise and getting relevant extractions. Try to find a paragraph, similar to the example from Wikipedia or a news source, that would allow you to argue your case concretely.</li>
</ul>
<div class="footnotes">
<ol class="footnotes">
<li><p class="FN" role="doc-footnote"><sup><a href="chapter_6.xhtml#fn1x6-bk" id="fn1x6">1</a></sup> Put more accurately, labeled instances are instances with known labels, while the labels of “unlabeled” instances have to be inferred.</p></li>
<li><p class="FN" role="doc-footnote"><sup><a href="chapter_6.xhtml#fn2x6-bk" id="fn2x6">2</a></sup> This section assumes some basic knowledge of CNNs. If readers are completely unfamiliar, they should skip this section at a first reading. For the interested reader, an accessible introduction to CNNs is recommended before perusing this section.</p></li>
<li><p class="FN" role="doc-footnote"><sup><a href="chapter_6.xhtml#fn3x6-bk" id="fn3x6">3</a></sup> A good reference for this claim is Tang et al. (2015a).</p></li>
</ol>
</div>
</section>
</section>
</div>
</body>
</html>