glam/docs/oclc/extracted_kg_fundamentals/OEBPS/xhtml/chapter_4.xhtml

<html xmlns="http://www.w3.org/1999/xhtml" xmlns:epub="http://www.idpf.org/2007/ops" lang="en" xml:lang="en">
<head>
<title>Knowledge Graphs</title>
<meta content="text/html; charset=utf-8" http-equiv="default-style"/>
<link href="../styles/stylesheet.css" rel="stylesheet" type="text/css"/>
<meta content="urn:uuid:531250ec-8629-4bbe-b4be-eb1eb0e84538" name="Adept.expected.resource"/>
</head>
<body epub:type="bodymatter">
<div class="body">
<p class="SP"> </p>
<section aria-labelledby="ch4" epub:type="chapter" role="doc-chapter">
<header>
<h1 class="chapter-number" id="ch4"><span aria-label="77" id="pg_77" role="doc-pagebreak"/>4</h1>
<h1 class="chapter-title"><b>Named Entity Recognition</b></h1>
</header>
<div class="ABS">
<p class="ABS"><b>Overview.</b>   Textual information has continued to proliferate rapidly in the digital realm, with many text repositories now available on the web. A significant part of this information, including online news, government material, legal documents (e.g., contracts and court rulings), corporate reports, medical notes, and even social media, tends to be transmitted as free-text documents that are difficult for machines to search or make sense of literally. This has led to the term <i>unstructured</i> being associated with free-text documents because the content is very different from “structured” data found in knowledge bases (KBs) and databases. To enable machines to do analytics over, or to build knowledge graphs (KGs) from, such data, there is a growing need to extract and interlink key pieces of information from the text. Typically, the first line of attack in building such information extraction (IE) systems is Named Entity Recognition (NER). In this chapter, we introduce IE and then delve deeper into NER. Compared to other IE techniques, NER has evolved over multiple decades of research, and has achieved impressive peak empirical performance in various domains across common entity types.</p>
</div>
<section epub:type="division">
<h2 class="head a-head"><a id="sec4-1"/><b>4.1 Introduction</b></h2>
<p class="noindent">IE is an unavoidable task in a broader KG construction (KGC) pipeline. Conceptually, IE can be defined simply as a class of algorithmic techniques to extract relevant information from raw data. As is so often the case in applied artificial intelligence (AI) and KG research, the devil is in the details. What do we mean by “relevant” and “information,” for example? What is raw data? To understand the nuance behind this definition, we provide two applications of IE in <a href="chapter_4.xhtml#fig4-1" id="rfig4-1">figure 4.1</a>. In the first application, the goal is to extract entities and relationships from web data. In chapter 3, for example, we covered the important problem of domain discovery and data acquisition, with the principal focus being on web data. Because almost every domain we can think of now has a significant web presence, the importance of minimally supervised web IE continues to grow. Fortunately, even though the problem is still not solved, in that no system is close to achieving human-level performance, the work is in a sufficiently mature stage that a range of options are available to the typical practitioner. We cover web IE in more detail in chapter 5.</p>
<div class="figure">
<figure class="IMG"><a id="fig4-1"/><img alt="" src="../images/Figure4-1.png" width="450"/>
<figcaption><p class="CAP"><span class="FIGN"><a href="#rfig4-1">Figure 4.1</a>:</span> <span class="FIG">Two contrasting versions and applications of IE for constructing KGs over raw sources. Web IE operates over webpages, and attempts to extract a KG with entities, relations, or even events, while NER extracts instances of concepts such as PERSON or LOCATION. Concepts come from an ontology that can be domain-specific. To supplement the instances, and interconnect them with relations, relation extraction has to be executed as another step.</span></p></figcaption>
</figure>
</div>
<p><span aria-label="78" id="pg_78" role="doc-pagebreak"/>In the second application, the raw data is free text, or <i>natural language</i>. Some authors have alluded to such data as being unstructured, but we discourage the use of this term because there is considerable structure (both syntactic and semantic) in natural-language data, and the study of such structure is an important goal of linguistics. However, it is also important to distinguish the various kinds of natural language, which can be quite diverse. The most obvious kind of variance arises from different languages (i.e., free text that is “natural” to a French speaker will not seem to natural to a native English speaker who has no knowledge of French). Beyond multilingual data, messages exchanged on social media platforms like Twitter have arguably yielded a whole other “dialect” that looks very different from the kinds of articles we read on Wikipedia or on newswire. Tabloids and more sensationalistic media fall somewhere in between, as do text messages, which have more regular spellings but are still short and can employ nonstandard characters like emojis.</p>
<p>Furthermore, as <a href="chapter_4.xhtml#fig4-1">figure 4.1</a> suggests, the practical problem statement can have a strong dependence on the application. For example, is it relevant for an IE to be extracting e-commerce products from text when the downstream application is geopolitical forecasting? Generally, the scope of an IE task, like so many others in KG-centric pipelines, is defined through an ontology, which contains classes and properties of interest. If the IE is meant to be in support of the aforesaid geopolitical forecasting application, for example, then classes like <i>Country, Geopolitical Location, State Actor</i>, and <i>Politician</i> would obviously be of interest, while e-commerce classes like <i>Product, Color,</i> and <i>Price</i> would not. Properties tend to be defined by the classes in the ontology [e.g., <i>Geopolitical Location</i> would have (literal-valued) properties such as <i>latitude</i> and <i>longitude</i>, while in the e-commerce domain, <i>Product</i> and <i>Price</i> might be linked by properties such as <i>suggested_retail_price</i> and <i>factory_outlet_price</i>].</p>
<p>In short, without an ontology, IE tends to be an underdetermined term, both because it does not specify <i>what</i> needs to be extracted (e.g., are extractions limited to named entities like people and organizations, or should they be extended to events and relations like “President of”?), and also does not specify the <i>semantics</i> of such extractions (e.g., whether <i>Barack Obama</i> should have the semantics of a <i>Person</i> or the finer-grained <i>Politician</i> class). While ontology-guided IE is still the predominant form of IE, Open Information Extraction (Open IE), which does not rely on an ontology being provided, has also been studied. While Open IE has seen much improvement since it was first introduced, it remains considerably noisier and less feasible in practice than traditional IE. In this chapter, we focus primarily on extracting class instances, or <i>entities</i> (rather than property instances, or <i>relations</i>), from ordinary free text such as newswire, with Open IE and social media IE, left for a subsequent chapter. Relation extraction, as well as higher-order extraction problems such as event extraction, will be covered in chapter 6.<span aria-label="79" id="pg_79" role="doc-pagebreak"/></p>
<p><span aria-label="80" id="pg_80" role="doc-pagebreak"/>Specifically, we will focus on an extremely important (and best-studied) IE problem in the NLP community called NER. Note that entities in text can be named or unnamed. For example, in the sentence “Currently, it is unknown who killed Martha Jones, but the criminal is at large and a manhunt is underway,” the person who killed Martha Jones is clearly an entity that exists in the real world (assuming someone <i>did</i> kill Martha Jones), but we do not have a name for that entity, even though we have the <i>type</i> (corresponding to a class like <i>Person</i> or <i>Criminal</i> in our ontology). In fact, we may never have a name for that entity, and we may (as a consequence) end up giving it an alternative name by way of reference (e.g., the Zodiac Killer). Extracting unnamed entities is also of interest, though we will focus mostly on extracting named entities in this chapter, which is now a mature research area with successful use-cases and implementation in several real-world applications.</p>
<p>We can define NER more precisely as the task of identifying the <i>instances</i> of a predefined set of concepts in a specific domain, ignoring other irrelevant information, where the input consists of a corpus of texts together with a clearly specified information need. As previously mentioned, concepts generally correspond to classes in an ontology, such as the simple model shown in <a href="chapter_4.xhtml#fig4-2" id="rfig4-2">figure 4.2</a>. Identifying a concept’s instances implies that an IE system must be able to locate specific <i>mentions</i> in the text, such that the mention <i>refers</i> to an instance of the concept. Furthermore, because different instances of the concept may actually be referring to the same underlying entity, these instances must be resolved. For example, consider the sentences “The United Nations sent food aid to Chad in 1980” and “The U.N. authorized a food aid mission to Chad in 1980.” Correctly extracting the mentions “U.N.” and “United Nations” as instances of the predefined concept “Country” is the task of an NER system. However, because both mentions refer to the same underlying entity, a further round of inference (known as <i>instance matching</i>) is required before the extractions can be queried, aggregated, or analyzed in complex ways.</p>
<div class="figure">
<figure class="IMG"><a id="fig4-2"/><img alt="" src="../images/Figure4-2.png" width="450"/>
<figcaption><p class="CAP"><span class="FIGN"><a href="#rfig4-2">Figure 4.2</a>:</span> <span class="FIG">A simple concept ontology, named instances of which need to be extracted by an NER system. The ontology fragment is based on the real-world DBpedia ontology. RDFS and DBO, which stand for “Resource Description Framework Schema” and “DBpedia Ontology,” respectively, indicate that the “vocabulary” terms (e.g., dbo:writer) lie in these namespaces.</span></p></figcaption>
</figure>
</div>
<p>We turn to instance matching in detail in chapter 8. In KG formalism, there is an easy separation between the two tasks of NER (and more generally, IE) and instance matching. One way to understand the separation is by thinking of NER as a KG <i>construction</i> problem (KGC), whereas instance matching is a KG <i>completion</i> problem. Solutions to both types of problems are necessary to achieve a high-quality KG.</p>
</section>
<section epub:type="division">
<h2 class="head a-head"><a id="sec4-2"/><b>4.2 Why Is Information Extraction Hard?</b></h2>
<p class="noindent">Even though this chapter is predominantly about NER, many of the techniques that we cover herein have directly inspired mainstream IE research for complex information types like relations and events. Thus, it is worthwhile taking a moment to understand what makes IE challenging to begin with. First, even in limited settings, IE is a nontrivial task due to the complexity and ambiguity of natural language. There are many ways of expressing the same fact, which can be distributed across multiple sentences, documents, or even <span aria-label="81" id="pg_81" role="doc-pagebreak"/>knowledge repositories. Significant amounts of relevant information could be implicit and difficult to discern without a requisite (and enormous) amount of background knowledge that humans have managed to acquire despite a “poverty of the stimulus.” In other words, we are able to use, learn, compose, and understand sentences in very creative ways, despite not having heard the vast majority of sentences we end up using and understanding. The general problem of natural-language understanding is far from being solved, though much progress has been made. Certainly, relation extraction and NER are important components in any computational system looking to understand language better. Furthermore, recent advances in NLP in developing robust, efficient, modular, high-coverage, and shallow text-processing techniques (predominantly based on statistical methods, especially deep neural networks), as opposed to deep linguistic analysis, have contributed to the customized deployment of IE techniques in real-world applications for processing large quantities of textual data.</p>
<p>Along with ambiguity, the other major challenge that IE tends to face is that models trained on concepts in a predefined ontology cannot be easily transferred when something significant changes, whether it is the genre of data (newswire versus social media) or the language. Not all text is equal, even for humans (e.g., even trained scientists have to expend considerable cognitive energy to understand a dense scientific report over reading and understanding a simple news article). The impact of the genre and the domain has generally been neglected in the NER literature, but some studies have shown that some leading NER systems, when tested on a corpus that comprised transcripts of phone conversations and technical emails (rather than a standard, widely used newswire collection called MUC-6), experienced degradation on standard metrics like precision and recall by as much as 20 <span aria-label="82" id="pg_82" role="doc-pagebreak"/>to 40 percent. In recent years, some authors have made progress on this issue in difficult domains, but common-domain, newswire-oriented NER systems continue to experience the best performance. Another problem is when the ontology changes via the introduction of novel concepts and properties. Sometimes the ontology becomes finer-grained (e.g., an application may decide that a class like <i>Person</i> is too coarse-grained, and it may be better to subdivide that class into subclasses like <i>Politician</i>, <i>Actor</i>, and <i>Military Personnel</i>). Even simple changes like that can affect a trained IE model profoundly because the entity typing mechanism (which is responsible for assigning extractions to concepts in the ontology) may have to be retrained, and if the model jointly considers extraction and typing (as many modern systems do), the whole system may have to be retrained. The other problem is that the definition of relevance changes because not all persons are now relevant, which can lead to loss of precision. Some extractions may now also have multiple types (e.g., when an actor is also a politician, like Arnold Schwarzenegger), making both evaluation and extraction difficult.</p>
<p>Finally, Open IE and unsupervised NER, which require no ontology and/or training instances, still lag significantly in performance compared to supervised, ontologically mediated NER. This is not surprising, and it should also be noted that the performance of unsupervised NER has improved steadily over the past decade, especially with improvements in language models and self-supervised learning.</p>
</section>
<section epub:type="division">
<h2 class="head a-head"><a id="sec4-3"/><b>4.3 Approaches for Named Entity Recognition</b></h2>
<p class="noindent">Early approaches to NER tended to make heavy use of handcrafted rule-based algorithms, although this is an oversimplification because the actual techniques were quite diverse from one system to another. An early research paper in the 1990s was one of the first to formally recognize the problem as a <i>computational</i> task by proposing a system that aims to “extract and recognize names” (in this specific case, company names). This system relied on heuristics and rules. Research in the area expanded soon afterward, with the first major event aimed at properly defining and evaluating NER (among other NLP problems) being MUC-6. Since then, the popularity of NER as a field of study in both academic and industrial contexts has continued to rise. Other than Message Understanding Conferences (MUCs), other scientific avenues where the community came together to discuss techniques and evaluation standards for NER include HUB-4, MET-2, and HAREM. A more recent initiative is the Language Resources and Evaluation Conference (LREC), which has staged multiple workshops and conference tracks on IE since the early 2000s.</p>
<p>One bias that must be noted here is the almost singular focus of early research on English IE and NER. Some exceptions like German notwithstanding, multilingual NER only started becoming well studied since the early 2000s through conferences like the SIGNLL Conference on Computational Natural Language Learning (CoNLL). It was only in the mid-2000s that Arabic and less well studied languages like Bulgarian and Russian started <span aria-label="83" id="pg_83" role="doc-pagebreak"/>becoming popular. Large-scale projects like Global Autonomous Language Exploitation (GALE) helped push for progress on multilingual NER. Much more recently, there has also been focus on low-resource languages like Uighyur. By “low-resource,” we mean a language that has not been well addressed by the NLP community, but that is spoken by a large population. Government-funded programs like DARPA LORELEI have been instrumental in renewing the focus on multilingual and low-resource NER.</p>
<p>One last point that we note before diving headlong into the details of NER systems and approaches is that the performance of NER can strongly depend on other elements of a complete pipeline, including preprocessing and tokenization. <i>Tokenization</i> is the segmentation of the text data into “wordlike” units, which are called <i>tokens</i>. Often, their type also needs to be determined, which involves identification of capitalized words, hyphenated words, punctuation signs, and numbers, to name a few. In fact, a general architecture for NER can look like <a href="chapter_4.xhtml#fig4-3" id="rfig4-3">figure 4.3</a>, with the NER step itself ensconced between other modules. Other NLP inference tasks have similar dependence on preprocessing; hence, the architecture can be built out even further with higher-level inference modules like relation and event extraction. Steps such as linguistic tagging (e.g., part-of-speech, or POS, tagging) and intermediate clusters can have a major impact on the overall performance of such modules. Linguistic tagging is one example of <i>syntactic analysis</i>, which includes computation of a dependency structure (most often, a parse tree) over a sentence. Other example tasks include phrase recognition (in particular, recognition of verb groups, noun phrases, and acronyms and abbreviations), sentence boundary detection, and morphological analysis. In short, to truly achieve impressive industry-scale performance on NER, much engineering <span aria-label="84" id="pg_84" role="doc-pagebreak"/>effort is required, often extending well beyond training and tuning the NER model itself.</p>
<div class="figure">
<figure class="IMG"><a id="fig4-3"/><img alt="" src="../images/Figure4-3.png" width="450"/>
<figcaption><p class="CAP"><span class="FIGN"><a href="#rfig4-3">Figure 4.3</a>:</span> <span class="FIG">A practical architecture for NER can depend heavily on other elements of the pipeline, such as preprocessing and tokenization, as well as the availability of external resources such as lexicons and pretrained word embeddings (described in subsequent sections of this chapter). A complete description of these modules is beyond the scope of this book, but it may be found in any standard text or survey on Natural Language Processing.</span></p></figcaption>
</figure>
</div>
<section epub:type="division">
<h3 class="head b-head"><a id="sec4-3-1"/><b>4.3.1 Supervised Approaches</b></h3>
<p class="noindent">The current popular paradigm for addressing NER is supervised learning, including such well-established statistical models as decision trees, maximum entropy models, hidden Markov models (HMMs), Support Vector Machines (SVMs), and conditional random fields (CRFs). Except for decision trees and SVMs, all of these are sequence-labeling techniques that do not make the famous independent and identically distributed (i.i.d.) assumption prevalent in many machine learning approaches and classifiers. To understand the difference, consider the sentence in <a href="chapter_4.xhtml#fig4-4" id="rfig4-4">figure 4.4</a>. Suppose that, to keep the analysis simple, we were trying to classify each word in the sentence as a named entity “LOC” (for location) “PER,” (for person) or “O” (for <i>others</i>, i.e., not a token or named entity of interest). Traditional classification would try to classify each word in the sentence <i>independently</i> (i.e., whether “the” gets a classifier as “O” has no impact on whether the next word is classified as “O,” “LOC,” or “PER”). Intuitively, this seems faulty. It seems more likely that “the” is going to be followed by a location like “United States” than not. Thus, we should not be classifying every token independently, but if possible, classifying the sequence as a whole. In the general case, the problem of taking all elements in a sequence and classifying them jointly is intractable for reasonable sequence sizes, but with some model assumptions, we can try to take some dependencies into account when classifying each token. Sequence labelers, including HMMs and CRFs, are examples of models that assign output states to input terms without making a strong independence assumption.</p>
<div class="figure">
<figure class="IMG"><a id="fig4-4"/><img alt="" src="../images/Figure4-4.png" width="450"/>
<figcaption><p class="CAP"><span class="FIGN"><a href="#rfig4-4">Figure 4.4</a>:</span> <span class="FIG">A CRF as applied to the task of NER. Unlike ordinary supervised classification, which would make an i.i.d. assumption and try to classify each term in the input sequence independently, CRFs (and other models like it) model dependencies to get better accuracy without necessarily becoming intractable. The dark nodes are output nodes that technically produce a probability over the full set of labels [which includes all concepts, and also a <i>Not Applicable</i> (NA)-type concept indicating that no named entity is present]. There are standard mechanisms for handling multiword extractions like “Bay Area.”</span></p></figcaption>
</figure>
</div>
<p>CRFs constitute an important enough class of models for NER (and other sequence-labeling problems) that we cover them separately in the next section. CRFs were first proposed and developed in the early 2000s, but they have continued to be an attractive solution even in the present day, though they are not state-of-the-art anymore due to the advent of deep learning methods like recurrent neural networks (RNNs). However, regardless of the technique, because supervised approaches involve a separation of labeled data into training and test sets, with the latter withheld until the model (and a suitable baseline) has been developed and tuned, proper evaluation is an important concern. One baseline that is considered almost as a “litmus” test consists of tagging words in the test corpus when they are annotated as entities in the training corpus. In other words, the performance of this system is a measure of the <i>vocabulary transfer</i> or the proportion of words, without repetitions, that appear in both the training and test corpus. In a study conducted over 20 years ago by Palmer and Day (1997), vocabulary transfer on the MUC-6 training data was found to have uneven vocabulary transfer across entity types (21 percent in the aggregate, but with 42 percent across locations and only 13 percent across person names). Actual recall of this simple baseline system is even higher because vocabulary transfer does not include repetitions, whereas many entities can be frequently repeated in the test corpus. A <span aria-label="85" id="pg_85" role="doc-pagebreak"/>second study by Mikheev et al. (1999) that followed the original vocabulary transfer study on MUC-6 found, for example, that on MUC-7, this simple baseline tagger can achieve a recall of 76 percent for locations, and 26 percent for persons, with precision well over 70 percent. Other such consistent results were achieved thereafter by other authors (see the “Bibliographic Notes” section at the end of this chapter). The reason why we point this out here is to show that a high recall on some categories may not actually be as impressive as it would be in some other fields of AI (such as instance matching, covered in a later chapter) or in other categories. For example, if a system managed to achieve 78 percent recall on locations on MUC-7, this is not as impressive as achieving 78 percent recall on person names. Considering this litmus baseline tagger as a minimum viable NER is a good way of ensuring that performance numbers are not ratcheted above their true significance. It also helps to ensure that the test corpus is not too easy (or hard); while it’s good to have some vocabulary transfer, the generalization of any method being evaluated becomes suspect if the test data set is too susceptible to achieve vocabulary transfer.</p>
<p class="TNI-H3"><b>4.3.1.1 Conditional Random Fields</b> To set the stage for a CRF, let us formulate our input <i>x</i> as a <i>sequence</i> (<i>x</i><sub>1</sub><i>, <span class="ellipsis">…</span>, x</i><sub><i>m</i></sub>) of <i>m</i> tokens. Currently, it is best to think of the tokens as words, though in the most general case, they can be higher-order terms (such as phrases and clauses). The CRF is supposed to accept this sequence as input, and then output another sequence of <i>output states</i>, which we denote as <i>s</i> = (<i>s</i><sub>1</sub><i>, <span class="ellipsis">…</span>, s</i><sub><i>m</i></sub>). The output states are the named entity tags. Note that the number of output states equals the number of input states.</p>
<p>Formally, we can define a CRF on observations <span class="font">&#120143;</span> and random variables <span class="font">&#120144;</span> by first defining a graph <i>G</i> = (<i>V, E</i>) and letting <span class="font">&#120144;</span> = (<span class="font">&#120144;</span><sub><i>v</i></sub>)<sub><i>v</i>∈<i>V</i></sub>, such that <span class="font">&#120144;</span> is indexed by the vertices of G. (<span class="font">&#120143;</span>, <span class="font">&#120144;</span>) is a CRF when the random variables <span class="font">&#120144;</span><sub><i>v</i></sub>, conditioned on <span class="font">&#120143;</span>, obey the <i>Markov property</i> with respect to the graph—that is, <i>p</i>(<span class="font">&#120144;</span><sub><i>v</i></sub>|<span class="font">&#120143;</span>, <span class="font">&#120144;</span><sub><i>w</i></sub><i>, w ≠ v</i>) = <i>p</i>(<span class="font">&#120144;</span><sub><i>v</i></sub>|<span class="font">&#120143;</span>, <span class="font">&#120144;</span><sub><i>w</i></sub><i>, neigh</i>(<i>w, v</i>)), where <i>neigh</i>(<i>a, b</i>) means that <i>a</i> and <i>b</i> are neighbors in <i>G</i>. Put even more simply, a CRF is an undirected graphical model with nodes that are partitioned into sets <span class="font">&#120143;</span> and <span class="font">&#120144;</span>, and with the conditional distribution <i>p</i>(<span class="font">&#120144;</span>|<span class="font">&#120143;</span>) explicitly modeled.</p>
<p>CRFs attempt to output a good output sequence by expressively modeling the conditional probability <i>p</i>(<i>s</i><sub>1</sub><i>, <span class="ellipsis">…</span>, s</i><sub><i>m</i></sub>|<i>x</i><sub>1</sub><i>, <span class="ellipsis">…</span>, x</i><sub><i>m</i></sub>). First, a <i>feature map</i> <span lang="el" xml:lang="el">Φ</span>(<i>x</i><sub>1</sub><i>, <span class="ellipsis">…</span>, x</i><sub><i>m</i></sub><i>, s</i><sub>1</sub><i>, <span class="ellipsis">…</span>, s</i><sub><i>m</i></sub>) ∈ <span class="font">ℝ</span><sup><i>d</i></sup> is defined that maps an entire input sequence <i>x</i> paired with an entire state sequence <i>s</i> to a <i>d</i>-dimensional feature vector. The probability can then be modeled as a log-linear model with the parameter vector <i>w</i> ∈ <span class="font">ℝ</span><sup><i>d</i></sup>:</p>
<figure class="DIS-IMG"><a id="eq4-1"/><img alt="" class="width" src="../images/eq4-1.png"/>
</figure>
<p>Here, <i>s</i>′ ranges over all possible output sequences. For the estimation of <i>w</i>, the assumption is that there are a set of <i>n</i> labeled examples <img alt="" class="inline" height="17" src="../images/pg85-in-1.png" width="67"/>. The regularized log-likelihood function <i>L</i> can now be defined as</p>
<figure class="DIS-IMG"><a id="eq4-2"/><img alt="" class="width" src="../images/eq4-2.png"/>
</figure>
<p class="noindent"><span aria-label="86" id="pg_86" role="doc-pagebreak"/>The nonlog terms in equation (<a href="chapter_4.xhtml#eq4-2">4.2</a>) force the parameter vector to be parsimonious in its respective norm, penalizing model complexity. This is the phenomenon of regularization witnessed in regular supervised models like SVMs. The parameters <i><span lang="el" xml:lang="el">λ</span></i><sub>1</sub> and <i><span lang="el" xml:lang="el">λ</span></i><sub>2</sub> allow the system designer to control the level of regularization. Finally, the parameter vector <i>w</i><sup>*</sup> is estimated as <i>w</i><sup>*</sup> = <i>argmax</i><sub><i>w</i>∈<span class="font">ℝ</span><sup><i>d</i></sup></sub><i>L</i>(<i>w</i>). Once estimated, <i>w</i><sup>*</sup> can be used to tag a sentence by outputting state <i>s</i><sup>*</sup> using the equation <i>s</i><sup>*</sup> = <i>argmax</i><sub><i>s</i></sub><i>p</i>(<i>s</i>|<i>x</i>; <i>w</i><sup>*</sup>).</p>
<p>An important point to note in the mathematical treatment of a CRF given here is that it is dependent on the kinds of features included in the feature map. If the features are not discriminative, then there is little that a CRF can do. Generally, a standard set of features that have been found to work well in the NLP community for this problem includes testing the word for such aspects as: Is the word uppercase? Is the word a digit? Other features include the POS tag of the word. A key element of the feature map where the power of CRFs over nonsequential models like SVMs emerges is that features can be crafted over words that occur before or after the target word. For example, we may want to include the word that appeared three steps earlier as a feature in itself, or we may want the POS tag of the words that come after within a window of size 2. In more recent work, features that are not directly crafted, but rely instead on neural models, are word embeddings, described next. The key concept to bear in mind is that features should be diverse and discriminative, and they also should (in the CRF formulation) depend on at least some of the tokens before or after them.</p>
<p><span aria-label="87" id="pg_87" role="doc-pagebreak"/>Note that exact inference in CRFs is intractable except in some special cases (e.g., if the graph is a chain or tree, message passing does yield exact solutions, analogous to the Viterbi algorithm for the case of HMMs), or if the CRF contains only pairwise potentials with submodular energy. Generally, approximate solutions are employed, including loopy belief propagation and mean field inference. Learning the parameters themselves (the training phase) is usually done using maximum likelihood; a convenient property of CRFs is that, if all nodes have exponential family distributions and all nodes are observed, the optimization is convex and can be solved using ordinary gradient descent. However, approximations have to be used when some of the variables are unobserved.</p>
</section>
<section epub:type="division">
<h3 class="head b-head"><a id="sec4-3-2"/><b>4.3.2 Semisupervised and Unsupervised Approaches</b></h3>
<p class="noindent">Modern supervised approaches (and in particular, deep learning approaches that we describe in the next section) for solving NER can require a large amount of labeled training data in order to learn a system that achieves good performance. The approaches covered thus far assumed that labeled training data is available and can be provided. In fact, the more expressive the model is (and the more parameters it has), the more training data is usually required. However, acquiring such annotations at scale not only involves high degrees of manual labor, but is expensive, time-consuming, and (hence) impractical. In fact, training data is not only difficult to acquire, but it is not always clear that it can even be made available to system developers, especially if the data sets are sensitive or protected under law (such as with health data). Ironically, some of the earliest systems did not require much training data (e.g., in the late 1990s, Collins and Singer only used a few labeled seeds and seven features such as entity context and orthography for NER). More recently, there has been renewed interest in semisupervised information extraction from large corpora. A particularly influential approach has been <i>weak supervision</i>, which requires some amount of human supervision, usually in the very beginning when the system designer provides a starting set of seeds, which are then used to bootstrap the model so that it can proceed without further supervision (until some convergence condition is met). Another set of approaches, building on more classic machine learning theory, is based on active learning, which requires small but periodic interventions from human annotators. The idea behind active learning is that the annotator initially provides a small set of examples, based on which the learner actively decides which other examples to present to the human teacher next for maximal gain. Usually, these samples are just the ones on which the learner’s prediction has the greatest uncertainty; by resolving these samples, the human annotator is giving the system the data that benefits it most. Another way to look at this is that it makes annotation more efficient by preempting the labeling of redundant samples that the system is able to automatically label with high certainty. Active learning is one example of “human-in-the-loop” learning that has recently witnessed resurgence, because it represents a hybrid situation where an appropriate balance of system design, data labeling, knowledge engineering, and human intervention together lead to effective performance.</p>
<p><span aria-label="88" id="pg_88" role="doc-pagebreak"/>However, by and large, the reliance on labeled training data has not disappeared (in the interest of having higher accuracy), but what has changed is the heavy reliance on manual feature engineering. Representation learning has emerged as a key trend in this direction. The basic problem of representation learning is to design an architecture (usually some kind of neural network) that takes as input raw data, such as a sequence of words (or even characters) and outputs a vector (representation) for the unit of interest (typically words, but also sentences and paragraphs). We illustrate classic representation learning over words (given a sufficiently large text corpus) in <a href="chapter_4.xhtml#fig4-5" id="rfig4-5">figure 4.5</a>. As it turns out, representation learning is an important factor attributed to the success of deep neural networks because the layers in such networks learn increasingly sophisticated representations (with increasing depth) of the kinds of objects fed to them. For convolutional neural networks (CNNs), these input objects are usually images, while for RNNs, they are sequences, not unlike the inputs fed to CRFs. It would not be incorrect to think of RNNs as advancing the state-of-the-art over CRFs. Because sequence labeling is the main problem that needs to be solved here, we focus on RNNs in the next section. However, this will not be the only coverage of representation learning in this book. In chapter 10, we describe how the notion of representation learning as applied to words here can be extended to embed nodes, relations, and even complex structures in KGs.</p>
<div class="figure">
<figure class="IMG"><a id="fig4-5"/><img alt="" src="../images/Figure4-5.png" width="450"/>
<figcaption><p class="CAP"><span class="FIGN"><a href="#rfig4-5">Figure 4.5</a>:</span> <span class="FIG">Representation learning (“embedding”) over words, given a sufficiently large corpus of documents (sets of sequences of words). We show words in the neighborhood of “politics.” For visualization purposes, the vectors have been projected to two dimensions. The mechanics behind representation learning are detailed in chapter 10, which describes how to embed KGs, including the actual neural network architectures that tend to be used for these embeddings.</span></p></figcaption>
</figure>
</div>
</section>
</section>
<section epub:type="division">
<h2 class="head a-head"><span aria-label="89" id="pg_89" role="doc-pagebreak"/><a id="sec4-4"/><b>4.4 Deep Learning for Named Entity Recognition</b></h2>
<p class="noindent">Starting with seminar work in 2011 that relied on representation learning for solving several NLP problems like POS tagging and NER in a unified feature framework, neural NER systems with minimal feature engineering have gained in popularity. Such models are appealing because they do not normally require domain-specific resources like lexicons or ontologies and can scale more easily without significant manual tuning. Several neural architectures have been proposed, mostly based on some form of RNN over word, subword, and character embeddings.</p>
<p>The simplest possible definition of an RNN is that it is a network with loops in it, which allows information to persist. Because they have loops, RNNs can be unrolled (in time or over a sequence), meaning that the network can also be thought of as multiple copies of the same network, each passing a message to a successor. RNNs’ chainlike structure has made them well suited to any problem that has involved sequences and lists, including speech recognition, language modeling, translation, image captioning, and NER. Just like CRFs, RNNs generally accept an input vector as input and then output another vector. However, here the output vector’s contents are influenced not only by the current input, but the entire history of input fed into the network in the past.</p>
<p>In practice, a particular kind of RNN—namely, Long Short-Term Memory (LSTM)—is used widely in the NLP community owing to its more powerful update equation and some appealing backpropagation dynamics. Despite becoming popular only recently, LSTMs were actually proposed well back in the 1990s, initially as a solution to the vanishing gradient problem in neural networks. LSTMs help preserve the error that can be backpropagated through time and layers. By maintaining a more constant error, these RNNs can continue to learn during thousands of time steps, allowing potentially remote linkages between causes and effects to be modeled and discovered. LSTMs contain information outside the normal flow of the recurrent network in a gated cell. Information can be stored in, written to, or read from a cell, much like data in a computer’s memory. The cell makes decisions about what to store and when to allow reads, writes, and erasures via gates that open and close. Unlike the digital storage on computers, however, these gates are analog, <span aria-label="90" id="pg_90" role="doc-pagebreak"/>implemented with elementwise multiplication by sigmoids, which are all in the range of 0–1. Being analog, they are differentiable, which makes them amenable to backpropagation.</p>
<p>Returning to the application of LSTMs and RNNs to NER, the most recent models have significantly outperformed feature-engineered systems, despite the latter’s access to domain-specific rules, knowledge, features, and lexicons. An example of an RNN-based model is a <i>character-based architecture</i> where a sentence is modeled as a sequence of characters, with the sequence passed through an RNN that predicts labels for each character (<a href="chapter_4.xhtml#fig4-6" id="rfig4-6">figure 4.6</a>). These labels are then transformed into word labels via postprocessing mechanisms. The potential of character NER neural models was first established empirically in the mid-2010s (e.g., a system in 2016 used highway networks over CNNs on character sequences of words, followed by an LSTM layer and softmax for the final predictions). Character-level architectures may be thought of as a finer-grained generalization of word-level architectures, where the RNN is input a sequence of words, not dissimilar to the CRFs discussed earlier. LSTMs and CRFs have also been combined in this context (e.g., in 2015, a proposed system showed that by adding a CRF layer to a word LSTM model, performance could be improved by more than 84 percent F1-Score on the CoNLL 2003 data set). Similar improvements were illustrated in domain-specific NER systems such as DrugNER, as well as the medical NER system proposed by Xu et al. (2017). Even more recently, models have been proposed that have not only combined word and character architectures, but also have further supplemented them with one of the most successful features from feature-engineering approaches (namely, <i>affixes</i>). While affix features have been used in NER systems since the early 2000s, as well as biomedical NER, they had not been used in neural NER systems until quite recently. A number of experiments in 2018 and later showed that affix embeddings capture information complementary to that of RNNs over the characters of a word. In fact, embedding affixes was found to be better than simply expanding the other embeddings to reach a similar number of hyperparameters. Clearly, the affixes are adding more to the model than just plain expressiveness.</p>
<div class="figure">
<figure class="IMG"><a id="fig4-6"/><img alt="" src="../images/Figure4-6.png" width="450"/>
<figcaption><p class="CAP"><span class="FIGN"><a href="#rfig4-6">Figure 4.6</a>:</span> <span class="FIG">A character-based RNN architecture as applied to NER for an input sentence such as “Michael’s birthday is coming.”</span></p></figcaption>
</figure>
</div>
<p>To summarize these threads of research, the general finding has been that word and character hybrid models tend to outperform individual word and character-based models (sometimes by more than 5 percent on the relevant metric). However, research in this area is significantly underway, and some findings show that progress is still to be made by incorporating key features of past feature-engineered models into modern Neural Network (NN) architectures. For example, a recent system managed to achieve a state-of-the-art result for Spanish, Dutch, and German NER (while performing within 1 percent of the best model for English NER) by incorporating affix features into an earlier system. In keeping with this trend, it is likely that (rather than a complete replacement of feature-engineered systems with representation learning) innovative hybrid models will continue to make further improvements beyond the state-of-the-art NER approaches.</p>
</section>
<section epub:type="division">
<h2 class="head a-head"><span aria-label="91" id="pg_91" role="doc-pagebreak"/><a id="sec4-5"/><b>4.5 Domain-Specific Named Entity Recognition</b></h2>
<p class="noindent">An excellent, and high-impact, example of domain-specific NER is in the patient and biomedical domain. There has been so much interest in this field that open-source packages implementing executable surveys of advances in the field (such as BANNER) have even been published. There is much motivation behind biomedical NER, owing mostly to molecular biology rapidly becoming an information-dense field. As such, building automated extraction tools to handle the large volumes of published literature has become a pressing focus. An accurate NER system can go a long way in making sense of these large quantities of text. Significant progress has been achieved over the last decade, as demonstrated in challenge evaluations such as BioCreative (where teams around the world implemented creative solutions to challenges such as the out-of-vocabulary<sup><a href="chapter_4.xhtml#fn1x4" id="fn1x4-bk">1</a></sup> problem). As a result, several systems exist to address this (e.g., ABNER and LingPipe). We note that deep learning has also had a strong influence on biomedical NER (e.g., a system called DrugNER, which is based on a word and character neural network model, has outperformed the best feature-engineered system by almost 9 percent on MedLine test data and 3.5 percent on the overall data set).</p>
<p>In work that we cover in part V of this book, NER has also been applied to more unusual domains, such as illicit advertisements (e.g., containing solicitations for drugs or sexual services) crawled over the web. Extracting key pieces of information from these ads, including phone numbers, physical attributes, and locations, is an important problem because it can help law enforcement find victims of human trafficking or crack down on rings and perpetrators engaging in such activities, sometimes under a legitimate front (e.g., a massage parlor). NER is significantly more difficult in such domains due to different language models and obfuscation of identifying attributes, the heavy presence of long-tail entities (meaning that it is very difficult to get any representative training data), and the noisy nature of web data. Domain-specific techniques are required to achieve the level of accuracy necessary for useful predictive analytics.</p>
<p>Other kinds of special NER also exist. For instance, NER from social media usually involves a different (or at the very least, differently configured) set of techniques compared to newswire; similarly, NER in the absence of an ontology (Open IE) can be much more challenging than ordinary NER. Because these special kinds of NER have become all too frequent due to the advent of big, irregular data sets and growth of social media on the web, we return to them in chapter 7. It is quite likely that as more data sets continue to be released, and as more applications start to emerge, new kinds of NER will be proposed and investigated. At the same time, it is unlikely that the wheel will have to be completely reinvented; most flavors of NER are largely based on a common set of techniques and principles, <span aria-label="92" id="pg_92" role="doc-pagebreak"/>with a continued heavy dependence on advances in sequence-labeling architectures like RNNs.</p>
</section>
<section epub:type="division">
<h2 class="head a-head"><a id="sec4-6"/><b>4.6 Evaluating Information Extraction Quality</b></h2>
<p class="noindent">As the previous sections illustrate, there are many approaches for tackling IE. Evaluating these systems is an equally important problem. The most important metrics used for evaluating IE performance are <i>precision</i> and <i>recall</i>, which were adopted from the information retrieval (IR) research community and may be seen as measures of <i>correctness</i> and <i>completeness</i>, respectively. To define such metrics for IE, let #total denote the total number of slots that should be filled according to an annotated reference corpus (the ground-truth). Let us further denote #correct and #incorrect as the number of slots correctly and incorrectly filled by the system, respectively. Incorrect responses arise for two reasons. Either the slot does not align with a slot in the gold standard (spurious slot) or the slot does align but has been assigned an invalid value. With these notions in place, precision and recall may be defined as follows:</p>
<figure class="DIS-IMG"><a id="eq4-3"/><a id="eq4-4"/><img alt="" class="width" src="../images/eq4-3-4.png"/>
</figure>
<p>Another way to think about precision is that it is the ratio of true positives to the sum of true positives and false positives, where a positive (whether true or false) is defined as a slot that has been produced by the system. In contrast, <i>recall</i> is the ratio of true positives to the sum of true positives and false negatives (slots that exist in the ground-truth but were never output by the system).</p>
<p>There is a clear trade-off between precision and recall, in that improving one usually leads to a loss in the other (although it is not theoretically necessary). Generally, to obtain a finer-grained picture of IE performance, precision and recall may even be measured for each slot type separately. Nevertheless, to navigate the trade-off between the two, a single number is required. An alternative is a plot, but in practice, it is fairly common to report the F1-measure, which is used as a weighted harmonic mean of precision and recall, defined as follows:</p>
<figure class="DIS-IMG"><a id="eq4-5"/><img alt="" class="width" src="../images/eq4-5.png"/>
</figure>
<p class="noindent">A metric that is specific to IE is the <i>slot error rate</i> (SER), defined as follows:</p>
<figure class="DIS-IMG"><a id="eq4-6"/><img alt="" class="width" src="../images/eq4-6.png"/>
</figure>
<p><span aria-label="93" id="pg_93" role="doc-pagebreak"/>Here, <i>#incorrect</i> was defined earlier, while <i>#missing</i> denotes the number of slots in the reference that do not align with any slots in the system’s outputs. Put differently, it is the ratio between the total number of slot errors and the total number of slots in the ground-truth.</p>
<p>A final note with regard to evaluation is that many of the metrics given here can be further customized depending on what is required by the IE system. For example, we could modify <i>F1</i> to place higher weight on precision over recall (by toggling a parameter <i><span lang="el" xml:lang="el">β</span></i> that is set to 1.0 in the current definition of <i>F1</i> and equally emphasizes precision and recall), and similarly, we may modify some of the other metrics to put more weight on spurious slots. In NER systems that rely heavily on some form of machine learning, a <i>validation</i> set is often necessary to ensure that the system’s hyperparameters are tuned such that its outputs are optimized for the metrics of interest.</p>
</section>
<section epub:type="division">
<h2 class="head a-head"><a id="sec4-7"/><b>4.7 Concluding Notes</b></h2>
<p class="noindent">NER is often the first line of attack when constructing a KG over raw sources, and in particular, free-text or natural-language documents. NER has been well studied since at least the 1990s, aided in great part by the MUCs. Today, there are a variety of data sets and systems available. More recently, deep learning approaches have led to breakthrough performance increases in several NER tasks. Yet the performance starts declining with lower quantities of training data, and when applied to irregular (or not as well studied) domains, languages, or styles of writing. In later chapters, we return to this issue, especially with respect to extracting entities from messages posted on social media platforms such as Twitter.</p>
</section>
<section epub:type="division">
<h2 class="head a-head"><a id="sec4-8"/><b>4.8 Software and Resources</b></h2>
<p class="noindent">At this time, there are several excellent openly available packages for NER. We enumerate some popular ones here, though this is not meant to be an exhaustive list:</p>
<ul class="numbered">
<li class="NL">1. The Stanford NER, accessible at <a href="https://nlp.stanford.edu/software/CRF-NER.html">https://<wbr/>nlp<wbr/>.stanford<wbr/>.edu<wbr/>/software<wbr/>/CRF<wbr/>-NER<wbr/>.html</a>, is a Java-based NER published by Finkel et al. (2005), and is also known as CRFClassifier because it provides a general implementation of linear-chain CRF sequence models. It is available for download, licensed under the GNU General Public License. The source is also included. The general software for doing not just NER, but other NLP tasks as well is called Stanford CoreNLP (<a href="https://stanfordnlp.github.io/CoreNLP/">https://<wbr/>stanfordnlp<wbr/>.github<wbr/>.io<wbr/>/CoreNLP<wbr/>/</a>), and it includes support for several languages, can be run as a simple web service, and has an integrated NLP toolkit with a broad range of grammatical analysis tools, among other facilities. We also recommend Manning et al. (2014) for more details.</li>
<li class="NL">2. The Natural Language Toolkit, or NLTK (<a href="https://www.nltk.org/">https://<wbr/>www<wbr/>.nltk<wbr/>.org<wbr/>/</a>), is a Python-based platform for NLP and provides easy-to-use interfaces to many standard corpora and <span aria-label="94" id="pg_94" role="doc-pagebreak"/>lexical resources, including WordNet. It also includes text-processing libraries for classification, tokenization, stemming, tagging, parsing, and semantic reasoning, as well as wrappers for industrial-strength NLP libraries. In addition, there is an active discussion forum.</li>
<li class="NL">3. SpaCy (<a href="https://spacy.io/">https://<wbr/>spacy<wbr/>.io<wbr/>/</a>) is another industrial-strength NLP toolkit that is designed for large-scale IE tasks. It uses memory-managed Cython, and independent research conducted a few years ago found it to be among the fastest extraction systems available, making it particularly amenable to web data. SpaCy can also be seamlessly operated with TensorFlow, PyTorch, and other neural network packages.</li>
</ul>
<p>The more recent line of NER systems based on deep learning–based models can be implemented (or modified) using libraries such as TensorFlow, Keras, and PyTorch, accessible at <a href="https://www.tensorflow.org/">https://<wbr/>www<wbr/>.tensorflow<wbr/>.org<wbr/>/</a>, <a href="https://github.com/keras-team/keras">https://<wbr/>github<wbr/>.com<wbr/>/keras<wbr/>-team<wbr/>/keras</a>, and <a href="https://pytorch.org/">https://<wbr/>pytorch<wbr/>.org<wbr/>/</a>, respectively. Another excellent resource for knowledge extraction tools, including for multilingual data and more advanced kinds of IE covered in the next three chapters, is the project page maintained by the BLENDER group at the University of Illinois, Urbana-Champaign: <a href="https://blender.cs.illinois.edu/software/">https://<wbr/>blender<wbr/>.cs<wbr/>.illinois<wbr/>.edu<wbr/>/software<wbr/>/</a>.</p>
</section>
<section epub:type="division">
<h2 class="head a-head"><a id="sec4-9"/><b>4.9 Bibliographic Notes</b></h2>
<p class="noindent">Much of the primary material on NER covered in this chapter has been inspired or derived from some classic surveys on the subject. A particularly useful survey was provided by Nadeau and Sekine (2007), but there at least a few others that have heavily influenced how we synthesized the material in this chapter, including Piskorski and Yangarber (2013), Appelt (1999), and Kaiser and Miksch (2005). In particular, Piskorski and Yangarber (2013) provide a broad overview of IE as a whole and break it down by task type, including NER, event extraction, and relation extraction. As expected, web IE is not given coverage because the NLP community has not overlapped much with the web community on the subject. [We state this as a historical note because it explains (in part) why we chose to write the chapter on web IE separate from this one.] Additionally, Jiang (2012) can be read as an accessible introduction to information extraction from text, but just like many of the mentioned surveys, it takes a broader overview of NER and relation extraction within a single chapter.</p>
<p>The study of NER itself dates back to the early 1990s in the version that we see today. Domain-specific NER started becoming more popular in the 2000s, although some of the earlier MUCs (discussed next) also had distinctly domain-specific flavors. Work that specifically covers domain-specific name recognition includes Collins and Singer (1999), Phillips and Riloff (2002), and Yangarber et al. (2002). For a reference to why IE is difficult and challenging, we recommend Huttunen et al. (2002).</p>
<p><span aria-label="95" id="pg_95" role="doc-pagebreak"/>The MUC series is largely responsible for directing the attention of the research community toward NER. Because these conferences hold a primary place in the history of modern NLP research, particularly for multisite evaluation of text understanding systems [see Chinchor and Sundheim (1995)], we briefly map out their evolution over time. The first five MUCs were held in the decade following 1987, but it was MUC-6 that truly led to systematic evaluation of NER. As noted by Grishman and Sundheim (1996), MUC-1 (held in 1987) was <i>exploratory</i> because each group designed its own format for recording the information in the document. There was no formal evaluation. By MUC-2 (held in 1989), the task had evolved to what is recognized today as template filling. MUC-2 worked out many of the details of primary evaluation measures, which were held to be precision and recall. By MUC-3, a domain-specific flavor was starting to emerge as the domain shifted to terrorist event reports from Central and South America, broadcast in articles by the Foreign Broadcast Information Service. The template became more complex, going from 10 slots in MUC-2 to 18 slots in MUC-3. MUC-4 remained largely the same, except that the number of slots increased to 24. MUC-5 represented a major shift compared to MUC-4, but it was since MUC-6 (held in 1995) that NER started to attract significant attention from NLP researchers. Several evaluation programs on this task, some of which will be covered later in the context of relation and event IE, include the Automatic Content Extraction (ACE) program, the shared task of the Conference on Natural Language Learning (CoNLL) in 2002 and 2003, and the BioCreative (Critical Assessment of Information Extraction Systems in Biology) challenge evaluation. Important references include Doddington et al. (2004), Hirschman et al. (2005), and Sang and De Meulder (2003).</p>
<p>While today, the phrase “named entity” is common, it was coined at MUC-6. The domain for IE was text (such as newspaper articles) describing company activities and defense-related activities, wherein it became clear that it is essential to recognize informational units such as names of people, organizations, and locations, as well as numeric expressions such as time, date, money, and percent. Identifying references to such entities in text was recognized as an extremely important subtask of the broader IE goal and was referred to as Named Entity Recognition and Classification (NERC).</p>
<p>IE as a sequence-labeling application goes back at least two decades, and possibly more. One of the earlier, more influential tools based on HMM was Nymble; see Bikel et al. (1998). CRFs became a standard method soon after; see Peng and McCallum (2006), Sutton et al. (2012), McCallum and Li (2003), and Sarawagi and Cohen (2005) for guidance. For more on the general application of supervised learning algorithms to NER and other similar NLP problems, excellent references are syntheses by Witten et al. (2016) and Manning and Schütze (1999), particularly the latter for NLP-focused work (although the content is now dated).</p>
<p>Semisupervised and unsupervised techniques have risen in popularity for at least a couple of decades. Early and heavily influential work include the multilevel bootstrapping <span aria-label="96" id="pg_96" role="doc-pagebreak"/>approach by Riloff et al. (1999), but also Cucchiarelli and Velardi (2001), Pasca et al. (2006), and Lin (1998). Another interesting study, by Jones et al. (2003), was based on active learning for IE. Performance has steadily improved as well; in fact, Nadeau et al. (2006) reported that semisupervised NER was starting to rival supervised NER. Because some of the semisupervised approaches tended to rely on “seeds,” Ji and Grishman (2006) explicitly drew attention to the problem of unlabeled data selection. Other important references include Alfonseca and Manandhar (2002), Evans and Street (2003), Shinyama and Sekine (2004), and Hearst (1992). Note that some of these cover, or intersect with, research on Open IE, which is the topic of chapter 7. The work by Hearst (1992) is important because it led to the adoption of so-called Hearst patterns for unsupervised labeling, and were in use more than a decade later, as evidenced by Cimiano and Völker (2005).</p>
<p>Finally, we note that many of the surveys cited here were prior to the advent of deep learning. A very recent survey on NER using deep learning models was provided by Yadav and Bethard (2019), and much of our analysis in this chapter on deep learning for NER relied on their work as a primary source. Rather than cite individual papers, we refer the interested reader to that paper for a comprehensive list of citations. Such a list, even in their case, would be incomplete, though, because new work continues to be proposed for improving NER performance even more using deep learning.</p>
</section>
<section epub:type="division">
<h2 class="head a-head"><a id="sec4-10"/><b>4.10 Exercises</b></h2>
<p class="noindent">Given the following set of words {<i>high, low, verb, semisupervised, automatic, noun</i>}, fill in the blanks for these statements. Note that some words in the set may be used more than once (and every word is used at least once):</p>
<ul class="numbered">
<li class="NL">1. Rule-based IE techniques are expected to have _________________ precision and _________________ recall.</li>
<li class="NL">2. Bootstrapping is an example of a(n) _________________ technique for IE.</li>
<li class="NL">3. _________________ extractors have _________________ cost but usually lead to semantic drift.</li>
<li class="NL">4. Supervised extractors have _________________ cost, but have _________________ precision.</li>
<li class="NL">5. Semisupervised extractors require a _________________ amount or training data.</li>
<li class="NL">6. One of the concrete subproblems in IE from natural-language text is defining the domain of interest. One simple approach to do it automatically is by setting any _________________ phrase as a candidate entity and any _________________ phrase as a candidate relation.</li>
</ul>
<div class="footnotes">
<ol class="footnotes">
<li><p class="FN" role="doc-footnote"><sup><a href="chapter_4.xhtml#fn1x4-bk" id="fn1x4">1</a></sup> This problem occurs when the test corpus contains words that were never encountered during training.</p></li>
</ol>
</div>
</section>
</section>
</div>
</body>
</html>