187 lines
No EOL
79 KiB
HTML
187 lines
No EOL
79 KiB
HTML
<html xmlns="http://www.w3.org/1999/xhtml" xmlns:epub="http://www.idpf.org/2007/ops" lang="en" xml:lang="en">
|
||
<head>
|
||
<title>Knowledge Graphs</title>
|
||
<meta content="text/html; charset=utf-8" http-equiv="default-style"/>
|
||
<link href="../styles/stylesheet.css" rel="stylesheet" type="text/css"/>
|
||
<meta content="urn:uuid:531250ec-8629-4bbe-b4be-eb1eb0e84538" name="Adept.expected.resource"/>
|
||
</head>
|
||
<body epub:type="bodymatter">
|
||
<div class="body">
|
||
<p class="SP"> </p>
|
||
<section aria-labelledby="ch7" epub:type="chapter" role="doc-chapter">
|
||
<header>
|
||
<h1 class="chapter-number" id="ch7"><span aria-label="149" id="pg_149" role="doc-pagebreak"/>7</h1>
|
||
<h1 class="chapter-title"><b>Nontraditional Information Extraction</b></h1>
|
||
</header>
|
||
<div class="ABS">
|
||
<p class="ABS"><b>Overview.</b> In previous chapters, we covered various task areas in information extraction (IE) including web IE, Named Entity Recognition (NER), and relation and event extraction. We hinted at other kinds of IE, however, that have recently gained traction, and with steadily improving performance. One such nontraditional IE is Open IE, which does not depend on the provision of an ontology, although it is not accurate to say that Open IE can extract just anything. Other kinds of nontraditional IE tasks that have continued to become popular include IE from short texts and messages like Short Message Service (SMS) and social media, domain-specific IE in areas like crisis response, and multilingual IE. We provide a flavor of all of these areas in this chapter.</p>
|
||
</div>
|
||
<section epub:type="division">
|
||
<h2 class="head a-head"><a id="sec7-1"/><b>7.1 Introduction</b></h2>
|
||
<p class="noindent">In chapter 4 on Named Entity Recognition (NER), we found that some categories, such as <i>Person</i>, <i>Location</i>, and <i>Organization</i> tend to be frequent enough that generic NER approaches, as well as open tools, are expected to be able to extract instances of these concepts from text. However, in some domains, what constitutes a named entity can be unusual, and best understood by studying the needs of that domain. Intuitive examples of domain-specific IE, some of which we have already covered, or alluded to, in previous chapters include bioinformatics and geopolitics. In the former domain, for example, systems have addressed the extraction of medications and disease names (e.g., <i>Alzheimer’s</i>). In the latter domain, named <i>entities</i>, as understood, are simply not enough, and we have to build systems for extracting relations and higher-order entities such as events.</p>
|
||
<p>Many other examples have been proposed, including “film,” “scientist,” “project name,” “email address” and so on, depending on the domain under study. In conferences such as the Conference on Natural Language Learning (CONLL), the type “miscellaneous” has often been used to include proper names that are not instances of the classic concepts. Sometimes the class has been augmented with types such as “product,” and in the Message Understanding Conferences (MUCs), “timex” types such as “date” and “time,” and “numex” types such as “money” and “percent,” have become quite common. In fact, since <span aria-label="150" id="pg_150" role="doc-pagebreak"/>2003, a special community named TIMEX2 specifically proposed a standard for the annotation and normalization of temporal expressions.</p>
|
||
<p>Given classic concepts like <i>Person</i> and <i>Location</i>, as well as the existence of a catchall concept such as <i>Miscellaneous</i>, we can frame domain-specific IE as a type of <i>fine-grained IE</i>. Fine-grained IE is typically dependent on a more detailed ontology than we would normally expect with ordinary NER. Not all fine-grained IE has to be domain-specific. For example, some of the earliest work in fine-grained IE addressed the challenge of extracting not just a location, but multiple location subtypes, including city, state, country, and ZIP code. Even when considering an innocuous attribute like <i>Person</i>, fine-grained IE can become important in domains such as politics (where we want to construct a KG that has instances typed according to fine-grained concepts such as <i>Politician</i>, <i>Candidate</i>, and <i>Donor</i>). The distinction between fine-grained IE and domain-specific IE can become blurred; the two can often cooccur. The problem is generally harder than ordinary IE because (for supervised systems), domain-specific annotations and/or more finely annotated corpora are necessary. Unsupervised systems are even harder to build and have to rely considerably on model assumptions. The hardest version of the problem, where any solution is far from achieving human performance, is when domain-specific IE is required for a genre of data (such as social media) that is already difficult for ordinary IE to work with. <a href="chapter_7.xhtml#fig7-1" id="rfig7-1">Figure 7.1</a> illustrates a representative instance of this problem.</p>
|
||
<div class="figure">
|
||
<figure class="IMG"><a id="fig7-1"/><img alt="" src="../images/Figure7-1.png" width="450"/>
|
||
<figcaption><p class="CAP"><span class="FIGN"><a href="#rfig7-1">Figure 7.1</a>:</span> <span class="FIG">An illustration of domain-specific IE (<i>natural disaster</i> domain) over social media data. Instances that might be extracted in such domains are underlined and include the natural disaster itself (<i>#CaliforniaFires</i>), support events (<i>Bake sale</i>) that themselves have arguments (<i>Stident Leadership Council</i>), span of the disaster (<i>1.8 million acres</i>), and even causal information (<i>prohibiting controlled burns</i>).</span></p></figcaption>
|
||
</figure>
|
||
</div>
|
||
<p><span aria-label="151" id="pg_151" role="doc-pagebreak"/>When domain-specific corpora are developed, they can prove to be important assets to the community. For example, due to the recent interest in bioinformatics, the GENIA corpus was created and led to many studies involving concepts such as <i>Protein</i>, <i>DNA</i>, and <i>Cell Line</i>. Other researchers have used the corpus to identify drug and chemical names. There has also been considerable research on extending domain-specific IE to be open-domain (as discussed in the next section)—that is, building IE systems that are not limited to a set of possible types to extract, but instead discovers types by itself.</p>
|
||
<p>Many of the IE challenges and approaches discussed in this chapter are more advanced than those in the preceding chapters, and remain the subject of evolving research. By necessity, there will be less discussion on some of the approaches relative to others, because the jury is still out on some subfields of IE. Where possible, we focus on general or established trends, and point the interested reader to other promising avenues in the section entitled “Bibliographic Notes,” at the end of this chapter.</p>
|
||
</section>
|
||
<section epub:type="division">
|
||
<h2 class="head a-head"><a id="sec7-2"/><b>7.2 Open Information Extraction</b></h2>
|
||
<p class="noindent">Unlike traditional IE methods, Open Information Extraction (Open IE) is not limited to a small set of target entity types or relations that are given in advance (via an ontology). Rather, it extracts all types of relations and entities found in a text corpus. In total contrast with domain-specific IE, Open IE facilitates <i>domain-independent</i> discovery of relations and entities extracted from text and is consequently able to scale to large, heterogeneous corpora such as the web. An Open IE system takes as input only a corpus, without any prior specification of the relations and entities of interest. The output is a set of all extracted relations and entities.</p>
|
||
<p>We note here that extracting relations is a more predominant concern in Open IE than extracting entities. One reason for this is that comprehensive named entity hierarchies already exist, which contain a relatively robust specification of domain-independent concepts for a system to extract. This goes back to the mid-2000s, when Sekine and Nobata (2004) first defined a named entity hierarchy with many fine-grained subconcepts, such as <i>museum</i> and <i>river</i>, and added a wide range of concepts (e.g., <i>product, event, substance, animal, color,</i> and even <i>religion</i>). In essence, their hierarchy covered the most frequent name types and rigid designators appearing in standard news data. The total number of categories was initially 200, but was being continuously refined (e.g., popular attributes for each concept were being defined even back then, to try and turn the hierarchy into an open-world ontology).</p>
|
||
<p>Today, several open-world ontologies (e.g., DBpedia, Wikipedia classes, and YAGO) exist that can be useful for extracting named entities (without an underlying domain ontology), and can even be used to bootstrap NER systems using techniques like distant and weak supervision. Consequently, Open IE research at the frontier has primarily focused on extracting relations between entities without knowing relation types in advance. There <span aria-label="152" id="pg_152" role="doc-pagebreak"/>are many clear benefits of Open IE. First, relations of interest do not have to be specified a priori before posing a query to the system, unlike most structured (or graph-based) querying systems, which only allow valid queries to be posed in terms of concepts and relations from a predefined ontology. Thus, the query can contain terms that correspond to “open relations” and “concepts,” a much more natural and powerful way of accessing data. Second, if Open IE can be satisfactorily solved, especially with open web data and documents, it would provide a big boost to specialized unsupervised and semisupervised systems, because they would be able to benefit from distant supervision. Ideally, an Open IE solution would execute in time <i>O</i>(<i>n</i>), where <i>n</i> is the number of documents in the corpus. An important vision that was presented as a motivation for Open IE originally was that of building (with low supervision) factual web-based question-answering systems capable of handling complex relational queries. Other applications include intelligent indexing for search engines and much lower accuracy degradation for domain-specific IE due to the provision of additional training data acquired through executing Open IE algorithms on web data.</p>
|
||
<p>Given these ambitious goals, it is natural to suppose that Open IE is not going to be realized without an equally formidable set of challenges. Some challenges that were identified as early as 2007 (in the context of an Open IE system called <i>TextRunner</i>) are automation, heterogeneity, and efficiency. These challenges are not unique to Open IE, of course; they will be a recurring theme for other subproblems, such as instance matching (IM), in building, cleaning, and querying knowledge graphs (KGs). But despite the common label, for each subproblem (including Open IE) these challenges take on a specific form that is important to understand in the context of the task.</p>
|
||
<p>First, automation is a challenge for Open IE because such systems must generally rely on unsupervised extraction strategies (i.e., instead of specifying target relations in advance, possible relations of interest must be automatically detected, usually in a single pass over the corpus). Moreover, the manual labor of creating suitable training data or extraction patterns must be reduced to a minimum by requiring only a small set of hand-tagged seed instances or a few manually defined extraction patterns. Second, heterogeneity presents a problem because it can trip up deep linguistic analysis tools like syntactic or dependency parsers designed for a certain kind of data, genre, or domain. Thus, core reliance has to be on shallow parsing techniques like Part-of-Speech (POS) tagging, or the downstream methods have to be capable of dealing with high quantities of noise and degradation (if deep linguistic parsing tools are used on data that significantly departs from the training data in its characteristics). Finally, as we noted several times in this chapter, the true power of Open IE is realized only when it is considered at web scale. In turn, this mandates that Open IE should be efficient, which is challenging given the nature of the problem.</p>
|
||
<p>Because Open IE, as a research area, is so much more advanced and recent than standard IE subareas like NER, the kind of research convergence and development witnessed in <span aria-label="153" id="pg_153" role="doc-pagebreak"/>NER has yet to take place in Open IE. However, some grouping of systems is possible due to the popularity of the problem among IE researchers. A recent survey on Open IE identified four such groups:</p>
|
||
<ul class="numbered-nb">
|
||
<li class="NL-N">1. <b>Learning-based.</b> Learning-based Open IE was one of the earliest kinds proposed, and much of the state-of-the-art continues to fall within this paradigm. An early such system was TextRunner, which is described in some more detail later in this chapter. The idea behind learning-based Open IE is to use three modules and largely rely on <i>self-supervised machine learning</i>. Self-supervision here is similar to both distant and weak supervision, in that a heuristic means (and typically an external data set) is employed to approximately label a sample with positive or negative labels, and then use that heuristically labeled sample as a training set. With this paradigm, the first module (called the <i>extractor</i>) tries to generate candidate extractions from the raw data using a range of features. The second module, which is a <i>classifier</i>, further refines the outputs of the extractor, resulting only in good or trustworthy extractions being retained. The third module, which is an <i>assessor</i>, is generally designed to assign confidences or probabilities to each extraction output by the classifier, using global features, an example of which might be the frequency of occurrence of the extracted information on the web.</li>
|
||
</ul>
|
||
<p>A learning-based system proposed after TextRunner, which also learns an extractor without direct supervision, is Wikipedia-based Open IE (WOE). It primarily uses Wikipedia as a source of distant supervision. WOE anticipated projects like DBpedia (discussed in part V of this book) by using entries in Wikipedia infoboxes as a bootstrapping source. The data thus extracted is used to learn extraction patterns on both POS tags and dependency parses. The idea behind using the latter is to try and discover long-range dependencies. In experimental results, dependency features were found to improve both precision and recall over shallow linguistic features, though at the cost of speed and scalability.</p>
|
||
<p>A more recent and well-known Open IE system is Open Language Learning for Information Extraction (OLLIE), which bootstraps the learning of patterns based on dependency parse paths, just like WOE. However, rather than rely on Wikipedia-bootstrapping, OLLIE relies on more precise weak supervision by relying on the outputs of the rule-based REVERB extractor (described later in this chapter). Furthermore, rather than ignore the <i>context</i> of a tuple and, as a consequence, extract propositions that may only be hypothetical or conditionally true (as opposed to an asserted fact), OLLIE also includes a context-analysis step wherein contextual information from an input sentence around an extraction is analyzed. By doing so, the precision of extractions is significantly improved. OLLIE also expanded the scope of Open IE beyond previous approaches by identifying both verb-based relations, as well as relations mediated by nouns and adjectives. At comparable precision, its recall (often called “yield” in the Open IE literature) also significantly improved compared to baselines. This expansion of scope has been a longer-term impact of OLLIE on the <span aria-label="154" id="pg_154" role="doc-pagebreak"/>Open IE community at large. For example, some other entirely learning-based approaches proposed after OLLIE, such as ReNoun, focused on noun-mediated relations.</p>
|
||
<ul class="numbered-ntb">
|
||
<li class="NL-N">2. <b>Rule-based.</b> These systems make use of handcrafted extraction rules, a primary example being REVERB, which is a shallow extractor that introduced a syntactic constraint expressed using a simple POS-based regular expression. The extractor was able to cover 85 percent of verb-based relational phrases in English. Hence, REVERB significantly reduces the number of incoherent and uninformative extractions that most previous Open IE systems had produced, and which had always been cited as a practical challenge in truly deploying Open IE in the wild. REVERB also uses mechanisms to deal with other important challenges in practical Open IE, including avoiding the production of relations that are too overly specific to be useful in downstream higher-order problems like question answering.</li>
|
||
</ul>
|
||
<p>Another rule-based system that is specifically able to deal with the extraction of relations that are not just binary, but may have arbitrary arity, is KRAKEN, proposed in 2012. KRAKEN captures complete facts from sentences by gathering the complete set of arguments for each relational phrase (within a sentence); hence producing arbitrary-arity tuples. Furthermore, identification of relational phrases, with corresponding arguments, is designed to rely on handwritten extraction rules over typed dependency parses. A similar system is EXEMPLAR, proposed in 2013, which is also able to extract <i>n</i>-ary relations using handcrafted patterns based on dependency parse trees and uses semantic role labeling to assign each extracted argument its corresponding role (such as <i>agent</i> or <i>patient</i>).</p>
|
||
<p>Other approaches, proposed later, were more abstract and semantics-oriented, attempting to mitigate the challenge posed by dependency parses (that the previous two approaches relied on) because it is hard to ascertain the complete structure of a sentence’s propositions using just the dependency parse tree. PROPS introduced a sentence representation, which was specifically designed to represent the proposition structure of an input sentence, and was generated by transforming a dependency parse tree into a directed graph. Propositional extraction becomes much more straightforward in this representation. A rule-based converter is used for the actual conversion. Another example of this kind of approach (where a more semantic structure is proposed or extracted, rather than directly extracting a sentence’s propositions from the dependency parse tree) is PredPatt, which has the added benefit of being one of only a few Open IE systems that are multilingual.</p>
|
||
<ul class="numbered-ntb">
|
||
<li class="NL-N">3. <b>Clause-based.</b> Clause-based systems are based on the idea that the performance of an Open IE system could improve significantly if, rather than execute the system on a complex sentence directly, we first restructure the sentence into a set of syntactically simplified, independent clauses that are easy to segment into Open IE tuples. ClausIE, first proposed in 2013, is one example of such a <i>paraphrase-based</i> system, and it exploits knowledge of English grammar to map dependency relations of the original sentence into clause constituents, yielding a set of coherent clauses that have a far simpler linguistic structure than <span aria-label="155" id="pg_155" role="doc-pagebreak"/>the original input. Following this step, the clause type can be determined by combining linguistic knowledge of properties of verbs (often helped by the availability of domain-independent lexicons) with knowledge about input clause structure. Based on the clause type, one or more propositions can be generated from each clause, each representing a different information set. For example, consider the sentence “The chauffeur showed Jim to his car.” The clause type is determined to be SVOA (where S, V, O, and A stand for Subject, Verb, Direct Object, and Adverbial, respectively), and a pattern for it could be <i>SV</i>′<i>OA</i>, where <i>V</i>′ is now a complex-transitive verb. The derived clause would be “ <i><</i>The chauffeur, showed, Jim, to his car >.”</li>
|
||
</ul>
|
||
<p>The Stanford Open IE takes this approach further by using a classifier for splitting a sentence into logically entailed shorter utterances. It does so by recursively traversing the dependency tree and predicting (at each step) whether an edge should yield an independent clause. To further increase the utility of extracted propositions, each clause (which is self-contained) can be maximally shortened by using natural logic inference techniques. Ultimately, a small set of 14 hand-crafted patterns is used to extract a predicate-argument tuple from each utterance.</p>
|
||
<p>While a few other clause-based Open IE besides the ones mentioned here exist, this approach to Open IE is not as popular as the learning-based or even the rule-based approaches. However, clause-based IE has had some influence on the development of interproposition relationship modeling, which is where many of the more recent systems are best classified.</p>
|
||
<ul class="numbered-ntb">
|
||
<li class="NL-N">4. <b>Interproposition relationship modeling.</b> Many of the previously defined Open IE systems lack expressiveness, in that they are not able to <i>interpret</i> the context in which complex assertions are often made. For example, they cannot determine if the assertion is stated factually, conditionally or even hypothetically. There is clear value in separating these categories. Earlier, we saw an exception, OLLIE, which first introduced a context-analysis step for precisely making these determinations. OLLIE did so by extracting an <i>attribution context</i> that denoted a proposition as being reported by some entity. For example, OLLIE could extract something along the lines of: ( <i><the universe, originated at, the Big Bang</i> >, AttributedTo <i>modern physicists</i>).</li>
|
||
</ul>
|
||
<p>By expanding the default Open IE representation with an extra attribute in this way, OLLIE is able to also determine if clausal modifiers such as “if” are used. Modifiers that claim or are clausal are identified by matching patterns with the dependency parse of the sentence, although the specific technique used for each is different. Other systems have followed OLLIE in improving this aspect of Open IE further. For example, OpenIE4, which combines two Open IE systems, SRLIE and RELNOUN, is able to produce similar outputs while also being able to mark additional temporal and spatial arguments. Successors to OpenIE4 have since been released and similarly take a hybrid approach to the problem, leading to improved performance and expansion of problem scope. Yet another approach is <span aria-label="156" id="pg_156" role="doc-pagebreak"/>CSD-IE, which uses a technique called <i>contextual sentence decomposition</i> to further specify propositions with information on which they depend. In essence, it uses handcrafted rules over a parser’s output to split a sentence into subsequences that are semantically similar. Each such subsequence constitutes a context, but while each context now contains a separate fact, it is often dependent on surrounding contexts. In OLLIE, this is represented by extracting additional contextual modifiers, but later systems took a more sophisticated approach by allowing tuples to contain references to other propositions via separate, linked propositions. In order to link propositions, each extraction is assigned a unique identifier, and this identifier is allowed to be used in the argument position of an extraction for a later substitution (with the corresponding fact that the identifier alludes to). This is reminiscent of event extraction except that the extractions are not events. Other approaches have since followed up on the idea of linked propositions.</p>
|
||
<p>Much more recently, several Open IE systems have been proposed (relying largely on interproposition relationship modeling) and have achieved, or come significantly close to, state-of-the-art performance. MinIE is one such approach and aims to decrease the production of overly specific extractions by employing four “minimization modes,” each varying in their aggressiveness, and thereby allowing the user or system designer to toggle between different precision/recall trade-offs. MinIE also semantically annotates extractions with information about polarity, modality, attribution, and quantities instead of directly representing the information in the actual extractions. Another example is Graphene, which is an Open IE framework inspired by decades-old work in rhetorical structure theory, which is able to transform complex sentences into compact structures by first using a set of handcrafted simplification rules and then removing clauses and phrases that do not fundamentally contribute to the input. Graphene is also able to identify rhetorical relations that connect core sentences with their associated contexts, and is thereby able to output semantically typed and interconnected relational tuples.</p>
|
||
<p class="STX">An important point that should be noted in the context of the taxonomy given here is that such a categorization is only a rough (and largely academic) guide for understanding the various classes of Open IE architecture. In practice, many of the architectures that have continued to prevail, such as OLLIE, Graphene, and OpenIE4 (and its successors) are often hybrid and rely on techniques that arguably fall within several categories. For example, we found that even in its original state, OLLIE relied on the outputs of a rule-based system to improve the precision of weak supervision. The Open IE series of systems have largely relied on hybrid combinations of approaches. In general, hybrid systems tend to work well in complex machine learning problems where it is difficult to identify one kind of approach that definitively yields good performance across all domains where it is being tested.</p>
|
||
<p>While detailing any individual system referenced here is beyond the scope of this chapter, we are presenting here examples of two successful and early Open IE systems, the designs of which have had significant influence in the community. While these systems <span aria-label="157" id="pg_157" role="doc-pagebreak"/>are no longer state-of-the-art, the philosophies underlying their design have continued to withstand the test of time, especially with respect to learning-based architectures.</p>
|
||
<section epub:type="division">
|
||
<h3 class="head b-head"><a id="sec7-2-1"/><b>7.2.1 KnowItAll</b></h3>
|
||
<p class="noindent">KnowItAll is a self-supervised, web Open IE system that has spawned multiple variants. It relies on a small set of domain-independent extraction patterns (e.g., <i><</i>A > <i>is a <</i>B >) to automatically instantiate relation-specific extraction rules (training data) and thereby learn domain-specific extraction rules iteratively in a bootstrapping process. KnowItAll is solely dependent on POS tagging rather than NER or deep linguistic parsing, which helps it better tackle the problem of heterogeneity (e.g., to other languages) and scale (because NER and deep linguistic analysis tools are expensive).</p>
|
||
<p>KnowItAll can autonomously extract facts, concepts, and relationships from the web. By seeding the system with an an extensible ontology (e.g., the YAGO ontology) and a small number of generic rule templates, KnowItAll creates text extraction rules for each class and relation in its ontology. Because of this seeding, KnowItAll should not be thought of as a completely unsupervised, open-ended IE, though in practice, such seeding is not difficult to accomplish and leads to higher quality (in terms of relevance) than relying on approaches that do not require any kind of seeding. KnowItAll relies on a domain-independent and language-independent architecture for populating the ontology with specific facts and relations, and it is capable of supporting scalability and high throughput. Early research used threading and asynchronous message passing to facilitate communication between individual KnowItAll modules. The important modules, some of which include the modules common to other learning-based systems (such as the extractor and the assessor), are described here:</p>
|
||
<ul class="numbered">
|
||
<li class="NL">1. <b>Extractor:</b> The extractor’s goal is to instantiate a set of extraction rules for each class and relation from a set of generic, domain-independent templates. Using an example from Etzioni et al. (2004), the generic template “NP1, such as NPList1” indicates that the head of each simple noun phrase (NP) in NPList1 is an instance of the class named in NP1. This template can be instantiated to find, for example, politicians from such sentences as “In the conference, she got to mingle with famous politicians, such as Hillary Clinton, Bill Clinton, and Nancy Pelosi.” KnowItAll would be able to extract three instances of the class <i>Politician</i> from this sentence, using the template as defined here. Other such templates can be defined, and if KnowItAll is combined with more recent work on inferring such templates automatically from large text corpora, the template specification process itself could be automated.</li>
|
||
<li class="NL">2. <b>Search engine interface:</b> This module is designed to automatically formulate queries based on extraction rules. Each rule has an associated search query composed of the keywords in the rule. For example, the previous rule would lead KnowItAll to issue the query “politicians such as” to a search engine like Google or Bing, download each <span aria-label="158" id="pg_158" role="doc-pagebreak"/>of the pages named in the engine’s results in parallel, and apply the extractor module to the appropriate sentences on each downloaded page. In 2004, KnowItAll used up to 12 search engines, including Google, Alta Vista, and Fast. Because many of these are now defunct, a modern version would have to modify, implement, and deploy its search engine interface accordingly.</li>
|
||
<li class="NL">3. <b>Assessor:</b> This module assesses the likelihood that the extractor’s conjectures are correct by using statistics computed by querying search engines. The assessor uses Pointwise Mutual Information (PMI) between words and phrases estimated from web search engine hit counts to do so, similar to Peter Turney’s PMI-IR algorithm. For example, suppose that the extractor has proposed “George Bush” as the name of a politician. If the PMI between “George Bush” and a phrase like “politician named George Bush” is high, this gives evidence that “George Bush” is indeed a valid instance of the class <i>Politician</i>. The assessor computes the PMI between each extracted instance and multiple phrases associated with politicians. These mutual information statistics are ultimately combined via a standard machine learning classifier (in the original KnowItAll system, a Naive Bayes classifier is used, although, in principle, more modern and powerful classifiers could also be used).</li>
|
||
<li class="NL">4. <b>Database:</b> The database stores its information (including metadata such as the rationale for, and the confidence in, individual assertions) in a commercial Relational Database Management System (RDBMS) such as MySQL or Postgres. Using such databases in the design has some clear advantages over ad-hoc schemes because such databases are persistent, scalable, and support rapid-fire queries and updates. Again, a more modern implementation of KnowItAll could potentially take advantage of databases hosted in the cloud, or even a key-value store like MongoDB or Elasticsearch that will be covered in more detail in part IV of this book.</li>
|
||
</ul>
|
||
<p>However, KnowItAll has some important limitations as well. For example, the system can require many search engine queries for assessing the reliability of the eligible relation-specific extraction rules, and it also requires as input the name of the relation of interest to the system. In the case of adding or updating a relation, a new learning cycle is further needed. This makes the overall approach rigid and not easily amenable to the scalable addition of new relations.</p>
|
||
</section>
|
||
<section epub:type="division">
|
||
<h3 class="head b-head"><a id="sec7-2-2"/><b>7.2.2 TextRunner</b></h3>
|
||
<p class="noindent">There has been considerable research (and corresponding progress) since KnowItAll was first proposed to address some of the aforementioned problems. We provide many pointers in the “Bibliographic Notes” section, but by way of example, we discuss one such influential system called <i>TextRunner</i>. TextRunner is an Open IE system that is <i>learning-based</i> in the taxonomy of the previously described approaches. It addresses the challenges of <span aria-label="159" id="pg_159" role="doc-pagebreak"/>naming relations in advance and scalability. Specifically, TextRunner needs just one pass through the corpus without the need to name any relations of interest in advance.</p>
|
||
<p>In brief, TextRunner operates as follows. First, given sample sentences from the Penn Treebank,<sup><a href="chapter_7.xhtml#fn1x7" id="fn1x7-bk">1</a></sup> its learner applies a dependency parser to heuristically identify and label promising extractions as positive and negative training examples. In this sense, it is very similar to weak supervision. One argument made by the system authors for using the Penn Treebank in this manner is that most relations in English can be characterized by a set of several lexico-syntactic patterns (e.g., <i>Entity-1 Verb Prep Entity-2</i>). The weak supervision data can thus be input into a classifier (such as Naive Bayes, though others can be used as well) that learns a model of reliable relations by leveraging unlexicalized POS and noun phrase chunk features. The self-supervised nature of TextRunner mitigates the requirement for providing training data, while the use of unlexicalized features helps the system scale to web-sized corpora.</p>
|
||
<p>Following learning, the TextRunner extractor generates candidate tuples by identifying pairs of noun phrase arguments and heuristically classifying each word between the arguments as being part of a relation phrase (or not). Candidate extractions are presented to a classifier, and only reliable extractions are retained. An important advantage of Text-Runner in this step is its efficiency and scalability due to the approach being restricted to use only shallow features. In the final step, an assessor, which is redundancy-based, assigns a probability to each relation extraction (RE) or retained tuple based on the number of sentences from which each extraction was found. In doing so, it exploits the redundancy of information on the web, assigning higher confidence to extractions that occur multiple times.</p>
|
||
<p>For each sentence, TextRunner ultimately returns one or more triples, each representing a binary relation between two entities [e.g., (London, capital-of, United Kingdom)], along with the probability of the triple being extracted correctly. In evaluations, TextRunner was found to achieve (on average) 75 percent on the precision metric, but one criticism was that the output of the system was nonnormalized to a large extent (e.g., problems of matching names referring to the same real-world entity or identifying synonymous relations were not handled). When handling the relation synonymy problem, for example, system performance (in terms of the recall metric) was found to improve. One issue with avoiding relation-specificity in the RE task is that it does not scale well to the web.</p>
|
||
</section>
|
||
<section epub:type="division">
|
||
<h3 class="head b-head"><a id="sec7-2-3"/><b>7.2.3 Evaluating and Comparing Open Information Extraction Systems</b></h3>
|
||
<p class="noindent">Open IE systems can be compared using both intrinsic and extrinsic evaluation approaches. An intrinsic approach in this context is one in which an Open IE system is compared to <span aria-label="160" id="pg_160" role="doc-pagebreak"/>another Open IE system as a baseline, typically KnowItAll and TextRunner in the earlier years, and more recent approaches like OLLIE in recent years. For example, MinIE was compared to OLLIE, ClausIE, and Stanford Open IE as baselines. KRAKEN was compared to REVERB as a baseline, while ClausIE was compared to several systems, including REVERB, KRAKEN, OLLIE, TextRunner, and a version of WOE. In contrast with intrinsic approaches, an extrinsic approach uses a external competition, such as the TAC KBP Slot-Filling Challenge and the MCTest comprehension task, to compare and contrast different systems on the same challenge. Although the challenge allows us to evaluate competing systems using a single test set, the challenge does not control for crucial factors, such as the training paradigm and resources available for training and development. The use of a single competition or data set itself raises issues of broader validity.</p>
|
||
<p>Regardless of which approach is used, evaluation of Open IE is not as clear-cut as that of more traditional IE systems due to the lack (even today) of a formal specification that clearly states what constitutes a valid relational tuple. Unfortunately, the lack of a specification does not just pose theoretical problems; it also holds back the potential of the community because it has long preempted the establishing of a common large-scale annotated corpus serving as a gold standard for reproducible, cross-system evaluations. Hence, many Open IE systems were forced to be evaluated on small-scale corpora, which is problematic because one of the essential motivations for developing Open IE in the first place was that it could be applied on the scale of the web.</p>
|
||
<p>More recently, some researchers have attempted to address the problem, mainly by introducing large-scale gold standard benchmarks. For example, in 2016, a corpus was published that contained three features underlying the principles of many existing Open IE solutions. These principles are <i>assertedness</i>, which states that extracted propositions should be asserted by the original sentence rather than inferring propositions out of implied sentences; <i>minimal propositions</i>, which states that Open IE systems should extract compact, self-contained propositions that do not combine several unrelated facts; and finally, <i>completeness and open lexicon</i>, which implies that all relations asserted in the input text should be extracted. The last of these makes Open IE a truly domain-independent task because all relations must be extracted from heterogeneous corpora, and no assumption about a set of classes of relations is made or specified a priori. Many Open IE systems have traditionally incorporated this principle by considering all possible verbs as potential relations; however, by limiting themselves to verbal predicates, these systems have ignored complex relations mediated by syntactic constructs such as nouns or adjectives. A handful of systems, as we noted earlier, have been trying to broaden the scope of Open IE, however, with increasing success.</p>
|
||
<p><span aria-label="161" id="pg_161" role="doc-pagebreak"/>How was the corpus constructed in the first place? In the case mentioned here, the annotations in an existing <i>Question Answering–driven Semantic Role Labeling</i> benchmark<sup><a href="chapter_7.xhtml#fn2x7" id="fn2x7-bk">2</a></sup> were converted to an Open IE corpus, leading to 10,000+ extractions over 3,000+ sentences originally from Wikipedia and the <i>Wall Street Journal</i> (WSJ). However, this effort was not the only one. In 2017, in an attempt to improve reproducibility, a set of researchers proposed RelVis, which is an Open IE benchmark framework that allows comparative analyses of systems at scale, supporting quantitative evaluation using standard measures like precision, recall, and F-score. Impressively, they also define a way to manually analyze qualitative errors to support empirical studies (e.g., studying and detecting errors such as redundant or uninformative extractions is possible using their framework).</p>
|
||
<p>We conclude the discussion on Open IE by noting that many important research issues remain. Perhaps the most urgent issue is for the community to agree upon, and start adopting, the benchmarks noted here for evaluation, since (at this time) not many researchers have actually used the two benchmarks we have described here. Another issue is that many Open IE systems still tend to focus on the extraction of binary relations within the scope of sentence boundaries. On a related note, using Open IE for events that have time, location, and potentially involve extraction of more than one relation that go beyond sentence level constitutes a challenging task that must be given more attention in the future. Multilingual and social media Open IE involve yet other challenges, to which very robust solutions do not exist. We describe some of the challenges in working with social media, in particular, in the next section. Finally, integrating good reasoners with Open IE systems is another challenging problem that could upend important fields, including answering general and powerful questions over the web without limiting the ontology or domain of discourse.</p>
|
||
</section>
|
||
</section>
|
||
<section epub:type="division">
|
||
<h2 class="head a-head"><a id="sec7-3"/><b>7.3 Social Media Information Extraction</b></h2>
|
||
<p class="noindent">Thus far, the implicit assumption has been that the IE is being executed on text that is fairly “regular.” This includes not just domain-specific corpora like bioinformatics papers, but more generally, newswire and newswire-like data such as Wikipedia pages. In contrast, social media data tends to follow its own jargon, or <i>language model</i>, especially on length-constrained message platforms like Twitter. While this is also true for domain-specific corpora to some extent, the subject matter on Twitter and social media is not constrained by domain. People just communicate differently on such platforms, even when relaying mundane information like the outcome of a sports game.</p>
|
||
<p>Of course, there is no denying the prevalence (and more arguably, utility) of micro-blogging and social media services like Twitter that allow users to write large numbers of short <span aria-label="162" id="pg_162" role="doc-pagebreak"/>text messages, and communicate, comment, and chat about products and events in real time. Social media usage trends have been climbing steadily, boosted in no small part by low-cost, low-barrier, and cross-platform access to such media, as well as the global popularity of smartphones and mobile devices. In many cases, social media provides information that is more updated than conventional information sources, including breaking news. They also allow people to communicate without a mediator, avoiding the suppression of certain voices (which may be too controversial for the mainstream), and to an extent, preventing or mitigating against biased coverage in news media by providing an alternative information source. All these features can be important not just in everyday life, but also in the context of natural disasters (as real-time information and microneeds on the ground can be streamed on social media), or mass political movements like the uprisings during the Arab Spring or other protests (where news coverage may be curbed or controlled by powerful players).</p>
|
||
<p>There is a clear benefit from performing automated analysis of social media content, and any comprehensive analysis that attempts to understand what is being said has to involve information extraction (IE). While the discussion thus far has provided some context about what might make IE on social media more challenging than in regular domains, we enumerate a few concrete reasons here:</p>
|
||
<ul class="numbered">
|
||
<li class="NL">1. First, texts on social media platforms like Twitter tend to be very short (e.g., Facebook limits status updates to 255 characters, whereas Twitter limits tweets to 280 characters).</li>
|
||
<li class="NL">2. Second, texts are noisy, exhibit much more language variation, and are written in a highly informal setting, with incumbent features including misspellings, lack of punctuation and unorthodox capitalization, nontraditional word use, heavy use of grammatically incorrect phrases and spellings, and extreme presence of out-of-vocabulary tokens.</li>
|
||
<li class="NL">3. Third, meaning in such texts can be conveyed by nontext cues that require some background knowledge, including use of emoticons, nonstandard abbreviations, and hashtags (in Twitter, but with similar equivalents in other social media sources).</li>
|
||
<li class="NL">4. Fourth, there is a low amount of discourse information per each microblog document, and a threaded structure is fragmented across multiple documents, flowing in several (sometimes meandering) directions.</li>
|
||
<li class="NL">5. Fifth, it is highly uncertain if the information conveyed in the text messages is reliable (e.g., compared to the news media) and should be extracted and incorporated into a central knowledge repository like a KG that will eventually be used for querying and analytics.</li>
|
||
</ul>
|
||
<p>These challenges suggest that a straightforward application of standard NLP tools on social media content typically results in significantly degraded performance. This has also <span aria-label="163" id="pg_163" role="doc-pagebreak"/>been an opportunity for researchers to develop new, powerful tools for tackling IE, especially methods that can extract information from short, noisy text messages. In fact, to combat the problems noted here, research has focused on microblog-specific IE algorithms, with particular attention dedicated to the normalization of microblog text, because such normalization can help remove linguistic noise from the text before sending the text through to POS tagging and NER. Note that higher-level IE tasks such as event extraction from Twitter or Facebook are more difficult than NER because not all information on an event may be expressed within a single message. Most of the reported work in the area of event extraction from short messages in social media has tended to focus on event detection rather than outright event extraction in the fine-grained way discussed in chapter 6.</p>
|
||
<p>Even at the present time, research on IE from social media is still in its early stages, and largely limited to English. Next, we describe the anatomy of some influential systems, notably TWICAL and TwitIE, to illustrate the key principles behind a typical social media IE architecture. Future research will likely focus on further adaptation of classic IE techniques to extract information from short messages used in microblogging services, tackling non-English social media, and techniques for aggregating and fusing information extracted from conventional text documents, such as news articles, and short messages posted through social media, such as those for using Twitter to enhance event descriptions extracted from classic online news sources with updates that are both period- and situation-specific.</p>
|
||
<section epub:type="division">
|
||
<h3 class="head b-head"><a id="sec7-3-1"/><b>7.3.1 TWICAL</b></h3>
|
||
<p class="noindent">TWICAL, first presented in 2012, was motivated by the observation that, despite the continuing rise of Facebook and Twitter, previous work in event extraction had largely focused on news articles. As we have noted, there are considerable challenges to processing such messages; however, the particular challenge that was identified in Ritter et al. (2012) was that, while a corpus of social media messages tended to be disorganized and noisy, individual messages were short and self-contained and did not feature the kind of complex discourse structure typical of texts containing narratives. Other challenges are that, due to the diversity of tweets, it is unclear in advance which event types are relevant (because TWICAL focused primarily on open-domain event extraction), and the language model in tweets (which is very informal) makes it difficult to design or adapt traditional NLP tools.</p>
|
||
<p>Given a raw stream of tweets, TWICAL extracts named entities in association with event phrases and unambiguous dates involved in significant events. Specifically, it extracts a four-tuple representation of events that includes named entity, event phrase, calendar date, and event type. The authors chose this representation to closely match the way that important events are typically mentioned in Twitter.</p>
|
||
<p>The overall procedure can be succinctly described as follows. First, POS tags are obtained for tweets. Second, named entities and event phrases are extracted, temporal expressions are resolved, and the extracted events are categorized into types. Finally, TWICAL <span aria-label="164" id="pg_164" role="doc-pagebreak"/>measures the strength of association between each named entity and date based on the number of tweets they cooccur in, to determine whether an event is significant. NLP tools, such as named entity segmentation systems and POS taggers that were designed to process edited texts (e.g., news articles), perform very poorly when applied to Twitter text due to noise. TWICAL, in contrast, utilizes a preexisting, openly available NER system and POS tagger trained on in-domain Twitter data that had been previously described in Ritter et al. (2011). Thus, it is a good example of an early system that, rather than borrow systems trained on news data, used Twitter data directly to do its own training.</p>
|
||
<p>Experimentally, TWICAL was evaluated on about 100 million tweets collected on November 3, 2011, using the Twitter streaming application programming interface (API) by tracking a broad set of temporal keywords, such as “today,” “tomorrow,” names of weekdays, and other such words. Along with temporal expressions and event phrases, named additions were also extracted from the text of each of the 100 million tweets. The authors then added the extracted triples to the data set used for inferring event types, and performed 50 iterations of Gibbs sampling for predicting event types on the new data, holding the hidden variables in the original data constant. A rigorous annotation procedure was followed, with events annotated according to four separate criteria described in Ritter et al. (2012).</p>
|
||
<p>A supervised baseline was used to evaluate the success of the approach by measuring precision at different recall levels by varying a threshold parameter. The authors found that because their approach could leverage large quantities of unlabeled data, it was able to outperform the supervised baseline by more than 14 percent. The authors also found that TWICAL’s high-confidence calendar entries were of surprisingly high quality. For example, if the data were limited to the 100 highest-ranked calendar entries over a two-week date range in the future, the precision of extracted (entity, date) pairs was found to be over 90 percent, an 80 percent increase over the baseline. This result has high practical significance, especially in interactive or IR-style approaches, because users’ attention tends to drop rapidly after perusing the first few entries in a ranked list. More broadly, it illustrated that the premises underlying the system were promising for developing good social media IE tools.</p>
|
||
</section>
|
||
<section epub:type="division">
|
||
<h3 class="head b-head"><a id="sec7-3-2"/><b>7.3.2 TwitIE</b></h3>
|
||
<p class="noindent">TwitIE is an open-source NLP pipeline first presented in 2013 at a premier NLP conference by researchers from the University of Sheffield. It is an end-to-end system that can be customized to microblog text at every phase. It also includes Twitter-specific data import and metadata handling. TwitIE is a significant modification of a previous system called GATE ANNIE,<sup><a href="chapter_7.xhtml#fn3x7" id="fn3x7-bk">3</a></sup> which was designed for news text. Just like TWICAL, the premise behind <span aria-label="165" id="pg_165" role="doc-pagebreak"/>its development is that social media presents challenges over news text that cannot be handled by only using minor modifications to an existing system.</p>
|
||
<p>There are five critical modules in TwitIE, namely: <i>Normalization, Stanford POS Tagging, NER, Language Identification,</i> and <i>TwitIE Tokenization</i>. Because of the language identification module, TwitIE can be used for non-English data. TwitIE uses the <i>TextCat</i> algorithm for language identification, which relies on models based on <i>n</i>-gram frequencies to discriminate among various languages. The authors integrated an adaptation of TextCat to Twitter, making it work for five languages. Accuracy was measured to be at 97.4 percent, with minor variation in per-language accuracy (99.4 percent for English versus 95.2 percent for French). Considering that language identification is a fairly coarse-grained task compared to some others (such as NER and tokenization), the results show that language identification is hard on tweets, but it still achieves fairly good accuracy.</p>
|
||
<p>One of the assumptions made by the system is that each tweet is written in only one language, an assumption that is reasonable mainly because of the shortness of tweets, but also by empirical observation. Given a collection of tweets in a new language, it is possible to train TwitIE’s TextCat to support that new language as well by using an <i>n</i>-gram-based fingerprint generator (from a corpus of documents in the new language) included as a plug-in in the TwitIE package.</p>
|
||
<p>Reliable tweet language identification is also important experimentally because it allows processing only of those tweets written in English, with the TwitIE English POS tagger and NER (as discussed next), by making execution of these modules incumbent on successful identification of the language of the tweet being in English. Although Bontcheva et al. (2013) (the original paper) does not demonstrate multilingual capabilities, the authors note that the GATE components that they adapt for TwitIE provide POS tagging and NER in French and German as well; in principle, therefore, much of what is noted next is applicable to these languages with some effort.</p>
|
||
<p>While the other modules mentioned earlier are also customized, the case of POS tagging is particularly instructive. Despite more than 97 percent accuracy of general-purpose English POS taggers, they are not suitable for microblogs, where their accuracy has been measured to decline to below 75 percent in studies. For this reason, TwitIE contains an adapted Stanford tagger, trained on tweets tagged with the Penn TreeBank tagset. Extra tag labels were added for retweets, Uniform Resource Locators (URLs), hashtags, and user mentions. This adapted tagger was trained using hand-annotated tweets, the NPS Chat corpus,<sup><a href="chapter_7.xhtml#fn4x7" id="fn4x7-bk">4</a></sup> and news text from the Penn TreeBank. While these adaptations yielded a model that was found to achieve 83.14 percent token accuracy, its performance still lags performance achieved on news content by 10 percent (though it was still clearly higher than a naive application of an ordinary POS target on social media text).</p>
|
||
<p><span aria-label="166" id="pg_166" role="doc-pagebreak"/>Experimentally, compared to the previous system (ANNIE) from which TwitIE was adapted, as well as the Stanford NER system (often used as the default pretrained NER when performing IE for novel data sets), TwitIE was found to achieve superior experimental results, particularly on the precision metric (which in turn, influenced the final F-measure), although recall was comparable to that of ANNIE (83 percent).</p>
|
||
<p>While continuing improvements in social media IE performance are promising, as evidenced by systems like TWICAL, TwitIE, and several others more recently proposed, it should be noted that there is still a significant gap in NER performance on microblogs, as compared to news content. As mentioned earlier, some of this gap could be attributed to insufficient linguistic context (compared to longer, more self-contained news articles) and the inherent difficulty of extracting information from tweets, for all the challenging reasons covered earlier in this section. Another limitation, which the authors of TwitIE acknowledge, is a severe lack of labeled training data, which hinders the adaptation of state-of-the-art NER algorithms, such as the Stanford conditional random field (CRF) tagger. As more training data sets and services are released for nontraditional sources, not only for social media platforms, but also for videos and multilingual corpora, more progress is expected on this issue.</p>
|
||
</section>
|
||
</section>
|
||
<section epub:type="division">
|
||
<h2 class="head a-head"><a id="sec7-4"/><b>7.4 Other Kinds of Nontraditional Information Extraction</b></h2>
|
||
<p class="noindent">Open IE and social media IE are not the only kinds of nontraditional IE possible. We saw at the very beginning of this chapter, albeit briefly, how fine-grained IE with a domain-specific flavor can itself require nontraditional, minimally supervised techniques, not to mention development of novel corpora and annotation schemes. Other kinds of IE are nontraditional for more historical reasons. For example, multilingual IE was introduced into the MUC series of conferences since MUC-5. This followed more than a decade of IE research that was purely focused on English. One of the impetuses for the shift was a growing amount of textual data available in other languages. Even so, much of the focus on non-English IE tended to be limited to the NER problem, with relatively little work reported on higher-level IE tasks, including RE and event extraction. In recent years, this has started to change, especially with the establishment of ambitious government-funded projects like DARPA LORELEI, DEFT, and AIDA, all of which have components designed to work with non-English data; in many cases, with languages that are “low-resource” and have almost no computational resources like English-parallel texts or training data to bootstrap them.</p>
|
||
<p>Perhaps because it has been studied for so long exclusively on English corpora, IE in languages other than English is typically harder, and the performance of non-English IE systems is usually worse. One reason is the lack of core NLP components and underlying linguistic resources for many languages, but most of all it is due to the various linguistic phenomena that are nonexistent in English, a subset of which includes:</p>
|
||
<ul class="numbered">
|
||
<li class="NL">1. <span aria-label="167" id="pg_167" role="doc-pagebreak"/>Lack of white space, such as in Chinese, which can make the problem of word boundary disambiguation more difficult.</li>
|
||
<li class="NL">2. Complex declension of proper names, such as in Slavic languages (e.g., Polish), which can make it more difficult to normalize named entities. For example, entire papers have been dedicated to the topic of lemmatizing and matching people’s names in such languages (e.g., Polish). In contrast, declension in English is so simple compared to other languages that the term is rarely even applied to English.</li>
|
||
<li class="NL">3. Free word order and rich morphology, which is common for Germanic, Slavic, and Romance languages and can complicate tasks like RE.</li>
|
||
</ul>
|
||
<p class="noindent">Note that this is not an exhaustive list (e.g., the problem of zero anaphora is common in Japanese and Slavic languages). A <i>zero anaphora</i> in natural-language conversations is the practice of omitting an overt reference term. To consider zero anaphora with the nonzero case, such as in English, consider the sentence “John says he is coming,” where the word “he” is linked by anaphora to the preceding noun John. However, a similar construction in Spanish has no corresponding element: “Dice Juan que viene” (literary translation is “says John that comes”). This kind of zero anaphora can present significant problems for an IE system that has been tuned to English where this problem does not occur in everyday parlance.</p>
|
||
<p>Another kind of nontraditional IE that is beyond the scope of this work, but that is rapidly gaining in popularity, is the extraction of instances, attributes, and relations from media sources, especially images and video. When dealing with such sources, techniques from both NLP and the computer vision community are necessary. The most successful models tend to be “joint,” not unlike the event-entity joint IE model considered in chapter 6. Furthermore, because of the success of deep learning in the computer vision community in particular, state-of-the-art systems that do IE on media sources are also predominantly based on deep learning and representation learning. Because of the success of these models, which improve with availability of more data and experimentation, the ‘blurring’ of boundaries between text and nontext modalities will continue, and it may not be long before KG construction over such kinds of media will become mainstream. However, we do note that it is (relatively) more expensive to run and train such neural networks, and using pretrained modules is not as straightforward for these modalities as for text. Hence, there remain infrastructure and computation challenges that still need to be addressed before such neural models truly become staples in a KG construction pipeline.</p>
|
||
</section>
|
||
<section epub:type="division">
|
||
<h2 class="head a-head"><a id="sec7-5"/><b>7.5 Concluding Notes</b></h2>
|
||
<p class="noindent">In the previous chapters, we introduced some “traditional” IE problems such as web IE, NER, and relation (and event) extraction. However, the heterogeneous nature of the web and subject matter experts, as well as the limitations posed by the traditional solutions, have <span aria-label="168" id="pg_168" role="doc-pagebreak"/>led to a robust set of research results addressing IE problems such as Open IE and social media IE. We describe these IE problems as nontraditional in this chapter, but we maintain that this is only a simplification. Social media, and IE on social media, have now been around for well over a decade, as has Open IE. Thus, whether we should continue to think of them as nontraditional is a matter of subjective opinion. Regardless, it is undeniable that these kinds of IE involve a different set of challenges and questions than traditional IE, which is largely concerned today with improving performance and with closing the gap between supervised and unsupervised systems. Social media IE is especially challenging because social media itself is continuously morphing, and there is considerable heterogeneity between various social media platforms, not to mention disparities that arise due to geographies, topics, and cultural norms.</p>
|
||
<p>Despite the considerable difference of opinion in the community on how best to tackle many of these challenges, certain trends have been emerging, as we have attempted to describe in this chapter. Open IE systems, for example, can be taxonomized in four categories in the way they approach the problem, although many of the best systems are hybrids that combine rule- and learning-based techniques. Yet it is unlikely that any of these approaches will be superseded anytime in the near future, and it is our prediction that novel state-of-the-art systems will draw on their predecessors in improving upon some aspect of these techniques, while adapting the best of other techniques to maximize performance. We expect to find something similar in the evolution of social media IE research.</p>
|
||
<p>Nevertheless, many interesting and open areas of research remain in all the different kinds of IE covered in this chapter, with some more theoretical, while others firmly empirical. While the most obvious agenda is to improve the nontraditional IE systems (because their performance is understandably still lagging that of more traditional IE systems that have witnessed significantly more research and community resources), it is equally important to make these systems more robust, and more amenable to capturing the richness of human language.</p>
|
||
<p>We also discussed how, in the last two decades, IE has moved from monolingual, task-tailored, knowledge-based systems to multilingual, adaptive architectures that deploy minimally supervised machine learning in order to automate many elements in an IE pipeline. Some such pipelines, such as for Open IE, are minimal in the inputs they require to the extent that even an ontology is not required. However, it is not the case that these systems are completely unsupervised, because they do rely on other kinds of user inputs, as we specified when describing some of the representative approaches. In particular, because of the reliance of many Open IE systems on techniques such as distant supervision or weak supervision, it is not clear if these systems can be successfully extended to domain-specific corpora. Some work in this area, especially bioinformatics, has been promising, however, and we will likely see more examples in the years to come.</p>
|
||
<p><span aria-label="169" id="pg_169" role="doc-pagebreak"/>Finally, considering all of the different kinds of IE that we have encountered in this chapter, is it still fair to use the performance metrics and evaluation protocol? Intuitively, in evaluating Open IE as opposed to ontologically grounded IE (such as the NER and RE cases considered in previous chapters), we must account for subjectivity in the definition of whether an extraction contributes positively to recall or not. An extraction that seems erroneous or trivial to us may matter to another user looking at the corpus through a different lens. Similarly, when considering social media IE, one must ask whether the independent and identically distributed (i.i.d.) assumption is a good one to make. Many people are able to make sense of social media only when presented with the broader context (e.g., a certain tweet may not make sense for someone who is not well versed in US politics or Hollywood celebrity culture and scandal). Therefore, when extracting from a tweet, it may not be fair to have a truly random sample, but to instead have a background corpus that is more topically cohesive and uniform. Arguably, this cohesiveness would provide an IE (or accompanying tools such as word embeddings) with a better chance of success at understanding and extracting from tweets and other sparse social media. There is still much to be decided, therefore, when evaluating what we have referred to as nontraditional IE in this chapter.</p>
|
||
</section>
|
||
<section epub:type="division">
|
||
<h2 class="head a-head"><a id="sec7-6"/><b>7.6 Software and Resources</b></h2>
|
||
<p class="noindent">Open IE has been experiencing a renaissance of late, and several good options for software implementations exist. Because Open IE is still experimental compared to standard NER, the interested practitioner is advised to test these on samples of their own data sets before settling on any given particular option. In some cases, no existing implementation may suffice, and it may be necessary to reimplement or customize an existing algorithm, with the systems discussed in this chapter only serving as guidance. Here, we list some promising options for existing implementations:</p>
|
||
<ul class="numbered">
|
||
<li class="NL">1. The Stanford NLP group, also mentioned in earlier chapters, offers an Open IE facility as well, published by Angeli et al. (2015): <a href="https://nlp.stanford.edu/software/openie.html">https://<wbr/>nlp<wbr/>.stanford<wbr/>.edu<wbr/>/software<wbr/>/openie<wbr/>.html</a>.</li>
|
||
<li class="NL">2. The Allen Institute offers another excellent resource: <a href="https://demo.allennlp.org/open-information-extraction">https://<wbr/>demo<wbr/>.allennlp<wbr/>.org<wbr/>/open<wbr/>-information<wbr/>-extraction</a>. There is a demo of a reimplementation of a recent bidirectional long short-term memory (BiLSTM) model by Stanovsky et al. (2018).</li>
|
||
<li class="NL">3. The resource <a href="https://github.com/NPCai/Open-IE-Papers">https://<wbr/>github<wbr/>.com<wbr/>/NPCai<wbr/>/Open<wbr/>-IE<wbr/>-Papers</a> maintained on GitHub is an impressive repository and taxonomic classification on various papers, including recent ones using neural networks. There is also guidance on available training and testing data.</li>
|
||
<li class="NL">4. Another excellent aggregation resource of Open IE is available at the webpage: <br/><a href="https://paperswithcode.com/task/open-information-extracti-on/latest">https://<wbr/>paperswithcode<wbr/>.com<wbr/>/task<wbr/>/open<wbr/>-information<wbr/>-extracti<wbr/>-on<wbr/>/latest</a>. It provides a list, <span aria-label="170" id="pg_170" role="doc-pagebreak"/>relatively well maintained and recent, of Open IE papers with code. At this time, 18 options were listed.</li>
|
||
</ul>
|
||
<p>When it comes to social media IE, one option is to gather together a training corpus of tweets or other social media (such as messages from Facebook or WhatsApp) and train one of the systems listed in the NER and RE chapters. Many of those systems are based on CRFs or neural networks optimized for sequence labeling, and with enough data, nothing prevents them from serving as a good IE system for such data. Because social media is domain-specific, and also culture-specific (as we pointed out in the section entitled “Concluding Notes”), we always recommend such retraining. Even the word embeddings should ideally be retrained on a background corpus that is more akin to the social media data that the user is looking to process. Most pretrained word embeddings are trained on “normal” text, such as that found in Wikipedia and news articles, and these do not closely resemble the informal (or even irreverent) nature of social media.</p>
|
||
<p>However, there are a few preimplemented packages available for those looking to specifically experiment with Twitter data. One option is the TwitIE system by Bontcheva et al. (2013), which maintains a homepage accessible at <a href="https://gate.ac.uk/wiki/twitie.html">https://<wbr/>gate<wbr/>.ac<wbr/>.uk<wbr/>/wiki<wbr/>/twitie<wbr/>.html</a>. The system described by Ritter et al. (2011, 2012) is also available at <a href="https://github.com/aritter/twitter_nlp">https://<wbr/>github<wbr/>.com<wbr/>/aritter<wbr/>/twitter<wbr/>_nlp</a>.</p>
|
||
<p>The BLENDER group at UIUC also has some tools that could prove to be useful for multilingual, joint and other kinds of IE covered both in this chapter and previous chapters: <a href="https://blender.cs.illinois.edu/software/">https://<wbr/>blender<wbr/>.cs<wbr/>.illinois<wbr/>.edu<wbr/>/software<wbr/>/</a>. Another group that maintains an excellent set of NLP resources is ARK, at the University of Washington: <a href="http://www.ark.cs.washington.edu/">http://<wbr/>www<wbr/>.ark<wbr/>.cs<wbr/>.washington<wbr/>.edu<wbr/>/</a>. Several resources available for download are designed for Twitter. Some resources will not directly yield IE output, but their output could serve as features to robustly bootstrap downstream IE training. For example, the source code available at <a href="https://github.com/brendano/ark-tweet-nlp/">https://<wbr/>github<wbr/>.com<wbr/>/brendano<wbr/>/ark<wbr/>-tweet<wbr/>-nlp<wbr/>/</a> could be used for robust POS tagging on tweets.</p>
|
||
</section>
|
||
<section epub:type="division">
|
||
<h2 class="head a-head"><a id="sec7-7"/><b>7.7 Bibliographic Notes</b></h2>
|
||
<p class="noindent">Open IE has been an active and important area of research for about 15 years. As currently defined, the term seems to have been popularized at around the mid-2000s, with a seminal reference being Etzioni et al. (2008), although ideas that shared a (somewhat) similar philosophy had been occasionally proposed. In their survey, Niklaus et al. (2018) draw on that work to identify three major challenges for Open IE systems, which we also alluded to earlier in the chapter (automation, corpus heterogeneity, and efficiency). The TextRunner system was the solution proposed by Etzioni et al. (2008) to these problems. As mentioned in the chapter, TextRunner used the KnowItAll system devised by Etzioni et al. (2004) as a baseline. Piskorski and Yangarber (2013) also provides an important perspective on Open IE and its relationship to other kinds of IE.</p>
|
||
<p><span aria-label="171" id="pg_171" role="doc-pagebreak"/>Since then, a large body of work has emerged on the subject, with a handful of influential references including Wu and Weld (2010), Fader et al. (2011), Yahya et al. (2014), Akbik and Löser (2012), Mesquita et al. (2013), Stanovsky et al. (2016), White et al. (2016), Christensen et al. (2010), Del Corro and Gemulla (2013), Schmidek and Barbosa (2014), and Gashteovski et al. (2017), many of which covered important systems like OLLIE, REVERB, and ClausIE that we briefly described in this chapter. Recent work in Open IE has also been quite prolific; for instance, Cetto et al. (2018) proposed the Graphene OpenIE framework.</p>
|
||
<p>Social Media IE, though a more novel research enterprise than Open IE, has become increasingly popular over time, especially as social media itself has grown in volume and popularity. Good starting references for the interested reader include Morgan and Van Keulen (2014), Hua et al. (2012), and Atefeh and Khreich (2015). Among specific systems covered in this chapter, TWICAL and TwitIE were proposed by Ritter et al. (2012) and Bontcheva et al. (2013), respectively. Other selected references for the social media context (though some are indirect) include Sankaranarayanan et al. (2009), Popescu et al. (2011), Li et al. (2012), and Zhou et al. (2015).</p>
|
||
<p>Toward the end of the chapter, we described cutting-edge research areas in nontraditional IE, including multilingual, cross-lingual, and multimodal IE. While much work remains to be done in these areas, we recommend Poibeau et al. (2012), Mei et al. (2018), Zhang, Duh, et al. (2017), Zhang, Whitehead, et al. (2017), Gong et al. (2017), and Mouzannar et al. (2018) as essential readings for the interested researcher to get started in these areas, with the understanding that in these new papers, the boundaries between different research areas are getting increasingly blurred. For example, Zhang, Whitehead, et al. (2017) not only covers the multimodal setting, but deals with the more complicated challenges posed by event extraction. Similarly, Mouzannar et al. (2018) describes a social media–related IE task and uses multimodal deep learning to accomplish the task.</p>
|
||
</section>
|
||
<section epub:type="division">
|
||
<h2 class="head a-head"><a id="sec7-8"/><b>7.8 Exercises</b></h2>
|
||
<ul class="numbered">
|
||
<li class="NL">1. We will try to do extraction on tweets using standard NER packages such as NLTK and SpaCy. You could use either one or both (or even other well-known packages that come with pretrained versions), but we recommend being consistent in usage in this, and the next several, questions. We assume NLTK is being used in these exercises. An interactive demo version for trying out individual texts is available at <a href="https://text-processing.com/demo/">https://<wbr/>text<wbr/>-processing<wbr/>.com<wbr/>/demo<wbr/>/</a>, but to do all the exercises, you will need to set up a local version of NLTK.<br/><span aria-label="172" id="pg_172" role="doc-pagebreak"/>As a first step, let us try to play with Twitter data and observe what happens. Go to twitter.com and copy-paste five tweets (one at a time) into the “Tag and Chunk Text” portal in the demo.<sup><a href="chapter_7.xhtml#fn5x7" id="fn5x7-bk">5</a></sup> Try to be diverse in your selection.</li>
|
||
<li class="NL">2. For each of the tweets, how well does NLTK identify named entities? What are the issues you observe? As a control, try to paste in five sentences from Wikipedia. What are the estimated differences in precision and recall between the control setting and the tweets?</li>
|
||
<li class="NL">3. Think of how you could improve such a model. List five features that you could extract from tweets that would help you do better. Use your own examples (and the NLTK errors in them) to argue for the utility of your features.</li>
|
||
<li class="NL">4. For this exercise, we will use a publicly available Twitter data set. There are several such data sets available, including on competition websites like Kaggle. One example is <a href="https://www.kaggle.com/crowdflower/twitter-airline-sentiment">https://<wbr/>www<wbr/>.kaggle<wbr/>.com<wbr/>/crowdflower<wbr/>/twitter<wbr/>-airline<wbr/>-sentiment</a>. Download this data set, and prepare the data so that it can be run through the NLTK extraction module. You will know that you have succeeded when the module runs seamlessly on a sample of this data and outputs named entities (though in all likelihood, many will be incorrect).</li>
|
||
<li class="NL">5. Randomly sample 100 tweets, and build a “ground-truth” set of entities. Restrict your ontology to persons, organizations, and locations, and be liberal in your definitions. Feel free to add or remove things from your (initially random) sample until you get to a set of tweets that has enough ground-truth extractions from all three classes. Looking at your output from the previous question, what are the precision, recall, and F-measure of NLTK outputs on each of the three classes on your ground-truth?</li>
|
||
<li class="NL">6. To understand the value of preprocessing, run your data set through NLTK with two kinds of preprocessing, and also without preprocessing. Once again, use the ground-truth set that you carefully constructed in exercise 5 to evaluate your results. Based on your results, what kind of preprocessing seems to work well? Could you “add” more preprocessing steps to make it even better? What is the best performance you are able to get, and what preprocessing steps did you have to take to get to that performance?</li>
|
||
<li class="NL">7. * The Stanford CoreNLP package supports Open IE: <a href="https://stanfordnlp.github.io/CoreNLP/openie.html">https://<wbr/>stanfordnlp<wbr/>.github<wbr/>.io<wbr/>/CoreNLP<wbr/>/openie<wbr/>.html</a>. Use the module on your Twitter data set. Do a careful, comparative analysis between the Open IE outputs and your previous NLTK outputs. Despite the noise, does the Open IE discover new or interesting things that you didn’t catch earlier (and if so, what)?</li>
|
||
</ul>
|
||
<div class="footnotes">
|
||
<ol class="footnotes">
|
||
<li><p class="FN" role="doc-footnote"><sup><a href="chapter_7.xhtml#fn1x7-bk" id="fn1x7">1</a></sup> The Penn Treebank project is a corpus, available to members of the Linguistic Data Consortium (LDC), used widely in the Natural Language Processing (NLP) community for its POS annotations.</p></li>
|
||
<li><p class="FN" role="doc-footnote"><sup><a href="chapter_7.xhtml#fn2x7-bk" id="fn2x7">2</a></sup> These are downstream tasks that may rely on Open IE as one solution. Question answering will be covered in more depth in part IV.</p></li>
|
||
<li><p class="FN" role="doc-footnote"><sup><a href="chapter_7.xhtml#fn3x7-bk" id="fn3x7">3</a></sup> GATE is an open-source NLP framework while ANNIE is a general-purpose IE pipeline. GATE comes prepackaged with ANNIE.</p></li>
|
||
<li><p class="FN" role="doc-footnote"><sup><a href="chapter_7.xhtml#fn4x7-bk" id="fn4x7">4</a></sup> Described at <a href="http://faculty.nps.edu/cmartell/NPSChat.htm">http://<wbr/>faculty<wbr/>.nps<wbr/>.edu<wbr/>/cmartell<wbr/>/NPSChat<wbr/>.htm</a>.</p></li>
|
||
<li><p class="FN" role="doc-footnote"><sup><a href="chapter_7.xhtml#fn5x7-bk" id="fn5x7">5</a></sup> If the demo is offline or not available (or not working), then you should switch to your local version (with your favorite pretrained model) for this exercise.</p></li>
|
||
</ol>
|
||
</div>
|
||
</section>
|
||
</section>
|
||
</div>
|
||
</body>
|
||
</html> |