glam/docs/oclc/extracted_kg_fundamentals/OEBPS/xhtml/chapter_17.xhtml

<html xmlns="http://www.w3.org/1999/xhtml" xmlns:epub="http://www.idpf.org/2007/ops" lang="en" xml:lang="en">
<head>
<title>Knowledge Graphs</title>
<meta content="text/html; charset=utf-8" http-equiv="default-style"/>
<link href="../styles/stylesheet.css" rel="stylesheet" type="text/css"/>
<meta content="urn:uuid:531250ec-8629-4bbe-b4be-eb1eb0e84538" name="Adept.expected.resource"/>
</head>
<body epub:type="bodymatter">
<div class="body">
<p class="SP"> </p>
<section aria-labelledby="ch17" epub:type="chapter" role="doc-chapter">
<header>
<h1 class="chapter-number" id="ch17"><span aria-label="439" id="pg_439" role="doc-pagebreak"/>17</h1>
<h1 class="chapter-title"><b>Knowledge Graphs for Domain-Specific Social Impact</b></h1>
</header>
<div class="ABS">
<p class="ABS"><b>Overview.</b>   Constructing knowledge graphs (KGs) over arbitrary domains is a difficult problem that has recently emerged as both important and feasible. In previous chapters, we have already seen some use-cases, such as for understanding events and their dynamics. In some sense, every real-world KG exists in society and has social impact. However, many such KGs, including DBpedia and Wikidata, are designed more as jack-of-all-trades entities that can support a broad set of applications. In contrast, research on domain-specific KGs for social impact have tried to have a major influence on one major field or application. Particularly influential examples include building KGs for fighting human trafficking (HT), forecasting geopolitical events (and hence, better preparing for them), and effectively mobilizing resources after natural disasters. Despite their potential impact, such application domains (in the computational and KG communities) have not been as well covered historically, and are quite challenging to work with in practical settings. Fortunately, as we discuss in this chapter, there has been a recent surge of exciting research in this area, with demonstrable success.</p>
</div>
<section epub:type="division">
<h2 class="head a-head"><a id="sec17-1"/><b>17.1 Introduction</b></h2>
<p class="noindent">We are used to thinking of the web as a one-size-fits-all ecosystem. Everyone seems to use the same (or similar variants of) web technologies such as email clients, social media, news and Really Simple Syndication (RSS) feeds, Google, YouTube, and even knowledge sources like Wikipedia. We could even argue that this is why the web has been so successful because it is largely predicated on a set of uniform standards and protocols like HTTP.</p>
<p>It is easy to forget that the web also has a significant long tail, and there is much evidence, both empirical and anecdotal, showing the importance and ubiquity of the long tail. Think of all the restaurants that do not have a Wikipedia page, or websites that you visit for myriad purposes but that are not household names like Amazon. Moving beyond the long tail, it does not take much imagination to see that there are entire subsets of the web that are <i>domain-specific</i>. Every academic in computer science, for example, is familiar with bibliographic portals like Google Scholar, DBLP, PubMed, and several others. Artists, photographs, and numerous other domain experts have their own catalogs of go-to pages. <span aria-label="440" id="pg_440" role="doc-pagebreak"/>Sometimes there is a real need for us to limit our search and attention to pages belonging to a specific domain, not just the entire web.</p>
<p>A real-world example of such a domain expert in the securities fraud domain is an employee from the Securities and Exchange Commission (SEC) who is attempting to identify actionable cases of penny stock fraud. Penny stock offerings in over-the-counter (OTC) markets are frequently suspected of being fraudulent, but without specific evidence, usually in the form of a false factual claim (that is admissible as evidence), their trading cannot be halted. With thousands of penny stock offerings, investigators do not have the resources or time to Google and investigate all of them.</p>
<p>A technical workflow that addresses this problem would be to crawl a corpus of relevant pages from the web describing the domain, a process that was described in detail in chapter 3, using methods like focused crawling or the (much more recent) domain discovery tool (DDT). Once such a corpus is obtained, an expert in information extraction (IE) and machine learning would elicit opinions from the users on what fields (e.g., location, company, or stock ticker symbol) are important to the user for answering domain-specific questions, along with example extractions per field. This sequence of KG construction (KGC) steps results in the familiar KG, which is now amenable to aggregations and to both keyword and structured querying, using many of the querying and analytics techniques studied in part IV of this book. With a good interface, for example, the domain expert can identify all persons and organizations (usually shell companies) associated with a stock ticker symbol, aggregate prices, or suspicious activity by searching for hyped-up phrases that indicate fraud. Furthermore, even if the initial KGC procedures do not yield ideal quality, we demonstrated in part III some mechanisms (including representation learning, entity resolution, and data cleaning) that could be used to complete or identify the (initially constructed) noisy KG.</p>
<p>As described here, the full workflow of acquiring the data, constructing and completing the KG, and then querying it seems to be a straightforward series of steps. In practice, these procedures are expensive and technically difficult (as evidenced by the fact that they have taken up a large chunk of this very book). Certainly, the domain expert from the SEC does not have the time or inclination to build such an integrated system by herself, and she would likely not have the resources to construct a team of engineers and data scientists to do it for her. In summary, there are two problems, both difficult. First, how can we combine the sets of technologies described here, and in the rest of this book, into an integrated, end-to-end system? What should such a system look like? Second, how can we make the system accessible to domain experts who do not necessarily have the technical ability to write complex scripts, or who do not even have a background in machine learning?</p>
<p>To answer these questions and facilitate research into domain-specific search, the Memex program was established by the US Defense Advanced Research Projects Agency (DARPA) in the mid-2010s. The goal of Memex was to develop software that advanced online search <span aria-label="441" id="pg_441" role="doc-pagebreak"/>by allowing nontechnical users to create their own domain-specific search engines for discovering relevant content and organizing it in ways that were more useful to their specific problems. Next, we describe the Domain-Specific Insight Graphs (DIG) architecture that was a direct output of Memex and was developed by a multiorganizational team. However, we note that this system (or the Memex program) is not the only approach for using KGs to address social problems. We cover at least one other application, crisis informatics and disaster response, which also draws on KG research, but in an alternative way.</p>
</section>
<section epub:type="division">
<h2 class="head a-head"><a id="sec17-2"/><b>17.2 Domain-Specific Insight Graphs</b></h2>
<p class="noindent">Funded under the DARPA Memex program, the DIG architecture was designed to allow domain experts and users the ability to construct and search KGs without any programming. The domain-specific aspect of the system arises from its ability to permit the setup, tuning, and integration of KGC for a fixed but arbitrary domain. This is in complete contrast to Google, as described later in this chapter. The insights that can be derived from the system arise from its search facilities and graphical user interface (GUI), which offers a range of customizable options.</p>
<p>A typical workflow in DIG is conveyed in <a href="chapter_17.xhtml#fig17-1" id="rfig17-1">figure 17.1</a>. First, in the <i>domain setup</i> phase, the system ingests the output of domain discovery (chapter 3); that is, a raw corpus of webpages by presenting an intuitive, multistep KGC to the user, who usually requires an hour or less of example-based training to learn how to navigate the steps and refine the <span aria-label="442" id="pg_442" role="doc-pagebreak"/>outputs iteratively. Second, in the <i>domain exploration</i> phase, the system offers a search interface that can be used to navigate the KGs using a combination of search techniques, some of which were described in depth in part IV. Next, we briefly describe the main components of both phases, followed by a description of the applications with which it has been deployed.</p>
<div class="figure">
<figure class="IMG"><a id="fig17-1"/><img alt="" src="../images/Figure17-1.png" width="450"/>
<figcaption><p class="CAP"><span class="FIGN"><a href="#rfig17-1">Figure 17.1</a>:</span> <span class="FIG">A typical workflow showing the working of the DIG system.</span></p></figcaption>
</figure>
</div>
<section epub:type="division">
<h3 class="head b-head"><a id="sec17-2-1"/><b>17.2.1 Domain Setup</b></h3>
<p class="noindent">The first phase, <i>domain setup</i>, involves setting up the domain with the goal of answering a certain set of (possibly open-ended) questions that a user is interested in further exploring. This phase does not comprise a strictly linear set of steps but rather involves several interleaved steps (<a href="chapter_17.xhtml#fig17-2" id="rfig17-2">figure 17.2</a>). At a high level, the user loads a sample of the corpus to explore, followed by defining wrappers using the Inferlink tool (briefly described subsequently), customizing domain-specific fields, and adding field-specific glossaries (if desired). Many of these extraction techniques were earlier covered in part II. Periodically, the user can <i>crystallize</i> a sequence of steps by running extractions to construct the KG and uploading the KG to an index. Once the upload is complete, users can demo their efforts by clicking on the Sample DigApp button in the upper-left corner of the screen shown in <a href="chapter_17.xhtml#fig17-2">figure 17.2</a> to explore the KG using a search interface. The process is iterative: the user can always return to the dashboard to define or refine more fields, define more wrappers with Inferlink (a wrapper induction system offering a GUI), or input more glossaries.</p>
<div class="figure">
<figure class="IMG"><a id="fig17-2"/><img alt="" class="width" src="../images/Figure17-2.png"/>
<figcaption><p class="CAP"><span class="FIGN"><a href="#rfig17-2">Figure 17.2</a>:</span> <span class="FIG">An illustration of the DIG application, involving several interleaved steps that a nontechnical subject matter (or domain) expert could take to set up their own domain-specific KG and search engine. The precise details and text on the dashboard are less important than the steps shown here and described in the text (and are being updated periodically).</span></p></figcaption>
</figure>
</div>
<p>Specifically, the different steps in setting up a domain are described next.</p>
<p><b>Loading Sample Pages.</b> The original DIG system ingested webpages that have been crawled and placed in a distributed file system using the Common Data Repository (CDR) format with an ingestion schema that was collectively decided upon by the Memex program and its participants. The schema records not just the HTML content of pages, but also metadata, such as when the page was crawled and which crawler was used. Images are separately processed and stored. CDR was defined to be a special instance of the Javascript Object Notation (JSON) format, and more recent versions of DIG can directly process JSON. At this time, direct ingestion of formats other than CDR or JSON are being actively explored, including PDFs.</p>
<p><b>Defining Fields.</b> DIG allows users to define their own fields and to customize the fields in several ways that directly influence their use during the domain exploration phase. To define a field, users click on the Fields tab (step 2 in <a href="chapter_17.xhtml#fig17-2">figure 17.2</a>), which provides them with an overview of fields that are already defined (including predefined fields like <i>location</i>), and also allows them to add and update fields. In addition to customizing the appearance of a field by assigning it a color and icon, users can set the importance of the field for search (on a scale of 1–10), declare the field to represent an entity by selecting the entity rather than the text option in Show as Links (thereby supporting entity-centric search, described in the section entitled “Domain Exploration,” later in this chapter), and assign it a predefined <span aria-label="443" id="pg_443" role="doc-pagebreak"/>extractor like a glossary.</p>
<span aria-label="444" id="pg_444" role="doc-pagebreak"/>
<p><span aria-label="445" id="pg_445" role="doc-pagebreak"/><b>Defining Wrappers Using the Inferlink Tool.</b> Earlier, in chapter 5, we described wrappers as a means of extracting structured elements from webpages. The Inferlink tool, developed by a private company of the same name, offers an intuitive, graphical way of doing this. For convenient formalism, let us define a top-level domain (TLD) (such as backpage.com) <i>T</i> as a set {<i>w</i><sub>1</sub><i>, <span class="ellipsis">…</span>, w</i><sub><i>n</i></sub>} of webpages (e.g., backpage.com/chicago/1234). As a first step, Inferlink uses an <i>unsupervised template clustering algorithm</i> to partition <i>T</i> into clusters, such that webpages in each cluster are structurally and contextually similar to each other. <a href="chapter_17.xhtml#fig17-3" id="rfig17-3">Figure 17.3</a> provides some real-world intuition for the TLD backpage.com (e.g., Cluster 1 contains city-specific webpages containing categories of advertisements such as appliances and roommates; and Cluster 2 contains webpages describing city-specific appliance listings). The examples in <a href="chapter_17.xhtml#fig17-3">figure 17.3</a> show that both structure and context are important. For example, while Clusters 2 and 3 are structurally similar, they are contextually different and thereby get separated into two clusters (rather than one).</p>
<div class="figure">
<figure class="IMG"><a id="fig17-3"/><img alt="" src="../images/Figure17-3.png" width="450"/>
<figcaption><p class="CAP"><span class="FIGN"><a href="#rfig17-3">Figure 17.3</a>:</span> <span class="FIG">Examples of three clusters, each containing structurally and contextually similar webpages.</span></p></figcaption>
</figure>
</div>
<p>Extraction setup proceeds as follows. Once they have chosen a relevant cluster, users see an illustration like the one in <a href="chapter_17.xhtml#fig17-4" id="rfig17-4">figure 17.4</a>, wherein Inferlink extracts common structural elements from the webpages in the cluster and presents it to the users in a column layout, with one row per webpage and one column per structured element that is common to <span aria-label="446" id="pg_446" role="doc-pagebreak"/>the webpages. Users can open a webpage by clicking on a link (the leftmost column in the screen shown in <a href="chapter_17.xhtml#fig17-4">figure 17.4</a>), delete a column, or assign it to a field that has already been defined. This is precisely the semantic typing step that was also discussed earlier in this book (chapter 4) in the context of Named Entity Recognition (NER); that is, it is not enough to extract just “Obama” in the sentence “Obama championed the Affordable Care Act”; a good NER must extract “Obama” as “Politician” or “Person,” depending on what kinds of concepts have been defined in the underlying ontology). Users must separately type columns for each TLD cluster because different TLDs (and more generally, different clusters in the same TLDs) share different structures in the webpages they contain. However, users are guaranteed high precision in their results, and after some initial training, they do not have to know anything about wrappers or how they work in order to operate the Inferlink tool.</p>
<div class="figure">
<figure class="IMG"><a id="fig17-4"/><img alt="" src="../images/Figure17-4.png" width="450"/>
<figcaption><p class="CAP"><span class="FIGN"><a href="#rfig17-4">Figure 17.4</a>:</span> <span class="FIG">An illustration (from the <i>Securities Fraud</i> domain) of the semantic typing facility in Inferlink. To create this screen, semantic typing, in terms of the defined set of fields, has already been done. For example, the second column has been typed with “age” semantics and other elements.</span></p></figcaption>
</figure>
</div>
<p><b>Defining Other Information Extraction Methods.</b> In addition to the Inferlink tool, DIG offers implementations of methods suitable for IE from blocks of text or other content that are not delimited as structured HTML elements (i.e., delimited using HTML tags). To ease user effort, some generic extractors are pretrained and cannot be customized, but they can be disabled. A good example is the location extractor, which extracts the names of cities, states, and countries using a machine learning model that was trained offline. DIG also offers users the option to input a glossary (with incumbent options, such as whether the glossary terms should be interpreted case-sensitively) for a given field. This option was popular with domain experts who evaluated the system. The Glossaries tab is accessed in a similar way as Fields. The latest version of DIG also offers users an intuitive rule editor for expressing and testing simple natural-language rules or templates with extraction placeholders. For example, a name extractor can be set up with a pattern recognition rule like “Hi, my name is [NAME].”</p>
<p><b>Indexing the Constructed Knowledge Graph.</b> In step 3 in <a href="chapter_17.xhtml#fig17-2">figure 17.2</a>, the user executes the defined IEs, including glossaries, Inferlink, and predefined extractors that have not been disabled, in order to construct a KG over a subset of TLDs and a sample of webpages from each TLD. While users are allowed to specify the sample size, the smaller the sample, the less representative the constructed KG, but the faster its execution. Hence, there is a trade-off. A status bar shows the user the progress of KGC. The constructed KG is indexed and stored in a NoSQL document store like Elasticsearch. In the domain exploration phase, this stored and indexed KG can be navigated in a variety of ways.</p>
</section>
<section epub:type="division">
<h3 class="head b-head"><a id="sec17-2-2"/><b>17.2.2 Domain Exploration</b></h3>
<p class="noindent">The domain exploration phase is the in-use phase, when the system is actually being used to satisfy information retrieval (IR) needs. Once the user has iterated enough times to be satisfied with the results on the sampled pages, she would initiate domain exploration by executing the KGC pipeline on the full corpus of webpages. The time until completion depends on the size of the corpus, as well as the computing power available.</p>
<p><span aria-label="447" id="pg_447" role="doc-pagebreak"/>DIG offers users a range of search capabilities, including basic keyword search, structured search, and a novel search capability called <i>dossier generation</i>. We enumerate some of the options here, using the screenshot in <a href="chapter_17.xhtml#fig17-5" id="rfig17-5">figure 17.5</a> for illustration.</p>
<div class="figure">
<figure class="IMG"><a id="fig17-5"/><img alt="" src="../images/Figure17-5.png" width="450"/>
<figcaption><p class="CAP"><span class="FIGN"><a href="#rfig17-5">Figure 17.5</a>:</span> <span class="FIG">An illustration of the search capabilities offered by the DIG system, with HT investigations as the use-case. Critical details have been blurred or obfuscated due to the illicit nature of some of the material.</span></p></figcaption>
</figure>
</div>
<ul class="numbered">
<li class="NL">1. <i>Search using structure:</i> The user begins the search by filling out values in a form containing fields that she has declared (during domain setup) as being searchable (<a href="chapter_17.xhtml#fig17-6" id="rfig17-6">figure 17.6</a>). The search engine in DIG, which is based on Elasticsearch in the back end, uses many of the techniques covered earlier, in part IV, to ensure that user intent is captured in a robust manner with good performance on the recall metric. In fields that have text semantics, like Description, keywords can be entered; these fields are like the search bar in engines like Google. While keyword search is designed to be primarily exploratory, structured search allows a user to quickly hone in on pages containing certain key details that the user has specified in the form. DIG uses ranking and relevance scoring techniques first developed in the IR community—namely, satisfying more criteria on a form will lead to a page having a higher ranking than another page that satisfies fewer criteria. Users can also make a search criterion strict by clicking on the star next to the field. This implies that the search is now a constraint: for a KG node, if the value is not found for that field, the node will be assigned a zero score and will never get retrieved or ranked in the interface.<span aria-label="448" id="pg_448" role="doc-pagebreak"/></li>
</ul>
<div class="figure">
<figure class="IMG"><a id="fig17-6"/><img alt="" src="../images/Figure17-6.png" width="450"/>
<figcaption><p class="CAP"><span class="FIGN"><a href="#rfig17-6">Figure 17.6</a>:</span> <span class="FIG">A structured search form in the DIG system. This form can be used for an ordinary <i>product</i> domain, but in this case, it was built for a prototype used to investigate the <i>counterfeit electronics</i> domain.</span></p></figcaption>
</figure>
</div>
<ul class="numbered">
<li class="NL">2. <span aria-label="449" id="pg_449" role="doc-pagebreak"/><i>Facets and filtering:</i> As shown in <a href="chapter_17.xhtml#fig17-7" id="rfig17-7">figure 17.7</a>, DIG supports faceted search and filtering on select fields (e.g., Model) that the user can specify during domain setup. In all of the investigate case studies in which DIG was evaluated, users made fairly intuitive choices: except for free-form fields like text, descriptions, or comments, they favored faceting over nonfaceting. In addition to allowing more informed search and filtering, facets help the user to see an overview of the search results. For example, in the screen, one can deduce that (among the <i>Model</i> extractions) <i>glock 19</i> and <i>glock 17</i> occurs far more often in the data than other models when searching for weapon models.</li>
</ul>
<div class="figure">
<figure class="IMG"><a id="fig17-7"/><img alt="" src="../images/Figure17-7.png" width="450"/>
<figcaption><p class="CAP"><span class="FIGN"><a href="#rfig17-7">Figure 17.7</a>:</span> <span class="FIG">Facets and filtering in the DIG system (see the gray sidebar). By “checking” off certain boxes or adding more search terms (top of the sidebar), a user can try to make the search more precise.</span></p></figcaption>
</figure>
</div>
<ul class="numbered">
<li class="NL">3. <i>Dossier generation (entity-centric search):</i> Dossier generation and summarization are important distinctions separating the domain exploration facilities of KG-centric systems like DIG from more generic Google search. An example is illustrated in <a href="chapter_17.xhtml#fig17-8" id="rfig17-8">figure 17.8</a> for the entity <i>glock 26</i>, which is presumably a model that an investigator in the Illegal Weapons Sales domain is interested in investigating further for suspicious activity. The entity dashboard summarizes the information about this particular entity by providing a (1) timeline of occurrences, (2) locations extracted from webpages from which the entity was extracted, (3) other entities cooccurring with the entity (along with <i>non-cooccurrence</i> information to enable intuitive significance comparisons), and (4) relevant pages related to that entity.</li>
</ul>
<div class="figure">
<figure class="IMG"><a id="fig17-8"/><img alt="" src="../images/Figure17-8.png" width="450"/>
<figcaption><p class="CAP"><span class="FIGN"><a href="#rfig17-8">Figure 17.8</a>:</span> <span class="FIG">Entity-centric search (also called <i>dossier generation</i>) in the DIG system from the <i>illegal firearms sales</i> domain. In this case, an investigator could look up all information extracted for the entity “glock 26.”</span></p></figcaption>
</figure>
</div>
<ul class="numbered">
<li class="NL">4. <span aria-label="450" id="pg_450" role="doc-pagebreak"/><i>Provenance:</i> DIG supports provenance both at the coarse-grained level of webpages and the fine-grained level of extractions. Concerning the latter, provenance information is obtained by clicking on the square next to an extraction, which brings up the specific extraction method, such as Inferlink, SpaCy, or glossary (<a href="chapter_17.xhtml#fig17-9" id="rfig17-9">figure 17.9</a>), and in the case of context-based extractors that use natural-language techniques, the text surrounding the extraction. Multiple provenances are illustrated if applicable, as shown in the figure (e.g., glossary extractions from text that originated in different structures in a webpage). We also support webpage-level provenance by allowing the user to open the cached webpage in a new tab. It is important to show the cached, rather than the live, webpage (which can also be shown by clicking on the predefined URL extraction that exists for every webpage in the corpus) because the webpage may have changed, or even been removed, since domain discovery. Cached pages in the HT version of the system were recently admitted as evidence in court (see the section entitled “Bibliographic Notes,” at the end of the chapter).</li>
</ul>
<div class="figure">
<figure class="IMG"><a id="fig17-9"/><img alt="" src="../images/Figure17-9.png" width="450"/>
<figcaption><p class="CAP"><span class="FIGN"><a href="#rfig17-9">Figure 17.9</a>:</span> <span class="FIG">Provenance in the DIG system for an extracted entity “square enix,” which is a game. In addition to the document ID, the provenance shows that a single extraction algorithm, based on dictionaries (or glossaries), was used. For the third method, the source (on which the algorithm was executed) is the raw HTML, while for the other two, it is the scraped text. Here, <i>content_relaxed</i> means that the text was extracted in a recall-friendly way from the HTML (meaning that some irrelevant elements, such as ad and code snippets embedded in the HTML, were extracted in addition to the relevant content), while <i>content_strict</i> implies a more precision-friendly approach.</span></p></figcaption>
</figure>
</div>
</section>
</section>
<section epub:type="division">
<h2 class="head a-head"><a id="sec17-3"/><b>17.3 Alternative System: DeepDive</b></h2>
<p class="noindent">DeepDive is an alternative system for domain-specific KGC that was first presented in the early 2010s and has since expanded. The original DeepDive system was based on the classic entity-relationship model and employed techniques like distant supervision and Statistical Relational Learning (SRL) to combine various signals. In this sense, it was <span aria-label="451" id="pg_451" role="doc-pagebreak"/>similar to other KGC architectures like YAGO and EntityCube. However, DeepDive went further in also offering deep Natural Language Processing (NLP) techniques to extract linguistic features, including dependency paths, from large quantities (up to terabytes) of text, as well as to perform web-scale statistical learning and inference. DeepDive operates by first converting diverse input data, such as ontologies and raw corpora, to relational features using both standard NLP tools and custom code. Next, these features are used to train statistical models capturing the relationships between linguistic patterns and target relations. By combining these models with domain knowledge (using statistical relational models like Markov Logic, as covered in chapter 9), DeepDive yields a domain-specific KG as its final output.</p>
<p>In terms of implementation, DeepDive used a declarative language that was similar to SQL, while inheriting Markov Logic Networks’ (MLNs) formal semantics. We covered MLNs in the context of SRL (chapter 9). As with the Knowledge Vault that was covered in chapter 15, DeepDive produces marginal probabilities that are calibrated—that is, if we examined all facts output with a probability of 0.8 by DeepDive, it means that (in expectation) 80 percent of these facts would be correct. For more details on DeepDive, we point interested readers to references in the “Bibliographic Notes” section.</p>
<p>In comparing DIG and DeepDive, we note that while DeepDive also offers an impressive and unified array of KGC tools, it was not designed for the nonprogrammer, or for <span aria-label="452" id="pg_452" role="doc-pagebreak"/>facilitating complex, entity-centric search of the kind that DIG has made possible, in both theory and practice. Furthermore, DeepDive is much more computationally intensive than DIG due to its heavy reliance on MLNs. According to Niu et al. (2012a), inference and learning in DeepDive can take hours, even on 1 TB RAM/48-core machines. While memory and computation requirements of both systems clearly depend on the size of the data, DIG does offer some simple facilities that can make it amenable to relatively low-power or low-memory settings (such as a laptop or desktop machine). Whether DIG or DeepDive is right for an application depends on the user, her technical prowess, and the desired methodology<sup><a href="chapter_17.xhtml#fn1x17" id="fn1x17-bk">1</a></sup> for getting to the final KG.</p>
</section>
<section epub:type="division">
<h2 class="head a-head"><a id="sec17-4"/><b>17.4 Applications and Use-Cases</b></h2>
<p class="noindent">To draw the connection between KG capabilities and applications with social impact, we detail two broad application domains (namely, investigative domains, which include complex subdomains that are fields of study and application in their own right, including HT and securities fraud, and crisis informatics). Each of these illustrates, in its own specific way, the challenges involved in working with complex, real-world domains with nontechnical stakeholders and unusual desiderata compared to relatively normal (and significant) domains like e-commerce, as well as the creative ways in which KGs can help address these challenges.</p>
<section epub:type="division">
<h3 class="head b-head"><a id="sec17-4-1"/><b>17.4.1 Investigative Domains</b></h3>
<p class="noindent">KGC over investigative domains is part of a broader movement that popularly goes by the phrase “AI for social good.” As artificial intelligence (AI) systems have proliferated in recent years, there has been increasing concern over several social and technological issues. Entire conferences and journals are now dedicated to this important issue. In part, the problem is social, because many of the most sophisticated AI tools either are developed in academic labs and not transitioned at all (beyond the lab) due to lack of software maturity or testing, or they are developed in industry and are largely proprietary. On the other hand, many of the agencies that can benefit most from AI do not have the resources to do research on this area, or to acquire the technology via licenses or other expensive means. Most do not have the resources to even facilitate transition (e.g., by retraining their employees or dedicating infrastructure to hosting the systems). Thus, it is all the more important to develop tools that nontechnical users can apply to their use-cases, so that the barrier for entry is not so high.</p>
<p><span aria-label="453" id="pg_453" role="doc-pagebreak"/>We now describe a sample of investigative domains that have been investigated, at least in a prototypical state, using the DIG system, as well as other technologies that have emerged over the course of the Memex program, at institutions and companies across the United States, including the Jet Propulsion Laboratory at the National Aeronautics and Space Administration (NASA) and New York University<sup><a href="chapter_17.xhtml#fn2x17" id="fn2x17-bk">2</a></sup>). Of these, the domain on which DIG was evaluated most extensively is HT. Thus, we pay special attention to this domain, which we cover last in this discussion.</p>
<p class="TNI-H3"><b>17.4.1.1 Securities Fraud</b> Securities fraud, particularly penny stock fraud is a complex domain that falls under the direct authority of the SEC in the United States. Penny stock fraud is unusual because much of the activity that accompanies fraudulent behavior, including hype and promotional activity, is legally permitted. Many of the actual actors involved may not be physically present in the US, but for regulatory reasons, shell companies fronting such activity for promotional and legal purposes have to be registered in the US to trade stocks legitimately in OTC exchanges. In addition to the longer-term goal of investigating and gathering information on such shell companies and the people involved in them, investigators are interested in taking preventive activity. This can happen when a penny stock company is caught actively engaging in factually fraudulent hype (e.g., a false claim that a contract was just signed with a well-known customer firm), in which case trading can be halted, or even shut down. This is also why the step is <i>preventive</i>; trading is shut down before unwary investors buy in and subsequently end up losing their savings. The DIG system supports these goals by allowing users to aggregate information (in the crawled corpus, which contains many web domains) about suspicious penny stocks using the entity-centric search facilities, and to zero in on burgeoning promotional activity.</p>
<p class="TNI-H3"><b>17.4.1.2 Illegal Firearms Sales</b> In the US, firearms sales are regulated in the sense that transactions cannot be conducted with arbitrary persons, or over arbitrary channels like the Internet. Investigators in the illegal firearms sales domain are interested in pinpointing activity that, either directly or indirectly, provides evidence for illicit sales that leave some digital trace. The domain is similar to the securities fraud domain (and dissimilar to the counterfeit electronics domain, described next), for the important reason that investigators limit their focus to domestic activities.</p>
<p class="TNI-H3"><b>17.4.1.3 Counterfeit Electronics</b> Despite what the name suggests, investigators in the counterfeit electronics domain are interested, not in consumer electronics, but in microchips and Field-Programmable Gate Arrays (FPGAs), which form the computational backbones of more complex application devices. The latter may resemble FPGAs from <span aria-label="454" id="pg_454" role="doc-pagebreak"/>genuine contractors, but they are fakes and may have malicious modifications at the hardware level. Certain countries, companies, and devices are more relevant to this kind of activity than others. We note that there is an obvious national security component to these investigations, and just as with the other domains described here, domain expertise plays a crucial role both in setting up the domain and in the knowledge discovery itself.</p>
<p class="TNI-H3"><b>17.4.1.4 Human Trafficking</b> Data from various authoritative sources, including the National Human Trafficking Resource Center, show that HT is not only on the rise in the United States, but is a problem of international proportions. The advent of the web has made the problem worse. HT victims are advertised both on the open and the Dark Web, with estimates of the number of (not necessarily unique) published advertisements numbering in the tens, if not hundreds, of millions. In recent years, various agencies in the US have turned to technology to assist them in combating this problem through the suggestion of leads, evidence, and HT indicators. Entities are typically sex advertisers, such as <i>escorts</i>, but could also be latent entities such as <i>vendors</i> (sex rings, sometimes posing under the guise of spas and massage parlors), who organize the activity.</p>
<p>DIG, along with another search system called Tellfinder, has been in active use by law enforcement agencies to track potential HT activity. It was a direct output of the Memex program and was eventually transitioned to the office of the District Attorney for New York, along with other offices. At this time, it has contributed to the convictions of at least three traffickers, including a 97-years-to-life sentence for a trafficker in the San Francisco area. Tellfinder has had similar success. These cases indicate that the use of KGs and entity-centric technology can have real and lasting benefits for society, especially in terms of fighting difficult problems like HT.</p>
<p class="TNI-H3"><b>17.4.1.5 Why Not Google?</b> As an aside, an important question that can, and does arise, when demonstrating such KG systems to nontechnical investigative users is “Why not just use Google for satisfying user intent?” In other words, why are investigative domains difficult for search engines like Google to handle? This is an important question because some of the challenges that emerge are, by no means, unique to investigative domains. We list some of these challenges next with the caveat that not every challenge applies to every investigative domain, and there are many noninvestigative domains to which these challenges also apply. We use the HT domain to illustrate the challenges more effectively, as all of these challenges have been observed in the HT domain. On occasion, we also invoke one of the other investigative domains, or even noninvestigative ones, where appropriate, as examples.</p>
<p><b>Nontraditional Domain.</b> HT, and several domains like it, are largely characterized by illicit, organized activity, and they have not been as extensively researched as traditional domains (e.g., enterprise) are. Directly adapting existing techniques from these domains is problematic, along with using external knowledge bases (KBs) like Wikipedia, because the <span aria-label="455" id="pg_455" role="doc-pagebreak"/>entities of interest (escorts and HT victims) are not described in such KBs. For example, we could not directly use tools like stemmers and tokenizers from standard NLP packages like Natural Language Toolkit (NLTK) because the corpus contains many nondictionary words and employs advanced obfuscation techniques. The problem is made much worse by the long-tail nature of both the HT domain and other domains, as one cannot tune an algorithm for webpages from a small number of root Uniform Resource Locators (URLs) such as backpage.com.</p>
<p><b>Scale and Irrelevance.</b> The scale of the task (millions of webpages) and the size of the corpus preclude many KG systems from using serial algorithms that have a high memory imprint and long running times. Many of the most expensive tasks have to be run on scalable infrastructures like Apache Spark to be viable; furthermore, because an annotated ground-truth is often not available, and the corpora crawled by domain discovery tools tend to have many irrelevant webpages, core algorithms often have to be executed several times. Scale and irrelevance both proved to be key engineering challenges to be overcome when building scalable KG-powered systems that offer domain-specific benefits beyond Google.</p>
<p><b>Missing Values and Noise.</b> In many cases, each page is typically missing information (e.g., hair color) that investigators would like to extract and use in their queries. However, it is typically unknown a priori which pages and web domains were missing values for which attributes. In many of these systems, it was often the case that extractors would get confused and extract noisy values for attributes that were either missing or well obfuscated. These observations strongly motivated the design of both extraction and query execution technology specifically for investigative search engines. Some of these techniques, including query reformulation, were covered earlier in this book in part IV; we also provide other pointers in the “Bibliographic Notes” section.</p>
<p><b>Information Obfuscation.</b> A recurring challenge in domains like HT is <i>information obfuscation</i>, which includes obscure language models, excessive use of punctuation marks and special characters, presence of extraneous, hard-to-filter data (e.g., advertisements) in web pages, irrelevant pages, lack of representative examples for supervised extractors, data skew, and heterogeneity. Many pages exhibit more than one problem; for instance, a sampling of some pages in the Memex corpus used to evaluate systems on HT queries revealed that obfuscation was the norm rather than the exception. A concrete obfuscation example is an individual stating that her phone number<sup><a href="chapter_17.xhtml#fn3x17" id="fn3x17-bk">3</a></sup> (+1-111-453-0004) was “1**1–1**4-5-3*-**-*oh-oh-oh***4.” A successful system would not only recover the original number from this text, but also infer (based on other information on the page) that it is a number from the United States (+1). While some limited work has been done in dealing with obfuscation, the techniques that are required to address human-centric <i>semantic</i> obfuscation of the kind <span aria-label="456" id="pg_456" role="doc-pagebreak"/>that predominates in HT are domain-specific, and they keep evolving as traffickers adjust their methods. Generic search engines are not optimized for such domain-specific shifts.</p>
<p><b>Complex Query Types.</b> In the HT domain, investigators are interested in several kinds of queries. The simplest queries, in principle, are factoid queries (called <i>point fact</i> queries) that can be handled by key-value data stores like Elasticsearch assuming robust extractions, indexing, and similarity computations. More complex aggregation and cluster queries (e.g., “find and list all escorts by age and ethnicity in Seattle on Christmas Eve”) over noisy data are far less straightforward. Even point fact queries turn out to be difficult when considering both the noise and the variability in the KGC.</p>
<p><b>Preclusion of Live Web Search.</b> Despite all these problems, we could still very well question if there is not <i>some</i> way to pose the question as a Google query, at least to solve the initial problem of locating relevant web pages, followed by online (i.e., real time) execution of extraction technology. There are two problems with such a thesis, even assuming the efficiency of online extractions. First, investigators are often interested, not just in what escort ads are <i>presently</i> on the web, but also in escort ads published in the past (and which may have been taken down subsequently). Many ads are published only for a few days. Building cases against trafficking requires establishing a pattern of behavior over time, so it is important to retain pages that may not be available at the moment of search.</p>
<p>In summary, such searches are especially vital for the purposes of evidence gathering, which is directly relevant to the motivation of using the system for evidence-based social good. The second problem is that it is not obvious how keyword-based search engines can solve complex query types such as clustering and aggregation in a purely online fashion, using only a keyword index. Finally, traditional web search principles, which rely on hyperlinks for the robust functioning of ranking algorithms like PageRank, do not hold in the HT domain, where relevant hyperlinks in the HTML of an escort ad tend to be sparse. Most links are inserted by publishers to promote other content and cause traditional search crawlers to behave undesirably.</p>
</section>
<section epub:type="division">
<h3 class="head b-head"><a id="sec17-4-2"/><b>17.4.2 Crisis Informatics</b></h3>
<p class="noindent">The UN Office for the Coordination of Human Affairs (OCHA) reported that in 2016, more than a 100 million people were affected by natural disasters alone, while over 60 million people were forcibly displaced by violence and conflict. Since that time, the numbers have only gotten worse. Certain programs, again funded by agencies like DARPA, have attempted to use technology to address these problems where their impact is most calamitous (e.g., developing or remote regions of the planet). For example, the DARPA LORELEI (Low-Resource Languages for Emergent Incidents) program was established with the explicit agenda of providing situational awareness for emergent incidents, under the assumption that the emergent incident occurs in a region of the world where the predominant language is <i>computationally</i> low resource, by which we mean languages for which few automated or NLP tools actually exist. An example is Uyghur, a Turkic <span aria-label="457" id="pg_457" role="doc-pagebreak"/>language spoken by about 10–25 million people in western China, but which has far fewer open-source NLP capabilities or research tools available in the public domain compared to major Western languages like English. The technologies resulting from LORELEI research, some of which have been demonstrated and are already undergoing transitioning, will be capable of supporting situational awareness based on low-resource foreign language sources within an extremely short time frame (about 24 hours after a new language requirement emerges).</p>
<p>What kinds of situational awareness would apply? The very basics, especially for text data, would be NLP tasks like machine translation (if the language is not in English), as well as inference and KGC on the translated text, including tasks such as NER and sentiment analysis. Additionally, once the KG is constructed, viable solutions to problems such as entity resolution and provenance reasoning are required to support detailed analytics that are ultimately visualized on a GUI. A major differentiator between crisis informatics systems and some of the other tools discussed earlier in this chapter is that the former must often work with sparse or otherwise context-poor data, of which short text social media (e.g., Twitter) is a good example. This entails the adoption of different kinds of IE (recall the different kinds of IE discussed in part II, including chapter 7, an entire chapter on social media and Open IE). A more advanced kind of situational awareness would rely on clustering and other kinds of semantic aggregation on a collection of (possibly streaming) messages, as many people will be discussing the same situation (e.g., an urgent need that has arisen on a particular street affected by a flood). Even the simply stated problems of binary classification and inference in the crisis informatics domain can get complicated. For example, one kind of inference that is useful to first responders is the level of <i>urgency</i> at both the message and the situation (cluster) levels. How to design good urgency detection algorithms for arbitrary disasters, or with only a few labels available, continues to be a dilemma for standard machine learning, which requires a reasonable number (or in the case of deep learning, a large number) of labels before the algorithms deliver a reasonable performance on test data.</p>
<p>A KG-centric system that is capable of providing some of the situational awareness mentioned above is THOR, whose name stands for “Text-enabled Humanitarian Operations in Real-time,” which was developed under the aforementioned LORELEI program. The architecture of THOR is illustrated in <a href="chapter_17.xhtml#fig17-10" id="rfig17-10">figure 17.10</a>, and a dashboard showing what it looks like in practice when executed in a real scenario or for a real data set shown in <a href="chapter_17.xhtml#fig17-11" id="rfig17-11">figure 17.11</a>. The input to THOR is a streaming corpus of raw documents and a set of NLP modules collectively denoted as the Language Technology Development Environment (LTDE), which includes state-of-the-art implementations for NLP services such as machine translation and NER. The LTDE outputs are used to structure each raw document in the stream into a situation frame, serialized as a semistructured JSON document that contains a combination of <span aria-label="458" id="pg_458" role="doc-pagebreak"/>unstructured and structured data and metadata. These JSON documents are then visualized in an interactive GUI to support both search and analytics.</p>
<div class="figure">
<figure class="IMG"><a id="fig17-10"/><img alt="" src="../images/Figure17-10.png" width="450"/>
<figcaption><p class="CAP"><span class="FIGN"><a href="#rfig17-10">Figure 17.10</a>:</span> <span class="FIG">The THOR system developed for providing situational awareness to responders in the aftermath of natural disasters and other crises.</span></p></figcaption>
</figure>
</div>
<div class="figure">
<figure class="IMG"><a id="fig17-11"/><img alt="" src="../images/Figure17-11.png" width="450"/>
<figcaption><p class="CAP"><span class="FIGN"><a href="#rfig17-11">Figure 17.11</a>:</span> <span class="FIG">The THOR dashboard over a real data set collected from Twitter in the aftermath of a Nepal earthquake, which devastated the region in 2015. In general, THOR is capable of working over myriad kinds of data and disasters, especially in those regions of the world where the native language is not English.</span></p></figcaption>
</figure>
</div>
<p>Although THOR is very different from KG construction systems like DIG and DeepDive, there are some similarities. Just like the other systems described earlier, for example, THOR is highly modular, allowing plug-and-play architecture that allows it to be customized for arbitrary user bases and disasters, as well as a combination of algorithmic and visualization utilities. For example, while the current version of THOR does not support route mapping or hotspot detection, these could potentially be added as layers on top of the default implementation if so desired by the organization using THOR. Even the visualization facilities in THOR are modular and tile-based. Among the analytics that are rendered in the GUI, the most important tile is on situation frames, which classify each JSON document as expressing one or more needs in preset categories such as food or water. By using methods like spatiotemporal rendering, these situations are ultimately designed to help field analysts decide where, when, and how to allocate resources to meet current situational needs.</p>
</section>
<section epub:type="division">
<h3 class="head b-head"><a id="sec17-4-3"/><b>17.4.3 COVID-19 and Medical Informatics</b></h3>
<p class="noindent">KGs and, more broadly, AI and informatics have been playing a prominent technological role in addressing the ongoing COVID-19 pandemic. A wide variety of data sets and <span aria-label="459" id="pg_459" role="doc-pagebreak"/>resources that have been made available, often for free, in response to the crisis have facilitated the rapid application of these advanced technologies. A particularly valuable resource has been the so-called CORD-19 corpus, which is a research data set comprising of over 57,000 scholarly articles (including over 45,000 with full text) about COVID-19 and other related coronaviruses. It was prepared and released by the White House and a coalition of leading research groups. Already some researchers have been applying (and releasing the output of) IE and other KG construction techniques on this data set. For example, Wang et al. (2020) created the CORD-NER data set, which is the output of comprehensive NER application on the CORD-19 corpus (using distant or weak supervision). Their data set covers 75 fine-grained entity types, including common biomedical entity types such as genes and chemicals. It also covers new types related to COVID-19 studies, including viral proteins, substrates, and immune response. The quality of NER was found to surpass SciSpaCy (a version of SpaCy that is optimized for scientific text) based on a sample set of documents. The authors have stated that they will continue to update CORD-NER based on incremental updates and improvements to the underlying system. Although the outputs of NER do not (by themselves) constitute a KG as we understand it, it is possible to express the outputs as a set of triples and apply other techniques such as instance matching (IM) and Probabilistic Soft Logic (PSL) to them. Several groups are already undertaking these refined efforts to complete the KG (as discussed in part III), including our own.</p>
<p><span aria-label="460" id="pg_460" role="doc-pagebreak"/>Recently, Neo4j (chapter 12) built and released a proper KG on COVID-19 that integrates publications, case statistics, genes, functions, molecular data, and other relevant information. According to the website,<sup><a href="chapter_17.xhtml#fn4x17" id="fn4x17-bk">4</a></sup> the project is a “voluntary initiative of graph enthusiasts and companies” and is therefore truly representative of a community effort. The website also includes details on the schema used to represent the KG and the specific data sets and infrastructure used. We also provide more guidance in the section entitled “Software and Resources,” at the end of this chapter. The fact that a collaboratively built KG, and KG-supporting resources such as CORD-NER, could be set up mere weeks after public data sets and resources were released is a testament to the maturity of the field of KGs, the applicability of research that has been the sum total of decades-long contribution of many practitioners and scientists, and the social recognition by many current scientists and companies that KGs could indeed be used as a technology to help in the fight against the disease.</p>
</section>
</section>
<section epub:type="division">
<h2 class="head a-head"><a id="sec17-5"/><b>17.5 Concluding Notes</b></h2>
<p class="noindent">In this chapter, we described some selected use-cases and architectures for constructing domain-specific KGs (and KG-powered systems such as domain-specific search engines <span aria-label="461" id="pg_461" role="doc-pagebreak"/>and intelligence tools for informatics and situational awareness), especially in areas of social impact such as HT and other crises. However, crises and illicit domains are by no means the only social domains that KG research can influence, and research on the application of KGs for social good continues to be very active. For example, very recently, there have been attempts to use KGs for doing better geopolitical forecasting. This is a difficult problem that, especially in today’s political climate, can yield insights into, and benefits for, a range of stakeholders, including policy think tanks and intelligence agencies like Intelligence Advanced Research Projects Activity (IARPA), if performed with reasonable success. Successful geopolitical forecasting requires one to navigate a landscape of complex variables and to expend considerable time and resources on doing research. In recent work, the DIG system was used to process various important data sources, such as the Armed Conflict Location and Event Data (ACLED) project, the Political Instability Task Force (PITF), and the Famine Early Warning Systems Network (FEWS NET), in order to equip geopolitical forecasters with sophisticated facilities like structured search, maps, and dossier generation. While the jury is still out on whether providing such aids can yield a consistent improvement in overall forecasting accuracy, there is no question that KG-powered systems are providing more sophisticated decision aids that are bringing more tools to domain experts than was thought possible or feasible before. In domains like HT and other crisis informatics, the evidence (in favor of better response or more social impact) for providing such aids is fairly conclusive. We are confident that more such cases will crystallize and be publicized over time.</p>
</section>
<section epub:type="division">
<h2 class="head a-head"><a id="sec17-6"/><b>17.6 Software and Resources</b></h2>
<p class="noindent">The Inferlink tool, which played an important role in DIG and can be used in other systems as well, is described in detail at <a href="http://www.inferlink.com/our-work#research-capabilities-section">http://<wbr/>www<wbr/>.inferlink<wbr/>.com<wbr/>/our<wbr/>-work#research<wbr/>-capabilities<wbr/>-section</a>. ETK, which provides the extraction capabilities in DIG but can similarly be extended or even used in a stand-alone fashion, is available at <a href="https://github.com/usc-isi-i2/etk">https://<wbr/>github<wbr/>.com<wbr/>/usc<wbr/>-isi<wbr/>-i2<wbr/>/etk</a>. Details on DIG are available at <a href="http://usc-isi-i2.github.io/dig/">http://<wbr/>usc<wbr/>-isi<wbr/>-i2<wbr/>.github<wbr/>.io<wbr/>/dig<wbr/>/</a>. An alternative system, DeepDive, that can also be used to do relation extraction (RE) and build KGs using a relatively customized framework, may be accessed at <a href="http://deepdive.stanford.edu/re-lation_extraction">http://<wbr/>deepdive<wbr/>.stanford<wbr/>.edu<wbr/>/re<wbr/>-lation<wbr/>_extraction</a>. In chapter 3, we also mentioned the DDT, which provides an advanced interface for domain discovery and has also been used extensively in the Memex program, which funded much of the HT work discussed in this chapter. The DDT maintains a GitHub repository at <a href="https://domain-discovery-tool.readthedocs.io/en/latest/">https://<wbr/>domain<wbr/>-discovery<wbr/>-tool<wbr/>.readthedocs<wbr/>.io<wbr/>/en<wbr/>/latest<wbr/>/</a>. Other useful resources available under Memex may be found at <a href="https://www.darpa.mil/opencatalog?ocSearch=m-emex&amp;sort=program&amp;ocFilter=all">https://<wbr/>www<wbr/>.darpa<wbr/>.mil<wbr/>/opencatalog<wbr/>?ocSearch<wbr/>=m<wbr/>-emex&amp;sort<wbr/>=program&amp;ocFilter<wbr/>=all</a>.</p>
<p>The THOR system, as a complete package, is not a fully open source package, although many individual modules are downloadable in some modality (e.g., as a Docker container or as source code). The ELISA system was one of the LTDEs (including machine translation <span aria-label="462" id="pg_462" role="doc-pagebreak"/>and IE) implemented for THOR. IE modules used in ELISA expose some download links and modalities at <a href="https://blender.cs.illinois.edu/software/">https://<wbr/>blender<wbr/>.cs<wbr/>.illinois<wbr/>.edu<wbr/>/software<wbr/>/</a>. Other research packages include tools for situational clustering of tweets (e.g., <a href="https://hub.docker.com/r/akarshdang/streaming-clustersv4">https://<wbr/>hub<wbr/>.docker<wbr/>.com<wbr/>/r<wbr/>/akarshdang<wbr/>/streaming<wbr/>-clustersv4</a>), visualization of crisis-relevant classifications on social media data (e.g., <a href="https://github.com/ppplin-day/Situati-on-Awareness-Visualization">https://<wbr/>github<wbr/>.com<wbr/>/ppplin<wbr/>-day<wbr/>/Situati<wbr/>-on<wbr/>-Awareness<wbr/>-Visualization</a>), and urgency detection (e.g., <a href="https://hub.docker.com/r/ppplinday/emergence-detection">https://<wbr/>hub<wbr/>.docker<wbr/>.com<wbr/>/r<wbr/>/ppplinday<wbr/>/emergence<wbr/>-detection</a>). Sentiment analysis also plays a role in such systems, and many useful sentiment analysis tools are available, including free, relatively simple, but widely used (and arguably robust) packages, such as the lexicon- and rule-based tool VADER (<a href="https://github.com/cjhutto/vaderSentiment">https://<wbr/>github<wbr/>.com<wbr/>/cjhutto<wbr/>/vaderSentiment</a>), as well as paid tools, especially for social media marketing. Deep learning has also been applied to sentiment analysis in recent years; see Zhang et al. (2018) for a survey and pointers.</p>
<p>There are several other projects that are using AI and knowledge management tools for crisis informatics. Good resources and pointers can be found on the pages for Project EPIC (<a href="https://epic.cs.colorado.edu/our_work/">https://<wbr/>epic<wbr/>.cs<wbr/>.colorado<wbr/>.edu<wbr/>/our<wbr/>_work<wbr/>/</a>) and more recently, Co-Inform (<a href="http://people.kmi.open.ac.uk/harith/">http://<wbr/>people<wbr/>.kmi<wbr/>.open<wbr/>.ac<wbr/>.uk<wbr/>/harith<wbr/>/</a>). Another valuable resource, especially for data sets and lexicons, is CrisisLex (<a href="https://crisislex.org/">https://<wbr/>crisislex<wbr/>.org<wbr/>/</a>), which also contains tools that helps users and researchers create these collections and lexicons. In the “Bibliographic Notes” section, we also provide citations to other important works, many of which maintain project pages as well.</p>
<p>The Linguistic Data Consortium (LDC) has done much resource curation and annotation to support the larger goals of the LORELEI program under which THOR was funded. The LDC webpage describing current and recent projects may be accessed at <a href="https://www.ldc.upenn.edu/collabor-ations/current-projects">https://<wbr/>www<wbr/>.ldc<wbr/>.upenn<wbr/>.edu<wbr/>/collabor<wbr/>-ations<wbr/>/current<wbr/>-projects</a>. Note that a subscription is typically required to access and download many of these resources.</p>
<p>Toward the end of the chapter, we mentioned COVID-19, which is an ongoing crisis at this time, and has already had severe social, economic, and medical impacts. A silver lining has been the rallying of scientific and other communities of people, globally, for the purpose of finding ways of tackling the crisis together. On the technology front, there has been ample provision of software, data, and resources for fighting the crisis. We can only list a few here, although we also include websites that are attempting to list known resources on a single page:</p>
<ul class="numbered">
<li class="NL">1. The CORD-19 corpus is available on Kaggle: <a href="http://www.kaggle.com/allen-insti-tute-for-ai/CORD-19-research-challenge">www<wbr/>.kaggle<wbr/>.com<wbr/>/allen<wbr/>-insti<wbr/>-tute<wbr/>-for<wbr/>-ai<wbr/>/CORD<wbr/>-19<wbr/>-research<wbr/>-challenge</a>.</li>
<li class="NL">2. The Neo4j KG on COVID-19 is described at <a href="http://www.odbms.org/2020/03/we-build-a-knowledge-graph-on-covid-19/">http://<wbr/>www<wbr/>.odbms<wbr/>.org<wbr/>/2020<wbr/>/03<wbr/>/we<wbr/>-build<wbr/>-a<wbr/>-knowledge<wbr/>-graph<wbr/>-on<wbr/>-covid<wbr/>-19<wbr/>/</a>. Public access credentials are also available on that page.</li>
<li class="NL">3. Another example of a COVID-19 KG project released by a commercial effort (Yahoo) is described at <a href="https://yahoodevelopers.tumblr.com/post/616566-076523839488/yahoo-knowledge-graph-announces-covid-19-dataset">https://<wbr/>yahoodevelopers<wbr/>.tumblr<wbr/>.com<wbr/>/post<wbr/>/616566<wbr/>-076523839488<wbr/>/yahoo<wbr/>-knowledge<wbr/>-graph<wbr/>-announces<wbr/>-covid<wbr/>-19<wbr/>-dataset</a>.</li>
<li class="NL">4. <span aria-label="463" id="pg_463" role="doc-pagebreak"/>CORD-NER is available at <a href="https://xuanwang91.github.io/2020-03-20-cord19-ner/">https://<wbr/>xuanwang91<wbr/>.github<wbr/>.io<wbr/>/2020<wbr/>-03<wbr/>-20<wbr/>-cord19<wbr/>-ner<wbr/>/</a>.</li>
<li class="NL">5. In general, a detailed summary of the literature review, tools, data sets, and other contributions from the Kaggle community’s COVID-19 work is maintained at <a href="https://www.kaggle.com/covid-19-contributions">https://<wbr/>www<wbr/>.kaggle<wbr/>.com<wbr/>/covid<wbr/>-19<wbr/>-contributions</a>.</li>
</ul>
<p>These sites present only one example of an aggregation of COVID-19 data sets and resources. A web search reveals many more, some of which are more official than others. Examples of aggregation pages include <a href="https://dataverse.harvard.edu/dataverse/2019ncov">https://<wbr/>dataverse<wbr/>.harvard<wbr/>.edu<wbr/>/dataverse<wbr/>/2019ncov</a>, <a href="https://browse.welch.jhmi.edu/covid-19/databases">https://<wbr/>browse<wbr/>.welch<wbr/>.jhmi<wbr/>.edu<wbr/>/covid<wbr/>-19<wbr/>/databases</a> and <a href="https://guides.ucsf.edu/COVID19/data">https://<wbr/>guides<wbr/>.ucsf<wbr/>.edu<wbr/>/COVID19<wbr/>/data</a>. Note that usually there is significant overlap, but on occasion, a good resource may be found or linked on selective aggregation websites. Furthermore, because the crisis continues to evolve very fast, we always recommend that the interested reader do a fresh web search, both to obtain new resources and to find the latest addresses of these resources.</p>
</section>
<section epub:type="division">
<h2 class="head a-head"><a id="sec17-7"/><b>17.7 Bibliographic Notes</b></h2>
<p class="noindent">As KGs are a relatively novel research area, using them for domain-specific social impact has only recently started becoming a serious topic of research, although the use-cases covered by this chapter (such as HT, but also natural disaster response) are important enough that they encompass individual domains of research in which KGs have played some role. By now, multiple papers have been published covering both direct and indirect applications of KG and KG-enabled search technology to HT, many of which cover research that was funded or conducted under the Memex program. There are also general studies about using technology to fight HT, as well as on better understanding the extent of this tragic crime. We cite a broad range of work here, including Szekely et al. (2015), Hultgren et al. (2016), Alvari et al. (2016), Kejriwal and Kapoor (2019), QC et al. (2016), Chen (2011), Harrendorf et al. (2010), Greiman and Bain (2013), Savona and Stefanizzi (2007), and Kejriwal and Szekely (2017a,c), for the interested reader to get started on this difficult subject. Work specifically related to DIG includes Szekely et al. (2015), Kejriwal et al. (2017a,b), Kejriwal and Szekely (2017c), and Kapoor et al. (2017). In the other direction, there has also been work on how the Internet, as well as technologies like mobile, have contributed to trafficking; we cite Musto and Boyd (2014), Dixon (2013), Sarkar (2015), Latonero et al. (2012), and Latonero (2012) as examples of research investigating this phenomenon.</p>
<p>A good reference for the DeepDive system, which has also been used for constructing KBs in difficult domains, is Niu et al. (2012a). Yang et al. (2016), though not specific to either HT or KGs, is also a worthwhile read for some of the conceptual underpinnings of the work cited earlier, as well as continuing work in this space.</p>
<p>Other related work, which is broader and more conceptual but critical to gaining a deeper understanding of these other domain-specific systems, includes Hogan et al. (2007), Lin et al. (2012), Saleiro et al. (2016), Tonon et al. (2012), Dalvi et al. (2009), Freitas et al. <span aria-label="464" id="pg_464" role="doc-pagebreak"/>(2012), Marchionini (2006), Jain et al. (2007), Anantharangachar et al. (2013), Nikolov et al. (2013), and Gupta et al. (2012). They cover a diverse set of topics, such as entity-centric search, ad-hoc object retrieval, exploratory search, structured querying over web ads and natural-language text, efficient combination of structured and full-text queries, ontology-guided IE, and scalable web search and query systems, all of which played some role in the construction and design of a system like DIG that was transitioned to law enforcement for fighting HT. Some of these topics (such as the combination of structured and full-text queries) fall into areas that have received a fuller treatment in part IV of this book.</p>
<p>Concerning natural disaster response, we note that the broader field of <i>crisis informatics</i> has become important, especially given the effects of climate change, record-setting summers and droughts, and the prevalence of more extreme weather events. For a background on crisis informatics, we highly recommend Hagar (2015), Reuter et al. (2018), Anderson and Schram (2011), Palen et al. (2007), and Palen and Anderson (2016) (with the last of these being a particularly instructive introduction); for a sense of how technology (especially social media) can have an impact in this area, we recommend Ukkusuri et al. (2014), Palen (2008), Heverin and Zach (2010), and Anderson et al. (2013). It is also instructive to consider the impact of funded projects in this space (e.g., Project EPIC is an important effort that was launched in September 2009 and supported by a $2.8 million grant from the National Science Foundation). It takes a multilingual, multidisciplinary perspective and has led to a large number of publications, with the last few publications coming in 2015. The most recent examples include Kogan et al. (2015), Aydin and Anderson (2015), and Barrenechea et al. (2015).</p>
<p>KGs play a niche role in this space, though the LORELEI program has introduced advanced-language technologies to address the problem in parts of the world where English is not the dominant language; see Christianson et al. (2018) for an excellent overview. Throughout this book (but especially in part II), we have made note of the difficulties posed for IE (and, by extension, KGC and KG-enabled) systems in the cross-lingual and multilingual setting. When dealing with urgent crisis situations, especially in developing nations, addressing these difficulties while still managing inevitable performance degradation (compared to English-only systems and baselines) can make all the difference between a timely intervention and great loss of life. LORELEI provided the basis for an impressive research agenda that brought together natural-language and KG researchers; a good set of program-level references include Tong et al. (2018), Papadopoulos et al. (2017), Strassel et al. (2017), and Strassel and Tracey (2016), of which at least one of the outputs was THOR. Good references, as well as a demonstration article on THOR that inspired much of the material covered herein, include Kejriwal et al. (2018a,b), Malandrakis et al. (2018), Gu and Kejriwal (2018), Zhang et al. (2018), Kejriwal and Zhou (2019), Martinez et al. (2019), and Kejriwal and Gu (2019).</p>
<p><span aria-label="465" id="pg_465" role="doc-pagebreak"/>Note that, because LORELEI and Memex ended recently, the jury is still out on the full impact (and sustainability) of this research on society. Some of the systems that have been transitioned to real-world users or are still being transitioned may or may not survive, but we believe that there is enough momentum behind the broader agenda of AI for social good (of which KGs for social good is a part) that more such systems designed to have direct social impact will continue to emerge.</p>
<p>Of course, beyond the two mentioned here, there are other social domains where KGs are starting to gain traction, although the literature is piecemeal and not easy to track or collect into a comprehensive list of references. An important domain is healthcare; see Shi et al. (2017), Gatta et al. (2017), and Huang et al. (2017) for some direct applications of KGs in healthcare-related domains. Recently, there have also been patents [specifically, we note one by Sahu et al. (2018)] from industry, showing that this is not just an academic pursuit. Another important domain is government, on which we provided more details on how KGs can play an important role in chapter 15.</p>
</section>
<section epub:type="division">
<h2 class="head a-head"><a id="sec17-8"/><b>17.8 Exercises</b></h2>
<ul class="numbered">
<li class="NL">1. Imagine that you are a doctor trying to find either a cure or vaccine for COVID-19. We described some KG efforts in the subsection entitled “COVID19 and Medical Informatics” that were under way as early as March 2020. For the KG to help you, as a doctor, in your job, what kinds of features should it support? Looking back at some of the other chapters in this section, could you think of essential data sets that should be incorporated into, or linked to, a COVID-19 KG?</li>
<li class="NL">2. Obfuscation of text is an important challenge for KGC systems when dealing with illicit domains like HT. Imagine an ordinary piece of text such as “Let us meet in the park. Call me at the number 123-456-7890 when you get here.” Suppose that you were trying to obfuscate this message so that it is clear to a (reasonable, but arbitrary) human being what you are trying to express, but you want to deliberately make it harder for search engines and machine learning to extract critical information such as phone number. Think of three such ways to obfuscate the message and write your obfuscated messages.</li>
<li class="NL">3. Thinking back to the KGC systems in part II, what kinds of approaches could you use to extract a useful KG from millions of messages like these?</li>
<li class="NL">4. Considering the crisis informatics problem described earlier, list five open data sets that you could use to deliver more value to stakeholders in this space. Why would you advocate for those data sources, and what are some key things to be aware of when using or integrating these KGs?</li>
<li class="NL">5. <span aria-label="466" id="pg_466" role="doc-pagebreak"/>How might a resource such as the Gene Ontology be of help in fighting COVID-19? Try to draw out a very specific scenario, and articulate your assumptions as clearly as possible.</li>
<li class="NL">6. Could you think of illicit domains that have significant presence on the web, but where KG technology, as described in this book, might not necessarily be applicable? Why or why not?</li>
<li class="NL">7. Suppose an illicit portal on the Dark Web wants to build a chatbot for its highest-paying customers to ensure that they do not have to wade through lots of obfuscation or irrelevant results, but they can find what they’re looking for as efficiently as they can. Why would the portal not run into the same problem as you when trying to build such a chatbot? Does obfuscation really matter for the portal? Why or why not? <i>Hint: Think carefully about the raw data that your KGC systems have to work with when dealing with a domain from the outside, as well as the data that the portal would have access to. How does the portal ensure that obfuscation does not matter to, or get in the way of the chatbot, even if third-party providers are involved in listing on the portal?</i></li>
</ul>
<div class="footnotes">
<ol class="footnotes">
<li><p class="FN" role="doc-footnote"><sup><a href="chapter_17.xhtml#fn1x17-bk" id="fn1x17">1</a></sup> One such aspect of this methodology would be the specific mechanism used for eliciting domain knowledge. DIG allows simpler means like glossaries and rules, while DeepDive offers SRL. Some kinds of domain knowledge are more apt for the latter, while others are better suited for the former.</p></li>
<li><p class="FN" role="doc-footnote"><sup><a href="chapter_17.xhtml#fn2x17-bk" id="fn2x17">2</a></sup> In chapter 3, we described the DDT that was developed by a group at New York University, primarily in response to the Memex goals and challenges, although the scope of the system has since been extended.</p></li>
<li><p class="FN" role="doc-footnote"><sup><a href="chapter_17.xhtml#fn3x17-bk" id="fn3x17">3</a></sup> We have taken the liberty of modifying some of the digits in this example so that it is not a real phone number.</p></li>
<li><p class="FN" role="doc-footnote"><sup><a href="chapter_17.xhtml#fn4x17-bk" id="fn4x17">4</a></sup> <a href="http://www.odbms.org/2020/03/we-build-a-knowledge-graph-on-covid-19/">http://<wbr/>www<wbr/>.odbms<wbr/>.org<wbr/>/2020<wbr/>/03<wbr/>/we<wbr/>-build<wbr/>-a<wbr/>-knowledge<wbr/>-graph<wbr/>-on<wbr/>-covid<wbr/>-19<wbr/>/</a>.</p></li>
</ol>
</div>
</section>
</section>
</div>
</body>
</html>