glam/docs/oclc/extracted_kg_fundamentals/OEBPS/xhtml/chapter_3.xhtml

<html xmlns="http://www.w3.org/1999/xhtml" xmlns:epub="http://www.idpf.org/2007/ops" lang="en" xml:lang="en">
<head>
<title>Knowledge Graphs</title>
<meta content="text/html; charset=utf-8" http-equiv="default-style"/>
<link href="../styles/stylesheet.css" rel="stylesheet" type="text/css"/>
<meta content="urn:uuid:531250ec-8629-4bbe-b4be-eb1eb0e84538" name="Adept.expected.resource"/>
</head>
<body epub:type="bodymatter">
<div class="body">
<p class="SP"> </p>
<section aria-labelledby="ch3" epub:type="chapter" role="doc-chapter">
<header>
<h1 class="chapter-number" id="ch3"><span aria-label="53" id="pg_53" role="doc-pagebreak"/>3</h1>
<h1 class="chapter-title"><b>Domain Discovery</b></h1>
</header>
<div class="ABS">
<p class="ABS"><b>Overview.</b>   The problem of acquiring a relevant corpus of data from the web was recognized even in the early days, when the web had fewer than a billion documents. Since that time, the web has grown exponentially, leading to renewed interest in the problem. In the broadest sense, domain discovery is less about discovering the domain than about discovering relevant data describing the domain. However, in a real sense, discovering relevant data is akin to a <i>data-driven</i> discovery of the domain itself, because what is actually on the web will determine the content of the domain more than a normative definition. In this chapter, we cover the problem of domain discovery in detail. Our principal focus will be on intelligent and focused crawling, both of which continue to be the primary mechanisms through which a specific domain can be discovered; however, we will also pay some attention to a number of advanced topics that have been explored by the research community over the last decade.</p>
</div>
<section epub:type="division">
<h2 class="head a-head"><a id="sec3-1"/><b>3.1 Introduction</b></h2>
<p class="noindent">It has already become a proverbial cliché that we are awash in data, and the web has continued its relentless growth. While this growth is not disputed, it is always useful to put a number on the growth and statistics of the web, in order to provide some context and motivation for why raw data acquisition on a specific topic can be a problem. According to Internet Live Stats, a popular website on tracking the growth of the web and cited by the official World Wide Web’s 25th anniversary site, the total number of <i>websites</i> (different from <i>webpages</i>, the count of which is much higher) on the web was almost 2 billion at the start of 2019, and there are more than 4 billion internet users. In contrast, in 1999, when motivations for focused data acquisition were already being touted by the academic community, there were only about 350 million <i>webpages</i>, with only 600 GB of text being changed every month. Today, those numbers are so enormous that even major search engines like Google may only be able to estimate them. What is abundantly clear, however, is that the web has truly become its own vast universe, and finding <i>relevant</i> data to answer a specific set of questions (or fulfill even a loosely bounded set of requirements) is like finding a needle in an ever-growing haystack.</p>
<p><span aria-label="54" id="pg_54" role="doc-pagebreak"/>Large data sets (Big Data) also exist in fields like genetics and space. However, those data sets tend to be well ordered and structured, and if available on the right infrastructure (whether a private cluster or the cloud), they can be indexed and searched with the right algorithms in place. The web is a different beast altogether. Even if it were possible to temporarily store large parts of the web on a massive cluster, which is currently only within the scope of major search engine providers, crawling the data has become problematic due to the rise of automated spiders and spam. Many websites now implement captchas, require logins or other credentials to access core data, or otherwise block computers and Internet Protocol (IP) addresses that make too many requests. In short, brute-force approaches are becoming harder to implement, requiring a healthy dose of automation and manual labor.</p>
<p>The web is also considerably heterogeneous compared to domain-specific data such as satellite images, or even social media. Here, heterogeneity can be thought of as variation along multiple dimensions, including syntactic structure of webpages, quality, textual and image content (which is how ordinary humans tend to think about heterogeneity), and method of access. For example, social media posts on Twitter can be crawled (within limits) by making calls to an application programming interface (API), but “normal” HTML pages have to be downloaded. Scraping useful content from such pages is even more challenging because there is so much on the pages that is not relevant (and also not visible to the human eye), but that is important for computer programs. For example, dynamic content on webpages is now mediated through embedded scripts, and because of modern web issues like search engine optimization (SEO), there may be hidden features on the webpages. To a computer program trying to “see” the page, all of these things are visible and must be handled gracefully. Because every website is different, and even webpages within a single website can be very different from one another, developing automated tools to construct KGs from such webpages is a difficult problem, and one that we pay attention to in the following chapters. The important thing to remember is that computers do not see the webpages in the same way that humans do, and this limits the expressiveness of any system that attempts to automatically characterize the relevance of a page to an input domain specification, usually using some form of machine learning.</p>
<p>By far, the predominant set of techniques that has been used in various guises to do good domain discovery is <i>focused crawling</i>. Intuitively, a crawler is a dynamic program that is supposed to go out on the web and acquire (via download, storage, and possibly indexing) a corpus that obeys certain criteria. Generic search engine crawlers are broad and not focused on a particular topic or user; they are the web-equivalent of a one-size-fits-all model. Focused crawling, as the name suggests, is either user- or domain-specific, though the latter is far more appropriate. User-focused search is akin to <i>personalized</i> information retrieval, for which a user model is required. Search engines like Google have invested significant effort in building such models, and can use them to provide relevant outputs to recurring users based on past search history and other artifacts, like the user’s location and <span aria-label="55" id="pg_55" role="doc-pagebreak"/>other information that becomes available when a user is signed into their accounts during search. In contrast, focused crawling is not about understanding user intent during search, but locating and downloading a “local universe” of documents that is relevant to some domain of interest. Specifying the domain is not always straightforward, however. In fact, we argue that there are four issues that need to be considered when designing (or choosing, customizing, and deploying) a focused crawler or other domain discovery tool.</p>
<p>First, how do we specify the domain to begin with? One easy, but naive, approach is to specify some keywords, but these could potentially be overloaded, ambiguous, or underspecified. It is also difficult to justify the difference between focused crawling and generic search (possibly with personalization) if inputs are always keywords, because the search engine learns over time how keywords map to user intent. Instead, in the vast majority of focused crawling literature, the crawler is given exemplar documents to begin with, but these could bias the system into discovering more of what is <i>already known</i>. This fundamentally defeats the purpose of domain discovery, especially because (in practice) many people are willing to commit to domain discovery to learn things about the domain that they might not have known, or even suspected. More recently, a functional view of domain discovery has also started to gain hold, especially for unusual domains like human trafficking and fraud. In such investigative domains, domain experts, such as law enforcement and workers in federal agencies like the Securities and Exchange Commission (SEC) and the Federal Bureau of Investigation (FBI), ideally want to construct a search system (whether based on KGs or other technology) to help them investigate leads about bad actors and make predictions about illicit activity before it actually happens and ends up causing damage. These kinds of investigative <i>queries</i> can also serve as inputs to the domain discovery process.</p>
<p>Second, how do we explore the <i>link structure</i> of the web to discover relevant webpages? Even when the web was “small,” randomly following links from relevant root pages was found to yield suboptimal results compared to more intelligent approaches. Since those early findings, mostly emerging in the 1990s, many complex approaches have been proposed. A full exploration is not within the scope of this chapter; instead, we focus on important high-level discoveries that clearly illustrate how the link structure can be exploited to make good relevance determinations.</p>
<p>Third, how do we use the <i>content</i> of the webpage to determine relevance? In the simplest case, one could imagine an approach that is used by the search engines (namely, keyword matching and bag of words models like tf-idf). In other cases, we may want to extract other features to assist a machine learning classifier in making good relevance determinations.</p>
<p>Fourth, and related to the previous issue, is the question of what kinds of <i>user interactions</i> are permitted when doing domain discovery. A simple, relatively lightweight model would take the input domain specification from the user, attempt domain discovery over a (possibly self-determined) specified period of time, and not allow further inputs from the <span aria-label="56" id="pg_56" role="doc-pagebreak"/>user. More practical systems, predicated on literature from both active learning and reinforcement learning, may want to solicit occasional feedback from the user in the hope of achieving higher quality and coverage. Either approach has its pros and cons, but the latter is more favored in modern times, especially for complex domains.</p>
<p>There are also some practical and ethical issues related to domain discovery that we do not cover in this chapter (or indeed the entire book), mainly because they are tangential to the core concept of domain discovery itself, but also because they have not been resolved in a stable or established way. For example, how does a domain discovery system deal with captchas and login issues? While deep learning has shown some promise in dealing with automatic transcription of captchas, the problem is far from solved. There is also the question of infrastructure and best practices: how often should a webpage be recrawled, and what kind of infrastructure should be used to host the crawler? There is no real right answer to these questions, other than “It depends.” For example, in our experience, when dealing with nefarious domains like human trafficking, some websites may have to be crawled more than four times a day to ensure data completeness. Furthermore, an old webpage should never be replaced by a recrawled version because the page could be useful for tracking activity and gathering evidence over time. However, if crawling news articles, it makes more sense to recrawl and replace, since the new news article may contain updated figures and redactions, rendering the previous version obsolete or even wrong.</p>
<p>Although we do not cover these issues in this chapter, we provide some guidance for the interested reader in the section entitled “Bibliographic Notes,” at the end of this chapter. Our primary focus here will be focused crawling, though we take a broader view of domain discovery toward the end of this chapter.</p>
</section>
<section epub:type="division">
<h2 class="head a-head"><a id="sec3-2"/><b>3.2 Focused Crawling</b></h2>
<p class="noindent">Crawlers (equivalently known as <i>robots</i> or <i>spiders</i>) were designed for downloading and assembling web content locally. <i>Focused</i> crawlers were introduced for satisfying the need of domain experts or particular organizations for creating and maintaining subject-specific web document collections locally, usually for addressing complex needs that could not be satisfactorily handled by generic search engines. Focused crawling is especially important when results have to be high quality, relevant to the domain, and up-to-date, all without investing in wholesale resources (e.g., time, space, and network bandwidth) for acquiring a one-size-fits-all data set. In short, focused crawlers try to download as many relevant, domain-specific pages as they can, while keeping the number of irrelevant pages to a minimum.</p>
<p>Recall that one of the issues alluded to in the introductory section was the level of user input. In early, and even recent, literature on focused crawling, the general assumption has been the crawler is given a <i>seed set</i> of webpages as input, with the immediate goal of extracting outgoing links in the seed pages and determining what links to visit next. <span aria-label="57" id="pg_57" role="doc-pagebreak"/>The primary difference between the various types and schools of focused crawling arise in assigning such <i>priorities</i> to links, though other differences are also relevant. Regardless of the criteria, webpages pointed to by these links are downloaded, and those deemed to be relevant (usually using some form of machine learning), are stored or indexed. Crawlers continue doing this until manually stopped or until a desired number of pages have been downloaded. Sometimes they stop by necessity because local resources get exhausted quickly as more pages are downloaded and stored. In yet other cases, crawling continues for a preset period of time. Decision criteria that depend on a crawler running until a resource (which could include time) runs out have an advantage in cloud infrastructure, because costs can be reasonably controlled or anticipated.</p>
<p>Crawlers used by generic search engines like Google retrieve massive web corpora regardless of topic or user-focused relevance. Focused crawlers accomplish the latter by combining both the content of the retrieved web pages and the link structure of the web. Based on how this is done, three classes of focused crawlers seem to have emerged in the literature, though some are much more popular than others. A classic focused crawler takes as input a user query that describes the topic, as well as a set of starting seed webpage Uniform Resource Locators (URLs) that can be used to guide the search toward other webpages of interest. The crawler incorporates criteria for assigning higher download priorities to links based on their likelihood to lead to relevant pages. Higher-priority pages are downloaded first, followed by recursively navigating to the links contained in the downloaded pages. Typically, download priorities are computed based on the similarity between the topic and the anchor text of a page link or between the topic and text of the page containing the link.</p>
<p>Text similarity is computed using an information similarity model, which in the modern era tends to almost always be a <i>vector space model</i> (VSM), although boolean models have been proposed in earlier eras. Put simply, a VSM attempts to represent a unit of data (in this context, a document) as a vector in some numeric space. Some VSMs that will recur throughout this book are the tf-idf VSM and embedding-based VSM, based on much more recent work in neural representation learning.<sup><a href="chapter_3.xhtml#fn1x3" id="fn1x3-bk">1</a></sup> These VSMs have become extremely important in communities as varied as Natural Language Processing (NLP), information retrieval, and knowledge discovery. The reader without any knowledge of VSMs is encouraged to briefly review sections 10.1 and 10.2 in chapter 10.</p>
<section epub:type="division">
<h3 class="head b-head"><a id="sec3-2-1"/><b>3.2.1 Main Design Elements of a Focused Crawler</b></h3>
<p class="noindent">The principal goal of a focused crawler is to keep the overall number of downloaded web pages for processing to a minimum, while maximizing the percentage of relevant pages <span aria-label="58" id="pg_58" role="doc-pagebreak"/>retrieved. It was well recognized, starting from early work, that the performance of a focused crawler could be highly dependent on the selection of good seed inputs. One way to infuse more robustness into this process is to accept a query from the user only as input and to use a generic web search engine like Google to obtain the first set of seed pages from this query. Regardless of how seed webpages are obtained, it is equally important to understand what makes a seed page good. Generally, it has been found that such pages are relevant to the topic (an obvious criterion), but less obviously, they can also be pages <i>from</i> which relevant pages can be accessed within a <i>small</i> number of link traversals. For example, if the topic is e-commerce products, a good seed page may describe reviews of several such products, along with links. Note that the review page itself is not all that relevant, but it serves as a broker, and can be instrumental in a focused crawler, accessing many more relevant pages than it would from a page describing a single e-commerce product.</p>
<p>The discussion here shows that there are many degrees of freedom in determining a precise crawler design, from input to content processing. Yet the architecture itself (of a focused crawler) is fairly standard, containing a uniform set of generic components. The actual instantiation of the components depends on the crawler (e.g., Best-First Crawler versus Learning Crawler) and is amenable to varying degrees of innovation. Generic elements that must always be taken into account when designing a crawler are discussed next.</p>
<p><b>Input Mechanisms.</b> Crawlers take as input a number of starting (seed) URLs and (in the case of focused crawlers) the topic description, which can be a list of keywords for classic and semantics-focused crawlers or a training set for learning-based crawlers. More recently, other innovative mechanisms for accepting the user’s intent as input have also been proposed.</p>
<p><b>Page Retrieval.</b> The links in downloaded pages are extracted and placed in a queue. A nonfocused crawler uses these links and proceeds with downloading new pages using a first-in-first-out protocol. A focused crawler reorders queue entries by applying content relevance or importance criteria, or it may decide to exclude a link from further expansion (generic crawlers may also apply importance criteria to determine pages that are worth crawling and indexing).</p>
<p><b>Content Processing and Representation.</b> Downloaded pages are lexically analyzed (e.g., using tokenization preprocessing modules) and transformed into a vector in some VSMs. For example, in the tf-idf VSM, each term in a vector is represented by computing a formula that trades off the term’s frequency (in the document) and its inverse frequency (in the full corpus), with the former contributing positively to the term’s importance and the latter contributing negatively. However, because computing inverse document frequency (idf) weights during crawling can be problematic (there is no full corpus because the corpus itself is being discovered and crawled), most Best-First Crawler implementations use only term frequency (tf) weights. In general, using idf in a cold-start setting is problematic, but because idf can be a useful feature, one other option is not to discard it completely, but <span aria-label="59" id="pg_59" role="doc-pagebreak"/>rather to rely on open knowledge bases (KBs; a background corpus), such as Wikipedia or the Google News Corpus.</p>
<p><b>Priority Assignment.</b> Extracted URLs from downloaded pages are placed in a priority queue, with priorities determined based on the crawler type, as well as user preferences. These range from simple criteria, such as page importance or relevance to query topic (computed by matching the query with page or anchor text), to more involved criteria, such as that determined by a learning process.</p>
<p><b>Expansion.</b> URLs are selected for further expansion, and all previous steps (page retrieval and download and priority assignment, among others) are repeated until some criterion is satisfied or system resources are exhausted. For example, the process may be stopped if the desired number of pages have been downloaded.</p>
</section>
<section epub:type="division">
<h3 class="head b-head"><a id="sec3-2-2"/><b>3.2.2 Best-First Crawlers</b></h3>
<p class="noindent">As the name suggests, Best-First Crawlers are the crawling equivalent of <i>best-first search</i>. Namely, every URL in the <i>crawl frontier</i> is assigned a “priority” score, which could be simple, using a tf-idf VSM, or a complicated combination of signals (as we describe later in this chapter with intelligent crawling). The URL with the highest priority score in the crawl frontier is selected for expansion. The links present on this downloaded page are then added to the frontier, and their priority scores are computed. Following this step, the next page to be crawled and downloaded is determined (based on the priority scores), and the entire process is repeated until the frontier is empty (i.e., there are no more pages to download).</p>
<p>Typically, Best-First Crawlers use only term frequency (tf) vectors for computing topic relevance. The use of inverse document frequency (idf) values (as suggested by the tf-idf VSM) is problematic because it not only requires recalculation of all term vectors at every crawling step but also, at the early stages of crawling, inverse document frequency values are highly inaccurate because the number of documents is too small. Generally, this is a problem with using inverse document frequencies when the document collection is compiled in a “cold-start” setting, whereby only a very small set of seed documents is initially available. Empirically, Best-First Crawlers have been shown to outperform several rival approaches (such as InfoSpiders and Shark-Search) as well as nonfocused Breadth-First crawling approaches. Historically, Best-First Crawling is considered to be the most successful approach to focused crawling due to its simplicity and efficiency.</p>
<p>Generalized versions of the Best-First Crawler also exist. For example, the <i>N-Best First Crawler</i> is a popular variant that, at each step, chooses <i>N</i> pages (instead of just one) with the highest priorities for expansion. In a similar vein, so-called intelligent crawling tries to combine various cues, including page content, URL string information, sibling pages, and statistics about relevant or irrelevant pages for assigning priorities to candidate pages. In some cases, this yields a more effective crawling algorithm that is able to learn to crawl without direct user supervision or training samples.</p>
<p><span aria-label="60" id="pg_60" role="doc-pagebreak"/>Variants of the classic Best-First Crawling strategy also exist, depending on how links in the same page are prioritized. We briefly describe some of these approaches below. Note that all variants draw on the formula for cosine similarity between the two vectors (represented in the same VSM space) <img alt="" class="inline" height="13" src="../images/a-vector.png" width="9"/> and <img alt="" class="inline" height="17" src="../images/b-vector.png" width="9"/>:</p>
<figure class="DIS-IMG"><a id="eq3-1"/><img alt="" class="width" src="../images/eq3-1.png"/>
</figure>
<p><b>Variant 1.</b> All links on the page receive the same download priority by applying equation (<a href="chapter_3.xhtml#eq3-1">3.1</a>) on the topic and page content representations.</p>
<p><b>Variant 2.</b> Priorities are assigned to pages by computing the similarity between the anchor text of the link pointing to the page and the query by applying equation (<a href="chapter_3.xhtml#eq3-1">3.1</a>). Unlike Variant 1, the links from the same page may be assigned different priority values.</p>
<p><b>Variant 3.</b> This variant combines variants 1 and 2 by computing the priority of link <i>l</i> in page <i>p</i> as the average of the two similarities (query to page and query to anchor text). The rationale behind this approach is that a page relevant to the topic is more likely to point to a relevant page than to an irrelevant one. On the other hand, anchor text may also be regarded as a reasonably reliable summary of the content of the page that the link points to. However, anchor text is not always descriptive (or representative) of the content of the page that the link points to. By combining the two complementary signals, variant 3 is thus able to achieve a certain degree of robustness.</p>
</section>
<section epub:type="division">
<h3 class="head b-head"><a id="sec3-2-3"/><b>3.2.3 Semantic Crawlers</b></h3>
<p class="noindent">While classical Best-First Crawlers compute similarity via standard lexical term matching-based information retrieval (IR) (i.e., roughly, two documents are similar if they share common terms, and the more common terms they share, the more similar they are), it is too simplistic in certain settings where two documents are related even when they do not share many (or even any) common terms. Imagine, for example, a page describing “tourist must-visits” in Los Angeles, while another page describes “places to check out.” The two pages share many lexical differences overall, but overlap heavily in terms of semantic content. Semantic crawlers address this problem by using term taxonomies or ontologies rather than relying solely on lexical matching or models like tf-idf. As we’ve seen earlier in this book when discussing KG ontologies and representational mechanisms, term taxonomies and ontologies conceptualize similar terms via typed links like <i>is-a</i>.</p>
<p>The basic approach of such crawlers is to first retrieve all terms conceptually similar to the topic terms from the ontology, and use these additional terms to supplement the topic description (e.g., by adding synonyms and other topically similar terms).</p>
<p>Where do general term taxonomies come from? One such famous resource is WordNet, a controlled vocabulary and thesaurus offering a taxonomic hierarchy of natural-language terms. It contains around 100,000 terms, organized into taxonomic hierarchies, and provides <span aria-label="61" id="pg_61" role="doc-pagebreak"/>broad coverage of the English vocabulary. Consequently, it can be used for focused crawling on almost every general-interest topic. <a href="chapter_3.xhtml#fig3-1" id="rfig3-1">Figure 3.1</a> illustrates an example for the word “politician.” Intuitively, we can see that, if properly leveraged, WordNet can be used for expanding the lexical term “politician” to include terms such as “mayor” and “legislator.” Pages containing these terms, compared to more random words (such as “garden” or “sky”) would be assigned higher scores by a semantic crawler than by crawlers relying solely on term-matching VSMs.</p>
<div class="figure">
<figure class="IMG"><a id="fig3-1"/><img alt="" src="../images/Figure3-1.png" width="450"/>
<figcaption><p class="CAP"><span class="FIGN"><a href="#rfig3-1">Figure 3.1</a>:</span> <span class="FIG">An illustration of the semantic information provided by the WordNet lexical resource for a common noun such as “politician.”</span></p></figcaption>
</figure>
</div>
<p>Because the similarity between the topic and a candidate page is now computed as a function of semantic (conceptual) similarities between the terms they contain, how do we compute the similarity between related concepts that are not lexicographically similar? The definition of document similarity will clearly depend on the choice of semantic similarity function. A good choice (in the context of semantic crawlers) for computing the priority of page <i>p</i> is</p>
<figure class="DIS-IMG"><a id="eq3-2"/><img alt="" class="width" src="../images/eq3-2.png"/>
</figure>
<p>Here, <i>i</i> and <i>j</i> are terms on the topic and candidate pages (or anchor text of the link), respectively, while <i>w</i><sub><i>i</i></sub> and <i>w</i><sub><i>j</i></sub> are their respective term weights. Further, <i>sim</i>(<i>i, j</i>) is the semantic similarity between these two terms. Assuming a resource like WordNet, several semantic similarity measures are applicable, such as the following:</p>
<p><b>Synonym Set Similarity.</b> Here, <i>sim</i>(<i>i, j</i>) is 1 if <i>i</i> belongs to the synonym set of term <i>j</i> in the WordNet taxonomy, and 0 otherwise.</p>
<p><b>Synonym, Hypernym/Hyponym Similarity.</b> Here, <i>sim</i>(<i>i, j</i>) is 1 if <i>i</i> belongs to the synonym set of term <i>j</i>, 0.5 if <i>i</i> is a hypernym or hyponym of <i>j</i>, and 0 otherwise.</p>
<p><span aria-label="62" id="pg_62" role="doc-pagebreak"/>The download priority of a link can be defined as the average of the similarity of the topic description with the anchor text and the page content, both computed using equation (<a href="chapter_3.xhtml#eq3-2">3.2</a>).</p>
</section>
<section epub:type="division">
<h3 class="head b-head"><a id="sec3-2-4"/><b>3.2.4 Learning Crawlers</b></h3>
<p class="noindent">Learning crawlers learn user preferences on a topic from a set of example pages known as the <i>training set</i>. The method relies on a user providing a labeled set of pages, where the label expresses whether the page is relevant to the topic of interest. Training may also involve learning the link-path leading to relevant pages. Some methods only assume a positive training set (where we are given annotations expressing which pages are relevant, but no explicit annotations expressing which pages are irrelevant), while others require annotations expressing both relevance and irrelevance. The core idea, as with many supervised machine learning–based approaches, is to first train a classifier using a training set, and during crawling, use the classifier to classify each downloaded page as relevant or irrelevant (and also assign it a priority). Early approaches to learning crawlers used classifiers like Naive Bayes (trained on web taxonomies such as Yahoo!) for distinguishing between relevant and irrelevant pages, which was then expanded to more advanced classifiers such as decision trees, neural networks, and support vector machines (SVMs). For example, in one publication on learning crawlers, an SVM was applied to both page content and link context, with the combination shown to outperform methods using page content or link context alone.</p>
<p>The structure of paths leading to relevant pages can be an important factor in focused crawling, as first shown with <i>context graphs</i> by Diligenti et al. (2000). The idea is to work backward by following back links to relevant pages to recover pages leading to relevant pages. These pages, along with their path information, form the context graph. The original context graph method builds classifiers for sets of pages mainly at distance 1 or 2 from relevant pages in the context graph. The focused crawler uses these classifiers to establish priorities of visited pages. In a subsequent section on the <i>Context-Focused Crawler</i> (CFC) system, this methodology is detailed further.</p>
<p>An extension to the context graph method is the hidden Markov model (HMM) crawler, wherein the user browses the web looking for relevant pages and indicates if a downloaded page is relevant to the topic or not. The visited sequence is recorded and is used to train the crawler to identify paths leading to relevant pages. The significant aspect to note about these crawlers is that they were among the first to successfully model the crawling process as a <i>sequence-labeling</i> problem. As subsequent chapters of this book will illustrate, sequence labeling has emerged as an important subarea in machine learning and NLP, and today, problems like named entity recognition (a critical component in a KG construction pipeline) rely heavily on sequence labeling.</p>
<p>Yet another variant of a learning-based crawler uses two classifiers instead of one. This was best illustrated in an early paper that first used the open directory (Directory Mozilla, <span aria-label="63" id="pg_63" role="doc-pagebreak"/>or DMOZ) web taxonomy to classify downloaded pages as relevant (or not), with a second classifier evaluating the probability that the given page will <i>lead</i> to a target page. While the two classification tasks should not be thought of as completely independent in the context of the modern web, we can nevertheless gauge that they are two separate classification tasks. Thus, using two classifiers has some benefits in this scenario.</p>
<p>We emphasize that learning crawlers are not necessarily exclusive from the crawlers described previously. Over the years, hybrid crawlers that combine ideas from learning and classic focused crawling have been proposed in the literature. We provide pointers to relevant reading in the “Bibliographic Notes” section at the end of this chapter.</p>
</section>
<section epub:type="division">
<h3 class="head b-head"><a id="sec3-2-5"/><b>3.2.5 Evaluation of Focused Crawling</b></h3>
<p class="noindent">Crawler performance is typically measured by the percentage of downloaded pages that are relevant to the topic (i.e., pages with similarity greater than a predefined threshold), a measure that is known in the literature as the <i>harvest rate</i>. Harvest rate can be adjusted (by using a higher threshold) to measure the ability of the crawler to download pages highly relevant to the topic.</p>
<p>How should we build a ground-truth for evaluating such measures? One suggested, and popular, approach is to issue each input topic as a query to Google and have the results be inspected by a user. Pages considered relevant by the user would constitute the ground-truth for the topic. Note that the size of a ground-truth set can be anywhere from a few tens to thousands of pages per topic. During evaluation, the results of a crawler per topic are compared with the ground-truth: for each page returned by the crawler, its document similarity (using VSM) with all pages in the ground-truth set is computed. If the maximum of these similarity values is greater than a user-defined threshold, the page is marked as a positive result (according to the method). The more positive the results of a crawler are, the more successful the crawler is (i.e., the higher the probability that the crawler retrieves results similar to the topic). The performance of a crawler is generally computed as the average number of positive results over all topics.</p>
<p>When evaluating multiple crawlers, it is also important to have the necessary controls in place. For example, all crawlers should be initialized using the same set of seed pages, and it is equally important for all experiments to be conducted between a similar time span and using a similar set of resources. A more nebulous problem that is hard to both diagnose and detect is <i>data set bias</i> (i.e., when the ground-truth itself is constructed in such a way or methodology as to bias the performance of one or more crawlers). The best way to account for such biases is to have the ground-truth be constructed by a team that is not involved in one or more system designs. By increasing the size of the ground-truth per topic, as well as increasing the number of topics, the probability of data set bias or overfitting is further reduced.</p>
</section>
</section>
<section epub:type="division">
<h2 class="head a-head"><span aria-label="64" id="pg_64" role="doc-pagebreak"/><a id="sec3-3"/><b>3.3 Influential Systems and Methodologies</b></h2>
<p class="noindent">As noted before, the design of a crawler is somewhat standard by now, although instantiations can widely vary. To gain a practical understanding of the trade-offs that are involved in engineering crawlers, in this section we consider some influential systems. Our goal is to be descriptive rather than prescriptive. The systems are described in the temporal order in which they have been developed within the community, in order to give a sense of how crawler design has advanced throughout the years.</p>
<section epub:type="division">
<h3 class="head b-head"><a id="sec3-3-1"/><b>3.3.1 Context-Focused Crawler</b></h3>
<p class="noindent">The CFC was one of the earliest crawlers to be predicated on the capability of preexisting search engines like Google. The basic intuition behind CFC is to take the seed documents, and to use a search engine to find and construct a representation of pages that occur within a certain link distance (defined as the minimum number of link traversals necessary to move from one page to another) of the documents. This representation is used to train a set of classifiers, which are optimized to detect and assign documents to different categories based on the expected link distance from the document to the target document. During the crawling stage, the classifiers are used to predict how many steps away from a target document the current retrieved document is likely to be. This information is then used to optimize the search.</p>
<p>Generally, there are two distinct stages to using the CFC when performing a focused crawl session, each of which is subsequently described in greater detail:</p>
<ul class="numbered">
<li class="NL">1. An initialization phase, when a set of context graphs and associated classifiers are constructed for each of the seed documents</li>
<li class="NL">2. A crawling phase, which uses the classifiers to guide the search and performs online updating of the context graph</li>
</ul>
<p class="TNI-H3"><b>3.3.1.1 Generating the Context Graphs</b> The first stage of a crawling session aims to extract the context within which target pages are typically found, and then to encode this information in a <i>context graph</i>. A separate context graph is built for every seed element provided by the user. Every seed document forms the first node of its associated context graph. Using an engine such as Google, a number of pages linking to the target are first retrieved (known as the <i>parents</i> of the seed page). Each parent page is itself modeled in the graph as a node, with an edge declared between the target document node and the parent node. The new nodes compose “layer 1” of the context graph. The back-crawling procedure is repeated to search all the documents linking to documents of layer 1. These pages are incorporated as nodes in the graph and compose “layer 2.” As <a href="chapter_3.xhtml#fig3-2" id="rfig3-2">figure 3.2</a> shows, we may intuitively think of these layers as ever-growing concentric circles. To simplify the link structure, visualization, and formalism, the convention is that if two documents in layer <i>i</i> can be accessed from a common parent, the parent document appears <i>twice</i> in the <span aria-label="65" id="pg_65" role="doc-pagebreak"/>next layer (i.e., layer <i>i</i> + 1). This results in an “induced” graph, where each document in the layer <i>i</i> + 1 is linked to one (and only one) document in the layer <i>i</i>.</p>
<div class="figure">
<figure class="IMG"><a id="fig3-2"/><img alt="" src="../images/Figure3-2.png" width="250"/>
<figcaption><p class="CAP"><span class="FIGN"><a href="#rfig3-2">Figure 3.2</a>:</span> <span class="FIG">A context graph (with two layers) of a target document.</span></p></figcaption>
</figure>
</div>
<p>The back-linking process is iterated until a user-specified number of layers have been filled or some other convergence criterion is met (e.g., an upper bound on the total number of backlink traversals or the size of the context graph is breached). In practice, the number of elements in a given layer can increase suddenly when the number of layers grows beyond some limit. In these cases, one systematic approach is to statistically sample the parent nodes, up to some system-dependent limit.</p>
<p>The depth of a context graph is defined to be the number of layers in the graph excluding the level 0 (the node storing the seed document). When <i>N</i> levels are in the context graph, path strategies of up to <i>N</i> steps can be modeled. For example, a context graph of depth 2 is shown in <a href="chapter_3.xhtml#fig3-2">figure 3.2</a>.</p>
<p>By constructing a context graph, the crawler gains knowledge about topics that are directly or indirectly related to the target topic, as well as a very simple model of the paths that relate these pages to the target documents. As expected, in practice, we find that the arrangement of the nodes in the layers reflects any hierarchical content structure. Highly related content typically appears near the center of the graph, while the outer layers contain more general pages. For example, if we want pages on poker players and start with a seed webpage describing a particular player (e.g., Phil Hellmuth), the first layer may contain pages listing World Series of Poker champions, the next layer may contain pages describing a broader set of poker players, and the next layer after that may just describe the game of poker. Similarly, if we are given a professor’s page as a seed page, the first layer would likely contain pages where the professor is mentioned (perhaps a class directory, as well as the group page of the professor’s research group); the second layer would contain department-level pages; and the third layer may contain university-level pages. As a result, when the crawler discovers a page with content that occurs higher up in the hierarchy, it can use its knowledge of the graph structure to guide the search toward the target pages.</p>
<p><span aria-label="66" id="pg_66" role="doc-pagebreak"/>Once context graphs for all seed documents have been built, the corresponding layers from the various context graphs are combined, yielding a layered structure known in the original paper as a <i>Merged Context Graph</i> (MCG). This graph does not have to be connected, but on the other extreme, it is also rare to have a graph where at least two seed nodes do not share an undirected path.</p>
<p class="TNI-H3"><b>3.3.1.2 Classifier Training</b> The next stage builds a set of classifiers for assigning any document retrieved from the web to one of the layers in the MCG, as well as for quantifying the (classifier’s) belief in the assignment. The classifiers require a feature representation of the documents on which to operate. The original CFC implementation uses keyword indexing of each document using a (now not uncommon) modification of tf-idf called <i>reduced tf-idf</i>, which only uses the 40 highest scoring components in the tf-idf vector representation of the document. This process ensures numerical stability, reduces the amount of training data required, and can also make classification faster and low-dimensional. There is no reason why a modern VSM based on neural embeddings like word2vec or paragraph2vec cannot be used as a replacement.</p>
<p>Given the representation, the classifier is constructed to assign any web document to a particular layer of the MCG. However, if the document is a poor fit for a given layer, the document should be discarded and labeled as <i>Other</i>. A major difficulty in implementing such a strategy (using a single classifier mapping a document to a set of <i>N</i> + 2 classes corresponding to the layers 0, 1<i>, <span class="ellipsis">…</span>, N</i>, as well as <i>Other</i>) is the absence of a good model or training set for <i>Other</i>. To solve this problem, the authors of the CFC work proposed a modification of the Naive Bayes classifier for each layer; further details can be found in Diligenti et al. (2000). Ultimately, the authors chose to use the classifier of layer 0 as the ultimate <i>arbiter</i> of topical relevance for a given document.</p>
<p class="TNI-H3"><b>3.3.1.3 Crawling Phase</b> Once the classifiers are trained, the crawler can utilize them by first organizing pages into a sequence of <i>N</i> + 2 queues, <i>N</i> being the maximum depth of the context graphs. The <i>ith</i> class (layer) is associated to the <i>ith</i> queue, with <i>i</i> ranging over 0, 1<i>, <span class="ellipsis">…</span>, N</i>. Note that queue <i>N</i> + 1 is not associated with any class, but rather reflects assignments to <i>Other</i>. Furthermore, the 0<i>th</i> queue will ultimately store all the retrieved topically relevant documents.</p>
<p>Initially, all the queues are empty except for the “dummy” queue <i>N</i> + 1, which is initialized with the starting URL of the crawl. The crawler retrieves the page pointed to by the URL, computes the reduced vector representation, and extracts all the hyperlinks. It then downloads all the children of the current page. All downloaded pages are classified individually and assigned to the queue corresponding to the winning layer (or the class <i>Other</i>). Each queue is maintained in a sorted state using the likelihood score associated with its documents. When the crawler needs the next document to move to, it pops from the first nonempty queue. The documents that are expected to rapidly lead to targets are <span aria-label="67" id="pg_67" role="doc-pagebreak"/>therefore followed before documents that will (with high probability) require more steps to yield relevant pages. However, depending on the relative queue thresholds, frequently high-confidence pages from queues representing longer download paths are retrieved.</p>
<p>The setting of the classifier thresholds that determine whether a document gets assigned to the class denoted <i>Other</i> determines the retrieval strategy. In the original paper’s default implementation, the likelihood function for each layer is applied to all the patterns in the training set for that layer. The confidence threshold is then set to be equal to the minimum likelihood obtained on the training set for the corresponding layer.</p>
<p>During the crawling phase, new context graphs can periodically be built for every topically relevant element found in queue 0. An alternative is to configure the focused crawler to ask for the immediate parents of every document as it appears in queue 0, and simply insert these into the appropriate queue <i>without</i> recomputing the MCG and classifiers. In this way, it is possible to continually exploit back-crawling at a reasonable computational cost.</p>
</section>
<section epub:type="division">
<h3 class="head b-head"><a id="sec3-3-2"/><b>3.3.2 Domain Discovery Tool</b></h3>
<p class="noindent">The simplicity of keyword queries is a strength and also a limitation. In theory, an analyst could improve the relevance of the results by issuing more specific queries. For example, if an e-commerce analyst were interested in collecting all reviews about a product over the web and the profiles of users using the product, she could potentially search all forums and user IDs associated with the sale of the product. To query for influencers, she could also search for whether a user is showing up in multiple forums and discussion boards. However, such queries cannot be expressed using ordinary search engines like Google. Even focused crawling tools like the CFC or Semantic Web (SW) search engines like Swoogle (see the section entitled “Software and Resources” at the end of the chapter), do not allow users to express such queries unless the data has already been gathered and parsed using an ontology.</p>
<p>To allow domain discovery over web resources, Krishnamurthy et al. (2016) recently proposed a Domain Discovery Tool (DDT) that serves as a visual analytics framework for interactive domain discovery. The framework augmented ordinary search engine functionality by directly supporting analysts in exploratory search. Specifically, DDT supports exploratory data analysis of webpages and translates analyst interactions with web data into a <i>computational model</i> of the domain of interest. One example of such a model that we saw in the context of systems like the enhanced HMM Crawler is a trained machine learning classifier. However, the model constructed by DDT is richer, leveraging more advanced machine learning. DDT is an open-source system and has been released on GitHub.</p>
<p>Just like the other focused crawling tools, DDT is a heavily engineered system that has several components working in tandem to achieve its goals. DDT is designed as a client-server model that has a web-based Javascript interface. This ensures that there is no clientside setup. Unlike most of the other systems, therefore, DDT has a strong <span aria-label="68" id="pg_68" role="doc-pagebreak"/>advantage: the analysts using the system could have a <i>nontechnical</i> background. <a href="chapter_3.xhtml#fig3-3" id="rfig3-3">Figure 3.3</a> illustrates the intuitiveness of the interface when someone is trying to crawl and discover relevant webpages around an “Ebola” domain that starts with just one or more keyword specifications. The other important components involved in DDT are described next.</p>
<div class="figure">
<figure class="IMG"><a id="fig3-3"/><img alt="" src="../images/Figure3-3.png" width="450"/>
<figcaption><p class="CAP"><span class="FIGN"><a href="#rfig3-3">Figure 3.3</a>:</span> <span class="FIG">The interface of the NYU DDT, used for discovering relevant webpages over the web for an Ebola-related domain.</span></p></figcaption>
</figure>
</div>
<p class="TNI-H3"><b><span aria-label="69" id="pg_69" role="doc-pagebreak"/>3.3.2.1 Data Gathering and Persistence</b> Domain experts can use a variety of methods to gather pages of interest for analysis. First, DDT allows users to query the web using a search engine like Google. The users can leverage the large collections that already were crawled by these search engines to discover interesting pages across the web using simple queries. However, because search engines return only the URLs and associated snippets, DDT downloads the HTML content given by the URLs and stores it in the selected domain’s index (using an Elasticsearch infrastructure). This content can be used later for analysis of the domain, and also as seeds for focused crawlers. Note also that because downloading a large number of pages (including raw HTML content) can be costly in terms of time, this operation is performed by DDT in the background.</p>
<p>Second, DDT also provides a mechanism for domain experts to directly incorporate their domain knowledge by allowing them to provide URLs either through the input box provided or by uploading a file containing a list of URLs. DDT then downloads the pages corresponding to these URLs and makes them available through its interface.</p>
<p>Finally, DDT automates the tedious and manual process that users often undertake when following links forward and backward from the pages they explore. Given a page, crawling backward retrieves the backlinks (the resources that contain a link to the selected page) of that page and then downloads the corresponding pages. Forward-crawling from a selected page retrieves all the pages whose links are contained on that page. Intuitively, these operations are effective and valuable because there is a significant probability that the backlink of a page, as well as the page itself, will contain links to other relevant pages. This same intuition was also relied upon earlier by the CFC.</p>
<p class="TNI-H3"><b>3.3.2.2 Visual Summarization of Search Results</b> Similar to data gathering, DDT provides several mechanisms to give an analyst an overview of the pages they have explored. An important mechanism is Multidimensional Scaling (MDS); instead of displaying a list of snippets, DDT applies MDS to create a visualization of the retrieved pages (that maintains the relative similarity and dissimilarity of the pages). This allows the user to more easily select, study, and annotate a set of pages.</p>
<p>Because initially all pages are unlabeled, DDT needs an unsupervised learning algorithm to group pages by similarity. While a variety of clustering methods are applicable here, including K-Means and agglomerative clustering, the MDS implementation in DDT is currently achieved by principal component analysis (PCA) of the documents. Furthermore, to improve scalability, DDT uses Google’s word2vec 300-dimensional pretrained vectors that were trained on part of a Google News data set comprising about 100 billion words, <span aria-label="70" id="pg_70" role="doc-pagebreak"/>instead of using a vanilla tf-idf approach. A simple averaging-like formula is used to derive a document embedding (also with 300 dimensions) by combining the embeddings of all the words in the document. In turn, this yields a document matrix (over a corpus of <i>n</i> documents) of size<sup><a href="chapter_3.xhtml#fn2x3" id="fn2x3-bk">2</a></sup> <i>n</i> × 300, which is much smaller than the traditional <i>document</i> × <i>term</i> matrix. This smaller matrix is sent to the MDS algorithm as input.</p>
<p>Along with visualizations, DDT also dynamically updates and shows real-time page statistics, such as the total number of pages in the domain, number of pages marked as relevant, irrelevant, or “neutral” (pages that have yet to be annotated), and number of pages downloaded in the background since the last update. Other summarization facilities include dashboards for these page statistics, as well as other rich descriptive statistics, such as statistics over the entire content in the domain, such as the distribution summary of sites, the distributions and intersections of the search queries issued, summary of page tags and their intersections, and number of pages added to the domain over time. Similarly, the topic distribution dashboard visualizes the various topics contained in the domain, with the topics generated using the Topik toolkit. Using LDAviz, the DDT shows the topics, the overlap of topics, and the most frequent words contained in each topic.</p>
<p class="TNI-H3"><b>3.3.2.3 User Annotations</b> DDT allows users to provide feedback for documents and extracted terms. Along with marking individual pages, users can select a group of documents for analysis and mark these documents as relevant or irrelevant. Users may also annotate pages with user-defined tags, which are especially useful for defining subdomains (e.g., <i>Information Extraction</i> could be a subdomain under the domain <i>Natural Language Processing</i>).</p>
<p>As with documents, keywords and phrases extracted by DDT can also be annotated by a user as relevant or irrelevant. Based on the relevant terms, the system reranks the untagged keywords and phrases by relatedness to the relevant terms (specifically, by using Bayesian sets). This allows capture of more related terms and phrases to help the user understand the domain further, as well as formulate new search queries. For more details on how the Bayesian set algorithm was adapted for this purpose, we refer the interested reader to the original paper. Users may also incorporate background knowledge by adding particular keywords and phrases customized for the domain. To guide users and provide them with a better understanding of the importance and discriminative power of extracted terms, DDT shows the percentage of relevant and irrelevant pages on which the term appears.</p>
<p class="TNI-H3"><b>3.3.2.4 Domain Model and Focused Crawling</b> DDT uses the pages marked “relevant” and “irrelevant” (as positive and negative examples, respectively) to support the construction of a page classifier that serves as a model for the domain. This classifier, together with a <span aria-label="71" id="pg_71" role="doc-pagebreak"/>set of seeds (relevant pages), can be used to configure a focused crawler. DDT supports the ACHE Crawler, developed within the same group.</p>
<p class="TNI-H3"><b>3.3.2.5 Summary</b> Despite the relatively simple mechanisms in it, early experiments and user feedback show that the DDT framework can considerably improve the quality and speed of domain discovery. DDT is a new exploratory data analysis framework that combines techniques from information retrieval, data mining, and interactive visualization to guide users in exploratory search. Many future avenues for research remain, not least of which is a comprehensive user study with a larger number of participants of diverse backgrounds. Krishnamurthy et al. (2016) are also planning to conduct evaluations of DDT’s effectiveness on <i>nonweb</i> corpora.</p>
</section>
</section>
<section epub:type="division">
<h2 class="head a-head"><a id="sec3-4"/><b>3.4 Concluding Notes</b></h2>
<p class="noindent">Domain discovery is a fundamental step for any organization or individual that is looking to do end-to-end KG construction, querying, and analytics. The old saying “garbage in, garbage out” is relevant for domain discovery. A raw data set that does not contain useful information (i.e., that can help the user gain functional insights) to begin with will not become useful once it has been structured into a KG using the techniques that we will cover in subsequent chapters. The more difficult a domain is, the more important it is to do good domain discovery and find all relevant data on the web. Because the web is more difficult to crawl than ever, due to both size and the advent of captcha technology, a good domain discovery tool must have strong engineering and conceptual foundations. Focused crawling continues to be the favored approach to acquiring a domain-specific corpus, but the emergence of ecosystems like the Semantic Web, Linked Open Data, the Twitter stream, and Schema.org, as well as public resources like the Web Data Commons project and Wikipedia, indicate that it may not always be necessary to scrape raw webpages to acquire relevant data. While focused crawling design has remained relatively stable since the late 1990s, new techniques for capturing user interactions and allowing users to express their intent in finer-grained modalities mean that the book is not closed on this subject. Particularly exciting is the application of some of this work to illicit domains like securities fraud and human trafficking that have traditionally proved hard to tackle using mainstream artificial intelligence (AI) technology of any kind.</p>
</section>
<section epub:type="division">
<h2 class="head a-head"><a id="sec3-5"/><b>3.5 Software and Resources</b></h2>
<p class="noindent">Much of the classic work on focused crawling that was covered in this chapter does not have modern, open-source implementations that we are aware of; given how much the web has changed since even just 10 years ago, it is fruitful to reimplement those systems using recent versions of programming languages like Python and Java. However, some of the <span aria-label="72" id="pg_72" role="doc-pagebreak"/>more recent systems are publicly available. A good example is the DDT tool covered in the section entitled “Domain Discovery Tool.” At this time, it is available on GitHub,<sup><a href="chapter_3.xhtml#fn3x3" id="fn3x3-bk">3</a></sup> and several demonstration videos<sup><a href="chapter_3.xhtml#fn4x3" id="fn4x3-bk">4</a></sup> have been published on YouTube as well.</p>
<p>In some cases, however, we believe that implementing the classic systems in a modern setting has merit. Systems that the interested reader may want to look at include Ariadne (proposed in 2001 as a generic framework for focused crawling, and implemented in Java), CATYRPEL (one of the first proposed systems, also modular, that extended existing work in focused document crawling by not only using keywords for the crawl, but also leveraging high-level background knowledge with concepts and relations, which are compared with the text of the searched page), LSCrawler (a framework for a focused web crawler that is based on <i>link semantics</i>, and that has achieved better recall in practice than systems like Ariadne that do not incorporate any notion of link semantics or semantic similarity), and for the Semantic Web–focused user, Swoogle. Swoogle is not a focused crawling tool; instead, it was proposed as an SW search and metadata engine. We believe that it qualifies as a domain discovery tool because it is a crawler-based indexing and retrieval system for the Semantic Web [i.e., for web documents in the Resource Description Framework (RDF) or Web Ontology Language (OWL)]. It extracts metadata for each discovered document and computes relations between documents. When the original Swoogle paper was published, it was a prototype SW search engine that could facilitate several tasks, including finding appropriate ontologies, finding instances and characterizing or profiling the Semantic Web (e.g., by collecting metadata, especially interdocument relations that provided a wealth of knowledge on how different ontologies were referenced and how connected the Semantic Web was from an empirical viewpoint).</p>
<p>For those looking to do lightweight scraping or crawling, a number of convenient tools are available in languages like Python. An excellent example is Beautiful Soup, accessed at <a href="https://www.crummy.com/software/BeautifulSoup/bs4/doc/">https://<wbr/>www<wbr/>.crummy<wbr/>.com<wbr/>/software<wbr/>/BeautifulSoup<wbr/>/bs4<wbr/>/doc<wbr/>/</a>, which is popular enough that its documentation has been translated into multiple languages by its users. Beautiful Soup is a Python library for extracting data from HTML and XML files. It works with most web parsers and allows intuitive programmatic ways for navigating, searching, and modifying the parse tree. Numerous tutorials are available. Another good resource is Scrapy (Scrapy.org), which is described as an open-source and collaborative framework for extracting data from websites. Scrapy can be installed in a Python environment using pip, and it allows users to do a number of interesting things, including building and running their own web spiders. Documentation is fairly complete, and good tutorials are available throughout <span aria-label="73" id="pg_73" role="doc-pagebreak"/>the web. Another good example is Import.io, accessed at <a href="https://import.io/">https://<wbr/>import<wbr/>.io<wbr/>/</a>, which is an interactive crawling platform.<sup><a href="chapter_3.xhtml#fn5x3" id="fn5x3-bk">5</a></sup></p>
<p>Although the examples given here are provided for Python, similar such packages are available in other languages as well, though they may not always be as easy to use as the abovementioned tools. We also note that there are command-line tools available in Unix-like systems for fetching webpages, an example being Scrape (<a href="https://github.com/huntrar/scrape">https://<wbr/>github<wbr/>.com<wbr/>/huntrar<wbr/>/scrape</a>), as well as some of the tools discussed on webpages such as <a href="https://linuxhint.com/top_20_webscraping_tools/">https://<wbr/>linuxhint<wbr/>.com<wbr/>/top<wbr/>_20<wbr/>_webscraping<wbr/>_tools<wbr/>/</a>. However, users should be cautious of deploying any of these tools at a truly large scale or for commercial purposes without first becoming familiar with their license or copyright status, as well as testing them in production scenarios. We also briefly note that, for companies or organizations that have a high stake in collecting, storing, or indexing web data (such as Google), the web crawling mechanisms are likely customized, internally developed, and refined.</p>
</section>
<section epub:type="division">
<h2 class="head a-head"><a id="sec3-6"/><b>3.6 Bibliographic Notes</b></h2>
<p class="noindent">What has been referred to (and described) as domain discovery in this chapter has a long and interesting history going back to the founding days of the web, and the concept was not necessarily always denoted as <i>domain discovery</i>. <i>Focused crawling</i> was a much more common term owing to the popularity of focused crawling techniques for building a domain-specific corpus from the web. More recently, as we described in the latter part of the chapter, the techniques have become significantly more diverse.</p>
<p>In the beginning of the chapter, we spent some space describing some statistics on the web, as well as why domain discovery on the web is both an important and difficult problem. Many of the statistics discussed can be found in several resources on the web; examples include Internet Live Stats (<a href="https://www.internetlivestats.com">https://<wbr/>www<wbr/>.internetlivestats<wbr/>.com</a>), Statista (<a href="https://www.statista.com/topics/1145/internet-usage-wo-rldwide/">https://<wbr/>www<wbr/>.statista<wbr/>.com<wbr/>/topics<wbr/>/1145<wbr/>/internet<wbr/>-usage<wbr/>-wo<wbr/>-rldwide<wbr/>/</a>), and references such as Lyman and Varian (2003). A web search could reveal more recent statistics than were cited in this chapter.</p>
<p>Early papers on crawling were already surprisingly sophisticated, even in the 1990s. Some of the papers that we reference next, including those based on link analysis and machine learning paradigms like reinforcement learning, already existed well back in the early 2000s. Some good examples of early crawlers that used a variety of interesting techniques, algorithms, and methodologies include the collaborative web crawling approach described by Shang-Hua Teng et al. (1999) as the IBM Grand Central Station project, a “hidden web” crawler proposed by Raghavan and Garcia-Molina (2001), ontology-based web “agents” <span aria-label="74" id="pg_74" role="doc-pagebreak"/>proposed by Luke et al. (1997) and Spector (1997), image search engines proposed by Sclaroff (1995), and even entire frameworks like SPHINX, proposed by Miller and Bharat (1998) to satisfy the need for a development environment that could enable practitioners to develop and test their own crawlers.</p>
<p>Much of the material in this chapter relied on the analysis by Batsakis et al. (2009), who describe the various kinds of crawlers (e.g., semantic, learning-based) in detail, and also mainly discuss ways to improve the performance of focused web crawlers. One of the primary approaches described in this chapter (focused crawling using context graphs) was originally described by Diligenti et al. (2000). Other relevant products include Ariadne, CATYRPEL, and LSCrawler, described by Ester et al. (2001), Ehrig and Maedche (2003), and Yuvarani et al. (2006), respectively. A good survey of focused crawling was provided by Novak (2004).</p>
<p>In addition, we mentioned several resources from the NLP community that have been increasingly used in some crawlers. Good references for some of these, including sequence models and WordNet, include Miller (1998), Rabiner and Juang (1986), and Sutton et al. (2012). An influential work on using WordNet to measure the relatedness of concepts is by Pedersen et al. (2004).</p>
<p>Chau and Chen (2003) provide a review of personalized and focused web spiders, defining a web spider as a software program that traverses the web “information space” by following hypertext links and retrieving documents using standard HTTP. This definition itself has a precedent in Cheong (1996). As Chau and Chen (2003) describe, web spider research directions have tended to fall along three dimensions (speed/efficiency, spidering policy, and IR, the last of which has attracted particularly prolific research).</p>
<p>Many crawlers and spiders have used link analysis (or more correctly, have been influenced by link analysis techniques). Relevant historical works that laid the groundwork for these techniques for many years to follow include Spertus (1997), Pirolli et al. (1996), Cho et al. (1998), Chakrabarti et al. (1999), and Weiss et al. (1996). Other creative crawlers used anchor text (clickable, emphasized text of an outgoing link in a webpage) or other cues, such as the text appearing near a hyperlink; see Amitay (1998), Armstrong et al. (1995), and Rennie et al. (1999). The last of these proposed reinforcement learning to efficiently crawl the web, which is worth noting considering the renaissance that reinforcement learning has been enjoying in game-playing. In other work, algorithms such as PageRank and HITS [Kleinberg (1999), Brin and Page (1998)], which give more weight to links from authoritative sources, have also led to revolutionary advances in modern web search. They are also relevant for domain discovery, especially with respect to the crawlers described in the “Focused Crawling” section in this chapter.</p>
<p>Since the early 2000s, the Semantic Web has also had an impact on crawling research. Earlier, we mentioned CATYRPEL, which is an ontology-focused crawler. Another example of an SW-inspired system is Swoogle, described by Ding et al. (2004), which <span aria-label="75" id="pg_75" role="doc-pagebreak"/>was specifically designed to be a crawler-based indexing and IR system for the Semantic Web. Other examples include Multicrawler, BioCrawler [Harth et al. (2006); Batzios et al. (2008)] and other research such as by Hogan et al. (2011) and Rana and Tyagi (2012).</p>
<p>The DDT, an advanced system for domain discovery, was developed and described by Krishnamurthy et al. (2016). Other related studies by Felix (2019) and Moraes et al. (2017) would also be relevant for the interested reader.</p>
<p>The field of IR, detailed further in chapter 11, is intimately related to domain discovery, in that good search engines have to determine user intent and retrieve (and rank) a set of pages or documents that are <i>relevant</i>. Relevance was also a consideration for focused crawling and other domain discovery tools, and we noted the importance attached to it by both semantic and learning crawlers, which try to estimate it from various extracted features. Evaluation is also an important concern; an excellent reference is Fox et al. (2005).</p>
<p>More recently, crawling data (especially images) by third parties has been subject to ethical concerns. For the interested reader, we particularly recommend the work by Thelwall and Stuart (2006), Krotov and Silva (2018), and Sun et al. (2010).</p>
<p>Crawling has also become harder to do in an off-the-shelf fashion due to many websites now actively preventing “robots” from accessing some of their pages by imposing captchas and other “human-proving” tasks; see, for example, Von Ahn et al. (2003), Gossweiler et al. (2009), Egele et al. (2010), and Singh and Pal (2014). Whether such methods are truly effective remains to be seen, as computer vision, NLP, and other machine learning research (and tooling) continue to undergo rapid advancement and are being applied to “solve” captchas, as evidenced from Ye et al. (2018) and Csuka et al. (2018).</p>
</section>
<section epub:type="division">
<h2 class="head a-head"><a id="sec3-7"/><b>3.7 Exercises</b></h2>
<ul class="numbered">
<li class="NL">1. In this and the next several questions, you will utilize web crawlers to collect webpages and extract data from the Internet Movie Database (IMDb) (<a href="https://www.imdb.com/">https://<wbr/>www<wbr/>.imdb<wbr/>.com<wbr/>/</a>). As we studied in this chapter, a web crawler is a program or <i>bot</i> that systematically browses the web, typically for the purpose of web indexing (web spidering). It starts with a list of seed URLs to visit, and as it visits each webpage, it finds the links in that web page and then visits those links and repeats the entire process. We recommend using Scrapy (<a href="https://scrapy.org">https://<wbr/>scrapy<wbr/>.org</a>), which is a crawler in the Python library, for these exercises. As a first step, download (and test) Scrapy and do a brief tutorial to ensure that you are familiar with the basics.<br/><i>Hint: To locate and extract the attributes, you may have to see the source HTML of some example pages. While there are some off-the-shelf tools (including regular expressions) that you can use for the extraction, if you are not confident about extracting data from HTML, you should return to this set of exercises after reading chapter 5.</i></li>
<li class="NL"><span aria-label="76" id="pg_76" role="doc-pagebreak"/>2. Crawl at least 5,000 webpages of science fiction movies/shows in IMDb using Scrapy. Extract and generate, for each webpage, the attributes shown in the left panel of the image. Make sure to store your crawled data into a JSON-Lines (.jl) file. In this file format, each line is a valid JSON object (dictionary) that holds the attributes listed in the left panel for a single crawled webpage. More details on JSON can be found at <a href="https://www.json.org/json-en.html">https://<wbr/>www<wbr/>.json<wbr/>.org<wbr/>/json<wbr/>-en<wbr/>.html</a>. While crawling and storing the webpages, did you encounter unexpected problems? If you did, how did you get around them? <i>Hint: While crawling, please make sure you obey the website’s politeness rules (e.g., sleep time between requests) in order to avoid getting banned.</i></li>
</ul>
<figure class="IMG"><img alt="" src="../images/pg76-1.png" width="450"/>
</figure>
<ul class="numbered">
<li class="NL">3. Similar to the previous task, crawl at least 5,000 webpages of cast (i.e., actors and actresses) in IMDb using Scrapy. Extract and generate the attributes in the right panel for each cast webpage. Store the crawled data into a JSON-Lines file.</li>
<li class="NL">4. In the context of this task, answer the following questions using no more than two sentences for each question:</li>
</ul>
<p class="AL">(a) What is the seed URL(s) that you used for each task?</p>
<p class="AL">(b) How did you manage to only collect movie/show or cast pages?</p>
<p class="AL">(c) Did you need to discard irrelevant pages? If so, did you do it?</p>
<p class="AL">(d) Did you collect the required number of pages? If you were not able to do so, please describe and explain your issues.</p>
<ul class="numbered">
<li class="NL">5. Ordinarily, focused crawlers (and many other types of crawlers as well) take as input traditional inputs such as some starting (or <i>seed</i>) URLs and, possibly, a topic description (e.g., a list of keywords). We suggested, however, that other innovative mechanisms can be used for representing a user’s intent. Think of at least two such novel input types. What domains would they be particularly useful for?</li>
</ul>
<div class="footnotes">
<ol class="footnotes">
<li><p class="FN" role="doc-footnote"><sup><a href="chapter_3.xhtml#fn1x3-bk" id="fn1x3">1</a></sup> Another well-known VSM designed specifically for documents (and which we do not focus on as much in this book) is the topic model (formally known as Latent Dirichlet Allocation, or LDA).</p></li>
<li><p class="FN" role="doc-footnote"><sup><a href="chapter_3.xhtml#fn2x3-bk" id="fn2x3">2</a></sup> A matrix of size <i>u</i> × <i>v</i> has <i>u</i> rows and <i>v</i> columns.</p></li>
<li><p class="FN" role="doc-footnote"><sup><a href="chapter_3.xhtml#fn3x3-bk" id="fn3x3">3</a></sup> <a href="https://github.com/ViDA-NYU/domain_discovery_tool">https://<wbr/>github<wbr/>.com<wbr/>/ViDA<wbr/>-NYU<wbr/>/domain\discovery\tool</a>.</p></li>
<li><p class="FN" role="doc-footnote"><sup><a href="chapter_3.xhtml#fn4x3-bk" id="fn4x3">4</a></sup> <a href="https://youtu.be/XmZUnMwI10M">https://<wbr/>youtu<wbr/>.be<wbr/>/XmZUnMwI10M</a>, <a href="https://youtu.be/YKAI9HPg4FM">https://<wbr/>youtu<wbr/>.be<wbr/>/YKAI9HPg4FM</a>, <a href="https://youtu.be/HPX8lR8QS4">https://<wbr/>youtu<wbr/>.be<wbr/>/HPX8lR8QS4</a>.</p></li>
<li><p class="FN" role="doc-footnote"><sup><a href="chapter_3.xhtml#fn5x3-bk" id="fn5x3">5</a></sup> We provide a set of suggested exercises at the end of this chapter for the reader curious about Scrapy in the context of a real-world data set and task.</p></li>
</ol>
</div>
</section>
</section>
</div>
</body>
</html>