glam/docs/oclc/extracted_kg_fundamentals/OEBPS/xhtml/chapter_5.xhtml

<html xmlns="http://www.w3.org/1999/xhtml" xmlns:epub="http://www.idpf.org/2007/ops" lang="en" xml:lang="en">
<head>
<title>Knowledge Graphs</title>
<meta content="text/html; charset=utf-8" http-equiv="default-style"/>
<link href="../styles/stylesheet.css" rel="stylesheet" type="text/css"/>
<meta content="urn:uuid:531250ec-8629-4bbe-b4be-eb1eb0e84538" name="Adept.expected.resource"/>
</head>
<body epub:type="bodymatter">
<div class="body">
<p class="SP"> </p>
<section aria-labelledby="ch5" epub:type="chapter" role="doc-chapter">
<header>
<h1 class="chapter-number" id="ch5"><span aria-label="97" id="pg_97" role="doc-pagebreak"/>5</h1>
<h1 class="chapter-title"><b>Web Information Extraction</b></h1>
</header>
<div class="ABS">
<p class="ABS"><b>Overview.</b>   Although information extraction (IE) has best been explored in the natural-language domains [particularly in the context of Named Entity Resolution (NER), as we described in chapter 4], the spectacular advent of the web from the early 1990s onward led to a large body of research output in extracting key information from web sources. Today, no other repository of public information is as vast or diverse as the corpus of searchable and indexable pages on the World Wide Web. This repository is now so vast that sophisticated methods are required simply to find and scrape domain-specific information (chapter 3) that could be of help in building a knowledge graph (KG) that has both high coverage and accuracy. Once we find these pages, we must develop IE modules, ultimately yielding a satisfactory KG that could be deployed in further downstream applications like question answering or querying. In this chapter, we cover influential web IE approaches and systems that were first proposed in the 1990s and have since been refined and combined with other methods.</p>
</div>
<section epub:type="division">
<h2 class="head a-head"><a id="sec5-1"/><b>5.1 Introduction</b></h2>
<p class="noindent">In chapter 4, we described IE methods for extracting key pieces of data from natural-language sources. Natural-language documents are often erroneously described as unstructured because each document is akin to a long sequence (whether it is interpreted as a sequence of characters, or words and punctuation) and has no syntactic structure beyond the language model itself. In contrast, a table has a very well defined structure, with columns that have headers and rows generally representing individual entities and their attributes. In proper RDBs, there are constraints on what kinds of values are allowed to be entered into the table, along with multitable constraints such as foreign key and functional dependencies.</p>
<p>Web sources, which arguably constitute the most voluminous category of information over which KGs (whether domain-specific or open-domain) can be constructed today, tend to fall somewhere in between the two extremes described here. The input to a web IE task tends to be either structured or semistructured rather than purely natural-language documents as in the previous chapter. In some respects, this makes the extraction problem easier, but in many other respects, it makes it much harder. An illustration of web IE for a real-world webpage is illustrated in <a href="chapter_5.xhtml#fig5-1" id="rfig5-1">figure 5.1</a>. We use bounding boxes to illustrate the <span aria-label="98" id="pg_98" role="doc-pagebreak"/>key pieces of information that should be extracted into a domain-specific KG (describing lawyers, their contact information, and their practices in California; see <a href="chapter_5.xhtml#fig5-2" id="rfig5-2">figure 5.2</a> for a sample KG fragment from this page).</p>
<div class="figure">
<figure class="IMG"><a id="fig5-1"/><img alt="" src="../images/Figure5-1.png" width="450"/>
<figcaption><p class="CAP"><span class="FIGN"><a href="#rfig5-1">Figure 5.1</a>:</span> <span class="FIG">A practical illustration of the web IE problem. We use bounding boxes to illustrate the elements that would need to be extracted. In some cases, the elements (like the “Contact us” link) may not be visible on the page itself, but they are obtained from the HTML via an <i>&lt; a</i> href &gt; tag or property. The KG fragment is illustrated in <a href="chapter_5.xhtml#fig5-2">figure 5.2</a>. The original webpage was taken from <i>lawyers.findlaw.com/lawyer/firm/auto-dealer-fraud/los-angeles/california</i>.</span></p></figcaption>
</figure>
</div>
<div class="figure">
<figure class="IMG"><a id="fig5-2"/><img alt="" src="../images/Figure5-2.png" width="450"/>
<figcaption><p class="CAP"><span class="FIGN"><a href="#rfig5-2">Figure 5.2</a>:</span> <span class="FIG">A KG fragment containing the extracted information from the webpage illustrated in <a href="chapter_5.xhtml#fig5-1">figure 5.1</a>.</span></p></figcaption>
</figure>
</div>
<p>In part, one of the difficulties illustrated in <a href="chapter_5.xhtml#fig5-1">figure 5.1</a> is that structure can be as important as content (i.e., on some webpages, the only clue that a particular phrase on a webpage is a phone number is by virtue of its position on that page, as well as surrounding contextual elements). We provide details on some of the challenges later in this section, but one of them, which is mostly a product of how modern webpages are designed and rendered, is the presence of snippets of code and irrelevant content like advertisements (often dynamically rendered) on the page. Thus, an important task when doing web IE is <i>parsing the structure</i> of the webpage. Within a parsed substructure, such as a block of text, other IE techniques such as NER may have to be applied. For attributes like phone numbers and dates, regular expressions may be enough to normalize the value.</p>
<p>Unfortunately, a web source is rarely as structured as a pure RDB, though some webpages are highly uniform and templatized, making them more similar to structured tables than to natural-language documents. By and large, it is popular to refer to web IE inputs as <i>semistructured</i> precisely because their structure and content are so heterogeneous, both in terms of substance (the actual material on the webpage) and because of the way the material is laid out. There is also disagreement in the research community in how to formulate definitions of phrases like “semistructured” or “free-text.” For example, some researchers treat postings (e.g., Airbnb rentals) on newsgroup websites or Facebook listings, medical records, and equipment and maintenance logs as semistructured, while HTML pages are treated as structured. Database researchers, as noted earlier, have a stronger view of what it means for data to be structured, and they largely treat only information stored in databases (with many limiting themselves to RDBs) as structured data. For such researchers, even XML documents are semistructured even with well-defined schemas, while HTML pages are unstructured.</p>
<p>Rather than argue about the merits or demerits of these terminological differences, we adopt a pragmatic viewpoint in this chapter. We consider HTML data as semistructured owing to the fair amount of regularity in how the content is presented on HTML pages. In several cases, the regularity is even programmatic, the best example being pages from the <i>Deep Web</i>, which is a term used often for dynamic webpages that are generated from structured databases with some templates or layouts. For example, the set of book pages from Amazon has the same layout for the author, title, price, comments, and other details, and a given such page is dynamically generated based on a query made by a user on the main Amazon search interface.<sup><a href="chapter_5.xhtml#fn1x5" id="fn1x5-bk">1</a></sup> There are many other such situations where webpages <span aria-label="99" id="pg_99" role="doc-pagebreak"/>are generated from the same database with the same template (program) and form a page class. Less extreme but still fairly regular, many page classes are not dynamic, but they still draw upon a common template. This is the case when we look at faculty pages (or course pages) on a given university’s website. However, even in these situations, considerable heterogeneity exists. Even in the case of the Deep Web, page classes can be significantly different from one another and the original template-generating page may not be available to anyone outside the company or organization (and is also subject to change at any time). In the case of manually developed, nondynamic templates, differences can start emerging quickly even at the top domain level. For example, faculty pages from the engineering school and the business school may exhibit considerable difference in structure, even under the umbrella of a single university. Manually building web IEs for <i>every</i> page class is neither scalable nor sustainable in the long run. Building adaptive and robust web IEs that work for pages of the same class is a difficult enough task as is, and it is the primary focus of most web IEs proposed in the literature. However, some IE systems over the years have taken on the ambitious challenge of extracting key information from pages and page classes across various websites.</p>
<p><span aria-label="100" id="pg_100" role="doc-pagebreak"/>Web IE is such an interesting problem that even its categorization attracts considerable diversity among researchers. One categorization that has become relatively popular is based on the <i>extraction target granularity</i> (e.g., are we building a system for record-level, page-level, and site-level IE tasks?). Record-level IE tasks involve the discovery of “record” (or <i>entity</i>) boundaries and their segmentation into separate attributes; page-level web IEs extract all data that are embedded in one webpage; site-level web IEs populate <span aria-label="101" id="pg_101" role="doc-pagebreak"/>a database from pages of a web domain,<sup><a href="chapter_5.xhtml#fn2x5" id="fn2x5-bk">2</a></sup> wherein the attributes of an extraction target are assumed to be spread across pages of a web domain. From the beginning of Web IE development, researchers have tended to devote more effort to developing record- and page-level data extraction, whereas industrial researchers have expressed more interest in building complete pipelines supporting site-level data extraction.</p>
<p>Web IE involves many challenges, some of which overlap with the challenges discussed in the previous chapter. For example, to achieve a high degree of automation, much human feedback and many manual annotations may be required, which limits both extraction generalization and adaptation. There is a clear trade-off that needs to be quantified before resources are allocated, because fewer annotations correspond to lower accuracy.</p>
<p><i>Volume</i> is yet another challenge for web IE, especially in domains such as finance. Web IEs built over high-volume domains need to be scalable. In domains that are dynamic (i.e., the data gets stale quickly), streaming systems are obviously preferable. In the Big Data universe, this challenge is known as <i>velocity</i>. Streaming web IE can be highly challenging, especially considering that many machine learning techniques, including unsupervised algorithms, often need to see many documents (often over several epochs) before accurate inference is achieved or performance converges.</p>
<p>A more sociotechnical challenge that was not as relevant in the early days of the web, but is very relevant today, is that of privacy and ethics surrounding data collection and use. This challenge also applies to the problem of domain discovery earlier described. Some recent high-profile instances have shown that even scraping can be unethical, even though the data is public. In particular, Social Web domains require extra caution to be exercised because pages in these domains often contain personal information. Even if it is legal to do so, it may not be ethical. Beyond data collection, bias in training and testing are other concerns that must be dealt with in this context. The web may be diverse, but it also contains its fair share of racist, incorrect, or otherwise problematic information. If the extraction system is biased toward some kinds of attributes or text over others, this will inevitably lead to a scenario that does not work equally well for everyone.</p>
<p>A final challenge that we mention here is <i>maintenance</i>. A web IE that has been built and tuned, even for a specific website like Backpage.com, may end up breaking because the website undergoes changes in format or content. This problem is serious enough, and was recognized early enough in the development of web IE systems, that it has already spawned a considerable amount of research dedicated both to <i>detecting</i> when the extraction system has stopped working (i.e., producing meaningful results) and automatically repairing the system to get it functioning again.</p>
</section>
<section epub:type="division">
<h2 class="head a-head"><span aria-label="102" id="pg_102" role="doc-pagebreak"/><a id="sec5-2"/><b>5.2 Wrapper Generation</b></h2>
<p class="noindent">While traditional IE tools such as NER populate KGs from textual documents using sequence labels and other methods, web IE primarily involves the development and maintenance of wrapper generation systems, known for short as <i>wrappers</i>. Wrappers are designed to deal with webpages containing a diverse mix of information and structure, where the information contained in a web document is assumed to be defined by means of formatting markup tags. Wrapper generation systems can exploit these formatting instructions to define suitable wrappers for a set of similarly structured webpages.</p>
<p>However, wrappers based on very simple extraction rules are not able to effectively exploit these formatting instructions when dealing with pages with a complex structure. This has necessitated the development of different kinds of wrappers that are able to deal with specific kinds of pages. We enumerate three such page kinds here:</p>
<ul class="numbered">
<li class="NL">1. Pages that have a reasonable amount of text, and that require some amount of Natural Language Processing (NLP), most commonly, NER;</li>
<li class="NL">2. Pages that do not conform to a fixed schema, in that there is no separate description for the content categories on the page;</li>
<li class="NL">3. Pages with structural and syntactic constraints, which allows development of wrappers that are hybrid in nature (i.e., use both structural and natural-language features to extract information).</li>
</ul>
<p class="noindent">We note that modern wrappers exploit not only the structure within a web domain, but (just like domain discovery tools) can incorporate other contextual information such as hyperlinks, which are particularly relevant when the information needed to discover the attributes of an entity is nested under (or scattered across) many different webpages.</p>
<p>There has been a long tradition of work in wrapped generation, just like the NLP-based techniques covered in chapter 4. Because of this large body of work, it has become possible to synthesize many of the systems into one or more of several categories, including supervised, semisupervised, and unsupervised systems. Next, we describe work in each category in more detail.</p>
<section epub:type="division">
<h3 class="head b-head"><a id="sec5-2-1"/><b>5.2.1 Manually Constructed and Supervised Wrappers</b></h3>
<p class="noindent">In the wrapper generation literature, a manually constructed wrapper is generally taken to mean any wrapper building approach that requires the wrapper to be explicitly programmed using either high- or low-level tools. In other words, wrappers are not automatically constructed based on data and constraints. One of the earliest approaches for manual construction of web wrappers (called TSIMMIS) viewed the web IE task as one of programmatically building sets of pattern-matching expressions and assigning their outputs to variables. It did this by taking as input a specification file that declaratively stated (by a sequence of commands given by programmers) where the data of interest is located <span aria-label="103" id="pg_103" role="doc-pagebreak"/>on the pages and how the data should be bundled into objects. Each command takes the form [variables, source, pattern], where <i>source</i> specifies the input text to be considered, <i>pattern</i> specifies how to find the text of interest within the source, and <i>variables</i> are a list of variables that hold the extracted results.</p>
<p>Other systems, such as Minerva, attempted to move beyond purely declarative approaches by combining their benefits with the flexibility of procedural programming in handling heterogeneity, irregularity, and exceptions. Minerva accomplished this combination by incorporating an explicit exception-handling mechanism inside a regular grammar. Exception-handling procedures are written in a special language called <i>Editor</i>. In the grammar used by Minerva, a set of productions is first defined with each production rule defining the structure of a nonterminal symbol (preceded by “$”) of the grammar. For example, given a book review page, we may want to extract the book name, the reviewer name, the rating given by the reviewer to the book, and the text. The nonterminal productions $bName, $rName, $rate, and $text would be used to represent these attributes. Furthermore, suppose that the page were such that a book name was preceded by the HTML snippet “ <i>&lt;</i>b &gt;Book Name <i>&lt;</i>/b &gt;” and was followed by<sup><a href="chapter_5.xhtml#fn3x5" id="fn3x5-bk">3</a></sup> “ <i>&lt;</i>b &gt;.” In the Minerva grammar, we could use a pattern such as “*(? <i>&lt;</i>b &gt;),” which matches everything before tag <i>&lt;</i>b &gt;, and place this pattern within a production rule such as <i>&lt;</i>b &gt;Book Name <i>&lt;</i>/b &gt; $<i>bName &lt;b</i> &gt; to facilitate correct book name extraction (and similarly for the other attributes). A special, nonterminal $TP (Tuple Production) can also be used to insert a tuple in the database after each macro object (in this case, a “book,” with all its attributes) has been parsed in terms of its attributes. For each production rule, it is possible to add an exception handler containing a piece of <i>Editor</i> code that can handles the irregularities found in the web data. Whenever the parsing of that production rule fails, an exception is raised and the corresponding exception handler is executed. This makes Minerva more robust than earlier systems like TSIMMIS; however, the rules and exceptions still have to be manually defined by a user.</p>
<p>Other influential examples of manually constructed wrappers include WebOQL (which is a functional language that relies on a data structure called a <i>hypertree</i> for querying and extracting information from the web and over semistructured data sets), W4F, which stands for WysiWyg Web Wrapper Factory, and is designed with WYSIWYG (what you see is what you get) support using smart wizards, and XWrap, which exploits formatting information in webpages to hypothesize about underlying semantic structures in the page. While all of these met with some success in their time, they were not able to scale and adapt with the explosive growth of the web.</p>
<p>As the brittleness of manually constructed web IEs became more apparent, and machine learning continued to evolve as a community, supervised WI started becoming more popular. <span aria-label="104" id="pg_104" role="doc-pagebreak"/>Supervised wrappers take as input a set of webpages labeled with examples (of the data to be extracted) and output a wrapper. General users, rather than technical programmers, provide an initial set of labeled examples and the system (possibly via an interface) may suggest additional pages for the user to label, akin to the active learning paradigm in general machine learning. The user base of such systems can be broad, and they are cheaper to set up for a website, precisely because programming expertise is not required to the same extent as for manually constructed wrapper induction (WI) systems.</p>
<p>A good example is <i>Rapier</i>, which attempts field-level extraction using bottom- up, or <i>compression-based</i>, relational learning (i.e., it begins with the most specific rules and then replaces them with more general rules). Rapier learns single-slot extraction patterns that use syntactic and semantic information commonly employed in NLP, including a part-of-speech (POS) tagger and lexicons such as WordNet. Extraction rules consist of three distinct patterns: first is a prefiller pattern that matches text immediately preceding the filler, second is a pattern that matches the actual slot filler, and the last pattern is a postfiller pattern that matches the text immediately following the filler. Returning to the running example of books, an extraction rule for the book title might consist of a prefiller pattern that says that an extraction should be preceded by the the three “words” <i>Book</i>, <i>Name</i>, and <i>&lt;/b</i> &gt;, a filler pattern that says that the name is a list of at most length 2, with the elements in the list labeled as “NN” or “NNS” by a POS tagger<sup><a href="chapter_5.xhtml#fn4x5" id="fn4x5-bk">4</a></sup> and a postfiller pattern that is the word <i>&lt;b</i> &gt; (see footnote 3 when describing Minerva). The precise syntax in which these rules are specified is not very important because it would depend on system implementation. The important point to note, though, is that the rules can employ complex cues, such as NLP outputs.</p>
<p>Another influential example is <i>Stalker</i>, which is a wrapper generation system that performs hierarchical data extraction by using an <i>embedded catalog (EC)</i> formalism to describe the structure of many types of “semistructured” documents. In the EC formalism, a page is described abstractly as a treelike structure with leaves as attributes to be extracted, and internal nodes are lists of tuples. For each node in the tree, the wrapper requires a rule for extracting the node from its parent. Additionally, for each list node, the wrapper needs a <i>list iteration</i> rule for breaking down the list into individual tuples. Stalker ultimately turns the complex problem of extracting data from an arbitrary document into several sequentially ordered (and easier) extraction tasks. The extractor is able to use multipass scans to handle missing attributes and multiple permutations. The actual extraction rules are generated via a <i>sequential covering</i> algorithm, which starts from <i>linear landmark automata</i> to cover the maximal number of positive examples, after which it attempts to generate new automata for the remaining examples. We illustrate an example of both a Stalker EC tree and rules in <a href="chapter_5.xhtml#fig5-3" id="rfig5-3">figure 5.3</a>. The tree is simple to understand; it just says that a web document is a <span aria-label="105" id="pg_105" role="doc-pagebreak"/>paper with a title, and a list of academic reviewers (who have reviewed that paper); each academic reviewer is described by their name, their acceptance decision (of the paper), and their review text. The Stalker rules, which mainly comprise tokens like <i>SkipTo</i>, are also intuitive to understand. For example, the list of academic reviewers is contained within the <i>&lt; ol &gt;</i> and <i>&lt; /ol &gt;</i> tags. The content within the tags can be further parsed to iterate over the attributes of an academic reviewer (the iteration rule in <a href="chapter_5.xhtml#fig5-3">figure 5.3</a>). The important aspect to remember about Stalker here is its EC representation of a webpage, as well as the simplicity of the rules, especially for dealing with fairly regular webpages containing nested and arraylike elements.</p>
<div class="figure">
<figure class="IMG"><a id="fig5-3"/><img alt="" src="../images/Figure5-3.png" width="450"/>
<figcaption><p class="CAP"><span class="FIGN"><a href="#rfig5-3">Figure 5.3</a>:</span> <span class="FIG">An EC model (above) of a hypothetical webpage describing reviews for academic papers, and some Stalker rules (below) for extracting reviewers and their accept/reject decisions.</span></p></figcaption>
</figure>
</div>
<p>Supervised approaches have long been a popular way of tackling the wrapper generation problem. Besides Rapier and Stalker, other popular examples include SRV, WHISK, NoDoSE, WIEN, and DEByE. Similar to Rapier, SRV uses relational learning, but its algorithm is top-down rather than bottom-up. It frames web IE as a classification problem, wherein input documents are tokenized and all substrings of tokens (namely, fragments of text) are labeled as either positive (meaning it is an extraction target) or negative. SRV rules that are generated are logic-based and rely on token-based features and predicates that may be simple or relational. Simple features map a token into a discrete value (such as its length in terms of number of characters), while relational features map one token to another. Intuitively, the combination of relational and simple features, combined with a learning algorithm, can yield plausible rules. For example, SRV may be able to discover <span aria-label="106" id="pg_106" role="doc-pagebreak"/>that the price of a product is a numeric word (possibly followed by a period and another numeric word), and occurs within an HTML tag. In contrast, WHISK uses a covering learning algorithm to generate multislot extraction rules. WHISK can be applied, not just to HTML documents, but also free-text and structured documents. WHISK rules are based on regular expression patterns that identify the context and exact delimiters of relevant phrases. To create the rules, it needs annotated training instances. WHISK learns rules top-down, starting from a general rule that covers all instances, and then extending the rule by adding one term at a time.</p>
<p>NoDoSE is even more interesting in that, rather than assuming that training examples can just be perfectly and completely obtained, it provides an interactive tool to its users to hierarchically decompose documents. It also differentiates between text and HTML code by having a separate, heuristic-based mining module for each. The goal of these modules is to infer a tree that describes the structure of the document. Given such a decomposition of a document, NoDoSE is able to automatically parse it to generate extraction rules.</p>
<p>Details of all these systems aside, we note that supervision in web IE arises in several forms, unlike with NER systems, where the primary source of supervision is annotation of named (and typed) entities within sentences and documents. NoDoSE, for example, uses supervision to infer an accurate treelike structure encoding the document, following which extraction rules are generated. On the other hand, Rapier is more traditional in the annotation-based supervision that it expects from users.</p>
</section>
<section epub:type="division">
<h3 class="head b-head"><a id="sec5-2-2"/><b>5.2.2 Semisupervised Approaches</b></h3>
<p class="noindent">Similar to the study of semisupervised NER approaches, semisupervised WI approaches use a variety of means to ensure that a wrapper can be learned without necessarily requiring large amounts of manual supervision. However, there are differences in how to reduce manual supervision. Some systems, like OLERA, accept incomplete, approximately correct examples from users for rule generation and are the equivalent of weakly supervised bootstrapping-based approaches in other areas of machine learning. Other systems, like IEPAD, require no labeled training pages at all, but do demand postprocessing effort from the user in order to select the correct target pattern and choose the data to be extracted. The majority of semisupervised approaches in web IE are built for record-level extraction tasks. Because extraction targets are unspecified for such systems, an interface is usually required for users to specify the extraction targets after the learning phase. The design of the graphical user interface (GUI) can play a role in the level of technical sophistication required from potential users. Brief descriptions of some of these systems are provided next, with citations of relevant papers in the section entitled “Bibliographic Notes,” at the end of this chapter.</p>
<p>IEPAD was one of the earliest IE systems to generalize extraction patterns from unlabeled webpages by exploiting the fact that if a webpage contained multiple (“homogeneous”) data records to be extracted, they are often rendered regularly using the same template <span aria-label="107" id="pg_107" role="doc-pagebreak"/>(mainly for good visualization). Repetitive patterns can thus be discovered, assuming that the webpage is well encoded. By discovering the repetitive patterns, wrappers can be induced. To do so, IEPAD uses a data structure based on a binary suffix tree designed to discover repetitive patterns in a webpage. Because the original data structure only records the exact match for suffixes, IEPAD applies a center star algorithm to align multiple strings starting from each occurrence of a repeat and ending before the start of the next occurrence. A signature representation is used to denote the template to understand all data records.</p>
<p>In contrast, OLERA is a semisupervised IE system that acquires a <i>rough</i> example from the user for extraction rule generation. OLERA can learn extraction rules for pages containing single data records, which is not something that IEPAD can handle. OLERA executes three main operations. First, it encloses an information block of interest by leveraging approximate matching against a block that the user has marked in the rough example, and then generalizing an extraction pattern using a multiple string alignment technique. Second, it drills down or rolls up an information slot: <i>drilling down</i> allows the user to navigate from a text fragment to more detailed components, whereas <i>rolling up</i> combines several slots to form a meaningful information unit. Finally, just as in IEPAD, OLERA designates relevant information slots for schema specification.</p>
<p>Another approach that is similar to OLERA is Thresher, the GUI for which was designed to allow users to specify examples of content with semantic significance, by highlighting the content and describing their meaning via labeling. Thresher uses tree edit distance between the DOM subtrees of these specified examples to induce a wrapper, after which the user is allowed to associate RDF classes and predicates in an ontology with the nodes in the induced wrapper.</p>
</section>
<section epub:type="division">
<h3 class="head b-head"><a id="sec5-2-3"/><b>5.2.3 Unsupervised Approaches</b></h3>
<p class="noindent">Unsupervised IE systems do not use any labeled training examples and have no user interactions to generate a wrapper. Systems such as RoadRunner and EXALG are designed to solve page-level web IE, while others like DeLa and DEPTA (whose name stands for Data Extraction based on Partial Tree Alignment) are designed for record-level web IE. Compared to supervised systems, where extraction targets are specified by the users, the extraction target for unsupervised web IEs is defined as the data that is used to generate the page, or nontag textual content in data-rich regions of an input webpage. Because several schemas could potentially align with the training pages (due to the presence of nullable attributes), ambiguity is inevitable. Thus, rather than be completely unsupervised, practical unsupervised IEs often leave the choice of determining the right schema to users. Similarly, if not all data is needed, postprocessing may be required to select relevant data and assign it to the proper class in the ontology.</p>
<p>RoadRunner has been particularly influential as an unsupervised web IE; it frames the site generation process as an encoding of the original database content into strings of HTML code. Consequently, the extraction itself becomes akin to data <i>decoding</i>, with <span aria-label="108" id="pg_108" role="doc-pagebreak"/>wrapper generation for a set of HTML pages corresponding to the inference of an HTML code grammar. RoadRunner uses the ACME matching technique to compare HTML pages of the same class. ACME, which stands for Align, Collapse, Match and Extract, was originally proposed by Crescenzi et al. (2001) as a technique to generate a wrapper by analyzing similarities and differences among some sample HTML pages of the class.</p>
<p>Using ACME, RoadRunner compares two webpages at a time by first aligning the matched tokens and collapsing the mismatched tokens. There are two kinds of mismatches: <i>string mismatches</i>, which are used to discover attributes (#PCDATA); and <i>tag mismatches</i>, which are used to discover iterators (+) and optionals (?). We illustrate these concepts using <a href="chapter_5.xhtml#fig5-4" id="rfig5-4">figure 5.4</a>, which uses an example inspired by the original paper. The example showcases some of the more complex matching facilities that RoadRunner is capable of. For instance, in <a href="chapter_5.xhtml#fig5-4">figure 5.4</a>, the pages have a nested structure, with a list of books, and for each book, its editions are given. The algorithm starts by matching the sample (the second webpage) against the wrapper, which is initialized to equal webpage 1. The parsing first stops when a tag mismatch is encountered at the token <i>&lt;</i>LI &gt; (right before <i>Database Primer</i>). When trying to solve the mismatch looking for a possible iterator, RoadRunner does the following: first, based on the possible terminal tag ( <i>&lt;</i>/LI &gt; in the earlier line), the algorithm locates one candidate square occurrence on the wrapper right after (from <i>&lt;</i>LI &gt; all the way to the last occurrence of <i>&lt;</i>/LI &gt;, as bounded by the dashed purple box); and second, it tries to match this candidate square against the upward portion of the wrapper. The square is matched backward, for example, by comparing the two occurrences of the terminal tag (the <i>&lt;</i>/LI &gt; right at the end against the <i>&lt;</i>/LI &gt; tag where the tag mismatch was first encountered), then moving to tokens <i>&lt; /PI &gt;</i> and <i>&lt; /PI &gt;</i> occurring before the occurrences of <i>&lt;</i>LI &gt;, and so on. This comparison is emphasized at the side of <a href="chapter_5.xhtml#fig5-4">figure 5.4</a>.</p>
<div class="figure">
<figure class="IMG"><a id="fig5-4"/><img alt="" src="../images/Figure5-4.png" width="450"/>
<figcaption><p class="CAP"><span class="FIGN"><a href="#rfig5-4">Figure 5.4</a>:</span> <span class="FIG">An overview of how RoadRunner performs unsupervised web IE.</span></p></figcaption>
</figure>
</div>
<p>When trying to match the two fragments, <i>internal mismatches</i> can also be detected, such as involving the tokens <i>&lt; /B &gt;</i> and <i>&lt; P &gt;</i> in the graphic (bounded by a box). The internal mismatches can be dealt with in the same way that external mismatches are dealt with, meaning that the matching algorithm needs to be recursive. Every time there is a mismatch, a new matching procedure has to be started based on the ideas expressed here. The only difference is that this kind of recursive matching does not work by comparing one wrapper with one sample, but two different portions of the same object. In the example, the single external mismatch triggers two internal mismatches, one of which leads to discovery of the book editions, and the second of which leads to the identification of the optional pattern <i>&lt;</i>I &gt;Special! <i>&lt;</i>/I &gt;. Without going into all the syntactic details of the final wrapper, we express it succinctly at the bottom of <a href="chapter_5.xhtml#fig5-4">figure 5.4</a>.</p>
<p>Because there can be several alignments, RoadRunner adopts union-free regular expression (UFRE) to reduce the complexity. The alignment result of the first two pages is then compared to the third page in the <i>page</i> class. The final wrapper is generated if RoadRunner succeeds in generalizing the wrapper each time a “mismatch” is found (i.e., when some <span aria-label="109" id="pg_109" role="doc-pagebreak"/>token in the input sample does not comply with the grammar specified by the current wrapper). The final wrapper, therefore, is a common wrapper that has solved all mismatches encountered during parsing.</p>
<p><span aria-label="110" id="pg_110" role="doc-pagebreak"/>Along with the module for template deduction, RoadRunner also provides modules for classification and labeling to facilitate wrapper construction. The first module, <i>Classifier</i>, analyzes pages and collects them into clusters with similar structural properties (the goal is to cluster together pages with the same template). The second module, <i>Labeler</i>, discovers <i>attribute names</i> for each page class. If RoadRunner is adapted to work with an ontology, the task of Labeler would be to assign concepts from the ontology to extractions from a set of pages belonging to the same class. Note that the RoadRunner system, as originally proposed, did not need any ontology or prior assumptions about the underlying schema or page contents; the schema was inferred along with the wrapper, and the system was fully capable of handling arbitrarily nested structures. Some preprocessing is required (e.g., an HTML page should be converted to an XHTML specification, which means that tags should be properly closed and nested), but these are not difficult to accomplish with mechanical tools. Experimentally, the algorithm was evaluated on some well-known websites at the time, including Buy.com and Amazon.com, and the authors showed that RoadRunner was able to successfully generate wrappers that matched many of the webpage samples and were consequently able to extract the requisite data on those pages.</p>
<p>Besides RoadRunner, there are a number of other unsupervised web IE systems that have gained some influence. One such system is DEPTA, which is applicable to webpages that contain two or more data records in a data region. It relies on the insight that records of the same data region are reflected in the tag tree of a webpage under the same parent node. Hence, irrelevant substrings do not need to be compared as done in suffix-based approaches. The overall algorithm works in three steps: First, it builds the HTML tag tree for the webpage, and disregards text strings. Second, it compares substrings for all children under the same parent. This results in significant efficiencies. The final step handles situations when a data record is not rendered contiguously (an assumption made frequently in prior work). Recognition of data items or attributes in a record is accomplished using partial tree alignment, which is considered better than string alignment, as it takes structural cues into account. Because DEPTA was limited in handling nested records, a new algorithm called NET was later proposed, which used advanced techniques relying on visual cues to expand the scope of the system.</p>
<p>Another unsupervised web IE system is DeLa, which is an extension of IEPAD and removes user interaction in extraction rule generalization while dealing with nested object extraction. It uses a novel WI algorithm called DSE (which stands for Data-rich Section Extraction), which extracts data-rich sections from webpages by comparing the DOM trees for two webpages from the same web domain. Data objects ultimately obtained by the system are transformed into a relational table, with multiple values of an attribute distributed <span aria-label="111" id="pg_111" role="doc-pagebreak"/>across multiple table rows. Labels are assigned to the data table columns using heuristics (e.g., using maximal-prefix and maximal-suffix shared by all cells in the column). Potentially, this part of the system could be further enhanced using distant supervision techniques (e.g., Wikipedia or DBpedia concepts could be used to automatically label columns for “common” ontological types such as Country or Politician) if it were to be reimplemented in a more modern setting.</p>
<p>Finally, the EXALG system formulates the unsupervised WI problem in a novel way: by ingesting a set of pages assumed to be created from an unknown template T, as well as the values to be extracted. EXALG deduces the template T and uses it to extract the set of values from the encoded pages as an output. EXALG detects the unknown template by using techniques differentiating roles and equivalence classes. Per the former technique, occurrences with two varying paths of a particular token are understood to have different roles (e.g., it would assume that “Name” occurring in “Book Name” has a different role than when it occurs in “Reviewer Name”). The latter technique defines an equivalence class as a maximal set of tokens having the same occurrence frequencies over the training pages (called the occurrence-vector). The key insight here is that “template tokens” encompassing a data record have the same occurrence vector and form an equivalence class. To mitigate against the very real problem of false positives, the technique prevents data tokens from accidentally forming an equivalence class by filtering out equivalence classes with insufficient support (the number of pages containing the tokens) and size (the number of tokens in an equivalence class). Equivalence classes must also be mutually nested, and tokens within an equivalence class must be ordered, to comply with the hierarchical structure of the data schema. Equivalence classes that survive these checks can be used to construct the original template and thereby induce a wrapper.</p>
</section>
<section epub:type="division">
<h3 class="head b-head"><a id="sec5-2-4"/><b>5.2.4 Empirical Comparative Analyses</b></h3>
<p class="noindent">Given the wide variety of WI tools presented here, direct comparison of many of the tools can be a difficult proposition. Yet, without systematic comparison, it is difficult to ascertain progress, especially considering the nature of the problem. An influential survey has proposed that such a comparison be conducted along three dimensions: namely, the <i>task domain</i> or the difficulty of the Web IE task (i.e., why do certain Web IE systems fail to handle websites with some particular structure, while others succeed?), the techniques used in different systems, and finally, the user effort involved in the training process, as well as the system portability across different domains. From the user’s point of view, the second dimension has been traditionally less important, but with the growth of GUIs and other efficient means of eliciting training data, it has become more important. Users may be willing to opt for a supervised system if acquiring labels and training data is easy but opt for a semisupervised or unsupervised system if it is laborious. If there are huge performance differences between using one technique or another, then the users may even be willing to put in some effort.</p>
<p><span aria-label="112" id="pg_112" role="doc-pagebreak"/>Many of the systems that we described in this chapter have indeed been compared across each of these three dimensions, mainly in terms of their capabilities and features used, rather than empirical performance on a common corpus (which unfortunately does not exist for web IE, unlike the data sets developed under MUC and other similar venues for NLP extraction tasks). Thus, we cannot claim that one system is better than another; however, some trends and limitations in the overall research area are revealed, as briefly documented here:</p>
<ul class="numbered">
<li class="NL">1. <b>Task domain dimension.</b> Analysis has shown that while manual and supervised web IE systems are primarily designed to extract information from cross-website pages, semisupervised and supervised web IE systems are mainly capable of extracting data from the template page, such as those found on the Deep Web. For better or for worse, there is a bias in the way unsupervised systems are designed: without the assumption of a common (or roughly common) template, they would not work well. This seems like a reasonable assumption, however, considering that they are not given labeled data. To take another example of analysis along this dimension, there is also considerable heterogeneity among wrappers in terms of their extraction levels (field-level, record-level, page-level, and site-level). While many wrappers are now able (or can be adapted) to extract at the record or even page level (e.g., RoadRunner and STALKER), almost no WI systems exist, to our knowledge, that can operate at the site level. Thus, this is clearly an open area of research. Other kinds of analyses along the task domain dimension are also possible, such as with respect to extraction target variation (e.g., is the IE robust to missing attributes and multivalued attributes?), template variation (e.g., can the IE support disjunctive rules and sequential patterns for rule generalization, to handle format variations in data instances?) and even non-HTML support.</li>
<li class="NL">2. <b>Technique-based dimension.</b> As observed in this chapter, many techniques have been proposed for WI over the decades. An analysis of these systems along the technique-based dimension could involve aspects such as the <i>number of passes</i> required over an input document to facilitate extraction (some wrappers, such as DEByE, require multiple passes, although most wrappers need only one); the <i>types of extraction rules</i> (e.g., is the wrapper relying on regular grammar or more powerful first-order logic?) induced by the wrapper; the <i>features</i> being supported by the wrapper (e.g., wrappers such as W4F use DOM tree paths, rather than just tag; literal or delimiter-bases content); the <i>tokenization schemes</i> supported by the wrappers (most of which support tag-level tokenization but some of which can support word-level tokenization, as well as tag-level encoding schemes to translate input training pages into tokens); and perhaps most important, the type of <i>learning algorithm</i> employed by the WI. Manual wrappers do not have any embedded learning algorithm, while wrappers such as DEPTA, DeLa, and IEPAD rely heavily on pattern mining. Even among supervised algorithms <span aria-label="113" id="pg_113" role="doc-pagebreak"/>that use some kind of linear programming or set covering, a distinction can be made between whether the learning occurs top-down (SRV, WHISK) or bottom-up (Rapier, WIEN, and Stalker).</li>
<li class="NL">3. <b>Automation-based dimension.</b> Analyses of web IE systems can be conducted with respect to such aspects as <i>user expertise</i> (e.g., many of the manual systems require users to have programming background in order to write good or even syntactically correct extraction rules); <i>fetching support</i> (e.g., does the WI system allow retrieval of pages dynamically using the URL?); <i>applicability</i> (how easily the approach can be extended to different domains due to architectural design decisions such as modularity); and even <i>output</i> or <i>application programming interface (API)</i> support, since some systems make it easy to retrieve the extracted data, but others can be accessed only programmatically or using specific data formats such as XML. Just as with previous dimensions, different wrappers offer different trade-offs among these various aspects. For example, concerning applicability, while manual and supervised systems often are modular and can be flexibly extended to other domains (assuming, of course, that a user is available to input the requisite amount of supervision), unsupervised systems are designed only for certain websites or domains. With respect to user expertise, while IEPAD (and OLERA) do not require labeling prior to pattern discovery, they do require postlabeling. Only DeLA seems to have successfully addressed the problem of assigning appropriate labels to extracted data in an unsupervised manner, though recently developed zero-shot learning techniques in NLP and computer vision communities may also be applicable (for the unsupervised schema and label assignment problem) in novel reimplementations of some of these systems, considering the broad availability of external concept data sets like DBPedia and WordNet.</li>
</ul>
</section>
</section>
<section epub:type="division">
<h2 class="head a-head"><a id="sec5-3"/><b>5.3 Beyond Wrappers: Information Extraction over Structured Data</b></h2>
<p class="noindent">Thus far, we have mainly considered IE for webpages. Although webpages are not natural-language sources (despite the high prevalence of natural-language content on the webpage), the assumption still is that they are relatively free-form compared to tables or graphs. Yet tables and databases have become very common on the web, with some referring to the vast set of databases powering dynamic webpages as the Deep Web. We already discussed the Deep Web earlier in this chapter, in the context of wrapper generation. To recap, when we search for a product on an e-commerce website like Amazon, the webpage for the product is usually not a precreated (or even stored) HTML; rather, it is created on the fly using entries on that product in internal Amazon databases. This allows the webpage to contain up-to-date information, be highly standardized (since only the metastructure of the webpage and how it links to, and incorporates, database entries on the fly), and not be overly sensitive to a database that is always highly in flux due to the frequency of product <span aria-label="114" id="pg_114" role="doc-pagebreak"/>sales and introductions. According to most estimates, the Deep Web far trumps the normal web in size (by factors as large as 90 to 100 times).</p>
<p>The reason for bringing this up here is to illustrate that, when it comes to web sources, the information-generating sources may predominantly be structured rather than natural language. While much of the Deep Web is inaccessible directly (one cannot obtain the Amazon or Walmart e-commerce databases althrough Google queries, or even through scraping, although parts of the database may be obtained through a carefully managed crawl), there is a growing movement on the internet to publish structured data sets directly and make them queryable through open APIs. One good example, which we detail in part V, is government data [e.g., the US government has released a large number of raw data sets through portals like data.gov; besides federal efforts, some states and counties (and even cities) have been instrumental in making their data sets public]. Some agencies and federal bureaus, such as the Bureau of Labor Statistics (BLS), have had a long history of making their data (largely structured) public. These tabular, tablelike, or otherwise highly structured data sets can serve as rich sets of information with which to populate KGs.</p>
<p>Why would extracting KG nodes, relations, and literal attributes from tables be so hard? Intuitively, it would seem that a table is a simple data structure where the type of an entity, such as a country, is a column header, and the rows contain and describe the actual entities (e.g., US, Switzerland). But in reality, this is not the case with many tables. However, empirical observation has shown that there is no one type of table on the web. There is considerable heterogeneity, as expressed in the examples illustrated in <a href="chapter_5.xhtml#fig5-5" id="rfig5-5">figures 5.5</a> and <a href="chapter_5.xhtml#fig5-6" id="rfig5-6">5.6</a>. Crestan and Pantel (2011) were among the earliest authors who attempted to provide a classification of the different types of HTML tables actually observed on the web. Their taxonomy includes two broad categories:</p>
<div class="figure">
<figure class="IMG"><a id="fig5-5"/><img alt="" src="../images/Figure5-5.png" width="450"/>
<figcaption><p class="CAP"><span class="FIGN"><a href="#rfig5-5">Figure 5.5</a>:</span> <span class="FIG">Illustrative examples of the <i>horizontal listing</i>, <i>attribute/value</i>, and <i>navigational</i> web table types.</span></p></figcaption>
</figure>
</div>
<div class="figure">
<figure class="IMG"><a id="fig5-6"/><img alt="" src="../images/Figure5-6.png" width="450"/>
<figcaption><p class="CAP"><span class="FIGN"><a href="#rfig5-6">Figure 5.6</a>:</span> <span class="FIG">Illustrative examples of the <i>vertical listing</i>, <i>matrix</i>, and <i>matrix calendar</i> web table types.</span></p></figcaption>
</figure>
</div>
<ul class="numbered">
<li class="NL">1. <b>Relational knowledge tables.</b> These are subclassified into <i>listings</i>,<sup><a href="chapter_5.xhtml#fn5x5" id="fn5x5-bk">5</a></sup> <i>attribute/value</i>, <i>matrix</i> (possibly subclassified further as <i>matrix calendar</i>), <i>enumeration</i>, and <i>form</i>.</li>
<li class="NL">2. <b>Layout tables.</b> These are subclassified further into <i>navigational</i> and <i>formatting</i>.</li>
</ul>
<p>Semiautomatically detecting the type of the table is itself a challenging problem with imperfect accuracy. Once detected, IE modules designed to handle that table type could be used, but these do not have perfect accuracy either. One may also encounter a long tail of tables (that could only be defined as type “other” since they may not fit well, or only fit loosely, in the taxonomy described previously); see <a href="chapter_5.xhtml#fig5-7" id="rfig5-7">figure 5.7</a> for an example. Hence, this continues to be an active and important area of research; we provide important references in the “Bibliographic Notes” section.</p>
<div class="figure">
<figure class="IMG"><a id="fig5-7"/><img alt="" src="../images/Figure5-7.png" width="450"/>
<figcaption><p class="CAP"><span class="FIGN"><a href="#rfig5-7">Figure 5.7</a>:</span> <span class="FIG">An “other” table type illustration that does not fit neatly into the taxonomy described in the text.</span></p></figcaption>
</figure>
</div>
<p>Even when fixing the table type, some tables can be complicated. Consider, for example, the table in <a href="chapter_5.xhtml#fig5-8" id="rfig5-8">figure 5.8</a>. This table was taken from Wikipedia<span aria-label="115" id="pg_115" role="doc-pagebreak"/> (<a href="https://en.wikipedia.org/wiki/Opinion_polling_for_the_next_United_Kingdom_general_election">https://<wbr/>en<wbr/>.wikipedia<wbr/>.org<wbr/>/wiki<wbr/>/Opinion<wbr/>_polling<wbr/>_for<wbr/>_the<wbr/>_next<wbr/>_United<wbr/>_Kingdom<wbr/>_general<wbr/>_election</a>) on October 18, 2019, and contains interesting data on opinion polling that, if extracted into a KG, could serve a useful purpose in forecasting and other such problem areas. However, extracting data from this table is no easy feat. While many of the columns contain numerical attributes, the numbers are not interpretable without both context and column headers. In this case, the column headers (not shown) are either abbreviations for UK political parties (such as Con or Lib Dem), which would be difficult to link to canonical entities without explicit domain knowledge, or for specific agendas such as Brexit. Other columns represent the area, which itself often contains an acronym value (e.g., GB stands for Great Britain), and in the case of the second column, a date range. Ontologizing such a data set is itself a challenging problem,<sup><a href="chapter_5.xhtml#fn6x5" id="fn6x5-bk">6</a></sup> while extraction would be even more so, especially considering that the table would not normally come in a spreadsheet, but rather would be embedded within an HTML page.</p>
<div class="figure">
<figure class="IMG"><a id="fig5-8"/><img alt="" src="../images/Figure5-8.png" width="450"/>
<figcaption><p class="CAP"><span class="FIGN"><a href="#rfig5-8">Figure 5.8</a>:</span> <span class="FIG">A tabular data source (containing information on polling from an instance of the UK general election, accessed on Wikipedia on October 18, 2019) for possible KG construction.</span></p></figcaption>
</figure>
</div>
<span aria-label="116" id="pg_116" role="doc-pagebreak"/>
<span aria-label="117" id="pg_117" role="doc-pagebreak"/>
<span aria-label="118" id="pg_118" role="doc-pagebreak"/>
<p><span aria-label="119" id="pg_119" role="doc-pagebreak"/>Currently, state-of-the-art tools cannot easily handle such data sets without significant upfront modeling and annotation effort. However, the broader problem of table understanding has been picking up steam in the web IE research community, building on important work that has been published over the last decade. One compromise often made in many of these papers is to assume that the input, when it comes to the extraction system, takes the form of a well-formed spreadsheet. This may not be completely unreasonable because a web table understanding pipeline could be decomposed into two steps: extract a table from HTML and write it out as a CSV (or other similarly formatted) spreadsheet, and then use a more advanced IE system specifically designed to operate on spreadsheets. Some noise in the first process is inevitable, but it can be adequately addressed with state-of-the-art machine learning, especially with the availability of massive open corpora obtained and released by projects such as Web Data Commons.</p>
<p>An example of the second technique is the system published by Chen and Zipf (2017), which specifically addressed the problem of identifying spreadsheet properties that were hoped to lead to transformation programs that would eventually lead to automated conversion of spreadsheets into RDBs. Spreadsheet properties here are defined to be aggregation rows, aggregation columns, hierarchical data, and other elements. Identifying spreadsheet properties is akin to parsing the structure of the spreadsheet (but in a semantic sense), which is not dissimilar from WI on a set of webpages. In their work, they used <i>rule-assisted active learning</i>, wherein crude, easy-to-write rules provided by users are integrated into an active learning framework and are used to save labeling effort via the generation of additional high-quality, labeled data in the initial training phase. Experimentally, the approach <span aria-label="120" id="pg_120" role="doc-pagebreak"/>was able to perform impressively even with minimal training data and was robust to low quality in the initially provided rules. Whether the approach would be able to perform well on “wild” data, such as the formatted table in <a href="chapter_5.xhtml#fig5-8">figure 5.8</a>, is open to debate.</p>
<p>Another approach, proposed at roughly the same time, profiled the potential of web tables for augmenting cross-domain KGs like DBpedia and YAGO (detailed further in part V). The motivation for using tables is that the original cross-domain KGs have a low degree of completeness and are also far from correct (e.g., the population of the US changes every year, depending on data availability; however, the updated number may not appear in DBpedia for some time). In contrast, tables of the web, when properly extracted and analyzed, contain up-to-date information from many categories in web KGs that frequently become outdated. The cross-domain KG augmentation approach mentioned earlier compares several data fusion strategies for incorporating table knowledge into KGs, and determines that some strategies work better than others. The reason why this work, as an empirical study, is important is because it shows that even when we don’t use web tables to construct KGs from scratch, we could still use them to improve existing KGs, or KGs that have been extracted over ordinary webpages and natural-language documents. In keeping with motivation, several other papers have been, and continue to be, proposed in extracting data from tables and mapping them to KGs (including deep learning approaches). We do not believe there is consensus yet on this problem, though progress is being made. We provide instructive pointers for the interested researcher in the “Bibliographic Notes” section.</p>
</section>
<section epub:type="division">
<h2 class="head a-head"><a id="sec5-4"/><b>5.4 Concluding Notes</b></h2>
<p class="noindent">The web is the richest and more voluminous source of data in existence today, and large swathes of it are accessible publicly. However, before we can do complex analytics or answer questions on such webpages, we have to find the relevant data (domain discovery, as covered in chapter 3) and then develop web IEs for constructing a KG over the data. This chapter covered web IE techniques and challenges, particularly wrapper generation. Wrappers were first proposed in the 1990s for parsing the structure and content in webpages to extract entities and attributes, and they have come a long way since then, with the development of numerous supervised, semisupervised, and unsupervised variants.</p>
<p>It is important to note that very frequently, web sources cannot be classed along neat lines as structured or unstructured, and neither can a real-world web IE system. Frequently, a combination of assumptions and heuristics yields the best performance; in a real sense, many web IE approaches (if not all IE approaches) are hybrid due to the restrictive quality requirements of downstream applications that must consume IE outputs and the constructed KG. An interesting use-case of a real-world, effective hybrid web IE system executed over semistructured data is the DBpedia project, which has been hugely influential in the Semantic Web community. We provide details on DBpedia in part V; in this chapter, we note that DBpedia is a KG that has been constructed over structured data <span aria-label="121" id="pg_121" role="doc-pagebreak"/>(namely, infobox attributes) over Wikipedia webpages. Because Wikipedia webpages are not free-form webpages, but rather have a somewhat predictable metastructure, including the infoboxes, links to other Wikipedia pages, a chapterlike structure, and reasonably high quality due to crowdsourced vetting, IE techniques such as those used by DBpedia can account for such regularities in delivering a higher-quality KG that might be output by a program that makes no such assumptions or is not developed to explicitly look for, and extract from, Wikipedia infoboxes.</p>
<p>Finally, we concluded the chapter with a note on extracting KG entities and attributes from web tables. Contrary to perception, web tables can be as messy as web documents, and extracting KG entities from them is problematic, as the entity could be a row or a value in a cell. Columns may represent attribute values, but they also could contain identifiers for other entities. Doing representation learning and information extraction over tables continues to be a new and exciting area, therefore—one that is poised to become still more important as more raw data is released in the form of tables due to efforts such as open data and open government.</p>
</section>
<section epub:type="division">
<h2 class="head a-head"><a id="sec5-5"/><b>5.5 Software and Resources</b></h2>
<p class="noindent">Unlike NER and other kinds of IE on natural language, open-source and widely used tool-kits for web IE are rare. This may be because many of the classic algorithms mentioned in this chapter were developed and first published more than two decades ago (in most cases); even if implementations were released at the time, they are likely not relevant anymore and will have to be reimplemented for modern HTML pages and standards than were prevalent in the early days of the web. For many of the algorithms, we do recommend a reimplementation. For those looking for off-the-shelf tools that work reasonably well, we recommend w3ilb (<a href="https://github.com/scrapy/w3lib">https://<wbr/>github<wbr/>.com<wbr/>/scrapy<wbr/>/w3lib</a>) and scrapely (<a href="https://github.com/scrapy/scrapely">https://<wbr/>github<wbr/>.com<wbr/>/scrapy<wbr/>/scrapely</a>). The former is useful for working with various kinds of web data, including HTML, forms, and URLs. The latter is more closely related to the content of this chapter, and it is also useful for extracting structured data from HTML pages. Even if <i>Scrapely</i> is not directly used or preferred, it can still be used to preprocess the HTML page in a way that then makes it more amenable to the application of advanced algorithms, including recurrent neural networks and sequence-labeling architectures. Another useful package that is implemented in Python and that is invaluable for working with, and preprocessing, web data, is <i>Beautiful Soup</i> (<a href="https://www.crummy.com/software/BeautifulSoup/bs4/doc/">https://<wbr/>www<wbr/>.crummy<wbr/>.com<wbr/>/software<wbr/>/BeautifulSoup<wbr/>/bs4<wbr/>/doc<wbr/>/</a>).</p>
<p>In some cases, it is possible to locate a version of the source code (of the more advanced algorithms). For example, RoadRunner maintains a homepage at <a href="http://www.dia.uniroma3.it/db/roadRunner/">http://<wbr/>www<wbr/>.dia<wbr/>.uniroma3<wbr/>.it<wbr/>/db<wbr/>/roadRunner<wbr/>/</a>, with resources for publications, experimental results, and downloads. It is also possible to find implementations of web extractors hosted on GitHub, but these should be treated with caution, as they may not have undergone rigorous testing or quality control.</p>
<p><span aria-label="122" id="pg_122" role="doc-pagebreak"/>However, there are some excellent commercial tools available for web IE and crawling. One example is Inferlink, which offers domain discovery and web IE services (<a href="http://www.inferlink.com/">http://<wbr/>www<wbr/>.inferlink<wbr/>.com<wbr/>/</a>). Others include Web Scraper (<a href="https://webscraper.io/">https://<wbr/>webscraper<wbr/>.io<wbr/>/</a>), Scrapinghub (<a href="https://scrapinghub.com/data-services">https://<wbr/>scrapinghub<wbr/>.com<wbr/>/data<wbr/>-services</a>), PromptCloud (<a href="https://www.promptcloud.com/">https://<wbr/>www<wbr/>.promptcloud<wbr/>.com<wbr/>/</a>), and Web Data Extraction Services (<a href="https://webdataextractionservices.com/">https://<wbr/>webdataextractionservices<wbr/>.com<wbr/>/</a>), to list only a few options. Some are much more specialized (e.g., for e-commerce crawling and product matching) than others. An advantage of having so many service providers in this space is that one could obtain a reasonable amount of crawled and extracted data (via a customized process) relatively inexpensively.</p>
</section>
<section epub:type="division">
<h2 class="head a-head"><a id="sec5-6"/><b>5.6 Bibliographic Notes</b></h2>
<p class="noindent">Many approaches to web IE and WI systems, including machine learning and pattern-mining techniques, have been proposed, with various degrees of automation. The survey by Chang et al. (2006) condenses and synthesizes the previously proposed taxonomies for IE tools developed by leading researchers in this area. This chapter is largely based on the core material in that survey, as it was written shortly after the main WI techniques had become established as mainstream.</p>
<p>As noted in the previous chapter, the Message Understanding Conferences (MUCs) inspired the early work in web IE. There are five main tasks defined for text IE: named entity recognition (NER), coreference resolution, template element construction, template relation construction, and scenario template production. The significance of the MUCs in the field of IE motivates some researchers to classify IE approaches into different classes, as pointed out as well by Chang et al. (2006): <i>MUC approaches</i>, evidenced by work from Riloff et al. (1993), Huffman (1996), Kim and Moldovan (1995), Krupka (1995), Soderland et al. (1995), among others published during that era; and <i>post-MUC approaches</i>, evidenced by work from Soderland (1999), Mooney (1999), Freitag (1998), Kushmerick et al. (1997), Hsu and Dung (1998), and Muslea et al. (1999), among others.</p>
<p>Note that even as early as the late 1990s, enough wrapper systems had emerged that authors such as Hsu and Dung (1998) had begun taxonomizing them into categories (e.g., handcrafted wrappers using general programming languages, heuristic-based wrappers, wrappers using specially designed programming languages or tools, and WI approaches). Chang et al. (2006) respected this taxonomy and compared WI systems from the user’s point of view by distinguishing IE tools based on the degree of automation. Consequently, their findings showed that IE tools could be categorized into (1) systems that need programmers, (2) systems that need annotation examples, (3) annotation-free systems, and (4) semisupervised systems.</p>
<p>These are not the only categorizations that exist: similar kinds of groupings were also proposed over the years by Muslea et al. (1999), Kushmerick and Thomas (2003), Sarawagi (2002), Kuhlins and Tredwell (2002), and Laender, Ribeiro-Neto, Da Silva, et al. (2002). <span aria-label="123" id="pg_123" role="doc-pagebreak"/>There were differences between how these authors looked at the web IE problem; for instance, Muslea et al. (1999) chose to categorize based on input document type and the structure and constraints assumed for the extraction patterns, Kushmerick and Thomas (2003) chose to categorize IE based on only two distinct categories (finite-state and relational), and Laender, Ribeiro-Neto, Da Silva, et al. (2002) proposed a taxonomy based on the primary <i>technique</i> used by each tool to generate a wrapper. Sarawagi (2002) classified wrappers into three categories depending on the extraction task (i.e., is the wrapper a record-level wrapper, a page-level wrapper, or a site-level wrapper?). The classification by Kuhlins and Tredwell (2002) may be the most simple (but perhaps not as useful for researchers), based on commercial versus noncommercial availability.</p>
<p>This discussion shows the importance of web IE (and WI, in particular) to the AI and World Wide Web communities. We are now in the phase, in fact, where metasurveys can be written about web IE tools rather than just surveys. The approach that we took for this chapter has elements of both a survey and a metasurvey, though we use specific system examples to convey the concepts where applicable. In particular, details on manually constructed and supervised wrapper systems, such as TSIMMIS, Minerva, WebOQL, W4F, XWrap, Rapier, Stalker, SRV, WHISK, NoDoSE, WIEN, and DEByE may be found in Hammer et al. (1997), Crescenzi and Mecca (1998), Arocena and Mendelzon (1998), Sahuguet and Azavant (2001), Liu et al. (2000), Mooney (1999), Muslea et al. (1999), Freitag (1998), Soderland (1999), Adelberg (1998), Kushmerick et al. (1997), and Laender, Ribeiro-Neto, Da Silva, et al. (2002), respectively. Details on semisupervised approaches mentioned in the chapter, such as OLERA, IEPAD, and Thresher, may be found in Chang and Kuo (2004), Chang and Lui (2001), and Hogue and Karger (2005). Details on unsupervised approaches, such as RoadRunner, EXALG, DEPTA, and DeLA, may be found in the papers by Crescenzi et al. (2001), Arasu and Garcia-Molina (2003), Liu et al. (2003), and Wang and Lochovsky (2002).</p>
<p>Toward the end of the chapter, we covered more recent work, including web IE for tables and dynamic webpages, such as those found on the Deep Web. For the former topic, we recommend Cafarella et al. (2008), Etzioni et al. (2005), Cafarella et al. (2009), and Gatterbauer et al. (2007). A related reference, for learning more about web-scale table census and classification (including the table types introduced earlier), is Crestan and Pantel (2011). For the latter, a good reading list should include Chen and Zipf (2017), Liu, Meng, and Meng (2009), Lehmann et al. (2012), Furche et al. (2013), An et al. (2007), and Hong (2010).</p>
</section>
<section epub:type="division">
<h2 class="head a-head"><a id="sec5-7"/><b>5.7 Exercises</b></h2>
<ul class="numbered">
<li class="NL">1. Suppose that you were crawling the full website lawyers.findlaw.com (we showed an example webpage in <a href="chapter_5.xhtml#fig5-1">figure 5.1</a>, along with the KG fragment that would be extracted by a good web IE system in <a href="chapter_5.xhtml#fig5-3">figure 5.3</a>). How might you use the techniques described <span aria-label="124" id="pg_124" role="doc-pagebreak"/>in this chapter to convert this website to a rich KG representation? Would a system like RoadRunner be useful to apply in such a use-case?</li>
<li class="NL">2. Again using <a href="chapter_5.xhtml#fig5-3">figure 5.3</a> as an example, can you design an EC model and some Stalker rules for the webpage shown in <a href="chapter_5.xhtml#fig5-1">figure 5.1</a>?</li>
<li class="NL">3. Can you think of example domains where both web IE and NER techniques (from the previous chapter, for instance) would have to be considered in tandem to give you a KG such as in <a href="chapter_5.xhtml#fig5-2">figure 5.2</a>? Give a rudimentary schematic of what a webpage from the domain you have in mind might look like.</li>
<li class="NL">4. With reference to the previous discussion in the chapter on web tables and information extraction, write down the table types of the three tables shown here.</li>
</ul>
<figure class="IMG"><img alt="" src="../images/pg124.png" width="450"/>
</figure>
<ul class="numbered">
<li class="NL">5. Mark a single triple on the table image (circle the subject, predicate, and object, and mark them with “s,” “p,” and “o,” respectively).</li>
</ul>
<div class="footnotes">
<ol class="footnotes">
<li><p class="FN" role="doc-footnote"><sup><a href="chapter_5.xhtml#fn1x5-bk" id="fn1x5">1</a></sup> Another way to think about dynamic webpages is that, caching and other such mechanisms aside, the HTML page <i>does not exist</i> (although the data for it does, in a database) until someone asks (“queries”) for it.</p></li>
<li><p class="FN" role="doc-footnote"><sup><a href="chapter_5.xhtml#fn2x5-bk" id="fn2x5">2</a></sup> In this chapter, <i>web domain, top-level domain</i>, and <i>website</i> are all used interchangeably, with examples including craigslist.com and ebay.com. In contrast, a <i>webpage</i> is a specific HTML page. A web domain is also a webpage (called its <i>home page</i>).</p></li>
<li><p class="FN" role="doc-footnote"><sup><a href="chapter_5.xhtml#fn3x5-bk" id="fn3x5">3</a></sup> This would happen if the very next line described another attribute of the book, such as “Reviewer,” which would also be in bold (just like the attribute name “Book Name”); hence, the line itself would begin with <i>&lt;</i>b &gt;.</p></li>
<li><p class="FN" role="doc-footnote"><sup><a href="chapter_5.xhtml#fn4x5-bk" id="fn4x5">4</a></sup> In other words, the list contains one or two, singular or plural, common nouns.</p></li>
<li><p class="FN" role="doc-footnote"><sup><a href="chapter_5.xhtml#fn5x5-bk" id="fn5x5">5</a></sup> This is itself subclassified further into <i>vertical</i> and <i>horizontal</i> listings.</p></li>
<li><p class="FN" role="doc-footnote"><sup><a href="chapter_5.xhtml#fn6x5-bk" id="fn6x5">6</a></sup> One reason for this is that, while some cell values are clearly “values” or KG literals (“attributes”), others are instances (“entities”), and many are even concepts. Furthermore, subtables and rows are akin to <i>events</i> (opinion polls) due to a clear spatiotemporal dependence and the presence of actors, sources, and other event-centric arguments. Compiling all of these elements into a single ontology is not straightforward.</p></li>
</ol>
</div>
</section>
</section>
</div>
</body>
</html>