glam/docs/oclc/extracted_kg_fundamentals/OEBPS/xhtml/chapter_16.xhtml
2025-11-30 23:30:29 +01:00

324 lines
No EOL
67 KiB
HTML
Raw Blame History

This file contains invisible Unicode characters

This file contains invisible Unicode characters that are indistinguishable to humans but may be processed differently by a computer. If you think that this is intentional, you can safely ignore this warning. Use the Escape button to reveal them.

This file contains Unicode characters that might be confused with other characters. If you think that this is intentional, you can safely ignore this warning. Use the Escape button to reveal them.

<html xmlns="http://www.w3.org/1999/xhtml" xmlns:epub="http://www.idpf.org/2007/ops" lang="en" xml:lang="en">
<head>
<title>Knowledge Graphs</title>
<meta content="text/html; charset=utf-8" http-equiv="default-style"/>
<link href="../styles/stylesheet.css" rel="stylesheet" type="text/css"/>
<meta content="urn:uuid:531250ec-8629-4bbe-b4be-eb1eb0e84538" name="Adept.expected.resource"/>
</head>
<body epub:type="bodymatter">
<div class="body">
<p class="SP"> </p>
<section aria-labelledby="ch16" epub:type="chapter" role="doc-chapter">
<header>
<h1 class="chapter-number" id="ch16"><span aria-label="415" id="pg_415" role="doc-pagebreak"/>16</h1>
<h1 class="chapter-title"><b>Knowledge Graphs and Ontologies in Science</b></h1>
</header>
<div class="ABS">
<p class="ABS"><b>Overview.</b>Scientific knowledge is one of the most important repositories of knowledge available to humanity. As scientific domains have proliferated and broadened in scope, particularly in rapidly advancing fields like biology and medicine, it has become all the more important to organize and represent that knowledge in a coherent manner. Ontologies and knowledge graphs (KGs) have emerged as fundamental technologies for accomplishing that goal. In this chapter, we provide an overview of KGs in selected scientific communities where KGs have had quite an impact in recent years (and thus provide important best-practice lessons), such as the life science, chemistry, and geoscience communities.</p>
</div>
<section epub:type="division">
<h2 class="head a-head"><a id="sec16-1"/><b>16.1Introduction</b></h2>
<p class="noindent">Scientific knowledge is arguably one of the most important and trusted bodies of knowledge available to humanity, and it has been painstakingly developed and tested over many centuries of theorizing and hypothesis formulation, experimental design and execution, and refinement. Just like ordinary social communities, each scientific community has its own norms and views, but common among them all is the scientific method and a focus on experimental validation of hypotheses. In each community, the scientific method has led to vast repositories of <i>scientific facts</i>, many of which have been replicated in multiple experiments or otherwise have some provenance or citation (usually, but not always, through a peer-reviewed scientific publication or experiment of record) that can be used to determine the verity of the claim. Because the scientific method is inductive, a given hypothesis can be overturned in the face of new knowledge. Facts can also lose their status as facts if new experiments are unable to replicate the finding or if unintentional (and in some unfortunate cases, intentional) biases or mistakes are discovered to lie behind the data.</p>
<p>We provide this background to illustrate two critical aspects of a scientific domain: first, despite all the differences between individual fields of study, the scientific method, as well as a focus on high-quality, experimentally validated, and peer-reviewed knowledge, provides a strong guarantee of the quality of modern scientific knowledge generated; and second, science is both a <i>dynamic</i> and a <i>social</i> endeavor. Science is dynamic because the <span aria-label="416" id="pg_416" role="doc-pagebreak"/>generation of new knowledge often leads to overturning of old hypotheses or proposals of new hypotheses (and sometimes the creation of entirely new areas of research), but it is also social because scientists rarely work in isolation—they must almost always draw on the findings of others in proposing their contributions. In other words, scientific knowledge is meant to be <i>shared</i>, and facts within an area of study have structure to them, as scientists within the field tend to use similar terminology and obey similar norms (notwithstanding the complexities that arise due to factions between communities and geopolitical differences). Intuitively, it is not hard to make the argument that scientific subfields have their own ontology, though it may not always be easy to encode an ontology (to satisfy everyone in the field) in a formal language like Web Ontology Language (OWL).</p>
<p>Going one step beyond ontologies, we argue that KGs are equally well suited to encoding, publishing, and sharing scientific knowledge for several reasons. The main reason is the structure of the data itself, because scientific knowledge as a high-quality repository is predominantly factual, though many facts can be superseded or refined in the face of new experiments and findings. On an auxiliary note, uncertainty and provenance are clearly important aspects that must be systematically accounted for, and although knowledge provenance is beyond the scope of this book (we provide guidance for the interested reader in the section entitled “Bibliographic Notes,” at the end of this chapter), ontologies for codifying and capturing provenance currently exist in the Semantic Web (SW) community. Next, because of the increasingly social nature of science, as evidenced by the success of large consortia and ambitious scientific projects like the discovery of gravitational waves or experiments in the Large Hadron Collider (which required the participation of numerous scientists and disciplines), properly represented and published KGs can be immensely useful to all the different stakeholders for sharing and querying a body of scientific knowledge. Finally, because of the domain-specific nature of scientific knowledge (a different way to say this is that an ontology in biology is very different from an ontology in materials science), KGs are apt, as they are well suited for capturing domain knowledge. In fact, as weve seen in the vast majority of this text, the structure and schema of most KGs (the one exception being KGs constructed using Open IE techniques) is defined using a domain ontology.</p>
<p>One point of confusion that can arise in natural disciplines like biology or chemistry is between a KG and an ontology. Because such disciplines tend to make claims about concepts (e.g., water molecule) rather than a particular instance of the concept (a particular water molecule in the Pacific Ocean), it is common to define the knowledge in an ontology. This leads to an enormous ontology, with no corresponding KG. This makes the scientific domain very unique, in that ontologies, not KGs, are the first-class citizens; in fact, KGs rarely exist in a structured format. However, because of its size and complexity, the ontology serves the same purpose as a KG does in more common applications like e-commerce <span aria-label="417" id="pg_417" role="doc-pagebreak"/>or search. With this in mind, in the rest of this chapter, we treat such ontologies just as we would the KGs that we have encountered thus far.</p>
</section>
<section epub:type="division">
<h2 class="head a-head"><a id="sec16-2"/><b>16.2Biology</b></h2>
<p class="noindent">As biological information has accumulated, it has become increasingly important to describe and classify biological objects in meaningful ways. Many species- and domain-specific databases have strategies to organize and integrate such data, allowing users to sift through expanding volumes of information. However, biologists want to be able to use the information stored in disparate databases to ask interesting and relevant, field-specific questions. For example, a biologist might want to ask which genes or gene products contribute to the formation and development of an epithelial sheet. Researchers may not want to stop there, and they may want to be able to expand such queries to find gene products in different organisms that share characteristics. To support this kind of research, an ordinary database is not enough (i.e., searching for such information in the context of complex scientific tasks like examining microarray expression data or sequencing genotypes in a population is simply impossible without an ecosystem of computational tools, querying systems, and even annotation schemes).</p>
<p>Although the biological field has many good resources to draw upon, the dominant resource that is pertinent to KGs as we have presented them in this book is the <i>Gene Ontology</i> (GO). Despite what its name suggests, the Gene Ontology has emerged as something far more than an ontology (as would be understood in everyday practice), and it is now an ecosystem in itself, having inspired a range of downstream systems and applications. In the rest of this section, we describe the various aspects of the Gene Ontology in detail.</p>
<section epub:type="division">
<h3 class="head b-head"><a id="sec16-2-1"/><b>16.2.1Gene Ontology</b></h3>
<p class="noindent">The need for a flexible, extensible resource in the life sciences was recognized well before the era of Big Data had begun in other areas, especially considering the many disparate data sources, varying information needs, and the dynamic, expanding nature of biological knowledge in irregular, often unpredictable, ways, given that some subfields can expand at faster rates than others. In response to this need, the Gene Ontology Consortium<sup><a href="chapter_16.xhtml#fn1x16" id="fn1x16-bk">1</a></sup> was formed to “develop a comprehensive, computational model of biological systems, ranging from the molecular to the organism level, across the multiplicity of species in the tree of life.” The original intent of the group was to construct a set of vocabularies that comprise terms that could be shared with a common understanding of the meaning of any term used and that could support cross-database queries. In this sense, this purpose was no different <span aria-label="418" id="pg_418" role="doc-pagebreak"/>from that of any group looking to define a class ontology for formalizing the terminology and semantics of their domain.</p>
<p>However, even very early on, novel extensions to the intent started to emerge. For example, it became clear that the combined set of annotations from the model organism groups would provide a useful resource for the community. This is a good example of a use-case that emerges organically. As a result of these annotations, in addition to developing the shared structured vocabularies, the GO project intended to develop an extensible database resource providing access not just to vocabularies, but also to annotation and query applications. The consortium also intended to release specialized data sets resulting from the use of the vocabularies in the annotation of genes and gene products. The goals of the GO Consortium are listed in <a href="chapter_16.xhtml#tab16-1" id="rtab16-1">table 16.1</a>.</p>
<div class="table">
<p class="TT"><a id="tab16-1"/><span class="FIGN"><a href="#rtab16-1">Table 16.1</a>:</span> <span class="FIG">Specific goals of the GO Consortium.</span></p>
<figure class="table">
<table class="table">
<thead>
<tr>
<th class="TCH"><p class="TB"><b>Goals of the GO Consortium</b></p></th>
</tr>
</thead>
<tbody>
<tr>
<td class="TB"><p class="TB">Compiling a comprehensive structured vocabulary of terms describing various elements of molecular biology shared among life forms. Furthermore, terms are defined, may have synonyms, and may be refined further into broader and narrower organizations. Also, separate vocabularies will be used to define the different dimensions of biology.</p></td>
</tr>
<tr>
<td class="TB"><p class="TB">Describing the biological objects (in the model organism database of each contributing member) using the terms compiled in the first point.</p></td>
</tr>
<tr>
<td class="TB"><p class="TB">Providing querying tools and mechanisms for manipulating the vocabularies, including (1) adding new vocabularies for additional aspects of biology, (2) permitting researchers to locate terms and biological objects via the web (or even more complex ways), and (3) allowing the setup of satellite databases where necessary.</p></td>
</tr>
<tr>
<td class="TB"><p class="TB">Providing the tools to enable curators to assign GO terms to biological objects (including sequence-based methods, editorial annotations, protein binding experiments and microarrays).</p></td>
</tr>
</tbody>
</table>
</figure>
</div>
<p>It should be noted that the consortium is also clear about what its goals are <i>not</i>. For example, the consortium has said clearly that the ontology is not a way to unify biological data sets, even though (at least in practice) the sharing of nomenclature has had that added welcome effect. The ontology is also not a dictated standard, in the sense of mandating nomenclature across databases. Rather, the goal is for groups to come together and arrive at a mutually acceptable consensus. Finally, the GO does not define homologies between gene products from different organisms. While usage of the ontology has resulted in shared annotations for gene products from different organisms, the annotation is not (in itself) sufficient for determining an evolutionary relationship.</p>
<p class="TNI-H3"><b>16.2.1.1Structure of the Ontology</b>To fulfill its goals, the GO Consortium developed three ontologies: <i>molecular function, biological process</i>, and <i>cellular component</i>, <span aria-label="419" id="pg_419" role="doc-pagebreak"/>to describe attributes of gene products or gene product groups. The three ontologies are each represented by a separate root ontology term. All terms in a domain can trace their parentage to a root term, although there may be numerous paths via varying numbers of intermediary terms to an ontology root. The three root nodes are unrelated and do not have a common parent node; for this reason, it is more appropriate to think of the GO as a project rather than as a single ontology. One complication that can arise due to the presence of three ontologies is the use of graph-based (or other) software that can work with only one ontology at a time (i.e., requires a single root node). A workaround suggested by the GO itself for dealing with this issue is to introduce a fake term that serves as a single root and is the parent of the three existing root nodes.</p>
<p>While a molecular function describes what a gene product does at the biochemical level, a biological process describes a broad biological objective and a cellular component describes the location of a gene product (e.g., within a cellular structure). The three ontologies are directed acyclic graphs, and predominantly include both <i>“is a”</i> and <i>“part of”</i> relationship types. Each term in the ontology is an accessible object in the GO resource and has a unique identifier that can be used as a database cross-reference. Elements of GO terms, as per the latest edition, are described in <a href="chapter_16.xhtml#tab16-2" id="rtab16-2">table 16.2</a></p>
<div class="table">
<p class="TT"><a id="tab16-2"/><span class="FIGN"><a href="#rtab16-2">Table 16.2</a>:</span> <span class="FIG">Essential and optional elements (the latter indicated by *) in GO terms.</span></p>
<figure class="table">
<table class="table">
<thead>
<tr>
<th class="TCH"><p class="TB"><b>Term</b></p></th>
<th class="TCH"><p class="TB"><b>Description</b></p></th>
</tr>
</thead>
<tbody>
<tr>
<td class="TB"><p class="TB">Unique identifier and term name</p></td>
<td class="TB"><p class="TB">Every term has a human-readable term name (e.g., mitochondrion) and a GO ID, which is a unique seven-digit identifier prefixed by <i>GO</i>.</p></td>
</tr>
<tr>
<td class="TB"><p class="TB">Aspect</p></td>
<td class="TB"><p class="TB">Denotes which of the three subontologies (cellular component, biological process, or molecular function) the term belongs to.</p></td>
</tr>
<tr>
<td class="TB"><p class="TB">Definition</p></td>
<td class="TB"><p class="TB">A textual description of what the term represents, plus references to the source of the information.</p></td>
</tr>
<tr>
<td class="TB"><p class="TB">Relationships to other terms</p></td>
<td class="TB"><p class="TB">How the term relates to other terms in the ontology. All terms (other than the root terms representing each aspect) have an “is a” subclass relationship to another term.</p></td>
</tr>
<tr>
<td class="TB"><p class="TB">Secondary IDs (also known as alternate IDs)*</p></td>
<td class="TB"><p class="TB">Secondary IDs come about when two or more terms are identical in meaning and merged into a single term. All term IDs are preserved so that no information (e.g., annotations to the merged IDs) is lost.</p></td>
</tr>
<tr>
<td class="TB"><p class="TB">Database cross-references (dbxrefs)*</p></td>
<td class="TB"><p class="TB">Database cross-references refer to identical or very similar objects in other databases.</p></td>
</tr>
<tr>
<td class="TB"><p class="TB">Synonyms*</p></td>
<td class="TB"><p class="TB">Alternative words or phrases closely related in meaning to the term name, with indication of the relationship between the name and synonym given by the synonym scope (<i>exact, broad, narrow</i>, and <i>related</i>). Custom synonym types are also used in the ontology. For example, a number of synonyms are designated as systematic synonyms; synonyms of this type are exact synonyms of the term name.</p></td>
</tr>
<tr>
<td class="TB"><p class="TB">Comment*</p></td>
<td class="TB"><p class="TB">Any extra information about the term and its usage.</p></td>
</tr>
<tr>
<td class="TB"><p class="TB">Subset*</p></td>
<td class="TB"><p class="TB">Indicates that the term belongs to a designated subset of terms, e.g., one of the GO subsets.</p></td>
</tr>
<tr>
<td class="TB"><p class="TB">Obsolete tag*</p></td>
<td class="TB"><p class="TB">Indicates that the term has been deprecated and should not be used. A GO term is obsoleted when it is out of scope, misleadingly named or defined, or describes a concept that would be better represented in another way and needs to be removed from the published ontology. In these cases, the term and ID persist in the ontology, but the term is tagged as obsolete and all relationships to other terms are removed. A comment is added to the term which details the reason for the obsoletion and replacement terms are suggested, if possible.</p></td>
</tr>
</tbody>
</table>
</figure>
</div>
<p><b>Molecular Function.</b> Within the molecular function ontology are terms describing molecular-level activities performed by gene products—specifically, activities that occur at the molecular level (e.g., catalysis or transport). In GO, molecular function terms represent activities rather than the entities (molecules or complexes) that perform these actions, and they do not specify where, when, or in what context the actions take place. Molecular functions generally correspond to activities that can be performed by individual gene products (a protein or ribonucleic acid), but some activities are performed by molecular complexes composed of multiple gene products. Examples of broad functional terms in the molecular function ontology include “enzyme,” “transporter,” and “ligand,” while an example of a more specific functional term is “adenylate cyclase.” Note that there is the potential for semantic confusion between a gene product and its molecular function because very often, a gene product, with enzymes being a particularly notorious example, is named by at least by one of its molecular functions (or a single function, if there is only one).</p>
<p><b>Biological Process.</b> These terms describe the larger processes or biological programs that are accomplished by several molecular activities. Examples of broad biological process terms that occur in the biological process ontology include “cell growth and maintenance,” with more specific examples being “pyrimidine metabolism” and “cAMP biosynthesis.” However, a biological process in this context is not considered equivalent to a pathway, and the consortium has not (thus far) attempted to represent any of the dynamics or dependencies required for describing a pathway.<span aria-label="420" id="pg_420" role="doc-pagebreak"/></p>
<p><span aria-label="421" id="pg_421" role="doc-pagebreak"/><b>Cellular Component.</b> Terms in the cellular component ontology describe locations relative to cellular structures, in which a gene product performs a function, either cellular compartments (e.g., mitochondrion), or stable macromolecular complexes of which they are parts, such as ribosome. Examples of terms in the cellular component ontology include complexes where multiple gene products can be found, such as ribosome and proteasome, in addition to location terms like nuclear membrane, used to indicate places in a cell where a gene product is active. Unlike the other aspects of GO, cellular component classes refer not to processes, but a cellular anatomy.</p>
<p>We note that each term in these three ontologies is defined, with a citation to the source from which the definition was obtained. Query and implementation tools were also developed to exploit the detailed relationships captured in the ontologies. Each term in the ontology has a relationship with at least one other term, but the consortium made a conscious decision not to incorporate these relationships in the term identifiers themselves due to the expected dynamic nature of the ontology (i.e., it was expected that over time, the location of the term within an ontology—namely, its parents and children—would likely change, sometimes in completely unexpected ways).</p>
<p class="TNI-H3"><b>16.2.1.2Features and Improvements.</b>Since the early 2000s, the GO has come a long way, and because of its uptake, has continued to be enhanced in its tools, resources, and policies, particularly with a view to improving annotation consistency and ensuring that annotations reflect the state of current biological knowledge. A specific concern addressed by the consortium was the use of inconsistent data representations through these enhancements. Recent improvements to the original GO are described next.</p>
<p><b>Ontology Development.</b> The total number of GO terms has been steadily increasing, with over a 100 percent increase in just a decade (from around 18,000 to more than 40,000 between 2004 and 2014). Compared to the number of GO terms added to describe molecular functions and cellular components, the number of terms describing biological processes has increased at a higher rate, averaging 4,000 new ones every two years since 2011. There has also been a consistent increase in the number of manual annotations made by curators (e.g., the number of manually annotated gene products has grown to more than 400,000).</p>
<p>Development hasnt been equally divided among the various ontology classes, with the cellular component branch seeing more robust enhancements than others and facilitating the needs of multiple communities. For example, the Subcellular Anatomy Ontology was merged into the GO cellular component representation, leading to a single, unified ontology designed to serve the needs of both the neuroscience community and the wider biomedical research community already being served by the GO. In a similar vein, ontology editors have carried out an effort to update and refine other areas of the ontology (e.g., making enhancements in the OWL version of GO to better support quality control and classification as integral parts of the overall ontology development cycle). For other details on ontology development and enhancement, we refer the interested reader to the “Bibliographic Notes” <span aria-label="422" id="pg_422" role="doc-pagebreak"/>section. The GO webpage itself has also undergone design changes and improvements; <a href="chapter_16.xhtml#fig16-1" id="rfig16-1">figure 16.1</a> illustrates how the homepage has changed in the last five years.</p>
<div class="figure">
<figure class="IMG"><a id="fig16-1"/><img alt="" src="../images/Figure16-1.png" width="450"/>
<figcaption><p class="CAP"><span class="FIGN"><a href="#rfig16-1">Figure 16.1</a>:</span> <span class="FIG">A snapshot of changes to the GO resource webpage from five years ago (top) to April 2019 (bottom).</span></p>
</figcaption>
</figure>
</div>
<p><b>Annotation.</b> Annotation is an important part of a shared, growing resource like the GO. Over the last several years, the GO Consortium has introduced metadata to better describe annotation contexts, including relationships such as localization dependencies and transcription factors. Metadata to better describe the spatiotemporal aspects of processes such as cell type or developmental stages has also been introduced. The information expressed <span aria-label="423" id="pg_423" role="doc-pagebreak"/>by these extensions refines functional annotations by representing relationships between a basic annotation and contextual information from within the GO (or even external ontologies). Extended annotations can enable complex queries and reasoning, which is an important goal in building such ontologies to begin with. The consortium has also encouraged experts to provide input in various biological areas. For example, a collaboration with the Transcription Factor Checkpoint database expanded annotations to human, mouse, and rat transcription factors, and the Developmental Functional Annotation at Tufts (DFLAT<sup><a href="chapter_16.xhtml#fn2x16" id="fn2x16-bk">2</a></sup>) project improved the annotation quality of the genes involved in fetal development. A joint collaboration between Gramene and Ensembl Plants yielded initial GO annotations for tens of sequenced plant genomes with new releases. Ultimately, such annotations have significantly expanded the scope, quality, and use-cases made possible by the GO resource.</p>
<p><b>Public Access and Browsing.</b> We earlier illustrated how the GO website has changed to a simpler, more streamlined look compared to just five years ago. New tools for browsing GO annotations were also released. The GO Consortium also provides platforms of interaction and welcomes participation from the community, both to address general inquiries and to address specific requests for the ontology. There are plug-ins for social media and GitHub on the webpage. These social and public-facing aspects of the GO are clearly considered to be important assets by the consortium, and are expected to be continued to be maintained and enhanced over time.</p>
</section>
</section>
<section epub:type="division">
<h2 class="head a-head"><a id="sec16-3"/><b>16.3Chemistry</b></h2>
<p class="noindent">Chemistry is another important natural science discipline in which knowledge representation has found important use-cases and applications. We describe two important efforts, ChEBI and PubChem, in this domain. Note that the GO itself is liberally used in chemistry as well.</p>
<section epub:type="division">
<h3 class="head b-head"><a id="sec16-3-1"/><b>16.3.1Chemical Entities of Biological Interest</b></h3>
<p class="noindent">Chemical Entities of Biological Interest (ChEBI) is a freely available dictionary of molecular entities focused on small chemical compounds (i.e., genome-encoded macromolecules, such as nucleic acids, proteins, and peptides derived from proteins by cleavage, are not generally included in ChEBI). The molecular entities in question are either natural or synthetic products that are used to intervene in the processes of living organisms; in essence, they encompass any constitutionally or isotopically distinct atom, molecule, ion, ion pair, or radical (among others) that is identifiable as a separately distinguishable entity.</p>
<p>In addition to molecular entities, ChEBI contains groups (parts of molecular entities) and classes of entities. It includes an ontological classification, whereby the relationships <span aria-label="424" id="pg_424" role="doc-pagebreak"/>between molecular entities or classes of entities and their parents, children, or both are specified.</p>
<p>All the data in ChEBI is nonproprietary or derived from a nonproprietary source, and is available under the Creative Commons License. Furthermore, ChEBI data items have detailed provenance, in that they are traceable and explicitly referenced to the original source. A visualization of the ChEBI entity-centric dashboard for the molecular entity ecogonine benzoate is shown in <a href="chapter_16.xhtml#fig16-2" id="rfig16-2">figure 16.2</a>. In addition to the visual structure of the molecule, the database contains an identifier for the entity (CHEBI:41001); a definition; a star-based annotation scheme that (in this case) confirms that the entity has been manually annotated by the ChEBI team; secondary IDs; supplier information; and chemical properties like Formula, Net Charge, and Average Mass. In this case, there is also a Wikipedia link. Although not shown here, other details that can be obtained by scrolling down the page include the IUPAC Name, roles classifications, registry numbers, and synonyms.</p>
<div class="figure">
<figure class="IMG"><a id="fig16-2"/><img alt="" class="width" src="../images/Figure16-2.png"/>
<figcaption><p class="CAP"><span class="FIGN"><a href="#rfig16-2">Figure 16.2</a>:</span> <span class="FIG">An illustration of ChEBI search results for the molecular entity “ecogonine benzoate.”</span></p></figcaption>
</figure>
</div>
<p>The reason that we illustrate the search dashboard here is to highlight a critical point that was first alluded to in the introduction when studying scientific knowledge repositories like ChEBI (namely, that such repositories meet the definition of a KG owing to their focus on entities, relationships, and connections between entities, despite the fact that they are not designated as KGs). Furthermore, in scientific databases such as these, it can be hard to distinguish where the ontology leaves off and the KG begins (if one even exists—some would argue that the ontology is all there is in a scientific repository). In practice, such a distinction is neither necessary nor wise, which makes these data sets different from traditional KGs constructed by natural-language systems over news or open-domain corpora. Unlike ordinary domains, where the ontology tends to be compact and the actual KG is much bigger, scientific KGs can be equivalently characterized as large ontologies.</p>
<p>Although ChEBI is reasonably compact (compared to other KGs), containing just over 18,000 entities in the early 2000s (and expanding to over 46,000 entries by 2016), it is differentiated by a strong focus on quality, with exceptional efforts afforded to International Union of Pure and Applied Chemistry (IUPAC) nomenclature rules, classification within the ontology, and best IUPAC practices when drawing chemical structures. The employed nomenclature and terminology in ChEBI is also recommended by the Nomenclature Committee of the International Union of Biochemistry and Molecular Biology (NC-IUBMB).</p>
<p>In recent years, programmatic access to ChEBI has significantly improved, as detailed in the section entitled “Software and Resources,” at the end of this chapter. The documented use-cases of ChEBI include (1) being used as a source of stable unique identifiers for chemicals in annotations in a wide range of bioinformatics databases, including UniProt and systems biology models; (2) being used in text- and data-mining programs; (3) being linked to the GO, as well as several other ontologies, as the chemistry component; and (4) being used for SW applications [e.g., as the recent representation of PubChem content as the Resource Description Framework (RDF) in order to provide <i>rdf:type</i> tags for PubChem <span aria-label="425" id="pg_425" role="doc-pagebreak"/>chemicals, as discussed next]. In the decade since its introduction, ChEBI has gained widespread adoption and become an essential repository for chemistry and bioinformatics, supporting a robust set of applications and user types in multiple scientific contexts.</p>
</section>
<section epub:type="division">
<h3 class="head b-head"><span aria-label="426" id="pg_426" role="doc-pagebreak"/><a id="sec16-3-2"/><b>16.3.2PubChem</b></h3>
<p class="noindent">PubChem is a public repository on chemical substances and their biological activities, and it was launched in 2004 as a component of the Molecular Libraries Roadmap Initiatives of the US National Institutes of Health (NIH). PubChem rapidly grew into a key chemical information resource for serving scientific communities in areas such as cheminformatics, chemical biology, and drug discovery. It contains one of the largest corpora of publicly available chemical information, with more than 157 million depositor-provided chemical substance descriptions, 60 million unique chemical structures, and 1 million biological assay descriptions, covering about 10,000 unique protein target sequences. This large repository of data is organized into three interlinked databases: <i>Substance, Compound</i>, and <i>BioAssay</i>. <i>Substance</i> stores depositor-contributed information, while the unique chemical structures extracted from the <i>Substance</i> database are stored in the <i>Compound</i> database. In contrast, the <i>BioAssay</i> database stores descriptions of biological assays on chemical substances.</p>
<p>PubChem has a strong community presence, with data provided by more than 350 contributors, including university labs, government agencies, pharmaceutical companies, and chemical vendors. Data provided by these contributors involves not just small molecules, but chemically modified macromolecules, lipids, and peptides, among other substances.</p>
<p>Originally, PubChem was not a KG; however, PubChemRDF changed this by encoding PubChems data using RDF and harnessing ontological frameworks to facilitate PubChem data sharing, analysis, and integration with resources external to the National Center for Biotechnology Information (NCBI) and across scientific domains. Chemical and drug ontologies such as NDF-RT, NCI Thesaurus,<sup><a href="chapter_16.xhtml#fn3x16" id="fn3x16-bk">3</a></sup> and ChEBI are used to annotate PubChem compounds and substances, while the GO and the Protein Ontology are used to annotate bioassay molecular targets. Furthermore, PubChemRDF exposes a number of semantic relationships among compounds, substances, bioassays, genes, and other elements.</p>
<p>Ultimately, PubChemRDF allows researchers to work with PubChem data locally using SW tools and systems. The selected data files in any subdomain can be downloaded from the PubChem File Transfer Protocol (FTP) site and imported into an RDF triple/quad-store (such as Apache Jena) that provides a SPARQL interface. The data can also be loaded into graph databases like Neo4j, and graph traversal and processing algorithms could be used to query the KG in advanced ways. Additionally, PubChemRDF provides programmatic data <span aria-label="427" id="pg_427" role="doc-pagebreak"/>access through a representational state transfer (REST)-ful interface and simple SPARQL-like query capabilities for grouping and filtering relevant results.</p>
</section>
</section>
<section epub:type="division">
<h2 class="head a-head"><a id="sec16-4"/><b>16.4Earth, Environment, and Geosciences</b></h2>
<p class="noindent">Geoscience studies produce data from various observations, experiments, and simulations a very high rate. With the proliferation of applications and data formats, the geoscience research community faces many challenges in effectively managing and sharing resources, as well as efficiently integrating and analyzing the data.</p>
<section epub:type="division">
<h3 class="head b-head"><a id="sec16-4-1"/><b>16.4.1Semantic Web for Earth and Environmental Terminology</b></h3>
<p class="noindent">Ontologies within the Semantic Web for Earth and Environmental Terminology (SWEET) together constitute an upper-level ontology for Earth system science. The SWEET ontologies include several thousand terms, spanning a broad extent of Earth system science and related concepts (such as data characteristics) using OWL.</p>
<p>SWEET consists of two types of ontologies: <i>faceted</i> and <i>integrative</i>. <a href="chapter_16.xhtml#fig16-3" id="rfig16-3">Figure 16.3</a>, derived from the SWEET guide, shows their interrelationships. <a href="chapter_16.xhtml#tab16-3" id="rtab16-3">Tables 16.3</a> and <a href="chapter_16.xhtml#tab16-4" id="rtab16-4">16.4</a> further describe these ontologies. As a set, the ontologies should be thought of as constituting a <i>concept space</i> for Earth system science. SWEET enables the same concept to be represented using various phrases to satisfy the needs of multiple users. Rather than define a compound concept such as air temperature, the SWEET ontologists decided to separate the physical property (e.g., temperature) from the element that the property applies to (e.g., air). This provides a more scalable solution to a growing knowledge base (KB). In this case, <i>compositional</i> knowledge of the independent concepts of the substance “air” and the property temperature provides a complete understanding of “air temperature,” without a need to create an explicit definition of the compound concept. While such a decomposition does not preclude term recompositions, the compound terms are designated as synonymous with their integral parts. In other instances, the compound concepts contain more meaning than their component parts (e.g., static pressure) and are explicitly included in the ontology.</p>
<div class="figure">
<figure class="IMG"><a id="fig16-3"/><img alt="" src="../images/Figure16-3.png" width="450"/>
<figcaption><p class="CAP"><span class="FIGN"><a href="#rfig16-3">Figure 16.3</a>:</span> <span class="FIG">Relationships between the SWEET ontologies (integrative and faceted).</span></p></figcaption>
</figure>
</div>
<div class="table">
<p class="TT"><a id="tab16-3"/><span class="FIGN"><a href="#rtab16-3">Table 16.3</a>:</span> <span class="FIG">Brief descriptions of faceted ontologies in SWEET.</span></p>
<figure class="table">
<table class="table">
<thead>
<tr>
<th class="TCH"><p class="TB"><b>Ontology</b></p></th>
<th class="TCH"><p class="TB"><b>Description</b></p></th>
</tr>
</thead>
<tbody>
<tr>
<td class="TB"><p class="TB">Earth Realm</p></td>
<td class="TB"><p class="TB">The spheres (e.g., atmosphere, ocean, and solid earth) of the Earth constitute the EarthRealm ontology, based upon the physical properties of the planet, including subrealms such as the ocean floor and atmospheric boundary layer. This ontology can be considered a state of the planet that is extendable to past or future time periods (as well as to other planets).</p></td>
</tr>
<tr>
<td class="TB"><p class="TB">Nonliving Substances</p></td>
<td class="TB"><p class="TB">The nonliving building blocks of nature include particles, electromagnetic radiation, and chemical compounds. These substances constitute an ontology of physics and chemistry.</p></td>
</tr>
<tr>
<td class="TB"><p class="TB">Living Substances</p></td>
<td class="TB"><p class="TB">The living substances include plant and animal species. This ontology was imported from the “biosphere” taxonomy of the Global Change Master Directory.</p></td>
</tr>
<tr>
<td class="TB"><p class="TB">Physical Processes</p></td>
<td class="TB"><p class="TB">Physical processes include processes that affect living and nonliving substances, such as diffusion and evaporation.</p></td>
</tr>
<tr>
<td class="TB"><p class="TB">Physical Properties</p></td>
<td class="TB"><p class="TB">A separate ontology was developed for physical properties, including those observable or associated with other components. Examples of physical properties include <i>temperature</i>, <i>pressure</i>, and <i>height</i>, and could apply to Nonliving Substances, Living Substances, and Physical Processes, among others. These properties typically are measured physical quantities (or qualities) with units.</p></td>
</tr>
<tr>
<td class="TB"><p class="TB">Units</p></td>
<td class="TB"><p class="TB">Units are defined using Unidatas UDUnits, a package that contains an extensive unit database and is available at <i><a href="http://www.unidata.ucar.edu/software/udunits/">www.unidata.ucar.edu/software/udunits/</a></i>. The resulting ontology includes conversion factors among various units. Prefixed units such as <i>km</i> are defined as a special case of <i>m</i> with an appropriate conversion factor.</p></td>
</tr>
<tr>
<td class="TB"><p class="TB">Time</p></td>
<td class="TB"><p class="TB">Time is essentially a numerical scale with terminology specific to the temporal domain. In the Time Ontology, the temporal extents and relations are special cases of numeric extents and relations, respectively. Temporal extents include <i>duration</i>, <i>season</i>, <i>century</i>, and <i>1996</i>, while examples of temporal relations include <i>after</i> and <i>before</i>.</p></td>
</tr>
<tr>
<td class="TB"><p class="TB">Space</p></td>
<td class="TB"><p class="TB">Similar to Time, Space is a multidimensional numerical scale with terminology specific to the spatial domain. A Space Ontology was developed, in which the spatial extents and relations are special cases of numeric extents and relations, respectively. Spatial extent examples include <i>country</i>, <i>Antarctica</i>, and <i>equator</i>, and spatial relations include <i>above</i> and <i>northOf</i>.</p></td>
</tr>
<tr>
<td class="TB"><p class="TB">Numerics</p></td>
<td class="TB"><p class="TB">Numerical extents include <i>interval, point, 0</i>, and <span class="font"></span><sup>2</sup>. Numerical relations include <i>greaterThan</i> and <i>max</i>. Multidimensional concepts were defined because they are not native to OWL and XML.</p></td>
</tr>
</tbody>
</table>
</figure>
</div>
<div class="table">
<p class="TT"><a id="tab16-4"/><span class="FIGN"><a href="#rtab16-4">Table 16.4</a>:</span> <span class="FIG">Brief descriptions of integrative ontologies in SWEET.</span></p>
<figure class="table">
<table class="table">
<thead>
<tr>
<th class="TCH"><p class="TB"><b>Ontology</b></p></th>
<th class="TCH"><p class="TB"><b>Description</b></p></th>
</tr>
</thead>
<tbody>
<tr>
<td class="TB"><p class="TB">Physical Phenomena</p></td>
<td class="TB"><p class="TB">A Phenomena Ontology is used to define transient events. A phenomenon crosses bounds of other ontology elements. Examples include <i>hurricane, earthquake</i>, and <i>El Niño</i>, and each has associated Time, Space, Earth Realms, Nonliving Elements, and Living Elements. Specific instances of phenomena, spanning approximately 50 events over the past two decades, are also included.</p></td>
</tr>
<tr>
<td class="TB"><p class="TB">Human Activities</p></td>
<td class="TB"><p class="TB">This ontology is included for representing activities that humans engage in, such as commerce and fisheries. It is included because scientific processes and phenomena have human impacts, and there is a need for representing such activities.</p></td>
</tr>
<tr>
<td class="TB"><p class="TB">Data</p></td>
<td class="TB"><p class="TB">The Data Ontology provides support for data set concepts, including representation, storage, modeling, format, resources, services, and distribution.</p></td>
</tr>
</tbody>
</table>
</figure>
</div>
</section>
<section epub:type="division">
<h3 class="head b-head"><a id="sec16-4-2"/><b>16.4.2The GEON Portal and OpenTopography</b></h3>
<p class="noindent">One other solution that was proposed in the mid-2000s was the GEON (the GEOscience Network) Portal, which was focused on the same problem as the GO—namely, that information sources often fail to share a common terminology, have a variety of data representation formats and management architectures, and exhibit complex relationships between data and tools used to analyze the data. Creating an infrastructure to integrate, analyze, and model geoscience data poses many challenges due to the extreme heterogeneity of geoscience data formats, storage and computing systems and, most important, the ubiquity of differing conventions, terminologies, and ontological frameworks across disciplines. Dealing with this heterogeneity is important to facilitate interdisciplinary research, especially in the face of pressing crises like climate change. Studying climate change using holistic and unbiased scientific principles require an integrated understanding of stratigraphy, sea-level changes, fossil record, isotopes, and tectonics. The proverbial scientist working on climate change must have access to data from a number of scientific processes. However, the expenses involved in collecting necessary information for a single scientist, or even a single group of scientists, can prove to be a formidable barrier impeding new and exciting directions of research.</p>
<span aria-label="428" id="pg_428" role="doc-pagebreak"/>
<span aria-label="429" id="pg_429" role="doc-pagebreak"/>
<p><span aria-label="430" id="pg_430" role="doc-pagebreak"/>The goal of GEON (which was funded by the US National Science Foundation) was to respond to the pressing need in the geosciences to interlink and share multidisciplinary data sets to understand the complex dynamics of Earth systems. The portal was already becoming popular in the mid-2000s, well before the current popularity of KGs. For instance, Nambiar et al. (2006) claimed that the publicly accessible portal contained more than 400 registered data sources, 600 services, more than 750 registered users, and 20 ontologies. Today, GEON has been largely superseded by a larger project called OpenTopography, for which it originally served as a proof-of-concept cyberinfrastructure. The term “cyberinfrastructure” was coined by the National Science Foundation in 2003 to describe the computer networks and application-specific software, tools, and data repositories that support research in a given discipline. OpenTopography facilitates community access to high-resolution, earth scienceoriented, topography data, and related tools and resources through cyberinfrastructure developed at the San Diego Supercomputer Center at University of California, San Diego. Its goals include to democratize online access to high-resolution (meter to submeter scale), earth scienceoriented topography data acquired with lidar and other technologies; harness cutting-edge cyberinfrastructure to provide web servicebased data access, processing, and analysis capabilities that are scalable, extensible, <span aria-label="431" id="pg_431" role="doc-pagebreak"/>and innovative; and foster interaction and knowledge exchange in the earth science lidar user community.</p>
<p>OpenTopography data access levels include the following:</p>
<ul class="numbered">
<li class="NL">1.<b>Google Earth.</b> Provides an excellent platform to deliver lidar-derived visualizations for research and outreach. These files display full-resolution images derived from lidar in the Google Earth virtual globe. The virtual globe environment provides a freely available and easily navigated viewer and enables quick integration of the lidar visualizations with imagery, geographic layers, and other relevant data available in Keyhole Markup Language (KML) format.</li>
<li class="NL">2.<b>Raster.</b> Precomputed raster data include digital elevation model layers computed from aerial lidar surveys and raster data from the Satellite Radar Topography Mission global data set. The digital elevation models from aerial lidar surveys are available as bare earth (ground), highest hit, or intensity (strength of laser pulse) tiles. Some data sets also have orthophotographs available. The digital elevation models are in common Geospatial Information Systems (GIS) formats and are compressed to reduce their size.</li>
<li class="NL">3.<b>Lidar point cloud data and on-demand processing.</b> Users are allowed to define an area of interest, as well as a subset of the data (e.g., “ground returns only”), and then to download the results of this query in ASCII or LAS binary point cloud formats. Also available is the option to generate custom derivative products such as digital elevation models produced with user-defined resolution and algorithm parameters and downloaded in a number of different file formats. The system will also generate geomorphic metrics (e.g., slope maps) and dynamically generate data product visualizations for display in the web browser or Google Earth.</li>
</ul>
<p>OpenTopography cyberinfrastructure is based on a multitiered service-oriented architecture for efficient web browserbased access to data and processing. It includes an infrastructure tier, an application tier, and a services tier. The infrastructure tier contains dedicated storage and compute resources, the application tier is where most users access data and processing, and the services tier includes algorithms (e.g., for visualization) and other domain-specific services such as orthoimagery, raster processing, and gridding.</p>
</section>
<section epub:type="division">
<h3 class="head b-head"><a id="sec16-4-3"/><b>16.4.3Environment Ontology</b></h3>
<p class="noindent">The Environment Ontology (ENVO) is a community-led, open project that seeks to provide an ontology for specifying a wide range of environments relevant to multiple life science disciplines and, through an open participation model, to accommodate the terminological requirements of all those needing to annotate data using ontology classes. In short, it is a succinct, controlled description of environments. A broad definition of an environment includes the natural or anthropogenic systems that can surround an entity (living or nonliving). ENVO is motivated by the finding that, like so many of the sciences, while all <span aria-label="432" id="pg_432" role="doc-pagebreak"/>biologists have an intuitive understanding of what is meant by “environment,” a rigorous definition of this class is nontrivial. For example, confusion often arises when attempting to distinguish an environment from a habitat or niche: as some studies have shown, the environment that an organism was observed in or isolated from may have little to do with its habitat or its niche.</p>
<p>ENVO is comprised of classes (terms) referring to key environment types that may be used to facilitate the retrieval and integration of a broad range of biological data. The ENVO Consortium did not develop ENVO in a vacuum; rather, it took into account the many existing resources addressing, among other entities, environment types. Like other domain scientists and scientific consortia, they were motivated by the value of unifying preexisting resources in a foundational (or building-block) ontology developed within a <i>federated</i> framework (thereby facilitating sharing) and exclusively concerned with the specification of environment types (the selected domain of interest), independent of any particular application (facilitating reuse).</p>
<p>ENVOs most developed branches (which are of primary interest to annotators) are the <i>biome, environmental feature</i>, and <i>environmental material</i> hierarchies. The biome hierarchy recognizes two important subclasses: terrestrial biome and aquatic biome. Most subclasses in the terrestrial biome have been adapted from major (terrestrial) habitat types defined by the World Wide Fund for Nature (WWF). The aquatic biome class has two subclasses: the marine biome and freshwater biome classes. The former hierarchy has been enriched with detailed input from marine scientists and includes classes representing depth-dependent layers of the oceans and seas, along with biomes associated with geographic entities, while the latter is in a considerably less developed state and includes subclasses adapted from the WWFs freshwater ecosystem classification.</p>
<p>The environmental feature hierarchy comprises subbranches addressing a number of spatial scales (e.g., the geographic feature subclass contains subclasses adapted from geographic surveys like those of the US Geographic Survey), as well as features that are of smaller spatial scale (e.g., carcasses and fomites that are included as subclasses of mesoscopic physical object) and finally, subclasses of marine feature and organic feature that are presently used to temporarily accommodate user requests.</p>
<p>The environmental material hierarchy is less deep compared to the biome and environmental feature hierarchies. Broad subclasses such as soil, water, and sediment are subdivided either by using well-known schemes (e.g., the UN Food and Agriculture Organization soil classification), or by referring to commonly used terms in the relevant domain by engaging with experts.</p>
<p>Since its introduction, and just like the other examples described here, ENVO has been adopted by or used in several projects, and its initial scope has significantly expanded. The <i>-omics</i> community was an early adopter of ENVO, which is a recommended ontology in the core component of the Minimal Information about any (x) Sequence (MIxS) specification. <span aria-label="433" id="pg_433" role="doc-pagebreak"/>Outside the <i>-omics</i> community, StrainInfo, a service which indexes and allows searching over numerous microbial culture collections, has used ENVO in its semantic representation of isolation environment. Recent interaction with the Environments-EOL initiative, which is utilizing text-mining approaches to annotate Encyclopedia of Life pages with ENVO classes, is providing valuable guidance in ENVOs development. The developers of ENVO are also working with the eco-informatics community to map the environmental descriptors in ENVO to the SPIRE vocabulary. This allows ecological interaction data mapped to SPIRE to be remapped to ENVO. Finally, as ENVO annotations become more widely available, databases and data retrieval tools (such as the Genomic Metadata for Infectious Agents Database) are supporting queries over ENVO classes.</p>
<p>More recently, much of ENVOs existing content has been revised for improved semantic representation, with the ontology now containing representations for habitats, environmental processes, anthropogenic environments, and entities relevant to environmental health initiatives and the global Sustainable Development Agenda for 2030. Several branches of ENVO have been used to incubate and seed new ontologies in previously unrepresented domains such as food and agronomy. Through this expansion, ENVO has subsequently been shaped into a multidomain ontology that bridges domains such as biomedicine, natural and anthropogenic ecology, -omics, and even socioeconomic development.</p>
</section>
</section>
<section epub:type="division">
<h2 class="head a-head"><a id="sec16-5"/><b>16.5Concluding Notes</b></h2>
<p class="noindent">In this chapter, we covered the use of KG technology within scientific domains such as life sciences, geosciences, and chemistry. While these are arguably the most influential applications of KGs in science, especially considering the conservative tendency of these communities with respect to quality, storage, protection, and use of data, there are several others that were not covered in this chapter. For example, within the social sciences (and especially the side of the community that intersects with network sciences), there has been a proliferation of rich, attributed graphs that look a lot less like traditional social networks and more like KGs. Within the computer sciences, there has always been a willingness to experiment with more cutting-edge technology, and KGs and ontologies exist for a wide variety of purposes, including for recording experimental computational results. We believe that the cases we have covered in this chapter have offered a good representation of the tendencies and use-cases of scientific communities in adopting KG technology. An important caveat that must be borne in mind when dealing with scientific KGs is that the distinction between an ontology and KG becomes blurred (purely as a pragmatic matter).</p>
</section>
<section epub:type="division">
<h2 class="head a-head"><span aria-label="434" id="pg_434" role="doc-pagebreak"/><a id="sec16-6"/><b>16.6Software and Resources</b></h2>
<p class="noindent">Many of the resources covered in this chapter are publicly available; we provide links to the important ones. In the event that a link does not resolve, we recommend using a search engine to get an updated link.</p>
<p>Specifically, the primary landing page for the GO is geneontology.org/. From this page, the interested user can access many of the facilities discussed in the chapter, including the ontology itself (<a href="http://geneontology.org/docs/ontology-documentation/">http://<wbr/>geneontology<wbr/>.org<wbr/>/docs<wbr/>/ontology<wbr/>-documentation<wbr/>/</a>), the annotation (<a href="http://geneontology.org/docs/go-annotations/">http://<wbr/>geneontology<wbr/>.org<wbr/>/docs<wbr/>/go<wbr/>-annotations<wbr/>/</a>), and importantly, the causal activity model (<a href="http://geneontology.org/docs/gocam-overview/">http://<wbr/>geneontology<wbr/>.org<wbr/>/docs<wbr/>/gocam<wbr/>-overview<wbr/>/</a>). There are also many tools available to browse, search, visualize, and curate the GO, available at <a href="http://geneontology.org/docs/tools-overview/">http://<wbr/>geneontology<wbr/>.org<wbr/>/docs<wbr/>/tools<wbr/>-overview<wbr/>/</a>. The Noctua Curation Platform for curators to create GO annotations is accessible at <a href="http://noctua.geneontology.org/">http://<wbr/>noctua<wbr/>.geneontology<wbr/>.org<wbr/>/</a>. The GO wiki (<a href="http://wiki.geneontology.org/index.php/Main_Page">http://<wbr/>wiki<wbr/>.geneontology<wbr/>.org<wbr/>/index<wbr/>.php<wbr/>/Main<wbr/>_Page</a>) is also a great resource for potential users to learn more about the ecosystem and resources. It also contains a collected list of publications, talks and posters, by year and time periods.</p>
<p>Programmatic access to ChEBI has significantly improved lately by introducing a library, libChEBI, in Java, Python, and Matlab. The GiHub page for this resource is accessible at <a href="https://github.com/libChEBI">https://<wbr/>github<wbr/>.com<wbr/>/libChEBI</a>. The addition of new tools, such as an analysis tool, BiNChE, and a query tool for the ontology, OntoQuery, have also significantly aided in making ChEBI accessible and useful for sophisticated analyses in the chemical sciences. These resources are accessible at <a href="https://www.ebi.ac.uk/chebi/tools/binche/">https://<wbr/>www<wbr/>.ebi<wbr/>.ac<wbr/>.uk<wbr/>/chebi<wbr/>/tools<wbr/>/binche<wbr/>/</a> and <a href="https://www.ncbi.nlm.nih.gov/pubmed/24008420">https://<wbr/>www<wbr/>.ncbi<wbr/>.nlm<wbr/>.nih<wbr/>.gov<wbr/>/pubmed<wbr/>/24008420</a>, respectively. For instance, BiNChE is a web-based enrichment analysis tool (although it is also available as a software library), and offers plain or weighted analysis options against the ChEBI role, structure or combined ontology. It was inspired by similar tools in the GO ecosystem. In contrast, OntoQuery was designed with a larger audience (including Semantic Web) in mind, since it allows for the easy formulation and execution of complex logical queries against the ontology. The formulation is easy because Description Logic (DL) queries can be posed in the relatively easy Manchester syntax and can be executed against the preloaded (and prereasoned) ontology. Additionally, OntoQuery offers syntax suggestions and corrections as a query is being typed, and supports composite queries using logical connectives like “and” and “or” of classes or relationships in the ontology. For example, OntoQuery would be able to retrieve results over ChEBI given a query such as “steriod and has_role some (human_metabolite or nematode_metabolite).”</p>
<p>PubChem data are available for bulk download on an FTP site (ftp://ftp.ncbi.nlm.nih.gov/pubchem). The PubChem Structure Download service, accessible at <a href="https://pubchem.ncbi.nlm.nih.gov/pc_fetch/pc_fetch.cgi">https://pubchem.ncbi.nlm.nih.gov/pc_fetch/pc_fetch.cgi</a>, can also be used to download a subset of substance or compound records in PubChem, rather than all PubChem records. The records can be exported in several formats, including plain text, Extensible Markup Language (XML), and <span aria-label="435" id="pg_435" role="doc-pagebreak"/>various other convenient modalities. Optionally, the files can be compressed in standard gzip or bzip2 formats. For more details, we recommend that the reader peruse the primary PubChem documentation at <a href="https://pubchemdocs.ncbi.nlm.nih.gov/downloads">https://<wbr/>pubchemdocs<wbr/>.ncbi<wbr/>.nlm<wbr/>.nih<wbr/>.gov<wbr/>/downloads</a>.</p>
<p>Concerning the geosciences resources, the OBO Foundry resources can be accessed at the following GitHub page: <a href="https://github.com/OBOFoun-dry/OBOFoundry.github.io/blob/master/resources.md">https://<wbr/>github<wbr/>.com<wbr/>/OBOFoun<wbr/>-dry<wbr/>/OBOFoundry<wbr/>.github<wbr/>.io<wbr/>/blob<wbr/>/master<wbr/>/resources<wbr/>.md</a>. ENVO is one example of an OBO Foundry ontology that we detailed in this chapter. The GitHub page linked here is useful because it contains many resources for getting started with ontologies (particularly scientific ontologies), and also contains links to tools and browsers. ENVO is published under a CC-BY license and is accessed at <a href="http://www.obofoundry.org/ontology/envo.html">http://<wbr/>www<wbr/>.obofoundry<wbr/>.org<wbr/>/ontology<wbr/>/envo<wbr/>.html</a>. The SWEET ontologies were previously downloadable from <a href="http://sweet.jpl.nasa.gov/sweet">http://<wbr/>sweet<wbr/>.jpl<wbr/>.nasa<wbr/>.gov<wbr/>/sweet</a>, but lately they have been inaccessible. Another link from where the main ontology is directly downloadable is <a href="http://iridl.ldeo.columbia.edu/ontologies/SWEET.owl">http://<wbr/>iridl<wbr/>.ldeo<wbr/>.columbia<wbr/>.edu<wbr/>/ontologies<wbr/>/SWEET<wbr/>.owl</a>. The main page for accessing OpenTopography resources is <a href="https://opentopography.org/">https://<wbr/>opentopography<wbr/>.org<wbr/>/</a>.</p>
</section>
<section epub:type="division">
<h2 class="head a-head"><a id="sec16-7"/><b>16.7Bibliographic Notes</b></h2>
<p class="noindent">Unlike many of the other chapters, for which we could draw on multiple surveys (and even books and metasurveys) as a bedrock of foundational material and recent advances, there is no one work (to our knowledge) that comprehensively describes KGs (or KBs) in even just the major natural sciences such as biology and chemistry. In this chapter, we took a per-field approach by describing developments as they have unfolded in individual areas of scientific knowledge. While there has been some work on curating multiple scientific KBs or even in producing metadata, a comprehensive description of the KBs themselves has been lacking in one single work.</p>
<p>Much effort was spent in the early part of this chapter on the GO, which remains, to our knowledge, the best-known and most widely deployed effort of applying KG technology in a scientific area. There are many good references for the GO, and we drew on a few of those in describing it in this chapter. A set of articles from the GO Consortium constitute excellent first reads; we recommend Gene Ontology Consortium (2004, 2008, 2012, 2015, 2017) and Gene Ontology Consortium et al. (2001). We also recommend foundational work by Ashburner et al. (2000) for interested readers. There has also been much secondary work to make sense of the GO itself. For example, Supek et al. (2011) propose a method to summarize and visualize GO terms to make a list of such terms easier of interpret. Other such studies include Boyle et al. (2004) and Martin et al. (2004).</p>
<p>Beyond the GO, ChEBI and PubChem are important resources for the chemistry domain. Good references include de Matos et al. (2010), Degtyarenko et al. (2007), Hastings et al. (2016), Kim et al. (2016), Wang et al. (2014), and Bolton et al. (2008). The paper describing PubChemRDF by Fu et al. (2015) is especially instructive because it directly involves <span aria-label="436" id="pg_436" role="doc-pagebreak"/>KG technology and has a close connection to SW tools, models, and languages such as RDF and SPARQL.</p>
<p>For the other resources mentioned here, good references describing SWEET include Raskin and Pan (2003), Raskin et al. (2004), and Raskin and Pan (2005). Good references for GEON and OpenTopography include Nambiar et al. (2006), Gahegan et al. (2009), Lin and Ludäscher (2004), Krishnan et al. (2011), and Crosby et al. (2013). Several good references exist for ENVO; we encourage the interested reader to consider Buttigieg et al. (2013, 2016). Note that there are other good resources for the geosciences beyond what was covered in this chapter; the abstract by Zaslavsky et al. (2016) provides pointers to a few of them.</p>
</section>
<section epub:type="division">
<h2 class="head a-head"><a id="sec16-8"/><b>16.8Exercises</b></h2>
<ul class="numbered">
<li class="NL">1.In the chapter introduction, we wrote that the scientific community tends to mostly publish knowledge as statements in an ontology, as opposed to a KG. Looking at the list here, can you determine what should be a concept in an ontology as opposed to an instance in a KG? If you choose the latter, explain why you would not put it in the ontology.</li>
</ul>
<p class="AL">(a)An element in the periodic table</p>
<p class="AL">(b)A fungal specimen that was collected from a specific rainforest</p>
<p class="AL">(c)The parameters in a differential equation describing a physical phenomena</p>
<p class="AL">(d)The scientist who invented a cure for tuberculosis</p>
<p class="AL">(e)The structure of the COVID-19 virus</p>
<ul class="numbered">
<li class="NL">2.We listed several optional elements for GO terms in <a href="chapter_16.xhtml#tab16-2">table 16.2</a>, such as Subset, Comment, and Obsolete tag. We would like to do a small sampling-based study to determine how many GO terms actually use these optional terms. To begin the study, pick 10 GO terms. You could do this by searching online or by browsing the GO website. Try to be as random as you can. What are these terms? Draw a table and list their unique identifiers and term names.</li>
<li class="NL">3.Considering only the six optional elements listed in <a href="chapter_16.xhtml#tab16-2">table 16.2</a>, how many GO terms in your sample have at least one value for each of these elements? Provide an individual percentage for each of these elements (e.g., you could state that 6 of the 10 terms in your sample have a value for Subset, while 8 have an associated comment).</li>
<li class="NL">4.When describing chemical entities and KGs for chemical entities, we considered both ChEBI and PubChem. We also used “ecogonine benzoate” to illustrate ChEBI. What is the identifier for this entity on PubChem? In comparing the entities for ChEBI and PubChem, what differences do you observe? Is there an equal amount of information about the entity on both portals?</li>
<li class="NL">5.<span aria-label="437" id="pg_437" role="doc-pagebreak"/>List three differences between PubChem and ChEBI.</li>
<li class="NL">6.Consider the SWEET ontologies in <a href="chapter_16.xhtml#fig16-3">figure 16.3</a>. What distinguishes faceted from integrative ontologies? What do the edges mean?</li>
<li class="NL">7.Try to look up the details of the Data integrative ontology online. List some classes and properties in this ontology. What is a good use-case where use of this ontology is essential? Considering the ubiquity of data, why is it not linked to every single integrative and faceted ontology in <a href="chapter_16.xhtml#fig16-3">figure 16.3</a>?</li>
</ul>
<div class="footnotes">
<ol class="footnotes">
<li><p class="FN" role="doc-footnote"><sup><a href="chapter_16.xhtml#fn1x16-bk" id="fn1x16">1</a></sup><a href="http://geneontology.org/">http://<wbr/>geneontology<wbr/>.org<wbr/>/</a>. This website is also the source of the quote.</p></li>
<li><p class="FN" role="doc-footnote"><sup><a href="chapter_16.xhtml#fn2x16-bk" id="fn2x16">2</a></sup><a href="http://dflat.cs.tufts.edu/">http://<wbr/>dflat<wbr/>.cs<wbr/>.tufts<wbr/>.edu<wbr/>/</a>.</p></li>
<li><p class="FN" role="doc-footnote"><sup><a href="chapter_16.xhtml#fn3x16-bk" id="fn3x16">3</a></sup><a href="https://ncithesaurus.nci.nih.gov/ncitbrowser/">https://<wbr/>ncithesaurus<wbr/>.nci<wbr/>.nih<wbr/>.gov<wbr/>/ncitbrowser<wbr/>/</a>.</p></li>
</ol>
</div>
</section>
</section>
</div>
</body>
</html>