Genome Analysis: Current Procedures and Applications | Book
Caister Academic Press
Maria S. Poptsova
Weill Cornell Medical College, New York, USA and Moscow State University, Russia
xiv + 374
GB £219 or US $250Add to cartPaperback:
US $250Buy on Amazon.com Buy on Amazon.co.ukEbook:
US $319Buy ebook
Customers who viewed this book also viewed:
In recent years there have been tremendous achievements made in DNA sequencing technologies and corresponding innovations in data analysis and bioinformatics that have revolutionized the field of genome analysis.
In this book, an impressive array of expert authors highlight and review current advances in genome analysis. This volume provides an invaluable, up-to-date and comprehensive overview of the methods currently employed for next-generation sequencing (NGS) data analysis, highlights their problems and limitations, demonstrates the applications and indicates the developing trends in various fields of genome research. The first part of the book is devoted to the methods and applications that arose from, or were significantly advanced by, NGS technologies: the identification of structural variation from DNA-seq data; whole-transcriptome analysis and discovery of small interfering RNAs (siRNAs) from RNA-seq data; motif finding in promoter regions, enhancer prediction and nucleosome sequence code discovery from ChiP-Seq data; identification of methylation patterns in cancer from MeDIP-seq data; transposon identification in NGS data; metagenomics and metatranscriptomics; NGS of viral communities; and causes and consequences of genome instabilities. The second part is devoted to the field of RNA biology with the last three chapters devoted to computational methods of RNA structure prediction including context-free grammar applications.
An essential book for everyone involved in sequence data analysis, next-generation sequencing, high-throughput sequencing, RNA structure prediction, bioinformatics and genome analysis.
Table of contents
1. Identification of Structural Variation
Suzanne S. Sindi and Benjamin J. Raphael
Structural variation, rearrangements of DNA sequences, has long been observed by chromosomal assays. With recent advances in high-throughput DNA sequencing, the ability to identify structural variation (SV) in genome sequences has improved considerably. Sequence-based methods for identifying SVs have greatly improved our knowledge of genomic variation in humans and other species. DNA sequencing data from an individual genome contains multiple possible signals that indicate SVs in this genome, and these signals must be analyzed and integrated using various computational techniques. Here we will give an overview of SV discovery methods from sequencing data using and remark on the challenges remaining.
2. Methods for RNA Isolation, Characterization, and Sequencing
Paul Zumbo and Christopher E. Mason
Ribonucleic Acid (RNA) is a key substrate for storing and transmitting biological information in cells, along with deoxyribonucleic acid (DNA), proteins, and other small molecules and metabolites. Since the discovery of nucleic acids by Friedrich Miescher in 1869, RNA has been observed in an expanded range of functions within and between cells, tissues, and even between generations. In 1958, Francis Crick proposed the Central Dogma of Molecular Biology, and he placed RNA as a simple intermediary of unidirectional information transfer between DNA and proteins. Yet, today, we know that a wide range of activities surround and violate this dogma, and recent work has shown that RNAs come in many varieties, serve essential regulatory and catalytic roles, and that RNA bases can harbor many small, chemical modifications that can also change its function. This chapter will review the history of RNA's expansion as a mediator and as a catalytic molecule in cells, the new methods developed to characterize and sequence RNA, and the means for contextualizing the roles of RNA.
3. Transcriptome Reconstruction and Quantification from RNA Sequencing Data
Sahar Al Seesi, Serghei Mangul, Adrian Caciula, Alex Zelikovsky and Ion Măndoiu
Massively parallel whole transcriptome sequencing has become the technology of choice for transcriptome analysis since it supports a wider range of problems than the previously popular microarray technology. In this chapter we focus on two of these applications, namely transcriptome reconstruction and quantification. We discuss the key computational problems related with these applications and describe some of the best-performing algorithms available for each. For transcriptome reconstruction, we present in detail a statistical genome-guided method called "Transcriptome Reconstruction using Integer Programming" (TRIP) that incorporates fragment length distribution into novel transcript reconstruction from paired-end RNA-Seq reads. Experimental results on both real and synthetic datasets show that TRIP is more accurate than methods ignoring fragment length distribution information. For transcriptome quantification, we focus on two Expectation-Maximization (EM) algorithms for both RNA-Seq and Digital Gene Expression (DGE) sequencing protocols. Both algorithms take into account alternative splicing and mapping ambiguities. We present experimental results on real datasets comparing the two protocols as well as methods for each protocol. Results show that the EM algorithms outperform other available methods for both RNA-Seq and DGE, and that they yield comparable quantification accuracy on real data generated using the RNA-Seq and DGE protocols.
4. Identification of Small Interfering RNA from Next-generation Sequencing Data
Thomas J. Hardcastle
Small interfering RNAs (siRNAs) play a crucial role in the regulation of transcriptomic and epigenetic factors. Next generation sequencing technologies allow the identification and quantification of siRNAs on a genome-wide scale. Given the proper tools for analysis, biologically meaningful inferences can be made about the biogenesis and patterns of expression of these key components of biological systems. This review discusses the application of sequencing technologies to siRNAs and the currently available tools for the analysis of the data thus generated.
5. Motif Discovery and Motif Finding in ChIP-Seq Data
Ivan V. Kulakovskiy and Vsevolod J. Makeev
Modern bioinformatics and molecular biology research are impossible to imagine without application of high-throughput DNA sequencing technologies, also called next-generation sequencing technologies. In particular, transcriptional regulation studies determining how different genes become "on" and "off" in different tissues in different conditions rely heavily on next-generation sequencing. The ChIP-Seq technology implying chromatin immunoprecipitation followed by deep sequencing allows genome-wide in vivo studies of binding sites for different transcription factors, the proteins that can specifically facilitate or prevent proper construction of the transcription initiatory complex necessary to activate transcription of a specific gene. Transcription initiation control in higher eukaryotes is extremely complex, and its analysis is especially difficult because of the genome size and comparably short transcription factor binding sites. Availability of ChIP-Seq data provided new insights into genome-wide distribution of transcription factor binding sites. It was a new challenge for computational biology to handle enormous amounts of data and detect actual binding sites within DNA segments identified by ChIP-Seq. Here we focus on application and advances of motif discovery and motif finding, a very well-established field in bioinformatics of sequence analysis, which has been given a second birth by the ChIP-Seq technology.
6. Mammalian Enhancer Prediction
Dongwon Lee and Michael A. Beer
We are still far from a complete understanding of regulatory elements in mammalian genomes, even though their central role in most biological processes is widely appreciated. Development of sequence-based models to predict the function and activity of regulatory elements is a fundamental step in being able to address many unsolved questions. Here we describe the current state of the art in computational methods to predict enhancers, especially recent developments using a support vector machine (SVM) framework, which can accurately identify tissue specific enhancers using only genomic sequence and an unbiased set of general sequence features. These models reveal both enriched and depleted predictive sequence features that are critical for specifying these enhancer activities, and can also be used to identify novel enhancers. Some of these predictions have been validated by several independent experiments both in vitro and in vivo. These methods can be applied to computationally predict the functional consequences of common sequence variants in regulatory regions. We believe that these efforts will significantly contribute to our understanding of mammalian regulatory systems and their role in common disease.
7. DNA Patterns for Nucleosome Positioning
There is an abundant experimental evidence for the role of specific nucleosome positioning in gene regulation. The nucleosome positioning is determined by DNA sequence and non-sequence factors such as ATP dependent remodeling factors. The nucleosome positions differ between various cell types for the same species as well as for similar genes of the different species. A nucleosome shift by just a few base pairs can alter the entire regulation of a gene. Knowing precise nucleosome location is critical for understanding how cis-regulatory elements control genetic information. Among different factors affecting nucleosome positioning on the DNA, the DNA sequence itself is the most important, and various sequence motifs have been described as guiding nucleosome positioning in a sequence specific manner. These motifs or patterns eventually were termed nucleosome positioning sequence (NPS) patterns although this term is not necessarily universal. In this chapter, we describe various classes of such NPS patterns known from literature and possible biological implications thereof. This chapter does not embark to provide an exhaustive review of all related points of view, neither its emphasis is on new results presented here for the first time. Rather it reflects the author's point of view on general tendencies in this area of science and tries to provide a possible answer to the most difficult question in the area: why does nucleosome positioning differ between different tissues while respective DNA sequence of the involved genes is essentially identical?
8. Hypermethylation in Cancer
A compelling body of evidences sustains the importance of epigenetic mechanisms in the development and progression of cancer. Assessing the epigenetic component of tumour samples is strongly improving our understanding of their biology and clinical behavior. In terms of DNA methylation, cancer cells show genome-wide hypomethylation and site-specific CpG island promoter hypermethylation. In the context of other epigenetic alterations, this chapter will focus on the hypermethylation of CpG islands in promoter regions, as the most widely described epigenetic modification in cancer. CpG islands hypermethylation is believed to be critical in the transcriptional silencing and regulation of tumor suppressor and crucial cancer genes involved in the major molecular pathways controlling cancer development and progression. In particular, several biological pathways of frequently methylated genes include cell cycle, DNA repair, apoptosis and invasion, among others. Furthermore, translational aspects of tumor methylomes described to date will be discussed towards their potential application as cancer biomarkers. Several tissue methylation signatures and individual candidates have been evidenced, that could potentially stratify tumors histopathologically, and discriminate patients in terms of their clinical outcome. Tumor methylation profiles could also be detected in body fluid specimens showing a promising role as non-invasive markers for cancer diagnosis towards an early detection and potentially for the surveillance of cancer patients in a near future. However, the epigenomic exploration of cancer has only just begun. Genome-scale DNA methylation profiling studies will further highlight the relevance of the epigenetic component to gain knowledge of cancer biology, and identify those profiles and candidates better correlating with clinical behavior.
9. Identification and Analysis of Transposable Elements in Genomic Sequences
Laurent Modolo and Emmanuelle Lerat
Genome sequences are composed of different compartments, among which transposable elements (TEs) represent one of the most important. Not only do these elements correspond to a particularly large proportion of genomes, they are also involved in different mechanisms implicated in the evolution of genomes, such as chromosome rearrangement and gene innovation. Thus, the precise determination of TEs in genomes is of significant importance. This step is becoming more and more complex with the emergence of new types of sequence data coming from next-generation sequencing (NGS) technologies. In this chapter, we present the current status of bioinformatic developments made in the detection and analysis of TEs in genomic sequences. We first present the classic tools dedicated to the identification of TEs in classic genomic data, which originate from whole genome sequences. Because these sequences are significantly different from the new types of sequences generated by NGS and because the problem of repeats in these data is not trivial, we then present how it is possible to handle TEs in NGS data. We also provide some examples of tools designed to answer particular questions about TEs using NGS data and how these types of data are particularly valuable for deepening our knowledge of the dynamics of TEs. Although this is a still a fast-growing field for which new developments are made every day, we hope to provide a broader view of what currently exists in this field and what allows for TE analyses in genomic sequences.
10. The Current State of Metagenomic Analysis
Pieter De Maayer, Angel Valverde and Don A. Cowan
The Earth harbours an enormous diversity of organisms, small and large, which produce a wealth of proteins and enzymes that we can potentially utilize to address many of our current challenges, including food safety, sustainable energy sources and human health. However, the majority of these organisms have been of limited value because they cannot be grown in culture, which has long been a prerequisite in order to study or exploit them. A new field of research that can provide access to the "unculturable majority" has recently emerged. The field is termed 'metagenomics'. With the advent of genome sequencing and the development of next-generation sequencing technologies, metagenomics has provided access to the genomes of unculturable organisms, allowing us to study their diversity and to access the potential in these organisms and their cellular constituents. Other technologies, such as metatranscriptomics, metaproteomics and metabolomics, have been incorporated into the metagenomics toolkit, all enhancing the power of this field of research. Here, we address the state of metagenomics, its current methodologies and pitfalls, recent developments and future perspectives, and show how this technology is transforming modern biological science.
Metagenomic studies, accelerated by the evolution of sequencing technologies and the rapid development of genomic analysis methods, can reveal genetic diversity and biodiversity in various samples including those of uncultured or unknown species. This approach, however, cannot be used to identify active functional genes under actual environmental conditions. Metatranscriptomics, which is similar in approach to metagenomics except that it utilizes RNA samples, is a powerful tool for the transcriptomic study of environmental samples. Unlike metagenomic studies, metatranscriptomic studies have not been popular to date due to problems with reliability, repeatability, redundancy and cost performance.
12. Inferring Viral Quasispecies Spectra from Shotgun and Amplicon Next-generation Sequencing Reads
Irina Astrovskaya, Nicholas Mancuso, Bassam Tork, Serghei Mangul, Alex Artyomenko, Pavel Skums, Lilia Ganova-Raeva, Ion Măndoiu and Alex Zelikovsky
Many clinically relevant viruses, including hepatitis C virus (HCV) and human immunodeficiency virus (HIV), exhibit high genomic diversity within infected hosts which may explain the failure of vaccines and resistance to existing antiviral therapies. Characterizing the viral population infecting a host requires reconstructing all co-existing (related, but non-identical) viral variants, referred to as quasispecies, and inferring their relative abundances. Next-generation sequencing is a promising approach for characterizing viral diversity due to its ability to generate large numbers of reads at a low cost. However, standard assembly software was originally designed for a single genome assembly and cannot be used to assemble multiple closely related quasispecies sequences and estimate their abundances. In this chapter, we focus on the problem of reconstructing viral quasispecies populations from next-generation sequencing reads produced by two most commonly used strategies: the shotgun sequencing and the sequencing of partially overlapping PCR amplicons. We discuss computational challenges associated with each strategy and review existing approaches to quasispecies reconstruction with focus on two state-of-the-art software tools - Viral Spectrum Assembler (ViSpA), designed for the shotgun reads, and Viral Assembler (VirA), which handles the amplicon reads. Both tools have been tested on simulated and real read data from HCV, HIV (ViSpA) and HBV (VirA) quasispecies, and shown to compare favorably with other existing methods.
13. DNA Instability in Bacterial Genomes: Causes and Consequences
Pedro H. Oliveira, Duarte M. F. Prazeres and Gabriel A. Monteiro
DNA is a structurally dynamic molecule that is central to cellular processes such as replication, transcription and recombination. In order to maintain genomic integrity, bacteria have developed a finely tuned and interwoven network of mechanisms that operate at multiple levels, and include damage recognition, signaling pathways, and DNA repair. On the other hand, without the capacity to accommodate genotypic variation up to a certain extent, bacteria would not be able to modify their fitness when faced with constantly changing environments. Herein we review our current knowledge on bacterial genome instability, with particular emphasis on findings gained from the often-studied gram-negative model organism Escherichia coli. We will address topics such as spontaneous and stress-induced mutagenesis, major DNA repair pathways, and the design of more stable genomes. Major questions and future challenges will also be discussed.
14. Comparative Methods For RNA Structure Prediction
Eckart Bindewald and Bruce A. Shapiro
The appreciation of the pervasiveness of RNA biology continues to increase. The vastness of available sequence information calls for computational tools that aid in a variety of prediction problems, such as RNA structure prediction, RNA-RNA interaction prediction and genomic scans for conserved RNA structural elements. Comparative methods for RNA structure prediction employ a set of homologous RNA sequences; this additional information can then be used to not only estimate energy contributions, but also information in the form of compensatory base changes. Different software tools make use of this information in a fascinating variety of ways. This paper surveys current comparative approaches for RNA structure predictions and discusses a variety of future trends.
15. Context-free Grammars and RNA Secondary Structure Prediction
Markus E. Nebel and Anika Schulz
For a long time, computational methods for RNA secondary structure prediction were typically based on more or less complex models of the free energy-defined by experimentally derived thermodynamic parameters and incomplete free energy rules. However, due to the problems even of comprehensive state-of-the-art thermodynamic models to capture some important, non-energetic influences on sequence folding, an attractive alternative is to use stochastic approaches with parameters estimated from growing databases of structural RNAs. This motivated the development of a competing methodology towards computational RNA structure prediction analysis that builds on principles of probabilistic modeling of the class of possible foldings rather than on incomplete free energy models. Such probabilistic prediction approaches are generally based on more or less powerful extensions of the concept of traditional context-free grammars that are indeed able to capture the specific structural information collected in an arbitrary database of reliable RNAs. This chapter deals with such probabilistic RNA folding approaches based on context-free modeling. Note that we here call an approach probabilistic if and only if it abstracts from general thermodynamic models and instead tries to learn about the structural behavior of the molecules by training (a manageable number of) probabilistic parameters from trusted RNA structure databases. In that sense, partition function approaches, even if providing pairing probabilities, are not assumed to be probabilistic.
16. Stochastic Context-free Grammars and RNA Secondary Structure Prediction
James W. J. Anderson
Prediction of RNA secondary structure from a single sequence, or an alignment of sequences, is a core problem in bioinformatics. Many approaches to RNA secondary structure prediction have been attempted, and probabilistic methods using stochastic context-free grammars (SCFGs) have been one of the more successful tries. In particular, SCFGs can be combined with a molecular evolution model to produce consensus structure predictions which more accurately predict RNA secondary structure than when considering single-sequence prediction. The use of SCFGs in RNA secondary structure prediction, and the potential for further developments make for a truly interesting topic. In this chapter we discuss the application of SCFGs to RNA secondary structure prediction, from a single sequence, or a single fixed alignment. An introduction to RNA secondary structure prediction is given, some technical issues for SCFGs, such as normal forms and grammar design, are discussed, methods are shown for estimating SCFG parameters. Methods are shown for predicting RNA secondary structures, and some measures are given for analysing SCFG variability. Finally, a brief discussion concerning their predictive quality is had, with some suggestions for further work and web resources given.
How to buy this book
(EAN: 9781908230294 9781908230683 Subjects: [molecular microbiology] [genomics] [bioinformatics] [molecular biology] )