Keynote and invited abstractsJames Taylor, Computing Chromosome Conformation
Gene regulation - control of when, where, and at what level genes are expressed - is a fundamental part of cell development and identity. Gene regulation involves complex coordination of DNA architecture at multiple scales, from the individual DNA bases to the organization of whole chromosomes. Chromosomes in the eukaryotic nucleus are organized in a coordinated non-random configuration that has a substantial influence on the regulation of gene expression, and thus cell state and identity. Achieving a global full understanding of gene regulation requires a multi-scale understanding of the function of the genome in its developmental and structural context. In recent years, our ability to understand this organization has substantially increased due to a variety of high-throughput assays. Chromatin interactions can be interrogated globally using high-throughput sequencing based approaches including Hi-C in both populations of cells and more recently single cells. Localization in single-cells can also be interrogated using fluorescence imaging approaches that are increasingly high-resolution and high-throughput. Here the I will discuss the computational challenges in analyzing and integrating these data types, and the resulting insights into our current understanding of how chromatin is organized. In addition, I will describe recent advances in software tools and infrastructure that help to facilitate the analyses of large-scale biological datasets.
Genome analysis is the foundation of many scientific and medical
discoveries as well as a key pillar of personalized medicine. Any
analysis of a genome fundamentally starts with the reconstruction of
the genome from its sequenced fragments. This process is called read
mapping. One key goal of read mapping is to find the variations that
are present between the sequenced genome and reference genome(s) and
to tolerate the errors introduced by the genome sequencing
process. Read mapping is currently a major bottleneck in the entire
genome analysis pipeline because state-of-the-art genome sequencing
technologies are able to sequence a genome much faster than the
computational techniques that are employed to reconstruct the
genome. New sequencing technologies, like nanopore sequencing, greatly
exacerbate this problem while at the same time making genome
sequencing much less costly.
This talk describes our ongoing journey in greatly improving the performance of genome read mapping. We first provide a brief background on read mappers that can comprehensively find variations and tolerate sequencing errors. Then, we describe both algorithmic and hardware-based acceleration approaches. Algorithmic approaches exploit the structure of the genome as well as the structure of the underlying hardware. Hardware-based acceleration approaches exploit specialized microarchitectures or new execution paradigms, like processing in memory. We show that significant improvements are possible with both algorithmic an hardware-based approaches and their combination. We conclude with a foreshadowing of future challenges brought about by very low cost yet highly error prone new sequencing technologies.
General-purpose processors can now contain many dozens of processor cores and support hundreds of simultaneous threads of execution. To make best use of these threads, genomics software must contend with new and subtle computer architecture issues. I will discuss tradeoffs related to lock types, input parsing strategies, batching, output striping and multiprocessing versus multithreading. I will also explore how the FASTQ file format -- its unpredictable record boundaries in particular -- can impede thread scaling. I'll suggest simple ways to change FASTQ files and similar formats that enable further improvements in thread scaling while maintaining essentially the same compressed file size. Finally, I will show how these improvements affect performance of the popular Bowtie, Bowtie 2 and HISAT alignment tools across various general-purpose architectures including Intel Skylake and Knight's Landing.
Bioinformatics technologies have always been in a race with genomics technologies, especially in the era of high throughput sequencing, to deliver timely results. For example, as the sequencing throughput on the Illumina platforms increased from millions to 10s and 100s of millions of reads per lane and beyond, sequence analysis methods shifted through several paradigms to offer analytical capability to projects that use these platforms.
The value of innovative data types in bioinformatics applications has been demonstrated several times. The most prominent examples are the use of FM-indexing for rapid read alignments, and k-mer mapping for expression quantification. Here, we will introduce a novel data type we call multi-index Bloom filter (miBF), and present sequence mapping as a potential use case. Another feature of miBF is the use of spaced seeds for error tolerant mapping. In our benchmarking experiments, we note that sequence mapping based on miBF performs an order of magnitude faster than the popular sequence alignment methods, reporting similar sensitivity and specificity measures.
The human reference genome is part of the foundation of modern human biology, and a monumental scientific achievement. However, because it excludes a great deal of common human variation, it introduces a pervasive reference bias into the field of human genomics. To reduce this bias, it makes sense to draw on representative collections of human genomes, brought together into reference cohorts. There are a number of techniques to represent and organize data gleaned from these cohorts, many using ideas implicitly or explicitly borrowed from graph based models. Here I survey our progress in this domain, and show how genome graphs, associated data structures and population genomics models can be used to efficiently map sequencing reads not to one genome, but simultaneously to the haplotypes of thousands of genomes.
HiCOMB 2018 Call For Papers
The size and complexity of genome- and proteome-scale data sets in bioinformatics continues to grow at a furious pace, and the analysis of these complex, noisy, data sets demands efficient algorithms and high performance computer architectures. Hence high-performance computing has become an integral part of research and development in bioinformatics, computational biology, and medical and health informatics. The goal of this workshop is to provide a forum for discussion of latest research in developing high-performance computing solutions to data- and compute-intensive problems arising from all areas of computational life sciences. We are especially interested in parallel and distributed algorithms, memory-efficient algorithms, large scale data mining techniques including approaches for big data and cloud computing, algorithms on multicores, many-cores and GPUs, and design of high-performance software and hardware for biological applications.
The workshop will feature contributed papers as well as invited talks from reputed researchers in the field.
Topics of interest include but are not limited to:
- Bioinformatics data analytics
- Biological network analysis
- Cloud-enabled solutions for computational biology
- Computational genomics and metagenomics
- Computational proteomics and metaproteomics
- DNA assembly, clustering, and mapping
- Energy-aware high performance biological applications
- Gene identification and annotation
- High performance algorithms for computational systems biology
- High throughput, high dimensional data analysis: flow cytometry and related proteomic data
- Parallel algorithms for biological sequence analysis
- Molecular evolution and phylogenetic reconstruction algorithms
- Protein structure prediction and modeling
- Parallel algorithms in chemical genetics and chemical informatics
- Transcriptome analysis with RNASeq
To submit a paper, please upload a PDF file through Easy Chair at the HiCOMB 2018 Submission Site. Submitted manuscripts may not exceed ten (10) single-spaced double-column pages using a 10-point size font on 8.5x11 inch pages (IEEE conference style), including figures, tables, and references (see IPDPS Call for Papers for more details). All papers will be reviewed. Proceedings of the workshops will be distributed at the conference and are submitted for inclusion in the IEEE Explore Digital Library after the conference.
|Workshop submissions due:||
February 26, 2018
||Final Camera-ready papers due:||
March 15, 2018
||Workshop:||May 21, 2018|
- James Taylor Ralph S. O'Connor Associate Professor of Biology Associate Professor of Computer Science Johns Hopkins University
- Onur Mutlu Professor of Computer Science ETH Zurich
- Ariful Azad, Lawrence Berkeley Lab
- Rayan Chikhi, CNRS, University of Lille 1
- Faraz Hach, Simon Fraser University
- Niina S. Haiminen, IBM
- Fereydoun Hormozdiari, UC Davis
- Ananth Kalyanaraman, Washington State University
- Daisuke Kihara, Purdue University
- Mehmet Koyuturk, Case Western Reserve University
- Benjamin Langmead, Johns Hopkins University
- Kamesh Madduri, Penn State
- Paul Medvedev (Chair), Penn State
- Alba Cristina Magalhaes Alves de Melo, University of Brasilia
- Folker Meyer, Argonne National Lab
- Rob Patro, Stony Brook University
- Knut Reinert, Freie Universit?t Berlin
- Jan Schroeder, The Walter and Eliza Hall Institute of Medical Research
- Alexandros Stamatakis, Heidelberg Institute for Theoretical Studies
- Sharma Thankachan, Georgia Tech
- Jaroslaw Zola, University at Buffalo, SUNY
- Paul Medvedev
Penn State University
Email: firstname.lastname@example.org (remove the stars)
Steering Committee Members
- David A. Bader
College of Computing
Georgia Institute of Technology
- Srinivas Aluru
College of Computing
Georgia Institute of Technology