Ordered sequences of molecules are the central concept of biological information. The DNA is composed of the 4 nucleotides (A, C, G, T) and is used to persist information over generations. Our understanding of these DNA sequences is very limited. We cannot read them like a book, we still have not learned their grammar and vocabulary. Consequently, there is no way to predict the biological purpose of a stretch of DNA from the sequence alone.
Deep learning methods offer new ways to bring light into the darkness of our genome and to elucidate the structure of genes and their regulation. However, the interpretability of deep models and difficulties with modeling long, variable-length sequences hinder the use of deep learning in biology.
We are working on new approaches to deal with these issues. Examples are sequence classification using convolutional neural networks and generative models for variable-length sequences using recurrent variational autoencoders. In this talk, we will give an overview of biological sequences, their fascinating properties and their relevance for disease biology. We will demonstrate some of our methods and their application. Finally, we will show some general ideas drawn from our research which are relevant for other topics.