Deciphering transcriptional regulation using deep neural networks
Abstract
The DNA holds the recipe of all life functions. To decipher the instructions, one has to learn and understand its complex syntax. The non-coding DNA contains regulatory elements, that are essential to control and activate gene expression in the right place at the right time. Previous studies have applied deep learning for gene expression prediction, directly from non-coding sequences, successfully. Almeida et al. [1] showed that a Convolutional Neural Network could learn regulatory syntax from long same-length fragments from the fruit fly. In this thesis, we tested how well deep neural networks could predict gene expression from short DNA fragments of varying lengths from the Atlantic salmon. Furthermore, we extracted what the models had learned, and tested if the sequence features corresponded to known regulatory sequence patterns (motifs).
Two deep neural network architectures were built, a Convolutional Neural Network (CNN) and a hybrid Convolutional and Long Short-Term Memory Neural Network (CNN-LSTM). We trained the models to predict the gene expression effect of DNA fragments from open chromatin of liver cells. The two model architectures performed equally well, and the performances depended on the amount of noise in the validation data, reaching a correlation of 0.68 on the sequences of top 10% base mean.
We extracted motifs both from the first convolutional filters and from sequence importance scores, and we compared the motifs to the JASPAR database of known vertebrate transcription factor binding site motifs. Among the significant matches to JASPAR, we found some general transcription factors like the TFCP2, HSF and AP-1, as well as some liver-specific transcription factors like the KLF15 and HNF6. Most motifs did not match any JASPAR motif. We explained the tendency of CNNs to distribute partial motifs across several filters, and that other sequence features might be important for prediction as well. Our results suggest that the models learned regulatory DNA syntax equally well, despite their different architectures, and we compared the motif findings in light of these differences.
This thesis demonstrates the potential of deep neural networks for analysis of ATAC-STARR-seq data, and suggests improvements worth exploring further to possibly increase performance. We also stress the need for more robust model interpretation techniques, which could unlock valuable knowledge in the future of genomics.