Evaluation of machine learning methods to decode transcriptional regulation
Abstract
With large biological measurements made possible by the development of high-throughput technology, it allows for the study of genomic data. In transcriptional regulation the cell controls the translation of DNA to RNA, and thereby controls which genes to express. Exploring the regulatory mechanisms underlying the genes that are controlled in a cell is part of epigenetic profiling. Finding accessible DNA regions that control transcriptional regulation can be done using this method. It appears that machine learning models have not previously been tested on ATAC-STARR-seq data with varying fragment length from the salmon genome. Using the results from an ATAC-STARR-seq experiment on a salmon genome, we want to explore the extent to which it is possible to predict from sequence fragments.
In this thesis, we attempt to predict to what degree sequence fragments can drive transcription using the ATAC-STARR-seq results of over five million sequence fragments. Fragments within the top 10 % basemean were selected for the model's training, validation, and testing in order to improve the performance of the machine learning models. Techniques such as feature engineering and feature importance were crucial for training the model and extracting relevant features for motif search. We evaluated three classical machine learning methods for regression analysis predicting log2FoldChange values. The ensemble methods XGBoost Regressor and Random Forest Regressor, and Linear Support Vector Regression were applied as machine learning algorithms. XGBoost Regressor and Random Forest Regressor are both powerful algorithms known to have been used successfully on sequential data. Linear Support Vector Regression's ability to handle a large number of samples compared to Support Vector Regression, was one of the reasons this algorithm was chosen.
Furthermore, XGBoost Regressor and Random Forest Regressor performed remarkably similar with potential for improvement. The Linear Support Vector Regression model had more trouble capturing the complexity of the data. Despite the results, it was possible to extract sequence features from the trained machine learning models and associate them with known transcription factors. This study's findings indicate that there is still a need for improvements in the performance of the models. While time limit and time-consuming algorithms have been a challenge, possibilities for tuning with not yet tested parameters and other methods to improve the model remains for further research.