Evaluation of machine learning methods to decode transcriptional regulation

Jeyakumar, Harini

dc.contributor.advisor	Hvidsten, Torgeir Rhoden
dc.contributor.advisor	Liland, Kristian Hovde
dc.contributor.advisor	Sandve, Simen Rød
dc.contributor.advisor	Grønvold, Lars
dc.contributor.author	Jeyakumar, Harini
dc.date.accessioned	2023-07-18T16:27:12Z
dc.date.available	2023-07-18T16:27:12Z
dc.date.issued	2023
dc.identifier	no.nmbu:wiseflow:6839497:54728236
dc.identifier.uri	https://hdl.handle.net/11250/3079858
dc.description.abstract	With large biological measurements made possible by the development of high-throughput technology, it allows for the study of genomic data. In transcriptional regulation the cell controls the translation of DNA to RNA, and thereby controls which genes to express. Exploring the regulatory mechanisms underlying the genes that are controlled in a cell is part of epigenetic profiling. Finding accessible DNA regions that control transcriptional regulation can be done using this method. It appears that machine learning models have not previously been tested on ATAC-STARR-seq data with varying fragment length from the salmon genome. Using the results from an ATAC-STARR-seq experiment on a salmon genome, we want to explore the extent to which it is possible to predict from sequence fragments. In this thesis, we attempt to predict to what degree sequence fragments can drive transcription using the ATAC-STARR-seq results of over five million sequence fragments. Fragments within the top 10 % basemean were selected for the model's training, validation, and testing in order to improve the performance of the machine learning models. Techniques such as feature engineering and feature importance were crucial for training the model and extracting relevant features for motif search. We evaluated three classical machine learning methods for regression analysis predicting log2FoldChange values. The ensemble methods XGBoost Regressor and Random Forest Regressor, and Linear Support Vector Regression were applied as machine learning algorithms. XGBoost Regressor and Random Forest Regressor are both powerful algorithms known to have been used successfully on sequential data. Linear Support Vector Regression's ability to handle a large number of samples compared to Support Vector Regression, was one of the reasons this algorithm was chosen. Furthermore, XGBoost Regressor and Random Forest Regressor performed remarkably similar with potential for improvement. The Linear Support Vector Regression model had more trouble capturing the complexity of the data. Despite the results, it was possible to extract sequence features from the trained machine learning models and associate them with known transcription factors. This study's findings indicate that there is still a need for improvements in the performance of the models. While time limit and time-consuming algorithms have been a challenge, possibilities for tuning with not yet tested parameters and other methods to improve the model remains for further research.
dc.description.abstract
dc.language	eng
dc.publisher	Norwegian University of Life Sciences, Ås
dc.title	Evaluation of machine learning methods to decode transcriptional regulation
dc.type	Master thesis
dc.description.localcode	M-BIAS

Tilhørende fil(er)

Filnavn:: no.nmbu:wiseflow:6839497:54728 ...
Størrelse:: 1.741Mb
Format:: PDF

Åpne

Denne innførselen finnes i følgende samling(er)

Master's theses (KBM) [890]

Vis enkel innførsel