Comparison of Pre-processing Methods and Various Machine Learning Models for Survival Analysis on Cancer Data

Karovic, Haris

dc.contributor.advisor	Oliver Tomic
dc.contributor.advisor	Cecilia Marie Futsæther
dc.contributor.author	Karovic, Haris
dc.date.accessioned	2023-07-25T16:27:15Z
dc.date.available	2023-07-25T16:27:15Z
dc.date.issued	2023
dc.identifier	no.nmbu:wiseflow:6866313:55029707
dc.identifier.uri	https://hdl.handle.net/11250/3081311
dc.description.abstract	Colorectal cancer and cancers in the head and neck region still pose a big problem in medicine and in the healthcare sector. In 2021 alone 11 121 deaths could be accounted for due to various cancers, with colorectal and head and neck cancer being among the more common types. In today's digital age, hospitals and researchers are collecting more data than ever before. Many studies have patients where the follow-up or study has ended before an event of interest occurs. Instead of discarding those patients from observed data when applying machine learning methods and subsequently losing valuable information, survival analysis can be applied. Survival analysis utilizes the information from the censoring variable that tells whether or not the event of interest has taken place before the study has ended. In this thesis several pre-processing techniques were utilized, such as removal of outliers, feature distribution transformations and feature selection techniques. These techniques were applied together with multiple machine learning algorithms from the scikit-learn and scikit-survival library. The survival algorithms used were Regularized Cox model with elastic net (Coxnet), random survival forest, tree based gradient boosting and gradient boosting with partial least squares as base learner. These algortihms take into account the information from the censoring variable in addition to the survival time. Other machine learning algorithms used were linear regression, ridge regression and Partial least squares regression (PLSR), where the last three algorithms only use the survival time as the target and do not account for the censoring variable. Two datasets were used in this thesis, one with patients diagnosed with colorectal cancer, and the second with patients diagnosed with various head and neck cancers. Furthermore, two experiments were carried out separately and validated by the use of repeated stratified k-fold cross validation. In the first experiment the models were fitted to different feature transformations of the datasets in combination with feature selection techniques. The second experiment involved hyperparameter tuning for the survival models. There was little difference in performance between the transformations, with no improvement on the head and neck dataset, however for the high dimensional colorectal cancer dataset, powertransformation led to a very small increase of 0.02 in the concordance index. The feature selection techniques did improve the performance of four of the models, which were Linear Regression, Ridge Regression, PLSR and Coxnet. For the more advanced survival models which were Gradient Boosted and Random Survival Forest, the feature selection did in general not improve metrics, as they might have benefited from greedily selecting features and updating feature weights on their own. The best model in the first experiment for OxyTarget was Random Forest with powertransform applied before, and all features available. This resulted in a concordance index of 0.83. For the head and neck dataset both Component Wise gradient boosting, Coxnet and PLSR were able to achieve the highest concordance index with 0.77, with Coxnet able to achieve that score across all three transformations. In the second experiment, all the survival models were tuned for different hyperparameters to see if the various metrics would improve. A small performance increase could be seen for several models. However, for the dataset with colorectal cancer, a Coxnet model tuned with a low regularization strength and low l1\_ratio penalty yielded a large increase in the concordance index and resulted in the best model with a score of 0.827. For the head and neck dataset, parameter tuning the Random Survival Forest algorithm for min\_weight\_fraction\_leaf and max\_depth resulted in the best model, and a concordance of 0.787 was achieved. The research and the framework created to conduct the aforementioned experiments show that more promising ranking results while maintaining robust models can be achieved through the use of pre-processing techniques and through the utilization of all data using repeated stratified k-fold cross validation. However, as the research conducted shows, there is no universal best algorithm or method to conduct survival analysis for cancer data, as it depends on the data.
dc.description.abstract
dc.language	eng
dc.publisher	Norwegian University of Life Sciences
dc.title	Comparison of Pre-processing Methods and Various Machine Learning Models for Survival Analysis on Cancer Data
dc.type	Master thesis

Tilhørende fil(er)

Filnavn:: no.nmbu:wiseflow:6866313:55029 ...
Størrelse:: 5.860Mb
Format:: PDF

Åpne

Denne innførselen finnes i følgende samling(er)

Master's theses (RealTek) [1722]

Vis enkel innførsel