Show simple item record

dc.contributor.advisorKristian Hovde Liland
dc.contributor.advisorLars-Gustav Snipen
dc.contributor.authorSteinset, August Noer
dc.date.accessioned2024-08-23T16:28:49Z
dc.date.available2024-08-23T16:28:49Z
dc.date.issued2024
dc.identifierno.nmbu:wiseflow:7110333:59110586
dc.identifier.urihttps://hdl.handle.net/11250/3147983
dc.description.abstractAs more and more genomes are sequences it becomes ever more needed to have access to fast methods of analyzing them. One such analysis is finding regions of the genome that contain descriptions of proteins we are familiar with. Pfam is a database containing these descriptions, and tools like HMMER can match known proteins/protein substructures onto sequences. This process is slow, and this thesis explores how to speed it up by creating machine-learning models that filter out the sequences less likely to contain known proteins. This is done by analyzing large datasets containing hundreds of species sampled by different means and by employing different target values. Through this valuable insight such as high GC content negatively impacting model performance and that machine learning models such as convolutional neural networks can describe specific Pfam patterns close to perfectly, with accuracies of 99.7% on balanced datasets. The end result of the thesis is that a filtering mechanism should be possible to create, but it would require significantly more work to get it working to a degree where it both included the necessary. Ensuring that enough Pfam entries are represented would be a good starting step. One of the models this thesis employs is the Tsetlin machine, hoping its unique structure would make it well suited to such data. The results did, however, not show this to be the case. While the Tsetlin machine might still be well suited for similar data, some changes would have to be made to how they were employed in this thesis.
dc.description.abstract
dc.languageeng
dc.publisherNorwegian University of Life Sciences
dc.titleDeep learning for Direct DNA Domain Detection
dc.typeMaster thesis


Files in this item

Thumbnail

This item appears in the following Collection(s)

Show simple item record