dc.contributor.advisor | Snipen, Lars-Gustav | |
dc.contributor.advisor | Bohlin, Jon | |
dc.contributor.advisor | Brynildsrud, Ola | |
dc.contributor.advisor | Knudsen, Per Kristian | |
dc.contributor.author | Liland, Jens Rasmus | |
dc.date.accessioned | 2019-12-18T09:17:23Z | |
dc.date.available | 2019-12-18T09:17:23Z | |
dc.date.issued | 2019 | |
dc.identifier.uri | http://hdl.handle.net/11250/2633830 | |
dc.description | Has code appendix in open Github repository: https://github.com/jenslila/liland2019master | nb_NO |
dc.description.abstract | The aim of this project was to investigate to what extent some machine learning methods are able to classify chromosome reads from plasmid reads based on K-mer statistics. Both short Illumina HiSeq 2500 reads and medium-length Nanopore reads were simulated in silico from fully assembled E.coli chromosome and plasmid sequences. Both canonical and non-canonical K-mers were counted on all categories of sequence lengths. Working with in silico simulation data like this is different to a real-world experiment in that sequencing simulators like ART has arbitrary categorical simulation statistics, e.g. boolean presence of sequencing error, which were adjusted to find optimal combinations. K-mer methods worked great for fully assembled genome sequences, in terms of binary classification accuracy, decreasing substantually to 61 % for the Illumina sequences, while maintaining a fairly high level at 87 % for the Nanopore sequences. Wrongly classified reads mainly gets classified as plasmids. A 37X increase in sequence length leads to a 42 % increase in accuracy. | nb_NO |
dc.description.abstract | Målet med prosjektet var å undersøke i hvilken grad visse maskinlærings-metoder vil kunne klassifisere kromosom-reads fra plasmid-reads, basert på K-mer statistikk. Både korte Illumina HiSeq 2500 reads, og mellomlange Nanopore reads ble simulert in silico fra komplett assemblerte E.coli kromosom- og plasmid-sekvenser. Både kanoniske og ikke-kanoniske K-merer ble talt for alle kategorier av sekvenslengder. Det å arbeide med in silico simulaterte data som disse er ulikt fra ikke-simulerte data ved at sekvens-simulatorer som ART har arbitrære, kategoriske simuleringsstatistikker, f.eks. boolsk tilstedeværelse av sekvenserings-feil, som ble justert for å finne optimale kombinasjoner. K-mer metoder fungerte veldig bra for fullstendig assemblerte genom-sekvenser, med hensyn til binær klassifikasjons-nøyaktighet, substansielt minkende til 61 % for Illumina-sekvensene, men opprettholder en temmelig nøyaktighet på 87 % for Nanopore-sekvensene. Feilklassifiserte reads men opprettholder en temmelig nøyaktighet på 87 % for Nanopore-sekvensene. Feilklassifiserte reads blir hovedsakelig klassifisert som plasmider. En 37X økning i sekvenslengde fører til en 42 % økning i nøyaktighet. | nb_NO |
dc.language.iso | eng | nb_NO |
dc.publisher | Norwegian University of Life Sciences, Ås | nb_NO |
dc.rights | Attribution-NonCommercial-NoDerivatives 4.0 Internasjonal | * |
dc.rights.uri | http://creativecommons.org/licenses/by-nc-nd/4.0/deed.no | * |
dc.subject | Escherichia coli | nb_NO |
dc.subject | NCBI | nb_NO |
dc.subject | Illumina | nb_NO |
dc.subject | Oxford Nanopore | nb_NO |
dc.subject | ART | nb_NO |
dc.subject | DeepSimulator | nb_NO |
dc.subject | ANOVA | nb_NO |
dc.subject | K-nearest neighbour classificator | nb_NO |
dc.subject | Random Forest classificator | nb_NO |
dc.subject | R programming language | nb_NO |
dc.subject | Python programming language | nb_NO |
dc.subject | Principal Component Analysis | nb_NO |
dc.subject | K-mer frequencies | nb_NO |
dc.subject | Classification | nb_NO |
dc.subject | Plasmids | nb_NO |
dc.subject | Chromosomes | nb_NO |
dc.subject | Canonical K-mers | nb_NO |
dc.subject | Pandas | nb_NO |
dc.title | Recognizing plasmid-reads by machine learning and K-mer statistics | nb_NO |
dc.type | Master thesis | nb_NO |
dc.description.version | submittedVersion | nb_NO |
dc.subject.nsi | VDP::Mathematics and natural science: 400::Basic biosciences: 470::Bioinformatics: 475 | nb_NO |
dc.source.pagenumber | 52 | nb_NO |
dc.description.localcode | M-KB | nb_NO |