Prokaryote classification : method development and novel insight in 16S ribosomal RNA-based classification

Vinje, Hilde

dc.contributor.advisor	Snipen, Lars
dc.contributor.advisor	Liland, Kristian Hovde
dc.contributor.advisor	Almøy, Trygve
dc.contributor.author	Vinje, Hilde
dc.date.accessioned	2018-05-07T11:47:47Z
dc.date.available	2018-05-07T11:47:47Z
dc.date.issued	2016
dc.identifier.isbn	978-82-575-1409-9
dc.identifier.issn	1894-6402
dc.identifier.uri	http://hdl.handle.net/11250/2497351
dc.description.abstract	The main objective of this thesis is the improvement of prokaryotic classification based on the 16S ribosomal RNA. As a result of the shift in sequencing technology, generating enormous amounts of sequencing data and the rise of cultivation-independent methods, the need for reliable, fast and memory efficient methods has been revealed. The 16S rRNA is used for building the existing taxonomy of prokaryotes and map it into the phylogenetic tree of life, as well as for the exploration of microbial communities, which has become a major focus in microbiology. It is a common belief that the discriminant power of the 16S marker lies within nine variable regions located along the gene. We began our work challenging this assumption by searching systematically for discriminating sites that contributes to a correct classification. 50 discriminating sites were found when classifying down to phylum level and for genus identification, over 80% of all sites were important, they were all scattered throughout the gene. We further present a systematic comparison of five K-mer based classification methods for the 16S rRNA gene. Classification methods based on counting K-mer are popular because they are fast, consider the whole sequence and will not suffer from the same uncertainties as evolutionary models and alignments. The five methods differ both in data usage and modelling strategies. Preprocessed nearest-neighbour (PLSNN) performed best on full-length sequences, but overall, for both full and fragmented sequences, the multinomial method outperformed the others. It was significantly better than the RDP-classifier, which today works as a gold standard classification method. There is no official taxonomy of prokaryotes and any classification method will suffer from the lack of consensus in training data. The ConTax database, presented in this thesis, is a seed-set of the most accurately classified sequences from which we can continue to explore the prokaryotic taxonomy and train new classification methods. A major feature of the new dataset is that a sequence is included only if three primary 16S databases agree on its assigned taxonomy down to genus level. The results are combined and presented in an R-package, microclass, which provide classification tools down to genus level. Efforts have been made to make the tools both fast and memory-efficient. All methods can be trained on new data, but a ready-to-use tool, the taxMachine, is also presented. The taxMachine has been trained with the multinomial method on full-length 16S sequences to recognize full or fragmented sequences, using the designed and optimized trimmed ConTax dataset for training.	nb_NO
dc.description.abstract	Hovedmålet med avhandlingen er å forbedre klassifikasjon av prokaryoter basert på 16S ribosomalt RNA. Som et resultat av skiftet i sekvenseringsteknologien, som nå genererer enorme mengder med sekvensdata, og fremveksten av kultiverings-uavhengige metoder, er det blitt avdekket et behovet for stabile, kjappe og minne-effektive metoder. 16S rRNA er blitt brukt for å bygge den eksisterende taxonomien av prokaryoter og kartelgge de i det fylogentiske livstreet. Samt, utforske mikrobielle samfunn, som har blitt et hovedfokus i mikrobiologi. Det er en vanlig oppfatning at den diskriminante evnen til 16S markøren ligger innenfor ni variable regioner lokalisert langs genet. Vi begynte vårt arbeid med å utfordre denne antagelsen ved å søke systematisk for posisjoner med diskriminerende evner som bidro til korrekt klassifikasjon. 50 diskriminerende posisjoner ble funnet ved klassifisering ned til phylum nivå og for genus identifisering var over 80% av alle posisjoner viktige, de var alle spredt over hele genet. Videre presenterer vi en systematisk sammenligning av fem K-mer baserte klassifiseringsmetoder for 16S rRNA genet. Klassifiseringsmetoder basert på å telle K-merer er populære fordi de er raske, tar for seg hele sekvensen og lider ikke av usikkerhetene som evolusjonære modeller og sammenstillinger gjør. De fem metodene er forskjellige både i databruken og modellerings-strategien. Den forbehandlede nærmeste-nabo metoden (PLSNN) gjorde det best for full lengde sekvenser, men generelt, for både full lengde og fragmenterte sekvenser, gjorde mutinomial metoden det bedre enn de andre. Den var signifikant bedre enn RDP klassifikatoren, som idag fungerer som en gullstandard av klassifiseringsmetoder. Det finnes ingen offisiell taxonomi av prokaryoter og enhver klassifiseringsmetode vil lide av mangelen på konsensus i treningsdata. ConTax datasettene, presentert i denne avhandlingen, er en samling av de mest nøyaktig klassifiserte sekvensene som vi kan fortsette å utforske den prokaryote taksonomien utifra og trene nye methoder med. Den viktigste egenskapen til datasettet er at en sekvens bare blir inkludert hvis tre hoved 16S databaser er enige om den tildelte taksonomien ned til genus nivå. Resultatene er kombinert og presentert i en Rpakke, microclass, som inneholder klassifiseringsverktøy ned til genus nivå. En innsats har blitt gjort i å gjøre redskapene både kjappe og minne-effektive. Alle metoder kan bli trent med nye data, men en klar-til-bruk metode, taxMachine, er også presentert. taxMachine er blitt trent med multinomial metoden på full lengde 16S sekvenser for å kjenne igjen fulle eller fragmenterte sekvenser, ved å bruke det konstruerte og optimerte trimmede ConTax datasettet for trening.	nb_NO
dc.language.iso	eng	nb_NO
dc.publisher	Norwegian University of Life Sciences, Ås	nb_NO
dc.relation.ispartofseries	PhD Thesis;2016:95
dc.rights	Attribution-NonCommercial-NoDerivatives 4.0 Internasjonal	*
dc.rights.uri	http://creativecommons.org/licenses/by-nc-nd/4.0/deed.no	*
dc.title	Prokaryote classification : method development and novel insight in 16S ribosomal RNA-based classification	nb_NO
dc.title.alternative	Prokaryote klassifisering : metodeutvikling og ny innsikt i 16S ribosomal RNA-basert klassifisering	nb_NO
dc.type	Doctoral thesis	nb_NO
dc.source.pagenumber	1 b. (flere pag.)	nb_NO

Files in this item

Name:: 2016-95_Hilde Vinje_(IKBM).pdf
Size:: 9.615Mb
Format:: PDF

View/Open

This item appears in the following Collection(s)

Doctoral theses (KBM) [121]

Show simple item record

Except where otherwise noted, this item's license is described as Attribution-NonCommercial-NoDerivatives 4.0 Internasjonal