Natural Language Processing and Topic Modeling for Exploring the Vegetarian and Vegan Trends

Olavsrud, Marius Aleksander

dc.contributor.advisor	Tomic, Oliver
dc.contributor.advisor	Liland, Kristian Hovde
dc.contributor.advisor	Berget, Ingunn
dc.contributor.author	Olavsrud, Marius Aleksander
dc.date.accessioned	2021-01-06T09:12:36Z
dc.date.available	2021-01-06T09:12:36Z
dc.date.issued	2020
dc.identifier.uri	https://hdl.handle.net/11250/2721646
dc.description.abstract	The purpose of this thesis is to examine how topic modeling can be used as a tool to explore large sets of text data. This thesis is written on assignment from Nofima Food Research Institute. A set of about 52 000 unknown texts of various lengths were downloaded using an external web-harvesting company (Webhose.io). The texts are collected with a specific search query consisting of food related vegetarian and vegan based keywords as this is a field of interest with Nofima. Latent Dirichlet Allocation, known as LDA, is used to create and model these topics. LDA is a method that allows unobserved groups of similar data to be explained by a group of words known as a topic. The collected texts are split into smaller subsections based on the type and lengths before being preprocessed for non-relevant information. A subset of medium length texts are used for the modeling. Further, the data is analysed with LDA, us- ing coherence score as a metric to determine the optimal number of topics. The results are visualised using pyLDAvis. Lastly, a small subset of the same texts are manually read by a group of employees at Nofima to validate the quality of the results in order to get a better understanding of the type of data that is anal- ysed. The study discovered that topic modeling can be used to explore a large set of data and get some meaningful insight of parts of the content. Several topics were found to include vegetarian and vegan related words. Some of these words were found to have a high probability of existence within the topic in question. The process revealed numerous concerns which needed to be addressed. Some examples were many non-related documents, large amounts of words that were not related to a given topic, deciding upon the optimal number of topics as well as visualisation of the topics.	en_US
dc.language.iso	eng	en_US
dc.rights	Attribution-NonCommercial-NoDerivatives 4.0 Internasjonal	*
dc.rights.uri	http://creativecommons.org/licenses/by-nc-nd/4.0/deed.no	*
dc.subject	Topic modeling	en_US
dc.subject	LDA	en_US
dc.title	Natural Language Processing and Topic Modeling for Exploring the Vegetarian and Vegan Trends	en_US
dc.type	Master thesis	en_US
dc.description.version	submittedVersion	en_US
dc.source.pagenumber	90	en_US
dc.description.localcode	M-DV	en_US

Tilhørende fil(er)

Filnavn:: Olavsrud2020.pdf
Størrelse:: 3.574Mb
Format:: PDF
Beskrivelse:: Masteroppgave

Åpne

Denne innførselen finnes i følgende samling(er)

Master's theses (RealTek) [1723]

Vis enkel innførsel

Med mindre annet er angitt, så er denne innførselen lisensiert som Attribution-NonCommercial-NoDerivatives 4.0 Internasjonal