Natural Language Processing and Topic Modeling for Exploring the Vegetarian and Vegan Trends

Olavsrud, Marius Aleksander

Olavsrud, Marius Aleksander

Master thesis

Submitted version

Åpne

Masteroppgave (3.574Mb)

Permanent lenke

https://hdl.handle.net/11250/2721646

Utgivelsesdato

2020

Metadata

Vis full innførsel

Samlinger

Master's theses (RealTek) [1722]

Sammendrag

The purpose of this thesis is to examine how topic modeling can be used as a tool to explore large sets of text data. This thesis is written on assignment from Nofima Food Research Institute. A set of about 52 000 unknown texts of various lengths were downloaded using an external web-harvesting company (Webhose.io). The texts are collected with a specific search query consisting of food related vegetarian and vegan based keywords as this is a field of interest with Nofima. Latent Dirichlet Allocation, known as LDA, is used to create and model these topics. LDA is a method that allows unobserved groups of similar data to be explained by a group of words known as a topic.

The collected texts are split into smaller subsections based on the type and lengths before being preprocessed for non-relevant information. A subset of medium length texts are used for the modeling. Further, the data is analysed with LDA, us- ing coherence score as a metric to determine the optimal number of topics. The results are visualised using pyLDAvis. Lastly, a small subset of the same texts are manually read by a group of employees at Nofima to validate the quality of the results in order to get a better understanding of the type of data that is anal- ysed.

The study discovered that topic modeling can be used to explore a large set of data and get some meaningful insight of parts of the content. Several topics were found to include vegetarian and vegan related words. Some of these words were found to have a high probability of existence within the topic in question.

The process revealed numerous concerns which needed to be addressed. Some examples were many non-related documents, large amounts of words that were not related to a given topic, deciding upon the optimal number of topics as well as visualisation of the topics.

Med mindre annet er angitt, så er denne innførselen lisensiert som Attribution-NonCommercial-NoDerivatives 4.0 Internasjonal