Tackling Lower-Resource Language Challenges: A Comparative Study of Norwegian Pre-Trained BERT Models and Traditional Approaches for Football Article Paragraph Classification

Helland, Eirik Duesund

dc.contributor.advisor	Stefan Schrunner
dc.contributor.advisor	Pål Halvorsen
dc.contributor.advisor	Steven Hicks
dc.contributor.author	Helland, Eirik Duesund
dc.date.accessioned	2023-07-26T16:27:15Z
dc.date.available	2023-07-26T16:27:15Z
dc.date.issued	2023
dc.identifier	no.nmbu:wiseflow:6839521:54591693
dc.identifier.uri	https://hdl.handle.net/11250/3081492
dc.description.abstract	In lower-resource language settings, domain-specific tasks such as paragraph classification of football articles present significant challenges. Traditional machine learning models face difficulties in effectively capturing the linguistic complexities inherent in the paragraphs, emphasizing the need for more advanced approaches to overcome these obstacles. This thesis investigates the potential of Norwegian pre-trained BERT (Bidirectional Encoder Representations from Transformers) models for paragraph classification tasks in the context of Norwegian football articles, a domain requiring a nuanced understanding of the Norwegian language. BERT is a powerful model architecture for language-specific processing tasks, which learns from the context of words in a sentence in both directions. Specifically, this thesis compares the performance of Transformer-based BERT models with traditional machine learning models in multi-class and multi-label classification tasks. An existing dataset of about 5,500 football article paragraphs is utilized to evaluate multi-class classification results. In addition, a newly annotated multi-label dataset of just over 2,000 samples is introduced for the multi-label classification assessment. The results reveal promising performance for the Norwegian pre-trained BERT models in both classification tasks, achieving an accuracy of ∼ 0.88 and a weighted-average F1-score of ∼ 0.87 in the multi-class classification task and accuracy of ∼ 0.40 and a weighted-average F1-score of ∼ 0.58 in the multi-label classification task, significantly outperforming the results of the traditional machine learning models. This study highlights the effectiveness of Transformer-based models in lower-resource language settings. It emphasizes the need for continued research and development in Natural Language Processing for underrepresented languages.
dc.description.abstract
dc.language	eng
dc.publisher	Norwegian University of Life Sciences
dc.title	Tackling Lower-Resource Language Challenges: A Comparative Study of Norwegian Pre-Trained BERT Models and Traditional Approaches for Football Article Paragraph Classification
dc.type	Master thesis

Tilhørende fil(er)

Filnavn:: no.nmbu:wiseflow:6839521:54591 ...
Størrelse:: 4.295Mb
Format:: PDF

Åpne

Denne innførselen finnes i følgende samling(er)

Master's theses (RealTek) [1724]

Vis enkel innførsel