Beyond extractive : advancing abstractive automatic text summarization in Norwegian with transformers

Navjord, Jørgen Johnsen; Korsvik, Jon-Mikkel Ryen

dc.contributor.advisor	Liland, Kristian Hovde
dc.contributor.advisor	Aeinehchi, Nader
dc.contributor.author	Navjord, Jørgen Johnsen
dc.contributor.author	Korsvik, Jon-Mikkel Ryen
dc.date.accessioned	2023-07-18T16:27:26Z
dc.date.available	2023-07-18T16:27:26Z
dc.date.issued	2023
dc.identifier	no.nmbu:wiseflow:6839553:54763253
dc.identifier.uri	https://hdl.handle.net/11250/3079868
dc.description.abstract	Automatic summarization is a key area in natural language processing (NLP) and machine learning which attempts to generate informative summaries of articles and documents. Despite its evolution since the 1950s, research on automatically summarising Norwegian text has remained relatively underdeveloped. Though there have been some strides made in extractive systems, which generate summaries by selecting and condensing key phrases directly from the source material, the field of abstractive summarization remains unexplored for the Norwegian language. Abstractive summarization is distinct as it generates summaries incorporating new words and phrases not present in the original text. This Master's thesis revolves around one key question: Is it possible to create a machine learning system capable of performing abstractive summarization in Norwegian? To answer this question, we generate and release the first two Norwegian datasets for creating and evaluating Norwegian summarization models. One of these datasets is a web scrape of Store Norske Leksikon (SNL), and the other is a machine-translated version of CNN/Daily Mail. Using these datasets, we fine-tune two Norwegian T5 language models with 580M and 1.2B parameters to create summaries. To assess the quality of the models, we employed both automatic ROUGE scores and human evaluations on the generated summaries. In an effort to better understand the model's behaviour, we measure how a model generates summaries with various metrics, including our own novel contribution which we name "Match Ratio" which measures sentence similarities between summaries and articles based on Levenshtein distances. The top-performing models achieved ROUGE-1 scores of 35.07 and 34.02 on SNL and CNN/DM, respectively. In terms of human evaluation, the best model yielded an average score of 3.96/5.00 for SNL and 4.64/5.00 for CNN/Daily Mail across various criteria. Based on these results, we conclude that it is possible to perform abstractive summarization of Norwegian with high-quality summaries. With this research, we have laid a foundation that hopefully will facilitate future research, empowering others to build upon our findings and contribute further to the development of Norwegian summarization models.
dc.description.abstract
dc.language	eng
dc.publisher	Norwegian University of Life Sciences, Ås
dc.title	Beyond extractive : advancing abstractive automatic text summarization in Norwegian with transformers
dc.type	Master thesis
dc.description.localcode	M-TDV

Tilhørende fil(er)

Filnavn:: no.nmbu:wiseflow:6839553:54763 ...
Størrelse:: 2.991Mb
Format:: PDF

Åpne

Denne innførselen finnes i følgende samling(er)

Master's theses (RealTek) [1724]

Vis enkel innførsel