Comparative study of NER using Bi-LSTM-CRF with different word vectorisation techniques on DNB documents

Joseph, Meera

dc.contributor.advisor	Tomic, Oliver
dc.contributor.advisor	Liland, Kristian Hovde
dc.contributor.author	Joseph, Meera
dc.date.accessioned	2021-10-05T11:46:10Z
dc.date.available	2021-10-05T11:46:10Z
dc.date.issued	2021
dc.identifier.uri	https://hdl.handle.net/11250/2787714
dc.description.abstract	The presence of huge volumes of unstructured data in the form of pdf documents poses a challenge to the organizations trying to extract valuable information from it. In this thesis, we try to solve this problem as per the requirement of DNB by building an automatic information extraction system to get only the key information in which the company is interested in from the pdf documents. This is achieved by comparing the performance of named entity recognition models for automatic text extraction, built using Bi-directional Long Short Term Memory (Bi-LSTM) with a Conditional Random Field (CRF) in combination with three variations of word vectorization techniques. The word vectorisation techniques compared in this thesis include randomly generated word embeddings by the Keras embedding layer, pre-trained static word embeddings focusing on 100-dimensional GloVe embeddings and, finally, deep-contextual ELMo word embeddings. Comparison of these models helps us identify the advantages and disadvantages of using different word embeddings by analysing their effect on NER performance. This study was performed on a DNB provided data set. The comparative study showed that the NER systems built using Bi-LSTM-CRF with GloVe embeddings gave the best results with a micro F1 score of 0.868 and a macro-F1 score of 0.872 on unseen data, in comparison to a Bi-LSTM-CRF based NER using Keras embedding layer and ELMo embeddings which gave micro F1 scores of 0.858 and 0.796 and macro F1 scores of 0.848 and 0.776 respectively. The result is in contrary to our assumption that NER using deep contextualised word embeddings show better performance when compared to NER using other word embeddings. We proposed that this contradicting performance is due to the high dimensionality, and we analysed it by using a lower-dimensional word embedding. It was found that using 50-dimensional GloVe embeddings instead of 100-dimensional GloVe embeddings resulted in an improvement of the overall micro and macro F1 score from 0.87 to 0.88. Additionally, optimising the best model, which was the Bi-LSTM-CRF using 100-dimensional GloVe embeddings, by tuning in a small hyperparameter search space did not result in any improvement from the present micro F1 score of 0.87 and macro F1 score of 0.87.	en_US
dc.language.iso	eng	en_US
dc.publisher	Norwegian University of Life Sciences, Ås	en_US
dc.rights	Attribution-NonCommercial-NoDerivatives 4.0 Internasjonal	*
dc.rights.uri	http://creativecommons.org/licenses/by-nc-nd/4.0/deed.no	*
dc.subject	NLP	en_US
dc.subject	NER	en_US
dc.subject	Deep learning	en_US
dc.title	Comparative study of NER using Bi-LSTM-CRF with different word vectorisation techniques on DNB documents	en_US
dc.type	Master thesis	en_US
dc.subject.nsi	VDP::Mathematics and natural science: 400	en_US
dc.description.localcode	M30-DV Master's Thesis	en_US
dc.description.localcode	M-DV	en_US

Tilhørende fil(er)

Filnavn:: Master_thesis_meera.pdf
Størrelse:: 3.490Mb
Format:: PDF
Beskrivelse:: Master thesis

Åpne

Denne innførselen finnes i følgende samling(er)

Master's theses (RealTek) [1724]

Vis enkel innførsel

Med mindre annet er angitt, så er denne innførselen lisensiert som Attribution-NonCommercial-NoDerivatives 4.0 Internasjonal