Vis enkel innførsel

dc.contributor.advisorTomic, Oliver
dc.contributor.advisorLiland, Kristian Hovde
dc.contributor.authorHetland, Petter Kolstad
dc.contributor.authorSigerstad, Tomas
dc.date.accessioned2021-11-01T14:28:09Z
dc.date.available2021-11-01T14:28:09Z
dc.date.issued2021
dc.identifier.urihttps://hdl.handle.net/11250/2826999
dc.description.abstractThis thesis aims to assist Arkivverket, The National Archival Services of Norway, in automating the redaction of national identity numbers in historical documents. As historical documents are released to the public at request, it is necessary to prevent personal data misallocation. Today this is handled by manual redaction of national identity numbers performed by employees at Arkivverket. Implementing a workflow where a machine learning model suggests possible national identity numbers (NIDs) to the employee for redaction may save time and increase the overall amount of NIDs identified. Arkivverket has developed a machine learning prototype for automatic document redaction using Optical Character Recognition and other tools. However, the current solution is not sufficiently accurate to be put into production in a suggestion workflow as approximately 11% of the identity numbers are left unredacted (based on the recall score). With a recall score of 89.0%, a precision score of 88.3%, and an F1 score of 88.6%, this model is used as a baseline for the performance of machine learning models developed and trained in this thesis. The thesis had two main goals. The first was to test whether object detection is a viable choice for automatically identifying NIDs. The documents contain many similar words and numbers, and many documents comprising a combination of hand- and machine-written text. The second goal, assuming that object detection is indeed a viable choice, was to check whether our detection models can reach a performance level that meets the demands of a suggestion workflow where each document is checked for NIDs by the model before being quality-assured by an employee and submitted. This would save time for the employees while preventing the unnecessary release of NIDs due to human error. In the long term, fully automated document redaction is the goal. Results show that using object detection models based on the Detectron2 framework is a highly viable approach for this problem, perhaps in large part due to the models' ability to recognize difficult, handwritten national identity numbers. The fine-tuned models are capable of reaching scores beyond those of the current prototype developed at Arkivverket. The most accurate model achieved a recall score of 97.9%, a precision score of 94.9%, and an F1 score of 96.4%. Based on our estimations, this model correctly identified \textit{more} NIDs in the dataset than its human counterparts at Arkivverket. A proposal for a deployment architecture is presented to illustrate the potential for combining our model and the existing redaction software to have a lasting economic- and ethical impact on the daily practices of Arkivverket. It is estimated that Arkivverket can initially save 65,381 NOK yearly after maintenance costs by implementing the proposed algorithm. With time and further research, however, the process of redacting national identity numbers may become fully autonomous and the savings potential greater.en_US
dc.language.isoengen_US
dc.publisherNorwegian University of Life Sciences, Åsen_US
dc.rightsAttribution-NonCommercial-NoDerivatives 4.0 Internasjonal*
dc.rights.urihttp://creativecommons.org/licenses/by-nc-nd/4.0/deed.no*
dc.subjectMachine learningen_US
dc.titleAutomated redaction of historical documents using machine learningen_US
dc.typeMaster thesisen_US
dc.description.localcodeM-ØAen_US


Tilhørende fil(er)

Thumbnail

Denne innførselen finnes i følgende samling(er)

Vis enkel innførsel

Attribution-NonCommercial-NoDerivatives 4.0 Internasjonal
Med mindre annet er angitt, så er denne innførselen lisensiert som Attribution-NonCommercial-NoDerivatives 4.0 Internasjonal