Classification of consumer goods into 5-digit COICOP 2018 codes

Müller, Daniel Milliam

dc.contributor.advisor	Toth, Boriska
dc.contributor.advisor	Frøslie, Kathrine Frey
dc.contributor.author	Müller, Daniel Milliam
dc.date.accessioned	2022-02-25T14:06:31Z
dc.date.available	2022-02-25T14:06:31Z
dc.date.issued	2021
dc.identifier.uri	https://hdl.handle.net/11250/2981525
dc.description.abstract	The survey of consumer expenditure is a national survey conducted by Statistics Norway (SSB) with the purpose of collecting detailed data about Norwegian households’ annual consumption of different goods and services. The survey has up until its most recent publication in 2012 relied on employees at SSB to manually categorise all registered expenditures into COICOP (Classification of Individual Consumption by Purpose) item codes to produce consumption statistics. This has involved large workloads and high implementation costs, and because of this, SSB wants to modernise and improve the efficiency of the survey for its next planned implementation in 2022. This study is the result of a 3-month collaboration with SSB to explore the application of supervised machine learning for classification of consumer goods to 5-digit COICOP codes. The purpose of this study has been to explore the potential of using machine learning to automate parts of the survey of consumer expenditure. This thesis demonstrates how different data sets from separate sources can be combined into a COICOP training data set that can be used to develop and evaluate COICOP classification models. Furthermore, this study explores how these models can be incorporated into a ”human-in-the-loop”-based classification system to facilitate automatic classification of consumer goods while also maintaining sufficient levels of data quality. The findings indicate that supervised machine learning is a suited method for classifying consumer goods into 5-digit COICOP codes. Additionally, the results show that the models’ prediction probabilities are good indicators of where misclassifications occur. Together, these findings show a promising potential for implementation of a ”human-in-the-loop”-based classification system for reliable classification of consumer goods. At the same time, the findings uncover important limitations with the data used in this thesis, as the models were trained on data that the survey of consumer expenditure will not be based on. This thesis has used data sets that were available, and these were not necessarily the most relevant. Therefore, it is not expected that the developed models will provide immediate value to the objectives of SSB without first being trained on more relevant data.	en_US
dc.description.abstract	Forbruksundersøkelsen er en nasjonal undersøkelse som er utført av Statistisk Sentralbyrå (SSB) med den hensikt å samle inn detaljert forbruksstatistikk om norske husholdninger. Inntil dens foreløpig siste gjennomføring i 2012, har ansatte ved SSB måttet manuelt kode alle registrerte varekjøp inn i COICOP (Classification of Individual Consumption by Purpose) varekoder for å produsere forbruksstatistikk fra undersøkelsen. Dette har medført store arbeidsmengder og høye kostnader, og SSB ønsker derfor nå å modernisere og effektivisere undersøkelsen i forbindelse med dens neste planlagte gjennomføring i 2022. Denne oppgaven er et resultat av et 3 måneders samarbeid med SSB for å utforske anvendelse av veiledet maskinlæring for å klassifisere forbruksvarer i 5-sifrede COICOP varegrupper. Dette har hatt som hensikt å kartlegge effektiviseringspotensialet ved å bruke maskinlæring til å automatisere deler av forbruksundersøkelsen. I denne oppgaven demonstreres det hvordan ulike datasett fra ulike kilder kan kombineres til et COICOP treningsdatasett som kan brukes til å utvikle og evalurere COICOP klassifiseringsmodeller. Videre utforsker oppgaven hvordan disse modellene kan brukes i kombinasjon med et ”human-in-the-loop”-basert klassifieringssystem forå tilrettelegge for automatisk klassifiering av varer og samtidig ivareta tilstrekkelig datakvalitet. Funnene antyder at veiledet maskinlæring er en egnet metode for klassifisering av varer til 5-sifrede COICOP varekoder, og i tillegg viser resultatene at modellenes prediksjonssannsynligheter gir en god indikasjon for hvor feil oppstår. Dette gir et godt grunnlag for bruk av et ”human-in-the-loop”-basert klassifiseringssystem for pålitelig klassifisering av forbruksvarer. Samtidig avdekker funnene sentrale begrensninger med dataen brukt i denne oppgaven, da modellene ble trent på data som forbruksundersøkelsen ikke vil basere seg på. Bakgrunnen for dette er at oppgaven har brukt de data som var tilgjengelige, og disse var ikke nødvendigvis de mest relevante. Det kan dermed ikke forventes at de utviklede modellene gir umiddelbar verdi til SSBs formål uten først å bli trent på mer relevante data.	en_US
dc.language.iso	eng	en_US
dc.publisher	Norwegian University of Life Sciences, Ås	en_US
dc.rights	Attribution-NonCommercial-NoDerivatives 4.0 Internasjonal	*
dc.rights.uri	http://creativecommons.org/licenses/by-nc-nd/4.0/deed.no	*
dc.title	Classification of consumer goods into 5-digit COICOP 2018 codes	en_US
dc.type	Master thesis	en_US
dc.description.localcode	M-IØ	en_US

Tilhørende fil(er)

Filnavn:: muller2021.pdf
Størrelse:: 2.897Mb
Format:: PDF

Åpne

Denne innførselen finnes i følgende samling(er)

Master's theses (KBM) [890]

Vis enkel innførsel

Med mindre annet er angitt, så er denne innførselen lisensiert som Attribution-NonCommercial-NoDerivatives 4.0 Internasjonal