Autocoding based Multi-Class Support Vector Machine by Fuzzy c-Means

Yukako Toko (ytoko@nstac.go.jp)
National Statistics Center Japan
Mika Sato-Ilic (mika@risk.tsukuba.ac.jp)
University of Tsukuba, Japan

Abstract

This paper proposes a new autocoding method for the coding task of the Family Income and Expenditure Survey. The data of the Family Income and Expenditure Survey included text descriptions extracted from digital receipts which have been getting large and complex in recent years. This paper proposes a new autocoding method to obtain stable results of discrimination as coding with high generalization performance dealing with cognitive uncertainty for text description data. This method is a combination of multi-class Support Vector Machine (SVM) by fuzzy c-means and the previously developed reliability score based classification method. The proposed method utilizes both SVM, a machine learning method known as high generalization performance, and fuzzy c-means that is a computational intelligence method known as high performance dealing with cognitive uncertainty. Also, the proposed method utilizes the previously developed classification method based on reliability score. A numerical example shows a better performance of the proposed method with the Family Income and Expenditure Survey compared with the previously proposed classification method. The proposed method is developed in python utilizing python libraries, and also it can be easily run in R, which is a popular language in the official statistics field.
Keywords: Coding, Machine Learning, Word2Vec, Support Vector Machine, Fuzzy c-Means, Reliability Score
JEL Classification: C38

[Full Text]

Romanian Statistical Review 1/2022