Yukako Toko (ytoko@nstac.go.jp)
National Statistics Center, Japan
Mika Sato-Ilic (msato@nstac.go.jp, mika@risk.tsukuba.ac.jp)
National Statistics Center, Japan, / University of Tsukuba, Japan
Abstract
In recent years, data handled in official statistics is getting large and complex. This paper proposes a new autocoding method utilizing a metric in high dimensional space to efficiently classify large and complex data. The proposed method is a hybrid method of Support Vector Machine (SVM) utilized Word2Vec and previously developed autocoding method based on reliability scores. Word2Vec was developed based on an idea of a neural probabilistic language model in which words are embedded in a continuous space using distributed representations of the words. SVM is a supervised machine learning algorithm for classification utilizing a metric in high dimensional space. It is known as high discrimination ability and generalization performance. In this paper, Word2Vec is used for notation from a word to a numerical vector, and SVM is used for classification based on the numerical vectors. In order to improve both ability of high classification accuracy and generalization performance, we combine classification by SVM that is known as classifying numerical vectors with high generalization performance and autocoding method based on the reliability score. Numerical examples show the efficiency of the proposed method. That is, the numerical examples show a better performance of the proposed hybrid method, which combines SVM and an autocoding method based on reliability scores, compared with the results of classification accuracy of cases when we apply either one of the methods. The proposed method is developed in R utilizing existing R packages for efficient development.
Keywords: Coding, Machine learning, Word2Vec, Support Vector Machine, Reliability score
JEL Classification: C38