Yukako Toko (ytoko@nstac.go.jp)
National Statistics Center, Japan
Shinya Iijima (siijima@nstac.go.jp)|
National Statistics Center, Japan
Mika Sato-Ilic (msato@nstac.go.jp)
National Statistics Center, Japan / University of Tsukuba
Abstract
Coding is the classification of objects (or features) based on given classification codes, and it is frequently required in the field of official statistics. This paper proposes a supervised overlapping multiclass classifier for autocoding. The classifier is implemented in R.
The purpose of this study is to efficiently apply this classifier to the coding task of the Family Income and Expenditure Survey in Japan. We previously developed a non-overlapping multiclass classifier that obtains “exclusive” classes. Even though the developed classifier provides high accuracy for the autocoding task, some objects with ambiguous input information are still incorrectly assigned codes. This shows that exclusive classification has a limitation when dealing with uncertainty. To solve this problem, we propose a new classifier that lists multiple candidates in descending order of the degree of reliability as output and assists experts in selecting a correct code from the listed candidate codes. We refer to this proposed classifier as the overlapping multiclass classifier. A new reliability score based on the weights of entropy is employed in the proposed classifier. With this new reliability score, the proposed classifier improves cumulative accuracy and practicability while the advantages of the structural simplicity of the algorithm and practical calculation time remain unchanged. The proposed algorithm is implemented in R to improve its versatility.
Keywords: Coding, Machine learning, Overlapping classification
JEL Classification: C38