A Supervised Multiclass Classifier for an Autocoding System

Yukako Toko (ytoko@nstac.go.jp)
National Statistics Center, Research and Development Division, Japan
Kazumi Wada (kwada@nstac.go.jp)
National Statistics Center, Research and Development Division, Japan
Mariko Kawano (mkawano@nstac.go.jp)
National Statistics Center, Research and Development Division, Japan

Abstract

Classification is often required in various contexts, including in the field of official statistics. In the previous study, we have developed a multiclass classifier that can classify short text descriptions with high accuracy. The algorithm borrows the concept of the naïve Bayes classifier and is so simple that its structure is easily understandable. The proposed classifier has the following two advantages. First, the processing times for both learning and classifying are extremely practical. Second, the proposed classifier yields high-accuracy results for a large portion of a dataset. We have previously developed an autocoding system for the Family Income and Expenditure Survey in Japan that has a better performing classifier. While the original system was developed in Perl in order to improve the efficiency of the coding process of short Japanese texts, the proposed system is implemented in the R programming language in order to explore versatility and is modified to make the system easily applicable to English text descriptions, in consideration of the increasing number of R users in the field of official statistics. We are planning to publish the proposed classifier as an R-package. The proposed classifier would be generally applicable to other classification tasks including coding activities in the field of official statistics, and it would contribute greatly to improving their efficiency.

Keywords: Coding, Text classification, Naïve Bayes, Machine learning
JEL Classification: C38

[Full Text]

Romanian Statistical Review 4/2017