Yukako Toko (ytoko@nstac.go.jp)
National Statistics Center, Tokyo
Mika Sato-Ilic (mika@risk.tsukuba.ac.jp)
University of Tsukuba, Japan
Takayuki Sasajima (tsasajima@nstac.go.jp)
National Statistics Center, Tokyo, Japan
Abstract
This paper proposes a new method for autocoding, including the filtering task for constructing training data in a machine-learning language model. Misspecification of the training data causes biased outputs and exhibits undesirable behavior. Therefore, improving the training data for Natural Language Processing (NLP) tasks in supervised learning is essential for obtaining more accurate results and avoiding harmful outputs. This paper improves the training data in our proposed supervised machine-learning method for autocoding based on reliability scores over text descriptions. In the improvement task, which is a filtering task, we exploit data classified with high-reliability scores based on the idea that data classified with high-reliability scores are clearer data; adding the information of those data to the training dataset is performed to obtain a better classification accuracy. The numerical examples for the coding task of the National Survey of Family Income, Consumption and Wealth and the Family Income and Expenditure Survey show a better performance of the proposed method.
Keywords: Coding, Reliability scores, Fuzzy logic, Text classification
JEL Classification: C38