Marcello D’Orazio (madorazi@istat.it)
Italian National Institute of Statistics – Istat
Abstract
Outlier detection is part of data editing phase for numerical variables. This work investigates outlier detection in the R environment by comparing “traditional” methods, popular in official statistics, with techniques developed in the field of data mining and statistical learning. The comparison is done considering longitudinal data where a set of quantitative non-negative variables are observed twice (or more) on the same sample of units. The work attempts to identify some “recent” outlier detection methods, already available in the R environment, that seem suitable for application in official statistics. This study takes stock of findings of a previous work investigating outlier detection in the univariate case that showed the goodness of some “recent” approaches; in this article we go a step further and investigate the behavior of “traditional” and recent methods also in the multivariate case. The first preliminary results are quite interesting and useful as guidance towards application of the chosen methods in the production of official statistics using the R facilities.
Keywords: binary recursive partitioning, clustering, nearest neighbor distance, panel data.
JEL classification: C – Mathematical and Quantitative methods; C14 Semiparametric and Nonparametric Methods: General; C33 Panel Data Models; C83 Survey Methods