利用機器學習填補遺漏值的比較與研究
No Thumbnail Available
Date
2022
Authors
Journal Title
Journal ISSN
Volume Title
Publisher
Abstract
本研究主要探討具有遺漏值的數據通過多種機器學習方法填補後之比較。遺漏值的填補是進行資料分析的重要過程,若隨意刪除或簡易替換,可能會導致後續的統計分析出現重大偏差,因此,在可用的填補方法中進行有效的選擇至關重要。我們利用近期熱門的機器學習填補法 K-鄰近算法 (K-Nearest Neighbor)、鏈式方程多重填補法 (Multivariate Imputation by Chained Equations) 及缺失森林 (MissForest) 等三種方法進行了模擬研究。在各種隨機遺漏設置下,當數據是完全續、完全類別或混合型數據集時,以評估每種方法的各自結果,結果表明,利用缺失森林 (MissForest) 方法來對資料進行填補時,其正規化方根均差 (NRMSE) 或是類別錯誤率 (PFC) 都有著最好的表現。我們還將三種方法應用於幾個實徵數據集上,結果顯示缺失森林皆優於其他兩種機器學習填補法。
This study explores the comparison of data with missing values after imputation bymultiple machine-learning methods. The imputation of missing values is an important process in data analysis. If the missing values are arbitrarily deleted or simply substituted, it may lead to substantial bias in the subsequent statistical analysis. Therefore, the effective selection among available imputation methods is extremely crucial.In this paper, we consider the recent machine-learning imputation methods, K-Nearest Neighbor, Multivariate Imputation by Chained Equations and MissForest. We conduct simulation studies for all-continuous, all-categorical and mixed data to evaluate the respective results from each method under various settings of random omission. The results show that the MissForest method has the best performance in terms of NRMSE and PFC. We also apply three methods to several real data sets.
This study explores the comparison of data with missing values after imputation bymultiple machine-learning methods. The imputation of missing values is an important process in data analysis. If the missing values are arbitrarily deleted or simply substituted, it may lead to substantial bias in the subsequent statistical analysis. Therefore, the effective selection among available imputation methods is extremely crucial.In this paper, we consider the recent machine-learning imputation methods, K-Nearest Neighbor, Multivariate Imputation by Chained Equations and MissForest. We conduct simulation studies for all-continuous, all-categorical and mixed data to evaluate the respective results from each method under various settings of random omission. The results show that the MissForest method has the best performance in terms of NRMSE and PFC. We also apply three methods to several real data sets.
Description
Keywords
遺漏值, 機器學習, K-鄰近算法, 鏈式方程多重填補法, 缺失森林, Imputation of missing values, K-Nearest Neighbor, Multivariate Imputation by Chained Equations, MissForest