用特徵選擇和數據平衡對高維且分佈不均的二元資料做類別預測

蘇立鴻; Su, Li-Hung

用特徵選擇和數據平衡對高維且分佈不均的二元資料做類別預測

dc.contributor	呂翠珊	zh_TW
dc.contributor	Lu, Tsui-Shan	en_US
dc.contributor.author	蘇立鴻	zh_TW
dc.contributor.author	Su, Li-Hung	en_US
dc.date.accessioned	2023-12-08T07:55:59Z
dc.date.available	2027-08-15
dc.date.available	2023-12-08T07:55:59Z
dc.date.issued	2022
dc.description.abstract	近年來，機器學習 (ML) 在資料探勘和預測方面逐漸流行；與傳統的統計訓練相比，ML 有名的是在預測或分類數據方面的高準確度，但仍然存在一些限制。首先是如果資料的分布高度不平均，ML 算法會遇到準確度悖論，意思是說它只會對多數類別進行預測，我們使用採樣方法來解決這個問題。其次是面對高維資料時的計算時間，我們使用特徵選擇方法來解決這個問題。在前面的資料預處理之後，我們考慮四種 ML 算法：邏輯迴歸、K-近鄰 (KNN) 、隨機森林 (RF) 和極限梯度提升 (XGBoost) 來比較模型的性能。我們通過具有 687 個變數和 40041 個觀察值的醫療數據集急性腎損傷 (AKI) 演示了上述過程。主要結果是他們是否在 AKI 上復發。結果表明，XGBoost 在接受者操作特徵曲線下的面積 (AUC-ROC) 方面具有最佳性能。對於醫療數據集，鈉、速尿、芬太尼、布美他尼、多巴胺、胰島素、白蛋白、甘油和腎上腺素是最具影響力的藥物，CCS1581 是影響最大的疾病。	zh_TW
dc.description.abstract	In the recent years, machine learning (ML) has become popular for data mining and predicting; compared to traditional statistical training, ML is known for its high accuracy on prediction or classifying the data. However, there still exists several limitations. First, if the distribution of the data is highly imbalanced, ML tends to meet accuracy paradox. To solve this problem, we use sampling methods. Second, dealing high dimensional data is computational demanding. We use feature selection methods to overcome this problem. After the aforementioned data preprocessing, we consider four ML algorithms: logistic regression, K-Nearest Neighbor (KNN), Random Forest (RF) and Extreme Gradient Boosting (XGBoost) to compare the performance of the model. We demonstrate the above procedure via a medical dataset Acute Kidney Injury (AKI) with 687 variables and 40041 observations. The main outcome is whether they have recurrence on AKI. The results shows that XGBoost has the best performance in terms of the area under the curve of receiver operating characteristic curve (AUC-ROC). For the medical dataset, Sodium, Furosemide, Fentanyl, Bumetanide, Dopamine, Insulin, Albumin, Glycerin and Epinephrine are top influential medication drugs and CCS1581 is top influential disease.	en_US
dc.description.sponsorship	數學系	zh_TW
dc.identifier	60940024S-41942
dc.identifier.uri	https://etds.lib.ntnu.edu.tw/thesis/detail/9be045c76d7ea756a663b94292667ad6/
dc.identifier.uri	http://rportal.lib.ntnu.edu.tw/handle/20.500.12235/121103
dc.language	英文
dc.subject	機器學習	zh_TW
dc.subject	邏輯迴歸	zh_TW
dc.subject	K-近鄰	zh_TW
dc.subject	隨機森林	zh_TW
dc.subject	極限梯度提升	zh_TW
dc.subject	不平衡	zh_TW
dc.subject	準確率悖論	zh_TW
dc.subject	採樣	zh_TW
dc.subject	高維	zh_TW
dc.subject	特徵選擇	zh_TW
dc.subject	machine learning	en_US
dc.subject	logistic regression	en_US
dc.subject	K-Nearest Neighbor	en_US
dc.subject	Random Forest	en_US
dc.subject	Extreme Gradient Boosting	en_US
dc.subject	imbalanced	en_US
dc.subject	accuracy paradox	en_US
dc.subject	sampling	en_US
dc.subject	high dimensional	en_US
dc.subject	feature selection	en_US
dc.title	用特徵選擇和數據平衡對高維且分佈不均的二元資料做類別預測	zh_TW
dc.title	Class Prediction with Feature Selection and Data Balancing on High Dimensional and Imbalanced Binary Data	en_US
dc.type	etd

Collections

學位論文

用特徵選擇和數據平衡對高維且分佈不均的二元資料做類別預測

Files

Collections