決策樹分析與羅吉斯迴歸於資料探勘的整合運用:以人事資料與民眾健康影響因素之探討為例 Integration of Decision Tree and Logistic Regression in Data Mining:Examples of Analysis of Personnel Data and the Influence Factors on People’s Health

Date
2015
Authors
鄧莉雅
Teng, Li-Ya
Journal Title
Journal ISSN
Volume Title
Publisher
Abstract
資料是企業組織的重要資產,如何有效進行資料分析與探勘是提升企業運作效能的重要議題。應用資料探勘方法於資料中挖掘與篩選出良好的資訊時,分類是一項重要的工作,而決策樹分析則是最常使用的資料探勘分類技術,然而當投入的變數越多,決策樹分析的執行效能也就受到影響。 在實徵分析部分,本研究利用人事資料庫與華人家庭動態資料庫,進行決策樹與羅吉斯迴歸的整合分析,其中人事資料庫帶有薪資的強勢變數,因此將進行有無強勢變數對於二階段分析效能與其影響的比較。至於華人家庭動態資料庫具有多年期固定樣本追蹤調查的特性,因此得以針對民眾健康的影響因素進行多年期資料的分析與比較。 研究結果發現,在人事資料庫中,影響具有三個水準的職別變數的重要投入變數為起薪、目前薪資、教育程度與過去的資歷。當投入變數包含強勢變數時,執行羅吉斯迴歸變數篩選程序前後的決策樹分析結果並無變化,不過刪除不顯著之變數後,分類準確率向上提升,但是當投入變數未包涵強勢變數時,決策樹分析結果則有明顯變化。在華人家庭動態資料庫的民眾健康之影響因素分析中,對於民眾的健康狀況三種水準的分類,以配偶的健康狀況、與父、母親的健康狀況三項是主要的投入變數,分析結果發現,二階段整合程序使得後續決策樹C5.0分析大幅減少決策規則,增強規則的解釋意義,但也因為減少許多變數投入,分類準確率與其他相關指標並無顯著提升。本研究除了針對羅吉斯迴歸與決策樹分析的原理與應用進行說明,提出兩階段的整合性分析策略,並以兩個實徵資料庫進行實徵分析,具體說明資料探勘技術可配合多變量統計的變數重要性檢定策略來提高分析效能,最後討論了本研究的限制與未來研究與應用上的建議。
Data is one of the most important assets in an enterprise or organization, and it is a big issue to use data analysis and data mining efficiently to progress the effectiveness of enterprise operation.When applying data mining to dig out or select great information, classification is the main work, and decision tree analysis is the technic of data mining usually used. However, when entering more variables, it would be possibly influenced on the effectiveness of analysis. In order to improve this weakness, we would like to integrate logistic regression into research to raise the effectiveness of classification. With significance test of logistic regression, selecting out some important variables with strong explanatory into decision tree model could progress the effectiveness of analysis, also the rules of practical value. Thus, this article uses decision tree analysis, which is usually applied for data mining, and tries to integrate logistic regression into whole research to realize how variables selection and effectiveness of classification would operate in these two databases. In this research, we choose logistic regression to analyze the data and determine what kind of variables should be used, selecting these variables which possess higher Wald test and more significant as well into decision tree, and compare with the model which is non-selecting variables for the outcome whether the new rules are less or much efficient in the end. In part of empirical analysis, the databases resourced the personnel database and Panel Study of Family Dynamics (PSFD) for analysis of decision tree and logistic regression. Especially, there are strong salary variables in the personnel database, so we would analyze the model whether there are strong salary variables in it with this two-steps analysis and compare the outcomes in the end. For PSFD which has the feature of multi-year connected data, we would focus on influence factors on people’s health to analyze and compare several of this multi-year datasets We find that, the improtant selected variables are salary-beginning、salary、education and previous experiences in personnel database. When entering variables including strong variables, the outcome doesn’t chang with decision tree analysis if implement variables selection in logistic regression or not, but it could have the classification accuracy rise after deleting these insignificant variables. On the other hand, when it doesn’t include the strong variables, it presents obviously change in decision tree analysis. In PSFD, the improtant variables are marital health and health of father and mother for the primary variables. We find that when intergrating with logistic regression, it could lower the rules for analysis in C5.0. However, due to reduction of variables entering, all the rates about model evaluation do not raise. In this research, we would introduce the concept and application of logistic regression and decision tree analysis, and submit the strategy of two-setps analysis as well, and implement this two pratical databases, specificly illustrating the data mining technic could raise the effectiveness of analysis with strategy of variables signicance test in multiple statistic. Finally, we have the discussion of limitation about this reaserch and future study, also sugesstions of application.
Description
Keywords
資料探勘, 決策樹, 羅吉斯迴歸, 變數篩選, Data mining, Logistic regression, Decision tree, Variable selection
Citation
Collections