書目探勘資料之清理研究-以問卷資料為例
No Thumbnail Available
Date
2012
Authors
Journal Title
Journal ISSN
Volume Title
Publisher
Abstract
資料清理是書目探勘中的第一步驟,同時也影響書目探勘的結果,但資料本身常具有雜訊的存在,如此可能導致探勘過程中耗費大量時間在解決去除雜訊的問題;同時雜訊過多也會影響書目探勘的結果。在過去研究之中書目探勘的資料清理大多討論內部性資料為主,少有以外部性資料作為資料來源,而圖書館事業中大量的外部性資料可與圖書館自動化系統各個模組資料做結合提供圖書館管理者更加了解圖書館讀者的使用行為。
本研究利用外部性資料作為資料來源,利用去除雜訊、資料整合、資料轉換、資料刪減、實行概念階層等步驟進行資料清理,並透過書目探勘中的迴歸分析與群集分析評估資料清理前後的探勘結果。結果顯示,進行資料清理後迴歸分析的R2與群集分析的解釋變數機率值皆能較執行資料清理前提昇
研究結果顯示本研究中所使用之資料清理方式與步驟有助於提昇書目探勘的準確度。此外,去除雜訊的步驟能有效提昇書目探勘的結果,其後並加以實行各項分群,如:雙變項分群、多變項分群等,皆能提昇書目探勘的結果。
Data cleaning is the inception of bibliomining, whose results also depend heavily on it. Yet, in the light of the noises encoded in the data in question, the traditional implementation of bibliomining has to sacrifice efficacy for the elimination of these undesired noises. In past papers most researchers on data cleaning for bibliomining focused on the processing of internal data, only few took external data as their source materials. However, vast external data available in the field of library science can be synthesized with library integrated system, providing librarians a better understanding of the usage behaviors of library users. In the methodology of our research, we first take external data as our source materials and apply them to different stages of data cleaning, i.g. data integration, data transformation, data reduction and concept hierarchy. Afterwards, we process both the untouched and the processed data with regression and clustering, on whose results we take extensive inspection with an aim to evince our concepts and methodology of data cleaning do facilitate the accuracy in bibliomining. Our results indicate that we are capable of extracting a much prospering result of variable probability in both R2 analysis of recession and clustering if data cleaning is adopted in bibliomining. In addition to noise elimination, we found the possibility to further increase the efficacy of bibliomining through dual-variable clustering, multi-variable clustering, to name just a few.
Data cleaning is the inception of bibliomining, whose results also depend heavily on it. Yet, in the light of the noises encoded in the data in question, the traditional implementation of bibliomining has to sacrifice efficacy for the elimination of these undesired noises. In past papers most researchers on data cleaning for bibliomining focused on the processing of internal data, only few took external data as their source materials. However, vast external data available in the field of library science can be synthesized with library integrated system, providing librarians a better understanding of the usage behaviors of library users. In the methodology of our research, we first take external data as our source materials and apply them to different stages of data cleaning, i.g. data integration, data transformation, data reduction and concept hierarchy. Afterwards, we process both the untouched and the processed data with regression and clustering, on whose results we take extensive inspection with an aim to evince our concepts and methodology of data cleaning do facilitate the accuracy in bibliomining. Our results indicate that we are capable of extracting a much prospering result of variable probability in both R2 analysis of recession and clustering if data cleaning is adopted in bibliomining. In addition to noise elimination, we found the possibility to further increase the efficacy of bibliomining through dual-variable clustering, multi-variable clustering, to name just a few.
Description
Keywords
資料清理, 書目探勘, Data Cleaning, Bibliomining