使用支援向量機進行中文文本可讀性分類-以國小國語課文為例

No Thumbnail Available

Date

2011

Journal Title

Journal ISSN

Volume Title

Publisher

Abstract

語文能力在各方面都扮演著重要的角色。而獲取語文能力最重要、最直接的管道之一就是透過閱讀。可讀性可以評估一個文本是否適合閱讀者的閱讀能力。以往的研究指出可讀性公式是一個工具,可以把對於不同教育程度的讀者所閱讀的文章加以調整。英文文本的可讀性研究很早就出現了,可是中文領域這方面的研究不多,而中文能力在現今社會又是一個很主要的趨勢。因此,一個適合文本可讀性的分類方法是很重要的。過去西方學者因為過去技術的不足多採用線性的可讀性公式對文本做可讀性分類,而線性的可讀性公式對本研究的資料有些限制,因此本研究的目的在建立一個由支援向量機(Support Vector Machine,SVM)所訓練產生的預測模型,將國小的國語科課文做可讀性的分類。進而觀察預測的課文跟原來實際的課文的年級是否相符,並針對錯誤的課文做分析,以改善與謀求分類上的準確性。 本研究以課程專家編撰,經國家編審單位審定的三個民間版本教科書(H版、K版、N版),國小一年級至六年級國語科課文刪減掉新詩、絕句、古文、律詩的課文後共計386篇為實驗資料,將課文一部分做為訓練資料,另一部分課文為測試資料,透過中文斷詞的處理及資料格式的轉換,最後以SVM來對文本的可讀性進行分類。研究結果發現:利用LIBSVM預測國小國語科課文冊別的準確率(accuracy)為47.92%、正確率(fit rate)為80.31%。最後針對預測錯誤的課文做錯誤分析,了解是甚麼因素造成預測上的錯誤。
Language plays an important part in every reign. And the most efficient way to enhance our ability is to read. Readability can estimate whether an article is suitable for one reader. Past researches claim that readability is a mean to adjust the level of article according to different kinds of educational attainment. The research of English readability has been on its way while Chinese has a little progression. However, Chinese is a trend in nowadays. It is important to find a suitable way to classify text readability. In the past researches, many western readability formulas do to the lack of technology use linear models on text classification, and linear readability formulas is a limit for the data in my research. Therefore, the purpose of this research is to use the predict model, which trained by the support vector machine, to classify the elementary Chinese textbook’s readability. And to check up that whether the text is matched with the predict text. At last, analyze the wrong text to improve the accuracy of text readability. This research was compiled by course expert and the experience materials( from first to sixth grades deleting the classical Chinese texts of three vision texts of private publish enterprise including vision H, K, and N) total 386 texts were examined by the national compilation organization. Part of the texts are used as training materials and the others are testing materials. Through the Chinese Word Segmentation processing and data format conversion, we at last do the text classification by SVM. The research conclusion is that the accuracy of predicting elementary texts is 47.92% while the fit rate is 80.31%. At the end, analyze the wrong prediction and understand the reason of this wrong prediction.

Description

Keywords

可讀性, 文本分類, 支援向量機, 中文斷詞, Readability, Text Classification , Support Vector Machine, Chinese Word Segmentation

Citation

Collections