利用Google互聯網分類新聞語料之新詞自動擷取技術支援詞庫式中文斷詞系統
No Thumbnail Available
Date
2006
Authors
Journal Title
Journal ISSN
Volume Title
Publisher
Abstract
中文斷詞技術一直都是熱門的研究,許許多多的斷詞方法被提出來,以辭庫為基礎的斷詞方法是最早被使用也是目前最普遍的一種斷詞技術,但此種中文斷詞技術若沒有搭配大量且多樣性的詞庫,其斷詞能力將沒辦法有效地展現出來。尤其是面對新時代的中文資料,現今的中文資料其內容出現許多傳統詞庫所沒有包含的新詞也就是所謂的未知詞,當傳統的詞庫式斷詞系統在處理這類中文資料時,往往因為無法判定中文資料中出現的新詞而造成錯誤,也降低了斷詞系統的正確率。因此一套有效率的中文新詞擷取系統將是必需的。本文提出一套自動產生詞庫的方法,利用Google提供的新聞服務與其特性,建立一新聞類專業詞庫,隨著時間變化每日即時更新此新聞類專業詞庫內容,詞庫中除了儲存所擷取出來的新詞,也記錄新詞的類別與出現的時間點等資訊,將可依賴這些資訊來增加詞庫的正確率,並提供研究者做更進一步的研究。由於新聞內容範圍廣大且多樣性,所以利用每日大量的新聞資料,即可得到各個領域相關的中文字詞,解決現有詞庫不易擴充的問題。也因為新聞資料的特性,中文社會最新出現的詞彙將能夠在最短的時間內被發現並加入詞庫裡。
實驗的結果證實了本文所提出的方法確實可行。從不同的新聞事件中,擷取出各個領域的字詞,透過中文語言專家的檢測,證明其中包含著傳統詞庫沒有涵蓋的新詞,並具備了可靠的正確率,也證明本方法確實擁有新詞自動擷取的能力。
Chinese word segmentation in a Chinese sentence is an essential step in the processing of Chinese natural language because it is beneficial to the Chinese text mining and information retrieval. Currently, the lexicon-based Chinese word segmentation scheme is the most widely used method, which can correctly identify Chinese sentences as distinct words from Chinese-language texts for real-word applications. However, the word identification ability of the lexicon-based scheme is highly dependent with a well prepared lexicon with sufficient amount of lexical entries which covers all of the Chinese words. In particular, this scheme cannot perform Chinese word segmentation process well for highly changeable texts with time, such as newspaper articles and web documents. This is because highly changeable documents often contain many new words that cannot be identified by the lexicon-based Chinese word segmentation systems with a constant lexicon. Moreover, to maintain the lexicon by manpower is an inefficient and time-consuming job. Based on the problems, this study proposes a novel statistics-based scheme for new word extraction based on the categorized corpuses of Google news retrieved from the Google news site automatically to promote the word identification ability for the lexicon-based Chinese word segmentation systems. Compared with another proposed method, the experimental results indicated that the proposed new word extraction scheme not only can more correctly retrieve news words from the categorized corpuses of Google news, but also obtain has larger amount of new words.
Chinese word segmentation in a Chinese sentence is an essential step in the processing of Chinese natural language because it is beneficial to the Chinese text mining and information retrieval. Currently, the lexicon-based Chinese word segmentation scheme is the most widely used method, which can correctly identify Chinese sentences as distinct words from Chinese-language texts for real-word applications. However, the word identification ability of the lexicon-based scheme is highly dependent with a well prepared lexicon with sufficient amount of lexical entries which covers all of the Chinese words. In particular, this scheme cannot perform Chinese word segmentation process well for highly changeable texts with time, such as newspaper articles and web documents. This is because highly changeable documents often contain many new words that cannot be identified by the lexicon-based Chinese word segmentation systems with a constant lexicon. Moreover, to maintain the lexicon by manpower is an inefficient and time-consuming job. Based on the problems, this study proposes a novel statistics-based scheme for new word extraction based on the categorized corpuses of Google news retrieved from the Google news site automatically to promote the word identification ability for the lexicon-based Chinese word segmentation systems. Compared with another proposed method, the experimental results indicated that the proposed new word extraction scheme not only can more correctly retrieve news words from the categorized corpuses of Google news, but also obtain has larger amount of new words.
Description
Keywords
中文斷詞, 新詞擷取, Google新聞服務, 詞庫, Natural language processing, New word extraction, Chinese word segmentation, Information retrieval