使用多種鑑別式模型以及特徵資訊於語音文件摘要之研究

No Thumbnail Available

Date

2010

Journal Title

Journal ISSN

Volume Title

Publisher

Abstract

已有許多機器學習的摘要方法被應用於語音文件摘要,它們通常將文件摘要視分類問題(分兩類),嘗試從文件中挑選重要的語句做為摘要結果;然而,訓練語料不平衡的問題有時會影響這些摘要方法的效能。另一方面,藉由以增進分類正確率而訓練的摘要方法並不見得擁有較好的摘要結果。鑑於此種現象,本論文首先探討使用兩個不同的訓練準則的摘要方法,以減輕上述問題所造成的負面影響,並且得以提高摘要效能。其一為將訓練文件中成對語句之間的重要性排序資訊,做為摘要方法訓練之依據;另一則以直接最大化其摘要評估分數為準則做為計摘要方法訓練之依據。另外,一些訓練語句和特徵選取的方法也在本論文被廣泛地研究與比較。摘要實驗是在中文廣播新聞上進行;我們發現所使用的兩種訓練準則皆能夠展現出比基礎實驗方法較好的結果,但於訓練語句以及特徵選取方法似乎並不能顯地改善摘要效能。
Many of the existing machine-learning approaches to speech summarization cast important sentence selection as a two-class classification problem; however, the imbalanced data problem sometimes results in a trained speech summarizer with unsatisfactory performance. On the other hand, training the summarizer by improving the associated classification accuracy does not always lead to better summarization evaluation performance. In view of such phenomena, this thesis investigates two different training criteria to alleviate the negative effects caused by them, as well as to boost the summarizer’s performance. One is to learn the classification capability of a summarizer on the basis of the pair-wise ordering information of sentences in a training document according to a degree of importance. The other is to train the summarizer by directly maximizing the associated evaluation score. Alternatively, a few methods for training sentence and feature selection are also extensively studied and compared. Experiment results on a broadcast news summarization task show that the presented two training criteria can drive up the performance as compared to baseline summarization system, while training sentence and feature selection seems to show mixed effectiveness.

Description

Keywords

語音文件, 摘錄式摘要, 逐點式方法, 成對式方法, 序列式方法, 訓練語料不平衡, 貪婪演算法, Spoken document, Extractive Summarization, Point-wise Approach, Pair-wise Approach, List-wise Approach, Unbalance Training Data, Greedy Algorithm

Citation

Collections