使用機器學習方法於語音文件檢索之研究
No Thumbnail Available
Date
2009
Authors
Journal Title
Journal ISSN
Volume Title
Publisher
Abstract
本論文初步地討論機器學習之方法在資訊檢索上的應用,即所謂排序學習(Learning to Rank);並針對近年被使用在資訊檢索上的各種機器學習模型及概念,以及所使用的各種特徵,包含詞彙本身之特徵、相近度特徵、及機率特徵等進行分析與實驗。除此之外,本論文亦將之延伸至語音文件檢索的應用上。本論文初步地使用TDT(Topic Detection and Tracking)中文語料部份作為實驗題材,此語料為過去TREC(文件檢索暨評測會議)上公開評估語音文件檢索系統的標準語料(Benchmark)之一,此語料包含TDT-2及TDT-3兩套語料,提供了大量的新聞語料,及豐富的主題、轉寫等標註,以作為語音文件檢索相關研究使用。為了更有效地開發富含資訊的語音文件特徵,本論文亦使用臺師大大陸口音中文大詞彙連續語音辨識器(Large Vocabulary Speech Recognition, LVCSR)作為語音文件轉寫平台,產生的詞圖(Word Graph),作為擷取語音文件獨特特徵的主要依據。此外,我們並考慮到資訊檢索中之訓練語料不平衡問題,並提出解決此問題之對策。最後,初步的實驗結果顯示,成對式訓練方法RankNet之訓練模型檢索成效較逐點式訓練方法SVM之訓練模型檢索成效為佳。
This thesis investigates the use of machine-learning approaches, namely learning-to-rank algorithms, for information retrieval (IR), with special emphasis on their theoretical foundations and the associated features that are used by them, such as the lexical features, proximity features, and probabilistic features. Meanwhile, we also consider the application of these approaches for spoken document retrieval (SDR). All experiments were conducted on the Topic Detection and Tracking corpora (especially, TDT-2 and TDT-3), which are the benchmark collections widely adopted for various SDR evaluations since they contain tens of hours of mainland-accented Chinese broadcast news documents equipped with topic labels and orthographic transcripts. In the hope of discovering more useful speech-related features for SDR as well as analyzing the problems caused by speech recognition errors, a large vocabulary speech recognition (LVCSR) system that can output a word lattice consisting of multiple recognition hypotheses for each broadcast news document is established. Moreover, we also deal with the problem of training the machine-learning retrieval models with unbalanced training data, and propose a remedy for it. Finally, the preliminary experimental results seem to show that the RankNet based retrieval model outperforms the support vector machine (SVM) based retrieval model for the SDR task studied in this thesis.
This thesis investigates the use of machine-learning approaches, namely learning-to-rank algorithms, for information retrieval (IR), with special emphasis on their theoretical foundations and the associated features that are used by them, such as the lexical features, proximity features, and probabilistic features. Meanwhile, we also consider the application of these approaches for spoken document retrieval (SDR). All experiments were conducted on the Topic Detection and Tracking corpora (especially, TDT-2 and TDT-3), which are the benchmark collections widely adopted for various SDR evaluations since they contain tens of hours of mainland-accented Chinese broadcast news documents equipped with topic labels and orthographic transcripts. In the hope of discovering more useful speech-related features for SDR as well as analyzing the problems caused by speech recognition errors, a large vocabulary speech recognition (LVCSR) system that can output a word lattice consisting of multiple recognition hypotheses for each broadcast news document is established. Moreover, we also deal with the problem of training the machine-learning retrieval models with unbalanced training data, and propose a remedy for it. Finally, the preliminary experimental results seem to show that the RankNet based retrieval model outperforms the support vector machine (SVM) based retrieval model for the SDR task studied in this thesis.
Description
Keywords
資訊檢索, 排序學習, 語音辨識, Information Retrieval, Learning to Rank, Speech Recognition