信心度評估於中文大詞彙連續語音辨識之研究

Abstract

本論文初步地探討信心度評估(Confidence Measures)於中文大詞彙連續語音辨識上之研究。除了討論原本一般信心度評估應用於判斷語音辨識結果(例如候選詞)是否正確之外,也嘗試將信心度評估應用在詞圖搜尋(Word Graph Rescoring)或N-最佳詞序列(N-best List)重新排序(Reranking)的研究。而實驗語料則是使用公視新聞語料庫(MATBN)中的外場記者(Field Reporters)跟受訪者(Interviewees)語句,以分別探討信心度評估在偏朗讀語料(Read Speech)或偏即性口語(Spontaneous Speech)等兩種不同性質的語句上是否能有不同的效能。首先,本論文嘗試使用熵值(Entropy)資訊並結合以事後機率為基礎之信心度評估方法,在MATBN外場記者(Read Speech)及外場受訪者(Spontaneous Speech)測試語料所得到的最佳實驗結果,可較傳統僅使用以事後機率為基礎之信心度評估可以分別有16.37%及12.00%的信心度錯誤率相對減少(Relative Reduction)。另一方面,在以最小化音框錯誤率(Time Frame Error)搜尋法來增進詞圖搜尋的正確率之實驗中,本論文嘗試結合以梅爾倒頻譜係數(Mel-frequency Cepstral Coefficients, MFCC),以及以異質性線性鑑別分析(Heteroscedastic Linear Discriminant Analysis, HLDA)搭配最大相似度線性轉換(Maximum Likelihood Linear Transformation, MLLT)兩種不同語音特徵參數所形成的詞圖資訊,並以最小化音框錯誤率搜尋法來降低語音辨識系統的字錯誤率,經由實驗顯示在外場記者測試語料能有4.6%的字錯誤率相對減少,而在外場受訪者測試語料的部份則有4.8%的字錯誤率相對減少,相較於僅使用異質性線性鑑別分析及最大相似度線性轉換求得語音特徵參數的詞圖並配合最小化音框錯誤率法有較佳的結果。最後,本論文嘗試在傳統以Levenshtein距離為成本函式(Cost Function)的最小化貝氏風險(Minimum Bayes Risk)辨識法則中,適當的加入以特徵為基礎的信心度評估。雖然經由實驗得知,在外場記者以及外場受訪者的語料中,對於辨識錯誤率並沒有很明顯的進步或退步,但相較於傳統利用Levenshtein距離為成本函式的最小化貝氏風險辨識法則而言,卻有較佳的結果。
This thesis investigated the use of various kinds of confidence measures for Mandarin large vocabulary continuous speech recognition (LVCSR). These confidence measures were not only used as a post processor to justify the correctness of final recognition hypotheses, but also directly integrated into the word graph rescoring and N-best list reranking procedures for the generation of better recognition hypotheses. All experiments were carried out on the Mandarin broadcast news corpus (MATBN), including the speech utterances of field reporters and interviewees which also respectively belong to the read speech style and the spontanesous speech one. Several approaches to utilizing confidence measures for Mandarin LVCSR were presented and extensively studied in this thesis. First, the entropy information and the posterior probability based confidence measure were tightly combined, and the experimental results showed that such an approach could give relative confidence error rate reductions of 16.37% and 12.00%, respectively, for the field reporters’ speech and the interviewees’ speech, compared to those obtained by using the posterior probability based confidence measure alone. On the other hand, we attempted to jointly consider the information inherent in the word graph constructed by using the Mel-frequency cepstral coefficients (MFCC), and the word graph constructed by using the discriminant acoustic features resulting form the heteroscedastic linear discriminant analysis and maximum likelihood linear transformation (HLDA+MLLT). The minimum time frame error decoding was conducted on these two word graphs simultanesously to find the best word sequence among them. The experimental results showed that such an approachcould achieve character error rate reductions of 4.6% and 4.8%, respectively, for the field reporters’ speech and the interviewees’ speech, which were better than the results obtained by conducting the minimum time frame error decoding on the word graph of HLDA+MLLT alone. Finally, we incorporated the feature-based confidence measure with the minimum Bayes risk decoding. Compared to the conventional minimum Bayes risk decoding, the proposed approach demonstrates slight but consistent performance gains.

Description

Keywords

信心度評估, 熵值, 最小化貝氏風險法則, 中文大詞彙連續語音辨識, Confidence Measures, Entropy, Minimum Bayes Risk, Large Vocabulary Continuous Speech Recognition

Citation

Collections