英文連續語音辨識之初步研究

陳柏琳Berlin Chen許庭瑋TingWei Hsu2019-09-052007-8-102019-09-052007http://etds.lib.ntnu.edu.tw/cgi-bin/gs32/gsweb.cgi?o=dstdcdr&s=id=%22GN0694470027%22.&%22.id.&http://rportal.lib.ntnu.edu.tw:80/handle/20.500.12235/106657本論文為英文連續語音辨識之初步研究。我們實作英文連續語音辨識器，並探討其主要組成，包含語音特徵擷取、聲學模型及語言模型等。首先，針對語音特徵擷取，我們比較傳統式梅爾倒頻譜係數(Mel-frequency Cepstral Coefficients, MFCC)與線性鑑別分析(Linear Discriminant Analysis, LDA)和異質性線性鑑別分析(Heteroscedastic Linear Discriminant Analysis, HLDA)之效能。再者，針對聲學模型，我們探討詞內三連音素模型(Intra-word Triphone Models)、狀態連結(State-Tying)技術、音素模糊矩陣(Phone Confusion Matrix)與非監督式聲學模型訓練(Unsupervised Acoustic Model Training)的使用，以提升語音辨識率。最後，針對語言模型，在語音辨識過程中分別利用詞頻數混合法(Count Merging)與模型插補法(Model Interpolation)，結合背景與同領域語言模型訓練語料，以達到較佳之詞發生預測。本論文實驗是以美國之音與台灣腔英文語料為題材，並有一些初步的觀察及發現。This thesis is intended to perform a preliminary study on English continuous speech recognition. An English continous speech recognizer was implemented, while parts of its major constituents, including speech feature extraction, acoustic modeling and language modeling, were extensively investigated as well. First, for speech feature extraction, we compared the performance of linear discriminant analysis (LDA) and heteroscedastic linear discriminant analysis (HLDA) to that of the conventional Mel-frequency cepstral coefficients (MFCC) .Second, for acoustic modeling, we explored the use of the intra-word triphone models, the state-tying scheme and the phone confusion matrix, as well as the unsupervised training of acoustic models, for better speech recognition results. Finally, for language modeling, both count-merging and model-interpolation approaches were respectively expoited to combine the background and in-domain language model training corpora to enable better prediction of word occurrences during the speech recognition process. The experiments were conducted on the Voice of America (VOA) and the English Across Taiwan (EAT) corpora.連續語音辨識詞內三連音素模型狀態連結音素模糊矩陣Continuous Speech RecognitionIntra TriphoneState tyingConfusion Matrix英文連續語音辨識之初步研究An Initial Study on English Continuous Speech Recognition