實證探究多種鑑別式語言模型於語音辨識之研究

陳柏琳Berlin Chen賴敏軒2019-09-052011-8-222019-09-052011http://etds.lib.ntnu.edu.tw/cgi-bin/gs32/gsweb.cgi?o=dstdcdr&s=id=%22GN0698470623%22.&%22.id.&http://rportal.lib.ntnu.edu.tw:80/handle/20.500.12235/106867語言模型(Language Model)在自動語音辨識(Automatic Speech Recognition, ASR)系統中扮演相當重要的角色，藉由使用大量的訓練文字來估測其相對應的模型參數，以描述自然語言的規律性。N-連(N-gram)語言模型(特別是雙連詞(Bigram)與三連詞(Trigram))常被用來估測每一個詞出現在已知前N-1個歷史詞之後的條件機率。此外，N-連模型大多是以最大化相似度為訓練目標，對於降低語音辨識錯誤率常會有所侷限，並非能達到最小化辨識錯誤率。近年來為了解決此問題，鑑別式語言模型(Discriminative Language Model, DLM)陸續地被提出，目的為從可能的辨識語句中正確地區別最佳的語句作為辨識之結果，而不是去符合其訓練資料，此概念已經被提出並論證有一定程度的成果。本論文首先實證探討多種以提升語音辨識效能為目標的鑑別式語言模型。接著，我們提出基於邊際(Margin-based)鑑別式語言模型訓練方法，對於被錯誤辨識的語句根據其字錯誤率(Word Error Rate, WER)與參考詞序列(字錯誤率最低)字錯誤率之差為比重，給予不同程度的懲罰。相較於其它現有的鑑別式語言模型，我們所提出的方法使用於大詞彙連續語音辨識(Large Vocabulary Continuous Speech Recognition, LVCSR)時有相當程度的幫助。Language modeling (LM), at the heart of most automatic speech recognition (ASR) systems, is to render the regularity of a given natural language, while it corresponding model parameters are estimated on the basis of a large amount of training text. The n-gram (especially the bigram and trigram) language models, which determine the probability of a word given the preceding n-1 word history, are most prominently used. The n-gram model, normally trained with the maximum likelihood (ML) criterion, are not always capable of achieving minimum recognition error rates which in fact are closely connected to the final evaluation metric. To address this problem, in the recent past, a range of discriminative language modeling (DLM) methods, aiming at correctly discriminate the recognition hypotheses for the best recognition results rather than just fitting the distribution of training data, have been proposed and demonstrated with varying degrees of success. In this thesis, we first present an empirical investigation of a few leading DLM models designed to boost the speech recognition performance. Then, we propose a novel use of various margin-based DLM training methods that penalize incorrect recognition hypotheses in proportion to their WER (word error rate) distance from the desired hypothesis (or the oracle) that has the minimum WER. Experiments conducted on a large vocabulary continuous speech recognition (LVCSR) task illustrate the performance merits of the methods instantiated from our DLM framework when compared to other existing methods.語音辨識鑑別式語言模型邊際訓練準則實證探究多種鑑別式語言模型於語音辨識之研究Empirical Comparisons of Various Discriminative Language Models for Speech Recognition