No Thumbnail Available
Journal Title
Journal ISSN
Volume Title
近來,有不少文獻針對鑑別式聲學模型訓練加以研究改進,本論文則延伸最小化音素錯誤(Minimum Phone Error, MPE)聲學模型訓練及調適,並使之應用在中文大詞彙連續語音辨識上。本論文以公視新聞外場記者語料作為實驗平台,在實驗中,先對聲學模型進行最大化相似度(Maximum Likelihood, ML)聲學模型訓練,再來則比較最小化音素錯誤與最大化交互資訊(Maximum Mutual Information, MMI)兩種鑑別式訓練,最小化音素錯誤訓練相較於最大化相似度訓練能大幅降低15.52%的相對音節錯誤率、12.33%的相對字錯誤率及10.02%的相對詞錯誤率,明顯優於最大化交互資訊的訓練方式。此外,在非監督式聲學模型調適上,本論文探討了在聲學模型空間及特徵空間上透過轉換矩陣間接調適的調適技術。然而,因為缺少正確轉譯文句(Correct Transcripts)可供最小化音素錯誤估測原始正確率,故需以辨識所產生對應的轉譯文句來取代,使得非監督式最小化音素錯誤調適技術無法對聲學模型參數做良好的估測,導致辨識效能顯著地下降。為了改善此現象,本論文提出了「原始正確率預測模型」(Raw Accuracy Prediction Model, RAPM)用來改良非監督式最小化音素錯誤之調適,對辨識效能有少許的提升。
Discriminative training of acoustic models has been an active focus of much current research in automatic speech recognition (ASR) in the past few years. This thesis extensively investigated the use of the Minimum Phone Error (MPE) approaches for discriminative training and adaptation of acoustic models for Mandarin large vocabulary continuous speech recognition (LVCSR). All experiments were carried out on the Mandarin broadcast news corpus (MATBN). The experimental results show that MPE training can give significant improvements over the baseline systems whose acoustic models were trained based on the Maximum Likelihood (ML), Maximum Mutual Information (MMI) principles. Comparing to the ML-trained acoustic models, relative reductions of 15.52% syllable error rate (SER), 12.33% character error rate (CER) and 10.02% word error rate (WER) were respectively obtained by using the MPE-trained models. Moreover, unsupervised adaptation of acoustic models via the MPE-trained linear transformation in either the model space or the feature space was studied as well with promising results indicated. However, because there was no correct reference transcript that can be used for accuracy calculation and only the top one automatic transcript can be used instead, the unsupervised MPE-based adaptation techniques may not always accumulate good estimates for the acoustic model parameters and thus their performance will be substantially degraded. To tackle this problem, in this thesis a novel Raw Accuracy Prediction Model (RAPM) was proposed to ameliorate the MPE-based adaptation techniques and slight performance gains were initially demonstrated.
Discriminative training of acoustic models has been an active focus of much current research in automatic speech recognition (ASR) in the past few years. This thesis extensively investigated the use of the Minimum Phone Error (MPE) approaches for discriminative training and adaptation of acoustic models for Mandarin large vocabulary continuous speech recognition (LVCSR). All experiments were carried out on the Mandarin broadcast news corpus (MATBN). The experimental results show that MPE training can give significant improvements over the baseline systems whose acoustic models were trained based on the Maximum Likelihood (ML), Maximum Mutual Information (MMI) principles. Comparing to the ML-trained acoustic models, relative reductions of 15.52% syllable error rate (SER), 12.33% character error rate (CER) and 10.02% word error rate (WER) were respectively obtained by using the MPE-trained models. Moreover, unsupervised adaptation of acoustic models via the MPE-trained linear transformation in either the model space or the feature space was studied as well with promising results indicated. However, because there was no correct reference transcript that can be used for accuracy calculation and only the top one automatic transcript can be used instead, the unsupervised MPE-based adaptation techniques may not always accumulate good estimates for the acoustic model parameters and thus their performance will be substantially degraded. To tackle this problem, in this thesis a novel Raw Accuracy Prediction Model (RAPM) was proposed to ameliorate the MPE-based adaptation techniques and slight performance gains were initially demonstrated.
最小化音素錯誤, 大詞彙連續語音辨識, 聲學模型訓練, 聲學模型調適, 最大化交互資訊, MPE, LVCSR, Acoustic Model Training, Acoustic Model Adaptation, MMI