以能量為基礎之語音正規化方法研究及其於語音端點偵測之應用

Abstract

本論文主要探討強健(Robust)性語音辨識技術在不同噪音環境下的情況,並且於時間軸上研究雜訊語音(Noisy Speech)在對數能量上重建出乾淨語音(Clean Speech)對數能量的方法。基於每一語句對數能量特徵值的分佈特性,我們期望發展出一個有效的方法可以重刻雜訊語音對數能量的尺度,以減緩噪音環境所造成不匹配的情形,並達到更好的辨識率效果。 根據時間軸上的語音訊號觀察顯示,低能量的語音音框比高能量的語音音框更容易受到加成性噪音(Additive Noise)的影響,並且當出現嚴重的加成性噪音影響的時候,對數能量特徵強度在語句中幾乎會整個被提高,因此我們提出一個簡單但是有效的方法,稱之為對數能量尺度重刻正規化技術(Log Energy Rescaling Normalization, LERN),適當的重刻雜訊語音的對數能量特徵值使成為接近乾淨語音的環境狀況。 語音辨識實驗採用的是包含多種噪音環境的語料,該語料是由歐洲電信標準協會(European Telecommunications Standards Institute, ETSI)所發行的Aurora-2.0語料庫,語料庫內容為英語發音的連續數字字串的小詞彙。提供有八種噪音來源和七種訊噪比(Signal-to-Noise Ratio, SNR)的情況。實驗方面,結果顯示對數能量尺度重刻正規化方法(LERN)的效果比其他的能量或對數能量上的正規化方法好。此外,另一組實驗則採用中文廣播新聞語料庫(Mandarin broadcast news corpus, MATBN)在大詞彙連續語音辨識(Large Vocabulary Continuous Speech Recognition, LVCSR)上的測試,並證明對數能量尺度重刻正規化方法(LERN)依然可以有效提升辨識率。
This thesis considered robust speech recognition in various noise environments, with a special focus on investigating the ways to reconstruct the clean time-domain log-energy features from the noise-contaminated ones. Based on the distribution characteristics of the log-energy features of each speech utterance, we aimed to develop an efficient approach to rescale the log-energy features of the noisy speech utterance so as to alleviate the mismatchcaused by environmental noises for better speech recognition performance. As the time-domain phenomena of the speech signals reveal that lower-energy speech frames are more vulnerable to additive noises than higher-energy ones, and that the magnitudes of the log-energy features of the speech utterance tend to be lifted up when they are seriously interfered with additive noise, we therefore proposed a simple but effective approach, named log-energy rescaling normalization (LERN), to appropriately rescale the log-energy features of noisy speech to that of the desirable clean one. The speech recognition experiments were conducted under various noise conditions using the European Telecommunications Standards Institute (ETSI) Aurora-2.0 database. The database contains a set of connected digit utterances spoken in English. It offers eight noise sources and seven different signal-to-noise ratios (SNRs). The experiment results showed that the performance of the proposed LERN approach was considerably better than the other conventional energy or log-energy feature normalization methods. Another set of experiments conducted on the large vocabulary continuous speech recognition (LVCSR) of Mandarin broadcast news also evidenced the effectiveness of LERN.

Description

Keywords

語音正規化, 語音端點偵測, Speech Feature Normalization, Voice Activity Detection

Citation

Collections

Endorsement

Review

Supplemented By

Referenced By