聲調特徵擷取技術與其在中文聲調辨識應用之研究
No Thumbnail Available
Date
2014
Authors
Journal Title
Journal ISSN
Volume Title
Publisher
Abstract
本論文探討不同層次的聲調特徵(Tone Features)的擷取對於中文聲調辨識相關應用的影響。聲調特徵概略地分為音框與各發音層次組合而成;音框層次的聲調資訊多以基頻數值表示,再以音素或音節等區間的統計量做為聲調特徵。
為了更強健地使用音高(Pitch)資訊,本論文探究多種音高表示法與正規化方法;音高表示法包含基頻變化頻譜(Fundamental Frequency Variation Spectrum, FFV Spectrum)、發聲機率(Probability Of Voicing, POV)和高維度梅爾倒頻譜係數(High-order Mel-frequency Cepstral Coefficients, HMFCC)等,而正規化方法包含平均值與變異數等化法(Mean and Variance Normalization, MVN)和統計圖等化法(Histogram Equalization, HEQ)。本論文亦提出以線性預估係數(Linear Predictive Coefficients, LPC)近似正規化互相關函數(Normalized Cross Correlation Function, NCCF)曲線,藉此完整地表達音框層次的音高資訊。此外,本論文比較了數種不同子區間與跨區間的音高統計量,包含本論文提出的子區間音高偏度(Skewness)與峰度(Kurtosis)特徵。最後嘗試不同的機器學習分類器,如支持向量機(Support Vector Machine, SVM)與深層類神經網路(Deep Neural Network, DNN),並結合前述的聲調特徵進行聲調辨識。
實驗以公視廣播新聞語料庫(MATBN Corpus)和臺灣師範大學華語學習者語音語料庫(NTNU-MAS Corpus)進行驗證,其結果顯示吾人提出之方法在聲調辨識應用有良好表現。
This thesis delves into the extraction of tonal features with different levels of granularity, as well as their applications to Mandarin tone recognition. In the most general sense, tonal features could be extracted at either the frame level or the pronunciation-interval level. For the former, tonal features are usually embodied with the instantaneous pitch information of each frame, while for the latter, tonal features are typically represented as an ensemble of different pitch-related statistical features calculated from the pronunciation interval of interest (like phone, syllable or sub-intervals of them). In order to robustly drive the pitch information of each frame for use in Mandarin tone recognition, we investigate not only various pitch estimation methods (such as fundamental frequency variation spectrum (FFV Spectrum), probability of voicing (POV) and high-order Mel-frequency Cepstral Coefficients (HMFCC)) but also various pitch normalization mechanisms (such as mean and variance normalization (MVN) and histogram equalization (HEQ)). In particular, we present a novel use of linear predictive coefficients (LPC) to approximate the curve of the normalized cross correlation function (NCCF) so that the frame-level pitch information can be more faithfully rendered. In addition, we compare the utilities of several pitch-related statistical features calculated within or among sub-intervals of a syllable, including our proposed features that are derived based on the skewness and kurtosis of pitch values. Furthermore, we also leverage different machine-learning techniques, such as support vector machine (SVM) and deep neural network (DNN), to work in concert with the aforementioned tonal features for Mandarin tone recognition. Empirical evaluations performed on the MATBN corpus and the NTNU-MAS corpus seem to demonstrated that our presented tonal feature extraction methods hold good promise for Mandarin tone recognition and are very competitive with existing methods.
This thesis delves into the extraction of tonal features with different levels of granularity, as well as their applications to Mandarin tone recognition. In the most general sense, tonal features could be extracted at either the frame level or the pronunciation-interval level. For the former, tonal features are usually embodied with the instantaneous pitch information of each frame, while for the latter, tonal features are typically represented as an ensemble of different pitch-related statistical features calculated from the pronunciation interval of interest (like phone, syllable or sub-intervals of them). In order to robustly drive the pitch information of each frame for use in Mandarin tone recognition, we investigate not only various pitch estimation methods (such as fundamental frequency variation spectrum (FFV Spectrum), probability of voicing (POV) and high-order Mel-frequency Cepstral Coefficients (HMFCC)) but also various pitch normalization mechanisms (such as mean and variance normalization (MVN) and histogram equalization (HEQ)). In particular, we present a novel use of linear predictive coefficients (LPC) to approximate the curve of the normalized cross correlation function (NCCF) so that the frame-level pitch information can be more faithfully rendered. In addition, we compare the utilities of several pitch-related statistical features calculated within or among sub-intervals of a syllable, including our proposed features that are derived based on the skewness and kurtosis of pitch values. Furthermore, we also leverage different machine-learning techniques, such as support vector machine (SVM) and deep neural network (DNN), to work in concert with the aforementioned tonal features for Mandarin tone recognition. Empirical evaluations performed on the MATBN corpus and the NTNU-MAS corpus seem to demonstrated that our presented tonal feature extraction methods hold good promise for Mandarin tone recognition and are very competitive with existing methods.
Description
Keywords
聲調辨識, 聲調特徵擷取, 線性預估係數, Tone recognition, Tonal feature, Linear Predictive Coefficients