改善豐富文脈模型於中文語音合成之研究
No Thumbnail Available
Date
2014
Authors
Journal Title
Journal ISSN
Volume Title
Publisher
Abstract
本論文中,我們首先回顧三種不同的合成技術:串接式語音合成(Concantenative Speech Synthesis)、統計模型式語音合成(Statistical Model-Based Speech Synthesis)以及混和式語音合成(Hybrid-Based Speech Synthesis)。本論文以統計模型式語音合成做為主要研究方向,並介紹兩種技術:基於隱藏式馬可夫模型之語音合成(Hidden Markov Model-Based Speech Synthesis, HMM-Based Speech Synthesis)與使用豐富文脈模型(Rich Context Model-Based)之隱藏式馬可夫模型語音合成。本論文將上述兩種技術應用至中文語音合成當中,並將針對豐富文脈模型之語音合成進行改良,提出使用潛藏語意分析(Latent Semantic Analysis, LSA)分析出文脈(Context)的潛藏韻律,希望藉由其潛藏的韻律從訓練語料庫當中選擇韻律上相似的模型,以便獲得較為優良起始語音參數向量序列(Initial Speech Parameter Vectors Sequence)並使用語音參數產生演算法(Speech Parameter Generation Algorithm)來產生目標語句之語音參數向量序列,並用於實際合成。本論文實驗將使用新釋出的台北科技大學中文電子書語音資料庫(NTUT-AB01-CH)作為語音合成之訓練資料,實驗結果將以一系列的主觀與客觀測驗來評斷統計式語音合成架構本論文所提出之方法與既有方法之長處。
In this thesis, we first provide a brief review of three mainstream frameworks for speech synthesis, namely, concatenative speech synthesis, statistical model-based speech synthesis and hybrid-based speech synthesis. Then, we focus our attention exclusively on comparing two important instantiations of the statistical model-based framework and their applications to Mandarin Chinese speech synthesis, which are the hidden Markov model-based method and the rich context model-based method respectively. In addition, we also explore the use of latent semantic analysis (LSA) to discover both lexical and prosodic cues inherent in the contextual descriptions of training speech utterances, with the hope that they can subsequently be used to obtain a good initialization for estimating the observation vector sequence of an utterance to be synthesized. A series of subjective and objective evaluations are conducted, using the newly released NTUT-AB01-CH corpus, to validate the performance merits of the aforementioned various methods stemming from the statistical model-based framework.
In this thesis, we first provide a brief review of three mainstream frameworks for speech synthesis, namely, concatenative speech synthesis, statistical model-based speech synthesis and hybrid-based speech synthesis. Then, we focus our attention exclusively on comparing two important instantiations of the statistical model-based framework and their applications to Mandarin Chinese speech synthesis, which are the hidden Markov model-based method and the rich context model-based method respectively. In addition, we also explore the use of latent semantic analysis (LSA) to discover both lexical and prosodic cues inherent in the contextual descriptions of training speech utterances, with the hope that they can subsequently be used to obtain a good initialization for estimating the observation vector sequence of an utterance to be synthesized. A series of subjective and objective evaluations are conducted, using the newly released NTUT-AB01-CH corpus, to validate the performance merits of the aforementioned various methods stemming from the statistical model-based framework.
Description
Keywords
基於隱藏式馬可夫模型之語音合成, 豐富文脈模型之語音合成, 起始語音參數序列, 潛藏語意分析, 空間向量模型, Hidden Markov Model Based Speech Synthesis, Rich Context Models Based Speech Synthesis, Initial Speech Parameter Sequence, Latent Semantic Analysis, Vector Space Model