陳柏琳博士劉成韋Liu Cheng-Wei2019-09-052005-08-012019-09-052005http://etds.lib.ntnu.edu.tw/cgi-bin/gs32/gsweb.cgi?o=dstdcdr&s=id=%22G0069247006%22.&%22.id.&http://rportal.lib.ntnu.edu.tw:80/handle/20.500.12235/106321人類在幾千年的演化過程中,生活上的智慧不斷的累積傳承,因此過去文明變遷和人類演化的步伐是一致的。而如今科技進化的速度,卻早已大大的超越了人類演化的速度,並且日常生活中可以使用的多媒體影音資訊也越來越多,例如廣播電視節目、語音信件、演講錄影和數位典藏等,基於這個因素,可以隨時隨地的存取上述多媒體資訊的手持式行動裝置,也越來越受到重視。很明顯地,在上述的絕大部份多媒體中,語音可以說是最具語意的主要內涵之一。除此之外,語音自古以來一直都是人類最自然也最直接的溝通方式,若能利用語音來做為人類和科技產品之間的溝通橋樑,除了具備友善且有效的優點之外,更能省去繁雜的操作手續。現今市面上所見的科技產品,普遍的來說體積已越來越小,因此觸控的方式已漸漸地不再便利。此外傳統的人機介面如滑鼠和鍵盤,並非在所有的環境下都能適當的被使用,例如在行動的汽車環境下就顯得不夠方便。所以若能利用語音來做為人機介面,將會大大的提升便利性,使得科技和生活能夠更緊密的融合。然而語音辨識通常會遭受到一些複雜的因素干擾,諸如背景噪音,通道效應,以及語者和語言上的差異等諸多因素,使得辨識系統始終無法發揮最佳的效用,而辨識率往往也差強人意。 而本篇論文的主旨,在於針對目前許多語音強健技術進行研究比較並加以改良,最後整合出一套新的技術。而本論文主要的研究方法,是以查表式統計圖等化法為主,並和其它相關的技術結合來提升語音的強健性,最後將查表式統計圖等化法加以改良為改良式統計圖等化法,也就是將參考分佈依據音框的種類,分為靜音和語音。甚至根據中文特性,再將語音細分為聲母和韻母。而吾人所提出的改良式統計圖等化法,辨識率比傳統的查表示統計圖等化法相對提升了4.04% ; 對於原始辨識率也相對提升了至少5.75%。此外吾人也嘗試對語音訊號所擷取出的頻譜熵特徵與線性鑑別分析的技術結合,再與傳統的語音特徵參數合併來作為新的語音特徵參數,而辨識率也相對提升了近1.00%。若將新的特徵參數和本論文另一個研究主題(THEQ)作結合,更可以達到加成性的效果,平均相對辨識率提升至5.19%。In the course of evolution for thousands of years, human beings have continuously acquired as well as accumulated their knowledge from their daily life. Therefore, the civilization and evolution of human beings were almost on a par with each other in the past several thousand years. However, the quick development of technology nowadays has surmounted the evolution of human beings further. For example, huge quantities of multimedia information, such as broadcast radio and television programs, voice mails, digital archives and so on, are continuously growing and filling our computers, networks and lives. Therefore, accessing multimedia information at anytime, anywhere by small handheld mobile devices is now becoming more and more emphasized. It is well known that speech is the primary and the most convenient means of communication between people, and it will play a more active role and serve as the major human-machine interface for the interaction between people and different kinds of smart devices in the near future. Hence, it would be much more comfortable if we could use speech as the human-machine interface, and automatically transcribe, retrieve and summarize multimedia using the speech information inherent in it. However, speech recognition is usually interfered with some complicated factors, such as the background and channel noises, speaker and linguistic variations, etc., which make the current state-of-the-art recognition systems still far from perfect. With these observations in mind, in this thesis, several attempts were made to improve the current speech robustness techniques, as well as to find a way to integrate them together. The experiments were carried out on the Aurora 2.0 database and the Mandarin broadcast news speech collected in Taiwan. Considering the phonetic characteristics of the Chinese language, a modified histogram equalization (MHEQ) approach was first proposed. Separated reference histograms for the silence and speech segments (MHEQ-2), or more precisely, the silence, INITIAL and FINAL segments (MHEQ-3) in Chinese, were established. The proposed approach can yield above 5.75% and 4.04% relative improvements over the baseline system and the conventional table-based histogram equalization (THEQ) approach, respectively, in the clean environments. Furthermore, the spectral entropy features obtained after Linear Discriminant Analysis (LDA) were used to augment the Mel-frequency cepsctral features, and considerable improvements were initially indicated. Finally, fusion of the above proposed approaches was also investigated with very promising results demonstrated.特徵抽取特徵正規化統計圖等化法頻譜熵語音辨識強健性feature extractionfeature normalizationhistogram equalizationspectral entropyspeech recognitionaurora 2.0robust強健性語音辨識上關於特徵正規化與其它改良技術的研究A Study on Feature Normalization and Other Improved Techniques for Robust Speech Recognition