結合跨域資訊與時序反轉增強網路於強健性語音辨識

趙福安; Chao, Fu-An

結合跨域資訊與時序反轉增強網路於強健性語音辨識

dc.contributor	陳柏琳	zh_TW
dc.contributor	洪志偉	zh_TW
dc.contributor	Chen, Berlin	en_US
dc.contributor	Hung, Jeih-Weih	en_US
dc.contributor.author	趙福安	zh_TW
dc.contributor.author	Chao, Fu-An	en_US
dc.date.accessioned	2022-06-08T02:43:24Z
dc.date.available	2026-09-08
dc.date.available	2022-06-08T02:43:24Z
dc.date.issued	2021
dc.description.abstract	由於在現實生活中的噪音環境不可控制且干擾語音辨識的效能，加上前端發展已相當健全的語音增強(Speech Enhancement)技術，許多學者運用語音增強技術於語音辨識中獲得不錯的成果。近年來因為計算能力的發展，在眾多語音增強技術當中，許多研究開始發現相位(Phase)資訊對語音增強至關重要。在這些使用到相位資訊的語音增強方法，皆比原始單純使用幅度(Magnitude)頻譜的方法有更優越的效果。綜觀現階段最優異的語音增強技術，有學者使用對抗式訓練(Adversarial Training)將客觀度量指標與鑑別器(Discriminator)連結，最大化語音的感知質量(Perceptual Quality)達到了最好的效果，但最大化語音感知質量並不能保證在後端可以獲得更佳的語音辨識(Speech Recognition)結果。基於上述觀點，本論文提出了兩種新穎的語音增強方法：第一種為時序反轉增強網路(Time-reversal Enhancement NETwork, TENET)，它是由時序反轉(Time-reversal)與孿生網路(Siamese Network)技術所構成，可以與任何語音增強模型結合，以增加其語音增強的效果。第二種為跨域雙路徑注意力網路(Cross-domain Dual-path Transformer, CD-DPTNet)，在考慮到相位資訊的前提下，提出一個雙映射投影(Bi-projection Fusion, BPF)機制，融合頻域以及時域之特徵應用於語音增強。實驗於Voice Bank-DEMAND語音增強實驗之標準語料庫，並額外設置了未知環境噪音的測試集作為測試。本論文提出的方法與現階段最好的語音增強方法相比，在客觀評估指標PESQ、SI-SDR皆可以得到現階段最好的語音增強效果；進一步測試在語音辨識，也較其它方法能更有效的提升語音辨識之準確性。而結合TENET與CD-DPTNet兩種方法，在未知環境噪音的測試集可以使經多情境訓練之聲學模型降低約相對43 % 詞錯誤率(Word Error Rate, WER)。	zh_TW
dc.description.abstract	Due to the variations of environmental noise in real life that interferes with the performance of automatic speech recognition (ASR), speech enhancement (SE) techniques have been developed rapidly and play an important role prior to acoustic modeling to mitigate noise effects on speech. Current state-of-the-art in the SE field adopts adversarial training by connecting an objective metric to the discriminator. However, there is no guarantee that optimizing the perceptual quality of speech will necessarily lead to improved ASR performance. Besides, many studies have suggested that phase information is crucial in SE in recent decades, and time-domain SE techniques have shown promise in noise suppression and robust ASR.Based on above observations, we present a Time-reversal Enhancement NETwork (TENET) and a Cross-domain Dual-path Transformer (CD-DPTNet) for SE. The former is a novel SE framework that can push the limits of any kind of SE models to improve their denoising performance, which leverages the time-reversed version of an input noisy signal itself, in conjunction with the siamese network to promote SE performance. On the other hand, CD-DPTNet explores two features that consider phase information in time domain and frequency domain and integrates a bi-projection fusion (BPF) module to extract cross-domain cues for better SE performance. Extensive experiments conducted on the Voicebank-DEMAND dataset show that TENET can achieve state-of-the-art results compared to a few top-of-the-line methods in terms of SE and ASR metrics. Finally, by incorporating these two methods, we can reduce a relative WER of 43 % on the test set of scenarios contaminated with unseen noise when compared to a strong baseline with multi-condition training settings.	en_US
dc.description.sponsorship	資訊工程學系	zh_TW
dc.identifier	60747002S-40178
dc.identifier.uri	https://etds.lib.ntnu.edu.tw/thesis/detail/85b560c42b5399bf3551e1a66a6b1cf3/
dc.identifier.uri	http://rportal.lib.ntnu.edu.tw/handle/20.500.12235/117290
dc.language	中文
dc.subject	語音增強	zh_TW
dc.subject	對抗式訓練	zh_TW
dc.subject	語音辨識	zh_TW
dc.subject	時序反轉	zh_TW
dc.subject	孿生網路	zh_TW
dc.subject	雙路徑注意力網路	zh_TW
dc.subject	雙映射投影	zh_TW
dc.subject	Speech Enhancement	en_US
dc.subject	Automatic Speech Recognition	en_US
dc.subject	Time Reversal	en_US
dc.subject	Siamese Network	en_US
dc.title	結合跨域資訊與時序反轉增強網路於強健性語音辨識	zh_TW
dc.title	A Time-reversal Enhancement Network with Cross-domain Information for Noise-robust Automatic Speech Recognition	en_US
dc.type	學術論文

Collections

學位論文

結合跨域資訊與時序反轉增強網路於強健性語音辨識

Files

Collections