結合跨域資訊與時序反轉增強網路於強健性語音辨識

dc.contributor陳柏琳zh_TW
dc.contributor洪志偉zh_TW
dc.contributorChen, Berlinen_US
dc.contributorHung, Jeih-Weihen_US
dc.contributor.author趙福安zh_TW
dc.contributor.authorChao, Fu-Anen_US
dc.date.accessioned2022-06-08T02:43:24Z
dc.date.available2026-09-08
dc.date.available2022-06-08T02:43:24Z
dc.date.issued2021
dc.description.abstract由於在現實生活中的噪音環境不可控制且干擾語音辨識的效能,加上前端發展已相當健全的語音增強(Speech Enhancement)技術,許多學者運用語音增強技術於語音辨識中獲得不錯的成果。近年來因為計算能力的發展,在眾多語音增強技術當中,許多研究開始發現相位(Phase)資訊對語音增強至關重要。在這些使用到相位資訊的語音增強方法,皆比原始單純使用幅度(Magnitude)頻譜的方法有更優越的效果。綜觀現階段最優異的語音增強技術,有學者使用對抗式訓練(Adversarial Training)將客觀度量指標與鑑別器(Discriminator)連結,最大化語音的感知質量(Perceptual Quality)達到了最好的效果,但最大化語音感知質量並不能保證在後端可以獲得更佳的語音辨識(Speech Recognition)結果。基於上述觀點,本論文提出了兩種新穎的語音增強方法:第一種為時序反轉增強網路(Time-reversal Enhancement NETwork, TENET),它是由時序反轉(Time-reversal)與孿生網路(Siamese Network)技術所構成,可以與任何語音增強模型結合,以增加其語音增強的效果。第二種為跨域雙路徑注意力網路(Cross-domain Dual-path Transformer, CD-DPTNet),在考慮到相位資訊的前提下,提出一個雙映射投影(Bi-projection Fusion, BPF)機制,融合頻域以及時域之特徵應用於語音增強。實驗於Voice Bank-DEMAND語音增強實驗之標準語料庫,並額外設置了未知環境噪音的測試集作為測試。本論文提出的方法與現階段最好的語音增強方法相比,在客觀評估指標PESQ、SI-SDR皆可以得到現階段最好的語音增強效果;進一步測試在語音辨識,也較其它方法能更有效的提升語音辨識之準確性。而結合TENET與CD-DPTNet兩種方法,在未知環境噪音的測試集可以使經多情境訓練之聲學模型降低約相對43 % 詞錯誤率(Word Error Rate, WER)。zh_TW
dc.description.abstractDue to the variations of environmental noise in real life that interferes with the performance of automatic speech recognition (ASR), speech enhancement (SE) techniques have been developed rapidly and play an important role prior to acoustic modeling to mitigate noise effects on speech. Current state-of-the-art in the SE field adopts adversarial training by connecting an objective metric to the discriminator. However, there is no guarantee that optimizing the perceptual quality of speech will necessarily lead to improved ASR performance. Besides, many studies have suggested that phase information is crucial in SE in recent decades, and time-domain SE techniques have shown promise in noise suppression and robust ASR.Based on above observations, we present a Time-reversal Enhancement NETwork (TENET) and a Cross-domain Dual-path Transformer (CD-DPTNet) for SE. The former is a novel SE framework that can push the limits of any kind of SE models to improve their denoising performance, which leverages the time-reversed version of an input noisy signal itself, in conjunction with the siamese network to promote SE performance. On the other hand, CD-DPTNet explores two features that consider phase information in time domain and frequency domain and integrates a bi-projection fusion (BPF) module to extract cross-domain cues for better SE performance. Extensive experiments conducted on the Voicebank-DEMAND dataset show that TENET can achieve state-of-the-art results compared to a few top-of-the-line methods in terms of SE and ASR metrics. Finally, by incorporating these two methods, we can reduce a relative WER of 43 % on the test set of scenarios contaminated with unseen noise when compared to a strong baseline with multi-condition training settings.en_US
dc.description.sponsorship資訊工程學系zh_TW
dc.identifier60747002S-40178
dc.identifier.urihttps://etds.lib.ntnu.edu.tw/thesis/detail/85b560c42b5399bf3551e1a66a6b1cf3/
dc.identifier.urihttp://rportal.lib.ntnu.edu.tw/handle/20.500.12235/117290
dc.language中文
dc.subject語音增強zh_TW
dc.subject對抗式訓練zh_TW
dc.subject語音辨識zh_TW
dc.subject時序反轉zh_TW
dc.subject孿生網路zh_TW
dc.subject雙路徑注意力網路zh_TW
dc.subject雙映射投影zh_TW
dc.subjectSpeech Enhancementen_US
dc.subjectAutomatic Speech Recognitionen_US
dc.subjectTime Reversalen_US
dc.subjectSiamese Networken_US
dc.title結合跨域資訊與時序反轉增強網路於強健性語音辨識zh_TW
dc.titleA Time-reversal Enhancement Network with Cross-domain Information for Noise-robust Automatic Speech Recognitionen_US
dc.type學術論文

Files

Collections