結合跨域資訊與時序反轉增強網路於強健性語音辨識
No Thumbnail Available
Date
2021
Authors
Journal Title
Journal ISSN
Volume Title
Publisher
Abstract
由於在現實生活中的噪音環境不可控制且干擾語音辨識的效能,加上前端發展已相當健全的語音增強(Speech Enhancement)技術,許多學者運用語音增強技術於語音辨識中獲得不錯的成果。近年來因為計算能力的發展,在眾多語音增強技術當中,許多研究開始發現相位(Phase)資訊對語音增強至關重要。在這些使用到相位資訊的語音增強方法,皆比原始單純使用幅度(Magnitude)頻譜的方法有更優越的效果。綜觀現階段最優異的語音增強技術,有學者使用對抗式訓練(Adversarial Training)將客觀度量指標與鑑別器(Discriminator)連結,最大化語音的感知質量(Perceptual Quality)達到了最好的效果,但最大化語音感知質量並不能保證在後端可以獲得更佳的語音辨識(Speech Recognition)結果。基於上述觀點,本論文提出了兩種新穎的語音增強方法:第一種為時序反轉增強網路(Time-reversal Enhancement NETwork, TENET),它是由時序反轉(Time-reversal)與孿生網路(Siamese Network)技術所構成,可以與任何語音增強模型結合,以增加其語音增強的效果。第二種為跨域雙路徑注意力網路(Cross-domain Dual-path Transformer, CD-DPTNet),在考慮到相位資訊的前提下,提出一個雙映射投影(Bi-projection Fusion, BPF)機制,融合頻域以及時域之特徵應用於語音增強。實驗於Voice Bank-DEMAND語音增強實驗之標準語料庫,並額外設置了未知環境噪音的測試集作為測試。本論文提出的方法與現階段最好的語音增強方法相比,在客觀評估指標PESQ、SI-SDR皆可以得到現階段最好的語音增強效果;進一步測試在語音辨識,也較其它方法能更有效的提升語音辨識之準確性。而結合TENET與CD-DPTNet兩種方法,在未知環境噪音的測試集可以使經多情境訓練之聲學模型降低約相對43 % 詞錯誤率(Word Error Rate, WER)。
Due to the variations of environmental noise in real life that interferes with the performance of automatic speech recognition (ASR), speech enhancement (SE) techniques have been developed rapidly and play an important role prior to acoustic modeling to mitigate noise effects on speech. Current state-of-the-art in the SE field adopts adversarial training by connecting an objective metric to the discriminator. However, there is no guarantee that optimizing the perceptual quality of speech will necessarily lead to improved ASR performance. Besides, many studies have suggested that phase information is crucial in SE in recent decades, and time-domain SE techniques have shown promise in noise suppression and robust ASR.Based on above observations, we present a Time-reversal Enhancement NETwork (TENET) and a Cross-domain Dual-path Transformer (CD-DPTNet) for SE. The former is a novel SE framework that can push the limits of any kind of SE models to improve their denoising performance, which leverages the time-reversed version of an input noisy signal itself, in conjunction with the siamese network to promote SE performance. On the other hand, CD-DPTNet explores two features that consider phase information in time domain and frequency domain and integrates a bi-projection fusion (BPF) module to extract cross-domain cues for better SE performance. Extensive experiments conducted on the Voicebank-DEMAND dataset show that TENET can achieve state-of-the-art results compared to a few top-of-the-line methods in terms of SE and ASR metrics. Finally, by incorporating these two methods, we can reduce a relative WER of 43 % on the test set of scenarios contaminated with unseen noise when compared to a strong baseline with multi-condition training settings.
Due to the variations of environmental noise in real life that interferes with the performance of automatic speech recognition (ASR), speech enhancement (SE) techniques have been developed rapidly and play an important role prior to acoustic modeling to mitigate noise effects on speech. Current state-of-the-art in the SE field adopts adversarial training by connecting an objective metric to the discriminator. However, there is no guarantee that optimizing the perceptual quality of speech will necessarily lead to improved ASR performance. Besides, many studies have suggested that phase information is crucial in SE in recent decades, and time-domain SE techniques have shown promise in noise suppression and robust ASR.Based on above observations, we present a Time-reversal Enhancement NETwork (TENET) and a Cross-domain Dual-path Transformer (CD-DPTNet) for SE. The former is a novel SE framework that can push the limits of any kind of SE models to improve their denoising performance, which leverages the time-reversed version of an input noisy signal itself, in conjunction with the siamese network to promote SE performance. On the other hand, CD-DPTNet explores two features that consider phase information in time domain and frequency domain and integrates a bi-projection fusion (BPF) module to extract cross-domain cues for better SE performance. Extensive experiments conducted on the Voicebank-DEMAND dataset show that TENET can achieve state-of-the-art results compared to a few top-of-the-line methods in terms of SE and ASR metrics. Finally, by incorporating these two methods, we can reduce a relative WER of 43 % on the test set of scenarios contaminated with unseen noise when compared to a strong baseline with multi-condition training settings.
Description
Keywords
語音增強, 對抗式訓練, 語音辨識, 時序反轉, 孿生網路, 雙路徑注意力網路, 雙映射投影, Speech Enhancement, Automatic Speech Recognition, Time Reversal, Siamese Network