語音增益之研究 — 適應性與可解釋性
No Thumbnail Available
Date
2024
Authors
Journal Title
Journal ISSN
Volume Title
Publisher
Abstract
本論文深入探討語音增益(SE)領域,這是一個通過減少噪音和失真來精煉語音信號的關鍵過程。借助深度神經網絡(DNNs),本研究解決了兩個基本挑戰:1)探索SE和自動語音辨識(ASR)系統之間的兼容性,以及2)增強基於DNN的SE模型的可解釋性。動機來源於SE模型可能在運作中引入的偽影(Artifacts),可能危及ASR性能,因此需要重新評估學習目標。為應對這一問題,提出了一種新穎的噪聲和偽影感知損失函數(NAaLoss),它在保持SE質量的同時,顯著提高了ASR性能。另外,在基於DNN的SE方法中,我們探索了一種新穎的設計,即基於Sinc的卷積(Sinc-conv),以在解釋性和時域方法的學習自由之間取得平衡。基於此,我們設計了重塑的Sinc卷積(rSinc-conv),不僅提升了SE的最新技術水平,還揭示了神經網絡在SE期間優先考慮的特定頻率組合。這項研究做出了實質性的貢獻,包括定義1)SE中的處理偽影,展示NAaLoss的有效性,通過視覺化偽影獲取洞見,並填補SE和ASR目標之間的差距。2)為SE量身定制的rSinc-conv的開發在訓練效率、濾波器多樣性和可解釋性方面提供了優勢。3)解析神經網絡的優先關注,對不同形狀濾波器的探索以及對各種SE模型的評估進一步促進了我們對SE網絡的理解和改進。總的來說,這項研究旨在為SE領域的討論做出貢獻,並為在現實情境中實現更強大和高效的SE鋪平技術道路。
This work delves into the domain of Speech Enhancement (SE), a critical process for refining speech signals by reducing noise and distortions. Leveraging deep neural networks (DNNs), this study addresses two fundamental challenges: 1) exploring the compatibility between SE and Automatic Speech Recognition (ASR) systems, and 2) enhancing the interpretability of DNN-based SE models.The motivation stems from the potential introduction of artifacts by SE models that can compromise ASR performance, necessitating a re-evaluation of the learning objectives. To tackle this, a novel Noise- and Artifact-aware loss function (NAaLoss) is proposed, significantly improving ASR performance while preserving SE quality.Within DNN-based SE methods, a novel approach, Sinc-based convolution (Sinc-conv), is explored to strike a balance between the interpretability of spectral approaches and the learning freedom of time-domain methods. Standing upon that, we devise the reformed Sinc-conv (rSinc-conv), which not only enhances the state-of-the-art in SE but also sheds light on the specific frequency components prioritized by neural networks during SE.This research makes substantial contributions, including defining processing artifacts in SE, demonstrating the effectiveness of NAaLoss, visualizing artifacts for insights, and bridging the gap between SE and ASR objectives. The development of rSinc-conv tailored for SE offers advantages in training efficiency, filter diversity, and interpretability. Insights into neural network attention, exploration of different shaped filters, and evaluation of various SE models further advance the understanding and improvement of SE networks. Overall, this work aims to contribute to the discourse in SE and pave the way for more robust and efficient SE techniques with broader applications in real-world scenarios.
This work delves into the domain of Speech Enhancement (SE), a critical process for refining speech signals by reducing noise and distortions. Leveraging deep neural networks (DNNs), this study addresses two fundamental challenges: 1) exploring the compatibility between SE and Automatic Speech Recognition (ASR) systems, and 2) enhancing the interpretability of DNN-based SE models.The motivation stems from the potential introduction of artifacts by SE models that can compromise ASR performance, necessitating a re-evaluation of the learning objectives. To tackle this, a novel Noise- and Artifact-aware loss function (NAaLoss) is proposed, significantly improving ASR performance while preserving SE quality.Within DNN-based SE methods, a novel approach, Sinc-based convolution (Sinc-conv), is explored to strike a balance between the interpretability of spectral approaches and the learning freedom of time-domain methods. Standing upon that, we devise the reformed Sinc-conv (rSinc-conv), which not only enhances the state-of-the-art in SE but also sheds light on the specific frequency components prioritized by neural networks during SE.This research makes substantial contributions, including defining processing artifacts in SE, demonstrating the effectiveness of NAaLoss, visualizing artifacts for insights, and bridging the gap between SE and ASR objectives. The development of rSinc-conv tailored for SE offers advantages in training efficiency, filter diversity, and interpretability. Insights into neural network attention, exploration of different shaped filters, and evaluation of various SE models further advance the understanding and improvement of SE networks. Overall, this work aims to contribute to the discourse in SE and pave the way for more robust and efficient SE techniques with broader applications in real-world scenarios.
Description
Keywords
語音增益, 兼容性, 強健性語音辨識, 處理偽影, 可解釋性, Sinc卷積, 關鍵頻帶, Speech Enhancement, Compatibility, Noise-robust Speech Recognition, Processing Artifacts, Interpretability, Sinc-convolution, Crucial Bands