利用無監督領域自適應實現穩健語音辨識與增強之研究

No Thumbnail Available

Date

2025

Journal Title

Journal ISSN

Volume Title

Publisher

Abstract

用於自動語音辨識(ASR)和語音增強(SE)的預訓練模型在匹配的雜訊和通道條件下展現出卓越的效能。然而,當面臨領域偏移時,尤其是在存在未知雜訊和通道失真的情況下,它們的性能會顯著下降。在本研究中,我們提出了 URSA-GAN,這是一個統一且領域感知的生成框架,專門設計用於緩解雜訊和通道條件下的不匹配問題。URSA-GAN 採用雙嵌入架構,該架構由雜訊編碼器和通道編碼器組成,每個編碼器都使用有限的領域內資料進行預訓練,以捕獲與領域相關的特性。這些嵌入會調節基於生成對抗網路(GAN)的語音生成器,促進語音與目標雜訊的合成,使其在保留語音內容的同時,與目標領域在聲學上保持一致。為了進一步增強泛化能力,我們提出了動態隨機擾動,這是一種新穎的正則化技術,它在生成過程中將受控的可變性引入嵌入中,從而提高了對未知領域的魯棒性。實證結果表明,在通道不匹配的情況下,URSA-GAN 在 Hakka Across Taiwan 和 Taiwanese Across Taiwan 語料庫上分別顯著降低了 20.02% 和 9.64% 的字元錯誤率(CER)。在 VoiceBank-DEMAND 資料集上,我們的框架在雜訊失配的情況下將語音增強性能提升了 2.95%。值得注意的是,在同時存在通道和雜訊劣化的複合測試集上進行的進一步評估證實了 URSA-GAN 能夠泛化到複雜的現實世界聲學條件,語音辨識性能相對提升了 16.16%,語音增強指標相對提升了 15.58%。
Pre-trained models for automatic speech recognition (ASR) and speech enhancement (SE) have exhibited remarkable capabilities under matched noise and channel conditions. However, they exhibit significant performance degradation when confronted with domain shifts, particularly in the presence of unseen noise and channel distortions. In this study, we present URSA-GAN, a unified and domain-aware generative framework specifically designed to mitigate mismatches in both noise and channel conditions. URSA-GAN leverages a dual-embedding architecture that consists of a noise encoder and a channel encoder, each pre-trained with limited in-domain data to capture domain-relevant representations. These embeddings condition a GAN-based speech generator, facilitating the synthesis of speech that is acoustically aligned with the target domain while preserving phonetic content. To enhance generalization further, we propose dynamic stochastic perturbation, a novel regularization technique that introduces controlled variability into the embeddings during generation, promoting robustness to unseen domains. Empirical results demonstrate that URSA-GAN substantially reduces the character error rate (CER) by 20.02% and 9.64% on the Hakka Across Taiwan (HAT) and Taiwanese Across Taiwan (TAT) corpora, respectively, under channel mismatch. On the VoiceBank-DEMAND dataset, our framework yields a 2.95% improvement in SE performance under noise mismatch. Notably, further evaluations on a compound test set with both channel and noise degradations confirm the ability of URSA-GAN to generalize to complex real-world acoustic conditions, achieving relative improvements of 16.16% in ASR performance and 15.58% in SE metrics.

Description

Keywords

自動語音辨識, 語音增強, 領域不匹配, 無監督領域自適應, 生成對抗網路, Automatic speech recognition, speech enhancement, domain mismatch, unsupervised domain adaptation, generative adversarial networks

Citation

Collections

Endorsement

Review

Supplemented By

Referenced By