提升個人化語音活動檢測之穩健性、緊湊性與靈活性
No Thumbnail Available
Date
2025
Authors
Journal Title
Journal ISSN
Volume Title
Publisher
Abstract
隨著個人語音介面於穿戴式與車載系統中的廣泛應用,個人化語音活動偵測(Personalized Voice Activity Detection, PVAD)已成為語音前端技術的重要研究方向。傳統 VAD 難以辨識說話者身分,容易在多語者場景中產生誤判;而 PVAD 結合說話人嵌入與聲學特徵,能有效辨識語音是否來自目標使用者。儘管如此,現有 PVAD 方法仍面臨三大挑戰:在噪音環境或訓練資料不足時缺乏穩健性、跨模態資訊融合策略欠佳導致模型緊湊性不高,以及缺乏任務彈性,使系統難以依需求在 VAD 與 PVAD 任務間靈活切換。為了解決上述問題,本論文提出三項創新方法:1)語者調變Sinc提取器(SCSE-PVAD),透過語者嵌入調控可學習的Sinc濾波器組參數,使模型能提取具語者辨識能力的頻譜特徵,在極端噪音與資料量變動下仍具備穩健性,並加速收斂;2)中階調變注意力PVAD(COIN-AT-PVAD),在特徵提取與分類模組之間導入FiLM為基礎的注意力融合策略,能有效整合語者與聲學資訊,並顯著減少模型參數量,同時維持與進階基線方法相當的效能; 3)基於循環神經網路的靈活動態編碼PVAD(FDE-RNN),採用具動態跳層機制與可拆式個人化模組的雙階段架構,可根據語音活動自動切換VAD與PVAD任務,進一步降低運算成本。實驗結果顯示,SCSE-PVAD在穩健性方面表現卓越,COIN-AT-PVAD在達到高模型緊湊的同時兼具效能,而FDE-RNN則於辨識準確率與實際部署效益上均超越多項現有先進方法,展現其高度應用潛力。
With the growing adoption of personalized voice interfaces in wearable and in-vehicle systems, Personalized Voice Activity Detection (PVAD) has become a key research focus in front-end speech technology. Regular Voice Activity Detection (VAD) systems often struggle to distinguish speaker identity, leading to errors in multi-speaker environments. PVAD addresses this by integrating speaker embeddings with acoustic features to determine whether the speech originates from the target user. However, existing PVAD approaches still face three main challenges: limited robustness under noise and data scarcity, suboptimal cross-modal fusion strategies that hinder model compactness, and a lack of task flexibility for switching between VAD and PVAD. To address the challenges of robustness, efficiency, and task flexibility in PVAD, this thesis proposes three novel methods: 1) Speaker-Conditioned Sinc-Extractor PVAD (SCSE-PVAD), which introduces a speaker-conditioned sinc-based filterbank whose parameters are modulated by speaker embeddings to learn speaker-discriminative spectral features, thereby enhancing robustness under extreme noise and varying data scales while also accelerating convergence. 2) Conditional Intermediate Attention PVAD (COIN-AT-PVAD) employs an intermediate FiLM-based attention fusion strategy to integrate speaker and acoustic information between feature extraction and classification, significantly reducing model size while maintaining performance comparable to advanced baselines. 3) Flexible Dynamic Encoding RNN (FDE-RNN), adopts a two-stage architecture with dynamic layer skipping and a detachable personalization module, enabling automatic switching between VAD and PVAD tasks at reduced computational cost. Experimental results demonstrate that SCSE-PVAD excels in robustness, while COIN-AT-PVAD achieves high model compactness without sacrificing accuracy. FDE-RNN surpasses most state-of-the-art models in both accuracy and deployment practicality.
With the growing adoption of personalized voice interfaces in wearable and in-vehicle systems, Personalized Voice Activity Detection (PVAD) has become a key research focus in front-end speech technology. Regular Voice Activity Detection (VAD) systems often struggle to distinguish speaker identity, leading to errors in multi-speaker environments. PVAD addresses this by integrating speaker embeddings with acoustic features to determine whether the speech originates from the target user. However, existing PVAD approaches still face three main challenges: limited robustness under noise and data scarcity, suboptimal cross-modal fusion strategies that hinder model compactness, and a lack of task flexibility for switching between VAD and PVAD. To address the challenges of robustness, efficiency, and task flexibility in PVAD, this thesis proposes three novel methods: 1) Speaker-Conditioned Sinc-Extractor PVAD (SCSE-PVAD), which introduces a speaker-conditioned sinc-based filterbank whose parameters are modulated by speaker embeddings to learn speaker-discriminative spectral features, thereby enhancing robustness under extreme noise and varying data scales while also accelerating convergence. 2) Conditional Intermediate Attention PVAD (COIN-AT-PVAD) employs an intermediate FiLM-based attention fusion strategy to integrate speaker and acoustic information between feature extraction and classification, significantly reducing model size while maintaining performance comparable to advanced baselines. 3) Flexible Dynamic Encoding RNN (FDE-RNN), adopts a two-stage architecture with dynamic layer skipping and a detachable personalization module, enabling automatic switching between VAD and PVAD tasks at reduced computational cost. Experimental results demonstrate that SCSE-PVAD excels in robustness, while COIN-AT-PVAD achieves high model compactness without sacrificing accuracy. FDE-RNN surpasses most state-of-the-art models in both accuracy and deployment practicality.
Description
Keywords
個人化語音活動檢測, Sinc卷積, 語者調變, 動態神經網路, Personalized Voice Activity, Sinc-convolution, Speaker-conditioned, Dynamic Neural Networks