針對多粒度學習與有效抑制誤喚醒的零樣本關鍵詞辨識
No Thumbnail Available
Date
2025
Journal Title
Journal ISSN
Volume Title
Publisher
Abstract
隨著科技的進步,不僅人人都攜帶智慧型手機,甚至家用電器也日益朝向語音控制的智慧家庭系統發展,使關鍵字偵測(Keyword Spotting, KWS)成為智慧裝置與語音助理中的核心關鍵技術。傳統的固定詞彙關鍵字偵測需要事先收集特定關鍵字的語音樣本並重新訓練模型才能辨識新關鍵字,具有彈性受限、成本高昂以及部署不便等缺點。為克服這些限制,近年來開放詞彙關鍵字偵測技術逐漸受到重視。無需依賴特定領域預先標註訓練資料的使用者自訂零樣本關鍵字偵測(zero-shot Keyword Spotting, ZSKWS),對於建構可適應且個人化的語音介面至關重要。然而,這類系統仍面臨艱鉅挑戰,包括有限的運算資源與有限的標註訓練資料。現有方法也難以區分聲學上相似的關鍵字,經常在實際部署中導致惱人的誤喚醒率(False Alarm Rate, FAR)。為解決這些限制,本研究提出一個輕量化、可即時運行的零樣本關鍵字偵測架構,能透過交叉注意力機制同時學習語句層級與音素層級對齊。該架構採用多粒度對比學習目標,並藉由文字轉語音(Text-to-Speech, TTS)資料增強,生成在語音上易混淆的關鍵字對以強化訓練流程。在四個公開基準資料集上的評估顯示,本研究模型達到最先進(State-of-the-Art)表現。在 Google Speech Commands v2 與 Qualcomm 資料集上,等錯誤率(Equal Error Rate, EER)降低至 3%,曲線下面積(Area Under the Curve, AUC)超過 99%,且準確率達 90% 以上。此外,在 AMI Meeting Corpus 上的誤喚醒率(FAR)低至 0.007%,同時維持 655K 參數的輕量化模型大小。這些結果證明本研究所提出的模型具有高運算效率,並能支援資源受限裝置上的即時部署。
With the advancement of technology, not only does everyone carry a smartphone, but even home appliances are increasingly moving toward voice-controlled smart home systems, making keyword spotting (KWS) an essential core technology in smart devices and voice assistants. Traditional fixed-vocabulary keyword spotting requires collecting speech samples for specific keywords in advance and retraining the model, if new keywords are to be recognized, which is limited flexibility, high costs, and various inconveniences in deployment. To overcome these limitations, open-vocabulary KWS techniques have gained traction in recent years. User-defined zero-shot keyword spotting (ZSKWS) without recourse to domain-specific pre-labeled training data is essential for building adaptable and personalized voice interfaces. However, such systems are still faced with arduous challenges, including constrained computational resources and limited annotated training data. Existing methods also struggle to distinguish acoustically similar keywords, which often lead to a pesky false alarm rate (FAR) in real-world deployments. To address these limitations, we propose a lightweight real-time zero-shot KWS framework that jointly learns utterance- and phoneme-level alignments through a cross-attention mechanism. The framework employs a multi-granularity contrastive learning objective and enhances the training process by generating phonetically confusable keyword pairs via text-to-speech (TTS) data augmentation. Evaluations on four public benchmark datasets show that our model achieves state-of-the-art performance. In the Google Speech Commands v2 and Qualcomm datasets, the Equal Error Rate (EER) is reduced to 3%, the Area Under the Curve (AUC) exceeds 99%, and the model achieves over 90% accuracy. Furthermore, the False Alarm Rate (FAR) on the AMI Meeting Corpus is as low as 0.007%, while maintaining a lightweight model size of 655K parameters. These results demonstrate the proposed model is computationally efficient and can support real-time deployment on resource-constrained devices.
With the advancement of technology, not only does everyone carry a smartphone, but even home appliances are increasingly moving toward voice-controlled smart home systems, making keyword spotting (KWS) an essential core technology in smart devices and voice assistants. Traditional fixed-vocabulary keyword spotting requires collecting speech samples for specific keywords in advance and retraining the model, if new keywords are to be recognized, which is limited flexibility, high costs, and various inconveniences in deployment. To overcome these limitations, open-vocabulary KWS techniques have gained traction in recent years. User-defined zero-shot keyword spotting (ZSKWS) without recourse to domain-specific pre-labeled training data is essential for building adaptable and personalized voice interfaces. However, such systems are still faced with arduous challenges, including constrained computational resources and limited annotated training data. Existing methods also struggle to distinguish acoustically similar keywords, which often lead to a pesky false alarm rate (FAR) in real-world deployments. To address these limitations, we propose a lightweight real-time zero-shot KWS framework that jointly learns utterance- and phoneme-level alignments through a cross-attention mechanism. The framework employs a multi-granularity contrastive learning objective and enhances the training process by generating phonetically confusable keyword pairs via text-to-speech (TTS) data augmentation. Evaluations on four public benchmark datasets show that our model achieves state-of-the-art performance. In the Google Speech Commands v2 and Qualcomm datasets, the Equal Error Rate (EER) is reduced to 3%, the Area Under the Curve (AUC) exceeds 99%, and the model achieves over 90% accuracy. Furthermore, the False Alarm Rate (FAR) on the AMI Meeting Corpus is as low as 0.007%, while maintaining a lightweight model size of 655K parameters. These results demonstrate the proposed model is computationally efficient and can support real-time deployment on resource-constrained devices.
Description
Keywords
零樣本關鍵詞辨識, 對比式學習, 誤喚醒, zero-shot keyword spotting, contrastive learning, false alarm