基於對比式訓練之輕量化開放詞彙的關鍵詞辨識

No Thumbnail Available

Date

2024

Journal Title

Journal ISSN

Volume Title

Publisher

Abstract

隨著智慧裝置的普及,關鍵詞辨識技術變得越來越重要,其目標是在連續語音中識別是否存在特定的關鍵詞,這項任務極具挑戰性,因為它不僅需要準確地檢測關鍵詞,還需要有效地排除其他關鍵詞。隨著深度神經網絡的快速發展,採用深度神經網絡的關鍵詞辨識在精準度上取得了顯著進步。傳統基於深度神經網絡的關鍵詞辨識系統需要大量目標關鍵詞的語音作為訓練資料,因此只能識別固定的關鍵詞,且在訓練完成後難以替換關鍵詞。若需要替換關鍵詞,則必須重新收集目標關鍵詞的語料並重新訓練模型。本文聚焦於實作一個開放詞彙的關鍵詞辨識系統。該系統通過自注意力機制,利用語音特徵與文本嵌入向量生成有效的聯合嵌入,並藉由辨別器對聯合嵌入計算信心分數。系統依據這些信心分數來決定是否啟動系統。同時,透過對比式學習來處理在設定多個關鍵詞時,錯誤關鍵詞的信心分數過高而產生的誤報問題。在預訓練音頻編碼器時,我們除了使用包含5000類關鍵詞的語料進行分類任務訓練的預訓練音頻編碼器外,還採用了更加節省參數的音頻編碼器架構,能夠減少100K的參數,並通過500類關鍵詞進行分類任務的預訓練。本研究在識別10個未在訓練階段出現的新關鍵詞上,達到了94.08%的準確率,相較於基準方法提升了12%。
As smart devices become more widespread, keyword spotting technology is becoming increasingly important. The goal of this technology is to identify the presence of specific keywords in continuous speech. This task is highly challenging as it not only requires accurate detection of the keywords but also the effective exclusion of other non-target keywords. With the rapid development of deep neural networks, keyword spotting using deep neural networks has achieved significant improvements in accuracy. Traditional keyword spotting systems based on deep neural networks require a large amount of speech data containing the target keywords for training. As a result, they can only recognize fixed keywords and it is difficult to replace these keywords once training is completed. If a keyword needs to be replaced, new speech data for the target keyword must be collected, and the model must be retrained. This paper focuses on implementing an open-vocabulary keyword spotting system. This system utilizes a self-attention mechanism to generate effective joint embeddings by leveraging speech features and text embedding vectors, and calculates confidence scores for these joint embeddings using a discriminator. The system decides whether to activate based on these confidence scores. Furthermore, contrastive learning is employed to address the false alarm issue caused by high confidence scores of incorrect keywords when multiple keywords are set. During the pre-training of the audio encoder, in addition to using a pre-trained audio encoder trained on a classification task with a dataset containing 5000 categories of keywords, we also adopted a more parameter-efficient audio encoder architecture. This architecture reduces the parameters by 100K and is pre-trained on a classification task with 500 categories of keywords. In this study, our approach achieved an accuracy of 94.08% in recognizing 10 new keywords that did not appear during the training phase, which is a 12% improvement over the baseline methods.

Description

Keywords

關鍵詞辨識, 零樣本, 對比學習, 開放詞彙, 自定義, keyword spotting, user-defined, zero-shot, contrastive learning, open-vocabulary

Citation

Collections

Endorsement

Review

Supplemented By

Referenced By