端到端情境化語音辨識技術之研究

No Thumbnail Available

Date

2024

Journal Title

Journal ISSN

Volume Title

Publisher

Abstract

在智慧家居設備和手機智慧助理的普及,語音互動技術已成為日常生活中不可或缺的一部分。端到端(E2E)神經網路模型的進步顯著提升了自動語音辨識(ASR)模型的表現,這些模型在多項基準測試中均超越了傳統的混合模型。然而,E2E ASR 模型在辨識特定領域的詞彙(例如聯絡人名和地名)時仍面臨挑戰,這種挑戰在下游應用如自然語言理解中顯得尤為重要。本研究旨在通過增強上下文語境的 ASR 模型,來應對這些模型在真實世界場景中效能下降的問題。 我們的研究首先深入分析了當前先進的 E2E ASR 模型在辨識錯誤方面的局限性,識別出主要問題,包括先驗知識不足和捕捉上下文資訊的能力不足。為解決這些問題,我們提出了 XPhoneAdapter 模型,這是一種結合了新型自監督音素編碼器 XPhoneBERT 的方法,能提供更豐富的音素感知特徵。此外,我們還針對上下文/非上下文不平衡和長尾分佈問題提出了解決辦法,並引入了 Q-HNW 方法進行硬負樣本訓練,以提升模型的穩定性。 研究結果顯示,結合精細的音素感知自監督特徵與增強的硬負樣本訓練,可以在 Librispeech 資料集上實現高達 18% 的相對詞錯誤率(WER)降低和 35% 的罕見詞錯誤率(C-WER)相對改善。此外,在 AISHELL-1 基準資料集上的實驗進一步證明了我們所提出方法的有效性,展示了顯著的效能提升。本論文的主要貢獻包括以下幾點: 1) 對先進 E2E ASR 模型的辨識錯誤進行了詳細分析,找出了訓練和測試環境中詞彙分佈不匹配的關鍵因素。 2) 突出了阻礙 ASR 模型通用化的兩大主要因素:先驗知識不足和捕捉上下文資訊的能力不足。 3) 提出了 XPhoneAdapter 模型,該模型引入了新型自監督音素編碼器 XPhoneBERT,以提供更豐富的音素感知特徵。 4) 針對上下文/非上下文不平衡和長尾分佈問題,提出了上下文平衡適應方法,以改善低頻上下文詞彙的模型表現。 5) 引入了 Q-HNW 方法進行負樣本訓練,以增強模型在挑戰性辨識場景中的穩定性。
In the era of smart home devices and phone-based smart assistants, voice interaction technology has become increasingly prevalent. The advancements in end-to-end (E2E) neural models have significantly improved the performance of automatic speech recognition (ASR) models, surpassing conventional hybrid models on various benchmark tasks. Despite these advancements, E2E ASR models face challenges in accurately recognizing domain-specific phrases, particularly named entities like contact names and geo-locations, which are critical for downstream tasks such as natural language understanding. This study aims to address the performance decline of ASR models in real-world scenarios by enhancing contextualized ASR (CASR) models. Our research investigates the limitations of current CASR models, identifying key issues such as insufficient prior knowledge and inadequate model capability to capture contextual information. We propose the XPhoneAdapter, which integrates a novel self-supervised phoneme encoder, XPhoneBERT, to provide richer phoneme-aware representations. Additionally, we address the context/no-context imbalance and long-tailed distribution problems through a context-balanced adaptation approach and introduce the Q-HNW method for hard negative training to enhance model robustness. Our findings demonstrate that the synergy of fine-grained phoneme-aware SSL features and enhanced hard negative training can achieve up to an 18% relative word error rate (WER) reduction and a 35% relative improvement in context word error rate (C-WER) on the Librispeech dataset. Experiments on the AISHELL-1 benchmark dataset further validate the effectiveness of our proposed methods, showing significant performance improvements. This thesis contributes to the field of ASR by providing an in-depth analysis of recognition errors, proposing novel methods to enhance CASR models, and demonstrating significant performance gains, thereby paving the way for more reliable and accurate speech recognition in real-world applications.

Description

Keywords

語音辨識, 情境化語音辨識, 長尾分布, 自督式模型, 負樣本訓練, Automatic speech recognition, Contextualized automatic speech recognition, Long-tailed distribution, Self-supervised models, Hard negative training

Citation

Collections

Endorsement

Review

Supplemented By

Referenced By