提升編碼器語言敏銳度在語碼轉換語音辨識中的有效性之研究
No Thumbnail Available
Date
2024
Authors
Journal Title
Journal ISSN
Volume Title
Publisher
Abstract
隨著端到端 (End-to-End, E2E) 神經網路的出現,語音辨識 (Automatic Speech Recognition, ASR) 領域進入了一個革命性的全新時代。E2E ASR 將傳統語音辨識框架中的模組整合為一個單一、統一的神經網路,能夠直接將輸入的語音信號轉錄為相應的文本。這一創新不僅簡化了神經網路的建模過程,還大大減少了各個模組獨立訓練時可能產生的不一致性。在單語辨識效能方面,E2E ASR 模型已經達到了接近人類水準的準確性,這標誌著語音辨識技術演進中的一個重要里程碑。根據統計,現今全球超過60%的人口是多語言使用者。在口頭交流中,多語者經常因為學習環境和情緒變化等因素無意識地在不同語言之間切換。這種現象被稱為語碼轉換(Code-Switching, CS),在台灣、新加坡和馬來西亞等高度國際化的國家中特別普遍。在語碼轉換中,模型不僅需要考慮聲學特徵,還需要學會精確識別語言切換的時刻。這一任務的複雜性經常導致端到端語音識別系統(E2E ASR)性能下降。因此,解決語碼轉換問題是語音識別領域中最緊迫的挑戰之一。為了解決這一挑戰,我們提出了 D-MoE 架構,這是一種設計用於同時利用語言間共享的底層資訊並有效減少聲音嵌入中語言混淆的編碼器。隨後,我們實施了一項創新技術,透過在編碼器內部建立語言邊界,潛移默化地豐富聲音嵌入中的語言知識,進一步增強了模型對不同語言的敏銳度。
With the advent of End-to-End (E2E) neural networks, the field of Automatic Speech Recognition (ASR) has embarked on a revolutionary new era. E2E ASR consolidates the modules of traditional speech recognition frameworks into a single, cohesive neural network, capable of directly transcribing input speech signals into corresponding text. This innovation not only streamlines neural network modeling but also significantly reduces inconsistencies that can arise from independently training each module. In terms of monolingual recognition performance, E2E ASR models have achieved near-human levels of accuracy, marking a significant milestone in the evolution of speech recognition technology.According to statistics, over 60% of the global population today are multilingual users. In verbal communication, multilingual speakers often switch between different languages unconsciously due to factors such as their learning environment and mood changes. This phenomenon, known as Code-Switching (CS), is especially prevalent in highly internationalized countries like Taiwan, Singapore, and Malaysia. In CS, a model needs to not only account for acoustic features but also learn to pinpoint the exact moments when languages switch. The complexity of this task often results in a decline in the performance of E2E ASR systems. Therefore, addressing CS is one of the most pressing challenges in the field of speech recognition.To tackle this challenge, we proposed the so-called D-MoE architecture, an encoder designed to simultaneously leverage shared underlying information between languages while effectively reducing language confusion in acoustic embeddings. Following this, we designed and implemented an innovative technique that establishes language boundaries within the encoder, subtly enriching the language knowledge in the audio embedding and further enhancing the acuity of the model to different languages.
With the advent of End-to-End (E2E) neural networks, the field of Automatic Speech Recognition (ASR) has embarked on a revolutionary new era. E2E ASR consolidates the modules of traditional speech recognition frameworks into a single, cohesive neural network, capable of directly transcribing input speech signals into corresponding text. This innovation not only streamlines neural network modeling but also significantly reduces inconsistencies that can arise from independently training each module. In terms of monolingual recognition performance, E2E ASR models have achieved near-human levels of accuracy, marking a significant milestone in the evolution of speech recognition technology.According to statistics, over 60% of the global population today are multilingual users. In verbal communication, multilingual speakers often switch between different languages unconsciously due to factors such as their learning environment and mood changes. This phenomenon, known as Code-Switching (CS), is especially prevalent in highly internationalized countries like Taiwan, Singapore, and Malaysia. In CS, a model needs to not only account for acoustic features but also learn to pinpoint the exact moments when languages switch. The complexity of this task often results in a decline in the performance of E2E ASR systems. Therefore, addressing CS is one of the most pressing challenges in the field of speech recognition.To tackle this challenge, we proposed the so-called D-MoE architecture, an encoder designed to simultaneously leverage shared underlying information between languages while effectively reducing language confusion in acoustic embeddings. Following this, we designed and implemented an innovative technique that establishes language boundaries within the encoder, subtly enriching the language knowledge in the audio embedding and further enhancing the acuity of the model to different languages.
Description
Keywords
自動語音辨識, 語碼轉換, 混合專家模型, 解絞損失, 中間層損失, 非尖峰CTC損失, automatic speech recognition, code-switching, mixture of expert, disentangle loss, intermediate loss, non-peaky CTC loss