提升編碼器語言敏銳度在語碼轉換語音辨識中的有效性之研究

楊子霆; Tzu-Ting, Yang

提升編碼器語言敏銳度在語碼轉換語音辨識中的有效性之研究

dc.contributor	陳柏琳	zh_TW
dc.contributor	Chen, Berlin	en_US
dc.contributor.author	楊子霆	zh_TW
dc.contributor.author	Tzu-Ting, Yang	en_US
dc.date.accessioned	2024-12-17T03:37:23Z
dc.date.available	2024-08-14
dc.date.issued	2024
dc.description.abstract	隨著端到端 (End-to-End, E2E) 神經網路的出現，語音辨識 (Automatic Speech Recognition, ASR) 領域進入了一個革命性的全新時代。E2E ASR 將傳統語音辨識框架中的模組整合為一個單一、統一的神經網路，能夠直接將輸入的語音信號轉錄為相應的文本。這一創新不僅簡化了神經網路的建模過程，還大大減少了各個模組獨立訓練時可能產生的不一致性。在單語辨識效能方面，E2E ASR 模型已經達到了接近人類水準的準確性，這標誌著語音辨識技術演進中的一個重要里程碑。根據統計，現今全球超過60%的人口是多語言使用者。在口頭交流中，多語者經常因為學習環境和情緒變化等因素無意識地在不同語言之間切換。這種現象被稱為語碼轉換（Code-Switching, CS），在台灣、新加坡和馬來西亞等高度國際化的國家中特別普遍。在語碼轉換中，模型不僅需要考慮聲學特徵，還需要學會精確識別語言切換的時刻。這一任務的複雜性經常導致端到端語音識別系統（E2E ASR）性能下降。因此，解決語碼轉換問題是語音識別領域中最緊迫的挑戰之一。為了解決這一挑戰，我們提出了 D-MoE 架構，這是一種設計用於同時利用語言間共享的底層資訊並有效減少聲音嵌入中語言混淆的編碼器。隨後，我們實施了一項創新技術，透過在編碼器內部建立語言邊界，潛移默化地豐富聲音嵌入中的語言知識，進一步增強了模型對不同語言的敏銳度。	zh_TW
dc.description.abstract	With the advent of End-to-End (E2E) neural networks, the field of Automatic Speech Recognition (ASR) has embarked on a revolutionary new era. E2E ASR consolidates the modules of traditional speech recognition frameworks into a single, cohesive neural network, capable of directly transcribing input speech signals into corresponding text. This innovation not only streamlines neural network modeling but also significantly reduces inconsistencies that can arise from independently training each module. In terms of monolingual recognition performance, E2E ASR models have achieved near-human levels of accuracy, marking a significant milestone in the evolution of speech recognition technology.According to statistics, over 60% of the global population today are multilingual users. In verbal communication, multilingual speakers often switch between different languages unconsciously due to factors such as their learning environment and mood changes. This phenomenon, known as Code-Switching (CS), is especially prevalent in highly internationalized countries like Taiwan, Singapore, and Malaysia. In CS, a model needs to not only account for acoustic features but also learn to pinpoint the exact moments when languages switch. The complexity of this task often results in a decline in the performance of E2E ASR systems. Therefore, addressing CS is one of the most pressing challenges in the field of speech recognition.To tackle this challenge, we proposed the so-called D-MoE architecture, an encoder designed to simultaneously leverage shared underlying information between languages while effectively reducing language confusion in acoustic embeddings. Following this, we designed and implemented an innovative technique that establishes language boundaries within the encoder, subtly enriching the language knowledge in the audio embedding and further enhancing the acuity of the model to different languages.	en_US
dc.description.sponsorship	資訊工程學系	zh_TW
dc.identifier	61047031S-46143
dc.identifier.uri	https://etds.lib.ntnu.edu.tw/thesis/detail/d7b096d8e4cbe5953dd856693b544d3c/
dc.identifier.uri	http://rportal.lib.ntnu.edu.tw/handle/20.500.12235/123703
dc.language	英文
dc.subject	自動語音辨識	zh_TW
dc.subject	語碼轉換	zh_TW
dc.subject	混合專家模型	zh_TW
dc.subject	解絞損失	zh_TW
dc.subject	中間層損失	zh_TW
dc.subject	非尖峰CTC損失	zh_TW
dc.subject	automatic speech recognition	en_US
dc.subject	code-switching	en_US
dc.subject	mixture of expert	en_US
dc.subject	disentangle loss	en_US
dc.subject	intermediate loss	en_US
dc.subject	non-peaky CTC loss	en_US
dc.title	提升編碼器語言敏銳度在語碼轉換語音辨識中的有效性之研究	zh_TW
dc.title	A Study on the Effectiveness of Language Acuity-Enhanced Encoders in Code-Switching Speech Recognition	en_US
dc.type	學術論文

Files

Original bundle

Now showing 1 - 1 of 1

Name:: 202400046143-108448.pdf
Size:: 2.29 MB
Format:: Adobe Portable Document Format
Description:: 學術論文

Download

Collections

學位論文