提升編碼器語言敏銳度在語碼轉換語音辨識中的有效性之研究

dc.contributor陳柏琳zh_TW
dc.contributorChen, Berlinen_US
dc.contributor.author楊子霆zh_TW
dc.contributor.authorTzu-Ting, Yangen_US
dc.date.accessioned2024-12-17T03:37:23Z
dc.date.available2024-08-14
dc.date.issued2024
dc.description.abstract隨著端到端 (End-to-End, E2E) 神經網路的出現,語音辨識 (Automatic Speech Recognition, ASR) 領域進入了一個革命性的全新時代。E2E ASR 將傳統語音辨識框架中的模組整合為一個單一、統一的神經網路,能夠直接將輸入的語音信號轉錄為相應的文本。這一創新不僅簡化了神經網路的建模過程,還大大減少了各個模組獨立訓練時可能產生的不一致性。在單語辨識效能方面,E2E ASR 模型已經達到了接近人類水準的準確性,這標誌著語音辨識技術演進中的一個重要里程碑。根據統計,現今全球超過60%的人口是多語言使用者。在口頭交流中,多語者經常因為學習環境和情緒變化等因素無意識地在不同語言之間切換。這種現象被稱為語碼轉換(Code-Switching, CS),在台灣、新加坡和馬來西亞等高度國際化的國家中特別普遍。在語碼轉換中,模型不僅需要考慮聲學特徵,還需要學會精確識別語言切換的時刻。這一任務的複雜性經常導致端到端語音識別系統(E2E ASR)性能下降。因此,解決語碼轉換問題是語音識別領域中最緊迫的挑戰之一。為了解決這一挑戰,我們提出了 D-MoE 架構,這是一種設計用於同時利用語言間共享的底層資訊並有效減少聲音嵌入中語言混淆的編碼器。隨後,我們實施了一項創新技術,透過在編碼器內部建立語言邊界,潛移默化地豐富聲音嵌入中的語言知識,進一步增強了模型對不同語言的敏銳度。zh_TW
dc.description.abstractWith the advent of End-to-End (E2E) neural networks, the field of Automatic Speech Recognition (ASR) has embarked on a revolutionary new era. E2E ASR consolidates the modules of traditional speech recognition frameworks into a single, cohesive neural network, capable of directly transcribing input speech signals into corresponding text. This innovation not only streamlines neural network modeling but also significantly reduces inconsistencies that can arise from independently training each module. In terms of monolingual recognition performance, E2E ASR models have achieved near-human levels of accuracy, marking a significant milestone in the evolution of speech recognition technology.According to statistics, over 60% of the global population today are multilingual users. In verbal communication, multilingual speakers often switch between different languages unconsciously due to factors such as their learning environment and mood changes. This phenomenon, known as Code-Switching (CS), is especially prevalent in highly internationalized countries like Taiwan, Singapore, and Malaysia. In CS, a model needs to not only account for acoustic features but also learn to pinpoint the exact moments when languages switch. The complexity of this task often results in a decline in the performance of E2E ASR systems. Therefore, addressing CS is one of the most pressing challenges in the field of speech recognition.To tackle this challenge, we proposed the so-called D-MoE architecture, an encoder designed to simultaneously leverage shared underlying information between languages while effectively reducing language confusion in acoustic embeddings. Following this, we designed and implemented an innovative technique that establishes language boundaries within the encoder, subtly enriching the language knowledge in the audio embedding and further enhancing the acuity of the model to different languages.en_US
dc.description.sponsorship資訊工程學系zh_TW
dc.identifier61047031S-46143
dc.identifier.urihttps://etds.lib.ntnu.edu.tw/thesis/detail/d7b096d8e4cbe5953dd856693b544d3c/
dc.identifier.urihttp://rportal.lib.ntnu.edu.tw/handle/20.500.12235/123703
dc.language英文
dc.subject自動語音辨識zh_TW
dc.subject語碼轉換zh_TW
dc.subject混合專家模型zh_TW
dc.subject解絞損失zh_TW
dc.subject中間層損失zh_TW
dc.subject非尖峰CTC損失zh_TW
dc.subjectautomatic speech recognitionen_US
dc.subjectcode-switchingen_US
dc.subjectmixture of experten_US
dc.subjectdisentangle lossen_US
dc.subjectintermediate lossen_US
dc.subjectnon-peaky CTC lossen_US
dc.title提升編碼器語言敏銳度在語碼轉換語音辨識中的有效性之研究zh_TW
dc.titleA Study on the Effectiveness of Language Acuity-Enhanced Encoders in Code-Switching Speech Recognitionen_US
dc.type學術論文

Files

Original bundle

Now showing 1 - 1 of 1
No Thumbnail Available
Name:
202400046143-108448.pdf
Size:
2.29 MB
Format:
Adobe Portable Document Format
Description:
學術論文

Collections