新穎語者自動分段標記技術之研究

No Thumbnail Available

Date

2024

Journal Title

Journal ISSN

Volume Title

Publisher

Abstract

語者自動分段標記(Speaker Diarization)在廣播節目、會議、線上媒體等多個領域中具有豐富的應用潛力,並且可以與自動語音辨識(ASR)或語音情緒辨識(SER)結合,從對話內容中提取有意義的資訊。然而,自動語音辨識在語者數量超過兩人時,其錯誤率顯著提升,這種情況被稱為雞尾酒會問題。為了解決未知語者數量的問題以及提升整體性能,衍生出端到端編碼器-解碼器吸引子(EEND-EDA)模型,並有許多研究針對此問題進行了深入探討。儘管有些研究結合了語者自動分段標記與自動語音辨識(ASR)或大型語言模型(LLM)以增加實用性,但這些方法並未針對編碼器的隱藏狀態進行改進。因此,本研究著重於改進語音特徵訊號的處理,以提升模型效能。為此,我們首先將模型框架從Transformer更改為Branchformer,強化模型對語者辨識的效能。其次,為了引導注意力機制使其更專注於語音活動,我們增加了一個輔助損失函數(Auxiliary Loss Function)。最後,嘗試將Log-Mel特徵進行更改,以提升模型的泛化能力。我們探討了在固定語者數量和未知語者數量情況下,進行語者自動分段標記是否能幫助模型提升效能,並為模型提供了新的選擇。
Speaker diarization has rich application potential in broadcasting programs, meetings, online media, and other fields. It can be combined with Automatic Speech Recognition (ASR) or Speech Emotion Recognition (SER) to extract meaningful information from conversational content. However, the error rate of ASR significantly increases when the number of speakers exceeds two, a phenomenon known as the cocktail party problem.To address the issue of unknown speaker numbers and enhance overall performance, the End-to-End Encoder-Decoder Attractor (EEND-EDA) model was developed, and numerous studies have delved into this problem. Although some studies have combined speaker diarization with ASR or large language models (LLMs) to increase practicality, these methods have not focused on improving the encoder's hidden states. Therefore, this study emphasizes enhancing the processing of speech signals to improve model performance. This study aims to address the aforementioned issues. First, the model framework By replacing with Transformer to Branchformer to strengthen the model's speaker recognition performance. Second, to guide the attention mechanism to focus more on voice activity, we add an Auxiliary Loss Function. Finally, we attempt to modify the log-Mel features to improve the model's generalization ability. We investigate whether speaker diarization under scenarios with both a fixed number of speakers and an unknown number of speakers can enhance model performance and offer new possibilities for the model.

Description

Keywords

語者自動分段標記, 端對端語者自動分段標記模型, 多頭注意力機制, 輔助損失函數, speaker diarization, end-to-end neural diarization, multi-head attention, auxiliary loss

Citation

Collections

Endorsement

Review

Supplemented By

Referenced By