語者確認使用不同語句嵌入函數之比較研究
No Thumbnail Available
Date
2021
Authors
Journal Title
Journal ISSN
Volume Title
Publisher
Abstract
語者語句的嵌入函數利用了神經網路將語句映射到一個空間,在該空間中,距離反映出語者之間的相似度,這種度量學習最早被提出應用在人臉辨識。最近幾年被拿來應用在應用在語者確認,這也推動近幾年語者確認任務的發展。但還是有明顯的正確率差異在語者確認的訓練集辨識和未知語者。在未知語者的狀況下,很評估適合使用小樣本學習。在實際環境中,語者確認系統需要識別短語句的語者,但在訓練時的語者話語都是相對較長的。然而近年的語者確認模型在短語句的語者確認中表現不佳。在這裡我們使用了原型網路損失、三元組損失和最先進的小樣本學習來優化嵌入語者模型。資料集使用了VoxCeleb1和VoxCeleb2,前者資料集的語者數量有1,221,後者資料集的語者數量有5,994。實驗的結果顯示,嵌入語者模型在我們提出的損失函數有較好的表現。
The speaker’s embedding model uses neural networks to map utterances to a space. The distance shows the similarity between each speaker. This metric learning was first proposed to be applied to face recognition. In recent years, it has been used in the application of speaker verification, which has also promoted the development of speaker verification tasks in recent years. However, there is still a significant difference correctness between the seen speakers and unseen speakers in training set. In the case of unseen speakers, it is very good to use few-shot learning. In the really environment, the speaker verification system needs to recognize the speaker of short utterances. But during training the speaker’s utterances are relatively long. In recent years, the speaker verification model does not perform well in short utterances. Here we use prototype network loss, triplet loss and state-of-the-art few-shot learning to optimize the speaker’s embedding model. The dataset we use VoxCeleb1 and VoxCeleb2. The number of speakers in the former dataset is 1,221 and the number of speakers in the latter dataset is 5,994. The results of the experiment show that speaker’s embedding model performs well in our proposed loss function.
The speaker’s embedding model uses neural networks to map utterances to a space. The distance shows the similarity between each speaker. This metric learning was first proposed to be applied to face recognition. In recent years, it has been used in the application of speaker verification, which has also promoted the development of speaker verification tasks in recent years. However, there is still a significant difference correctness between the seen speakers and unseen speakers in training set. In the case of unseen speakers, it is very good to use few-shot learning. In the really environment, the speaker verification system needs to recognize the speaker of short utterances. But during training the speaker’s utterances are relatively long. In recent years, the speaker verification model does not perform well in short utterances. Here we use prototype network loss, triplet loss and state-of-the-art few-shot learning to optimize the speaker’s embedding model. The dataset we use VoxCeleb1 and VoxCeleb2. The number of speakers in the former dataset is 1,221 and the number of speakers in the latter dataset is 5,994. The results of the experiment show that speaker’s embedding model performs well in our proposed loss function.
Description
Keywords
語者確認, 語音辨識, 小樣本學習, Speaker verification, Speech recognition, Few-shot learning