語言模型調適使用語者用詞特徵於會議語音辨識之研究
No Thumbnail Available
Date
2018
Authors
Journal Title
Journal ISSN
Volume Title
Publisher
Abstract
在會議中,如何翔實地記錄交談內容是一項很重要的工作;藉由閱讀會議記錄,可以讓未參與的人員了解會議內容。同時,也因為語音被轉寫為文字,可以使會議內容的檢索更為精準。然而,人工會議紀錄往往費力耗時;因此,使用自動語音辨識技術完成會議交談內容的轉寫,能夠節省許多時間與人力的投入。但是會議語料庫和其它一般常見的語料如新聞報導之間存在很大差異;會議語料庫通常包含不常見的單詞、短句、混合語言使用和個人口語習慣。
有鑑於此,本論文試圖解決會議語音辨識時語者間用語特性不同所造成的問題。多個語者的存在可能代表有多種的語言模式;更進一步地說,人們在講話時並沒有嚴格遵循語法,而且通常會有說話延遲、停頓或個人慣用語以及其它獨特的說話方式。但是,過去會議語音辨識中的語言模型大都不會針對不同的語者進行調整,而是假設不同的語者間擁有相同的語言模式,於是將包含多個語者的文字轉寫合成一個訓練集,藉此訓練單一的語言模型。為突破此假設,本研究希望根據不同語者為語言模型的訓練和預測提供額外的信息,即是語言模型的語者調適。本文考慮兩種測試階段的情境──「已知語者」和「未知語者」,並提出了對應此兩種情境的語者特徵擷取方法,以及探討如何利用語者特徵來輔助語言模型的訓練。
在中文和英文會議語音辨識任務上的一系列語言模型的語者調適實驗顯示,我們所提出的語言模型無論是在已知語者還是未知語者情境下都有良好的表現,並且比現有的先進技術方法有較佳的效能。
In a meeting environment, how to faithfully produce the meeting minutes is considered an important task. By reading the minutes of the meeting, the non-participating personnel can understand the content of the meeting. Meanwhile, due to that the spoken content of the meeting has been transcribed into text, searching of relevant meetings in a database thus becomes more accurate. However, manually transcribing the content of a meeting is often labor-intensive and time-consuming; using automatic speech recognition (ASR) technologies to transcribe the content will be a good surrogate for this purpose. Also worth mentioning is that there are great distinctions between those speech corpora that are frequently-dealt with, such as news datasets, and meeting corpora. A meeting corpus usually contains uncommon words, short sentences, code-mixing phenomena and diverse personal characteristics of speaking. In view of the above, this thesis sets out to alleviate the problems caused by the multiple-speaker situation occurring frequently in a meeting for improved ASR. There are a wide variety of ways to utter in a multiple-speaker situation. That is to say, people do not strictly follow the grammar when speaking and usually have a tendency to stutter while speaking, or often use personal idioms and some unique ways of speaking. Nevertheless, the existing language models employed in ASR of meeting recordings rarely account for these facts but instead assume that all speakers participating in a meeting share the same speaking style or word-usage behavior. In turn, a single language is built with all the manual transcripts of utterances compiled from multiple speakers that were taken holistically as the training set. To relax such an assumption, we endeavor to augment additional information cues into the training phase and the prediction phase of language modeling to accommodate the variety of speaker-related characteristics, i.e., conducting speaker adaptation for language modeling. To this end, two disparate scenarios, i.e., "known speakers" and "unknown speakers," for the prediction phase are taken into consideration for developing methods to extract speaker-related information cues to aid in the training of language models. A series of experiments carried out on automatic transcription of Mandarin and English meeting recordings show that the proposed language models along with different mechanisms for speaker adaption achieve good performance gains in relation to some state-of-the-art methods compared in the thesis.
In a meeting environment, how to faithfully produce the meeting minutes is considered an important task. By reading the minutes of the meeting, the non-participating personnel can understand the content of the meeting. Meanwhile, due to that the spoken content of the meeting has been transcribed into text, searching of relevant meetings in a database thus becomes more accurate. However, manually transcribing the content of a meeting is often labor-intensive and time-consuming; using automatic speech recognition (ASR) technologies to transcribe the content will be a good surrogate for this purpose. Also worth mentioning is that there are great distinctions between those speech corpora that are frequently-dealt with, such as news datasets, and meeting corpora. A meeting corpus usually contains uncommon words, short sentences, code-mixing phenomena and diverse personal characteristics of speaking. In view of the above, this thesis sets out to alleviate the problems caused by the multiple-speaker situation occurring frequently in a meeting for improved ASR. There are a wide variety of ways to utter in a multiple-speaker situation. That is to say, people do not strictly follow the grammar when speaking and usually have a tendency to stutter while speaking, or often use personal idioms and some unique ways of speaking. Nevertheless, the existing language models employed in ASR of meeting recordings rarely account for these facts but instead assume that all speakers participating in a meeting share the same speaking style or word-usage behavior. In turn, a single language is built with all the manual transcripts of utterances compiled from multiple speakers that were taken holistically as the training set. To relax such an assumption, we endeavor to augment additional information cues into the training phase and the prediction phase of language modeling to accommodate the variety of speaker-related characteristics, i.e., conducting speaker adaptation for language modeling. To this end, two disparate scenarios, i.e., "known speakers" and "unknown speakers," for the prediction phase are taken into consideration for developing methods to extract speaker-related information cues to aid in the training of language models. A series of experiments carried out on automatic transcription of Mandarin and English meeting recordings show that the proposed language models along with different mechanisms for speaker adaption achieve good performance gains in relation to some state-of-the-art methods compared in the thesis.
Description
Keywords
會議語音辨識, 語言模型, 語者調適, 遞迴式類神經網路, speech recognition, language modeling, speaker adaptation, recurrent neural networks