使用跨語句上下文語言模型和圖神經網路於會話語音辨識重新排序之研究
No Thumbnail Available
Date
2021
Authors
Journal Title
Journal ISSN
Volume Title
Publisher
Abstract
語言模型在一個語音辨識系統中扮演著極為重要的角色,來量化一段已辨識 候選句(詞序列)在自然語言中的語意與語法之接受度。近年來,基於神經網路架 構的語言模型明顯優於傳統的 n 連語言模型,主要因為前者具有捕捉更長距離的 上下文的卓越能力。然而,有鑒於神經語言模型的高計算複雜度,它們通常應用 於第二階段的 N 最佳候選句重新排序來對每個候選句重新打分。這種替代且輕 量級的方法,能夠使用更精緻的神經語言模型以整合任務有關的線索或調適機制 來更佳的重排候選句,已引起了廣大的興趣並成為語音辨識領域中一個重要的研 究方向。另一方面,使用語音辨識系統來有效的辨識出對話語音,對於邁向智能對話 AI 起關鍵重要的作用。相關的應用包含虛擬助理、智能音箱、互動式語音應答... 等等,都無所不在於我們的日常生活中。而在這些真實世界的應用中,通常(或理 想上)會以多輪語音與使用者作互動,這些對話語音存在一些常見的語言現象, 例如主題連貫性和單詞重複性,但這些現象與解決辦法仍然有待探索。基於上述 的種種觀察,我們首先利用上下文語言模型(例如: BERT),將 N 最佳候選重排任 務重新定義為一種預測問題。再者,為了更進一步增強我們的模型以處理對話語 音,我們探索了一系列的主題和歷史調適的技術,大致可分為三部分: (1)一種將 跨語句資訊融入到模型中的有效方法; (2)一種利用無監督式主題建模來擷取與 任務有關的全局資訊的有效方法; (3)一種利用圖神經網路(例如: GCN)來提取詞 彙之間全局結構依賴性的新穎方法。我們在國際標竿 AMI 會議語料庫進行了一 系列的實驗來評估所提出的方法。實驗結果顯示了在降低單詞錯誤率方面,與當 前一些最先進與主流的方法相比,提出方法有其有效性與可行性。
Language models (LMs) play a significant role in an automatic speech recognition (ASR) system to provide a likelihood for any word sequence hypothesis. Over recent years, neural network (NN)-based LMs have been shown to consistently outperform the classical n-gram LMs due mainly to their superior abilities of modeling longer contextual dependency. Nevertheless, because of their high computational complexity, neural LMs usually apply to score the hypotheses produced by ASR systems at the second-pass N-best hypothesis reranking stage. This alternative and lightweight approach, which reranks N-best hypotheses with more sophisticated neural LMs, has attracted considerable interest and served as an important research direction in ASR.Meanwhile, the effective recognition of conversational speech with ASR acts as a crucial role towards conversational AI. Possible applications ranging from virtual assistants, smart speakers to interactive voice responses (IVR) and among others, have become ubiquitous in our daily lives. These real-world applications typically interact with users in multiple turns of speech utterances that exist global conversational-level phenomena such as topical coherence and word recurrence, which however remain to be underexplored. In view of the above, we frame ASR N-best reranking with contextualized language models (such as BERT) as a prediction problem. To further enhance our models to handle conversational speech, we explore a set of topic/history modeling techniques that broadly can be three-fold: 1) an effective way to incorporate cross-utterance information clues into the model; 2) an efficient way to leverage task- specific global information with unsupervised topic modeling; and 3) a novel approach to distilling global structural dependencies among words by a graph neural network (such as GCN). We carry out a series of empirical experiments with the proposed methods on the AMI benchmark meeting corpus. Experimental results demonstrate the effectiveness and feasibility of our methods in comparison to some current top-of-the- line methods in terms of word error rate (WER) reduction.
Language models (LMs) play a significant role in an automatic speech recognition (ASR) system to provide a likelihood for any word sequence hypothesis. Over recent years, neural network (NN)-based LMs have been shown to consistently outperform the classical n-gram LMs due mainly to their superior abilities of modeling longer contextual dependency. Nevertheless, because of their high computational complexity, neural LMs usually apply to score the hypotheses produced by ASR systems at the second-pass N-best hypothesis reranking stage. This alternative and lightweight approach, which reranks N-best hypotheses with more sophisticated neural LMs, has attracted considerable interest and served as an important research direction in ASR.Meanwhile, the effective recognition of conversational speech with ASR acts as a crucial role towards conversational AI. Possible applications ranging from virtual assistants, smart speakers to interactive voice responses (IVR) and among others, have become ubiquitous in our daily lives. These real-world applications typically interact with users in multiple turns of speech utterances that exist global conversational-level phenomena such as topical coherence and word recurrence, which however remain to be underexplored. In view of the above, we frame ASR N-best reranking with contextualized language models (such as BERT) as a prediction problem. To further enhance our models to handle conversational speech, we explore a set of topic/history modeling techniques that broadly can be three-fold: 1) an effective way to incorporate cross-utterance information clues into the model; 2) an efficient way to leverage task- specific global information with unsupervised topic modeling; and 3) a novel approach to distilling global structural dependencies among words by a graph neural network (such as GCN). We carry out a series of empirical experiments with the proposed methods on the AMI benchmark meeting corpus. Experimental results demonstrate the effectiveness and feasibility of our methods in comparison to some current top-of-the- line methods in terms of word error rate (WER) reduction.
Description
Keywords
自動語音辨識, 語言模型, 對話語音, 跨句資訊, N-best列表, 重新排序, 上下文語言模型, 圖神經網路, automatic speech recognition, language modeling, conversational speech, cross-utterance, N-best hypothesis reranking, BERT, GCN