陳柏琳Chen, Berlin王馨偉Wang, Hsin-Wei2023-12-082023-08-142023-12-082023https://etds.lib.ntnu.edu.tw/thesis/detail/77c25d6cac1160290903dc87d95ced9d/http://rportal.lib.ntnu.edu.tw/handle/20.500.12235/121565得益於神經模型架構和訓練算法的協同突破,自動語音識別(ASR)最近取得了巨大的成功並達到了人類的水平。然而,ASR 在許多現實用例中的性能仍遠未達到完美。人們對設計和開發可行的後處理模組以通過細修 ASR 輸出句子來提高識別性能的研究興趣激增,這些模組大致分為兩類。 第一類方法是 ASR 前 N 個最佳假設重新排序。ASR 前 N 個最佳假設重新排序旨在從給定的 N 個假設列表中找到單詞錯誤率最低的假設。另一類方法的靈感來自中文拼寫校正 (CSC) 或英文拼寫校正 (ESC)等,旨在檢測和校正 ASR 輸出句子的文本級錯誤。在本文中,我們嘗試將上述兩種方法整合到ASR糾錯(AEC)模組中,並探討不同類型的特徵對AEC的影響。我們提出的方法名為REDECORATE,適用於校正從現成語音服務獲得的純文本轉錄。在大多數情況下,目標域的相關純文本數據相對更容易獲得。因此,使用從此類數據中收集的知識可以更有效地將通用域 ASR 模型導向目標域。鑑於此,我們提出了另一種基於領域自適應數據構建的單詞共現圖的新穎的糾錯方法。 給定的神經 ASR 模型可以通過即插即用的方式輕鬆訪問有關語音話語語義上下文的知識,而無需引入額外的參數。該方法名為GRACE,可以隨插即用適用於客製化訓練的ASR模型的模型調適或是直接校正ASR轉錄結果。在 AISHELL-1 基準數據集上進行的一系列實驗表明,所提出的方法可以在強大的 ASR 基線上顯著降低字符錯誤率 (CER)。Automatic speech recognition (ASR) has achieved remarkable success and reached human parity, because of synergistic breakthroughs in neural network architectures and training algorithms. However, the performance of ASR in many real-world applications is still insufficiently high.An increasing number of researchers have attempted to design and develop feasible post-processing modules for improving ASR performance by refining ASR output sentences. These modules can be divided into two categories: those based on the reranking of ASR N-best hypotheses, and those focusing on spelling correction. The aim of reranking hypotheses is to find the oracle hypothesis with the lowest word error rate from a given list of N-best hypotheses. Moreover, the aim of spelling correction is to detect and correct errors in ASR output transcriptions. In this study, we attempted to integrate the reranking of N-best ASR hypotheses with correction methods into an ASR error correction (AEC) module and examined the effects of various types of features on AEC.In most cases, obtaining text data for a target domain is easier than is obtaining speech data for the domain, and the knowledge acquired from text data can be used to efficiently bias a general-domain ASR model to the target domain. Therefore, we propose a novel error correction method that leverages word co-occurrence graphs constructed using domain-adaptive data. Knowledge regarding the semantic context of a speech utterance can be readily accessed by a given neural ASR model in a plug-and-play manner without the need to introduce additional parameters. This knowledge is accessed through word co-occurrence graphs, allowing the ASR model to tap into the rich contextual relationships between words and enhance its understanding of the spoken language.A series of experiments conducted on the AISHELL-1 benchmark dataset indicated that the proposed method can achieve a remarkably lower character error rate compared with those achieved by baseline ASR approaches.語音辨識後處理N個最佳假設重新排序非自回歸單詞共現圖自動語音辨識post-processing of speech recognitionN-best hypotheses rerankingnon-autoregressiveword co-occurrence graphsautomatic speech recognition適用於改善語音辨識的新穎調適方法與後處理模型Novel Adaptation and Post-processing Approaches for Improving Speech Recognitionetd