針對端到端語音辨識中語境偏移之適應性研究

No Thumbnail Available

Date

2025

Journal Title

Journal ISSN

Volume Title

Publisher

Abstract

隨著後疫情時代的到來,線上會議成為主流,使得對語音轉錄技術的需求日益增加。然而,在這些會議場景中,語音辨識系統面臨專業術語、人名、關鍵詞等辨識不準確的挑戰,影響了轉錄結果的完整性和精確度。這些問題尤其常見於涉及特定行業術語或專業背景的會議,如醫療、法律、金融等領域。在此情境下,準確地轉錄關鍵詞和專有名詞不僅是為了提升會議紀錄的可讀性,也有助於在後續的資訊檢索和分析中更有效地處理和提取重要內容。針對此需求,語音辨識技術逐漸引入語境化偏移及文字提示功能,通過整合特定語境清單和專業術語庫,使系統能更精確地辨識會議中的重要內容,進一步提高會議資料的品質與實用性。本研究聚焦於增強語音辨識模型的上下文敏銳度,旨在透過引入不同類型的語義特徵以及特定的提示訊息來提升模型對領域特定詞彙的辨識能力。研究結果顯示,利用提示訓練,在AISHELL-1 資料集上的詞相對錯誤率可以達到13.8 %的相對詞錯誤率,以及7.5 %的相對實體錯誤率,研究結果表明本研究有效地喚醒模型對於專業術語或重要詞彙的敏感性,降低偏移詞錯誤率,並提升轉錄結果的精確度。透過提供了詞彙的語境線索,幫助模型在專業場景下更準確地辨識並正確轉錄相應內容,從而減少因上下文缺乏而導致的誤差。
With the arrival of the post-pandemic era, online meetings have become the norm, leading to a growing demand for speech transcription technology. However, in these meeting scenarios, speech recognition systems face challenges with accurately recognizing specialized terminology, names, and keywords, which in turn affects the completeness and precision of the transcription results. These issues are especially common in meetings involving industry-specific or specialized knowledge, such as in healthcare, law, and finance. In such contexts, accurately transcribing keywords and proper nouns not only improves the readability of meeting minutes but also facilitates more effective retrieval and extraction of important information in subsequent analysis. To address this need, speech recognition technology has gradually introduced contextual biasing and text prompting functionality. By integrating domain-specific word lists and specialized terminology databases, the system can more accurately recognize important content in meetings and further enhance the quality and utility of meeting data. This study focuses on enhancing the contextual sensitivity of speech recognition models by introducing different types of semantic features and specific prompts to improve the recognition of domain-specific vocabulary. The results show that through prompt-based training on the AISHELL-1 dataset, it is possible to achieve a 13.8% relative word error rate reduction and a 7.5% relative entity error rate reduction. These findings indicate that this approach effectively heightens the model’s sensitivity to specialized terminology or critical vocabulary, reduces errors in biasing words, and improves transcription accuracy. By providing contextual clues for the vocabulary, the model is better able to accurately recognize and correctly transcribe relevant content in professional settings, thereby reducing errors caused by a lack of context.

Description

Keywords

語音辨識, 語境偏移, 關鍵詞辨識, 提示微調, Speech Recognition, Contextual Biasing, Keyword Recognition, Prompt-Tuning

Citation

Collections

Endorsement

Review

Supplemented By

Referenced By