韻律特徵於YouTube言語體裁多模態分類中之潛力

No Thumbnail Available

Date

2023

Journal Title

Journal ISSN

Volume Title

Publisher

Abstract

本研究旨在分析YouTube臺灣華語創作內容中,娛樂型和知識型言語體裁之韻律特質,以及不同特徵模組(feature mode)對於自動化言語體裁分類模型之成效。我們建立了一個由5049語句所組成的語料庫。在此研究中,單一語句定義為言語中兩間隔停頓間之單位,每個語句紀錄了其文本、言語體裁、時長特徵[包含:語句時長、停頓時長、語速、時長成對變異指數(duration-based PVI)]、基頻特徵[包含:平均值、全距、基頻成對變異指數(f0-based PVI)]。我們也進一步將每個語句的文本以TF-IDF方法轉換成文字特徵。本研究是以每個單一語句為分析單位。首先,我們運用所提出的七個時長及基頻特徵,建立了羅吉斯迴歸模型,以分析娛樂型及知識型言語體裁分別具有特定哪些韻律特質。再者,我們建立了三種自動化言語體裁分類模型,包含了韻律特徵模型、文字特徵模型、多模態特徵(結合韻律及文字特徵)模型,以研究韻律特徵於言語體裁分類之潛力、多模態特徵是否能進一步提升言語體裁分類之結果。根據羅吉斯回歸模型的結果顯示,在我們所提出的七個韻律特徵中,有六個韻律特徵(排除停頓時長,包含:語句時長、語速、時長成對變異指數、基頻全距、基頻成對變異指數、基頻平均值)於模型中呈現統計顯著性,顯示娛樂型及知識型言語體裁具有不同韻律特質。此統計分析結果也顯示,與娛樂型言語體裁相比,知識型言語體裁通常具有較長的語句時長、較慢的語速、較低的音高、較明顯的語調變化,其節奏也更具等時性。再者,我們也運用提出的七個韻律特徵來訓練韻律特徵分類模型以及多模態特徵分類模型。研究結果顯示,以七個韻律特徵為本的模型分類準確率達0.733,展現了韻律特徵於言語體裁分類之潛力;此外,多模態特徵分類模型表現優於任何其他以單一特徵模組為本之模型,分類結果達到0.846準確率。我們認為在言語體裁分類任務中,韻律特徵能夠彌補文字特徵所缺乏或無法完全呈現的訊息,甚至能夠進一步提升原本就具不錯表現的文字特徵模型。總而言之,言語的多模態現象,使得進行言語體裁分類任務時必須同時考量韻律特徵及文字特徵。
This current study aims to investigate the prosodic properties of the speech genres (i.e., entertaining and informative) within Taiwan Mandarin YouTube content and the effectiveness of different feature modes on automatic speech genre classification models. We established a corpus consisting of 5049 utterances, each of which was an inter-pausal unit in speech. All the utterances were recorded with their corresponding transcripts, speech genres, four durational features (i.e., utterance duration, pause duration, speech rate, duration-based PVI), and three pitch-related features (i.e., f0 mean, f0 range, f0-based PVI). Based on the transcript of each utterance, we also utilized the naïve TF-IDF technique to acquire the textual features. An utterance-based analytical framework was adopted. First, we used the binary logistic model fitting on the seven proposed durational and pitch-related features to investigate which prosodic features can effectively characterize the entertaining and informative speech. Second, we built three types of automatic speech genre classification models—namely, prosody-mode, text-mode, and multimodal (prosody + text) models—to study the potentiality of prosodic features in speech genre classification and whether multimodal (prosody + text) features can further improve the performance of speech genre classification.According to the results of our logistic model, six of the seven proposed prosodic features—utterance duration, dur-PVI, speech rate, f0 mean, f0 range, and f0-PVI—showed statistical significance in the logistic model. Our analysis suggested that in comparison to entertaining speech, informative speech tended to have longer utterance duration, slower speech rate, more isochronic rhythm, and more perceptible intonation at a lower pitch level. Additionally, we fed the seven proposed prosodic features into the prosody-mode and multimodal speech genre classification models. Our results showed that (1) the prosody-mode model attained 0.733 accuracy, showing the potentiality of prosodic features in speech genre classification; (2) the multimodal (prosody + text) model outperformed the other single-mode ones, amounting to 0.846 accuracy. We conclude that prosodic features can complement the absent or underrepresented information of textual features in speech genre classification and further improve the already high-performing text-mode classification results. The inherent multimodality of speech makes it necessary to include both prosodic and textual features in the task of speech genre classification.

Description

Keywords

韻律, 言語體裁, 多模態分類, 體裁分類, 台灣華語YouTube創作內容, prosody, speech genre, multimodal classification, genre classification, Taiwan Mandarin YouTube content

Citation

Collections

Endorsement

Review

Supplemented By

Referenced By