類別不平衡下文件分類之多策略漸進訓練

No Thumbnail Available

Date

2024

Journal Title

Journal ISSN

Volume Title

Publisher

Abstract

社交媒體的廣泛應用產生了大量使用者生成的內容,這些資料為分析群眾對特定主題的態度提供了寶貴來源。然而,網路論壇中的發文內容非常廣泛,其中絕大多數與目標主題無關,這類內容可被歸為0類別。與主題相關的細分類別相較,0類別與其他類別的資料分布極不平衡。因此,建構一個文本分類器可同時辨識與主題無關及相關的各類別是一項重大挑戰。本研究針對此問題提出一個名為多策略漸進訓練(MSIT)的半監督學習方法。MSIT 將類別原型學習的概念融入訓練策略中,首先透過初始訓練階段與增強訓練階段構建基礎模型,並採用 R-Drop 訓練策略及區隔原型訓練策略,以提升模型的表示法學習能力。隨後,MSIT 利用基礎模型對未標示資料進行偽標示,並根據資料表示法與類別原型的相似性,選取部分偽標示資料加入訓練。在自我訓練過程中,MSIT並考慮模型預測與類別原型預測的一致性,進一步優化分類器的參數。在心理健康素養態度分類任務上,本研究使用社交媒體平台蒐集的文件資料進行的實驗結果顯示,MSIT 的兩階段基礎訓練策略顯著提升了基礎模型的性能,其在預測主題相關類別的 F1 指標上,表現優於以相關研究方法建構的基準模型。進一步引入未標示資料進行半監督學習後,MSIT 的預測效果更為顯著,達到最佳 macro-F1 值 0.663,並在多個資料集上表現出穩定的預測能力,整體效能優於相關研究模型。
The widespread use of social media has generated a vast amount of user-generated content, providing valuable data for analyzing users' attitudes on specific topics. However, the content in forum posts is highly diverse, with most of it unrelated to the target topic, which is categorized as "irrelevant" with a score of 0. When integrating the category with a score of 0 and the topic-relevant categories to be detected, the data distribution among categories becomes extremely imbalanced. This imbalance poses a significant challenge in constructing a text classifier that can simultaneously identify unrelated and topic-related categories. To mitigate the challenges posed by data imbalance and improve classification performance, this study proposes a semi-supervised learning approach named Multi-Strategy Incremental Training (MSIT). MSIT incorporates the concept of prototype learning into the training process. Initially, two foundational training stages are employed to construct the base model: an initial training phase and an enhanced training phase. Two training strategies—R-Drop and prototype separation—are utilized to improve representation learning. Subsequently, MSIT leverages the base model to assign pseudo-labels to unlabeled data. A subset of pseudo-labeled data is then selected for inclusion in the training dataset based on the similarity of their data representations to the class prototypes. Additionally, the consistency between model predictions and prototype predictions is considered during the self-training process to further refine the classifier's parameters and enhance its generalization ability. Performance evaluations were conducted on document datasets collected from social media platforms, specifically focusing on tasks related to distinguishing mental health attitudes. Experimental results demonstrate that the proposed two-stage training approach of MSIT effectively enhances the base model. The constructed classifier achieves higher F1 scores for predicting topic-related categories compared to other models proposed in related works. Furthermore, by incorporating unlabeled data into the semi-supervised learning process, performance is further improved, achieving a best macro-F1 score of 0.663, highlighting the effectiveness of incorporating unlabeled data in the semi-supervised learning framework. MSIT also exhibited consistent performance across various datasets, outperforming related works overall.

Description

Keywords

類別原型學習, 心理健康素養, Prototype Learning, Mental Health Literacy

Citation

Collections

Endorsement

Review

Supplemented By

Referenced By