結合監督式及非監督式方法進行新聞文章意見持有者辨識之研究
No Thumbnail Available
Date
2016
Authors
Journal Title
Journal ISSN
Volume Title
Publisher
Abstract
意見探勘幫助我們自動地從大量的可靠來源文本,擷取人們感興趣且可利用的主觀性資訊。意見句可分為四個部分,包括意見主題、意見持有者、意見主張及意見情感,本研究目的在於辨識意見持有者。本研究提出一個結合監督式及非監督式學習的方法,辨識意見句中的文章作者或持有者代表詞,本研究的主要流程任務分成兩個部分:文章作者意見辨識、意見持有者辨識。
意見持有者辨識目的是從意見句中擷取出表達此意見的人物名或組織名,以監督式學習方法為基礎,從包含主觀性意見句的文檔中,人工標記意見持有者的代表詞答案,再經由自然語言處理方法進行預處理步驟(包含斷詞、詞性標記及具名實體辨識等),之後將兩個主要任務通過各自建立的數個支援向量機模型,對意見表達句進行文章作者辨識與意見持有者的識別。在文章作者意見辨識中使用包含詞彙相關資訊、詞性相關資訊、標點符號相關資訊、具名實體相關資訊、句法相關資訊、意見詞資訊等特徵值;在意見持有者的識別中則使用包含詞性相關資訊、詞彙相關資訊、具名實體相關資訊、文句組成相關資訊、標點符號相關資訊等特徵值。最後合併兩部分的辨識結果,產生系統提報的意見持有者。
對於一個意見句中含有多個意見持有者候選詞之問題,我們利用公式計算出代表意見持有者的詞彙,並借助本研究制定的規則,修正持有者代表詞完整度不足的問題;此外,對於意見持有者涉及指代消解問題的情況,本研究使用Hobbs Algorithm句法剖析的方式解決此問題。本研究的系統辨識方法,實驗結果表明在英語新聞語料中,文章作者辨識可以達到F-1值91.58%的效能,及意見持有者辨識可以達到F-1值71.83%的效能,在此基礎上進行了交叉驗證和刪減特徵值分析重要程度的工作,並且能夠得到良好的辨識效果。
Opinion mining helps us automatically extract useful subjective information from a large number of reliable texts. Opinion sentences can be decomposed into four parts, including opinion topic, opinion holder, opinion claim and opinion sentiment. Our goal aims to identify the holders of opinion. This study proposes a combination of supervised and unsupervised learning approaches to extract the article author and holders. The main flow of our research work is divided into two phases: identifying article author and holders of the opinion sentence among the labeled corpus. The purpose of opinion holder identification is to capture the expression of the person or organizations from the subjectivity opinion sentences. The approach is based on the supervised learning method using several manual annotated corpus provided in the online news articles. The preprocessing steps via natural language processing techniques, such as segmentation, part-of-speech tagging and named entity recognition, etc. Our feature analysis is based on both machine learning (i.e., support vector machine, SVM) and unsupervised pattern recognition techniques. Different SVM models are evaluated via cross-validation experiments. The proposed features consist of the lexical feature, part-of-speech feature, punctuation mark feature, named entity feature, syntactic feature, position feature, phrase composition feature and opinion word feature. The study also addresses the problem of multiple opinion holder candidates being realized in a single sentence. The proposed approach includes some unsupervised extracting methods to detect the opinion holders without labeled training data. Some manual rules are employed to revise the incomplete holder representations. Furthermore, the Hobbs algorithm is applied to resolve the anaphora resolution problem. Our approach is tested on an annotated news corpus with 10-fold cross- validation and with feature deletion analysis, obtaining 91.58% and 71.83% of F-1 scores for the task of extracting author’s opinion and the task of opinion holder identification, respectively. Finally, the experimental results show the exhilaratingly good performance.
Opinion mining helps us automatically extract useful subjective information from a large number of reliable texts. Opinion sentences can be decomposed into four parts, including opinion topic, opinion holder, opinion claim and opinion sentiment. Our goal aims to identify the holders of opinion. This study proposes a combination of supervised and unsupervised learning approaches to extract the article author and holders. The main flow of our research work is divided into two phases: identifying article author and holders of the opinion sentence among the labeled corpus. The purpose of opinion holder identification is to capture the expression of the person or organizations from the subjectivity opinion sentences. The approach is based on the supervised learning method using several manual annotated corpus provided in the online news articles. The preprocessing steps via natural language processing techniques, such as segmentation, part-of-speech tagging and named entity recognition, etc. Our feature analysis is based on both machine learning (i.e., support vector machine, SVM) and unsupervised pattern recognition techniques. Different SVM models are evaluated via cross-validation experiments. The proposed features consist of the lexical feature, part-of-speech feature, punctuation mark feature, named entity feature, syntactic feature, position feature, phrase composition feature and opinion word feature. The study also addresses the problem of multiple opinion holder candidates being realized in a single sentence. The proposed approach includes some unsupervised extracting methods to detect the opinion holders without labeled training data. Some manual rules are employed to revise the incomplete holder representations. Furthermore, the Hobbs algorithm is applied to resolve the anaphora resolution problem. Our approach is tested on an annotated news corpus with 10-fold cross- validation and with feature deletion analysis, obtaining 91.58% and 71.83% of F-1 scores for the task of extracting author’s opinion and the task of opinion holder identification, respectively. Finally, the experimental results show the exhilaratingly good performance.
Description
Keywords
意見探勘, 意見持有者辨識, 支援向量機, 監督式學習, 非監督式方法, opinion mining, opinion holder identification, support vector machine, supervised learning, unsupervised learning approach