中文部落格文章之相關性擷取與意見傾向分析之研究

No Thumbnail Available

Date

2015

Journal Title

Journal ISSN

Volume Title

Publisher

Abstract

隨著網路技術的發展,越來越多人透過網路分享自己的評論意見,如何在龐大的網路文章中,自動化分類文章意見傾向,是情感分析(Sentiment Analysis)重要的研究方向。在本論文中,本研究針對政論性文章,提出能擷取出與特定主題相關文章,並且進行文章的意見傾向分析的方法,意見傾向分類為正面、中立和負面。 為了能精確的分類文章,本研究提出非監督式和監督式學習方法,實驗分為擷取主題相關文章與主題相關文章意見傾向分析兩大部分。在非監督式方法中,本研究利用點對點相互資訊(Pointwise Mutual Information, PMI)的公式計算文中名詞和主題的相關程度,將相關程度高的名詞作為查詢擴充詞彙,若文章中包含主題詞或查詢擴充辭彙則代表與主題相關。然後,本研究分析主題相關文章中的句子結構,以lexicon-based的方法給予句子極性,並且探討句子中包含否定詞、轉折詞和句尾為問號對於極性的影響。 在監督式方法中,本研究選擇使用向量支援機器(SVM)進行文章分類,在主題相關文章擷取的實驗中,透過卡方檢驗(Chi-square test, CHI)的公式計算訓練資料的辭彙和類別為相關的分數,並將分數排序前20名的詞彙以兩個或三個為一組,本研究發現有些詞彙組合在同一篇文章中出現代表與主題相關。在主題相關文章意見傾向分析的實驗結果顯示,以詞彙在不同極性文章出現頻率選取訓練詞彙比使用卡方檢驗進行特徵挑選好,而特徵使用詞彙在訓練資料中的極性,比使用情感辭典的詞彙極性的結果好。 最後,比較非監督式與監督式學習方法的主題相關文章之意見傾向分析實驗結果,顯示監督式方法的結果比非監督式的方法好,精確率因為實驗主題不同,最高為70.84%,最低為65.49%。
With the development of the internet technology, a lot of people express their opinions as reviews or comments on the Internet. Classifying the opinion polarity of documents automatically becomes an important research direction of sentiment analysis. In the thesis, the experiment data are political articles, some methods are designed to extract documents which are related to the topic and analyze the opinion polarity of documents. The polarities are classified as positive, neutral and negative. For the purpose of correctly classifying documents, the unsupervised learning and supervised learning methods are adopted. The experiments consist of the extraction of the topic-relevant documents and the analysis of the opinion polarity of the document. In the unsupervised learning method, the Pointwise Mutual Information score of each noun phrase is computed in order to extract the query expansion terms. Then, the topic-relevant documents are extracted by utilizing the topic-relevant terms and topic seed words. Next, we analyze the structures of the sentences where the lexicon-based method is utilized to determine the opinion polarity of the sentence. In addition, the issues of whether the sentence that contains negative words, transitional expressions and question mark will influence the opinion polarity are investigated. Furthermore, in the supervised learning method, the machine learning classifier SVM is employed to classify documents. In the experiment of extracting topic-relevant documents, the score of relevance between words and the topic is computed by the Chi-square test formula. Within the top twenty ranks, we discover that some pair words or trio words appearing in the document represent that the document is relevant to the topic. The experimental results of the opinion polarity show that extracting the training terms by the specific frequency condition is better than the feature selection based on the Chi-square test. Moreover, the result of feature selection shows that using the polarity of each word in the training data is better than using the polarity of the sentiment words in the sentiment lexicon. Finally, comparing the results of the unsupervised learning and the supervised learning methods in the analysis of the opinion polarity, the supervised learning method is better than the unsupervised learning one. Among the different experiment topics, the highest precision is 70.84%, and the lowest precision is 65.49%.

Description

Keywords

情感分析, 查詢詞擴充, 主題相關文章擷取, 意見極性分類, Sentiment Analysis, Query Expansion, Topic-relevant Document Retrieval, Opinion Polarity Classification

Citation

Collections