基於類神經之關聯詞向量表示於文本分類任務之研究
No Thumbnail Available
Date
2017
Authors
Journal Title
Journal ISSN
Volume Title
Publisher
Abstract
由於資訊網路的蓬勃發展,人們在物聯網上存取文本資料的需求也與日俱增,因此文本分類在自然語言處理的領域中的應用為相當熱門的研究。目前,在文本分類中最為核心的問題為特徵表示的選擇,大部分的研究使用詞袋(Bag of words)模型做為文本的特徵表示,但詞袋模型無法有效的表達詞與詞之間的關係,進而失去了文本上的語意。
在本論文中,我們使用兩種新穎的類神經網路架構 : 連體網路(Siamese Nets)和生成式對抗網路(Generative Adversarial Nets), 在訓練過程中使模型能學習更為強健且帶有豐富語意的特徵表示。本論文實驗採用知名的分類資料庫,IMDB電影評論分類、20Newsgroups新聞群組分類,由一系列的情緒分析和主題分類的實驗結果顯示,藉由這些類神經網路所學習到的特徵表示可以有效地提昇文本分類的效能。
With the rapid global access to tremendous amounts of text data on the Internet, text categorization or classification has emerged as an important and hot research topic in the natural language processing (NLP) community with many applications. Currently, the foremost problem in text categorization would be feature representation, which is commonly based on the bag-of-words (BoW) model, where word unigrams, bigrams (n-grams) or some specifically designed patterns are typically extracted as the component features. It has been noted that the loss of word order raised by the BoW representations is particularly problematic on document categorization. In order to leverage the influence of word order and proximity information on text categorization tasks, we explore a novel use of a Siamese nets and Generative adversarial nets for document representation and text categorization. Experiments conducted on two benchmark text categorization tasks, viz. IMDB and 20Newsgroups, we take advantage of these novel architectures for learning distributed vector representations of documents that can reflect the semantic relatedness.
With the rapid global access to tremendous amounts of text data on the Internet, text categorization or classification has emerged as an important and hot research topic in the natural language processing (NLP) community with many applications. Currently, the foremost problem in text categorization would be feature representation, which is commonly based on the bag-of-words (BoW) model, where word unigrams, bigrams (n-grams) or some specifically designed patterns are typically extracted as the component features. It has been noted that the loss of word order raised by the BoW representations is particularly problematic on document categorization. In order to leverage the influence of word order and proximity information on text categorization tasks, we explore a novel use of a Siamese nets and Generative adversarial nets for document representation and text categorization. Experiments conducted on two benchmark text categorization tasks, viz. IMDB and 20Newsgroups, we take advantage of these novel architectures for learning distributed vector representations of documents that can reflect the semantic relatedness.
Description
Keywords
文本分類, 表示學習, 深度學習, 連體網路, 生成式對抗網路, Text Categorization, Representation Learning, Deep Learning, Siamese Networkws, Generative Adversarial Networks