以字句擷取為基礎並應用於文件分類之自動摘要之研究

No Thumbnail Available

Date

2005

Journal Title

Journal ISSN

Volume Title

Publisher

Abstract

摘錄式(Extractive)摘要旨在於從原始文件中依據摘要比例自動選取一些重要的字句、段落或章節,並按順序將其形成簡潔摘要。大多數常見的摘要模型原則上可依據其特性分為兩種比對策略。其一,以逐字比對(Literal Term Matching)的方式評估字句與文件的相關性,這其中以向量空間模型(Vector Space Model, VSM)為代表;其二,以概念比對(Concept Matching)的方式評估,這其中以潛藏語意分析(Latent Semantic Analysis, LSA)為代表。 基於這些觀察,在本研究中我們提出數種自動文件摘要的改進方法。在逐字比對上,研究隱藏式馬可夫模型(Hidden Markov Model, HMM),並對其兩種變化(型一及型二)做廣泛的探討。於隱藏式馬可夫模型-型一:視文件為一生成模型(Generative Model),對於每個索引都有一對應的機率分佈,文件與文件中每一字句的相關性,是藉由字句的所有索引,被文件模型生成相似值(Likelihood)的連乘積來決定,換句話說當字句含有較高的相似值,則其與文件的相關性就越高;於隱藏式馬可夫模型-型二:則視文件中每一字句為一機率生成模型,文件中每一字句與文件的相關性,是藉由文件被字句生成的相似值來決定,並且文件中各字句可依據其所產生的相似值作排序。另一方面,在概念比對上,提出兩種摘要模型,分別為嵌入式潛藏語意分析(embedded LSA)與主題混合模型(Topical Mixture Model, TMM)。於嵌入式潛藏語意分析:文件與文件中每一字句同時參與潛藏語意空間的建構,並且字句的重要性可經由適當評估在潛藏語意空間內,其向量表示式與文件的相關性而得;於主題混合模型:文件中每一字句被分別表示成一混合模型,並由K個潛藏主題分佈及其相對應特定文件的事後機率所組成,文件中每一字句與文件相關性,即可藉由文件中索引發生在潛藏主題及字句產生各別主題的機率值來評估。我們在中文語音廣播新聞語料庫上執行了一系列的實驗,實驗結果顯示使用隱藏式馬可夫模型或主題混合模型其結果較其它常見方法有顯著的提升,同時主題混合模型在幾乎所有情況下均較隱藏式馬可夫模型來得佳。 最後,我們也研究摘要模型中主題混合模型在文件分類的適用性,並且文件也能預先經由上述摘要模型做前處理。初步實驗結果顯示,主題混合模型分類器較常見K-最近鄰(K-Nearest-Neighbor, KNN)分類器在分類的效果上有些微的提升。 關鍵字:摘要、潛藏語意分析、隱藏式馬可夫模型、主題混合模型、 K-最近鄰分類器
The purpose of extractive summarization is to automatically select a number of salient sentences, passages, or paragraphs from the original document according to a target summarization ratio and then sequence them to form a concise summary. Most of the conventional summarization models in principle can be characterized by two matching strategies. One is to measure the relevance between the sentence and document by the literal term matching, as exampled by the Vector Space Model (VSM); while the other is to measure such relevance by the concept matching, as exampled by the Latent Semantic Analysis (LSA). Based on these observations, in the study, we propose several improved approaches for automatic document summarization. For literal term matching, the Hidden Markov Model (HMM) is investigated, and of which two variants (HMM-Type1 and HMM-Type2) were extensively studied. In the HMM-Type1, the document is viewed as a generative model which has the probability distribution over each indexing term. The relevance between the document and each sentence in the document is measured by the product of the likelihoods of all indexing terms of the sentence generated by the document model. In other words, the sentences with higher likelihood scores are more likely to be relevant to the document. In the HMM-Type2, each sentence in the document is viewed as the probability generative model instead. The relevance between the document and a given sentence is measured by the likelihood that the document is generated by the sentence, and the sentences in the document can be ranked according to the respective likelihood scores they generate. On the other hand, for concept matching, two summarization models, i.e., the embedded Latent Semantic Analysis (embedded LSA) and the Topical Mixture Model (TMM), were proposed. In the embedded LSA, the document to be summarized is also involved in the construction of the latent semantic space, and then the importance of each sentence can be properly measured by the proximity of its vector representation to that of the document in the latent semantic space. While in the TMM, each sentence in the document is respectively taken as a mixture model consisting of K latent topical distributions and their corresponding document-specific posterior probabilities. The relevance between each sentence in the document and document itself is then measured based on the likelihoods of the indexing terms of the document observed in the latent topics and the probabilities that the sentence generates the respective topics. A series of experiments has been conducted on the Mandarin broadcast news speech and the experimental results show that the performance achieved by using either the HMM or the TMM is significantly better than that of the conventional approaches, and the TMM further outperforms the HMM in almost all conditions. Finally, we also study the possibility to adapt the TMM summarization model to document classification, for which the documents also may be preprocessed beforehand by the summarization models mentioned above. The initial experimental results show that the TMM classifier yields slightly superior performance when compared to the conventional K-Nearest-Neighbor classifier. Keywords:Summarization, Latent Semantic Analysis, Hidden Markov Model,Topical Mixture Model, K-Nearest-Neighbor Classifier

Description

Keywords

摘要, 潛藏語意分析, 隱藏式馬可夫模型, 主題混合模型, K-最近鄰分類器, Summarization, Latent Semantic Analysis, Hidden Markov Model, Topical Mixture Model, K-Nearest-Neighbor Classifier

Citation

Collections