網頁搜尋結果重要面向事實內容自動擷取之研究
Abstract
本論文的主要研究目的為,透過使用者給定的查詢字以及指定面向關鍵字,從大量的查詢回傳結果中,自動摘要出重要的面向資訊提供給使用者,讓使用者能快速得到所需的面向資訊。為了避免下載所有查詢結果文件並處理需花費相當多的時間,因此本論文採用查詢結果回傳的文件片段內容(snippets),作為探勘查詢字相關資訊的資料來源。本研究提出一個稱為SR-Summarization的方法,利用字詞在各面向查詢回傳結果中的分佈特性,提出評估字詞與查詢關鍵字的一般面向分數以及面向代表性分數的計算公式,進而評估一個句子的一般面向及面向代表性分數。此外,方法中也提出評估句子事實資訊性的計算公式,採用機器學習方法評估句子的品質好壞。最後,採用結合摘要內容的資訊量及內容多樣性為機制的句子挑選依據,產生"查詢字一般面向資訊”摘要,以及指定面向之”面向事實資訊”摘要。實驗結果顯示,本研究之方法能夠有效擷取出網頁搜尋結果中的重要面向事實內容,透過使用者問卷調查顯示,相較於相關研究的方法,使用者對於本研究方法找出的摘要結果有更高的滿意度。
The purpose of this thesis is to automatically summarize the important query-focused facet information from the huge number of search results according to a query and multiple facet terms given by users. From the results, users can quickly obtain the facet information they needed. Instead of spending much time to download all the search results of the query, in this study, the snippets of the search results are used as data resource. A method called SR-Summarization is proposed to estimate the general and facet score of a segment. First, the general and facet score of a segment is estimated according to the distribution of the contained terms among the search results of multiple aspects. Furthermore, a machine learning method is used to estimate the completeness quality of the segment. The weighted sum of the above two scores represents the informative score of a segment. Finally, the informative score and diversity score are both considered to select segments for generating a general summary and multiple facet summaries for the search result. The experiment results show that the SR-Summarization method can effectively extract important facet information from search results. The user survey shows that our approach has better performance than the related method on generating informative facet summaries.
The purpose of this thesis is to automatically summarize the important query-focused facet information from the huge number of search results according to a query and multiple facet terms given by users. From the results, users can quickly obtain the facet information they needed. Instead of spending much time to download all the search results of the query, in this study, the snippets of the search results are used as data resource. A method called SR-Summarization is proposed to estimate the general and facet score of a segment. First, the general and facet score of a segment is estimated according to the distribution of the contained terms among the search results of multiple aspects. Furthermore, a machine learning method is used to estimate the completeness quality of the segment. The weighted sum of the above two scores represents the informative score of a segment. Finally, the informative score and diversity score are both considered to select segments for generating a general summary and multiple facet summaries for the search result. The experiment results show that the SR-Summarization method can effectively extract important facet information from search results. The user survey shows that our approach has better performance than the related method on generating informative facet summaries.
Description
Keywords
文字探勘, 文件摘要, 面向事實擷取, text mining, document summarization, facet retrieval