流行疾病中文新聞面向事實自動擷取之研究
No Thumbnail Available
Date
2017
Authors
Journal Title
Journal ISSN
Volume Title
Publisher
Abstract
當流行疾病發生時,使用者通常希望獲得更多有關於流行疾病的面向事實。本論文以中文流行疾病網路新聞為資料來源,研究如何從流行疾病新聞中自動擷取出疫情、症狀面向事實句,並從面向事實句中擷取出語意三元詞組進行結構化表示,以幫助有效率地查詢流行疾病的疫情發展狀況及症狀演變,並可作為建立知識庫的基礎。本論文提出的方法,對疫情及症狀面向事實句各建立一個分類模型,用來預測擷取新聞中對應的面向事實句。為了達到有效分類,本論文從已標示的面向事實句及非面向事實句中,以統計分析擷取出對分類較有效果的面向關鍵字,以這些關鍵字為基礎來建立每個句子的面向句分類特徵值。此外,由於不同流行病皆需給定訓練資料,本論文提出一個面向事實句自動標示的方法,可減少人工標示訓練資料的成本。此外,根據句子中詞彙的語法出現相依性分析,本論文方法可取出面向事實句的語意三元詞組及時間地點等屬性,建立面向事實的結構化表示。實驗結果顯示本論文提供的方法在面向事實句的選取、語意三元詞組的擷取都達到良好的效果。
When a pandemic occurs, users would like to get more information about the epidemic on various aspects. In this thesis, the Chinese news documents about epidemic diseases collected from internet are considered as the data source. We studied how to extract the sentences describing epidemic or symptom facets of the diseases from the news documents. Besides, the semantic triples are extracted from the sentences to help efficiently inquire the development of epidemic and the evolution of symptoms, which provide a basis for constructing a knowledge base. In the proposed method, two classification models are constructed for extracting the sentences of epidemic or symptom facets, respectively. In order to achieve effective classification, we used a statistical analysis method to extract the keywords that are more effective for distinguish the facet sentences from the non-facet sentences. Based on these keywords, various feature values for classification are established for each sentence. In addition, in order to reduce the manually labeling cost of training data for various epidemics, we proposed a method to automatically label the facet and non-facet training sentences. Finally, according to the grammatical analysis result on each facet sentence, the semantic triples and the corresponding time and place are extracted to establish a structured representation for the facet information. The results of experiments show that the methods provided in this thesis perform well on both selecting the facet sentences and retrieving the semantic triples.
When a pandemic occurs, users would like to get more information about the epidemic on various aspects. In this thesis, the Chinese news documents about epidemic diseases collected from internet are considered as the data source. We studied how to extract the sentences describing epidemic or symptom facets of the diseases from the news documents. Besides, the semantic triples are extracted from the sentences to help efficiently inquire the development of epidemic and the evolution of symptoms, which provide a basis for constructing a knowledge base. In the proposed method, two classification models are constructed for extracting the sentences of epidemic or symptom facets, respectively. In order to achieve effective classification, we used a statistical analysis method to extract the keywords that are more effective for distinguish the facet sentences from the non-facet sentences. Based on these keywords, various feature values for classification are established for each sentence. In addition, in order to reduce the manually labeling cost of training data for various epidemics, we proposed a method to automatically label the facet and non-facet training sentences. Finally, according to the grammatical analysis result on each facet sentence, the semantic triples and the corresponding time and place are extracted to establish a structured representation for the facet information. The results of experiments show that the methods provided in this thesis perform well on both selecting the facet sentences and retrieving the semantic triples.
Description
Keywords
關鍵字選取, 面向事實擷取, 資訊結構化, keyword extraction, facet retrieval, structural form