應用摘要系統與資訊距離方法於生醫問答系統之研究
No Thumbnail Available
Date
2014
Authors
Journal Title
Journal ISSN
Volume Title
Publisher
Abstract
本論文以阿茲海默症為主題,探討生醫相關之問答系統。目的在於將摘要系統特性以及資訊距離方法運用在問答系統的研究上,希望藉由機器學習的能力以及現有的相關文獻與背景知識庫的支援,找出此類問題的正確答案。
測試資料共包含四個與阿茲海默症相關的測試資料集,每個測試集包含一篇測試文章、10個與該文章相關的測試問題,每個問題都有五個選項,問題題型皆為單選題。另外使用到背景知識庫,資料來源包含從Pubmed Central得到關於阿茲海默症的醫學文獻資料庫(Medical Literature Analysis and Retrieval System Online, Medline)的文章,以及美國麻薩諸塞州的阿茲海默症研究中心(Massachusetts Alzheimer’s Disease Research Center)所提供關於阿茲海默症的生物文章及摘要。
在研究過程中根據不同的架構方法進行不同的研究,研究方法一為利用蔡秉翰於2013年所提出的生醫相關問答系統為基礎,結合摘要系統,對測試文章或背景知識庫做摘要,希望能夠藉由摘要系統的特性,將文章中重要的資訊擷取出來。而在研究方法二中的概念是認為問題與正確答案之間的資訊距離應小於問題與其他候選答案之間的資訊距離,因此將資訊距離方法針對QA4MRE的資料特性加以改良,並加入TFIDF計算方法及擴充詞語的技術。
最後,分別對這兩種研究方法進行實驗。在研究方法一的實驗中發現,因為背景知識庫中的文獻與對應測試集的問題主題關聯性較低,代表文章中之資訊大多為不重要的資訊,所以若對背景知識庫做摘要,可以有效的將重要之資訊擷取出來。而在研究方法二的實驗中發現,對資訊距離方法而言,採取增加Question Focus數量的方式能夠有效的使準確率提升。
經由實驗,本研究在探討將摘要系統與資訊距離方法應用於生醫問答系統的過程中發現,對背景知識庫中的文獻做摘要以及應用資訊距離的權重計算方法皆可以得到不錯的結果。
The study takes Alzheimer’s disease as a subject to implement a biomedical question answering system. The purpose in the thesis is to employ both the properties of a summarization system and an information distance method to the question answering system. The machine learning techniques are also applied, attempting to find out a correct answer from the related literature and background knowledge. The test data is composed of four sets of test documents. Each set includes one document, ten questions and five answer options per question. For each question, there is only one correct answer from the multiple choices. The study also utilizes the background collections from the articles of Medical Literature Analysis and Retrieval System Online, called Medline, and Massachusetts Alzheimer’s Disease Research Center. In the thesis, several different approaches are adopted towards developing an effective question answering system. The first approach is related to methods used in the study of Hou and Tsai in 2014.In this study, the previous approach is extended using the summarization technique to obtain the important information. The second approach is related to the concept of the information distance. The thesis proposes that the information distance between the question and the corresponding correct answer must be smaller than the distances between the question and the other incorrect answers. Furthermore, the concept of the information distance is adapted to fit the characteristics of QA4MRE. Besides, two other techniques, TFIDF computation and the query expansion, are also used in the second approach. Finally, from the experiment of the first approach, it shows that the relevance between the literatures in background knowledge and the question in the test set is not high enough. We observe that, if we make a summary of literatures in background knowledge that may include too many noises among, we can effectively capture the important information needed. From the experiment by the second method, we observe that, if we increase the number of “Question Focus,” we can effectively improve the accuracy of the system. In summary, both summarization and information distance methods are applied to the biomedical question answering system in the study. The experiments show that summarizing the literatures in background knowledge and applying the information distance method can yield good results.
The study takes Alzheimer’s disease as a subject to implement a biomedical question answering system. The purpose in the thesis is to employ both the properties of a summarization system and an information distance method to the question answering system. The machine learning techniques are also applied, attempting to find out a correct answer from the related literature and background knowledge. The test data is composed of four sets of test documents. Each set includes one document, ten questions and five answer options per question. For each question, there is only one correct answer from the multiple choices. The study also utilizes the background collections from the articles of Medical Literature Analysis and Retrieval System Online, called Medline, and Massachusetts Alzheimer’s Disease Research Center. In the thesis, several different approaches are adopted towards developing an effective question answering system. The first approach is related to methods used in the study of Hou and Tsai in 2014.In this study, the previous approach is extended using the summarization technique to obtain the important information. The second approach is related to the concept of the information distance. The thesis proposes that the information distance between the question and the corresponding correct answer must be smaller than the distances between the question and the other incorrect answers. Furthermore, the concept of the information distance is adapted to fit the characteristics of QA4MRE. Besides, two other techniques, TFIDF computation and the query expansion, are also used in the second approach. Finally, from the experiment of the first approach, it shows that the relevance between the literatures in background knowledge and the question in the test set is not high enough. We observe that, if we make a summary of literatures in background knowledge that may include too many noises among, we can effectively capture the important information needed. From the experiment by the second method, we observe that, if we increase the number of “Question Focus,” we can effectively improve the accuracy of the system. In summary, both summarization and information distance methods are applied to the biomedical question answering system in the study. The experiments show that summarizing the literatures in background knowledge and applying the information distance method can yield good results.
Description
Keywords
資訊距離, 摘要, 答案驗證, 機器閱讀問答系統評估, 跨語言評估會議, 字詞擴充, Information distance, Summarization, Answer validation, QA4MRE, CLEF, Query expansion