Utilizing BLAST to Extract Citation Metadata from Online Publication Lists

Abstract
科學家相互引用文獻和研究結果,是科學得以迅速發展的重要因素。因此,書目表單(citation list)或文獻目錄(bibliography)無疑是學者的重要工具。一般常見的書目(citation)資料,通常記載著作者(author)、標題(title)、出版資訊(publication information)等訊息。出版資訊隨著出版形式不同(例如書本、期刊、研討會論文集、叢書、研究報告、技術報告等),而有種種變化,其內容則包括期刊或研討會名稱、冊別、編號、頁數、出版年月、出版商、出版地點等。這些扼要描述文獻背景訊息的後設資料(metadata),通常有結構化(structured)和半結構化(semi-structured)等兩種呈現形式。結構化的書目,可以資料庫或欄位式的表單作為代表;半結構化的文獻目錄,則以連續字串的形式呈現,其形式比較自由。因此,不同的學者在描述同一筆文獻的時候,可能會寫出兩筆外觀看來很不一致的書目資料。不止後設資料屬性的前後次序會有變化,連使用到的屬性也可能有所不同。 然而出現在網路上的文獻目錄,絕大多數卻都屬於半結構化的形式。若要加值運用,就得先將半結構化的文獻目錄,剖析和轉換成為一致的結構化形式,並分析彼此參照的關係和建立索引,以提供文獻搜尋和引用統計等資訊服務。本論文擬探討如何將半結構化文獻目錄,轉換成為一致的結構化資料。這是書目資料處理的核心問題。 由於書目資料型態眾多,想要自動將半結構化的書目轉換成結構化的資料實為不易。為了辨識書目後設資料,我們的基本構想是運用基因比對技術來解決這個書目資料辨識的問題。也就是將半結構化書目轉成蛋白質序列(protein sequence)。將已知的書目資料的樣板,則轉換成蛋白質序列,儲存於樣板資料庫中(template database)。當必須解析新的半結構化的書目時,則可將新的書目轉換成蛋白質序列。再以BLAST這項序列比對工具,從事先建立好的樣板資料庫中,找出與該蛋白質序列最相近的樣板。最後根據此樣板作後設資料的解析。 這樣的處理方式讓系統更有彈性,不僅可以輕易加入新的書目樣板,也可以快速找到最相近的樣板作為解析後設資料的依據。解析結果的準確率會因樣本資料庫的完整度而有所不同,也會因為計分表的設計而有所偏差,更會因測試資料的型態不同(例如含中文姓氏的著作表列與不含中文姓氏的著作表列)而形成不一樣的結果。本論文在這些議題上作了一些測試,在最理想的狀況下本系統可以達到91.2%的準確率,而OpCit的系統準確率在理想狀況下卻僅能達到75%。相反的在樣板資料庫完整度低的情況下(樣板完整度百分之五十),而且使用不利的測試資料,本系統的準確率降到38.2%,而OpCit系統為6%。
It is an important factor for scientific research developing rapidly that scientists cite documents or research results between each other. Therefore, it is undoubted that citation lists or bibliographies are important tools to scholars. The common information of citation data usually contains messages of author, title, and publication information. The publication information has a variety of variations according to the publishing types (E.g. book, journal paper, conference paper, series, research report, technical report, etc.). A publication information contains journal name or conference name, volume, number, page, publish year, publish month, publisher, and publisher’s address, etc. The metadata, briefly describing the background messages of bibliographies, is often presented in structured form or in semi-structured form. The structured citation data can be represented by database or field table; and the semi-structured citation data are presented in the form of successive words. The form of semi-structured citation is more flexible. Hence, a bibliography described by different scholars may be written as two citation data which are inconsistent in appearance. Not only the order of metadata may be changed, but also the attributes used may be different. However, the bibliographies appeared in the Internet are mostly presented in the semi-structured form. If we want to utilize the information of citation data, we must transform the semi-structured bibliography into structured bibliography fist. We have to analyze the relationship of each citation data and build up an index for the services of bibliography search and citing statistics. In this paper, we plan to discuss how to transform semi-structured bibliographies into uniform structured data, which is the core problem of citation data processing. Due to the numerous models of citation data, it is hard to transform the semi-structured citation data into structured data automatically. In order to recognize the citation metadata, in our basic conception, we utilize the technique of gene sequence search alignment to resolve the problem of citation data recognition. The known semi-structured citation data are transformed into protein sequence, and saved in the template database. When a new semi-structured citation data is going to be parsed, we can translate the new citation data into protein sequence, and then utilize, BLAST a sequence alignment tool, to find a template which is most similar to the protein sequence from the template database established beforehand. Finally, we can parse metadata according to the template. It is more flexible in our system to operate in such way. We can not only add new citation template easily, but also can rapidly search the most similar template to parse the metadata. The precision of parsing result will be different for the completeness of template database, the design of scoring table, and the type of the test data (E.g. the publication lists with Chinese surname, or the publication lists without Chinese surname). In this paper, we do some experiments in these issues. In the ideal condition, the precision of our system can reach 91.2%, but the precision of OpCit system is only 75%. When the completeness of template is low (completeness of template about 50%), using a bad test data set, the precision of our system is down to 38.2% and the precision of OpCit system is 6%.
Description
Keywords
基因, BLAST, 著作表列, 目錄學, 後設資料, gene, BLAST, publication list, bibliography, metadata
Citation
Collections