中文期刊論文資訊擷取之研究 — 以圖書資訊學領域為例
No Thumbnail Available
Date
2023
Authors
Journal Title
Journal ISSN
Volume Title
Publisher
Abstract
目前的科學文獻數量以相當驚人的速度在成長當中,如何將這些巨量、富 含知識的科學文獻內容從 PDF 中剖析出來,是當前相當重要的課題。然而在臺 灣鮮少看到有相關的研究,本研究的目的在於提出臺灣中文學術期刊資訊擷取 的解決方案,並以圖書資訊學領域期刊論文為例。本研究透過重新訓練開放原始碼科學文獻剖析工具 GROBID,達成擷取中 文學術期刊資訊(篇名、作者、摘要、關鍵字、具章節邏輯的內文等)的目 的,並透過十倍交叉驗證法(10 Fold Cross-Validation)來評估訓練成效。本研究 透過重新訓練後的模型剖析 725 篇台灣圖書資訊領域期刊論文,觀察與分析可 能影響剖析成功率的原因。本研究發現,三個模型(Segmentation、Header、Fulltext)在訓練資料 n = 100 與 n = 250 時, F1 score 沒有特別明顯的成長。相同期刊的論文會因為不同 年代出版而有不同的版型,這個現象對於剖析成功率有影響。本研究透過將剖析後的科學文獻內文匯入QA系統中,使得QA系統可以 回答更專業的問題,作為對剖析科學文獻後的加值利用範例。
The current volume of scientific literature is growing astonishingly, and the extraction of the vast amount of knowledge-rich content from scientific article PDFs has become a critical issue. However, there is a scarcity of research focusing on this area in Taiwan. This study aims to propose a solution for extracting information from Chinese academic journals in Taiwan, using the field of library and information science as an example.This study successfully extracts Chinese academic journal information by retrain- ing the open-source scientific literature parsing tool GROBID, including article titles, authors, abstracts, keywords, and structured full text with logical sections. The effectiveness of the training is evaluated using a ten-fold cross-validation method. The retrained model is applied to analyze 725 journal articles in the library and information science field in Taiwan, observing and analyzing factors that may affect the success rate of parsing.The study found that the three models (Segmentation, Header, Fulltext) did not significantly improve the F1 score when trained on n = 100 and n = 250 data samples. The variation in document layouts due to different publication years of articles within the same journal impacts the parsing success rate.Finally, we Incorporate the parsed scientific literature into a Question-Answering (QA) system, making an example of the added value of parsed scientific literature.
The current volume of scientific literature is growing astonishingly, and the extraction of the vast amount of knowledge-rich content from scientific article PDFs has become a critical issue. However, there is a scarcity of research focusing on this area in Taiwan. This study aims to propose a solution for extracting information from Chinese academic journals in Taiwan, using the field of library and information science as an example.This study successfully extracts Chinese academic journal information by retrain- ing the open-source scientific literature parsing tool GROBID, including article titles, authors, abstracts, keywords, and structured full text with logical sections. The effectiveness of the training is evaluated using a ten-fold cross-validation method. The retrained model is applied to analyze 725 journal articles in the library and information science field in Taiwan, observing and analyzing factors that may affect the success rate of parsing.The study found that the three models (Segmentation, Header, Fulltext) did not significantly improve the F1 score when trained on n = 100 and n = 250 data samples. The variation in document layouts due to different publication years of articles within the same journal impacts the parsing success rate.Finally, we Incorporate the parsed scientific literature into a Question-Answering (QA) system, making an example of the added value of parsed scientific literature.
Description
Keywords
資訊擷取, 開放原始碼, GROBID, 全文資料集, Information Extraction, Open Source, GROBID, Full Text Dataset