開放取用學術資源的自動化擷取系統實作:以臺灣人文社會領域期刊資料為例

dc.contributor曾元顯zh_TW
dc.contributorTseng, Yuen-Hsienen_US
dc.contributor.author余宗翰zh_TW
dc.contributor.authorYu, Zong-Hanen_US
dc.date.accessioned2025-12-09T07:37:20Z
dc.date.available2025-06-25
dc.date.issued2025
dc.description.abstract本研究旨在建置一套可穩定運作且具高度擴展性的自動化學術資源擷取系統,針對臺灣人文社會領域開放取用期刊進行抓取。現行國內引文索引資料庫多仰賴人工建檔與維護,導致資料更新與整合流程費時費力;而開放取用平台則受限於期刊端主動上架與維運意願,造成資料時效與涵蓋範圍不足,進而影響學術資源的可用性與知識庫建構的穩定性。為此,本研究設計並實作「Social and Theoretical Academic Repository(STAR)」系統,結合 Scrapy 爬蟲框架與 Docker 容器化部署技術,整合 MySQL、Redis、Django、Playwright、FTP 等模組,建立排程式爬取、結構化檔案儲存與網頁式管理操作的自動化平台。系統具備網頁式管理介面,支援管理者透過 Django 後台調整排程或即時執行爬蟲任務,亦提供 FTP 批次下載功能供使用者取得期刊全文檔案。系統完成部署後共建置 46 支期刊爬蟲模組,成功擷取 17,865 篇 PDF 文章檔案,總容量達 79 GB。比較首次與後續爬取平均耗時,整體處理效率提升 73.4 %,顯示系統具備長期穩定運作與低維運負擔的特性。本研究驗證了以模組化容器架構整合開源爬蟲技術,能有效支援多網站資料擷取與期刊資料彙整之需求,並為後續文本生成、語意比對與知識問答等應用場景,提供可重複使用之期刊資料擷取基礎。未來可進一步結合語意嵌入與文本分析工具,拓展資料加值應用場景。zh_TW
dc.description.abstractThis study aims to develop a stable and highly scalable automated system for extracting open-access academic resources, specifically targeting journals in the humanities and social sciences in Taiwan. Existing domestic citation index databases rely heavily on manual curation and maintenance, resulting in time-consuming and labor-intensive update and integration processes. In addition, open access platforms are constrained by the willingness of journal publishers to actively upload and maintain content, leading to issues of timeliness and coverage, which ultimately affect the usability of academic resources and the stability of knowledge base construction.To address these challenges, this study designs and implements the “Social and Theoretical Academic Repository” (STAR) system. The system integrates the Scrapy web crawling framework with Docker-based container deployment, combining MySQL, Redis, Django, Playwright, and FTP modules to create an automated platform featuring scheduled crawling, structured file storage, and web-based administrative control.The system provides a web interface that allows administrators to adjust schedules or execute crawlers in real time via the Django backend. It also offers FTP-based batch downloading for users to access full-text journal PDFs. Upon deployment, 46 crawler modules were implemented, successfully harvesting 17,865 PDF articles with a total data volume of 79 GB. A comparison of initial and subsequent crawls showed a 73.4 % improvement in processing efficiency, demonstrating the system's long-term stability and low maintenance cost. This study confirms that integrating open-source crawling technologies within a modular container architecture effectively supports multi-site data extraction and journal metadata aggregation. The system also provides a reusable infrastructure for future applications in text generation, semantic similarity analysis, and knowledge-based question answering. Further development may incorporate semantic embedding and text analysis techniques to enhance value-added applications.en_US
dc.description.sponsorship圖書資訊學研究所圖書資訊學數位學習碩士在職專班zh_TW
dc.identifier012153203-47248
dc.identifier.urihttps://etds.lib.ntnu.edu.tw/thesis/detail/71588f271e3f9d6f9be2a38298691a90/
dc.identifier.urihttp://rportal.lib.ntnu.edu.tw/handle/20.500.12235/124504
dc.language中文
dc.subject資料擷取zh_TW
dc.subject開放取用zh_TW
dc.subject網頁爬蟲zh_TW
dc.subjectScrapyzh_TW
dc.subjectDockerzh_TW
dc.subject容器化zh_TW
dc.subjectData Extractionen_US
dc.subjectOpen Accessen_US
dc.subjectWeb Crawlingen_US
dc.subjectScrapyen_US
dc.subjectDockeren_US
dc.subjectContainerizationen_US
dc.title開放取用學術資源的自動化擷取系統實作:以臺灣人文社會領域期刊資料為例zh_TW
dc.titleDesign and Implementation of an Automated System for Extracting Open Access Academic Resources: A Case Study of Taiwanese Journals in the Humanities and Social Sciencesen_US
dc.type專業實務報告(專業實務類)

Files

Original bundle

Now showing 1 - 1 of 1
No Thumbnail Available
Name:
202500047248-109609.pdf
Size:
3.13 MB
Format:
Adobe Portable Document Format
Description:
專業實務報告(專業實務類)

Collections