科展作品語意相似句搜尋：句向量表示模型之應用與成效評估

古蕬瑋; Ku, Szu-Wei

科展作品語意相似句搜尋：句向量表示模型之應用與成效評估

dc.contributor	曾元顯	zh_TW
dc.contributor	李龍豪	zh_TW
dc.contributor	Tseng, Yuen-Hsien	en_US
dc.contributor	Lee, Lung-Hao	en_US
dc.contributor.author	古蕬瑋	zh_TW
dc.contributor.author	Ku, Szu-Wei	en_US
dc.date.accessioned	2025-12-09T07:37:21Z
dc.date.available	2025-06-30
dc.date.issued	2025
dc.description.abstract	本研究旨在探討句向量表示模型於繁體中文語意相似句辨識任務中的應用效果，研究以歷年科展得獎作品為語料來源，共12,194篇 PDF 以文字檔輸出。結合「科學展覽作品相似句搜尋系統」進行相似句配對搜尋，系統包含五種適用繁體中文的句向量表示模型。為評估模型效能與人工標記的一致性與準確性，設計不同抽樣策略與設定相似度門檻值，並針對約20,000多組句對進行人工標記，進而計算 F1分數作為評估指標。研究結果顯示，sBERT 為最適用於繁體中文語意比對的句向量模型。未經微調時，在最佳門檻值0.93下可達 F1分數0.8578。經比對學習技術微調後，sBERT 在門檻值0.91時 F1分數提升至0.9714，展現出極高的精確性。此門檻值亦可作為未來相關研究與應用的重要參考基準。抽樣結果亦顯示，sBERT 相較其他模型，具備兼顧字詞層面與語意理解的特性，因此較能辨識出改寫句型。此外，為彌補繁體中文語意相似句研究文獻的稀缺，本研究於人工標記階段制定五等級相似句分級標準，並將相似句來源歸納為六大類，提供語意比對相關研究具體分類依據。研究亦指出，句型結構與用詞大致相同但語意迥異的「爭議樣本」易導致模型誤判，為後續模型優化提供反思方向。	zh_TW
dc.description.abstract	This study investigates the effectiveness of sentence embedding models in identifying semantically similar sentences in Traditional Chinese. The corpus comprises 12,194 award-winning science fair project reports, converted from PDF files to plain text. A"Science Fair Project Similar Sentence Retrieval System" was developed, integrating five sentence embedding models designed for Traditional Chinese. To evaluate model performance and its alignment with human judgment, the study employed various sampling strategies and similarity thresholds. Over 20,000 sentence pairs were manually annotated, and the F1-score was used as the primary evaluation metric. Experimental results demonstrate that sBERT is the most effective model for semantic similarity tasks in Traditional Chinese. Without fine-tuning, sBERT achieved an F1-score of 0.8578 at the optimal threshold of 0.93. After applying contrastive learning-based fine-tuning, its F1-score improved to 0.9714 at a threshold of 0.91, indicating high precision. This threshold also offers a valuable reference point for future research and applications. Sampling results further suggest that sBERT excels at capturing both lexical and semantic information, making it more capable of detecting paraphrased sentence pairs compared to other models.To address the limited literature on semantic similarity in Traditional Chinese, this study also proposes a five-level similarity annotation standard and categorizes sentence sources into six distinct types, providing a structured framework for future research. The study further highlights that “controversial samples” — pairs with similar structure and wording but divergent meanings — remain a major source of misclassification, offering insight for future model improvement.	en_US
dc.description.sponsorship	圖書資訊學研究所圖書資訊學數位學習碩士在職專班	zh_TW
dc.identifier	012153208-47291
dc.identifier.uri	https://etds.lib.ntnu.edu.tw/thesis/detail/40d3f0bd1415bd286b0cd0f524a5e5ea/
dc.identifier.uri	http://rportal.lib.ntnu.edu.tw/handle/20.500.12235/124507
dc.language	中文
dc.subject	語意相似句辨識	zh_TW
dc.subject	句向量表示模型	zh_TW
dc.subject	繁體中文句向量	zh_TW
dc.subject	語意比對	zh_TW
dc.subject	Semantic Similar Sentence Identification	en_US
dc.subject	Sentence Embedding Models	en_US
dc.subject	Traditional Chinese Sentence Embeddings	en_US
dc.subject	Semantic Comparison	en_US
dc.title	科展作品語意相似句搜尋：句向量表示模型之應用與成效評估	zh_TW
dc.title	Semantic Similarity Sentence Search for Science Fair Projects: Application and Performance Evaluation of Sentence Embedding Models	en_US
dc.type	學術論文

Files

Original bundle

Now showing 1 - 1 of 1

Name:: 202500047291-109641.pdf
Size:: 1.93 MB
Format:: Adobe Portable Document Format
Description:: 學術論文

Download

Collections

學位論文