科展作品語意相似句搜尋:句向量表示模型之應用與成效評估

dc.contributor曾元顯zh_TW
dc.contributor李龍豪zh_TW
dc.contributorTseng, Yuen-Hsienen_US
dc.contributorLee, Lung-Haoen_US
dc.contributor.author古蕬瑋zh_TW
dc.contributor.authorKu, Szu-Weien_US
dc.date.accessioned2025-12-09T07:37:21Z
dc.date.available2025-06-30
dc.date.issued2025
dc.description.abstract本研究旨在探討句向量表示模型於繁體中文語意相似句辨識任務中的應用效果,研究以歷年科展得獎作品為語料來源,共12,194篇 PDF 以文字檔輸出。結合「科學展覽作品相似句搜尋系統」進行相似句配對搜尋,系統包含五種適用繁體中文的句向量表示模型。為評估模型效能與人工標記的一致性與準確性,設計不同抽樣策略與設定相似度門檻值,並針對約20,000多組句對進行人工標記,進而計算 F1分數作為評估指標。研究結果顯示,sBERT 為最適用於繁體中文語意比對的句向量模型。未經微調時,在最佳門檻值0.93下可達 F1分數0.8578。經比對學習技術微調後,sBERT 在門檻值0.91時 F1分數提升至0.9714,展現出極高的精確性。此門檻值亦可作為未來相關研究與應用的重要參考基準。抽樣結果亦顯示,sBERT 相較其他模型,具備兼顧字詞層面與語意理解的特性,因此較能辨識出改寫句型。此外,為彌補繁體中文語意相似句研究文獻的稀缺,本研究於人工標記階段制定五等級相似句分級標準,並將相似句來源歸納為六大類,提供語意比對相關研究具體分類依據。研究亦指出,句型結構與用詞大致相同但語意迥異的「爭議樣本」易導致模型誤判,為後續模型優化提供反思方向。zh_TW
dc.description.abstractThis study investigates the effectiveness of sentence embedding models in identifying semantically similar sentences in Traditional Chinese. The corpus comprises 12,194 award-winning science fair project reports, converted from PDF files to plain text. A"Science Fair Project Similar Sentence Retrieval System" was developed, integrating five sentence embedding models designed for Traditional Chinese. To evaluate model performance and its alignment with human judgment, the study employed various sampling strategies and similarity thresholds. Over 20,000 sentence pairs were manually annotated, and the F1-score was used as the primary evaluation metric. Experimental results demonstrate that sBERT is the most effective model for semantic similarity tasks in Traditional Chinese. Without fine-tuning, sBERT achieved an F1-score of 0.8578 at the optimal threshold of 0.93. After applying contrastive learning-based fine-tuning, its F1-score improved to 0.9714 at a threshold of 0.91, indicating high precision. This threshold also offers a valuable reference point for future research and applications. Sampling results further suggest that sBERT excels at capturing both lexical and semantic information, making it more capable of detecting paraphrased sentence pairs compared to other models.To address the limited literature on semantic similarity in Traditional Chinese, this study also proposes a five-level similarity annotation standard and categorizes sentence sources into six distinct types, providing a structured framework for future research. The study further highlights that “controversial samples” — pairs with similar structure and wording but divergent meanings — remain a major source of misclassification, offering insight for future model improvement.en_US
dc.description.sponsorship圖書資訊學研究所圖書資訊學數位學習碩士在職專班zh_TW
dc.identifier012153208-47291
dc.identifier.urihttps://etds.lib.ntnu.edu.tw/thesis/detail/40d3f0bd1415bd286b0cd0f524a5e5ea/
dc.identifier.urihttp://rportal.lib.ntnu.edu.tw/handle/20.500.12235/124507
dc.language中文
dc.subject語意相似句辨識zh_TW
dc.subject句向量表示模型zh_TW
dc.subject繁體中文句向量zh_TW
dc.subject語意比對zh_TW
dc.subjectSemantic Similar Sentence Identificationen_US
dc.subjectSentence Embedding Modelsen_US
dc.subjectTraditional Chinese Sentence Embeddingsen_US
dc.subjectSemantic Comparisonen_US
dc.title科展作品語意相似句搜尋:句向量表示模型之應用與成效評估zh_TW
dc.titleSemantic Similarity Sentence Search for Science Fair Projects: Application and Performance Evaluation of Sentence Embedding Modelsen_US
dc.type學術論文

Files

Original bundle

Now showing 1 - 1 of 1
No Thumbnail Available
Name:
202500047291-109641.pdf
Size:
1.93 MB
Format:
Adobe Portable Document Format
Description:
學術論文

Collections