科展作品語意相似句搜尋:句向量表示模型之應用與成效評估

No Thumbnail Available

Date

2025

Journal Title

Journal ISSN

Volume Title

Publisher

Abstract

本研究旨在探討句向量表示模型於繁體中文語意相似句辨識任務中的應用效果,研究以歷年科展得獎作品為語料來源,共12,194篇 PDF 以文字檔輸出。結合「科學展覽作品相似句搜尋系統」進行相似句配對搜尋,系統包含五種適用繁體中文的句向量表示模型。為評估模型效能與人工標記的一致性與準確性,設計不同抽樣策略與設定相似度門檻值,並針對約20,000多組句對進行人工標記,進而計算 F1分數作為評估指標。研究結果顯示,sBERT 為最適用於繁體中文語意比對的句向量模型。未經微調時,在最佳門檻值0.93下可達 F1分數0.8578。經比對學習技術微調後,sBERT 在門檻值0.91時 F1分數提升至0.9714,展現出極高的精確性。此門檻值亦可作為未來相關研究與應用的重要參考基準。抽樣結果亦顯示,sBERT 相較其他模型,具備兼顧字詞層面與語意理解的特性,因此較能辨識出改寫句型。此外,為彌補繁體中文語意相似句研究文獻的稀缺,本研究於人工標記階段制定五等級相似句分級標準,並將相似句來源歸納為六大類,提供語意比對相關研究具體分類依據。研究亦指出,句型結構與用詞大致相同但語意迥異的「爭議樣本」易導致模型誤判,為後續模型優化提供反思方向。
This study investigates the effectiveness of sentence embedding models in identifying semantically similar sentences in Traditional Chinese. The corpus comprises 12,194 award-winning science fair project reports, converted from PDF files to plain text. A"Science Fair Project Similar Sentence Retrieval System" was developed, integrating five sentence embedding models designed for Traditional Chinese. To evaluate model performance and its alignment with human judgment, the study employed various sampling strategies and similarity thresholds. Over 20,000 sentence pairs were manually annotated, and the F1-score was used as the primary evaluation metric. Experimental results demonstrate that sBERT is the most effective model for semantic similarity tasks in Traditional Chinese. Without fine-tuning, sBERT achieved an F1-score of 0.8578 at the optimal threshold of 0.93. After applying contrastive learning-based fine-tuning, its F1-score improved to 0.9714 at a threshold of 0.91, indicating high precision. This threshold also offers a valuable reference point for future research and applications. Sampling results further suggest that sBERT excels at capturing both lexical and semantic information, making it more capable of detecting paraphrased sentence pairs compared to other models.To address the limited literature on semantic similarity in Traditional Chinese, this study also proposes a five-level similarity annotation standard and categorizes sentence sources into six distinct types, providing a structured framework for future research. The study further highlights that “controversial samples” — pairs with similar structure and wording but divergent meanings — remain a major source of misclassification, offering insight for future model improvement.

Description

Keywords

語意相似句辨識, 句向量表示模型, 繁體中文句向量, 語意比對, Semantic Similar Sentence Identification, Sentence Embedding Models, Traditional Chinese Sentence Embeddings, Semantic Comparison

Citation

Collections

Endorsement

Review

Supplemented By

Referenced By