潛在概念分析-利用中文網路資料在向量空間模型中呈現語意關係概念知識
No Thumbnail Available
Date
2012
Authors
Journal Title
Journal ISSN
Volume Title
Publisher
Abstract
在自然語言處理領域中, 詞彙模式(lexical pattern)經常被使用在許多計算語意關係之間相似度的實驗裡。然而儘管這些詞彙模式的重要性日益增加,對於它們被宣稱所代表的語意關係,卻很少有學者去探討它們反映了哪種層面的訊息。本論文主張這些詞彙模式和它們所代表的語意關係,在語言使用過程中,具備了同樣的概念特性。
同時本論文也提出一個稱為潛在概念分析(LCA)的計算模型,這個計算模型能掌握並且運用詞彙模式所具備的概念特性來進行相似度的計算。潛在概念分析是個自動化演算法,該演算法主要利用奇異值分解法(SVD)來處理因為大規模語料庫所產生的高維度問題。在本篇論文中,首先有35組詞彙模式經由半自動方式產生出來,作為LCA的輸入資料來源,接著每組詞彙模式都會產生一組列表,該列表會按照相似度距離由近到遠列出其他的34組詞彙模式。為了檢視LCA的功能,最後產生出來的結果會與另一組由手動標注的結果相互對照,這組由手動分群而成的結果所採取的準則來自詞彙資源網站FrameNet分類所依據的標準,最後結果顯示LCA所完成的相似度距離計算與手動分群的結果相似。
本論文所採取的方法近似於Turney (2006)與Bollegala et al. (2009)所使用的方法,但差異在於本論文所提出之方法並不只是依靠頻率的分布情形,另外也將語言使用者對詞彙模式的概念知識納入LCA的計算考量。因為LCA的語料來源是網路內容,因此網路內容所具備的不穩定和易變動的特性也有時會影響LCA的表現。未來相關研究可依長期蒐集資料的方式來降低這個問題的影響。
In the field of Natural Language Processing, lexical patterns are often applied in many experiments that involve similarity measure among word relations. Despite their growing importance, however, these patterns are rarely examined in terms of what aspect they inherit from the word relation they are claimed to represent. In the thesis, it is proposed that lexical patterns exhibit the same conceptual nature as word relations do. They both display conceptual qualities when they are applied in language use. It is also proposed in this thesis that the conceptual nature of lexical patterns can be captured and implemented in a computational model, latent conceptual analysis (LCA), to calculate similarity among the patterns. LCA is an automatic algorithm that relies on singular vector decomposition (SVD) to reduce the high dimensionality resulted from large-scale corpus. In the thesis, after 35 lexical patterns are generated semi-automatically, each of them is sent to LCA as input data, whose distance from the other 34 patterns will be subsequently determined. To validate the performance of LCA, the result is compared to that of a manual clustering method whose standards are based on principles applied in FrameNet. As revealed from the comparison, LCA has achieved a result similar to that of manual clustering. The approach adopted in the thesis is similar to that applied by Turney (2006) and Bollegala et al. (2009). However, instead of relying solely on frequency distribution, language users’ conceptual knowledge about lexical patterns is also taken into consideration in LCA. Because LCA uses Web contents as its corpus, the dynamic and constantly changing nature of data collected from the Web can sometimes affect the performance of LCA. Therefore it is suggested that future studies applying LCA should collect data in a long-term fashion to alleviate this problem.
In the field of Natural Language Processing, lexical patterns are often applied in many experiments that involve similarity measure among word relations. Despite their growing importance, however, these patterns are rarely examined in terms of what aspect they inherit from the word relation they are claimed to represent. In the thesis, it is proposed that lexical patterns exhibit the same conceptual nature as word relations do. They both display conceptual qualities when they are applied in language use. It is also proposed in this thesis that the conceptual nature of lexical patterns can be captured and implemented in a computational model, latent conceptual analysis (LCA), to calculate similarity among the patterns. LCA is an automatic algorithm that relies on singular vector decomposition (SVD) to reduce the high dimensionality resulted from large-scale corpus. In the thesis, after 35 lexical patterns are generated semi-automatically, each of them is sent to LCA as input data, whose distance from the other 34 patterns will be subsequently determined. To validate the performance of LCA, the result is compared to that of a manual clustering method whose standards are based on principles applied in FrameNet. As revealed from the comparison, LCA has achieved a result similar to that of manual clustering. The approach adopted in the thesis is similar to that applied by Turney (2006) and Bollegala et al. (2009). However, instead of relying solely on frequency distribution, language users’ conceptual knowledge about lexical patterns is also taken into consideration in LCA. Because LCA uses Web contents as its corpus, the dynamic and constantly changing nature of data collected from the Web can sometimes affect the performance of LCA. Therefore it is suggested that future studies applying LCA should collect data in a long-term fashion to alleviate this problem.
Description
Keywords
關係相似度, 詞彙關係, 向量空間模型, relation similarity, lexical relations, Vector Space Model