利用專門可比語料庫結合機器翻譯自動提取雙語對譯N連詞:以合約文類為例
No Thumbnail Available
Date
2012
Authors
Journal Title
Journal ISSN
Volume Title
Publisher
Abstract
本研究從筆譯職場的合約翻譯需求出發,合約文類是高度專門的領域,其文體迥異於一般文件,但同時又具有制式、重複的特徵,非常適合運用翻譯記憶系統。近十年來翻譯記憶系統在筆譯市場上應用日益普遍,但翻譯記憶系統有賴以雙語對譯的平行資料庫做為檢索依據,將既有的翻譯回收利用;若沒有足夠的翻譯資料庫,工具本身並無法發揮效益。這正是合約翻譯要運用翻譯記憶系統的限制所在,合約文件涉及簽約當事人的敏感機密,雙語對譯語料取得困難,而依賴人工翻譯以累積翻譯資料庫又曠日費時,無法迅速建置合約翻譯語料庫以直接套用於翻譯記憶系統。
因此,本研究從不同語言的單語專門語料庫著手,亦即學界所稱的可比語料庫,以克服翻譯語料不足的困難。專門領域的可比語料庫彼此雖然沒有對譯的關係,但所涵蓋的領域術語、概念及常用表達,必有許多交集且互為翻譯。本研究之目的就在於探討一個可行的方法,利用統計式機器翻譯與字串相似度比對技術,從中文與英文合約可比語料庫當中自動提取雙語對譯的連續詞串,亦即N連詞 (N-gram)。
研究方法首先以網際網路為語料來源,建置中文與英文合約可比語料庫;其次利用語料庫檢索工具,提取合約的主題詞與關鍵主題詞,再依據這些核心主題詞建立N連詞。接下來應用Google譯者工具包自動翻譯服務,分別產生中文與英文N連詞的機器譯文。最後,借用翻譯記憶系統的相似度比對功能,以英文合約N連詞與「中文N連詞機器英譯」進行字串相似度比對,兩者若完全相同或高度相符,即表示該英文N連詞與對應的中文N連詞極可能互為翻譯。中文N連詞到英文N連詞的配對提取,同樣以英文做為相似度比對的中介語言,所得到的中英對譯N連詞經由專家評估後,發現高度相符 (95% 以上) 的三連詞至六連詞,對應正確率達到83%。
研究結果顯示,本論文提出的方法,技術上相對簡單且可行,能夠具體提取出互為翻譯的中英文N連詞。在筆譯實務上,這些對譯的N連詞可以直接匯入翻譯記憶系統做為翻譯資源,或做為檢索關鍵詞,以檢索合約語料庫的相關術語、概念、搭配詞、句型、語境,尤其能夠找出雖非直接對譯但內容相關的中文與英文條款,協助譯者提高翻譯效率及品質。相同的資源也可應用於合約翻譯教學,利用中文與英文合約專門語料庫做為平行文本,再搭配中英對譯N連詞,學生可以有效快速檢索出所需的合約術語、概念、句型及其對譯表達,大幅縮短資料搜尋時間與學習曲線。在計算語言學領域,本論文提出的方法對於資訊工程、機器翻譯、翻譯記憶系統開發等領域也可有參考價值,能夠進一步擷取中英合約的對譯術語,甚至擴展至合約對譯句的擷取。
This study is motivated by the need of contract translation. Business contracts belong to a highly specialized genre, characterized by specific vocabularies, domain terms, formulaic expressions and repetitive standard clauses. These features make contract texts an ideal candidate for applying a Translation Memory System, or TM, which searches and retrieves past translations from a database of source texts and their equivalents in the target language in aligned segments. A TM system thus requires a large database of past translations (known as a parallel corpus) in order to get the best result. And there lies the difficulty in using a TM system for English-Chinese contract translation, as parallel corpora of English and Chinese contracts are scarcely available. To overcome the limitations of parallel corpora, this study turns to comparable corpora, i.e. monolingual corpora of similar design in two or more languages. Comparable corpora of a specialized domain, though not direct translations of eachother, contain domain terms, concepts and fixed expressions that are mutual translations. This study aims to explore a simple yet effective method for extracting such translation equivalents from a comparable corpus of Chinese and English contracts by employing statistical machine translation and string similarity comparison. First, a comparable corpus of Chinese and English contracts is built from texts mined from the Internet. Second, keyword and key keyword lists are built with concordancer tools, on which Chinese and English N-grams are then built. Third, the N-grams are translated into English and Chinese respectively with Google Translator Toolkit. And finally, the English N-grams are compared with the Google-translated English, using the built-in similarity comparison function of a TM system. English N-grams that meet or exceed a pre-defined match value are automatically mapped to the corresponding Chinese N-grams to establish a list of English-Chinese N-gram pairs. Chinese N-grams are also mapped to possible English N-gram translation equivalents following the same procedures. These N-gram pairs are evaluated by experienced contract translators, and the results show that 3-word to 6-word N-grams with a match value of 95% and above have a mapping accuracy of 82%. The results show that the method employed is technically simple yet effective. For contract translators, the correctly mapped N-gram pairs can be imported to a TM system as a translation resource, or they can be used as concordance search keywords to retrieve from the comparable corpus needed terms, concepts, collocations, adequate sentence patterns and contexts. The same resources can apply to translator training. Students will benefit from authentic parallel texts, and using the Chinese-English N-gram pairs will improve search results and shorten the learning curve. For computational linguistics, the findings in this paper may suggest further study into extraction of contract terms or even sentence-level translation equivalents from comparable corpora.
This study is motivated by the need of contract translation. Business contracts belong to a highly specialized genre, characterized by specific vocabularies, domain terms, formulaic expressions and repetitive standard clauses. These features make contract texts an ideal candidate for applying a Translation Memory System, or TM, which searches and retrieves past translations from a database of source texts and their equivalents in the target language in aligned segments. A TM system thus requires a large database of past translations (known as a parallel corpus) in order to get the best result. And there lies the difficulty in using a TM system for English-Chinese contract translation, as parallel corpora of English and Chinese contracts are scarcely available. To overcome the limitations of parallel corpora, this study turns to comparable corpora, i.e. monolingual corpora of similar design in two or more languages. Comparable corpora of a specialized domain, though not direct translations of eachother, contain domain terms, concepts and fixed expressions that are mutual translations. This study aims to explore a simple yet effective method for extracting such translation equivalents from a comparable corpus of Chinese and English contracts by employing statistical machine translation and string similarity comparison. First, a comparable corpus of Chinese and English contracts is built from texts mined from the Internet. Second, keyword and key keyword lists are built with concordancer tools, on which Chinese and English N-grams are then built. Third, the N-grams are translated into English and Chinese respectively with Google Translator Toolkit. And finally, the English N-grams are compared with the Google-translated English, using the built-in similarity comparison function of a TM system. English N-grams that meet or exceed a pre-defined match value are automatically mapped to the corresponding Chinese N-grams to establish a list of English-Chinese N-gram pairs. Chinese N-grams are also mapped to possible English N-gram translation equivalents following the same procedures. These N-gram pairs are evaluated by experienced contract translators, and the results show that 3-word to 6-word N-grams with a match value of 95% and above have a mapping accuracy of 82%. The results show that the method employed is technically simple yet effective. For contract translators, the correctly mapped N-gram pairs can be imported to a TM system as a translation resource, or they can be used as concordance search keywords to retrieve from the comparable corpus needed terms, concepts, collocations, adequate sentence patterns and contexts. The same resources can apply to translator training. Students will benefit from authentic parallel texts, and using the Chinese-English N-gram pairs will improve search results and shorten the learning curve. For computational linguistics, the findings in this paper may suggest further study into extraction of contract terms or even sentence-level translation equivalents from comparable corpora.
Description
Keywords
可比語料庫, 合約翻譯, 機器翻譯, 翻譯記憶系統, N連詞, 相似度比對, comparable corpora, contract translation, machine translation, translation memory, N-grams, similarity comparison