中文文本作者辨識研究: 以社群網站--臉書為例

謝舒凱Shu-Kai Hsieh陳美瑜Mei-Yu Chen2019-09-032014-06-142019-09-032013http://etds.lib.ntnu.edu.tw/cgi-bin/gs32/gsweb.cgi?o=dstdcdr&s=id=%22GN0698210322%22.&%22.id.&http://rportal.lib.ntnu.edu.tw:80/handle/20.500.12235/97814個人寫作風格差異(風格學)一直是熱門研究主題。從語言學角度觀察，研究人員嘗試各種量化方法及建立各種指數希望能將「個人差異」量化（Tweedie& Baayen, 1998; Mosteller & Wallace, 1964; Burrows, 2002, 2003, 2007; Hoover, 2004）。而從資訊科學領域來看，現今社會對「語言鑑識」或「文件作者分類」有漸增的需求，因為在數位化的時代，人們需要這項技術來幫助偵測漸增的網路匿名犯罪，或是幫助數位化文件作者分類。此篇論文首先介紹兩種學科對於個人寫作風格差異的研究方法，並且進行兩項實驗。實驗採用現今流行的社群網站Facebook 上的個人語料來探索中文的字(characters)與詞(words)能對個人寫作差異提供多少解釋力，並且探勘其他的文件風格，諸如:結構、主觀化、情緒特徵等，能對社群短語的作者判斷提供多少幫助。並且此研究坦討於常見的特徵權重 (tf-idf、詞頻、比例分布)計算中，何種權值能提供較佳的準確值。本實驗採用新式向量機套件— LibLinear 做為作者分類器，此分類器套件特殊的設計使其更適應於高維度的特徵訓練，例如「文件分類」這種需包含為數眾多的詞作為特徵值的任務。且不同於一般的分類器，Liblinear 能提供每項特徵對應不同分類別的的貢獻分數，因而能幫助研究者檢視何種特徵最能代表該作者類別。從實驗一的結果得知，tf-idf 特徵權的表現略比比例分布佳，但並未比詞頻的表現好。這個結果顯示在此類社群短語中，不論是在單則文章中或是整個實驗語料庫中，關鍵詞鮮少重複出現。原因有可能來自於在社群網站當張，短語的特性使其所能包含的文字較少，以及人們在此種社交平台上傾向不斷更換主題的特性。因此tf-idf 這種降低功能詞權重並提高文章關鍵詞權重的計算方式，沒能在此類短語文章屬性中見其專長，反而簡單的詞頻計算方式表現更佳。並且，這種結果或許反映了在功能詞與內容詞兩種特徵的比較上，tf-idf預設功能詞特徵對於作者辨識不重要的假設或許並不適當。實驗二展示中文不同階層的詞彙 (例如:字、詞、二字詞、字與詞混合)能提供的作者辨識度。另一個常見於中文作者辨識的議題是關於中文的斷詞問題。不同於字母系統的語言，中文在語言表層結構上並不存在字元間隔以區分單詞。因此先前許多針對中文作者辨識的研究選擇使用不分詞的方法進行分類辨識。本文中的第二項實驗以 CKIP 進行中文分詞，並且同時採用不分詞與分詞後的結果作為特徵值，以探索中文中不同字詞單元分別能提供的作者分類鑑識力（包括以字為本及以詞為本的一字詞單位、以詞為本的二字詞單位，以及混合字與詞）。結果顯示以詞為本的特徵值分類表現優於以字為本的特徵值。同時在第二個實驗中加入了字詞以外的特徵集（包含結構特徵、主觀化特徵、情緒特徵）。結果顯示主觀化特徵與情緒特徵在社群語料文類中的重要性。Individual’s writing difference (Stylometry) has been a popular research interest. In Linguistics, researchers want to know whether the individual difference can be quantified and measured by a certain indexes or statistics (Tweedie& Baayen, 1998; Mosteller & Wallace, 1964; Burrows, 2002, 2003, 2007; Hoover, 2004). From Information Technology perspectives, nowadays, there’s an increasing need for document forensics to detect the authorship of anonymous documents either to help investigate internet crimes or to serve different document classification purpose. This paper introduces different ways of measuring the individual writing difference from both Linguistic and Information Technology disciplines. Two experiments are carried out on individuals’ texts collected from the prevalent social media platform—Facebook—to investigate to what extent Chinese characters and lexicons can capture the individual’s writing difference, and to what extent other textual attributes, such as the structural, subjectivity, and emotion clues can contribute to this kind of social short texts. Also, this study examines three different kinds of feature weighting methods (i.e. tf-idf, frequency, ratio) and compare their efficiency in the short texts classification. A recently released SVM classifier, LibLinear, is adopted. The special design of this software package not only makes it more adapted to document classification tasks, where the dimension of features is extremely high, but also can provide ranking scores of each feature that tell the researchers which feature in the feature set can best discriminate and represent a specific category. From the result of the first experiment, tf-idf weighting outperforms the measure of the ratio, but didn’t outperform the measure of frequency. The result shows that in this kind of social short texts, keywords seldom repeat themselves no matter locally or universally. This might be attributed to the relative short length to include more words in a single post and also the characteristics of social platform that people change topics frequently. Therefore, the benefit of tf-idf that degrades the weighting of functions while promoting that of locally frequent content words doesn’t show extra discriminant power compared to the more simple measure of frequency. Also, the preassumption of tf-idf that assumes function words won’t provide information about the author’s preference might not be adequate. Another common issue when carrying out Chinese authorship identification is the segmentation problem. Unlike alphabetic langagues, Chinese doesn’t have word boundaries on the surface structure. Thus, much previous research chose to tackle this language by non-segmented approaches. The second experiment demonstrates the discriminating power of different levels of lexicons (i.e. character-based and word-based unigram, word-based bigram, mix of character and words) in Chinese authorship identification. The result shows that word-based features have much better performance than character-based features. Also, in the second experiment, different feature levels are taken into account (i.e. the structural level, subjectivity level, and emotion level). The result shows the important role of subjectivity and emotion clues to the genre of the social media texts.作者辨識風格學SVM向量機文件探勘個人化差異情緒自然語料社群短語authorship identificationStylometrySVMtext miningindividual differenceemotionnaturalistic datasocial media short text中文文本作者辨識研究: 以社群網站--臉書為例Chinese Authorship Identification: A case study based on Social Corpus -- Facebook