摘要能力量尺之建置及摘要自動化批改系統之建置與效能評估
No Thumbnail Available
Date
2021
Authors
Journal Title
Journal ISSN
Volume Title
Publisher
Abstract
國內近年來在十二年國民基本教育課程綱要(簡稱12年國教課綱)的推動下,更加重視素養的養成。當中受到最多關注的便是閱讀理解這項跨領域素養,隨之而起的則是關於閱讀教學、閱讀策略的討論。許多教師嘗試將閱讀理解的概念融入於教學中,亦時常要求學生進行各種閱讀任務(task),其中撰寫摘要被視為最能代表讀者是否獲知閱讀文本內容的方法,亦常被用作閱讀理解的檢核。然而,在實務上摘要評分工具的研發卻相當缺乏,且具有標準不一、測驗結果無法相互比較等問題。有鑑於此,本研究擬建構一套可應用於廣泛對象的摘要評分規準,調查學生的摘要能力發展,並透過試題反應理論(item response theory, IRT)建構摘要能力量尺,提供參照標準,使教師可有效地掌握學生的程度。更重要的是,為呼應閱讀教學之需求,本研究擬探討自動化摘要批改應用於讀後評量的可行性。本文依研究主體劃分為二,研究一的重點為,透過收集實徵資料,瞭解學生的摘要能力發展情形,並研發摘要評分規準,使教師在評估學生的摘要能力時有所依歸。而在研究過程中,專家批改摘要的結果,亦為研究二自動摘要評分的檢驗效標。研究一精選四份不同難度的文章作為測驗文本,要求受試者在進行閱讀後,透過撰寫摘要,重述文章的重要意涵。研究一的受試對象包含二至九年級學生,共2,003名。考量學生就讀年級的差異,受試者所閱讀的文章由研究者依難度進行指派,每位學生撰寫一至兩篇摘要,總計收集2,591篇摘要。所有摘要皆依本研究所建置的評分規準,透過四大向度(完整度、關鍵訊息、濃縮整合、以及遣詞用字)進行批改,綜合評估學生的摘要能力。批改者皆為本研究所招募的資深教師(本文稱專家批改者)。經由斯皮爾曼等級相關(Spearman’s rank correlation)分析每篇文本的兩個初閱分數,可發現評分者間具有高度的給分一致性,評分者間相關至少達 .85以上,評分品質穩定。除此之外,由於研究中的部分學生針對不同測驗文本,同時撰寫兩篇摘要,故所有測驗文本的批改結果可藉由共同人的設計進行等化,再藉由IRT分析,連結所有年級的能力表現,量尺化學生的摘要能力發展結果。分析結果顯示,與學生的摘要原始得分具有相同的趨勢,各年級學生的平均能力值皆隨年級遞增。相關結果不僅代表教師評分的有效性以外,亦可透過各年級的平均能力值建構摘要能力量尺,提供摘要能力定位的參考標準。而研究二著重自動化摘要批改模型的建立以及其效能之探討。本文利用機器學習(machine learning),以段落向量、潛在語意分析(Latent Semantic Analysis, LSA)、變換器之雙向編碼器表示(Bidirectional Encoder Representations from Transformers, BERT)等三種技術,結合密度尖峰分群法(density peaks clustering),生成電腦摘要。再依本研究建構的自動摘要評分模組,透過將學生摘要與電腦摘要相互比較的方式,評估學生摘要品質。為貼近教學實務需求,本研究之評分模組係依研究一之評分規準建置而成。擷取評分規準中屬於閱讀理解範疇的三大向度(完整度、關鍵訊息、濃縮整合),分別以學生摘要中納含主題的數量比率、學生摘要中關鍵詞彙的數量比率、和學生摘要與電腦摘要的語意相似性,等三個層面表徵學生摘要在完整度、關鍵訊息、濃縮整合的表現情形。在效能檢核上,本文分為兩個層面進行探討。第一部分為自動摘要生成的效果,本研究分別利用「召回率導向摘要評估」(Recall-Oriented Understudy for Gisting Evaluation, ROUGE)、概念詞重覆率、主題涵蓋率,檢核三項電腦技術所節錄的自動摘要是否足以代表原始本文。其結果發現,段落向量與LSA的自動摘要品質良好,且兩者效能在伯仲之間,BERT的成效則相對較差。而在本文的另一個探討重點,摘要自動評分的效能上,本研究藉由專家人工評分的結果與三個評分模型各別評估的摘要品質結果,進行相關性分析與準確率統計,比較三者之間哪一個模型與專家評分的相關性或是準確率最高,便代表其效能最好。經由斯皮爾曼等級相關分析顯示,三個評分模型在總分的相關係數介於 .61至 .68之間,接近高相關,在個別向度的相關上也至少有 .46以上的水準,且所有的相關係數皆達顯著水準,代表不管是哪一個評分模組的自動評分結果皆與專家評分的趨勢相近,具有良好的代表性。在準確率統計方面,三者的成效亦相當優良,鄰近準確率至少皆達8成以上,三者差異不大。而在穩定性上,則以LSA的表現最好。另一方面,本文亦導入專家評分者所整理的節錄式(extractive)摘要,同樣透過三個面向的評分模組,評估學生摘要品質並進行準確率統計。透過此方式,不僅可以得知哪一個模型的效能較好,更能進一步瞭解三個自動摘要評分模型的效能有多好。而相關結果顯示,縱使將電腦摘要替換為專家摘要作為比較基準,其自動評分的準確率並無明顯的差異,表示本研究所採用的電腦自動摘要技術良好,效能與專家摘要相近。相較於現有摘要能力評量,本文研究最大的優勢為,透過研究一蒐集跨學習階段的學生摘要,確立評分規準的有效性以外,更將學生的摘要能力表現建構於同一量尺之上,可供長期追蹤學生的摘要發展情形之用。此外,亦突破傳統做法,結合書籍難度,準確評估學生摘要能力。另一方面,在研究二的部分,以往資訊技術研發的重點大多聚焦於如何有效地生成電腦化摘要,鮮有針對中文自動摘要批改的研究。少數以電腦自動化摘要批改為號召的系統,又多僅以語意相似性評估摘要品質,忽略了摘要能力其他成分的重要性。而本文將自動化摘要技術附加電腦評分模組進行整合,可呼應摘要實務教學所重視的完整性、關鍵訊息、濃縮整合等層面的細項摘要技能;而藉由與專家人工批改結果進行跨域連結、比較,本研究進一步探究不同模型應用於自動化摘要批改的效能,此作法可望為相關領域的研發提供寶貴的實徵證據。
Summarization is a key component of reading literacy. In recognizing the importance of assessing students’ summary abilities, this present research project aimed to improve current summary assessment tools. This paper is divided into two studies. For the first study, the goal was to construct summarization scoring rubrics that could be used for wide range of students and text types, and to investigate students’ summarizing ability. Four reading texts at different difficulty levels were selected as the research materials in this study. 2,003 students from second grade to ninth grade participated in this study. They were assigned to read one or two texts and restated the main ideas of the texts. The difficulties of the assigned texts were corresponded to the grades of the students. 2,591 summary articles were collected, and these articles were then graded by teachers recruited by this study.The scoring rubrics developed by the present study included four dimensions: completeness, key information, integration, as well as wording and phrasing. Each summary article was graded by two teachers who assessed the qualities of the summaries by these four dimensions. The results of Spearman’s rank correlation analysis on the two scores of the articles were .85 or higher, showing that interrater reliability was high, and the rating quality was consistent and stable. Some participants were assigned to write two pieces of summary, so the rating results could be linked by common persons. After analyzed by item response theory, students’ abilities of summarization from second grade to ninth grade were scaled. From the analysis of multidimensional random coefficients multinomial logit model (MRCMLM), the results indicated that the students’ abilities (theta) increased as the grades increased, which was at the same trend as the original scores illustrated. The summarization ability scale and the average theta of each grade provided standards for students to understand how well their summarization skills developed when comparing to their peers.The second study of this research project focused on designing automatic summary scoring models and evaluating the effectiveness of these models. This study incorporated different techniques of machine learning, including paragraph embedding, Latent Semantic Analysis (LSA), and Bidirectional Encoder Representations from Transformers (BERT) to combine density peaks clustering for generating automatic summarization texts (computer summary). Furthermore, this study also designed automatic scoring models corresponding to the dimensions of “completeness,” “key information,” and “integration.” Using the automatic scoring models to compare students’ written summaries and computer summaries, scores of each written summary were given. This study then examined the automatic rating results to the scores rated by the teachers in the study 1.To evaluate the performance of the automatic summarization scoring, this study first investigated the qualities of computer summaries. Three types of indices were implemented: Recall-Oriented Understudy for Gisting Evaluation (ROUGE), ratio of concept words using, and coverage of themes. The results showed that the computer summaries generated by paragraph embedding and LSA had better qualities, and the efficacy between the two were similar, while the performance of BERT was relatively poor.As for the other important aspect of the study two, this paper examined the performance of the automatic scoring models. Each of the written summary had four scores: one graded by experts (teachers in study one), and three rated by the different automatic scoring models. By comparing the experts’ rating with each automatic score, Spearman’s rank correlation analysis showed that the correlation coefficients of the total scores on the summaries were among .61 to .68, which was close to high correlation, and all correlation coefficients were at significant level. This indicated that rating results from three different automatic scoring models all had the similar tendency as the experts’ rating did, and the automatic scoring results had good representation. In terms of accuracy, all models performed well by reaching the adjacent accuracy higher than 80%. Among all, the LSA model had a better stability.Different from the previous assessments of summarization, the present research project did not only construct valid scoring rubrics for assessing multiple dimensions of summary abilities, but it also established a summary scale for students across wide range of grades. Additionally, the effectiveness of the automatic summary scoring models proposed by this project was also verified. Combined the findings of these two studies, this paper provided solid evidences and revolutionary solutions to assess and track students’ summarization abilities in various contexts for a long term.
Summarization is a key component of reading literacy. In recognizing the importance of assessing students’ summary abilities, this present research project aimed to improve current summary assessment tools. This paper is divided into two studies. For the first study, the goal was to construct summarization scoring rubrics that could be used for wide range of students and text types, and to investigate students’ summarizing ability. Four reading texts at different difficulty levels were selected as the research materials in this study. 2,003 students from second grade to ninth grade participated in this study. They were assigned to read one or two texts and restated the main ideas of the texts. The difficulties of the assigned texts were corresponded to the grades of the students. 2,591 summary articles were collected, and these articles were then graded by teachers recruited by this study.The scoring rubrics developed by the present study included four dimensions: completeness, key information, integration, as well as wording and phrasing. Each summary article was graded by two teachers who assessed the qualities of the summaries by these four dimensions. The results of Spearman’s rank correlation analysis on the two scores of the articles were .85 or higher, showing that interrater reliability was high, and the rating quality was consistent and stable. Some participants were assigned to write two pieces of summary, so the rating results could be linked by common persons. After analyzed by item response theory, students’ abilities of summarization from second grade to ninth grade were scaled. From the analysis of multidimensional random coefficients multinomial logit model (MRCMLM), the results indicated that the students’ abilities (theta) increased as the grades increased, which was at the same trend as the original scores illustrated. The summarization ability scale and the average theta of each grade provided standards for students to understand how well their summarization skills developed when comparing to their peers.The second study of this research project focused on designing automatic summary scoring models and evaluating the effectiveness of these models. This study incorporated different techniques of machine learning, including paragraph embedding, Latent Semantic Analysis (LSA), and Bidirectional Encoder Representations from Transformers (BERT) to combine density peaks clustering for generating automatic summarization texts (computer summary). Furthermore, this study also designed automatic scoring models corresponding to the dimensions of “completeness,” “key information,” and “integration.” Using the automatic scoring models to compare students’ written summaries and computer summaries, scores of each written summary were given. This study then examined the automatic rating results to the scores rated by the teachers in the study 1.To evaluate the performance of the automatic summarization scoring, this study first investigated the qualities of computer summaries. Three types of indices were implemented: Recall-Oriented Understudy for Gisting Evaluation (ROUGE), ratio of concept words using, and coverage of themes. The results showed that the computer summaries generated by paragraph embedding and LSA had better qualities, and the efficacy between the two were similar, while the performance of BERT was relatively poor.As for the other important aspect of the study two, this paper examined the performance of the automatic scoring models. Each of the written summary had four scores: one graded by experts (teachers in study one), and three rated by the different automatic scoring models. By comparing the experts’ rating with each automatic score, Spearman’s rank correlation analysis showed that the correlation coefficients of the total scores on the summaries were among .61 to .68, which was close to high correlation, and all correlation coefficients were at significant level. This indicated that rating results from three different automatic scoring models all had the similar tendency as the experts’ rating did, and the automatic scoring results had good representation. In terms of accuracy, all models performed well by reaching the adjacent accuracy higher than 80%. Among all, the LSA model had a better stability.Different from the previous assessments of summarization, the present research project did not only construct valid scoring rubrics for assessing multiple dimensions of summary abilities, but it also established a summary scale for students across wide range of grades. Additionally, the effectiveness of the automatic summary scoring models proposed by this project was also verified. Combined the findings of these two studies, this paper provided solid evidences and revolutionary solutions to assess and track students’ summarization abilities in various contexts for a long term.
Description
Keywords
摘要, 自動化摘要, 自動化批改, 試題反應理論, 段落向量, 潛在語意分析, 變換器之雙向編碼器, summarization, automatic summarization, automatic summary scoring, item response theory, paragraph embedding, Latent Semantic Analysis, Bidirectional Encoder Representations from Transformers