單字階層測驗之局部獨立性檢測
No Thumbnail Available
Date
2019
Authors
Journal Title
Journal ISSN
Volume Title
Publisher
Abstract
單字階層測驗(Vocabulary Levels Test, VLT)在一般前後測之實驗設計研究中,常作為分班測驗,診斷測驗,和學習的基準。相較於其它的詞彙量測驗,像是VST或者是Yes/No測驗,單字階層測驗在過去的35年間受到最多的注目,儘管此單字階層測驗的項目題組形式遭到一些質疑。因為單字階層測驗包含三個項目(定義),和六個選項(單字)。因為三個項目組合為同一題組的選項,曾有質疑指出回答其中一個項目會不公平地影響(或決定)同一題組的其他選項的答覆。這種局部依賴稱作為項目鍊(item chaining),且此種現象明顯地違反經典測驗理論和試題反應理論的項目獨立之基本的假設。假若項目鍊在測驗中是一種普遍的現象,此同時也挑戰另一個測驗理論的基本假設:單向度或者是測驗本身設計之評量能力,此以單字能力為例。若因為項目依賴違反兩個在測驗中基本理論假設,測驗的信度和效度將令人存疑。
本論文的目標為檢測一個簡短版本的單字階層測驗之項目獨立性,其中包含三個階層而不是五個階層。利用更廣泛的Rasch模式,以檢測在單字階層測驗中的項目獨立性之現象和範圍。本論文的資料蒐集包含302位大學和研究生的測驗資料,主要利用Winstep軟體在20個不同資料階層中,進行兩種類型的單項度測驗(1. 主成分殘差分析(PCAR)和2. Yen的 Q3值,此數值可以找出局部依賴的項目)。
1.結合三個單字階層測驗2,3和5(一個資料階層)
2.每一個獨立單字階層測驗(三個資料階層)
3.四個能力組別和所有的單字階層測驗(四個資料階層)
4.四個能力組別和三個獨立單字階層測驗 (十二個資料階層)
另外執行兩項分析;模擬資料包含非隨機殘差和實證資料之比較,另一項是用Rasch模式分析三項目組合的題組。總結,本研究綜合分析42個不同分析量化結果和質性分析有問題的題目,包含以下兩種方法:1. 作答規律包含答案,誘答選項,和未回應的選項;2. 利用COCA蒐集的單字頻率和分佈資料,COCA是目前最大的英文語料庫(Davies, 2008-)。
相似於文獻中的一些研究發現,其中單向度的Rasch分析結果顯示可接受之配適度,個人和項目之可信度。另外,和模擬資料比較時,亦很少不可解釋的變異數。這些數值顯示,單字階層測驗項目分析結果沒有發現明顯的或是有問題的測驗題目。但是,透過20個階層資料的組合分析顯示超過三分之一的測驗題目有以下傾項:1. 有兩項題目有局部依賴,依賴程度為弱到中等程度(相關係數0.3-0.7);和/或者2. 測驗題目中未在Rasch單字知識向度中,卻依據主成分殘差分析有顯著負荷量(超過 +/- 0.3)。執行質性分析以進一步了解Rasch統計檢測之結果。結果顯示由上述至少在兩項上述分析中,發現有一小組七個題組為可能有問題的局部依賴項目,而這些題組將進行題目敘述和單字頻率檢視。
雖然統計和質化分析的結果不能將局部依賴歸咎於項目鍊,這七個題組項目確實有一些共同的性質造成一些問題降低了測驗的能力。這些性質包含兩個項目在困難度上面有相當大的差別,於此論文中稱作 “2-vs-1 困難群”;事實上,在30個題組中就有19個題組項目有此傾向。當一個測驗中困難群在同一個題組中位置彼此相近,但是卻距離邊緣第三項很遠,此現象由Q3數據檢驗呈現是有微弱或是中等的局部依賴現象。這個現象出現在六個題組中(占總20%)。當局部依賴的現象出現於在題組中的前兩題項目,第一題是比第二題更難,且遠比第三題困難的情形時(約四分之一到三分之一的測驗者不回答此題組),題組的第一題根據主成分殘差分析的結果顯示,此項題目和Rash向度之單字知識顯示不相關。這個情形出現在單字階層測驗3和5中的四個題組(占總13%)。
本研究指出一個重要的議題就是單字困難度,在單字診斷測驗中如同單字階層測驗,這個議題一直以來都被忽略或者是被研究者視為擾嚷變數(Culligan, 2015)。就我所知,此種測驗類型的單字困難度從不曾被實際地理論化過,但是卻被默認為是語料庫中單字頻率的一種功能。儘管有一些相反的論述(Schmitt et al. [2001] 的單字階層測驗, and Beglar [2007]的VST),基本的假設是單字頻率越低(i.e., 比較不常見),此單字項目在單字階層測驗中就比較困難。本研究的結果顯示如此之假設是有問題的,基於兩原因。第一,Schmitt et al. (2001)的單字階層測驗的版本是基於過時且數量小的語料庫,因此在單字階層測驗中沒有正確的單字頻率,特別是在於低程度單字階層測驗3和5。主要的原因是當語料庫包含相對數量少的文章,也沒有考慮單字分布的情形(i.e., 該單字在語料庫中的多少文章中出現)。因此這些單字會有不一致和偏斜的分布的情形。第二,最重要的是單字困難度的評量並沒有和頻率的資料作相關聯之測試,即使將分布資料納入考量。這個觀察同時也顯示第二語言學習者的單字量,並沒有和純英文的語料庫做一個相關檢視,特別在於前2000字上面。建議應該要研究單字頻率和困難度的相關性。
The Vocabulary Levels Test (VLT) has been used as a placement test, diagnostic test and benchmark for learning in pre- and post-test type of studies. Compared to other vocabulary size tests like the VST and Yes/No test, the VLT has received the most attention in research publications in the last 35 years, despite widespread suspicion of its item cluster format. Since each item cluster is composed of three items (definitions) and six answer options (words), it is suspected that the answering of one item can unfairly influence—or depend on—the answering of another item in the cluster since the three cluster items draw from the same set of answer options. This type of Local Item Dependence (LID) is called item chaining and appears to be a flagrant violation of the basic assumption of Local Item Independence (LII) in Classical Test Theory as well Item Response Theory. And if item chaining is pervasive throughout the test, this also challenges another fundamental assumption in test theory: unidimensionality, or the test’s capacity to measure only one trait like vocabulary knowledge. If both of these assumptions are substantially violated by Local Item Dependence (LID), the test’s reliability and validity are necessarily called into question. The purpose of this dissertation is to investigate the issue of LID in a shortened version of the VLT (three levels instead of five) using a wider variety of Rasch modelling approaches that were triangulated so as to identify the existence and extent of LID in the VLT. Specifically, data were collected for 302 Taiwanese university students or university graduates and Winsteps was used to run two types of dimensionality tests (1. Principal Components Analysis of Residuals [PCAR] and 2. Yen’s Q3 statistic that identifies pairs of locally dependent items) on 20 different data levels: 1.three combined levels of the VLT2, 3, and 5 (1 data level) 2.each independent VLT level (3 data levels) 3.four ability groups versus combined VLT levels (4 data levels) 4.four ability groups versus three independent VLT levels (12 data levels). Two more analyses were also conducted: simulated data with non-random residuals factored out were also compared to the empirical data, and items were grouped into three-item clusters to perform a Rasch analysis of testlets. In total, this study synthesized the results of 42 different analyses and qualitatively investigated the resulting problematic testlets using 1. response patterns of answer keys, distractors and items left unanswered, and 2. word frequency and dispersion information from COCA the largest and most updated currently available English language corpus (Davies, 2008-). Similar to previous research findings, the unidimensional Rasch analyses showed acceptable fit statistics, person and item reliability, and very little unexplained variance, especially when compared with the simulated data. The testlet analysis also did not uncover any obviously problematic testlets. However, from a combination of the above 20 levels of analysis, more than a third of the testlets appeared either to 1. have a pair of locally dependent (LD) items that were weakly to moderately dependent on each other (correlation of 0.3-0.7), and/or 2. have items with substantive PCAR loadings (beyond +/- 0.3) on a dimension that was not the Rasch dimension of vocabulary knowledge. Additional qualitative investigations were conducted in an effort to better understand and explain the Rasch statistical results. A subset of seven testlets that emerged from at least two of the above analyses were assumed to be the most likely candidates of problematic LID, and these were more closely scrutinized using qualitative procedures of checking item wording and word frequency. Although the statistical and qualitative procedures cannot conclusively show that the cause of LID is item chaining, the seven items share a number of characteristics that clearly create a problematic dynamic that undermines the proper functioning of testlets. These characteristics include a pair of items that considerably differ in difficulty measures from the third item in the cluster, which I have called a “2-vs-1 difficulty bundle”; in fact, 19 out of 30 testlets shared this configuration. However, when these difficulty bundles in a testlet are fairly close together but far apart from the outlying third item, the Q3 LID analysis identified them as either weakly or moderately locally dependent; this was the case for six testlets (20% of the total). And when this LID pair was the first two items with the first item more difficult than the second, and much more difficult than the third outlying item (with a quarter to one third of test-takers leaving the pair unanswered), the first item in the testlet was identified by the PCAR as negatively correlating with Rasch dimension of vocabulary knowledge; this was the case for four testlets (13% of the total) in VLT3 and 5. A key issue that emerged from this investigation is item difficulty in a vocabulary diagnostic test like the VLT, which has been variously ignored or treated as a “nuisance variable” by researchers (Culligan, 2015). Difficulty in this type of test has never, to the best of my knowledge, been overtly theorized, but has been tacitly operationalized as a function of word frequency from a corpus. Despite some unargued claims to the contrary (Schmitt et al. [2001] for the VLT, and Beglar [2007] for the VST), the assumption is that the less frequent (i.e., less common) the word, the more difficult the word-item on the VLT. This study shows that this is problematic for at least two reasons. First, the Schmitt et al. (2001) VLT versions are based on outdated and small corpora that have inaccurate word frequency information for all the VLT levels, but especially for the lower VLT3 and 5 levels; this is primarily because word frequency information will be necessarily inconsistent and skewed for less common words when using smaller corpora that contain a relatively small number of randomly sampled texts and do not account for dispersion (i.e., how many texts in the corpus containing the word). Secondly, and most importantly, difficulty measures—even when accounting for dispersion information—are often uncorrelated with frequency information, which shows that the learner’s second language (L2) lexicon does not mirror authentic English corpora, especially beyond the first 2000 words. Suggestions are given to help bridge the gap between frequency and difficulty.
The Vocabulary Levels Test (VLT) has been used as a placement test, diagnostic test and benchmark for learning in pre- and post-test type of studies. Compared to other vocabulary size tests like the VST and Yes/No test, the VLT has received the most attention in research publications in the last 35 years, despite widespread suspicion of its item cluster format. Since each item cluster is composed of three items (definitions) and six answer options (words), it is suspected that the answering of one item can unfairly influence—or depend on—the answering of another item in the cluster since the three cluster items draw from the same set of answer options. This type of Local Item Dependence (LID) is called item chaining and appears to be a flagrant violation of the basic assumption of Local Item Independence (LII) in Classical Test Theory as well Item Response Theory. And if item chaining is pervasive throughout the test, this also challenges another fundamental assumption in test theory: unidimensionality, or the test’s capacity to measure only one trait like vocabulary knowledge. If both of these assumptions are substantially violated by Local Item Dependence (LID), the test’s reliability and validity are necessarily called into question. The purpose of this dissertation is to investigate the issue of LID in a shortened version of the VLT (three levels instead of five) using a wider variety of Rasch modelling approaches that were triangulated so as to identify the existence and extent of LID in the VLT. Specifically, data were collected for 302 Taiwanese university students or university graduates and Winsteps was used to run two types of dimensionality tests (1. Principal Components Analysis of Residuals [PCAR] and 2. Yen’s Q3 statistic that identifies pairs of locally dependent items) on 20 different data levels: 1.three combined levels of the VLT2, 3, and 5 (1 data level) 2.each independent VLT level (3 data levels) 3.four ability groups versus combined VLT levels (4 data levels) 4.four ability groups versus three independent VLT levels (12 data levels). Two more analyses were also conducted: simulated data with non-random residuals factored out were also compared to the empirical data, and items were grouped into three-item clusters to perform a Rasch analysis of testlets. In total, this study synthesized the results of 42 different analyses and qualitatively investigated the resulting problematic testlets using 1. response patterns of answer keys, distractors and items left unanswered, and 2. word frequency and dispersion information from COCA the largest and most updated currently available English language corpus (Davies, 2008-). Similar to previous research findings, the unidimensional Rasch analyses showed acceptable fit statistics, person and item reliability, and very little unexplained variance, especially when compared with the simulated data. The testlet analysis also did not uncover any obviously problematic testlets. However, from a combination of the above 20 levels of analysis, more than a third of the testlets appeared either to 1. have a pair of locally dependent (LD) items that were weakly to moderately dependent on each other (correlation of 0.3-0.7), and/or 2. have items with substantive PCAR loadings (beyond +/- 0.3) on a dimension that was not the Rasch dimension of vocabulary knowledge. Additional qualitative investigations were conducted in an effort to better understand and explain the Rasch statistical results. A subset of seven testlets that emerged from at least two of the above analyses were assumed to be the most likely candidates of problematic LID, and these were more closely scrutinized using qualitative procedures of checking item wording and word frequency. Although the statistical and qualitative procedures cannot conclusively show that the cause of LID is item chaining, the seven items share a number of characteristics that clearly create a problematic dynamic that undermines the proper functioning of testlets. These characteristics include a pair of items that considerably differ in difficulty measures from the third item in the cluster, which I have called a “2-vs-1 difficulty bundle”; in fact, 19 out of 30 testlets shared this configuration. However, when these difficulty bundles in a testlet are fairly close together but far apart from the outlying third item, the Q3 LID analysis identified them as either weakly or moderately locally dependent; this was the case for six testlets (20% of the total). And when this LID pair was the first two items with the first item more difficult than the second, and much more difficult than the third outlying item (with a quarter to one third of test-takers leaving the pair unanswered), the first item in the testlet was identified by the PCAR as negatively correlating with Rasch dimension of vocabulary knowledge; this was the case for four testlets (13% of the total) in VLT3 and 5. A key issue that emerged from this investigation is item difficulty in a vocabulary diagnostic test like the VLT, which has been variously ignored or treated as a “nuisance variable” by researchers (Culligan, 2015). Difficulty in this type of test has never, to the best of my knowledge, been overtly theorized, but has been tacitly operationalized as a function of word frequency from a corpus. Despite some unargued claims to the contrary (Schmitt et al. [2001] for the VLT, and Beglar [2007] for the VST), the assumption is that the less frequent (i.e., less common) the word, the more difficult the word-item on the VLT. This study shows that this is problematic for at least two reasons. First, the Schmitt et al. (2001) VLT versions are based on outdated and small corpora that have inaccurate word frequency information for all the VLT levels, but especially for the lower VLT3 and 5 levels; this is primarily because word frequency information will be necessarily inconsistent and skewed for less common words when using smaller corpora that contain a relatively small number of randomly sampled texts and do not account for dispersion (i.e., how many texts in the corpus containing the word). Secondly, and most importantly, difficulty measures—even when accounting for dispersion information—are often uncorrelated with frequency information, which shows that the learner’s second language (L2) lexicon does not mirror authentic English corpora, especially beyond the first 2000 words. Suggestions are given to help bridge the gap between frequency and difficulty.
Description
Keywords
單字測驗, 局部依賴, 單向度, Rasch模式, 潛在特質, 單字階層測驗, vocabulary testing, local item dependence, unidimensionality, Rasch model, latent trait, Vocabulary Levels Test