以「補充性表現水平描述輔助自陳式測量構念」之延伸Angoff標準設定研究
No Thumbnail Available
Date
2021-12-??
Authors
Journal Title
Journal ISSN
Volume Title
Publisher
國立臺灣師範大學教育心理學系
Department of Educational Psychology, NTNU
Department of Educational Psychology, NTNU
Abstract
隨著十二年國民基本教育課程推動,其核心重視學生知識、態度、技能與策略等全人培養。為回應這波課程革新對於學生表現影響,遂成立臺灣學生成就長期追蹤評量計畫(TASAL),目的在描述臺灣學生表現及探究影響因子,其本質為標準本位評量。然而,在回顧過去標準設定研究,多以學科認知範疇的標準設定為焦點,較少涉及情意、策略面向。據此,本研究提出補充性表現水平描述(S-PLD)概念,以輔助專家教師使用延伸Angoff(extended Angoff)標準設定法,進行自陳式測量構念通過分數設定,並透過檢視效度之過程、內部與外部證據,以支持本研究結果合理性。本研究結果顯示標準設定成員對於標準設定過程,多認同其適切性。藉由討論與反思,凝聚出可接受、適當結果,而在檢視成員於各輪次評定結果的穩定與一致性,其誤差也多能在合理範圍內。最後,本研究所設定兩個切截點,也具有一定區別不同層級策略使用者於外在效標(英語文理解)的表現。整體而言,本研究結果是具有過程、內部與外部證據支持,並於文末,提出幾點建議,供未來研究者參考。
One vision of the 12-Year Basic Education Curriculum in Taiwan is to promote the comprehensive learning and development of all students. To ensure the quality of this curriculum reform, the Ministry of Education funded a long-term project, the Taiwan Assessment of Student Achievement: Longitudinal Study (TASAL), to evaluate the impact of the curriculum on student performance. The TASAL is a large-scale standards-based assessment, and standard setting is one of its main task. A comprehensive literature review indicated that most empirical studies related to standard setting have focused on cognitive domains and few study undertake expert-oriented standard-setting processes in affective domains because of some practical limitations. The present study suggests a new approach, employing supplementary performance-level descriptors (S-PLDs) in an extended Angoff method in setting standards for self-report measures. The purpose of this study was to uncover evidence of the procedural, internal, and external validity of implementing an extended Angoff method procedure with S-PLDs in standard setting for English comprehension strategy use among seventh grade students in Taiwan.PLDs are designed to outline the knowledge, skills, and practices that indicate the level of student performance in a target domain. In the present study, the use of comprehension strategies for learning English as a foreign language was examined. S-PLDs provide comparable but unique functions within the standard-setting process. S-PLDs offer supplementary material to subject matter experts to facilitate the formation of profiles of student performance in target domains, especially when ambiguities in conventional PLDs may prevent expert consensus during the standard-setting process.In this study, stratified two-stage cluster sampling was adopted to select representative seventh graders in Taiwan during the 2018–2019 academic year. After sampling, 7,246 students had been selected ; only 2,732 students, 1,417 boys and 1,315 girls, received an English comprehension strategy use questionnaire and English proficiency test. Student performance on both measurement instruments was the basis for writing PLDs and S-PLDs. The scale measuring English comprehension strategy use was a 4-point discrete visual analogue scale self-report measure developed through standardized procedures and comprises four dimensions: memorization (6 items), cognition (6 items), inference (8 items), and comprehension monitoring (10 items) strategies. The results of four-dimensional confirmatory factor analysis indicated a favorable model–data fit, except for the chisquare value, which was affected by the large sample size . Moreover, the English proficiency test used was a cognitive measure assessing students' listening and reading comprehension abilities through the use of multiple-choice and constructed-response items. A total of 182 items were developed through a standardized procedure and divided into 13 blocks to assemble 26 test booklets. Each booklet, containing 28 items, was randomly delivered to a participating student; each student completed only one booklet. After data cleansing and item calibration with a multidimensional random coefficient multinomial logit model and the Test Analysis M odules (Robitzsch et al., 2020), the information-weighted fit mean-square indices for all test items ranged from 0.79 to 1.37, meeting the criterion proposed by Linacre (2005).An expert-oriented standard-setting meeting was hosted on May 20, 2020, after advanced materials, such as agenda, instruction of standard-setting method, had been sent to all experts. Eight experts from across Taiwan were invited to join the meeting, and they all had experience involving standard-setting meetings for student performance on English proficiency tests. The average number of years of teaching experience for these experts was 18.75, and seven had experience in teaching low achievers. Overall, the experts had sufficient prerequisite knowledge and experience with standard-setting processes. On the day of the standard-setting meeting, a series of events, including orientation, training and practice, and three rounds of extended Angoff standard-setting methods with different types of feedback provided between rounds, were undertaken. Feedback questionnaires were developed , and discussions among the experts between the rounds were recorded and analyzed as evidence of procedural and internal validity.Most of the subject matter experts were satisfied with the events during the standard-setting process and agreed that they could set satisfactory cutoff scores for future usage. From the results of feedback questionnaires completed between rounds, the experts nearly unanimously agreed that the materials received in advance; the introductions to PLDs, S-PLDs, and the extended Angoff method; and previous experience in setting standards for English proficiency were beneficial in judging items during the process. Additionally, the experts agreed that the S-PLDs played a key role in facilitating the formation of outlines for student performance in comprehension strategy use across different levels. All of these results indicate procedural validity.For evidence of internal validity, classification error (the ratio of the standard error of the passing score to the measurement error), was computed to indicate the consistency of the item ratings between and within the experts during the three-round process. Between experts and across rounds, the classification error ranged from 0.08 to 0.36 for memorization strategies, 0.14 to 0.49 for cognition strategies , 0.19 to 0.61 for inference strategies, and 0.24 to 0.72 for comprehension monitoring strategies. These results indicate that the cognitive levels for the four dimensions affect the consistency of item rating. Strategies with more abstract item content tended to have higher classification error. Furthermore, the lowest classification error values occurred in the second round for memorization and inference strategies and in the third round for cognition and comprehension monitoring strategies. All low values for each dimension were beneath the cutoff of 0.33 proposed by Kaftandjieva (2010), except for the value of 0.37 for comprehension monitoring strategystrategies. Regarding the rating consistency within experts between rounds, the results showed no extreme classification error, and most of the values were beneath 0.33, with the exceptions of 0.35 for cognition strategies and 0.37, 0.42, and 0.61 for comprehension monitoring strategies. Therefore, most experts exhibited rating consistency between the rounds. Additionally, the results of a content analysis of the item rating discussions indicated that three reference sources might affect experts' judgments regarding the items: (1) students' actual performance, (2) PLDs and S-PLDs, and (3) experts' personal expectations. For example, one expert might give a lower score because his students tend to exhibit poor performance on a particular item dependent on their teaching experience, whereas another expert might give a higher score because of their personal expectations.To examine external validity, student performance on English proficiency tests was adopted as an external criterion. With two cutoff scores used to divide students into basic, proficient, and advanced users in each dimension, a medium effect size was obtained for memorization strategies, and large effect sizes were obtained for cognition, inference, and comprehension monitoring strategies. Furthermore, to compare the final cutoff scores obtained through the study method with existing methods , the study adopted the concept from TIMSS and PIRLS for setting standards for affective domains (Martin et al., 2014, p. 308). The classification accuracy indices, which indicate the proportions of students classified identically, were 90.25%, 81.20%, 82.52%, and 87.56% for the four dimensions. To sum up, the present study obtained satisfactory evidence of the procedural, internal, and external validity of using an extended Angoff procedure for setting standards for self-report measures with S-PLDs; additional suggestions are presented herein.
One vision of the 12-Year Basic Education Curriculum in Taiwan is to promote the comprehensive learning and development of all students. To ensure the quality of this curriculum reform, the Ministry of Education funded a long-term project, the Taiwan Assessment of Student Achievement: Longitudinal Study (TASAL), to evaluate the impact of the curriculum on student performance. The TASAL is a large-scale standards-based assessment, and standard setting is one of its main task. A comprehensive literature review indicated that most empirical studies related to standard setting have focused on cognitive domains and few study undertake expert-oriented standard-setting processes in affective domains because of some practical limitations. The present study suggests a new approach, employing supplementary performance-level descriptors (S-PLDs) in an extended Angoff method in setting standards for self-report measures. The purpose of this study was to uncover evidence of the procedural, internal, and external validity of implementing an extended Angoff method procedure with S-PLDs in standard setting for English comprehension strategy use among seventh grade students in Taiwan.PLDs are designed to outline the knowledge, skills, and practices that indicate the level of student performance in a target domain. In the present study, the use of comprehension strategies for learning English as a foreign language was examined. S-PLDs provide comparable but unique functions within the standard-setting process. S-PLDs offer supplementary material to subject matter experts to facilitate the formation of profiles of student performance in target domains, especially when ambiguities in conventional PLDs may prevent expert consensus during the standard-setting process.In this study, stratified two-stage cluster sampling was adopted to select representative seventh graders in Taiwan during the 2018–2019 academic year. After sampling, 7,246 students had been selected ; only 2,732 students, 1,417 boys and 1,315 girls, received an English comprehension strategy use questionnaire and English proficiency test. Student performance on both measurement instruments was the basis for writing PLDs and S-PLDs. The scale measuring English comprehension strategy use was a 4-point discrete visual analogue scale self-report measure developed through standardized procedures and comprises four dimensions: memorization (6 items), cognition (6 items), inference (8 items), and comprehension monitoring (10 items) strategies. The results of four-dimensional confirmatory factor analysis indicated a favorable model–data fit, except for the chisquare value, which was affected by the large sample size . Moreover, the English proficiency test used was a cognitive measure assessing students' listening and reading comprehension abilities through the use of multiple-choice and constructed-response items. A total of 182 items were developed through a standardized procedure and divided into 13 blocks to assemble 26 test booklets. Each booklet, containing 28 items, was randomly delivered to a participating student; each student completed only one booklet. After data cleansing and item calibration with a multidimensional random coefficient multinomial logit model and the Test Analysis M odules (Robitzsch et al., 2020), the information-weighted fit mean-square indices for all test items ranged from 0.79 to 1.37, meeting the criterion proposed by Linacre (2005).An expert-oriented standard-setting meeting was hosted on May 20, 2020, after advanced materials, such as agenda, instruction of standard-setting method, had been sent to all experts. Eight experts from across Taiwan were invited to join the meeting, and they all had experience involving standard-setting meetings for student performance on English proficiency tests. The average number of years of teaching experience for these experts was 18.75, and seven had experience in teaching low achievers. Overall, the experts had sufficient prerequisite knowledge and experience with standard-setting processes. On the day of the standard-setting meeting, a series of events, including orientation, training and practice, and three rounds of extended Angoff standard-setting methods with different types of feedback provided between rounds, were undertaken. Feedback questionnaires were developed , and discussions among the experts between the rounds were recorded and analyzed as evidence of procedural and internal validity.Most of the subject matter experts were satisfied with the events during the standard-setting process and agreed that they could set satisfactory cutoff scores for future usage. From the results of feedback questionnaires completed between rounds, the experts nearly unanimously agreed that the materials received in advance; the introductions to PLDs, S-PLDs, and the extended Angoff method; and previous experience in setting standards for English proficiency were beneficial in judging items during the process. Additionally, the experts agreed that the S-PLDs played a key role in facilitating the formation of outlines for student performance in comprehension strategy use across different levels. All of these results indicate procedural validity.For evidence of internal validity, classification error (the ratio of the standard error of the passing score to the measurement error), was computed to indicate the consistency of the item ratings between and within the experts during the three-round process. Between experts and across rounds, the classification error ranged from 0.08 to 0.36 for memorization strategies, 0.14 to 0.49 for cognition strategies , 0.19 to 0.61 for inference strategies, and 0.24 to 0.72 for comprehension monitoring strategies. These results indicate that the cognitive levels for the four dimensions affect the consistency of item rating. Strategies with more abstract item content tended to have higher classification error. Furthermore, the lowest classification error values occurred in the second round for memorization and inference strategies and in the third round for cognition and comprehension monitoring strategies. All low values for each dimension were beneath the cutoff of 0.33 proposed by Kaftandjieva (2010), except for the value of 0.37 for comprehension monitoring strategystrategies. Regarding the rating consistency within experts between rounds, the results showed no extreme classification error, and most of the values were beneath 0.33, with the exceptions of 0.35 for cognition strategies and 0.37, 0.42, and 0.61 for comprehension monitoring strategies. Therefore, most experts exhibited rating consistency between the rounds. Additionally, the results of a content analysis of the item rating discussions indicated that three reference sources might affect experts' judgments regarding the items: (1) students' actual performance, (2) PLDs and S-PLDs, and (3) experts' personal expectations. For example, one expert might give a lower score because his students tend to exhibit poor performance on a particular item dependent on their teaching experience, whereas another expert might give a higher score because of their personal expectations.To examine external validity, student performance on English proficiency tests was adopted as an external criterion. With two cutoff scores used to divide students into basic, proficient, and advanced users in each dimension, a medium effect size was obtained for memorization strategies, and large effect sizes were obtained for cognition, inference, and comprehension monitoring strategies. Furthermore, to compare the final cutoff scores obtained through the study method with existing methods , the study adopted the concept from TIMSS and PIRLS for setting standards for affective domains (Martin et al., 2014, p. 308). The classification accuracy indices, which indicate the proportions of students classified identically, were 90.25%, 81.20%, 82.52%, and 87.56% for the four dimensions. To sum up, the present study obtained satisfactory evidence of the procedural, internal, and external validity of using an extended Angoff procedure for setting standards for self-report measures with S-PLDs; additional suggestions are presented herein.