「混合專家與學生實徵表現導向」大型教育評量標準設定之效度評估研究  Assessing the Validity of Standard-Setting for an English Language Assessment With a Hybrid Expert and Empirical Performance Model

謝進昌 Jin-Chang Hsieh

doi:10.6209/JORIES.202306_68(2).0001

Journal of Research in Education Sciences (Jun 2023)

「混合專家與學生實徵表現導向」大型教育評量標準設定之效度評估研究 Assessing the Validity of Standard-Setting for an English Language Assessment With a Hybrid Expert and Empirical Performance Model

謝進昌 Jin-Chang Hsieh

Affiliations

謝進昌 Jin-Chang Hsieh: Research Center for Testing and Assessment, National Academy for Educational Research

DOI: https://doi.org/10.6209/JORIES.202306_68(2).0001
Journal volume & issue: Vol. 68, no. 2
pp. 1 – 35

Abstract

Read online

為評估《十二年國民基本教育課程綱要》推動對於學生表現之影響，遂推動臺灣學生成就長期追蹤評量計畫（TASAL），目的在追蹤臺灣學生素養成長表現、探究影響因子與回饋國家課程綱要。究其內涵為標準本位大型教育評量，而本研究目的在以標準發展的整體歷程觀點，提出「混合專家與學生實徵表現導向模式」，以多面向、多途徑（來源）累積支持過程、內部、外部與後效（間接預估）影響等證據，以回應第四學習階段英語文標準設定結果的有效性。在透過標準化流程發展評量工具時，研究者於過程中逐步融入標準設定各項任務元素，以建構理論、發展表現水準描述、標準設定素材與試題、開發相關大型評量技術等。研究者經蒐集來自15名專家成員對於評估問卷填答，以檢核標準設定過程、結果合理性，而結果顯示成員多能認同其適切性。此外，本研究透過回饋訊息提供、成員間討論與反思，也發現成員對於試題判讀，會隨著輪次增加而愈趨一致，分類誤差多能在合理區間內。加諸以七年級學生升至八年級之英語文理解表現作為外部效標，結果顯示所設定的切截分數是具有良好區別不同層級學生於外部效標表現差異之程度。整體而言，本研究在標準形成（或標準設定）階段，大致能獲得良好過程、內部、外部與後效（間接預估）影響等證據支持。文末，本研究並提出建議，供未來參考。 Background and Purpose The Taiwan Assessment of Student Achievement: Longitudinal Study (TASAL) was implemented to evaluate the effect of the new 12-year basic education curriculum on student performance in Taiwan. TASAL is a standards-based, large-scale assessment that aims to track the literacy growth of Taiwanese students, explore relevant factors, and collect empirical evidence to assist in the development of future curriculum guidelines. This study assessed the validity of standard-setting with a hybrid model combining expert and student empirical performance. The hybrid model exhibits multidimensional, multisource, and long-term cumulative features. The multidimensional feature provides evidence for procedural, internal, and external validity and for setting appropriate standards (Kane, 1994, 2001; Pant et al., 2009). The multisource feature indicates that the evidence of validity is derived from various sources, such as expert opinions and students’ empirical performance. Finally, the long-term cumulative feature represents the process of accumulating evidence over a long period. Presenting every type of evidence in a study is challenging due to time and resource constraints. The burden placed on researchers and students should be considered. Method 1. Sampling In 2019, the evaluation of seventh-grade students was initiated formally in TASAL. In 2020, the same group of students, now in the eighth grade, was evaluated in TASAL. The sampling method was stratified two-stage cluster sampling. Initially, 256 junior high schools were selected to take part in the evaluation. Finally, 246 schools with a total of 2,793 students were enrolled for this project. Regarding the English test of TASAL, in 2019, 2,793 seventh-grade students took the TASAL English test. In 2020, 2,893 eighth-grade students took the test. Among the eighth-grade students, 2,554 took the English test in both years. 2. Materials The TASAL English core competence assessment was developed through a standardized procedure, including purpose clarification, theory construction, assessment guidelines, performance level descriptor development, test item designation, test assembly, and data analysis. The TASAL English core competence assessment examines English reading comprehension according to the corresponding content in the 12-year basic education curriculum. Based on the concept of transforming verb-noun usage into cognitive processes and content knowledge, as proposed by Anderson et al. (2001), a separate set of assessment criteria and test items has been developed for the TASAL English core competence assessment to evaluate reading comprehension. In the TASAL English core competence assessment, six levels of performance descriptors was initially proposed (Hsieh, 2023). However, no corresponding test items were available for the sixth (highest) level of the assessment, because the standard-setting process still focused on the seventh-grade test items. Therefore, this study focused on the first five levels, which included acquiring linguistic fluency, locating explicitly stated information, literal comprehension, implicit comprehension, and evaluation and reflection beyond text comprehension. According to a review of the literature, various text types based on the OECD text types (2019) are used in the TASAL English core competence assessment, and these types are modified to include descriptive, introductive, transactional, expository, commentary, persuasive, narrative, and literary texts. The assessment for seventh-grade students contained 182 test items, and the assessment for eighth-grade students contained 196 test items; 84 common items were included in both assessments. The response consistency was good. The Expected A Posteriori (EAP) estimate of the items were 0.85 and 0.91 in the assessments for seventh-grade and eighth-grade students, respectively. 3. Standard-setting This study employed the extended Angoff method (Hambleton & Plake, 1995) to establish assessment standards. A total of 15 experts from various regions in Taiwan were trained and participated in the standard-setting meeting. Among these experts, 10 were women and 5 were men, with an average teaching experience of 18.25 years. The standard-setting meeting was implemented in three rounds, and student ability and cutoff scores were estimated by weighted likelihood estimation (Warm, 1989). Statistical analyses were performed in R (R Core Team, 2022) and TAM software packages (Robitzsch et al., 2020). Result and Conclusion Feedback was collected using a questionnaire on standard-setting. Most of the experts rated the process and outcome of the standard-setting meeting as being well above or above average. The experts agreed or strongly agreed that providing feedback and PLD procedures were helpful in establishing standards. In summary, this study provides satisfactory evidence for the procedural validity of standard-setting. This study also provides evidence for the internal validity of standard-setting. During the initial round, the standard error of cutoff scores was between 2.03 and 11.58, as reported by all experts across all levels. However, during subsequent rounds, the margin of error decreased. In general, most standard errors (relative to the measurement error of 34.64) were within an acceptable level of 0.33, which is consistent with the results of Kaftandjieva (2010, p. 104). Using the English comprehension performance of eighth-grade students as the external criteria, the use of the scores obtained from the seventh-grade assessment to set cutoff scores was effective for significantly distinguishing between different levels of achievement. A partial η2 of .506 was obtained, indicating a large effect size, as suggested by Cohen (1988). In conclusion, this study provides evidence for the external validity of standard-setting. In summary, some valuable suggestions are provided based on the study results. For example, when evaluating changes in student performance, the regression toward the mean may be a crucial factor affecting the result of standard-setting during the implementation of vertical articulation of cutoff scores across grades. Additionally, continuously collecting evidence to support the validity of standard-setting is crucial in responding to educational policies and curriculum guidelines. Therefore, the study results indicate the importance of building ongoing proof of validity in future research.

Published in Journal of Research in Education Sciences

ISSN: 2073-753X (Print)
Publisher: National Taiwan Normal University
Country of publisher: Taiwan, Province of China
LCC subjects: Education: Theory and practice of education
Website: http://jories.ntnu.edu.tw/jres/Default.aspx?loc=en

About the journal

Abstract

Keywords