├── 2021_10_4_IES_AS_Challenge_RFI_Presentation.pdf ├── application_documents.zip ├── automated_scoring_press_release.pdf ├── images ├── greatest_hits_gender_smd_I.png ├── greatest_hits_qwk_G.png ├── greatest_hits_qwk_I.png └── greatest_hits_race_smd_I.png ├── readme.md └── results.md /2021_10_4_IES_AS_Challenge_RFI_Presentation.pdf: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/NAEP-AS-Challenge/reading-prediction/91dec58b111429470a38b7ef198f7044a2463f7d/2021_10_4_IES_AS_Challenge_RFI_Presentation.pdf -------------------------------------------------------------------------------- /application_documents.zip: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/NAEP-AS-Challenge/reading-prediction/91dec58b111429470a38b7ef198f7044a2463f7d/application_documents.zip -------------------------------------------------------------------------------- /automated_scoring_press_release.pdf: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/NAEP-AS-Challenge/reading-prediction/91dec58b111429470a38b7ef198f7044a2463f7d/automated_scoring_press_release.pdf -------------------------------------------------------------------------------- /images/greatest_hits_gender_smd_I.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/NAEP-AS-Challenge/reading-prediction/91dec58b111429470a38b7ef198f7044a2463f7d/images/greatest_hits_gender_smd_I.png -------------------------------------------------------------------------------- /images/greatest_hits_qwk_G.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/NAEP-AS-Challenge/reading-prediction/91dec58b111429470a38b7ef198f7044a2463f7d/images/greatest_hits_qwk_G.png -------------------------------------------------------------------------------- /images/greatest_hits_qwk_I.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/NAEP-AS-Challenge/reading-prediction/91dec58b111429470a38b7ef198f7044a2463f7d/images/greatest_hits_qwk_I.png -------------------------------------------------------------------------------- /images/greatest_hits_race_smd_I.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/NAEP-AS-Challenge/reading-prediction/91dec58b111429470a38b7ef198f7044a2463f7d/images/greatest_hits_race_smd_I.png -------------------------------------------------------------------------------- /readme.md: -------------------------------------------------------------------------------- 1 | # ED.gov National Assessment of Educational Progress (NAEP) Automated Scoring Challenge 2 | 3 | ### Winners announced! 4 | Press release available [here](https://github.com/NAEP-AS-Challenge/info/blob/edd6f9565b5bfdeb303ad32c86e53f7e32339358/automated_scoring_press_release.pdf) and more detail about awarded submissions posted [here](https://github.com/NAEP-AS-Challenge/info/blob/952ee82cc52163b17bccc1f12174ce246b8dff9b/results.md). 5 | 6 | A pre-print of a published paper from the UMass Amherst winning submission is available [here](https://arxiv.org/abs/2205.09864) and code used for their winning entry is available [here](https://github.com/ni9elf/automated-scoring). 7 | 8 | ## Overview 9 | We are seeking seeking submissions of automated scoring models to score 10 | constructed response items for the National Assessment of Educational 11 | Progress's reading assessment. The purpose of the challenge is to help 12 | NAEP determine the existing capabilities, accuracy metrics, the 13 | underlying validity evidence of assigned scores, and costs and 14 | efficiencies of using automated scoring with the NAEP reading assessment 15 | items. The Challenge requires that submissions demonstrate 16 | interpretability of models, provide score predictions using these 17 | models, analyze models for potential bias based on student demographic 18 | characteristics, and provide cost information for putting an automated 19 | scoring system into operational use. 20 | 21 | **CHALLENGE DETAILS** 22 | 23 | TOTAL CASH PRIZES OFFERED: \$30,000 (maximum of $20,000 for first-place entries)\ 24 | TYPE OF CHALLENGE: Automated Scoring of Open Ended Reading Test Items 25 | 26 | **SUBMISSION START: 9/16/2021 8:00 AM ET\ 27 | SUBMISSION END: 11/28/2021 5:00 PM ET** 28 | 29 | Request for Information Webinar was held 10/4/2021 @ 12:00 ET. The [slides are posted here](https://github.com/NAEP-AS-Challenge/info/blob/c782c585636cdb32bf7373291e93c7f34256ebba/2021_10_4_IES_AS_Challenge_RFI_Presentation.pdf). 30 | 31 | ## Description 32 | 33 | Automated Scoring using natural language processing is a well-developed 34 | application of artificial intelligence in education. Results are 35 | consistently demonstrated to be on-par with the inter-rater reliability 36 | of human scorers for well-developed items (Shermis, 2014). Currently, 37 | the National Assessment of Educational Progress (NAEP) makes extensive 38 | use of constructed response items. Annually, contractors assemble teams 39 | of human scorers who score millions of student responses to NAEP's 40 | assessments. Previous internal research on the application of automated 41 | scoring to NAEP items indicates that NAEP's items can be scored 42 | successfully with automated scoring, using natural langage processing. Prior special studies have found that automated 43 | scoring can perform as well as human raters in assigning scores, and 44 | assigning a confidence level associated with the predicted score. Compared to human raters, no 45 | evidence of biased student scoring based on demographic characteristics 46 | was observed. 47 | 48 | This challenge seeks to expand on this earlier work to ascertain whether 49 | a wider array of automated scoring models can perform well with a 50 | representative subset of NAEP Reading, constructed response items 51 | administered in 2017 to students in grades 4 and 8. The ultimate goal is 52 | to produce reliable and valid score assignments, provide additional 53 | information about responses (e.g. length, cohesion, linguistic 54 | complexity), and generate scores more quickly while saving money on 55 | scoring costs. 56 | 57 | There are two components to this challenge; entrants may submit 58 | responses to one or both of these components: 59 | 60 | 1) Component A - Item-Specific Models: Successful respondents will 61 | build a predictive model for each item that can be scored, using 62 | current state-of-the-art practices in operational automated scoring 63 | deployments. Extensive training data from prior human scoring administrations will be provided. 64 | The first-place prize for this challenge is \$15,000, 65 | with up to 4 runner-up prizes of \$1,250 each. 66 | 67 | 2) Component B - Generic Models: Successful respondents will build a generic scoring model that will score items that 68 | were not included in the training dataset, but are from the same 69 | administration, subject, and grade level. The prize for this 70 | challenge is \$5,000, with up to 4 runner-up prizes of \$1,250 each. 71 | 72 | Participants will be provided access to digital files that contain 73 | information related the results of human-scored constructed responses to 74 | reading assessment items that were administered in 2017, such as item 75 | text, passage, scoring rubric, student responses, and human assigned 76 | scores (both single and double scored). The responses correspond to 77 | items that accompany two genres of 4^th^ and 8^th^ grade reading passages, literary and informational. Items for this challenge 78 | are of two writing formats, short and extended constructed response. 79 | 80 | The data set includes 20 items for the item-specific models, and 2 items 81 | for the generic models. There are an average of 1,181 double-scored 82 | responses per dataset, which are divided into a training dataset (50%), 83 | a validation dataset (10%), and a test dataset (40%). The validation 84 | dataset is augmented with a much larger number of single-scored 85 | responses (average 23,000 per item). Detailed information about each 86 | item is provided under "Item Information" below. 87 | 88 | In addition to model accuracy compared to human scorers, successful 89 | respondents to this Challenge will provide documentation of model 90 | interpretability through a technical report that will be evaluated for 91 | transparency, explanability, and fairness as further explained in 92 | "Evaluation Criteria". The Federal Government is particularly interested 93 | in submissions that provide accurate results and meet these objectives, 94 | as they have been absent from a good deal of recent research in 95 | automated scoring, particularly for solutions using artificial 96 | intelligence (e.g. neural networks, transformer networks) and other 97 | complex algorithmic approaches (Kumar & Boulanger, 2020). Both this documentation and predicted scores will be submitted simultaneously. The documentation 98 | documentation will be evaluated before respondents' scored submissions are evaluated. Only 99 | documentation that meets acceptance criteria (as specified below) will be 100 | considered as valid submissions and evaluated for accuracy of the predicted 101 | scores compared to the hold-out test dataset. 102 | 103 | This process is consistent 104 | with the operational processes that the Department intends to use as 105 | part of the approval process for scoring and reporting; only models that 106 | can provide substantive validity evidence would be approved for 107 | production use. This aspect of the Challenge is of critical importance 108 | in educational contexts. 109 | 110 | ## 111 | 112 | ## Eligibility Information 113 | 114 | Institutions and individuals that have the ability and capacity to conduct research are 115 | eligible to apply. Eligible applicants include, but are not limited to, 116 | non-profit and for-profit organizations and public and private agencies 117 | and institutions, such as colleges and universities. 118 | 119 | ## Requirements for Participation & Confidential Data Security 120 | 121 | - The datasets used for this challenge contain student responses from previous NAEP assessments and are therefore considered NCES confidential materials. All participants must confirm that they are able to meet NCES Confidential Data security requirements. These requirements include restrictions on the use of data, security of data, and destruction of data when the analysis is completed. These requirements are specified in the security application documentation (available in this repository as ["application_documents.zip"](application_documents.zip). Security documentation must be completed and submitted before an applicant will be provided access to the response data. Data must also be destroyed/deleted within 30 days of completing the Challenge and all participants must submit a signed and witnessed form confirming that action. This form is also included within the security application. 122 | 123 | - It is possible, although unlikely, that responses may contain 124 | information about individual respondents. Individually identifiable 125 | information about students, their families, and their schools. Patricipants must agree that this will not 126 | be revealed. 127 | 128 | - No person may: 129 | 130 | - use any information for any purpose other than for the purposes 131 | of this activity 132 | 133 | - make any publication whereby the data furnished by any 134 | particular person can be identified 135 | 136 | - The Education Sciences Reform Act of 2002 requires IES to develop 137 | and enforce standards to protect the confidentiality of students, 138 | their families, and their schools in the collection, reporting, and 139 | publication of data. The IES confidentiality statute is found in 140 | Public Law 107-279, section 183 (or as codified in 20 U.S.C. 9573). 141 | 142 | - Anyone who violates the confidentiality provisions of this Act when 143 | using the data shall be found guilty of a class E felony and can be 144 | imprisoned up to five years, and/or fined up to \$250,000. 145 | 146 | No future NAEP contract work is guaranteed on the basis of performance 147 | in this competition. Contracts are let on separate RFPs where performance 148 | may be one of many criteria that may be used for evaluation. 149 | 150 | ## 151 | 152 | ## Challenge Timeline 153 | 154 | |** |**Item**|**Duration (D)**|**Start**|**Finish**| 155 | | :-: | :- | :-: | :-: | :-: | 156 | |2.1|Challenge posting period|30|16-Sep|20-Oct| 157 | |2.2|Request for information webinar||4-Oct|4-Oct| 158 | |**2.3**|**Application deadline \***|||~~**20-Oct**~~ 1-Nov @ 11:59 PM ET| 159 | |2.4|Provide dataset|| |28-Oct| 160 | |2.5|Competitors prepare responses\*|30|29-Oct|28-Nov| 161 | |**2.6**|**Response deadline**|||**28 Nov**| 162 | |2.7|Select winner|||Mid-January| 163 | 164 | \*Note: applications will be taken on a rolling basis and dataset access 165 | will be provided as soon as possible (typically 48 hours after receipt of required documentation). 166 | 167 | ## Request for Information Webinar 168 | 169 | 170 | A Request for Information Webinar was held 10/4/2021 @ 12:00 ET. The [slides are posted here](https://github.com/NAEP-AS-Challenge/info/blob/c782c585636cdb32bf7373291e93c7f34256ebba/2021_10_4_IES_AS_Challenge_RFI_Presentation.pdf). 171 | Questions may also be sent via Github "issues" or via email to 172 | . 173 | 174 | ## Frequently Asked Questions 175 | 176 | *About the National Center for Education Statistics (NCES)* 177 | 178 | The National Center for Education Statistics (NCES), one of the 179 | principal federal statistical agencies, is the primary federal entity 180 | for collecting and analyzing data related to education in the United 181 | States and other nations. It provides statistical services for educators 182 | and education officials at the federal, State, and local levels; 183 | Congress; researchers; students; parents; the media and the general 184 | public. NCES is located within the Institute of Education Sciences 185 | (IES), the research arm of the U.S. Department of Education. 186 | 187 | The National Assessment of Educational Progress (NAEP) is a 188 | congressionally mandated project administered by NCES. NAEP is given to 189 | a representative sample of students across the country. Results are 190 | reported for groups of students with similar characteristics (e.g., 191 | gender, race and ethnicity, school location), and are not reported for individual students. 192 | National results are available for all subjects assessed by NAEP. State 193 | and selected urban district results are available for mathematics, 194 | reading, and (in some assessment years) science and writing. 195 | 196 | In 2009 NAEP began its transition to computer-based administration with 197 | the hope of creating innovative performance assessments that more 198 | closely reflected ways students were working and learning in classrooms, 199 | increasing test security, enhancing cost effectiveness, reporting with a 200 | faster turn-around time, and collecting performance/process data 201 | with the goal of more comprehensively studying assessment results. The 202 | potential of automated scoring is consonant with these goals, as long as 203 | the procedure produces test scores with reliability and validity 204 | indices that are comparable to current levels for the program. This challenge will help NAEP explore whether moving in this 205 | direction is appropriate at this time. 206 | 207 | ## The Challenge 208 | 209 | Participants may submit a response to either or both of the components 210 | below. 211 | 212 | ## Component A: Item-Specific Models 213 | 214 | For **item-specific models**, the participant shall create a model for each of twenty items. 215 | Respondents will be provided with training data from prior human scoring which will include the item text, 216 | passage, scoring rubric, student responses, and human assigned scores 217 | (both single and double scored). Respondents will use this information, and may supplement this information with external training data, features, or models 218 | to create a predictive model of human scores that is applied to a set of 219 | "test" responses for which scores are not provided. The predicted scores 220 | will be submitted as part of participants' responses. 221 | 222 | ## Component B: Generic Models 223 | 224 | For **generic models**, the participant shall create a model for two 225 | items similar in genre and format to those in Component A, but for which 226 | only responses and scores will be provided. The model may be trained on 227 | any or all items from Component A and may be supplemented with external 228 | training data, features, or models. Respondents will use this 229 | information to create a predictive model of human scores that is applied 230 | to a set of "test" responses for which scores are not provided. The 231 | predicted scores will be submitted as part of the participants' responses. 232 | 233 | For both components, if your scoring engine links to an external routine or program that maps 234 | semantic space or determines other linguistic features, you are 235 | permitted to use these aids to make your predictions. However, you are 236 | responsible for obtaining any licenses or permissions to do so. If your 237 | scoring engine has these capabilities built in, those capabilities 238 | should be documented in the technical report. 239 | 240 | ## Training Datasets 241 | 242 | |**Component**|**Items Included** | 243 | | :- | :- | 244 | |Component A: Item-Specific Models|

- training data set (used for model building)

- cross-validation set (used for internal model evaluation)

- test data set (used for making score predictions)

The training and cross-validation sets have responses scored by two raters. The cross-validation set may be supplemented with single scored responses. The test data set will have text only, no scores. In addition to response and score data, item text, passage, scoring rubric, and additional relevant information will be provided.

| 245 | |Component B: Generic Models|

- all items from Component A for different items (similar in genre and grade)

- test data set for new items (used for making score predictions)

- optional: additional training data, features or models

| 246 | 247 | ## Detailed Item Information 248 | 249 | **Item-Specific Model Datasets** 250 | |Item ID|Gr. | For.|Avg. No. Words|Total N|Miss.|DS Training N|DS Validation N|DS + SS Validation N|DS Test N| 251 | | :- | :- | :- | :- | :- | :- | :- | :- | :- | :- | 252 | |2017\_DBA\_DR08\_1715RE2T13\_05|8|SCR|28.11|21166|100|533|107|20207|426| 253 | |2017\_DBA\_DR08\_1715RE2T13\_07|8|ECR|47.14|19934|385|495|99|19043|396| 254 | |2017\_DBA\_DR08\_1715RE2T13\_08|8|SCR|28.28|19669|224|502|101|18989|402| 255 | |2017\_DBA\_DR08\_1715RE4T05G08\_03|8|SCR|34.59|21197|81|539|108|20307|432| 256 | |2017\_DBA\_DR08\_1715RE4T05G08\_06|8|ECR|51.29|21170|83|531|106|20214|425| 257 | |2017\_DBA\_DR08\_1715RE4T05G08\_07|8|SCR|28.80|21088|62|529|106|20135|424| 258 | |2017\_DBA\_DR08\_1715RE4T05G08\_09|8|ECR|34.87|20869|93|527|105|19920|422| 259 | |2017\_DBA\_DR08\_1715RE4T08G08\_03|8|SCR|44.20|21380|92|543|109|20403|434| 260 | |2017\_DBA\_DR08\_1715RE4T08G08\_06|8|SCR|32.82|21309|107|538|108|20340|431| 261 | |2017\_DBA\_DR08\_1715RE4T08G08\_07|8|ECR|44.74|20990|185|529|106|20038|423| 262 | |2017\_DBA\_DR08\_1715RE4T08G08\_09|8|SCR|32.95|20589|190|513|103|19665|411| 263 | |2017\_DBA\_DR04\_1715RE1T10\_05|4|SCR|18.52|27806|355|690|138|26564|552| 264 | |2017\_DBA\_DR04\_1715RE4T05G04\_03|4|SCR|18.63|28264|323|715|143|26977|572| 265 | |2017\_DBA\_DR04\_1715RE4T05G04\_06|4|ECR|24.63|27462|327|682|137|26234|546| 266 | |2017\_DBA\_DR04\_1715RE4T05G04\_07|4|SCR|14.55|26588|222|666|133|25389|533| 267 | |2017\_DBA\_DR04\_1715RE4T05G04\_09|4|ECR|16.67|25651|426|634|127|24509|508| 268 | |2017\_DBA\_DR04\_1715RE4T08G04\_03|4|SCR|23.90|28307|292|707|141|27034|566| 269 | |2017\_DBA\_DR04\_1715RE4T08G04\_06|4|SCR|18.53|27533|329|694|139|26283|556| 270 | |2017\_DBA\_DR04\_1715RE4T08G04\_07|4|ECR|20.55|25960|667|641|128|24806|513| 271 | |2017\_DBA\_DR04\_1715RE4T08G04\_09|4|SCR|14.56|23720|597|583|117|22670|467| 272 | 273 | **Generic Model Datasets** 274 | |Item ID|Gr. | For.|DS Test N| 275 | | :- | :- | :- | :- | 276 | |2017\_DBA\_DR08\_1715RE2T13\_06|8|SCR|420| 277 | |2017\_DBA\_DR04\_1715RE1T10\_07|4|ECR|539| 278 | 279 | Item ID = Item identifier; 280 | Gr. = Grade; 281 | For. = Format; 282 | SCR = Short Constructed Response; 283 | ECR = Extended Constructed Response; 284 | Avg. No. Words = Average Number of Words; 285 | Miss. = Missing Data; 286 | DS Training N = Double-Scored Training N; 287 | DS Validation N = Double-Scored Validation N; 288 | DS Test N = Double-Scored Test N; 289 | DS+SS Validation N = Double-Scored + Single Scored Validation N 290 | 291 | ## Evaluation Criteria 292 | 293 | **Part 1: Model Interpretability**. Submissions must provide a 294 | technical report that explains the model development process and results 295 | appropriate to a technical audience with educational measurement 296 | expertise. It is not expected that competitors will reveal confidential 297 | information, but will provide evidence that enables an external reviewer 298 | to assess the validity and fairness of the automated scoring process and 299 | models. These reports will be submitted with submissions of predicted 300 | scores, but must be approved as providing a sufficient degree of 301 | interpretability per the criteria below before the response predictions 302 | will be evaluated. 303 | 304 | **Part 2: Model Accuracy**. Scoring performance will be evaluated 305 | on the average quadratic weighted kappa (rounded to the third 306 | decimal place) across all items in the competition. Competitors must score at least 99.5% of all scorable 307 | responses (for which there is a human rating). During model construction, competitors will have access 308 | to online resources to request further information or to make 309 | suggestions about how to improve performance. See Appendix A for more 310 | information on quadratic weighted kappa and how the winner of the 311 | competition will be determined. 312 | 313 | Model interpretability will be evaluated according to three criteria, 314 | which will be equally weighted in the review: 315 | 316 | a) **Transparency** -- explanation of the ***process for model 317 | training and testing***, the features extracted from the text, and 318 | the algorithms used in model building. While these may describe a 319 | general workflow, they should also include the specific text 320 | features and algorithmic choices used to create the final models 321 | that score items in this Challenge. 322 | 323 | b) **Explainability** -- explanation of the ***resulting item model 324 | and/or individual scores*** that includes the input features 325 | considered, the modeling results, and algorithm choices. 326 | 327 | c) **Fairness** -- analysis into any **differences based on student 328 | demographic background** in automated scoring compared to those 329 | found in human-scored results. 330 | 331 | Although not an evaluated criteria for winning the competition, 332 | technical reports should include estimates for minimal training sample 333 | sizes that would place the scoring engine's estimates within two percent 334 | of the final predicted values. 335 | 336 | |**Criteria**|**Responsive submissions will adequately address:**| 337 | | :- | :- | 338 | |

1. Transparency

|

- Explain the model building process

- Include descriptions of features used for model building

- Include description of algorithm used for model building

| 339 | |

2. Explainability

|

- Provide feature values and model statistics as appropriate to methods used

- Provide results from model training and cross validation

- Provide validity explanations that consider items and scoring rubrics

| 340 | |

3. Fairness

|- Conducts analysis to ensure that models perform the same for different sub-populations, especially those from historically underserved communities. | 341 | 342 | ## Key Parameters 343 | 344 | In conducting the work for the Challenge, there are several parameters 345 | to consider: 346 | 347 | - Both team and individual competitors are eligible to participate. 348 | 349 | - Data sets will be provided in csv format. 350 | 351 | - All items in the training and test sets have been deemed “scorable” by human raters. There are responses in the validation set that were evaluated as “unscorable” and have no rating associated with them. They should be treated as missing data. 352 | 353 | 354 | - Teams must complete the required security documentation before 355 | datasets will be released. This documentation is avialable at: ["application_documents.zip"](application_documents.zip). 356 | 357 | ## Deliverables 358 | 359 | Valid submissions will include reports with the following items: 360 | 361 | - A technical report that provides model interpretability as 362 | previously described. 363 | 364 | - Predicted scores (CSV format) from the test data responses (see below for data format). 365 | 366 | - A pricing sheet that includes all costs related to the production 367 | implementation of automated scoring: model training, item scoring, 368 | and any other infrastructure or organizational costs that would be 369 | required to integrate machine scoring into a live scoring system. 370 | While some costs (e.g. project management) may be variable, we 371 | expect fixed costs for well-known items such as model training, item 372 | scoring, system administration, and others. 373 | 374 | 375 | ## Predicted Score Data Format and Upload Process 376 | 377 | To submit your predicted scores, please use the following format to modify the test dataset provided for each item. 378 | 1. Delete the column "ReadingTextResponse" that contains the student response text (for data security reasons).  **Please do not submit any files that contain the text of student responses**. 379 | 2. Add a column "predicted_score" and enter your predicted score in that column. 380 | 3. Add a column "participant" and put in the email address for the person who requested the dataset (you only need to enter in one row). 381 | 4. Save the file using the same original filename in .CSV format. 382 | 5. Repeat for all items predicted and save into a single folder/directory. 383 | 6. Zip that folder/directory. Add your technical report, pricing sheet (if appropriate), and upload to the transfer.ies.gov folder that you have been provided via email. **Entries must be uploaded by the Challenge deadline (11/28/2021 11:59 PM ET)**. 384 | 385 | ## Challenge Administration Platform 386 | 387 | Most aspects of the Challenge will be administered via Github 388 | (https://github.com/NAEP-AS-Challenge/info). 389 | 390 | Specifically, this platform will be used for the following purposes: 391 | 392 | 1) Information -- information about the challenge will be posted under 393 | "info" and available to the public. 394 | 395 | 2) Questions -- all questions about the Challenge, datasets, or items 396 | should be posted as an "issue" and will be publicly available. Responses will typically be made within 24 business hours. 397 | 398 | Please note that response data will be provided outside of Github to the contact specified on the application. Submissions will also be upload to a separate secure server. 399 | 400 | ## Prizes 401 | 402 | The Department of Education is offering up to 10 prizes for a total 403 | potential award of up to \$30,000 (\$20,000 for item-specific models, 404 | \$10,000 for a generic model). The first-place prize for the item-specific 405 | challenge is \$15,000, and the first-place prize for the generic model 406 | is \$5,000. Up to 4 runner-up prizes in each category may be awarded 407 | with cash prizes of \$1,250 each. 408 | 409 | The winning results of the competition will be published in a technical 410 | report summarizing the results of the competition. At the Department's 411 | discretion, to assist with selecting winners, one or more of the most 412 | highly rated challenge participants may be invited to present a virtual 413 | presentation that reflects the basic elements of their technical report. 414 | 415 | Any potential prizes awarded under this Challenge will be paid by 416 | electronic funds transfer. Winners will be required to complete and 417 | return an Automated Clearing House (ACH) Vendor/Miscellaneous Payment 418 | Enrollment Form to ED within a given timeframe. The form collects 419 | banking information needed to make an electronic payment (direct 420 | deposit) to the winner. Award recipients will be responsible for any 421 | applicable local, state, and federal taxes and reporting that may be 422 | required under applicable tax laws. 423 | 424 | ## Rules 425 | 426 | ## Terms and Conditions 427 | 428 | All entry information submitted 429 | to [automated-scoring-challenge@ed.gov](mailto:automated-scoring-challenge@ed.gov) and 430 | all materials, including any copy of the submission, become property of 431 | the Department and will not be returned (See "Ownership and Licensing" 432 | for information about use of these items). Furthermore, the Department 433 | shall have no liability for any submission that is lost, intercepted, or 434 | not received by the Department. The Department assumes no liability or 435 | responsibility for any error, omission, interruption, deletion, theft, 436 | destruction, unauthorized access to, or alteration of, submissions. 437 | 438 | ## Representations and Warranties/Indemnification 439 | 440 | By participating in the Challenge, each entrant represents, warrants, 441 | and covenants as follows: 442 | 443 | 1. The entrants are the sole authors, creators, and owners of the 444 | submission; 445 | 446 | 2. The entrant's submission: 447 | 448 | a. Is not the subject of any actual or threatened litigation or 449 | claim; 450 | 451 | b. Does not, and will not, violate or infringe upon the privacy 452 | rights, publicity rights, or other legal rights of any third 453 | party; and 454 | 455 | c. Does not contain any harmful computer code (sometimes referred 456 | to as "malware," "viruses," or "worms"). 457 | 458 | 3. The submission, and entrants' implementation of the submission, does 459 | not, and will not, violate any applicable laws or regulations of the 460 | United States. 461 | 462 | 4. Entrants will indemnify, defend, and hold harmless the Department 463 | from and against all third party claims, actions, or proceedings of 464 | any kind and from any and all damages, liabilities, costs, and 465 | expenses relating to, or arising from, entrant's submission or any 466 | breach or alleged breach of any of the representations, warranties, 467 | and covenants of entrant hereunder. 468 | 469 | 5. The Department reserves the right to disqualify any submission that 470 | the Department, in its discretion, deems to violate these Official 471 | Rules, Terms, and Conditions in this notice. 472 | 473 | ## Ownership and Licensing 474 | 475 | Each entrant retains full ownership of the algorithmic approaches to 476 | their submission, including all intellectual property rights therein. By 477 | participating in the Challenge, each entrant hereby grants to the 478 | Department a royalty-free, nonexclusive, irrevocable, and worldwide 479 | license to reproduce, publish, produce derivative works, distribute 480 | copies to the public, perform publicly and display publicly, and/or 481 | otherwise use the technical report and assessment of submitted scores 482 | from each participant in the competition. 483 | 484 | ## Publicity Release 485 | 486 | By participating in the Challenge, each entrant hereby irrevocably 487 | grants to the Department the right to use the entrant's name, likeness, 488 | image, and biographical information in any and all media for advertising 489 | and promotional purposes relating to the Challenge. 490 | 491 | ## Disqualification 492 | 493 | The Department reserves the right, in its sole discretion, to disqualify 494 | any entrant who is found to be tampering with the entry process or the 495 | operation of the Challenge, Challenge webpage, or other 496 | Challenge-related webpages; to be acting in violation of these Official 497 | Rules, Terms, and Conditions; to be acting in an unsportsmanlike or 498 | disruptive manner, or with the intent to disrupt or undermine the 499 | legitimate operation of the Challenge; or to annoy, abuse, threaten, or 500 | harass any other person; and, the Department reserves the right to seek 501 | damages and other remedies from any such person to the fullest extent 502 | permitted by law. 503 | 504 | ## Disclaimer 505 | 506 | The Challenge webpage contains information and resources from public and 507 | private organizations that may be useful to the reader. Inclusion of 508 | this information does not constitute an endorsement by the Department of 509 | any products or services offered or views expressed. 510 | 511 | The Challenge webpage also contains hyperlinks and URLs created and 512 | maintained by outside organizations, which are provided for the reader's 513 | convenience. The Department is not responsible for the accuracy of the 514 | information contained therein. 515 | 516 | ## Notice to Challenge Entrants and Award Recipients 517 | 518 | Attempts to notify entrants and award recipients will be made using the 519 | email address associated with the entrants' submissions. The Department 520 | is not responsible for email or other communication problems of any 521 | kind. 522 | 523 | If, despite reasonable efforts, an entrant does not respond within three 524 | days of the first notification attempt regarding selection as an award 525 | recipient (or a shorter time as exigencies may require) or if the 526 | notification is returned as undeliverable to such entrant, that entrant 527 | may forfeit the entrant's award and associated prizes, and an alternate 528 | award recipient may be selected. 529 | 530 | If any potential award recipient is found to be ineligible, has not 531 | complied with these Official Rules, Terms, and Conditions, or declines 532 | the applicable prize for any reason prior to award, such potential award 533 | recipient will be disqualified. An alternate award recipient may be 534 | selected, or the applicable award may go unawarded. 535 | 536 | ## Dates/Deadlines 537 | 538 | The Department reserves the right to modify any dates or deadlines set 539 | forth in these Official Rules, Terms, and Conditions or otherwise 540 | governing the Challenge. 541 | 542 | ## Challenge Termination 543 | 544 | The Department reserves the right to suspend, postpone, cease, 545 | terminate, or otherwise modify this Challenge, or any entrant's 546 | participation in the Challenge, at any time at the Department's 547 | discretion. 548 | 549 | ## General Liability Release 550 | 551 | By participating in the Challenge, each entrant hereby agrees that --- 552 | (a) The Department shall not be responsible or liable for any losses, 553 | damages, or injuries of any kind (including death) resulting from 554 | participation in the Challenge or any Challenge-related activity, or 555 | from entrants' acceptance, receipt, possession, use, or misuse of any 556 | prize; and (b) The entrant will indemnify, defend, and hold harmless the 557 | Department from and against all third party claims, actions, or 558 | proceedings of any kind and from any and all damages, liabilities, 559 | costs, and expenses relating to, or arising from, the entrant's 560 | participation in the Challenge. 561 | 562 | Without limiting the generality of the foregoing, the Department is not 563 | responsible for incomplete, illegible, misdirected, misprinted, late, 564 | lost, postage-due, damaged, or stolen entries or prize notifications; or 565 | for lost, interrupted, inaccessible, or unavailable networks, servers, 566 | satellites, Internet Service Providers, webpages, or other connections; 567 | or for miscommunications, failed, jumbled, scrambled, delayed, or 568 | misdirected computer, telephone, cable transmissions or other 569 | communications; or for any technical malfunctions, failures, 570 | difficulties, or other errors of any kind or nature; or for the 571 | incorrect or inaccurate capture of information, or the failure to 572 | capture any information. 573 | 574 | These Official Rules, Terms, and Conditions cannot be modified except by 575 | the Department in its sole and absolute discretion. The invalidity or 576 | unenforceability of any provision of these Official Rules, Terms, and 577 | Conditions shall not affect the validity or enforceability of any other 578 | provision. In the event that any provision is determined to be invalid 579 | or otherwise unenforceable or illegal, these Official Rules, Terms, and 580 | Conditions shall otherwise remain in effect and shall be construed in 581 | accordance with their terms as if the invalid or illegal provision were 582 | not contained herein. 583 | 584 | ## Exercise 585 | 586 | The failure of the Department to exercise or enforce any right or 587 | provision of these Official Rules, Terms, and Conditions shall not 588 | constitute a waiver of such right or provision. 589 | 590 | ## Governing Law 591 | 592 | All issues and questions concerning the construction, validity, 593 | interpretation, and enforceability of these Official Rules, Terms, and 594 | Conditions shall be governed by and construed in accordance with U.S. 595 | Federal law as applied in the Federal courts of the District of Columbia 596 | if a complaint is filed by any party against the Department. 597 | 598 | ## Privacy Policy 599 | 600 | By participating in the Challenge, each entrant hereby agrees that 601 | occasionally, the Department may also use the entrant's information to 602 | contact the entrant about Federal Challenge and innovation related 603 | activities. 604 | 605 | Please contact [automated-scoring-challenge@ed.gov](mailto:automated-scoring-challenge@ed.gov) should 606 | you have any comments or questions about these Official Rules, Terms, 607 | and Conditions. 608 | 609 | ## Other Information 610 | 611 | *Accessible Format:* Individuals with disabilities can obtain this 612 | document and a copy of the submission package in an accessible format 613 | (e.g., braille, large print, audiotape, or compact disc) on request 614 | to [automated-scoring-challenge@ed.gov](mailto:automated-scoring-challenge@ed.gov). 615 | 616 | ## Review 617 | 618 | ED will screen all completed submissions to determine compliance with 619 | submission criteria and determine eligible winner(s) following the 620 | process described in Appendix A. 621 | 622 | ## How To Enter 623 | 624 | 1. Entrants must submit an application to participate by first completing the required security authorization forms to access NCES Confidential materials. These are provided at: https://github.com/NAEP-AS-Challenge/info/application-documents.zip. Completed applications should be sent via email to: 625 | [automated-scoring-challenge@ed.gov](mailto:automated-scoring-challenge@ed.gov) by the ~~**20-Oct**~~ 11/1/2021 @ 11:59 PM ET deadline. NOTE: Applications will be reviewed on a rolling basis starting 9/21/2021. 626 | 627 | 2. Once approved, participants will be provided with secure access to 628 | the dataset and materials for the challenge. 629 | 630 | 3. Submissions must be uploaded to a secure server, with access provided via email to the lead applicant. All submissions 631 | will be kept confidential. Submissions sould contain both the 632 | technical report and predicted scores. Submissions must be submitted 633 | by 11/28/2021 at 11:59PM ET. 634 | 635 | 4. Within 30 days of final submissions, ALL participants are required to submit the signed and witnessed form confirming their destruction / deletion of all data that was provided for their use in this challenge. This form is available, with instructions for submission, at: https://github.com/NAEP-AS-Challenge/info/application-documents.zip. 636 | 637 | All entrants consent to the Official Rules, 638 | Terms, and Conditions upon submitting an entry. Once submitted, a 639 | submission may not be altered. The Department reserves the right to 640 | disqualify any submission that the Department deems inappropriate. The 641 | Department encourages entrants to submit entries, in the form of a 642 | final, technical report, that contains both a narrative and predicted 643 | scores as far in advance of the deadline as possible. 644 | 645 | Individuals with disabilities who need an accommodation or auxiliary aid 646 | in connection with the submission process should 647 | contact [automated-scoring-challenge@ed.gov](mailto:automated-scoring-challenge@ed.gov). 648 | If the Department provides an accommodation or auxiliary aid to an 649 | individual with a disability in connection with the submission process, 650 | the entry remains subject to all other requirements and limitations in 651 | this notice. 652 | 653 | ## 654 | 655 | ## Timeline and Notification 656 | 657 | All submissions will be acknowledged as they are received. The winning 658 | proposal(s) will be notified via email in Mid-January 2022. 659 | 660 | ## Appendix A ## 661 | 662 | Determining the Winner (Item-specific and Generic Prompt Competition) 663 | 664 | The review process will consist of the following steps: 665 | 666 | 1) Submissions will be reviewed to ensure that they are complete, 667 | including: 668 | A) The technical report for all entrants will be reviewed by a 669 | committee of panelists assembled by NCES at their discretion. If the 670 | panel determines that a statement meets the criteria of 671 | transparency, explainability, and fairness as explained previously 672 | in the Challenge document, the submission will be evaluated. If it 673 | does not meet these criteria, NCES will provide feedback to the 674 | participant and provide up to 14 calendar days for a new submission 675 | of the report, which will be reviewed again. If the submission meets 676 | the standard, the response will be reviewed further. If it does not, 677 | the response will be rejected. The rubric to be used in evaluating 678 | reports for interpretability is provided below. 679 | 680 | B) The number of items and responses scored will be counted. 681 | Competitors must score all items and 99.5% of the responses which 682 | have a legitimate human score. Entries will be rejected that do not 683 | provide sufficient scores to meet these criteria. 684 | 685 | C) The pricing sheet will be reviewed to ensure that it includes both 686 | fixed and variable costs for an operational deployment of the 687 | scoring model. 688 | 689 | |**Criteria**|**Responsive submissions will adequately:**| 690 | | :- | :- | 691 | |

1. Transparency

|

- Explain the model building process

- Include descriptions of features used for model building

- Include description of algorithm used for model building

| 692 | |

2. Explainability

|

- Provide feature values and model statistics as appropriate to methods used

- Provide results from model training and cross validation

- Provide validity explanations that consider items and scoring rubrics

| 693 | |

3. Fairness

|- Conduct analysis to ensure that models perform the same for different sub-populations, especially those from historically underserved communities. | 694 | 695 | 696 | 2) All items will be scored using quadratic weighted kappa, a metric 697 | which measures the agreement between two ratings. The winner will be 698 | determined by the highest average quadratic weighted kappa across 699 | the items used in the competition, rounded to the third decimal 700 | place (for more information about quadratic weighted kappa, see 701 | information below). For the purposes of the competition, each item 702 | is weighted equally. This analysis will be performed separately for 703 | the Item-specific models and for the Generic model. 704 | 705 | a. Results shall be numerically ranked from most accurate to least 706 | accurate, and the top responses will be chosen for prize awards and 707 | public recognition. 708 | 709 | ## Quadratic weighted kappa 710 | 711 | Quadratic weighted kappa has been used to determine the winning 712 | predictions in a number of high-stakes competitions where the goal is to 713 | match human ratings (Shermis & Hamner, 2012; 2013; Shermis, 2014; 2015). 714 | 715 | Quadratic weighted kappa allows disagreements to be weighted 716 | differently and is especially useful when codes or ratings are 717 | ordered. Three matrices are involved, the matrix of observed scores, the 718 | matrix of expected scores based on chance agreement, and the weight 719 | matrix. Weight matrix cells located on the diagonal (upper-left to 720 | bottom-right) represent agreement and thus contain zeros. Off-diagonal 721 | cells contain weights indicating the seriousness of that disagreement. 722 | Often, cells one off the diagonal are weighted 1, those two off 2, etc. 723 | 724 | Quadratic weighted kappa typically varies from 0 (random agreement 725 | between raters) to 1 (complete agreement between raters). In the event 726 | that there is less agreement between the raters than expected by chance, 727 | the metric may go below 0. The quadratic weighted kappa is calculated 728 | between the scores which are expected/known and the predicted scores. 729 | 730 | Further information about Quadratic Weighted Kappa is available at: . 731 | 732 | Items in the Challenge data sets have scores which can range from 0-2 to 0-4. 733 | Any assigned score outside the range of the item is considered an 734 | unscorable response (e.g., condition codes assigned by human raters), is 735 | treated as missing data, and no prediction should be made for that 736 | response. The quadratic weighted kappa is calculated as follows. First, 737 | an N x N histogram matrix *x* is constructed, such that *x~ij~* 738 | corresponds to the number of cases that have a rating of *i* (actual) 739 | and received a predicted rating *j*. An N x N matrix of weights, *w*, is 740 | calculated based on the difference between actual and predicted rating 741 | scores. 742 | 743 | An N x N histogram matrix of expected ratings, m, is calculated, 744 | assuming that there is no correlation between rating scores. This is 745 | calculated as the outer product between the actual rating\'s histogram 746 | vector of ratings and the predicted rating\'s histogram vector of 747 | ratings, normalized such that *m* and *x* have the same sum. 748 | 749 | From these three matrices, the quadratic weighted kappa is calculated. 750 | 751 | (see 752 | 753 | for examples) 754 | 755 | ## Works Cited 756 | 757 | Kumar, V., & Boulanger, D. (2020). Explainable Automated Essay Scoring: 758 | Deep Learning Really Has Pedagogical Value. *Frontiers in Education*, 759 | *5*, 572367. 760 | 761 | Shermis, M. D., & Hamner, B. (2012). *Contrasting state-of-the-art in 762 | the machine scoring: Analysis*. National Council on Measurement in 763 | Education, Vancouver, BC. 764 | 765 | Shermis, M. D., & Hamner, B. (2013). Contrasting state-of-the-art 766 | automated scoring of essays. In M. D. Shermis, J. C. Burstein (Eds.), 767 | *Handbook of Automated Essay Evaluation: Current Applications and New 768 | Directions* (pp. 298 - 312). Routledge. 769 | 770 | Shermis, M. D. (2014). State-of-the-art automated essay scoring: A 771 | United States demonstration and competition, results, and future 772 | directions. *Assessing Writing*, *20*, 53--76. 773 | 774 | Shermis, M. D. (2015). Contrasting state-of-the-art in the machine 775 | scoring of short-form constructed responses. *Educational Assessment*, 776 | *20*(1), 46 - 65. 777 | https://doi.org/http://dx.doi.org/10.1080/10627197.2015.997617 778 | -------------------------------------------------------------------------------- /results.md: -------------------------------------------------------------------------------- 1 | # Results from the NAEP Reading Automated Scoring Challenge 2 | Over two dozen teams participated in this Challenge. Out of these submissions, there were seven awards made. The winners were identified based on the accuracy of automated scores compared to human agreement and lack of bias observed in their predictions. All awarded entries also provided technical reports that met challenge requirements for transparency, explainability and fairness. 3 | 4 | **Grand Prizes** 5 | - Arianto Wibowo, Measurement Incorporated (Item-Specific Model) 6 | - Andrew Lan, UMass-Amherst (Item-Specific Model) 7 | - Susan Lottridge, Cambium Assessment (Item-Specific Model) 8 | - Torsten Zesch, University of Duisburg-Essen (Generic Model) 9 | 10 | **Runners-up** 11 | - Fabian Zehner, DIPF | Leibniz Institute for Research and Information in Education, 12 | Centre for Technology-Based Assessment (Item-Specific Model) 13 | - Scott Crossley, Georgia State University (Item-Specific Model) 14 | - Prathic Sundararajan, Georgia Institute of Technology and Suraj Rajendran, Weill Cornell Medical College (Item-Specific Model) 15 | - Susan Lottridge, Cambium Assessment (Generic Model) 16 | 17 | ### Item-Specific Challenge Results 18 | For the item-specific challenge, three grand prize winners and three runner-up teams were selected. All of these submissions met the requirements to use automated models in an actual test including accuracy compared to human scoring and no substantive change in score differences based on the gender or race/ethnicity of the respondents. The accuracy analysis of the winning submissions follows below. 19 | 20 |

QWK Differences for Automated Models in Item-Specific Challenge

21 | 22 | As the figure illustrates, the top three submissions had only a .011 difference in QWK values for accuracy comparing human inter-rater reliability to the agreement of automated scores with human scores. Given how close these results are and the relatively small number of predictions in the challenge, all of these submissions were deemed to deserve a grand prize. A different set of responses could easily result in a different order of top entries by statistical chance. All of the awarded submissions here are within the .05 QWK difference that is generally accepted for operational use of automated scoring models [(Williamson, D. M, Xi, X., & Breyer, F. J., 2012)](https://doi.org/10.1111/j.1745-3992.2011.00223.x). 23 | 24 |

SMD Differences for Automated Models by Race/Ethnicity

25 |

SMD Differences for Automated Models by Gender

26 | 27 | The figures above illustrate the difference in results by race/ethnicity and gender compared to the difference found in human scoring. There are small differences observed in either the race or gender criteria compared to the human results; these accurate models do not exacerbate (or reduce) the differences observed in human-scored NAEP responses. The larger results observed for students in the "other" subgroup in the race/ethnicity analysis are likely the result of small sample size. Nonetheless, all results are well-within the 0.15 difference that is generally accepted for operational use. Given the importance of these differences to understanding NAEP results, this is an important outcome of the challenge. 28 | 29 | ### Generic Challenge Results 30 | For the generic challenge, one grand prize and one runner-up were selected. The results in this challenge were much less accurate than the item-specific models and indicate an area for further development. The accuracy analysis of the winning submissions follows below. Given the lack of accuracy, fairness analysis is not provided. 31 | 32 |

QWK Differences for Automated Models in Generic Challenge

33 | 34 | The accuracy of these results is much better than mere chance (0.00), but these results could not be relied upon to make inferences about scoring. Whereas the top results in the item-level challenge had a degradation of 0.018, the degradation for the top generic submission is 0.314; it is over 15 times larger. These results indicate how important the contextual information is within a NAEP reading item. While it is possible to imagine ways that generic models could be used (e.g. to score pilot-tested items for preliminary results, or provide immediate scores on new items) they are far from suitable for operational use in their current form. 35 | 36 | A hearty congratulations to all the participants in the challenge and appreciation from NCES for providing important insights into automated scoring of NAEP items. --------------------------------------------------------------------------------