├── 2021_10_4_IES_AS_Challenge_RFI_Presentation.pdf
├── application_documents.zip
├── automated_scoring_press_release.pdf
├── images
    ├── greatest_hits_gender_smd_I.png
    ├── greatest_hits_qwk_G.png
    ├── greatest_hits_qwk_I.png
    └── greatest_hits_race_smd_I.png
├── readme.md
└── results.md


/2021_10_4_IES_AS_Challenge_RFI_Presentation.pdf:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/NAEP-AS-Challenge/reading-prediction/91dec58b111429470a38b7ef198f7044a2463f7d/2021_10_4_IES_AS_Challenge_RFI_Presentation.pdf


--------------------------------------------------------------------------------
/application_documents.zip:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/NAEP-AS-Challenge/reading-prediction/91dec58b111429470a38b7ef198f7044a2463f7d/application_documents.zip


--------------------------------------------------------------------------------
/automated_scoring_press_release.pdf:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/NAEP-AS-Challenge/reading-prediction/91dec58b111429470a38b7ef198f7044a2463f7d/automated_scoring_press_release.pdf


--------------------------------------------------------------------------------
/images/greatest_hits_gender_smd_I.png:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/NAEP-AS-Challenge/reading-prediction/91dec58b111429470a38b7ef198f7044a2463f7d/images/greatest_hits_gender_smd_I.png


--------------------------------------------------------------------------------
/images/greatest_hits_qwk_G.png:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/NAEP-AS-Challenge/reading-prediction/91dec58b111429470a38b7ef198f7044a2463f7d/images/greatest_hits_qwk_G.png


--------------------------------------------------------------------------------
/images/greatest_hits_qwk_I.png:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/NAEP-AS-Challenge/reading-prediction/91dec58b111429470a38b7ef198f7044a2463f7d/images/greatest_hits_qwk_I.png


--------------------------------------------------------------------------------
/images/greatest_hits_race_smd_I.png:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/NAEP-AS-Challenge/reading-prediction/91dec58b111429470a38b7ef198f7044a2463f7d/images/greatest_hits_race_smd_I.png


--------------------------------------------------------------------------------
/readme.md:
--------------------------------------------------------------------------------
  1 | # ED.gov National Assessment of Educational Progress (NAEP) Automated Scoring Challenge
  2 | 
  3 | ### Winners announced! 
  4 | Press release available [here](https://github.com/NAEP-AS-Challenge/info/blob/edd6f9565b5bfdeb303ad32c86e53f7e32339358/automated_scoring_press_release.pdf) and more detail about awarded submissions posted [here](https://github.com/NAEP-AS-Challenge/info/blob/952ee82cc52163b17bccc1f12174ce246b8dff9b/results.md). 
  5 | 
  6 | A pre-print of a published paper from the UMass Amherst winning submission is available [here](https://arxiv.org/abs/2205.09864) and code used for their winning entry is available [here](https://github.com/ni9elf/automated-scoring).
  7 | 
  8 | ## Overview
  9 | We are seeking seeking submissions of automated scoring models to score
 10 | constructed response items for the National Assessment of Educational
 11 | Progress's reading assessment. The purpose of the challenge is to help
 12 | NAEP determine the existing capabilities, accuracy metrics, the
 13 | underlying validity evidence of assigned scores, and costs and
 14 | efficiencies of using automated scoring with the NAEP reading assessment
 15 | items. The Challenge requires that submissions demonstrate
 16 | interpretability of models, provide score predictions using these
 17 | models, analyze models for potential bias based on student demographic
 18 | characteristics, and provide cost information for putting an automated
 19 | scoring system into operational use.
 20 | 
 21 | **CHALLENGE DETAILS**
 22 | 
 23 | TOTAL CASH PRIZES OFFERED: \$30,000 (maximum of $20,000 for first-place entries)\
 24 | TYPE OF CHALLENGE: Automated Scoring of Open Ended Reading Test Items
 25 | 
 26 | **SUBMISSION START: 9/16/2021 8:00 AM ET\
 27 | SUBMISSION END: 11/28/2021 5:00 PM ET**
 28 | 
 29 | Request for Information Webinar was held 10/4/2021 @ 12:00 ET.  The [slides are posted here](https://github.com/NAEP-AS-Challenge/info/blob/c782c585636cdb32bf7373291e93c7f34256ebba/2021_10_4_IES_AS_Challenge_RFI_Presentation.pdf). 
 30 | 
 31 | ## Description
 32 | 
 33 | Automated Scoring using natural language processing is a well-developed
 34 | application of artificial intelligence in education. Results are
 35 | consistently demonstrated to be on-par with the inter-rater reliability
 36 | of human scorers for well-developed items (Shermis, 2014). Currently,
 37 | the National Assessment of Educational Progress (NAEP) makes extensive
 38 | use of constructed response items. Annually, contractors assemble teams
 39 | of human scorers who score millions of student responses to NAEP's
 40 | assessments. Previous internal research on the application of automated
 41 | scoring to NAEP items indicates that NAEP's items can be scored
 42 | successfully with automated scoring, using natural langage processing. Prior special studies have found that automated
 43 | scoring can perform as well as human raters in assigning scores, and
 44 | assigning a confidence level associated with the predicted score. Compared to human raters, no
 45 | evidence of biased student scoring based on demographic characteristics
 46 | was observed.
 47 | 
 48 | This challenge seeks to expand on this earlier work to ascertain whether
 49 | a wider array of automated scoring models can perform well with a
 50 | representative subset of NAEP Reading, constructed response items
 51 | administered in 2017 to students in grades 4 and 8. The ultimate goal is
 52 | to produce reliable and valid score assignments, provide additional
 53 | information about responses (e.g. length, cohesion, linguistic
 54 | complexity), and generate scores more quickly while saving money on
 55 | scoring costs.
 56 | 
 57 | There are two components to this challenge; entrants may submit
 58 | responses to one or both of these components:
 59 | 
 60 | 1)  Component A - Item-Specific Models: Successful respondents will
 61 |     build a predictive model for each item that can be scored, using
 62 |     current state-of-the-art practices in operational automated scoring
 63 |     deployments. Extensive training data from prior human scoring administrations will be provided.
 64 |     The first-place prize for this challenge is \$15,000,
 65 |     with up to 4 runner-up prizes of \$1,250 each.
 66 | 
 67 | 2)  Component B - Generic Models: Successful respondents will build a generic scoring model that will score items that
 68 |     were not included in the training dataset, but are from the same
 69 |     administration, subject, and grade level. The prize for this
 70 |     challenge is \$5,000, with up to 4 runner-up prizes of \$1,250 each.
 71 | 
 72 | Participants will be provided access to digital files that contain
 73 | information related the results of human-scored constructed responses to
 74 | reading assessment items that were administered in 2017, such as item
 75 | text, passage, scoring rubric, student responses, and human assigned
 76 | scores (both single and double scored). The responses correspond to
 77 | items that accompany two genres of 4^th^ and 8^th^ grade reading passages, literary and informational. Items for this challenge
 78 | are of two writing formats, short and extended constructed response.
 79 | 
 80 | The data set includes 20 items for the item-specific models, and 2 items
 81 | for the generic models. There are an average of 1,181 double-scored
 82 | responses per dataset, which are divided into a training dataset (50%),
 83 | a validation dataset (10%), and a test dataset (40%). The validation
 84 | dataset is augmented with a much larger number of single-scored
 85 | responses (average 23,000 per item). Detailed information about each
 86 | item is provided under "Item Information" below.
 87 | 
 88 | In addition to model accuracy compared to human scorers, successful
 89 | respondents to this Challenge will provide documentation of model
 90 | interpretability through a technical report that will be evaluated for
 91 | transparency, explanability, and fairness as further explained in
 92 | "Evaluation Criteria". The Federal Government is particularly interested
 93 | in submissions that provide accurate results and meet these objectives,
 94 | as they have been absent from a good deal of recent research in
 95 | automated scoring, particularly for solutions using artificial
 96 | intelligence (e.g. neural networks, transformer networks) and other
 97 | complex algorithmic approaches (Kumar & Boulanger, 2020). Both this documentation and predicted scores will be submitted simultaneously. The documentation
 98 | documentation will be evaluated before respondents' scored submissions are evaluated. Only
 99 | documentation that meets acceptance criteria (as specified below) will be
100 | considered as valid submissions and evaluated for accuracy of the predicted
101 | scores compared to the hold-out test dataset. 
102 | 
103 | This process is consistent
104 | with the operational processes that the Department intends to use as
105 | part of the approval process for scoring and reporting; only models that
106 | can provide substantive validity evidence would be approved for
107 | production use. This aspect of the Challenge is of critical importance
108 | in educational contexts.
109 | 
110 | ## 
111 | 
112 | ## Eligibility Information
113 | 
114 | Institutions and individuals that have the ability and capacity to conduct research are
115 | eligible to apply. Eligible applicants include, but are not limited to,
116 | non-profit and for-profit organizations and public and private agencies
117 | and institutions, such as colleges and universities.
118 | 
119 | ## Requirements for Participation & Confidential Data Security 
120 | 
121 | -   The datasets used for this challenge contain student responses from previous NAEP assessments and are therefore considered NCES confidential materials. All participants must confirm that they are able to meet NCES Confidential Data security requirements. These requirements include restrictions on the use of data, security of data, and destruction of data when the analysis is completed. These requirements are specified in the security application documentation (available in this repository as ["application_documents.zip"](application_documents.zip). Security documentation must be completed and submitted before an applicant will be provided access to the response data. Data must also be destroyed/deleted within 30 days of completing the Challenge and all participants must submit a signed and witnessed form confirming that action. This form is also included within the security application.
122 | 
123 | -   It is possible, although unlikely, that responses may contain
124 |     information about individual respondents. Individually identifiable
125 |     information about students, their families, and their schools. Patricipants must agree that this will not 
126 |     be revealed.
127 | 
128 | -   No person may:
129 | 
130 |     -   use any information for any purpose other than for the purposes
131 |         of this activity
132 | 
133 |     -   make any publication whereby the data furnished by any
134 |         particular person can be identified
135 | 
136 | -   The Education Sciences Reform Act of 2002 requires IES to develop
137 |     and enforce standards to protect the confidentiality of students,
138 |     their families, and their schools in the collection, reporting, and
139 |     publication of data. The IES confidentiality statute is found in
140 |     Public Law 107-279, section 183 (or as codified in 20 U.S.C. 9573).
141 | 
142 | -   Anyone who violates the confidentiality provisions of this Act when
143 |     using the data shall be found guilty of a class E felony and can be
144 |     imprisoned up to five years, and/or fined up to \$250,000.
145 | 
146 | No future NAEP contract work is guaranteed on the basis of performance
147 | in this competition. Contracts are let on separate RFPs where performance
148 | may be one of many criteria that may be used for evaluation.
149 | 
150 | ## 
151 | 
152 | ## Challenge Timeline 
153 | 
154 | |** |**Item**|**Duration (D)**|**Start**|**Finish**|
155 | | :-: | :- | :-: | :-: | :-: |
156 | |2.1|Challenge posting period|30|16-Sep|20-Oct|
157 | |2.2|Request for information webinar||4-Oct|4-Oct|
158 | |**2.3**|**Application deadline \***|||~~**20-Oct**~~ 1-Nov @ 11:59 PM ET|
159 | |2.4|Provide dataset|| |28-Oct|
160 | |2.5|Competitors prepare responses\*|30|29-Oct|28-Nov|
161 | |**2.6**|**Response deadline**|||**28 Nov**|
162 | |2.7|Select winner|||Mid-January|
163 | 
164 | \*Note: applications will be taken on a rolling basis and dataset access
165 | will be provided as soon as possible (typically 48 hours after receipt of required documentation).
166 | 
167 | ## Request for Information Webinar
168 | 
169 | 
170 | A Request for Information Webinar was held 10/4/2021 @ 12:00 ET.  The [slides are posted here](https://github.com/NAEP-AS-Challenge/info/blob/c782c585636cdb32bf7373291e93c7f34256ebba/2021_10_4_IES_AS_Challenge_RFI_Presentation.pdf). 
171 | Questions may also be sent via Github "issues" or via email to
172 | <automated-scoring-challenge@ed.gov>.
173 | 
174 | ## Frequently Asked Questions
175 | 
176 | *About the National Center for Education Statistics (NCES)*
177 | 
178 | The National Center for Education Statistics (NCES), one of the
179 | principal federal statistical agencies, is the primary federal entity
180 | for collecting and analyzing data related to education in the United
181 | States and other nations. It provides statistical services for educators
182 | and education officials at the federal, State, and local levels;
183 | Congress; researchers; students; parents; the media and the general
184 | public. NCES is located within the Institute of Education Sciences
185 | (IES), the research arm of the U.S. Department of Education.
186 | 
187 | The National Assessment of Educational Progress (NAEP) is a
188 | congressionally mandated project administered by NCES. NAEP is given to
189 | a representative sample of students across the country. Results are
190 | reported for groups of students with similar characteristics (e.g.,
191 | gender, race and ethnicity, school location), and are not reported for individual students.
192 | National results are available for all subjects assessed by NAEP. State
193 | and selected urban district results are available for mathematics,
194 | reading, and (in some assessment years) science and writing.
195 | 
196 | In 2009 NAEP began its transition to computer-based administration with
197 | the hope of creating innovative performance assessments that more
198 | closely reflected ways students were working and learning in classrooms,
199 | increasing test security, enhancing cost effectiveness, reporting with a
200 | faster turn-around time, and collecting performance/process data
201 | with the goal of more comprehensively studying assessment results. The
202 | potential of automated scoring is consonant with these goals, as long as
203 | the procedure produces test scores with reliability and validity
204 | indices that are comparable to current levels for the program. This challenge will help NAEP explore whether moving in this
205 | direction is appropriate at this time.
206 | 
207 | ## The Challenge
208 | 
209 | Participants may submit a response to either or both of the components
210 | below.
211 | 
212 | ## Component A: Item-Specific Models
213 | 
214 | For **item-specific models**, the participant shall create a model for each of twenty items.  
215 | Respondents will be provided with training data from prior human scoring which will include the item text,
216 | passage, scoring rubric, student responses, and human assigned scores
217 | (both single and double scored). Respondents will use this information, and may supplement this information with external training data, features, or models 
218 | to create a predictive model of human scores that is applied to a set of
219 | "test" responses for which scores are not provided. The predicted scores
220 | will be submitted as part of participants' responses.
221 | 
222 | ## Component B: Generic Models
223 | 
224 | For **generic models**, the participant shall create a model for two
225 | items similar in genre and format to those in Component A, but for which
226 | only responses and scores will be provided. The model may be trained on
227 | any or all items from Component A and may be supplemented with external
228 | training data, features, or models. Respondents will use this
229 | information to create a predictive model of human scores that is applied
230 | to a set of "test" responses for which scores are not provided. The
231 | predicted scores will be submitted as part of the participants' responses.
232 | 
233 | For both components, if your scoring engine links to an external routine or program that maps
234 | semantic space or determines other linguistic features, you are
235 | permitted to use these aids to make your predictions. However, you are
236 | responsible for obtaining any licenses or permissions to do so. If your
237 | scoring engine has these capabilities built in, those capabilities
238 | should be documented in the technical report.
239 | 
240 | ## Training Datasets
241 | 
242 | |**Component**|**Items Included** |
243 | | :- | :- |
244 | |Component A: Item-Specific Models|<p>- training data set (used for model building)</p><p>- cross-validation set (used for internal model evaluation)</p><p>- test data set (used for making score predictions) </p><p></p><p>The training and cross-validation sets have responses scored by two raters. The cross-validation set may be supplemented with single scored responses. The test data set will have text only, no scores. In addition to response and score data, item text, passage, scoring rubric, and additional relevant information will be provided. </p>|
245 | |Component B: Generic Models|<p>- all items from Component A for different items (similar in genre and grade)</p><p>- test data set for new items (used for making score predictions) </p><p>- optional: additional training data, features or models</p>|
246 | 
247 | ## Detailed Item Information
248 | 
249 | **Item-Specific Model Datasets**
250 | |Item ID|Gr. | For.|Avg. No. Words|Total N|Miss.|DS Training N|DS Validation N|DS + SS Validation N|DS Test N|
251 | | :- | :- | :- | :- | :- | :- | :- | :- | :- | :- |
252 | |2017\_DBA\_DR08\_1715RE2T13\_05|8|SCR|28.11|21166|100|533|107|20207|426|
253 | |2017\_DBA\_DR08\_1715RE2T13\_07|8|ECR|47.14|19934|385|495|99|19043|396|
254 | |2017\_DBA\_DR08\_1715RE2T13\_08|8|SCR|28.28|19669|224|502|101|18989|402|
255 | |2017\_DBA\_DR08\_1715RE4T05G08\_03|8|SCR|34.59|21197|81|539|108|20307|432|
256 | |2017\_DBA\_DR08\_1715RE4T05G08\_06|8|ECR|51.29|21170|83|531|106|20214|425|
257 | |2017\_DBA\_DR08\_1715RE4T05G08\_07|8|SCR|28.80|21088|62|529|106|20135|424|
258 | |2017\_DBA\_DR08\_1715RE4T05G08\_09|8|ECR|34.87|20869|93|527|105|19920|422|
259 | |2017\_DBA\_DR08\_1715RE4T08G08\_03|8|SCR|44.20|21380|92|543|109|20403|434|
260 | |2017\_DBA\_DR08\_1715RE4T08G08\_06|8|SCR|32.82|21309|107|538|108|20340|431|
261 | |2017\_DBA\_DR08\_1715RE4T08G08\_07|8|ECR|44.74|20990|185|529|106|20038|423|
262 | |2017\_DBA\_DR08\_1715RE4T08G08\_09|8|SCR|32.95|20589|190|513|103|19665|411|
263 | |2017\_DBA\_DR04\_1715RE1T10\_05|4|SCR|18.52|27806|355|690|138|26564|552|
264 | |2017\_DBA\_DR04\_1715RE4T05G04\_03|4|SCR|18.63|28264|323|715|143|26977|572|
265 | |2017\_DBA\_DR04\_1715RE4T05G04\_06|4|ECR|24.63|27462|327|682|137|26234|546|
266 | |2017\_DBA\_DR04\_1715RE4T05G04\_07|4|SCR|14.55|26588|222|666|133|25389|533|
267 | |2017\_DBA\_DR04\_1715RE4T05G04\_09|4|ECR|16.67|25651|426|634|127|24509|508|
268 | |2017\_DBA\_DR04\_1715RE4T08G04\_03|4|SCR|23.90|28307|292|707|141|27034|566|
269 | |2017\_DBA\_DR04\_1715RE4T08G04\_06|4|SCR|18.53|27533|329|694|139|26283|556|
270 | |2017\_DBA\_DR04\_1715RE4T08G04\_07|4|ECR|20.55|25960|667|641|128|24806|513|
271 | |2017\_DBA\_DR04\_1715RE4T08G04\_09|4|SCR|14.56|23720|597|583|117|22670|467|
272 | 
273 | **Generic Model Datasets**
274 | |Item ID|Gr. | For.|DS Test N|
275 | | :- | :- | :- | :- |
276 | |2017\_DBA\_DR08\_1715RE2T13\_06|8|SCR|420|
277 | |2017\_DBA\_DR04\_1715RE1T10\_07|4|ECR|539|
278 | 
279 | Item ID = Item identifier;
280 | Gr. = Grade; 
281 | For. = Format; 
282 | SCR = Short Constructed Response;
283 | ECR = Extended Constructed Response;						
284 | Avg. No. Words = Average Number of Words;	
285 | Miss. = Missing Data;
286 | DS Training N = Double-Scored Training N;
287 | DS Validation N = Double-Scored Validation N;
288 | DS Test N = Double-Scored Test N;
289 | DS+SS Validation N = Double-Scored + Single Scored Validation N
290 | 
291 | ## Evaluation Criteria 
292 | 
293 | **Part 1: Model Interpretability**. Submissions must provide a
294 | technical report that explains the model development process and results
295 | appropriate to a technical audience with educational measurement
296 | expertise. It is not expected that competitors will reveal confidential
297 | information, but will provide evidence that enables an external reviewer
298 | to assess the validity and fairness of the automated scoring process and
299 | models. These reports will be submitted with submissions of predicted
300 | scores, but must be approved as providing a sufficient degree of
301 | interpretability per the criteria below before the response predictions
302 | will be evaluated.
303 | 
304 | **Part 2: Model Accuracy**. Scoring performance will be evaluated
305 | on the average quadratic weighted kappa (rounded to the third
306 | decimal place) across all items in the competition. Competitors must score at least 99.5% of all scorable
307 | responses (for which there is a human rating). During model construction, competitors will have access
308 | to online resources to request further information or to make
309 | suggestions about how to improve performance. See Appendix A for more
310 | information on quadratic weighted kappa and how the winner of the
311 | competition will be determined.
312 | 
313 | Model interpretability will be evaluated according to three criteria,
314 | which will be equally weighted in the review:
315 | 
316 | a)  **Transparency** -- explanation of the ***process for model
317 |     training and testing***, the features extracted from the text, and
318 |     the algorithms used in model building. While these may describe a
319 |     general workflow, they should also include the specific text
320 |     features and algorithmic choices used to create the final models
321 |     that score items in this Challenge.
322 | 
323 | b)  **Explainability** -- explanation of the ***resulting item model
324 |     and/or individual scores*** that includes the input features
325 |     considered, the modeling results, and algorithm choices.
326 | 
327 | c)  **Fairness** -- analysis into any **differences based on student
328 |     demographic background** in automated scoring compared to those
329 |     found in human-scored results.
330 | 
331 | Although not an evaluated criteria for winning the competition,
332 | technical reports should include estimates for minimal training sample
333 | sizes that would place the scoring engine's estimates within two percent
334 | of the final predicted values.
335 | 
336 | |**Criteria**|**Responsive submissions will adequately address:**|
337 | | :- | :- |
338 | |<p></p><p>1.	Transparency</p>|<p>- Explain the model building process</p><p>- Include descriptions of features used for model building </p><p>- Include description of algorithm used for model building</p>|
339 | |<p></p><p>2. Explainability</p>|<p>- Provide feature values and model statistics as appropriate to methods used </p><p>- Provide results from model training and cross validation</p><p>- Provide validity explanations that consider items and scoring rubrics</p>|
340 | |<p></p><p>3. Fairness </p>|- Conducts analysis to ensure that models perform the same for different sub-populations, especially those from historically underserved communities. |
341 | 
342 | ## Key Parameters
343 | 
344 | In conducting the work for the Challenge, there are several parameters
345 | to consider:
346 | 
347 | -   Both team and individual competitors are eligible to participate.
348 | 
349 | -   Data sets will be provided in csv format.
350 | 
351 | -   All items in the training and test sets have been deemed “scorable” by human raters. There are responses in the validation set that were evaluated as “unscorable” and have no rating associated with them. They should be treated as missing data.
352 | 
353 | 
354 | -   Teams must complete the required security documentation before
355 |     datasets will be released. This documentation is avialable at: ["application_documents.zip"](application_documents.zip).
356 | 
357 | ## Deliverables
358 | 
359 | Valid submissions will include reports with the following items:
360 | 
361 | -   A technical report that provides model interpretability as
362 |     previously described.
363 | 
364 | -   Predicted scores (CSV format) from the test data responses (see below for data format).
365 | 
366 | -   A pricing sheet that includes all costs related to the production
367 |     implementation of automated scoring: model training, item scoring,
368 |     and any other infrastructure or organizational costs that would be
369 |     required to integrate machine scoring into a live scoring system.
370 |     While some costs (e.g. project management) may be variable, we
371 |     expect fixed costs for well-known items such as model training, item
372 |     scoring, system administration, and others.
373 | 
374 | 
375 | ## Predicted Score Data Format and Upload Process
376 | 
377 | To submit your predicted scores, please use the following format to modify the test dataset provided for each item. 
378 | 1. Delete the column "ReadingTextResponse" that contains the student response text (for data security reasons).  **Please do not submit any files that contain the text of student responses**. 
379 | 2. Add a column "predicted_score" and enter your predicted score in that column. 
380 | 3. Add a column "participant" and put in the email address for the person who requested the dataset (you only need to enter in one row). 
381 | 4. Save the file using the same original filename in .CSV format.
382 | 5. Repeat for all items predicted and save into a single folder/directory. 
383 | 6. Zip that folder/directory.  Add your technical report, pricing sheet (if appropriate), and upload to the transfer.ies.gov folder that you have been provided via email.  **Entries must be uploaded by the Challenge deadline (11/28/2021 11:59 PM ET)**.
384 | 
385 | ## Challenge Administration Platform 
386 | 
387 | Most aspects of the Challenge will be administered via Github
388 | (https://github.com/NAEP-AS-Challenge/info).
389 | 
390 | Specifically, this platform will be used for the following purposes:
391 | 
392 | 1)  Information -- information about the challenge will be posted under
393 |     "info" and available to the public.
394 | 
395 | 2)  Questions -- all questions about the Challenge, datasets, or items
396 |     should be posted as an "issue" and will be publicly available. Responses will typically be made within 24 business hours.
397 | 
398 | Please note that response data will be provided outside of Github to the contact specified on the application.  Submissions will also be upload to a separate secure server.
399 | 
400 | ## Prizes
401 | 
402 | The Department of Education is offering up to 10 prizes for a total
403 | potential award of up to \$30,000 (\$20,000 for item-specific models,
404 | \$10,000 for a generic model). The first-place prize for the item-specific
405 | challenge is \$15,000, and the first-place prize for the generic model
406 | is \$5,000. Up to 4 runner-up prizes in each category may be awarded
407 | with cash prizes of \$1,250 each.
408 | 
409 | The winning results of the competition will be published in a technical
410 | report summarizing the results of the competition. At the Department's
411 | discretion, to assist with selecting winners, one or more of the most
412 | highly rated challenge participants may be invited to present a virtual
413 | presentation that reflects the basic elements of their technical report.
414 | 
415 | Any potential prizes awarded under this Challenge will be paid by
416 | electronic funds transfer. Winners will be required to complete and
417 | return an Automated Clearing House (ACH) Vendor/Miscellaneous Payment
418 | Enrollment Form to ED within a given timeframe. The form collects
419 | banking information needed to make an electronic payment (direct
420 | deposit) to the winner. Award recipients will be responsible for any
421 | applicable local, state, and federal taxes and reporting that may be
422 | required under applicable tax laws.
423 | 
424 | ## Rules
425 | 
426 | ## Terms and Conditions
427 | 
428 | All entry information submitted
429 | to [automated-scoring-challenge@ed.gov](mailto:automated-scoring-challenge@ed.gov) and
430 | all materials, including any copy of the submission, become property of
431 | the Department and will not be returned (See "Ownership and Licensing"
432 | for information about use of these items). Furthermore, the Department
433 | shall have no liability for any submission that is lost, intercepted, or
434 | not received by the Department. The Department assumes no liability or
435 | responsibility for any error, omission, interruption, deletion, theft,
436 | destruction, unauthorized access to, or alteration of, submissions.
437 | 
438 | ## Representations and Warranties/Indemnification
439 | 
440 | By participating in the Challenge, each entrant represents, warrants,
441 | and covenants as follows:
442 | 
443 | 1.  The entrants are the sole authors, creators, and owners of the
444 |     submission;
445 | 
446 | 2.  The entrant's submission:
447 | 
448 |     a.  Is not the subject of any actual or threatened litigation or
449 |         claim;
450 | 
451 |     b.  Does not, and will not, violate or infringe upon the privacy
452 |         rights, publicity rights, or other legal rights of any third
453 |         party; and
454 | 
455 |     c.  Does not contain any harmful computer code (sometimes referred
456 |         to as "malware," "viruses," or "worms").
457 | 
458 | 3.  The submission, and entrants' implementation of the submission, does
459 |     not, and will not, violate any applicable laws or regulations of the
460 |     United States.
461 | 
462 | 4.  Entrants will indemnify, defend, and hold harmless the Department
463 |     from and against all third party claims, actions, or proceedings of
464 |     any kind and from any and all damages, liabilities, costs, and
465 |     expenses relating to, or arising from, entrant's submission or any
466 |     breach or alleged breach of any of the representations, warranties,
467 |     and covenants of entrant hereunder.
468 | 
469 | 5.  The Department reserves the right to disqualify any submission that
470 |     the Department, in its discretion, deems to violate these Official
471 |     Rules, Terms, and Conditions in this notice.
472 | 
473 | ## Ownership and Licensing
474 | 
475 | Each entrant retains full ownership of the algorithmic approaches to
476 | their submission, including all intellectual property rights therein. By
477 | participating in the Challenge, each entrant hereby grants to the
478 | Department a royalty-free, nonexclusive, irrevocable, and worldwide
479 | license to reproduce, publish, produce derivative works, distribute
480 | copies to the public, perform publicly and display publicly, and/or
481 | otherwise use the technical report and assessment of submitted scores
482 | from each participant in the competition.
483 | 
484 | ## Publicity Release
485 | 
486 | By participating in the Challenge, each entrant hereby irrevocably
487 | grants to the Department the right to use the entrant's name, likeness,
488 | image, and biographical information in any and all media for advertising
489 | and promotional purposes relating to the Challenge.
490 | 
491 | ## Disqualification
492 | 
493 | The Department reserves the right, in its sole discretion, to disqualify
494 | any entrant who is found to be tampering with the entry process or the
495 | operation of the Challenge, Challenge webpage, or other
496 | Challenge-related webpages; to be acting in violation of these Official
497 | Rules, Terms, and Conditions; to be acting in an unsportsmanlike or
498 | disruptive manner, or with the intent to disrupt or undermine the
499 | legitimate operation of the Challenge; or to annoy, abuse, threaten, or
500 | harass any other person; and, the Department reserves the right to seek
501 | damages and other remedies from any such person to the fullest extent
502 | permitted by law.
503 | 
504 | ## Disclaimer
505 | 
506 | The Challenge webpage contains information and resources from public and
507 | private organizations that may be useful to the reader. Inclusion of
508 | this information does not constitute an endorsement by the Department of
509 | any products or services offered or views expressed.
510 | 
511 | The Challenge webpage also contains hyperlinks and URLs created and
512 | maintained by outside organizations, which are provided for the reader's
513 | convenience. The Department is not responsible for the accuracy of the
514 | information contained therein.
515 | 
516 | ## Notice to Challenge Entrants and Award Recipients
517 | 
518 | Attempts to notify entrants and award recipients will be made using the
519 | email address associated with the entrants' submissions. The Department
520 | is not responsible for email or other communication problems of any
521 | kind.
522 | 
523 | If, despite reasonable efforts, an entrant does not respond within three
524 | days of the first notification attempt regarding selection as an award
525 | recipient (or a shorter time as exigencies may require) or if the
526 | notification is returned as undeliverable to such entrant, that entrant
527 | may forfeit the entrant's award and associated prizes, and an alternate
528 | award recipient may be selected.
529 | 
530 | If any potential award recipient is found to be ineligible, has not
531 | complied with these Official Rules, Terms, and Conditions, or declines
532 | the applicable prize for any reason prior to award, such potential award
533 | recipient will be disqualified. An alternate award recipient may be
534 | selected, or the applicable award may go unawarded.
535 | 
536 | ## Dates/Deadlines
537 | 
538 | The Department reserves the right to modify any dates or deadlines set
539 | forth in these Official Rules, Terms, and Conditions or otherwise
540 | governing the Challenge.
541 | 
542 | ## Challenge Termination
543 | 
544 | The Department reserves the right to suspend, postpone, cease,
545 | terminate, or otherwise modify this Challenge, or any entrant's
546 | participation in the Challenge, at any time at the Department's
547 | discretion.
548 | 
549 | ## General Liability Release
550 | 
551 | By participating in the Challenge, each entrant hereby agrees that ---
552 | (a) The Department shall not be responsible or liable for any losses,
553 | damages, or injuries of any kind (including death) resulting from
554 | participation in the Challenge or any Challenge-related activity, or
555 | from entrants' acceptance, receipt, possession, use, or misuse of any
556 | prize; and (b) The entrant will indemnify, defend, and hold harmless the
557 | Department from and against all third party claims, actions, or
558 | proceedings of any kind and from any and all damages, liabilities,
559 | costs, and expenses relating to, or arising from, the entrant's
560 | participation in the Challenge.
561 | 
562 | Without limiting the generality of the foregoing, the Department is not
563 | responsible for incomplete, illegible, misdirected, misprinted, late,
564 | lost, postage-due, damaged, or stolen entries or prize notifications; or
565 | for lost, interrupted, inaccessible, or unavailable networks, servers,
566 | satellites, Internet Service Providers, webpages, or other connections;
567 | or for miscommunications, failed, jumbled, scrambled, delayed, or
568 | misdirected computer, telephone, cable transmissions or other
569 | communications; or for any technical malfunctions, failures,
570 | difficulties, or other errors of any kind or nature; or for the
571 | incorrect or inaccurate capture of information, or the failure to
572 | capture any information.
573 | 
574 | These Official Rules, Terms, and Conditions cannot be modified except by
575 | the Department in its sole and absolute discretion. The invalidity or
576 | unenforceability of any provision of these Official Rules, Terms, and
577 | Conditions shall not affect the validity or enforceability of any other
578 | provision. In the event that any provision is determined to be invalid
579 | or otherwise unenforceable or illegal, these Official Rules, Terms, and
580 | Conditions shall otherwise remain in effect and shall be construed in
581 | accordance with their terms as if the invalid or illegal provision were
582 | not contained herein.
583 | 
584 | ## Exercise
585 | 
586 | The failure of the Department to exercise or enforce any right or
587 | provision of these Official Rules, Terms, and Conditions shall not
588 | constitute a waiver of such right or provision.
589 | 
590 | ## Governing Law
591 | 
592 | All issues and questions concerning the construction, validity,
593 | interpretation, and enforceability of these Official Rules, Terms, and
594 | Conditions shall be governed by and construed in accordance with U.S.
595 | Federal law as applied in the Federal courts of the District of Columbia
596 | if a complaint is filed by any party against the Department.
597 | 
598 | ## Privacy Policy
599 | 
600 | By participating in the Challenge, each entrant hereby agrees that
601 | occasionally, the Department may also use the entrant's information to
602 | contact the entrant about Federal Challenge and innovation related
603 | activities.
604 | 
605 | Please contact [automated-scoring-challenge@ed.gov](mailto:automated-scoring-challenge@ed.gov) should
606 | you have any comments or questions about these Official Rules, Terms,
607 | and Conditions.
608 | 
609 | ## Other Information
610 | 
611 | *Accessible Format:* Individuals with disabilities can obtain this
612 | document and a copy of the submission package in an accessible format
613 | (e.g., braille, large print, audiotape, or compact disc) on request
614 | to [automated-scoring-challenge@ed.gov](mailto:automated-scoring-challenge@ed.gov).
615 | 
616 | ## Review
617 | 
618 | ED will screen all completed submissions to determine compliance with
619 | submission criteria and determine eligible winner(s) following the
620 | process described in Appendix A. 
621 | 
622 | ## How To Enter
623 | 
624 | 1.  Entrants must submit an application to participate by first completing the required security authorization forms to access NCES Confidential materials. These are provided at: https://github.com/NAEP-AS-Challenge/info/application-documents.zip.  Completed applications should be sent via email to: 
625 |     [automated-scoring-challenge@ed.gov](mailto:automated-scoring-challenge@ed.gov) by the ~~**20-Oct**~~ 11/1/2021 @ 11:59 PM ET deadline. NOTE: Applications will be reviewed on a rolling basis starting 9/21/2021.
626 | 
627 | 2.  Once approved, participants will be provided with secure access to
628 |     the dataset and materials for the challenge.
629 | 
630 | 3.  Submissions must be uploaded to a secure server, with access provided via email to the lead applicant. All submissions
631 |     will be kept confidential. Submissions sould contain both the
632 |     technical report and predicted scores. Submissions must be submitted
633 |     by 11/28/2021 at 11:59PM ET.
634 |     
635 | 4. Within 30 days of final submissions, ALL participants are required to submit the signed and witnessed form confirming their destruction / deletion of all data that was provided for their use in this challenge.  This form is available, with instructions for submission, at: https://github.com/NAEP-AS-Challenge/info/application-documents.zip.
636 | 
637 | All entrants consent to the Official Rules,
638 | Terms, and Conditions upon submitting an entry. Once submitted, a
639 | submission may not be altered. The Department reserves the right to
640 | disqualify any submission that the Department deems inappropriate. The
641 | Department encourages entrants to submit entries, in the form of a
642 | final, technical report, that contains both a narrative and predicted
643 | scores as far in advance of the deadline as possible.
644 | 
645 | Individuals with disabilities who need an accommodation or auxiliary aid
646 | in connection with the submission process should
647 | contact [automated-scoring-challenge@ed.gov](mailto:automated-scoring-challenge@ed.gov).
648 | If the Department provides an accommodation or auxiliary aid to an
649 | individual with a disability in connection with the submission process,
650 | the entry remains subject to all other requirements and limitations in
651 | this notice.
652 | 
653 | ## 
654 | 
655 | ## Timeline and Notification
656 | 
657 | All submissions will be acknowledged as they are received. The winning
658 | proposal(s) will be notified via email in Mid-January 2022.
659 | 
660 | ## Appendix A ## 
661 | 
662 | Determining the Winner (Item-specific and Generic Prompt Competition)
663 | 
664 | The review process will consist of the following steps:
665 | 
666 | 1)  Submissions will be reviewed to ensure that they are complete,
667 |     including:
668 | A)  The technical report for all entrants will be reviewed by a
669 |     committee of panelists assembled by NCES at their discretion. If the
670 |     panel determines that a statement meets the criteria of
671 |     transparency, explainability, and fairness as explained previously
672 |     in the Challenge document, the submission will be evaluated. If it
673 |     does not meet these criteria, NCES will provide feedback to the
674 |     participant and provide up to 14 calendar days for a new submission
675 |     of the report, which will be reviewed again. If the submission meets
676 |     the standard, the response will be reviewed further. If it does not,
677 |     the response will be rejected. The rubric to be used in evaluating
678 |     reports for interpretability is provided below.
679 | 
680 | B)  The number of items and responses scored will be counted.
681 |     Competitors must score all items and 99.5% of the responses which
682 |     have a legitimate human score. Entries will be rejected that do not
683 |     provide sufficient scores to meet these criteria.
684 | 
685 | C)  The pricing sheet will be reviewed to ensure that it includes both
686 |     fixed and variable costs for an operational deployment of the
687 |     scoring model.
688 | 
689 | |**Criteria**|**Responsive submissions will adequately:**|
690 | | :- | :- |
691 | |<p></p><p>1.	Transparency</p>|<p>- Explain the model building process</p><p>- Include descriptions of features used for model building </p><p>- Include description of algorithm used for model building</p>|
692 | |<p></p><p>2. Explainability</p>|<p>- Provide feature values and model statistics as appropriate to methods used </p><p>- Provide results from model training and cross validation</p><p>- Provide validity explanations that consider items and scoring rubrics</p>|
693 | |<p></p><p>3. Fairness </p>|- Conduct analysis to ensure that models perform the same for different sub-populations, especially those from historically underserved communities. |
694 | 
695 | 
696 | 2)  All items will be scored using quadratic weighted kappa, a metric
697 |     which measures the agreement between two ratings. The winner will be
698 |     determined by the highest average quadratic weighted kappa across
699 |     the items used in the competition, rounded to the third decimal
700 |     place (for more information about quadratic weighted kappa, see
701 |     information below). For the purposes of the competition, each item
702 |     is weighted equally. This analysis will be performed separately for
703 |     the Item-specific models and for the Generic model.
704 | 
705 |     a.  Results shall be numerically ranked from most accurate to least
706 |         accurate, and the top responses will be chosen for prize awards and
707 |         public recognition.
708 | 
709 | ## Quadratic weighted kappa
710 | 
711 | Quadratic weighted kappa has been used to determine the winning
712 | predictions in a number of high-stakes competitions where the goal is to
713 | match human ratings (Shermis & Hamner, 2012; 2013; Shermis, 2014; 2015).
714 | 
715 | Quadratic weighted kappa allows disagreements to be weighted
716 | differently and is especially useful when codes or ratings are
717 | ordered. Three matrices are involved, the matrix of observed scores, the
718 | matrix of expected scores based on chance agreement, and the weight
719 | matrix. Weight matrix cells located on the diagonal (upper-left to
720 | bottom-right) represent agreement and thus contain zeros. Off-diagonal
721 | cells contain weights indicating the seriousness of that disagreement.
722 | Often, cells one off the diagonal are weighted 1, those two off 2, etc.
723 | 
724 | Quadratic weighted kappa typically varies from 0 (random agreement
725 | between raters) to 1 (complete agreement between raters). In the event
726 | that there is less agreement between the raters than expected by chance,
727 | the metric may go below 0. The quadratic weighted kappa is calculated
728 | between the scores which are expected/known and the predicted scores.
729 | 
730 | Further information about Quadratic Weighted Kappa is available at: <https://en.wikipedia.org/wiki/Cohen%27s_kappa>. 
731 | 
732 | Items in the Challenge data sets have scores which can range from 0-2 to 0-4.
733 | Any assigned score outside the range of the item is considered an
734 | unscorable response (e.g., condition codes assigned by human raters), is
735 | treated as missing data, and no prediction should be made for that
736 | response. The quadratic weighted kappa is calculated as follows. First,
737 | an N x N histogram matrix *x* is constructed, such that *x~ij~*
738 | corresponds to the number of cases that have a rating of *i* (actual)
739 | and received a predicted rating *j*. An N x N matrix of weights, *w*, is
740 | calculated based on the difference between actual and predicted rating
741 | scores.
742 | 
743 | An N x N histogram matrix of expected ratings, m, is calculated,
744 | assuming that there is no correlation between rating scores. This is
745 | calculated as the outer product between the actual rating\'s histogram
746 | vector of ratings and the predicted rating\'s histogram vector of
747 | ratings, normalized such that *m* and *x* have the same sum.
748 | 
749 | From these three matrices, the quadratic weighted kappa is calculated.
750 | 
751 | (see
752 | <https://www.kaggle.com/aroraaman/quadratic-kappa-metric-explained-in-5-simple-steps>
753 | for examples)
754 | 
755 | ## Works Cited
756 | 
757 | Kumar, V., & Boulanger, D. (2020). Explainable Automated Essay Scoring:
758 | Deep Learning Really Has Pedagogical Value. *Frontiers in Education*,
759 | *5*, 572367. <https://doi.org/10.3389/feduc.2020.572367>
760 | 
761 | Shermis, M. D., & Hamner, B. (2012). *Contrasting state-of-the-art in
762 | the machine scoring: Analysis*. National Council on Measurement in
763 | Education, Vancouver, BC.
764 | 
765 | Shermis, M. D., & Hamner, B. (2013). Contrasting state-of-the-art
766 | automated scoring of essays. In M. D. Shermis, J. C. Burstein (Eds.),
767 | *Handbook of Automated Essay Evaluation: Current Applications and New
768 | Directions* (pp. 298 - 312). Routledge.
769 | 
770 | Shermis, M. D. (2014). State-of-the-art automated essay scoring: A
771 | United States demonstration and competition, results, and future
772 | directions. *Assessing Writing*, *20*, 53--76.
773 | 
774 | Shermis, M. D. (2015). Contrasting state-of-the-art in the machine
775 | scoring of short-form constructed responses. *Educational Assessment*,
776 | *20*(1), 46 - 65.
777 | https://doi.org/http://dx.doi.org/10.1080/10627197.2015.997617
778 | 


--------------------------------------------------------------------------------
/results.md:
--------------------------------------------------------------------------------
 1 | # Results from the NAEP Reading Automated Scoring Challenge
 2 | Over two dozen teams participated in this Challenge.  Out of these submissions, there were seven awards made.  The winners were identified based on the accuracy of automated scores compared to human agreement and lack of bias observed in their predictions. All awarded entries also provided technical reports that met challenge requirements for transparency, explainability and fairness.
 3 | 
 4 | **Grand Prizes**
 5 | - Arianto Wibowo, Measurement Incorporated (Item-Specific Model)
 6 | - Andrew Lan, UMass-Amherst (Item-Specific Model)
 7 | - Susan Lottridge, Cambium Assessment (Item-Specific Model)
 8 | - Torsten Zesch, University of Duisburg-Essen (Generic Model)
 9 |  
10 | **Runners-up**
11 | - Fabian Zehner, DIPF | Leibniz Institute for Research and Information in Education,
12 | Centre for Technology-Based Assessment (Item-Specific Model)
13 | - Scott Crossley, Georgia State University (Item-Specific Model)
14 | - Prathic Sundararajan, Georgia Institute of Technology and Suraj Rajendran, Weill Cornell Medical College (Item-Specific Model)
15 | - Susan Lottridge, Cambium Assessment (Generic Model)
16 | 
17 | ### Item-Specific Challenge Results
18 | For the item-specific challenge, three grand prize winners and three runner-up teams were selected.  All of these submissions met the requirements to use automated models in an actual test including accuracy  compared to human scoring and no substantive change in score differences based on the gender or race/ethnicity of the respondents.  The accuracy analysis of the winning submissions follows below. 
19 | 
20 | <p><img src="images/greatest_hits_qwk_I.png" alt="QWK Differences for Automated Models in Item-Specific Challenge" /></p>
21 |                                            
22 | As the figure illustrates, the top three submissions had only a .011 difference in QWK values for accuracy comparing human inter-rater reliability to the agreement of automated scores with human scores.  Given how close these results are and the relatively small number of predictions in the challenge, all of these submissions were deemed to deserve a grand prize. A different set of responses could easily result in a different order of top entries by statistical chance.  All of the awarded submissions here are within the .05 QWK difference that is generally accepted for operational use of automated scoring models [(Williamson, D. M, Xi, X., & Breyer, F. J., 2012)](https://doi.org/10.1111/j.1745-3992.2011.00223.x).
23 | 
24 | <p><img src="images/greatest_hits_race_smd_I.png" alt="SMD Differences for Automated Models by Race/Ethnicity" /></p>
25 | <p><img src="images/greatest_hits_gender_smd_I.png" alt="SMD Differences for Automated Models by Gender" /></p>
26 | 
27 | The figures above illustrate the difference in results by race/ethnicity and gender compared to the difference found in human scoring. There are small differences observed in either the race or gender criteria compared to the human results; these accurate models do not exacerbate (or reduce) the differences observed in human-scored NAEP responses. The larger results observed for students in the "other" subgroup in the race/ethnicity analysis are likely the result of small sample size. Nonetheless, all results are well-within the 0.15 difference that is generally accepted for operational use. Given the importance of these differences to understanding NAEP results, this is an important outcome of the challenge. 
28 | 
29 | ### Generic Challenge Results
30 | For the generic challenge, one grand prize and one runner-up were selected. The results in this challenge were much less accurate than the item-specific models and indicate an area for further development. The accuracy analysis of the winning submissions follows below. Given the lack of accuracy, fairness analysis is not provided. 
31 | 
32 | <p><img src="images/greatest_hits_qwk_G.png" alt="QWK Differences for Automated Models in Generic Challenge" /></p>
33 | 
34 | The accuracy of these results is much better than mere chance (0.00), but these results could not be relied upon to make inferences about scoring. Whereas the top results in the item-level challenge had a degradation of 0.018, the degradation for the top generic submission is 0.314; it is over 15 times larger.  These results indicate how important the contextual information is within a NAEP reading item. While it is possible to imagine ways that generic models could be used (e.g. to score pilot-tested items for preliminary results, or provide immediate scores on new items) they are far from suitable for operational use in their current form. 
35 | 
36 | A hearty congratulations to all the participants in the challenge and appreciation from NCES for providing important insights into automated scoring of NAEP items. 


--------------------------------------------------------------------------------