A perennial question as technology improves is the extent to which it will change—or replace— the work traditionally done by humans. From self-checkout at the grocery store to the ability of AI to detect serious diseases on medical scans, workers in all areas find themselves working alongside tools that can do parts of their jobs. With the increased availability of AI tools in classrooms accelerated by the pandemic and showing no signs of a slowdown, teaching has become yet another field in which professional work is shared with tools like AI.
We wondered about the role of AI in one specific teaching practice: assessing student learning. With the time it takes to score and give feedback on student work deterring many writing teachers from assigning lengthier writing tasks, and with the long turnaround time most students wait to receive grades and feedback, there is significant timesaving and learning potential in an AI helping grade student work. Then again, we wondered, could an AI scoring and feedback system really help students as much as teachers could?
"Teachers have the ability to say, 'What were you trying to tell me? Because I don't understand.' The AI is trying to fix the writing process and the format—fix what is already there, not trying to understand what they intended to say."
We recently completed an evaluation of an AI-equipped platform through which middle school students could draft, submit and revise argumentative essays in response to pre-curated writing prompts. Every time students clicked ‘submit,’ they received mastery-based (score 1–4) dimension-aligned scores in four writing domains (Claim & Focus, Support & Evidence, Organization, Language & Style) and dimension-aligned comments offering observations and suggestions for improvement—all generated by the AI instantly upon students’ submissions.
To compare AI scores and feedback with those given by actual teachers, we hosted an in-person convening of 16 middle school writing teachers who had used the platform with their students during the 2021–22 school year. After calibrating together on the project rubric to ensure reliable understanding and application of the scores and suggestions, we assigned each teacher 10 random essays (not from their own students) to score and provide feedback on. This yielded a total of 160 teacher-assessed essays, which we could compare directly to the AI-given scores and feedback on those same essays.
How were teachers’ scores similar to or different from scores given by the AI?
On average, we found that teachers scored essays lower than the AI, with significant differences in every dimension except for Claim & Focus. In terms of the overall score across all four dimensions (minimum 4, maximum 16), teachers’ average score on these 160 essays was 7.6, while the AI’s average score on the same set of papers was 8.8. In terms of particular dimensions, Figure 1 shows in the dimensions of Claim & Focus and Support & Evidence that teachers and AI tended to agree on the high (4) and low (1) scoring essays, but they disagreed in the middle, with teachers more likely to score an essay a 2 and the AI more likely to score it a 3. On the other hand, in the dimensions of Organization and Language & Style, teachers were far more likely to score essays at a 1 or 2, while AI scores were spread across 1 through 4, with many more essays at 3 or even 4.
How were teachers’ written comments similar to or different from those given by the AI?
During our convening with the 16 teachers, we gave them opportunities to discuss the scores and feedback they had given on their 10 essays. Before even reflecting on their specific essays, a common observation we heard was that when they were using the program in their own classrooms the previous year, they needed to help the majority of their students read and interpret the comments the AI had given. For example, in many cases, they reported students would read a comment but were unsure what it was asking them to do to improve their writing. Therefore, one immediate difference that emerged, according to teachers, was their ability to put their comments into developmentally-appropriate language that matched their students’ needs and capacities.
"In reflection, we discussed how nice AI was, even in the comments/feedback. The kids that are coming up now are used to more direct, honest feedback. It's not always about stroking the ego but about fixing a problem. So we don't always need two stars for one wish. Sometimes we need to be straight to the point."
Another difference that emerged was teachers’ focus on the essay as a whole—the flow, the voice, whether it was just a summary or built an argument, whether the evidence suited the argument or whether it all made sense as a whole. The tendency for teachers to score a 2 in the argument-focused domains of Claim & Focus and Support & Evidence, they reasoned, was due to their ability to see the whole essay—which this AI is actually unable to see since many AIs are trained on sentence level rather than whole-essay guidance.
Teachers’ harsher assessment of Organization similarly stems from their ability, unlike the AI, to grasp the whole essay’s sequence and flow. Teachers shared, for instance, that the AI could spot transition words or guide students to use more transition words and would assess the use of transition words as evidence of good organization, whereas they, as teachers, could see whether the transitions actually flowed or were just plugged into an incoherent set of sentences. In the domain of Language & Style, teachers again pointed out the ways the AI was easier to fool, such as by including a string of seemingly sophisticated vocabulary—which would impress the AI but which the teacher would see as a series of words that did not add up to a sentence or idea.
Can AI help teachers with grading?
Assessing student work well is a time-consuming and hugely important component of teaching, especially when students are learning to write. Students need steady practice with rapid feedback in order to become confident, solid writers, but most teachers lack the planning and grading time and teach too many students to be able to assign routine or lengthy writing and to maintain any semblance of work-life balance or sustainability in their career.
The promise of AI to alleviate some of this burden is potentially quite significant. While our initial findings in this study show that teachers and AI approach assessment in slightly different ways, we believe that if AI systems could be trained to see essays more holistically the way teachers do and to craft feedback language in more developmentally- and contextually-appropriate ways for students to process comments independently, there is real potential for AI to help teachers with grading. We believe improving AI in these areas is a worthwhile pursuit, both to reduce teachers’ grading burdens and, as a result, to ensure students get more frequent opportunities to write paired with immediate and helpful feedback to grow as writers.