Case Study: Grading Assistant

As the LLM revolution started reverberating across the world, NoRedInk was in an interesting position. It had no prior experience with AI or machine learning, but its whole reason for being, teaching English Language Arts, was exactly what the early models seemed best able to support. A huge opportunity to advance our mission had arrived out of the blue, but it wasn't yet clear what that opportunity would look like.

What was clear: companies were employing "spray-on AI" to ride the early wave, bolting on LLM-powered features that were more about checking a box than meaningfully improving their products. NoRedInk wanted to take a more thoughtful, longterm approach. After early ideation, we arrived at automatic grading. Certainly not a new concept in the industry, but one we felt LLMs were uniquely good at, and a feature that would meaningfully serve NoRedInk's teachers and students in a fundamental way.

Finding the Opportunity

Our early research showed that grading writing is one of the highest-friction tasks for teachers. Even committed teachers reported difficulty with limited time and providing individualized feedback consistently. Teachers who were less confident teaching writing often avoided assigning it at all, prioritizing easier-to-measure skill practice. One teacher interview emphasized that he “would not typically stray into writing…unless AI can grade it and ELA teachers trust it over 80% of the time.”

We began internal experimentation to see whether an LLM could evaluate student work reliably, how rubric-aligned feedback might be generated, and what level of accuracy was achievable with prompt engineering and curriculum input. Product, curriculum, and engineering collaborated to define the “surface area” that an MVP could reasonably support. These early constraints helped refine the ideation into a manageable product scope suitable for beta testing. We determined that we should focus first on a subset of writing (argumentative paragraphs) where grading criteria are explicit and structured and prioritize a lightweight experience that reduces friction for teachers rather than adding steps.

Designing the Experience

Based on what we were hearing from teachers, we established some design principles to guide our outcome:

  • Teacher-first: Teachers must always remain in control of grading decisions.
  • Transparent: Show how the AI reached its judgment.
  • Lightweight: Reduce grading effort rather than introducing new workflows.
  • Aligned: Follow familiar rubric structures grounded in curriculum accuracy.

These principles led me to a design that provided AI feedback as comments for the student, which doubled as explanations for the teacher. I started with an expansive prototype that included in-line commenting and general commentd. This helped sell the concept to our leadership team and provided us a canvas to test ideas on. Once our experiments revealed that the LLM functioned best with a narrow, highly structured rubric, in-line and general commenting was dropped in favor of grading and commenting based on rubric criteria.

One particular area of focus of my design process was how explicitly teachers should need to review/approve the AI feedback. Initially, my design included both conveniences and safeguards. Teachers were asked to approve comments, but they could also mass approve them. Teachers could skip approval, but a nag modal encouraged them to give things another lookover. Ultimately, out of a desire to ensure that students received only feedback that had been vetted by their teachers and not an unaccountable AI, I removed the ability for teachers to mass approve all feedback. It was early days in teachers' and students' relationships with AI, and maintaining trust was paramount. But conversely, to avoid having the experience become too onerous, I removed the nag modal.

Launching the Beta

After I had validated the design with internal users and stakeholders and after the feedback produced by Grading Assistant was meeting our quality threshold, we launched a beta to a subset of users. The purpose of the beta was to validate whether AI-assisted grading could measurably reduce teacher workload, accelerate feedback to students, and maintain a level of scoring reliability that preserved teacher trust. The beta also served internal goals, such as exploring cost implications of high-frequency AI use and building organizational experience in developing production-ready AI features. Prior to the beta launch, I created a microsite to announce the coming feature and provide a way for teachers to sign up for it. This built hype and established a means to directly communicate with our beta testers.

The logistics of the beta launch reflected the cross-functional complexity of releasing an AI-powered feature. The team set a firm launch date and prepared an extensive checklist that ensured reliability across engineering, design, QA, and curricuum. I finalized the user experience, usability testing the final prototype to ensure that the interface made sense to teachers and that they understood how to approve and edit comments. I also worked with engineering on adding last minute polish such as small animations that played when teachers approved feeback, and we readied surveys and interview guides for post-launch evaluation.

Getting Back Results

The results from the beta were extremely encouraging:

  • Teachers spent 82 seconds grading manually grading comparable assignments, compared to 37 seconds for Grading Assistant.
  • Twice as many students received feedback within a day when the assignment was graded with Grading Assistant.
  • Students were 3× more likely to receive written feedback on their assignment when Grading Assistant was used (45% vs. 16%).

Usage and adoption were also strong. 15% of teachers who assigned any writing chose a Grading Assistant assignment, and 55% of teachers who viewed a Grading Assistant assignment ultimately assigned it. Qualitatively, teachers showed high enthusiasm. Many expressed interest in using Grading Assistant “every day” for writing practice. Teachers appreciated the reduction in grading load, allowing them to focus more on instruction and less on mechanical scoring. They thought the AI-generated feedback was clearer, more specific, and more encouraging than what they could produce within typical time constraints. Teachers also remarked that students found the feedback easy to act on and were energized by how quickly it arrived.

Of course, these were beta users, so it was a self-selecting group. Nevertheless, both quantitatively and qualitatively, we felt affirmed that we were on the right track and should proceed immediately to a full release.

Improvements Based on User Feedback

Teachers in the beta edited scores or feedback only 5% of the time, suggesting strong perceived accuracy. And since many noted that manually clicking “accept” for each comment could be tedious, I iterated on the interface to drop the need to approve AI comments, resulting in simpler, more efficient experience. However, this then caused internal users to wonder if in the absence of specific confirm/edit buttons, their edits were being saved. This led me to iterate through several ways of conveying autosaving or adding back in save buttons. Eventually, I found that a simple message did the trick without needing to introdue new functionality.

Much of the post-beta iteration was in the area of comment tone and structure, including more positive language, higher specificity, and better personalization. We also found that feedback containing paraphrased student writing could unintentionally repeat harmful content in rare cases. As a response, we introduced better content detection and flagging instead of grading potentially harmful submissions. Concurrently, I designed the experience whereby teachers would be alerted to these flags and could choose to grade manually or "force" Grading Assistant to evaluate a submission.

Perhaps the biggest barrier to satisfaction was not the user experience or the comment quality. Teachers simply wanted more. They wanted it for more genres, for longer essays, and to work with source texts. This was a great problem to have, and I proposed some early examples of how more options and different types of grading could be integrated into the experience, which the team eventually picked up down the line.

Productization & Integration

Moving from a limited beta to a full-fledged platform feature meant integrating Grading Assistant seamlessly into the application while also promoting its use to our teachers. To that end, I surveyed our teacher experience to ensure it supported and acknowledged Grading Assistant. I added iconographic notations to indicate when a given assignment used the feature to both differentiate it from "normal" assignments and remind teachers of their use of it. I added calls to action for its use in appropriate locations, such as recommended assignments on the dashboard and assignment library, ensuring that teachers would be both encouraged to use it as well as have access to it through familiar means. I also created upsells for our free teacher users to get them to try out Grading Assistant as part of a free trial. Very encouragingly, we found that teachers who used Grading Assistant during the free trial were 70% more likely to apply for Premium.

Beyond the core teacher experience, I worked with our Enterprise team to add these new Grading Assistant assignments to our Benchmark experience, creating the experience whereby district admins would choose from and modify supported assignments from among the list of assignments supported by the Benchmark feature. I also made information architecture recommendations here as to what should and should not be included and how the assignments should be organized.

Conclusion

Grading Assistant meaningfully reduced friction in writing instruction by halving teacher grading time, tripling student access to written feedback, and increasing teacher willingness to assign writing tasks. Our collaborative, cross-disciplinary approach ensured that the feature not only met the technical and pedagogical challenge of automated grading but also built genuine trust among teachers.

The design process to me demonstrated the importance of constraining ambitious technology within clear pedagogical boundaries. Accuracy alone was not enough; teachers needed transparency, alignment to familiar rubrics, and confidence that they remained in control. Small interaction decisions, such as how feedback was structured, when approval was required, how edits were saved, had an outsized impact on the experience. By pairing technical experimentation with careful design, we were able to introduce AI in a way that felt supportive rather than disruptive to classroom practice.

Going forward, my focus shifted to supporting longer, more structured essays and refining the interface to allow teachers to view more content on smaller screens, a recurring issue I noticed while watching user sessions. Now that the concept was proven, we were full steam ahead on expanding the feature to support as wide a variety of writing types as possible and making Grading Assistant more discoverable and usable.