Can AI tools assess coding assignments?

Wait 5 sec.

CAREER COLUMN13 May 2026Yulu Hou and her partner experimented with using ChatGPT to automate marking of undergraduate assignments. Here’s what they learnt.ByYulu Hou0Yulu HouYulu Hou is a PhD student in the Department of Educational Administration at Michigan State University in East Lansing.View author publicationsSearch author on: PubMed Google ScholarCredit: Creative Images Lab/GettyOne evening, my partner Boyan Li sat at the kitchen table marking student submissions for a coding course he was teaching as part of his PhD at Harvard Medical School in Boston, Massachusetts. The assignment required students to implement a computational-biology algorithm on a given data set. Each submission demanded more than a quick check. He ran the code, examined the output and traced the logic line by line. Some submissions were clearly correct; others were clearly wrong. But many fell into a grey zone: they were partly right, but uneven in their execution or reasoning. These were the hardest to assess, and the most time-consuming.As a higher-education researcher, I watched this process with professional interest. What seemed to be a purely technical task — running code and checking outputs — was revealed to be deeply interpretative. Assessing coding assignments involves deciding what counts as understanding, what counts as error and how much variation is acceptable. This resonated with my own research on student learning and development, which views educational activities as inherently relational: even something as seemingly mechanical as marking becomes a dialogue between the examiner and the learner.Seeing this interplay of technical skill and human judgement led me to ask: can generative artificial intelligence (genAI) assist in assessing without erasing the interpretative work that makes it meaningful?Experimenting with AICoding assignments seem to be especially well-suited to AI tools. Unlike essays, computer code follows clear structures and strict rules, making it easier to evaluate. My partner tested this idea using OpenAI’s ChatGPT 5.4. He gave it the assignment prompt alongside the reference solution and asked it to assess a student’s code for accuracy. In practice, ChatGPT mainly compared the student’s code with the reference solution and struggled to recognize valid alternative approaches. It often focused on minor issues — such as lower computational efficiency — rather than evaluating whether the student understood the underlying algorithm, which was the main learning objective.Observing my partner’s frustration, I realized that ChatGPT was missing important context. I suggested that he provide information about common student mistakes and clarify which minor issues could be ignored.Chatbots in science: What can ChatGPT do for you?His existing workflow proved especially helpful here: before marking, he writes his own code and then looks at the instructor’s reference solution. This helps him to anticipate what students might struggle with, which are often the same parts that he initially makes mistakes on. Patterns also emerged during meetings with students. Students often came to him with similar questions, and some brought AI-generated answers that they did not fully understand. These recurring points of confusion revealed key bottlenecks in the process of correctly implementing the whole algorithm — insights that would have been difficult to identify from the reference solution alone.Integrating these insights improved the AI tool’s usefulness. It could suggest further test cases, probing whether a student’s solution passed the marking-rubric checkpoints but failed on ‘edge cases’ — in which, for instance, an algorithm might be given extreme (but valid) input values. For one assignment, students implemented an algorithm to align a genome sequence. One student submitted lengthy, hard-to-read code that passed all three rubric checkpoints. ChatGPT, however, identified a flaw in the program’s logic and, after extended reasoning, proposed an edge case in which it would yield incorrect results. Without AI, this mistake might have gone unnoticed or required hours of manual inspection.At the same time, ChatGPT had clear limitations. It sometimes treated any deviation from the reference solution as an error, even when the student’s approach was valid. It produced confident explanations that did not hold up under closer inspection. And, unless explicitly instructed, it did not reliably check whether the code actually ran. Fully automated assessing — supplying student code and receiving a final mark — remained impractical.Drawing on her experience as a higher-education researcher, Yulu Hou helped her partner to experiment with automated marking of undergraduate coding assignments.Credit: Hima RawalWhat we learntThese early experiments showed that using AI effectively is less about creating a fully automated marking system and more about how it is integrated into the existing process. ChatGPT works best as a teaching assistant, not as the final grader. Here’s how to make the most of it.Provide context. When structuring prompts for marking, I found it effective to proceed in stages: first, introducing the problem set and asking the model to work through it by itself; then providing one or more reference solutions; and finally, highlighting key steps, common errors and minor issues that should not be penalized.Generate test cases. AI is particularly effective at identifying edge cases that existing checks might miss. These edge cases can then be incorporated in the marking rubric to guide more-thorough evaluation.doi: https://doi.org/10.1038/d41586-026-01139-xThis is an article from the Nature Careers Community, a place for Nature readers to share their professional experiences and advice. Guest posts are encouraged.Subscribe to Nature Briefing: Careers, an unmissable free weekly round-up of help and advice for working scientists. Chatbots in science: What can ChatGPT do for you? Three ways ChatGPT helps me in my academic writing Six tips for better coding with ChatGPT AI and science: what 1,600 researchers think Collection: ChatGPT’s impact on careers in scienceSubjectsCareersComputer scienceMachine learningLatest on:CareersComputer scienceMachine learningJobs Postdoctoral FellowPostdoctoral Fellow focus on ferroic smart materials: shape-memory alloys, piezoelectric ceramics, and magnetostrictive materials, etcNingbo, Zhejiang (CN)Center for Advanced Smart Materials (CASM), Yongjiang Laboratory (Y-Lab)Principal Investigator/ Associate ProfessorPI & Associate Professor class focus on innovative research about ferroic smart materials or technology transferNingbo, Zhejiang (CN)Center for Advanced Smart Materials (CASM), Yongjiang Laboratory (Y-Lab)Scientific EditorThe Scientific Editor plays a critical role in ensuring a high‑quality, efficient, and integrity‑driven peer review process.London - Hybrid working modelSpringer Nature LtdCRAG DirectorCRAG is seeking applicants for Director with a distinguished record of scientific excellence, proven leadership and management experience.Universitat Autònoma de Barcelona (UAB), Bellaterra, BarcelonaCentre de Recerca en Agrigenòmica (CRAG), Consorci CSIC – IRTA – UAB – UBAssociate or Senior Editor, Nature MaterialsJob Title: Associate or Senior Editor, Nature Materials Locations: Shanghai, Beijing, Milan or Pune - hybrid working model Application Deadline: Ju...Shanghai, Beijing, Milan or Pune - hybrid working modelSpringer Nature Ltd