Software developers using AI assistance scored 17% lower on coding comprehension tests despite completing tasks slightly faster, according to research published last month by Anthropic that challenges the assumption that productivity gains from artificial intelligence come without cost.

The randomized controlled trial involved 52 software engineers learning a new Python library with and without AI assistance. Developers using AI averaged 50% on knowledge quizzes covering the concepts they had just used, compared to 67% for those who coded manually - a gap equivalent to nearly two letter grades. The difference proved statistically significant even though AI users finished only marginally faster.

"The findings highlight that not all AI-reliance is the same," the researchers wrote in the study published January 29. "The way we interact with AI while trying to be efficient affects how much we learn."

The research arrives amid intense debate within the programming community about AI's impact on fundamental skills. Ruby on Rails creator David Heinemeier Hansson warned in July that AI coding tools may be eroding programming fundamentals, arguing that typing code serves a fundamental role in knowledge acquisition that cannot be replicated through passive consumption of AI-generated solutions.

The productivity paradox

Participants assigned to the treatment group could access an AI assistant while completing two coding tasks using Trio, a Python library for asynchronous programming that handles concurrent execution and input-output processing. The library is less well-known than asyncio and involves new concepts like structured concurrency that extend beyond basic Python fluency. The control group worked without AI assistance. All participants then took a 14-question quiz covering debugging, code reading, and conceptual understanding.

The first task required writing a timer that prints every passing second while other functions run, introducing core concepts of nurseries, task management, and concurrent execution in Trio. The second task involved implementing a record retrieval function capable of handling missing record errors, introducing error handling and memory channels for storing results. These standalone tasks provided sufficient instructions and usage examples for completion without prior Trio knowledge.

The study found no statistically significant productivity improvement on average, despite the AI assistant being capable of generating complete solutions when prompted. The treatment group finished approximately two minutes faster than the control group - 23 minutes versus 25 minutes - but this difference did not reach statistical significance. This surprising result stemmed from how developers used the technology. Some participants spent up to 11 minutes composing AI queries - representing 30% of the total task time.

The AI assistant, built on GPT-4o, had access to participants' current code and could produce complete, correct solutions for both tasks when asked. Yet only about 20% of participants in the AI group fully delegated coding tasks to receive maximum productivity gains. The majority invested substantial time in query composition, conceptual questions, or explanation requests that reduced the speed advantage.

Manual analysis of screen recordings revealed six distinct interaction patterns that led to dramatically different outcomes. The researchers watched recordings of all 51 participants in the AI condition, manually annotating timestamps for AI interactions, queries, errors encountered, and task completion events.

Three patterns - labeled "AI delegation," "progressive AI reliance," and "iterative AI debugging" - averaged quiz scores below 40%. These developers relied heavily on AI for code generation or debugging without investing cognitive effort in understanding the solutions.

The AI delegation group (four participants) wholly relied on AI to write code and complete tasks, finishing fastest at 19.5 minutes average but scoring just 39% on quizzes. They asked AI to generate code, pasted the results, and moved forward without seeking to understand the underlying concepts.

The progressive AI reliance group (four participants) started by asking one or two clarifying questions but eventually delegated all code writing to the AI assistant for the second task. They averaged 22 minutes completion time but scored only 35% on quizzes, largely due to not mastering concepts in the second task.

The iterative AI debugging group (four participants) relied on AI to debug or verify their code, asking more questions but depending on the assistant to solve problems rather than clarifying their own understanding. They were slowest among low-scoring groups at 31 minutes while scoring just 24% on average.

Three other patterns preserved learning outcomes despite AI usage. Developers who asked only conceptual questions scored 68% on average, while those who requested explanations alongside code generation averaged 65%. The highest-scoring pattern involved generating code first, then asking follow-up questions to build comprehension, which yielded 86% quiz performance.

The conceptual inquiry group (seven participants) asked only conceptual questions and relied on improved understanding to complete tasks. They encountered many errors but independently resolved them, averaging 22 minutes completion time - second fastest overall after AI delegation - while scoring 68% on quizzes.

The hybrid code-explanation group (three participants) composed queries requesting both code generation and explanations of the generated code. Reading and understanding these explanations consumed more time, averaging 24 minutes, but resulted in 65% quiz scores.

The generation-then-comprehension group (two participants) first generated code through AI, then asked follow-up questions to improve understanding after implementation. They averaged 24 minutes and scored highest at 86%, demonstrating that checking one's understanding after using AI for generation preserves learning outcomes.

"Importantly, using AI assistance didn't guarantee a lower score," the researchers noted. "The participants who showed stronger mastery used AI assistance not just to produce code but to build comprehension while doing so."

The debugging deficit

The largest score gap between groups appeared on debugging questions, revealing what may be the most significant long-term concern about AI-assisted development. Participants without AI assistance encountered three times as many errors during the coding task, including errors specific to the Trio library that mapped directly to quiz topics. This pattern suggests that encountering and independently resolving errors strengthened debugging skills - precisely the capability needed for supervising AI-generated code.

The control group averaged three errors per participant during the task, while the AI group averaged just one. The errors broke down into two categories: general Python syntax issues like NameError and AttributeError, which typically resulted from typos on variable or function names, and Trio-specific errors that required deeper conceptual understanding.

Among developers using AI, 12 completed both tasks without encountering any errors. Only two participants in the control group finished error-free, with the remaining 24 working through multiple error cycles that forced engagement with core concepts.

Developers in the control group encountered "RuntimeWarning" errors when coroutines weren't properly awaited and "TypeError" errors when Trio functions received coroutine objects instead of async functions. These library-specific errors forced understanding of key concepts about how the Trio library handles coroutines and the proper usage of await keywords - topics that appeared directly on the evaluation quiz.

The most common errors overall were NameError and AttributeError from typos, which both groups encountered. However, the control group faced significantly more RuntimeWarning and TypeError errors requiring Trio-specific knowledge to resolve. The AI group rarely encountered these conceptual errors because the assistant generated syntactically correct code that properly implemented Trio patterns.

The pattern reveals a fundamental tension in AI-assisted learning. Errors that feel frustrating during development serve an important pedagogical function by forcing developers to understand why code fails and how to fix it. The AI group's smoother experience - fewer errors, faster completion - came at the cost of the deep learning that occurs when wrestling with conceptual mistakes.

This finding carries particular weight for software engineering organizations. As companies deploy more AI-generated code, human developers need strong debugging skills to catch errors and verify correctness. If junior engineers' debugging abilities are impaired by relying on AI during skill formation, they may lack the competency to provide effective oversight later.

The findings parallel concerns raised by experienced developers who maintain strict control over AI agents rather than delegating complete implementation. Research published January 5 found that 77 out of 99 professional developers rated their enjoyment of working with agents as "pleased" or "extremely pleased," but they consistently maintained oversight of software design rather than adopting "vibe coding" approaches where developers delegate complete authority to AI systems.

Implications for workplace skill development

The research focused on junior developers - the group most likely to rely on AI for productivity gains while still developing fundamental skills. The study design mimicked how engineers acquire new capabilities on the job by learning unfamiliar libraries through self-guided tutorials, a common pattern in professional software development where new tools and frameworks emerge continuously.

Participants represented typical early-career to mid-career software engineers. Most held bachelor's degrees and worked either as freelance or professional developers. They had between one and seven-plus years of coding experience, used Python at least weekly, and had tried AI coding assistance at least a few times previously. None had used the Trio library before, ensuring the task genuinely represented new skill acquisition.

The demographic composition matters because junior developers face unique pressures in modern software organizations. They must demonstrate productivity to justify their positions while simultaneously building the expertise needed for career advancement. AI tools offer an appealing solution to this tension by enabling task completion without requiring full understanding - but the research suggests this shortcut carries hidden costs.

"Given time constraints and organizational pressures, junior developers or other professionals may rely on AI to complete tasks as fast as possible at the cost of skill development," the researchers warned. "Notably the ability to debug issues when something goes wrong."

The implications extend beyond individual programmers to broader questions about professional development in AI-augmented workplaces. Companies transitioning to higher ratios of AI-written code face a potential competency gap if junior engineers' skill formation gets impaired by the very tools designed to enhance productivity. The question becomes particularly acute in safety-critical domains where human oversight of automated systems remains essential.

Organizations measure developer productivity through metrics like pull requests completed, code commits, and features shipped. These quantitative measures capture output but not the underlying skill development that enables long-term value creation. A junior developer who ships code quickly using AI assistance may appear productive in the short term while actually falling behind peers who invested more time building fundamental competencies.

The research suggests this creates a principal-agent problem where individual incentives diverge from organizational interests. Developers face pressure to complete tasks quickly, encouraging AI delegation that maximizes immediate output. Organizations need developers who can supervise AI systems, debug complex issues, and make sound architectural decisions - capabilities that require the deep knowledge threatened by excessive AI reliance.

Anthropic researcher Alex Tamkin emphasized that productivity benefits may come at the cost of skills necessary to validate AI-written code. "Managers should think intentionally about how to deploy AI tools at scale, and consider systems or intentional design choices that ensure engineers continue to learn as they work," the researchers advised.

The findings also raise questions about how organizations should structure mentorship and code review processes. Traditional approaches assume junior developers write code that senior engineers review, providing feedback that builds skills over time. AI assistance disrupts this model by enabling juniors to produce code without understanding it, potentially reducing the learning opportunities that code review historically provided.

Some organizations have begun implementing policies around AI tool usage. These range from complete prohibition during onboarding periods to requirements that developers manually type AI-generated code rather than copying and pasting it. The research provides empirical support for such interventions while highlighting that blanket policies may be unnecessary - the key distinction lies in how developers use AI rather than whether they use it at all.

The study suggests several paths forward. Major AI services have introduced learning modes - ChatGPT offers Study Mode while Claude Code provides Learning and Explanatory modes - designed to foster understanding rather than pure productivity. These features reflect recognition that cognitive effort remains important for mastery.

Methodological approach

The randomized experiment split 52 professional developers into treatment and control groups, balancing participants across coding experience, Python proficiency, and familiarity with asynchronous programming. Participants had used Python at least weekly for over a year and had tried AI coding assistance previously, but none had experience with the Trio library.

The coding tasks required implementing concurrent execution and error handling - skills often learned in professional settings rather than academic environments. Participants worked through problem descriptions, starter code, and brief concept explanations, mimicking real-world learning scenarios.

Anthropic conducted four pilot studies before the main experiment to address compliance issues and refine the evaluation design. Early pilots revealed that 35% of participants in the no-AI condition used AI assistance despite instructions prohibiting it. Subsequent pilots employed screen recording and stricter protocols to ensure compliance.

The evaluation covered seven core Trio concepts through 14 questions worth 27 points total. Question types included debugging (identifying errors), code reading (comprehending functionality), and conceptual understanding (grasping core principles). The researchers deliberately excluded code writing questions to reduce the impact of syntax errors.

Broader context

The findings arrive during a period of rapid AI adoption in software development. A Google engineer revealed in Januarythat Claude Code replicated in one hour a distributed systems project that took Google's internal teams a full year to build. Such dramatic productivity claims have accelerated interest in AI coding tools.

However, the Anthropic research suggests that speed advantages may not translate to skill development, particularly for unfamiliar tasks. Previous studies found mixed results on AI coding productivity. Some research showed developers completing tasks 55% faster with AI assistance, while other work found no significant speedup on certain task types.

The distinction appears to hinge on whether developers already possess relevant skills. Anthropic's earlier observational research found AI can reduce task completion time by 80% for work where participants already had expertise. The new controlled trial examines what happens when people learn something new - a different question with potentially different answers.

"It is possible that AI both accelerates productivity on well-developed skills and hinders the acquisition of new ones," the researchers wrote, noting that more research is needed to understand this relationship.

For marketing professionals managing technical teams, the research offers practical guidance. Organizations should consider how AI tool deployment affects junior developer training, whether productivity metrics account for long-term skill development, and what systems might preserve learning while capturing automation benefits.

The study acknowledges limitations including a relatively small sample size, assessment immediately after the coding task rather than longitudinal measurement, and focus on a single programming library. The researchers call for future work examining AI's effects on non-coding tasks, whether impacts dissipate as engineers develop greater fluency, and how AI assistance differs from human mentorship during learning.

Industry response

The software development community has responded with intense debate about balancing short-term productivity against long-term competency. Some developers argue that AI represents an inevitable shift requiring different skill sets focused on architecture and oversight rather than line-by-line coding.

Others share concerns similar to those documented in the study. One developer commented: "I let the LLM suggest code, but then I retype it & then actually learn something" - a middle-ground approach attempting to preserve active learning while leveraging AI capabilities.

The research comes as Anthropic expands Claude Code's automation capabilities to non-programmers through Cowork, released January 12. The company has positioned itself as particularly focused on AI safety, with features like Learning mode specifically designed to support skill development rather than pure productivity.

The question of how workplaces integrate AI tools while preserving human expertise remains unresolved. The Anthropic research provides evidence that choices matter - not just whether to use AI, but how to use it, with what oversight, and for what purposes.

"Cognitive effort - and even getting painfully stuck - is likely important for fostering mastery," the researchers concluded. "This is also a lesson that applies to how individuals choose to work with AI, and which tools they use."

Timeline

Summary

Who: Anthropic researchers Judy Hanwen Shen and Alex Tamkin conducted the study involving 52 professional software developers with varying experience levels.

What: A randomized controlled trial measuring how AI coding assistance affects skill development when learning a new Python library, finding 17% lower quiz scores among AI users despite marginally faster task completion.

When: The research was published January 29, 2026, following four pilot studies and pre-registration of the experimental protocol.

Where: The study examined developers learning the Trio asynchronous programming library through an online coding platform with optional AI assistant access.

Why: The research addresses fundamental questions about whether productivity gains from AI assistance come at the cost of skill development, particularly for junior developers still acquiring core competencies needed to supervise AI-generated code in high-stakes environments.

Share this article
The link has been copied!