Ten Months with Copilot Coding Agent in dotnet/runtime

Wait 5 sec.

When GitHub’s Copilot Coding Agent (CCA) first became available in May 2025, the promise of cloud-based AI coding agents was generating enormous excitement, and equal skepticism. So we began an experiment, and like many experiments in software engineering, it started with a simple question: could a cloud-based AI coding agent meaningfully contribute to one of the most complex, most scrutinized, and most critical open source codebases in the world?The dotnet/runtime repository is not a typical codebase. It’s the beating heart of .NET, containing the .NET runtime (including the garbage collector, the JIT compiler, the AOT toolchain) and the hundreds of libraries that form the core of .NET. It spans millions of lines of code across C#, C++, assembly, and a smattering of other languages. It runs on Windows, Linux, macOS, iOS, Android, and WebAssembly. It powers everything from trillion-dollar financial systems to games on your phone, and it’s the lifeblood of the vast majority of Microsoft’s own services. .NET has 7+ million monthly active developers, has had hundreds of thousands of pull requests merged into it from thousands of contributors (across all its constituent repos, with dotnet/runtime at the base), and is consistently in the top 5 of the CNCF highest velocity open source projects on GitHub. If something breaks in dotnet/runtime, millions of developers and customers feel it, viscerally. At the same time, we’re constantly trying to evolve, optimize, and expand the capabilities of .NET, with ever increasing pressures on our core maintainers and ever increasing responsibilities competing for priority. .NET ships annually, with servicing monthly, and developers expect every release to be high quality, chock full of performance improvements and new features to power their businesses.So with dotnet/runtime, the question wasn’t if we should try CCA, but how we could use it responsibly in a codebase where mistakes can have outsized consequences. “Responsibly” is critical. The .NET team has full ownership for everything we ship. We’re not “handing over” development of .NET to AI; rather, we’re experienced engineers adding a new tool to our workflow. Our standards have not and will not change. Rigor, correctness, and fundamentals are at the heart of everything we do. When we use AI to write code, it’s in service of that goal.Ten months later, with 878 CCA pull requests in dotnet/runtime alone (535 merged, representing over 95,000 lines of code added and 31,000 lines removed), we have enough data to share what we’ve learned. This is not a cautionary tale of AI gone wrong nor is it a triumphant declaration of AI supremacy. It’s a practical account of human-AI collaboration, of learning what works and what doesn’t, and of the iterative process of teaching AI to contribute to a codebase that took decades of human expertise to build. Let’s start with overall data.The Numbers at a GlanceThis post focuses primarily on dotnet/runtime, which has been a main proving ground for CCA experimentation within .NET. As arguably the most complex and demanding codebase in the .NET ecosystem, it provides a rigorous test for AI-assisted development. The numbers below reflect our current experience there, from May 19, 2025 (when CCA launched) through March 22, 2026 (when I gathered the list of PRs for this post):CategoryPRsMergedClosedOpenSuccess RateHuman (Microsoft)3,082 (50%)2,55637714987.1%Human (Community)1,411 (23%)1,02926212079.7%CCA878 (14%)5352539067.9%Bot (e.g. dependabot)810 (13%)6661093585.9%Total6,1814,7861,00139482.7%Every CCA PR was created at the explicit request of a human with maintainer rights to the repository; CCA cannot open PRs on its own. dotnet/runtime is an open-source project with significant community contributions, and the table above reflects the full PR population. “Microsoft” vs “Community” is determined by users listing Microsoft/MSFT in their public GitHub profile, which likely undercounts Microsoft contributors slightly. “Bot” includes automated PRs from dependency update bots like dotnet-maestro and dependabot. “Success rate” is calculated as merged / (merged + closed), excluding PRs that are still open.CCA PRs represented 22.2% of all Microsoft-originated PRs by volume (878 out of 3,960 Microsoft + CCA PRs). This measures PR count, not engineering effort; CCA PRs tend to be more bounded in scope than the average human PR, so this overstates CCA’s share of total work. CCA’s 67.9% success rate is also lower than the 87.1% for Microsoft human PRs. However, those comparisons miss crucial context we’ll explore throughout this post: CCA PRs and human PRs are fundamentally different populations facing different selection pressures.NOTE: This post is grounded in concrete data drawn directly from GitHub (collected with the help of Copilot CLI): PR counts, merge rates, commit histories, review comments, timing, etc. However, these metrics are subject to inherent biases and limitations. CCA PRs are not randomly sampled; they reflect deliberate choices about which tasks to assign to an AI agent, and those choices evolved over the ten months covered here. Success rates depend on what we chose to attempt, not just how well CCA performed. Comparisons between CCA and human PRs are between fundamentally different populations: humans self-select complex, judgment-heavy work while CCA is assigned more bounded tasks. We also note that 90 CCA PRs remain open and are excluded from success rate calculations. The data should be taken directionally, as strong evidence of trends and patterns, rather than as precise benchmarks of AI capability. Where sample sizes are small or methodological caveats apply, we note them.One quality signal we can measure directly is the revert rate: how often a merged PR is subsequently reverted. Of the 535 merged CCA PRs, 3 were reverted (0.6%). For comparison, 33 of 4,251 non-CCA merged PRs were reverted (0.8%) during the same period. The sample sizes are small enough that the difference is not statistically meaningful, but at minimum, the data shows no red flags. (This analysis identifies reverts by searching for merged PRs with “revert” in the title that references another PR number or title; it may slightly undercount if reverts were accomplished through other means.)This post focuses primarily on PR throughput metrics (volume, merge rate, time-to-merge, and review/iteration patterns). It does not attempt to comprehensively quantify all downstream quality outcomes, though the revert data above provides one concrete quality signal. It also does not analyze the compute cost of CCA usage or the CI resources consumed by CCA PRs (including failed runs that required iteration). These are real costs, and organizations adopting cloud AI agents like CCA should factor them into their evaluation alongside the productivity benefits described here.Beyond CCA, we’ve also enabled Copilot Code Review (CCR). Every pull request, whether authored by a member of the .NET team, by CCA, or by an external contributor, automatically receives AI-powered code review feedback. This means every PR that flows through the repository involves AI, not just the ones AI authored. The combination of AI authoring and AI reviewing has become a default workflow. This use of CCA is also separate from any use of AI employed locally by developers, on the .NET team or otherwise, as part of creating PRs; anecdotally from conversations with my teammates, it’s now a significant minority of PRs that don’t involve significant AI as part of design, investigation, and/or coding. As an example, as I write this paragraph, this is my terminal:(The first tab is copilot reviewing this post looking for various things I asked it to update. The second is rewriting a math component in the .NET core libraries to optimize performance. The third is investigating and fixing a failure in a PR to the modelcontextprotocol repo (that CCA contributed). The fourth is tweaking a terminal UI. And the fifth is updating an IChatClient implementation with new features available in the latest Microsoft.Extensions.AI.Abstractions build.)Comparison RepositoriesTo provide context and contrast, I’ve also gathered data from several other of our .NET repositories (ones maintained by the .NET team) where CCA has been used. These aren’t the only places we’ve deployed CCA (it’s been used in another 60 dotnet/* repos), but they offer useful comparisons with dotnet/runtime due to their varying characteristics. These numbers are for the exact same time period, from May 19, 2025 through March 22, 2026:RepositoryTotal PRsCCA PRsMerged CCAClosed CCAOpen CCACCA Success RateWhy It’s Interesting for Comparisonmicrosoft/aspire352711307173902364.8%Greenfield cloud-native stack created in fall 2023; the team has been an early and aggressive adopter of CCA.dotnet/roslyn2784263165564274.7%The C# and Visual Basic compilers and language services; deep domain complexity with decades of history, comparable in maturity to runtime.dotnet/aspnetcore1804254151614271.2%Large, mature codebase with web-focused domain; tests whether CCA handles framework-level complexity.dotnet/efcore11361298337969.2%Highly specialized query translation domain; tests CCA’s ability to handle deep domain expertise requirements.dotnet/extensions5691279825479.7%Contains our AI-focused libraries (Microsoft.Extensions.AI); newer codebase with modern patterns.modelcontextprotocol/csharp-sdk52818213640677.3%Greenfield project started in spring 2025; shows CCA performance without legacy burden or historical conventions.These comparison repositories help us understand which factors influence CCA success: codebase age, domain complexity, architectural patterns, and the presence of legacy constraints.Across all seven repositories, we’ve seen 2,963 CCA pull requests with 1,885 merged (68.6% success rate), representing ~392,000 lines added and ~121,000 lines deleted, for a net contribution of roughly 271,000 lines of code.These repos vary in size and complexity. Enumerating tracked source files (using git ls-files), classifying text vs. binary, and counting lines and tokens (tiktoken o200k_base tokenizer) reveals these numbers to help calibrate:RepositoryFilesText FilesTotal Text SizeTotal LinesTotal Tokensdotnet/runtime57,19457,005~620.0 MB14,159,378187,393,758dotnet/roslyn20,62020,317~375.1 MB9,508,66285,964,140dotnet/aspnetcore16,66116,530~162.3 MB2,664,61049,002,522microsoft/aspire8,8028,706~108.2 MB2,331,91525,958,731dotnet/efcore6,1146,108~81.8 MB1,814,33418,978,960dotnet/extensions3,5653,548~23.0 MB543,3835,879,188modelcontextprotocol/csharp-sdk735732~4.1 MB105,587887,356Change Size DistributionHow big are these CCA PRs? The distribution in dotnet/runtime shows CCA handling changes across a wide range of sizes:Lines ChangedPRs%0596.7%1-1017019.4%11-5022225.3%51-10012013.7%101-50021524.5%501-1,000515.8%1,001-5,000323.6%5,000+91.0%The 59 zero-line PRs (6.7%) are cases where CCA was invoked but didn’t produce a change, typically either tasks abandoned before the agent began coding (where the developer changed their mind for some reason) or where CCA ended up not making any changes because the issue was already fully addressed. Of the remaining PRs, nearly half (48%) are under 50 lines, such as targeted bug fixes. But over a quarter (26%) are in the 101-500 line range, representing substantial features or refactorings. The handful of 5,000+ line PRs include major refactorings or upgrades, like updating support from Unicode 16.0 to Unicode 17.0 (which entailed CCA writing and merging tools to automate the update, programmatically updating the relevant data files, discovering and running code generators, and more tactical updates such as updating regex’s recognized named character ranges) and comprehensive test coverage additions.Interestingly, size correlates with success in a non-obvious way:Lines ChangedDecided PRsSuccess Rate0560.0%1-1016080.0%11-5019576.9%51-10010475.0%101-50019764.0%501-1,0004468.2%1,001+3271.9%The sweet spot is 1-50 lines at 76-80%. Success drops in the 101-500 range (64%), where tasks are large enough to involve multiple interacting components but not so large that they tend to be well-scoped refactorings. The largest PRs (1,001+ lines) rebound to 72%, likely because these tend to be carefully scoped mechanical tasks (like code generation updates or comprehensive test additions) that play to CCA’s strengths. The lesson is that task scope matters more than size: a well-scoped task that produces 50 lines succeeds more reliably than a vague one that produces 200.Where do all those lines go? Analyzing the file-level data for all 535 merged CCA PRs, 65.7% of lines added were test code (files in test directories), 29.6% were production code, and 4.7% were other files (documentation, project files, etc.). For comparison, a random sample of 500 merged human PRs shows a similar but less pronounced pattern: 49.9% test code, 38.5% production, 11.6% other. The high test percentage isn’t unique to CCA; it reflects the reality that in today’s production codebases, changes generally require as much or more test code than production code. That said, CCA’s 66% test ratio is notably higher than humans’ 50%, consistent with the fact that pure testing tasks are among CCA’s strongest categories and that CCA is disproportionately assigned test-writing work. 56% of all merged CCA PRs touched at least one test file, compared to 38% of human PRs.The TrajectoryWhat’s most encouraging is how our success rate in dotnet/runtime has climbed over time:MonthPRsSuccess RateCumulative SuccessMay-252441.7%41.7%Jun-251369.2%51.4%Jul-252369.6%58.3%Aug-253660.0%58.9%Sep-251662.5%59.5%Oct-258658.8%59.2%Nov-257569.0%61.8%Dec-258772.4%64.4%Jan-2618271.2%66.6%Feb-2619669.7%67.4%Mar-26*14072.1%67.9%From 41.7% in our first month to holding steady at ~71% across the most recent quarter. (Early months have small sample sizes, e.g. June 2025 had only 13 PRs, so individual monthly rates should be taken directionally; the cumulative column provides a more stable signal. March 2026 only covers through the 22nd, and 54 of those PRs are still open, so that rate is preliminary.) That’s a story of learning: learning what tasks to assign, learning how to write instructions, learning how to iterate with an AI pair programmer. To understand what that learning curve actually looks like in practice, it helps to zoom in on a few specific experiments along the way.Tackling the Backlog (and Beyond)One of CCA’s most tangible impacts is accelerating work that was waiting for someone with time to address it. Of the 464 CCA PRs in dotnet/runtime that link to a source issue, the age of those issues at the time CCA was assigned tells a striking story:Issue AgePRs%Same day18038.8%1-7 days429.1%1-4 weeks316.7%1-3 months357.5%3-12 months5311.4%1-2 years306.5%2+ years9320.0%The median issue age is just 11 days, but the average is 382 days (12.6 months), pulled up by a long tail of old issues. 20% of the issues CCA tackled were over two years old, with some dating back as far as 9 years, predating the creation of the dotnet/runtime repository itself (they were migrated from the old coreclr/corefx repos). These are issues that area owners agreed should be fixed, but that never rose to the top of anyone’s priority list. CCA doesn’t have competing priorities; it just needs to be pointed at a problem.The 39% same-day bucket has two sources: sometimes we create well-scoped issues specifically designed for CCA, filing and assigning within minutes; other times, we triage incoming community-filed issues directly to CCA on the day they arrive. 31% of same-day issues were filed by external contributors, with a median triage-to-CCA time of just 2.7 hours. In the fastest case, a community-reported source generator bug was assigned to CCA under two minutes after being filed. CCA is becoming part of how we respond to incoming reports, not just a tool for pre-planned work. The issues in the 2+ year bucket, by contrast, represent genuine backlog acceleration, work that likely would not have been done for months or years otherwise.The Birthday Party ExperimentAn inflection point in our CCA journey came on a Saturday in October. I was at a birthday party with one of my kids, and while the youngins were off playing, I found myself scrolling through our backlog of dotnet/runtime issues on my phone. Many were tagged “help wanted”, issues that were well-understood but waiting for someone with time to tackle them.I started assigning issues to Copilot. Not randomly, but thoughtfully: skimming each issue, assessing whether the problem was clear enough, whether the fix was likely within what I understood CCA’s capabilities to be (with Claude Sonnet 4.0 or 4.5 at the time), whether the scope was manageable. Over the course of an hour, I assigned more than 20 issues covering a range of areas.By evening, I had reviewed 22 pull requests, most of which were reasonable attempts at solving real problems. Some were excellent. Some needed iteration. A few revealed that we shouldn’t make the change at all, which is itself a valuable outcome.Let me walk through a few representative examples from that day:The Thread Safety Fix (PR #120619)The issue described a thread safety problem in System.Text.Json‘s JsonObject.InitializeDictionary that was causing intermittent GetPath failures. I had a hypothesis about the problem and roughly what the fix should look like, so I included that in my prompt when assigning the issue.Copilot validated my hypothesis, implemented the fix, and added a regression test. The entire review took about 10 minutes, mostly spent on two minor cleanup suggestions. The fix was correct, the test was appropriate, and it merged cleanly.This is CCA at its best: a well-defined bug with a clear fix that just needs someone (or something) to do the mechanical work of implementing and testing it.The Intentionally Closed PR (PR #120638)Sometimes the most valuable outcome of a PR is deciding not to make the change. I assigned an issue about regex quantifiers on anchors (like ^* or $+) where the issue was proposing that this syntax that was currently parsing successfully should instead be rejected as invalid.Copilot dutifully implemented the change, modifying the parser and, importantly, updating tests. It did a good job, finding exactly the right places to make the changes and making clean, minimal edits to the product source. But reviewing those test changes revealed something important: our existing behavior was intentional. The tests Copilot had to modify to get the tests to pass weren’t bugs, they were documenting deliberate design decisions.I closed the PR and the corresponding issue. Was this a “failure”? No. Copilot had essentially done the investigative work of determining that the issue shouldn’t be fixed. That’s worth something, several hours of my time had I done it manually.The Debugging Win (PR #120622)This one surprised me. The issue involved our NonBacktracking regex engine and empty capture groups with newlines, a subtle bug in a part of the codebase I’m less familiar with. Debugging this manually would have meant stepping through unfamiliar code, understanding the state machine, identifying where the logic diverged. Hours, maybe a day.Instead, Copilot found the problem and submitted a one-line fix plus tests. Five minutes to review. The fix was trivial, and obvious in hindsight once the source of the problem was found and documented, but finding it was the hard part, and CCA handled that entirely.The Struggle with BCrypt (PR #120633)Not everything went smoothly. An issue about using Windows BCrypt for our internal Sha1ForNonSecretPurposes function turned into a 20+ commit odyssey. Copilot’s initial approach produced a mess of #if conditionals. It took multiple rounds of feedback to get it to split the code into platform-specific files, exacerbated by its inability to test the changes on Windows at the time. And it resulted in a native binary size increase for NativeAOT that required more detailed investigation.This PR was eventually closed. But it taught us something important: CCA struggles with problems that require architectural judgment… choosing the right API shape based on real-world usage patterns, anticipating ripple effects across platforms and build configurations, and understanding the downstream implications of design decisions like interop patterns on binary size. These are skills that come from deep familiarity with a codebase’s conventions and history.By the end of that Saturday, I had a rough mental model: CCA is excellent at implementing well-specified changes, very good at investigating issues, and relatively poor at architecting solutions, especially in large codebases that require broad understanding. That model has held up well over the following months.The Redmond Flight ExperimentIf the Saturday birthday party demonstrated CCA’s potential for tackling a wide variety of problems, the Redmond flight experiment a few months later demonstrated something different: the sheer throughput that becomes possible when you can work from anywhere with nothing but a phone, and the ramifications of that.On January 6th, 2026, I boarded a cross-country flight to Redmond, WA. No laptop (or, rather, no ability to charge my power-hungry laptop), just my phone and a movie to watch. But between scenes (and perhaps during a few slow stretches of plot), I found myself scrolling through our issue backlog, assigning issues to Copilot, and kicking off PRs, as well as thinking through some desired performance optimizations and refactorings and submitting tasks via the agent pane.By the time I landed, I’d opened nine pull requests spanning bug fixes, test coverage increases, performance optimizations, and even experimental data structure work. From a phone. At 35,000 feet. Before CCA, opening nine meaningful PRs during travel wouldn’t have been realistic. Even with a laptop, implementing code changes, writing tests, running local verification, all from an airplane seat, would have been a stretch. From a phone? Not possible.But with CCA, my job was different. I wasn’t writing code. I was identifying work, scoping problems, and giving direction. The agent did the implementation. Here’s what emerged:PRTitleStatusScope#122944Fix integer overflow in BufferedStream for large buffersMergedBug fix, +25/-1#122945Add regression tests for TarReader after DataStream disposalMerged+226 lines of tests#122947Optimize Directory.GetFiles by passing safe patterns to NtQueryDirectoryFileMergedPerformance optimization, +109/-20#122950Support constructors with byref parameters (in/ref/out) in System.Text.JsonMergedFeature / bug fix, +8,359/-39#122951Fix baggage encoding by using EscapeDataString instead of WebUtility.UrlEncodeClosedEncoding fix, +66/-8#122952Optimize HashSet.UnionWith to copy data from another HashSet when emptyMergedPerformance, +52#122953Remove NET9_0_OR_GREATER and NET10_0_OR_GREATER preprocessor constantsMerged112 files, +316/-2404#122956Partial Chase-Lev work-stealing deque implementation for ConcurrentBagClosedExperimental, +155/-267#122959Port alternation switch optimization from source generator to RegexCompilerMergedPerformance optimization, complex IL emit, +306/-39Consider PR #122953: removing obsolete preprocessor constants. The repo no longer builds for .NET Core versions older than .NET 10 (but still builds some packages that multitarget to .NET Framework and .NET Standard), so constants like NET9_0_OR_GREATER and NET10_0_OR_GREATER are now always true and, thus, unnecessary clutter. This is a straightforward cleanup, but it touched 112 files across System.Private.CoreLib, Microsoft.Extensions.*, System.Collections.Immutable, System.Text.Json, cryptography libraries, and dozens of test files. It’s also more than just a search/replace, as now unreachable code should be deleted. From my phone, I typed a prompt explaining the goal and the strategy (e.g. replace with #if NET for multi-targeted files, and remove conditionals entirely for .NET Core-only files). CCA analyzed the codebase, applied the correct transformation to each file based on its context, and produced a PR. Could I have done this manually? Sure, in a couple of hours of tedious searching and manual deletion with a laptop. From my phone? Not a chance.Or consider PR #122959: porting an optimization from the regex source generator to the regex compiler. This involved understanding how the C# compiler lowers switch statements to IL, applying the same heuristic with System.Reflection.Emit, and handling edge cases around atomic alternations. The PR adds 306 lines of complicated IL opcode emission. CCA wrote it; I reviewed it from the ground after landing.PR #122947 didn’t even start with an issue assignment. I had a conversation with Copilot in “ask mode” on GitHub, sharing the problem statement and going back-and-forth brainstorming possible optimization approaches. When we converged on a viable approach, I asked it to implement. The PR emerged from that collaborative design session, conducted entirely via chat on my phone while waiting to take off.Seven of the nine PRs merged and two were closed (one because review determined the change wasn’t the right approach and that the underlying issue should be closed as well, and one because the data structure being experimented with was incompatible with the scenario, an issue actually highlighted in the PR description by CCA where it warned against merging).The practical upshot of this story? CCA changes where and when serious software engineering can happen. The constraint isn’t typing speed or screen real estate: it’s knowledge, judgment, and the ability to articulate what needs to be done. Waiting in an airport? Provide feedback on changes that should be made. Commuting on a train? Trigger a PR. The marginal cost of starting work drops significantly when “starting work” means typing or speaking a direction rather than switching contexts and setting up a development environment.That highlights a dark side to this superpower, however. I opened nine PRs, some quite complicated, in the span of a few hours. Those PRs need review. Detailed, careful review, the kind that takes at least 30 to 60 minutes per PR for changes of this complexity. That means I quite quickly created 5 to 9 hours of review work, spread across team members who have their own responsibilities and demands for their time. A week later, three of those PRs were still open. Not because they were bad, but because in part reviewers hadn’t gotten to them yet. And that was with me actively pinging people, nudging the PRs forward. The bottleneck has moved. AI changes the economics of code production. One person with good judgment and a phone can generate PRs faster than a team can review them. This creates asymmetric pressure: the person triggering CCA work feels productive (“nine PRs!!”), while reviewers feel overwhelmed (“nine PRs??”).For a repository like dotnet/runtime, where reliability is paramount, where changes affect millions of developers and consumers, where our ship cycle demands confidence, we cannot compromise on review quality, on expert and experienced eyes validating the changes (at least the spirit of them if not the dotting of the is). But we also can’t ignore the bottleneck. If PR generation outpaces review capacity, we either:Slow down PR generation (waste AI’s potential)Speed up review (somehow)Reduce .NET runtime quality (unacceptable)Option 2 is the only sustainable path. And it means finding meaningful ways to use AI to assist with code review, not to replace human judgment, but to accelerate the mechanics. Focus the reviewer’s attention, summarize changes, flag patterns, highlight what’s different from standard approaches, identify areas that need closer scrutiny and those that are fine and uninteresting. This is the next frontier: if AI can help write code, it can help validate it, too. CCR is already useful and improving quickly; it catches real issues and helps us spot things we might otherwise miss in a first pass. We’ve also built a custom code-review skill that we can invoke on demand to get a deeper, repo-aware analysis tailored to dotnet/runtime‘s conventions. Where we need the most continued investment is in the bigger picture, helping reviewers focus on what matters most, architectural concerns, subtle cross-cutting consequences, spooky action at a distance, etc., so that human attention is spent where it has the highest impact.The Power of InstructionsIf there’s one insight from this experience worth emphasizing, it’s this: instructions matter enormously. We learned this lesson the hard way, in a very public forum.The Rocky Startdotnet/runtime was one of, if not the, first major public repository to adopt CCA. We started using it on the very first day CCA was announced publicly in May 2025. We were excited, eager to experiment, and ready to push the boundaries of what cloud AI-assisted development could accomplish in a production codebase of this scale and complexity.What we weren’t ready for was the reality of deploying CCA into a codebase like ours without any preparation.When we first enabled CCA in dotnet/runtime, we had no .github/copilot-instructions.md file. We also weren’t aware of the firewall rules that CCA operates under (by default, the agent runs in a sandboxed environment that blocks access to most external resources). This meant CCA couldn’t download the NuGet packages our build requires. It couldn’t access the feeds where some of our dependencies live. Even if it knew how to build dotnet/runtime (which it didn’t, because we hadn’t told it), it couldn’t actually do it because everything it needed was blocked.The results were… not great. CCA would submit PRs with code changes, but those changes couldn’t be validated by the agent itself. It was essentially writing code it couldn’t compile, proposing fixes it couldn’t test. Our success rate in May 2025 was 41.7%, more failure than success.dotnet/runtime is a public open source repo, which meant we were experimenting openly and that our stumbles were visible to everyone. And as it so frequently does, the internet had opinions. Viral threads appeared on Hacker News and Reddit where observers mocked what they saw. Comments compared CCA to incompetent contractors. People made up stories about AI being “forced onto .NET” by mandate. Critics claimed we were pushing AI-generated code directly into the codebase without human oversight. AI skeptics declared this as proof that AI coding was fundamentally broken and should be abandoned.The criticism reached individual PRs, too. PR #115762, which involved implementing Unicode version retrieval on iOS, became a lightning rod. The PR accumulated over a hundred comments, many hostile, as CCA struggled to get the build working, and external observers piled on in the discussion. The conversation eventually had to be locked.But here’s what the critics missed or ignored, and what I tried to explain in my comments on that PR: