MCP Beyond the Chat Window: Build Diagnostics in CI

Wait 5 sec.

In a previous postwe introduced theMicrosoft Binlog MCP Server and showed how an AI assistant can investigateMSBuild binary logs through natural language. Picture the payoff in CI: a pullrequest build fails, and instead of a human downloading the binlog and scrollingthe Structured Log Viewer, an agent opens it, pinpoints the failing target andtask, and posts the root cause straight back to the PR.That first post was a great start, but it told only part of the story. Ithighlighted 15 tools – the server’s surface has since grown well beyond that –and it focused on the interactive, sit-at-your-keyboard experience. In practice,the same Model Context Protocol(MCP) tools are doing real work unattended, inside a continuous integrationpipeline on GitHub Actions.This post fills in the rest. We’ll:Watch those tools run unattended inside a GitHub Actions workflow on thepublic microsoft/testfx repository – areal PR build failure, analyzed and explained automaticallyWalk through the 23 Binlog MCP tools the first post never mentionedBack the efficiency claims with evaluation data instead of vibesIf you lead a team, here is the outcome that matters: red builds get aplain-language root cause posted to the PR automatically, so engineers stoplosing time downloading logs and decoding MSBuild output. That means faster PRturnaround, fewer interruptions for your build experts, and junior developerswho can unblock themselves instead of waiting for someone who “knows the build.”And because it runs as advisory automation on infrastructure you already have,your team can adopt it without changing how anyone works.MCP in a GitHub Actions WorkflowStart with the payoff. Themicrosoft/testfx repository runs MCP-powered agents directly in CI usingGitHub Agentic Workflows (gh aw), whereeach workflow is authored in Markdown and compiled to a .lock.yml file. Theworkflows are public, so you can read the exact source:build-failure-analysis.md,its on-demand companionbuild-failure-analysis-command.md,and thebuild-failure-analyst agentthey delegate to. GitHub Agentic Workflows are still evolving, so treat these asa reference implementation to copy from rather than a turnkey feature everyrepository has today.Build failure analysis, on every PRThe build-failure-analysis workflow runs the repository build on every pullrequest, and only when the build fails wakes up an agent that queries thebinlog live through the Binlog MCP Server. The MCP server runs as a container,with the binlog mounted read-only:mcp-servers: binlog-mcp: container: "mcr.microsoft.com/dotnet-buildtools/prereqs:azurelinux-3.0-binlog-mcp-amd64" mounts: - "/tmp/build.binlog:/data/build.binlog:ro" allowed: ["*"]The workflow spells out its own flow: it runs./build.sh --binaryLog, and on failure “delegates to thebuild-failure-analyst agent (which queries the binlog live via thecontainerized binlog-mcp MCP server) to identify root causes, post a PRcomment summarizing them, and attach inline suggestion blocks tied to thediff.” It is explicitly advisory, not gating: the agent’s comment neverdecides whether the PR passes. The repository’s normal required build workflowstays the merge gate; this MCP-powered workflow only analyzes failures andcomments on the PR.A companion build-failure-analysis-command workflow lets a maintainer rerunthe same analysis on demand by commenting /analyze-build-failure.Concretely, a single failed-build run might have the agent callbinlog_overview to see the build failed, binlog_errors to get the failingerror with its target and task context, and binlog_target_reasons orbinlog_task_details to explain why that step ran the way it did – thensummarize the root cause in a PR comment with an inline suggestion. That wholechain happens without anyone opening a log viewer.Here is an actual comment the workflow left on a microsoft/testfx pullrequest. This is a real CI failure – not a hand-picked toy example, but aformatting (IDE0055) failure in a full multi-project build – showing the sametools at work in a real repository, posting the rootcause, the exact file and line, a ready-to-apply fix, and a build overviewpulled straight from the binlog:Look at what is in that comment: the MSBuild version, the 46 projects that built,the five errors, the four failing projects, and the precise IDE0055 locationdown to the column. All of it came from the agent querying the binlog livethrough the containerized Binlog MCP Server – nobody downloaded a log or openeda viewer. When the binlog can’t be parsed, the same comment degrades gracefullyto a short status note with a link back to the run, so you always know whathappened.See the Tools in ActionYou just saw those tools deliver a verdict inside CI. Now let’s slow down andlook at what a few of them actually hand back, up close. Everything below isunedited output, captured against a tiny console app built withdotnet build /bl.Diagnosing a failing buildThe app has a one-character typo – Consolee.WriteLine instead ofConsole.WriteLine – so dotnet build fails with CS0103:Because we built with /bl, that same failure also produced a binlog. Point anassistant at it and it walks the path a human would, only faster:binlog_overview to confirm the build failed and where, then binlog_errorsfor the exact file, line, target, and task. Here is a real session against thelive server:That structured payload – code, file, line, targetName, taskName – isexactly what lets an assistant explain the failure and propose a fix withoutever scrolling a raw log.Letting the server take the first passYou don’t have to drive the tools one at a time, either. binlog_diagnosedoes the first pass for you – it reads the binlog, groups the errors, picksout the root cause, and even suggests a fix:For a one-character typo that is overkill. But on a real CI failure with awall of cascading errors, having the server name the root cause up front –and point you at the next tool to run – is the difference between a two-minutefix and a half-hour log dive.Comparing two buildsNow take two successful builds of the same project that differ only in aproject setting and an added package, and ask binlog_compare to diff them.This is the unedited output:The two changes that matter jump right out: LangVersion went from 14.0 topreview, and a Newtonsoft.Json 13.0.3 reference appeared. The rest – thetelemetry session IDs and the restore session GUID – is expected per-runnoise. Surfacing everything in one structured call is exactly what lets anassistant separate the signal from it, instead of opening two logs andcomparing them by hand. (binlog_compare was one of the original 15 tools fromthe first post, so you won’t see it in the catalog tables below – those listonly the 23 the first post didn’t cover.)The Tools the First Post Didn’t CoverYou have now seen a handful of these tools at work, both interactively and inCI. Here is the full set the first post skipped. The first post highlighted 15tools for interactive investigation. As of thiswriting the server source tree exposes *38 `binlog_tools** in total - so the tables below list the **23 tools the first post didn't mention**, grouped by what you use them for. (The exact set evolves, and the published container image can lag the source tree, so treat this as the current snapshot rather than a frozen contract -binlog_capabilities` always reports what your installed serveractually supports.)Targets and tasksWhen a build does something you didn’t expect – a target fires that shouldn’t,or one you need gets skipped – these walk the execution tree for you, no/v:diag log-diving required.ToolWhat it doesbinlog_project_targetsList the targets executed in a specific projectbinlog_search_targetsSearch targets by name across all projectsbinlog_target_reasonsExplain why a target ran or was skipped – the usual answer to “why does this rebuild every time?”binlog_tasks_in_targetList the tasks within a targetbinlog_task_detailsDetails for a specific task executionbinlog_explore_nodeExplore an arbitrary node in the build treebinlog_diagnoseAutomated, high-level build diagnosis – a good first stop that points at the likely culpritProperties and evaluationMSBuild evaluation is where most “works on my machine” mysteries hide. Theseexpose the exact properties – and global properties – each project wasevaluated with, so you can compare a CI agent against your laptop.ToolWhat it doesbinlog_compare_propertyCompare a single property across two binlogs – pinpoint the one setting that driftedbinlog_preprocessPreprocessed project view (the msbuild /pp equivalent) – the fully expanded project after every importbinlog_evaluationsList project evaluationsbinlog_evaluation_propertiesProperties for a specific evaluationbinlog_evaluation_global_propertiesGlobal properties for a specific evaluationPerformance analysisReach for these when the build works but is slow. They turn a binlog into aranked list of where the time actually went.ToolWhat it doesbinlog_expensive_analyzersSlowest Roslyn analyzers and source generators – the usual suspects behind slow compilesbinlog_analyzer_summaryAnalyzer execution summarybinlog_project_target_timesTarget-level timing breakdown for a specific projectbinlog_incremental_analysisWhich targets were skipped vs rebuilt – how you catch a broken incremental buildGraph, dependencies, and toolchainThese answer the “how is everything wired together?” questions – build order,restore, and the assemblies that actually made it onto the compiler commandline.ToolWhat it doesbinlog_build_graphProject dependency graph and critical path – what’s really gating your build timebinlog_target_graphExecuted-target timeline for one evaluationbinlog_nugetRestore info: versions, sources, durationbinlog_assembly_conflictsAssembly version conflict / RAR analysisbinlog_compilerCompiler command-line invocationsbinlog_double_writesFiles written by more than one task or target – a classic source of flaky, nondeterministic buildsContractOne housekeeping tool that keeps the others honest across server versions.ToolWhat it doesbinlog_capabilitiesReport the server’s contract version and tool envelopeThat is a lot of heavy diagnostic lifting the first post never touched – wholenew capabilities like automated diagnosis (binlog_diagnose), per-evaluationinspection for multi-targeted builds, dependency and critical-path graphs,NuGet restore analysis, assembly-conflict detection, and incremental-buildintrospection.TipTo generate a binary log, add /bl to anydotnet build, dotnet test, or dotnet pack command – for exampledotnet build /bl.Does It Actually Help? The Evaluation DataAdding tools to an AI assistant is only worthwhile if it makes the assistantbetter and cheaper, not just busier. To measure that, the team runs a publicevaluation harness that scores different configurations on the same set ofreal-world MSBuild diagnosis scenarios – identifying a build failure, tracing aproperty, doing a full autonomous root-cause investigation – on a 0-5 qualityscale, while recording wall-clock time, tool calls, and token usage. The resultsare published at thebinlog evals dashboard.Across the 102 runs available at the time of writing, the picture is consistent.The dashboard includes several experimental and alternative configurations; thetable below selects the no-tools baseline plus the two configurations this postis about. Input tokens roughly track compute cost, so on that column lower ischeaper. Results vary by run and scenario, so check the dashboard for the fullcomparison.ConfigurationAvg score (0-5)Avg wall timeAvg input tokensplain (no tools)3.25349.8 s1,268,501binlog-mcp (Binlog MCP Server)3.68196.1 s1,141,426skill-mcp (skills + MCP)3.60166.7 s879,205In other words, against the no-tools baseline the Binlog MCP Server raised theaverage score from 3.25 to 3.68 while finishing roughly 44% faster(196 s vs 350 s). The skills-plus-MCP configuration was the fastest and cheapestof the three – about 52% faster and roughly 30% fewer input tokens thanbaseline – while scoring 3.60, just below Binlog MCP alone but still wellabove the no-tools baseline. One likely reason both tool-based configurationscome out ahead is that purpose-built tools let the model query structured binlogdata directly instead of reconstructing it from raw text logs, so it spendsfewer turns and tokens to reach the same answer.About the numbersThese figures come from a previewevaluation harness over a small, evolving set of scenarios; the dashboard flagsruns with incomplete data. Treat them as directional evidence of the efficiencytrend, not as a benchmark guarantee. Re-check the live dashboard for the latestresults.Try It YourselfEverything above is built on the same foundation you can adopt today:Install the dotnet-msbuild plugin from thedotnet/skills marketplace in VisualStudio, VS Code, or the Copilot CLI. (That is the build-diagnostics pluginthis post uses; the marketplace has others for different jobs.)Build with /bl to capture a binlog, then ask your assistant to investigateit – now with the full toolset, not just the original 15.To take it into CI, follow the microsoft/testfx pattern: wire thecontainerized Binlog MCP Server into a GitHub Agentic Workflow so buildfailures get analyzed automatically. (These agentic workflows are stillevolving; the testfx workflows are the public reference to copy from.)The Model Context Protocol turns out to be far more than a chat convenience. Itis a portable contract that lets the same diagnostic toolsrun wherever your code does – in your editor, in your terminal, and in yourpipelines. We’d love your feedback; file issues in thedotnet/skills repository.The post MCP Beyond the Chat Window: Build Diagnostics in CI appeared first on .NET Blog.