How Capital One Cut Tracing Data by 70% With OpenTelemetry

Wait 5 sec.

Organizations absolutely need to squeeze as much value out of their telemetry as they possibly can, for a number of reasons. Gathering telemetric data or observability is also a very tricky proposition, to say the least.On one hand, turning on the spigot to pull all metrics that an environment generates quickly becomes — to put it mildly — an unwieldy and unmanageable situation, not to mention unaffordable for most, if not all, organizations.Too little sampling of metrics data means that the data is likely missing key elements for debugging, interpreting or monitoring for potential outages and other problems. Optimizing operations and development becomes askew and inaccurate, or unreliable. Additionally, using the wrong sampling data for metrics is of little to no help.This problem is compounded, or this dynamic or dilemma is compounded, for very large enterprises such as, in this case, Capital One Bank.During the Observability Day event ahead of KubeCon + CloudNativeCon North America, Capital One engineers Joseph Knight and Sateesh Mamidala showed how they relied on OpenTelemetry to solve the tracing sampling data and were able to implement that across Capital One’s entire operations worldwide.Their efforts paid off: They reported a 70% reduction in tracing data volumes.It wasn’t an easy task, but OpenTelemetry served as the backbone for their gargantuan project, which they detailed in their KubeCon presentation.Capital One’s shift to @OpenTelemetry: Joseph Knight & Sateesh Mamidala, discussed why it was necessary during their Observability Day talk « From Data Overload To Optimized Insights: Implementing OTel Sampling for Smarter Observability » before #KubeCon NA. @linuxfoundation pic.twitter.com/qZMtmn4Jdx— BC Gain (@bcamerongain), Nov. 11, 2025As Knight said during their talk, Capital One’s metrics involved dealing with “more than a petabyte per day without any sampling.”The solution required a deployment of dedicated infrastructure. Tail-based sampling requires turning it into a horizontally scaling problem, as you must “bring all the spans together for a trace before you can make a sampling decision,” Knight said.This, he added, resulted in layering collectors with a load-balancing exporter, a collector layer, and then a sampling processor layer, all entirely dedicated to tracing.Why Capital One Chose OpenTelemetry Over Vendor ToolsBefore adopting OpenTelemetry, Capital One’s engineers relied on vendor tools that implemented their own, often disparate, sampling strategies, typically providing only head-based sampling, in which the decision to keep a trace or not is made at the beginning of a request.OpenTelemetry “gave us the new perspective that head-based sampling is not very effective,” Knight said.The current approach with OTel offers two key benefits, Knight said. The first is that the centralized team now has control over the cost of distributed tracing. This control ensures that widespread adoption is possible with the available resources.Second, the team can provide guarantees to application teams that “they will be able to see certain behavior in their tool,” such as specific errors, which builds “a lot more comfort in how sampling affects the traces coming from their application,” Knight said. This, he added, “can’t be achieved with micro, probabilistic or deadly sampling.”Best Practices for Making Sampled Tracing Data UsefulThe key to making sampled data useful is the addition of tags. Capital One’s team adds tags to sampled traces to indicate how they were selected and at what probabilistic ratio they were sampled. This is useful in two ways, Knight said.Estimation: Teams can estimate the original trace data generated by multiplying the trace value by the probabilistic ratio, which gives an estimate for how many traces or requests were generated prior to sampling.Historical accuracy: By tagging the data directly, if the sampling ratios change over time, the original ratios are “baked in with the source data,” Knight said, allowing teams to look backward without seeing jumps over time.Furthermore, instead of relying on every span for rate information, teams should be taught to use metrics along with spans to get a more accurate picture of system behavior.“We export in the semantic invention metrics, histograms for every single span that we generate, both from the server on your client side,” Knight said.Using these metrics for accurate counts means “you don’t need every span to understand the rate of your system,” he said. “Building rules and guides for translating tools, alerts and dashboards to use metrics can make this transition easier.”The Strategic Shift From Head- To Tail-Based SamplingThe shift from head-based to tail-based sampling, in which the sampling occurs at the end of the trace, has been a success, Knight said. The teams are now “very happy that they are getting a much more better picture now from the races than before,” he said. This is because tail sampling allows the decision to be made after receiving all the spans and looking at the entire trace.Despite the challenges of finding the right balance between high-rate and low-rate applications, the continued focus on dynamically adapting the tail sampling processor is key. The Capital One team aims to publish this research as an open source contribution.Ongoing Challenges and Future Goals in Data SamplingThat 70% reduction in trace volume might be impressive, but the team is looking at the remaining 30% and asking, “How can we do better?” Knight said.The central challenge is a “tug of war” between high-frequency (high-rate) and low-frequency (low-rate) events in the probabilistic ratios, he said. High-rate applications can handle a much lower probabilistic rate, whereas low-rate applications get starved at a lower ratio. At scale, tailoring the rule set to every specific application is not feasible.The current focus is on building enhancements to the tail-sampling processor that will give the system the ability to, as Knight said, “adapt to the frequency of events we see dynamically, right without config changes on our side.”The post How Capital One Cut Tracing Data by 70% With OpenTelemetry appeared first on The New Stack.