Fixes Required for Prometheus’ OpenTelemetry Integration

Wait 5 sec.

We have covered how there have been some conflicts between OpenTelemetry and Prometheus compatibility for a number of reasons. Much of this has to do with how Prometheus has been — and remains — a very tried and trusted open source metrics solution that preceded what OpenTelemetry has offered as an alternate way of standardization and other attributes that definitely shine for observability.During PromCon, Prometheus’s annual user conference held recently in Munich, Julius Volz, a co-founder of Prometheus, offered specific insights during the opening keynote into how using OpenTelemetry with Prometheus still poses problems and what needs to be done to address those issues.First, the bad: There is a fundamental loss of service discovery and active pull when using OpenTelemetry, as well as the complexity of OpenTelemetry’s SDKs. Performance issues abound as well. In one benchmark test — which RevCom has not verified — Volz noted there was up to a 22 times speed difference in Go benchmarks when OpenTelemetry instrumentation is used compared to native Prometheus instrumentation.Obviously, improving performance is key, in particular for code meant for observability, which runs very often and is meant to help improve performance. But semantic conventions also pose problems, as we discussed in our previous articles. There is also a continued need for collaboration between the Prometheus and OpenTelemetry teams.“I did want to point out the drawbacks of [integrating OpenTelemetry with Prometheus], and maybe make you think about, ‘OK, do we really want to go down this route if you mostly care about metrics and using them with Prometheus?’” Volz said. “Also, in terms of the service discovery and SDK slowness drawbacks, I would really like other people to think about that more and maybe improve things, to not just throw away the baby with the bath water and lose all these benefits that we built very carefully in Prometheus over a long time.”There have been many improvements since Richard “RichiH” Hartmann started a formal effort to improve interoperability between the two projects in 2020 (his business title is director of community at Grafana Labs).While OpenTelemetry’s lack of service discovery will always create extra work for operators, most of the fundamental problems have been resolved, Hartmann told me post conference. OpenTelemetry has changed, arguably fixed, their histogram bucket definitions in favor of Prometheus, Hartmann said. “Prometheus then expanded support for data labels to support OpenTelemetry,” Hartmann said. “And both projects collaborated early on native histograms based on the work of Björn “Beorn” Rabenstein, leading to fully compatible releases on both sides.Meanwhile, OpenTelemetry is “here in full force and is not going to go away, and organizations seek to use it with Prometheus, instrumenting their services with OpenTelemetry and then sending the metrics part to the Prometheus system, Volz said. “However, there are plenty of downsides in comparison to using Prometheus’ own native instrumentation client libraries,” Volz said. “It is important to be aware of these before choosing the OpenTelemetry route if one mostly cares about metrics and Prometheus.”A quick contrast between the two systems shows that Prometheus is an entire monitoring system, focusing only on the metrics signal type. In contrast, OpenTelemetry “only cares about” generating the signals — including logs, metrics, traces and profiles — and then passing them on to some kind of third-party backend system, Volz said. This corresponds to how OpenTelemetry’s creators’ goal is to standardize on the emission aspect and reflects the many different storage vendors represented in OpenTelemetry.The transfer pieces present a key difference: OpenTelemetry uses OTLP to send those metrics via push, while Prometheus uses a text-based format and actively pulls them, Volz said. Sending metrics via an OpenTelemetry collector to an OTLP receiver endpoint in a Prometheus server introduces several drawbacks, Volz said.Losing Active Pull and Health MonitoringThe first and most unfortunate downside is the act of throwing away a lot of what makes Prometheus good and capable: the integration of service discovery with a pull-based active target monitoring, Volz said. Prometheus solves this by talking to systems like the Kubernetes API server to get an always-up-to-date view. It then actively tries to pull or scrape metrics, recording an up metric with a value of zero or one, which is essential for target health alerts, Volz said.Since OpenTelemetry doesn’t have a built-in facility for this functionality, it becomes harder to tell if a target is running but not sending metrics, or if it is down, Volz said. Conversely, security controls are bypassed when unexpected metrics are pushed. “A lot of people ignore this completely, treating their Prometheus server as a random receptacle of metrics, and then they do not know if a process that should be running is not,” Volz said. “The idea of a synthetic up metric for OTLP ingestion has been heard, but it does not exist yet.”The second downside is the resultant changed metric names or somewhat “ugly PromQL selectors,” Volz said. OpenTelemetry introduces character set differences, allowing characters like dots and slashes that were not supported previously in Prometheus 3. “This suggests that the people standardizing OpenTelemetry did not highly prioritize how a metric would be used in a query language like PromQL,” Volz said.Prometheus conventions add suffixes for both units and types of metrics to immediately clarify the meaning. OpenTelemetry, however, says “don’t put unit and type into the metric name.” As a result, the Prometheus ingestion layer adds back those suffixes during translation. With the extended character set, PromQL selectors become more complex and less easy to write and read than the native selector, Volz said.Indeed, OpenTelemetry and its SDKs are quite complex and can be quite slow. Benchmarking in Go showed that native Prometheus client libraries are up to 22 times faster than the OpenTelemetry SDK for counter increments, as mentioned above. “Even adding two labels makes the OpenTelemetry SDK 90% slower,” Volz said. “OpenTelemetry’s complexity is baked in and is hard to remove, making it the XML or CORBA of telemetry, because it attempts to solve all problems at once.”Roll up SleevesTo address health checks, future work may involve a synthetic up metric for OTLP ingestion, Volz said. This feature would use service discovery and correlate expected data with incoming data to generate an up metric when data is missing, Volz said.“On the metrics side, the Prometheus team is actively trying to improve things, including creating an experimental Delta to cumulative processor to support OpenTelemetry’s Delta temporality,” Volz said. “There is also a recognized regret for not having semantic conventions in Prometheus land from the beginning, suggesting future collaboration with OpenTelemetry people could make sense to introduce a similar standardized naming structure.”The post Fixes Required for Prometheus’ OpenTelemetry Integration appeared first on The New Stack.