Your CI Pipeline Should Reject Any Service Without an Observability Contract

Wait 5 sec.

2:11 AM, Tuesday. Our payment notification service — fires transaction confirmations to roughly 40,000 users — had stopped processing. Completely. No alert fired. Nothing paged the on-call. Slack was dark. The engineer on rotation found out at 2:58 AM because a user tweeted at our support account saying they'd made a payment and hadn't gotten a confirmation email in 45 minutes. (The support team DM'd the on-call at 2:58 exactly. I know because I have the screenshot in our post-mortem doc and I've looked at it more times than I'd like to admit.)47 minutes of a revenue-adjacent, customer-facing service failing. Zero observable signal from any of the monitoring systems we were paying good money for.The root cause was almost embarrassing to write into the post-mortem. Six weeks before the incident, a junior engineer extracted the notification logic from a monolith into a new Node.js microservice as part of a planned decomposition. The service worked. Tests passed. The /healthz endpoint returned 200. CI was green. But in the extraction, the structured logging configuration got left behind in the old service. The new one defaulted to plain console.log() instead of the structured JSON format our Loki pipeline expected.So:No structured JSON → Loki couldn't parse the output → none of our log-based alert rules matched anything → PagerDuty never received a signal → 47 minutes of silence.The service wasn't broken in any technical sense. It was observably dead. And those are not the same thing, which is the uncomfortable lesson I had to explain to our CTO at 9 AM that Tuesday.\What the Audit Turned Up\After the incident I spent a week going through every microservice on the cluster. Fourteen services total, built by three teams over about two years. The teams were backend (seven services), frontend (four), and a platform team I was part of (three). I went through every Helm chart and manifest by hand.The results:| Telemetry requirement | Compliant | Non-compliant ||----|----|----|| Structured JSON logs (level, service, timestamp, traceId) | 9 | 5 || Prometheus /metrics port exposed | 11 | 3 || W3C traceparent header propagation | 6 | 8 || ≥ 1 SLO defined in Alertmanager | 4 | 10 || Liveness + readiness probes configured | 13 | 1 |Ten out of fourteen services had no SLO. Eight weren't propagating trace context — meaning distributed traces in Grafana Tempo were broken at almost every service boundary. I'd been wondering for months why traces were always incomplete. Now I knew.Here's the thing: we had the full stack. Prometheus, Grafana, Loki, Tempo, Alertmanager. The tooling budget was fine. The problem was that nothing enforced wiring into it. Observability was "required" in a Confluence page that nobody opened after their first week. Engineers added new services and, honestly, why would they spend two hours hooking up logging and SLOs when the tests passed and the PR was waiting?I probably would have done the same thing in their position.Before: What a Non-Compliant Service Looked LikeThis is the payment-notification Deployment that shipped before the incident. Everything about it screams "works on my machine":# deploy/payment-notification/deployment.yaml ← BEFORE (non-compliant)apiVersion: apps/v1kind: Deploymentmetadata: name: payment-notification labels: app: payment-notification # No alerting/rule-group label — no SLO reference # No observability annotations — nobody enforced themspec: replicas: 2 selector: matchLabels: app: payment-notification template: metadata: labels: app: payment-notification spec: containers: - name: payment-notification image: myregistry/payment-notification:2.3.9 ports: - name: http containerPort: 3000 # No metrics port — Prometheus can't scrape anything # No livenessProbe — Kubernetes has no way to restart a hung pod # No readinessProbe — pods receive traffic before they're readyThat's it. Five lines of missing config between "fully observable" and "47 minutes of silence."\The Observability ContractI called it an observability contract: a minimum set of telemetry requirements every service must satisfy before it's allowed to reach the cluster. Not a wiki entry. A hard CI gate — PR can't merge, deploy never happens.\Five requirements:1. Structured JSON logging. The Deployment must carry an annotation declaring it emits logs with at minimum level, service, timestamp, and traceId fields. This is an intent annotation — the enforcement that the code actually does it happens in a nightly audit job (more on that later).2. Prometheus metrics port. The app container must expose a port named metrics. No named port, no scraping, no dashboards — we block the PR.3. Distributed trace propagation. The service must annotate that it reads and forwards W3C traceparent headers. Again, intent-based at CI time.4. An alert rule reference. The Deployment must carry an alerting/rule-group label pointing to a named Alertmanager rule group. If there's no label, there's probably no alert, and then we're back to Twitter finding our incidents before we do.5. Liveness and readiness probes. Both, on the app container. Not optional. Kubernetes can't restart a hung process without a liveness probe and shouldn't route traffic to an unready pod without a readiness probe. This one I'm genuinely surprised we had to mandate — but the audit said one out of fourteen, so here we are.We enforce all five with Open Policy Agent and Conftest. Conftest lets you write Rego policies that validate Kubernetes manifests against your rules before anything touches the cluster.\The OPA Policy# policies/observability_contract.rego## Run with: conftest test --policy policies/ rendered-manifests/# Fails (exit 1) if any deny rule fires.package observabilityimport future.keywords.in# Skip sidecars — only evaluate the app containerapp_container := container { container := input.spec.template.spec.containers[_] not startswith(container.name, "istio-") not startswith(container.name, "fluentbit-")}# 1. Structured logging annotationdeny[msg] { input.kind == "Deployment" not input.metadata.annotations["observability/structured-logging"] msg := sprintf( "Deployment '%s': missing 'observability/structured-logging: enabled'. Emit structured JSON with level, service, timestamp, traceId.", [input.metadata.name], )}# 2. Prometheus metrics portdeny[msg] { input.kind == "Deployment" container := app_container not has_metrics_port(container) msg := sprintf( "Deployment '%s' container '%s': no port named 'metrics'. Prometheus can't scrape it.", [input.metadata.name, container.name], )}has_metrics_port(container) { port := container.ports[_] port.name == "metrics"}# 3. Trace context annotationdeny[msg] { input.kind == "Deployment" not input.metadata.annotations["observability/trace-propagation"] msg := sprintf( "Deployment '%s': missing 'observability/trace-propagation: w3c'. Propagate W3C traceparent on outbound requests.", [input.metadata.name], )}# 4. Alert rule group labeldeny[msg] { input.kind == "Deployment" not input.metadata.labels["alerting/rule-group"] msg := sprintf( "Deployment '%s': no 'alerting/rule-group' label. Reference an Alertmanager rule group or add an AlertmanagerConfig.", [input.metadata.name], )}# 5a. Liveness probedeny[msg] { input.kind == "Deployment" container := app_container not container.livenessProbe msg := sprintf( "Deployment '%s' container '%s': no livenessProbe. Kubernetes can't restart a hung process without one.", [input.metadata.name, container.name], )}# 5b. Readiness probedeny[msg] { input.kind == "Deployment" container := app_container not container.readinessProbe msg := sprintf( "Deployment '%s' container '%s': no readinessProbe. Pods will take traffic before they're ready.", [input.metadata.name, container.name], )}After: What a Compliant Service Looks LikeSame service, fixed manifest:# deploy/payment-notification/deployment.yaml ← AFTER (compliant)apiVersion: apps/v1kind: Deploymentmetadata: name: payment-notification labels: app: payment-notification alerting/rule-group: payment-services # ← Rule 4 annotations: observability/structured-logging: "enabled" # ← Rule 1 observability/trace-propagation: "w3c" # ← Rule 3spec: replicas: 2 selector: matchLabels: app: payment-notification template: metadata: labels: app: payment-notification spec: containers: - name: payment-notification image: myregistry/payment-notification:2.4.1 ports: - name: http containerPort: 3000 - name: metrics # ← Rule 2 containerPort: 9090 livenessProbe: # ← Rule 5a httpGet: path: /healthz port: http initialDelaySeconds: 10 periodSeconds: 15 failureThreshold: 3 readinessProbe: # ← Rule 5b httpGet: path: /readyz port: http initialDelaySeconds: 5 periodSeconds: 10 failureThreshold: 2Five additions. Maybe 90 minutes of engineering work per service to actually wire up the logging and SLO. That's it.The CI GateThe Rego policy runs as a required check on every PR touching a Helm chart or manifest:# .github/workflows/observability-gate.ymlname: Observability Contract Gateon: pull_request: paths: - "charts/**" - "k8s/**" - "deploy/**"jobs: validate-observability-contract: name: Validate Observability Contract runs-on: ubuntu-latest steps: - uses: actions/checkout@v4 - name: Install Conftest run: | wget -q https://github.com/open-policy-agent/conftest/releases/download/v0.50.0/conftest_0.50.0_Linux_x86_64.tar.gz tar xzf conftest_0.50.0_Linux_x86_64.tar.gz mv conftest /usr/local/bin/ - name: Install Helm uses: azure/setup-helm@v4 with: version: "3.14.0" - name: Render Helm charts run: | mkdir -p rendered/ for chart_dir in charts/*/; do chart_name=$(basename "$chart_dir") helm template "$chart_name" "$chart_dir" \ --values "$chart_dir/values.yaml" \ --output-dir rendered/ done - name: Run observability contract policy run: | conftest test rendered/ \ --policy policies/ \ --namespace observability \ --all-namespaces \ --output table # Exit code 1 on any deny rule — PR cannot merge.When a manifest fails, the CI output looks like this:FAIL - rendered/payment-notification/templates/deployment.yaml - observability Deployment 'payment-notification-v2': missing 'observability/structured-logging: enabled'. Emit structured JSON with level, service, timestamp, traceId.FAIL - rendered/payment-notification/templates/deployment.yaml - observability Deployment 'payment-notification-v2': no 'alerting/rule-group' label. Reference an Alertmanager rule group or add an AlertmanagerConfig.3 tests, 1 passed, 0 warnings, 2 failures, 0 exceptionsSpecific. Actionable. The engineer doesn't need to read a wiki page — the failure message tells them exactly what to add.What Three Months Looked LikeWe shipped the gate in January and gave existing services a six-week window to reach compliance. New services got zero grace — the gate applied from their first PR.By April: all fourteen services compliant, up from four at the time of the incident. Two new services — a webhook processor and an analytics aggregator — both got caught by the gate before merging. The webhook service was missing the metrics port and the trace annotation. The engineer added both in the same PR, probably in 20 minutes. No incident. No post-mortem.Zero silent failures since January.The unexpected side effect was documentation. Because engineers were hitting policy failures on new services and asking "how do I fix this?", we had to actually write a clear "how to make your service observable" guide. That guide is in onboarding now. The gate created documentation pressure that no number of "please update the wiki" Slack messages ever did. I'd tried for about a year to get that guide written. The CI gate got it done in two weeks.Where This Breaks DownThe annotations for logging (Rule 1) and trace propagation (Rule 3) are trust-based. We verify that the annotation exists. We don't verify at CI time that the code actually emits structured JSON or actually propagates the header. An engineer can write observability/structured-logging: enabled and ship a service that still calls console.log().We caught this once — about six weeks after we launched the gate. Found it in a log search, not from an incident. But still.So we added a nightly runtime audit: a Python script that hits each service's live Loki log stream, pulls the last 200 lines, and validates the JSON schema. If a service fails — wrong format, missing fields, plain text — it opens a GitHub issue and pings the owning team in Slack. It runs at 3:15 AM UTC, after the quarantine test job finishes.The CI gate is intent. The runtime audit is verification. Neither is airtight alone. Together, they've kept the dashboards from going silent since January, and that is genuinely all I needed.The 47 minutes are done.If you're not using Helm — Kustomize, CDK8s, raw manifests — Conftest works identically. Point it at whatever directory your rendered YAML lives in. The Rego doesn't care how the manifests got there.\