Why SRE SLOs for Azure EventHub and Apache Kafka Must Be Different

A Real-World Microservices Architecture Case Study with DataDog and a Custom EventBridge

In partnership with

Background: A Real-Word Enterprise Scenario

In a recent architecture working session, our engineering team was tasked with defining SLOs (Service Level Objectives) for two microservices running in a large enterprise environment. The platform involves:

  • 500+ microservices

  • A 100-node Apache Kafka on-premises cluster with most of the business applications (yet to be migrated to Azure) running on-premise as microservices (Java, .Net, Python, Node etc)

  • Several workloads, microservices also deployed in Azure Kubernetes and integrated with Azure EventHub

Two services in particular came up for SLO instrumentation:

  • Microservice A (Deployed On-Premise Kubernetes): Produces and consumes messages directly from the Kafka cluster.

  • Microservice B (Deployed Azure Kubernetes): Consumes messages from both Kafka and Azure EventHub. The connection between the two is handled by a custom-built microservice called HomeGrown EventBridge, which reads messages from Kafka and writes to EventHub, and vice versa.

We were defining SLOs using DataDog when a colleague suggested both microservices should have the same SLOs - since both "just process and publish messages."

This post explains why SLOs must reflect the underlying system architecture, not just the business function.

I DisAgreed

What Are SRE SLOs?

In Site Reliability Engineering (SRE), a Service Level Objective (SLO) is a clearly defined performance target that reflects how reliable a service must be. SLOs are usually defined in terms of:

  • Latency (e.g., "95% of messages processed within 200ms")

  • Availability (e.g., "Service is available 99.9% of the time")

  • Error Rate (e.g., "Less than 0.1% publish failures over 1 hour")

SLOs help teams align on what "good enough" means and form the basis for error budgets, incident management thresholds, and service capacity planning.

Tools Commonly Used to Define and Monitor SLOs

Depending on your stack, there are several tools that help teams define, visualize, and track SLOs:

Tool

Key Capabilities

DataDog

Easy setup, built-in SLO widgets, multi-source metrics

Nobl9

Dedicated SLO platform, integrates with cloud and on-prem

Prometheus + Grafana

Open-source, highly customizable for SRE SLOs

Google Cloud Monitoring

Native SLO definition for GCP workloads

New Relic / Dynatrace

Built-in SLO tracking for full-stack monitoring

In our case, we used DataDog to define custom latency and error rate SLOs for each microservice.

Why Azure EventHub and Apache Kafka Deserve Different SLOs

At a glance, both services appear to be doing the same thing: publishing and consuming messages. But architecturally, they’re built on very different foundations with vastly different runtime characteristics.

Let’s break it down.

Architectural Differences

Characteristic

Apache Kafka (On-Premises)

Azure EventHub

Deployment Control

Full control: tuning, partitioning, retention

Managed by Azure

Latency Profile

Tuned for low latency

Moderate latency with potential throttling

Backpressure Handling

Controlled by consumer config

Azure decides retry/backoff

Failure Isolation

Local, more predictable

Shared Azure infrastructure

Metrics Resolution

Fine-grained, near real-time

1-minute resolution for most metrics

Additional Complexity: Bridging the Two with EventBridge

For Microservice B, the architecture includes:

Kafka → EventBridge → Azure EventHub → EventHub Consumer

This introduces:

  • Extra network hops

  • Serialization/deserialization overhead

  • Protocol translation risks (e.g., Avro-to-JSON)

  • Multiple failure points (Kafka, EventBridge, EventHub)

Each of these layers adds latency and operational uncertainty—both critical to consider in your SLO targets.

Kafka-Only Service (Microservice A)

  • Message Publish Latency (p95): < 100ms

  • Message Consume Latency (p95): < 200ms

  • Error Rate: < 0.01%

  • Alerts: Partition lag, offset commit failures

Kafka to EventHub (Microservice B)

  • Kafka to EventBridge Latency (p95): < 150ms

  • EventBridge to EventHub Publish Latency (p95): < 300ms

  • EventHub Consumer Processing (p95): < 500ms

  • End-to-End Latency SLO (p95): < 1s

  • Error Rate: < 0.1%

These are just reference targets. Actual SLOs should be derived from historical performance, business criticality, and user-facing impact.

Why Uniform SLOs Are Risky

Even though both services "handle messages," their operational environments are fundamentally different.

  • You can’t control Azure EventHub internals or metrics resolution.

  • Throttling behavior in Azure may degrade latency under high load.

  • Failure recovery and observability are more predictable in Kafka.

  • The EventBridge microservice itself becomes a point of failure or latency amplification.

Uniform SLOs overlook these architectural nuances and may result in misleading reliability metrics, poor alerting, or missed SLIs.

Final Takeaway

SLOs are only valuable when they reflect the actual behavior and dependencies of your architecture. Applying the same SLOs to fundamentally different systems like Kafka and EventHub ignores the realities of hybrid environments.

When you bridge systems across on-premises and cloud, treat them as separate SLO domains—each with its own latency budget, failure profile, and observability constraints.

Go From Idea to Video in Minutes—Not Hours

Creating content daily shouldn’t feel like a full-time job.

Syllaby.io helps you generate faceless videos in just minutes—no editing, no filming, and no burnout.

✅ Auto-generate engaging short or long-form scripts
✅ Add captions, voiceovers, B-roll & character-consistent avatars
✅ Schedule and publish to TikTok, YouTube, Reels, and more

Whether you're a solopreneur or agency, Syllaby gives you everything you need to scale your content—fast.