- Technology Illumination
- Posts
- Why SRE SLOs for Azure EventHub and Apache Kafka Must Be Different
Why SRE SLOs for Azure EventHub and Apache Kafka Must Be Different
A Real-World Microservices Architecture Case Study with DataDog and a Custom EventBridge
Background: A Real-Word Enterprise Scenario
In a recent architecture working session, our engineering team was tasked with defining SLOs (Service Level Objectives) for two microservices running in a large enterprise environment. The platform involves:
500+ microservices
A 100-node Apache Kafka on-premises cluster with most of the business applications (yet to be migrated to Azure) running on-premise as microservices (Java, .Net, Python, Node etc)
Several workloads, microservices also deployed in Azure Kubernetes and integrated with Azure EventHub
Two services in particular came up for SLO instrumentation:
Microservice A (Deployed On-Premise Kubernetes): Produces and consumes messages directly from the Kafka cluster.
Microservice B (Deployed Azure Kubernetes): Consumes messages from both Kafka and Azure EventHub. The connection between the two is handled by a custom-built microservice called HomeGrown EventBridge, which reads messages from Kafka and writes to EventHub, and vice versa.
We were defining SLOs using DataDog when a colleague suggested both microservices should have the same SLOs - since both "just process and publish messages."
This post explains why SLOs must reflect the underlying system architecture, not just the business function.
I DisAgreed
What Are SRE SLOs?
In Site Reliability Engineering (SRE), a Service Level Objective (SLO) is a clearly defined performance target that reflects how reliable a service must be. SLOs are usually defined in terms of:
Latency (e.g., "95% of messages processed within 200ms")
Availability (e.g., "Service is available 99.9% of the time")
Error Rate (e.g., "Less than 0.1% publish failures over 1 hour")
SLOs help teams align on what "good enough" means and form the basis for error budgets, incident management thresholds, and service capacity planning.
Tools Commonly Used to Define and Monitor SLOs
Depending on your stack, there are several tools that help teams define, visualize, and track SLOs:
Tool | Key Capabilities |
|---|---|
DataDog | Easy setup, built-in SLO widgets, multi-source metrics |
Nobl9 | Dedicated SLO platform, integrates with cloud and on-prem |
Prometheus + Grafana | Open-source, highly customizable for SRE SLOs |
Google Cloud Monitoring | Native SLO definition for GCP workloads |
New Relic / Dynatrace | Built-in SLO tracking for full-stack monitoring |
In our case, we used DataDog to define custom latency and error rate SLOs for each microservice.
Why Azure EventHub and Apache Kafka Deserve Different SLOs
At a glance, both services appear to be doing the same thing: publishing and consuming messages. But architecturally, they’re built on very different foundations with vastly different runtime characteristics.
Let’s break it down.
Architectural Differences
Characteristic | Apache Kafka (On-Premises) | Azure EventHub |
|---|---|---|
Deployment Control | Full control: tuning, partitioning, retention | Managed by Azure |
Latency Profile | Tuned for low latency | Moderate latency with potential throttling |
Backpressure Handling | Controlled by consumer config | Azure decides retry/backoff |
Failure Isolation | Local, more predictable | Shared Azure infrastructure |
Metrics Resolution | Fine-grained, near real-time | 1-minute resolution for most metrics |
Additional Complexity: Bridging the Two with EventBridge
For Microservice B, the architecture includes:
Kafka → EventBridge → Azure EventHub → EventHub Consumer
This introduces:
Extra network hops
Serialization/deserialization overhead
Protocol translation risks (e.g., Avro-to-JSON)
Multiple failure points (Kafka, EventBridge, EventHub)
Each of these layers adds latency and operational uncertainty—both critical to consider in your SLO targets.
Recommended SLO Profiles
Kafka-Only Service (Microservice A)
Message Publish Latency (p95): < 100ms
Message Consume Latency (p95): < 200ms
Error Rate: < 0.01%
Alerts: Partition lag, offset commit failures
Kafka to EventHub (Microservice B)
Kafka to EventBridge Latency (p95): < 150ms
EventBridge to EventHub Publish Latency (p95): < 300ms
EventHub Consumer Processing (p95): < 500ms
End-to-End Latency SLO (p95): < 1s
Error Rate: < 0.1%
These are just reference targets. Actual SLOs should be derived from historical performance, business criticality, and user-facing impact.
Why Uniform SLOs Are Risky
Even though both services "handle messages," their operational environments are fundamentally different.
You can’t control Azure EventHub internals or metrics resolution.
Throttling behavior in Azure may degrade latency under high load.
Failure recovery and observability are more predictable in Kafka.
The EventBridge microservice itself becomes a point of failure or latency amplification.
Uniform SLOs overlook these architectural nuances and may result in misleading reliability metrics, poor alerting, or missed SLIs.
Final Takeaway
SLOs are only valuable when they reflect the actual behavior and dependencies of your architecture. Applying the same SLOs to fundamentally different systems like Kafka and EventHub ignores the realities of hybrid environments.
When you bridge systems across on-premises and cloud, treat them as separate SLO domains—each with its own latency budget, failure profile, and observability constraints.
Go From Idea to Video in Minutes—Not Hours
Creating content daily shouldn’t feel like a full-time job.
Syllaby.io helps you generate faceless videos in just minutes—no editing, no filming, and no burnout.
✅ Auto-generate engaging short or long-form scripts
✅ Add captions, voiceovers, B-roll & character-consistent avatars
✅ Schedule and publish to TikTok, YouTube, Reels, and more
Whether you're a solopreneur or agency, Syllaby gives you everything you need to scale your content—fast.

