Technology Illumination
Posts
Why SRE SLOs for Azure EventHub and Apache Kafka Must Be Different

Why SRE SLOs for Azure EventHub and Apache Kafka Must Be Different

A Real-World Microservices Architecture Case Study with DataDog and a Custom EventBridge

Technology Illumination
July 08, 2025

In partnership with

Background: A Real-Word Enterprise Scenario

In a recent architecture working session, our engineering team was tasked with defining SLOs (Service Level Objectives) for two microservices running in a large enterprise environment. The platform involves:

500+ microservices
A 100-node Apache Kafka on-premises cluster with most of the business applications (yet to be migrated to Azure) running on-premise as microservices (Java, .Net, Python, Node etc)
Several workloads, microservices also deployed in Azure Kubernetes and integrated with Azure EventHub

Two services in particular came up for SLO instrumentation:

Microservice A (Deployed On-Premise Kubernetes): Produces and consumes messages directly from the Kafka cluster.
Microservice B (Deployed Azure Kubernetes): Consumes messages from both Kafka and Azure EventHub. The connection between the two is handled by a custom-built microservice called HomeGrown EventBridge, which reads messages from Kafka and writes to EventHub, and vice versa.

We were defining SLOs using DataDog when a colleague suggested both microservices should have the same SLOs - since both "just process and publish messages."

This post explains why SLOs must reflect the underlying system architecture, not just the business function.

I DisAgreed

What Are SRE SLOs?

In Site Reliability Engineering (SRE), a Service Level Objective (SLO) is a clearly defined performance target that reflects how reliable a service must be. SLOs are usually defined in terms of:

Latency (e.g., "95% of messages processed within 200ms")
Availability (e.g., "Service is available 99.9% of the time")
Error Rate (e.g., "Less than 0.1% publish failures over 1 hour")

SLOs help teams align on what "good enough" means and form the basis for error budgets, incident management thresholds, and service capacity planning.

Tools Commonly Used to Define and Monitor SLOs

Depending on your stack, there are several tools that help teams define, visualize, and track SLOs:

Tool	Key Capabilities
DataDog	Easy setup, built-in SLO widgets, multi-source metrics
Nobl9	Dedicated SLO platform, integrates with cloud and on-prem
Prometheus + Grafana	Open-source, highly customizable for SRE SLOs
Google Cloud Monitoring	Native SLO definition for GCP workloads
New Relic / Dynatrace	Built-in SLO tracking for full-stack monitoring

In our case, we used DataDog to define custom latency and error rate SLOs for each microservice.

Why Azure EventHub and Apache Kafka Deserve Different SLOs

At a glance, both services appear to be doing the same thing: publishing and consuming messages. But architecturally, they’re built on very different foundations with vastly different runtime characteristics.

Let’s break it down.

Architectural Differences

Characteristic	Apache Kafka (On-Premises)	Azure EventHub
Deployment Control	Full control: tuning, partitioning, retention	Managed by Azure
Latency Profile	Tuned for low latency	Moderate latency with potential throttling
Backpressure Handling	Controlled by consumer config	Azure decides retry/backoff
Failure Isolation	Local, more predictable	Shared Azure infrastructure
Metrics Resolution	Fine-grained, near real-time	1-minute resolution for most metrics

Additional Complexity: Bridging the Two with EventBridge

For Microservice B, the architecture includes:

Kafka → EventBridge → Azure EventHub → EventHub Consumer

This introduces:

Extra network hops
Serialization/deserialization overhead
Protocol translation risks (e.g., Avro-to-JSON)
Multiple failure points (Kafka, EventBridge, EventHub)

Each of these layers adds latency and operational uncertainty—both critical to consider in your SLO targets.

Recommended SLO Profiles

Kafka-Only Service (Microservice A)

Message Publish Latency (p95): < 100ms
Message Consume Latency (p95): < 200ms
Error Rate: < 0.01%
Alerts: Partition lag, offset commit failures

Kafka to EventHub (Microservice B)

Kafka to EventBridge Latency (p95): < 150ms
EventBridge to EventHub Publish Latency (p95): < 300ms
EventHub Consumer Processing (p95): < 500ms
End-to-End Latency SLO (p95): < 1s
Error Rate: < 0.1%

These are just reference targets. Actual SLOs should be derived from historical performance, business criticality, and user-facing impact.

Why Uniform SLOs Are Risky

Even though both services "handle messages," their operational environments are fundamentally different.

You can’t control Azure EventHub internals or metrics resolution.
Throttling behavior in Azure may degrade latency under high load.
Failure recovery and observability are more predictable in Kafka.
The EventBridge microservice itself becomes a point of failure or latency amplification.

Uniform SLOs overlook these architectural nuances and may result in misleading reliability metrics, poor alerting, or missed SLIs.

Final Takeaway

SLOs are only valuable when they reflect the actual behavior and dependencies of your architecture. Applying the same SLOs to fundamentally different systems like Kafka and EventHub ignores the realities of hybrid environments.

When you bridge systems across on-premises and cloud, treat them as separate SLO domains—each with its own latency budget, failure profile, and observability constraints.

Go From Idea to Video in Minutes—Not Hours

Creating content daily shouldn’t feel like a full-time job.

Syllaby.io helps you generate faceless videos in just minutes—no editing, no filming, and no burnout.

✅ Auto-generate engaging short or long-form scripts
✅ Add captions, voiceovers, B-roll & character-consistent avatars
✅ Schedule and publish to TikTok, YouTube, Reels, and more

Whether you're a solopreneur or agency, Syllaby gives you everything you need to scale your content—fast.

Get 7-days free!