Notification Service — Fault-tolerant service on microservices

Copy Starter Project

Thumbnail for Notification Service, a Framer Marketplace tutorial by Lina Sizov.

How I designed and built a fault-tolerant notification service that gracefully handles failures, ensures delivery, and integrates seamlessly in a microservice environment.

Goals of the Notification Service

Asynchronous message processing
Retry mechanism in case of failures
Deduplication to avoid re-notifying
Monitoring & Alerts on delivery failure
Support multiple channels: Email, SMS, In-App
Decouple producer from consumer

Architecture at a Glance

User Service --> Kafka Topic --> Notification Service --> Email/SMS Providers | Dead Letter Queue

Tech Stack Used in this application

Java + Spring Boot for the microservice

Kafka for event streaming

Redis for caching and deduplication

MySQL for audit logs

Prometheus + Grafana for observability

Step 1: Event-Driven Notification

Instead of direct API calls, we used Kafka to send notification events:

{ "userId": "u123", "channel": "EMAIL", "message": "Welcome to our platform!", "retryCount": 0, "notificationId": "uuid-456" }

This message lands in the notification-events topic and is consumed asynchronously.

Step 2: Retry with Backoff Strategy

Failures can happen due to provider issues (e.g., SendGrid down). So, we implemented a retry logic with exponential backoff.

@KafkaListener(topics = "notification-events") public void consume(NotificationEvent event) { try { sendNotification(event); } catch (Exception ex) { if (event.getRetryCount() < MAX_RETRY) { scheduleRetry(event); // re-publish with delay } else { sendToDLQ(event); // Dead Letter Queue } } }

Step 3: Avoid Duplicate Notifications

We store a hash of the notification in Redis with TTL to avoid re-sending the same notification.

String hashKey = generateHash(notification); if (redisTemplate.hasKey(hashKey)) return; sendNotification(notification); redisTemplate.opsForValue().set(hashKey, "sent", 1, TimeUnit.HOURS);

Step 4: Observability

We integrated Prometheus + Grafana to track metrics:

Number of notifications sent
Retry count
Failure rate per provider
DLQ size

Counter notificationSent = Counter.build() .name("notifications_sent_total") .help("Total number of notifications sent") .register(); notificationSent.inc();

Step 5: Testing Fault Tolerance

We simulated outages using:

Shutting down email provider

Delaying Kafka messages

High throughput spikes

The system gracefully recovered using retry + DLQ monitoring.

What I Learned

Don't treat notifications as fire-and-forget
Retry logic is not enough — you need observability
Use idempotency to avoid chaos during retries
DLQs are your friend — never silently drop failed messages

A fault-tolerant notification service is not about over-engineering; it's about anticipating failures. With Kafka, retry mechanisms, Redis deduplication, and monitoring in place, your system can keep users in the loop — even when things go wrong behind the scenes.

Note — — — — — — — — —

Always think like a failure will happen. Then build like it won't hurt.

Would you like me to: