Asynchronous message processing
Retry mechanism in case of failures
Deduplication to avoid re-notifying
Monitoring & Alerts on delivery failure
Support multiple channels: Email, SMS, In-App
Decouple producer from consumer
Architecture at a Glance
User Service --> Kafka Topic --> Notification Service --> Email/SMS Providers | Dead Letter Queue
Tech Stack Used in this application
Java + Spring Boot for the microservice
Kafka for event streaming
Redis for caching and deduplication
MySQL for audit logs
Prometheus + Grafana for observability
Instead of direct API calls, we used Kafka to send notification events:
{ "userId": "u123", "channel": "EMAIL", "message": "Welcome to our platform!", "retryCount": 0, "notificationId": "uuid-456" }
This message lands in the notification-events topic and is consumed asynchronously.
Failures can happen due to provider issues (e.g., SendGrid down). So, we implemented a retry logic with exponential backoff.
@KafkaListener(topics = "notification-events") public void consume(NotificationEvent event) { try { sendNotification(event); } catch (Exception ex) { if (event.getRetryCount() < MAX_RETRY) { scheduleRetry(event); // re-publish with delay } else { sendToDLQ(event); // Dead Letter Queue } } }
We store a hash of the notification in Redis with TTL to avoid re-sending the same notification.
String hashKey = generateHash(notification); if (redisTemplate.hasKey(hashKey)) return; sendNotification(notification); redisTemplate.opsForValue().set(hashKey, "sent", 1, TimeUnit.HOURS);
We integrated Prometheus + Grafana to track metrics:
Number of notifications sent
Retry count
Failure rate per provider
DLQ size
Counter notificationSent = Counter.build() .name("notifications_sent_total") .help("Total number of notifications sent") .register(); notificationSent.inc();
We simulated outages using:
Shutting down email provider
Delaying Kafka messages
High throughput spikes
The system gracefully recovered using retry + DLQ monitoring.
What I Learned
Don't treat notifications as fire-and-forget
Retry logic is not enough — you need observability
Use idempotency to avoid chaos during retries
DLQs are your friend — never silently drop failed messages
A fault-tolerant notification service is not about over-engineering; it's about anticipating failures. With Kafka, retry mechanisms, Redis deduplication, and monitoring in place, your system can keep users in the loop — even when things go wrong behind the scenes.
Note — — — — — — — — —
Always think like a failure will happen. Then build like it won't hurt.
Would you like me to:
Convert this into a downloadable format?
Add a system design diagram?
Add code repository link or GitHub Gist for the full project?
Let me know and I'll help you extend this.