You Wrote a Retry Loop. Now You Maintain a Mail Server.

March 19, 2026 · 7 min read

Co-founder, mailbot

You added email to your app. It was supposed to be simple. Send a request, get a 200, move on. Then an email bounced. So you added a retry. Then retries started piling up, so you added a queue. Then the queue needed backoff logic, deduplication, and failure routing. Now you are three abstractions deep into a problem that was never yours to solve.

It Starts with One Retry

Every email retry story begins the same way. A transactional email fails. The fix seems obvious: try again.

So you wrap your send call in a retry block. Three attempts, a small delay between each. It works. The email goes through on the second try. You commit the code and move on.

But that retry has no awareness of what went wrong. It does not know if the failure was a temporary DNS issue, a full recipient mailbox, a blocklisted IP, or a permanent address rejection. It treats every failure the same way: try again and hope for the best.

This is not retry logic. This is a coin flip with a loop around it.

The Queue You Did Not Plan For

Retries work fine when failure is rare. But at volume, even a 2% failure rate produces a steady stream of retries. Your retry loop blocks other sends. Response times degrade.

So you move the retry into a background job. You pick a job queue library. You configure workers. You add a dead letter queue for emails that fail all retries. You write monitoring for queue depth and worker health.

None of this has anything to do with your product. You are building infrastructure for a problem that should not exist.

Backoff Is Harder Than It Looks

Simple retry loops hammer the receiving server. If the server is overloaded, retrying immediately makes things worse. So you implement exponential backoff. First retry after 30 seconds, then 2 minutes, then 10, then an hour.

But now you have timing state. Each email tracks attempt count, last attempt time, and next scheduled attempt. You need persistence and a scheduler.

You also need jitter, because if a hundred emails all fail at the same time, you do not want them all retrying at the same time.

Then you discover that different failure types need different strategies. A 421 (temporary failure) should retry sooner than a 450 (mailbox unavailable). A 550 (permanent failure) should not retry at all. But your retry loop does not parse SMTP response codes. It just sees "failed" and tries again.

So you add response code parsing. Then you realize different mail servers return different formats for the same error. You start building a classification layer.

You are now building a mail transfer agent. This was supposed to be a send call.

Deduplication Is the Silent Problem

Here is something most retry implementations miss entirely: duplicate sends.

Your background job picks up a failed email and retries it. But between the first attempt and the retry, the original send actually went through. The receiving server accepted it, but your system never got the acknowledgment because the connection dropped.

Now your user gets the same email twice. Or three times, if the retry runs again before you detect the duplicate.

Deduplication in email is not trivial. You cannot just check a database flag, because the send and the status update are not atomic. You need idempotency keys, message fingerprinting, or some form of distributed state that tracks what has actually been delivered versus what your system thinks has been delivered.

Most teams do not implement this. Most teams do not even know it is a problem until a customer complains about getting five copies of the same invoice.

Failure Routing Is a Product Decision Pretending to Be an Engineering One

After all retries are exhausted, what happens to the email? This is where retry logic stops being an engineering problem and becomes a product problem that nobody planned for.

Does the email go to a dead letter queue? Who monitors that queue? When a critical password reset email lands there, how long before someone notices? Does the user get notified that their email was not delivered? Does the system fall back to a different delivery method?

Most implementations do none of this. The email quietly disappears. The dead letter queue grows. Nobody looks at it because nobody owns it.

Meanwhile, the user who never received their receipt, their booking confirmation, or their account verification just leaves. They do not file a bug report. They do not contact support. They just stop trusting your product.

The Monitoring Gap

You built retry logic, a queue, backoff, deduplication, and failure routing. But can you answer this question: right now, how many emails are stuck in retry?

Most teams cannot. The retry system is a black box. No dashboard, no alerting, no real time visibility.

When delivery rates drop, you find out from customer support tickets, not from your monitoring. By then, hundreds of emails have already been lost.

Building observability into a custom retry system means instrumenting every stage: initial send, each retry attempt, backoff state, final disposition. It means dashboards, alerts, and audit trails for individual emails.

This is what mail servers do. And you just rebuilt one from scratch, one patch at a time.

What You Actually Needed

The entire retry problem exists because of a gap between what your email API gives you and what email actually requires.

Your API gives you: send a message, get a status code. Email requires: send a message, track its delivery across multiple hops, handle temporary and permanent failures differently, retry with awareness of server responses, deduplicate across attempts, route failures to the right handler, and provide visibility into every stage of the process.

That gap is not something you should be filling with application code. It is infrastructure. It belongs in the layer that handles email, not in the layer that handles your business logic.

mailbot treats email delivery as a managed pipeline. When you send through mailbot, retry logic, backoff, failure classification, and deduplication are handled at the infrastructure level. Every email has a timeline of delivery events, from the initial send through each delivery attempt to the final disposition. You do not build the retry. You do not maintain the queue. You read the event stream and respond to outcomes.

Your application code stays clean. Your on-call engineers stay sane. And when an email fails, you know exactly why, without reading through queue logs at 2 AM.

Stop Building Mail Servers by Accident

Nobody sets out to build a mail server. It happens incrementally. A retry here, a queue there, a backoff strategy, a deduplication check. Each piece seems small and reasonable in isolation.

But step back and look at what you have built. You have a message queue with scheduling, a failure classification system, a deduplication layer, and (hopefully) monitoring. That is not application code. That is email infrastructure.

The question is not whether you can build it. You obviously can. The question is whether you should keep maintaining it, or whether you should let that complexity live where it belongs: in the email layer, managed by a system that was designed for it from the start.

Every hour your team spends debugging a retry loop is an hour they are not spending on your product. And your product is the thing your users are actually paying for.

It Starts with One Retry​

The Queue You Did Not Plan For​

Backoff Is Harder Than It Looks​

Deduplication Is the Silent Problem​

Failure Routing Is a Product Decision Pretending to Be an Engineering One​

The Monitoring Gap​

What You Actually Needed​

Stop Building Mail Servers by Accident​