Skip to main content

Why Your AI Agent Loses Context After Three Email Replies

· 8 min read
Founder, mailbot

Your AI email agent handles the first two replies flawlessly. It reads the original message, drafts a thoughtful response, and sends it on time. Then reply three arrives and something quietly goes wrong. The agent responds to the wrong topic. It asks a question the customer already answered. It loses the thread entirely.

This is not a reasoning failure. The underlying language model did not forget anything. The problem is infrastructure: the email headers that stitch a conversation together are either missing, truncated, or misread by the receiving client. By the time a thread hits three or more messages deep, the structural glue that email relies on has often already snapped.

How Email Threads Are Actually Built

Every email message carries a unique identifier called a Message-ID. When you reply to that message, your client adds two headers: In-Reply-To, which holds the parent message's Message-ID, and References, which holds the entire ancestry chain.

As defined by RFC 2822 (IETF), the References header must equal the parent message's own References field plus the parent's Message-ID appended at the end. Each new reply extends this chain. So by message four, the References header contains three Message-ID values, each pointing one step further back.

This chain is how email clients group messages into a visible thread. Without it, each reply appears as a disconnected new conversation. Different email clients handle threading differently: Gmail uses References together with In-Reply-To and subject matching, Outlook leans primarily on References, and Apple Mail follows RFC 2822 most closely. Omitting the References header outright breaks threads after three or more messages in every major client.

Why the Break Happens at Reply Three (Not Reply One or Two)

The first reply only needs In-Reply-To. Many email clients and APIs get that right automatically. The second reply needs both In-Reply-To and a short References chain. Most still manage this correctly. But the third reply requires your sending code to correctly read and forward the entire References chain from the previous message, append the previous message's Message-ID, and then set the new In-Reply-To to point at that same previous message.

If any step in that chain is wrong, not just the most recent one, clients diverge. According to Alibaba LifeTips research on Outlook threading, Outlook splits 31.4% of threads that Gmail preserves intact. Outlook's threading is folder-bound and more sensitive to subject-line changes, while Gmail treats subject-matching as a fallback rather than a primary signal. A thread that looks continuous in Gmail may already be fragmented in Outlook before your agent sees it.

This is the compound problem. Your agent does not receive a clean, continuous conversation object. It receives fragments: some messages grouped, some orphaned, some duplicated across quoted footers. As one developer put it in a discussion on r/AI_Agents, roughly 80% of the tokens in a real email thread are duplicate quotes and footers, not new information. The same thread "evolves into a substantial infrastructure project" once you try to use it reliably in production.

The Token Waste and Context Collapse Problem

Before the headers even matter, there is a tokenization problem. When your agent fetches a thread naively, it is likely reading the same content four or five times because each reply quotes the previous one in full. The new information in message five may be two sentences. The payload you are feeding your agent may be four thousand tokens of redundant history.

This is a practical, measurable problem. The developer thread on r/AI_Agents describes direct measurements: roughly 80% of tokens in a real email thread are duplicate quotes, signatures, and footers. Your agent is spending most of its context window processing information it has already seen.

Then, when the References chain is broken, your agent cannot reliably determine which message is the true root of the conversation, which messages have already been handled, or what the correct In-Reply-To value for its outgoing reply should be. At that point, even a perfectly capable reasoning engine will produce incoherent replies because the input it received was incoherent. The agent did not fail. The input pipeline failed.

According to the Composio 2025 AI Agent Report, integration failure rather than language model failure is the primary cause of AI agent production failures. Email threading is a textbook example of this pattern. The model is fine. The plumbing is broken.

The Wrong Way: Manual Header Construction

Many developers, when building email reply logic, reach for client.messages.send() with manually constructed headers. It looks like this:

import { MailbotClient } from '@yopiesuryadi/mailbot-sdk';
const client = new MailbotClient({ apiKey: 'mb_test_xxx' });

// WRONG: manually constructing threading headers
const response = await client.messages.send({
inboxId: 'inbox_abc123',
to: [{ email: 'customer@example.com' }],
subject: 'Re: Your support request',
bodyText: 'Thank you for your message...',
headers: {
'In-Reply-To': '<original-message-id@mail.example.com>',
'References': '<original-message-id@mail.example.com>',
// You are now responsible for reading and appending the full References chain
// Get this wrong once and Outlook splits the thread. Forever.
},
});

The problem with this approach is not that it is hard to write. It is that it requires you to correctly read the full References chain from the previous message, append the new Message-ID, and keep this logic accurate across every environment that may format or truncate headers differently. Miss it once on reply three, and your agent is now operating on a broken thread for every reply that follows.

The Right Way: client.messages.reply()

The client.messages.reply() method exists specifically to handle this. It reads the correct Message-ID from the parent message, builds the full References chain by reading the parent's own References header, and sets both In-Reply-To and References correctly before sending. You do not touch headers at all.

import { MailbotClient } from '@yopiesuryadi/mailbot-sdk';
const client = new MailbotClient({ apiKey: 'mb_test_xxx' });

// RIGHT: let the SDK handle threading headers automatically
const reply = await client.messages.reply({
inboxId: 'inbox_abc123',
messageId: 'msg_xyz789', // the specific message you are replying to
bodyText: 'Thank you for following up. Here is what we found...',
});

That single call handles RFC 2822 compliance, correct References chain construction, and In-Reply-To assignment. No manual header management, no risk of chain truncation.

Fetching Full Thread Context Before You Reply

The other half of the problem is making sure your agent actually reads the full thread before composing a reply. The correct pattern is to call client.threads.get() first, which returns the complete message history for the thread as a structured object, then pass that context to your agent, and only then call client.messages.reply().

import { MailbotClient } from '@yopiesuryadi/mailbot-sdk';
const client = new MailbotClient({ apiKey: 'mb_test_xxx' });

// Step 1: Get the full thread so your agent has complete context
const thread = await client.threads.get(inboxId, threadId);

// Step 2: Extract message bodies in order (skip duplicate quoted sections)
const messages = thread.messages.map((msg) => ({
from: msg.from,
date: msg.date,
body: msg.bodyText,
}));

// Step 3: Feed structured thread to your agent, get a reply draft
const replyDraft = await yourAgentLogic(messages);

// Step 4: Reply using the SDK so headers are handled correctly
const sent = await client.messages.reply({
inboxId,
messageId: thread.messages.at(-1)?.id, // reply to the most recent message
bodyText: replyDraft,
});

This pattern gives your agent structured, deduplicated context rather than a raw chain of quoted bodies. It also ensures the reply is attached to the correct message in the thread, which is what determines whether Gmail, Outlook, and Apple Mail all show it in the right place. The deduplication step in particular matters: by passing only the unique message bodies in chronological order instead of four thousand tokens of repeated quoted text, your reasoning engine operates on clean input. Your agent processes less and produces more accurate output.

What Consistent Threading Actually Unlocks

When your agent maintains thread continuity reliably, several things improve at once. The conversation history your agent reads is accurate instead of fragmented. Reply rates from customers tend to rise because responses feel contextually aware rather than generic. Thread-level event history through client.events.list(threadId) becomes useful for auditing what the agent did and when.

More importantly, you stop debugging ghost threads in Outlook and wondering why a customer says "I already answered that." This much is clear: the clients that handle threading well do so because they construct headers by the spec. The ones that break threads do so because they take shortcuts. Your agent should not take shortcuts either.

Email is still the dominant communication channel for business workflows. An AI agent that loses thread context after three replies is not a production agent. It is a prototype that creates cleanup work for your team. Getting the infrastructure right is not optional, and as the r/AI_Agents community noted, most teams learn this the hard way after shipping.

You do not have to.

Start building thread-aware email agents at getmail.bot/docs/getting-started.


Sources

  1. RFC 2822: Internet Message Format (IETF)
  2. Reddit r/AI_Agents: Email Context for AI Agents Is Way Harder Than It Looks
  3. Composio: Why AI Agent Pilots Fail, 2026 Integration Roadmap
  4. Alibaba LifeTips: Make Outlook Thread Conversations Like Gmail

When Your AI Agent Should Stop Sending Email and Ask a Human

· 9 min read
Founder, mailbot

The Agent That Kept Apologizing

Imagine an AI agent handling your customer support inbox. A customer writes in, frustrated, mentioning a potential refund dispute. The agent replies with a calm, professional response. The customer replies again, angrier. The agent replies again, still composed. By the fourth exchange, the agent has sent four apology emails to a customer who needed a human to make a judgment call two emails ago.

This is not a hallucination problem. The agent understood the situation. It just had no mechanism to know when it was no longer the right tool for the job.

The Problem With Autonomous Email Agents

Autonomous agents handle routine tasks well. They can parse inbound emails, look up order status, send confirmations, and follow up on open threads. But real inboxes are not neat. They contain sensitive topics, emotionally loaded language, ambiguous requests, and situations where the wrong reply carries legal or reputational risk.

The established playbook for handling this is called human-in-the-loop (HITL), and most of the literature around it focuses on chat. Chat handoff is well-understood: a bot loses confidence, a session is live, a human joins the conversation. The handoff is synchronous. Both parties are present.

Email handoff is a different problem. There is no live session. The customer sent their message and walked away. The agent's reply may sit in their inbox for hours. If the agent escalates incorrectly and a human also replies, you now have two conflicting responses in the same thread. And if the escalation is not properly tracked, the human operator may not even know they need to act.

Nobody in the HITL space talks about this. That gap is exactly what this post addresses.

The Insight: Email Handoff Requires Async-Safe Traceability

In chat, a handoff is an event: a session is transferred, a new agent joins, the conversation continues. In email, a handoff is a state change on a thread. The thread must be marked. The human operator must be notified through a separate channel. The agent must stop sending until the human resolves or re-delegates.

This requires three things to work correctly:

  1. Trigger logic that recognizes when escalation is warranted
  2. Notification routing that alerts a human without polluting the customer thread
  3. Thread state management that prevents the agent from continuing to reply

Get any one of these wrong and you get either missed escalations, duplicate replies, or a human who does not realize they are on the hook.

When to Escalate: The Triggers That Matter

Not every uncertain situation warrants a handoff. According to Elementum AI, a reasonable target is a 10 to 15 percent escalation rate. Too low, and your agent is overconfident. Too high, and human operators are overwhelmed and the system defeats itself.

The triggers worth implementing fall into three categories.

Confidence threshold breach. When the agent's confidence score for its intended reply drops below a defined threshold, it should not send. Anyreach sets this threshold at 85 percent. Below that, human intervention is triggered. Their reported result is 99.8 percent accuracy with HITL active, compared to lower accuracy without it.

Keyword and topic detection. Certain words in an inbound message should immediately flag for review regardless of confidence score. Eesel AI identifies the most common triggers in support contexts: refund, cancel, legal, complaint, and explicit requests to speak with a human. In email, this detection runs on the inbound message body before the agent drafts a reply.

Loop and failure detection. When the same thread has cycled through multiple agent replies without resolution, the agent is probably stuck. Replicant identifies conversation loops, repeated fallback responses, and backend failures as AI-initiated escalation triggers. In email, a loop looks like an increasing reply count on a thread with no status change. Practitioners building agent systems also tie escalation to tool failure events and low evaluation scores, not just confidence on the reply itself.

How Event Notifications Become Escalation Triggers

Every email thread carries an event timeline: message received, agent replied, customer opened, customer replied again, bounce detected. These events are the raw material for escalation logic.

The right architecture treats event notifications as the nervous system of the escalation pipeline. Instead of polling for thread state on a schedule, the agent registers a listener for specific events and acts when those events arrive. A bounce on a reply, a sentiment shift in a new inbound message, or a third reply from the same sender within 24 hours can each serve as a trigger signal.

Here is how to wire that up with the mailbot SDK:

import { MailbotClient } from '@yopiesuryadi/mailbot-sdk';
const client = new MailbotClient({ apiKey: 'mb_test_xxx' });

// Register an event notification listener for inbound messages
// Note: Webhooks fire for all inboxes. Filter by inbox in your handler if needed.
await client.webhooks.create({
url: 'https://your-agent.example.com/hooks/inbound',
events: ['message.received', 'message.bounced']
});

When the event arrives at your handler, you check the thread timeline to assess the escalation signal:

// In your event handler
async function handleInbound(payload: { threadId: string; messageId: string }) {
const events = await client.events.list(payload.threadId);
const replyCount = events.filter(e => e.type === 'message.sent').length;
const hasBounce = events.some(e => e.type === 'message.bounced');

if (replyCount >= 3 || hasBounce) {
await escalateToHuman(payload.threadId, payload.messageId);
}
}

Building the Async-Safe Handoff

Once the escalation decision is made, you need to do three things in sequence. Mark the thread, notify the human, and stop the agent.

Step 1: Mark the thread as escalated.

async function escalateToHuman(threadId: string, messageId: string) {
// Mark the message so the agent pipeline knows to skip this thread
await client.messages.updateLabels(messageId, {
add: ['escalated', 'awaiting-human'],
remove: ['agent-active']
});

Step 2: Notify the human operator through a separate inbox.

The escalation notice goes to your internal operator inbox, not the customer thread. This is critical. A reply to the customer thread at this point would be a second response the customer was not expecting, and could conflict with the human's eventual reply.

  // Notify the human operator via a separate internal inbox
await client.messages.send({
inboxId: 'inbox_operator_alerts',
to: 'support-lead@yourcompany.com',
subject: `[Escalation Required] Thread ${threadId}`,
body: `A customer thread requires human review.\n\nThread ID: ${threadId}\nMessage ID: ${messageId}\n\nReason: Reply loop detected or bounce received.\n\nReview and reply directly to the customer thread.`
});

Step 3: Confirm delivery of the escalation notice.

Before the function exits, confirm the escalation message actually reached the operator. A failed escalation notification is as bad as no escalation at all.

  // Verify the escalation notice was delivered
const timeline = await client.engagement.messageTimeline(messageId);
const delivered = timeline.events.some(e => e.type === 'delivered');

if (!delivered) {
// Log for retry or fallback alerting
console.error(`Escalation notification not delivered for thread ${threadId}`);
}
}

Your agent's main reply loop must check for the escalated label before drafting any response. If the label is present, the agent skips that thread entirely until a human resolves and removes the label.

Why This Architecture Matters

The difference between a good HITL system and a bad one in email contexts is not the trigger logic. Teams spend most of their time on that. The real failure mode is what happens after the decision is made.

In chat, the session transfer is enforced by the platform. The agent is literally removed from the conversation. In email, you must enforce that boundary yourself. The agent will keep replying if you let it. The escalated label combined with a label check at the start of the reply pipeline creates the boundary. Without it, the escalation remains a notification rather than a state change, and the agent keeps going.

Elementum AI frames HITL as a continuous feedback loop rather than a one-time gate. That framing applies here: after the human resolves the thread, removing the escalated label re-enables the agent on future inbound messages. The thread history becomes part of the agent's training signal. Each escalation is also a data point on where your confidence thresholds need adjustment.

The Broader Pattern

Email handoff is harder than chat handoff because it forces you to treat escalation as a durable state, not a transient event. The thread exists in perpetuity. The customer will reply again. The agent will see that reply. If your system treats escalation as a notification and not a state change, the agent will respond to that next reply as if the escalation never happened.

The architecture described here: event-triggered listeners, timeline-based loop detection, label-enforced agent gating, and human notification through a separate channel, is the pattern that makes email HITL actually work. Not the detection logic. The state management.

If you are building an email agent and your HITL plan is to log escalations to a Slack channel and hope someone notices, you are one busy support queue away from a problem.


Build your first escalation pipeline on mailbot.


Sources

  1. Elementum AI, "Human-in-the-Loop Agentic AI" (2026-03-12): https://www.elementum.ai/blog/human-in-the-loop-agentic-ai
  2. Eesel AI, "Best Practices for Human Handoff in Chat Support" (2025-10-22): https://www.eesel.ai/blog/best-practices-for-human-handoff-in-chat-support
  3. Replicant, "When to Hand Off to a Human: How to Set Effective AI Escalation Rules" (2025-06-23): https://www.replicant.com/blog/when-to-hand-off-to-a-human-how-to-set-effective-ai-escalation-rules
  4. Reddit r/AI_Agents, "Anyone building agent systems with human-in-the-loop?": https://www.reddit.com/r/AI_Agents/comments/1m5q6h1/anyone_building_agent_systems_with_humanintheloop/
  5. Anyreach, "What Is Human-in-the-Loop in Agentic AI: Building Trust Through Intelligent Fallback" (2025-08-04): https://blog.anyreach.ai/what-is-human-in-the-loop-in-agentic-ai-building-trust-through-intelligent-fallback/

How to Give Your AI Agent a Real Email Inbox with MCP

· 7 min read
Founder, mailbot

Most email MCP servers let your AI client send email. That is the easy half. The harder half is letting it receive replies, track delivery events, and maintain conversation context across a thread. This tutorial shows you how to wire both halves together using the mailbot MCP server.

If you have searched for "MCP email server" or "email MCP server" and landed on tutorials that only cover outbound, you already know the gap. MailerCheck's roundup of 6 email MCP servers confirms that the only two-way option in the list is a Gmail relay through Zapier. For developers who want a purpose-built inbox that their AI agent can own end to end, that is a meaningful gap.

This tutorial fills it. By the end, your MCP-compatible AI client will be able to create an inbox, send email from it, read replies, and check delivery events.


What You Will Build

An AI agent workflow backed by a real mailbot inbox. Your AI client exposes 13 MCP tools that map directly to mailbot's API surface: inbox management, message sending, reply handling, thread reading, and delivery event inspection. You type a natural language instruction, and the client calls the right tool.

This is useful for agentic tasks like: send a follow-up to anyone who replied to yesterday's campaign, check whether my outbound message was delivered, or create a throwaway inbox for this test scenario and clean it up when done.


Prerequisites

Before you start:

  • Node.js 18 or later installed on your machine (the MCP server runs via npx)
  • An MCP-compatible AI desktop client that supports external MCP servers via a JSON config file
  • A mailbot account and API key from getmail.bot

No local build step required. The package ships prebuilt to npm.


Step 1: Understand How MCP Servers Work

MCP (Model Context Protocol) lets an AI client call external tools in the same way a developer calls an API. According to the official MCP documentation, servers expose tools as typed functions. When you send a message to your AI client, it inspects the available tools, decides which one matches your intent, and executes it. The result comes back as context for the next response.

For email, this means your AI client becomes a first-class email actor rather than a text generator that happens to mention email addresses. It can actually create inboxes, send messages, and read what comes back.


Step 2: Install the mailbot MCP Server

No manual install is required. The package runs on demand via npx, so your AI client fetches and executes it automatically on first launch.

The package is published at @yopiesuryadi/mailbot-mcp on npm. If you want to inspect the package before running it, you can pull it manually:

npx @yopiesuryadi/mailbot-mcp --help

This confirms the package resolves and prints the available tool list.


Step 3: Configure the MCP Server in Your AI Client

Your MCP-compatible AI client reads a JSON config file to discover external servers. The exact file location varies by client. Common locations:

OSTypical config path
macOS~/Library/Application Support/<ClientName>/config.json
Windows%APPDATA%\<ClientName>\config.json

Add the following block to your client's MCP servers config:

{
"mcpServers": {
"mailbot": {
"command": "npx",
"args": ["-y", "@yopiesuryadi/mailbot-mcp"],
"env": {
"MAILBOT_API_KEY": "mb_test_xxx"
}
}
}
}

Replace mb_test_xxx with your actual mailbot API key from your account dashboard.

Save the file and restart your AI client. If the client has a tools or connectors panel, you should see "mailbot" listed with its 13 available tools. That confirms the server is running and connected.

Note: the MCP server is at v1 and has not been tested across every AI client configuration. If your client does not surface the tools after restart, check that the config JSON is valid and that Node.js is accessible on your system PATH.


Step 4: Create an Inbox via MCP

Once the server is connected, you can talk to your AI client in plain language. To create a new inbox, try a prompt like:

Create a new mailbot inbox named "support-test"

Your AI client will call the create_inbox tool, which maps to client.inboxes.create in the mailbot SDK. The tool returns the inbox details including its assigned email address.

You can list existing inboxes with:

List my mailbot inboxes

And retrieve details for a specific one with:

Get the inbox with ID inbox_abc123


Step 5: Send Email via MCP

With an inbox created, sending is one instruction away:

Send an email from my support-test inbox to recipient@example.com with the subject "Hello from mailbot MCP" and a plain text body saying "This was sent by my AI agent."

The client calls the send_message tool under the hood. This is meaningfully different from send-only email MCP servers like Mailtrap's MCP integration, which only expose a single outbound send tool. With mailbot, the same session that sends can also receive and inspect.

You can also send HTML:

Send an HTML email from support-test to recipient@example.com. Subject: "Welcome". Body: a simple HTML welcome message with a bold heading.


Step 6: Receive and Read Email via MCP

When a reply arrives at your mailbot inbox, your AI client can read it:

List the latest messages in my support-test inbox

This calls list_messages and returns subject, sender, snippet, and thread ID for each message. To read a full message:

Get the full content of message msg_xyz789

To search across messages:

Search my support-test inbox for messages from sender@example.com

The search_messages tool accepts sender, subject keywords, date ranges, and label filters, so your agent can do targeted retrieval without reading the entire inbox.

If you are building an automated flow and need to wait for a reply before proceeding, the wait_for_message tool (backed by client.messages.waitFor) polls until a matching message arrives or a timeout is reached. This is useful for test flows where you send a message and need to assert on the reply.


Step 7: Check Delivery Events via MCP

Sending a message is the start, not the end. Your AI client can also inspect what happened to each message after delivery.

Check the delivery events for thread thread_abc123

This calls list_events for the thread, returning a timeline of events (queued, delivered, opened, bounced, and so on). You can also retrieve a single event:

Get event details for event evt_123

This is useful for agentic tasks like: "Send a follow-up only if the first message was delivered but not opened." Your agent can check the event timeline, make a conditional decision, and act without you writing any conditional logic manually.


Step 8: Organize with Labels and Threads via MCP

The 13 mailbot MCP tools also cover thread reading and label management. To view a full conversation thread:

Show me the full thread for thread_abc123

To label a message for downstream filtering:

Add the label "needs-followup" to message msg_xyz789

Labels work as lightweight state markers that persist on the message, so other tools or agents in your workflow can filter by them later.


What Is Next

This tutorial covered the core loop: create inbox, send, receive, inspect events. The mailbot MCP server exposes the same API surface as the SDK, so everything in the mailbot documentation applies to what your AI client can do.

A few directions to explore from here:

  • Event notifications: Set up a webhook to push inbound messages to your own endpoint, so your agent reacts in real time rather than polling.
  • Domain verification: Verify a custom sending domain so outbound messages use your own address.
  • Compliance checks: Use the compliance tools to run readiness checks before sending to a new list.

The MCP integration is v1. Feedback from real usage is how it improves. If you run into edge cases with your specific AI client configuration, the documentation is the right place to start: getmail.bot/docs/getting-started.


Sources

Email API for AI Agents: What to Evaluate Before You Pick One

· 10 min read
Founder, mailbot

Hook

Most developers discover the limits of their email API only after something breaks in production. An agent sends a follow-up reply that lands outside the original thread. An inbound message arrives at a webhook endpoint and disappears with no way to replay it. A compliance audit asks for an audit log that was never generated. By then, migrating to a different provider is painful.

Choosing an email API for an AI agent is a different problem from choosing one for transactional email. A welcome email does not need to receive replies. An AI agent does. The decision criteria are not the same, and most comparison guides available today were written with transactional use cases in mind.

The Problem

The standard comparison framework for email APIs focuses on deliverability, latency, and price per message. Those things matter, but they answer the wrong question for agents. When an agent manages an ongoing support conversation, a sales sequence, or an approval workflow, the relevant questions are: can the agent receive the reply, does the reply arrive in the right thread context, and what happens if the event notification fails?

A billion-request benchmark by Knock found that SendGrid's median API response time is 22ms, Postmark's is 33ms, and Resend's is 79ms. Those numbers matter for transactional throughput. But an agent waiting on a human reply is not measuring latency in milliseconds. It is measuring reliability over minutes and hours.

The industry has also converged on a comparison model that treats inbound email as a secondary feature, something bolted on via webhook rather than designed as a core primitive. AgentMail's 2026 comparison of the top providers found that most handle inbound email through stateless webhook routing with no persistent storage or threading. For transactional email, that is fine. For an agent that needs to read a reply, correlate it with a prior message, and continue a conversation, it is a significant gap.

The Insight

The right question to ask before picking an email API for an agent is not "how fast is the send?" It is "can this API handle the full conversation loop?" That loop has seven distinct requirements, and providers differ on almost every one of them.

The Evaluation Framework

1. Two-Way vs. Send-Only

The most fundamental distinction. Send-only providers (AWS SES at its core, older SendGrid configurations) give you an endpoint for outbound email and nothing more. Two-way providers give you both a send path and a receive path.

The difference in architecture is significant. AgentMail's comparison found that SendGrid's inbound parse is stateless (no persistent storage, no threading) and Mailgun routes inbound email via webhook with no persistent inbox. Resend added inbound webhook support in 2025. AWS SES requires additional AWS infrastructure (S3, Lambda, SNS) to do anything useful with a received message.

For agents, the question is whether you want to build and maintain that additional layer yourself or use an API that treats two-way communication as a first-class primitive.

An API designed for the full conversation loop looks like this:

import { MailbotClient } from '@yopiesuryadi/mailbot-sdk';
const client = new MailbotClient({ apiKey: 'mb_test_xxx' });

// An email API for agents should handle the full cycle: send, receive, reply in thread
const inbox = await client.inboxes.create({ name: 'support-agent' });
const inbound = await client.messages.waitFor({ inboxId: inbox.id, direction: 'inbound', timeoutMs: 30000 });
await client.messages.reply({ inboxId: inbox.id, messageId: inbound.id, bodyText: 'Thanks for reaching out.' });

The waitFor method blocks until a reply arrives, which is exactly the pattern an agent running a turn needs.

2. MTA Ownership vs. Rented Infrastructure

Who controls the mail transfer agent matters for deliverability configuration and high-volume cost. AWS SES runs its own MTA and prices per message ($0.10 per 1,000 emails sent and $0.10 per 1,000 received, per AgentMail's pricing table). SendGrid and Postmark operate their own infrastructure. Resend routes through established MTAs.

Providers that own their MTA can offer dedicated IPs, custom warm-up, and more direct control over reputation. Providers that abstract the MTA away trade that control for easier onboarding. For agents sending at modest volume (under 50,000 messages per month), MTA ownership is less important than the API surface above it. In regulated industries, dedicated IP configuration and reputation isolation become meaningful.

3. Thread Handling

Email threading is governed by three headers: Message-ID, In-Reply-To, and References. RFC 2822 specifies that a reply's References field should include the parent's References plus the parent's Message-ID. When this chain is managed correctly, every major email client preserves the thread. When it breaks, replies land as new conversations.

Managing this manually in application code is straightforward for the first reply. It becomes error-prone after three or four turns when the References header needs to carry the full ancestry. An API that handles threading automatically removes this class of bug entirely.

AgentMail handles threading via built-in API support with automatic header management. SendGrid and SES do not manage thread state; the application is responsible for passing the correct headers on every reply. Resend's threading behavior is not documented as an automatic feature.

4. Event Notification Reliability

When an inbound message triggers an event, what happens if your endpoint is down? This is where providers diverge significantly.

Mailtrap's flexibility comparison found that retry windows vary considerably across providers:

ProviderRetry window
SendGrid72 hours
Mailtrap24 hours
Postmark12 hours
Mailgun8 hours
ResendUser-managed

Beyond retry duration, the question is whether you can replay a specific event after the window expires. For agents that need to recover from a missed notification without re-triggering the full workflow, event replay is a meaningful capability.

5. Custom Domains with Full DNS Setup

An AI agent sending from agent@support.yourcompany.com requires a custom domain with SPF, DKIM, and ideally DMARC records configured. The question is how much of that setup the provider automates.

All major providers support custom domains. The differences are in the verification flow, the time required, and whether the provider guides you through the full DNS record set or leaves gaps. AgentMail notes that SendGrid's time to first email is 10 to 15 minutes, Mailgun's is similar, and AWS SES requires sandbox approval that takes 24 to 48 hours for new users.

For agents deployed in enterprise environments, the ability to verify multiple domains and issue separate credentials per domain matters for tenant isolation.

6. Compliance Readiness

Enterprise buyers commonly ask for SOC 2 Type II, ISO 27001, or equivalent certifications. For agents handling customer communication, audit logs (who sent what, when, and to whom) are also relevant.

This criterion is often invisible until a procurement process or security review surfaces it. Checking compliance posture before you build is faster than retrofitting.

7. Pricing Model

Three pricing models exist: per-message, per-inbox (flat rate), and hybrid.

Per-message pricing (AWS SES at $0.10/1,000) is economical at high outbound volume but can become expensive if agents are also receiving large volumes. Per-inbox pricing (AgentMail's model, starting at $20/month for 10 inboxes and 10,000 messages) is predictable for deployments with a fixed number of agent inboxes. Flat-rate models (Postmark, Resend's Pro tier) are predictable up to a message ceiling, then require a tier upgrade.

For agents, the relevant calculation is the ratio of inbound to outbound messages. A support agent that receives 1,000 messages and sends 1,000 replies is doing 2,000 message operations. A per-message model bills both directions if the provider supports inbound; a per-inbox model does not change with message volume within the tier.

Comparison Table

The table below summarizes each provider across the seven criteria. "Native" means the feature is a first-class API primitive. "Webhook" means the feature requires your application to handle state and persistence.

CriterionSendGridAWS SESResendPostmarkAgentMailmailbot
Two-way (inbound)Webhook, statelessVia S3/Lambda/SNSWebhook (since 2025)LimitedNative inboxNative inbox
MTA ownershipYes (Twilio)Yes (AWS)AbstractedYesAbstractedYes
Auto thread handlingNoNoNot documentedNoYesYes
Event retry window72 hoursNot specifiedUser-managed12 hoursConfigurableConfigurable
Custom domainsYesYesYesYesYesYes
Compliance docsSOC 2SOC 2, ISO 27001SOC 2SOC 2Not publishedIn progress
Pricing modelPer-messagePer-messagePer-message tiersPer-message tiersPer-inbox tiersPer-inbox tiers

A few notes on reading this table honestly. SendGrid's 22ms p50 latency is the best measured across any provider in the Knock benchmark, and its 72-hour retry window is the longest available for event notifications. AWS SES has the most consistent error rates of any provider measured, with most days below 0.01%. These are real advantages for high-volume transactional use cases.

The providers that score highest on the agent-specific criteria (two-way, auto threading, event replay) are the newer ones: AgentMail and mailbot. Both are earlier-stage than SendGrid or SES, which means a tradeoff: more agent-native features, less operational history.

Where mailbot Stands

mailbot is designed around the agent use case. Inboxes are programmable resources. Threads are tracked automatically with correct In-Reply-To and References headers on every reply. Event notifications include replay via client.events.replay(eventId). Compliance tooling is available via client.compliance.readiness() and client.auditLog.list(). Pricing is per-inbox, not per-message.

The honest caveat: mailbot is younger than SendGrid or Postmark, which means less operational track record at the top end of volume. If you are migrating an existing high-volume transactional email pipeline, that history matters. If you are building a new agent workflow from scratch, the agent-native API surface is a meaningful starting point advantage.

Close

Not every agent needs all seven criteria. A simple outbound notification agent only needs criteria 2 and 5. An agent managing multi-turn customer conversations needs all seven, and a gap in any one of them will surface as a bug in production.

The providers that dominated email in 2015 were built for a world where email was a one-way notification channel. Agent workflows are a different problem, and the evaluation should reflect that. The mailbot comparison page maps each criterion to a working implementation.


Sources

  1. AgentMail, "5 Best Email API For Developers Compared [2026]" (2026-01-27): https://www.agentmail.to/blog/5-best-email-api-for-developers-compared-2026
  2. Jeff Everhart / Knock via Dev.to, "We analyzed a billion email API requests: here's what we learned" (2026-03-12): https://dev.to/knocklabs/we-analyzed-a-billion-email-api-requests-heres-what-we-learned-j39
  3. Ivan Djuric / Mailtrap, "5 Best Email APIs: Flexibility Comparison [2026]" (2026-03-13): https://mailtrap.io/blog/email-api-flexibility/
  4. Postmark, "Best Email API" (2026-01-12): https://postmarkapp.com/blog/best-email-api
  5. Reddit r/webdev, "Email API benchmarks for SendGrid, Amazon SES...": https://www.reddit.com/r/webdev/comments/1rrxxs5/email_api_benchmarks_for_sendgrid_amazon_ses/
  6. IETF RFC 2822, "Internet Message Format": https://datatracker.ietf.org/doc/html/rfc2822

Building an AI Support Agent That Sends Real Email (Not Just Chat)

· 9 min read
Founder, mailbot

The Problem Is Not the AI

Most teams building AI support agents hit the same wall. The AI classification works fine in testing. The prompt responses look reasonable. But when they try to connect it to actual email, things fall apart fast. The inbox is shared with marketing sends. There is no way to listen for inbound messages without polling. Replies break the thread. Nobody knows whether the automated response was actually delivered.

As one developer put it in a thread on r/AI_Agents: "What begins as simple email context evolves into a substantial infrastructure project." That quote describes the experience of most teams within the first week of building a real support agent, not a demo.

The Composio AI Agent Report is direct about the root cause: integration failure, not model failure, is the number one reason AI agent pilots fail in production. The report identifies "brittle connectors" as a specific trap, where one-off integrations work in isolation but break the moment real email volume hits, or when email clients format messages differently than expected.

This post is a comprehensive walkthrough for building a support agent that avoids those failure modes. It covers everything from creating a dedicated inbox, to listening for inbound messages, to classifying intent, to confirming delivery, to escalating uncertain cases to a human reviewer. If you want the 30-minute quickstart version, the existing Build an Email AI Agent in 30 Minutes post covers the basics. This post is for teams who want something production-ready.

Why Dedicated Infrastructure Matters

A support agent needs its own inbox, its own event notification listener, and a reliable threading model. Sharing an inbox with other email processes introduces noise that defeats classification before the AI ever sees a message.

Instantly's email triage research found that 70 to 80 percent of routine support emails can be classified and responded to automatically, but only when the classification system has clean, well-scoped input. Routing all company email through one inbox and asking an agent to sort it out is not a clean input.

It is worth noting that we run mailbot's own support inbox this way. The architecture described in this post is not hypothetical. You can read about it in the mailbot dogfooding post, which covers how we use our own API to handle support at the company level.

Step 1: Create a Dedicated Inbox

Start by initializing the SDK and creating an inbox specifically for support:

import { MailbotClient } from '@yopiesuryadi/mailbot-sdk';

const client = new MailbotClient({ apiKey: 'mb_test_xxx' });

const inbox = await client.inboxes.create({ name: 'support-agent' });
console.log('Inbox created:', inbox.id, inbox.address);

This gives you an isolated address (something like support-agent@yourdomain.getmail.bot) that receives only inbound support email. No newsletter noise, no transactional sends from other systems. Your classifier gets a clean channel.

Step 2: Register an Event Notification Listener

Polling an inbox on an interval is the third failure trap identified in the Composio report, labeled the "Polling Tax." It wastes resources, introduces latency, and adds another surface where things can fail silently.

Register an event notification endpoint instead. The SDK makes this a single call:

const hook = await client.webhooks.create({
url: 'https://your-agent.example.com/inbound',
events: ['message.inbound'],
});
// Note: Webhooks fire for all inboxes. Filter by inboxId in your /inbound handler.
console.log('Listener registered:', hook.id);

Your endpoint at /inbound will now receive a POST payload every time a new message arrives in the support inbox. No polling required.

Step 3: Receive and Read the Inbound Message

When your endpoint receives a notification, it includes the inboxId and messageId. Use those to fetch the full message and the thread context:

app.post('/inbound', async (req, res) => {
const { inboxId, messageId, threadId } = req.body;

// Fetch the individual message
const message = await client.messages.get(inboxId, messageId);

// Fetch the full thread for context
const thread = await client.threads.get(inboxId, threadId);

// Pass to your classifier
const intent = await classifyIntent(message.subject, message.bodyText, thread);

await handleIntent(intent, inboxId, messageId);

res.sendStatus(200);
});

Fetching the full thread via client.threads.get() is important for repeat customers or ongoing issues. A support ticket about a billing error in the third reply looks very different without the first two messages. Thread context prevents your classifier from treating it as a fresh, unrelated inquiry.

Step 4: Classify Intent and Reply

Your AI classifier receives the message text and thread context and returns an intent label plus a confidence score. The exact implementation of your classifier is up to you. The important part is that this function returns something structured:

async function classifyIntent(subject: string, body: string, thread: any) {
// Call your AI classification layer here
// Return: { intent: string, confidence: number, suggestedReply: string }
}

Instantly's research shows that 70 to 80 percent of routine support emails fall into a small set of intent categories: order status, refund request, account access, and general inquiry. A well-tuned classifier handles the bulk of volume without human review.

When confidence is above your threshold, reply in the same thread:

async function handleIntent(intent: any, inboxId: string, messageId: string) {
if (intent.confidence >= 0.80) {
await client.messages.reply({
inboxId,
messageId,
bodyText: intent.suggestedReply,
});
} else {
await escalateToHuman(inboxId, messageId, intent);
}
}

Using client.messages.reply() keeps the response inside the original thread. The customer's email client shows it as a continuation of the same conversation, not a new message. This matters both for the customer experience and for the threading chain that future AI classification will need.

Step 5: Verify Delivery with the Event Timeline

Sending a reply is not the same as delivering it. Network issues, misconfigured DNS, and provider-side throttling can all cause a message to leave your system without reaching the recipient.

Use client.engagement.messageTimeline() to confirm the delivery path after sending:

const timeline = await client.engagement.messageTimeline(messageId);

const delivered = timeline.events.some(e => e.type === 'delivered');
const opened = timeline.events.some(e => e.type === 'opened');

if (!delivered) {
console.warn('Reply not confirmed delivered. Flagging for review.');
// Trigger retry or alert here
}

This is the kind of operational check that separates a demo agent from a production one. If a customer does not receive the reply, the next message they send will be an escalation in frustration. Catching delivery failures early gives you time to intervene before that happens.

Step 6: Escalate to a Human When Confidence Is Low

When the classifier's confidence falls below your threshold, the message should go to a human reviewer rather than being sent an automated reply that may be wrong or tone-deaf.

The pattern has two parts: label the message so it appears in the escalation queue, then notify a human agent via a separate inbox.

async function escalateToHuman(inboxId: string, messageId: string, intent: any) {
// Label the message in the support inbox
await client.messages.updateLabels({
inboxId,
messageId,
labels: ['escalated'],
});

// Send notification to human agent inbox
await client.messages.send({
inboxId: HUMAN_AGENT_INBOX_ID,
to: 'support-team@yourcompany.com',
subject: 'Escalation Required: Low Confidence Classification',
bodyText: `Message ID ${messageId} was classified as "${intent.intent}" with confidence ${intent.confidence}. Please review and respond manually.`,
});
}

This pattern is consistent with findings from Eesel AI's analysis of human handoff best practices, which identifies confidence thresholds and intent-specific triggers as the most reliable escalation signals. Keywords like "refund," "cancel," or "legal" warrant a lower threshold regardless of overall confidence.

The label approach keeps your support inbox organized. Messages labeled escalated appear separately from those the agent handled autonomously. You get a natural audit trail without building a separate database.

Step 7: Check Compliance Readiness Before Going Live

Before routing real customer email through the agent, run a compliance readiness check on the inbox:

const readiness = await client.compliance.readiness(inbox.id);
console.log('Compliance status:', readiness);

This checks that the inbox has proper configuration for unsubscribe handling, opt-out tracking, and other requirements that apply to automated email senders. Running this before go-live avoids situations where a compliance gap surfaces only after you have been sending at volume.

Putting It Together

The full architecture looks like this:

  1. A dedicated support inbox receives inbound email cleanly.
  2. An event notification listener fires your handler on each new message.
  3. Your handler fetches the message and full thread context.
  4. Your AI classifier returns an intent and confidence score.
  5. High-confidence intents trigger an automated reply via client.messages.reply().
  6. The event timeline confirms delivery after each send.
  7. Low-confidence intents are labeled escalated and routed to a human agent via a second inbox.
  8. Compliance readiness is verified before production launch.

We built and run this exact pattern for mailbot's own support. The dogfooding post goes into detail on how the live system handles real volume and where we had to adjust our confidence thresholds over time.

The Infrastructure Is the Product

The AI classifier is the part that gets the most attention in conversations about AI support agents. But as the r/AI_Agents community has found directly, the classifier is rarely where things break. The email infrastructure underneath it is where fragility lives: brittle polling loops, lost thread context, unconfirmed delivery, no human fallback.

The steps in this guide address each of those failure points specifically. A dedicated inbox eliminates noise. Event notifications replace polling. client.threads.get() preserves context. client.engagement.messageTimeline() confirms delivery. Labels and a second inbox create a human escalation path. Compliance readiness checks prevent surprises at go-live.

Ready to start building? The full SDK reference is at getmail.bot/docs/getting-started.


Sources

  1. r/AI_Agents: Email context for AI agents is way harder than it looks
  2. Composio: Why AI Agent Pilots Fail in 2026 (Integration Roadmap)
  3. Instantly: Automate Email Triage Classification with AI
  4. Eesel AI: Best Practices for Human Handoff in Chat Support
  5. mailbot: We Run Our Own Support on Our Own API

Why Your Agent Needs Its Own Email Address, Not a Shared Mailbox

· 7 min read
Founder, mailbot

You have three agents running different workflows. Support, onboarding, and billing. All three send email from the same address: team@yourcompany.com. A customer replies to a billing notification. The support agent picks it up. The onboarding agent ignores it. Nobody knows which agent should own the conversation.

This is what happens when agents share a mailbox. It works in demos. It breaks in production.

Your Compliance Checkbox Is Not a Compliance Strategy

· 6 min read
Founder, mailbot

Look at the footer of almost any email API provider. You will find a badge. SOC 2. ISO 27001. Maybe both. There is rarely anything else. No explanation of where your email content is stored. No mention of which jurisdiction governs that data. No answer to the question: if a regulator asks who accessed this data and when, can you actually answer that?

A badge in the footer is not a compliance posture. It is marketing. And for teams building AI agents that handle email, the difference between the two is the difference between passing an audit and failing one.

Email Infrastructure Is the Unsexy Moat

· 5 min read
Founder, mailbot

The demo always shows the API. Clean endpoint, one function call, email sent. The crowd nods. The founder moves on.

Nobody demos IP warming. Nobody presents a slide about bounce classification logic, retry queues, or what happens when a major inbox provider changes its filtering rules overnight. That stuff is not on the conference schedule. It never is.

From Inbox Zero to Inbox API

· 5 min read
Founder, mailbot

Inbox Zero was never about email. Merlin Mann, the productivity writer who coined it in the mid-2000s, eventually said so himself. The concept was about minimizing the mental attention you give to inboxes, not about achieving a literal empty count. But the productivity internet took it literally, and an entire industry of workflows, browser extensions, and time-blocking rituals was born. Millions of people spent hours every week trying to reach zero.

That is not a productivity system. That is a coping mechanism.