The Transactional Outbox Pattern: Solving the Dual-Write Problem

🎯

You cannot atomically write to your database and publish an event. The transactional outbox pattern fixes this dual-write problem by writing events into the same DB transaction and relaying them afterward — at-least-once, with idempotent consumers.

Introduction

Here's a piece of code that looks completely reasonable and is quietly broken:

await db.order.create({ data: order }) // 1. save to database
await eventBus.publish('order.created', order) // 2. publish an event

Save the order, then tell the rest of the system about it. What could go wrong? Everything in the gap between those two lines. If the process crashes after line 1 but before line 2, the order exists but no event is published — downstream services never hear about it, the confirmation email never sends, the ledger never updates. Flip the order of the lines and you get the opposite failure: an event for an order that was never saved.

This is the dual-write problem, and it's one of the most common sources of silent data inconsistency in distributed systems. You're writing to two systems — your database and your message broker — and there's no way to make those two writes atomic. The transactional outbox pattern is the standard, battle-tested solution. This post is how it works, why the obvious alternatives don't, and how to build it.

The Dual-Write Problem

The root issue is that a database transaction and a message publish are two separate systems with two separate commit points. You can make the database write atomic with other database writes. You can make the publish atomic with other publishes. But you cannot wrap a Postgres COMMIT and a Kafka/RabbitMQ/SQS publish in a single atomic unit. There's always a moment where one has happened and the other hasn't, and a crash in that moment leaves the two systems disagreeing forever.

It's tempting to think this is rare. It isn't. Processes crash, deploys restart pods mid-request, the network to the broker blips, the broker is briefly down. At any real volume, the gap will be hit, and each hit is a permanent inconsistency that no retry can fix — because by the time you retry, you've lost the knowledge that you were halfway through.

Why Not Just Use a Distributed Transaction?

The textbook answer is "two-phase commit" (2PC) — a coordinator that asks both systems to prepare, then commit. In practice almost nobody uses it for app-to-broker writes, for good reasons:

Most modern brokers don't support it (or support it poorly). Kafka, SQS, most HTTP APIs have no XA transaction to enlist.
It's a synchronous, blocking protocol. Every participant holds locks until the coordinator decides, which destroys throughput and latency.
The coordinator is a failure point. If it dies after prepare but before commit, participants are stuck holding locks, "in doubt," waiting.

2PC trades a small inconsistency window for a large availability and performance cost. For high-throughput systems, that trade is backwards. The outbox pattern gets the same guarantee without the coordinator.

The Outbox Pattern

The insight is elegant: if you can't atomically write to the database and the broker, then only write to the database — including the event itself. Add an outbox table in the same database, and write your domain change and the event to it in one transaction:

CREATE TABLE outbox (
  id           BIGSERIAL PRIMARY KEY,
  aggregate_id TEXT NOT NULL,
  event_type   TEXT NOT NULL,
  payload      JSONB NOT NULL,
  created_at   TIMESTAMPTZ DEFAULT now(),
  published_at TIMESTAMPTZ              -- NULL until a relay ships it
);

await db.$transaction(async tx => {
  await tx.order.create({ data: order })
  await tx.outbox.create({
    data: {
      aggregateId: order.id,
      eventType: 'order.created',
      payload: order,
    },
  })
})

Now the domain write and the event are a single atomic commit. Either both happen or neither does — the dual-write problem is gone, because there's only one write, to one system. The event sits durably in the outbox table, waiting to be delivered. A separate process — the relay — reads unpublished rows and ships them to the broker.

The Relay: Polling vs Change Data Capture

The relay's job is to move rows from the outbox to the broker and mark them published. There are two ways to build it.

Polling. The simplest: a worker loops, selects unpublished rows, publishes each, marks it done.

async function relayTick() {
  const events = await db.outbox.findMany({
    where: { publishedAt: null },
    orderBy: { id: 'asc' },
    take: 100,
  })

  for (const event of events) {
    await broker.publish(event.eventType, event.payload)
    await db.outbox.update({
      where: { id: event.id },
      data: { publishedAt: new Date() },
    })
  }
}

Easy to build, easy to reason about, works everywhere. The cost is polling latency and load — you're querying constantly. Fine for most systems; tune the interval and batch size.

Change Data Capture (CDC). Instead of polling, tail the database's replication log (e.g. Postgres WAL via Debezium). The moment a row is committed to the outbox, the log streams it out and the relay publishes it. Lower latency, no polling load, but more operational machinery to run. Reach for CDC when latency or scale makes polling hurt; start with polling otherwise.

At-Least-Once — So Consumers Must Be Idempotent

Look closely at the polling relay and you'll spot the same gap as the original problem, just moved: it publishes to the broker, then marks the row published. If it crashes in between, the event was published but the row still shows unpublished — so the next tick publishes it again.

This is not a bug to be fixed; it's a deliberate trade. The outbox guarantees at-least-once delivery: every event is delivered, possibly more than once, never zero times. The alternative (mark-then-publish) would risk losing events, which is far worse. Duplicates are the acceptable failure.

Which means the outbox pattern is only half a solution on its own. The other half is that every consumer must be idempotent — able to receive the same event twice and process it once. That's exactly the exactly-once-effect through idempotency pattern: at-least-once delivery from the outbox, plus idempotent consumers, equals exactly-once effect across your whole system. The two patterns are designed to be used together.

Ordering

A detail that bites: if downstream cares about order (event B must not be processed before event A), the relay must preserve it. The BIGSERIAL id gives you a total order of insertion; publish in id order and don't parallelise across events that share an aggregate. If you need strict per-entity ordering, key the broker partition by aggregate_id so all events for one order/account land on the same partition and stay ordered. Global ordering across all events is usually neither needed nor worth the throughput cost.

Pitfalls

Marking published before the publish succeeds. Always publish first, then mark — at-least-once beats at-most-once for events.
An unbounded outbox table. Published rows accumulate; archive or delete them on a schedule, or the table (and its indexes) bloat.
Forgetting idempotent consumers. The outbox will deliver duplicates. A non-idempotent consumer turns that into double-processing.
Putting the outbox in a different database than the domain data. Then you're back to a dual-write. The whole trick depends on one transaction over one database.
Huge payloads in the outbox. Store what consumers need; for large blobs, store a reference and let consumers fetch.

Conclusion

The dual-write problem is sneaky because the broken code looks correct and works fine in testing — it only fails under the crashes and blips that real production guarantees. The transactional outbox fixes it with a humble move: stop writing to two systems, write the event into the same database transaction as the data, and let a relay deliver it afterward. One atomic commit, zero lost events.

The pattern's honesty is what makes it robust: it doesn't pretend to give exactly-once delivery, it gives at-least-once and tells you plainly that your consumers must be idempotent to close the loop. Pair the outbox with idempotent processing and you have a system where events are never lost, duplicates never matter, and your database and the rest of the world never drift apart.