Distributed Workflows: Designing Without One Big Transaction

The workflow didn’t get harder. Your assumptions did.

In one database, you can pretend the world is atomic. Do the work. Commit. Done.

In a distributed system, the business workflow still exists. But “done” no longer lives in one place.

So the design must change.

The goal is not perfect control. The goal is a workflow that finishes, even when parts fail.

A workflow is state, not calls

Many teams implement workflows as call chains. A calls B. B calls C. Everyone hopes.

That’s not a workflow model. That’s a dependency chain.

A workflow model is simple:

what states exist
what transitions are valid
what must be true before and after each transition

If you can’t describe the states, you can’t recover from failure cleanly. You can only retry and pray.

There is no global rollback

In distributed systems, failures are partial and messy.

A step may succeed and the confirmation may be lost. A timeout may hide a success. A retry may run the same action twice.

So “rollback” is usually not a technical feature. It is a business decision.

That is why workflow design is not just engineering. It is product behavior expressed as state transitions.

Choose how the workflow is coordinated

You can coordinate a workflow in two broad ways.

With a coordinator, one place tracks state and decides what happens next. This makes the workflow visible and easier to reason about. The risk is letting the coordinator become a dumping ground for domain logic.

Without a coordinator, the workflow is driven by reactions to events. This can be flexible, but the workflow becomes harder to “see” as a whole. Debugging becomes reconstruction.

Both can work.

What matters is this: when something goes wrong, can you explain what is happening?

If you can’t, the workflow isn’t designed. It’s implicit.

Compensation is the real “undo”

When a later step fails, you often can’t undo the world. Time has passed. Other systems reacted. People were notified.

So the system needs compensations: business actions that correct course. Refund. Release a reservation. Cancel a request. Issue a reversal.

Compensation must be explicit:

when it triggers
what it changes
what happens if it fails

If compensation is improvised, production becomes the design phase.

The three foundations that make workflows survivable

First: idempotency. Because retries and duplicates will happen.

Second: time rules. How long do you wait? How many retries? When is it “stuck”? Who gets notified?

Third: visibility. Support and engineers need to answer “where is it?” quickly. Users need a clear difference between “accepted” and “completed.”

Without these three, workflows don’t fail loudly. They fail quietly. And quietly is worse.

Closing

A distributed workflow is not a transaction. It is a sequence of state changes over time.

Model the states. Make coordination intentional. Treat compensation as a business action. Design for duplicates. Make progress visible.

That is how workflows finish.

Key takeaways / refresher bullets

Distributed workflows break transaction assumptions, not business intent.
Model workflows as states and transitions, not call chains.
There is no global rollback; failures are partial and uncertain.
Coordination must be chosen intentionally; visibility matters more than purity.
Compensation is a business action, not a technical undo.
Idempotency is required because retries and duplicates happen.
Timeouts, retry limits, and “stuck” handling must be designed.
Workflow state should be visible; “accepted” and “completed” are different.