A System Design That Failed in Practice

On paper, the system was clean.

Clear boundaries. Well-defined services. Events flowing neatly through queues. Every component had a responsibility, every responsibility had an owner. It looked exactly like what you’d want to explain on a whiteboard.

In production, it slowly fell apart.

Not catastrophically. Not in a way that triggered alarms or postmortems right away. It failed quietly, in the gaps between assumptions.

The Design

The system followed a fairly standard pattern:

A core service handled requests
Supporting services reacted via events
State was distributed, but “eventually consistent”
Retries were handled automatically
Failures were assumed to be transient

Nothing exotic. Nothing obviously wrong.

Each service could be reasoned about independently. Each deployment was small. Each change felt safe.

That was the problem.

What Worked (Initially)

For a while, everything behaved exactly as designed.

Latency was low
Throughput was high
Services scaled independently
Teams moved fast without stepping on each other

From a distance, it looked like a success.

From inside the system, something else was forming.

Where It Broke

The first cracks showed up as data that was technically correct but operationally useless.

Orders existed but couldn’t be fulfilled.
Payments succeeded but weren’t visible to downstream systems.
Retries “worked” but duplicated side effects.

Nothing was broken enough to fail loudly.

Every issue could be explained away:

“The event will catch up”
“That’s eventual consistency”
“The retry will fix it”

Individually, each explanation made sense.
Collectively, they hid the real issue.

The Real Failure: Ownership of State

No one actually owned the system’s truth.

Each service owned its data.
No service owned the outcome.

When something went wrong, every component could say:

“I did what I was supposed to do.”

And they were right.

The system didn’t fail because of bugs.
It failed because responsibility was fragmented.

Observability Didn’t Save Us

We had metrics. Logs. Traces.

What we didn’t have was a way to answer simple questions:

Is this request actually done?
Who decides that it failed?
What does “success” mean across services?

Observability showed us activity, not resolution.

The system was busy. The system was alive.
The system was not reliable.

Why This Didn’t Show Up in Design Reviews

Because design reviews focus on structure, not behavior.

We reviewed:

service boundaries
schemas
scaling characteristics

We didn’t review:

how partial failure feels
who cleans up ambiguity
where humans intervene when automation stalls

Those questions don’t diagram well.

The Fix Wasn’t Architectural

We didn’t rewrite the system.
We didn’t remove microservices.
We didn’t switch databases.

We introduced explicit ownership.

One service became responsible for declaring outcomes — not just emitting events.
Retries stopped being automatic in places where side effects mattered.
Some “eventual” paths became synchronous again.

The system became slightly slower.
It became much easier to reason about.

What I Took Away

Clean abstractions are seductive.
They reduce local complexity while increasing global ambiguity.

A system that looks elegant in isolation can be fragile in reality if no one owns the full lifecycle of an operation.

In practice, the most important question isn’t:

“Is this service correct?”

It’s:

“Who is accountable when this doesn’t finish?”

If the answer is unclear, the design is already failing — it just hasn’t done so yet.