← All posts
Concurrency · 4 min read

Atomics vs Mutex in Rust: Why Mutex Won Under Heavy Contention

Why a mutex beat atomics under heavy contention — with flamegraphs and a counterintuitive takeaway.

What if I told you a mutex beat atomics in a tight shared-counter benchmark?

That happened in my Rust experiment when I compared:

At low thread counts, atomics were clearly faster. At high contention, mutex pulled ahead.

The Benchmark Setup

I benchmarked a single shared counter with Criterion:

The code is intentionally simple and contention-heavy.

The Performance Data

Here are the main results (Criterion midpoint):

Increments/threadThreadsAtomicMutexWinner
200,00021.44 ms7.15 msAtomic (~4.98x)
200,00046.40 ms11.65 msAtomic (~1.82x)
200,000825.75 ms19.18 msMutex (~1.34x)
1,000,00027.63 ms45.83 msAtomic (~6.00x)
1,000,000435.21 ms60.59 msAtomic (~1.72x)
1,000,0008133.23 ms97.88 msMutex (~1.36x)

I also reran only 2-thread cases with larger input sizes:

So lower contention strongly favors atomic.

The Counterintuitive Part

If atomics are “lighter” than mutexes, why does mutex win at 8 threads here?

Because this benchmark is not measuring lock overhead in isolation.

It is measuring contention on one shared cache line.

What Actually Happens in the CPU

Atomic path (fetch_add)

Even if your code only updates and never explicitly reads, fetch_add is still a hardware read-modify-write operation.

That means the core needs exclusive ownership of the cache line before updating.

With many threads on one shared counter:

This is cache-line ping-pong (line handoff).

Mutex path

Mutex still contends, but lock arbitration changes behavior:

So under very high contention, mutex can be less bad than a single shared atomic increment loop.

Profiling Evidence

I profiled the 1M x 8 threads case with Linux perf and flamegraphs.

Atomic flamegraph

The hotspot was dominated by:

That is the atomic RMW instruction itself, matching the “contended atomic” hypothesis.

Atomic flamegraph (1M x 8 threads)

Mutex flamegraph

Time was spread across lock paths:

So atomic cost was concentrated in one heavily contended instruction; mutex cost was distributed across lock arbitration.

Mutex flamegraph (1M x 8 threads)

Difference: Atomics vs Mutex Lock

Both are synchronization tools, but they solve different problems.

AspectAtomicsMutex Lock
ScopeSingle value / primitive operationsCritical section over one or more shared values
Blocking behaviorNo lock acquisition API; operation executes atomicallyCan block/wait when lock is contended (spin/park/wake)
Low-contention costUsually lower overheadUsually higher overhead per operation
High-contention behaviorCan bottleneck badly on one hot location (cache-line ping-pong)Can sometimes outperform naive atomics by arbitration/serialization
Correctness modelHarder for multi-step state transitionsEasier for compound shared-state updates
Typical useCounters, flags, state bits, lock-free primitivesMulti-step shared data mutations requiring mutual exclusion
Rule of thumbUse for simple independent state changesUse when correctness needs compound updates

Practical Takeaways

  1. Atomics are not automatically faster under all contention levels.
  2. A single global atomic counter can scale poorly with many writers.
  3. Mutex can outperform naive atomics in highly contended single-counter patterns.
  4. The best optimization is often reducing contention, not changing one primitive to another.

Better Designs for Production

If your workload looks like this benchmark, these patterns usually help more:

Run It Yourself

Code is available here:

🔗 github.com/RatulDawar/rust-experiments

The benchmark crate used here is atomics-vs-mutex.

The Bottom Line

The real question is not just:

atomics vs mutex?

The real question is:

how much shared cache-line contention does this design create?

Under low contention, atomics shine.

Under heavy contention on one shared location, mutex can win.