What is CAS (Compare-And-Swap)?

Junior Level

Basic Understanding

CAS (Compare-And-Swap) is an atomic CPU instruction that allows updating a value only if it hasn’t changed since it was read. This is the foundation of lock-free algorithms.

Simple Analogy

Imagine an ATM:

You check the balance: 1000 rub
You want to withdraw 200 rub (expecting 800 to remain)
The ATM checks: “is the balance still 1000?”
If yes — withdraws 200, new balance 800
If no (someone already withdrew) — cancels the operation, need to retry

Key difference from synchronized: CAS is an optimistic approach (try, if it fails — retry). synchronized is a pessimistic approach (lock and do, no one will interfere).

How CAS Works

// CAS takes 3 parameters:
// 1. Memory address
// 2. Expected value
// 3. New value

boolean CAS(address, expectedValue, newValue) {
    if (memory[address] == expectedValue) {
        memory[address] = newValue;
        return true;  // Success!
    }
    return false; // Failed — value changed
}

Example with AtomicInteger

AtomicInteger counter = new AtomicInteger(0);

counter.incrementAndGet(); // Internally:
// 1. Read current value (current = 0)
// 2. Compute new value (next = 1)
// 3. CAS: if value == 0, set to 1
//    If CAS failed — repeat from step 1

Spin-loop

public final int incrementAndGet() {
    for (;;) { // Infinite loop
        int current = get();        // Read
        int next = current + 1;     // Compute
        if (compareAndSet(current, next)) { // CAS
            return next;            // Success!
        }
        // CAS failed — someone else updated the value
        // Go to next iteration
    }
}

When is CAS faster than locks?

Scenario	CAS	synchronized
Low contention	Faster (no locking)	Slower (acquire/release)
Medium contention	Faster	Medium
High contention	Slower (infinite loops)	Faster (threads are parked)

Middle Level

Hardware-Level Algorithm

On x86 architecture, CAS is implemented via the LOCK CMPXCHG instruction:

; rax = expected value
; rcx = new value
; [memory] = actual value in memory

lock cmpxchg [memory], rcx
; If [memory] == rax:
;   [memory] = rcx      ← Write new value
;   ZF = 1              ← Success flag
; Else:
;   rax = [memory]      ← Load actual value
;   ZF = 0              ← Failure flag

The lock prefix locks the memory bus for the duration of the operation — other cores cannot access this memory cell.

CAS in Java: Unsafe and VarHandle

Before Java 9: sun.misc.Unsafe

Before Java 9: sun.misc.Unsafe (direct memory access, dangerous) Java 9+: VarHandle (safe replacement for Unsafe, JEP 193) Note: Unsafe is still accessible in Java 9+ via --add-opens, but not recommended.

// Direct memory access (not recommended)
Unsafe unsafe = ...;
long offset = ...; // Field offset in memory

int current;
do {
    current = unsafe.getIntVolatile(obj, offset);
} while (!unsafe.compareAndSwapInt(obj, offset, current, current + 1));

Java 9+: VarHandle

// Typed and safe API
private static final VarHandle VALUE;
static {
    MethodHandles.Lookup l = MethodHandles.lookup();
    VALUE = l.findVarHandle(MyClass.class, "value", int.class);
}

int current;
do {
    current = (int) VALUE.getVolatile(this);
} while (!VALUE.compareAndSet(this, current, current + 1));

Lock-free vs Wait-free

In an interview, it’s important to distinguish:

This is a progress guarantee hierarchy:

Wait-free (strongest): EVERY thread completes the operation in a finite number of steps
Lock-free (middle): AT LEAST ONE thread completes the operation in a finite number of steps
Obstruction-free (weakest): thread completes if all others are suspended

Wait-free → Lock-free → Obstruction-free: each next one is weaker than the previous.

Term	Guarantee	Example
Lock-free	At least one thread completes the operation in finite time	AtomicInteger
Wait-free	Every thread completes in a finite number of steps	LongAdder.add()
Obstruction-free	Thread completes if others don’t interfere	Some STM implementations

STM (Software Transactional Memory) — software transactional memory: operations execute in a transaction, then atomically commit. An alternative approach to lock-free programming.

Exponential Back-off

Under high contention, a simple spin cycle wastes CPU. Solution — pauses:

public int incrementWithBackoff() {
    int backoff = 1;
    for (;;) {
        int current = get();
        int next = current + 1;
        if (compareAndSet(current, next)) {
            return next;
        }
        // CAS failed — pause
        Thread.onSpinWait(); // Hint to the processor

        // `Thread.onSpinWait()` — a hint to the JVM/CPU: "I'm in a spin cycle, you can optimize". On x86 this is the PAUSE instruction, which reduces power consumption and improves spin cycle performance.
        Thread.sleep(ThreadLocalRandom.current().nextInt(0, backoff));
        backoff = Math.min(backoff * 2, MAX_BACKOFF); // Exponential growth
    }
}

ABA Problem

Time  Thread 1        Thread 2        Stack (head)
──────────────────────────────────────────────────
t1    reads A                         A → B → C
t2                  CAS(A, B)         B → C
t3                  CAS(B, A)         A → C (new value!)
t4    CAS(A, D) → SUCCESS!
      But the stack changed! B and C are lost!

Solution: AtomicStampedReference

// Stores value + version number
AtomicStampedReference<String> ref = new AtomicStampedReference<>("A", 0);

int[] stampHolder = new int[1];
String value = ref.get(stampHolder);
int stamp = stampHolder[0];

// CAS only if both value AND version match
ref.compareAndSet("A", "D", stamp, stamp + 1);

When CAS is NOT suitable

Compound operations: need to atomically update two fields — CAS works on one field
High contention (100+ threads): CAS cycle wastes CPU, better to use LongAdder or Lock
Long computations: if milliseconds pass between read and CAS — CAS will constantly fail
ABA problem: value changed A→B→A, CAS “won’t notice” — need AtomicStampedReference

Senior Level

Under the Hood: Memory Ordering

CAS provides a full memory barrier (StoreLoad):

Before CAS:  All writes are guaranteed to be completed
CAS:         Atomic update
After CAS:   All reads see the latest data

On different architectures:

Architecture	CAS Implementation	Barriers
x86	`lock cmpxchg`	Full barrier (x86 is strongly ordered)
ARM	`ldxr/stxr` loop	Requires `dmb` instruction
RISC-V	`lr.w/sc.w` loop	Requires `fence`

Spin Cycle and CPU Saturation

Under very high contention, threads may spin forever:

// Problem: 100 threads, 1 CAS variable
for (;;) {
    int current = get();
    int next = current + 1;
    if (compareAndSet(current, next)) // All 100 threads constantly fail
        return next;
    // CPU at 100%, but no progress!
}

Solutions:

Exponential Back-off — pause between attempts
LongAdder — distribution across cells
Fallback to locking — if contention is too high

Performance: CAS vs Mutex

Metric	CAS	Mutex (synchronized)
Latency (uncontended)	~5-10 ns	~10-20 ns
Latency (contended)	~100+ ns (spin)	~1000+ ns (park)
CPU usage (contended)	100% (spin-loop)	~0% (parked)
Throughput (low contention)	Higher	Lower
Throughput (high contention)	Lower	Higher

Critical point: Under very high load, a mutex may be better because it “parks” threads rather than burning CPU.

AtomicLong Implementation on 32-bit Systems

// On a 32-bit JVM, writing a long is not atomic!
// long = 64 bits, written in two 32-bit steps

// AtomicLong guarantees atomicity through:
// 1. CMPXCHG8B instruction (x86)
// 2. Or internal locking (on older CPUs)

Bus Traffic and Cache Coherence

On multiprocessor systems, frequent CAS causes:

Cache Line Invalidation — each CAS sends Invalidate to all cores
Bus Traffic — intensive inter-core communication
Memory Bandwidth Saturation — the bus is clogged, regular operations slow down

Solution: Segmented structures (like ConcurrentHashMap) or LongAdder.

Diagnostics

JMC (Java Mission Control)

java -XX:StartFlightRecording=filename=rec.jfr MyApp

Events:

jdk.ExecutionSample — shows where threads spend time
If lots of time in compareAndSet → contention

`-XX:+PrintAssembly`

java -XX:+PrintAssembly -XX:+UnlockDiagnosticVMOptions

Look for:

lock cmpxchg dword ptr [rax], rcx  # ← This is CAS

Measuring Contention

// Programmatic contention estimate
AtomicLong counter = new AtomicLong(0);
long start = System.nanoTime();
for (int i = 0; i < 1_000_000; i++) {
    counter.incrementAndGet();
}
long elapsed = System.nanoTime() - start;
System.out.println("Average time: " + (elapsed / 1_000_000) + " ns");
// Much > 10 ns → contention

Best Practices

CAS for simple operations — increment, set, get
Back-off under contention — Thread.onSpinWait() as a CPU hint
LongAdder for counters — under high contention
AtomicStampedReference — when the ABA problem is possible
Don’t use CAS for long operations — switch to locks
Monitor CAS failures — via JFR or profiling
Avoid CAS for multiple fields — use synchronized

Interview Cheat Sheet

Must know:

CAS = Compare-And-Swap: atomically compares value with expected and writes new if matched
On x86: lock cmpxchg (lock prefix locks memory bus); on ARM: ldxr/stxr; on RISC-V: lr.w/sc.w
CAS provides a full memory barrier (StoreLoad) — all writes before CAS completed, all reads after CAS are current
CAS is an optimistic approach (try/retry), synchronized is pessimistic (lock and do)
Lock-free (AtomicInteger) vs Wait-free (LongAdder): lock-free = at least one thread completes; wait-free = every thread completes
ABA problem: value returned A→B→A, CAS won’t notice; solution — AtomicStampedReference (value + version)
Under high contention, CAS cycle wastes CPU (100%); solution — exponential back-off or switch to LongAdder/Lock

Frequent follow-up questions:

When is CAS faster than synchronized? — Under low/medium contention (no locking/parking overhead)
When is synchronized faster than CAS? — Under high contention (100+ threads): CAS spins and burns CPU, mutex parks threads
What does Thread.onSpinWait() do? — Hint to the JVM/CPU “I’m in a spin cycle”; on x86 = PAUSE instruction, reduces power consumption
How does CAS differ from volatile? — volatile = visibility only; CAS = visibility + atomic write (compare-and-swap)

Red flags (do NOT say):

“CAS is always faster than synchronized” — no, under high contention a mutex is more effective
“CAS has no memory issues” — no, CAS sets a full memory barrier, which is expensive
“ABA problem is a Java bug” — no, it’s a fundamental problem of CAS algorithms in any language
“Wait-free and lock-free are the same” — no, wait-free is stronger: EVERY thread completes, lock-free — at least ONE

Related topics:

[[8. What are Atomic classes]] — all Atomic classes are built on CAS
[[1. What is the difference between synchronized and volatile]] — CAS vs synchronized approaches
[[3. What is visibility problem]] — CAS provides a full memory barrier
[[7. What is reentrant lock]] — AQS uses CAS to acquire state