Question 9 · Section 9

What is CAS (Compare-And-Swap)?

Imagine an ATM: 4. If yes — withdraws 200, new balance 800 5. If no (someone already withdrew) — cancels the operation, need to retry

Language versions: English Russian Ukrainian

Junior Level

Basic Understanding

CAS (Compare-And-Swap) is an atomic CPU instruction that allows updating a value only if it hasn’t changed since it was read. This is the foundation of lock-free algorithms.

Simple Analogy

Imagine an ATM:

  1. You check the balance: 1000 rub
  2. You want to withdraw 200 rub (expecting 800 to remain)
  3. The ATM checks: “is the balance still 1000?”
  4. If yes — withdraws 200, new balance 800
  5. If no (someone already withdrew) — cancels the operation, need to retry

Key difference from synchronized: CAS is an optimistic approach (try, if it fails — retry). synchronized is a pessimistic approach (lock and do, no one will interfere).

How CAS Works

// CAS takes 3 parameters:
// 1. Memory address
// 2. Expected value
// 3. New value

boolean CAS(address, expectedValue, newValue) {
    if (memory[address] == expectedValue) {
        memory[address] = newValue;
        return true;  // Success!
    }
    return false; // Failed — value changed
}

Example with AtomicInteger

AtomicInteger counter = new AtomicInteger(0);

counter.incrementAndGet(); // Internally:
// 1. Read current value (current = 0)
// 2. Compute new value (next = 1)
// 3. CAS: if value == 0, set to 1
//    If CAS failed — repeat from step 1

Spin-loop

public final int incrementAndGet() {
    for (;;) { // Infinite loop
        int current = get();        // Read
        int next = current + 1;     // Compute
        if (compareAndSet(current, next)) { // CAS
            return next;            // Success!
        }
        // CAS failed — someone else updated the value
        // Go to next iteration
    }
}

When is CAS faster than locks?

Scenario CAS synchronized
Low contention Faster (no locking) Slower (acquire/release)
Medium contention Faster Medium
High contention Slower (infinite loops) Faster (threads are parked)

Middle Level

Hardware-Level Algorithm

On x86 architecture, CAS is implemented via the LOCK CMPXCHG instruction:

; rax = expected value
; rcx = new value
; [memory] = actual value in memory

lock cmpxchg [memory], rcx
; If [memory] == rax:
;   [memory] = rcx      ← Write new value
;   ZF = 1              ← Success flag
; Else:
;   rax = [memory]      ← Load actual value
;   ZF = 0              ← Failure flag

The lock prefix locks the memory bus for the duration of the operation — other cores cannot access this memory cell.

CAS in Java: Unsafe and VarHandle

Before Java 9: sun.misc.Unsafe

Before Java 9: sun.misc.Unsafe (direct memory access, dangerous) Java 9+: VarHandle (safe replacement for Unsafe, JEP 193) Note: Unsafe is still accessible in Java 9+ via --add-opens, but not recommended.

// Direct memory access (not recommended)
Unsafe unsafe = ...;
long offset = ...; // Field offset in memory

int current;
do {
    current = unsafe.getIntVolatile(obj, offset);
} while (!unsafe.compareAndSwapInt(obj, offset, current, current + 1));

Java 9+: VarHandle

// Typed and safe API
private static final VarHandle VALUE;
static {
    MethodHandles.Lookup l = MethodHandles.lookup();
    VALUE = l.findVarHandle(MyClass.class, "value", int.class);
}

int current;
do {
    current = (int) VALUE.getVolatile(this);
} while (!VALUE.compareAndSet(this, current, current + 1));

Lock-free vs Wait-free

In an interview, it’s important to distinguish:

This is a progress guarantee hierarchy:

  • Wait-free (strongest): EVERY thread completes the operation in a finite number of steps
  • Lock-free (middle): AT LEAST ONE thread completes the operation in a finite number of steps
  • Obstruction-free (weakest): thread completes if all others are suspended

Wait-free → Lock-free → Obstruction-free: each next one is weaker than the previous.

Term Guarantee Example
Lock-free At least one thread completes the operation in finite time AtomicInteger
Wait-free Every thread completes in a finite number of steps LongAdder.add()
Obstruction-free Thread completes if others don’t interfere Some STM implementations

STM (Software Transactional Memory) — software transactional memory: operations execute in a transaction, then atomically commit. An alternative approach to lock-free programming.

Exponential Back-off

Under high contention, a simple spin cycle wastes CPU. Solution — pauses:

public int incrementWithBackoff() {
    int backoff = 1;
    for (;;) {
        int current = get();
        int next = current + 1;
        if (compareAndSet(current, next)) {
            return next;
        }
        // CAS failed — pause
        Thread.onSpinWait(); // Hint to the processor

        // `Thread.onSpinWait()` — a hint to the JVM/CPU: "I'm in a spin cycle, you can optimize". On x86 this is the PAUSE instruction, which reduces power consumption and improves spin cycle performance.
        Thread.sleep(ThreadLocalRandom.current().nextInt(0, backoff));
        backoff = Math.min(backoff * 2, MAX_BACKOFF); // Exponential growth
    }
}

ABA Problem

Time  Thread 1        Thread 2        Stack (head)
──────────────────────────────────────────────────
t1    reads A                         A → B → C
t2                  CAS(A, B)         B → C
t3                  CAS(B, A)         A → C (new value!)
t4    CAS(A, D) → SUCCESS!
      But the stack changed! B and C are lost!

Solution: AtomicStampedReference

// Stores value + version number
AtomicStampedReference<String> ref = new AtomicStampedReference<>("A", 0);

int[] stampHolder = new int[1];
String value = ref.get(stampHolder);
int stamp = stampHolder[0];

// CAS only if both value AND version match
ref.compareAndSet("A", "D", stamp, stamp + 1);

When CAS is NOT suitable

  1. Compound operations: need to atomically update two fields — CAS works on one field
  2. High contention (100+ threads): CAS cycle wastes CPU, better to use LongAdder or Lock
  3. Long computations: if milliseconds pass between read and CAS — CAS will constantly fail
  4. ABA problem: value changed A→B→A, CAS “won’t notice” — need AtomicStampedReference

Senior Level

Under the Hood: Memory Ordering

CAS provides a full memory barrier (StoreLoad):

Before CAS:  All writes are guaranteed to be completed
CAS:         Atomic update
After CAS:   All reads see the latest data

On different architectures:

Architecture CAS Implementation Barriers
x86 lock cmpxchg Full barrier (x86 is strongly ordered)
ARM ldxr/stxr loop Requires dmb instruction
RISC-V lr.w/sc.w loop Requires fence

Spin Cycle and CPU Saturation

Under very high contention, threads may spin forever:

// Problem: 100 threads, 1 CAS variable
for (;;) {
    int current = get();
    int next = current + 1;
    if (compareAndSet(current, next)) // All 100 threads constantly fail
        return next;
    // CPU at 100%, but no progress!
}

Solutions:

  1. Exponential Back-off — pause between attempts
  2. LongAdder — distribution across cells
  3. Fallback to locking — if contention is too high

Performance: CAS vs Mutex

Metric CAS Mutex (synchronized)
Latency (uncontended) ~5-10 ns ~10-20 ns
Latency (contended) ~100+ ns (spin) ~1000+ ns (park)
CPU usage (contended) 100% (spin-loop) ~0% (parked)
Throughput (low contention) Higher Lower
Throughput (high contention) Lower Higher

Critical point: Under very high load, a mutex may be better because it “parks” threads rather than burning CPU.

AtomicLong Implementation on 32-bit Systems

// On a 32-bit JVM, writing a long is not atomic!
// long = 64 bits, written in two 32-bit steps

// AtomicLong guarantees atomicity through:
// 1. CMPXCHG8B instruction (x86)
// 2. Or internal locking (on older CPUs)

Bus Traffic and Cache Coherence

On multiprocessor systems, frequent CAS causes:

  1. Cache Line Invalidation — each CAS sends Invalidate to all cores
  2. Bus Traffic — intensive inter-core communication
  3. Memory Bandwidth Saturation — the bus is clogged, regular operations slow down

Solution: Segmented structures (like ConcurrentHashMap) or LongAdder.

Diagnostics

JMC (Java Mission Control)

java -XX:StartFlightRecording=filename=rec.jfr MyApp

Events:

  • jdk.ExecutionSample — shows where threads spend time
  • If lots of time in compareAndSet → contention

-XX:+PrintAssembly

java -XX:+PrintAssembly -XX:+UnlockDiagnosticVMOptions

Look for:

lock cmpxchg dword ptr [rax], rcx  # ← This is CAS

Measuring Contention

// Programmatic contention estimate
AtomicLong counter = new AtomicLong(0);
long start = System.nanoTime();
for (int i = 0; i < 1_000_000; i++) {
    counter.incrementAndGet();
}
long elapsed = System.nanoTime() - start;
System.out.println("Average time: " + (elapsed / 1_000_000) + " ns");
// Much > 10 ns → contention

Best Practices

  1. CAS for simple operations — increment, set, get
  2. Back-off under contentionThread.onSpinWait() as a CPU hint
  3. LongAdder for counters — under high contention
  4. AtomicStampedReference — when the ABA problem is possible
  5. Don’t use CAS for long operations — switch to locks
  6. Monitor CAS failures — via JFR or profiling
  7. Avoid CAS for multiple fields — use synchronized

Interview Cheat Sheet

Must know:

  • CAS = Compare-And-Swap: atomically compares value with expected and writes new if matched
  • On x86: lock cmpxchg (lock prefix locks memory bus); on ARM: ldxr/stxr; on RISC-V: lr.w/sc.w
  • CAS provides a full memory barrier (StoreLoad) — all writes before CAS completed, all reads after CAS are current
  • CAS is an optimistic approach (try/retry), synchronized is pessimistic (lock and do)
  • Lock-free (AtomicInteger) vs Wait-free (LongAdder): lock-free = at least one thread completes; wait-free = every thread completes
  • ABA problem: value returned A→B→A, CAS won’t notice; solution — AtomicStampedReference (value + version)
  • Under high contention, CAS cycle wastes CPU (100%); solution — exponential back-off or switch to LongAdder/Lock

Frequent follow-up questions:

  • When is CAS faster than synchronized? — Under low/medium contention (no locking/parking overhead)
  • When is synchronized faster than CAS? — Under high contention (100+ threads): CAS spins and burns CPU, mutex parks threads
  • What does Thread.onSpinWait() do? — Hint to the JVM/CPU “I’m in a spin cycle”; on x86 = PAUSE instruction, reduces power consumption
  • How does CAS differ from volatile? — volatile = visibility only; CAS = visibility + atomic write (compare-and-swap)

Red flags (do NOT say):

  • “CAS is always faster than synchronized” — no, under high contention a mutex is more effective
  • “CAS has no memory issues” — no, CAS sets a full memory barrier, which is expensive
  • “ABA problem is a Java bug” — no, it’s a fundamental problem of CAS algorithms in any language
  • “Wait-free and lock-free are the same” — no, wait-free is stronger: EVERY thread completes, lock-free — at least ONE

Related topics:

  • [[8. What are Atomic classes]] — all Atomic classes are built on CAS
  • [[1. What is the difference between synchronized and volatile]] — CAS vs synchronized approaches
  • [[3. What is visibility problem]] — CAS provides a full memory barrier
  • [[7. What is reentrant lock]] — AQS uses CAS to acquire state