Question 3 · Section 10

How HashMap Determines Which Bucket to Put an Element In?

HashMap determines the bucket number in two steps:

Language versions: English Russian Ukrainian

🟢 Junior Level

HashMap determines the bucket number in two steps:

  1. Gets the key’s hash code — calls key.hashCode(), which returns a 32-bit number — the object’s “digital fingerprint”. A good hashCode distributes objects evenly across the range; a bad one clusters them.
  2. Mixes the bits — XORs the hash code with itself, shifted right by 16 positions
  3. Applies a mask — uses the formula (array_size - 1) & hash

Simple example:

String key = "hello";
int hash = key.hashCode();           // e.g., 99162322
int index = (16 - 1) & hash;         // With size 16: index = 2

Why it’s so fast: Instead of division (which is slow), bitwise AND (&) is used, which executes in 1 CPU cycle.

🟡 Middle Level

Step 1: Bit Perturbation Function

static final int hash(Object key) {
    int h;
    return (key == null) ? 0 : (h = key.hashCode()) ^ (h >>> 16);
}

Why XOR and shift by 16?

  • The array size is usually small (16, 32, 64…), so only low bits are used for the index
  • If hashCode() differs only in high bits — collisions occur
  • >>> 16 shift moves high bits to low positions
  • XOR mixes high-bit information into low bits
  • Result: even with a poor hashCode, elements distribute evenly

Why specifically XOR and shift 16? The perturbation function (from Latin ‘perturbare’, ‘to mix’) ensures that even if hashCode() differs only in high bits, after mixing the difference also appears in low bits.

Step 2: Power-of-Two Magic

int index = (n - 1) & hash;

If n = 16 (2^4):

n-1 = 15 => 0000 1111 (mask)
hash   => 1101 0110
&      => 0000 0110 (result: 6)

This is mathematically equivalent to hash % 16. & executes in 1 cycle (bitwise operation), while % requires division — 10-50 cycles depending on the CPU.

hashCode Caching

String caches its hashCode in a private hash field. Frequent HashMap operations benefit enormously from this.

Common Mistakes

  1. Poor hashCode — always returning 1 puts all elements in one bucket
  2. Mutable key — modifying key fields after put changes the hash, making the element unreachable

Don’t Try to “Improve” Hashing

Don’t create exotic hashing formulas without profiling. The OpenJDK spread function is thoroughly tested. A bad “optimization” is often worse than the default.

🔴 Senior Level

Internal Implementation

In the OpenJDK source (HashMap.java):

// putVal method:
if ((p = tab[i = (n - 1) & hash]) == null)
    tab[i] = newNode(hash, key, value, null);

Index determination happens on every operation — get, put, remove. That’s why hashing optimization is critical.

Why Power of Two is Mandatory

If the size were 15 (not a power of two):

n-1 = 14 => 0000 1110 (mask)

The lowest bit is always zeroed — all odd buckets (1, 3, 5…) are never used. This causes 50% capacity loss and doubled collisions.

Performance

  • Hash computation: O(1), 2 operations (XOR + shift)
  • Index determination: O(1), 2 operations (& and subtraction)
  • Overall overhead: minimal, but called billions of times in large applications

Edge Cases

  1. key == null: hash(null) returns 0, null always in table[0]
  2. hashCode() = constant: all elements in one bucket, degradation to O(n)/O(log n)
  3. Strings with same hashCode: "Aa" and "BB" both give 2112 — a known Java collision

Security: Hash Flooding

An attacker can craft keys with the same hash. In Java 8+, treeification protects against DoS, but CPU overhead still increases due to tree rebalancing.

Optimization for Records

In Java 14+, record types generate hashCode() automatically based on all components. This guarantees a good contract, but for complex records with collection fields, hashCode can be expensive.


🎯 Interview Cheat Sheet

Must know:

  • Step 1: hash(key) = key.hashCode() ^ (key.hashCode() >>> 16) — spread function
  • Step 2: index = (n-1) & hash — bitwise mask instead of %
  • XOR mixes high bits into low — protects against poor hashCodes
  • & executes in 1 cycle, % requires 10-50 cycles
  • String caches hashCode — repeated calls are O(1)
  • null gives hash = 0, always in table[0]
  • With non-2^n sizes, the formula breaks — odd buckets are never used

Common follow-up questions:

  • Why specifically XOR and shift 16? — 32-bit int, half high bits, half low bits; XOR mixes high into low
  • What if hashCode = constant? — all elements in one bucket, degradation to O(n)
  • Why not use %? — division is slow (10-50 cycles), & = 1 cycle
  • Do “Aa” and “BB” give the same hashCode? — yes, 2112 — a known collision

Red flags (DO NOT say):

  • “HashMap uses hashCode directly” — no, there’s a spread function
  • “Index = hashCode % size” — no, (n-1) & hash
  • “You can improve the spread function” — the standard OpenJDK one is thoroughly tested

Related topics:

  • [[01. How HashMap Works Internally]]
  • [[04. What is a Collision in HashMap]]
  • [[07. What is the equals() and hashCode() Contract]]
  • [[15. Why String Is Often Used as a Key in HashMap]]