Question 12 · Section 4

How does HashSet work internally?

4. fastutil for primitives 5. Treeification → indicator of bad hashCode 6. Serialization — custom, not standard

Language versions: English Russian Ukrainian

🟢 Junior Level

HashSet uses a HashMap internally!

// Inside HashSet:
private transient HashMap<E, Object> map;
private static final Object PRESENT = new Object();

public boolean add(E e) {
    return map.put(e, PRESENT) == null;
}

How it works:

  • Your element → key in HashMap
  • Value → the same PRESENT object
  • Uniqueness → HashMap does not allow duplicate keys

Example:

HashSet<String> set = new HashSet<>();
set.add("A");
// Inside: map.put("A", PRESENT)

set.add("A");
// Inside: map.put("A", PRESENT) → already exists! → false

🟡 Middle Level

Why via HashMap?

Advantages:
  → Shared code for HashMap and HashSet
  → All HashMap improvements automatically in HashSet
  → Java 8: Treeification — converts linked list into a red-black tree when > 8 elements in a bucket. Protects against O(1) → O(n) degradation on hashCode collisions.

Memory

Each element in HashSet:
  HashMap$Node:
    Header: 12 bytes
    int hash: 4 bytes
    K key: 4-8 bytes (reference)
    V value: 4 bytes (reference to PRESENT)
    Node next: 4-8 bytes
    Padding: up to 8 bytes

Total: ~32 bytes overhead per element!

→ HashSet<Long> is 8x larger than long[]

Collisions and Java 8+

Few collisions (< 8 in a bucket):
  → Linked list
  → Search: O(n)

Many collisions (> 8, size > 64):
  → Red-black tree
  → Search: O(log n)

→ Protection against Hash DoS attacks!

Serialization

// The map field is transient (not serialized directly). Reason: the hash table depends on JDK-specific hashCode() and randomization — upon deserialization on a different JVM, an element could end up in a different bucket. Instead, HashSet serializes elements and rebuilds the map.

Trap: mutable elements

// ❌ Element lost
Person p = new Person("Ivan");
set.add(p);
p.setName("Petr");  // hashCode() changed!
set.remove(p);      // → FALSE! Searching in the wrong bucket

// ✅ Immutable elements
record Person(String name) {}

When NOT to use HashSet

  1. Need iteration order — use LinkedHashSet
  2. Elements with bad hashCode — you get O(n) instead of O(1) (many collisions)
  3. Storing primitives — use fastutil/eclipse-collections to save 4x memory

🔴 Senior Level

Hash DoS Attack protection

Attack: malicious actor crafts keys with identical hashCode()
  → All go into the same bucket
  → Search O(n) instead of O(1)
  → Server "hangs"

Java 8+: Treeification at > 8 elements
  → Search O(log n) even under attack
  → Hash function with randomization

Memory Deep Dive

HashSet<String> for 1 million strings:

1 million HashMap$Node: 32 MB
1 million String objects: ~40 MB
1 million char[] arrays: ~24 MB
HashMap table array: ~4 MB

Total: ~100 MB

→ ArrayList<String>: ~68 MB
→ HashSet is 50% larger!

Load Factor trade-off

Default: 0.75

Increase (0.9):
  → Less memory
  → More collisions
  → Slower search

Decrease (0.5):
  → More memory
  → Fewer collisions
  → Faster search

Alternatives for primitives

// ❌ HashSet<Long> for 100 million elements
// → 100 million Node objects → 3.2 GB!

// ✅ fastutil (open addressing)
LongOpenHashSet set = new LongOpenHashSet();
// → No Node objects → 800 MB
// → 4x less memory!

Internal Structure

HashSet → AbstractSet → AbstractCollection

Delegation:
  add(e)         → map.put(e, PRESENT)
  remove(e)      → map.remove(e) == PRESENT
  contains(e)    → map.containsKey(e)
  size()         → map.size()
  iterator()     → map.keySet().iterator()
  clear()        → map.clear()

→ HashSet = thin wrapper over keySet!

Production Experience

Real scenario: Bad hashCode

  • Objects with hashCode() = 0 for all
  • All in one bucket → O(n) search
  • 100,000 elements → 5 seconds for contains!
  • Solution: rewrite hashCode()
  • Result: 0.1 ms

Best Practices

  1. Elements must be effectively immutable while inside the Set. Mutating an element after addition breaks the hashCode/equals contract — the element becomes “invisible” to search.
  2. hashCode/equals contract — critical
  3. initialCapacity — avoid resizes
  4. fastutil for primitives
  5. Treeification → indicator of bad hashCode
  6. Serialization — custom, not standard

Summary for Senior

  • HashSet = wrapper over HashMap.keySet()
  • PRESENT = static final Object placeholder
  • Memory = ~32 bytes overhead per element
  • Treeification = Hash DoS protection (> 8)
  • Transient map = custom serialization
  • Mutable elements = data loss!
  • Open addressing = less memory (fastutil)

🎯 Interview Cheat Sheet

Must know:

  • HashSet internally is a wrapper over HashMap.keySet(), each element = key, value = static final Object PRESENT
  • Complexity: O(1) for add/remove/contains on average
  • Java 8+: treeification at > 8 elements in a bucket — linked list turns into a red-black tree (O(log n))
  • Memory overhead: ~32 bytes per element (HashMap$Node: hash, key, value, next)
  • Default load factor 0.75 — balance between memory and collisions
  • Custom serialization: map field = transient, elements serialized separately
  • Mutable elements break search — hashCode() changes, element is “lost”
  • For primitives, fastutil/Trove is better — open addressing, 4x less memory

Frequent follow-up questions:

  • Why does HashSet use HashMap instead of its own structure? — Code reuse, all HashMap optimizations are automatically available.
  • What is treeification and why is it needed? — At > 8 collisions in a bucket, the list turns into a tree — protection against Hash DoS attacks (O(n) → O(log n)).
  • Why is serialization custom? — The hash table depends on JVM-specific hashCode() and randomization — upon deserialization, elements may end up in different buckets.
  • How to reduce HashSet memory footprint for primitives? — Use fastutil LongOpenHashSet and similar — no Node objects, open addressing.

Red flags (DO NOT say):

  • ❌ “HashSet stores elements directly” — it stores them as HashMap keys with a PRESENT placeholder
  • ❌ “HashSet is fully thread-safe” — no, you need ConcurrentHashMap or Collections.synchronizedSet
  • ❌ “HashSet guarantees iteration order” — order is not guaranteed, depends on hashCode()
  • ❌ “Treeification always happens on collisions” — only at > 8 elements AND size > 64

Related topics:

  • [[11. What is the difference between HashSet, LinkedHashSet and TreeSet]]
  • [[14. What is Map and what implementations exist]]
  • [[15. What is the difference between HashMap, LinkedHashMap and TreeMap]]
  • [[19. How ConcurrentHashMap ensures thread-safety]]