How does HashSet work internally?
4. fastutil for primitives 5. Treeification → indicator of bad hashCode 6. Serialization — custom, not standard
🟢 Junior Level
HashSet uses a HashMap internally!
// Inside HashSet:
private transient HashMap<E, Object> map;
private static final Object PRESENT = new Object();
public boolean add(E e) {
return map.put(e, PRESENT) == null;
}
How it works:
- Your element → key in HashMap
- Value → the same
PRESENTobject - Uniqueness → HashMap does not allow duplicate keys
Example:
HashSet<String> set = new HashSet<>();
set.add("A");
// Inside: map.put("A", PRESENT)
set.add("A");
// Inside: map.put("A", PRESENT) → already exists! → false
🟡 Middle Level
Why via HashMap?
Advantages:
→ Shared code for HashMap and HashSet
→ All HashMap improvements automatically in HashSet
→ Java 8: Treeification — converts linked list into a red-black tree when > 8 elements in a bucket. Protects against O(1) → O(n) degradation on hashCode collisions.
Memory
Each element in HashSet:
HashMap$Node:
Header: 12 bytes
int hash: 4 bytes
K key: 4-8 bytes (reference)
V value: 4 bytes (reference to PRESENT)
Node next: 4-8 bytes
Padding: up to 8 bytes
Total: ~32 bytes overhead per element!
→ HashSet<Long> is 8x larger than long[]
Collisions and Java 8+
Few collisions (< 8 in a bucket):
→ Linked list
→ Search: O(n)
Many collisions (> 8, size > 64):
→ Red-black tree
→ Search: O(log n)
→ Protection against Hash DoS attacks!
Serialization
// The map field is transient (not serialized directly). Reason: the hash table depends on JDK-specific hashCode() and randomization — upon deserialization on a different JVM, an element could end up in a different bucket. Instead, HashSet serializes elements and rebuilds the map.
Trap: mutable elements
// ❌ Element lost
Person p = new Person("Ivan");
set.add(p);
p.setName("Petr"); // hashCode() changed!
set.remove(p); // → FALSE! Searching in the wrong bucket
// ✅ Immutable elements
record Person(String name) {}
When NOT to use HashSet
- Need iteration order — use LinkedHashSet
- Elements with bad hashCode — you get O(n) instead of O(1) (many collisions)
- Storing primitives — use fastutil/eclipse-collections to save 4x memory
🔴 Senior Level
Hash DoS Attack protection
Attack: malicious actor crafts keys with identical hashCode()
→ All go into the same bucket
→ Search O(n) instead of O(1)
→ Server "hangs"
Java 8+: Treeification at > 8 elements
→ Search O(log n) even under attack
→ Hash function with randomization
Memory Deep Dive
HashSet<String> for 1 million strings:
1 million HashMap$Node: 32 MB
1 million String objects: ~40 MB
1 million char[] arrays: ~24 MB
HashMap table array: ~4 MB
Total: ~100 MB
→ ArrayList<String>: ~68 MB
→ HashSet is 50% larger!
Load Factor trade-off
Default: 0.75
Increase (0.9):
→ Less memory
→ More collisions
→ Slower search
Decrease (0.5):
→ More memory
→ Fewer collisions
→ Faster search
Alternatives for primitives
// ❌ HashSet<Long> for 100 million elements
// → 100 million Node objects → 3.2 GB!
// ✅ fastutil (open addressing)
LongOpenHashSet set = new LongOpenHashSet();
// → No Node objects → 800 MB
// → 4x less memory!
Internal Structure
HashSet → AbstractSet → AbstractCollection
Delegation:
add(e) → map.put(e, PRESENT)
remove(e) → map.remove(e) == PRESENT
contains(e) → map.containsKey(e)
size() → map.size()
iterator() → map.keySet().iterator()
clear() → map.clear()
→ HashSet = thin wrapper over keySet!
Production Experience
Real scenario: Bad hashCode
- Objects with hashCode() = 0 for all
- All in one bucket → O(n) search
- 100,000 elements → 5 seconds for contains!
- Solution: rewrite hashCode()
- Result: 0.1 ms
Best Practices
- Elements must be effectively immutable while inside the Set. Mutating an element after addition breaks the hashCode/equals contract — the element becomes “invisible” to search.
- hashCode/equals contract — critical
- initialCapacity — avoid resizes
- fastutil for primitives
- Treeification → indicator of bad hashCode
- Serialization — custom, not standard
Summary for Senior
- HashSet = wrapper over HashMap.keySet()
- PRESENT = static final Object placeholder
- Memory = ~32 bytes overhead per element
- Treeification = Hash DoS protection (> 8)
- Transient map = custom serialization
- Mutable elements = data loss!
- Open addressing = less memory (fastutil)
🎯 Interview Cheat Sheet
Must know:
- HashSet internally is a wrapper over HashMap.keySet(), each element = key, value = static final Object PRESENT
- Complexity: O(1) for add/remove/contains on average
- Java 8+: treeification at > 8 elements in a bucket — linked list turns into a red-black tree (O(log n))
- Memory overhead: ~32 bytes per element (HashMap$Node: hash, key, value, next)
- Default load factor 0.75 — balance between memory and collisions
- Custom serialization: map field = transient, elements serialized separately
- Mutable elements break search — hashCode() changes, element is “lost”
- For primitives, fastutil/Trove is better — open addressing, 4x less memory
Frequent follow-up questions:
- Why does HashSet use HashMap instead of its own structure? — Code reuse, all HashMap optimizations are automatically available.
- What is treeification and why is it needed? — At > 8 collisions in a bucket, the list turns into a tree — protection against Hash DoS attacks (O(n) → O(log n)).
- Why is serialization custom? — The hash table depends on JVM-specific hashCode() and randomization — upon deserialization, elements may end up in different buckets.
- How to reduce HashSet memory footprint for primitives? — Use fastutil LongOpenHashSet and similar — no Node objects, open addressing.
Red flags (DO NOT say):
- ❌ “HashSet stores elements directly” — it stores them as HashMap keys with a PRESENT placeholder
- ❌ “HashSet is fully thread-safe” — no, you need ConcurrentHashMap or Collections.synchronizedSet
- ❌ “HashSet guarantees iteration order” — order is not guaranteed, depends on hashCode()
- ❌ “Treeification always happens on collisions” — only at > 8 elements AND size > 64
Related topics:
- [[11. What is the difference between HashSet, LinkedHashSet and TreeSet]]
- [[14. What is Map and what implementations exist]]
- [[15. What is the difference between HashMap, LinkedHashMap and TreeMap]]
- [[19. How ConcurrentHashMap ensures thread-safety]]