What do distinct(), sorted(), limit(), skip() operations do?
All four operations are intermediate, but they differ from filter and map:
🟢 Junior Level
All four operations are intermediate, but they differ from filter and map:
distinct() — removes duplicates:
List.of(1, 2, 2, 3, 3, 3).stream().distinct().collect(toList());
// [1, 2, 3]
sorted() — sorts elements:
List.of(3, 1, 2).stream().sorted().collect(toList());
// [1, 2, 3]
// sorted() in parallelStream: first all elements are collected, then sorted, // then distributed to workers. merge overhead may exceed the benefit.
limit(n) — takes the first n elements:
List.of(1, 2, 3, 4, 5).stream().limit(3).collect(toList());
// [1, 2, 3]
skip(n) — skips the first n elements:
List.of(1, 2, 3, 4, 5).stream().skip(2).collect(toList());
// [3, 4, 5]
🟡 Middle Level
Stateful operations
Stateful operations require knowledge of ALL elements in the pipeline, not just the current one. distinct() must see all elements to find unique ones. sorted() must sort the entire set.
distinct() and sorted() are stateful. They require buffering:
distinct()— creates an internalHashSetto track unique elementssorted()— this is a barrier in the pipeline. The stream collects ALL elements, sorts them, then passes them further
Memory: Both consume O(N) memory — can lead to heap growth and GC pauses on large data.
Short-circuit operations
limit() and skip() are short-circuit operations:
limit(n)— signals to stop work via a cancellation flagskip(n)— has an internal counter, “swallows” elements
Order optimization
// BAD — sort a million, then take 10
stream.sorted().limit(10)...
// GOOD — filter first, reducing N
stream.filter(relevant).sorted().limit(10)...
🔴 Senior Level
When NOT to use
distinct()— if data is already unique (HashSet at input) — unnecessary checksorted()— if order does not matter (use findAny instead of findFirst)limit(n)after sorted() — sorting the entire set for the first n (use PriorityQueue)skip(n)for pagination on large data — O(n) skip, keyset pagination is better
Concurrency problems
In parallelStream, these operations become bottlenecks:
limit()andskip(): Require strict synchronization between threadsdistinct(): Merging hash sets from different threads is expensivesorted(): Parallel sorting is effective only on very large arrays
Edge Cases
distinct()on mutable objects: If you modify a field of an object after it passesdistinct— the uniqueness contract is violatedsorted()stability: Streams guarantee stable sorting (preserving order of equal elements) if the source is ordered
Diagnostics
jmap -histo: Will show inflatedHashSetorObject[]fromdistinct/sortedon large data- Infinite stream check: If an infinite stream (
Stream.iterate) hangs — check whetherlimit()is placed before or after heavy filters
🎯 Interview Cheat Sheet
Must know:
distinct()andsorted()are stateful operations, requiring buffering of all elements in memory (O(N))distinct()uses an internal HashSet to track uniquenesssorted()is a barrier in the pipeline: collects ALL elements, sorts, then passes furtherlimit()andskip()are short-circuit operations, do not require buffering- Order matters:
filter().sorted().limit()is more efficient thansorted().limit() limit()aftersorted()— sorting the entire set for the first n (use PriorityQueue)skip()for pagination on large data — O(n), keyset pagination is better- In parallelStream, stateful operations become bottlenecks due to synchronization
Frequent follow-up questions:
- How do stateful operations differ from stateless? — Stateless (filter, map) process each element independently; stateful (distinct, sorted) must see the entire element set.
- Why is sorted() a barrier? — The stream cannot sort elements until it has collected them all — this blocks the pipeline until full data collection.
- What is worse for performance — distinct or sorted? — sorted() is heavier, as it requires full buffering and the sorting itself; distinct() uses HashSet with O(1) checks.
- Why is limit() after sorted() an anti-pattern? — You sort the entire dataset then take the first n — better to filter and limit before sorting.
Red flags (DO NOT say):
- “sorted() processes elements one at a time” — incorrect, it is a barrier, all elements are needed
- “distinct() does not consume additional memory” — incorrect, it creates an internal HashSet
- “limit() after sorted() is optimal” — incorrect, sorting the entire set for n elements is excessive
- “parallelStream always speeds up sorted()” — incorrect, merge overhead may exceed the benefit
Related topics:
- [[21. What is lazy evaluation in Stream]]
- [[24. How does short-circuiting work in Stream]]
- [[26. What do findFirst() and findAny() operations do]]
- [[27. How to collect Stream into Map]]