What do distinct(), sorted(), limit(), skip() operations do?

🟢 Junior Level

All four operations are intermediate, but they differ from filter and map:

distinct() — removes duplicates:

List.of(1, 2, 2, 3, 3, 3).stream().distinct().collect(toList());
// [1, 2, 3]

sorted() — sorts elements:

List.of(3, 1, 2).stream().sorted().collect(toList());
// [1, 2, 3]

// sorted() in parallelStream: first all elements are collected, then sorted, // then distributed to workers. merge overhead may exceed the benefit.

limit(n) — takes the first n elements:

List.of(1, 2, 3, 4, 5).stream().limit(3).collect(toList());
// [1, 2, 3]

skip(n) — skips the first n elements:

List.of(1, 2, 3, 4, 5).stream().skip(2).collect(toList());
// [3, 4, 5]

🟡 Middle Level

Stateful operations

Stateful operations require knowledge of ALL elements in the pipeline, not just the current one. distinct() must see all elements to find unique ones. sorted() must sort the entire set.

distinct() and sorted() are stateful. They require buffering:

distinct() — creates an internal HashSet to track unique elements
sorted() — this is a barrier in the pipeline. The stream collects ALL elements, sorts them, then passes them further

Memory: Both consume O(N) memory — can lead to heap growth and GC pauses on large data.

Short-circuit operations

limit() and skip() are short-circuit operations:

limit(n) — signals to stop work via a cancellation flag
skip(n) — has an internal counter, “swallows” elements

Order optimization

// BAD — sort a million, then take 10
stream.sorted().limit(10)...

// GOOD — filter first, reducing N
stream.filter(relevant).sorted().limit(10)...

🔴 Senior Level

When NOT to use

distinct() — if data is already unique (HashSet at input) — unnecessary check
sorted() — if order does not matter (use findAny instead of findFirst)
limit(n) after sorted() — sorting the entire set for the first n (use PriorityQueue)
skip(n) for pagination on large data — O(n) skip, keyset pagination is better

Concurrency problems

In parallelStream, these operations become bottlenecks:

limit() and skip(): Require strict synchronization between threads
distinct(): Merging hash sets from different threads is expensive
sorted(): Parallel sorting is effective only on very large arrays

Edge Cases

distinct() on mutable objects: If you modify a field of an object after it passes distinct — the uniqueness contract is violated
sorted() stability: Streams guarantee stable sorting (preserving order of equal elements) if the source is ordered

Diagnostics

jmap -histo: Will show inflated HashSet or Object[] from distinct/sorted on large data
Infinite stream check: If an infinite stream (Stream.iterate) hangs — check whether limit() is placed before or after heavy filters

🎯 Interview Cheat Sheet

Must know:

distinct() and sorted() are stateful operations, requiring buffering of all elements in memory (O(N))
distinct() uses an internal HashSet to track uniqueness
sorted() is a barrier in the pipeline: collects ALL elements, sorts, then passes further
limit() and skip() are short-circuit operations, do not require buffering
Order matters: filter().sorted().limit() is more efficient than sorted().limit()
limit() after sorted() — sorting the entire set for the first n (use PriorityQueue)
skip() for pagination on large data — O(n), keyset pagination is better
In parallelStream, stateful operations become bottlenecks due to synchronization

Frequent follow-up questions:

How do stateful operations differ from stateless? — Stateless (filter, map) process each element independently; stateful (distinct, sorted) must see the entire element set.
Why is sorted() a barrier? — The stream cannot sort elements until it has collected them all — this blocks the pipeline until full data collection.
What is worse for performance — distinct or sorted? — sorted() is heavier, as it requires full buffering and the sorting itself; distinct() uses HashSet with O(1) checks.
Why is limit() after sorted() an anti-pattern? — You sort the entire dataset then take the first n — better to filter and limit before sorting.

Red flags (DO NOT say):

“sorted() processes elements one at a time” — incorrect, it is a barrier, all elements are needed
“distinct() does not consume additional memory” — incorrect, it creates an internal HashSet
“limit() after sorted() is optimal” — incorrect, sorting the entire set for n elements is excessive
“parallelStream always speeds up sorted()” — incorrect, merge overhead may exceed the benefit

Related topics:

[[21. What is lazy evaluation in Stream]]
[[24. How does short-circuiting work in Stream]]
[[26. What do findFirst() and findAny() operations do]]
[[27. How to collect Stream into Map]]