What is a Correlated Subquery?
A correlated subquery is a subquery that depends on values from the outer query. It executes for each row of the outer query.
π’ Junior Level
A correlated subquery is a subquery that depends on values from the outer query. It executes for each row of the outer query.
Why understand this: such subqueries look elegant and are intuitively understandable, but on large tables they can paralyze the database β the subquery executes for each row of the outer query, giving O(N Γ M) complexity.
Simple analogy: Imagine checking each document in a folder. For each document, you open a separate folder and search for related files there. Itβs like a loop in code: for each element β perform a search.
Example:
-- For each user, find their last order
SELECT u.name,
(SELECT o.amount
FROM orders o
WHERE o.user_id = u.id -- β depends on the outer row!
ORDER BY o.created_at DESC
LIMIT 1) as last_order
FROM users u;
How it works:
- Take the first row from
users - Execute the subquery for this
user_id - Take the second row from
users - Execute the subquery for this
user_id - β¦ and so on for each row
When to use:
- When you need to compute something for each row individually
- For simple cases with small tables
- Often better replaced with JOIN or window functions
π‘ Middle Level
Logical Execution Model
Classic model: O(N Γ M) β the subquery executes N times (by the number of rows in the outer query).
-- Correlated subquery in SELECT
SELECT u.name,
(SELECT SUM(amount) FROM orders o WHERE o.user_id = u.id) as total
FROM users u;
-- Logically this is:
for user in users: -- 1,000,000 rows
total = SELECT SUM(amount) -- Executes 1,000,000 times!
FROM orders
WHERE user_id = user.id
print(user.name, total)
Physical Reality: Decorrelation
PostgreSQL tries to βunfoldβ correlated subqueries:
-- Original query
SELECT * FROM users u
WHERE EXISTS (
SELECT 1 FROM orders o WHERE o.user_id = u.id AND amount > 100
);
-- After decorrelation:
-- Plan: Hash Semi-Join (executed ONCE for all rows!)
-- Complexity: O(N + M) instead of O(N Γ M)
SubPlan vs InitPlan in EXPLAIN
SubPlan β executes for each row (bad):
EXPLAIN SELECT u.name,
(SELECT SUM(amount) FROM orders o WHERE o.user_id = u.id)
FROM users u;
-- Plan:
-- Seq Scan on users
-- SubPlan 1
-- -> Aggregate
-- -> Index Scan on orders
-- β οΈ SubPlan 1 executes for each row of users!
InitPlan β executes once (good):
EXPLAIN SELECT * FROM users
WHERE amount > (SELECT AVG(amount) FROM orders);
-- Plan:
-- InitPlan 1
-- -> Aggregate (AVG)
-- -> Seq Scan on orders
-- β Seq Scan on users
-- β
InitPlan executes 1 time!
Correlated Subqueries in UPDATE and DELETE
This is a common cause of slow operations:
-- β BAD: correlated subquery in UPDATE
UPDATE orders o SET status = 'OLD'
WHERE created_at < (
SELECT MIN(created_at)
FROM users u
WHERE u.id = o.user_id
);
-- For each order β subquery!
-- β
GOOD: UPDATE with FROM (JOIN-like syntax)
UPDATE orders o SET status = 'OLD'
FROM users u
WHERE o.user_id = u.id
AND o.created_at < u.min_created_at;
-- Plan: Hash Join (much faster!)
EXISTS: Important Nuances
-- These two queries are EQUIVALENT in PostgreSQL:
EXISTS (SELECT 1 FROM orders WHERE user_id = u.id)
EXISTS (SELECT * FROM orders WHERE user_id = u.id)
EXISTS (SELECT amount FROM orders WHERE user_id = u.id)
-- The optimizer ignores the SELECT list inside EXISTS
-- It only cares about the fact that at least one row exists
-- Write as your team prefers (usually SELECT 1)
Common Mistakes
- Scalar subquery in SELECT on large data
-- β 1M users β 1M subqueries SELECT u.name, (SELECT COUNT(*) FROM orders WHERE user_id = u.id) as orders_count FROM users u; -- β Replace with JOIN + GROUP BY SELECT u.name, COUNT(o.id) as orders_count FROM users u LEFT JOIN orders o ON u.id = o.user_id GROUP BY u.id, u.name; - Correlated UPDATE without index
-- β If orders.user_id is not indexed β Seq Scan in a loop! UPDATE users u SET total_orders = (SELECT COUNT(*) FROM orders o WHERE o.user_id = u.id); -- β First create an index CREATE INDEX idx_orders_user_id ON orders(user_id); - Ignoring Memoize (PG 14+)
-- On PG 14+ with Memoize it can be fast even with LATERAL -- Check EXPLAIN, don't prematurely optimize
Alternatives
| Situation | Correlated Subquery | Alternative |
|---|---|---|
| Aggregation per row | Subquery in SELECT | JOIN + GROUP BY |
| Finding one value | Subquery with LIMIT 1 | LATERAL JOIN |
| Existence check | Subquery in WHERE | EXISTS (Semi-Join) |
| Complex per-row logic | Correlated subquery | Window Functions |
Practical Example
-- β Correlated subquery (slow)
SELECT u.name,
(SELECT AVG(amount) FROM orders WHERE user_id = u.id) as avg_amount
FROM users u
WHERE (SELECT AVG(amount) FROM orders WHERE user_id = u.id) > 100;
-- β
Window function (fast)
-- A window function makes one pass over the data with sorting: O(N log N).
-- A correlated subquery executes a separate query for each row:
-- O(N Γ M). On a table of 1M rows, this is 1M separate queries
-- vs one pass β a difference of hundreds or thousands of times.
SELECT name, avg_amount
FROM (
SELECT u.name,
AVG(o.amount) OVER(PARTITION BY u.id) as avg_amount
FROM users u
INNER JOIN orders o ON u.id = o.user_id
) sub
WHERE avg_amount > 100;
π΄ Senior Level
Physical Execution Mechanisms
1. SubPlan (naive)
For each row of the outer query:
β Execute subquery
β Return result
Complexity: O(N Γ M)
2. InitPlan (independent)
Once before the main query:
β Compute subquery
β Save result
Main query uses the result
Complexity: O(M) + O(N)
3. Param (correlated, but optimized)
Subquery accepts a parameter from the outer row
β Can use an index on the parameter
β Can be cached (Memoize)
Decorrelation (Unnesting)
The optimizer tries to turn a correlated subquery into a JOIN:
-- Original: correlated EXISTS
SELECT * FROM users u
WHERE EXISTS (
SELECT 1 FROM orders o WHERE o.user_id = u.id AND o.amount > 100
);
-- After decorrelation:
-- Hash Semi-Join
-- Hash Cond: (o.user_id = u.id)
-- Hash Filter: (o.amount > 100)
-- Complexity: O(N + M) instead of O(N Γ M)!
When decorrelation is NOT possible:
- Subquery contains
LIMIT/OFFSET - Subquery contains aggregates depending on the outer row
- Subquery contains
ORDER BYwithout a decorrelatable key
Memoize Node (PostgreSQL 14+)
Revolutionary optimization for correlated queries:
-- Query with LATERAL (forced Nested Loop)
SELECT u.name, o.amount
FROM users u
LEFT JOIN LATERAL (
SELECT amount FROM orders o
WHERE o.user_id = u.id
ORDER BY created_at DESC
LIMIT 1
) o ON true;
-- Plan with Memoize (PG 14+):
-- Nested Loop
-- β Seq Scan on users
-- β Memoize (cache_size: 10MB)
-- β Limit
-- β Index Scan Backward on orders
-- Memoize caches:
-- user_id = 1 β amount = 500
-- user_id = 2 β amount = 300
-- user_id = 1 β HIT! (taken from cache)
Memoize Metrics:
Memoize (cost=... rows=...)
Hits: 950,000 β taken from cache
Misses: 50,000 β recomputed
Evictions: 5,000 β evicted from cache (out of space)
Hit rate = 950,000 / 1,000,000 = 95% β great!
Hit rate < 50% β Memoize doesn't help (too many unique keys)
Memoize Limitations:
- Cache is memory-limited (
memoize_cache_sizeparameter, default 10MB) - Works only within a single Nested Loop node
- Doesnβt help if all keys are unique (0% repeats)
Correlated UPDATE/DELETE: Deep Analysis
Problem:
UPDATE orders o SET status = 'OLD'
WHERE created_at < (
SELECT MIN(created_at)
FROM users u
WHERE u.id = o.user_id
);
What happens:
for order in orders: -- 100,000,000 rows
min_date = SELECT MIN(created_at) -- 100,000,000 subqueries!
FROM users
WHERE id = order.user_id
if order.created_at < min_date:
UPDATE order SET status = 'OLD'
Solution via UPDATE FROM:
-- Step 1: Precompute minimum once
WITH user_min_dates AS (
SELECT user_id, MIN(created_at) as min_date
FROM users
GROUP BY user_id
)
-- Step 2: JOIN instead of subquery
UPDATE orders o SET status = 'OLD'
FROM user_min_dates umd
WHERE o.user_id = umd.user_id
AND o.created_at < umd.min_date;
-- Plan:
-- CTE Scan on user_min_dates
-- β HashAggregate (1 time)
-- Hash Join
-- β Seq Scan on orders
-- β Hash on user_min_dates
-- Total: 2 passes instead of 100,000,000!
EXISTS and NULL: Myths
-- Myth: SELECT 1 is faster than SELECT *
-- Reality: THE OPTIMIZER IGNORES THE SELECT LIST in EXISTS
EXPLAIN EXISTS (SELECT 1 FROM orders WHERE user_id = 1);
EXPLAIN EXISTS (SELECT * FROM orders WHERE user_id = 1);
EXPLAIN EXISTS (SELECT amount, created_at FROM orders WHERE user_id = 1);
-- All three give IDENTICAL plans!
-- Write SELECT 1 for readability (semantically clear we're checking existence)
Window Functions as an Alternative
-- β Correlated subquery for each row
SELECT o.*,
(SELECT AVG(amount)
FROM orders o2
WHERE o2.user_id = o.user_id) as user_avg
FROM orders o;
-- β
Window function (one pass!)
SELECT o.*,
AVG(amount) OVER(PARTITION BY user_id) as user_avg
FROM orders o;
-- Difference:
-- Subquery: O(N Γ M) β for each row, find all user's orders
-- Window: O(N log N) β one pass with sort/hash
Edge Cases
- Scalar subquery returns > 1 row
-- β Error: more than one row returned by a subquery used as an expression SELECT u.name, (SELECT amount FROM orders WHERE user_id = u.id) -- 10 orders! FROM users u; -- β LIMIT 1 or aggregation SELECT u.name, (SELECT amount FROM orders WHERE user_id = u.id ORDER BY created_at DESC LIMIT 1) FROM users u; - Scalar subquery returns NULL
-- If subquery found no rows β NULL SELECT u.name, (SELECT amount FROM orders WHERE user_id = u.id LIMIT 1) FROM users u; -- Users without orders β NULL -- β οΈ NULL in arithmetic β NULL SELECT u.name, (SELECT SUM(amount) FROM orders WHERE user_id = u.id) + 10 FROM users u; -- Without orders: NULL + 10 = NULL! -- Solution: COALESCE - Multiple correlated subqueries
-- β 3 subqueries per row β 3N executions SELECT u.name, (SELECT COUNT(*) FROM orders WHERE user_id = u.id) as orders_count, (SELECT SUM(amount) FROM orders WHERE user_id = u.id) as total_amount, (SELECT MAX(created_at) FROM orders WHERE user_id = u.id) as last_order FROM users u; -- β Single JOIN + GROUP BY SELECT u.name, COUNT(o.id) as orders_count, COALESCE(SUM(o.amount), 0) as total_amount, MAX(o.created_at) as last_order FROM users u LEFT JOIN orders o ON u.id = o.user_id GROUP BY u.id, u.name;
Performance Comparison
| Approach | Complexity | When to use |
|---|---|---|
| SubPlan | O(N Γ M) | Avoid on large data |
| InitPlan | O(M) + O(N) | Subquery doesnβt depend on outer row |
| Decorrelated (Semi-Join) | O(N + M) | EXISTS / IN with decorrelation |
| Memoize (PG 14+) | O(N Γ log M) with cache | LATERAL with key repeats |
| Window Function | O(N log N) | Aggregation per row |
| UPDATE FROM | O(N + M) | Correlated UPDATE |
Production Experience
Real scenario #1: SubPlan paralyzes the DB
- CRM system: compute deal sum for each contact
- Query: scalar subquery in SELECT (50,000 contacts)
- Time: 12 minutes (50,000 subqueries to orders)
- EXPLAIN:
SubPlan 1 β Index Scan (executed 50,000 times) - Solution:
SELECT c.*, COALESCE(SUM(o.amount), 0) as total_deals FROM contacts c LEFT JOIN orders o ON c.id = o.contact_id GROUP BY c.id; - Result: 800ms (900x speedup)
Real scenario #2: Memoize saves LATERAL
- API: last 3 orders per client (100,000 clients)
- LATERAL without Memoize (PG 13): > 3 minutes
- LATERAL with Memoize (PG 14): 2 seconds
- Memoize: Hits = 98%, Misses = 2% (many clients repeat in cache)
Real scenario #3: Correlated UPDATE
- E-commerce: mark old orders (100M rows)
- UPDATE with correlated subquery: > 8 hours
- Solution: UPDATE FROM + CTE
WITH cutoff_dates AS ( SELECT user_id, MIN(created_at) + INTERVAL '1 year' as cutoff FROM users GROUP BY user_id ) UPDATE orders o SET status = 'ARCHIVED' FROM cutoff_dates cd WHERE o.user_id = cd.user_id AND o.created_at < cd.cutoff; - Result: 15 minutes (32x speedup)
Monitoring
-- 1. Find SubPlan in plans
EXPLAIN (ANALYZE, BUFFERS)
SELECT u.name, (SELECT COUNT(*) FROM orders WHERE user_id = u.id)
FROM users u;
-- Look for:
-- "SubPlan" β executes for each row
-- "InitPlan" β executes once
-- 2. Check Memoize efficiency
EXPLAIN (ANALYZE)
SELECT ... LATERAL ...;
-- Memoize (actual rows=...)
-- Hits: 500000
-- Misses: 10000
-- Evictions: 500
-- Hit rate = Hits / (Hits + Misses)
-- 3. Find correlated subqueries in pg_stat_statements
SELECT query, mean_exec_time, calls
FROM pg_stat_statements
WHERE query ~ '\(SELECT.*WHERE.*=.*\)'
AND query LIKE '%SELECT%'
ORDER BY mean_exec_time DESC
LIMIT 10;
-- 4. Check UPDATE with subqueries
SELECT query FROM pg_stat_activity
WHERE query LIKE 'UPDATE%'
AND query LIKE '%SELECT%';
Best Practices
- Avoid scalar subqueries in SELECT on large tables
- EXISTS > IN for existence checks (Semi-Join)
- UPDATE FROM / DELETE USING instead of correlated subqueries
- Window functions for per-row aggregation
- Memoize (PG 14+) can save LATERAL β check Hit rate
- Indexes on correlation keys are critical for SubPlan
- COALESCE for subqueries that may return NULL
- Always check the plan: look for SubPlan vs InitPlan
Summary for Senior
- SubPlan = executes per row β O(N Γ M) β avoid
- InitPlan = executes once β O(M) + O(N) β great
- Decorrelation turns SubPlan into Semi-Join β O(N + M)
- Memoize (PG 14+) caches LATERAL results β check Hits/Misses
- Window functions β best alternative for per-row aggregation
- UPDATE FROM / DELETE USING β always faster than correlated subqueries
- EXISTS ignores SELECT list β write
SELECT 1for readability - Monitor:
EXPLAIN (ANALYZE), look for SubPlan, check Memoize Hit rate
π― Interview Cheat Sheet
Must know:
- Correlated subquery depends on the outer row β executes per row
- SubPlan = O(N Γ M) β disaster on large data
- InitPlan = O(M) + O(N) β computed once
- Decorrelation: optimizer turns subquery into Hash Semi-Join β O(N + M)
- Decorrelation NOT possible with LIMIT, OFFSET, aggregates with outer dependency
- Memoize (PG 14+): caches results β Hit rate > 80% = great
- EXISTS = Semi-Join: stops after first match, no duplicates
- Scalar subquery in SELECT β JOIN + GROUP BY is better
- UPDATE FROM / DELETE USING β Hash Join instead of N subqueries
- Window functions: O(N log N) instead of O(N Γ M) for per-row aggregation
Frequent follow-up questions:
- βHow to find SubPlan in the plan?β β Look for
SubPlanin EXPLAIN (executes per row) - βWhen is Memoize useless?β β All keys are unique β 0% Hits
- βWhy is EXISTS better than JOIN for existence checks?β β No Join Amplification
- βHow to speed up correlated UPDATE?β β UPDATE FROM + CTE
Red flags (DO NOT say):
- β βCorrelated subquery is fine for large tablesβ (SubPlan = disaster!)
- β βSELECT * in EXISTS is slower than SELECT 1β (optimizer ignores SELECT list)
- β βMemoize will always save itβ (needs repeating keys)
- β βLATERAL JOIN is always goodβ (Nested Loop β only for small outer)
Related topics:
- [[What is better JOIN or subquery]] β Unnesting, Decorrelation
- [[What types of JOIN exist]] β Semi-Join, Hash Join
- [[What do window functions do]] β alternative for per-row aggregation
- [[What is an explain plan]] β finding SubPlan in the plan