What is a Correlated Subquery?

🟢 Junior Level

A correlated subquery is a subquery that depends on values from the outer query. It executes for each row of the outer query.

Why understand this: such subqueries look elegant and are intuitively understandable, but on large tables they can paralyze the database — the subquery executes for each row of the outer query, giving O(N × M) complexity.

Simple analogy: Imagine checking each document in a folder. For each document, you open a separate folder and search for related files there. It’s like a loop in code: for each element → perform a search.

Example:

-- For each user, find their last order
SELECT u.name,
       (SELECT o.amount
        FROM orders o
        WHERE o.user_id = u.id  -- ← depends on the outer row!
        ORDER BY o.created_at DESC
        LIMIT 1) as last_order
FROM users u;

How it works:

Take the first row from users
Execute the subquery for this user_id
Take the second row from users
Execute the subquery for this user_id
… and so on for each row

When to use:

When you need to compute something for each row individually
For simple cases with small tables
Often better replaced with JOIN or window functions

🟡 Middle Level

Logical Execution Model

Classic model: O(N × M) — the subquery executes N times (by the number of rows in the outer query).

-- Correlated subquery in SELECT
SELECT u.name,
       (SELECT SUM(amount) FROM orders o WHERE o.user_id = u.id) as total
FROM users u;

-- Logically this is:
for user in users:              -- 1,000,000 rows
    total = SELECT SUM(amount)  -- Executes 1,000,000 times!
        FROM orders
        WHERE user_id = user.id
    print(user.name, total)

Physical Reality: Decorrelation

PostgreSQL tries to “unfold” correlated subqueries:

-- Original query
SELECT * FROM users u
WHERE EXISTS (
    SELECT 1 FROM orders o WHERE o.user_id = u.id AND amount > 100
);

-- After decorrelation:
-- Plan: Hash Semi-Join (executed ONCE for all rows!)
-- Complexity: O(N + M) instead of O(N × M)

SubPlan vs InitPlan in EXPLAIN

SubPlan — executes for each row (bad):

EXPLAIN SELECT u.name,
       (SELECT SUM(amount) FROM orders o WHERE o.user_id = u.id)
FROM users u;

-- Plan:
-- Seq Scan on users
--   SubPlan 1
--     -> Aggregate
--        -> Index Scan on orders
-- ⚠️ SubPlan 1 executes for each row of users!

InitPlan — executes once (good):

EXPLAIN SELECT * FROM users
WHERE amount > (SELECT AVG(amount) FROM orders);

-- Plan:
-- InitPlan 1
--   -> Aggregate (AVG)
--      -> Seq Scan on orders
-- → Seq Scan on users
-- ✅ InitPlan executes 1 time!

Correlated Subqueries in UPDATE and DELETE

This is a common cause of slow operations:

-- ❌ BAD: correlated subquery in UPDATE
UPDATE orders o SET status = 'OLD'
WHERE created_at < (
    SELECT MIN(created_at)
    FROM users u
    WHERE u.id = o.user_id
);
-- For each order → subquery!

-- ✅ GOOD: UPDATE with FROM (JOIN-like syntax)
UPDATE orders o SET status = 'OLD'
FROM users u
WHERE o.user_id = u.id
  AND o.created_at < u.min_created_at;
-- Plan: Hash Join (much faster!)

EXISTS: Important Nuances

-- These two queries are EQUIVALENT in PostgreSQL:
EXISTS (SELECT 1 FROM orders WHERE user_id = u.id)
EXISTS (SELECT * FROM orders WHERE user_id = u.id)
EXISTS (SELECT amount FROM orders WHERE user_id = u.id)

-- The optimizer ignores the SELECT list inside EXISTS
-- It only cares about the fact that at least one row exists
-- Write as your team prefers (usually SELECT 1)

Common Mistakes

Scalar subquery in SELECT on large data

-- ❌ 1M users → 1M subqueries
SELECT u.name,
       (SELECT COUNT(*) FROM orders WHERE user_id = u.id) as orders_count
FROM users u;

-- ✅ Replace with JOIN + GROUP BY
SELECT u.name, COUNT(o.id) as orders_count
FROM users u
LEFT JOIN orders o ON u.id = o.user_id
GROUP BY u.id, u.name;

Correlated UPDATE without index

-- ❌ If orders.user_id is not indexed → Seq Scan in a loop!
UPDATE users u SET total_orders =
    (SELECT COUNT(*) FROM orders o WHERE o.user_id = u.id);

-- ✅ First create an index
CREATE INDEX idx_orders_user_id ON orders(user_id);

Ignoring Memoize (PG 14+)

-- On PG 14+ with Memoize it can be fast even with LATERAL
-- Check EXPLAIN, don't prematurely optimize

Alternatives

Situation	Correlated Subquery	Alternative
Aggregation per row	Subquery in SELECT	JOIN + GROUP BY
Finding one value	Subquery with LIMIT 1	LATERAL JOIN
Existence check	Subquery in WHERE	EXISTS (Semi-Join)
Complex per-row logic	Correlated subquery	Window Functions

Practical Example

-- ❌ Correlated subquery (slow)
SELECT u.name,
       (SELECT AVG(amount) FROM orders WHERE user_id = u.id) as avg_amount
FROM users u
WHERE (SELECT AVG(amount) FROM orders WHERE user_id = u.id) > 100;

-- ✅ Window function (fast)
-- A window function makes one pass over the data with sorting: O(N log N).
-- A correlated subquery executes a separate query for each row:
-- O(N × M). On a table of 1M rows, this is 1M separate queries
-- vs one pass — a difference of hundreds or thousands of times.
SELECT name, avg_amount
FROM (
    SELECT u.name,
           AVG(o.amount) OVER(PARTITION BY u.id) as avg_amount
    FROM users u
    INNER JOIN orders o ON u.id = o.user_id
) sub
WHERE avg_amount > 100;

🔴 Senior Level

Physical Execution Mechanisms

1. SubPlan (naive)

For each row of the outer query:
  → Execute subquery
  → Return result

Complexity: O(N × M)

2. InitPlan (independent)

Once before the main query:
  → Compute subquery
  → Save result

Main query uses the result
Complexity: O(M) + O(N)

3. Param (correlated, but optimized)

Subquery accepts a parameter from the outer row
→ Can use an index on the parameter
→ Can be cached (Memoize)

Decorrelation (Unnesting)

The optimizer tries to turn a correlated subquery into a JOIN:

-- Original: correlated EXISTS
SELECT * FROM users u
WHERE EXISTS (
    SELECT 1 FROM orders o WHERE o.user_id = u.id AND o.amount > 100
);

-- After decorrelation:
-- Hash Semi-Join
--   Hash Cond: (o.user_id = u.id)
--   Hash Filter: (o.amount > 100)

-- Complexity: O(N + M) instead of O(N × M)!

When decorrelation is NOT possible:

Subquery contains LIMIT / OFFSET
Subquery contains aggregates depending on the outer row
Subquery contains ORDER BY without a decorrelatable key

Memoize Node (PostgreSQL 14+)

Revolutionary optimization for correlated queries:

-- Query with LATERAL (forced Nested Loop)
SELECT u.name, o.amount
FROM users u
LEFT JOIN LATERAL (
    SELECT amount FROM orders o
    WHERE o.user_id = u.id
    ORDER BY created_at DESC
    LIMIT 1
) o ON true;

-- Plan with Memoize (PG 14+):
-- Nested Loop
--   → Seq Scan on users
--   → Memoize (cache_size: 10MB)
--       → Limit
--          → Index Scan Backward on orders

-- Memoize caches:
--   user_id = 1 → amount = 500
--   user_id = 2 → amount = 300
--   user_id = 1 → HIT! (taken from cache)

Memoize Metrics:

Memoize (cost=... rows=...)
  Hits: 950,000     ← taken from cache
  Misses: 50,000    ← recomputed
  Evictions: 5,000  ← evicted from cache (out of space)

Hit rate = 950,000 / 1,000,000 = 95% → great!
Hit rate < 50% → Memoize doesn't help (too many unique keys)

Memoize Limitations:

Cache is memory-limited (memoize_cache_size parameter, default 10MB)
Works only within a single Nested Loop node
Doesn’t help if all keys are unique (0% repeats)

Correlated UPDATE/DELETE: Deep Analysis

Problem:

UPDATE orders o SET status = 'OLD'
WHERE created_at < (
    SELECT MIN(created_at)
    FROM users u
    WHERE u.id = o.user_id
);

What happens:

for order in orders:                      -- 100,000,000 rows
    min_date = SELECT MIN(created_at)     -- 100,000,000 subqueries!
        FROM users
        WHERE id = order.user_id
    if order.created_at < min_date:
        UPDATE order SET status = 'OLD'

Solution via UPDATE FROM:

-- Step 1: Precompute minimum once
WITH user_min_dates AS (
    SELECT user_id, MIN(created_at) as min_date
    FROM users
    GROUP BY user_id
)
-- Step 2: JOIN instead of subquery
UPDATE orders o SET status = 'OLD'
FROM user_min_dates umd
WHERE o.user_id = umd.user_id
  AND o.created_at < umd.min_date;

-- Plan:
-- CTE Scan on user_min_dates
--   → HashAggregate (1 time)
-- Hash Join
--   → Seq Scan on orders
--   → Hash on user_min_dates
-- Total: 2 passes instead of 100,000,000!

EXISTS and NULL: Myths

-- Myth: SELECT 1 is faster than SELECT *
-- Reality: THE OPTIMIZER IGNORES THE SELECT LIST in EXISTS

EXPLAIN EXISTS (SELECT 1 FROM orders WHERE user_id = 1);
EXPLAIN EXISTS (SELECT * FROM orders WHERE user_id = 1);
EXPLAIN EXISTS (SELECT amount, created_at FROM orders WHERE user_id = 1);

-- All three give IDENTICAL plans!
-- Write SELECT 1 for readability (semantically clear we're checking existence)

Window Functions as an Alternative

-- ❌ Correlated subquery for each row
SELECT o.*,
       (SELECT AVG(amount)
        FROM orders o2
        WHERE o2.user_id = o.user_id) as user_avg
FROM orders o;

-- ✅ Window function (one pass!)
SELECT o.*,
       AVG(amount) OVER(PARTITION BY user_id) as user_avg
FROM orders o;

-- Difference:
-- Subquery: O(N × M) — for each row, find all user's orders
-- Window: O(N log N) — one pass with sort/hash

Edge Cases

Scalar subquery returns > 1 row

-- ❌ Error: more than one row returned by a subquery used as an expression
SELECT u.name,
       (SELECT amount FROM orders WHERE user_id = u.id) -- 10 orders!
FROM users u;

-- ✅ LIMIT 1 or aggregation
SELECT u.name,
       (SELECT amount FROM orders WHERE user_id = u.id ORDER BY created_at DESC LIMIT 1)
FROM users u;

Scalar subquery returns NULL

-- If subquery found no rows → NULL
SELECT u.name,
       (SELECT amount FROM orders WHERE user_id = u.id LIMIT 1)
FROM users u;
-- Users without orders → NULL

-- ⚠️ NULL in arithmetic → NULL
SELECT u.name,
       (SELECT SUM(amount) FROM orders WHERE user_id = u.id) + 10
FROM users u;
-- Without orders: NULL + 10 = NULL!
-- Solution: COALESCE

Multiple correlated subqueries

-- ❌ 3 subqueries per row → 3N executions
SELECT u.name,
       (SELECT COUNT(*) FROM orders WHERE user_id = u.id) as orders_count,
       (SELECT SUM(amount) FROM orders WHERE user_id = u.id) as total_amount,
       (SELECT MAX(created_at) FROM orders WHERE user_id = u.id) as last_order
FROM users u;

-- ✅ Single JOIN + GROUP BY
SELECT u.name,
       COUNT(o.id) as orders_count,
       COALESCE(SUM(o.amount), 0) as total_amount,
       MAX(o.created_at) as last_order
FROM users u
LEFT JOIN orders o ON u.id = o.user_id
GROUP BY u.id, u.name;

Performance Comparison

Approach	Complexity	When to use
SubPlan	O(N × M)	Avoid on large data
InitPlan	O(M) + O(N)	Subquery doesn’t depend on outer row
Decorrelated (Semi-Join)	O(N + M)	EXISTS / IN with decorrelation
Memoize (PG 14+)	O(N × log M) with cache	LATERAL with key repeats
Window Function	O(N log N)	Aggregation per row
UPDATE FROM	O(N + M)	Correlated UPDATE

Production Experience

Real scenario #1: SubPlan paralyzes the DB

CRM system: compute deal sum for each contact
Query: scalar subquery in SELECT (50,000 contacts)
Time: 12 minutes (50,000 subqueries to orders)
EXPLAIN: SubPlan 1 → Index Scan (executed 50,000 times)

Solution:

SELECT c.*, COALESCE(SUM(o.amount), 0) as total_deals
FROM contacts c
LEFT JOIN orders o ON c.id = o.contact_id
GROUP BY c.id;

Result: 800ms (900x speedup)

Real scenario #2: Memoize saves LATERAL

API: last 3 orders per client (100,000 clients)
LATERAL without Memoize (PG 13): > 3 minutes
LATERAL with Memoize (PG 14): 2 seconds
Memoize: Hits = 98%, Misses = 2% (many clients repeat in cache)

Real scenario #3: Correlated UPDATE

E-commerce: mark old orders (100M rows)
UPDATE with correlated subquery: > 8 hours

Solution: UPDATE FROM + CTE

WITH cutoff_dates AS (
    SELECT user_id, MIN(created_at) + INTERVAL '1 year' as cutoff
    FROM users
    GROUP BY user_id
)
UPDATE orders o SET status = 'ARCHIVED'
FROM cutoff_dates cd
WHERE o.user_id = cd.user_id
  AND o.created_at < cd.cutoff;

Result: 15 minutes (32x speedup)

Monitoring

-- 1. Find SubPlan in plans
EXPLAIN (ANALYZE, BUFFERS)
SELECT u.name, (SELECT COUNT(*) FROM orders WHERE user_id = u.id)
FROM users u;

-- Look for:
-- "SubPlan" → executes for each row
-- "InitPlan" → executes once

-- 2. Check Memoize efficiency
EXPLAIN (ANALYZE)
SELECT ... LATERAL ...;

-- Memoize (actual rows=...)
--   Hits: 500000
--   Misses: 10000
--   Evictions: 500
-- Hit rate = Hits / (Hits + Misses)

-- 3. Find correlated subqueries in pg_stat_statements
SELECT query, mean_exec_time, calls
FROM pg_stat_statements
WHERE query ~ '\(SELECT.*WHERE.*=.*\)'
  AND query LIKE '%SELECT%'
ORDER BY mean_exec_time DESC
LIMIT 10;

-- 4. Check UPDATE with subqueries
SELECT query FROM pg_stat_activity
WHERE query LIKE 'UPDATE%'
  AND query LIKE '%SELECT%';

Best Practices

Avoid scalar subqueries in SELECT on large tables
EXISTS > IN for existence checks (Semi-Join)
UPDATE FROM / DELETE USING instead of correlated subqueries
Window functions for per-row aggregation
Memoize (PG 14+) can save LATERAL — check Hit rate
Indexes on correlation keys are critical for SubPlan
COALESCE for subqueries that may return NULL
Always check the plan: look for SubPlan vs InitPlan

Summary for Senior

SubPlan = executes per row → O(N × M) → avoid
InitPlan = executes once → O(M) + O(N) → great
Decorrelation turns SubPlan into Semi-Join → O(N + M)
Memoize (PG 14+) caches LATERAL results → check Hits/Misses
Window functions — best alternative for per-row aggregation
UPDATE FROM / DELETE USING — always faster than correlated subqueries
EXISTS ignores SELECT list — write SELECT 1 for readability
Monitor: EXPLAIN (ANALYZE), look for SubPlan, check Memoize Hit rate

🎯 Interview Cheat Sheet

Must know:

Correlated subquery depends on the outer row → executes per row
SubPlan = O(N × M) → disaster on large data
InitPlan = O(M) + O(N) → computed once
Decorrelation: optimizer turns subquery into Hash Semi-Join → O(N + M)
Decorrelation NOT possible with LIMIT, OFFSET, aggregates with outer dependency
Memoize (PG 14+): caches results → Hit rate > 80% = great
EXISTS = Semi-Join: stops after first match, no duplicates
Scalar subquery in SELECT → JOIN + GROUP BY is better
UPDATE FROM / DELETE USING → Hash Join instead of N subqueries
Window functions: O(N log N) instead of O(N × M) for per-row aggregation

Frequent follow-up questions:

“How to find SubPlan in the plan?” → Look for SubPlan in EXPLAIN (executes per row)
“When is Memoize useless?” → All keys are unique → 0% Hits
“Why is EXISTS better than JOIN for existence checks?” → No Join Amplification
“How to speed up correlated UPDATE?” → UPDATE FROM + CTE

Red flags (DO NOT say):

❌ “Correlated subquery is fine for large tables” (SubPlan = disaster!)
❌ “SELECT * in EXISTS is slower than SELECT 1” (optimizer ignores SELECT list)
❌ “Memoize will always save it” (needs repeating keys)
❌ “LATERAL JOIN is always good” (Nested Loop → only for small outer)

Related topics:

[[What is better JOIN or subquery]] → Unnesting, Decorrelation
[[What types of JOIN exist]] → Semi-Join, Hash Join
[[What do window functions do]] → alternative for per-row aggregation
[[What is an explain plan]] → finding SubPlan in the plan