Question 10 Β· Section 1

What is a Correlated Subquery?

A correlated subquery is a subquery that depends on values from the outer query. It executes for each row of the outer query.

Language versions: English Russian Ukrainian

🟒 Junior Level

A correlated subquery is a subquery that depends on values from the outer query. It executes for each row of the outer query.

Why understand this: such subqueries look elegant and are intuitively understandable, but on large tables they can paralyze the database β€” the subquery executes for each row of the outer query, giving O(N Γ— M) complexity.

Simple analogy: Imagine checking each document in a folder. For each document, you open a separate folder and search for related files there. It’s like a loop in code: for each element β†’ perform a search.

Example:

-- For each user, find their last order
SELECT u.name,
       (SELECT o.amount
        FROM orders o
        WHERE o.user_id = u.id  -- ← depends on the outer row!
        ORDER BY o.created_at DESC
        LIMIT 1) as last_order
FROM users u;

How it works:

  1. Take the first row from users
  2. Execute the subquery for this user_id
  3. Take the second row from users
  4. Execute the subquery for this user_id
  5. … and so on for each row

When to use:

  • When you need to compute something for each row individually
  • For simple cases with small tables
  • Often better replaced with JOIN or window functions

🟑 Middle Level

Logical Execution Model

Classic model: O(N Γ— M) β€” the subquery executes N times (by the number of rows in the outer query).

-- Correlated subquery in SELECT
SELECT u.name,
       (SELECT SUM(amount) FROM orders o WHERE o.user_id = u.id) as total
FROM users u;

-- Logically this is:
for user in users:              -- 1,000,000 rows
    total = SELECT SUM(amount)  -- Executes 1,000,000 times!
        FROM orders
        WHERE user_id = user.id
    print(user.name, total)

Physical Reality: Decorrelation

PostgreSQL tries to β€œunfold” correlated subqueries:

-- Original query
SELECT * FROM users u
WHERE EXISTS (
    SELECT 1 FROM orders o WHERE o.user_id = u.id AND amount > 100
);

-- After decorrelation:
-- Plan: Hash Semi-Join (executed ONCE for all rows!)
-- Complexity: O(N + M) instead of O(N Γ— M)

SubPlan vs InitPlan in EXPLAIN

SubPlan β€” executes for each row (bad):

EXPLAIN SELECT u.name,
       (SELECT SUM(amount) FROM orders o WHERE o.user_id = u.id)
FROM users u;

-- Plan:
-- Seq Scan on users
--   SubPlan 1
--     -> Aggregate
--        -> Index Scan on orders
-- ⚠️ SubPlan 1 executes for each row of users!

InitPlan β€” executes once (good):

EXPLAIN SELECT * FROM users
WHERE amount > (SELECT AVG(amount) FROM orders);

-- Plan:
-- InitPlan 1
--   -> Aggregate (AVG)
--      -> Seq Scan on orders
-- β†’ Seq Scan on users
-- βœ… InitPlan executes 1 time!

Correlated Subqueries in UPDATE and DELETE

This is a common cause of slow operations:

-- ❌ BAD: correlated subquery in UPDATE
UPDATE orders o SET status = 'OLD'
WHERE created_at < (
    SELECT MIN(created_at)
    FROM users u
    WHERE u.id = o.user_id
);
-- For each order β†’ subquery!

-- βœ… GOOD: UPDATE with FROM (JOIN-like syntax)
UPDATE orders o SET status = 'OLD'
FROM users u
WHERE o.user_id = u.id
  AND o.created_at < u.min_created_at;
-- Plan: Hash Join (much faster!)

EXISTS: Important Nuances

-- These two queries are EQUIVALENT in PostgreSQL:
EXISTS (SELECT 1 FROM orders WHERE user_id = u.id)
EXISTS (SELECT * FROM orders WHERE user_id = u.id)
EXISTS (SELECT amount FROM orders WHERE user_id = u.id)

-- The optimizer ignores the SELECT list inside EXISTS
-- It only cares about the fact that at least one row exists
-- Write as your team prefers (usually SELECT 1)

Common Mistakes

  1. Scalar subquery in SELECT on large data
    -- ❌ 1M users β†’ 1M subqueries
    SELECT u.name,
           (SELECT COUNT(*) FROM orders WHERE user_id = u.id) as orders_count
    FROM users u;
    
    -- βœ… Replace with JOIN + GROUP BY
    SELECT u.name, COUNT(o.id) as orders_count
    FROM users u
    LEFT JOIN orders o ON u.id = o.user_id
    GROUP BY u.id, u.name;
    
  2. Correlated UPDATE without index
    -- ❌ If orders.user_id is not indexed β†’ Seq Scan in a loop!
    UPDATE users u SET total_orders =
        (SELECT COUNT(*) FROM orders o WHERE o.user_id = u.id);
    
    -- βœ… First create an index
    CREATE INDEX idx_orders_user_id ON orders(user_id);
    
  3. Ignoring Memoize (PG 14+)
    -- On PG 14+ with Memoize it can be fast even with LATERAL
    -- Check EXPLAIN, don't prematurely optimize
    

Alternatives

Situation Correlated Subquery Alternative
Aggregation per row Subquery in SELECT JOIN + GROUP BY
Finding one value Subquery with LIMIT 1 LATERAL JOIN
Existence check Subquery in WHERE EXISTS (Semi-Join)
Complex per-row logic Correlated subquery Window Functions

Practical Example

-- ❌ Correlated subquery (slow)
SELECT u.name,
       (SELECT AVG(amount) FROM orders WHERE user_id = u.id) as avg_amount
FROM users u
WHERE (SELECT AVG(amount) FROM orders WHERE user_id = u.id) > 100;

-- βœ… Window function (fast)
-- A window function makes one pass over the data with sorting: O(N log N).
-- A correlated subquery executes a separate query for each row:
-- O(N Γ— M). On a table of 1M rows, this is 1M separate queries
-- vs one pass β€” a difference of hundreds or thousands of times.
SELECT name, avg_amount
FROM (
    SELECT u.name,
           AVG(o.amount) OVER(PARTITION BY u.id) as avg_amount
    FROM users u
    INNER JOIN orders o ON u.id = o.user_id
) sub
WHERE avg_amount > 100;

πŸ”΄ Senior Level

Physical Execution Mechanisms

1. SubPlan (naive)

For each row of the outer query:
  β†’ Execute subquery
  β†’ Return result

Complexity: O(N Γ— M)

2. InitPlan (independent)

Once before the main query:
  β†’ Compute subquery
  β†’ Save result

Main query uses the result
Complexity: O(M) + O(N)

3. Param (correlated, but optimized)

Subquery accepts a parameter from the outer row
β†’ Can use an index on the parameter
β†’ Can be cached (Memoize)

Decorrelation (Unnesting)

The optimizer tries to turn a correlated subquery into a JOIN:

-- Original: correlated EXISTS
SELECT * FROM users u
WHERE EXISTS (
    SELECT 1 FROM orders o WHERE o.user_id = u.id AND o.amount > 100
);

-- After decorrelation:
-- Hash Semi-Join
--   Hash Cond: (o.user_id = u.id)
--   Hash Filter: (o.amount > 100)

-- Complexity: O(N + M) instead of O(N Γ— M)!

When decorrelation is NOT possible:

  • Subquery contains LIMIT / OFFSET
  • Subquery contains aggregates depending on the outer row
  • Subquery contains ORDER BY without a decorrelatable key

Memoize Node (PostgreSQL 14+)

Revolutionary optimization for correlated queries:

-- Query with LATERAL (forced Nested Loop)
SELECT u.name, o.amount
FROM users u
LEFT JOIN LATERAL (
    SELECT amount FROM orders o
    WHERE o.user_id = u.id
    ORDER BY created_at DESC
    LIMIT 1
) o ON true;

-- Plan with Memoize (PG 14+):
-- Nested Loop
--   β†’ Seq Scan on users
--   β†’ Memoize (cache_size: 10MB)
--       β†’ Limit
--          β†’ Index Scan Backward on orders

-- Memoize caches:
--   user_id = 1 β†’ amount = 500
--   user_id = 2 β†’ amount = 300
--   user_id = 1 β†’ HIT! (taken from cache)

Memoize Metrics:

Memoize (cost=... rows=...)
  Hits: 950,000     ← taken from cache
  Misses: 50,000    ← recomputed
  Evictions: 5,000  ← evicted from cache (out of space)

Hit rate = 950,000 / 1,000,000 = 95% β†’ great!
Hit rate < 50% β†’ Memoize doesn't help (too many unique keys)

Memoize Limitations:

  • Cache is memory-limited (memoize_cache_size parameter, default 10MB)
  • Works only within a single Nested Loop node
  • Doesn’t help if all keys are unique (0% repeats)

Correlated UPDATE/DELETE: Deep Analysis

Problem:

UPDATE orders o SET status = 'OLD'
WHERE created_at < (
    SELECT MIN(created_at)
    FROM users u
    WHERE u.id = o.user_id
);

What happens:

for order in orders:                      -- 100,000,000 rows
    min_date = SELECT MIN(created_at)     -- 100,000,000 subqueries!
        FROM users
        WHERE id = order.user_id
    if order.created_at < min_date:
        UPDATE order SET status = 'OLD'

Solution via UPDATE FROM:

-- Step 1: Precompute minimum once
WITH user_min_dates AS (
    SELECT user_id, MIN(created_at) as min_date
    FROM users
    GROUP BY user_id
)
-- Step 2: JOIN instead of subquery
UPDATE orders o SET status = 'OLD'
FROM user_min_dates umd
WHERE o.user_id = umd.user_id
  AND o.created_at < umd.min_date;

-- Plan:
-- CTE Scan on user_min_dates
--   β†’ HashAggregate (1 time)
-- Hash Join
--   β†’ Seq Scan on orders
--   β†’ Hash on user_min_dates
-- Total: 2 passes instead of 100,000,000!

EXISTS and NULL: Myths

-- Myth: SELECT 1 is faster than SELECT *
-- Reality: THE OPTIMIZER IGNORES THE SELECT LIST in EXISTS

EXPLAIN EXISTS (SELECT 1 FROM orders WHERE user_id = 1);
EXPLAIN EXISTS (SELECT * FROM orders WHERE user_id = 1);
EXPLAIN EXISTS (SELECT amount, created_at FROM orders WHERE user_id = 1);

-- All three give IDENTICAL plans!
-- Write SELECT 1 for readability (semantically clear we're checking existence)

Window Functions as an Alternative

-- ❌ Correlated subquery for each row
SELECT o.*,
       (SELECT AVG(amount)
        FROM orders o2
        WHERE o2.user_id = o.user_id) as user_avg
FROM orders o;

-- βœ… Window function (one pass!)
SELECT o.*,
       AVG(amount) OVER(PARTITION BY user_id) as user_avg
FROM orders o;

-- Difference:
-- Subquery: O(N Γ— M) β€” for each row, find all user's orders
-- Window: O(N log N) β€” one pass with sort/hash

Edge Cases

  1. Scalar subquery returns > 1 row
    -- ❌ Error: more than one row returned by a subquery used as an expression
    SELECT u.name,
           (SELECT amount FROM orders WHERE user_id = u.id) -- 10 orders!
    FROM users u;
    
    -- βœ… LIMIT 1 or aggregation
    SELECT u.name,
           (SELECT amount FROM orders WHERE user_id = u.id ORDER BY created_at DESC LIMIT 1)
    FROM users u;
    
  2. Scalar subquery returns NULL
    -- If subquery found no rows β†’ NULL
    SELECT u.name,
           (SELECT amount FROM orders WHERE user_id = u.id LIMIT 1)
    FROM users u;
    -- Users without orders β†’ NULL
    
    -- ⚠️ NULL in arithmetic β†’ NULL
    SELECT u.name,
           (SELECT SUM(amount) FROM orders WHERE user_id = u.id) + 10
    FROM users u;
    -- Without orders: NULL + 10 = NULL!
    -- Solution: COALESCE
    
  3. Multiple correlated subqueries
    -- ❌ 3 subqueries per row β†’ 3N executions
    SELECT u.name,
           (SELECT COUNT(*) FROM orders WHERE user_id = u.id) as orders_count,
           (SELECT SUM(amount) FROM orders WHERE user_id = u.id) as total_amount,
           (SELECT MAX(created_at) FROM orders WHERE user_id = u.id) as last_order
    FROM users u;
    
    -- βœ… Single JOIN + GROUP BY
    SELECT u.name,
           COUNT(o.id) as orders_count,
           COALESCE(SUM(o.amount), 0) as total_amount,
           MAX(o.created_at) as last_order
    FROM users u
    LEFT JOIN orders o ON u.id = o.user_id
    GROUP BY u.id, u.name;
    

Performance Comparison

Approach Complexity When to use
SubPlan O(N Γ— M) Avoid on large data
InitPlan O(M) + O(N) Subquery doesn’t depend on outer row
Decorrelated (Semi-Join) O(N + M) EXISTS / IN with decorrelation
Memoize (PG 14+) O(N Γ— log M) with cache LATERAL with key repeats
Window Function O(N log N) Aggregation per row
UPDATE FROM O(N + M) Correlated UPDATE

Production Experience

Real scenario #1: SubPlan paralyzes the DB

  • CRM system: compute deal sum for each contact
  • Query: scalar subquery in SELECT (50,000 contacts)
  • Time: 12 minutes (50,000 subqueries to orders)
  • EXPLAIN: SubPlan 1 β†’ Index Scan (executed 50,000 times)
  • Solution:
    SELECT c.*, COALESCE(SUM(o.amount), 0) as total_deals
    FROM contacts c
    LEFT JOIN orders o ON c.id = o.contact_id
    GROUP BY c.id;
    
  • Result: 800ms (900x speedup)

Real scenario #2: Memoize saves LATERAL

  • API: last 3 orders per client (100,000 clients)
  • LATERAL without Memoize (PG 13): > 3 minutes
  • LATERAL with Memoize (PG 14): 2 seconds
  • Memoize: Hits = 98%, Misses = 2% (many clients repeat in cache)

Real scenario #3: Correlated UPDATE

  • E-commerce: mark old orders (100M rows)
  • UPDATE with correlated subquery: > 8 hours
  • Solution: UPDATE FROM + CTE
    WITH cutoff_dates AS (
        SELECT user_id, MIN(created_at) + INTERVAL '1 year' as cutoff
        FROM users
        GROUP BY user_id
    )
    UPDATE orders o SET status = 'ARCHIVED'
    FROM cutoff_dates cd
    WHERE o.user_id = cd.user_id
      AND o.created_at < cd.cutoff;
    
  • Result: 15 minutes (32x speedup)

Monitoring

-- 1. Find SubPlan in plans
EXPLAIN (ANALYZE, BUFFERS)
SELECT u.name, (SELECT COUNT(*) FROM orders WHERE user_id = u.id)
FROM users u;

-- Look for:
-- "SubPlan" β†’ executes for each row
-- "InitPlan" β†’ executes once

-- 2. Check Memoize efficiency
EXPLAIN (ANALYZE)
SELECT ... LATERAL ...;

-- Memoize (actual rows=...)
--   Hits: 500000
--   Misses: 10000
--   Evictions: 500
-- Hit rate = Hits / (Hits + Misses)

-- 3. Find correlated subqueries in pg_stat_statements
SELECT query, mean_exec_time, calls
FROM pg_stat_statements
WHERE query ~ '\(SELECT.*WHERE.*=.*\)'
  AND query LIKE '%SELECT%'
ORDER BY mean_exec_time DESC
LIMIT 10;

-- 4. Check UPDATE with subqueries
SELECT query FROM pg_stat_activity
WHERE query LIKE 'UPDATE%'
  AND query LIKE '%SELECT%';

Best Practices

  1. Avoid scalar subqueries in SELECT on large tables
  2. EXISTS > IN for existence checks (Semi-Join)
  3. UPDATE FROM / DELETE USING instead of correlated subqueries
  4. Window functions for per-row aggregation
  5. Memoize (PG 14+) can save LATERAL β€” check Hit rate
  6. Indexes on correlation keys are critical for SubPlan
  7. COALESCE for subqueries that may return NULL
  8. Always check the plan: look for SubPlan vs InitPlan

Summary for Senior

  • SubPlan = executes per row β†’ O(N Γ— M) β†’ avoid
  • InitPlan = executes once β†’ O(M) + O(N) β†’ great
  • Decorrelation turns SubPlan into Semi-Join β†’ O(N + M)
  • Memoize (PG 14+) caches LATERAL results β†’ check Hits/Misses
  • Window functions β€” best alternative for per-row aggregation
  • UPDATE FROM / DELETE USING β€” always faster than correlated subqueries
  • EXISTS ignores SELECT list β€” write SELECT 1 for readability
  • Monitor: EXPLAIN (ANALYZE), look for SubPlan, check Memoize Hit rate

🎯 Interview Cheat Sheet

Must know:

  • Correlated subquery depends on the outer row β†’ executes per row
  • SubPlan = O(N Γ— M) β†’ disaster on large data
  • InitPlan = O(M) + O(N) β†’ computed once
  • Decorrelation: optimizer turns subquery into Hash Semi-Join β†’ O(N + M)
  • Decorrelation NOT possible with LIMIT, OFFSET, aggregates with outer dependency
  • Memoize (PG 14+): caches results β†’ Hit rate > 80% = great
  • EXISTS = Semi-Join: stops after first match, no duplicates
  • Scalar subquery in SELECT β†’ JOIN + GROUP BY is better
  • UPDATE FROM / DELETE USING β†’ Hash Join instead of N subqueries
  • Window functions: O(N log N) instead of O(N Γ— M) for per-row aggregation

Frequent follow-up questions:

  • β€œHow to find SubPlan in the plan?” β†’ Look for SubPlan in EXPLAIN (executes per row)
  • β€œWhen is Memoize useless?” β†’ All keys are unique β†’ 0% Hits
  • β€œWhy is EXISTS better than JOIN for existence checks?” β†’ No Join Amplification
  • β€œHow to speed up correlated UPDATE?” β†’ UPDATE FROM + CTE

Red flags (DO NOT say):

  • ❌ β€œCorrelated subquery is fine for large tables” (SubPlan = disaster!)
  • ❌ β€œSELECT * in EXISTS is slower than SELECT 1” (optimizer ignores SELECT list)
  • ❌ β€œMemoize will always save it” (needs repeating keys)
  • ❌ β€œLATERAL JOIN is always good” (Nested Loop β†’ only for small outer)

Related topics:

  • [[What is better JOIN or subquery]] β†’ Unnesting, Decorrelation
  • [[What types of JOIN exist]] β†’ Semi-Join, Hash Join
  • [[What do window functions do]] β†’ alternative for per-row aggregation
  • [[What is an explain plan]] β†’ finding SubPlan in the plan