Question 9 · Section 1

What is Better: JOIN or Subquery?

PostgreSQL can Unnest (flatten) — transforms subqueries into JOINs at the execution plan level. This means the subquery you wrote is physically executed as a table join, which i...

Language versions: English Russian Ukrainian

🟢 Junior Level

There is no definitive answer — it depends on the situation. In most modern databases (including PostgreSQL), the optimizer often transforms subqueries into JOINs, so performance can be identical.

Simple rule:

  • JOIN — when you need data from both tables
  • Subquery (EXISTS/IN) — when you only need to check existence

Example:

-- JOIN: get users AND their orders
SELECT u.name, o.amount
FROM users u
JOIN orders o ON u.id = o.user_id;

-- EXISTS: check if user has orders
SELECT * FROM users u
WHERE EXISTS (
    SELECT 1 FROM orders o WHERE o.user_id = u.id
);

When to use what:

  • Need columns from both tables → JOIN
  • Only need filtering → EXISTS/IN
  • Complex logic with aggregation → subquery

🟡 Middle Level

How the Optimizer Handles Subqueries

PostgreSQL can Unnest (flatten) — transforms subqueries into JOINs at the execution plan level. This means the subquery you wrote is physically executed as a table join, which is usually faster.

Often, Semi-Join is used — a special type of join: for each row in the left table, it checks if there is at least one match in the right table, and returns the left table row ONCE (without duplicates). There is no direct SQL syntax for Semi-Join — it occurs inside the planner.

-- Subquery in WHERE
SELECT * FROM users
WHERE id IN (SELECT user_id FROM orders WHERE amount > 100);

-- Optimizer transforms into:
SELECT * FROM users
INNER JOIN (SELECT user_id FROM orders WHERE amount > 100) o
ON users.id = o.user_id;
-- Plan: Hash Semi-Join (very efficient)

Semi-Join vs JOIN

Problem with JOIN: Join Amplification

-- ❌ If a user has 10 orders, they appear 10 times!
SELECT DISTINCT u.* FROM users u
JOIN orders o ON u.id = o.user_id;
-- Then DISTINCT removes duplicates (extra sort/hash)

-- ✅ EXISTS uses Semi-Join (no duplicates!)
SELECT * FROM users u
WHERE EXISTS (
    SELECT 1 FROM orders o WHERE o.user_id = u.id
);
-- Stops after first match → faster
-- EXISTS uses Semi-Join: as soon as the FIRST match for the left table row
-- is found, scanning the right table for that row stops
-- (short-circuit evaluation). No need to build the full join result
-- and then do DISTINCT to remove duplicates.

Optimization Fences

Some constructs prevent the optimizer from flattening the subquery:

Fence What happens
LIMIT / OFFSET Subquery executes separately
DISTINCT Cannot be pulled into JOIN
GROUP BY / Aggregates Subquery is materialized
UNION / INTERSECT Separate node in plan
-- ❌ LIMIT creates a fence
SELECT * FROM users u
WHERE id IN (SELECT user_id FROM orders ORDER BY amount DESC LIMIT 10);
-- Plan: SubPlan (executed for each row!)

-- ✅ Replace with JOIN LATERAL
SELECT u.*, o.amount
FROM users u
LEFT JOIN LATERAL (
    SELECT amount FROM orders
    WHERE o.user_id = u.id
    ORDER BY amount DESC
    LIMIT 10
) o ON true;

Scalar Subqueries in SELECT

-- ❌ BAD: subquery executes for each row
SELECT u.name,
       (SELECT SUM(amount) FROM orders o WHERE o.user_id = u.id) as total
FROM users u;
-- Plan: SubPlan (1,000,000 subqueries for 1,000,000 users!)

-- ✅ GOOD: JOIN + GROUP BY
SELECT u.name, SUM(o.amount) as total
FROM users u
LEFT JOIN orders o ON u.id = o.user_id
GROUP BY u.name;
-- Plan: Hash Join + HashAggregate (much faster!)

CTE (WITH) and Materialization

PostgreSQL 12+: CTEs are inlined by default:

Before PG 12 CTEs were ALWAYS materialized — this was an optimization fence. The subquery inside the CTE was computed once, and the result was saved to a temporary structure. This could be a plus (avoiding repeated computation) and a minus (the optimizer couldn’t “look inside” the CTE and apply filters from the outer query).

-- PG 12+: this CTE will be inlined into the main query
WITH active_users AS (
    SELECT * FROM users WHERE status = 'ACTIVE'
)
SELECT * FROM active_users u
JOIN orders o ON u.id = o.user_id;
-- Plan: as if you wrote JOIN directly

Controlling materialization:

-- Force materialization (compute CTE once)
WITH c AS MATERIALIZED (
    SELECT user_id, SUM(amount) as total FROM orders GROUP BY user_id
)
SELECT * FROM users u JOIN c ON u.id = c.user_id;

-- Force inlining
WITH c AS NOT MATERIALIZED (
    SELECT * FROM orders WHERE amount > 100
)
SELECT * FROM c JOIN users u ON c.user_id = u.id;

Common Mistakes

  1. Scalar subquery in SELECT on large data
    -- ❌ Performance disaster
    SELECT u.name, (SELECT COUNT(*) FROM orders WHERE user_id = u.id)
    FROM users u;  -- 1M rows → 1M subqueries!
    
  2. NOT IN with NULL
    -- ❌ Returns empty result if there is a NULL!
    SELECT * FROM users
    WHERE id NOT IN (SELECT user_id FROM orders);
    
    -- ✅ Use NOT EXISTS
    SELECT * FROM users u
    WHERE NOT EXISTS (SELECT 1 FROM orders o WHERE o.user_id = u.id);
    
  3. Redundant DISTINCT with JOIN
    -- ❌ DISTINCT after JOIN — sign of Join Amplification
    SELECT DISTINCT u.* FROM users u
    JOIN orders o ON u.id = o.user_id;
    
    -- ✅ EXISTS is more efficient
    SELECT * FROM users u
    WHERE EXISTS (SELECT 1 FROM orders o WHERE o.user_id = u.id);
    

Practical Comparison

Situation Best Tool Why
Data from both tables JOIN Maximum execution strategies
Existence check EXISTS Hash Semi-Join, no duplicates
Dependent selection (top-N) LATERAL Uses indexes inside
Aggregation per row JOIN + GROUP BY One pass instead of N subqueries
Complex logic with LIMIT LATERAL Planner sees indexes

🔴 Senior Level

Unnesting and Decorrelation

How the optimizer “flattens” subqueries:

-- Original query
SELECT * FROM users u
WHERE u.id IN (
    SELECT o.user_id FROM orders o
    WHERE o.amount > (SELECT AVG(amount) FROM orders)
);

-- Transformation steps:
-- 1. Scalar subquery AVG → InitPlan (computed once)
-- 2. IN subquery → Hash Semi-Join
-- 3. Final plan:
--    InitPlan: AVG(amount)
--    → Hash Semi-Join (users ↔ orders)

Decorrelation:

-- Correlated subquery (depends on outer row)
SELECT * FROM users u
WHERE EXISTS (
    SELECT 1 FROM orders o WHERE o.user_id = u.id AND o.amount > 100
);

-- After decorrelation:
-- Hash Semi-Join:
--   1. Build hash from orders WHERE amount > 100
--   2. Scan users, check presence in hash
-- Complexity: O(N + M) instead of O(N × M)!

SubPlan vs InitPlan vs Param

SubPlan: Executed for each row of the outer query

EXPLAIN SELECT u.name,
       (SELECT SUM(amount) FROM orders o WHERE o.user_id = u.id)
FROM users u;

-- Plan:
-- Seq Scan on users
--   SubPlan 1
--     -> Aggregate
--        -> Index Scan on orders (user_id = u.id)
-- ⚠️ Executes 1,000,000 times!

InitPlan: Executed once before the main query

EXPLAIN SELECT * FROM users
WHERE amount > (SELECT AVG(amount) FROM orders);

-- Plan:
-- InitPlan 1
--   -> Aggregate (AVG)
--      -> Seq Scan on orders
-- → Seq Scan on users (with filter by InitPlan result)
-- ✅ Executes 1 time!

Memoize Node (PostgreSQL 14+)

Memoize — a node in the execution plan that caches subquery results. If the subquery is called with the same parameters again, the result is taken from the cache instead of re-executing. The cache is memory-limited (parameter memoize_cache_size, default 10MB per Memoize node).

Revolutionary optimization for correlated subqueries:

-- Query with LATERAL (forced Nested Loop)
SELECT u.name, o.amount
FROM users u
LEFT JOIN LATERAL (
    SELECT amount FROM orders o
    WHERE o.user_id = u.id
    ORDER BY created_at DESC
    LIMIT 1
) o ON true;

-- PostgreSQL 14+: may insert Memoize
-- Plan:
-- Nested Loop
--   → Seq Scan on users
--   → Memoize
--       → Limit
--          → Index Scan Backward on orders

-- Memoize caches results for each user_id
-- If user_id repeats → taken from cache!
-- 🚀 10-100x speedup on data with repeats

Monitoring Memoize:

EXPLAIN (ANALYZE, BUFFERS)
SELECT ... LATERAL ...;

-- Look for in plan:
-- Memoize  (cost=... rows=... actual rows=...)
--   Hits: 500000  ← taken from cache
--   Misses: 10000 ← recomputed
--   Evictions: 500  ← evicted from cache

LATERAL JOIN: When and Why

Lateral chooses Nested Loop:

LATERAL almost always → Nested Loop
→ Great if outer table is small (< 10,000 rows)
→ Disaster if outer table is large (> 1,000,000 rows)

Alternatives for large data:

-- ❌ LATERAL on 1M users → slow
SELECT u.name, last_order.amount
FROM users u
LEFT JOIN LATERAL (
    SELECT amount FROM orders o
    WHERE o.user_id = u.id
    ORDER BY created_at DESC
    LIMIT 1
) last_order ON true;

-- ✅ Window Function → Hash Join
SELECT name, amount
FROM (
    SELECT u.name, o.amount,
           ROW_NUMBER() OVER(PARTITION BY u.id ORDER BY o.created_at DESC) as rn
    FROM users u
    INNER JOIN orders o ON u.id = o.user_id
) sub
WHERE rn = 1;

-- ✅ DISTINCT ON → Sort + Unique
SELECT DISTINCT ON (u.id) u.name, o.amount
FROM users u
INNER JOIN orders o ON u.id = o.user_id
ORDER BY u.id, o.created_at DESC;

CTE: MATERIALIZED vs NOT MATERIALIZED

When to use MATERIALIZED:

-- CTE is used multiple times → materialization is beneficial
WITH expensive_calc AS MATERIALIZED (
    SELECT user_id, SUM(amount) * 1.2 as total
    FROM orders
    GROUP BY user_id
)
SELECT * FROM users u JOIN expensive_calc ec ON u.id = ec.user_id
UNION ALL
SELECT * FROM archived_users au JOIN expensive_calc ec ON au.id = ec.user_id;
-- CTE computed 1 time, used 2 times

When to use NOT MATERIALIZED:

-- CTE is used 1 time + has filtering → inlining is better
WITH filtered_orders AS NOT MATERIALIZED (
    SELECT * FROM orders WHERE amount > 1000 AND status = 'COMPLETED'
)
SELECT * FROM users u
JOIN filtered_orders fo ON u.id = fo.user_id
WHERE u.created_at > '2024-01-01';
-- Allows optimizer to "push" filter u.created_at into CTE

Edge Cases

  1. Volatile functions in subqueries
    -- NOW() is stable within a query
    SELECT * FROM users WHERE created_at < NOW();
    
    -- RANDOM() is computed every time!
    SELECT * FROM users WHERE id = (SELECT FLOOR(RANDOM() * 1000000));
    -- Returns DIFFERENT values for each row!
    
  2. Subquery in UPDATE/DELETE
    -- ❌ Correlated subquery in UPDATE → slow
    UPDATE orders o SET status = 'VIP'
    WHERE user_id IN (SELECT id FROM users WHERE vip = true);
    
    -- ✅ FROM syntax → JOIN
    UPDATE orders o SET status = 'VIP'
    FROM users u
    WHERE o.user_id = u.id AND u.vip = true;
    -- Plan: Hash Join (much faster!)
    
  3. Empty subquery
    -- If subquery returned 0 rows:
    -- IN () → FALSE
    -- NOT IN () → TRUE (careful with NULL!)
    -- EXISTS () → FALSE
    -- NOT EXISTS () → TRUE
    

Performance Comparison

Approach Complexity When to use
JOIN O(N + M) with Hash Join Data from both tables
EXISTS/IN O(N + M) with Semi-Join Existence check
Scalar Subquery O(N × M) Avoid on large data
LATERAL O(N × log M) Dependent selections, small outer
CTE MATERIALIZED O(M) + O(N) CTE used > 1 time
CTE NOT MATERIALIZED Like JOIN CTE used 1 time

Production Experience

Real scenario #1: SubPlan kills API

  • REST API: computing balance for each of 10,000 users
  • Query: scalar subquery in SELECT
  • Time: 45 seconds (10,000 subqueries!)
  • Solution:
    -- Replaced with JOIN + GROUP BY
    SELECT u.*, COALESCE(SUM(o.amount), 0) as balance
    FROM users u
    LEFT JOIN orders o ON u.id = o.user_id
    GROUP BY u.id;
    
  • Result: 200ms (225x speedup)

Real scenario #2: Memoize saves from LATERAL

  • Report: top-5 orders per client (100,000 clients)
  • LATERAL without Memoize: > 5 minutes
  • After PG 14 with Memoize: 3 seconds
  • Hits/Misses ratio: 95%/5% (many clients have same orders in cache)

Real scenario #3: CTE materialization

  • Analytics: CTE with aggregation used in 4 UNION ALL
  • Without MATERIALIZED: CTE computed 4 times → 2 minutes
  • With MATERIALIZED: computed 1 time → 30 seconds
  • But: if CTE is small, MATERIALIZED may slow down (overhead)

Monitoring

-- 1. Check SubPlan in plans
EXPLAIN (ANALYZE, BUFFERS)
SELECT u.name, (SELECT COUNT(*) FROM orders WHERE user_id = u.id)
FROM users u;
-- Look for "SubPlan" → potential issue

-- 2. Check Memoize efficiency
EXPLAIN (ANALYZE)
SELECT ... LATERAL ...;
-- Memoize (Hits: X, Misses: Y)
-- Hit rate = Hits / (Hits + Misses)
-- > 80% → great, < 50% → Memoize doesn't help

-- 3. CTE materialization
EXPLAIN (ANALYZE)
WITH c AS MATERIALIZED (...)
SELECT ...;
-- CTE Scan on c → materialized
-- Look for "CTE c" in subplans → inlined

-- 4. Find scalar subqueries in pg_stat_statements
SELECT query, mean_exec_time
FROM pg_stat_statements
WHERE query ~ 'SELECT.*\(SELECT.*FROM.*\).*FROM'
ORDER BY mean_exec_time DESC;

Best Practices

  1. EXISTS > DISTINCT JOIN for existence checks
  2. NOT EXISTS > NOT IN (avoid NULL trap)
  3. Avoid scalar subqueries in SELECT on large data
  4. LATERAL only for small outer selections (< 10,000 rows)
  5. Manage CTEs through MATERIALIZED / NOT MATERIALIZED consciously
  6. UPDATE/DELETE use FROM/USING instead of subqueries
  7. Memoize (PG 14+) can save LATERAL — check Hit rate
  8. Always check the plan with EXPLAIN (ANALYZE)

Summary for Senior

  • Optimizer supports Unnesting and Decorrelation of subqueries
  • SubPlan per row → disaster; replace with JOIN
  • Memoize (PG 14+) caches LATERAL results → check Hits/Misses
  • CTEs in PG 12+ are inlined by default; manage via MATERIALIZED
  • EXISTS = Semi-Join (no Join Amplification)
  • LATERAL ≈ Nested Loop → only for small outer selections
  • UPDATE/DELETE via FROM/USING instead of correlated subqueries
  • Always check the plan: look for SubPlan, Batches, CTE Scan

🎯 Interview Cheat Sheet

Must know:

  • PostgreSQL supports Unnesting — transforms subqueries into JOINs (Hash Semi-Join)
  • SubPlan = executed for each row → O(N × M) → DISASTER
  • InitPlan = executed once → O(M) + O(N) → great
  • EXISTS = Semi-Join: no Join Amplification (duplicates from JOIN), stops after 1st match
  • NOT EXISTS > NOT IN: NOT IN with NULL returns empty result
  • Memoize (PG 14+): caches LATERAL results → Hit rate > 80% = great
  • CTEs in PG 12+: inlined by default (NOT MATERIALIZED), manage explicitly
  • Scalar subquery in SELECT = N subqueries → JOIN + GROUP BY is better
  • UPDATE FROM / DELETE USING faster than correlated subqueries
  • Always check EXPLAIN (ANALYZE)

Frequent follow-up questions:

  • “Why is EXISTS better than DISTINCT JOIN?” → Semi-Join doesn’t create duplicates, no sort for DISTINCT
  • “When doesn’t Memoize help?” → All keys are unique → 0% Hits
  • “CTE MATERIALIZED or NOT MATERIALIZED?” → MATERIALIZED if used > 1 time
  • “Why is scalar subquery in SELECT bad?” → N subqueries for N rows

Red flags (DO NOT say):

  • ❌ “Subquery is always slower than JOIN” (decorrelated EXISTS can be faster)
  • ❌ “NOT IN is safe” (NULL returns empty result!)
  • ❌ “CTE is always materialized” (PG 12+ inlines by default)
  • ❌ “Memoize always helps” (needs repeating keys)

Related topics:

  • [[What types of JOIN exist]] → Semi-Join, Anti-Join
  • [[What is a correlated subquery]] → SubPlan vs InitPlan, Memoize
  • [[What is the difference between INNER JOIN and LEFT JOIN]] → JOIN Amplification