What JOIN types exist?

🟢 Junior Level

JOIN is an operation that combines rows from two or more tables based on a join condition.

Why: databases are normalized — data is stored in separate tables (users separately, orders separately) to avoid duplication and write anomalies. JOIN allows you to “glue” them back together when reading, when you need to see related information.

Simple analogy: Imagine you have two lists:

A list of students with their IDs
A list of grades with student IDs

JOIN lets you connect these lists and see: “Ivanov (ID=1) — grade 5, Petrov (ID=2) — grade 4”.

Main JOIN types:

Type	Description	Example
INNER JOIN	Only rows with matches in both tables	“Show orders and the customers who placed them”
LEFT JOIN	All rows from the left table + matches from the right	“Show all customers, even those without orders”
RIGHT JOIN	All rows from the right + matches from the left	Rarely used
FULL JOIN	All rows from both tables	“Show all customers and all orders, even without matches”
CROSS JOIN	All with all (Cartesian product)	“Generate all combinations”

Example:

-- INNER JOIN: only orders with customers
SELECT u.name, o.amount
FROM users u
INNER JOIN orders o ON u.id = o.user_id;

-- LEFT JOIN: all users, even without orders
SELECT u.name, o.amount
FROM users u
LEFT JOIN orders o ON u.id = o.user_id;
-- If no orders → o.amount = NULL

When to use:

INNER JOIN — when you only need related data
LEFT JOIN — when you need all records from the main table
CROSS JOIN — rarely, for generating combinations

🟡 Middle Level

Physical execution strategies (Execution Plan)

PostgreSQL uses three algorithms to perform JOINs:

1. Nested Loop Join

For each row from table A:
  → Look for matches in table B (via index)

When used: One table is small, the other has an index
Complexity: O(N × log M) with an index
Example: 100 users → for each, find orders by index

2. Hash Join

1. Build a hash table from the smaller table
2. Scan the larger table, look for matches in the hash

When used: Large tables, no suitable index
Requires: work_mem to store the hash table (work_mem is a PostgreSQL parameter that determines how much memory a single operation can allocate before spilling to disk; default 4MB)
Complexity: O(N + M)

3. Merge Join

1. Sort both tables by the JOIN key
2. Merge them like two sorted arrays

When used: Data is already sorted (by index)
Complexity: O(N log N + M log M) for sorting, then O(N + M)

Advanced logical types

Semi-Join (EXISTS):

-- Show users who have orders
SELECT * FROM users u
WHERE EXISTS (
    SELECT 1 FROM orders o WHERE o.user_id = u.id
);
-- Plan: Hash Semi-Join (stops after the first match)
-- Semi-Join is more efficient than a regular JOIN: as soon as the FIRST match
-- is found for a left table row, scanning the right table stops.
-- A regular JOIN returns ALL matches (if a user has 100 orders — 100 rows),
-- and then DISTINCT collapses them. Semi-Join returns the user once immediately.

Anti-Join (NOT EXISTS):

-- Show users WITHOUT orders
SELECT * FROM users u
WHERE NOT EXISTS (
    SELECT 1 FROM orders o WHERE o.user_id = u.id
);
-- Plan: Hash Anti-Join

Lateral Join:

-- For each user, find the 3 most recent orders
SELECT u.name, o.amount
FROM users u
LEFT JOIN LATERAL (
    SELECT amount FROM orders o
    WHERE o.user_id = u.id
    ORDER BY created_at DESC
    LIMIT 3
) o ON true;
-- Like "flatmap" in programming: for each row from the left table
-- a subquery is executed, and its results are "flattened" into the final table.
-- If the subquery returns 3 orders — 3 rows appear in the result for that user.

Common mistakes

Using JOIN instead of EXISTS

-- ❌ May duplicate rows if a user has many orders
SELECT DISTINCT u.* FROM users u
JOIN orders o ON u.id = o.user_id;

-- ✅ EXISTS is more efficient (stops after the first match)
SELECT * FROM users u
WHERE EXISTS (
    SELECT 1 FROM orders o WHERE o.user_id = u.id
);

NOT IN with NULL

-- ❌ If orders has NULL → returns empty result!
SELECT * FROM users
WHERE id NOT IN (SELECT user_id FROM orders);

-- ✅ Always use NOT EXISTS
SELECT * FROM users u
WHERE NOT EXISTS (
    SELECT 1 FROM orders o WHERE o.user_id = u.id
);

Forgot the ON → CROSS JOIN

-- ❌ Forgot the condition → Cartesian product!
SELECT * FROM users u JOIN orders o;
-- 1000 users × 10000 orders = 10,000,000 rows!

Hash Join and work_mem

-- If the hash table doesn't fit in work_mem → spill to disk
EXPLAIN ANALYZE SELECT * FROM orders o
JOIN users u ON o.user_id = u.id;

-- If you see: "Batches: 4" → hash didn't fit in memory
-- Solution: increase work_mem for the session
SET work_mem = '256MB';

Practical comparison

JOIN	When to use	Strategy
INNER + indexes	Small + large with index	Nested Loop
INNER without indexes	Two large tables	Hash Join
INNER + sorted	Data is sorted	Merge Join
LEFT JOIN	Need to preserve all from left	Depends on data
EXISTS	Existence check	Semi-Join
LATERAL	Dependent subqueries	Nested Loop

🔴 Senior Level

Logical vs Physical JOINs

Logical JOINs — what you write in SQL:

INNER, LEFT, RIGHT, FULL, CROSS

Physical JOINs — how PostgreSQL actually executes them:

Nested Loop, Hash, Merge, Semi, Anti

Important: The optimizer can transform logical JOINs into physical ones in different ways!

Nested Loop Join

Mechanism:

for row_a in table_a:           -- Outer (Driving)
    for row_b in table_b:       -- Inner (Probed)
        if match(row_a, row_b):
            emit(row_a, row_b)

Optimal conditions:

Outer table: < 10,000 rows
Inner table: index on the JOIN key
Selectivity: high

Complexity:

Without index: O(N × M) — catastrophe on large data
With index: O(N × log M) — excellent

Hash Join

Mechanism:

-- Phase 1: Build
for row in smaller_table:
    hash_table[hash(join_key)] = row

-- Phase 2: Probe
for row in larger_table:
    if hash(join_key) in hash_table:
        emit(row, hash_table[hash(join_key)])

Optimal conditions:

Both tables are large
No suitable index
Equality in the JOIN condition (=)

Batches (when it doesn’t fit in work_mem):

If hash table > work_mem:
  1. Split data into batches
  2. Each batch is processed separately
  3. Batches are spilled to disk (temp_files)
  → Severe slowdown!

Monitoring:

EXPLAIN (ANALYZE, BUFFERS) SELECT * FROM a JOIN b ON a.id = b.a_id;
-- Hash Join (cost=... rows=...)
--   Batches: 8  ← bad!
--   Memory Usage: 256MB
--   Disk Spill: 2GB  ← very bad!

Merge Join

Mechanism:

sort(table_a by join_key)
sort(table_b by join_key)

ptr_a = 0, ptr_b = 0
while ptr_a < len(a) and ptr_b < len(b):
    if a[ptr_a].key == b[ptr_b].key:
        emit(a[ptr_a], b[ptr_b])
        ptr_a++, ptr_b++
    elif a[ptr_a].key < b[ptr_b].key:
        ptr_a++
    else:
        ptr_b++

Optimal conditions:

Data is already sorted (by index)
Large volumes of data
JOIN condition: =, <, >, <=, >=

GEQO (Genetic Query Optimizer)

With > 12 tables in a single query (geqo_threshold parameter):

Full enumeration of Join Reordering options becomes too expensive
PostgreSQL switches to a genetic algorithm
May choose a suboptimal plan!

Solution:

-- Increase the threshold (considers more options, but planning takes longer)
SET geqo_threshold = 20;

-- Or disable GEQO (careful!)
SET geqo = off;

Join Reordering and OUTER JOIN

INNER JOIN: Associative and commutative

(A JOIN B) JOIN C = A JOIN (B JOIN C) = (A JOIN C) JOIN B

The planner can reorder tables to find the optimal plan.

LEFT JOIN: NOT associative

(A LEFT JOIN B) LEFT JOIN C ≠ A LEFT JOIN (B LEFT JOIN C)

The planner is restricted in reordering → may lead to a suboptimal plan.

Outer Join Removal (PG 10+)

Optimization: PostgreSQL can remove unnecessary LEFT JOINs:

-- LEFT JOIN will be removed if:
-- 1. No columns from the right table are selected
-- 2. The key in the right table is UNIQUE/PRIMARY KEY

SELECT u.name FROM users u
LEFT JOIN orders o ON u.id = o.user_id;
-- orders removed from the plan! (o is not used in SELECT)

Monitoring via EXPLAIN:

EXPLAIN SELECT u.name FROM users u
LEFT JOIN orders o ON u.id = o.user_id;
-- Seq Scan on users  ← orders is NOT in the plan!
-- Outer JOIN was removed by the optimizer (Outer Join Removal)

Edge Cases

Join Amplification

-- ❌ If a user has 10 orders, they appear 10 times!
SELECT u.name, o.amount
FROM users u
JOIN orders o ON u.id = o.user_id;

-- ✅ Group or use EXISTS
SELECT u.name, SUM(o.amount)
FROM users u
JOIN orders o ON u.id = o.user_id
GROUP BY u.name;

Join Collapse Limits

-- Parameters that affect JOIN optimization
SHOW from_collapse_limit;    -- Default: 8
SHOW join_collapse_limit;    -- Default: 8

-- If > 8 subqueries/JOINs → planner won't "flatten" them
-- May lead to a suboptimal plan

Lateral Join and Nested Loop

-- Lateral almost always chooses Nested Loop
-- Great for small outer sets
-- Catastrophe for large ones!

-- ✅ users = 100 → fast
SELECT * FROM users u
LEFT JOIN LATERAL (
    SELECT * FROM orders o WHERE o.user_id = u.id LIMIT 3
) o ON true;

-- ❌ users = 1,000,000 → slow

Performance Tuning

work_mem for Hash Join:

-- Check current value
SHOW work_mem;  -- Default: 4MB

-- For specific queries (careful!)
SET work_mem = '256MB';
SELECT * FROM large_table a JOIN large_table b ON a.id = b.a_id;
RESET work_mem;

-- ⚠️ work_mem is allocated PER OPERATION, not per session!
-- 10 parallel Hash Joins × 256MB = 2.5GB RAM!

enable_* flags (for debugging):

-- Temporarily disable a strategy for testing
SET enable_hashjoin = off;  -- Test Nested Loop
SET enable_mergejoin = off; -- Test Hash Join
SET enable_nestloop = off;  -- Test Hash/Merge

-- ⚠️ DEBUGGING ONLY! Not in production!
EXPLAIN ANALYZE SELECT ...;

-- Reset
RESET enable_hashjoin;

Production Experience

Real scenario #1: Hash Join Disk Spill

Analytics platform: JOIN tables with 50M and 100M rows
Problem: Query > 5 minutes
EXPLAIN ANALYZE: Hash Join, Batches: 32, Disk Spill: 8GB
Cause: work_mem = 4MB (default)
Solution: SET work_mem = '512MB' for the session
Result: Query < 30 seconds (10x speedup)

Real scenario #2: GEQO chooses a bad plan

Report: JOIN of 15 tables
Problem: Query 2 minutes instead of expected 5 seconds
Cause: GEQO (genetic algorithm) chose a bad JOIN order
Solution:
- Split the query into 2 CTEs
- Reduced the number of JOINs in a single query to 8
Result: Query < 3 seconds

Real scenario #3: Lateral Join on large data

API: For each of 1M users, find the last order
LATERAL query > 10 minutes

Solution: Replaced with Window Function

-- ✅ Window Function instead of LATERAL
SELECT * FROM (
    SELECT *,
           ROW_NUMBER() OVER(PARTITION BY user_id ORDER BY created_at DESC) as rn
    FROM orders
) sub WHERE rn = 1;

Result: Query < 5 seconds

Monitoring

-- 1. Check JOIN strategies used
SELECT query, call_count, mean_exec_time
FROM pg_stat_statements
WHERE query LIKE '%JOIN%'
ORDER BY mean_exec_time DESC
LIMIT 10;

-- 2. Analyze a specific query
EXPLAIN (ANALYZE, BUFFERS, FORMAT TEXT)
SELECT * FROM orders o
JOIN users u ON o.user_id = u.id
WHERE o.created_at > '2024-01-01';

-- 3. Check work_mem usage
-- In PostgreSQL logs when insufficient:
-- "temporary file: path ..., size ..."

-- 4. Check GEQO
SELECT query FROM pg_stat_activity
WHERE query LIKE '%JOIN%JOIN%JOIN%JOIN%JOIN%JOIN%JOIN%JOIN%JOIN%JOIN%JOIN%JOIN%JOIN%';
-- > 12 JOINs → possibly GEQO

Best Practices

EXISTS instead of DISTINCT JOIN for existence checks
NOT EXISTS instead of NOT IN (avoid the NULL trap)
LATERAL only for small outer sets (< 10,000 rows)
Increase work_mem for specific queries with Hash Join
Watch for Batches in EXPLAIN — if > 1, increase work_mem
Avoid > 12 JOINs in a single query (GEQO)
Indexes on JOIN fields are critical for Nested Loop
Split complex queries via CTEs for better control

Summary for Senior

3 physical strategies: Nested Loop (small + index), Hash (large without index), Merge (sorted)
GEQO at > 12 tables → may choose a bad plan
Batches > 1 in Hash Join → increase work_mem
Outer Join Removal (PG 10+) can remove unnecessary JOINs
LATERAL ≈ Nested Loop → only for small outer sets
NOT EXISTS > NOT IN (NULL safety)
EXISTS > DISTINCT JOIN (Semi-Join is more efficient)
Always check the plan via EXPLAIN (ANALYZE, BUFFERS)

🎯 Interview Cheat Sheet

Must know:

5 logical JOINs: INNER, LEFT, RIGHT, FULL, CROSS
3 physical strategies: Nested Loop, Hash, Merge
Nested Loop: O(N × log M) with index, great for small + index
Hash Join: O(N + M), requires work_mem, Batches > 1 → disk spill
Merge Join: requires sorted data, O(N + M) after sorting
GEQO at > 12 tables → genetic algorithm → may choose a bad plan
LATERAL ≈ Nested Loop → only for < 10,000 outer rows
Semi-Join (EXISTS) is more efficient than DISTINCT JOIN (no duplicates)

Common follow-up questions:

“Why is Hash Join slow?” → Batches > 1 → disk spill → increase work_mem
“When is Nested Loop better than Hash Join?” → Outer table is small + index on inner
“What is GEQO and why is it bad?” → Genetic algorithm at > 12 tables → suboptimal plan
“When is LATERAL JOIN justified?” → Small outer table + index on inner

Red flags (DO NOT say):

❌ “JOIN is always faster than a subquery” (depends on decorrelation)
❌ “work_mem doesn’t matter for JOIN” (Hash Join requires work_mem!)
❌ “LATERAL JOIN will solve everything” (only for small outer sets!)
❌ “NOT IN is safe” (NULL returns an empty result!)

Related topics:

[[Difference between INNER JOIN and LEFT JOIN]] → logical vs physical JOINs
[[What is better JOIN or subquery]] → Semi-Join, decorrelation
[[What are indexes for]] → indexes on JOIN fields
[[What is a correlated subquery]] → EXISTS vs JOIN