Question 20 · Section 1

What is Explain plan?

4. Hash Left Join → joined 5. HashAggregate → grouped

Language versions: English Russian Ukrainian

🟢 Junior Level

EXPLAIN — a PostgreSQL command that shows how the database plans to execute a query, without actually running it.

Simple analogy: Before a trip, GPS shows the route: which roads, how many turns, estimated time. EXPLAIN is a GPS for a query: shows which indexes, which tables, how long it will take.

Example:

-- Show query plan
EXPLAIN SELECT * FROM users WHERE email = 'test@example.com';

-- Result:
-- Index Scan using idx_users_email on users  (cost=0.42..8.44 rows=1 width=100)
--   Index Cond: (email = 'test@example.com'::text)

What the numbers mean:

  • cost=0.42..8.44 — cost: startup..total (to get all rows)
  • rows=1 — how many rows the database expects
  • width=100 — average row size in bytes

When to use:

  • Query runs slowly
  • You want to check if an index is being used
  • Before creating an index — will it help?

🟡 Middle Level

EXPLAIN vs EXPLAIN ANALYZE

-- EXPLAIN: plan only, does NOT execute the query
EXPLAIN SELECT * FROM users WHERE email = 'test@example.com';

-- EXPLAIN ANALYZE: executes the query + shows actual time
EXPLAIN ANALYZE SELECT * FROM users WHERE email = 'test@example.com';

-- Result:
-- Index Scan using idx_users_email on users  (cost=0.42..8.44 rows=1 width=100)
--   (actual time=0.035..0.037 rows=1 loops=1)
--                     ↑ actual time in ms
-- Planning Time: 0.123 ms
-- Execution Time: 0.058 ms

Key flags

-- Full set for analysis
EXPLAIN (ANALYZE, BUFFERS, FORMAT TEXT) SELECT ...;

-- Flags:
-- ANALYZE  → execute the query + actual time
-- BUFFERS  → show cache/disk reads
-- COSTS    → show cost (on by default)
-- ROWS     → show estimated row count
-- WIDTH    → show row size
-- FORMAT   → TEXT (default), JSON, YAML, XML

Reading a plan

EXPLAIN (ANALYZE, BUFFERS)
SELECT u.name, COUNT(o.id) as orders
FROM users u
LEFT JOIN orders o ON u.id = o.user_id
WHERE u.city = 'Moscow'
GROUP BY u.name;

-- Plan:
-- HashAggregate  (cost=... rows=... actual time=15.2..15.3 rows=500 loops=1)
--   Buffers: shared hit=250 read=50
--   → Hash Left Join  (cost=... actual time=10.1..12.5 rows=2000 loops=1)
--        Buffers: shared hit=200 read=40
--        Hash Cond: (o.user_id = u.id)
--        → Seq Scan on orders o  (actual time=0.01..5.2 rows=10000 loops=1)
--             Buffers: shared hit=150
--        → Hash  (cost=... actual time=3.1..3.1 rows=500 loops=1)
--             Buffers: shared hit=50 read=40
--             → Seq Scan on users u  (actual time=0.02..2.8 rows=500 loops=1)
--                  Filter: (city = 'Moscow')
--                  Rows Removed by Filter: 999500
-- Planning Time: 0.5 ms
-- Execution Time: 16.0 ms

Read bottom-up:

  1. Seq Scan on users → filtered 500 out of 1M
  2. Hash → built hash table for JOIN
  3. Seq Scan on orders → read 10,000 orders
  4. Hash Left Join → joined
  5. HashAggregate → grouped

Scan types

Type When Speed
Seq Scan Reading entire table Slow for large, fast for small
Index Scan Search by index → read row Fast for small selections
Index Only Scan All data in the index Very fast
Bitmap Index Scan Index → bitmap → rows Medium selections (100-10,000 rows)

Bitmap Index Scan: collects ALL matching TIDs from the index into a bitmap, then reads rows. More efficient than Index Scan for 100-10,000 rows from different locations — fewer random I/O operations.

Common problems

  1. Seq Scan on a large table
    EXPLAIN SELECT * FROM users WHERE email = 'test@example.com';
    -- Seq Scan on users  ← bad!
    --   Filter: (email = ...)
    --   Rows Removed by Filter: 999999
    
    -- Solution: create an index
    CREATE INDEX idx_users_email ON users(email);
    
  2. Estimate doesn’t match reality
    -- rows=1000 actual rows=1
    -- Planner was off by 1000x!
    
    -- Cause: stale statistics
    -- Solution: ANALYZE users;
    
  3. Spill to disk
    EXPLAIN (ANALYZE) SELECT ... GROUP BY ...;
    -- HashAggregate (actual time=... rows=...)
    --   Disk: 5GB  ← bad! work_mem wasn't enough
    
    -- Solution: SET work_mem = '256MB';
    

Practical examples

-- Check index usage
EXPLAIN SELECT * FROM orders WHERE user_id = 123;
-- Index Scan using idx_orders_user_id → great!

-- Check JOIN
EXPLAIN SELECT * FROM users u JOIN orders o ON u.id = o.user_id;
-- Hash Join → normal for large tables
-- Nested Loop → great if there's an index

-- Check sorting
EXPLAIN SELECT * FROM orders ORDER BY created_at DESC LIMIT 10;
-- Limit → Index Scan Backward → great!
-- Limit → Sort → can be optimized with an index

🔴 Senior Level

Anatomy of node Costs

Sort (cost=150.10..155.10 rows=2000 width=40)
                  
    Startup    Total
    Cost       Cost

Startup Cost: Cost to get the first row

  • Critical for LIMIT 1, WHERE EXISTS
  • Planner may choose a plan with huge Total Cost but low Startup Cost

Total Cost: Cost to get all rows

  • Critical for full retrieval

Arbitrary units:

  • 1 unit ≈ reading one 8KB page from disk
  • cost=100 ≈ reading 100 pages

BUFFERS: I/O analysis

EXPLAIN (ANALYZE, BUFFERS) SELECT ...;

-- Buffers: shared hit=250 read=50 written=10
--               ↑       ↑         ↑
--            from cache  from disk  written

-- shared hit: 250 / (250+50) = 83% Cache Hit Rate
-- read=50: 50 × 8KB = 400KB read from disk
-- written=10: 10 × 8KB = 80KB written (temp files?)

Interpretation:

  • shared hit » shared read → great (everything in cache)
  • shared read » shared hit → problem (lots of Disk I/O)
  • written > 0 → spill to disk or temp file writing

Generic vs Custom Plans

The Prepared Statements problem:

-- Custom Plan (knows the specific value)
EXPLAIN SELECT * FROM users WHERE id = 5;
-- Plan: Index Scan (perfect for id=5)

-- Generic Plan (universal for any $1)
PREPARE stmt(int) AS SELECT * FROM users WHERE id = $1;
EXPLAIN EXECUTE stmt(5);
-- Plan: Seq Scan (universal, "average" for all ids)

Why Generic Plan may be worse:

Skewed data:
  id=5 → 1 row (Index Scan is ideal)
  id=100 → 500,000 rows (Seq Scan is ideal)

Generic Plan: "average" → Seq Scan
→ For id=5, catastrophically slow!

The planner "averages" the cost: for id=5, Index Scan is needed (1 row),
for id=100 — Seq Scan (500,000 rows). Seq Scan is "safer" on average,
but for specific values with small selections — disaster.

Solution:

-- 1. plan_cache_mode
SET plan_cache_mode = 'force_custom_plan';  -- Always Custom
SET plan_cache_mode = 'force_generic_plan'; -- Always Generic
SET plan_cache_mode = 'auto';               -- Default (5 times Custom, then Generic)

-- 2. In JDBC/Hibernate:
-- Check: prepareThreshold=0 (disable Prepared Statements)

JIT Compilation

JIT (Just-In-Time) — PostgreSQL compiles parts of the query into machine code via LLVM. As if the database “rewrote” part of the query in C.

Speeds up:

  1. Heavy expressions (a * b + c / d) on millions of rows
  2. Aggregates (SUM, AVG) with many groupings
  3. Tuple evaluation during Hash Join

Slows down: simple queries (SELECT * FROM t WHERE id = 1) — compilation takes longer than execution!

-- In EXPLAIN ANALYZE, a JIT block may appear
-- JIT:
--   Functions: 5
--   Options: Inlining, Optimization
--   Timing: Min 2.1 ms, Max 2.5 ms

-- Pros: speeds up complex expressions and aggregates
-- Cons: for short queries, compilation > execution!

-- Disable for OLTP
SET jit = off;

-- Check
SHOW jit;

Parallel Query in plans

EXPLAIN SELECT COUNT(*) FROM large_table;

-- Gather
--   Workers Planned: 4
--   → Parallel Seq Scan on large_table
--        Workers Launched: 4

-- Important metrics:
-- Workers Planned: how many the optimizer planned
-- Workers Launched: how many actually started
-- If Launched < Planned → check:
--   max_parallel_workers_per_gather
--   max_parallel_workers
--   Server load

Incremental Sort (PG 13+)

EXPLAIN SELECT * FROM orders ORDER BY user_id, created_at;

-- Incremental Sort  ← PG 13+!
--   Sort Key: user_id, created_at
--   Presorted Key: user_id  ← data already sorted by user_id
--   → Sorts only by created_at within each user_id
--   → 5-10x faster than a full sort!

On PG 12 and below — full sort of all columns, even if part is already sorted.

Edge Cases

  1. EXPLAIN ANALYZE executes the query!
    -- ⚠️ DML query will be executed!
    EXPLAIN ANALYZE DELETE FROM users WHERE status = 'INACTIVE';
    -- → All inactive users are DELETED!
    
    -- ✅ Safe way:
    BEGIN;
    EXPLAIN ANALYZE DELETE FROM users WHERE status = 'INACTIVE';
    ROLLBACK;  -- Undo the changes
    
  2. First EXPLAIN ANALYZE run may be misleading
    Data may not be in cache → first run is slower.
    Run 2-3 times and look at the second/third result.
    Or: DISCARD ALL; — clear the cache.
    
  3. WAL generation (PG 13+)
    EXPLAIN (ANALYZE, WAL) INSERT INTO orders SELECT ...;
    
    -- WAL:
    --   Records: 50000
    --   Bytes: 10MB
    --   FPI: 100 (Full Page Images — full copies of 8KB pages in WAL on
    --   the first page modification after a checkpoint. A large FPI count =
    --   high load on WAL and replication)
    
    -- Critical for estimating replication load
    

On PG 12 and below, the WAL flag in EXPLAIN is not available. Estimate replication load via pg_stat_wal or logs.

  1. Timing overhead
    -- EXPLAIN ANALYZE adds overhead for time measurements
    -- For micro-optimizations, disable TIMING
    
    EXPLAIN (ANALYZE, TIMING off) SELECT ...;
    -- Faster, but without detailed per-node measurements
    

Production Experience

Real scenario #1: Generic Plan kills API

  • Application: JDBC Prepared Statement for email search
  • In console: 5ms (Custom Plan, Index Scan)
  • In application: 500ms (Generic Plan, Seq Scan)
  • Cause: email skewed (some users have 1000 records)
  • Solution: plan_cache_mode = 'force_custom_plan'
  • Result: stable 5ms

Real scenario #2: JIT slows down OLTP

  • REST API: simple queries < 1ms
  • EXPLAIN ANALYZE: JIT Timing: 8ms
  • Cause: JIT enabled by default, compilation > execution
  • Solution: SET jit = off for the session
  • Result: 0.5ms instead of 8.5ms (17x faster)

Monitoring

-- 1. Find slow queries
SELECT query, mean_exec_time, calls, shared_blks_read
FROM pg_stat_statements
ORDER BY mean_exec_time DESC
LIMIT 10;

-- 2. Check a specific query
EXPLAIN (ANALYZE, BUFFERS, WAL)
SELECT ...;

-- 3. Find queries with Generic Plan problems
-- (in application logs: slow, in console: fast)

-- 4. Check JIT
EXPLAIN (ANALYZE) SELECT ...;
-- Look for the "JIT" block and Timing

-- 5. Parallel Query effectiveness
EXPLAIN (ANALYZE) SELECT ...;
-- Workers Launched vs Planned

Best Practices

  1. Always use EXPLAIN (ANALYZE, BUFFERS)
  2. Compare Estimated vs Actual rows → discrepancy = statistics problem
  3. Check Cache Hit Rate via Buffers
  4. Generic vs Custom — test Prepared Statements
  5. JIT — disable for OLTP, enable for OLAP
  6. DML + EXPLAIN ANALYZE → only in a transaction with ROLLBACK
  7. WAL (PG 13+) for estimating replication load
  8. SETTINGS (PG 12+)EXPLAIN (ANALYZE, SETTINGS) shows modified session parameters. On PG 11 and below — check parameters manually via SHOW.

Summary for Senior

  • Startup Cost is critical for LIMIT, Total Cost for full retrieval
  • BUFFERS = the main I/O metric (hit vs read)
  • Generic vs Custom Plans — a common cause of prod/dev discrepancies
  • plan_cache_mode controls Prepared Statements behavior
  • JIT speeds up OLAP, slows down OLTP → disable where not needed
  • WAL (PG 13+) shows replication load
  • Estimated ≠ Actual rowsANALYZE or SET STATISTICS
  • DML + EXPLAIN ANALYZE → always in BEGIN...ROLLBACK

🎯 Interview Cheat Sheet

Must know:

  • EXPLAIN: shows the plan WITHOUT execution; EXPLAIN ANALYZE: executes + actual time
  • Cost: startup..total in arbitrary units (1 unit ≈ reading an 8KB page)
  • BUFFERS: shared hit (from cache), shared read (from disk), written (write)
  • Generic Plan (Prepared Statements): universal, may be worse for skewed data
  • Custom Plan: knows the specific value, better for skewed data
  • plan_cache_mode: auto (5 times Custom → Generic), force_custom_plan, force_generic_plan
  • JIT: speeds up complex expressions/aggregates, slows down OLTP (< 1ms queries)
  • Parallel Query: Workers Planned vs Launched → if Launched < Planned, check settings
  • Incremental Sort (PG 13+): data partially sorted → sort the rest
  • WAL (PG 13+): Records, Bytes, FPI → estimate replication load
  • DML + EXPLAIN ANALYZE → EXECUTES the query! Only in BEGIN…ROLLBACK

Frequent follow-up questions:

  • “Why is it slower in the application than in the console?” → Generic Plan in Prepared Statements
  • “How to check Cache Hit Rate?” → BUFFERS: hit / (hit + read)
  • “Why does JIT slow things down?” → Compilation 8ms > execution 1ms → disable for OLTP
  • “What does Estimated rows ≠ Actual mean?” → Stale statistics → ANALYZE

Red flags (do NOT say):

  • ❌ “EXPLAIN ANALYZE is safe for DML” (it EXECUTES the query!)
  • ❌ “Generic Plan is always good” (for skewed data — disaster)
  • ❌ “JIT always speeds up” (it slows down OLTP!)
  • ❌ “Cost is milliseconds” (no, arbitrary units!)

Related topics:

  • [[How to optimize slow queries]] → practical application of EXPLAIN
  • [[Why is ANALYZE needed]] → Estimated ≠ Actual → ANALYZE
  • [[What is a correlated subquery]] → finding SubPlan in the plan