Question 4 · Section 1

When should you create an index?

Create an index when table searches have become slow and you understand which fields are used for searching.

Language versions: English Russian Ukrainian

🟢 Junior Level

Create an index when table searches have become slow and you understand which fields are used for searching.

Simple analogy: If you often look up a contact in a phone book by phone number, it makes sense to create a separate “by number” index.

Main idea:

  • An index is needed to speed up searches
  • Create indexes for fields in WHERE, JOIN, ORDER BY
  • Don’t create indexes “just in case”

Example:

-- Slow query without index
SELECT * FROM users WHERE email = 'test@example.com';

-- Creating an index
CREATE INDEX idx_users_email ON users(email);

-- Now the query is fast
SELECT * FROM users WHERE email = 'test@example.com';

When to create:

  • The field is frequently used in WHERE
  • The field is used in JOIN for table linking
  • The field is used in ORDER BY for sorting
  • The table is large (more than 10,000 rows)

🟡 Middle Level

How to identify the need for indexes

1. Slow query analysis

-- Enable slow query logging
-- postgresql.conf:
-- log_min_duration_statement = 1000  -- logs queries > 1 second

2. pg_stat_statements extension

-- Enable the extension
CREATE EXTENSION pg_stat_statements;

-- Find the slowest queries
SELECT query,
       calls,
       mean_exec_time as avg_time_ms,
       total_exec_time as total_time_ms
FROM pg_stat_statements
ORDER BY total_exec_time DESC
LIMIT 10;

Main criteria for creating indexes

1. Fields in WHERE with high selectivity

-- ✅ High selectivity (many unique values)
CREATE INDEX idx_users_email ON users(email);

-- ❌ Low selectivity (few unique values)
CREATE INDEX idx_users_gender ON users(gender); -- M/F

2. Foreign Keys

-- It is recommended to index FKs!
CREATE TABLE orders (
    id SERIAL PRIMARY KEY,
    user_id INTEGER REFERENCES users(id)  -- ← index this!
);

-- Creating an index on FK
CREATE INDEX idx_orders_user_id ON orders(user_id);

-- Why? When deleting a user, PG checks for absence of orders
-- Without index → Seq Scan on orders (very slow on large tables)

Exceptions: FK table < 1000 rows (Seq Scan is fast), or you never delete/update parent records.

3. Fields for JOIN

-- For Nested Loop Join, indexes on join fields are critical
SELECT * FROM orders o
JOIN users u ON o.user_id = u.id;

-- Indexes:
-- orders(user_id) ← for finding user's orders
-- users(id) ← already exists (PRIMARY KEY)

4. Fields for ORDER BY / GROUP BY

-- An index allows you to avoid sorting
CREATE INDEX idx_orders_created ON orders(created_at DESC);

-- Query without Sort (data is already sorted)
SELECT * FROM orders ORDER BY created_at DESC LIMIT 10;

Common mistakes

  1. Creating an index without analyzing the workload
    • ❌ Created an index “just in case”
    • ✅ First analyze queries via EXPLAIN
  2. Duplicate indexes
    -- ❌ Two identical indexes
    CREATE INDEX idx1 ON users(email);
    CREATE INDEX idx2 ON users(email);
    
  3. Indexes on small tables
    • Table < 1000 rows → Seq Scan is faster
    • Index just takes up space

Index cost

Each index means:

  1. Slower INSERT/UPDATE/DELETE (the tree needs updating)
  2. Disk space (from 10% to 100%+ of the table size)
  3. RAM load (the index must be in memory for efficiency)

Practical rule

-- Before creating an index, check:
EXPLAIN (ANALYZE, BUFFERS)
SELECT * FROM users WHERE email = 'test@example.com';

-- If you see Seq Scan and buffer hits is low → index will help
-- If already Index Scan → maybe the index already exists

🔴 Senior Level

How to identify the need: Workload Analysis

pg_stat_statements — the main tool:

-- Find queries with the highest total_exec_time
SELECT
    query,
    calls,
    mean_exec_time,
    total_exec_time,
    rows,
    shared_blks_read,    -- Reading from disk (bad)
    shared_blks_hit      -- Reading from cache (good)
FROM pg_stat_statements
WHERE dbid = (SELECT oid FROM pg_database WHERE datname = current_database())
ORDER BY total_exec_time DESC
LIMIT 20;

-- Key metric: shared_blks_read / calls > 100
-- → Query reads many pages from disk → needs an index

HypoPG — hypothetical indexes:

-- Extension for testing "what if" scenarios
CREATE EXTENSION hypopg;

-- Create a virtual index (does NOT take space!)
SELECT * FROM hypopg_create_index(
    'CREATE INDEX idx_orders_user_date ON orders(user_id, created_at DESC)'
);

-- Check if the planner would choose it
EXPLAIN SELECT * FROM orders WHERE user_id = 123 ORDER BY created_at DESC LIMIT 10;
-- In the plan you'll see: Index Scan using <hypothetical>...

-- Remove the hypothetical index
SELECT * FROM hypopg_drop_index(indexrelid);

Architectural Trade-offs

1. Write Amplification

Each new index =
  + Slower INSERT/UPDATE/DELETE (the tree needs updating)
  + WAL load (Write Ahead Log: PostgreSQL first logs all changes to WAL files, then applies them to the main data. Each index increases WAL volume because index changes also need to be logged)
  + Increased Replica Lag (replica can't keep up)
  + Bloat (index inflation over time)

2. Index consolidation

-- ❌ Before: 3 separate indexes
CREATE INDEX idx_a ON t(a);              -- 2 GB
CREATE INDEX idx_a_b ON t(a, b);         -- 5 GB
CREATE INDEX idx_a_b_c ON t(a, b, c);    -- 8 GB
-- Total: 15 GB, write slowdown x3

-- ✅ After: 1 index covers all prefixes
CREATE INDEX idx_a_b_c ON t(a, b, c);    -- 8 GB
-- Covers: WHERE a, WHERE a,b, WHERE a,b,c
-- Savings: 7 GB, writes 3x faster

3. Partial Index for rare values

-- Index only errors (5% of all records)
CREATE INDEX idx_orders_errors ON orders(id, error_message)
WHERE status = 'ERROR';

-- Space savings: 95% smaller than a full index
-- Speedup: only for critical queries

Edge Cases

  1. Creating an index in Production
    -- ❌ Blocks writes (ShareLock)
    CREATE INDEX idx ON orders(user_id);
    
    -- ✅ Does not block, but takes longer
    CREATE INDEX CONCURRENTLY idx ON orders(user_id);
    
    -- ⚠️ If interrupted, remains INVALID
    SELECT indexrelid::regclass, indisvalid
    FROM pg_index
    WHERE NOT indisvalid;
    
  2. Bulk Load optimization
    -- When inserting millions of rows:
    -- 1. Drop indexes
    DROP INDEX idx_orders_user_id;
    
    -- 2. Insert data
    COPY orders FROM '/data/orders.csv';
    
    -- 3. Recreate indexes (faster than updating on each insert)
    CREATE INDEX idx_orders_user_id ON orders(user_id);
    
  3. Index bloat prevention
    -- For frequently updated indexes
    CREATE INDEX idx_orders_status ON orders(status)
    WITH (fillfactor = 70);
    
    -- 30% of page is free → fewer splits
    

Performance

Impact on writes (approximate numbers):

Number of indexes INSERT latency WAL volume
1-2 +5-10% Base
5-7 +50-100% x2-x3
10-15 +200-500% x5-x10

The 10-15% rule:

  • If an index is used in < 10-15% of total queries → likely not needed
  • Seq Scan on a table that fits entirely in cache may be faster than Index Scan

Production Experience

Real scenario #1:

  • Fintech startup: 50M transactions
  • Problem: Report queries > 30 seconds
  • Analysis via pg_stat_statements:
    • Top 3 queries read 80% of the time
    • All 3 use WHERE user_id = ? AND created_at > ?
  • Solution:
    • CREATE INDEX CONCURRENTLY idx_transactions_user_date ON transactions(user_id, created_at DESC)
    • Added INCLUDE (amount, type) for Index-Only Scan
  • Result: Queries became < 100ms (300x speedup)

Real scenario #2:

  • E-commerce: 200M orders
  • Problem: INSERT slowed down to 2 seconds, Replica Lag > 60 seconds
  • Analysis:
    • 18 indexes on the orders table
    • 12 of them were used < 5 times per month
    • WAL volume: 500 MB/minute
  • Solution:
    • Dropped 12 unused indexes
    • Consolidated the rest to 6
    • Result: INSERT < 100ms, Replica Lag < 5 seconds

Monitoring

-- 1. Check index usage
SELECT
    schemaname,
    relname as table_name,
    indexrelname as index_name,
    idx_scan,
    idx_tup_read,
    idx_tup_fetch,
    pg_size_pretty(pg_relation_size(indexrelid)) as size
FROM pg_stat_user_indexes
WHERE schemaname = 'public'
ORDER BY idx_scan ASC  -- Unused at the bottom
LIMIT 20;

-- 2. Check Foreign Keys without indexes
SELECT
    conrelid::regclass AS table_with_fk,
    conname AS fk_constraint,
    confrelid::regclass AS referenced_table
FROM pg_constraint
WHERE contype = 'f'
  AND NOT EXISTS (
    SELECT 1 FROM pg_index
    WHERE indrelid = conrelid
      AND indkey[0] = ANY(conkey)
  );

-- 3. Check duplicates
-- (indexes with the same set of columns)
SELECT
    a.indexrelid::regclass as index1,
    b.indexrelid::regclass as index2,
    pg_get_indexdef(a.indexrelid) as definition
FROM pg_index a
JOIN pg_index b ON a.indrelid = b.indrelid
  AND a.indexrelid < b.indexrelid
  AND a.indkey::text = b.indkey::text;

-- 4. HypoPG for testing
SELECT * FROM hypopg_list_indexes();

Best Practices

  1. Analyze workload via pg_stat_statements before creating
  2. Test hypotheses via HypoPG
  3. It is recommended to index Foreign Keys
  4. Use CONCURRENTLY in production
  5. Consolidate indexes by prefixes
  6. Remove unused indexes regularly
  7. Monitor Replica Lag when adding new indexes
  8. For Bulk Load — drop and recreate indexes

Summary for Senior

  • Use pg_stat_statements to find bottleneck queries
  • Check hypotheses via HypoPG before creating real indexes
  • It is recommended to index Foreign Keys — a common cause of Seq Scan
  • In production — only CONCURRENTLY
  • Consolidate indexes: (a), (a,b), (a,b,c) → one (a,b,c)
  • Monitor Replica Lag and WAL volume after adding indexes
  • Remove unused indexes — each index is a write tax

🎯 Interview Cheat Sheet

Must know:

  • Analyze workload via pg_stat_statements before creating
  • It is recommended to index Foreign Keys — a common cause of Seq Scan
  • CONCURRENTLY in production (does not block writes, but takes longer)
  • HypoPG — testing “what if” without creating a real index
  • Write Amplification: each index = +INSERT/UPDATE/DELETE + WAL volume
  • Bulk Load: drop indexes → insert → recreate (faster)
  • fillfactor = 70-80 for frequently updated indexes

Common follow-up questions:

  • “How to know if an index is needed?” → shared_blks_read / calls > 100 in pg_stat_statements
  • “What to do before a bulk insert?” → Drop indexes → COPY → recreate
  • “How to check if an index will help?” → HypoPG + EXPLAIN
  • “What is Replica Lag and how do indexes affect it?” → More indexes → more WAL → replica falls behind

Red flags (DO NOT say):

  • ❌ “I’ll create an index and see” (analyze workload first!)
  • ❌ “Indexes just in case won’t hurt” (Write Amplification!)
  • ❌ “I’ll create an index in production without CONCURRENTLY” (write lock!)

Related topics:

  • [[What are indexes for]] → overview
  • [[What are the disadvantages of indexes]] → Write Amplification
  • [[What is a composite index]] → consolidation
  • [[How does MVCC work in PostgreSQL]] → HOT updates