When should you create an index?
Create an index when table searches have become slow and you understand which fields are used for searching.
🟢 Junior Level
Create an index when table searches have become slow and you understand which fields are used for searching.
Simple analogy: If you often look up a contact in a phone book by phone number, it makes sense to create a separate “by number” index.
Main idea:
- An index is needed to speed up searches
- Create indexes for fields in WHERE, JOIN, ORDER BY
- Don’t create indexes “just in case”
Example:
-- Slow query without index
SELECT * FROM users WHERE email = 'test@example.com';
-- Creating an index
CREATE INDEX idx_users_email ON users(email);
-- Now the query is fast
SELECT * FROM users WHERE email = 'test@example.com';
When to create:
- The field is frequently used in WHERE
- The field is used in JOIN for table linking
- The field is used in ORDER BY for sorting
- The table is large (more than 10,000 rows)
🟡 Middle Level
How to identify the need for indexes
1. Slow query analysis
-- Enable slow query logging
-- postgresql.conf:
-- log_min_duration_statement = 1000 -- logs queries > 1 second
2. pg_stat_statements extension
-- Enable the extension
CREATE EXTENSION pg_stat_statements;
-- Find the slowest queries
SELECT query,
calls,
mean_exec_time as avg_time_ms,
total_exec_time as total_time_ms
FROM pg_stat_statements
ORDER BY total_exec_time DESC
LIMIT 10;
Main criteria for creating indexes
1. Fields in WHERE with high selectivity
-- ✅ High selectivity (many unique values)
CREATE INDEX idx_users_email ON users(email);
-- ❌ Low selectivity (few unique values)
CREATE INDEX idx_users_gender ON users(gender); -- M/F
2. Foreign Keys
-- It is recommended to index FKs!
CREATE TABLE orders (
id SERIAL PRIMARY KEY,
user_id INTEGER REFERENCES users(id) -- ← index this!
);
-- Creating an index on FK
CREATE INDEX idx_orders_user_id ON orders(user_id);
-- Why? When deleting a user, PG checks for absence of orders
-- Without index → Seq Scan on orders (very slow on large tables)
Exceptions: FK table < 1000 rows (Seq Scan is fast), or you never delete/update parent records.
3. Fields for JOIN
-- For Nested Loop Join, indexes on join fields are critical
SELECT * FROM orders o
JOIN users u ON o.user_id = u.id;
-- Indexes:
-- orders(user_id) ← for finding user's orders
-- users(id) ← already exists (PRIMARY KEY)
4. Fields for ORDER BY / GROUP BY
-- An index allows you to avoid sorting
CREATE INDEX idx_orders_created ON orders(created_at DESC);
-- Query without Sort (data is already sorted)
SELECT * FROM orders ORDER BY created_at DESC LIMIT 10;
Common mistakes
- Creating an index without analyzing the workload
- ❌ Created an index “just in case”
- ✅ First analyze queries via
EXPLAIN
- Duplicate indexes
-- ❌ Two identical indexes CREATE INDEX idx1 ON users(email); CREATE INDEX idx2 ON users(email); - Indexes on small tables
- Table < 1000 rows → Seq Scan is faster
- Index just takes up space
Index cost
Each index means:
- Slower INSERT/UPDATE/DELETE (the tree needs updating)
- Disk space (from 10% to 100%+ of the table size)
- RAM load (the index must be in memory for efficiency)
Practical rule
-- Before creating an index, check:
EXPLAIN (ANALYZE, BUFFERS)
SELECT * FROM users WHERE email = 'test@example.com';
-- If you see Seq Scan and buffer hits is low → index will help
-- If already Index Scan → maybe the index already exists
🔴 Senior Level
How to identify the need: Workload Analysis
pg_stat_statements — the main tool:
-- Find queries with the highest total_exec_time
SELECT
query,
calls,
mean_exec_time,
total_exec_time,
rows,
shared_blks_read, -- Reading from disk (bad)
shared_blks_hit -- Reading from cache (good)
FROM pg_stat_statements
WHERE dbid = (SELECT oid FROM pg_database WHERE datname = current_database())
ORDER BY total_exec_time DESC
LIMIT 20;
-- Key metric: shared_blks_read / calls > 100
-- → Query reads many pages from disk → needs an index
HypoPG — hypothetical indexes:
-- Extension for testing "what if" scenarios
CREATE EXTENSION hypopg;
-- Create a virtual index (does NOT take space!)
SELECT * FROM hypopg_create_index(
'CREATE INDEX idx_orders_user_date ON orders(user_id, created_at DESC)'
);
-- Check if the planner would choose it
EXPLAIN SELECT * FROM orders WHERE user_id = 123 ORDER BY created_at DESC LIMIT 10;
-- In the plan you'll see: Index Scan using <hypothetical>...
-- Remove the hypothetical index
SELECT * FROM hypopg_drop_index(indexrelid);
Architectural Trade-offs
1. Write Amplification
Each new index =
+ Slower INSERT/UPDATE/DELETE (the tree needs updating)
+ WAL load (Write Ahead Log: PostgreSQL first logs all changes to WAL files, then applies them to the main data. Each index increases WAL volume because index changes also need to be logged)
+ Increased Replica Lag (replica can't keep up)
+ Bloat (index inflation over time)
2. Index consolidation
-- ❌ Before: 3 separate indexes
CREATE INDEX idx_a ON t(a); -- 2 GB
CREATE INDEX idx_a_b ON t(a, b); -- 5 GB
CREATE INDEX idx_a_b_c ON t(a, b, c); -- 8 GB
-- Total: 15 GB, write slowdown x3
-- ✅ After: 1 index covers all prefixes
CREATE INDEX idx_a_b_c ON t(a, b, c); -- 8 GB
-- Covers: WHERE a, WHERE a,b, WHERE a,b,c
-- Savings: 7 GB, writes 3x faster
3. Partial Index for rare values
-- Index only errors (5% of all records)
CREATE INDEX idx_orders_errors ON orders(id, error_message)
WHERE status = 'ERROR';
-- Space savings: 95% smaller than a full index
-- Speedup: only for critical queries
Edge Cases
- Creating an index in Production
-- ❌ Blocks writes (ShareLock) CREATE INDEX idx ON orders(user_id); -- ✅ Does not block, but takes longer CREATE INDEX CONCURRENTLY idx ON orders(user_id); -- ⚠️ If interrupted, remains INVALID SELECT indexrelid::regclass, indisvalid FROM pg_index WHERE NOT indisvalid; - Bulk Load optimization
-- When inserting millions of rows: -- 1. Drop indexes DROP INDEX idx_orders_user_id; -- 2. Insert data COPY orders FROM '/data/orders.csv'; -- 3. Recreate indexes (faster than updating on each insert) CREATE INDEX idx_orders_user_id ON orders(user_id); - Index bloat prevention
-- For frequently updated indexes CREATE INDEX idx_orders_status ON orders(status) WITH (fillfactor = 70); -- 30% of page is free → fewer splits
Performance
Impact on writes (approximate numbers):
| Number of indexes | INSERT latency | WAL volume |
|---|---|---|
| 1-2 | +5-10% | Base |
| 5-7 | +50-100% | x2-x3 |
| 10-15 | +200-500% | x5-x10 |
The 10-15% rule:
- If an index is used in < 10-15% of total queries → likely not needed
- Seq Scan on a table that fits entirely in cache may be faster than Index Scan
Production Experience
Real scenario #1:
- Fintech startup: 50M transactions
- Problem: Report queries > 30 seconds
- Analysis via
pg_stat_statements:- Top 3 queries read 80% of the time
- All 3 use
WHERE user_id = ? AND created_at > ?
- Solution:
CREATE INDEX CONCURRENTLY idx_transactions_user_date ON transactions(user_id, created_at DESC)- Added
INCLUDE (amount, type)for Index-Only Scan
- Result: Queries became < 100ms (300x speedup)
Real scenario #2:
- E-commerce: 200M orders
- Problem: INSERT slowed down to 2 seconds, Replica Lag > 60 seconds
- Analysis:
- 18 indexes on the orders table
- 12 of them were used < 5 times per month
- WAL volume: 500 MB/minute
- Solution:
- Dropped 12 unused indexes
- Consolidated the rest to 6
- Result: INSERT < 100ms, Replica Lag < 5 seconds
Monitoring
-- 1. Check index usage
SELECT
schemaname,
relname as table_name,
indexrelname as index_name,
idx_scan,
idx_tup_read,
idx_tup_fetch,
pg_size_pretty(pg_relation_size(indexrelid)) as size
FROM pg_stat_user_indexes
WHERE schemaname = 'public'
ORDER BY idx_scan ASC -- Unused at the bottom
LIMIT 20;
-- 2. Check Foreign Keys without indexes
SELECT
conrelid::regclass AS table_with_fk,
conname AS fk_constraint,
confrelid::regclass AS referenced_table
FROM pg_constraint
WHERE contype = 'f'
AND NOT EXISTS (
SELECT 1 FROM pg_index
WHERE indrelid = conrelid
AND indkey[0] = ANY(conkey)
);
-- 3. Check duplicates
-- (indexes with the same set of columns)
SELECT
a.indexrelid::regclass as index1,
b.indexrelid::regclass as index2,
pg_get_indexdef(a.indexrelid) as definition
FROM pg_index a
JOIN pg_index b ON a.indrelid = b.indrelid
AND a.indexrelid < b.indexrelid
AND a.indkey::text = b.indkey::text;
-- 4. HypoPG for testing
SELECT * FROM hypopg_list_indexes();
Best Practices
- Analyze workload via
pg_stat_statementsbefore creating - Test hypotheses via
HypoPG - It is recommended to index Foreign Keys
- Use CONCURRENTLY in production
- Consolidate indexes by prefixes
- Remove unused indexes regularly
- Monitor Replica Lag when adding new indexes
- For Bulk Load — drop and recreate indexes
Summary for Senior
- Use pg_stat_statements to find bottleneck queries
- Check hypotheses via HypoPG before creating real indexes
- It is recommended to index Foreign Keys — a common cause of Seq Scan
- In production — only CONCURRENTLY
- Consolidate indexes:
(a),(a,b),(a,b,c)→ one(a,b,c) - Monitor Replica Lag and WAL volume after adding indexes
- Remove unused indexes — each index is a write tax
🎯 Interview Cheat Sheet
Must know:
- Analyze workload via
pg_stat_statementsbefore creating - It is recommended to index Foreign Keys — a common cause of Seq Scan
CONCURRENTLYin production (does not block writes, but takes longer)- HypoPG — testing “what if” without creating a real index
- Write Amplification: each index = +INSERT/UPDATE/DELETE + WAL volume
- Bulk Load: drop indexes → insert → recreate (faster)
fillfactor = 70-80for frequently updated indexes
Common follow-up questions:
- “How to know if an index is needed?” →
shared_blks_read / calls > 100in pg_stat_statements - “What to do before a bulk insert?” → Drop indexes → COPY → recreate
- “How to check if an index will help?” → HypoPG + EXPLAIN
- “What is Replica Lag and how do indexes affect it?” → More indexes → more WAL → replica falls behind
Red flags (DO NOT say):
- ❌ “I’ll create an index and see” (analyze workload first!)
- ❌ “Indexes just in case won’t hurt” (Write Amplification!)
- ❌ “I’ll create an index in production without CONCURRENTLY” (write lock!)
Related topics:
- [[What are indexes for]] → overview
- [[What are the disadvantages of indexes]] → Write Amplification
- [[What is a composite index]] → consolidation
- [[How does MVCC work in PostgreSQL]] → HOT updates