What is a composite index?

🟢 Junior Level

Composite (compound) index is an index created on multiple columns simultaneously.

Why: a single composite index can replace several individual ones, saving disk space and speeding up writes. Additionally, it allows optimizing queries that filter on multiple fields at once — which is common in most real-world applications.

Simple analogy: Imagine a phone book where contacts are sorted first by last name, then by first name. You first find the right last name, and within it — the right first name.

Main idea:

One index can speed up searches on multiple fields
The order of columns in the index is critically important
The left prefix rule applies

Example:

-- Creating a composite index
CREATE INDEX idx_users_city_age ON users(city, age);

-- ✅ Will be used (by city)
SELECT * FROM users WHERE city = 'Moscow';

-- ✅ Will be used (by city and age)
SELECT * FROM users WHERE city = 'Moscow' AND age = 25;

-- ❌ Will NOT be used (only age)
SELECT * FROM users WHERE age = 25;

When to use:

When you frequently search by a combination of fields
When queries use different combinations of the same fields
For optimizing sorting on multiple fields

🟡 Middle Level

How it works

A composite index sorts data hierarchically:

First by the first column
Within equal values of the first — by the second
And so on

Left prefix rule: An index (A, B, C) can be used for searches by:

✅ A
✅ A, B
✅ A, B, C
❌ B (without A)
❌ C (without A and B)
❌ B, C (without A)

Golden rules of column ordering

1. Equality before range (= before >)

-- Query: WHERE city = 'Moscow' AND age > 25

-- ✅ Correct order
CREATE INDEX idx_city_age ON users(city, age);

-- ❌ Incorrect order (age as range makes city useless)
CREATE INDEX idx_age_city ON users(age, city);

2. Sorting (ORDER BY)

-- The index covers ORDER BY in the same order
CREATE INDEX idx_orders_user_date ON orders(user_id, created_at DESC);

-- ✅ Uses the index
SELECT * FROM orders WHERE user_id = 123 ORDER BY created_at DESC;

-- ❌ Does not use the index for sorting
SELECT * FROM orders ORDER BY created_at DESC; -- no filter on user_id

Common mistakes

Wrong column order

-- ❌ If city is used with '=' and age with '>', city must come first
CREATE INDEX idx_age_city ON users(age, city); -- Incorrect

Duplicate indexes

-- ❌ Index (A, B) already covers (A)
CREATE INDEX idx_a ON users(a);           -- Redundant
CREATE INDEX idx_a_b ON users(a, b);      -- Covers both cases

Too many columns
- PostgreSQL supports up to 32 columns in an index
- In practice: more than 5 columns — a red flag
- Large index → needs more RAM to stay in cache → if RAM is insufficient → Cache Misses → Random I/O → performance drop

Comparison with separate indexes

Query	Index (A) + (B)	Index (A, B)
`WHERE A = 1`	✅ Uses (A)	✅ Uses (A,B)
`WHERE B = 2`	✅ Uses (B)	❌ Does not use
`WHERE A = 1 AND B = 2`	⚠️ Bitmap or one of them	✅ Fully uses

Practical example

-- Typical e-commerce scenario
CREATE INDEX idx_orders_customer_date ON orders(customer_id, created_at DESC);

-- Query 1: All orders of a customer
SELECT * FROM orders WHERE customer_id = 123;
-- Plan: Index Scan (by customer_id)

-- Query 2: Recent orders of a customer
SELECT * FROM orders
WHERE customer_id = 123
ORDER BY created_at DESC
LIMIT 10;
-- Plan: Index Scan + Limit (very fast!)

🔴 Senior Level

Internal Implementation

Lexicographic sorting: A composite index key is essentially one large string. The B-tree stores a concatenation of column values.

Key structure in Leaf Node:

┌─────────────────────────────────────────┐
│ City="Moscow" | Age=25 | TID=(123,456)  │
├─────────────────────────────────────────┤
│ City="Moscow" | Age=26 | TID=(123,457)  │
├─────────────────────────────────────────┤
│ City="SPb"     | Age=20 | TID=(124,100) │
└─────────────────────────────────────────┘

Prefix compression: PostgreSQL can compress keys in Internal nodes, storing only the part needed for navigation.

Architectural Trade-offs

Column ordering rule:

[Equi-columns] → [Range-column] → [Include-columns]
     (WHERE =)      (WHERE >,<)     (SELECT only)

Example:

-- Query: WHERE status = 'ACTIVE' AND city = 'Moscow' AND created_at > '2024-01-01'
-- ORDER BY created_at DESC

-- ✅ Optimal index
CREATE INDEX idx_orders_optimal
ON orders(status, city, created_at DESC);

-- status (=) → city (=) → created_at (>) → ideal order

Mixed directions (PG 13+):

-- For ORDER BY a ASC, b DESC
CREATE INDEX idx_mixed ON table_name(a ASC, b DESC);

-- A regular index (ASC, ASC) will NOT cover such an ORDER BY

On PG 12 and below, mixed directions (ASC + DESC) in a single index are not optimized correctly — the index will be sorted in one direction. Solution: a separate index for the specific ORDER BY or additional sorting in the query.

Emulating Index Skip Scan

PostgreSQL does not have built-in Skip Scan (unlike Oracle/MySQL 8.0). But you can emulate it:

-- Problem: index (status, user_id), but searching only by user_id
-- status has few unique values (ACTIVE, INACTIVE, DELETED)

-- ✅ Recursive CTE emulating Skip Scan
WITH RECURSIVE skip_scan AS (
  (SELECT MIN(status) AS status FROM orders)
  UNION ALL
  SELECT (SELECT MIN(status) FROM orders WHERE status > s.status)
  FROM skip_scan s WHERE s.status IS NOT NULL
)
SELECT * FROM orders o
JOIN skip_scan s ON o.status = s.status
WHERE o.user_id = 123;

Performance: On a table with 100M records and 3 unique status values:

Without Skip Scan: Seq Scan ~30 seconds
With emulation: ~50 milliseconds (600x difference!)

Edge Cases

NULL in a composite index

-- NULL sorts last by default (NULLS LAST)
CREATE INDEX idx_a_b ON t(a, b);

-- WHERE a IS NULL → search at the end of the index
-- WHERE a = 1 AND b IS NULL → search at the end of the a=1 group

Operator Class (OpClass)

-- OpClass (Operator Class) — a set of rules defining how column values
-- are compared and sorted in the index. By default, text uses
-- standard character-by-character comparison, but you can use, e.g.,
-- hash comparison or case-insensitive comparison
CREATE INDEX idx_name ON users(name COLLATE "C");

-- For specific operations
CREATE INDEX idx_tags ON posts USING gin(tags);

Index-Only Scan and Visibility Map
- Even if all columns are in the index, PG may still go to Heap
- Reason: Pages are not marked in the Visibility Map
- Solution: VACUUM updates the Visibility Map

Performance

Impact on writes:

Each column in the index increases key size
Large key → fewer entries per page → more tree levels
For 100M records:
- Index (int): ~2 GB
- Index (int, text(50)): ~8 GB
- Index (int, text(50), text(100)): ~15 GB

Index consolidation:

-- ❌ Before (3 indexes, 15 GB total)
CREATE INDEX idx_a ON t(a);           -- 2 GB
CREATE INDEX idx_a_b ON t(a, b);      -- 5 GB
CREATE INDEX idx_a_b_c ON t(a, b, c); -- 8 GB

-- ✅ After (1 index, 8 GB)
CREATE INDEX idx_a_b_c ON t(a, b, c); -- Covers all prefixes!

Production Experience

Real scenario:

SaaS platform: 200M order records
Problem: 12 separate indexes, 80 GB total
Symptom: INSERT > 500ms, RAM insufficient, Cache Hit rate < 60%
Analysis via pg_stat_user_indexes:
- 8 indexes have idx_scan < 10 per month
- 4 indexes duplicate prefixes
Solution:
- Consolidate to 3 composite indexes
- Added INCLUDE for Index-Only Scan on hot queries
- Result: 15 GB of indexes, INSERT < 50ms, Cache Hit > 95%

Monitoring

-- Check composite index usage
SELECT
    indexrelname,
    idx_scan,
    idx_tup_read,
    pg_size_pretty(pg_relation_size(indexrelid)) as size
FROM pg_stat_user_indexes
WHERE relname = 'orders'
ORDER BY idx_scan DESC;

-- Check column order
SELECT
    i.relname as table_name,
    ix.indexrelid::regclass as index_name,
    array_agg(a.attname ORDER BY k.n) as columns
FROM pg_index ix
JOIN pg_class i ON i.oid = ix.indrelid
JOIN pg_class idx ON idx.oid = ix.indexrelid
JOIN generate_series(0, array_length(ix.indkey, 1)-1) as k(n) ON true
JOIN pg_attribute a ON a.attrelid = i.oid AND a.attnum = ix.indkey[k.n]
WHERE i.relname = 'orders'
GROUP BY i.relname, ix.indexrelid;

-- Check prefix duplicates
-- (requires pg_stat_statements extension or a custom pg_index query)

Best Practices

Column order: = → ranges → INCLUDE
Consolidation: One index (a,b,c) replaces (a) and (a,b)
Limit size: Maximum 3-5 columns per index
INCLUDE: For columns only in SELECT, use INCLUDE
Mixed directions: Create indexes for specific ORDER BY clauses
Monitoring: Regularly check pg_stat_user_indexes for unused indexes

Summary for Senior

A composite index is a lexicographically sorted string-key
Column order is critical: Equi → Range → Include
Emulate Skip Scan via Recursive CTE for low-cardinality first columns
Consolidate indexes to save RAM and speed up writes
Visibility Map affects Index-Only Scan — keep up with VACUUM
Always check the plan via EXPLAIN (ANALYZE, BUFFERS)

🎯 Interview Cheat Sheet

Must know:

A composite index sorts data hierarchically: 1st column → 2nd → …
Left prefix rule: (A, B, C) works for A, A,B, A,B,C but NOT for B, C
Column order: = (equality) → ranges (>, <) → INCLUDE (SELECT only)
Consolidation: (a), (a,b), (a,b,c) → one (a,b,c) saves space
PostgreSQL does not have Skip Scan → emulate via Recursive CTE
Maximum 3-5 columns in practice (PG supports up to 32)
Mixed directions: (a ASC, b DESC) — PG 13+

Common follow-up questions:

“Why doesn’t WHERE age = 25 use the index (city, age)?” → Left prefix: city is first
“How to replace 3 separate indexes with one?” → A composite (a,b,c) covers all prefixes
“What is Skip Scan and why doesn’t PG have it?” → Skipping the first column; PG doesn’t have it, we emulate with CTE
“How does ORDER BY affect column order?” → The index must match ORDER BY

Red flags (DO NOT say):

❌ “Column order doesn’t matter” (it’s critically important!)
❌ “I’ll create a separate index on each field” (consolidate!)
❌ “32 columns in an index is normal practice” (max 3-5 in practice)

Related topics:

[[What are indexes for]] → when to create indexes
[[How does B-tree index work]] → internal structure
[[What is index cardinality]] → column order by cardinality
[[What are the disadvantages of indexes]] → write cost