Question 3 Β· Section 1

What is a composite index?

A composite index sorts data hierarchically:

Language versions: English Russian Ukrainian

🟒 Junior Level

Composite (compound) index is an index created on multiple columns simultaneously.

Why: a single composite index can replace several individual ones, saving disk space and speeding up writes. Additionally, it allows optimizing queries that filter on multiple fields at once β€” which is common in most real-world applications.

Simple analogy: Imagine a phone book where contacts are sorted first by last name, then by first name. You first find the right last name, and within it β€” the right first name.

Main idea:

  • One index can speed up searches on multiple fields
  • The order of columns in the index is critically important
  • The left prefix rule applies

Example:

-- Creating a composite index
CREATE INDEX idx_users_city_age ON users(city, age);

-- βœ… Will be used (by city)
SELECT * FROM users WHERE city = 'Moscow';

-- βœ… Will be used (by city and age)
SELECT * FROM users WHERE city = 'Moscow' AND age = 25;

-- ❌ Will NOT be used (only age)
SELECT * FROM users WHERE age = 25;

When to use:

  • When you frequently search by a combination of fields
  • When queries use different combinations of the same fields
  • For optimizing sorting on multiple fields

🟑 Middle Level

How it works

A composite index sorts data hierarchically:

  1. First by the first column
  2. Within equal values of the first β€” by the second
  3. And so on

Left prefix rule: An index (A, B, C) can be used for searches by:

  • βœ… A
  • βœ… A, B
  • βœ… A, B, C
  • ❌ B (without A)
  • ❌ C (without A and B)
  • ❌ B, C (without A)

Golden rules of column ordering

1. Equality before range (= before >)

-- Query: WHERE city = 'Moscow' AND age > 25

-- βœ… Correct order
CREATE INDEX idx_city_age ON users(city, age);

-- ❌ Incorrect order (age as range makes city useless)
CREATE INDEX idx_age_city ON users(age, city);

2. Sorting (ORDER BY)

-- The index covers ORDER BY in the same order
CREATE INDEX idx_orders_user_date ON orders(user_id, created_at DESC);

-- βœ… Uses the index
SELECT * FROM orders WHERE user_id = 123 ORDER BY created_at DESC;

-- ❌ Does not use the index for sorting
SELECT * FROM orders ORDER BY created_at DESC; -- no filter on user_id

Common mistakes

  1. Wrong column order
    -- ❌ If city is used with '=' and age with '>', city must come first
    CREATE INDEX idx_age_city ON users(age, city); -- Incorrect
    
  2. Duplicate indexes
    -- ❌ Index (A, B) already covers (A)
    CREATE INDEX idx_a ON users(a);           -- Redundant
    CREATE INDEX idx_a_b ON users(a, b);      -- Covers both cases
    
  3. Too many columns
    • PostgreSQL supports up to 32 columns in an index
    • In practice: more than 5 columns β€” a red flag
    • Large index β†’ needs more RAM to stay in cache β†’ if RAM is insufficient β†’ Cache Misses β†’ Random I/O β†’ performance drop

Comparison with separate indexes

Query Index (A) + (B) Index (A, B)
WHERE A = 1 βœ… Uses (A) βœ… Uses (A,B)
WHERE B = 2 βœ… Uses (B) ❌ Does not use
WHERE A = 1 AND B = 2 ⚠️ Bitmap or one of them βœ… Fully uses

Practical example

-- Typical e-commerce scenario
CREATE INDEX idx_orders_customer_date ON orders(customer_id, created_at DESC);

-- Query 1: All orders of a customer
SELECT * FROM orders WHERE customer_id = 123;
-- Plan: Index Scan (by customer_id)

-- Query 2: Recent orders of a customer
SELECT * FROM orders
WHERE customer_id = 123
ORDER BY created_at DESC
LIMIT 10;
-- Plan: Index Scan + Limit (very fast!)

πŸ”΄ Senior Level

Internal Implementation

Lexicographic sorting: A composite index key is essentially one large string. The B-tree stores a concatenation of column values.

Key structure in Leaf Node:

β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
β”‚ City="Moscow" | Age=25 | TID=(123,456)  β”‚
β”œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€
β”‚ City="Moscow" | Age=26 | TID=(123,457)  β”‚
β”œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€
β”‚ City="SPb"     | Age=20 | TID=(124,100) β”‚
β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜

Prefix compression: PostgreSQL can compress keys in Internal nodes, storing only the part needed for navigation.

Architectural Trade-offs

Column ordering rule:

[Equi-columns] β†’ [Range-column] β†’ [Include-columns]
     (WHERE =)      (WHERE >,<)     (SELECT only)

Example:

-- Query: WHERE status = 'ACTIVE' AND city = 'Moscow' AND created_at > '2024-01-01'
-- ORDER BY created_at DESC

-- βœ… Optimal index
CREATE INDEX idx_orders_optimal
ON orders(status, city, created_at DESC);

-- status (=) β†’ city (=) β†’ created_at (>) β†’ ideal order

Mixed directions (PG 13+):

-- For ORDER BY a ASC, b DESC
CREATE INDEX idx_mixed ON table_name(a ASC, b DESC);

-- A regular index (ASC, ASC) will NOT cover such an ORDER BY

On PG 12 and below, mixed directions (ASC + DESC) in a single index are not optimized correctly β€” the index will be sorted in one direction. Solution: a separate index for the specific ORDER BY or additional sorting in the query.

Emulating Index Skip Scan

PostgreSQL does not have built-in Skip Scan (unlike Oracle/MySQL 8.0). But you can emulate it:

-- Problem: index (status, user_id), but searching only by user_id
-- status has few unique values (ACTIVE, INACTIVE, DELETED)

-- βœ… Recursive CTE emulating Skip Scan
WITH RECURSIVE skip_scan AS (
  (SELECT MIN(status) AS status FROM orders)
  UNION ALL
  SELECT (SELECT MIN(status) FROM orders WHERE status > s.status)
  FROM skip_scan s WHERE s.status IS NOT NULL
)
SELECT * FROM orders o
JOIN skip_scan s ON o.status = s.status
WHERE o.user_id = 123;

Performance: On a table with 100M records and 3 unique status values:

  • Without Skip Scan: Seq Scan ~30 seconds
  • With emulation: ~50 milliseconds (600x difference!)

Edge Cases

  1. NULL in a composite index
    -- NULL sorts last by default (NULLS LAST)
    CREATE INDEX idx_a_b ON t(a, b);
    
    -- WHERE a IS NULL β†’ search at the end of the index
    -- WHERE a = 1 AND b IS NULL β†’ search at the end of the a=1 group
    
  2. Operator Class (OpClass)
    -- OpClass (Operator Class) β€” a set of rules defining how column values
    -- are compared and sorted in the index. By default, text uses
    -- standard character-by-character comparison, but you can use, e.g.,
    -- hash comparison or case-insensitive comparison
    CREATE INDEX idx_name ON users(name COLLATE "C");
    
    -- For specific operations
    CREATE INDEX idx_tags ON posts USING gin(tags);
    
  3. Index-Only Scan and Visibility Map
    • Even if all columns are in the index, PG may still go to Heap
    • Reason: Pages are not marked in the Visibility Map
    • Solution: VACUUM updates the Visibility Map

Performance

Impact on writes:

  • Each column in the index increases key size
  • Large key β†’ fewer entries per page β†’ more tree levels
  • For 100M records:
    • Index (int): ~2 GB
    • Index (int, text(50)): ~8 GB
    • Index (int, text(50), text(100)): ~15 GB

Index consolidation:

-- ❌ Before (3 indexes, 15 GB total)
CREATE INDEX idx_a ON t(a);           -- 2 GB
CREATE INDEX idx_a_b ON t(a, b);      -- 5 GB
CREATE INDEX idx_a_b_c ON t(a, b, c); -- 8 GB

-- βœ… After (1 index, 8 GB)
CREATE INDEX idx_a_b_c ON t(a, b, c); -- Covers all prefixes!

Production Experience

Real scenario:

  • SaaS platform: 200M order records
  • Problem: 12 separate indexes, 80 GB total
  • Symptom: INSERT > 500ms, RAM insufficient, Cache Hit rate < 60%
  • Analysis via pg_stat_user_indexes:
    • 8 indexes have idx_scan < 10 per month
    • 4 indexes duplicate prefixes
  • Solution:
    • Consolidate to 3 composite indexes
    • Added INCLUDE for Index-Only Scan on hot queries
    • Result: 15 GB of indexes, INSERT < 50ms, Cache Hit > 95%

Monitoring

-- Check composite index usage
SELECT
    indexrelname,
    idx_scan,
    idx_tup_read,
    pg_size_pretty(pg_relation_size(indexrelid)) as size
FROM pg_stat_user_indexes
WHERE relname = 'orders'
ORDER BY idx_scan DESC;

-- Check column order
SELECT
    i.relname as table_name,
    ix.indexrelid::regclass as index_name,
    array_agg(a.attname ORDER BY k.n) as columns
FROM pg_index ix
JOIN pg_class i ON i.oid = ix.indrelid
JOIN pg_class idx ON idx.oid = ix.indexrelid
JOIN generate_series(0, array_length(ix.indkey, 1)-1) as k(n) ON true
JOIN pg_attribute a ON a.attrelid = i.oid AND a.attnum = ix.indkey[k.n]
WHERE i.relname = 'orders'
GROUP BY i.relname, ix.indexrelid;

-- Check prefix duplicates
-- (requires pg_stat_statements extension or a custom pg_index query)

Best Practices

  1. Column order: = β†’ ranges β†’ INCLUDE
  2. Consolidation: One index (a,b,c) replaces (a) and (a,b)
  3. Limit size: Maximum 3-5 columns per index
  4. INCLUDE: For columns only in SELECT, use INCLUDE
  5. Mixed directions: Create indexes for specific ORDER BY clauses
  6. Monitoring: Regularly check pg_stat_user_indexes for unused indexes

Summary for Senior

  • A composite index is a lexicographically sorted string-key
  • Column order is critical: Equi β†’ Range β†’ Include
  • Emulate Skip Scan via Recursive CTE for low-cardinality first columns
  • Consolidate indexes to save RAM and speed up writes
  • Visibility Map affects Index-Only Scan β€” keep up with VACUUM
  • Always check the plan via EXPLAIN (ANALYZE, BUFFERS)

🎯 Interview Cheat Sheet

Must know:

  • A composite index sorts data hierarchically: 1st column β†’ 2nd β†’ …
  • Left prefix rule: (A, B, C) works for A, A,B, A,B,C but NOT for B, C
  • Column order: = (equality) β†’ ranges (>, <) β†’ INCLUDE (SELECT only)
  • Consolidation: (a), (a,b), (a,b,c) β†’ one (a,b,c) saves space
  • PostgreSQL does not have Skip Scan β†’ emulate via Recursive CTE
  • Maximum 3-5 columns in practice (PG supports up to 32)
  • Mixed directions: (a ASC, b DESC) β€” PG 13+

Common follow-up questions:

  • β€œWhy doesn’t WHERE age = 25 use the index (city, age)?” β†’ Left prefix: city is first
  • β€œHow to replace 3 separate indexes with one?” β†’ A composite (a,b,c) covers all prefixes
  • β€œWhat is Skip Scan and why doesn’t PG have it?” β†’ Skipping the first column; PG doesn’t have it, we emulate with CTE
  • β€œHow does ORDER BY affect column order?” β†’ The index must match ORDER BY

Red flags (DO NOT say):

  • ❌ β€œColumn order doesn’t matter” (it’s critically important!)
  • ❌ β€œI’ll create a separate index on each field” (consolidate!)
  • ❌ β€œ32 columns in an index is normal practice” (max 3-5 in practice)

Related topics:

  • [[What are indexes for]] β†’ when to create indexes
  • [[How does B-tree index work]] β†’ internal structure
  • [[What is index cardinality]] β†’ column order by cardinality
  • [[What are the disadvantages of indexes]] β†’ write cost