What is index cardinality?
These concepts are related but distinct:
🟢 Junior Level
Cardinality is the number of unique values in a column of a table.
Why this matters: PostgreSQL’s query planner uses cardinality to decide whether to use an index or read the entire table. High cardinality = many unique values = the index is efficient. Low cardinality = few unique values = Seq Scan will likely be faster than an index.
Simple analogy: Imagine a bag of marbles of different colors. Cardinality is how many different colors are in the bag (not counting how many marbles of each color).
Main idea:
- High cardinality = many unique values (e.g., email)
- Low cardinality = few unique values (e.g., gender: M/F)
- Indexes are more effective on fields with high cardinality
Example:
-- Table users (1,000,000 rows)
-- High cardinality
SELECT COUNT(DISTINCT email) FROM users; -- ~1,000,000 (unique)
-- Low cardinality
SELECT COUNT(DISTINCT gender) FROM users; -- 2 (M and F)
-- Index on email — great choice ✅
-- Index on gender — poor choice ❌
When it matters:
- When choosing fields for indexing
- When analyzing why an index is not being used
- When optimizing slow queries
🟡 Middle Level
Cardinality vs Selectivity
These concepts are related but distinct:
| Term | Definition | Example |
|---|---|---|
| Cardinality | Number of unique values | COUNT(DISTINCT city) = 500 |
| Selectivity | Cardinality / Total count | 500 / 1,000,000 = 0.0005 |
Higher cardinality → higher selectivity → more effective the index.
Why: High selectivity means the query will return few rows. Index Scan reads 3-4 pages of B-tree + 1 Heap page per found row. If there are 100,000 rows out of 1,000,000 — that’s 100,000 random Heap reads, which is slower than one sequential read of the entire table (Seq Scan).
How PostgreSQL estimates cardinality
PostgreSQL collects statistics via the ANALYZE command (runs automatically):
-- View column statistics
SELECT
attname as column_name,
n_distinct, -- Uniqueness estimate
correlation -- Correlation with physical order
FROM pg_stats
WHERE tablename = 'users'
AND attname = 'email';
n_distinct:
- If > 0 — absolute number of unique values
- If < 0 (e.g., -1) — fraction of total (-1 = 100% unique)
Impact on execution plan choice
-- High cardinality (email)
EXPLAIN SELECT * FROM users WHERE email = 'test@example.com';
-- Plan: Index Scan (will find 1 row out of 1M)
-- Low cardinality (status)
EXPLAIN SELECT * FROM users WHERE status = 'ACTIVE';
-- Plan: Seq Scan (will return 500,000 rows — index is not beneficial)
Switching threshold (approximate):
| % rows in result | Plan |
|---|---|
| < 1-5% | Index Scan |
| 5-20% | Bitmap Index Scan (collects TIDs into a bitmap, then reads heap — fewer random I/Os than Index Scan) |
| > 20% | Seq Scan |
Data Skew
-- Table orders: 1,000,000 rows
SELECT status, COUNT(*)
FROM orders
GROUP BY status;
-- Result:
-- ACTIVE: 950,000 (95%)
-- PENDING: 40,000 (4%)
-- ERROR: 10,000 (1%)
-- Query 1: WHERE status = 'ERROR' → Index Scan (rare value)
-- Query 2: WHERE status = 'ACTIVE' → Seq Scan (common value)
**Why different plans:** PostgreSQL stores MCV (Most Common Values) — a list of the most frequent values with their frequencies. For 'ERROR' (1%), the planner knows it will find ~10,000 rows out of 1M — Index Scan is beneficial. For 'ACTIVE' (95%) — 950,000 rows, which is cheaper to read via Seq Scan (one sequential read vs 950,000 random reads).
PostgreSQL stores Most Common Values (MCV) and can choose different plans for different values!
Common mistakes
- Indexing fields with low cardinality
-- ❌ Bad: only 2 values CREATE INDEX idx_users_gender ON users(gender); -- Seq Scan will still be chosen for most queries - Ignoring composite cardinality
-- Individually: city (500), age (80) — medium cardinality -- Together: (city, age) → high cardinality of the combination! CREATE INDEX idx_users_city_age ON users(city, age); - Outdated statistics
-- After a bulk insert, statistics are stale -- The planner may choose the wrong plan -- Solution: update statistics ANALYZE users;
Practical example
-- Check column cardinality
SELECT
attname as column_name,
n_distinct,
CASE
WHEN n_distinct < 0 THEN 'High (' || ABS(n_distinct) * 100 || '%)'
WHEN n_distinct < 100 THEN 'Low'
WHEN n_distinct < 10000 THEN 'Medium'
ELSE 'High'
END as cardinality_level
FROM pg_stats
WHERE tablename = 'users';
🔴 Senior Level
How Postgres estimates cardinality: Deep Dive
Statistics are stored in pg_statistic (accessed via pg_stats):
Key fields:
- n_distinct
-- Uniqueness estimate -- Examples: -- -1.0 = 100% unique (each value is unique) -- -0.5 = 50% unique -- 100 = exactly 100 unique values - most_common_vals (MCV)
-- Top most frequent values and their frequency -- Example for status: -- MCV: ['ACTIVE', 'PENDING', 'ERROR'] -- Freq: [0.95, 0.04, 0.01] -- The planner uses this for accurate estimates: -- WHERE status = 'ERROR' → Index Scan (1% of rows) -- WHERE status = 'ACTIVE' → Seq Scan (95% of rows) - histogram_bounds
-- Value distribution for non-MCV values -- Splits the range into buckets (100 by default) -- Used for estimating range selectivity - correlation
-- A number from -1 to 1 -- Shows how much the physical row order -- matches the logical order of values in the index -- correlation = 1.0 → rows on disk are sorted same as in the index -- correlation = 0.0 → complete chaos (Random I/O) -- correlation = -1.0 → sorted in reverse order
Extended Statistics (PostgreSQL 10+)
Problem: PostgreSQL assumes column independence.
-- Query
SELECT * FROM cars
WHERE make = 'Honda' AND model = 'Civic';
-- Planner thinks:
-- P(make='Honda') = 0.1
-- P(model='Civic') = 0.05
-- P(both) = 0.1 * 0.05 = 0.005 (0.5%)
-- Reality:
-- If make='Honda', then model='Civic' is much more likely!
-- P(both) = 0.03 (3%) — 6x more!
Solution: Extended Statistics
-- 1. Functional Dependencies
-- If knowing the model, we know the make
CREATE STATISTICS s_make_model (dependencies)
ON make, model FROM cars;
-- 2. N-Distinct
-- Cardinality of column combinations
CREATE STATISTICS s_city_age (ndistinct)
ON city, age FROM users;
-- 3. MCV for combinations (PG 13+)
-- Most frequent value combinations
CREATE STATISTICS s_status_priority (mcv)
ON status, priority FROM orders;
-- Apply
ANALYZE cars;
ANALYZE users;
ANALYZE orders;
On PG 12 and below, the
mcvtype is unavailable — usedependenciesandndistinct.
Check:
-- View extended statistics
SELECT * FROM pg_stats_ext
WHERE tablename = 'cars';
-- Specifically ndistinct
SELECT
stxname,
stxkeys,
stxndistinct
FROM pg_statistic_ext
WHERE stxname = 's_city_age';
Impact of statistics_target
-- Default: 100 histogram buckets
SHOW default_statistics_target;
-- For columns with "irregular" distribution, increase accuracy
ALTER TABLE users ALTER COLUMN email SET STATISTICS 1000;
ALTER TABLE orders ALTER COLUMN status SET STATISTICS 500;
-- Update statistics
ANALYZE users;
ANALYZE orders;
-- ⚠️ Cost: ANALYZE will take longer
-- But planning becomes more accurate
Edge Cases
- Correlation kills the index
-- Check SELECT attname, correlation FROM pg_stats WHERE tablename = 'orders' AND attname = 'created_at'; -- correlation = 0.05 → Random I/O -- Solution: CLUSTER (re-sort the table) CLUSTER orders USING idx_orders_created; -- ⚠️ Locks the table! Do it during a maintenance window - Outdated statistics
-- Check: when was ANALYZE last run SELECT relname, last_analyze, last_autoanalyze, n_mod_since_analyze -- Changes since last ANALYZE FROM pg_stat_user_tables WHERE n_mod_since_analyze > 100000; -- Many changes! -- Solution ANALYZE VERBOSE orders; - Partitioned Tables
-- For partitioned tables, statistics are collected on each partition -- The planner may misestimate -- Solution: increase statistics_target on key partitions ALTER TABLE orders_y2024 ALTER COLUMN created_at SET STATISTICS 1000;
Performance
Impact on planning:
| statistics_target | Plan accuracy | ANALYZE time |
|---|---|---|
| 10 (minimum) | Low | Fast |
| 100 (default) | Good | Normal |
| 1000 | Excellent | 10x longer |
| 10000 | Maximum | 100x longer |
Statistics memory footprint:
- Each histogram bucket: ~24 bytes
- With statistics_target = 100: ~2.4 KB per column
- With 1000: ~24 KB per column
- RAM impact: minimal
Production Experience
Real scenario #1: Data Skew kills performance
- Log platform: 500M records
- Table:
level(DEBUG: 80%, INFO: 15%, ERROR: 5%) - Problem: ERROR queries were slow
- Cause: statistics_target = 100, MCV list didn’t cover all levels
- Solution:
ALTER TABLE logs ALTER COLUMN level SET STATISTICS 1000; ANALYZE logs; - Result: Planner accurately estimates selectivity, Index Scan for ERROR
Real scenario #2: Correlated Columns
- E-commerce: orders by
(country, city) - Problem: Planner misestimated by 2 orders of magnitude
- Cause: city heavily depends on country (Moscow only in Russia)
- Solution:
CREATE STATISTICS s_country_city (ndistinct, dependencies) ON country, city FROM orders; ANALYZE orders; - Result: Row estimation became accurate, planner chose the correct Join
Monitoring
-- 1. Check statistics freshness
SELECT
relname,
last_analyze,
last_autoanalyze,
n_mod_since_analyze,
CASE
WHEN n_mod_since_analyze > 100000 THEN 'NEEDS ANALYZE'
ELSE 'OK'
END as status
FROM pg_stat_user_tables
ORDER BY n_mod_since_analyze DESC;
-- 2. Check correlation
SELECT
attname,
ROUND(correlation::numeric, 3) as correlation,
CASE
WHEN ABS(correlation) < 0.3 THEN 'LOW - Random I/O'
WHEN ABS(correlation) < 0.7 THEN 'MEDIUM'
ELSE 'HIGH - Sequential I/O'
END as assessment
FROM pg_stats
WHERE tablename = 'orders'
ORDER BY ABS(correlation) ASC;
-- 3. Check n_distinct
SELECT
attname,
n_distinct,
CASE
WHEN n_distinct < 0 THEN 'High (' || ABS(n_distinct) * 100 || '%)'
WHEN n_distinct < 100 THEN 'Low'
WHEN n_distinct < 10000 THEN 'Medium'
ELSE 'High'
END as cardinality
FROM pg_stats
WHERE tablename = 'users';
-- 4. Check extended statistics
SELECT
stxname,
stxkeys::regclass[],
stxndistinct,
stxdependencies
FROM pg_statistic_ext;
Best Practices
- Monitor
n_mod_since_analyze— if > 10% of the table, run ANALYZE - Increase
statistics_targetfor columns with abnormal distribution - Use Extended Statistics for correlated columns
- Check correlation — < 0.3 means Random I/O
- Don’t forget ANALYZE after bulk insert/delete
- MCV list should cover > 80% of data for accurate planning
Summary for Senior
- Cardinality is a key input parameter for the optimizer
- n_distinct, MCV, histogram, correlation — 4 pillars of statistics
- Data Skew — a common cause of wrong plans; check MCV
- Extended Statistics (PG 10+) solves the correlated columns problem
- correlation < 0.3 → Random I/O → consider CLUSTER
- statistics_target — balance between accuracy and ANALYZE time
- Regularly check
pg_statsandpg_statistic_ext
🎯 Interview Cheat Sheet
Must know:
- Cardinality = number of unique values; Selectivity = cardinality / total rows
- n_distinct:
-1= 100% unique,500= exactly 500 unique - MCV (Most Common Values): top frequent values + frequency — for skewed data
- Correlation:
-1..1— how much physical order matches logical order - correlation = 0 → Random I/O (chaos on disk), correlation = 1 → Sequential I/O
- Extended Statistics (PG 10+):
dependencies,ndistinct,mcvfor correlated columns - statistics_target = 100 (default), 1000 — for skewed columns (10x longer ANALYZE)
Common follow-up questions:
- “Why isn’t the index on status (ACTIVE/PENDING) used?” → Low cardinality, Seq Scan is faster
- “What happens if statistics_target is too low?” → Planner misestimates rows
- “How to solve correlated columns problem (country, city)?” →
CREATE STATISTICS ... (dependencies) - “What does correlation = 0.05 mean?” → Nearly chaotic order → CLUSTER or new index
Red flags (DO NOT say):
- ❌ “An index on a field with 2 values (M/F) will speed up search” (Seq Scan is still faster)
- ❌ “n_distinct = -1 means no unique values” (the opposite — 100% unique!)
- ❌ “statistics_target doesn’t affect performance” (it affects ANALYZE time and plan accuracy)
Related topics:
- [[What are indexes for]] → selectivity and cardinality
- [[Why is ANALYZE needed]] → statistics collection, n_distinct, MCV, histogram
- [[What is a composite index]] → composite cardinality
- [[How to optimize slow queries]] → statistics impact on plan