What is index cardinality?

🟢 Junior Level

Cardinality is the number of unique values in a column of a table.

Why this matters: PostgreSQL’s query planner uses cardinality to decide whether to use an index or read the entire table. High cardinality = many unique values = the index is efficient. Low cardinality = few unique values = Seq Scan will likely be faster than an index.

Simple analogy: Imagine a bag of marbles of different colors. Cardinality is how many different colors are in the bag (not counting how many marbles of each color).

Main idea:

High cardinality = many unique values (e.g., email)
Low cardinality = few unique values (e.g., gender: M/F)
Indexes are more effective on fields with high cardinality

Example:

-- Table users (1,000,000 rows)

-- High cardinality
SELECT COUNT(DISTINCT email) FROM users;  -- ~1,000,000 (unique)

-- Low cardinality
SELECT COUNT(DISTINCT gender) FROM users; -- 2 (M and F)

-- Index on email — great choice ✅
-- Index on gender — poor choice ❌

When it matters:

When choosing fields for indexing
When analyzing why an index is not being used
When optimizing slow queries

🟡 Middle Level

Cardinality vs Selectivity

These concepts are related but distinct:

Term	Definition	Example
Cardinality	Number of unique values	`COUNT(DISTINCT city) = 500`
Selectivity	Cardinality / Total count	`500 / 1,000,000 = 0.0005`

Higher cardinality → higher selectivity → more effective the index.

Why: High selectivity means the query will return few rows. Index Scan reads 3-4 pages of B-tree + 1 Heap page per found row. If there are 100,000 rows out of 1,000,000 — that’s 100,000 random Heap reads, which is slower than one sequential read of the entire table (Seq Scan).

How PostgreSQL estimates cardinality

PostgreSQL collects statistics via the ANALYZE command (runs automatically):

-- View column statistics
SELECT
    attname as column_name,
    n_distinct,              -- Uniqueness estimate
    correlation              -- Correlation with physical order
FROM pg_stats
WHERE tablename = 'users'
  AND attname = 'email';

n_distinct:

If > 0 — absolute number of unique values
If < 0 (e.g., -1) — fraction of total (-1 = 100% unique)

Impact on execution plan choice

-- High cardinality (email)
EXPLAIN SELECT * FROM users WHERE email = 'test@example.com';
-- Plan: Index Scan (will find 1 row out of 1M)

-- Low cardinality (status)
EXPLAIN SELECT * FROM users WHERE status = 'ACTIVE';
-- Plan: Seq Scan (will return 500,000 rows — index is not beneficial)

Switching threshold (approximate):

% rows in result	Plan
< 1-5%	Index Scan
5-20%	Bitmap Index Scan (collects TIDs into a bitmap, then reads heap — fewer random I/Os than Index Scan)
> 20%	Seq Scan

Data Skew

-- Table orders: 1,000,000 rows
SELECT status, COUNT(*)
FROM orders
GROUP BY status;

-- Result:
-- ACTIVE:   950,000 (95%)
-- PENDING:   40,000 (4%)
-- ERROR:     10,000 (1%)

-- Query 1: WHERE status = 'ERROR' → Index Scan (rare value)
-- Query 2: WHERE status = 'ACTIVE' → Seq Scan (common value)

**Why different plans:** PostgreSQL stores MCV (Most Common Values) — a list of the most frequent values with their frequencies. For 'ERROR' (1%), the planner knows it will find ~10,000 rows out of 1M — Index Scan is beneficial. For 'ACTIVE' (95%) — 950,000 rows, which is cheaper to read via Seq Scan (one sequential read vs 950,000 random reads).

PostgreSQL stores Most Common Values (MCV) and can choose different plans for different values!

Common mistakes

Indexing fields with low cardinality

-- ❌ Bad: only 2 values
CREATE INDEX idx_users_gender ON users(gender);
-- Seq Scan will still be chosen for most queries

Ignoring composite cardinality

-- Individually: city (500), age (80) — medium cardinality
-- Together: (city, age) → high cardinality of the combination!
CREATE INDEX idx_users_city_age ON users(city, age);

Outdated statistics

-- After a bulk insert, statistics are stale
-- The planner may choose the wrong plan

-- Solution: update statistics
ANALYZE users;

Practical example

-- Check column cardinality
SELECT
    attname as column_name,
    n_distinct,
    CASE
        WHEN n_distinct < 0 THEN 'High (' || ABS(n_distinct) * 100 || '%)'
        WHEN n_distinct < 100 THEN 'Low'
        WHEN n_distinct < 10000 THEN 'Medium'
        ELSE 'High'
    END as cardinality_level
FROM pg_stats
WHERE tablename = 'users';

🔴 Senior Level

How Postgres estimates cardinality: Deep Dive

Statistics are stored in pg_statistic (accessed via pg_stats):

Key fields:

n_distinct

-- Uniqueness estimate
-- Examples:
-- -1.0 = 100% unique (each value is unique)
-- -0.5 = 50% unique
-- 100 = exactly 100 unique values

most_common_vals (MCV)

-- Top most frequent values and their frequency
-- Example for status:
-- MCV: ['ACTIVE', 'PENDING', 'ERROR']
-- Freq: [0.95, 0.04, 0.01]

-- The planner uses this for accurate estimates:
-- WHERE status = 'ERROR' → Index Scan (1% of rows)
-- WHERE status = 'ACTIVE' → Seq Scan (95% of rows)

histogram_bounds

-- Value distribution for non-MCV values
-- Splits the range into buckets (100 by default)
-- Used for estimating range selectivity

correlation

-- A number from -1 to 1
-- Shows how much the physical row order
-- matches the logical order of values in the index

-- correlation = 1.0 → rows on disk are sorted same as in the index
-- correlation = 0.0 → complete chaos (Random I/O)
-- correlation = -1.0 → sorted in reverse order

Extended Statistics (PostgreSQL 10+)

Problem: PostgreSQL assumes column independence.

-- Query
SELECT * FROM cars
WHERE make = 'Honda' AND model = 'Civic';

-- Planner thinks:
-- P(make='Honda') = 0.1
-- P(model='Civic') = 0.05
-- P(both) = 0.1 * 0.05 = 0.005 (0.5%)

-- Reality:
-- If make='Honda', then model='Civic' is much more likely!
-- P(both) = 0.03 (3%) — 6x more!

Solution: Extended Statistics

-- 1. Functional Dependencies
-- If knowing the model, we know the make
CREATE STATISTICS s_make_model (dependencies)
ON make, model FROM cars;

-- 2. N-Distinct
-- Cardinality of column combinations
CREATE STATISTICS s_city_age (ndistinct)
ON city, age FROM users;

-- 3. MCV for combinations (PG 13+)
-- Most frequent value combinations
CREATE STATISTICS s_status_priority (mcv)
ON status, priority FROM orders;

-- Apply
ANALYZE cars;
ANALYZE users;
ANALYZE orders;

On PG 12 and below, the mcv type is unavailable — use dependencies and ndistinct.

Check:

-- View extended statistics
SELECT * FROM pg_stats_ext
WHERE tablename = 'cars';

-- Specifically ndistinct
SELECT
    stxname,
    stxkeys,
    stxndistinct
FROM pg_statistic_ext
WHERE stxname = 's_city_age';

Impact of statistics_target

-- Default: 100 histogram buckets
SHOW default_statistics_target;

-- For columns with "irregular" distribution, increase accuracy
ALTER TABLE users ALTER COLUMN email SET STATISTICS 1000;
ALTER TABLE orders ALTER COLUMN status SET STATISTICS 500;

-- Update statistics
ANALYZE users;
ANALYZE orders;

-- ⚠️ Cost: ANALYZE will take longer
-- But planning becomes more accurate

Edge Cases

Correlation kills the index

-- Check
SELECT attname, correlation
FROM pg_stats
WHERE tablename = 'orders'
  AND attname = 'created_at';

-- correlation = 0.05 → Random I/O
-- Solution: CLUSTER (re-sort the table)
CLUSTER orders USING idx_orders_created;
-- ⚠️ Locks the table! Do it during a maintenance window

Outdated statistics

-- Check: when was ANALYZE last run
SELECT
    relname,
    last_analyze,
    last_autoanalyze,
    n_mod_since_analyze  -- Changes since last ANALYZE
FROM pg_stat_user_tables
WHERE n_mod_since_analyze > 100000;  -- Many changes!

-- Solution
ANALYZE VERBOSE orders;

Partitioned Tables

-- For partitioned tables, statistics are collected on each partition
-- The planner may misestimate
-- Solution: increase statistics_target on key partitions
ALTER TABLE orders_y2024 ALTER COLUMN created_at SET STATISTICS 1000;

Performance

Impact on planning:

statistics_target	Plan accuracy	ANALYZE time
10 (minimum)	Low	Fast
100 (default)	Good	Normal
1000	Excellent	10x longer
10000	Maximum	100x longer

Statistics memory footprint:

Each histogram bucket: ~24 bytes
With statistics_target = 100: ~2.4 KB per column
With 1000: ~24 KB per column
RAM impact: minimal

Production Experience

Real scenario #1: Data Skew kills performance

Log platform: 500M records
Table: level (DEBUG: 80%, INFO: 15%, ERROR: 5%)
Problem: ERROR queries were slow
Cause: statistics_target = 100, MCV list didn’t cover all levels

Solution:

ALTER TABLE logs ALTER COLUMN level SET STATISTICS 1000;
ANALYZE logs;

Result: Planner accurately estimates selectivity, Index Scan for ERROR

Real scenario #2: Correlated Columns

E-commerce: orders by (country, city)
Problem: Planner misestimated by 2 orders of magnitude
Cause: city heavily depends on country (Moscow only in Russia)

Solution:

CREATE STATISTICS s_country_city (ndistinct, dependencies)
ON country, city FROM orders;
ANALYZE orders;

Result: Row estimation became accurate, planner chose the correct Join

Monitoring

-- 1. Check statistics freshness
SELECT
    relname,
    last_analyze,
    last_autoanalyze,
    n_mod_since_analyze,
    CASE
        WHEN n_mod_since_analyze > 100000 THEN 'NEEDS ANALYZE'
        ELSE 'OK'
    END as status
FROM pg_stat_user_tables
ORDER BY n_mod_since_analyze DESC;

-- 2. Check correlation
SELECT
    attname,
    ROUND(correlation::numeric, 3) as correlation,
    CASE
        WHEN ABS(correlation) < 0.3 THEN 'LOW - Random I/O'
        WHEN ABS(correlation) < 0.7 THEN 'MEDIUM'
        ELSE 'HIGH - Sequential I/O'
    END as assessment
FROM pg_stats
WHERE tablename = 'orders'
ORDER BY ABS(correlation) ASC;

-- 3. Check n_distinct
SELECT
    attname,
    n_distinct,
    CASE
        WHEN n_distinct < 0 THEN 'High (' || ABS(n_distinct) * 100 || '%)'
        WHEN n_distinct < 100 THEN 'Low'
        WHEN n_distinct < 10000 THEN 'Medium'
        ELSE 'High'
    END as cardinality
FROM pg_stats
WHERE tablename = 'users';

-- 4. Check extended statistics
SELECT
    stxname,
    stxkeys::regclass[],
    stxndistinct,
    stxdependencies
FROM pg_statistic_ext;

Best Practices

Monitor n_mod_since_analyze — if > 10% of the table, run ANALYZE
Increase statistics_target for columns with abnormal distribution
Use Extended Statistics for correlated columns
Check correlation — < 0.3 means Random I/O
Don’t forget ANALYZE after bulk insert/delete
MCV list should cover > 80% of data for accurate planning

Summary for Senior

Cardinality is a key input parameter for the optimizer
n_distinct, MCV, histogram, correlation — 4 pillars of statistics
Data Skew — a common cause of wrong plans; check MCV
Extended Statistics (PG 10+) solves the correlated columns problem
correlation < 0.3 → Random I/O → consider CLUSTER
statistics_target — balance between accuracy and ANALYZE time
Regularly check pg_stats and pg_statistic_ext

🎯 Interview Cheat Sheet

Must know:

Cardinality = number of unique values; Selectivity = cardinality / total rows
n_distinct: -1 = 100% unique, 500 = exactly 500 unique
MCV (Most Common Values): top frequent values + frequency — for skewed data
Correlation: -1..1 — how much physical order matches logical order
correlation = 0 → Random I/O (chaos on disk), correlation = 1 → Sequential I/O
Extended Statistics (PG 10+): dependencies, ndistinct, mcv for correlated columns
statistics_target = 100 (default), 1000 — for skewed columns (10x longer ANALYZE)

Common follow-up questions:

“Why isn’t the index on status (ACTIVE/PENDING) used?” → Low cardinality, Seq Scan is faster
“What happens if statistics_target is too low?” → Planner misestimates rows
“How to solve correlated columns problem (country, city)?” → CREATE STATISTICS ... (dependencies)
“What does correlation = 0.05 mean?” → Nearly chaotic order → CLUSTER or new index

Red flags (DO NOT say):

❌ “An index on a field with 2 values (M/F) will speed up search” (Seq Scan is still faster)
❌ “n_distinct = -1 means no unique values” (the opposite — 100% unique!)
❌ “statistics_target doesn’t affect performance” (it affects ANALYZE time and plan accuracy)

Related topics:

[[What are indexes for]] → selectivity and cardinality
[[Why is ANALYZE needed]] → statistics collection, n_distinct, MCV, histogram
[[What is a composite index]] → composite cardinality
[[How to optimize slow queries]] → statistics impact on plan