Question 6 · Section 1

What is index cardinality?

These concepts are related but distinct:

Language versions: English Russian Ukrainian

🟢 Junior Level

Cardinality is the number of unique values in a column of a table.

Why this matters: PostgreSQL’s query planner uses cardinality to decide whether to use an index or read the entire table. High cardinality = many unique values = the index is efficient. Low cardinality = few unique values = Seq Scan will likely be faster than an index.

Simple analogy: Imagine a bag of marbles of different colors. Cardinality is how many different colors are in the bag (not counting how many marbles of each color).

Main idea:

  • High cardinality = many unique values (e.g., email)
  • Low cardinality = few unique values (e.g., gender: M/F)
  • Indexes are more effective on fields with high cardinality

Example:

-- Table users (1,000,000 rows)

-- High cardinality
SELECT COUNT(DISTINCT email) FROM users;  -- ~1,000,000 (unique)

-- Low cardinality
SELECT COUNT(DISTINCT gender) FROM users; -- 2 (M and F)

-- Index on email — great choice ✅
-- Index on gender — poor choice ❌

When it matters:

  • When choosing fields for indexing
  • When analyzing why an index is not being used
  • When optimizing slow queries

🟡 Middle Level

Cardinality vs Selectivity

These concepts are related but distinct:

Term Definition Example
Cardinality Number of unique values COUNT(DISTINCT city) = 500
Selectivity Cardinality / Total count 500 / 1,000,000 = 0.0005

Higher cardinality → higher selectivity → more effective the index.

Why: High selectivity means the query will return few rows. Index Scan reads 3-4 pages of B-tree + 1 Heap page per found row. If there are 100,000 rows out of 1,000,000 — that’s 100,000 random Heap reads, which is slower than one sequential read of the entire table (Seq Scan).

How PostgreSQL estimates cardinality

PostgreSQL collects statistics via the ANALYZE command (runs automatically):

-- View column statistics
SELECT
    attname as column_name,
    n_distinct,              -- Uniqueness estimate
    correlation              -- Correlation with physical order
FROM pg_stats
WHERE tablename = 'users'
  AND attname = 'email';

n_distinct:

  • If > 0 — absolute number of unique values
  • If < 0 (e.g., -1) — fraction of total (-1 = 100% unique)

Impact on execution plan choice

-- High cardinality (email)
EXPLAIN SELECT * FROM users WHERE email = 'test@example.com';
-- Plan: Index Scan (will find 1 row out of 1M)

-- Low cardinality (status)
EXPLAIN SELECT * FROM users WHERE status = 'ACTIVE';
-- Plan: Seq Scan (will return 500,000 rows — index is not beneficial)

Switching threshold (approximate):

% rows in result Plan
< 1-5% Index Scan
5-20% Bitmap Index Scan (collects TIDs into a bitmap, then reads heap — fewer random I/Os than Index Scan)
> 20% Seq Scan

Data Skew

-- Table orders: 1,000,000 rows
SELECT status, COUNT(*)
FROM orders
GROUP BY status;

-- Result:
-- ACTIVE:   950,000 (95%)
-- PENDING:   40,000 (4%)
-- ERROR:     10,000 (1%)

-- Query 1: WHERE status = 'ERROR' → Index Scan (rare value)
-- Query 2: WHERE status = 'ACTIVE' → Seq Scan (common value)

**Why different plans:** PostgreSQL stores MCV (Most Common Values)  a list of the most frequent values with their frequencies. For 'ERROR' (1%), the planner knows it will find ~10,000 rows out of 1M  Index Scan is beneficial. For 'ACTIVE' (95%)  950,000 rows, which is cheaper to read via Seq Scan (one sequential read vs 950,000 random reads).

PostgreSQL stores Most Common Values (MCV) and can choose different plans for different values!

Common mistakes

  1. Indexing fields with low cardinality
    -- ❌ Bad: only 2 values
    CREATE INDEX idx_users_gender ON users(gender);
    -- Seq Scan will still be chosen for most queries
    
  2. Ignoring composite cardinality
    -- Individually: city (500), age (80) — medium cardinality
    -- Together: (city, age) → high cardinality of the combination!
    CREATE INDEX idx_users_city_age ON users(city, age);
    
  3. Outdated statistics
    -- After a bulk insert, statistics are stale
    -- The planner may choose the wrong plan
    
    -- Solution: update statistics
    ANALYZE users;
    

Practical example

-- Check column cardinality
SELECT
    attname as column_name,
    n_distinct,
    CASE
        WHEN n_distinct < 0 THEN 'High (' || ABS(n_distinct) * 100 || '%)'
        WHEN n_distinct < 100 THEN 'Low'
        WHEN n_distinct < 10000 THEN 'Medium'
        ELSE 'High'
    END as cardinality_level
FROM pg_stats
WHERE tablename = 'users';

🔴 Senior Level

How Postgres estimates cardinality: Deep Dive

Statistics are stored in pg_statistic (accessed via pg_stats):

Key fields:

  1. n_distinct
    -- Uniqueness estimate
    -- Examples:
    -- -1.0 = 100% unique (each value is unique)
    -- -0.5 = 50% unique
    -- 100 = exactly 100 unique values
    
  2. most_common_vals (MCV)
    -- Top most frequent values and their frequency
    -- Example for status:
    -- MCV: ['ACTIVE', 'PENDING', 'ERROR']
    -- Freq: [0.95, 0.04, 0.01]
    
    -- The planner uses this for accurate estimates:
    -- WHERE status = 'ERROR' → Index Scan (1% of rows)
    -- WHERE status = 'ACTIVE' → Seq Scan (95% of rows)
    
  3. histogram_bounds
    -- Value distribution for non-MCV values
    -- Splits the range into buckets (100 by default)
    -- Used for estimating range selectivity
    
  4. correlation
    -- A number from -1 to 1
    -- Shows how much the physical row order
    -- matches the logical order of values in the index
    
    -- correlation = 1.0 → rows on disk are sorted same as in the index
    -- correlation = 0.0 → complete chaos (Random I/O)
    -- correlation = -1.0 → sorted in reverse order
    

Extended Statistics (PostgreSQL 10+)

Problem: PostgreSQL assumes column independence.

-- Query
SELECT * FROM cars
WHERE make = 'Honda' AND model = 'Civic';

-- Planner thinks:
-- P(make='Honda') = 0.1
-- P(model='Civic') = 0.05
-- P(both) = 0.1 * 0.05 = 0.005 (0.5%)

-- Reality:
-- If make='Honda', then model='Civic' is much more likely!
-- P(both) = 0.03 (3%) — 6x more!

Solution: Extended Statistics

-- 1. Functional Dependencies
-- If knowing the model, we know the make
CREATE STATISTICS s_make_model (dependencies)
ON make, model FROM cars;

-- 2. N-Distinct
-- Cardinality of column combinations
CREATE STATISTICS s_city_age (ndistinct)
ON city, age FROM users;

-- 3. MCV for combinations (PG 13+)
-- Most frequent value combinations
CREATE STATISTICS s_status_priority (mcv)
ON status, priority FROM orders;

-- Apply
ANALYZE cars;
ANALYZE users;
ANALYZE orders;

On PG 12 and below, the mcv type is unavailable — use dependencies and ndistinct.

Check:

-- View extended statistics
SELECT * FROM pg_stats_ext
WHERE tablename = 'cars';

-- Specifically ndistinct
SELECT
    stxname,
    stxkeys,
    stxndistinct
FROM pg_statistic_ext
WHERE stxname = 's_city_age';

Impact of statistics_target

-- Default: 100 histogram buckets
SHOW default_statistics_target;

-- For columns with "irregular" distribution, increase accuracy
ALTER TABLE users ALTER COLUMN email SET STATISTICS 1000;
ALTER TABLE orders ALTER COLUMN status SET STATISTICS 500;

-- Update statistics
ANALYZE users;
ANALYZE orders;

-- ⚠️ Cost: ANALYZE will take longer
-- But planning becomes more accurate

Edge Cases

  1. Correlation kills the index
    -- Check
    SELECT attname, correlation
    FROM pg_stats
    WHERE tablename = 'orders'
      AND attname = 'created_at';
    
    -- correlation = 0.05 → Random I/O
    -- Solution: CLUSTER (re-sort the table)
    CLUSTER orders USING idx_orders_created;
    -- ⚠️ Locks the table! Do it during a maintenance window
    
  2. Outdated statistics
    -- Check: when was ANALYZE last run
    SELECT
        relname,
        last_analyze,
        last_autoanalyze,
        n_mod_since_analyze  -- Changes since last ANALYZE
    FROM pg_stat_user_tables
    WHERE n_mod_since_analyze > 100000;  -- Many changes!
    
    -- Solution
    ANALYZE VERBOSE orders;
    
  3. Partitioned Tables
    -- For partitioned tables, statistics are collected on each partition
    -- The planner may misestimate
    -- Solution: increase statistics_target on key partitions
    ALTER TABLE orders_y2024 ALTER COLUMN created_at SET STATISTICS 1000;
    

Performance

Impact on planning:

statistics_target Plan accuracy ANALYZE time
10 (minimum) Low Fast
100 (default) Good Normal
1000 Excellent 10x longer
10000 Maximum 100x longer

Statistics memory footprint:

  • Each histogram bucket: ~24 bytes
  • With statistics_target = 100: ~2.4 KB per column
  • With 1000: ~24 KB per column
  • RAM impact: minimal

Production Experience

Real scenario #1: Data Skew kills performance

  • Log platform: 500M records
  • Table: level (DEBUG: 80%, INFO: 15%, ERROR: 5%)
  • Problem: ERROR queries were slow
  • Cause: statistics_target = 100, MCV list didn’t cover all levels
  • Solution:
    ALTER TABLE logs ALTER COLUMN level SET STATISTICS 1000;
    ANALYZE logs;
    
  • Result: Planner accurately estimates selectivity, Index Scan for ERROR

Real scenario #2: Correlated Columns

  • E-commerce: orders by (country, city)
  • Problem: Planner misestimated by 2 orders of magnitude
  • Cause: city heavily depends on country (Moscow only in Russia)
  • Solution:
    CREATE STATISTICS s_country_city (ndistinct, dependencies)
    ON country, city FROM orders;
    ANALYZE orders;
    
  • Result: Row estimation became accurate, planner chose the correct Join

Monitoring

-- 1. Check statistics freshness
SELECT
    relname,
    last_analyze,
    last_autoanalyze,
    n_mod_since_analyze,
    CASE
        WHEN n_mod_since_analyze > 100000 THEN 'NEEDS ANALYZE'
        ELSE 'OK'
    END as status
FROM pg_stat_user_tables
ORDER BY n_mod_since_analyze DESC;

-- 2. Check correlation
SELECT
    attname,
    ROUND(correlation::numeric, 3) as correlation,
    CASE
        WHEN ABS(correlation) < 0.3 THEN 'LOW - Random I/O'
        WHEN ABS(correlation) < 0.7 THEN 'MEDIUM'
        ELSE 'HIGH - Sequential I/O'
    END as assessment
FROM pg_stats
WHERE tablename = 'orders'
ORDER BY ABS(correlation) ASC;

-- 3. Check n_distinct
SELECT
    attname,
    n_distinct,
    CASE
        WHEN n_distinct < 0 THEN 'High (' || ABS(n_distinct) * 100 || '%)'
        WHEN n_distinct < 100 THEN 'Low'
        WHEN n_distinct < 10000 THEN 'Medium'
        ELSE 'High'
    END as cardinality
FROM pg_stats
WHERE tablename = 'users';

-- 4. Check extended statistics
SELECT
    stxname,
    stxkeys::regclass[],
    stxndistinct,
    stxdependencies
FROM pg_statistic_ext;

Best Practices

  1. Monitor n_mod_since_analyze — if > 10% of the table, run ANALYZE
  2. Increase statistics_target for columns with abnormal distribution
  3. Use Extended Statistics for correlated columns
  4. Check correlation — < 0.3 means Random I/O
  5. Don’t forget ANALYZE after bulk insert/delete
  6. MCV list should cover > 80% of data for accurate planning

Summary for Senior

  • Cardinality is a key input parameter for the optimizer
  • n_distinct, MCV, histogram, correlation — 4 pillars of statistics
  • Data Skew — a common cause of wrong plans; check MCV
  • Extended Statistics (PG 10+) solves the correlated columns problem
  • correlation < 0.3 → Random I/O → consider CLUSTER
  • statistics_target — balance between accuracy and ANALYZE time
  • Regularly check pg_stats and pg_statistic_ext

🎯 Interview Cheat Sheet

Must know:

  • Cardinality = number of unique values; Selectivity = cardinality / total rows
  • n_distinct: -1 = 100% unique, 500 = exactly 500 unique
  • MCV (Most Common Values): top frequent values + frequency — for skewed data
  • Correlation: -1..1 — how much physical order matches logical order
  • correlation = 0 → Random I/O (chaos on disk), correlation = 1 → Sequential I/O
  • Extended Statistics (PG 10+): dependencies, ndistinct, mcv for correlated columns
  • statistics_target = 100 (default), 1000 — for skewed columns (10x longer ANALYZE)

Common follow-up questions:

  • “Why isn’t the index on status (ACTIVE/PENDING) used?” → Low cardinality, Seq Scan is faster
  • “What happens if statistics_target is too low?” → Planner misestimates rows
  • “How to solve correlated columns problem (country, city)?” → CREATE STATISTICS ... (dependencies)
  • “What does correlation = 0.05 mean?” → Nearly chaotic order → CLUSTER or new index

Red flags (DO NOT say):

  • ❌ “An index on a field with 2 values (M/F) will speed up search” (Seq Scan is still faster)
  • ❌ “n_distinct = -1 means no unique values” (the opposite — 100% unique!)
  • ❌ “statistics_target doesn’t affect performance” (it affects ANALYZE time and plan accuracy)

Related topics:

  • [[What are indexes for]] → selectivity and cardinality
  • [[Why is ANALYZE needed]] → statistics collection, n_distinct, MCV, histogram
  • [[What is a composite index]] → composite cardinality
  • [[How to optimize slow queries]] → statistics impact on plan