Why is ANALYZE needed?

🟢 Junior Level

ANALYZE — a command that collects statistics about data in a table, so PostgreSQL can choose optimal query execution plans.

Simple analogy: Imagine you’re going on a trip. Without statistics (which roads, traffic, distances), you’d pick a route at random. ANALYZE is like a GPS that knows the state of all roads and helps choose the best path.

What it does:

Counts the number of unique values in columns
Remembers the most frequent values
Estimates data distribution
Saves this information to pg_stats

Example:

-- Collect statistics for the entire database
ANALYZE;

-- For a specific table
ANALYZE users;

-- For specific columns
ANALYZE users (email, city);

-- With details
ANALYZE VERBOSE users;

Important: ANALYZE only collects statistics. VACUUM ANALYZE does both: cleans dead tuples AND collects statistics.

Why:

Without statistics, the planner “guesses” and may choose a slow plan
With statistics: it knows that Index Scan is faster for 1 row, and Seq Scan for 10,000

🟡 Middle Level

How ANALYZE collects statistics

ANALYZE does NOT scan the entire table! It uses random sampling:

Sample size = 300 × statistics_target

By default (target = 100):
  300 × 100 = 30,000 rows

For a 1 billion row table:
  Still 30,000 rows (fast!)

What goes into pg_stats

-- View statistics
SELECT * FROM pg_stats WHERE tablename = 'users';

Key fields:

Field	Description
n_distinct	Uniqueness estimate (-1 = 100% unique)
most_common_vals	Top frequent values (MCV — Most Common Values)
most_common_freqs	Frequency of each top value
correlation	How much physical order matches logical order (-1 to 1)
histogram_bounds	Bucket boundaries for range estimation

How the planner uses statistics

-- Query
SELECT * FROM users WHERE city = 'Moscow';

-- Planner checks pg_stats:
-- n_distinct = 500 (500 cities)
-- most_common_vals = ['Moscow', 'SPb', 'Kazan', ...]
-- most_common_freqs = [0.3, 0.2, 0.05, ...]
-- → Moscow = 30% of table → Seq Scan (too many for an index)

-- Query
SELECT * FROM users WHERE email = 'test@example.com';

-- Planner checks pg_stats:
-- n_distinct = -1 (100% unique)
-- → 1 row out of 1M → Index Scan!

Autovacuum and ANALYZE

-- Auto-ANALYZE runs on its own
SHOW autovacuum_analyze_threshold;     -- 50 rows
SHOW autovacuum_analyze_scale_factor;  -- 0.1 (10%)

-- Triggers when:
-- modified rows > 50 + 0.1 * total_rows

-- For a table with 1000 rows:
-- 50 + 0.1 * 1000 = 150 modifications → ANALYZE

Common problems

Outdated statistics

-- After a mass insert, statistics are stale
COPY users FROM '/data/users.csv';  -- 1M rows
-- Statistics are old! Planner is "blind"

-- Solution
ANALYZE users;

Temporary tables

-- Autovacuum does NOT see temporary tables!
CREATE TEMP TABLE temp_data AS SELECT ...;

-- Manual ANALYZE needed
ANALYZE temp_data;

Low accuracy

-- For a "ragged" distribution, 30,000 rows isn't enough

-- Increase accuracy
ALTER TABLE orders ALTER COLUMN status SET STATISTICS 1000;
ANALYZE orders;

-- Now sample: 300 × 1000 = 300,000 rows

Practical examples

-- Check when ANALYZE was last run
SELECT
    relname,
    last_analyze,
    last_autoanalyze,
    n_mod_since_analyze  -- Modifications since last ANALYZE
FROM pg_stat_user_tables
WHERE relname = 'orders';

-- If n_mod_since_analyze is large → time for ANALYZE

🔴 Senior Level

Sampling algorithm

Reservoir Sampling:

PostgreSQL uses random sampling of PAGES (not rows):
1. Randomly selects pages from the table
2. Scans all rows on the selected pages
3. Builds statistics from the sample

Sampling pages gives more even coverage, accounting for data
locality (rows on the same page are usually related).

Advantage:
  - O(1) time regardless of table size
  - Representative sample

Disadvantage:
  - May be inaccurate for very "ragged" distributions

statistics_target: fine-tuning

-- Default: 100 (30,000 row sample)
SHOW default_statistics_target;

-- Increase for problematic columns
ALTER TABLE orders ALTER COLUMN status SET STATISTICS 1000;
ALTER TABLE users ALTER COLUMN email SET STATISTICS 500;

-- Apply
ANALYZE orders;
ANALYZE users;

-- Cost:
-- target = 100:   ANALYZE 3 seconds
-- target = 1000:  ANALYZE 30 seconds
-- target = 10000: ANALYZE 5 minutes

When to increase:

Column with anomalous distribution (Data Skew — e.g., 90% of rows have status=’ACTIVE’, 10% — 20 other statuses)
Planner frequently misestimates rows
MCV (Most Common Values) doesn’t cover > 80% of data

Extended statistics (Multivariate Statistics)

Problem: PostgreSQL assumes columns are independent.

-- Query
SELECT * FROM cars WHERE make = 'Honda' AND model = 'Civic';

-- Planner (naively):
-- P(Honda) = 0.1
-- P(Civic) = 0.05
-- P(both) = 0.1 × 0.05 = 0.005 (0.5%)

-- Reality:
-- model depends on make!
-- P(both) = 0.03 (3%) — 6x more!
-- Due to the 6x estimation error, the planner may choose Nested Loop (expecting 5
-- rows instead of 300), when Hash Join would be 100x faster.

Solution: CREATE STATISTICS

-- 1. Functional Dependencies (PG 10+)
-- Knowing the model tells us the make
CREATE STATISTICS s_make_model (dependencies)
ON make, model FROM cars;

-- 2. N-Distinct (PG 10+)
-- Cardinality of a combination
CREATE STATISTICS s_city_age (ndistinct)
ON city, age FROM users;

-- 3. MCV Lists (PG 13+)
-- On PG 12 and below, the `mcv` type is not supported → ANALYZE will error.
-- Top value combinations
CREATE STATISTICS s_status_priority (mcv)
ON status, priority FROM orders;

-- Apply
ANALYZE cars;
ANALYZE users;
ANALYZE orders;

Checking:

-- View extended statistics
SELECT * FROM pg_stats_ext WHERE tablename = 'cars';

-- Specifically dependencies
SELECT
    stxname,
    stxkeys,
    stxdependencies
FROM pg_statistic_ext;

Expression statistics

-- Functional index
CREATE INDEX idx_users_email_lower ON users(LOWER(email));

-- ANALYZE collects statistics on the function RESULT!
ANALYZE users;

-- Now the planner accurately estimates:
SELECT * FROM users WHERE LOWER(email) = 'test@example.com';
-- → Knows the selectivity of LOWER(email), not just email

pg_stat_progress_analyze (PG 13+)

-- Monitoring ANALYZE on large tables
SELECT
    pid,
    relid::regclass as table_name,
    phase,
    sample_blks_total,
    sample_blks_scanned,
    ext_stats_computed,
    ROUND(100.0 * sample_blks_scanned / NULLIF(sample_blks_total, 0), 2) as progress_pct
FROM pg_stat_progress_analyze;

-- Phase:
-- 'acquiring sample rows'
-- 'acquiring inherited sample rows'
-- 'computing statistics'
-- 'computing extended statistics'

On PG 12 and below, ANALYZE progress monitoring is not available. Estimate time by table size: ~1 sec per 1M rows at default_statistics_target.

1. Cross-table relationships

PostgreSQL does NOT store statistics about JOINs between tables!
JOIN size estimate = product of selectivities

→ If tables correlate → estimate is wrong
→ Solution: CTE or Materialized Views

2. Partitioned Tables

-- For partitioned tables:
-- Statistics are collected on EACH partition
-- The planner may misestimate the total

-- Solution:
-- Increase statistics_target on key partitions
ALTER TABLE orders_y2024 ALTER COLUMN created_at SET STATISTICS 1000;

3. Time-Series Data

-- Data changes over time
-- Statistics become stale between ANALYZE runs

-- Solution:
-- Run ANALYZE more frequently for time-series tables
ALTER TABLE logs SET (
    autovacuum_analyze_scale_factor = 0.01  -- 1% instead of 10%
);

Edge Cases

ANALYZE locking

-- ANALYZE takes ShareUpdateExclusiveLock
-- Allows: SELECT, INSERT, UPDATE, DELETE
-- Blocks: another ANALYZE, VACUUM, ALTER TABLE

-- Two ANALYZEs at the same time → one waits

Inheritance

-- For tables with inheritance:
ANALYZE parent_table;  -- Parent only
ANALYZE;               -- Entire DB, including children

-- Or separately:
ANALYZE child_table_1;
ANALYZE child_table_2;

Foreign Tables

-- For foreign tables (FDW):
-- ANALYZE can be very slow (query to remote server)

-- You can import statistics
IMPORT FOREIGN SCHEMA public
FROM SERVER remote_server
INTO local_schema;

Performance Impact

Wrong statistics → bad plan:

Estimation error	Result
Underestimated 10x	Index Scan instead of Seq Scan
Overestimated 10x	Seq Scan instead of Index Scan
Wrong JOIN order	Nested Loop instead of Hash Join

Production Experience

Real scenario #1: Mass insert without ANALYZE

ETL: loaded 50M rows into orders
Queries became 100x slower
Cause: stale statistics (table was empty)
Planner thought: “table is empty → Seq Scan is fast”
Reality: 50M rows → Seq Scan is a disaster
Solution: ANALYZE orders; → 2 seconds, everything fixed

Real scenario #2: Correlated columns

Analytics: WHERE country = ‘RU’ AND city = ‘Moscow’
Planner was off by 50x (assumed independence)
Solution: CREATE STATISTICS s_country_city (dependencies) ON country, city
Result: accurate estimate, correct plan

Monitoring

-- 1. Check statistics freshness
SELECT
    relname,
    last_analyze,
    last_autoanalyze,
    n_mod_since_analyze,
    CASE
        WHEN n_mod_since_analyze > 100000 THEN 'NEEDS ANALYZE'
        ELSE 'OK'
    END as status
FROM pg_stat_user_tables
ORDER BY n_mod_since_analyze DESC
LIMIT 20;

-- 2. Check statistics_target
SELECT
    attname,
    attname,
    CASE
        WHEN attstattarget = -1 THEN 'Default (' || current_setting('default_statistics_target') || ')'
        ELSE attstattarget::text
    END as statistics_target
FROM pg_attribute
WHERE attrelid = 'orders'::regclass
  AND attnum > 0
  AND NOT attisdropped;

-- 3. Extended statistics
SELECT * FROM pg_stats_ext;

-- 4. Progress
SELECT * FROM pg_stat_progress_analyze;

Best Practices

ANALYZE after mass insert/delete
Increase statistics_target for skewed columns
CREATE STATISTICS for correlated columns
Monitor n_mod_since_analyze — > 10% = time for ANALYZE
Don’t rely solely on autovacuum for hot tables
Check plans via EXPLAIN ANALYZE after ANALYZE
Temporary tables → manual ANALYZE
Partitioned tables → ANALYZE on key partitions

Summary for Senior

ANALYZE = statistics collection via random sampling (300 × target rows)
statistics_target — balance between accuracy and time
Extended Statistics (PG 10+) solves the correlated columns problem
MCV Lists (PG 13+) for top value combinations
Cross-table relationships — planner’s blind spot
Mass changes → always ANALYZE afterwards
Monitor n_mod_since_analyze and plan accuracy
pg_stat_progress_analyze (PG 13+) for monitoring on large tables

🎯 Interview Cheat Sheet

Must know:

ANALYZE: random sample of 300 × statistics_target rows (doesn’t scan the whole table)
pg_stats: n_distinct, most_common_vals (MCV), histogram_bounds, correlation
statistics_target = 100 (default) → 30,000 row sample
Autovacuum triggers ANALYZE when: modified > 50 + 0.1 × total_rows
Extended Statistics (PG 10+): dependencies (correlated columns), ndistinct (composite cardinality), mcv (PG 13+, top combinations)
Algorithm: Reservoir Sampling → O(1) time regardless of table size
Blind spots: cross-table relationships, partitioned tables, time-series data
pg_stat_progress_analyze (PG 13+): progress monitoring
Expression statistics: ANALYZE collects statistics on the function result (LOWER(email))

Frequent follow-up questions:

“Why did a query become slow after a mass insert?” → Stale statistics → ANALYZE
“How to solve the correlated columns problem?” → CREATE STATISTICS ... (dependencies)
“When to increase statistics_target?” → Skewed data, planner misestimates
“Does PG analyze temporary tables?” → No, manual ANALYZE needed

Red flags (do NOT say):

❌ “ANALYZE scans the entire table” (no, random sampling!)
❌ “Autovacuum is always up to date” (after mass insert — no)
❌ “Extended Statistics solves all problems” (cross-table relationships are a blind spot)
❌ “Temporary tables analyze themselves” (no, autovacuum doesn’t see them)

Related topics:

[[What is index cardinality]] → n_distinct, MCV, correlation
[[What is VACUUM in PostgreSQL]] → Autovacuum + ANALYZE
[[How to optimize slow queries]] → statistics impact on plans