Why is ANALYZE needed?
status='ACTIVE', 10% — 20 other statuses)
🟢 Junior Level
ANALYZE — a command that collects statistics about data in a table, so PostgreSQL can choose optimal query execution plans.
Simple analogy: Imagine you’re going on a trip. Without statistics (which roads, traffic, distances), you’d pick a route at random. ANALYZE is like a GPS that knows the state of all roads and helps choose the best path.
What it does:
- Counts the number of unique values in columns
- Remembers the most frequent values
- Estimates data distribution
- Saves this information to
pg_stats
Example:
-- Collect statistics for the entire database
ANALYZE;
-- For a specific table
ANALYZE users;
-- For specific columns
ANALYZE users (email, city);
-- With details
ANALYZE VERBOSE users;
Important: ANALYZE only collects statistics. VACUUM ANALYZE does both: cleans dead tuples AND collects statistics.
Why:
- Without statistics, the planner “guesses” and may choose a slow plan
- With statistics: it knows that Index Scan is faster for 1 row, and Seq Scan for 10,000
🟡 Middle Level
How ANALYZE collects statistics
ANALYZE does NOT scan the entire table! It uses random sampling:
Sample size = 300 × statistics_target
By default (target = 100):
300 × 100 = 30,000 rows
For a 1 billion row table:
Still 30,000 rows (fast!)
What goes into pg_stats
-- View statistics
SELECT * FROM pg_stats WHERE tablename = 'users';
Key fields:
| Field | Description |
|---|---|
| n_distinct | Uniqueness estimate (-1 = 100% unique) |
| most_common_vals | Top frequent values (MCV — Most Common Values) |
| most_common_freqs | Frequency of each top value |
| correlation | How much physical order matches logical order (-1 to 1) |
| histogram_bounds | Bucket boundaries for range estimation |
How the planner uses statistics
-- Query
SELECT * FROM users WHERE city = 'Moscow';
-- Planner checks pg_stats:
-- n_distinct = 500 (500 cities)
-- most_common_vals = ['Moscow', 'SPb', 'Kazan', ...]
-- most_common_freqs = [0.3, 0.2, 0.05, ...]
-- → Moscow = 30% of table → Seq Scan (too many for an index)
-- Query
SELECT * FROM users WHERE email = 'test@example.com';
-- Planner checks pg_stats:
-- n_distinct = -1 (100% unique)
-- → 1 row out of 1M → Index Scan!
Autovacuum and ANALYZE
-- Auto-ANALYZE runs on its own
SHOW autovacuum_analyze_threshold; -- 50 rows
SHOW autovacuum_analyze_scale_factor; -- 0.1 (10%)
-- Triggers when:
-- modified rows > 50 + 0.1 * total_rows
-- For a table with 1000 rows:
-- 50 + 0.1 * 1000 = 150 modifications → ANALYZE
Common problems
- Outdated statistics
-- After a mass insert, statistics are stale COPY users FROM '/data/users.csv'; -- 1M rows -- Statistics are old! Planner is "blind" -- Solution ANALYZE users; - Temporary tables
-- Autovacuum does NOT see temporary tables! CREATE TEMP TABLE temp_data AS SELECT ...; -- Manual ANALYZE needed ANALYZE temp_data; - Low accuracy
-- For a "ragged" distribution, 30,000 rows isn't enough -- Increase accuracy ALTER TABLE orders ALTER COLUMN status SET STATISTICS 1000; ANALYZE orders; -- Now sample: 300 × 1000 = 300,000 rows
Practical examples
-- Check when ANALYZE was last run
SELECT
relname,
last_analyze,
last_autoanalyze,
n_mod_since_analyze -- Modifications since last ANALYZE
FROM pg_stat_user_tables
WHERE relname = 'orders';
-- If n_mod_since_analyze is large → time for ANALYZE
🔴 Senior Level
Sampling algorithm
Reservoir Sampling:
PostgreSQL uses random sampling of PAGES (not rows):
1. Randomly selects pages from the table
2. Scans all rows on the selected pages
3. Builds statistics from the sample
Sampling pages gives more even coverage, accounting for data
locality (rows on the same page are usually related).
Advantage:
- O(1) time regardless of table size
- Representative sample
Disadvantage:
- May be inaccurate for very "ragged" distributions
statistics_target: fine-tuning
-- Default: 100 (30,000 row sample)
SHOW default_statistics_target;
-- Increase for problematic columns
ALTER TABLE orders ALTER COLUMN status SET STATISTICS 1000;
ALTER TABLE users ALTER COLUMN email SET STATISTICS 500;
-- Apply
ANALYZE orders;
ANALYZE users;
-- Cost:
-- target = 100: ANALYZE 3 seconds
-- target = 1000: ANALYZE 30 seconds
-- target = 10000: ANALYZE 5 minutes
When to increase:
- Column with anomalous distribution (Data Skew — e.g., 90% of rows have status=’ACTIVE’, 10% — 20 other statuses)
- Planner frequently misestimates rows
- MCV (Most Common Values) doesn’t cover > 80% of data
Extended statistics (Multivariate Statistics)
Problem: PostgreSQL assumes columns are independent.
-- Query
SELECT * FROM cars WHERE make = 'Honda' AND model = 'Civic';
-- Planner (naively):
-- P(Honda) = 0.1
-- P(Civic) = 0.05
-- P(both) = 0.1 × 0.05 = 0.005 (0.5%)
-- Reality:
-- model depends on make!
-- P(both) = 0.03 (3%) — 6x more!
-- Due to the 6x estimation error, the planner may choose Nested Loop (expecting 5
-- rows instead of 300), when Hash Join would be 100x faster.
Solution: CREATE STATISTICS
-- 1. Functional Dependencies (PG 10+)
-- Knowing the model tells us the make
CREATE STATISTICS s_make_model (dependencies)
ON make, model FROM cars;
-- 2. N-Distinct (PG 10+)
-- Cardinality of a combination
CREATE STATISTICS s_city_age (ndistinct)
ON city, age FROM users;
-- 3. MCV Lists (PG 13+)
-- On PG 12 and below, the `mcv` type is not supported → ANALYZE will error.
-- Top value combinations
CREATE STATISTICS s_status_priority (mcv)
ON status, priority FROM orders;
-- Apply
ANALYZE cars;
ANALYZE users;
ANALYZE orders;
Checking:
-- View extended statistics
SELECT * FROM pg_stats_ext WHERE tablename = 'cars';
-- Specifically dependencies
SELECT
stxname,
stxkeys,
stxdependencies
FROM pg_statistic_ext;
Expression statistics
-- Functional index
CREATE INDEX idx_users_email_lower ON users(LOWER(email));
-- ANALYZE collects statistics on the function RESULT!
ANALYZE users;
-- Now the planner accurately estimates:
SELECT * FROM users WHERE LOWER(email) = 'test@example.com';
-- → Knows the selectivity of LOWER(email), not just email
pg_stat_progress_analyze (PG 13+)
-- Monitoring ANALYZE on large tables
SELECT
pid,
relid::regclass as table_name,
phase,
sample_blks_total,
sample_blks_scanned,
ext_stats_computed,
ROUND(100.0 * sample_blks_scanned / NULLIF(sample_blks_total, 0), 2) as progress_pct
FROM pg_stat_progress_analyze;
-- Phase:
-- 'acquiring sample rows'
-- 'acquiring inherited sample rows'
-- 'computing statistics'
-- 'computing extended statistics'
On PG 12 and below, ANALYZE progress monitoring is not available. Estimate time by table size: ~1 sec per 1M rows at default_statistics_target.
“Blind spots” of statistics
1. Cross-table relationships
PostgreSQL does NOT store statistics about JOINs between tables!
JOIN size estimate = product of selectivities
→ If tables correlate → estimate is wrong
→ Solution: CTE or Materialized Views
2. Partitioned Tables
-- For partitioned tables:
-- Statistics are collected on EACH partition
-- The planner may misestimate the total
-- Solution:
-- Increase statistics_target on key partitions
ALTER TABLE orders_y2024 ALTER COLUMN created_at SET STATISTICS 1000;
3. Time-Series Data
-- Data changes over time
-- Statistics become stale between ANALYZE runs
-- Solution:
-- Run ANALYZE more frequently for time-series tables
ALTER TABLE logs SET (
autovacuum_analyze_scale_factor = 0.01 -- 1% instead of 10%
);
Edge Cases
- ANALYZE locking
-- ANALYZE takes ShareUpdateExclusiveLock -- Allows: SELECT, INSERT, UPDATE, DELETE -- Blocks: another ANALYZE, VACUUM, ALTER TABLE -- Two ANALYZEs at the same time → one waits - Inheritance
-- For tables with inheritance: ANALYZE parent_table; -- Parent only ANALYZE; -- Entire DB, including children -- Or separately: ANALYZE child_table_1; ANALYZE child_table_2; - Foreign Tables
-- For foreign tables (FDW): -- ANALYZE can be very slow (query to remote server) -- You can import statistics IMPORT FOREIGN SCHEMA public FROM SERVER remote_server INTO local_schema;
Performance Impact
Wrong statistics → bad plan:
| Estimation error | Result |
|---|---|
| Underestimated 10x | Index Scan instead of Seq Scan |
| Overestimated 10x | Seq Scan instead of Index Scan |
| Wrong JOIN order | Nested Loop instead of Hash Join |
Production Experience
Real scenario #1: Mass insert without ANALYZE
- ETL: loaded 50M rows into orders
- Queries became 100x slower
- Cause: stale statistics (table was empty)
- Planner thought: “table is empty → Seq Scan is fast”
- Reality: 50M rows → Seq Scan is a disaster
- Solution:
ANALYZE orders;→ 2 seconds, everything fixed
Real scenario #2: Correlated columns
- Analytics: WHERE country = ‘RU’ AND city = ‘Moscow’
- Planner was off by 50x (assumed independence)
- Solution:
CREATE STATISTICS s_country_city (dependencies) ON country, city - Result: accurate estimate, correct plan
Monitoring
-- 1. Check statistics freshness
SELECT
relname,
last_analyze,
last_autoanalyze,
n_mod_since_analyze,
CASE
WHEN n_mod_since_analyze > 100000 THEN 'NEEDS ANALYZE'
ELSE 'OK'
END as status
FROM pg_stat_user_tables
ORDER BY n_mod_since_analyze DESC
LIMIT 20;
-- 2. Check statistics_target
SELECT
attname,
attname,
CASE
WHEN attstattarget = -1 THEN 'Default (' || current_setting('default_statistics_target') || ')'
ELSE attstattarget::text
END as statistics_target
FROM pg_attribute
WHERE attrelid = 'orders'::regclass
AND attnum > 0
AND NOT attisdropped;
-- 3. Extended statistics
SELECT * FROM pg_stats_ext;
-- 4. Progress
SELECT * FROM pg_stat_progress_analyze;
Best Practices
- ANALYZE after mass insert/delete
- Increase statistics_target for skewed columns
- CREATE STATISTICS for correlated columns
- Monitor n_mod_since_analyze — > 10% = time for ANALYZE
- Don’t rely solely on autovacuum for hot tables
- Check plans via
EXPLAIN ANALYZEafter ANALYZE - Temporary tables → manual ANALYZE
- Partitioned tables → ANALYZE on key partitions
Summary for Senior
- ANALYZE = statistics collection via random sampling (300 × target rows)
- statistics_target — balance between accuracy and time
- Extended Statistics (PG 10+) solves the correlated columns problem
- MCV Lists (PG 13+) for top value combinations
- Cross-table relationships — planner’s blind spot
- Mass changes → always ANALYZE afterwards
- Monitor n_mod_since_analyze and plan accuracy
- pg_stat_progress_analyze (PG 13+) for monitoring on large tables
🎯 Interview Cheat Sheet
Must know:
- ANALYZE: random sample of 300 × statistics_target rows (doesn’t scan the whole table)
- pg_stats: n_distinct, most_common_vals (MCV), histogram_bounds, correlation
- statistics_target = 100 (default) → 30,000 row sample
- Autovacuum triggers ANALYZE when: modified > 50 + 0.1 × total_rows
- Extended Statistics (PG 10+):
dependencies(correlated columns),ndistinct(composite cardinality),mcv(PG 13+, top combinations) - Algorithm: Reservoir Sampling → O(1) time regardless of table size
- Blind spots: cross-table relationships, partitioned tables, time-series data
- pg_stat_progress_analyze (PG 13+): progress monitoring
- Expression statistics: ANALYZE collects statistics on the function result (LOWER(email))
Frequent follow-up questions:
- “Why did a query become slow after a mass insert?” → Stale statistics → ANALYZE
- “How to solve the correlated columns problem?” →
CREATE STATISTICS ... (dependencies) - “When to increase statistics_target?” → Skewed data, planner misestimates
- “Does PG analyze temporary tables?” → No, manual ANALYZE needed
Red flags (do NOT say):
- ❌ “ANALYZE scans the entire table” (no, random sampling!)
- ❌ “Autovacuum is always up to date” (after mass insert — no)
- ❌ “Extended Statistics solves all problems” (cross-table relationships are a blind spot)
- ❌ “Temporary tables analyze themselves” (no, autovacuum doesn’t see them)
Related topics:
- [[What is index cardinality]] → n_distinct, MCV, correlation
- [[What is VACUUM in PostgreSQL]] → Autovacuum + ANALYZE
- [[How to optimize slow queries]] → statistics impact on plans