How does MVCC work in PostgreSQL?

🟢 Junior Level

MVCC (Multi-Version Concurrency Control) — a mechanism that allows PostgreSQL to show each user their own version of data, without blocking reads during writes.

Simple analogy: Imagine Google Docs. When you edit a document, others see the old version until you save. Meanwhile, they can continue reading — they don’t need to wait for you to finish.

Difference from other RDBMS: Oracle uses undo segments, SQL Server uses tempdb. PostgreSQL stores ALL versions in the same table → simpler, but bloat from UPDATE/DELETE.

Core idea:

On UPDATE, PostgreSQL does not overwrite the row, but creates a new version
On DELETE, the row is not physically removed, but marked as deleted
Each transaction sees only the rows that existed at the moment it started

Example:

-- Transaction 1                    -- Transaction 2
BEGIN;                              BEGIN;
SELECT * FROM users WHERE id=1;     -- Sees: name='Ivan'

UPDATE users SET name='Petr'
WHERE id=1;                         -- Still sees: name='Ivan'!

SELECT * FROM users WHERE id=1;     -- Sees: name='Petr'
COMMIT;

                                    SELECT * FROM users WHERE id=1;
                                    -- Now sees: name='Petr'

Why it’s needed:

Readers don’t block writers
Writers don’t block readers
No “dirty read” problem

🟡 Middle Level

How it works internally

Each row (tuple) contains hidden system fields:

Field	Description
xmin	ID of the transaction that created the row
xmax	ID of the transaction that deleted the row (0 = not deleted)
ctid	Physical address of the row (page_number, row_number)

Mechanism:

-- Initial state
-- users table:
-- xmin=100, xmax=0, name='Ivan', ctid=(1,1)

-- Transaction 101 performs UPDATE
UPDATE users SET name='Petr' WHERE id=1;

-- New version of the row:
-- xmin=101, xmax=0, name='Petr', ctid=(1,2)
-- Old version:
-- xmin=100, xmax=101, name='Ivan', ctid=(1,1) ← xmax=101 (marked)

-- Transaction 102 (started before 101):
-- → Sees the row with xmin=100 (xmax=101 not yet committed for it)
-- → name='Ivan'

-- Transaction 103 (started after 101):
-- → Sees the row with xmin=101 (committed)
-- → name='Petr'

Snapshots

When a transaction starts a SELECT, it takes a snapshot of the database state:

Snapshot:
  xmin = 100  ← All transactions < 100 are committed
  xmax = 105  ← All transactions >= 105 haven't started yet
  xip_list = [101, 102, 103]  ← Active transactions

Visibility rule:
  A row is visible if:
    - its xmin < snapshot.xmin OR xmin is in xip_list (still active)
    - its xmax = 0 OR xmax >= snapshot.xmax

UPDATE = DELETE + INSERT

-- This is NOT an in-place update!
UPDATE users SET name='Petr' WHERE id=1;

-- Actually:
-- 1. A NEW version of the row is created (name='Petr')
-- 2. The old version is marked (xmax = current transaction)
-- 3. Indexes are updated (reference to the new ctid)

VACUUM: cleaning up dead rows

-- After UPDATE/DELETE, "dead" rows remain (dead tuples)
-- VACUUM removes them, freeing space

-- Check dead rows
SELECT
    schemaname,
    relname,
    n_live_tup,    -- Live rows
    n_dead_tup,    -- Dead rows
    last_vacuum
FROM pg_stat_user_tables
WHERE relname = 'users';

-- Manual run
VACUUM users;

Common problems

Bloat

-- Table takes 10x more space than it should
-- Cause: many dead tuples, VACUUM can't keep up

-- Solution
VACUUM FULL users;  -- ⚠️ Locks the table!
-- or
pg_repack -t users; -- Without locking

Long transactions block VACUUM

-- If a transaction has been open for 2 hours
-- VACUUM cannot remove rows that it can see

-- Find long transactions
SELECT pid, now() - xact_start as duration, query
FROM pg_stat_activity
WHERE state = 'active'
ORDER BY xact_start ASC;

Transaction ID Wraparound

-- XID is a 32-bit number (4 billion)
-- If old rows aren't "frozen" → overflow

-- Check age
SELECT datname, age(datfrozenxid)
FROM pg_database
ORDER BY age DESC;

-- > 200M → urgent VACUUM needed!
-- > 1.5B → database will go READ ONLY

🔴 Senior Level

Hint Bits optimization

Problem: Checking transaction status via CLOG (Commit Log) is expensive (disk read).

CLOG — files in pg_xact/, where each transaction = 2 bits (IN_PROGRESS/COMMITTED/ABORTED). This is NOT WAL. CLOG read = microseconds, but with millions of rows → seconds.

Solution: PostgreSQL uses Hint Bits — 2 bits in the row header:

Heap Tuple Header:
  ...
  t_infomask:
    HEAP_XMIN_COMMITTED    ← xmin committed
    HEAP_XMIN_INVALID      ← xmin aborted
    HEAP_XMAX_COMMITTED    ← xmax committed
    HEAP_XMAX_INVALID      ← xmax aborted
  ...

Mechanism:

1. First reader checks xmin status in CLOG
2. Sets Hint Bits in the row header
3. All subsequent readers check Hint Bits instantly
   → No CLOOKUP needed!

This is called "Hint Bits propagation" or "Lazy hinting"

Important: Hint Bits are NOT WAL-logged! After crash, they are recalculated.

Multixacts

When multiple transactions lock the same row (SELECT FOR SHARE):

-- Transaction 1: SELECT FOR SHARE → lock
-- Transaction 2: SELECT FOR SHARE → also a lock

-- xmax cannot store two IDs!
-- Solution: MultiXact ID

-- xmax stores a reference to pg_multixact
-- pg_multixact stores the list of transactions: [101, 102]

Problem: Multixact ages just like XID → VACUUM is needed.

Serializable Snapshot Isolation (SSI)

PostgreSQL implements SERIALIZABLE without read locks:

-- SERIALIZABLE level in PG ≠ SERIALIZABLE in other RDBMS
-- PG uses SSI (Serializable Snapshot Isolation)

-- Mechanism:
-- 1. Tracks SIRead locks (predicate locks)
-- 2. Detects RW-conflict cycles
-- 3. If a cycle is found → aborts one transaction with error:
--    ERROR: could not serialize access due to read/write dependencies

Example:

-- Transaction 1                           -- Transaction 2
SET TRANSACTION ISOLATION LEVEL
  SERIALIZABLE;                            SET TRANSACTION ISOLATION LEVEL
                                             SERIALIZABLE;
SELECT SUM(balance) FROM accounts;
                                           UPDATE accounts SET balance = 1000
                                             WHERE id = 1;
UPDATE accounts SET balance = 500
  WHERE id = 2;
COMMIT;
                                           COMMIT;  ← ERROR: serialization failure!

⚠️ Do NOT use SERIALIZABLE in high-load OLTP: with frequent RW-conflicts, transactions constantly roll back. Use READ COMMITTED or REPEATABLE READ instead.

HOT (Heap Only Tuple) updates

Heap = the main storage of table rows (not index). “Heap Only Tuple” = new version on the same page, indexes are not updated.

HOT mechanism:

A critical optimization for UPDATE:

-- Normal UPDATE:
-- 1. New row in Heap
-- 2. Update ALL indexes (new ctid)
-- 3. Mark the old row (xmax)

-- HOT UPDATE (if conditions are met):
-- 1. New row in Heap (on the same page!)
-- 2. Indexes are NOT updated
-- 3. A HOT chain is created: old → new

-- Conditions for HOT:
-- - Updated columns are NOT indexed
-- - There is space on the Heap page (fillfactor < 100)

HOT monitoring:

SELECT
    relname,
    n_tup_hot_update,     -- HOT updates
    n_tup_upd,            -- Total updates
    ROUND(100.0 * n_tup_hot_update / NULLIF(n_tup_upd, 0), 2) as hot_pct
FROM pg_stat_user_tables
WHERE n_tup_upd > 0;

-- hot_pct > 70% → great
-- hot_pct < 20% → problem (too many indexed updates)

CLOG (pg_xact)

pg_xact/ — directory with 256 KB files
Each transaction = 2 bits:
  00 = IN_PROGRESS
  01 = COMMITTED
  10 = ABORTED
  11 = SUB_COMMITTED

-- 4 transactions per byte
-- 256 KB = 1 billion transactions
-- Files are deleted after VACUUM + Freeze

XID Wraparound: deep analysis

-- 32-bit XID = 4,294,967,296 transactions
-- XID comparison: cyclic arithmetic

-- autovacuum performs "Freezing":
-- 1. Finds rows with xmin < freeze_limit
-- 2. Replaces xmin with FrozenXID (a special value)
-- 3. FrozenXID is always considered "in the past"
-- "Freezing" = replacing xmin with FrozenXID, which is ALWAYS "in the past".
-- A frozen row is visible to any transaction. Like a "1900" date on
-- a document — guaranteed "older" than any current transaction.

-- Monitoring:
SELECT
    datname,
    age(datfrozenxid),
    ROUND(100.0 * age(datfrozenxid) / 2000000000, 2) as pct_to_wraparound
FROM pg_database;

-- > 1.5B = 75% → critical!
-- 2B = 100% → database goes READ ONLY

Edge Cases

Long-running transactions

-- On replica: hot_standby_feedback = on
-- Replica asks master not to delete rows it is reading
-- → Master accumulates dead tuples → Bloat!

TOAST and MVCC

-- Large fields (text, bytea) are stored in TOAST tables
-- TOAST rows are also versioned
-- But TOAST does NOT use MVCC for reading!
-- → May return data from a deleted row in rare cases

Visibility Map

-- VM: bitmap of pages where ALL rows are visible to everyone
-- Used for:
-- 1. Index Only Scan (visibility check without Heap)
-- 2. VACUUM (skipping "clean" pages)

-- Updated during VACUUM

Performance Impact

Operation	MVCC effect
SELECT	Not blocked, creates Snapshot
INSERT	Creates new row (xmin = current)
UPDATE	Creates new + marks old (xmax)
DELETE	Marks row (xmax = current)
VACUUM	Removes rows with xmax < oldest_active_xid

Production Experience

Real scenario #1: Bloat from a long transaction

Analyst opened a transaction and left for lunch (4 hours)
VACUUM could not remove dead tuples
Table bloated from 10 GB to 80 GB
Solution: SELECT pg_terminate_backend(pid) → VACUUM FULL

Real scenario #2: Wraparound crisis

autovacuum was disabled “for performance”
age(datfrozenxid) = 1.9B (95%)
Database started warning in logs
Solution: urgent VACUUM FREEZE on the entire database
Lesson: never disable autovacuum!

Monitoring

-- 1. Dead tuples
SELECT relname, n_dead_tup, last_vacuum, last_autovacuum
FROM pg_stat_user_tables
WHERE n_dead_tup > 10000
ORDER BY n_dead_tup DESC;

-- 2. XID age
SELECT datname, age(datfrozenxid)
FROM pg_database
ORDER BY age DESC;

-- 3. Long transactions
SELECT pid, now() - xact_start as duration, state, query
FROM pg_stat_activity
WHERE xact_start < now() - INTERVAL '1 hour'
ORDER BY xact_start;

-- 4. HOT ratio
SELECT
    relname,
    ROUND(100.0 * n_tup_hot_update / NULLIF(n_tup_upd, 0), 2) as hot_pct
FROM pg_stat_user_tables
WHERE n_tup_upd > 1000
ORDER BY hot_pct ASC;

Best Practices

Don’t keep transactions open longer than necessary
autovacuum — do NOT disable! Tune it, but don’t turn it off
fillfactor = 70-80 for frequently updated tables → more HOT
Index consciously — every index kills HOT for that column
Monitor age(datfrozenxid) — prevent wraparound
pg_repack for fighting Bloat without locking
hot_standby_feedback = on → monitor Bloat on master
VACUUM FULL only during maintenance window (locks!)

Summary for Senior

MVCC = row version = xmin/xmax, doesn’t block reads
Hint Bits speed up subsequent reads without CLOG
HOT = UPDATE optimization without index updates
SSI = SERIALIZABLE without read locks (PG is unique!)
XID Wraparound — a real threat → monitor age()
Long transactions = the main cause of Bloat
VACUUM doesn’t shrink files → use pg_repack
Visibility Map is critical for Index Only Scan

🎯 Interview Cheat Sheet

Must know:

MVCC: each row has xmin (created) and xmax (deleted)
Snapshot: a transaction sees rows that existed when it started
UPDATE = DELETE + INSERT (new row version, old → dead tuple)
Hint Bits: 2 bits in the row header → avoid CLOG reads after the first reader
HOT (Heap Only Tuple): UPDATE without index updates (if column is not indexed + space on page)
SSI (Serializable Snapshot Isolation): SERIALIZABLE without read locks
XID Wraparound: 32-bit → 4B transactions → autovacuum performs Freeze
Visibility Map: bitmap of “clean” pages → Index Only Scan + VACUUM skip
CLOG (pg_xact): 2 bits per transaction → IN_PROGRESS/COMMITTED/ABORTED
Long transactions → VACUUM can’t remove dead tuples → Bloat

Frequent follow-up questions:

“What happens if autovacuum is disabled?” → XID Wraparound → database goes READ ONLY
“Why doesn’t VACUUM shrink files?” → Frees space for new data, doesn’t return it to the OS
“How does HOT work?” → New version on the same page → indexes aren’t updated
“What is hot_standby_feedback?” → Replica asks master not to delete rows → Bloat on master

Red flags (do NOT say):

❌ “VACUUM shrinks table files” (no, VACUUM FULL or pg_repack is needed)
❌ “autovacuum can be disabled for performance” (Wraparound → READ ONLY!)
❌ “MVCC blocks reads during writes” (readers don’t block writers!)
❌ “Hint Bits are WAL-logged” (no, they are recalculated after crash)

Related topics:

[[What is VACUUM in PostgreSQL]] → cleaning dead tuples
[[Why is ANALYZE needed]] → collecting statistics
[[Why are indexes needed]] → HOT and Visibility Map