How does MVCC work in PostgreSQL?
Each row (tuple) contains hidden system fields:
🟢 Junior Level
MVCC (Multi-Version Concurrency Control) — a mechanism that allows PostgreSQL to show each user their own version of data, without blocking reads during writes.
Simple analogy: Imagine Google Docs. When you edit a document, others see the old version until you save. Meanwhile, they can continue reading — they don’t need to wait for you to finish.
Difference from other RDBMS: Oracle uses undo segments, SQL Server uses tempdb. PostgreSQL stores ALL versions in the same table → simpler, but bloat from UPDATE/DELETE.
Core idea:
- On
UPDATE, PostgreSQL does not overwrite the row, but creates a new version - On
DELETE, the row is not physically removed, but marked as deleted - Each transaction sees only the rows that existed at the moment it started
Example:
-- Transaction 1 -- Transaction 2
BEGIN; BEGIN;
SELECT * FROM users WHERE id=1; -- Sees: name='Ivan'
UPDATE users SET name='Petr'
WHERE id=1; -- Still sees: name='Ivan'!
SELECT * FROM users WHERE id=1; -- Sees: name='Petr'
COMMIT;
SELECT * FROM users WHERE id=1;
-- Now sees: name='Petr'
Why it’s needed:
- Readers don’t block writers
- Writers don’t block readers
- No “dirty read” problem
🟡 Middle Level
How it works internally
Each row (tuple) contains hidden system fields:
| Field | Description |
|---|---|
| xmin | ID of the transaction that created the row |
| xmax | ID of the transaction that deleted the row (0 = not deleted) |
| ctid | Physical address of the row (page_number, row_number) |
Mechanism:
-- Initial state
-- users table:
-- xmin=100, xmax=0, name='Ivan', ctid=(1,1)
-- Transaction 101 performs UPDATE
UPDATE users SET name='Petr' WHERE id=1;
-- New version of the row:
-- xmin=101, xmax=0, name='Petr', ctid=(1,2)
-- Old version:
-- xmin=100, xmax=101, name='Ivan', ctid=(1,1) ← xmax=101 (marked)
-- Transaction 102 (started before 101):
-- → Sees the row with xmin=100 (xmax=101 not yet committed for it)
-- → name='Ivan'
-- Transaction 103 (started after 101):
-- → Sees the row with xmin=101 (committed)
-- → name='Petr'
Snapshots
When a transaction starts a SELECT, it takes a snapshot of the database state:
Snapshot:
xmin = 100 ← All transactions < 100 are committed
xmax = 105 ← All transactions >= 105 haven't started yet
xip_list = [101, 102, 103] ← Active transactions
Visibility rule:
A row is visible if:
- its xmin < snapshot.xmin OR xmin is in xip_list (still active)
- its xmax = 0 OR xmax >= snapshot.xmax
UPDATE = DELETE + INSERT
-- This is NOT an in-place update!
UPDATE users SET name='Petr' WHERE id=1;
-- Actually:
-- 1. A NEW version of the row is created (name='Petr')
-- 2. The old version is marked (xmax = current transaction)
-- 3. Indexes are updated (reference to the new ctid)
VACUUM: cleaning up dead rows
-- After UPDATE/DELETE, "dead" rows remain (dead tuples)
-- VACUUM removes them, freeing space
-- Check dead rows
SELECT
schemaname,
relname,
n_live_tup, -- Live rows
n_dead_tup, -- Dead rows
last_vacuum
FROM pg_stat_user_tables
WHERE relname = 'users';
-- Manual run
VACUUM users;
Common problems
- Bloat
-- Table takes 10x more space than it should -- Cause: many dead tuples, VACUUM can't keep up -- Solution VACUUM FULL users; -- ⚠️ Locks the table! -- or pg_repack -t users; -- Without locking - Long transactions block VACUUM
-- If a transaction has been open for 2 hours -- VACUUM cannot remove rows that it can see -- Find long transactions SELECT pid, now() - xact_start as duration, query FROM pg_stat_activity WHERE state = 'active' ORDER BY xact_start ASC; - Transaction ID Wraparound
-- XID is a 32-bit number (4 billion) -- If old rows aren't "frozen" → overflow -- Check age SELECT datname, age(datfrozenxid) FROM pg_database ORDER BY age DESC; -- > 200M → urgent VACUUM needed! -- > 1.5B → database will go READ ONLY
🔴 Senior Level
Hint Bits optimization
Problem: Checking transaction status via CLOG (Commit Log) is expensive (disk read).
CLOG — files in pg_xact/, where each transaction = 2 bits (IN_PROGRESS/COMMITTED/ABORTED). This is NOT WAL. CLOG read = microseconds, but with millions of rows → seconds.
Solution: PostgreSQL uses Hint Bits — 2 bits in the row header:
Heap Tuple Header:
...
t_infomask:
HEAP_XMIN_COMMITTED ← xmin committed
HEAP_XMIN_INVALID ← xmin aborted
HEAP_XMAX_COMMITTED ← xmax committed
HEAP_XMAX_INVALID ← xmax aborted
...
Mechanism:
1. First reader checks xmin status in CLOG
2. Sets Hint Bits in the row header
3. All subsequent readers check Hint Bits instantly
→ No CLOOKUP needed!
This is called "Hint Bits propagation" or "Lazy hinting"
Important: Hint Bits are NOT WAL-logged! After crash, they are recalculated.
Multixacts
When multiple transactions lock the same row (SELECT FOR SHARE):
-- Transaction 1: SELECT FOR SHARE → lock
-- Transaction 2: SELECT FOR SHARE → also a lock
-- xmax cannot store two IDs!
-- Solution: MultiXact ID
-- xmax stores a reference to pg_multixact
-- pg_multixact stores the list of transactions: [101, 102]
Problem: Multixact ages just like XID → VACUUM is needed.
Serializable Snapshot Isolation (SSI)
PostgreSQL implements SERIALIZABLE without read locks:
-- SERIALIZABLE level in PG ≠ SERIALIZABLE in other RDBMS
-- PG uses SSI (Serializable Snapshot Isolation)
-- Mechanism:
-- 1. Tracks SIRead locks (predicate locks)
-- 2. Detects RW-conflict cycles
-- 3. If a cycle is found → aborts one transaction with error:
-- ERROR: could not serialize access due to read/write dependencies
Example:
-- Transaction 1 -- Transaction 2
SET TRANSACTION ISOLATION LEVEL
SERIALIZABLE; SET TRANSACTION ISOLATION LEVEL
SERIALIZABLE;
SELECT SUM(balance) FROM accounts;
UPDATE accounts SET balance = 1000
WHERE id = 1;
UPDATE accounts SET balance = 500
WHERE id = 2;
COMMIT;
COMMIT; ← ERROR: serialization failure!
⚠️ Do NOT use SERIALIZABLE in high-load OLTP: with frequent RW-conflicts, transactions constantly roll back. Use READ COMMITTED or REPEATABLE READ instead.
HOT (Heap Only Tuple) updates
Heap = the main storage of table rows (not index). “Heap Only Tuple” = new version on the same page, indexes are not updated.
HOT mechanism:
A critical optimization for UPDATE:
-- Normal UPDATE:
-- 1. New row in Heap
-- 2. Update ALL indexes (new ctid)
-- 3. Mark the old row (xmax)
-- HOT UPDATE (if conditions are met):
-- 1. New row in Heap (on the same page!)
-- 2. Indexes are NOT updated
-- 3. A HOT chain is created: old → new
-- Conditions for HOT:
-- - Updated columns are NOT indexed
-- - There is space on the Heap page (fillfactor < 100)
HOT monitoring:
SELECT
relname,
n_tup_hot_update, -- HOT updates
n_tup_upd, -- Total updates
ROUND(100.0 * n_tup_hot_update / NULLIF(n_tup_upd, 0), 2) as hot_pct
FROM pg_stat_user_tables
WHERE n_tup_upd > 0;
-- hot_pct > 70% → great
-- hot_pct < 20% → problem (too many indexed updates)
CLOG (pg_xact)
pg_xact/ — directory with 256 KB files
Each transaction = 2 bits:
00 = IN_PROGRESS
01 = COMMITTED
10 = ABORTED
11 = SUB_COMMITTED
-- 4 transactions per byte
-- 256 KB = 1 billion transactions
-- Files are deleted after VACUUM + Freeze
XID Wraparound: deep analysis
-- 32-bit XID = 4,294,967,296 transactions
-- XID comparison: cyclic arithmetic
-- autovacuum performs "Freezing":
-- 1. Finds rows with xmin < freeze_limit
-- 2. Replaces xmin with FrozenXID (a special value)
-- 3. FrozenXID is always considered "in the past"
-- "Freezing" = replacing xmin with FrozenXID, which is ALWAYS "in the past".
-- A frozen row is visible to any transaction. Like a "1900" date on
-- a document — guaranteed "older" than any current transaction.
-- Monitoring:
SELECT
datname,
age(datfrozenxid),
ROUND(100.0 * age(datfrozenxid) / 2000000000, 2) as pct_to_wraparound
FROM pg_database;
-- > 1.5B = 75% → critical!
-- 2B = 100% → database goes READ ONLY
Edge Cases
- Long-running transactions
-- On replica: hot_standby_feedback = on -- Replica asks master not to delete rows it is reading -- → Master accumulates dead tuples → Bloat! - TOAST and MVCC
-- Large fields (text, bytea) are stored in TOAST tables -- TOAST rows are also versioned -- But TOAST does NOT use MVCC for reading! -- → May return data from a deleted row in rare cases - Visibility Map
-- VM: bitmap of pages where ALL rows are visible to everyone -- Used for: -- 1. Index Only Scan (visibility check without Heap) -- 2. VACUUM (skipping "clean" pages) -- Updated during VACUUM
Performance Impact
| Operation | MVCC effect |
|---|---|
| SELECT | Not blocked, creates Snapshot |
| INSERT | Creates new row (xmin = current) |
| UPDATE | Creates new + marks old (xmax) |
| DELETE | Marks row (xmax = current) |
| VACUUM | Removes rows with xmax < oldest_active_xid |
Production Experience
Real scenario #1: Bloat from a long transaction
- Analyst opened a transaction and left for lunch (4 hours)
- VACUUM could not remove dead tuples
- Table bloated from 10 GB to 80 GB
- Solution:
SELECT pg_terminate_backend(pid)→ VACUUM FULL
Real scenario #2: Wraparound crisis
- autovacuum was disabled “for performance”
- age(datfrozenxid) = 1.9B (95%)
- Database started warning in logs
- Solution: urgent
VACUUM FREEZEon the entire database - Lesson: never disable autovacuum!
Monitoring
-- 1. Dead tuples
SELECT relname, n_dead_tup, last_vacuum, last_autovacuum
FROM pg_stat_user_tables
WHERE n_dead_tup > 10000
ORDER BY n_dead_tup DESC;
-- 2. XID age
SELECT datname, age(datfrozenxid)
FROM pg_database
ORDER BY age DESC;
-- 3. Long transactions
SELECT pid, now() - xact_start as duration, state, query
FROM pg_stat_activity
WHERE xact_start < now() - INTERVAL '1 hour'
ORDER BY xact_start;
-- 4. HOT ratio
SELECT
relname,
ROUND(100.0 * n_tup_hot_update / NULLIF(n_tup_upd, 0), 2) as hot_pct
FROM pg_stat_user_tables
WHERE n_tup_upd > 1000
ORDER BY hot_pct ASC;
Best Practices
- Don’t keep transactions open longer than necessary
- autovacuum — do NOT disable! Tune it, but don’t turn it off
- fillfactor = 70-80 for frequently updated tables → more HOT
- Index consciously — every index kills HOT for that column
- Monitor age(datfrozenxid) — prevent wraparound
- pg_repack for fighting Bloat without locking
- hot_standby_feedback = on → monitor Bloat on master
- VACUUM FULL only during maintenance window (locks!)
Summary for Senior
- MVCC = row version = xmin/xmax, doesn’t block reads
- Hint Bits speed up subsequent reads without CLOG
- HOT = UPDATE optimization without index updates
- SSI = SERIALIZABLE without read locks (PG is unique!)
- XID Wraparound — a real threat → monitor age()
- Long transactions = the main cause of Bloat
- VACUUM doesn’t shrink files → use pg_repack
- Visibility Map is critical for Index Only Scan
🎯 Interview Cheat Sheet
Must know:
- MVCC: each row has xmin (created) and xmax (deleted)
- Snapshot: a transaction sees rows that existed when it started
- UPDATE = DELETE + INSERT (new row version, old → dead tuple)
- Hint Bits: 2 bits in the row header → avoid CLOG reads after the first reader
- HOT (Heap Only Tuple): UPDATE without index updates (if column is not indexed + space on page)
- SSI (Serializable Snapshot Isolation): SERIALIZABLE without read locks
- XID Wraparound: 32-bit → 4B transactions → autovacuum performs Freeze
- Visibility Map: bitmap of “clean” pages → Index Only Scan + VACUUM skip
- CLOG (pg_xact): 2 bits per transaction → IN_PROGRESS/COMMITTED/ABORTED
- Long transactions → VACUUM can’t remove dead tuples → Bloat
Frequent follow-up questions:
- “What happens if autovacuum is disabled?” → XID Wraparound → database goes READ ONLY
- “Why doesn’t VACUUM shrink files?” → Frees space for new data, doesn’t return it to the OS
- “How does HOT work?” → New version on the same page → indexes aren’t updated
- “What is hot_standby_feedback?” → Replica asks master not to delete rows → Bloat on master
Red flags (do NOT say):
- ❌ “VACUUM shrinks table files” (no, VACUUM FULL or pg_repack is needed)
- ❌ “autovacuum can be disabled for performance” (Wraparound → READ ONLY!)
- ❌ “MVCC blocks reads during writes” (readers don’t block writers!)
- ❌ “Hint Bits are WAL-logged” (no, they are recalculated after crash)
Related topics:
- [[What is VACUUM in PostgreSQL]] → cleaning dead tuples
- [[Why is ANALYZE needed]] → collecting statistics
- [[Why are indexes needed]] → HOT and Visibility Map