Question 17 · Section 1

How does MVCC work in PostgreSQL?

Each row (tuple) contains hidden system fields:

Language versions: English Russian Ukrainian

🟢 Junior Level

MVCC (Multi-Version Concurrency Control) — a mechanism that allows PostgreSQL to show each user their own version of data, without blocking reads during writes.

Simple analogy: Imagine Google Docs. When you edit a document, others see the old version until you save. Meanwhile, they can continue reading — they don’t need to wait for you to finish.

Difference from other RDBMS: Oracle uses undo segments, SQL Server uses tempdb. PostgreSQL stores ALL versions in the same table → simpler, but bloat from UPDATE/DELETE.

Core idea:

  • On UPDATE, PostgreSQL does not overwrite the row, but creates a new version
  • On DELETE, the row is not physically removed, but marked as deleted
  • Each transaction sees only the rows that existed at the moment it started

Example:

-- Transaction 1                    -- Transaction 2
BEGIN;                              BEGIN;
SELECT * FROM users WHERE id=1;     -- Sees: name='Ivan'

UPDATE users SET name='Petr'
WHERE id=1;                         -- Still sees: name='Ivan'!

SELECT * FROM users WHERE id=1;     -- Sees: name='Petr'
COMMIT;

                                    SELECT * FROM users WHERE id=1;
                                    -- Now sees: name='Petr'

Why it’s needed:

  • Readers don’t block writers
  • Writers don’t block readers
  • No “dirty read” problem

🟡 Middle Level

How it works internally

Each row (tuple) contains hidden system fields:

Field Description
xmin ID of the transaction that created the row
xmax ID of the transaction that deleted the row (0 = not deleted)
ctid Physical address of the row (page_number, row_number)

Mechanism:

-- Initial state
-- users table:
-- xmin=100, xmax=0, name='Ivan', ctid=(1,1)

-- Transaction 101 performs UPDATE
UPDATE users SET name='Petr' WHERE id=1;

-- New version of the row:
-- xmin=101, xmax=0, name='Petr', ctid=(1,2)
-- Old version:
-- xmin=100, xmax=101, name='Ivan', ctid=(1,1) ← xmax=101 (marked)

-- Transaction 102 (started before 101):
-- → Sees the row with xmin=100 (xmax=101 not yet committed for it)
-- → name='Ivan'

-- Transaction 103 (started after 101):
-- → Sees the row with xmin=101 (committed)
-- → name='Petr'

Snapshots

When a transaction starts a SELECT, it takes a snapshot of the database state:

Snapshot:
  xmin = 100  ← All transactions < 100 are committed
  xmax = 105  ← All transactions >= 105 haven't started yet
  xip_list = [101, 102, 103]  ← Active transactions

Visibility rule:
  A row is visible if:
    - its xmin < snapshot.xmin OR xmin is in xip_list (still active)
    - its xmax = 0 OR xmax >= snapshot.xmax

UPDATE = DELETE + INSERT

-- This is NOT an in-place update!
UPDATE users SET name='Petr' WHERE id=1;

-- Actually:
-- 1. A NEW version of the row is created (name='Petr')
-- 2. The old version is marked (xmax = current transaction)
-- 3. Indexes are updated (reference to the new ctid)

VACUUM: cleaning up dead rows

-- After UPDATE/DELETE, "dead" rows remain (dead tuples)
-- VACUUM removes them, freeing space

-- Check dead rows
SELECT
    schemaname,
    relname,
    n_live_tup,    -- Live rows
    n_dead_tup,    -- Dead rows
    last_vacuum
FROM pg_stat_user_tables
WHERE relname = 'users';

-- Manual run
VACUUM users;

Common problems

  1. Bloat
    -- Table takes 10x more space than it should
    -- Cause: many dead tuples, VACUUM can't keep up
    
    -- Solution
    VACUUM FULL users;  -- ⚠️ Locks the table!
    -- or
    pg_repack -t users; -- Without locking
    
  2. Long transactions block VACUUM
    -- If a transaction has been open for 2 hours
    -- VACUUM cannot remove rows that it can see
    
    -- Find long transactions
    SELECT pid, now() - xact_start as duration, query
    FROM pg_stat_activity
    WHERE state = 'active'
    ORDER BY xact_start ASC;
    
  3. Transaction ID Wraparound
    -- XID is a 32-bit number (4 billion)
    -- If old rows aren't "frozen" → overflow
    
    -- Check age
    SELECT datname, age(datfrozenxid)
    FROM pg_database
    ORDER BY age DESC;
    
    -- > 200M → urgent VACUUM needed!
    -- > 1.5B → database will go READ ONLY
    

🔴 Senior Level

Hint Bits optimization

Problem: Checking transaction status via CLOG (Commit Log) is expensive (disk read).

CLOG — files in pg_xact/, where each transaction = 2 bits (IN_PROGRESS/COMMITTED/ABORTED). This is NOT WAL. CLOG read = microseconds, but with millions of rows → seconds.

Solution: PostgreSQL uses Hint Bits — 2 bits in the row header:

Heap Tuple Header:
  ...
  t_infomask:
    HEAP_XMIN_COMMITTED    ← xmin committed
    HEAP_XMIN_INVALID      ← xmin aborted
    HEAP_XMAX_COMMITTED    ← xmax committed
    HEAP_XMAX_INVALID      ← xmax aborted
  ...

Mechanism:

1. First reader checks xmin status in CLOG
2. Sets Hint Bits in the row header
3. All subsequent readers check Hint Bits instantly
   → No CLOOKUP needed!

This is called "Hint Bits propagation" or "Lazy hinting"

Important: Hint Bits are NOT WAL-logged! After crash, they are recalculated.

Multixacts

When multiple transactions lock the same row (SELECT FOR SHARE):

-- Transaction 1: SELECT FOR SHARE → lock
-- Transaction 2: SELECT FOR SHARE → also a lock

-- xmax cannot store two IDs!
-- Solution: MultiXact ID

-- xmax stores a reference to pg_multixact
-- pg_multixact stores the list of transactions: [101, 102]

Problem: Multixact ages just like XID → VACUUM is needed.

Serializable Snapshot Isolation (SSI)

PostgreSQL implements SERIALIZABLE without read locks:

-- SERIALIZABLE level in PG ≠ SERIALIZABLE in other RDBMS
-- PG uses SSI (Serializable Snapshot Isolation)

-- Mechanism:
-- 1. Tracks SIRead locks (predicate locks)
-- 2. Detects RW-conflict cycles
-- 3. If a cycle is found → aborts one transaction with error:
--    ERROR: could not serialize access due to read/write dependencies

Example:

-- Transaction 1                           -- Transaction 2
SET TRANSACTION ISOLATION LEVEL
  SERIALIZABLE;                            SET TRANSACTION ISOLATION LEVEL
                                             SERIALIZABLE;
SELECT SUM(balance) FROM accounts;
                                           UPDATE accounts SET balance = 1000
                                             WHERE id = 1;
UPDATE accounts SET balance = 500
  WHERE id = 2;
COMMIT;
                                           COMMIT;   ERROR: serialization failure!

⚠️ Do NOT use SERIALIZABLE in high-load OLTP: with frequent RW-conflicts, transactions constantly roll back. Use READ COMMITTED or REPEATABLE READ instead.

HOT (Heap Only Tuple) updates

Heap = the main storage of table rows (not index). “Heap Only Tuple” = new version on the same page, indexes are not updated.

HOT mechanism:

A critical optimization for UPDATE:

-- Normal UPDATE:
-- 1. New row in Heap
-- 2. Update ALL indexes (new ctid)
-- 3. Mark the old row (xmax)

-- HOT UPDATE (if conditions are met):
-- 1. New row in Heap (on the same page!)
-- 2. Indexes are NOT updated
-- 3. A HOT chain is created: old → new

-- Conditions for HOT:
-- - Updated columns are NOT indexed
-- - There is space on the Heap page (fillfactor < 100)

HOT monitoring:

SELECT
    relname,
    n_tup_hot_update,     -- HOT updates
    n_tup_upd,            -- Total updates
    ROUND(100.0 * n_tup_hot_update / NULLIF(n_tup_upd, 0), 2) as hot_pct
FROM pg_stat_user_tables
WHERE n_tup_upd > 0;

-- hot_pct > 70% → great
-- hot_pct < 20% → problem (too many indexed updates)

CLOG (pg_xact)

pg_xact/ — directory with 256 KB files
Each transaction = 2 bits:
  00 = IN_PROGRESS
  01 = COMMITTED
  10 = ABORTED
  11 = SUB_COMMITTED

-- 4 transactions per byte
-- 256 KB = 1 billion transactions
-- Files are deleted after VACUUM + Freeze

XID Wraparound: deep analysis

-- 32-bit XID = 4,294,967,296 transactions
-- XID comparison: cyclic arithmetic

-- autovacuum performs "Freezing":
-- 1. Finds rows with xmin < freeze_limit
-- 2. Replaces xmin with FrozenXID (a special value)
-- 3. FrozenXID is always considered "in the past"
-- "Freezing" = replacing xmin with FrozenXID, which is ALWAYS "in the past".
-- A frozen row is visible to any transaction. Like a "1900" date on
-- a document — guaranteed "older" than any current transaction.

-- Monitoring:
SELECT
    datname,
    age(datfrozenxid),
    ROUND(100.0 * age(datfrozenxid) / 2000000000, 2) as pct_to_wraparound
FROM pg_database;

-- > 1.5B = 75% → critical!
-- 2B = 100% → database goes READ ONLY

Edge Cases

  1. Long-running transactions
    -- On replica: hot_standby_feedback = on
    -- Replica asks master not to delete rows it is reading
    -- → Master accumulates dead tuples → Bloat!
    
  2. TOAST and MVCC
    -- Large fields (text, bytea) are stored in TOAST tables
    -- TOAST rows are also versioned
    -- But TOAST does NOT use MVCC for reading!
    -- → May return data from a deleted row in rare cases
    
  3. Visibility Map
    -- VM: bitmap of pages where ALL rows are visible to everyone
    -- Used for:
    -- 1. Index Only Scan (visibility check without Heap)
    -- 2. VACUUM (skipping "clean" pages)
    
    -- Updated during VACUUM
    

Performance Impact

Operation MVCC effect
SELECT Not blocked, creates Snapshot
INSERT Creates new row (xmin = current)
UPDATE Creates new + marks old (xmax)
DELETE Marks row (xmax = current)
VACUUM Removes rows with xmax < oldest_active_xid

Production Experience

Real scenario #1: Bloat from a long transaction

  • Analyst opened a transaction and left for lunch (4 hours)
  • VACUUM could not remove dead tuples
  • Table bloated from 10 GB to 80 GB
  • Solution: SELECT pg_terminate_backend(pid) → VACUUM FULL

Real scenario #2: Wraparound crisis

  • autovacuum was disabled “for performance”
  • age(datfrozenxid) = 1.9B (95%)
  • Database started warning in logs
  • Solution: urgent VACUUM FREEZE on the entire database
  • Lesson: never disable autovacuum!

Monitoring

-- 1. Dead tuples
SELECT relname, n_dead_tup, last_vacuum, last_autovacuum
FROM pg_stat_user_tables
WHERE n_dead_tup > 10000
ORDER BY n_dead_tup DESC;

-- 2. XID age
SELECT datname, age(datfrozenxid)
FROM pg_database
ORDER BY age DESC;

-- 3. Long transactions
SELECT pid, now() - xact_start as duration, state, query
FROM pg_stat_activity
WHERE xact_start < now() - INTERVAL '1 hour'
ORDER BY xact_start;

-- 4. HOT ratio
SELECT
    relname,
    ROUND(100.0 * n_tup_hot_update / NULLIF(n_tup_upd, 0), 2) as hot_pct
FROM pg_stat_user_tables
WHERE n_tup_upd > 1000
ORDER BY hot_pct ASC;

Best Practices

  1. Don’t keep transactions open longer than necessary
  2. autovacuum — do NOT disable! Tune it, but don’t turn it off
  3. fillfactor = 70-80 for frequently updated tables → more HOT
  4. Index consciously — every index kills HOT for that column
  5. Monitor age(datfrozenxid) — prevent wraparound
  6. pg_repack for fighting Bloat without locking
  7. hot_standby_feedback = on → monitor Bloat on master
  8. VACUUM FULL only during maintenance window (locks!)

Summary for Senior

  • MVCC = row version = xmin/xmax, doesn’t block reads
  • Hint Bits speed up subsequent reads without CLOG
  • HOT = UPDATE optimization without index updates
  • SSI = SERIALIZABLE without read locks (PG is unique!)
  • XID Wraparound — a real threat → monitor age()
  • Long transactions = the main cause of Bloat
  • VACUUM doesn’t shrink files → use pg_repack
  • Visibility Map is critical for Index Only Scan

🎯 Interview Cheat Sheet

Must know:

  • MVCC: each row has xmin (created) and xmax (deleted)
  • Snapshot: a transaction sees rows that existed when it started
  • UPDATE = DELETE + INSERT (new row version, old → dead tuple)
  • Hint Bits: 2 bits in the row header → avoid CLOG reads after the first reader
  • HOT (Heap Only Tuple): UPDATE without index updates (if column is not indexed + space on page)
  • SSI (Serializable Snapshot Isolation): SERIALIZABLE without read locks
  • XID Wraparound: 32-bit → 4B transactions → autovacuum performs Freeze
  • Visibility Map: bitmap of “clean” pages → Index Only Scan + VACUUM skip
  • CLOG (pg_xact): 2 bits per transaction → IN_PROGRESS/COMMITTED/ABORTED
  • Long transactions → VACUUM can’t remove dead tuples → Bloat

Frequent follow-up questions:

  • “What happens if autovacuum is disabled?” → XID Wraparound → database goes READ ONLY
  • “Why doesn’t VACUUM shrink files?” → Frees space for new data, doesn’t return it to the OS
  • “How does HOT work?” → New version on the same page → indexes aren’t updated
  • “What is hot_standby_feedback?” → Replica asks master not to delete rows → Bloat on master

Red flags (do NOT say):

  • ❌ “VACUUM shrinks table files” (no, VACUUM FULL or pg_repack is needed)
  • ❌ “autovacuum can be disabled for performance” (Wraparound → READ ONLY!)
  • ❌ “MVCC blocks reads during writes” (readers don’t block writers!)
  • ❌ “Hint Bits are WAL-logged” (no, they are recalculated after crash)

Related topics:

  • [[What is VACUUM in PostgreSQL]] → cleaning dead tuples
  • [[Why is ANALYZE needed]] → collecting statistics
  • [[Why are indexes needed]] → HOT and Visibility Map