Question 2 Β· Section 1

How does B-tree index work?

To find a page, you go from top to bottom, discarding unnecessary sections.

Language versions: English Russian Ukrainian

🟒 Junior Level

B-tree (Balanced Tree) is a balanced tree that PostgreSQL uses for storing indexes by default.

Why β€œbalanced”? All paths from root to leaf have the same length. This guarantees that any search will take the same amount of time β€” there are no β€œbad” cases like in a regular binary tree, which can degenerate into a linked list.

Simple analogy: Imagine a table of contents in a book, split into levels:

  • Top level: sections (Chapters 1-10)
  • Middle level: subsections (1.1, 1.2, 1.3…)
  • Bottom level: specific pages

To find a page, you go from top to bottom, discarding unnecessary sections.

Main idea:

  • All data is sorted
  • Search is fast: even with a million elements, you only need ~20-30 comparisons
  • PostgreSQL automatically maintains tree balance

Example:

-- Creating a B-tree index (default)
CREATE INDEX idx_users_email ON users(email);

-- Indexed search: O(log n) instead of O(n)
SELECT * FROM users WHERE email = 'test@example.com';

When to use:

  • For most cases (this is the default index)
  • For equality search: WHERE id = 5
  • For range search: WHERE price BETWEEN 100 AND 500
  • For sorting: ORDER BY created_at DESC

🟑 Middle Level

How it works internally

The index is stored in 8 KB pages. The structure is hierarchical:

Root
  β”œβ”€β”€ Internal Node β†’ Internal Node β†’ Leaf Node
  └── Internal Node β†’ Leaf Node

Page structure:

  1. Page Header β€” metadata (LSN β€” Log Sequence Number, the WAL log entry number for crash recovery; checksums for integrity verification)
  2. Index Tuples β€” pairs of key + TID (pointer to the row in the table)
  3. Special Area β€” pointers to neighboring pages (P_NEXT, P_PREV)
  4. High Key β€” maximum value on the page

Search algorithm

  1. Start from the Root page
  2. Compare the search value with the keys
  3. Descend into the appropriate child node
  4. Repeat until you reach a Leaf page
  5. In the Leaf, find the TID and read the row from the table

Why B-tree instead of a regular binary tree?

Characteristic Binary tree B-tree (PostgreSQL)
Branching factor 2 (left/right) 100+ (many child nodes)
Height for 1M records ~20 levels ~3-4 levels
I/O operations Many Minimal (reads by pages)

Common mistakes

  1. Trying to index everything
    • B-tree takes disk space and RAM
    • Each INSERT/UPDATE requires index updates
  2. Wrong order in a composite index
    -- Index (last_name, first_name)
    -- βœ… Will be used
    WHERE last_name = 'Ivanov'
    
    -- βœ… Will be used
    WHERE last_name = 'Ivanov' AND first_name = 'Ivan'
    
    -- ❌ Will NOT be used (or inefficiently)
    WHERE first_name = 'Ivan'
    
  3. Searching with functions
    -- ❌ Index will not be used
    WHERE UPPER(name) = 'IVAN'
    
    -- βœ… Create a functional index
    CREATE INDEX idx_users_name_upper ON users(UPPER(name));
    

When NOT to use B-tree

  • For full-text search β†’ use GIN
  • For geospatial data β†’ use GiST
  • Only for equality and minimal memory footprint needed β†’ Hash
  • For arrays and JSONB β†’ GIN

Practical example

-- B-tree supports sorting
CREATE INDEX idx_orders_date ON orders(created_at DESC);

-- Query without sort (data is already sorted in the index)
SELECT * FROM orders ORDER BY created_at DESC LIMIT 10;
-- Plan: Index Scan Backward (very fast)

πŸ”΄ Senior Level

Internal Implementation

PostgreSQL’s B-tree implementation is based on a modified Lehman and Yao algorithm, which provides high concurrency without locking the entire path from root to leaf.

Meta-page structure (page 0):

Meta-page
  β”œβ”€β”€ Magic number (index type identifier)
  β”œβ”€β”€ Version
  β”œβ”€β”€ Root page number
  └── Fastroot (root cache for faster access)

Leaf Node structure:

β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
β”‚ Page Header (16 bytes)          β”‚
β”œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€
β”‚ ItemId_1 β†’ IndexTuple_1         β”‚
β”‚ ItemId_2 β†’ IndexTuple_2         β”‚
β”‚ ...                             β”‚
β”œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€
β”‚ High Key (maximum value)        β”‚
β”œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€
β”‚ Special Area (P_NEXT, P_PREV)   β”‚
β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜

High Key: Critical for search correctness during concurrent splits. If during descent we see that the value > High Key β†’ we go to P_NEXT without returning to the root.

Concurrent access algorithm

Search:

  1. Take a short-term AccessShareLock on the page (a lightweight lock that doesn’t prevent other processes from reading the page in parallel)
  2. Descend, releasing locks on parent pages
  3. Even if a parallel process is doing a Split, the search remains correct

Page Split:

Before split:          After split:
β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”        β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”   β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
β”‚ A B C D Eβ”‚  β†’     β”‚ A B C    β”‚β†’  β”‚ D E      β”‚
β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜        β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜   β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜
                         ↑
                    Parent updated
  1. Create a new page on the β€œright”
  2. Move half of the data
  3. Update the parent node
  4. Important: INCLUDE columns are NOT stored in Internal nodes, only in Leaf

Modern optimizations (PG 13+)

On PG 12 and below, deduplication is absent β€” each key is stored separately. With many duplicates, the index bloats significantly. Solution: regular REINDEX.

Deduplication:

-- PostgreSQL 13+ combines identical keys
-- Before: (key=1, TID=1), (key=1, TID=2), (key=1, TID=3)
-- After: (key=1, TIDs=[1,2,3])

-- Particularly effective for low-cardinality columns
CREATE INDEX idx_orders_status ON orders(status);
-- Low cardinality = many duplicate keys. Instead of storing
-- (key='ACTIVE', TID=1), (key='ACTIVE', TID=2), ..., (key='ACTIVE', TID=1000000)
-- deduplication stores: (key='ACTIVE', TIDs=[1,2,...,1000000])
-- β†’ savings proportional to the number of duplicates: 70-90% space

LP_DEAD marking:

  • When a row is deleted from the Heap, the index entry is marked LP_DEAD
  • VACUUM physically removes such entries
  • Important: VACUUM frees space within pages but does NOT shrink the physical file size on disk

Edge Cases

  1. Bloat
    -- Diagnostics via pgstattuple
    SELECT * FROM pgstatindex('idx_users_email');
    -- avg_leaf_density < 50% β†’ time for REINDEX
    
    -- Treatment
    REINDEX INDEX CONCURRENTLY idx_users_email;
    
  2. Fill Factor
    -- For frequently updated indexes, reserve space
    CREATE INDEX idx_orders_status ON orders(status) WITH (fillfactor = 70);
    -- 30% of the page remains free for updates
    -- Minimizes Page Splits
    
  3. Corrupted Index
    -- Integrity check
    SELECT bt_page_items('idx_users_email', 1);
    
    -- Rebuild after failure
    REINDEX DATABASE my_database;
    

Performance

Operation complexity:

  • Search: O(logβ‚™ N), where n is the branching factor (typically 100-1000)
  • Tree height for 1 billion records: 3-4 levels
  • I/O operations for search: 3-4 page reads

Memory Footprint:

  • Each index entry: ~16-24 bytes + key size
  • For 100M records with int key: ~2-3 GB of index
  • Critical: The index should fit in shared_buffers (RAM)

Production Experience

Real scenario:

  • E-commerce platform: 500M orders
  • Problem: Index on order_id bloated to 50 GB (60% Bloat)
  • Symptom: Queries slowed down 10x, Cache Miss rate > 40%
  • Solution:
    • REINDEX CONCURRENTLY (without locking)
    • Added fillfactor = 80 on rebuild
    • Result: Index became 20 GB, Cache Hit rate > 95%

Monitoring

-- Check index state
SELECT
    indexrelname,
    pg_size_pretty(pg_relation_size(indexrelid)) as size,
    idx_scan,
    idx_tup_read,
    idx_tup_fetch
FROM pg_stat_user_indexes
WHERE relname = 'orders';

-- Check Bloat
SELECT
    index_name,
    bloat_pct,
    real_size,
    extra_size
FROM pg_index_bloat_stats
WHERE schemaname = 'public';

-- Check tree level
SELECT * FROM pgstatindex('idx_orders_id');
-- tree_level = 0 β†’ everything in one page
-- tree_level = 3 β†’ normal large tree

Best Practices

  1. Use CONCURRENTLY for creation/rebuilding in production
  2. Set fillfactor = 70-80 for frequently updated indexes
  3. Regularly check avg_leaf_density via pgstattuple
  4. Deduplication is enabled by default in PG 13+ β€” make sure you’re using an up-to-date version
  5. Monitor index sizes relative to RAM

Summary for Senior

  • B-tree is a disk structure optimized for concurrency (Lehman & Yao)
  • High Key ensures search correctness during concurrent splits
  • INCLUDE columns only in Leaf nodes β€” don’t bloat the search tree
  • Deduplication (PG 13+) radically reduces Bloat for low-cardinality columns
  • REINDEX CONCURRENTLY β€” the main tool against Bloat in production

🎯 Interview Cheat Sheet

Must know:

  • B-tree is a hierarchical structure: Root β†’ Internal β†’ Leaf (TID β†’ Heap)
  • Branching factor 100+ β†’ height 3-4 levels even for 1 billion records
  • Search complexity: O(logβ‚™ N), where n is the branching factor
  • Lehman-Yao algorithm: concurrent access without locking the entire path
  • High Key β€” maximum value on a page, needed for correctness during splits
  • Deduplication (PG 13+): combines identical keys β†’ 70-90% space savings
  • fillfactor = 70-80 for frequently updated indexes β†’ fewer splits

Common follow-up questions:

  • β€œWhy B-tree instead of binary tree?” β†’ B-tree branching factor 100+ vs 2, fewer I/O
  • β€œWhat is a Page Split and why is it bad?” β†’ page division on insert β†’ Bloat
  • β€œHow does deduplication save space?” β†’ (key=1, TIDs=[1,2,3]) instead of 3 separate entries
  • β€œWhat happens at 60% Bloat?” β†’ Cache Miss, Random I/O, REINDEX CONCURRENTLY
  • β€œWhy are INCLUDE columns only in Leaf?” β†’ They don’t bloat Internal nodes β†’ fewer tree levels

Red flags (DO NOT say):

  • ❌ β€œB-tree is a binary tree” (no, B-tree = multi-way tree)
  • ❌ β€œVACUUM shrinks the index file size” (no, REINDEX is needed)
  • ❌ β€œDeduplication works for all indexes” (only PG 13+, not for unique)

Related topics:

  • [[What are indexes for]] β†’ overview of index types
  • [[What is a composite index]] β†’ lexicographic sorting
  • [[What is index cardinality]] β†’ deduplication for low-cardinality columns
  • [[How does MVCC work in PostgreSQL]] β†’ TID, HOT, Visibility Map