How does B-tree index work?

🟢 Junior Level

B-tree (Balanced Tree) is a balanced tree that PostgreSQL uses for storing indexes by default.

Why “balanced”? All paths from root to leaf have the same length. This guarantees that any search will take the same amount of time — there are no “bad” cases like in a regular binary tree, which can degenerate into a linked list.

Simple analogy: Imagine a table of contents in a book, split into levels:

Top level: sections (Chapters 1-10)
Middle level: subsections (1.1, 1.2, 1.3…)
Bottom level: specific pages

To find a page, you go from top to bottom, discarding unnecessary sections.

Main idea:

All data is sorted
Search is fast: even with a million elements, you only need ~20-30 comparisons
PostgreSQL automatically maintains tree balance

Example:

-- Creating a B-tree index (default)
CREATE INDEX idx_users_email ON users(email);

-- Indexed search: O(log n) instead of O(n)
SELECT * FROM users WHERE email = 'test@example.com';

When to use:

For most cases (this is the default index)
For equality search: WHERE id = 5
For range search: WHERE price BETWEEN 100 AND 500
For sorting: ORDER BY created_at DESC

🟡 Middle Level

How it works internally

The index is stored in 8 KB pages. The structure is hierarchical:

Root
  ├── Internal Node → Internal Node → Leaf Node
  └── Internal Node → Leaf Node

Page structure:

Page Header — metadata (LSN — Log Sequence Number, the WAL log entry number for crash recovery; checksums for integrity verification)
Index Tuples — pairs of key + TID (pointer to the row in the table)
Special Area — pointers to neighboring pages (P_NEXT, P_PREV)
High Key — maximum value on the page

Search algorithm

Start from the Root page
Compare the search value with the keys
Descend into the appropriate child node
Repeat until you reach a Leaf page
In the Leaf, find the TID and read the row from the table

Why B-tree instead of a regular binary tree?

Characteristic	Binary tree	B-tree (PostgreSQL)
Branching factor	2 (left/right)	100+ (many child nodes)
Height for 1M records	~20 levels	~3-4 levels
I/O operations	Many	Minimal (reads by pages)

Common mistakes

Trying to index everything
- B-tree takes disk space and RAM
- Each INSERT/UPDATE requires index updates

Wrong order in a composite index

-- Index (last_name, first_name)
-- ✅ Will be used
WHERE last_name = 'Ivanov'

-- ✅ Will be used
WHERE last_name = 'Ivanov' AND first_name = 'Ivan'

-- ❌ Will NOT be used (or inefficiently)
WHERE first_name = 'Ivan'

Searching with functions

-- ❌ Index will not be used
WHERE UPPER(name) = 'IVAN'

-- ✅ Create a functional index
CREATE INDEX idx_users_name_upper ON users(UPPER(name));

When NOT to use B-tree

For full-text search → use GIN
For geospatial data → use GiST
Only for equality and minimal memory footprint needed → Hash
For arrays and JSONB → GIN

Practical example

-- B-tree supports sorting
CREATE INDEX idx_orders_date ON orders(created_at DESC);

-- Query without sort (data is already sorted in the index)
SELECT * FROM orders ORDER BY created_at DESC LIMIT 10;
-- Plan: Index Scan Backward (very fast)

🔴 Senior Level

Internal Implementation

PostgreSQL’s B-tree implementation is based on a modified Lehman and Yao algorithm, which provides high concurrency without locking the entire path from root to leaf.

Meta-page structure (page 0):

Meta-page
  ├── Magic number (index type identifier)
  ├── Version
  ├── Root page number
  └── Fastroot (root cache for faster access)

Leaf Node structure:

┌─────────────────────────────────┐
│ Page Header (16 bytes)          │
├─────────────────────────────────┤
│ ItemId_1 → IndexTuple_1         │
│ ItemId_2 → IndexTuple_2         │
│ ...                             │
├─────────────────────────────────┤
│ High Key (maximum value)        │
├─────────────────────────────────┤
│ Special Area (P_NEXT, P_PREV)   │
└─────────────────────────────────┘

High Key: Critical for search correctness during concurrent splits. If during descent we see that the value > High Key → we go to P_NEXT without returning to the root.

Concurrent access algorithm

Search:

Take a short-term AccessShareLock on the page (a lightweight lock that doesn’t prevent other processes from reading the page in parallel)
Descend, releasing locks on parent pages
Even if a parallel process is doing a Split, the search remains correct

Page Split:

Before split:          After split:
┌──────────┐        ┌──────────┐   ┌──────────┐
│ A B C D E│  →     │ A B C    │→  │ D E      │
└──────────┘        └──────────┘   └──────────┘
                         ↑
                    Parent updated

Create a new page on the “right”
Move half of the data
Update the parent node
Important: INCLUDE columns are NOT stored in Internal nodes, only in Leaf

Modern optimizations (PG 13+)

On PG 12 and below, deduplication is absent — each key is stored separately. With many duplicates, the index bloats significantly. Solution: regular REINDEX.

Deduplication:

-- PostgreSQL 13+ combines identical keys
-- Before: (key=1, TID=1), (key=1, TID=2), (key=1, TID=3)
-- After: (key=1, TIDs=[1,2,3])

-- Particularly effective for low-cardinality columns
CREATE INDEX idx_orders_status ON orders(status);
-- Low cardinality = many duplicate keys. Instead of storing
-- (key='ACTIVE', TID=1), (key='ACTIVE', TID=2), ..., (key='ACTIVE', TID=1000000)
-- deduplication stores: (key='ACTIVE', TIDs=[1,2,...,1000000])
-- → savings proportional to the number of duplicates: 70-90% space

LP_DEAD marking:

When a row is deleted from the Heap, the index entry is marked LP_DEAD
VACUUM physically removes such entries
Important: VACUUM frees space within pages but does NOT shrink the physical file size on disk

Edge Cases

Bloat

-- Diagnostics via pgstattuple
SELECT * FROM pgstatindex('idx_users_email');
-- avg_leaf_density < 50% → time for REINDEX

-- Treatment
REINDEX INDEX CONCURRENTLY idx_users_email;

Fill Factor

-- For frequently updated indexes, reserve space
CREATE INDEX idx_orders_status ON orders(status) WITH (fillfactor = 70);
-- 30% of the page remains free for updates
-- Minimizes Page Splits

Corrupted Index

-- Integrity check
SELECT bt_page_items('idx_users_email', 1);

-- Rebuild after failure
REINDEX DATABASE my_database;

Performance

Operation complexity:

Search: O(logₙ N), where n is the branching factor (typically 100-1000)
Tree height for 1 billion records: 3-4 levels
I/O operations for search: 3-4 page reads

Memory Footprint:

Each index entry: ~16-24 bytes + key size
For 100M records with int key: ~2-3 GB of index
Critical: The index should fit in shared_buffers (RAM)

Production Experience

Real scenario:

E-commerce platform: 500M orders
Problem: Index on order_id bloated to 50 GB (60% Bloat)
Symptom: Queries slowed down 10x, Cache Miss rate > 40%
Solution:
- REINDEX CONCURRENTLY (without locking)
- Added fillfactor = 80 on rebuild
- Result: Index became 20 GB, Cache Hit rate > 95%

Monitoring

-- Check index state
SELECT
    indexrelname,
    pg_size_pretty(pg_relation_size(indexrelid)) as size,
    idx_scan,
    idx_tup_read,
    idx_tup_fetch
FROM pg_stat_user_indexes
WHERE relname = 'orders';

-- Check Bloat
SELECT
    index_name,
    bloat_pct,
    real_size,
    extra_size
FROM pg_index_bloat_stats
WHERE schemaname = 'public';

-- Check tree level
SELECT * FROM pgstatindex('idx_orders_id');
-- tree_level = 0 → everything in one page
-- tree_level = 3 → normal large tree

Best Practices

Use CONCURRENTLY for creation/rebuilding in production
Set fillfactor = 70-80 for frequently updated indexes
Regularly check avg_leaf_density via pgstattuple
Deduplication is enabled by default in PG 13+ — make sure you’re using an up-to-date version
Monitor index sizes relative to RAM

Summary for Senior

B-tree is a disk structure optimized for concurrency (Lehman & Yao)
High Key ensures search correctness during concurrent splits
INCLUDE columns only in Leaf nodes — don’t bloat the search tree
Deduplication (PG 13+) radically reduces Bloat for low-cardinality columns
REINDEX CONCURRENTLY — the main tool against Bloat in production

🎯 Interview Cheat Sheet

Must know:

B-tree is a hierarchical structure: Root → Internal → Leaf (TID → Heap)
Branching factor 100+ → height 3-4 levels even for 1 billion records
Search complexity: O(logₙ N), where n is the branching factor
Lehman-Yao algorithm: concurrent access without locking the entire path
High Key — maximum value on a page, needed for correctness during splits
Deduplication (PG 13+): combines identical keys → 70-90% space savings
fillfactor = 70-80 for frequently updated indexes → fewer splits

Common follow-up questions:

“Why B-tree instead of binary tree?” → B-tree branching factor 100+ vs 2, fewer I/O
“What is a Page Split and why is it bad?” → page division on insert → Bloat
“How does deduplication save space?” → (key=1, TIDs=[1,2,3]) instead of 3 separate entries
“What happens at 60% Bloat?” → Cache Miss, Random I/O, REINDEX CONCURRENTLY
“Why are INCLUDE columns only in Leaf?” → They don’t bloat Internal nodes → fewer tree levels

Red flags (DO NOT say):

❌ “B-tree is a binary tree” (no, B-tree = multi-way tree)
❌ “VACUUM shrinks the index file size” (no, REINDEX is needed)
❌ “Deduplication works for all indexes” (only PG 13+, not for unique)

Related topics:

[[What are indexes for]] → overview of index types
[[What is a composite index]] → lexicographic sorting
[[What is index cardinality]] → deduplication for low-cardinality columns
[[How does MVCC work in PostgreSQL]] → TID, HOT, Visibility Map