How does B-tree index work?
To find a page, you go from top to bottom, discarding unnecessary sections.
π’ Junior Level
B-tree (Balanced Tree) is a balanced tree that PostgreSQL uses for storing indexes by default.
Why βbalancedβ? All paths from root to leaf have the same length. This guarantees that any search will take the same amount of time β there are no βbadβ cases like in a regular binary tree, which can degenerate into a linked list.
Simple analogy: Imagine a table of contents in a book, split into levels:
- Top level: sections (Chapters 1-10)
- Middle level: subsections (1.1, 1.2, 1.3β¦)
- Bottom level: specific pages
To find a page, you go from top to bottom, discarding unnecessary sections.
Main idea:
- All data is sorted
- Search is fast: even with a million elements, you only need ~20-30 comparisons
- PostgreSQL automatically maintains tree balance
Example:
-- Creating a B-tree index (default)
CREATE INDEX idx_users_email ON users(email);
-- Indexed search: O(log n) instead of O(n)
SELECT * FROM users WHERE email = 'test@example.com';
When to use:
- For most cases (this is the default index)
- For equality search:
WHERE id = 5 - For range search:
WHERE price BETWEEN 100 AND 500 - For sorting:
ORDER BY created_at DESC
π‘ Middle Level
How it works internally
The index is stored in 8 KB pages. The structure is hierarchical:
Root
βββ Internal Node β Internal Node β Leaf Node
βββ Internal Node β Leaf Node
Page structure:
- Page Header β metadata (LSN β Log Sequence Number, the WAL log entry number for crash recovery; checksums for integrity verification)
- Index Tuples β pairs of
key + TID(pointer to the row in the table) - Special Area β pointers to neighboring pages (P_NEXT, P_PREV)
- High Key β maximum value on the page
Search algorithm
- Start from the Root page
- Compare the search value with the keys
- Descend into the appropriate child node
- Repeat until you reach a Leaf page
- In the Leaf, find the TID and read the row from the table
Why B-tree instead of a regular binary tree?
| Characteristic | Binary tree | B-tree (PostgreSQL) |
|---|---|---|
| Branching factor | 2 (left/right) | 100+ (many child nodes) |
| Height for 1M records | ~20 levels | ~3-4 levels |
| I/O operations | Many | Minimal (reads by pages) |
Common mistakes
- Trying to index everything
- B-tree takes disk space and RAM
- Each INSERT/UPDATE requires index updates
- Wrong order in a composite index
-- Index (last_name, first_name) -- β Will be used WHERE last_name = 'Ivanov' -- β Will be used WHERE last_name = 'Ivanov' AND first_name = 'Ivan' -- β Will NOT be used (or inefficiently) WHERE first_name = 'Ivan' - Searching with functions
-- β Index will not be used WHERE UPPER(name) = 'IVAN' -- β Create a functional index CREATE INDEX idx_users_name_upper ON users(UPPER(name));
When NOT to use B-tree
- For full-text search β use GIN
- For geospatial data β use GiST
- Only for equality and minimal memory footprint needed β Hash
- For arrays and JSONB β GIN
Practical example
-- B-tree supports sorting
CREATE INDEX idx_orders_date ON orders(created_at DESC);
-- Query without sort (data is already sorted in the index)
SELECT * FROM orders ORDER BY created_at DESC LIMIT 10;
-- Plan: Index Scan Backward (very fast)
π΄ Senior Level
Internal Implementation
PostgreSQLβs B-tree implementation is based on a modified Lehman and Yao algorithm, which provides high concurrency without locking the entire path from root to leaf.
Meta-page structure (page 0):
Meta-page
βββ Magic number (index type identifier)
βββ Version
βββ Root page number
βββ Fastroot (root cache for faster access)
Leaf Node structure:
βββββββββββββββββββββββββββββββββββ
β Page Header (16 bytes) β
βββββββββββββββββββββββββββββββββββ€
β ItemId_1 β IndexTuple_1 β
β ItemId_2 β IndexTuple_2 β
β ... β
βββββββββββββββββββββββββββββββββββ€
β High Key (maximum value) β
βββββββββββββββββββββββββββββββββββ€
β Special Area (P_NEXT, P_PREV) β
βββββββββββββββββββββββββββββββββββ
High Key: Critical for search correctness during concurrent splits. If during descent we see that the value > High Key β we go to P_NEXT without returning to the root.
Concurrent access algorithm
Search:
- Take a short-term
AccessShareLockon the page (a lightweight lock that doesnβt prevent other processes from reading the page in parallel) - Descend, releasing locks on parent pages
- Even if a parallel process is doing a Split, the search remains correct
Page Split:
Before split: After split:
ββββββββββββ ββββββββββββ ββββββββββββ
β A B C D Eβ β β A B C ββ β D E β
ββββββββββββ ββββββββββββ ββββββββββββ
β
Parent updated
- Create a new page on the βrightβ
- Move half of the data
- Update the parent node
- Important: INCLUDE columns are NOT stored in Internal nodes, only in Leaf
Modern optimizations (PG 13+)
On PG 12 and below, deduplication is absent β each key is stored separately. With many duplicates, the index bloats significantly. Solution: regular REINDEX.
Deduplication:
-- PostgreSQL 13+ combines identical keys
-- Before: (key=1, TID=1), (key=1, TID=2), (key=1, TID=3)
-- After: (key=1, TIDs=[1,2,3])
-- Particularly effective for low-cardinality columns
CREATE INDEX idx_orders_status ON orders(status);
-- Low cardinality = many duplicate keys. Instead of storing
-- (key='ACTIVE', TID=1), (key='ACTIVE', TID=2), ..., (key='ACTIVE', TID=1000000)
-- deduplication stores: (key='ACTIVE', TIDs=[1,2,...,1000000])
-- β savings proportional to the number of duplicates: 70-90% space
LP_DEAD marking:
- When a row is deleted from the Heap, the index entry is marked
LP_DEAD VACUUMphysically removes such entries- Important: VACUUM frees space within pages but does NOT shrink the physical file size on disk
Edge Cases
- Bloat
-- Diagnostics via pgstattuple SELECT * FROM pgstatindex('idx_users_email'); -- avg_leaf_density < 50% β time for REINDEX -- Treatment REINDEX INDEX CONCURRENTLY idx_users_email; - Fill Factor
-- For frequently updated indexes, reserve space CREATE INDEX idx_orders_status ON orders(status) WITH (fillfactor = 70); -- 30% of the page remains free for updates -- Minimizes Page Splits - Corrupted Index
-- Integrity check SELECT bt_page_items('idx_users_email', 1); -- Rebuild after failure REINDEX DATABASE my_database;
Performance
Operation complexity:
- Search: O(logβ N), where n is the branching factor (typically 100-1000)
- Tree height for 1 billion records: 3-4 levels
- I/O operations for search: 3-4 page reads
Memory Footprint:
- Each index entry: ~16-24 bytes + key size
- For 100M records with int key: ~2-3 GB of index
- Critical: The index should fit in
shared_buffers(RAM)
Production Experience
Real scenario:
- E-commerce platform: 500M orders
- Problem: Index on
order_idbloated to 50 GB (60% Bloat) - Symptom: Queries slowed down 10x, Cache Miss rate > 40%
- Solution:
REINDEX CONCURRENTLY(without locking)- Added
fillfactor = 80on rebuild - Result: Index became 20 GB, Cache Hit rate > 95%
Monitoring
-- Check index state
SELECT
indexrelname,
pg_size_pretty(pg_relation_size(indexrelid)) as size,
idx_scan,
idx_tup_read,
idx_tup_fetch
FROM pg_stat_user_indexes
WHERE relname = 'orders';
-- Check Bloat
SELECT
index_name,
bloat_pct,
real_size,
extra_size
FROM pg_index_bloat_stats
WHERE schemaname = 'public';
-- Check tree level
SELECT * FROM pgstatindex('idx_orders_id');
-- tree_level = 0 β everything in one page
-- tree_level = 3 β normal large tree
Best Practices
- Use
CONCURRENTLYfor creation/rebuilding in production - Set
fillfactor = 70-80for frequently updated indexes - Regularly check
avg_leaf_densityviapgstattuple - Deduplication is enabled by default in PG 13+ β make sure youβre using an up-to-date version
- Monitor index sizes relative to RAM
Summary for Senior
- B-tree is a disk structure optimized for concurrency (Lehman & Yao)
- High Key ensures search correctness during concurrent splits
- INCLUDE columns only in Leaf nodes β donβt bloat the search tree
- Deduplication (PG 13+) radically reduces Bloat for low-cardinality columns
REINDEX CONCURRENTLYβ the main tool against Bloat in production
π― Interview Cheat Sheet
Must know:
- B-tree is a hierarchical structure: Root β Internal β Leaf (TID β Heap)
- Branching factor 100+ β height 3-4 levels even for 1 billion records
- Search complexity: O(logβ N), where n is the branching factor
- Lehman-Yao algorithm: concurrent access without locking the entire path
- High Key β maximum value on a page, needed for correctness during splits
- Deduplication (PG 13+): combines identical keys β 70-90% space savings
fillfactor = 70-80for frequently updated indexes β fewer splits
Common follow-up questions:
- βWhy B-tree instead of binary tree?β β B-tree branching factor 100+ vs 2, fewer I/O
- βWhat is a Page Split and why is it bad?β β page division on insert β Bloat
- βHow does deduplication save space?β β
(key=1, TIDs=[1,2,3])instead of 3 separate entries - βWhat happens at 60% Bloat?β β Cache Miss, Random I/O,
REINDEX CONCURRENTLY - βWhy are INCLUDE columns only in Leaf?β β They donβt bloat Internal nodes β fewer tree levels
Red flags (DO NOT say):
- β βB-tree is a binary treeβ (no, B-tree = multi-way tree)
- β βVACUUM shrinks the index file sizeβ (no, REINDEX is needed)
- β βDeduplication works for all indexesβ (only PG 13+, not for unique)
Related topics:
- [[What are indexes for]] β overview of index types
- [[What is a composite index]] β lexicographic sorting
- [[What is index cardinality]] β deduplication for low-cardinality columns
- [[How does MVCC work in PostgreSQL]] β TID, HOT, Visibility Map