What is sharding

Junior Level

Sharding is splitting a large database into several smaller parts (shards), each on its own server.

Unlike read replicas (copies of data for reading), shards contain DIFFERENT data. Each shard is an independent part of the dataset.

Large DB (100M records) -> slow

Sharding:
Shard 1: users A-F (server 1)
Shard 2: users G-M (server 2)
Shard 3: users N-Z (server 3)

Why:

Scaling — each shard on its own server
Performance — indexes become smaller, fit entirely in RAM, B-tree depth decreases. The query scans fewer pages — faster.
Reliability — one shard is down, others keep working

Middle Level

When NOT to shard

Don’t shard until you hit storage or write limits of a single server. For datasets up to ~100GB, sharding is usually excessive — its operational complexity outweighs the benefits.

Sharding strategies

1. Range-based:

Shard 1: ID 1-1000
Shard 2: ID 1001-2000
Shard 3: ID 2001-3000

2. Hash-based:

shard = hash(key) % num_shards
user_123 -> hash("user_123") % 3 = shard 2

3. Directory-based:

Lookup table: key -> shard
user_123 -> shard 2
user_456 -> shard 1

Common mistakes

Hot shard:
```
All new users go to one shard
-> uneven load
Solution: hash-based instead of range-based
```
(Hot spot — one shard gets disproportionately more traffic. Cross-shard joins — queries that combine data from different shards (expensive).)

Senior Level

Architectural Trade-offs

Range	Hash	Directory
Simpler	More even	More flexible
Hot spots	Hard to rebalance	Single point of failure

Production Experience

Resharding:

When adding a new shard:
1. Recalculate hash distribution
2. Move data
3. Update routing

MongoDB, Cassandra support auto-resharding

Best Practices

✅ Hash-based for even distribution
✅ Monitor each shard's size
✅ Plan resharding in advance
✅ Cross-shard joins — avoid them

❌ Without hot spot monitoring
❌ Too many cross-shard queries

Interview Cheat Sheet

Must know:

Sharding — splitting a DB into parts (shards), each on its own server
Each shard contains DIFFERENT data (unlike read replicas)
Strategies: range-based, hash-based, directory-based
Hash-based distributes more evenly, range-based is simpler but has hot spots
Resharding — moving data when adding shards
Cross-shard JOINs are expensive — avoid them
Don’t shard until you hit storage/write limits (~100GB+)

Common follow-up questions:

What is a hot shard? One shard gets disproportionately more traffic — solution: hash-based instead of range-based.
How to do resharding? Recalculate distribution -> move data -> update routing. MongoDB/Cassandra support auto-resharding.
Range vs hash vs directory? Range is simpler, hash is more even, directory is more flexible (but SPOF).
When is sharding excessive? Dataset up to ~100GB — operational complexity outweighs the benefits.

Red flags (DO NOT say):

“Sharding = read replicas” — no, replicas = copies, shards = different data
“Range-based is always better” — no, hot spots with uneven distribution
“Cross-shard JOINs are normal practice” — no, they are expensive, avoid them
“Sharding is needed from the start” — no, start when you hit limits

Related topics:

[[11. What is the difference between sharding and partitioning]]
[[12. How to implement horizontal scaling of microservices]]
[[13. What is Database per Service pattern]]
[[14. What problems arise when using a shared database]]