Database Scaling & Sharding

While scaling stateless web servers horizontally is relatively easy (just spin up another EC2 instance behind a load balancer), scaling a stateful database horizontally is incredibly complex.

When your database becomes the bottleneck, there are two primary techniques to scale it out: Replication and Sharding.

1. Database Replication (Master-Slave)

Replication involves copying your data across multiple database servers to improve read throughput and provide fault tolerance.

In a traditional Master-Slave (or Primary-Replica) architecture:

There is exactly one Master database. It is the only database permitted to handle WRITE operations (Insert, Update, Delete).
There are multiple Slave databases. They are read-only.
Whenever data is written to the Master, it automatically syncs those changes down to all the Slaves.

The Benefits

Because 90% of traffic in most web applications is read traffic (users looking at tweets, not writing them), you can point all your SELECT queries to the fleet of Slave databases, massively offloading the CPU usage from your Master database.

The Drawback: Replication Lag

Since the Master must copy the data over the network to the Slaves, it takes a few milliseconds. If a user posts a comment (written to Master) and immediately refreshes the page (read from Slave), the Slave might not have received the update yet, and the comment will appear to have vanished.

2. Database Sharding (Data Partitioning)

Replication solves the problem of too many reads. But what if you have too many writes? Or what if your database size exceeds the maximum hard drive size you can physically buy?

Sharding is the process of breaking up a massive database into smaller, distinct chunks (shards) and spreading those chunks across multiple independent database servers.

Imagine you have a Users table with 1 billion rows. You could shard the database based on the User's ID:

Shard 1 (Server A): Stores User IDs 1 to 250,000,000.
Shard 2 (Server B): Stores User IDs 250,000,001 to 500,000,000.
Shard 3 (Server C): Stores User IDs 500,000,001 to 750,000,000.

The Challenges of Sharding

While sharding provides infinite write scalability and infinite storage capacity, it introduces nightmarish complexity into your application code:

Join Operations: You can no longer easily execute SQL JOIN queries across tables if the tables reside on completely different physical servers.
Resharding Data: If Shard 1 fills up completely because users 1 to 250M uploaded way more photos than everyone else, you have to rebalance the data across the shards, which requires massive downtime and data migration scripts.
Celebrity Problem: If you shard a social network by User ID, and Justin Bieber (User ID 50) posts a photo, millions of fans will query Shard 1 simultaneously, bringing that single server to its knees while the other shards sit idle. This is known as a "Hotspot".

Database Scaling & Sharding

While scaling stateless web servers horizontally is relatively easy (just spin up another EC2 instance behind a load balancer), scaling a stateful database horizontally is incredibly complex.

When your database becomes the bottleneck, there are two primary techniques to scale it out: Replication and Sharding.

1. Database Replication (Master-Slave)

Replication involves copying your data across multiple database servers to improve read throughput and provide fault tolerance.

In a traditional Master-Slave (or Primary-Replica) architecture:

There is exactly one Master database. It is the only database permitted to handle WRITE operations (Insert, Update, Delete).

There are multiple Slave databases. They are read-only.

Whenever data is written to the Master, it automatically syncs those changes down to all the Slaves.

The Benefits

The Drawback: Replication Lag

2. Database Sharding (Data Partitioning)

Replication solves the problem of too many reads. But what if you have too many writes? Or what if your database size exceeds the maximum hard drive size you can physically buy?

Sharding is the process of breaking up a massive database into smaller, distinct chunks (shards) and spreading those chunks across multiple independent database servers.

Imagine you have a Users table with 1 billion rows. You could shard the database based on the User's ID:

Shard 1 (Server A): Stores User IDs 1 to 250,000,000.

Shard 2 (Server B): Stores User IDs 250,000,001 to 500,000,000.

Shard 3 (Server C): Stores User IDs 500,000,001 to 750,000,000.

The Challenges of Sharding

While sharding provides infinite write scalability and infinite storage capacity, it introduces nightmarish complexity into your application code:

Join Operations: You can no longer easily execute SQL JOIN queries across tables if the tables reside on completely different physical servers.

Resharding Data: If Shard 1 fills up completely because users 1 to 250M uploaded way more photos than everyone else, you have to rebalance the data across the shards, which requires massive downtime and data migration scripts.

Celebrity Problem: If you shard a social network by User ID, and Justin Bieber (User ID 50) posts a photo, millions of fans will query Shard 1 simultaneously, bringing that single server to its knees while the other shards sit idle. This is known as a "Hotspot".