Database Sharding & Partitioning

When a single database server, even with replicas, can no longer store all your data (terabytes or petabytes), or when write traffic overwhelms a single master, you must split the database into smaller, faster, more manageable pieces. This is called Sharding (or Horizontal Partitioning).

1. What is Sharding?

Sharding breaks up a large database table into smaller chunks called Shards, and distributes them across multiple database server instances. Each shard holds a subset of the total data and operates independently.

Example: A Users table with 1 billion rows is split across 10 shards. Shard 1 holds users with IDs 1-100M, Shard 2 holds 100M-200M, and so on.

2. Sharding Strategies

Hash-Based Sharding (Key-Based)

A hash function is applied to a key attribute (like UserID). The result determines which shard the data belongs to.

shard_number = hash(UserID) % number_of_shards
Pros: Data is evenly distributed across shards.
Cons: Adding a new shard changes the modulo, requiring a massive redistribution of data (solved by Consistent Hashing).

Range-Based Sharding

Data is divided based on ranges of a key value.

Shard 1: Users A-H, Shard 2: Users I-P, Shard 3: Users Q-Z.
Pros: Range queries are efficient (finding all users from A-C only hits Shard 1).
Cons: Can cause Hotspots. If celebrities with millions of followers have names starting with "K", Shard 2 is disproportionately loaded.

Directory-Based Sharding

A lookup service (directory) maps each entity to its shard.

Pros: Maximum flexibility in assigning data to shards.
Cons: The lookup service itself becomes a single point of failure and a potential bottleneck.

3. Problems with Sharding

Sharding introduces significant operational complexity:

Joins Across Shards: You cannot easily perform a SQL JOIN between two tables that live on different physical servers. The application must query both shards separately and merge the results in application code.
Referential Integrity: Foreign key constraints cannot be enforced across different database servers.
Rebalancing: If one shard grows much larger than others, you need to split it and redistribute data, which is a complex and risky operation on a live system.
Increased Operational Overhead: You now have 10 database servers to monitor, back up, patch, and maintain instead of 1.

4. Vertical Partitioning

Unlike horizontal partitioning (sharding), Vertical Partitioning splits a table by columns.

A Users table might be vertically partitioned into UserProfile (id, name, bio) on one server and UserMedia (id, profile_picture_blob, video_blob) on another server optimized for large binary storage.

1. What is Sharding?

Example: A Users table with 1 billion rows is split across 10 shards. Shard 1 holds users with IDs 1-100M, Shard 2 holds 100M-200M, and so on.

2. Sharding Strategies

Hash-Based Sharding (Key-Based)

A hash function is applied to a key attribute (like UserID). The result determines which shard the data belongs to.

shard_number = hash(UserID) % number_of_shards

Pros: Data is evenly distributed across shards.

Cons: Adding a new shard changes the modulo, requiring a massive redistribution of data (solved by Consistent Hashing).

Range-Based Sharding

Data is divided based on ranges of a key value.

Shard 1: Users A-H, Shard 2: Users I-P, Shard 3: Users Q-Z.

Pros: Range queries are efficient (finding all users from A-C only hits Shard 1).

Cons: Can cause Hotspots. If celebrities with millions of followers have names starting with "K", Shard 2 is disproportionately loaded.

Directory-Based Sharding

A lookup service (directory) maps each entity to its shard.

Pros: Maximum flexibility in assigning data to shards.

Cons: The lookup service itself becomes a single point of failure and a potential bottleneck.

3. Problems with Sharding

Sharding introduces significant operational complexity:

Joins Across Shards: You cannot easily perform a SQL JOIN between two tables that live on different physical servers. The application must query both shards separately and merge the results in application code.

Referential Integrity: Foreign key constraints cannot be enforced across different database servers.

Rebalancing: If one shard grows much larger than others, you need to split it and redistribute data, which is a complex and risky operation on a live system.

Increased Operational Overhead: You now have 10 database servers to monitor, back up, patch, and maintain instead of 1.