How does Facebook scale MySQL?

Facebook is one of the largest social media platforms in the world, with over 2 billion monthly active users as of 2022. At that massive scale, storing and accessing user data efficiently presents some unique infrastructure challenges. Let’s take a look at how Facebook has scaled MySQL to meet the demands of their massive user base.

The Early Days

In the early days of Facebook, a single MySQL database server was sufficient. The initial prototype of Facebook, built by Mark Zuckerberg in 2004, used MySQL to store profile data for Harvard students. However, as Facebook started expanding to other universities and gaining millions of users, this setup quickly became untenable.

Once Facebook started experiencing significant growth, they had to shard their MySQL data across multiple servers to distribute the load. However, these early efforts at scaling MySQL were still fairly rudimentary. They would assign a range of users to each database server based on the first letter of their last name.

Scaling with Memcached

As the number of queries to their MySQL databases increased into the millions per second, Facebook hit the limits of what MySQL could handle on its own. To help alleviate the load, they implemented Memcached, a distributed in-memory caching system.

Memcached acts as a cache layer in front of the MySQL databases. Frequently accessed data is stored and retrieved from memory in the Memcached pools instead of hitting the databases every time. This reduces latency and improves throughput dramatically.

However, caching alone wasn’t sufficient as Facebook continued to expand. They needed ways to scale the underlying MySQL databases themselves to handle the immense volume of queries.

Sharding

Once hitting the limits of caching, Facebook turned to sharding to split the data across ever more MySQL servers. The initial approach of sharding by the first letter of last names resulted in poor distribution and hot spots.

They gradually improved the sharding approach over time:

Sharding by user ID – Better distribution as user IDs were incremented sequentially.

Sharding by geography – Group users located in the same geographic regions.
Sharding by event – Segment events into separate databases.

With hundreds of shards powered by thousands of MySQL servers, Facebook could distribute the read and write load much more evenly. However, managing all these shards also posed its own difficulties.

Master-Slave Replication

To improve redundancy and reduce the load on master databases, Facebook implemented master-slave replication. Important shards would have a master MySQL instance that handled writes. This master would then replicate data to multiple read-only slave replicas.

This again helped improve performance and redundancy. If the master went down, they could route traffic to one of the read replicas. However, as the number of shards and replicas grew exponentially, managing failovers and promoting new masters became increasingly complex.

MySQL at Scale

Today, Facebook’s MySQL infrastructure looks markedly different from those early days of a single database server. Some key facts about their MySQL deployment at scale:

Hundreds of shards storing petabytes of data
Thousands of MySQL instances as masters and replicas
Custom sharding algorithms based on access patterns

Automated management and failover systems
Percona server with Facebook-specific patches and optimizations

Facebook has also developed a number of open source tools to help manage MySQL at this massive scale:

Vitess – For managing large MySQL deployments across shards.
MyRocks – MySQL storage engine optimized for fast SSDs.
ProxySQL – High performance proxy to manage MySQL query routing.

New Data Stores

In addition to scaling MySQL, Facebook has also adopted newer database technologies better suited for certain data models and access patterns:

Cassandra – For high volume time series data like logging.
HBase – As a distributed key-value store.

HDFS – Hadoop-based analytics data warehouse.

These newer systems complement MySQL for Facebook’s specialized use cases like messaging, analytics, and social graphs. MySQL remains a foundational technology powering Facebook’s core data storage and access.

Lessons Learned

Facebook’s experience scaling MySQL provides valuable lessons for anyone running mission-critical MySQL databases:

Plan for scaling needs early – Facebook hit limits in their initial MySQL scaling approaches as usage exploded.
Implement caching – Adding cache layers like Memcached is crucial for high performance.
Shard intelligently – The sharding strategy can have a big impact on distribution and performance.

Automate management – Automating failovers, replication, and management is required at scale.
Offload analytics – Use separate data stores like HDFS for analytics to avoid bogging down OLTP workloads.

While Facebook’s infrastructure is uniquely massive, these lessons apply equally to smaller organizations expecting significant growth in their MySQL usage.

Conclusion

Facebook’s overwhelming scale poses unique challenges in storing user data and serving requests efficiently. By adopting solutions like intelligent sharding, master-slave replication, caching with Memcached, and harnessing technologies like Vitess and MyRocks, they have scaled MySQL to meet the demands of over 2 billion users.

Facebook will likely continue innovating with MySQL and complementary data stores to power their platform’s exponential growth well into the future.