Why is Facebook documents and data so large?

Facebook is one of the largest and most popular social media platforms in the world, with over 2.8 billion monthly active users as of Q3 2020. With so many users constantly posting content, commenting, sharing photos and videos, and messaging each other, an enormous amount of data is generated on Facebook’s servers each day.

The Amount of User Content Generated

One of the main reasons why Facebook’s documents and data repositories are so large is simply the sheer amount of content that gets created and shared on the platform every day. Here are some statistics to put this into perspective:

Over 300 million photos are uploaded to Facebook every day

There are over 100 billion friend connections on Facebook
More than 30 billion pieces of content (web links, news stories, blog posts, notes, photos etc.) are shared on Facebook monthly
There are over 2.7 billion Likes and comments generated every day

Over 2.5 billion photos get Liked each day
On average, 20 minutes is spent on Facebook per visit

All of this content – photos, videos, comments, posts, shares, likes, messages between friends – needs to be stored somewhere. While Facebook uses techniques like compression to reduce the storage space needed, the sheer volume still requires massive storage capacity.

High Resolution Photos and Videos

A major driver behind Facebook’s massive data size is the fact that users upload incredibly high resolution photos and videos. With modern smartphone cameras having 12MP, 16MP or even 40MP sensors, image files easily reach into the multiple megabytes or tens of megabytes for just a single photo.

When you multiply that by the 300 million+ photos uploaded per day, it adds up to unbelievable amounts of data. Even Facebook-owned Instagram sees over 100 million photos uploaded per day. All this data has to live somewhere on Facebook’s servers.

Lengthy Text Posts

While images and videos take up the most storage space, text-based posts and comments also accumulate into sizable amounts of data. Facebook allows up to 63,206 characters in a single post – that’s about the size of a full novel!

With people often writing lengthy paragraphs or multiple paragraphs when sharing life updates, memories, opinions and more on Facebook, those tens of thousands of characters per post add up. There are also billions of comments left every day, which can be several sentences or paragraphs each. All this text data contributes to Facebook’s massive document and database sizes.

Messenger Chat Logs

Facebook’s Messenger app has over 1.3 billion monthly active users. All those Messenger conversations contain enormous amounts of text data that must be stored. There are billions of messages sent via Messenger every single day. In addition to storing the text, metadata like timestamps, sender/recipient information, attachments, geolocation data and more also add to the size of chat logs.

Multiple Copies of Data

For reliability and redundancy purposes, Facebook does not just store user data in single copies. There are multiple live copies of Facebook’s core databases stored across different data centers around the world. This is done to prevent data loss in the event of outages or disasters affecting one data center.

But the tradeoff is that now everything – photos, videos, posts, comments, messages – has to be copied multiple times. This multiplied storage requirement is factored into the massive data warehouses Facebook operates.

Growth Over Time

Facebook first launched in 2004. Since then, there has been nonstop user growth, engagement growth, product growth and data growth. Billions of users have been uploading photos, posting updates, commenting and messaging for over 15 years straight. Data from over a decade of usage continues to accumulate in Facebook’s databases.

Unlike some web services that might periodically delete old user data, Facebook retains this historical information indefinitely. This means storage needs constantly grow year after year as new data gets created in addition to the old data.

User Base Growth

Facebook’s user base has grown at an astonishing rate over the years. The platform has gone from 1 million users in 2004 to 2.8 billion monthly active users today in 2023. With such tremendous user growth comes associated growth in photos, videos, posts, comments and other data generated.

More users equals more data. So Facebook’s databases need to keep expanding capacity to accommodate the surging volumes as the user base size increases over time.

Diversity of Content Types

Facebook supports many different types of content being shared and stored. There are photos, videos, text updates, links, comments, messenger chats, notifications, events, groups, friend networks, interested topics, and much more. Each content type requires its own database schema, storage and processing needs.

This diversity of data types that must be supported contributes to the massive overall data scale.

Ad Targeting Data

To enable targeted advertising, Facebook collects and stores extensive data about its users’ demographics, interests, behaviors and more. This includes information users provide directly to Facebook, data mined from users’ activities on Facebook, data from third party partnerships, and broader inferences made through analytics.

This wealth of user data enables advertisers to target marketing messages to very specific audiences. But it also means Facebook itself has to store and process huge amounts of analytics data in order to serve relevant ads.

Crawling and Indexing the Web

To power its social graph and content feeds, Facebook needs awareness of what’s happening across the web. The platform relies on bots and scrapers that constantly crawl websites, news sources, blogs, and other content across the internet.

The content indexes created from this crawling require considerable storage space. All this external data being brought into Facebook’s systems contributes to its data warehouse size.

Internal Analytics Data

Facebook runs stringent analytics processes across its products and user data to optimize performance. This includes tracking user engagement metrics, ad conversion performance, feature usage, demographics trends, churn analysis, security analytics, operational analytics, and much more.

The internal analytics dashboards, log data and reports Facebook analyzes to improve its products also require storage space and contribute to its overall data scale.

Acquired Companies

Facebook has acquired dozens of companies over the years, including Instagram, WhatsApp, Oculus VR and many more. When these acquisitions occur, all of the new company’s user data, content and operational systems get absorbed into Facebook’s ecosystem.

Each major acquisition essentially dumps massive new amounts of data into Facebook’s overall storage scale.

Logging and Auditing

Facebook maintains exhaustive logs of system events, access requests, operational metrics, inbound traffic, security events, user account changes, and much more. Storing these verbose system logs and audit trails requires sizable data storage capacity.

But it provides essential system transparency, supports analytics, and aids security investigations in the event of an incident.

Backups and Archives

Facebook does not immediately delete user data when an account gets closed. For operational analysis, security forensics if needed, and compliance with legal retention policies, Facebook maintains archives of inactive account data.

These backups and archives allow Facebook to restore accounts if users change their minds. But they add to the platform’s overall data storage needs.

Compliance Requirements

As a large global business, Facebook has to comply with data protection and retention policies like GDPR, CCPA, HIPAA, and various international privacy laws. These regulations often require companies to retain certain user data for minimum periods of time before allowing deletion.

Having to preserve data to meet compliance adds further to Facebook’s data warehouse sizes.

Research and Academic Partnerships

Facebook makes some anonymized and aggregated user data available to academics and researchers through programs like Social Science One. Sharing data with research partners adds additional copies into circulation, increasing storage needs.

It also requires Facebook to develop anonymization pipelines that transform their raw data into research-safe datasets, adding engineering complexity.

Diversity of Data Formats

Facebook uses a variety of data formats, storage engines, and processing systems to handle different types of information optimally. This includes formats like:

Photos and videos stored in JPEG, PNG, MP4
Text/documents in XML, JSON
Relational data in MySQL databases

Graph data in neo4j graphs
Key-value storage in Redis
Timeseries data in Cassandra

Search indexes in Elasticsearch
Streams in Kafka

Supporting such a diversity of data storage formats contributes to the complexity of Facebook’s data infrastructure. There can’t be a “one size fits all” approach, necessitating specialized storage and processing systems for each data type.

Caching and Redundancy

To provide low latency, high availability access to all this data at global scale, Facebook heavily utilizes caching and redundancy techniques. This includes:

CDNs to cache content close to users
Redis caching clusters

Multiple live copies of databases/tables
Replica shards of databases in data centers worldwide
Multi-datacenter replication

Error correction data in logs/datasets to handle data corruption

All these mechanisms for performance and reliability increase the number of data copies and caching instances that must be maintained. But they are critical to serving billions of users with responsive, always-on services.

Conclusion

In summary, Facebook’s massive data scale is driven by:

Billions of active users generating content and interactions continuously
High resolution photos and videos uploaded
Text posts, comments and messenger chats

Growth of userbase and engagement over 15+ years
Diversity of content types (photos, videos, docs, messages, etc)
Ad targeting data and analytics

Crawling and indexing web content
Internal analytics data
Absorbing companies acquired like Instagram

Logging/auditing for transparency and forensics
Backups, archives and compliance needs
Supporting many data formats, engines and structures

Caching, redundancy and availability mechanisms

The massive populations of people using Facebook products every day inevitably require enormous amounts of data storage and infrastructure to support all those interactions.