Fork me on GitHub

My data is bigger than your data!

How big is big data really? From time to time, various organizations brag about how much data they have, how big their clusters are, how many requests per second they serve, etc. Every time I come across these statistics, I make note of them. It's quite amazing to see how these numbers change over time... looking at the numbers from just a few years ago reminds you of this famous Austin Powers scene. Here's another gem.

Without further adieu, here's "big" data, in reverse chronological order...

Cassandra at Apple (June 2019)

Internet Archive (May 2018)

Netflix Statistics on AWS (April 2018)

A core component of its stream processing system is something called the Keystone data pipeline, which is used to move upwards of 12PB of data per day into its S3 data warehouse, where 100PB of highly compressed data reside.

Keystone Router, a key piece of software that distribute the 3 trillion events per day across 2,000 routing jobs and 200,000 parallel operators to other data sinks in Netflix's S3 repository.

Source: How Netflix Optimized Flink for Massive Scale on AWS

14 TB Hard Drives (April 2018)

Western Digital Introduces Ultrastar DC HC530 14TB Hard Drive.

Plusar and Heron (March 2018)

Pulsar has been in production for three years at Yahoo, where it handles 2 million plus topics and processes 100 billion messages per day.

At Twitter, Heron is used by over 50 teams and processes 2 trillion events per day, or about 20PB per day.

Source: Streamlio Claims Pulsar Performance Advantages Over Kafka

Kafka at Netflix: >1 trillion messages per day (March 2018)

Prime Day 2017 – Powered by AWS (September 2017)

During Amazon Prime Day, 2017: Amazon DynamoDB requests from Alexa, the sites, and the Amazon fulfillment centers totaled 3.34 trillion, peaking at 12.9 million per second.

Source: AWS Blog

Pulsar and Apache DistributedLog (September 2017)

Apache DistributedLog is a replicated log store originally developed at Twitter. It’s been used in production at Twitter for more than four years, supporting several critical services like pub-sub messaging, log replication for distributed databases, and real-time stream computing, delivering more than 1.5 trillion events (or about 17 PB) per day.

Source: Strata Data Conference

European Space Agency's Gaia Mission (August 2017)

Gaia continues to be a challenging mission in all areas even after 4 years of operation. In total we have processed almost 800 Billion (=800,000 Million) astrometric, 160 Billion (=160,000 Million) photometric and more than 15 Billion spectroscopic observation which is the largest astronomical dataset from a science space mission until the present day.

The Gaia mission is considered by the experts “the biggest data processing challenge to date in astronomy.

Source: ODBMS Blog: Gaia Mission maps 1 Billion stars. Interview with Uwe Lammers

Internet Archive (June 2017)

Archives Unleashed 4.0 (June 2017)

More details at Archives Unleashed 4.0.

Spark Summit (June 2017)

Addition details can be found on the DataBricks blog: Making Apache Spark the Fastest Open Source Streaming Engine

1 TB graph data in Neo4j (May 2017)

According to Niels Meersschaert, Chief Technology Officer at Qualia, the Qualia team relies on over one terabyte of graph data in Neo4j.

Source: ODBMS Blog: Identity Graph Analysis at Scale. Interview with Niels Meersschaert

Neo4j Pushes Graph DB Limits Past a Quadrillion Nodes (April 2017)

A graph database with a quadrillion nodes? Such a monstrous entity is beyond the scope of what technologist are trying to do now. But with the latest release of the Neo4j database from Neo Technology, such a graph is theoretically possible.

Source: datanami

NYSE Data Hub (March 2017)

The new enterprise data hub supports data processing and analytics across more than 20 PB of data, with 30 TB of fresh data added daily.

Source: NYSE: Gaining Real-time Insights from More Than 20 PB of Data

Twitter Infrastructure (January 2017)

Overview of Twitter's current infrastructure:


Source: The Infrastructure Behind Twitter

Internet Archive (January 2017)

Kafka at Uber (December 2016)

Since January 2016 Chaperone has been a key piece of Uber Engineering’s multi–data center infrastructure, currently handling about a trillion messages a day.

Source: Introducing Chaperone: How Uber Engineering Audits Kafka End-to-End

Internet Archive (October 2016)

In October of 2012, we held just over 10 petabytes of unique content. Today, we have archived a little over 30 petabytes, and we add between 13 and 15 terabytes of content per day (web and television are the most voluminous).

Currently, Internet Archive hosts about 20,000 individual disk drives. Each of these are housed in specialized computers (we call them “datanodes”) that have 36 data drives (plus two operating systems drives) per machine. Datanodes are organized into racks of 10 machines (360 data drives), and interconnected via high-speed ethernet to form our storage cluster. Even though our content storage has tripled over the past four years, our count of disk drives has stayed about the same. This is because disk drive technology improvements. Datanodes that were once populated with 36 individual 2-terabyte (2T) drives are today filled with 8-terabyte (8T) drives, moving single node capacity from 72 terabytes (64.8T formatted) to 288 terabytes (259.2T formatted) in the same physical space! This evolution of disk density did not happen in a single step, so we have populations of 2T, 3T, 4T, and 8T drives in our storage clusters.

Source: Internet Archive Blog

Metamarkets (September 2016)

Today Metamarkets processes over 300 billion events per day, representing over 100 TB going through a single pipeline built entirely on open source technologies including Druid, Kafka, and Samza. Growing to such a scale presents engineering challenges on many levels, not just in design but also with operations, especially when downtime is not an option.

Source: Strata+Hadoop World

Facebook (August 2016)

A single Spark job that reads 60 TB of compressed data and performs a 90 TB shuffle and sort.

Source: Facebook Engineering

Samsung's New SSD (August 2016)

Samsung is showing off a monster million IOPS SSD that can pump out read data at 6.4 gigabytes per second and store up to 6.4TB.

Source: The Register

Tesla (May 2016)

At ‪#‎EmTechDigital: Tesla now gains a million miles of driving data (comparing human to robot safety) every 10 hours. So they log more miles per day than the Google program has gathered since inception. Tesla's design goal is to be 2-10x better than human drivers.

Source: Steve Jurvetson

Google Translate (May 2016)

Google translates more than 100 billion words each day.

Source: Official Google Blog

Related, Google CEO Sundar Pichai announced during his Google I/O keynote that 20 percent of queries on its mobile app and on Android devices are voice searches.

Source: Search Engine Land

Google App Engine (April 2016)

Google App Engine serves over 100 billion requests per day.

Source: Google blog

What happens in a 2016 internet minute? (April 2016)

Scale of Spotify (March 2016)

EMC Unveils DSSD D5: A Quantum Leap In Flash Storage (February 2016)

Source: EMC Press Release

1PB Uploaded to YouTube Every Day (February 2016)

For YouTube alone, users upload over 400 hours of video every minute, which at one gigabyte per hour requires more than one petabyte (1M GB) of new storage every day or about 100x the Library of Congress. As shown in the graph, this continues to grow exponentially, with a 10x increase every five years.

Source: Google Cloud Platform Blog

Sequencing vs Moore's Law (January 2016)

Source: DNA Sequencing Costs: Data from the NHGRI Genome Sequencing Program (GSP)

One Trillion Messages per Day with Kafka (November 2015)

ONE MILLION HTTP requests per second (November 2015)

Demo of a Google Container Engine cluster serving 1 million HTTP requests per second.

Source: Kubernetes blog

Spark at Scale (November 2015)

Source: Keynote by Ion Stocia at IEE Big Data 2015

Big data... blast from the past

Large Scale Distributed Deep Learning on Hadoop Clusters (September 2015)

Yahoo has a current footprint of more than 40,000 servers and 600 petabytes of storage spread across 19 clusters.

Source: Yahoo! blog

Big Code at Google (September 2015)

Google has around two billion lines of code totaling around 85 TB, all in a mono-repo. There are around 45,000 commits per day from 25,000 engineers. See talk here.

Source: Wired

Hadoop at Twitter (September 2015)

Our Hadoop filesystems host over 300PB of data on tens of thousands of servers. Blog post below further describes Twitter's federated, multi-DC Hadoop setup.

Source: Hadoop filesystem at Twitter

HBase at Bloomberg (September 2015)

The Bloomberg end of day historical system is called PriceHistory... This system was designed for the high volume of requests for single securities and quite impressive for its day. The response time for a year's data for a field on a security is 5ms, and it receives billions of hits a day, around half a million a second at peak.

Source: Apache Blog

Kafka at LinkedIn handles 1.1 Trillion Messages Per Day (September 2015)

Source: Confluent blog post

Scalable of the Go language (July 2015)

Hadoop Summit (June 2015)

Kafka at Uber (May 2015)

Yahoo's Pistachio (May 2015)

Pistachio is Yahoo's recently open-sourced distributed key value store system:

It’s being used as the user profile storage for large scale ads serving products within Yahoo. 10+ billions of user profiles are being stored with ~2 million reads QPS, 0.8GB/s read throughput and ~0.5 million writes QPS, 0.3GB/s write throughput. Average latency is under 1ms. It guarantees strong consistency and fault-tolerance. We have hundreds of servers in 8 data centers all over the globe supporting hundreds of millions in revenue.

Source: Yahoo! Engineering Blog

Updates form HBaseCon (May 2015)

Bay Area Mesos Meetup (April 2015)

Blog post describing how Siri's backend is powered by Apache Mesos.

Statistics from the Hadoop Summit in Brussels (April 2015)

Hadoop at Twitter (March 2015)

The Internet? Bah! (February 2015)

The Explosive Growth of Spark (February 2015)

Kafka (February 2015)

Today at LinkedIn Kafka handles over 500 billion events per day spread over a number of data centers. It became the backbone for data flow between systems of all kinds, the core pipeline for Hadoop data, and the hub for stream processing.

Source: Confluence Blog

Strata (February 2015)

Andrew Moore shares his experiences at Google (January 2015)

Andrew Moore (CMU professor who built Google Pittsburgh and helped grow Google's Adwords and shopping systems) shared his experiences at the NITRD Big Data Strategic Initiative Workshop held at Georgetown in January, 2015. I attended the workshop and captured a few tidbits:

Petabyte Sort with Spark (November 2014)

Spark sorts 100 TB in 23 minutes on 206 machines (6592 virtual cores), which translates into 4.27 TB/min or 20.7 GB/min/node.

Spark sorts 1 PB in 234 minutes on 190 machines (6080 virtual cores), which translates into 4.27 TB/min or 22.5 GB/min/node.

Source: Databricks blog and also this older post.

Hadoop at Yahoo (September 2014)

From an interview with Mithun Radhakrishnan, member of the Yahoo Hive team:

"Y!Grid is Yahoo's Grid of Hadoop Clusters that's used for all the "big data" processing that happens in Yahoo today. It currently consists of 16 clusters in multiple datacenters, spanning 32,500 nodes, and accounts for almost a million Hadoop jobs every day. Around 10% of those are Hive jobs."

Source: ODBMS Industry Watch

Gartner: Hype Cycle for Big Data (August 2014)

Big data has just passed the top of the Hype Cycle, and moving toward the Trough of Disillusionment. This is not a bad thing. It means the market starts to mature, becoming more realistic about how big data can be useful for organizations   large and small. Big data will become business as usual.

Source: Gartner

Twitter (July 2014)

Pinterest (July 2014)

Some stats:

Source: Pinterest Engineering Blog

Baidu (July 2014)

HDFS at Twitter (June 2014)

ACL Lifetime Achievement Award (June 2014)

Airbnb (June 2014)

DataTorrent at the Hadoop Summit (June 2014)

[David] Hornik [of August Capital] was also an early stage investor in Splunk, and he sees lots of potential here for DataTorrent. “When you can process a billion data points in a second, there are a lot of possibilities.”

Source: TechCrunch

What would you do with a system that could process 1.5 billion events per second? That’s the mind-boggling rate at which DataTorrent’s Real-Time Streaming (RTS) offering for Hadoop was recently benchmarked. Now that RTS is generally available–DataTorrent announced its general availability today at Hadoop Summit in San Jose–we may soon find out.

That 1.5-billion-events-per-second figure was recorded on DataTorrent’s internal Hadoop cluster, which sports 34 nodes. Each node is able to process tens of thousands of incoming events (call data records, machine data, and clickstream data are common targets) per second, and in turn generates hundreds of thousands of secondary events that are then processed again using one of the 400 operators that DataTorrent makes available as part of its in-memory, big-data kit.

Source: Datanami

Yahoo! and the Hadoop Summit (June 2014)

[local cached copy]

Snapchat (May 2014)

Snapchat claims that over 700 million snaps are shared per day on the service, which could make it the most-used photo-sharing app in the world — ahead of Facebook, WhatsApp, and others. Even Snapchat’s Stories feature seems to be doing well, amassing 500 million views per day.

Source: The Verge

Kafka at LinkedIn (May 2014)

Source: Samza at LinkedIn: Taking Stream Processing to the Next Level by Martin Kleppmann at Berlin Buzzwords on May 27, 2014

What "Big Data" Means to Google (May 2014)

Tape Storage (May 2014)

Here's a local copy of infographic.

Here's more bragging from IBM about achieving a new record of 85.9 billion bits of data per square inch in areal data density on low-cost linear magnetic particulate tape.

Google's Bigdata at HBaseCon2014 (May 2014)

Bigtable scale numbers from keynote talk at HBaseCon2014 by Carter Page:

More HBaseCon2014 highlights.

Hadoop at eBay (April 2014)

Hive at Facebook (April 2014)

Our warehouse stores upwards of 300 PB of Hive data, with an incoming daily rate of about 600 TB.

Source: Facebook Engineering Blog

Flurry's HBase Cluster (April 2014)

Flurry has the largest contiguous HBase cluster: Mobile analytics company Flurry has an HBase cluster with 1,200 nodes (replicating into another 1,200 node cluster).

Editorial Note: Really? Even larger than Facebook's HBase cluster that powers Messages?

Source: O'Reilly Radar

Kafka at LinkedIn (April 2014)

Internet Archive (February 2014)

How Google Backs Up The Internet (January 2014)

Talk How Google Backs Up the Internet. Here's a nice blog post summarizing talk contents.

Interesting tidbits:

Other (personal) observations:

Google's BigTable (September 2013)

Source: Jeff Dean's talk slides at XLDB 2013 [local copy]

Google's Disks (2013)

Estimate: Google has close to 10 exabytes of active storage attached to running clusters.

Source: What if?

For reference: Total disk storage systems capacity shipped (in 2013) reached 8.2 exabytes.

Source: IDC Press Release

NSA's datacenter (Summer 2013)

Facebook's Giraph (August 2013)

All of these improvements have made Giraph fast, memory efficient, scalable and easily integrated into our existing Hive data warehouse architecture. On 200 commodity machines we are able to run an iteration of page rank on an actual 1 trillion edge social graph formed by various user interactions in under four minutes with the appropriate garbage collection and performance tuning. We can cluster a Facebook monthly active user data set of 1 billion input vectors with 100 features into 10,000 centroids with k-means in less than 10 minutes per iteration.

Source: Facebook Engineering

Hadoop Summit (June 2013)

Amazon S3 (April 2013)

There are now more than 2 trillion objects stored in Amazon S3 and that the service is regularly peaking at over 1.1 million requests per second.

Source: Amazon Web Services Blog

Hadoop Summit (March 2013)

Strata (February 2013)

Hadoop at Yahoo! (February 2013)

Around ~45k hadoop nodes, ~350 PB total

Source: YDN Blog

Internet Archive reaches 10 PB (October 2012)

Blog post about the Internet Archive's 10 PB party.

Source: Internet Archive

Google at SES San Francisco (August 2012)

Google has seen more than 30 trillion URLs and crawls 20 billion pages a day. One hundred billion searches are conducted each month on Google (3 billion a day).

Source: Spotlight Keynote With Matt Cutts #SESSF (from Google)

Cassandra at eBay (August 2012)

eBay Marketplaces:

A glimpse on our Cassandra deployment:

Source: Slideshare [local copy]

The size, scale, and numbers of (June 2012)

Source: Hugh Williams Blog Post

Facebook at Hadoop Summit (June 2012)

Square Kilometer Array (June 2012)

Amazon S3 Cloud Storage Hosts 1 Trillion Objects (June 2012)

Late last week the number of objects stored in Amazon S3 reached one trillion.

Source: Amazon Web Services Blog

Pinterest Architecture Update (May 2012)

18 Million Visitors, 10x Growth, 12 Employees, 410 TB Of Data.

80 million objects stored in S3 with 410 terabytes of user data, 10x what they had in August. EC2 instances have grown by 3x. Around $39K fo S3 and $30K for EC2.

Source: High Scalability Blog Post

Six Super-Scale Hadoop Deployments (April 2012)

Source: Datanami

Ranking at eBay (April 2012)

eBay is amazingly dynamic. Around 10% of the 300+ million items for sale end each day (sell or end unsold), and a new 10% is listed. A large fraction of items have updates: they get bids, prices change, sellers revise descriptions, buyers watch, buyers offer, buyers ask questions, and so on. We process tens of millions of change events on items in a typical day, that is, our search engine receives that many signals that something important has changed about an item that should be used in the search ranking process. And all that is happening while we process around 250 million queries on a typical day.

Source: Hugh Williams Blog Post

Modern HTTP Servers Are Fast (March 2012)

A modern HTTP server (nginx 1.0.14) running on somewhat recent hardware (dual Intel Xeon X5670, 6 cores at 2.93 GHz, with 24GB of RAM) is capable of servicing 500,000 Requests/Sec.

Source: The Low Latency Web

Strata (February 2012)

Tumblr Architecture (February 2012)

15 Billion Page Views A Month And Harder To Scale Than Twitter: 500 million page views a day, a peak rate of ~40k requests per second, ~3TB of new data to store a day, all running on 1000+ servers.

Source: High Scalability Blog Post

Digital Universe (2011)

In 2011 the world will create a staggering 1.8 zettabytes.

Source: IDC

DataSift Architecture (November 2011)

Source: High Scalability Blog Post

Hadoop at Facebook (July 2011)

In 2010, Facebook had the largest Hadoop cluster in the world, with over 20 PB of storage. By March 2011, the cluster had grown to 30 PB.

Source: Facebook Engineering Blog

Facebook (June 2010)

Hive at Facebook (May 2009)

In a new-to-me metric, Facebook has 610 Hadoop nodes, running in a single cluster, due to be increased to 1000 soon.

Source: DBMS2

Datawarehouses at eBay (April 2009)

Metrics on eBay's main Teradata data warehouse include:

Metrics on eBay's Greenplum data warehouse (or, if you like, data mart) include:

Source: DBMS2

The World's Technological Capacity (2007)

In 2007, humankind was able to store 295 exabytes.

Source: Science Magazine

All the empty or usable space on hard drives, tapes, CDs, DVDs, and memory (volatile and nonvolatile) in the market equaled 264 exabytes.

Source: IDC

Visa Credit Card Transactions (2007)

According to the Visa website, they processed 27.612 billion transactions in 2007. This means an average of 875 credit transactions per second based on a uniform distribution. Assuming that 80% of transactions occur in the 8 hours of the day, this gives an event rate of 2100 transactions per second.

Source: Schultz-Møller et al. (2009)