Inside The Systems That Drive Facebook
March 10, 2016 Timothy Prickett Morgan
Hyperscalers have hundreds of millions to more than a billion users, which requires infrastructure on a vast scale. Those hyperscalers that are not running applications on behalf of others on public cloud slices have it easier than their cloudy peers because they have a relatively small number of applications to support and therefore they can keep the type of machines in their datacenters to a minimum and therefore to keep the per unit and operational costs low.
You might be surprised to know how few different servers it takes to run social media juggernaut Facebook, the driving force behind the Open Compute Project that was founded five years ago to shake up server, storage, and datacenter design and its supply chain. And it is perhaps even more surprising to see that the number of different types of systems deployed by Facebook is going down, not up.
Facebook, like other hyperscalers, deploys its infrastructure at the rack level and software is designed to make the fullest use possible of the compute, storage, and networking shared across the rack to run specific workloads. At the Open Compute Summit this week in San Jose, Facebook representatives and manufacturers of Open Compute gear were showing off some new hardware toys, which we will cover separately. Facebook, interestingly, gave a peek inside the current configurations of its racks, which are used to support web, database, and various storage workloads.
At the moment, Facebook has six different configurations that it rolls into its datacenters, which are characterized by its workloads and which are based on two different servers – code-named “Leopard” and “Yosemite” – and its “Wedge” top of rack switch. Here is the lineup:
The servers and storage servers in the racks are configured with different amounts of main memory, disk, and flash and depending on the workload have adjacent “Knox” Open Vault disk arrays for additional capacity. Rather than changing the form factors of the servers to accommodate more or less storage, Facebook keeps the nodes the same form factor for its Open Rack sleds and adjusts the storage using local bays and Open Vault bays. In a very real sense, the rack is the server for Facebook and its peers.
The Type I rack at Facebook is used to run its web services front end, which hosts its HipHop virtual machine and PHP stack. At the moment, this web front end has 30 servers per rack, and it can be based on either Leopard two-socket server nodes based on “Haswell” Xeon E5 v3 processors or on Yosemite quad-node sleds based on single-socket Xeon D processors. With the custom 16-core Xeon D chip that Intel created in conjunction with Facebook, each sled can have 64 cores. With Xeon E5 processors, the highest Facebook could drive that is 36 cores, and that would be using top-bin Xeon E5 parts that cost three times as much as middle SKU chips, and even more compared to the Xeon Ds. Facebook configures 32 GB per server node on the web service racks, and the Yosemite sleds have a better memory to core ratio and probably cost a lot less, too. Which is why we think Facebook is probably not adding a lot of Leopard machines for the web tier right now. Yosemite, which we detailed here, was designed explicitly for this kind of work to drive up density and drive up cost. (We will be looking at the performance of Yosemite and Facebook’s new “Lightning” all-flash storage arrays separately.) The web tier nodes have a 500 GB disk drive each, and as you can see, it only takes one Wedge switch to link the nodes together and two power shelves to feed all the gear. As you can also see, the Type I rack has some empty bays for further expansion should a slightly different workload come along that needs more compute or storage or power.
The Type II rack, whatever it was, has been retired. The next size up is the Type III rack which is aimed at supporting the MySQL databases that underpin the Facebook PHP application stack. This rack is based on the Leopard two-socket sleds, which have 256 GB of main memory each (which Facebook characterizes as a high amount but we don’t think that until you are pushing 768 GB). The server nodes each have a 128 GB microSATA drive plus two high I/O flash drives that come in at 3.2 TB each. (Facebook did not say which drive it uses.) There are two power shelves and one Wedge switch to glue it all together and again plenty of room to expand the rack with more compute and storage if necessary for workloads other than MySQL.
The Type III rack at Facebook is configured for Hadoop data warehousing storage and analytics, and two years ago Facebook had over 25,000 nodes dedicated to Hadoop and this has probably doubled at least since that time. The Hadoop racks have 18 Leopard servers, each configured with a RAID disk controller and linking up to nine Knox Open Vault storage shelves. Each Knox storage shelf has two bays of disk drives, each supporting 15 3.5-inch SATA drives. While the industry has moved on to 6 TB and 8 TB drives, Facebook is still using 4 TB drives, for a total of 120 TB per Knox array. Facebook partitions the nine Knox storage array into two slices, with 15 drives allocated to each of the 18 Leopard nodes. Each Leopard node has 128 GB of main memory (which Facebook calls a medium configuration) and has 60 TB of disk across those 15 spindles. We do not know what processors Facebook is using for Hadoop, but it seems likely that it is keeping close to parity between processor core count per node and the number of spindles attached to it. As you can see from the diagram, there is not a lot of empty space in this rack.
The Type V rack is used for Facebook’s “Haystack” object storage, which is used to house exabytes – yes, exabytes – of photos and videos, and it dials up the number of Knox arrays and down the number of Leopard servers within the Open Rack. The Leopard servers have a low 32 GB of memory per node, but each of the dozen nodes in the rack is allocated with an entire Knox Open Vault array, yielding 30 4 TB drives per node for 120 TB total capacity. We are amazed that Facebook has not moved to fatter disk drives for Haystack, but when you buy in bulk, you can probably stay off the top-end parts. This time last year, Jason Taylor, vice president of infrastructure foundation at Facebook, told The Next Platform that users were uploading 40 PB of photos per day at Facebook, and with its new push into streaming video, the rate of capacity expansion here must be enormous.
The Type VI rack is for heavy cache applications like the Facebook News Feed, ad serving, and search. There are no Knox disk arrays in this setup, but each of the 30 Leopard servers has 256 GB of main memory (a high configuration) and a 2 TB disk drive (a midsized one in Facebook’s categorization). We don’t know this, but the rack is a high compute/high memory setup designed to accelerate fast access to data, and it probably has a high core count per CPU. You night wonder that the News Feed is backed by disk instead of flash, but as Bobby Johnson, the creator of Facebook’s Haystack object storage, explained in a contributed article recently, the size of the News Feed data quickly outgrew the size of the flash and they had to move to disk. Obviously, with flash drives now pushing 10 TB, size is not the issue, but cost still is.
The final Facebook rack is actually not a single rack, but one of Facebook’s triplet Open Rack setups crammed with a total of six Leopard servers and 48 Knox storage arrays for implementing Facebook’s cold storage. Here’s what it looks like:
Each Leopard server has 128 GB of memory and has 240 drives in total attached to it for a total of 960 TB of capacity. The way the power management and erasure coding works on Facebook’s cold storage, data is spread across one drive per storage shelf per rack per server, and at any given time, only sixteen drives can be fired up and accessing data. The other drives in the rack are spun down and sitting quietly, waiting for an access. This allows Facebook to do 1 exabyte of storage in a power envelope of about 1.5 megawatts.