Facebook has to keep digging into ever-lower levels of its architecture to make efficient use of endlessly growing training data. This has meant rethinking how ML training pipelines operate from ingestion to inference, sparking ideas about what system trends will dominate in industrial scale AI.
From post ranking, content understanding, object and face recognition, segmentation, tracking, speech recognition, and translations, Facebook is awash in models for predicting and understanding what its users do and what they might want next. All of this data feeding, back and forth, is executed on a wide range of devices for a long list of purposes.
On the backend, the social media giant is flooded with training data. While algorithms have helped decide what to keep or trash and where to feed it in the ML training pipeline, all of this comes with an increasing cost — and unexpected limitations in how it can deploy systems.
Two-thirds of preprocessing pipelines are spending less power on GPUs than on the data ingestion piece itself, says Facebook software engineer, Niket Agarwal. “This is constraining. We need to deploy more training capacity, which means we have to deploy more data ingestion capacity to serve those GPUs. The effect is that this limits the number of training clusters we can deploy.”
This is particularly the case with recommender systems, which are one of the most important AI/ML workloads at Facebook — not to mention one of the most memory-intensive (thus expensive in terms of data movement/time). “The memory bandwidth requires reaching hundreds of TB/s and the compute throughput requirements reach petaops per second,” Agarwal adds.
This problem will only compound over the next few years. Agarwal adds that data storage requirements for recommendation systems have grown by 1.75X over the last two years with 13X growth in training data ingestion throughput projected over the next three years.
Colleague Carole-Jean Wu, research scientist at Facebook, adds to those numbers: “We start with unstructured data, extract features and use that to build our model and figure out how much weight we need for various features then deploy to make various predictions. The scale of that process is hundreds of petabytes of data replicated daily, data pipelines must operate at TB/s processing throughout. We devote a few million machines to efficiently feed data samples to tens of thousands of machine trainer pipelines.”
“At that scale, there are real implications for how we design and optimize the data preprocessing stage,” she adds.
Mere partitions of some of Facebook’s recommendation models are in the 25 petabyte range. During his Hot Chips presentation this week, Agarwal points to “traditional” ML training systems with Nvidia DGX boxes as an example. “For us there is simply no way to store on the trainer nodes.” He says that for preprocessing and data transformations, the number of CPU sockets (x86) required to build a single system is anywhere from 10–55. “There is no way to build a system with enough CPU sockets to do this. Local data storage and preprocessing don’t work for us.”
Here is how Facebook has navigated the heavy-duty ingest and preprocessing once teams realized that disaggregation was the only solution.
To store all of this, Facebook used its own Tectonic distributed file system. It generates all the training features and stores those using offline ETL pipelines that pull event logs from the many services and convert those into ML pipelines that are pushed into data warehouse table format. The distributed file system grabs these pipelines and pulls the data for preprocessing on GPU training systems. This is, by the way, how Facebook has been handling the ingest/preprocessing loop for three years and, while Agarwal says it is a fault-tolerant and necessarily disaggregated, there are problems ahead.
“Disaggregation is not enough,” he says. Take a look at the power requirements alone and refresh above on what he said about keeping the GPUs fed.
The big system fixes, like taking a disaggregated approach, have been made. What this means is that Facebook is going to have to keep digging to ever-lower depths in its stacks to optimize this stage — a stage that is the lynchpin for its ability to keep scaling AI training as its done now.
The lower-level projects to keep tuning this stage include adding some I/O buffer with NVMe (building a distributed cache on top of the preprocessing system) and getting more efficient with training data reuse.
Other optimizations include feature flattening (pushing data filtering to storage nodes), optimizing data formats, merging reads to get better disk throughput, and using featuring reordering to cut down on unnecessary reads. While the latter comes with software cost, we can only imagine the NVMe pricetag for infrastructure at this scale. Nonetheless, these optimizations are beginning to pay dividends, as Agarwal shared during Hot Chips.
Hyperscalers have to figure out these AI pain points the hard way since they are blazing trails at scale. The takeaway from Facebook’s showing at Hot Chips this year is one we have known intellectually but haven’t heard much about on the ground: the AI training system limits might have far less to do with capacity and capability and more to do with caps on what ingest can handle and efficiently push off.