Every large IT shop has its own unique I/O demands; from mixed workloads to high volume or velocity of data and all the points in between. When deep learning is added into the production mix, along with an advanced, large-scale analytics backbone that runs everything from back office analysis and billing to near real-time customer-facing services, the I/O burden gets heavier, even when that weight is increasingly distributed.
It worth referencing this comprehensive view of the backend infrastructure that powers content discovery giant Taboola. All of the demands listed above and in the detailed overview of all the data ingest, analysis, training, and inference make one thing clear: data movement matters more than ever. Making I/O subsystems that can keep up with constantly retrained models, many billions of data points flooding through daily, and the need to deliver the best content recommendation at the right time with incredible speed, accuracy (and the proper billing and tracking for end users) is an incredible challenge—several challenges, actually.
One thing that consistently makes Taboola interesting from a systems perspective, however, is that they manage complexity by taking a LEGO brick approach to their infrastructure, which allows the right server, storage, or network element to be pulled in where it’s needed the most. This type of disaggregation is nothing new with servers and has been in play at Taboola for several years, but the company took a similar approach with its storage infrastructure as well. By minimizing complexity and maximizing agility and performance of their storage environment, Taboola makes sure every application has what it needs in the moment and can be flipped to support new requirements based on demand, or application changes.
Yet another interesting data point about Taboola’s storage infrastructure: everything runs on HDFS. This was a bit more common five to eight years ago when enterprise IT shops put Hadoop into production but for distributed infrastructure running across several global datacenters it is not the norm. Really, however, Taboola rarely fits the norm; its workloads are highly tuned and tailored for the unique recommendation engine-based systems they support, as well as all the companion analytics — and it has taken years for them to get it just right. While vanilla HDFS took some wrangling, the company’s VP of IT, Ariel Pisetzky says that they have worked to optimize it so it integrates well with the many workloads, missions, and datacenters they operate, including with their AI training cluster — a remarkable achievement since the file systems side of AI training clusters, especially those with several hundred nodes, can get tricky.
When it comes to large-scale storage with AI in the mix, which is certainly the case at Taboola, NVIDIA GPUs are a consideration. With the high-I/O demands from the GPU-enabled training clusters, the company needs to be smart about having high performance storage that can keep up with the high compute and fast networking capabilities inside today’s most demanding GPU-dense training systems. As AI continues to evolve with NVIDIA GPUs at its heart, Taboola will be working to match high-impact GPU capability with smart, high performance storage and file systems.
The point is this. Taboola blazes its own IT trails. Compute, storage, file systems, analytics integrations, AI training and inference and for that matter, optimization around all of these areas. They rarely follow norms, at least in terms of how other large companies handle AI, analytics, and near real-time services. This is how they ended up using HDFS, for instance, and interestingly, how they brought Dell Technologies into their ultra-customizable IT worldview on the storage front. It might seem strange that a story about the need for highly customized storage infrastructure ends up hinging on customizations and unique tailoring from a company whose business is shipping out standard servers, but for Pisetzky, Dell’s role was central in helping Taboola’s IT scale with demand new layers of complexity.
Pisetzky says that they have been on the front ends of some trends while others have come and gone. One key area where they made initial investments that turned out for the best was with NVMe, making sizable investments well before any of the NVMe over fabrics or other technologies came along to further exploit such capital expenses in hardware.
“When we started out, we were buying drives based on what Dell recommended. As we advanced with different use cases using NVMe, however, it was before there were SKUs for the kinds of servers we wanted to buy. We asked Dell to work with us, to understand our differentiated requirements, and they did. Not only did their teams work to see why this mattered for our use cases, but they worked with us to approve new drives with SKUs just for us, which they even tested in their backend labs to certify for us.”
It might seem like this is the kind of “special treatment” that OEMs give only to their largest clients, but Pisetzky says this was well before they were buying servers in the thousands and investing heavily in storage infrastructure. Dell Technologies took the time to understand their differentiation as a business and gave the tools to make decisions about what server and storage lines to pick and how they could be tailored for their growth, which eventually exploded.
That’s because no one is doing quite what they do, at the scale they’re doing it, or with the same distributed computing, storage, network, software/application challenges they have. “We don’t have companies we can look at for guidance on how to build the right system for this kind of mission, we have had to learn as we go,” Pisetzky says. What has emerged is a finely-tuned infrastructure that works in harmony around very specific workloads and targets.
Pisetzky repeated that it was critical, especially when blazing new trails in server and storage infrastructure, to have partners that would not only lend expertise and guidance, but create new solutions based on unique, emerging needs. “Much of this work Dell did with us was before the kind of customer we are today, with tens of thousands of servers and rich storage infrastructure.”
Read the full case study about Taboola and its broader compute environment.
Be the first to comment