Why Dropbox’s Exascale Strategy Is Long-Term, On-Prem Disk

The various life-extension technologies that will keep disk at the forefront of some of the largest storage installations are working–and keeping disk’s largest consumers, like Dropbox, around for long haul…

When it comes to exascale storage capacity, the national labs have nothing on Dropbox.

The company’s custom-built system for storing and managing the hundreds of billions of user content items (with multiple petabytes uploaded daily) has had to stand the test of time, performance, and cost pressures. And at the heart of it all—now and for at least the next five years—is based on disk and with the exception of international locations, all on-prem.

In the company’s early days, the key to growth and scalability was on the public cloud. Dropbox was one of the much-heralded use cases for Amazon’s S3 service as far back as 2013 but it quickly became clear that being a cloud-native business came with significant costs, not just in dollar terms but in flexibility of the underlying infrastructure as well.

Thus began one of the largest migrations in early webscale history, with Dropbox moving from AWS to its own datacenters with the centerpiece being its own Magic Pocket software and hardware that had been hard-optimized for just exactly what Dropbox is known for—storage and quick retrieval of data.

Now, with four datacenters in the U.S. and continued popularity of their service, Dropbox is thinking about what lies ahead for storage infrastructure. Despite a world awash in exciting new options for boosting I/O and storage capacity, the name of the game at Dropbox is optimizing that on-prem infrastructure around three key goals—and surprisingly, cost isn’t at the top of the list. As the senior director of platform strategy and operations at Dropbox, Ali Zafar, tells us, the most important drivers center around control and flexibility. “

“Cost isn’t number one, but it’s in the top three—and there are three things we look at when we make a decision about hardware. Is it more cost efficient to stay on prem or use the public cloud? We are operating at scale where customization is cost-efficient. And also, can we innovative faster than the public cloud? And further, is our workload unique compared to what the public cloud supports?” He adds that with their Magic Pocket system, “with the scale and economics of multi-exabyte operations it makes sense for us to invest resources in building on-prem now because the unit economics are awesome.” Further, he says that for some specialized work, including AI/ML, it makes more sense to use the public cloud. In other words, AWS is always a ready option but Dropbox infrastructure teams are constantly weighing the options.

So far, on-prem infrastructure wins the cost and flexibility war yet the cloud is always there for special projects or specific hybrid or international goals. So let’s take a look at the hardware pieces that are central to this cost-efficient multi-exabyte storage story. Unlike the hyperscalers, HPC sites, or large enterprises we cover, the hardware story isn’t complicated—but that simplicity is where the story is. And if it shows anything at all, it’s that good old-fashioned disk is provided enough innovations to boost capacity, performance, and reliability without driving up costs so much that it fails to remain a long-term contender.

Actually, disk is not necessarily “old-fashioned” these days. Zafar says that for the next five years, if not well beyond that, disk will be their standby with no need for flash or other novel extensions because of new trends in disk like SMR, EAMR, and HMR—all of which are currently implemented or will be in production in the near term.

Right now, shingled magnetic recording (SMR) is the latest, greatest addition to Dropbox’s exascale storage fleet. This innovation is key to far higher densities in disk given the stacked or “shingled” overlaying of written data. Magic Pocket (more on that here, but it’s primarily storage with some compute and databases) stores all file content for users. Zafar says more than 50% of the file content is on SMR and any new racks landing soon are also SMR-based. More specifically, Dropbox is rolling out the 20TB SMRs at scale. “The cost and performance and overall unit economics are phenomenal,” he says.

In fact, when Zafar points to what makes Dropbox innovative on the infrastructure side, he points to their widespread of SMR drives in addition to a shift to AMD CPUs for some of their compute workloads—both of which he says have given them significant cost and capability boosts in recent months.

On the near horizon, other disk-based technologies designed to keep adding densities to the media like EAMR and HMR are key to the Dropbox Magic Pocket exabyte-plus storage strategy.

Energy-Assisted Magnetic Recording tech is novel in a different way than changing how recording is layered. In these devices, where companies like Western Digital have filled drives with helium to reduce friction from air, some limitations emerge. As small of a difference as it sounds like air could make, it actually allows tracks to be crammed so close together they can overlap. However, this means narrowed write fields and the opportunity for error. EAMR gets around this by focusing on the track being written to make the media easier to write in that single location, it’s an energy boost of “concentration” if you will. And this is razor-focus, literally. The HAMR (heat-assisted magnetic recording) technology Zafar points to is laser-driven heat to ensure precise, fast, writing to a media where every bit of density is being preserved and enhanced. Seagate’s teams describe it best:

HAMR technology solves both these problems. HAMR uses a new kind of media magnetic technology on each disk that allows data bits to become smaller and more densely packed than ever, while remaining magnetically and thermally stable. Then, to write new data, a small laser diode attached to each recording head momentarily heats a tiny spot on the disk, which enables the recording head to flip the magnetic polarity of a single bit at a time, enabling data to be written. Each bit is heated and cools down in a nanosecond, so the HAMR laser has no impact at all on drive temperature, or on the temperature, stability, or reliability of the media overall.

With all of this in mind, we can assume Dropbox will keep climbing with disk innovations, as incremental as they are. Anyone who has followed compute for a time is familiar with the idea that the more density you try to create, the more pockets of resistance and bottlenecks there are. For disk, the most important innovations are ensuring functionality, reliability, and performance out of all that areal density before the media has to give up the ghost (and by then, who knows, whatever iteration of flash is next might be what Dropbox does).

And it’s not just Dropbox, by the way, that will be riding this same curve with disk until the end of days.

Many companies with majority on-prem installations are bullish on disk beyond the 2023 end point on the chart above because they don’t have the same needs of ultra-high performance the HPC and even big transactional websites or businesses do. Dropbox’s very own business is a density story of its own—how to fit as many user files as possible with high enough performance that users aren’t waiting minutes to get their files. And for Zafar, that is less about the performance of the media and more about having smart strategies for storage in warm and cold stores. He gives a great example—one many of us can relate to Dropbox-wise (or with any similar service). It’s tax season 2014. You’re accessing that puppy four or five teams in a week for a month. And then, it goes to land of forgotten tax returns, accessed perhaps once every two years, if that.

That AI/ML work Dropbox is doing on the cloud can make this smarter—and all of this can lend years to the disk-based strategy the company will pursue for years to come.

“Even now, the difference between pricing on drives versus flash is still too much. Unless that starts decreasing significantly, the overall TCO is on the drives, it’s significantly better, we can achieve cost targets and it makes our infrastructure super competitive—and that is also why we have so much on-prem,” Zafar concludes.

For the wider strategy of overall infrastructure, of which disk is just one part, Zafar says that seeing what lies in the decade is difficult across the compute, database, and Hadoop parts of their workloads. “We will continue to have our own custom infrastructure with Magic Pocket. The areal density growth from the market is well into the 50TB range and above. If we can do that, the TCO for us will be phenomenal,” Zafar says, adding, “the price point at which we can serve and store data is still going to be top of mind.”

AWS
Vendor Voice - High Performance Computing on Amazon Web Services

Sign up to our Newsletter

Featuring highlights, analysis, and stories from the week directly from us to your inbox with nothing in between.
Subscribe now

2 Comments

Leave a Reply

Your email address will not be published.


*


This site uses Akismet to reduce spam. Learn how your comment data is processed.