Unless you work at an IT-sophisticated HPC or enterprise shop with a long-term view on high performance storage, you probably well understand the “just in time” nature of storage procurement. With the growth in management requirements, especially in 2021, how do you retain all the investment in widely varied storage hardware without making an even bigger mess out of its management?
We are going to coin a new term here: cobbled storage.
This is an accurate descriptor for many environments that started small and have had to cobble together bits of on-prem storage infrastructure over the years out of fear of running out. Often, this was not acquired with long-term plans in mind, but rather to meet research or enterprise demands. This means there are multiple storage media, different vendors, varied interfaces and file systems. In short, it’s a wreck, even if it works.
While we are on the topic, take a look at this piece on the life sciences industry and what its challenges are for growth during the time of COVID and the research impetus that comes with it. While storage is just one part of this, it highlights when cobbled storage starts to fall apart. It turns out, that time is now.
The problem with cobbled storage is that that eventually, when enough complexity is added in data and applications, management becomes a mess. And when management falls apart, so does performance, efficiency, and so on. These are reasons why we’ve seen end users of on-prem storage gear break down and look to some of the start-up NVMe vendors or to hardware that does sophisticated tiering and caching and so forth.
Let’s consider this in the form of an example and look at what they did first, what they had to buy next, and what the end result is for some context.
The storage evolution at the non-profit La Jolla Institute for Immunology (LJI) is similar to those in other scientific labs around the world. It began small with more back office drivers than HPC and grew swiftly in patchwork fashion over time with different storage media, capacities, tiers, and vendors. The arrival of genome sequencing machines and massive cryo-electron microscopes were the proverbial straw breaking the back of ho-hum storage infrastructure and that push has only been more intense with time.
For a center like LJI the problem with all of this is that their storage has slowly been cobbled together with new requirements. While this has been useful from an evolutionary perspective and has allowed LJI to get around huge CAPEX for projected capacity storage, it means that getting new approaches to work well requires flexibility from the part of any grand-sweep approach to organize, categorize, and prioritize future storage systems.
In just the last few years, LJI has gone from relatively stable storage requirements to handling many petabytes of usable data and make all of it usable, discoverable, and searchable. This has meant getting creative with categorization and metadata tagging but it also means storage performance and flexibility become more important. Further, the La Jolla Institute has a massive database for serving scientific data and teams wanted fast response times. This, says Michael Scarpelli, Director of IT at the La Jolla Institute, meant they had to get creative and find a balance of cost, performance, and a flexible approach that could let them keep swapping in multiple vendor’s storage devices and have it all just work.
LJI also has a proprietary database that serves mission-critical science but it was dragging under the data weight. “We needed something that work with whatever storage we threw at it and let us do fast writes and transactions no matter what storage it was running with. Researchers demand speed and we wanted to be sure that if we were adding a layer of complexity and intelligence to the file server it wouldn’t show them down or make them jump ship to something slower but more available,” Scarpelli says.
“Sequence data, a new Cryo-EM microscope and all of the infrastructure around that, are what drives our storage infrastructure now. The microscope alone is a beast. It generates 1-2 TB raw image data each day and runs 24/7. There’s a ton of data that comes off the machine and we don’t want to just sit on it, we have to get it into the HPC environment where now we also have a lot more GPU growth to handle analyzing the new image data,” Scarpelli adds.
LJI decided on Excelero NVMesh storage, which is based on their elastic NVMe approach. This led to a 10X improvement in storage performance for the all-important database with an average 6.8X speedup across other jobs. Given the team’s growth in GPU for the Cryo-EM workloads and the ability to fit larger dataset sizes on those GPU servers as well as the ability to provide improvements over the mixed bag of existing flash and disk, the move makes sense technology-wise, but it’s really about being able to continue with the ad hoc approach to adding storage infrastructure—no matter what vendor is providing it.
“We don’t tend to look at other architectures for inspiration necessarily, we make decisions on what to buy next based on the problems at hand. One of the researchers require a sudden rise in need for storage or compute based on a tsunami from the sequencers and we do what fits.” With this ad hoc approach to buying storage from multiple vendors with different performance and capacity profiles, it was the flexibility that sealed the deal for Scarpelli. “There are plenty of great intelligent storage products on the market but many are very locked together with hardware and software or the licensing is done by capacity. There was always something with each one of them. With Excelero we got the flexibility.
Of course, cost is an issue, especially when NVMe is on the table. The La Jolla Institute has operated in a CAPEX world for thirty years, Scarpelli, and that shift to cloud or OPEX isn’t in their foreseeable future. “We see a lot of things and say, that would be awesome, but we’re a non-profit research organization, we have enterprise needs but not an enterprise budget. We have to walk away from a lot of things we like because they don’t fit into our budget or if they do, they don’t match the scale we need.”
Excelero just happened to be the winner on this particular deal, but this kind of approach to storage with smart layers and built in intelligence is what’s keeping a number of other NVMe-oriented storage players afloat. While companies like this might want to take on the hyperscale datacenters and HPC sites of the world, the real bread and butter for the market might be in the small to medium sized enterprises and research organizations like LJI, where they inadvertently created a mess over the years (albeit a functional one) and now need a “super-sweeper” to make the best use of it and provide the intelligence for fast searching, writing, and serving.