Specialized Supercomputing Cloud Turns Eye to Machine Learning
August 23, 2016 Nicole Hemsoth
Back in 2010, when the term “cloud computing” was still laden with peril and mystery for many users in enterprise and high performance computing, HPC cloud startup, Nimbix, stepped out to tackle that perceived risk for some of the most challenging, latency-sensitive applications.
At the time, there were only a handful of small companies catering to the needs of high performance computing applications and those that existed were developing clever middleware to hook into AWS infrastructure. There were a few companies offering true “HPC as a service” (distinct datacenters designed to fit such workloads that could be accessed via a web interface or APIs) but many of those have gone relatively quiet over the last couple of years.
When Nimbix got its start, the possibilities of running HPC workloads in the cloud was the subject of great debate in the academic-dominated scientific computing realm. As mentioned above, concerns about latency in the performance-conscious realm of these applications loomed large, as did the more general concerns about the cost of moving data, the remote hardware capability for running demanding jobs, and the availability of notoriously expensive licenses from HPC ISVs.
While Amazon and its competitors plugged away at the licensing problem, they were still missing the hardware and middleware specialization needed to make HPC in the cloud truly possible, even those AWS tried early on to address this by adding 10 GB Ethernet and multicore CPU options (and later, lower-end Nvidia GRID GPUs). In those early days, this difficulty is what fueled the rise of other HPC cloud startups like Cycle Computing, which made running complex jobs on AWS more seamless—but the other way to tackle the problem was simply to build both the hardware and software and wrap it neatly in a cloud operating system that could orchestrate HPC workflows with those needs in mind.
This is the approach Nimbix took and they quickly set about adding unique hardware in addition to building their JARVICE cloud operating system and orchestration layer, which is not entirely unlike OpenStack. The custom-built JARVICE platform sits on top of Linux to allow it run on the heterogeneous collection of hardware that sits in a distributed set of datacenters in the Dallas metro area (with more planned soon, including in Europe and Asia). This manages the clusters and workflows, assigns resources, and manages the containers that power user applications.
Leo Reiter, CTO at Nimbix tells The Next Platform their typical users fall into two categories. On the one hand there is the bread and butter simulation customer that users the many solvers and applications in the Nimbix library of scientific and technical computing applications they have license agreements with. For these users, they provide the data and performance parameters and the system orchestrates the workflows using JARVICE and their container approach to application delivery. Counted in this group are other users with high performance data analysis or machine learning needs. On the other end are their developer users, who can use Nimbix as a Paas to deliver their own workflows or applications and stick those in the public or private catalog. Of course, to do all of this with high performance and scalability means the Nimbix folks had to give some serious thought to hardware infrastructure.
Nimbix has been providing Xilinx FPGAs in their cloud since 2010 for researchers and the Xilinx development team, but they also have a wide range of Nvidia GPUs—from the low end Maxwell-based parts for the Titan X (for machine learning training) to the new M40 processors for deep learning all the way up to the Nvidia Tesla K80 cards for those with high performance simulations and analytics. Much of the processor environment consists of 16-core Haswell parts, which they can create secure, fractional nodes from as needed (making a 16-core part look like a 4-core with the necessary memory apportionment, etc.). They also use Infiniband for all nodes and for their storage system. So far, their cloud compares only to some elements Microsoft has integrated (they now have some K80s and Inifiniband capabilities) but overall, Reiter says, they are succeeding because no other cloud provider is making the hardware investments to quite the same degree. He points to the fact that there are GPUs on AWS, but the Grid parts aren’t meaty enough to handle the seismic, bioinformatics, engineering and other HPC oriented workflows—and even for deep learning training these are insufficient to their users.
What is interesting here is that just as companies that have specialized in HPC hardware are finding their gear is a good fit for deep learning training and broader machine learning applications, so too is Nimbix finding a potential new path. They have managed to carve out a niche in supercomputing and a few other areas, but so far, there aren’t a lot of robust, tuned high performance hardware options as a service that fit the machine learning bill. We noted that Nervana Systems (recently acquired by Intel) is doing this, and there are a few others who are offering deep learning as a service, but a company that HPC users might know might be very well positioned as deep learning and HPC merge in some application areas and require a remote sandbox—or eventual production environment.
Reiter says they are seeing more interest in deep learning and machine learning and have added robustness to their software stack with hooks for TensorFlow, Torch, and other frameworks. Since they already have the heterogeneous hardware on site and a proven business model behind them, we could see Nimbix move from quiet company from the research regions to HPC push into greater visibility via a new crop of machine learning applications and end users.