How Machine Learning Will Improve Servers
February 28, 2015 Timothy Prickett Morgan
In an ideal world, you would not buy a server unless you knew how all of the components that comprise it would work together for several years supporting your specific applications. But you can’t know this ahead of time, and you can’t even know how the machines you have compare to similar machines used by your datacenter peers.
Providing such fine-grained data about servers is one of the goals of Coolan, a company just out of stealth mode that was founded by some of the top infrastructure people at Facebook. Coolan also wants to anonymize the data it gathers from customers and share it across them and with the industry at large to help IT vendors and those who make their own gear build better machinery and to assist customers in finding that better machinery and configuring it properly to run their workloads.
This is precisely the approach that you would expect from Amir Michael, who spent nearly six years as a hardware engineer at Google before moving to Facebook five years ago into the role of hardware and datacenter engineer He was instrumental in creating the server designs for Facebook’s Prineville, Oregon datacenter, which were the founding kernels of the Open Compute Project, And as Michael tells The Next Platform, the ideas of sharing and openness embodied in that project to make the IT industry better pervade Coolan as well.
“A lot of our principles come from Open Compute and Facebook, with the idea that through sharing, infrastructure should be able to become better,” Michael explains. “With Facebook FBOSS and OCP and all of the projects that are in that, the focus is all in the design of the infrastructure, how you actually get it to run very efficiently and very fast. With Coolan, once you have already designed something efficiently, we are focusing on how to run it efficiently. You can buy a really fast car, but if you don’t know how to drive it, then there really is no point in that.”
That means, among other things, looking at thousands of data points collected from inside servers. For example, Coolan will monitor the temperatures for components as they run and do correlations between those temperatures and component failures to see what the optimal temperatures are. Through data extracted from the Coolan agent software that resides on servers, the service will be able to make recommendations for kernel and driver levels, or system configurations for running specific applications.
One of the key aspects of the Coolan service is the agent that is sitting on servers, gathering up all of the telemetry from the hardware, operating system, system software, and applications. This agent is not, like many system management agents, closed source and running as an executable. Michael says that Coolan is purposefully writing the agent in Python, an interpreted language, and that by open sourcing the code for this agent anyone and everyone can inspect precisely what it is doing on their machines and audits behavior. “It is totally unnoticeable to the server as far as performance degradation is concerned,” Michael says, addressing another concern that IT managers have about any system management agents running on their systems. “Moreover, hardware doesn’t change that often, so it is less invasive and heavy than a traditional data collector might be.”
With the Coolan service, the collection of the data is not the secret sauce, but rather the aggregation of that data within a large installation and, more importantly, across multiple installations. The wider the dataset for servers, their components, and the system and application software that is affected by their behavior, the better correlations and predictions that Coolan can make on behalf of its customers.
“If you think about many of the other solutions that are out there, they are static, they are not learning, and they are not based on community datasets,” Michael says. “A lot of the value we have is derived from having a larger sample size and doing things a little bit smarter with machine learning to predict failures, to more accurately do root cause failure analysis, to do alerts when you have potential system conflicts, and to bubble up trends in the infrastructure that you were previously unware of because without these algorithms you wouldn’t be able to see them.”
Still, if companies do not feel like sharing data with others and just want to use Coolan to examine their own infrastructure, Michael says that they will still be able to get value out of the service. But increasing the sample size is important to make more accurate predictions. However, says Michael, without being terribly specific, the sample size is not as large as you might expect to get meaningful results out of the Coolan service because so many servers are based on common CPU, memory, disk, flash, networking, and other components. Some customers have fairly homogenous components in their server infrastructure, and they can get useful information immediately. Those with more diverse server infrastructure need community data to get a big enough sample size for their components to make meaningful correlations and predictions.
What the Coolan agent does not do is look at the data stored on a server – Coolan has no interest in what is stored on a particular machine. And when it looks at applications, it will be tracking things like how many I/O operations per second they consume over time on disk, flash, or the network. The tool is not tracking the data that the applications are crunching, but rather how the applications are treading on the hardware. Moreover, the Coolan agent is set up in such a way that customers can let certain kinds of data float up to the central Coolan systems to be shared or blocked. It seems reasonable that Coolan will offer a private version of the tool for companies that don’t want their data uploaded because some companies – government agencies and financial services firms – are restrictive about the data they share and any systems that are sending out telemetry. This cuts against the whole openness theme of sharing data that Michael and his co-founders espouse, but that’s business. Some share, some don’t.
The Coolan server management tool is currently running on the Amazon Web Services cloud, and the back-end system is designed for long-term storage of all data from all servers that are running on the machines. The intent, as with many data analytics projects, is to never throw any metrics gathered from the servers away because you never know what kinds of correlations you might want to make in the future. So Coolan could end up with a pretty hefty S3 storage bill from AWS at some point. Coolan is not divulging precisely how it the service will work, but Michael says generally that there will be a dashboard for immediate trends in the server infrastructure as well as long term analysis based on a fuller set of time series data relating to the machinery under its watchful eye.
While the immediate goal of the Coolan service is to help companies run their server infrastructure better, the larger goal will perhaps be a little disruptive to server makers and their supply chain partners. That is because Coolan is going to release reliability data about servers and their components based on the datasets it gathers from customers. Considering that Michael was instrumental in the launch of the Open Compute Project, this will not be the first time that he will have had a bull’s eye penned over his picture (metaphorically speaking) by many in the commercial server community.
“If you think about the hardware industry today, it is fairly opaque as far as this information goes and as for people making purchasing decisions. A lot of the decisions, debugging, and optimizations are really driven by the vendors, and this is our opportunity to add a lot more transparency to that industry. Our ultimate goal is to have infrastructure that is more stable, more reliable, and the vendors have to be part of that conversation.”
The Coolan service is in a private beta now with a handful of customers, whose infrastructure ranges from 100 to 1,000 servers and who are operating in a wide variety of industries. Michael says that eventually there will be a public beta, but only after the private beta customers give it feedback on features and other aspects of the service so it can be refined. He anticipates that general availability of the Coolan service to commercial customers will be available within 9 to 12 months, and as for pricing, nothing has been decided as yet but a per-server subscription model seems likely. The Coolan agent currently works on bare-metal servers running Linux, and at the moment it has only been tested on X86 hardware but as long as Python runs on Linux, it doesn’t matter if it is running on an ARM or Power or any other processor. Over time, if customers demand it, Coolan will port the agent to Windows running on X86 hardware and perhaps to Unix systems, too.
The Coolan team has lots of expertise in server design and systems management, and from a lot of different areas. Michael is CEO at Coolan, and his brother, Yoni, who worked at cloud medical records service Practice Fusion, is also a co-founder. So is John Heiliger, who was vice president of infrastructure and technical operations at Facebook from 2007 through 2011, when a team of around 400 people created and ran the Facebook infrastructure as the site grew from 35 million to 800 million users. Interestingly, Heiliger did a stint at North Bridge Venture Partners in 2012, and North Bridge, along with Social+Capital and Keshif Ventures, were the three investors in the seed round of funding that Coolan took down in February 2014. Eduardo Pinheiro, an engineer from Google who co-wrote two important papers on failure rates for memory and disk drives on the search engine giant’s own infrastructure, is an advisor to Coolan. Other advisors include Jimmy Clidaris, who leads datacenter and platform infrastructure at Google, and Paul Santinelli, a partner at North Bridge who used to run the system patching service at commercial Linux distributor Red Hat a decade ago.