Lessons Learned, And The Year Ahead
January 4, 2016 Timothy Prickett Morgan, Nicole Hemsoth - Co-Editors
The high end of the computing industry has always captivated us, and we still find the forces at work in the upper echelons of the datacenters of the world, and the hardware and software that is created to run the largest and most complex workloads found there, fascinating.
It is that fascination that culminated in the founding of The Next Platform early last year, an embodiment of our belief that technologies developed in one area, such as high performance computing or hyperscale datacenters, would be adopted elsewhere, such as in large enterprises.
Such transformations are not instantaneous or easy, and not all technologies are adopted directly by all companies in their own datacenters. More and more, we think, the difficult technologies could end up being supplied as services and only the largest organizations with a deep technical bench will be on the cutting edge, shaping those technologies for their own use and eventually helping make it more usable for the rest of the IT community. This has happened time and again with various open source infrastructure software, from Linux to Hadoop, and we do not think this will change. But the wide availability of economical cloud computing does change where companies might think about deploying new applications and therefore where they house their datasets, and this is a relatively new phenomenon.
In our years of working together, following the trends in high performance computing, hyperscale, cloud, and enterprise, we have learned a few lessons and, as the new year starts, we want to set these down before you to not only share our thinking, but to give us focus in what we do and how we do it. This is also an open invitation to readers to share their own ideas about the changes they see happening in their datacenters. A publication is an extended conversation, not a series of lectures. Our observations are not presented in any particular order of importance, so don’t read too much into it. We are just starting at the bottom and working our way up through the stack.
The Give And Take Of Hardware
It is no coincidence that Intel made its $16.7 billion acquisition of FPGA maker Altera just as GPU maker Nvidia has been able to cultivate a fairly large compute business out if its server-grade Tesla compute engines. Intel has a GPU accelerator of sorts for compute with the “Knights Landing” Xeon Phi processors, due later this year, and it could start offloading work to hybrid Xeons packed with integrated HD graphics GPUs if it wanted to do so as well.
It is ironic that just as the datacenter has become about as homogeneous as it has ever been with the ubiquity of Intel’s Xeon processors, this is the time that hybrid computing mixing all kinds of compute elements together seems to be coming to the fore. The reason is simple enough, but the long-term outcome is far from clear. The relative homogeneity of the Xeon platform has enabled organizations to create large pools of similar infrastructure that can be repurposed to support many different kinds of workloads, and this in turn has spurred the use of full-blown virtualization and now containers as a means of managing those workloads and driving up utilization.
It will be hard to walk away from this common computing substrate, but the future might be more heterogeneous than we have become accustomed to. It all depends on how difficult it will be to integrate different programming elements with disparate storage and networking elements, and how specific hardware designs deliver performance gains at a price. We expect for storage to move closer and closer to compute – as is happening, for instance, with flash and other non-volatile memories – and for compute to be embedded more and more in networking and storage. The movement of raw data is the big killer.
As far as hybrid computing goes, the question everyone has to ask is will any performance or price/performance increases over running workloads on generic Xeon machines be sufficient to warrant bringing different iron into the datacenter and subjecting programmers to a more complex programming environment. Some organizations need whatever performance boost they can get, as soon as they can get it, while others will just wait out Moore’s Law for 18 or 24 or perhaps now 30 months and make do until then.
Having said all that, we have new “Broadwell” Xeon E5 processors from Intel, the “Zen” Opterons from AMD, and the Power8+ chips from IBM coming down the pike, along with the “Pascal” GPUs from Nvidia in addition to the Knights Landing parallel chips from Intel. It won’t be boring on the systems front.
Scale Matters, But Not In The Way You Might Think
If it were not for Moore’s Law improvements in the processing capacity of server chips and in the storage capacity of disk drives and, now, flash chips, companies that had implemented distributed analytics technologies like Hadoop would be running ever-larger clusters to store and chew on their data. But luckily, so far at least, Moore’s Law has more or less kept pace such that the largest Hadoop clusters in the commercial world (meaning not counting the hyperscalers like Yahoo and Facebook) are able to keep their footprints well under 1,000 or 2,000 nodes, and most enterprises have something on the order of dozens to hundreds of nodes.
Scale means different things to different organizations. A complex business like a manufacturer or a bank has thousands of workloads, all of which might run on a large set of machines but which would be utterly dwarfed by the number of systems of a Google or a Yahoo or a Facebook. Most enterprise customers wrestle with the diversity of their platforms, including the operating system, the middleware and databases, and the applications, not with the number of machines. Google, by contrast, has a lot of machinery running a more modest number of workloads and it is hell bent on driving up utilization and therefore driving down its hardware costs. The cloud providers, interestingly enough, split the difference between the two, having fairly large (or in the case of Amazon Web Services, very large) infrastructure that has to support an extreme diversity of workloads. Most enterprises have to make their infrastructure look more like AWS rather than shoot for the purity of Google’s hyperscale, and it is not surprising that an increasing number of companies want to deploy at least some of their applications on cloudy infrastructure outside of their own datacenters.
It would be interesting to come up with a scale metric that married diversity of workloads and diversity of platform types along with scale of systems. Who has it easier, General Motors or Google? It just might be Google.
Data Big Enough for Machine Learning
Even with the scale and increasing heterogeneity of systems, new innovations at both hyperscale and general enterprise are adding a new, albeit more complex layer to stack, particularly for analytical workloads. While we tend to avoid the catch-all “big data” phrase here whenever possible, as we’ve noted throughout our first year, machine learning is the next evolutionary phase for data that just a few years ago, might have been kicked to Hadoop and analytics platforms. As we noted above, many of the expensive, complex developments that are too hefty for most enterprise IT shops to take on tend to be developed at hyperscale companies like Google and Facebook before moving into wider adoption elsewhere. A good popular example of this Google’s recently released TensorFlow technology, which will be pushed out for wider development.
What is most interesting for us on the machine learning front, and what we tried to capture through a number of pieces in 2015, is that the machine learning story is far more complex from a hardware point of view than it might otherwise seem. Intel and other chipmakers, as well as accelerator makers like Nvidia with its new line of machine learning-focused GPUs, and more recently, FPGA makers Xilinx and Altera (now part of Intel) are also seeing a prime spot for those devices to accelerate deep learning and machine learning workloads. The overall trend of heterogeneous systems, which was being driven by its own forces, will find a new stream of choices to snap in as both accelerators and processors.
For many of these large-scale machine learning workloads at Google, Facebook, Baidu, and other companies, GPU computing remains at the top of the list for the computationally intensive training task. But for the equally important execution side of such workloads, there appears to be a war of words (little else as of yet) over what the best platform for delivering the end result. As the year unfolds, we expect to have a better sense of whether that will be low-power CPUs as stand-alone entities, or aided with FPGAs or energy efficient GPUs.
In short, what we’re watching on the machine learning and deep learning fronts this year will be equal measure software and algorithm development matched with a keen eye on how the hardware environment will stack together. If this indeed the next phase of the big data phenomenon that reshaped how social giants, enterprises, and research centers thought about collecting, storing, and processing data–then having a smart way to do all three will be at the top of many IT wishlists in 2016. All of this, of course, is aided by developments that have taken place in another segment of the market–high performance computing.
HPC Drives Deeper Into The Enterprise
This is a theme we have been hearing for the past several years from Cray, SGI, and other makers of parallel systems originally intended for giant simulation and modeling workloads largely run by government agencies, national labs, and academic institutions. Having had some success in getting enterprise customers to adopt their technologies, these HPC players will be architecting their future machines in such a way as to help them preserve their traditional HPC footprints as well as expand more into the enterprise space.
The details of these future machines remain unclear, but this is a common theme, as is the idea that data analytics and simulation and modeling workloads need to be converged on a single set of systems with more tightly coupled storage and networks that allows these workloads to operate together, not as separate silos. What is most interesting here is that while indeed, we have been hearing about this shift from research and academic HPC to a far larger enterprise HPC set, it has taken some time before a new wealth of use cases exhibiting this have pushed into view. While we know about the standard commercial HPC segments, including oil and gas, financial services, and life sciences, the culmination of the trends discussed above are set, at least from our vantage point, to fan the flames of more numerous–and interesting–examples of HPC in places one might not expect. This includes machine learning, by the way, which is we are correct, is where the next big wave of both software and hardware (platform) innovation will take place.
While this was not meant to be a prediction piece necessarily, at the start of a new year, we cannot help but look backward, adjust our course accordingly, and share our insights about what is ahead. This gives you an early view into what themes you might expect for 2016–and for us, offers a chance to survey the landscape more generally. We look forward to the coming year–and in case you missed our hasty pre-holiday message, we appreciate your thoughts and ideas on these and other topics. Back to work we go!