During his sabbatical at Twitter, GPU researcher and graph analytics whiz, UC Davis professor John Owens, made a list of the elements the social network might want to consider as it built out its recommendation and other real-time services.
His work in GPUs for supercomputing meant he carried quite few performance lessons to Twitter, but when he sat down with the teams there, it became quickly apparent that their considerations about how accelerators could be used was not as simple as taking advantage of the speedups—it was more about delivering sub-second responses to beat the CPU times on questions like, “who else might this user be interested in following”. And for Owens—and Twitter—that meant rethinking how GPUs can be useful for problems that social media and web-scale giants are tackling.
“If you look at a problem like who to follow, which is relevant for Facebook and Twitter, among others, the goal is to compute that really quickly. If that’s something we can do well on a GPU in half a second, but that takes five seconds on a CPU, there’s a real value proposition for them there,” Owens explains. However, even if there is a clear benefit to this response time, there are much larger barriers that keep GPUs from invading the datacenters of the social giants, aside from their placement as dedicated, small clusters that are purpose built for one particular problem.
This barrier was at the edge of a conversation The Next Platform had earlier this month with Facebook Research lead, Yann LeCun, about the future of GPUs for deep neural networks. At hyperscale companies in particular, LeCun revealed that while there has been a great deal of research into where GPUs provide benefit, these workloads are not put into production for user-focused services. Rather, the GPU clusters are being used to train the complex models, which are then passed over to run on the CPU-only commodity systems that Facebook, Twitter, Google, and others rely on for their core services.
There are complex, and almost insurmountable, reasons why the GPU is being reserved for non-production clusters and mere model training. And at the top of the list is a problem that any custom bit of hardware will have in the future as more of the specialization is being done in software. Quite simply, as Owens argues, there is a firmly entrenched view that the only way to manage and operate at scale, both for hardware and software maintainability, is to have commodity-based clusters that are completely uniform. This makes it simple to update, add new code, and manage from a hardware standpoint—and if there are performance benefits that are left on the table by using just CPUs, then the accelerator companies need to work harder to make sure that the GPUs can be amenable to all the open source software frameworks that are designed to do the deep learning, machine learning, and other graph-like work. And for that to happen it means one company, in this case, Nvidia, has to be the force behind all that code work. And it is very, very hard work.
The question is whether Nvidia would be able to put that kind of resource push into its open source hooks, and even if so, the value (i.e. how many web-scale companies would buy GPUs to put into massive scale production systems) would have to be clear. Interestingly, deep learning and similar workloads were the target at this year’s Nvidia GPU Technology Conference, with a great many use cases highlighted on where these fit into the grand-scale datacenters of Baidu and others for image recognition and other work.
But as Owens tells The Platform, “Deep learning is not necessarily a true graph problem because the graphs there are very structured. There is no irregularity, and that irregularity is the hard problem for graph analytics. It is a bit troubling to see so much focus on deep learning and GPUs and I expect by next year the focus on where these fit will be something much broader.”
When it comes to graph analytics and GPUs, the bottlenecks, while not numerous, are quite thick and will limit progress until we move into the era of NVLink, which can connect multiple GPUs on the same node to each other and to CPUs, while offering a much higher bandwidth solution and access to shared memory. As has always been the case with GPUs, there are memory limitations—there is only so much of the problem that can fit within that space, and shuttling it around the system, with relatively high latency, is not a workaround either. Further, as Owens explains, graphs are difficult to parallelize, especially over multiple nodes and in a scalable manner.
Even if Nvidia took charge and poured big efforts into adding more software staff to keep pace with the constant stream of new open source software tools used for deep learning and graph analytics, there is still yet another really big problem—a “chicken and egg scenario” as Owens describes.
“If you think about Twitter, as just one example, they look at their homogeneous datacenter, which is relatively easy to manage and say, why develop new code for the GPU if we don’t have any GPUs in our systems. And on the other end, they might see GPU systems but there is not any code yet that has hooks for GPUs so they don’t develop either.”
The “we’ve never had it so we don’t want it” argument is a tough one for companies like Nvidia to overcome as it looks to penetrate the hyperscalers and social media giants, but Owens notes that there are some major performance leaps they can achieve now—and definitely in the future with NVLink and more innovations in graph analytics that can harness the core-rich GPUs.
Owens, coming from a research background, says that there is something else these large-scale datacenter operators tend to think about more than academics. “What happens if you get great performance off a particular piece of hardware and make that big investment and then something happens? They really do think about these things. So, think about this—BlueGene performed really well on graphs. But IBM is not supporting that. Larabee did great for certain graphics workloads. Where is that technology now?”
In other words, having systems that are “easily” maintainable is critical—but so is a sense that these vanilla systems, if another flavor is mixed in, won’t spoil if one of those ingredients goes stale.
Sign up to our Newsletter
Featuring highlights, analysis, and stories from the week directly from us to your inbox with nothing in between.
The reason nvidia focused on deep learning at GTC2015 is simple: they got stuck on 28nm and GM200 had no double precision.
They took all the space filled by FP64 cores and put more FP32 cores so they could sell new GPUs in the consumer space that were a little more powerful than the ancient GK110 parts.
The fact that the K80 with its GK210 GPU exists is further evidence. Smart business strategy on their part.
I too work on tree and graph problems on gpus and cpu. They are notoriously hard to accelerate them on gpu.
The major issue i face is that you have to always jostle data between cpu and gpu memories, this takes up lot of time given that there is a huge difference in gpu memory bandwidth and cpu- gpu PCI express link.
I think to fix this requires not only Nvlink but also that Nvidia sonehow can integrate its gpu on a amd or xeon soc. It is quite possible that someday Nvidia would want to have its own soc with its gpu and perhaps an arm 64 bit multicore. But that not going to happen soon.
The biggest change will happen when a single soc will have all the three: multicore cpu, gpu and fpga. I believe this will take another 5 years if not more to happen happen.