The frameworks are in place, the hardware infrastructure is robust, but what has been keeping machine learning performance at bay has far less to do with the system-level capabilities and more to do with intense model optimization.
While it might not be the sexy story that generates the unending wave of headlines around deep learning, hyperparameter tuning is a big barrier when it comes to new leaps in deep learning performance. In more traditional machine learning, there are plenty of open sources tools for this, but where it is needed most is in deep learning—an area that does appear to be gaining a solid enterprise foothold outside of the initial web companies that spun services based on image, speech, and video recognition.
Optimizing traditional machine learning and newer deep learning frameworks like TensorFlow is not simple—and it can have an incredible impact when it is done (or not done) well, providing many orders of magnitude improvements in accuracy, performance, or efficiency—depending on what users tune for.
Configuring around the number and scope of hypermeters in a TensorFlow-driven workload leaves humans in the dust and optimizing with brute force methods is computationally wasteful, at least if there is a more targeted, streamlined way of knob-turning for the desired model modifications (performance, accuracy, etc.).
As Scott Clark, co-founder of SigOpt, which is one of a handful of companies focusing on this tunability problem for deep learning, tells The Next Platform, about 90% of machine learning and deep learning users in both academia and enterprise are using one of three methods—and none of them are ultimately very efficient or productive as stand-alone optimization approaches. This will become an even bigger problem in the near future as more actual enterprise use cases at scale for deep learning roll out—and as researchers enrich their work with ever-more sophisticated models under computational grant constraints.
Manual search, which is condensing a ten or twenty-dimensional optimization problem into one’s head is not necessarily effective and is a drain on time and expertise. Grid search, which is essentially laying down a grid of all possible permutations of a configuration and trying everything is wasteful as well, especially as the number of parameters grows. And randomized searches, which are one of the most popular methods of optimizing, is like trying to climb a mountain by jumping out of a plane until you land on the peak, Clark says. This is the most effective, but for the rest of the 10%, he says they are using open source codes that lock users into one of these approaches. This happens in academia more due to the need to validate a research course, but in industry, having an ensemble approach based on the inputs and outputs is actually the most effective—something he and his team have been working to prove across a number of use cases in research and for companies like Prudential, Huawei, Hotwire, and others.
Clark took the original automatic optimization code he developed at Yelp and asked eBay, LinkedIn, Apple, Microsoft and others how they were optimizing similar frameworks and found they had the same problem of inefficient, manual tuning for recommendation engines, search ranking, and other problems. However, he says their feedback on that platform was that it was too hard to use. He went back to the table, assembled the team to create an ensemble approach that melded all of the different optimization methods as they suited different workloads, and built an API to extract the user complexity via an optimization loop that runs in out once or as many times as needed when data or models change significantly.
The grand challenge for SigOpt is to generalize parameter tuning across a wide range of machine learning applications so models can full effect out of both the code and the underlying infrastructure. These things would appear to be very model specific, but there are far more generalizations than meet the eye, which are discoverable in a black box approach, taking only input and output parameters from a given workload. “As you tune the various architectures of a deep learning system, you can create arbitrarily complex underlying pipelines. We factor that in; it is part of this field of optimal learning—using only inputs and outputs of a system and leveraging that information alone to guide users into the best possible configuration,” explains Clark, who says this is hyperparameter tuning as a service (API based into AWS as back-end infrastructure). “In terms of actual performance optimization, that output itself can be a composite of many things; accuracy, inference time—whatever correlates with business value.”
“The goal is trade off exploration—learning how different configurations work in systems) for exploitation, or leveraging what you have to get better results,” Clark explains. “In Bayesian optimization or optimal learning, it is a black box approach. You can’t introspect the underlying system; you don’t know the underlying data and model. But you do know what parameters can be tuned and you can probe the underlying system and observe different outputs. The whole idea is about looking at how previous configurations have performed to determine the best, most intelligent thing to try next.”
Clark says that their top user base is in academia, but close on those heels is the financial services sector. Algorithmic trading is a hot area with statistical arbitrage applications leading the way and of course, the web companies that have computer vision and language processing workloads are also key. In their early days, SigOpt worked with Nervana Systems on its Neon framework and they have a number of academic users already.
To compose this ensemble of approaches, Clark has put together a diverse team of Bayesian optimization and optimal learning experts from both academia and industry. He built a similar tool during his time working on the ad targeting team at Yelp and before that, was working on optimization for problems in genomics, which he says don’t look that different from ad targeting problems fundamentally. “One thing I came across during that work is that there was a lot of domain expertise poured into these algorithms, but we always had that heavy extra step of fine-tuning the algorithms to get peak performance. There was a lot of manual trial and error, and if there were enough compute resources, we would have to brute force it. People in finance, social media companies, and elsewhere were all having the same problem—they built these great things only to optimize them in this trial and error way.”
The two and half-year old company has raised $8 million in funding to date. Clark will be speaking at the GPU Technology Conference (GTC17) Thursday, May 11 at 10:00 a.m. about how GPU computing is an enabler for larger-scale deep learning, and how that scale is creating some serious opportunities on the optimization front. We will be at that event, by the way, so do stay tuned.
Sign up to our Newsletter
Featuring highlights, analysis, and stories from the week directly from us to your inbox with nothing in between.
Bayesian optimization seems to have been replaced by deep reinforcement learning for network construction and hyper parameter tuning (see “neural architecture search with reinforcement learning “), however, that paper required *800* GPUs! I think it’s safe to say the hardware hasn’t quite settled yet.
Introspection: Accelerating Neural Network Training By Learning Weight Evolution
Have a look at “Introspection: Accelerating Neural Network Training By Learning Weight Evolution”
TL;DR: Acceleration of training by performing weight updates, using knowledge obtained from training other neural networks.
I agree with the previous commentators. This problem will very likely be solved by a higher learning strategy itself. There are Recurrent network and Reinforcement learning or genetic algorithms based solution to optimize and build better more simple CNN-based networks. Especially with Reinforcement learning it should be trivial as you have a simple optimization goal or cost function to hum for.
In some way it is a bootstrap process.
I hope those people puring $8 Million can spare the cash as it sounds to me they might have to write that off very soon.