The Theoretical Underbelly of Optimizing Machine Learning Systems

This week we will be delving into systems and architectures designed for machine learning, but since it’s Monday and there is quite a bit to come, it seemed worthwhile to take a step back and consider how architects and software developers are thinking about the current landscape.

With everyone from Intel touting the next generation deep learning and machine learning as a partial basis for their Altera buy, to webscale companies like Microsoft, Google, Baidu and others seeking ways to boost machine learning algorithms with hardware, accelerator, and of course, software approaches, the larger conversations tend to get lost in the mix. For instance, what does it mean to optimize for these codes—and what are the system design choices that seem to be the best fits?

It may seem a bit odd to take machine learning systems guidance from the point of view of a UC Berkeley professor who specializes in gamma ray bursts and black holes, but when it comes to applied scientific algorithms across massive, shifting datasets, Dr. Joshua Bloom does have a view into the complexity—and the systems required to tackle it. This is true both in terms of understanding the machines required to process cosmological simulations (i.e. large-scale supercomputers) but for Bloom and his group at Berkeley, the equal challenge lies in the tradeoffs of that computation versus the requirements of the models. Sacrificing accuracy for performance, scalability for complexity, memory for model depth—the list goes on. But at the core, Bloom says, is an increased need for machine learning systems and people building them to understand the purpose of optimizations and apply those to both the hardware and software (and by default, the outcomes and implementation)

Machine learning systems are alive, he says, both “influencing and responding to their environment. At best, they’re valuable, resilient, functioning systems composed of many imperfect parts with many weak contracts between them, built by fallible individuals with broken communication channels, all of whom are living a resource constrained world that’s constantly changing, with the results being consumed by exacting and capricious individuals.” This definition, as he told a group at PyData Seattle, which was hosted by Microsoft, indicates what we already know. This is hard stuff.

Optimization decisions up and down the stack.
Optimization decisions up and down the stack.

The difficulty lies in variability—and that variability means that there are never any standard tradeoffs that suit any algorithms, which is especially true since the same models, once applied to different datasets, can change performance-wise dramatically.

“The whole idea of machine learning is not that you build a system to run the algorithms and come back and check in with a year later. It is evolving and alive,” Bloom said to a group that gathered at PyData Seattle. And even still he notes, referencing the seminal Google paper that outlines the challenges ahead for machine learning, these systems are not just chewing machine learning problems. “A mature system might end up being (at most) 5% machine learning code and (at least) 95% glue code. In production systems, this is certainly the case, Bloom says, pointing to examples from users of Wise.io, where Bloom is founder and CTO.

Bloom says software is the connective tissue to the hardware, but what is so often missing is the sense of what it being optimized. The results can be generalized (in that big data, throw it all at the wall to see what sticks sort of way) but actual machine learning requires great nuance—and it can be different each time. “At the algorithm and model level, it’s learning rate, convexity, error bounds, scaling and so on. And at the hardware level, it’s about accuracy, memory and disk usage, CPU needs, time to learn and predict—with all of this in mind, taking a myopic view of a machine learning system can be costly further up the stack.”

Optimization for machine learning means finding what matters out of an algorithm. Some might be willing to take false positives over missed detection or the reverse—it’s all “just fishing otherwise,” Bloom explains.

The full talk, although pretty general as far as these go, is useful as a kick-off to the more focused, architectural articles coming this week but did deserve a pointer.

Sign up to our Newsletter

Featuring highlights, analysis, and stories from the week directly from us to your inbox with nothing in between.
Subscribe now

2 Comments

  1. Totally agree different learning algorithm have different bottlenecks often they lie in parts that are intrinsic serial so do not benefit from any kind of parallel architecture. Data set structure and especially their density also has a serious impact on the performance or the algorithm. But the biggest bottleneck is that nearly all of the current algorithm need human intervention as all the big performer are supervised learning algorithms.

Leave a Reply

Your email address will not be published.


*


This site uses Akismet to reduce spam. Learn how your comment data is processed.