Machine Learning for Auto-Tuning HPC Systems
March 6, 2018
On today’s episode of “The Interview” with The Next Platform we discuss the art and science of tuning high performance systems for maximum performance—something that has traditionally come at high time cost for performance engineering experts.
While the role of performance engineer will not disappear anytime soon, machine learning is making tuning systems—everything from CPUs to application specific parameters—less of a burden. Despite the highly custom nature of systems and applications, reinforcement learning is allowing new leaps in time-saving tuning as software learns what works best for user applications and architectures, freeing up performance engineers to focus on the finer points of system behavior.
In the podcast player below, we talk about how machine learning works against the many tunable parameters in complex systems with Tomer Morad, founder of Concertio, one of a handful of new companies that relies on evolving machine learning techniques to better extract performance without hours of manual knob-tweaking overhead.
We discuss the limitations of machine learning for complex systems as well as opportunities for everything from firmware, OS, software stack, MPI and other components of a system.
There is still a wide range of advantage points for automating tuning with anywhere between 0%-3X improvements, but advances in the underlying learning systems could lead to better optimized systems out of the gate.
For more context, Morad describes an experiment in which Concertio and Mellanox teamed up for accelerating a specific networking use-case running on Mellanox’s ConnectX-3 Pro Ethernet cards. The results showed that the settings discovered automatically by the AI-powered Optimizer Studio tool outperformed the best settings found through manual tuning.
Before founding Concertio, our guest, Tomer Morad co-founded and served as CEO of transSpot, a provider of digital advertising solutions for the digital signage market. Before that, Tomer served as CTO and Chairman of transSpot, Chief Security Officer at Horizon Semiconductors, and a technical team leader at an intelligence unit in the Israel Defense Forces. Tomer holds a PhD from the Technion – Israel Institute of Technology, which focused on energy-efficient system resource allocation.