An Inside Look at What Powers Microsoft’s Internal Systems for AI R&D
March 28, 2018 Nicole Hemsoth
For those who might expect Microsoft to favor its own Windows-centric platforms and tools to power comprehensive infrastructure for serving AI compute and software services for internal R&D groups, plan on being surprised.
While Microsoft does rely on some core windows features and certainly its Azure cloud services, much of its infrastructure is powered by a broad suite of open source tools. As Jim Jernigan, senior R&D systems engineer at Microsoft Research told us at the GPU Technology Conference (GTC18) this week, the highest volume of workloads running on the diverse research clusters Microsoft uses for AI development are running on Docker on Linux on bare metal, even if that will eventually change as more services and resources are added to Azure.
The infrastructure Jernigan speaks of is quite a rich mix of resources, ranging from standard CPU nodes to the highest-end GPU V100 compute resources. It is his team’s job to provide the infrastructure for the many systems, network, and services engineers building out AI based tools and thus it is no surprise that Docker, Chef, and CoreOS, and many other open source tools are close at hand.
In many ways, Jernigan says his team is operating what looks a lot like an academic supercomputer center with different clusters, a wide range of applications, high demand, and thus the need for fair scheduling that maximizes utilization and productivity. No small task as many in HPC already know—but an even larger one given the complexities introduced by densely heterogeneous systems.
At the heart of these services, which include general research clusters as well as the more structured Internal Deep Learning Service that we will talk about in a moment are a few critical tools. First is something that old timers in HPC will recognize as a souped-up version of Microsoft Compute Cluster Server—now called Microsoft HPC Pack clusters, which is a free download that can serve as the cluster management fabric for deployment, monitoring, and scheduling of jobs. This is one of the only Microsoft-centric pieces in the overall infrastructure puzzle.
Users can submit jobs, deciding if they run on Windows or Linux and on-site or cloud. In the process of showing off the interface on his slides it became clear that there were other interesting options, including running it on a DGX appliance—no surprise that Microsoft has one, but sure wish we knew more about their experiences with it.
The more interesting infrastructure project in his team is the Internal Learning Service, which is interesting in that has a wide range of CPU and GPUs available and it is entirely managed with open source approaches with the Hadoop scheduler at the base, which has been modified for these workloads. The service is open to all of Microsoft, which means the scheduling/fairness is always an issue, but of note, this entire resource is built on Linux and Docker. Users submit to the scheduler through a Python API or a web GUI or orchestrator to manage dataflows. This compute is kept close to the training data as well.
It’s easy to trace a track through this. For instance, one could use HPC Pack with CNTK on Windows with a V100 GPU boost on bare metal (which would skip Docker since that is still not supported natively in Windows). It is also possible to go through the service and select a TensorFlow container (all of the frameworks are containerized, of course) which runs on Linux and outside of selecting the cluster (from a K40 GPU to Volta) boost, much of the ugly infrastructure is abstracted.
Jernigan was also proud of another internal service that we would love to get our own hands on one of these days. This is called the hardware lending library and provides access to all the upcoming and cutting-edge hardware Microsoft gets before market—everything from new FPGAs, CPUs, and even systems with 6TB RAM and NVMe/Fusion IO gear as well as Azure beta hardware.
Not pictured are other tools that are central to the operation of the service, including Chef to managem all the systems and their config, software installs, versions, auth management, and so on. The team also uses Grafana and Nagios for monitoring GPU utilization, something that sounds small until Microsoft shows what bad utilization looks like from a user that grabbed a lot of resources and squandered them.
For the main systems management, the real star of Microsoft’s show is Docker.
Everything in green is a separate container and every row represents a separate compute node. It is not easy to see but the top five are the infrastructure servers that handle job scheduling and container management, then there are storage clusters for hot stuff followed by a long list of worker nodes. The services are using Docker containers but so are the user workloads.
Here is the truly enlightening slide showing how systems are managed at Microsoft. This chart is for the internal web service that supports AI and other research projects. Microsoft is using CoreOS with Docker, ETCd and Sensu for monitoring, Hadoop as we said is the main resource manager and Gluster is the off-cluster hot file storage for logs with HDFS handling warm storage with compute nodes having the ability to interact with and hold this data. Again, the whole cluster is run on Docker—infrastructure and all.
In many ways, this reliance on Linux and open source should not really come as a surprise given the maturity of many of these tools as well as a very open source-centric deep learning ecosystem that has already developed outside of standard cluster operation norms. It was still neat to see inside the management process for what looks very much like an academic site serving a lot of different user demands—and to see how this is shared as fairly as possible within a large organization.