It is rare to find a story here at The Next Platform that does not focus on systems for large-scale use cases. However, given that AI training has been a driver for high-end, GPU-laden systems over the last four years or so, we wanted to give some thought to how that market might change. There are cloud-based options for model training and development, but there are also custom-built laptops and workstations that could carve out business from AWS and others, eliminating some of the unexpected costs of the cloud.
On the laptop side, we were curious about where a single machine maxes out for AI training and how and where it is a better fit, especially for research and development, than using cloud-based GPUs. We were also curious about how these laptops are engineered with this particular workload in mind. It would seem that power would be a major constraint for all the obvious reasons, but it turns out it’s a bit more complex, according to Remy Guercio of Lambda Labs, a company that builds custom deep learning laptops and workstations.
The startup got rolling in 2013 targeting facial recognition on the software side before moving into custom devices for data collection, including the “Lambda Hat” which took a photo every ten seconds. They then ventured into cloud building, launching their internal GPU cloud in 2015 then building GPU workstations and a plug-and-play scalable server akin to the DGX machines from Nvidia in mid-2017. This hardware business thrived, snagging them customers including Apple, Raytheon, Tencent, LinkedIn, Intel, Microsoft, and others. The reason for this growth? The need for on the fly model development and experimentation on a local device with a cost that cuts into the cloud’s appeal for training on the cost front.
Building a balanced piece of portable hardware for compute and memory-intensive AI training is quite a bit more complex than it might sound, particularly in terms of memory capabilities, balancing GPU capability with the form factor and its need for battery life, says Remy Guercio of Lambda Labs. The company’s top-end Tensorbook is best suited to training small and medium-sized models and as one might expect, the more computationally intensive, the more strain on the available memory/performance.
The most popular model is the Lambda Tensorbook Max with Nvidia GeForce 2080 Max-Q GPU (8BG VRAM) a 6-core Intel i7-9750H processor, 64 GB RAM and a terabyte of NVMe storage. The success of these is driven by academic researchers (ML students use the Tensorbook to experiment and train models for their masters and PhD theses, Guercio says) but these are finding more traction in enterprise development in the last few months with the Tensorbook being used as the daily driver for AI/DL training workloads along with coding tasks (prototyping, fixing bugs, making sure training/inference run without error).
“Most academic researchers develop deep learning models with around a 11GB memory limitation in mind, so our goal was to get as close to that limit while still maintaining a portable laptop form factor. We were able to get 8GB with the 2080 Max-Q, Guercio adding that around 80% of popular AI models need less than 8GB although some (Pix2Pix HD, MT-DNN, and StyleGAN) are not well suited to laptop training at all due to memory constraints. Others, including NasNet Large, DeepLabV3, Yolo3, MaskRCNN) and language models (Transformer Big, Convolutional Seq2Seq, unsupMT, BERT Base/Finetune) can fit well within the limitations.
The Tensorbook is designed for development and training but Guercio says that it can handle inference better than one might expect, even for models that the system wouldn’t be powerful enough to train. Researchers can download a large image model trained on the company’s own cloud and run it locally. “As model sizes grow, batch sizes decrease, and while with training it’s important to have a relatively large batch size (the number of training sets the GPU can handle at once) inference only requires a batch size of 1. This opens the Tensorbook up to running on many more models.”
For training development, this is competitive with common cloud instances with similar capabilities. “We commonly have customers deciding between getting a Tensorbook and using an AWS p3.2X large single GPU instance. If you were to train 8 hours a day (only on working days) the cloud server would run you around $525 per month. Many would rather go for the $3300 upfront investment that pays itself off in around six months and can also be used as a personal/work laptop.”
Unlike other laptops, the power concerns are much more prominent but so is the realization that for this market, battery life can be exchanged for higher performance given the way most users interact with a deep learning notebook. “Training models requires a significant amount of energy to power the GPU, which leads to reduced battery life. However, since this is for researchers and developers, we found that they mostly did this type of work with a power outlet nearby. Additionally, by keeping the battery lighter we were able to keep the price within $3300 for the top-end model, the laptop size thin, and weight within 4.1 pounds to allow the users to carry the machine around.”
On the software side the systems are laden with just enough for development, with Ubuntu 18.04 OS with Lambda Stack on top with Debian packages for several frameworks (CUDA, cuDNN, TensorFlow, PyTorch, Caffee, and Torch and all their dependencies).
Aside from the general questions about where these laptops max out workload-wise, one other thing caught out attention. We’re used to focusing on high-end Nvidia V100-dense systems ala DGX and traditional supercomputers but it’s hard to ignore how much research and development in training is done on much lower-end GPUs. Lambda released some analysis and benchmarking comparing performance on several GPUs and AI workloads. Here are some interesting takeaways from the company’s Michael Balaban, who did the evaluation with cost in mind:
The RTX 2060 (6GB) is great for basic AI exploration that will not involve the constraints of large language models.
The 2070 or 2080 (8GB) These fall right into the “just enough memory” sweet spot for models but the memory requirements for language workloads is growing and could outpace this eventually. The GPU budget for this is between $600 and $800 and is workable for development on mid-sized and smaller models.
The real jump comes with the 2080Ti with 11GB. This is 50% more expensive and about 40% more powerful but it comes with major capability in compute and memory.
For Titan RTX and Quadro RTX 6000 with 24GB. This is for the bulk of memory-intensive models (BERT large, etc) and future-proofs against the SOTAs that will be even more memory intensive with a dramatically increased number of parameters and input resolution.
Lambda has published several workload-specific performance profiles with additional analysis here. They have also published several other specific benchmarks here over the last several months.