The inception of Google’s effort to build its own AI chips is quite well known by now but in the interests of review, we’ll note that as early 2013 the company envisioned machine learning could consume the majority of its compute time. Inference was a particularly expensive proposition, forcing Google to look at its own possible role in creating purpose-built chips for its own massive-scale AI inference operations. The TPU was born with TPUv1 serving inference.
While high performance inference could be achieved it didn’t take Google’s TPU designers and workload experts long to see the real bottleneck had become training. This pushed development toward TPUv2 for efficient, scalable, high performance training.
Much has been written about the success and design of these devices but we have never had good insight into the challenges of designing the TPU—everything from the hardware itself to engineering team requirements. Several of these in-between events between the first chips to TPUv4 have been addressed by the core TPU design team, including Cliff Young, Norman Jouppi, and David Patterson in a recent retrospective published by IEEE (behind paywall).
The TPU design team clarifies that while Google is teeming with engineers, the number assigned to their own chip effort was limited, as was the budget. This is likely the case for any chip startup or even software company, thus it’s worth relating how they prioritized during the initial design process and throughout subsequent generations—all at a time when training complexity and scale were growing exponentially.
To work around these limitations the Google team had two “buckets” of goals: those that had to be outstanding, others that simply had to work. Priority items included the ability to build quickly, achieve high performance devices that could scale across chips, survey and account for new workloads and of course, be cost effective. Everything else, they say, was secondary. “While tempting to brush these second bucket issues aside as minor embarrassments, the reality is building and delivering a good system is as much about what you decide not to do as to what you do. In retrospect, these decisions are not embarrassing at all.”
Although many of the lessons Google can share about agile chip design processes are TPU specific, there are some important features in that set of first bucket items.
The “build quickly” mantra is naturally at the top of the list, whether for Google or startups but this stage is all about tradeoffs and sacrifices, even if they’re not “embarrassing” at the end of the process. “Our cross-team co-design philosophy found simpler-to-design hardware that also gave more predictable software control such as DMA to main memory (HBM) and compiler-controlled on-chip memory instead of caches,” the TPU team says, adding that along the way difficult tradeoffs to balance the development schedule and even dealing with an “unstylish” chip layout were on that list.
They have little to say about high performance and the ability to string together many chips as these were the lynchpin parts of the effort. However, when it comes to balance and tradeoffs in the midst of budget and engineering limitations, keeping pace to the onslaught of new training workloads complicated the process. “To support the deluge of training workloads, we built a core grounded in linear algebra that works well with the XLA compiler and HBM ensures we have enough capacity and bandwidth to keep pace with growing models.”
And of course, none of this was cheap. Although it’s striking to think about serious budget limitations at a company like Google, they had to keep constant tabs on simplicity, even at the sake of a rather plain design, something that comes up continually.
“The matrix units are efficient, the design was simple without gratuitous bells and whistles, and we got our money’s worth in performance.”
For chip startups there is no golden grail like the above. But the tradeoffs never end and progress should never be wasted.
As the design iterations continued, they say they didn’t want to “blow everything we worked hard for in TPUv2 by building the kitchen sink into TPUv3. TPUv3 is a ‘mid-life kicker’ that leveraged what we already built,” the team says, pointing to re-use of the familiar 16nm technology, simply doubling existing matrix units to get double max Flops/sec, boosting the clock from 700 to 940MHz (with 30% coming from tuning pipelines) and ramping HBM performance by 30% using a higher bus speed. They also found they could keep stretching the capabilities of HBM by doubling capacity to take on larger models and batch sizes. As a silver lining, using so much from previous iterations in TPUv2, they could scale all this into a 1024-chip system, a vast improvement over the previous generation’s 256 limit.
“It seems quaint now, but we thought the 256-chip TPUv2 system was huge. ML’s voracious appetite continues, so moving to 1024—chip systems was critical.”In other words, to that first bucket list of important design points add the ability to reuse previous innovations.
“As deep learning continues to evolve and as we understand the workloads and scaling limits better, opportunities continue for further codesign across ML models, software, and hardware to improve generation TPUs and other domain specific architectures,” the team concludes.
It is worth reading the entire piece on the design process specific to the TPU as it walks through the various design points and pivots, of which there were many—not just from the inference to training leap, but factoring in higher performance with ever more ambitious efficiency targets in the midst of changing memory and interconnect technologies. IEEE hosts the full copy here.