It was only just last month that we spoke with Google distinguished hardware engineer, Norman Jouppi, in depth about the tensor processing unit used internally at the search giant to accelerate deep learning inference, but that device—that first TPU—is already appearing rather out of fashion.
This morning at the Google’s I/O event, the company stole Nvidia’s recent Volta GPU thunder by releasing details about its second-generation tensor processing unit (TPU), which will manage both training and inference in a rather staggering 180 teraflops system board, complete with custom network to lash several together into “TPU pods” that can deliver Top 500-class supercomputing might at up to 11.5 petaflops of peak performance.
“We have a talented ASIC design tea that worked on the first generation TPU and many of the same people were involved in this. The second generation is more of a design of an entire system versus the first, which was a smaller thing because we were just running inference on a single chip. The training process is much more demanding, we need to think holistically about not just the underlying devices, but how they are connected into larger systems like the Pods,” Dean explains.
We will follow up with Google to understand this custom network architecture but below is what were able to glean from the first high-level pre-briefs offered on the newest TPU and how it racks and stacks to get that supercomputer-class performance. Google did not provide the specifications of the TPU2 chip or its motherboard, but here is the only image out there that we can start doing some backwards math with.
To our eyes it looks a little like a Cray XT or XC boards, which we find amusing, except that the interconnect chips seem to be soldered onto the center of the motherboard and the ports to the outside world are on the outside of the board. This TPU2 board has four of the TPU2 units, each board capable of a maximum peak throughput of 45 teraflops with the system board having an aggregate of 180 teraflops as we have said above. (We presume that this is using 16-bit half-precision floating point.)
There are eight interconnect ports on the left and right edges, plus another two ports on the left hand edge that break the pattern. It would be interesting if each TPU2 board had direct access to flash storage, as AMD is doing with its future “Vega” Radeon Instinct GPU accelerators. Those two ports on the left could be for linking directly to storage, or they could be an uplink to a higher level in the network used to interconnect the TPUs into a processing complex.
If we had to guess – and we do until Google divulges more details – each TPU2 has two ports to the outside world and across the racks and the two extra ports on the left are for local storage and for node interconnect within the rack. (Google could be implementing a loosely coupled or somewhat tight memory sharing protocol across these interconnects if they are fast enough and fat enough, and that would be neat.)
Here is what a pod of the TPU2 boards looks like, which Google says has an aggregate of 11.5 petaflops of machine learning number crunching capacity:
Take a hard look at it. From our eyes, it looks like these are actually Open Compute racks, or something about the same dimensions if not a little wider. There are eight rows of TPU units, with four of the TPU boards per enclosure horizontally. We can’t tell if the rack is full depth of half depth. You can see the six ported sides are poking out of the racks of TPU2 compute nodes, and those two ports reach up into enclosures above them.
Above the top row of TPU2 enclosures, those two ports reach up into an enclosure that does not appear to have TPU2 units in them. Our guess is that this is a bare bones flash enclosure that stores local data at high speed for the TPUs. In any event, we can see this has at least 32 of the TPU2 motherboards, which means it has 128 TPUs in the rack. If we do the math on that, this works out to the 11.5 petaflops across those two racks with the blue covers, which make up the pod.
For perspective on what all this floating point goodness means on the production end, Google’s newest large-scale translation model takes an entire day to train on 32 of the “best commercially available GPUs” (we can assume Pascal)—while one 1/8 of a TPU pod can do the job in an afternoon. Keep in mind, of course, that the TPU is tailor-made for chewing on TensorFlow while commercial GPUs, even those at the highest end, have to be general purpose enough to suit both high- and low-precision workloads. Still, the ROI of Google rolling its own ASIC are not difficult to wrangle from this example.
As an additional point, remember that with the first generation TPU, Google might have had super-fast and efficient inference, but that model had to be moved from a GPU training cluster first, which slows experimentation with new or retrained models moving into deployment and made developers at Google have to wait longer for results before iterating on their work. It is for this reason that the training/inference on a single device is the holy grail of deep learning hardware–and we are finally at a point where there are multiple options cropping up for just that; in the future, Knights Mill from Intel, but of course–and freshest in our minds–is the Volta GPU.
Nvidia’s Volta GPUs, with “tensor core” processing elements for speeding machine learning training as well as eventual supercomputing workloads, can achieve 120 teraflops on a single device, a major advantage from Pascal, just released a year ago. While that cadence is impressive, Google’s announcement just stole some of the oxygen from Nvidia’s announcement—even if users will never get their hands on their own in-house TPU-powered machine anytime soon.
Dean says that the Volta architecture is interesting in that Nvidia realized “the core matrix multiply primitive is important to accelerating these applications.” He adds that Google’s first generation TPU also took the same idea of speeding matrix multiplication for inference, but of course, this device is doing that across the machine learning workflow. “Speeding up linear algebra is always a good idea,” he notes.
Hardware aside for a moment, here’s the really interesting bit from a user perspective. Instead of keeping its secret sauce inside, Google will be making these TPUs available via the Google Cloud Platform in the near future. The company’s senior fellow, Jeff Dean, says that they do not wish to limit competition and will be offering the TPU as an option, along with Volta GPUs and continuing with Skylake Xeons (which GCP already has), Dean says, as well to give developers several options about how to build and execute their models. Google will also be offering 1000 TPUs in its cloud for qualified research teams that are working on open science projects and might be willing to open source their machine learning work.
Dean explains that while GPUs and CPUs will still be used for some select internal machine learning workloads inside Google, the power of having both training and inference on the same device—one that is tuned for TensorFlow will change the balance. While the power specs of the newest TPU were not relayed, we should note that the skinny power consumption of the first-generation device is not a good yardstick for gauging efficiency of the new device since it is doing both training and inference. We can speculate that the power consumption is lower than Volta, which is a massive device by any measure, but that also is being tasked with supporting a wide range of workloads, from some HPC applications that will require 64-bit floating point, all the way down to ultra-low precision workloads in machine learning. Nvidia standardized its approach on FP16 so users could cobble together or pare their precision according to workload, but we have to assume this newest TPU architecture is geared toward 16- or 8-bit. We hope to verify in a follow-up with lead engineers.
On that note, Dean did say that “Unlike the first generation that supported quantized integer arithmetic, this uses floating point computations. You won’t have to transform the model once its trained for inference to use quantized arithmetic, you can use the same floating point representation throughout training and inference, which will make this easier to deploy.”
It is a good thing for Nvidia and Intel that Google is not in the business of pushing its custom-developed hardware into the market, because the TPU is woefully competitive in a market where both seek an edge. Putting the second-generation TPU in the Google Cloud Platform will certainly send some users that way for large-scale training, but as noted, there will also be high-end GPUs as well as CPUs for those workloads. The ability for users to use TensorFlow at scale on an architecture designed just for that purpose will be compelling however. We imagine that this move will light a fire under Amazon and Microsoft with its Azure cloud to step to when it comes to offering latest-generation GPUs, something they have been slow about doing (the highest-end GPU available on Amazon is the Tesla K80, but Pascal P100s are now available on Azure).
For those who keep wondering why Google doesn’t commercialize its chips, read above and see how Google is already doing this–albeit via a less direct route (and one with less risk, for that matter). If indeed these deep learning markets expand at the level predicted, the differentiation provided by TPU and TensorFlow will be enough to give Google Cloud Platform an edge like it’s never had before. That gets them around mass production–and into a mass user base, and one that can help it build out TensorFlow in the process.
As a side note, we have to harken back to Google’s motto from so many years ago… “Don’t be evil.” Because let’s be honest, going public with this beast during the Volta unveil would have been…yes, evil.
Sign up to our Newsletter
Featuring highlights, analysis, and stories from the week directly from us to your inbox with nothing in between.
The TPU signup link link mentioned in “Google will also be offering 1000 TPUs in its cloud for qualified research teams that are working on open science projects” has a typo. The correct link is
This is very interesting, especially considering that Google will make instances of these new TPUs available to cloud users. However, not much information was released to judge their competitiveness with Volta, especially as it concerns both training and inferencing workloads. I think the biggest thing NVIDIA has going for it, however, is that machine learning might be combined with simulation and analytics by most users. Efforts like those being undertaken by GoAI to unify data structures to allow machine learning and other applications to be run on GPUs without moving the data around could turn out to be very advantageous for GPUs.
Also, I think I noticed a mistake in the article:
“Nvidia’s Volta GPUs, with “tensor core” processing elements for speeding machine learning training as well as eventual supercomputing workloads, can achieve 120 teraflops on a single device—just over 40% performance boost from the Pascal GPUs that just hit the market last year.”
NVIDIA’s Pascal-based Tesla P100 can achieve about 20 teraflops for FP16 machine learning training. So the promised 120 teraflops for the Volta-based Tesla V100 is a 6 times performance boost, not a 40% performance boost, over the P100 in machine learning training. The V100 has a 40% performance boost in generic floating point performance over the P100.
That’s an interesting point… Simulation, analytics and AI go hand in hand together
I think you answered your own question with regards to how they came up with that 11.5 petaflops number.
It wasn’t until the end of this article where you mentioned that this is a pure fp engine that it made sense to me why they referenced sp rather than the usually more apropos 8bit.
Each of these TPU devices has 4 TPU2s combined together they have an aggregate 180 DL TFLOPs, as a result, each TPU2 has 45 DL TFLOPs… The V100 has 120 DL TFLOPs so it is faster than a TPU2 so there is a flaw in the article am I correct?? Trying to run a single neural network on multiple chips simultaneously would be hampered by its interconnect even if it is fast.
this is again the typical buzz about something that Google does. It is surprising to see so much wow for an application specific chip. Nothing new in terms of reseach and innovation, just pouring lots of resources to get an ASIC fast. Time to market is the only thing I can congratulate to Google. It is a shame that the ASIC efficiency and performance is not orders of magnitude better that GPGPUs
“It is a shame that the ASIC efficiency and performance is not orders of magnitude better that GPGPUs”
I don’t think that that speaks poorly towards Google. Rather, it shows that the GPU architecture, with some minor tweaks, is itself a rather good fit for machine learning. And that is a good thing for computing and the economy as a whole. If it took an exotic processor to do machine learning well it would likely reduce the scope and scale of the benefits of machine learning, because it would make machine learning harder to integrate into existing workflows.
Another thing to keep in mind is that Google isn’t just designing an ASIC die here. They are designing a scalable system built with ASICs.
I’m with ‘the architect’, nothing really interesting. Lots of resources and integration work… Regarding scalability, the HPC research community has been attacking the same problems for a while. They need a couple more iterations to have something decent, and then, they will.realise that it is not worth the investment.
nVidia has used up all the minor tweaks now, Volta will have substantial amount of “dead silicon” for things that are not ML related. So nVidia has hit the end of the road now. And they will have to make a serious decision what to do in the future. Gamers are certainly not going to pay for dead sillicon that brings them zero benefit.
“nVidia has used up all the minor tweaks now, Volta will have substantial amount of “dead silicon” for things that are not ML related.”
NVIDIA’s blog post says: “Tensor Cores and their associated data paths are custom-crafted to dramatically increase floating-point compute throughput at only modest area and power costs.” Besides adding the Tensor Cores, NVIDIA also made the integer units on the V100 independent of the FP units and added greater scheduling flexibility (and some other minor improvements to cache latency and flexibility). If one compares the number of SMs and transistors on the V100 to the number on the P100 then NVIDIA’s blog post claim about the area cost for the Tensor Cores seems to be true.
“Gamers are certainly not going to pay for dead sillicon that brings them zero benefit.”
How do you even know the Tensor Cores will be on NVIDIA’s gaming GPUs? The gaming GPUs don’t have FP64 or NVLink, for instance. If NVIDIA chooses to leave the Tensor Cores on the gaming GPUs, it will probably be because they don’t significantly reduce the potential of the GPUs for gaming. And if NVIDIA does leave the cores in, maybe they’ll be able to take advantage of them in some of the GameWorks libraries.
“How do you even know the Tensor Cores will be on NVIDIA’s gaming GPUs?”
Big silicon -> low yields. So you can be sure that those dead ones will not end up in the bin but in a different channel that’s the economics of foundry business.
GV100 chips may go into Titans, but I seriously doubt they will appear in any GeForce cards.
Regardless, if NVIDIA were able to use partially disabled GV100 chips in gaming cards they would do so only under the condition that it benefited their gaming business, not penalized it. NVIDIA surely has no plans to sell their data center chips below cost, subsidizing them with their gaming chips. If it’s done, it will be because NVIDIA chooses it, and so thinks it’s beneficial, or because they poorly forecast the size of the data center business or some other blunder. Now that the data center business is profitable, there’s no reason that the GV100 has to increase the cost of NVIDIA’s gaming GPUs. In fact, with good execution I think it should decrease the cost.
But Titans are Geforce brand! I think you are a bit fooled by JJH. nVidia is the same as Intel they need that consumer / standard business sector to keep the R&D, foundry cost and stuff going. Without the large numbers they are screwed and the datacenter business certainly is not anywhere near the number of the consumer level (yet). If it ever dips into that direction yes you are right. But I doubt it will in the near future.
That’s the #1 reason why IBM has more or less become irrelevant in the datacenter / HPC business simply because their costs are simply way too high. And only longterm mainframe owners are willing to keep paying through their nose
True but in the end it doesn’t matter as it shows that the big players out there are certainly willing to branch out in doing their own hardware or more exotic solution which in itself is very bad news for nVidia as who they are going to sell their expansive toys to in the future? It’s not the gamers they are not interested in these architectural tweaks.
Im sceptical that these things can compete in raw training performance: 4 of them yield 180 DL ops and a single volta yields 120.
meanwhile nvidia has announced their ~96 TFLOPS volta at 150 watt, i wonder then how much more efficient google is? 1.5x?