Tearing Apart Google’s TPU 3.0 AI Coprocessor

May 10, 2018 Paul Teich AI, Cloud, Compute, Hyperscale 19

Google did its best to impress this week at its annual IO conference. While Google rolled out a bunch of benchmarks that were run on its current Cloud TPU instances, based on TPUv2 chips, the company divulged a few skimpy details about its next generation TPU chip and its systems architecture. The company changed from version notation (TPUv2) to revision notation (TPU 3.0) with the update, but ironically the detail we have assembled shows that the step from TPUv2 to what we will call TPUv3 probably isn’t that big; it should probably be called TPU v2r5 or something like that.

You might want to take a look at the drill down into the TPUv2 that we did last year to update yourself on the architecture if you are not familiar with it. We use Google’s definition of a Cloud TPU, which is a board containing four TPU chips. Google’s current Cloud TPU beta program only allows users to access single Cloud TPUs. Cloud TPUs cannot yet be federated in any way, except by Google’s in-house developers. We have learned over the past year that Google has abstracted its Cloud TPUs behind its TensorFlow deep learning (DL) framework. We don’t expect that to change; no one outside of Google’s in-house TensorFlow development team will have direct access to Cloud TPU hardware, probably ever.

We also believe that Google has funded a huge software engineering and optimization effort to get to its current beta Cloud TPU deployment. That gives Google incentive to retain as much of TPUv2’s system interfaces and behavior – hardware abstraction layer and application programming interfaces (APIs) – as possible with the TPUv3. Google offered no information on when TPUv3 will be offered as a service, in Cloud TPUs or in a multi-rack pod configurations. It did show a photo of a TPUv3-based Cloud TPU board and pod photos. The company made the following assertions:

The TPUv3 chip runs so hot that for first time Google has introduced liquid cooling in its datacenters
Each TPUv3 pod will be eight times more powerful than a TPUv2 pod
Each TPUv3 pod will perform at “well over a hundred petaflops”

However, Google also restated that its TPUv2 pod clocks in at 11.5 petaflops. An 8X improvement should land a TPUv3 pod at a baseline of 92.2 petaflops, but 100 petaflops is almost 9X. We can’t believe Google’s marketing folks didn’t round up, so something is not quite right with the math. This might be a good place to insert a joke about floating point bugs, but we’ll move on.

The Pods

It is obvious from the two photos of the full TPUv3 pod that Google scaled its next-generation way up:

There are twice as many racks per pod
There are twice as many Cloud TPUs per rack

This nets a 4X performance improvement over a TPUv2 pod if nothing else changes.

The Racks

The TPUv3 pod racks are spaced closer than the TPUv2 racks. But, like TPUv2 pods, there is still no storage evident in TPUv3 pods. TPUv3 racks are also taller, to accommodate adding water cooling.

Google moved the uninterruptable power supplies from the bottom of the TPUv2 rack to the top of the TPUv3 rack. We assume that the massive metal box now at the bottom of the rack contains a water pump or other water-cooling related gear.

TPUv2 top and bottom of rack (left) and TPUv3 top of rack (right)

Modern hyperscale datacenters don’t use raised floors. Google’s racks are heavy even before adding water, so they sit directly on concrete slabs. Water enters and leaves from the top of the rack. Google’s datacenters have plenty of overhead space, shown in a photo of the TPUv3 pod. However, routing and hanging heavy water conduits must have been an additional operations challenge.

TPUv3 water connections (top left), perhaps water pump (bottom left), and above rack datacenter infrastructure (right)

Note the twisted wire running in front of the rack on the floor, just in front of the big metal bottom-of-rack box. We suspect it is a moisture sensor.

Shelves And Boards

Google not only doubled the compute shelf density, it reduced the ratio of server boards to Cloud TPUs from one-to-one to one server board for every two Cloud TPU boards. This has an impact in power consumption estimates, as servers and Cloud TPUs in a TPUv3 pod will draw power from the same rack power supply.

Google bills the server board used with its current Cloud TPU beta instance as a Compute Engine n1-standard-2 instance on its Cloud Platform public cloud, which has two virtual CPUs and 7.5 GB of memory. We think it is a safe bet to assume this is a mainstream dual-socket X86 server.

Recall that a TPUv2 pod contains 256 TPUv2 chips and 128 server processors. A TPUv3 pod will double the server processors and quadruple the TPU chip count.

We believe that Google over-provisioned the servers in its TPUv2 pod. This is understandable for a new chip and system architecture. After at least a year of tuning pod software and a minor revision of the silicon, it is likely that halving the number of servers had negligible effect on pod performance. This could be for many reasons, perhaps the servers were not compute or bandwidth bound, or Google could have deployed newer Intel Xeon or AMD Epyc processors with many more cores.

Integrating server boards into the Cloud TPU racks enabled Google to double the number of racks by using identical rack configurations. Standardizing on one rack configuration must help reduce the cost and complexity of hardware deployment.

Compute shelves: TPUv2 (left) and TPUv3 (right)

However, to achieve higher density, Google had to move from a 4U Cloud TPU form factor to a 2U high-density form factor. Google runs its datacenters warm (published figures are in the 80°F to 95°F range), so the TPUv2 air-cooled heat sinks had to be large. Google uses open racks, so moving enough air to cool a hot socket in a dense form factor becomes expensive – so expensive that water cooling becomes a viable alternative. Especially for a high value service like deep learning.

Moving the server board into the TPUv3 rack also shortens connecting cables, so in general we believe Google saved significant cable costs and eliminated dead space in the TPUv2 Pod’s server racks.

Close-up of compute shelves: TPUv2 (top) and TPUv3 (bottom)

Google did not show photos of the board to rack water interconnect.

Cloud TPUs

However, it did show two views of the TPUv3 Cloud TPU. The TPUv3 Cloud TPU has a similar layout to the TPUv2 Cloud TPU. The addition of water cooling is the obvious change. The back of board power connector looks the same. However, there are four additional connectors on the front of the board. The two large silver squares on the front (left) side of the photo are clusters of four connectors each.

TPUv3 board (top left), TPUv2 board (bottom left), and TPUv3 board close-up (right)

Google didn’t mention the additional connectors. We believe the most likely explanation is that Google added a dimension to the toroidal hyper-mesh and, specifically, moved from a 2D toroidal mesh to a 3D toroidal mesh.

Toroidal mesh interconnect diagrams: 2D (left) and 3D (right)

Last year we speculated on several types of interconnects and called it wrong – Google connects a server to a Cloud TPU using 32 lanes of cabled PCI-Express 3.0 (28 GB/s for each link). We believe that it is unlikely that Google increased bandwidth between the server boards and the Cloud TPUs, because PCI-Express bandwidth and latency are probably not big performance limiters.

While interconnect topology will help deep learning tasks scale better in a pod, it doesn’t contribute to raw theoretical petaflops of performance.

TPU Chips

Now that we are down to the chip level to answer the question: “Where does the remaining 2X performance improvement come from?” Google described its TPUv2 core in general terms:

There are two Matrix Units (MXUs)
Each MXU has 8 GB of dedicated high bandwidth memory (HBM)
Each MXU has a raw peak throughput of 22.5 teraflops
But the MXU does not use a standard floating point format to achieve its floating point throughput

Google invented its own internal floating point format called “bfloat” for “brain floating point” (after Google Brain). The Bfloat format uses an 8-bit exponent and 7-bit mantissa, instead of the IEEE standard FP16’s 5-bit exponent and 10-bit mantissa. Bfloat expresses values from ~1e^-38 to ~3e³⁸, which is orders of magnitude wider dynamic range than IEEE FP16. Google invented its bfloat format because it found that it needed data science experts while training with IEEE FP16 to make sure that numbers stayed within FP16’s more limited range.

We believe that Google has implemented hardware format conversion inside the MXU itself to virtually eliminate conversion latencies and software development headaches. Format conversion from FP16 to bfloat looks like a straight-forward precision truncation to a smaller mantissa. Converting FP16 to FP32 and then FP32 to FP16 is known practice; the same techniques can be used to convert from FP32 to bfloat and then bfloat to FP16 or FP32.

Google claimed “great” reuse of intermediate results as data flows through the MXU’s systolic array. That last sentence will take more words to unpack than we can devote to this post.

Given how well the MXU appears to be performing for Google, we believe it is unlikely that Google will change the MXU significantly from TPUv2 to TPUv3. The far more likely scenario is that Google will simply double the number of MXUs for TPUv3.

Block diagrams: TPUv2 (left) and TPUv3 (right); TPUv2

One year between chip announcements is a very short chip design cadence. It does not leave time for significant architectural development. But it is enough time to shrink an existing MXU core to a new manufacturing process, tune power consumption and speed paths while doing so, and then stamp more MXU cores onto a die with a little additional floor-planning work. The following table contains what little hard information we have and our best estimates of where Google is headed with its TPUv3 chips.

Last year, we estimated that TPUv2 consumed 200 watts to 250 watts per chip. We now know that includes 16 GB of HBM in each package, with 2.4 TB/sec bandwidth between the MXU and HBM.

We will stick with last year’s estimate of 36 kilowatts of rack power supply (288 kilowatts total for a TPUv3 pod). If we assume 400 watts for each dual-socket server, we work backwards to about 200 watts per TPUv3 chip, including 32 GB of HBM. If these chips were not packed so densely onto boards and into racks, or they were deployed in cooler datacenters, they might not need water cooling. Another alternative might be that Google is deploying single-socket servers in their new TPUv3 clusters. Lowering server power to under 250 watts might give TPUv3 enough headroom to burn up to 225 watts.

Given a conservative initial TPUv2 MXU design, followed by TPUv3 process shrink, wider and faster HBM, and speed path tuning, it is reasonable to expect that performance per core can stay the same across the two generations without a radical MXU redesign.

Market Recap

Google is still deploying TPUv1 add-in cards for inferencing tasks, four to a server. Google has deployed TPUv1 to accelerate web searches and other large-scale inferencing tasks – if you’ve used Google’s search engine lately, you have probably used a TPUv1.

Google is only offering TPUv2 access via its beta Cloud TPU instances and gave no predictions for when it would start production availability with a service level agreement. Google did state this week that it would make TPUv2 pods available to customers “later this year,” but it is unclear whether that would be a production service. Our best guess is that Google will wait until it has validated and debugged TPUv3 pods to deploy TPU pods at worldwide scale. Google is using TPUv2 pods internally for some training tasks. This week, Google made no statements at all about when it would deploy any capability or service based on TPUv3 chips. We believe the TPUv3 announcement was intended to highlight Google’s long-term commitment to controlling its own destiny for accelerating its TensorFlow deep learning framework.

However, we look at TPUv3 as more of a TPUv2.5 than a new generation of chips. Most of the new hardware development appears to be happening at a system level around the TPUv3 chip.

Paul Teich is a technologist and a principal analyst at TIRIAS Research, covering clouds, data analysis, the Internet of Things and infrastructure-scale user experience. He is also a contributor to Forbes/Tech, and focuses on how people interact with technology-based products and services. For three decades, Teich immersed himself in IT design, development, and strategy, including two decades at AMD in product management roles. Teich holds 12 US patents and earned a BSCS from Texas A&M University and an MS in Technology Commercialization from the University of Texas’ McCombs School of Business.

Lohn Spack says:

May 10, 2018 at 9:44 pm

google has POWER9s … I bet this they are using them, plus OpenCAPI and PCI Express 4.0….all this requires water cooling

Reply
- Paul Teich says:
  
  May 11, 2018 at 8:29 am
  
  IBM’s POWER9 processor makes a lot of sense in GPU-based systems because POWER9 integrates NVLink. POWER9 is also the first processor to market with integrated PCIe 4, and I think PCIe 4 is a good bet as a future TPUv3 interconnect. I say future because PCIe 3/4 would be easy to design into TPUv3 and then wait for Intel and AMD to catch up to implement PCIe 4. I should have included that in my post. Simplest explanations are best, so I’ll stick with PCIe 4 as a more general, processor vendor neutral solution, and wait for more evidence that OpenCAPI might be in play here. It is possible, though.
  
  Reply
jimmy says:

May 11, 2018 at 4:44 am

I keep wondering what the benefint of the TPU:s are?

If they can do roughly 90 TOPS @ 250 watts.

There are already GPU:s doing roughly 110 TOPS @ 250 watts.

And the GPU:s are far more general purpose and ameable for re-programming. For example the TPU:s cant handle RNN:s and fixing this will require some serious engineering (hardware) work.

Is the benefit that it is simply cheaper for google to build these things rather than buy GPU:s ? I suppose its also a good negotiating chip when discussing bulk purchases from Nvidia (they basically have a monopoly on this market).

Reply
- Paul Teich says:
  
  May 11, 2018 at 8:19 am
  
  Designing your own chip is only cheaper in very high volume. Google does have very high volume internal deployment and is searching for higher efficiency for individual high-volume workloads. The only way to get there is to experiment. Google also repeated throughout the week that they will continue to deploy a lot of GPUs, especially for training.
  
  Reply
- Intel says:
  
  May 11, 2018 at 9:43 am
  
  You mean these 130K USD boards?
  
  TPU is dirty cheap. Minimum IP in hardware, Cheaper to manufacture (no extra FP64/graphics pipelines)…
  
  Reply
- nvKid says:
  
  May 11, 2018 at 6:03 pm
  
  It benefits Google as they get the TPUs at much lower price than the 8000 USD Graphics Cards, (generally an ASIC would cost 50- a few hundred dollars after finishing designing) as a product it is definitely lame and that is why they are not selling the chips. And yeah it only helps AI people though.
  
  Reply
- Jack Smith says:
  
  May 11, 2018 at 10:51 pm
  
  Could not do Wavenet and offer at a competitive price without the TPUs. Plus we can see pricing the TPUs are about half the cost of using Nvidia.
  
  https://medium.com/@8fee9a760280/c2bbb6a51e5e
  
  Reply
Ricardo Bánffy says:

May 11, 2018 at 7:15 am

Since the datacenters run warm, why not position boards vertically so that convection could remove at least a small part of the heat dissipated? It makes little difference for a small cluster, but their stuff is anything but small.

Reply
- Timothy Prickett Morgan says:
  
  May 11, 2018 at 7:44 am
  
  It does seem odd, doesn’t it?
  
  Reply
- Paul Teich says:
  
  May 11, 2018 at 8:14 am
  
  Its a good thought. Pretty much every liquid immersion chassis I’ve seen has vertical cards. But for EMI shielding in air, we put cards in metal chassis with really crappy airflow. Can’t get around EMI, so we have to force air through the metal chassis. At that point it doesn’t matter what orientation the cards use and it is easier to manage everything with fewer cards per chassis. Otherwise all of the hyperscalers would be using enterprise style blade chassis – they don’t because they can measure the economics of fewer boards per smaller chassis.
  
  Reply
Jack Smith says:

May 11, 2018 at 10:44 pm

The TPU 2 was about half the cost of using Nvidia for same work.

https://medium.com/@8fee9a760280/c2bbb6a51e5e

Be interesting how much further ahead Google is now with the 3.

But most impressive is able to offer Wavenet at a competitive cost to the old technique for TTS used by everyone else.

16k through a NN in real time is just hard to believe possible. Nvidia has their work cut out for them.

Reply
- jimmy says:
  
  May 13, 2018 at 3:27 am
  
  the GPUs are cheaper if ypu buy tjem yourselves.
  
  You cant buy TPUs.
  
  The GPU renting pricing is pushed up due to huge demand currently.
  
  Reply
  - Blade Meng says:
    
    May 16, 2018 at 3:43 pm
    
    Imaging you need professional knowledge, read tons of tech documents and extra time for configuration, no one want to waste time for the infrastructure.
    
    That’s why people use cloud to rent TPUs or GPUs. Not buy them when you doing your own training as startup or do research. Its highly efficient training ability can save much more time for people without configuration or optimize.
    
    People buy GPUs to build their own training centers for two reasons: 1. The scale is too large to rent from cloud, it’s also too expensive if you want to rent that many GPUs. 2. They want to keep privacy of training data. No one can sure the data you feed to the gpu cloud can only be yours, the cloud platform can use them too (it’s illegal, but has possibilities)
    
    Another part people will not buy TPUs for now is it only support tensorflow, no caffle, no pytorch, no mxnet or CNTK. No one want to be locked on the platforms.
    
    So it is reasonable that Google will never provide TPU products to the market, but only cloud service.
    
    Reply
- Paul Teich says:
  
  May 14, 2018 at 10:35 am
  
  Every time someone says this I cringe.
  Yes, pricing is inexpensive when there is no guarantee of service. Google’s Cloud TPU instances are great to benchmark with, but are pre-production (beta) and not yet priced as a supported production service. The benchmarked GPU instances are priced with SLA as a production-worthy service. (SLA = Service Level Agreement.)
  
  From Google’s Cloud TPU Pricing page
  https://cloud.google.com/tpu/docs/pricing
  (as of 14 May 2018):
  
  Beta
  This is a Beta release of Cloud TPU. This release is not covered by any SLA or deprecation policy and is not intended for real-time use in critical applications.
  
  Reply
Blade Meng says:

May 14, 2018 at 10:10 am

Quote– “However, Google also restated that its TPUv2 pod clocks in at 11.5 petaflops. An 8X improvement should land a TPUv3 pod at a baseline of 92.2 petaflops, but 100 petaflops is almost 9X. We can’t believe Google’s marketing folks didn’t round up, so something is not quite right with the math. This might be a good place to insert a joke about floating point bugs, but we’ll move on.”

I asked Zak Stone about this question, he said the 8x performance increasing between TPUv3 Pod and TPUv2 pod is the minimum number in some benchmark, the better number will be 10x or more.

And I guess the 100+ petaflops Google mentioned is the Linpack benchmark(or Tensor benchmark,etc.) –the pure performance.

—-

BTW, Can I get your permission to translate and quote part of your articles to Chinese? I want to publish these good pictures and analysis through my wemedia account on Tencent wechat.

Reply
vangelis angelakos says:

May 20, 2018 at 3:16 pm

Is there a public reference to that: ‘..Cloud TPU using 32 lanes of cabled PCI-Express 3.0 (28 GB/s for each link)’

Thank you!

Reply
- Mike Mallen says:
  
  October 17, 2018 at 11:26 pm
  
  What is the utilization rate? The key behind all of this hardware.
  
  Reply
Guangbin says:

May 23, 2018 at 6:31 am

Is every TPU Rack is combine two open rack side-by-side, and remove the adjacent brace and side plate?

Reply
ErichF says:

July 9, 2018 at 4:58 am

Nice article, thank you! One detail is hard to believe:
“Last year, we estimated that TPUv2 consumed 200 watts to 250 watts per chip. We now know that includes 16 GB of HBM in each package, with 2.4 TB/sec bandwidth between the MXU and HBM.”

A HBM2 can do at most 256GB/s. Actually less. That’s why V100 has “only” 900GB/s with 4 HBM2s. The two HBM2s of TPUv2 would therefore rather give us ~500GB/s.

Reply