For two decades now, Google has demonstrated perhaps more than any other company that the datacenter is the new computer, what the search engine giant called a “warehouse-scale machine” way back in 2009 with a paper written by Urs Hölzle, who was and still is senior vice president for Technical Infrastructure at Google, and Luiz André Barroso, who is vice president of engineering for the core products at Google and who was a researcher at Digital Equipment and Compaq before that.
Lots of people have come around to this idea because of its obvious merits, expressed by that Google paper and by the rise of its hyperscaler peers. Interestingly and most recently, for instance, Nvidia with its acquisitions of Mellanox Technology and Cumulus Networks and its desired acquisition of Arm Holdings has caught the “datacenter is the computer” bug.
After spending decades designing and perfecting its macro-scale system, Google is shifting the center of gravity to a different part of the system where much innovation needs to be accomplished if we are to survive the end of Moore’s Law, which most certainly is upon us as it gets more expensive – and not less so – to cram more transistors into a piece of silicon.
The big news driving this announcement was that Google has hired Uri Frank, a chip designer who worked on many generations of system-on-chip devices for client computers at Intel, to be vice president of engineering for server chip design at the Googleplex. Amin Vahdat, a Google Fellow who is known to readers of The Next Platform as the person who has steered development of the globe-spanning network that underpins Google as well as the datacenter-scale networks that are the heart of its vast operations, is now vice president of system infrastructure at the company and took some time to explain why the company was hiring Frank and why it believed the system-on-chip, or SoC, was going to be the focus of engineering to drive efficiencies and scale for compute into the future. Vahdat is also Frank’s boss, so this is the person who should be talking about the strategy on Day One.
What Google is up to is a bit subtle, and it is talking about what it thinks it needs to do before it has done it, which is a bit unusual for the company, and Vahdat conceded as much. We pointed out that normally, when Google talks about something “new” it has already solved the problem five years ago and is only now telling the world about it. This happened with MapReduce, which spawned Hadoop; then BigTable, which spawned Drill; then Spanner, which spawned CockroachDB. Google just took out the middleman with its Borg/Omega container controller and cloned it to create Kubernetes, which it open sourced.
We don’t think that Google will be open sourcing server SoC designs any time soon, but if it might help drive sales of its Google Cloud, we would not be surprised to see custom or semi-custom SoCs being offered for sale to on premises datacenters or co-location facilities running the Anthos Kubernetes stack that is about as close to what Google has internally as you can get. Or want, assuming Borg is highly tuned for Google-specific workloads and infrastructure.
Google must have made Frank a pretty compelling offer in terms of intellectual challenge and compensation to leave Intel. Only a few weeks ago, Frank was one of a number of executives in its Israeli chip design team that was promoted as new chief executive officer, Pat Gelsinger, was coming back to the chip maker to take the helm. In Frank’s case, he was elevated to corporate vice president after being general manager of the Core and Client Development Group, which has a team of over 2,000 engineers working in the United States, Israel, and India.
Frank has been climbing the ranks at Intel since leaving college. He got a bachelor’s degree in electrical and electronics engineering from Technion, the MIT of Israel, in 2000, and followed that up with a master’s degree sponsored by Intel in 2004. In 2011, Frank was appointed director of engineering for a team of more than 200 engineers who worked on memory controllers, PCI-Express controllers, power management circuits, and on-chip ring and mesh fabrics. In February 2014, Frank moved to Intel’s Beaverton, Oregon offices as director of engineering and managed the 300 engineers working on the “Apollo Lake” PC chip, and in 2016 was named senior director of engineering in charge of Core SoC designs. In 2018, Frank was named a vice president of its Platform Engineering Group and director of product development for PC, AI, and IoT chips, and the title shifted in late 2018 but the job stayed largely the same.
At Intel, the server SoC design always starts with the client SoC, so it is no surprise that Frank might be tapped to lead custom server chip development. A core is a core is how Intel has always thought about it, and that may be precisely the problem that Google is trying to solve by a “doubling down” on custom chips, as Vahdat put it. A server sometimes does very different things than a client, and even with those things that both clients and servers do, the ratios of them and the bandwidths requires to process them are different. We are thinking that maybe what Google is thinking is that we need a true server core, and one that is tuned for the kinds of workloads that Google itself is running.
But it is important to not get carried away here. Google did not announce that it is creating its own instruction set and custom chip, as it did in 2015 with the Tensor Processing Unit (TPU) for running machine learning training and inference algorithms on its TensorFlow framework or in 2019 as it created its own Video Processing Unit (VPU) to handle video transcoding on media servers. As Hölzle has reminded us more than once, Google only makes custom silicon when it absolutely has to, and more times than not it has gotten semicustom CPUs with a few tweaks here and there for specific workloads or worked with partners to create semi-custom disk drives, flash drives, network interface cards, or network switches.
“One of the things that I want to emphasize – and this is going to continue to be true – is that we are not looking to do it all,” Vahdat tells The Next Platform, and already people reading the announcement of Frank’s departure from Intel and move to Google and the blog post put out by Vahdat are jumping to the wrong conclusion. “We are looking to do as much with partners and the ecosystem as possible, and frankly this is increasingly so. A decade ago we did more in-house and tried to keep it in-house, but we are continuing our trend to partner. We made flash drives, but we never made our own NAND gates. But in some cases, as with our initial use of flash, we actually have to prove out that something has value before others can follow.”
Google own vertical integration, where it owns its entire software stack from the Linux kernel all the way up through application and data services to the Web browser, has given it some advantages for custom chippery or higher-level custom hardware, and Vahdat admits that. Flash is a good case in point. If you are building a video chip or a flash device for the entire world to use, it tends to the lowest common denominator, which limits specific utility, or to having a very broad feature set, which makes it use up transistors and burn power unnecessarily. The way you write data placements or do garbage collection on flash, says Vahdat for example, is very different on a warehouse-scale computer than it is for a single laptop. The TPU and VPU are very precise devices tuned specifically for TensorFlow and YouTube or Hangouts, respectively, Vahdat says. But maybe you only go that far if you have to.
Google has a growing scale for its workloads as well as a growing number of workloads, plus a public cloud business that has to support a huge diversity of applications and systems software. In these cases, the best –and most economical – approach might be to find best-of-breed components and integrate them onto an SoC tuned specifically for workloads. This is where the SoC as the new motherboard idea comes in.
“All of the components in a system integrate on a motherboard, often on a PCI-Express bus,” says Vahdat. “The integration and customization point is that motherboard. We are in a place now, balancing application demand and efficiency, that it is very hard to know how much of a particular device to put on a motherboard. And it is actually hard to coordinate application code to be able to manage data movement and memory across all of the devices running out of the PCI-Express bus. Without talking about specifics, what we are talking about is innovating with components on a basic level and bringing them together when and where they matter, customized for individual applications – just as we did for storage, for machine learning, and for video – and put them all together on this new motherboard. We think of this as the motherboard of this decade and it will allow us to integrate different kinds of IP.”
This does not necessarily mean chiplets from different vendors integrated into a single package, but that could be part of what Frank and his team will be exploring. And it does not mean using protocols like CXL to extend the motherboard out beyond a single chassis – although Google will obviously be using CXL and perhaps other protocols like CCIX or Gen-Z where it is appropriate to link compute and storage elements together. What it does mean is that Google needs to specialize if it is to still keep brining something akin to Moore’s Law improvements into its systems. (And as Hölzle pointed out to us so many years ago, Google will do anything to beat Moore’s Law because that is what a hyperscaler has to do to remain in business.)
“Back in the day, when things were getting faster exponentially, at just a massive clip, it didn’t make sense to specialize for individual workloads,” explains Vahdat. “Back then at Google, we had a smaller number of workloads, too. So specializing for a couple of them was sufficient. In the cloud world, and also given the number of services that we host, it is no longer the case that one particular application dominates. So this model of being able to integrate the best of breed IP, buying as much of it as possible and partnering with others everywhere that it makes sense, lets us be able to rapidly specialize for individual applications.”
In a way, what Google really wants to do is teach the chip makers to cooperate in a way that they really do not, and have not historically. Imagine if you could take bits and pieces of technology from Intel, AMD, IBM, and Nvidia and make the right kind of specific compute device. This is the kind of thing Google is dreaming about, and maybe it can happen if Google buys some IP here and there and integrates it to prove it works. Maybe it will happen at the chiplet level first.
“We want to do as little as possible and only what we need to do,” Vahdat emphasizes, and this is a consistent message from Google with regards to hardware over the decades. Google only builds what it has to. “I think it depends on what it is that we’re trying to put together in the end and the particular use case. But again, we want to do as little of this as possible, ideally leading industry so that we decrease that over time. This is not the business we want to be in a long term.”