AMD Gets Zen About The Edge
February 22, 2018 Timothy Prickett Morgan
If there is one thing that can be said about modern distributed computing that has held true for three decades now, it is that the closer you get to the core of the datacenter, the beefier the compute tends to be. Conversely, as computing gets pushed to the edge, it gets lighter by the necessity of using little power and delivering just enough performance to accomplish whatever data crunching is necessary outside of the datacenter.
While we have focused on the compute in the traditional datacenter since founding The Next Platform three years ago, occasionally dabbling in the microserver arena or less beefy storage engines, starting this year, as we explained earlier this week, we are going to be tracking the computing on the edge with the same intensity we have brought to the datacenter core. (See Pushed To The Edge for our thoughts about what is happening in distributed computing and why we are compelled to do this.)
By the way, it is common to call these less beefy devices embedded processors, and they are so called because, as the name suggests, they were designed to be put inside all kinds of manufactured devices to provide local compute, initially in a standalone fashion but now increasingly networked to each other and with the back-end datacenter. We are not fond of the term embedded because of the historical implications, and prefer the term edge computing instead because it is broader. In many cases, these embedded processors won’t be inside of anything other than what could accurately called a baby datacenter, gathering up data from endpoint devices – cars, drones, 5G cell towers, retail machinery, appliances, manufacturing equipment, thermostats, whatever – so the term doesn’t apply well. We think fog computing, another term to seemingly differentiate it from cloud computing, is amusing but vague. (Ironic, isn’t it?) We’ll stick to the edge.
The last two weeks have indeed been edgy when it comes to compute. Intel launched its “Skylake” update to the Xeon D edge processors, rounding out and finishing off its Skylake line as far as we know. (The Skylake Xeon E3-1500 v5 single-socket processors, which also have edge uses, came out in June 2016 if you can believe it, and with embedded Iris Pro graphics cards with 72 GPU cores that delivered 331.2 gigaflops of floating point oomph at double precision and 1.32 teraflops at single precision. This was followed up by the “Kaby Lake” Xeon E3 v6 server chip, which we looked at here and did a very deep historical price/performance analysis with other Xeon edge processors there. The Xeon E5 and Xeon E7 lines were of course merged together with the Skylake Xeon SP launch back in July 2017. We went over the Xeon SP announcement, drilled down into the architecture, and did a thorough price/performance analysis as well. So that’s all folks for the Skylakes.)
In the wake of the Xeon D announcements last week, Arm revealed its Project Trillium architecture, which, among other things, will allow for machine learning inference be yanked out of the datacenter and put into the device that need to react to the real world in real time using machine learning algorithms.
Now, it is AMD’s turn to talk about edge computing, and the company is rolling out two distinct processors, one a derivative of its Epyc server CPU-only chips and another based on its Ryzen hybrid CPU-GPU chips that thus far have only been used in client devices but like the Xeon E3 (itself a derivative of the desktop Core i7 chip) can be used in lighter server workloads.
Here is how AMD sees the landscape for its new Epyc and Ryzen embedded processors, which obviously have edge uses that apply to compute, networking, and storage that are relevant for all kinds of jobs in the IT spectrum:
The big benefit that AMD is espousing for its core to edge strategy is the same one that Intel has been talking up for the better part of a decade and that the Arm collective has been using as well since a handful of vendors have gotten credible Arm server chips out the door. And that is simply the ability to have the same architecture and instruction set span from the datacenter core all the way out through one or many layers of edge computing. Intel has always contended that because it owns the datacenter and the desktop that it should also own other endpoints (like smartphones and tablets) as well as edge computing (inside switches and storage arrays, networking gear with virtualized functions, and so on), but to date Intel has failed with smartphones and tablets where Arm rules and the jury is still out having a late lunch on IoT devices and the edge computing that will serve them.
The Epyc 3000 series is a variant of the “Naples” Epyc 7000s that were announced last summer and that was created for servers with one or two sockets and a need for lots of memory and I/O. The Epyc 7000s basically compete with the belly of the Skylake Xeon SP space, while the Epyc 3000s take on the Xeon E3s and Xeon Ds as well as a smattering of the Atom C3000s. The Naples Epyc 7000 package has four of the “Zeppelin” eight-core chips on it, which are based on the Zen architecture; these chips are based on a pair of four-core chiplets that pack L2 cache and L3 cache onto the chip, and the whole shebang is glued together using AMD’s Infinity Fabric across the dies and, for the NUMA-enabled versions, across multiple sockets.
Here is the Epyc 3000 lineup:
AMD did not make clear precisely how the Epyc 3000 was made, but the top end parts have half the cores, half the number of memory controllers, and half the PCI-Express 3.0 lanes compared to the Epyc 7000s, and given the desire to make these smaller but also capable – and to use up as many chips that come out of the GlobalFoundries Fab 8 in Malta, New York as is possible, including those where only some of the cores on each Zeppelin die work – we think that the Epyc 3000 is comprised of two Zeppelin dies instead of the four in the Epyc 7000. The four top SKUs in the chart above, in fact, seem to have two Zeppelin dies. The four bottom ones could be a single Zeppelin die or two Zeppelin dies that have a lot of components inactive. Our initial guess was the latter, so you don’t have two distinct manufacturing processes, but if you look carefully at the spec sheet, you will see that the parts that top out at eight cores have single Zeppelin dies and the parts with sixteen cores have two Zeppelin dies.
The important thing is that the Epyc 3000 package has the features shown, and that includes integrated support on the package for up to sixteen SATA 6 storage ports, up to 64 lanes of PCI-Express 3.0 peripheral I/O, and up to eight 10 Gb/sec Ethernet ports. That is a lot of I/O, and by the way, those PCI-Express lanes support direct attachment of modern NVM-Express flash drives. Obviously, the memory and I/O capacity in the single die versions of the Epyc 3000s are cut in half. The other important thing is that AMD is guaranteeing that the Epyc 3000s will be available for ten years, so vendors making high volume equipment that needs robust computing, memory, storage, and I/O that falls short of a full-blown datacenter server can depend on this chip to be around for a long time and replaceable in the field. At lot of edge computing devices are going to be operating in places that are not precisely server friendly.
The Ryzen V1000s are interesting in that they combine compute in the form of CPUs and GPUs, all on the same die, like this:
This entry CPU-GPU edge processor is presumably a multi-die implementation, considering that the marketing materials show the Zeon core complex, the Vega graphics, the I/O and system hub all being linked by the Infinity Fabric.
The Ryzen V1000s have four Zen cores, with two cores sharing a single DDR4 memory controller; two memory sticks can hang off each controller for a maximum of four memory sticks per chip. The cores share 2 MB of L2 cache and 4 MB of L3 cache, so there is plenty of fast and close memory to keep those cores fed. Here are the feeds and speeds on the Ryzen V1000s:
On the other side of the chip is a baby “Vega” GPU with eleven GPU compute units (CUs) on it. Depending on the yield, AMD is scaling the CPU side to either two or four cores, with or without simultaneous multithreading turned on, and on the GPU side from three to eleven of those Vega CUs activated. As you can see from the block diagram, the bulk of the PCI-Express 3.0 lanes are used up to drive four displays, which we don’t care one whit about for edge computing. There are only sixteen lanes of PCI-Express coming off the processor complex, and presumably if you don’t want to drive displays, you can drive NVM-Express flash drives, which would be useful for edge computing. The chip has two 10 Gb/sec Ethernet ports, which is plenty for a lot of edge networking needs, and can support two SATA storage devices.
It is interesting to contemplate the possible uses of the Vega GPU accelerator. AMD says that it is rated at 3.66 teraflops at FP16 half precision, and that is a respectable and usable amount of performance. For one thing, FP16 is perfectly fine – and often employed – for machine learning inference, and it is also increasingly used for machine learning training. No one is suggesting that a cluster of these chips could outdo the performance of an Nvidia Tesla GV100 GPU when it comes to either inference or training – you are talking 125 teraops there, or 34X the performance. But if the GPU has full access to the main memory, as it seems to, and if you could put 64 GB memory sticks in this thing, then that would be 3.66 teraflops of processing against 512 GB of main memory. The Tesla V100 accelerator tops out at 16 GB of stacked HBM2 memory, and that is a factor of 32X in favor of this edge processor. Unfortunately, if you drill down into the specs, while the DDR4 memory scales up to 3.2 GHz in speed on the Ryzen V1000, it only supports one memory stick per channel and tops out at 32 GB per stick, for a maximum of 64 GB.
As for more traditional HPC workloads, we presume that 1.83 teraflops of single precision FP32 floating point oomph is also interesting and useful, especially in a 45 watt thermal envelope that can be overclocked a bit on the CPU. This is especially true if it is an embarrassingly parallel job and if 10 Gb/sec Ethernet is sufficient to run the MPI stack.
One last thing: In the two tables above, we have reckoned relative performance based on the integer performance of the Ryzen V1605B, based on core count, clock speed, and whether or not SMT is active to give those cores a boost. Most of the benchmarks on the Ryzen chips that we can find show the SMT adding around 25 percent to the throughput of the chips. The prices shown in black are the single unit prices that AMD has supplied based on the traditional 1,000-unit trays that the CPU industry uses; it is a reasonably high volume purchase, nothing crazy.
Next up, we will take a look at how these edge computing processors stack up to the Xeon E3 and Xeon D competition.