Compute is by far still the largest part of the hardware budget at most IT organizations, and even with the advance of technology, which allows more compute, memory, storage, and I/O to be crammed into a server node, we still seem to always want more. But with a tighter coupling of flash in systems and new memories coming to market like 3D XPoint, the server is set to become a more complex bit of machinery.
To try to figure out what is going on out there with memory on systems in the real world and how future technologies might affect how servers are configured, The Next Platform sat down with two experts from Micron Technology to talk about how systems are evolving today and what they might look like in the future given all of these changes.
Steve Pawlowski, vice president of advanced computing solutions at Micron and formerly the chief technology officer of Intel’s Data Center Group, helped develop the processor maker’s initial server platforms (including their main and cache memory subsystems) and also helped drive the PCI-Express and USB peripheral standards, among many other things. Brad Spiers, principal solutions architect for advanced storage at Micron, sat in on the chat with us, too. Spiers spent more than two decades architecting distributed platforms for Swiss Bank, Morgan Stanley, and Bank of America before joining Micron last July. Pawlowski left Intel after three decades, joining Micron in July 2014.
The conversation with Pawlowski and Spiers was wide ranging, which stands to reason given the broad possibilities for server architectures in the coming years and the pressure on system architects to squeeze more performance out of the circuits that make up compute, storage, and networking. Efficiency is going to be a bigger factor than it has ever been, particularly as Moore’s Law starts running out of gas.
Timothy Prickett Morgan: I am very interested in how different kinds of non-volatile memory are going to be woven into future systems. I know Intel has talked a bit about this, but Micron has a different perspective on this evolving memory hierarchy and possibly different customers.
Brad Spiers: What we are seeing so far is that there is a shift going on and it is really coming from Moore’s Law, but it is showing up in maybe some unexpected areas. We are starting to see a shift in big data. What happened initially was that companies began to build Hadoop systems as a cheap place to dump data. They built an architecture that had a lot of spinning disk and that was great for dumping data on the cheap whereas in the past they would have to put it inside of an expensive database.
But now, with the rise of Spark, people are starting to shift to in-memory computation, and this is impacting every workload, including machine learning, which was a lot of the reason for a lot of the move to Spark. Because of the end of Moore’s Law, it is the movement of data – again and again and again – that is actually the challenge. What you want to focus on, at a systems scale, is how much energy is consumed just moving data and how much time does it take.
So one of the changes when you move to Spark is that the speed of DRAM is so much faster than spinning disks – everyone is so excited about SSDs, but with memory, you can do it more than 1,000 times faster. Overall, because of the other system benefits, we see Spark being about 100 times faster than Hadoop, and that lets you take on different classes of problems.
Whereas Hadoop is good for asking very simple questions using MapReduce and it was good for one-iteration questions like counting the number of times that “Micron” appear in articles that were published in the past five years, and the data is spread across the nodes and you add it up, and that’s fine. But machine learning uses more complex iterative algorithm, and once it becomes iterative, you become bottlenecked by the speed of your storage. The real change is that if you can do all of that processing from data stored in main memory, that’s great. But as you can imagine, a lot of problems have datasets that are larger than DRAM.
What you do then is push the data into NVM-Express SSDs, and then the network quickly becomes a bottleneck because of the capabilities of SSDs. The overall system architecture has shifted radically.
TPM: So let’s get into some nuts and bolts here. A few years ago, a Hadoop node might have been a two-socket Xeon server using six core processors with 64 GB of memory, a slew of SATA disks, and crappy Gigabit Ethernet. It might have had 12 cores and 12 disk drives for a balance between compute and storage. So what does a Spark node and network look like.
Brad Spiers: You are spot on with the Hadoop configuration from the past. People across the industry are trying to figure out what the right configuration is for Spark, and people are calling us left and right, and we are also doing this ourselves, using Spark to help improve our own chip yields using SSD-enhanced Spark clusters.
Once you add SSDs, you can support a higher powered CPU, and you need that extra performance to do transformations on the data. Before, you didn’t need that many cores because they were essentially sitting idle and it was all about the disk drives. But with Spark, as you step through the machine learning algorithms, you create new features in the data and to do that you would do some computationally intense transformations.
The other piece is that you tend to buy a higher powered Intel CPU, and coupled with that you need to upgrade your network. Anyone who has built their Hadoop cluster with Gigabit Ethernet has essentially locked that data into that cluster and prevented themselves from getting the insights from that data. And the whole reason they bought the cluster to being with is to get insight from the data.
It used to be when I was buying systems for banks, we had local disk that was comprised of multiple disks and a host bus adapter. We have thrown out all of those RAID cards and replaced them with an SSD because they are seeing that the failure rate of an SSD is 0.1 percent year on year, which is so much lower than hard drives that they see a massive cost benefit compared to disks. And it helps because you have all of the local IOPs, you can cope with the boot storm and shutdown storm as the day begins and ends.
TPM: So what is the recommended configuration for a Spark cluster? Do you jump to InfiniBand?
Brad Spiers: Ethernet with RoCE is what people will use, and 10 Gb/sec is the absolute minimum, and 25 Gb/sec is better and the market is going in that direction. The other thing is that you tend to buy a number of SATA drives in a box – maybe 8 to 24 drives in a box – and you bump up the memory quite a bit, with 256 GB and even 512 GB as the new minimum. What people are seeing is that the benefit they get out of the DRAM is much larger than buying a bigger box with more cores.
TPM: By the way, that has always been the case in transaction processing environments, but processor makers didn’t want you to think about that. Throughout history, memory upgrades have been cheaper than processor upgrades to deliver a certain kind of performance.
Steve Pawlowski: The issue now is system architects are looking at the tradeoffs between memory capacity and memory bandwidth, and in many cases, the performance could be substantially improved with smaller capacities and much higher bandwidth.
TPM: Knights Landing is an example of this, getting high bandwidth memory much closer to the compute, or HBM memory on the “Pascal” Tesla GPU accelerator. Do you anticipate there being a regular Xeon with that style of on-package memory? You can’t do NUMA on such a machine.
Steve Pawlowski: Generally speaking, the processor is always going to be constrained when it comes to memory bandwidth, and adding co-processing with FPGAs on the fabric will make it worse, so getting more bandwidth to that compute complex would be of paramount importance.
TPM: Is this kind of architecture, where there is fast memory very close to the processor, going to become normal? I made a joke a while back that the future of compute was something that looked a lot like a graphics card with much better packaging and would look a lot less like a traditional server motherboard as we know it.
Brad Spiers: That’s a pretty good joke, and with the video cards it is just a lot of regular computation, but I think that your intuition is correct in that the memory and processing need to be closer and we need to learn from the lessons of exascale and reduce the number of picojoules per bit that is required to move data from memory to the processor. Hybrid Memory Cube reduces that quite a bit. (Pun intended.) Even though HMC is not JEDEC compliant, it does have a large amount of bandwidth and very lower power per bit per fetch.
TPM: You are on your second generation of HMC, and I am wondering if anyone is putting this on processors? Knights Landing is sort of doing it . . . .
Steve Pawlowski: Yes, Knights Landing is doing it, even though it is sort of a proprietary interface, it is using the HMC protocols. The majority of HMC deployments, for both generations one and two, have been more on the networking side in proprietary switches because they are not necessarily pushing capacity but they do need bandwidth for packet inspection and other kinds of things.
Intel has significant market share in compute and storage, but not so much in networking, and that is why you see innovation in networking because this is where they can do it.
TPM: Are there demands for high bandwidth on the storage side as with switches and other network gear?
Brad Spiers: So far, not as much. What we see is that non-volatile DIMMs are of great interest there. NVDIMMS sampling now at 8 GB. [See our coverage of Micron NVDIMMs here, the future possibilities of Intel’s 3D XPoint DIMMs there, and Diablo Technologies Memory1 flash DIMMs here.]
Steve Pawlowski: I think that with stacked memory close to the processor we are getting bandwidth, but we are paying for it with capacity. The HPC labs can certainly look at something like Knights Landing and if they have X GB of memory they will keep as much of their data in that memory as is possible and they can do a pretty good job of tailoring the application to meet the hardware.
The Holy Grail that we and the industry are pushing for is a high capacity, high bandwidth memory and we are looking at the technologies that would allow that and minimize any performance impact. And that is where I think you are going to see a lot of innovation over the next few years with DRAM and storage-class memories like 3D XPoint and whatever happens to follow that. We will look at applications and workloads and how we can build a class of systems based on those topologies and those stacking methodologies and deliver the best of both worlds.
TPM: That is the real question, and the thing that I have been trying to figure out. How do you build such a system and what will it look like. The “Skylake” Xeon processors and their “Purley” platform present a chance to have a radically different architecture with much tighter coupling of compute and memories of various kinds.
Steve Pawlowski: I think that what you are going to see is that there will be a transition where you can pretty much use any the technologies and the system configuration may not be optimal for all of them.
Unfortunately, the way that the industry works, with any major architectural shift, software will lag hardware and that lag can be anywhere from six to ten years. In my previous career at Intel, for instance, the transition was major and it took a good eight years for the software to catch up with the hardware, and some software is still trickling in.
So you have a transition where the new system runs the old software really well and the leading customers will start experimenting with the new features, as the technology becomes less expensive and the benefits can be demonstrated, you will see greater and greater uptake. Certainly not everybody is going to buy a platform and jam it with 3D XPoint or whatever. It is going to depend on the size of the workload and the performance expectations and so on.
But once it is out there, people can port more software over and start leveraging the persistency of 3D XPoint and then these platforms will become more pervasive. At that point, the ratio of standard DRAM to 3D XPoint to NVM-Express flash memory will start to change. I wish I could predict what those ratios will be, but it really depends the applicability of the technology, how fast the software changes, and how fast the price comes down.
TPM: What is the typical configuration of a server these days, in terms of memory and flash?
Steve Pawlowski: Flash does have a presence and NVM-Express over PCI-Express is certainly being used because it is optimized for load store, but it has never really moved into the memory system domain because of the latencies. Reads are still on the order of microseconds and writes are on the order of almost a full millisecond when you look at full completion times. With 3D XPoint, the performance is getting closer to DRAM – you are probably three orders of magnitude better on write completion times and on reads you are getting closer and closer to DRAM.
The thing is, performance will always sell and that is something that Intel has proven time and time again.
If you put flash in a system and you don’t have enough DRAM in front of it to hide the latencies that you are looking at, the performance degradation of the application will increase significantly. That is why people have been working at putting flash into the hierarchy of storage-class memory rather than as a storage device per se. It was really the latency that limited it, but when you get to 3D XPoint, that gives you greater flexibility as a systems designer to change the equation of how much DRAM and how much storage-class memory. On a cost per bit basis, you can have a lot more memory for the same money and have a lot more data resident and you are not doing that data movement between the memory and the storage domains.
TPM: What kind of density are we talking about? If you are trying to cram everything into a 512 GB or 1 TB system today with DRAM, are we going to see 10 TB systems using a mix of DRAM and 3D XPoint?
Steve Pawlowski: We think it is reasonable to see systems with 5 TB to 10 TB of memory over the next several years on a two-socket configuration. I don’t think that’s unrealistic. The ratio is usually 80-20, and it depends on the application and the system balance.
TPM: Will the advent of 3D XPoint put less pressure on memory manufacturers like Micron to drive down the price of DRAM? That by adding 3D XPoint to the system, server makers can drive up addressable memory capacity and because of the lower cost of 3D XPoint also drive down the overall price of memory.
Steve Pawlowski: No. Quite honestly, when you look at the performance of DRAM, there is still nothing better out there. This is a rational assumption, but I don’t think that is what will happen. Not everybody will have a 3D XPoint architecture, there will still be a portion of customers who are dependent on DRAM.
TPM: As part of that memory change, I presume that something like HMC will also have a place in the system, too?
Steve Pawlowski: Not in all systems, and I am going to be a little bold here. This is all about data movement and the energy of moving those bits back and forth. One of the keen observations that we had when we were doing research at Intel was this: With the targets the industry has given us, you are looking at roughly 30 picojoules per operation across the system – that includes the cooling system, the storage, the network and everything else.
From that, the compute is really such a small amount and it is really about how you get data to the compute. Hybrid Memory Cube is about getting memory and compute closer and closer together, and with the first generation we saw that data movement go from 4 picojoules per bit down to a half or a third of a picojoule per bit. So the next thing to consider is putting logic on that compute and stacking up memory, and this becomes a basic building block for everything we build going forward. This is an area where I can see a trend growing over time.
Now, instead of buying a separate CPU chip and a separate memory chip and a separate networking chip, you basically buy these building blocks and have a way to interconnect them with a scalable fabric. And by the way, your storage would be architected the same way, with 3D XPoint and logic connected to the same network. You have the means to scale these blocks in a meaningful way, minimize the energy, and increase the density of the compute.
TPM: Will these building blocks interconnect using silicon photonics?
Steve Pawlowski: It depends on the distances between components and the energy trade-offs you are going to make. A lot of people have told me that with silicon photonics we will be able to get down to a half or a third of a picojoule per bit because we will be signaling at terabits per second with wave division multiplexing. But when you look at something that somebody is going to need in going from chip to chip, it is going to be a long time before we can get down to those power levels. So I think that on package, signaling will be electrical for a long time. As you get off package, it will depend on the distance.
We did a presentation back at Intel in 2000 about photonics, and it kind of reminds me of fusion in that it is always 20 years out. Here we are, 15 years later and it is still 20 years out.
I do think that companies have driven down the costs so much it is becoming an attractive alternative in the datacenter and someday in rack-to-rack communications, and possibly into the systems. I think I will have been retired for a long time before we have full optical on a standard server platform. That’s just prediction, and something could come along and change that. But every time the optical engineers have come in, the electrical engineers have come in with a new approach that has allowed us to scale up the electrical interconnects.