Flash Disruption Comes To Server Main Memory
August 5, 2015 Timothy Prickett Morgan
Flash memory is cheap, but it isn’t fast, at least not by the standards of DRAM. But Diablo Technologies, an upstart in the memory arena that has been making some waves in recent years, thinks that putting NAND flash into server main memory slots and actually addressing it, at the bit level like memory and not at the block level like regular NAND flash, is going to be extremely disruptive. Much more so than the Memory Channel Storage that Diablo co-developed with SanDisk and IBM, and possibly as disruptive as the 3D XPoint memory from Intel and Micron Technology that will ship next year.
The idea in any system design, as we discussed at length last week, is to get bigger chunks of memory – either volatile or non-volatile – very close to the CPU. This is what Memory Channel Storage, which has been shipping for about a year now, does.
Diablo created the chipset that converts the DDR3 protocol used for server main memory for the past several years into the SATA interface used by NAND flash chips. SanDisk, through its partnership with Toshiba, is a big manufacturer of flash memory and created DDR3 DIMM form factors that put together the Diablo chipset, its own flash chips, and some additional software from itself and Diablo. This flash DIMM was sold under the ULLtraDIMM brand by SanDisk and certified on servers from Supermicro and Huawei Technologies; it also came in a variant for high-end IBM/Lenovo System X6 servers, called eXFlash, that had some IBM microcode and software improvements to the SanDisk DIMM that made it distinct. SanDisk was working on other distribution deals for Memory Channel Storage as far as we know, but has not announced any publicly. The company had exclusive manufacturing rights to the first generation of flash memory using Diablo’s chipset, but does not have that deal with the new – and potentially more significant – development from Diablo.
With Memory Channel Storage, even though the flash memory is on a DRAM DIMM form factor, the BIOS updates to the server and the drivers added to the system make it look like a block device like a PCI-Express flash card or an SSD. Just like these other flash devices, the data on an ULLtraDIMM or eXFLash stick doesn’t get flushed when the power goes off. The difference is that this flash DIMM is plugged into the memory bus and so the bandwidth is very high and the latency is very low – something like a 3.3 microsecond latency for reads, which is many times faster than reads from a flash card plugged into the PCI-Express bus. At first, Diablo and SanDisk did not reveal the write latencies, but did say that they are comparable to those of other flash devices, which is more indicative of the underlying flash than of the interface into it. The subsequently pegged it at 150 microseconds. Each flash memory stick could handle about 140,000 IOPS of random reads and about 44,000 random writes, and on sequential reads that have 880 MB/sec of bandwidth and on sequential writes that was about 600 MB/sec. That’s not bad for a device that only burns about 9 watts at idle and around 12.5 watts when humming along.
The initial Memory Channel Storage sticks came in 200 GB and 400 GB capacities and the word on the street was that the plan was to expand that up to 800 GB and 1.6 TB as the market took off. That has not happened yet, and it may never happen because capacity is not as important as proximity to the CPU. That said, Jerome McFarland, principal product marketer at Diablo, tells The Next Platform that SanDisk and Diablo still plan to make and sell this DDR3 flash DIMM that looks like storage and, perhaps importantly, that the lawsuit filed by memory maker Netlist against SanDisk and itself, contending that Memory Channel Storage violated some of Netlist’s patents, has been largely settled, with the courts saying that there is no infringement. This curtailed adoption of Memory Channel Storage. Diablo is not saying how many customers are using it, but we know for a fact that Wall Street was very excited about this technology from the get-go for all kinds of low latency, high throughput jobs.
With the Memory1 flash-based memory that Diablo is announcing today, a number of different things change.
For one thing, Diablo is making the Memory1 sticks using its chips, microcode, and software. Diablo intends to sell it directly to hyperscalers and other large customers who do their own component shopping in volume (even if they don’t actually build their own machines as many think they do) as well as indirectly to customers through the big system makers (Hewlett-Packard, Dell, Lenovo, Supermicro, and so forth), the rapidly rising original design manufacturers (Quanta, WiWynn, StackVelocity, Inspur and others).
The big change with Memory1, however, is that although Memory1 sticks are based on NAND flash, they look and feel and act like main memory to the system, albeit slower if fatter main memory. The data that is housed on Memory1 flash DIMMs is not persistent, as is the case with DRAM DIMMs, and this is done intentionally, according to McFarland, so that it looks and behaves like main memory so that the BIOS in the servers can more easily interact with it. (This may change in future releases of Memory1, where data can be made persistent or not, depending on BIOS and application settings.) Diablo has worked with American Megatrends to create a modification to its BIOS, which is available to anyone who wants to use Memory1, which is a far simpler process than was the case with ULLtraDIMM and eXFlash.
The Memory1 flash sticks use the faster DDR4 interface that is employed on the current “Haswell” and impending “Broadwell” families of Xeon E5 and E7 processors, which offer higher bandwidth than the DDR3 interface. DDR4 memory is also available in IBM’s Power8 servers and with Cavium Networks’ ThunderX ARM server chips, but for now McFarland says that Diablo is focusing on supporting Intel Xeons and looking at its options for other architectures such as Power, Sparc, ARM, and Opteron. Diablo will be initially selling Memory1 sticks in 64 GB, 128 GB, and 256 GB capacities, which is much less dense than the Memory Channel Storage block devices. But the important thing to remember is that for most servers, 16 GB DRAM memory sticks are the sweet spot in terms of capacity, 32 GB DRAM sticks are a bit pricey but more common on big fat boxes for virtualization and in-memory processing, and 64 GB DRAM sticks, while technically available, are scarcer than hen’s teeth and prohibitively expensive even if you can find them.
Diablo is not giving out pricing on its Memory1 flash memory or any specific feeds and speeds on it yet, but is only talking in generalities until it becomes generally available now. (Key hyperscalers, large enterprises, and server OEMs and ODMs already have parts to play with.) But the chart above has some strong hints.
McFarland says that an 8 GB DRAM memory stick is actually more expensive than a 16 GB DRAM; depending on the source and the volume bought, an 8 GB DRAM stick costs between $10 and $12 per GB, while a 16 GB DRAM stick costs between $8 and $10 per GB. The fatter 32 GB DRAM stick costs somewhere between $16 to $20 per GB, and if you can find a 64 GB DRAM, it would cost many tens of dollars per GB. (It is about $41 per GB in the chart above, if that is to scale.) If you get out the ruler on that chart above, it looks like Diablo will charge around $4 per GB for a 64 GB Memory1 flash stick, around $5 per GB for a 128 GB Memory1 stick, and around $10 per GB for a 256 GB Memory 1 stick.
There are a bunch of ways of looking at a DRAM to Memory1 comparison. Generally speaking, flash costs about one-tenth as much per gigabyte as DRAM. It also burns one-third the watts per gigabyte of DRAM and has about ten times the storage density, as measured in gigabytes per square inch.
But there are limits to main memory. Current Xeon processors have 46-bit physical and 48-bit virtual addressing, and that is not going to change with the future “Broadwell” processors due early next year but could be bumped up by two bits in the “Skylake” generation coming out a few years hence. So there is a limit to how far any SMP or NUMA machine can push up main memory (whether it is based on DRAM or flash or a mix), and that limit is 64 TB.
Diablo is recommending that, for performance reasons, customers have at least 10 percent of the capacity in their machines coming from DRAM, and the mix will rise and fall as workloads dictate. A two-socket server will be able to have up to 4 TB of Memory1 flash main memory, and companies will not have to make any changes to the server other than the AMI BIOS update (presumably Phoenix BIOS support is coming); that means no changes to the operating system and no changes to the application. The add-on software from Diablo will shuffle the hottest data to DRAM and the coolest data to flash, automagically and transparently, but it seems likely there will be a software developer kit of some sort to let techies pin data themselves from within their applications if they so choose.
The main thing to consider is that companies will be able to get 256 GB of slow main memory for a slightly higher cost per GB than a 16 GB memory stick. They might spend $2,560 for a single Memory1 stick compared to around $144 for a 16 GB stick, but they will be able to hold 16 times the data on that card. The price is only scaling up a little faster than the capacity. And Memory1 sticks delivering 64 GB and 128 GB capacities will be considerably more attractive because they are priced a lot lower.
Based on the specs for Memory Channel Storage, we expect that the write latency for Memory1 sticks to be very low, but this does not have to the case. Diablo might be dialing up the read speed a bit because it expects for this flash main memory to be used predominately as a high-speed, bit-level read cache. The idea is that companies doing in-memory processing like Memcached, Spark, Redis, and SAP HANA will be able to stretch the main memory in their systems at an affordable price, and even though there may be a performance hit on a chunk of that memory, the fact that it is bit addressable, right on the memory bus, and offering substantially more capacity will more than make up for that performance hit. When pressed for actual benchmark results showing such data, McFarland said that Diablo was not yet ready to make such disclosures, but would be in a position to do so when Memory1 ships in volume in the fourth quarter.
The idea is that for these workloads that are more sensitive to memory capacity than memory bandwidth, Memory1 can deliver up to four times the effective DRAM in the system, burn 70 percent less power per gigabyte for the memory, and perhaps knock down the number of servers by a factor of ten. This latter bit is the big payoff.
The Memcached use case is an obvious one, says McFarland. Memcached front-ends a lot of the databases that comprise modern Web sites, but main memory is expensive, particularly for fat-node servers that could have a high memory footprint and thereby reduce cache misses. So what ends up happening is that companies tend to use 128 GB main memory in two-socket server nodes to run Memcached. That usually means a full set of data cannot be cached on the server nodes in the Memcached cluster and therefore the network becomes the bottleneck for Memcached performance because fragments of the full dataset are scattered around nodes. But by adding in a mix of Memory1 storage to the server nodes, each node could have its main memory boosted to 1 TB for a lot less money than it would cost to do this with DRAM and a full dataset could reside on each node. This means dropping cache misses and less time on the network in the cluster, and therefore much better Memcached performance. How much better performance, Diablo is not saying. But it has to be significant enough to bother, and the world likes 10X improvements in price/performance.
In the Internet search example that Diablo put together – which is a thought experiment not a benchmark test – the company actually put some numbers on it:
By adding 1 TB of Memory1 flash memory to server nodes with 128 GB of DRAM memory, Diablo says that it could reduce the server farm for a massive search engine indexing operation from 100,000 machines down to 10,000. (This means you, Google and Microsoft.) By doing so, Diablo reckons it could drop the capital expenses on the servers to $80 million for those 10,000 machines, with about $5,000 per node going for its Memory1 storage. (That seems to be eight Memory1 sticks at 128 GB, which is consistent with the pricing we worked out above.) The search engine operation would also burn a lot less electricity, too.
The benefits of Memory1 are not restricted to the applications outlined above. All relational databases and other in-memory and distributed databases could see a performance boost.
The disruption in the memory market from Memory1 could be quite large, if the performance plays out the way that Diablo is hinting it can. What we do know is that the total addressable market for Memory1 is large, and possibly expanded a bit by flash main memory. Intel and Micron certainly think they can expand the memory market by creating something that is halfway between DRAM and NAND, and there is no reason to believe Memory1 can’t be just as disruptive as 3D XPoint and a number of other memory technologies that will no doubt be commercialized in the coming years.
Judging from the addressable market figures that Diablo has put together with its own data and that from Intel and market researchers Gartner, the memory DIMM market (regardless of technology) is set to grow a lot faster than the server market overall, and if the benefits are what Diablo says they are, then Memory1 could have an adverse effect on server shipments. But companies usually also want to have bigger compute complexes to run larger or more jobs, so this effect might be mitigated. So server node count is really driven by and limited by the budget, which stays constant to down a little at large enterprises in the aggregate, which goes up a little at HPC centers, or which skyrockets at hyperscalers and cloud builders.
The biggest effect of Memory1 might be on Hewlett-Packard and SGI, which have pinned the strategies of their big NUMA machines on being able to cram a lot of main memory into a single system image. With a two-socket Xeon E5-2600 server having a 4 TB memory footprint and a four-socket Xeon E5-4600 having an 8 TB footprint, their NUMA DRAM machines could have less of an advantage. And if Diablo works with Intel and server makers to create variants of its technology for Xeon E7 machines with four or eight sockets, it could have as much as a 16 TB footprint for memory in a much less costly and smaller machine. It will be interesting to see how this all plays out.