Intel Shuts Down Lustre File System Business
April 20, 2017 Timothy Prickett Morgan
Chip maker Intel is getting out of the business of trying to make money with a commercially supported release of the high-end Lustre parallel file system. Lustre is commonly used at HPC centers and is increasingly deployed by enterprises to take on their biggest file system jobs.
But don’t jump too far to any other conclusions. The core development and support team, minus a few key people who have already left, remains at Intel and will be working on Lustre for the foreseeable future.
Intel quietly announced its plans to shutter its Lustre commercialization efforts in a posting earlier this week and is not making any further statements or answering any questions about its plans for Lustre above and beyond what is in this post. The timing of this is interesting, and it unwinds some of the efforts that Intel had made in the enterprise software space, including buying Whamcloud for its Lustre distribution in July 2012 and rolling its own variant of Hadoop in February 2013. The change Intel is making with its Lustre efforts comes during the same week that Intel Developer Forum, its annual summer partner extravaganza that was hosted for almost two decades, was canceled and just after Hadoop distributor Cloudera, in which Intel invested $740 million for a 22 percent stake in the company in March 2014, announced that it was going to go public.
The bean counters are clearly taking out their red pencils as Intel is preparing for an assault on its hegemony in datacenter compute from IBM with Power9, a resurgent AMD with its Naples Opterons, and the efforts of Applied Micro, Cavium, and Qualcomm with their respective X-Gene, ThunderX, and Centriq processors. The word on the street is that the Lustre business was actually profitable, although moderately so, but Intel’s presence as a software peddler no doubt confused some customers and irritated the companies selling Lustre file systems to HPC and enterprise shops that have been doing so long before Intel decided it needed to bring order to this sometimes rambunctious parallel file system community.
As far as we know, the Lustre business inside of Intel had about 100 employees, with the 15 core developers lead by Peter Jones, the Lustre engineering manager at Intel who managed the support and release rollups at Sun Microsystems, Oracle, and Whamcloud as each took control of the Lustre file system in their turn. There are another 15 people who are involved in supporting Lustre for customers, and they are also staying on at Intel. Just like Intel has a core set of developers that work on the Linux kernel and it pays for this work because it is in its own enlightened self-interest, Intel seems content to continue to pay for the development of and support of Lustre. This is good news for the Lustre community, particularly since Intel has been doing a lot of the heavy lifting on Lustre for years now. The remaining 70 employees are looking for jobs, including Brent Gorda, the president and CEO at Whamcloud that, way back in 2001 was the future technologies group lead at Lawrence Berkeley National Lab who paid Hewlett-Packard to work on further development on Lustre because of the scalability limits of IBM’s Global Parallel File System (GPFS, known as Spectrum these days) as well as XFS and other file systems.
Taking On The Exascale Challenge
The Lustre project predates this investment, and has a long and winding history, starting with its founding by Peter Braam when he was a researcher at Carnegie Mellon University. In 2001, Braam set up Cluster File Systems to provide commercial support for Lustre, funds of which paid for further development of the parallel file system. In September 2007, Sun Microsystems bought CFS as it was trying to expand its own HPC business and also marry some of the capabilities of the Zettabyte File System (ZFS) created by Sun to the parallel scalability and access of Lustre. (This work continues to this day, which we reported on back in January.) Lustre changed hands in January 2010 when Oracle bought Sun Microsystems for $7.4 billion, and by July of that year, Gorda founded Whamcloud with $10 million in funding. Eric Barton, who worked at Lawrence Livermore National Laboratory, where Lustre cut its teeth on nuclear arms data for the US Department of Energy, and who became a principal engineer for Lustre at Sun after the CFS acquisition, joined Whamcloud, as did Robert Read, who was in charge of the Lustre work at Oracle. This annoyed Oracle, who turned around and sold the trademark for Lustre to Xyratex, now owned by Seagate Technologies, which also created its own commercially supported Lustre.
Back then, according to people we have spoken to, there was a real danger that the Lustre community would fork into two camps, with Whamcloud on one side and Xyratex on the other, both pushing Lustre in slightly different directions. So, it is perhaps fortunate for Lustre that Intel came along and did its own distribution, and because of its might and muscle, got everyone behind a single code base. Cray, DataDirect Networks, Silicon Graphics (before its acquisition by Hewlett Packard Enterprise last year), HPE itself, Dell, Seagate, Sugon, Lenovo, and Bull are all selling Lustre file systems substantially based on the Intel Enterprise Lustre (IEL) distribution.
To be sure, many of these companies did a lot of their own work on Lustre after they got the code from Intel, as Robert Triendl, senior vice president of sales, marketing, and field services at DDN, explains to The Next Platform. Triendl bristles a little bit that a piece of software as complex as Lustre can be called an appliance, even though all of the suppliers of commercial Lustre file systems talk about it that way and even its own Exascaler setup is called an appliance. While companies like DDN can make Lustre easier to install and maintain, it is by its very nature a difficult beast to tame, as are, we would point out, the large scale compute clusters and networks that parallel file systems serve.
According to Triendl, DDN does not fork from the IEL release, but creates a private branch of the code when it receives it and then spends three to six months hardening it and doing quality assurance testing on it. During that time, other features come out of the Lustre community, and these are backported to the earlier IEL release by DDN’s own software engineers and then pushed back to Intel for code review to be included in future IEL releases. It is this modified IEL code that is married to DDN’s own management, high availability, and support tools to create Exascaler. The experts at DDN provide Level 1 and Level 2 support for its Exascaler customers and Intel provides Level 3 backup support for when things get over their heads. Triendl says that of the thousands of support cases that DDN manages – and only Cray might have more than it does among those peddling Lustre file systems – only somewhere between 3 percent and 4 percent of these cases are passed up to Intel these days for Level 3 handling.
This is probably what the situation looks like for the other Lustre distributors.
We think – and this is just based on gut feeling – that Intel is running into issues as it is competing with its partners in big HPC deals. The way that large scale systems are procured these days, someone has to be the prime contractor on the system, and it is clear from Intel’s acquisition of the interconnect business from Cray and its investment in Lustre that Intel wanted to push HPC technologies in a higher volume and more consistently, democratizing HPC as Dell likes to put it. The OpenHPC software stack and the Scalable System Framework hardware stack illustrate its desire to do this. But the big HPC vendors that peddle Lustre, such as Cray, Seagate, and SGI/HPE, do not want to have to compete against Intel for customers. Intel has to leave these companies enough room to innovate and differentiate; they don’t just want to be resellers that are cobbling together Intel technologies for much effort and little profit. Contrary to what many believe or hope, the margins in HPC are not very good and never have been, just like the margins have never been good from hyperscalers and cloud builders. The biggest customers demand more and pay less, and it is always enterprises who are allergic to risk who provide some black ink to the bottom line.
This is why the expansion of Lustre into enterprise accounts, first spearheaded by Sun, then Whamcloud, and then various Lustre resellers before and after Intel got involved, is so vital to the Lustre ecosystem. Assuming that the US government under the Trump administration is not going to be funding a broader HPC agenda – and we can’t really tell yet what could happen here – and assuming ongoing pressure to economically justify ever-larger scale systems, Lustre has to be something more than a file system that only the propellerheads of the world can deploy. This is a challenge, but then again, so is Hadoop or Spark.
In fact, for the past several years, and under Intel’s guidance, several Lustre resellers have been trying to push Lustre underneath the Hadoop framework, putting a shim between the MapReduce layer of Hadoop and the interfaces of the Hadoop Distributed File System, allowing Lustre to emulate Hadoop but also do straight-up Lustre tasks. Such an approach would allow for a convergence of traditional HPC workloads and data analytics workloads on the same clusters – something many companies and HPC centers want to be able to do to save dough and have better utilization of their clusters. (Red Hat has similarly been trying to position its Gluster clustered file system as an alternative to HDFS for Hadoop, and IBM has been doing the same with GPFS.) Ironically, Cloudera is rumored to have wanted cash from Intel to certify its Enterprise Lustre software to be supported as part of its Cloudera Enterprise Hadoop stack and to be integrated with the Cloudera Manager as EMC paid Cloudera to do a similar integration with its Isilon storage arrays. For whatever reason, Intel was not able to just pick up the phone, as a major investor in Cloudera, and simply demand, for the good of both parties, that this formal integration happen. So only the open source – and not commercially supported – Cloudera implementation of Hadoop could have Intel Enterprise Lustre as a file system. Yes, this is absolutely idiotic. You would think you would get better customer service for $740 million. . . .
It is hard to say for sure how many Lustre users there are out there in the world, given the fact that it is open source, but the people we spoke to who are in the know say that there are definitely thousands of sites, including academic and government HPC centers as well as enterprise customers who use it to underpin their applications. At the low end of the Lustre spectrum, there are some customers in these academic settings that support hundreds of terabytes of data, and there are a few dozen publicly known and secret customers (like governmental security agencies) that have hundreds of petabytes of data under management by Lustre. Enterprise customers tend to be in the petabyte scale, and you really have to need that scale and the performance that Lustre provides before adopting either Lustre or GPFS.
Thanks to IBM’s long commercial history and its leverage with enterprises and its partnership with Lenovo, GPFS tends to have around 90 percent of the enterprise space (depending on how you slice up HPC) compared to around 10 percent for Lustre. But IBM has just changed its pricing model for GPFS to be based on the amount of data under management from a client-based pricing model, and for a lot of customers, this means a big price increase, and that is an opportunity for all Lustre players. In the core HPC market, Lustre has 75 of the Top 100 systems on the Top 500 rankings, and is used on nine of the top ten systems.
Lustre is pervasive and important, and will continue to develop and evolve and be supported by the community of vendors who have backed it, including Intel. To make sure this happens, Intel is essentially donating its Enterprise Lustre distribution to the community and hoping it adopts this as the codebase for future development while continuing to pay a core team to help with that effort.