DDN Breathes New Life Into Lustre File System

Lustre has been an essential component of HPC systems for a decade and a half, and has experienced a somewhat turbulent history of shifting ownership followed by uncertain support from various backers as an open source project.  Now, with DataDirect Networks acquiring Intel’s Lustre File System business and associated assets including the development team, what does the future hold for the file system and its users?

With more than 60 out of the top 100 supercomputers using the Lustre file system, the future of the project might have seemed to be secure, but broader adoption beyond the core HPC market has been slow. This seems to have contributed to Intel’s decision to give up trying to sell its own commercial distribution of Lustre in April last year, which cast doubts over the future of the project. Without the support and guidance of major backers like Intel, open source tools run the risk of fading away through neglect or forking as rival interests try to pull the project in different directions.

As far as HPC sites are concerned, DDN’s acquisition of the Lustre assets is encouraging, as it means the project now has the backing of a firm that lives and breathes scale-out storage systems, already uses Lustre in its own EXAScaler arrays, and is committed to continuing support and development in future.

DDN has basically taken everything to do with Lustre out of Intel’s hands, including the development team, the support team, the existing end user and partner worldwide support contracts, plus various assets such as the Jira repository and bug-tracking system for the code itself. In a nice twist, the Lustre team is being retained and operated as an autonomous division within DDN, which will be known as Whamcloud – reviving the company name for the commercial supporter of Lustre that existed before its acquisition by Intel back in 2012.

“DDN has been working very closely with that team for many years, even before their Whamcloud days, so when we took over Intel’s Lustre assets, we took over that Whamcloud domain name, and it seemed an obvious thing to do to place them back under that name again. It’s also something that the Lustre community recognises,” James Coomer, DDN’s vice president of product management, tells The Next Platform.

The go-to market model for Lustre will not change as far as DDN is concerned – it sees its EXAScaler appliances as the simplest and most effective way to deliver a Lustre-based solution to customers, while partners and end users that download the Lustre code and choose to build it themselves can approach the Whamcloud team for paid technical support services.

Now, however, there is the prospect of closer collaboration and feedback between the EXAScaler and Whamcloud teams, according to Coomer.

“We expect some very healthy sort of cross-talk between those two sets of support and engineering communities. The developers on EXAScaler have hands-on experience with very large scale customers and a certain view of productization and simplicity at scale, and that feedback will go into the Whamcloud team, and hopefully feedback will come the other way to help the EXAScaler team improve their stuff,” he says.

And what of DDN’s plans for the Lustre file system in the future? DDN does not have the driving seat to itself, as there are contractual arrangements for feature development between the Whamcloud team and the Open Scalable File Systems (OpenSFS) and European Open File System (EOFS) groups. The members of these groups (DDN is a member of EOFS) all co-ordinate together on the long term stability roadmap for Lustre. The current Lustre roadmap is below:

Nevertheless, DDN said it has strong ambitions to take Lustre from where it is today and move it forwards with new capabilities to expand on the file system’s suitability for applications beyond its traditional HPC role, perhaps to broaden its appeal in the wider enterprise market. This means making a Lustre file system easier to deploy and manage, but there is also a desire to steer it towards new markets, meaning AI and analytics, although HPC remains a core focus.

“So the AI markets appreciate simplicity a lot more, they appreciate starting small. There’s work to be done in integrating these AI workflows, which are rather different from the HPC ones, and containerised workflows as well. We’ve already done a fair amount of that work at DDN already, and we want to continue that work and how Lustre can apply to those new and emerging fields,” Coomer says.

On the enterprise side, hybrid cloud is also being seen as a potential area of opportunity for Lustre, as there is already a cloud distribution of Lustre supporting some production analytics workloads at large scale, and making this work seamlessly with on-prem deployments of Lustre could open up new use cases.

One obvious area for expanding on existing capabilities is in support for flash storage. Lustre was developed with a focus on traditional HPC workloads, and is thus good at delivering a huge chunk of streamed I/O between the compute nodes and a large array of rotating hard drives. However, new workloads such as machine learning and analytics often comprise a more random mix of reads and writes of varying sizes, which a disk-based parallel file system will not handle so well. This is currently addressed by using some form of burst buffer.

Coomer says this is an area DDN is looking at on the roadmap, in particular the transparency of flash pools used as part of a scalable, economically viable capacity environment.

“Particularly with AI, but also for HPC, people want their application to work out of flash, but they also want an economic very scalable capacity option, which means hard drives today and probably for quite a while yet. So making that flash layer transparent is important, and there have been great developments within the Lustre community to support that kind of operation, and we want to bring that to maturity. So yes – pooling, managing pools, transparency of flash and optimisation of flash for these particular use cases,” he says.

Meanwhile, DDN also has a range of GRIDScaler appliances that are based not on Lustre, but on IBM’s Spectrum Scale (GPFS) parallel file system, which raises the question of whether one is likely to eventually end up making the other redundant. It seems unlikely.

Coomer says that there is no simple choice between the two platforms, in that they both have their respective strengths and weaknesses, and so it would involve “a pretty nuanced conversation” with customers about their requirements before recommending one or the other.

“One example is, Lustre does rather nice quality of service, so it has this facility where you can merge live workloads and give priority to some over others, and there’s something called jobstats, it’s not a great name, but if you’ve got a large production system, jobstats allows you to see  who is doing what and when. That’s very important, in that you can look at the system and see who is using your metadata services, who is using your I/O services most, and you can rank them and see what’s happening on the system, which makes it easier to debug certain issues,”

However, many of the differences are subtle, and because both file systems have been developing for a long time, they have gained many of the same features, such as snapshots and pools, although these may have been implemented in different ways.

For Lustre users, it looks like DDN’s newest acquisition looks set to bring some long-term stability, with the promise of a more rapid development pace for new features and capabilities.

Sign up to our Newsletter

Featuring highlights, analysis, and stories from the week directly from us to your inbox with nothing in between.
Subscribe now

Be the first to comment

Leave a Reply

Your email address will not be published.


*


This site uses Akismet to reduce spam. Learn how your comment data is processed.