Like so many tried and true software platforms for high performance computing, the IBM-developed distributed computing management software, xCAT (Extreme Cluster Administration Toolkit), has been powering cluster needs since before the dawn of the new millennium.
Despite its prominence on large-scale supercomputers over the years, this wide adoption was never because of its ease of use, particularly for system administrators. The initial learning curve is steep and for users submitting and monitoring jobs, it takes some experience. Still, xCAT does the job of managing large IBM clusters in particular. However, as Lenovo takes the reins on a number of IBM projects, it is set for some re-invention, particularly on the ease of learning and use fronts, as well as in its ability to allow even greater flexibility in a more open source-oriented environment.
As Luigi Brochard, Lenovo HPC Distinguished Engineer described at the HPC Advisory Council Swiss conference, development teams are taking a dual-stack approach to integrate the needs of those using commercial and open source stacks. At the core of the open source effort is the management layer, which for IBM customers, has been rooted in xCAT, a framework that is at the heart of almost all IBM systems on the Top 500 list of the world’s fastest supercomputer, but is limited to those very large-scale sites because it is difficult to learn and use for both users and system administrators.
Brochard says that the key to the Lenovo strategy for HPC is to open access to mid-size and smaller clusters, thus tackling this critical bit of glue and making it simpler for new users is key. For these smaller shops, xCAT is far too hefty for the job at hand and requires a leaner, cleaner interface and the ability to snap into other open source tools. Accordingly, Lenovo teams have built a new adaptation of xCAT called Confluent, which targets ease of use and is based on product similar to xCAT used in China for Lenovo’s large-scale clusters called LiCO (Lenovo Intelligent Computing Orchestration).
As you can see, there is still work to be done on the integration front with other schedulers, but Brochard says they are working with early test users and a handful of universities, including Oxford, to make create a stable platform that will appeal to smaller-scale HPC cluster admins and their users while offering the robustness required for the large supercomputing sites.
Lenovo is working on extending xCAT for its large-scale customers via the awkwardly titled (as admitted by Brochard, who begged the audience for ideas on a new name) OSMWC, which has all the beefiness of xCAT but with cleaner management interfaces and new integrations coming along. Ultimately, via a partnership with the OpenHPC effort, there will be an OpenHPC stack running on top of xCAT to create an open system management framework for customers at all node counts. Such an effort will include some key new features Lenovo is developing now, including power and energy awareness (via some technology they’ve pulled from Platform LSF, now part of IBM), lightweight virtual HPC, big data and Spark workload support, and the ability to manage the datacenter better as a whole through more comprehensive monitoring.
It can be dizzying to keep up with the project IBM, and now Lenovo, are creating and blending, especially in this case where there are multiple names, including the Chinese version of xCAT, which is called LiCO, which then became OSMWC, and now Confluent, which is will help Lenovo break into HPC centers with lower node counts. Lenovo has a long hike ahead to capture the swath of the high performance computing market that once went to IBM, but they have undertaken some noteworthy technical efforts to start reeling in new business, particularly on the cluster management front.