Jupyter may have had its beginnings in enterprise data problems but some in the high performance computing community has been adopting it as the prime platform for both data science and AI at massive scale.
Many supercomputing sites have responded to demand by supporting Juptyer in the last few years with a growing number expected in the coming years. Among the HPC centers that have been at the leading edge of pulling Jupyter into supercomputing is the National Energy Research Scientific Computing Center (NERSC) at Berkeley Lab has been exploring Jupyter as an interface for its Cori supercomputer since 2016, which was when the second phase of the Cray XC40 supercomputer hit NERSC’s floor. The design spec for the all-Intel system was always massive-scale data intensive computing, making the center’s Jupyter experiment well-placed.
NERSC workloads include everything from climate modeling, materials science, astrophysics, and biomedical applications with the Cori system taking on these diverse needs from over 3,000 unique users. According to a NERSC report, today around 700 unique users each month are using Jupyter on Cori, a 3X jump over the last three years and around 20-25% of user interaction now goes through the platform. “Jupyter has become indispensable, a primary point of entry to Cori and other systems for a substantial fraction of all NERSC users,” according to Rollin Thomas and Shreyas Cholia.
Jupyter is proving to have a transformative impact on modern computational science by enabling a new integrative mode of interactive supercomputing, where code, analysis, and data all come together under a single visual interface that can seamlessly access powerful hardware resources.
Getting to that point in adoption was not necessarily easy. In 2015, the NERSC team found that users were trying to launch and connect to their own Jupyter notebooks using SSH tunnels on an older system. While that might have been unwelcome finding at first, NERSC looked at how they could embrace tooling so many users wanted. The result of their exploration was the creation of JupyterHub, which provides a managed multi-user Jupyter service to enable access to NERSC (and now other) supercomputers. As the creators explain, “it has a highly extensible, deployment-agnostic design built on powerful high-level abstractions (spawners, authenticators, services) and is developed b a robust, broad open source community.” They add that from a supercomputing center perspective it provides strategic leverage by supporting any platform for a diverse and demanding user base.
Jupyter is quickly becoming the entry point to HPC for a growing class of users. The ability to provision different resources, and integrate with HPC workload management systems through JupyterHub is an important enabler of easy-to-use interactive supercomputing.
JupyterHub has evolved in a series of phases since 2015, beginning with its role as a science gateway that provided users the ability to run notebooks and kernels that could tap into NERSC’s file system. This was mostly relevant for smaller workloads in analytics and visualization but did not use any of the main supercomputing resources. Later they were able to create a JupyterHub spawner (SSH-based) to launch notebooks on the Cori supercomputer from external hubs. Now, Jupyter is an interface for the HPC center (using a Rancher Labs-based Docker container as a service platform) to allow access to compute nodes on Cori. Finally, in 2018, NERSC researchers released JupyterLab, which refines workflow management and adds text editors, file viewers, and other features.
In the report NERSC developers discuss a few use cases of Jupyter and its related projects in the context of geophysical subsurface imaging (using Jupyter in a Docker container with a pre-defined reproducible environment), electron microscope image analytics, and advanced light source tomography.
“Because of mission, design, and technological trends, supercomputers and the HPC centers that run them are still less homogeneous as a group than cloud providers. This means ‘one size fits all’ solutions are sometimes harder to come by,” the NERSC development team says.
“And while providers of supercomputing power want to increase ease of use, they are not interested in homogenizing or concealing specialized capabilities from expert users. Developers working on Jupyter projects that may intersect with HPC especially should avoid making assumptions about HPC center policy (e.g., queue configuration, submit and run limits, privileged access) and seek input from HPC developers on how to generalize those assumptions. As long as Jupyter developers remain committed to extensibility, abstraction, and remaining agnostic about deployment options, developers at HPC centers and their research partners can help fill the gaps.”