Julia Still Not Grown Up Enough to Ride Exascale Train

We’ve been watching Julia, an HPC-oriented programming language designed for technical and scientific computing for a number of years to see it can make inroads into supercomputing. This week we were able to get some info about how it performs on an exascale system.

The main selling point of Julia has been that it combines the computational efficiency of languages like C and Fortran with the ease aspect of languages like Python and R.

Julia is particularly strong in areas requiring high computational power and data manipulation, such as supercomputing and has built-in support for parallel and distributed computing. It is open-source and utilizes Just-In-Time (JIT) compilation through LLVM to achieve its high performance.

One of the reasons it’s had some traction in HPC is because there are many libraries for various scientific needs with syntax is similar to MATLAB, making it easier for non-HPC-expert researchers to transition. Another fancy feature is the ability to easily call C and Fortran libraries without requiring special wrappers, enabling seamless integration with legacy codebases. It also has native support for a range of parallel computing paradigms, from shared memory multi-threading to distributed memory clusters, and even GPUs, making it versatile for various kinds of computational workloads.

While this sounds ideal, it’s pushed beyond some limits at extreme scale, although it’s hard to criticize anything for showing some cracks on the U.S.’s first exascale system, the Frontier supercomputer at ORNL.

In this breakdown of Julia’s performance on Frontier, it appears Julia is promising but not yet a fully optimized language for end-to-end workflows as the title indicates, at least at that scale.

It does hold up its end of the bargain as a unifying language for end-to-end HPC workflows on Frontier. The problem is that its computational performance lags, and significantly. As the paper outlines, it’s a nearly 50% gap compared to native AMD HIP codes on Frontier’s GPUs.

And while Julia scales well up to 4,096 MPI processes and exhibits near-zero overhead in MPI communications and parallel I/O, issues arise at larger scales and the JIT compilation introduces initial overheads.

The language shows good weak scaling properties but the study outlines increased variability in time-to-solution when scaling to larger process counts.

On a positive note, it looks like its MPI and ADIOS2 bindings demonstrate near-zero overheads, indicating Julia can handle communication and I/O tasks well. And sorry for the “but” but, the study didn’t explore GPU-aware MPI, which could be an avenue for further performance improvements and lift the Julia at exascale a bit.

That other key feature, the Just-In-Time (JIT) compilation, also appears to introduce an initial overhead, although it looks like this is amortized over time, which is okay though in an HPC setting, this could still be a concern for short, time-critical runs.

Julia was successful in providing a seamless workflow from computational simulation to data analysis and visualization in Jupyter Notebooks. This flexibility is a strong point in favor of Julia as an end-to-end HPC solution.

Julia shows a lot of promise and advantages in terms of language unification for different HPC tasks and near-zero overheads in communication and I/O.

Ah but, that pesky performance gap and the scalability concerns suggest Julia may not yet be fully ready for exascale computing without further optimization and testing.

Sign up to our Newsletter

Featuring highlights, analysis, and stories from the week directly from us to your inbox with nothing in between.
Subscribe now

6 Comments

  1. Great article! It adds to TNP’s 01/05/22 piece “Julia Across HPC”, where Julia was tested on CPUs and GPUs, and for which the journal article is now open-access at bris.ac.uk. Today’s brand new paper, about Julia on Frontier, will be presented at SC23 in Denver (bottom-left of 1st page), which TNP Experts will once again absolutely need to definitely attend, to bring us all the latest and greatest on HPC recipes, cookware, and master-chef competitions!

    Julia is a great concept for HPC gastronomy in my opinion. It has a decently-sized user-group, good performance, and hardware-agnosticism. The competition is heating-up in this space of effective high-level HPC/AI languages, especially with Mojo’s snaky python-orientation, and it should be interesting to see if the type-inference convergence algorithm 2.0 (or newer) in Julia’s JIT can match or beat explicit static typing in Mojo (where it led to 400x speedup relative to a very naive Python matmul).

    It is notable that in this Julia-Frontier paper, ORNL uses solutions of the Gray-Scott coupled Nonlinear reactive-diffusive pair of PDEs to benchmark the system. These PDEs produce Turing patterns as seen on the faceplate of Atos BullSequana XH3000 (2-D version; TNP 02/16/22). But they use explicit time-stepping (“simple forward […] difference”, Section 3.1) where Crank-Nicolson semi-implicit time-stepping with Picard fixed-point iteration may be preferable for accuracy, along with ADI which does wonders for the spatial discretization (many very fast Thomas’ algorithms, in parallel).

    AMD could probably do much worse than help improve AMDGPU.jl so that it better matches HIP (2x speedup if I read correctly). The ease with which the 2-PDE Gray-Scott solution could be implemented in Julia (vs a single PDE in HIP) should be justification enough!

  2. In my opinion getting a GPU to work at all using anything but the vendor supplied compiler is pretty amazing. Julia does this with efficient MPI communication while further supports an interactive read-evaluate-print loop. Upon looking at the code in the paper I’m wondering about the 6-condition if statement in kernel_amdgpu that checks whether a point is on the domain boundary. My understanding is such conditionals make little difference for CPU calculations because of speculative execution but can significantly slow down GPU processing.

    While it’s admittedly unlikely my 5 minute code review found anything significant, I can’t help wonder whether removing those conditional might recover some of the performance between the HIP version using the vendor-optimized solver and Julia.

    • I totally agree! Julia is great connective tissue for HPC and AI, and AMD has the hardware-specific expertise to help optimize its performance. I don’t know how many PhDs they have working on software infrastructure (NVIDIA has approx 300 based on Tim’s estimate), but dispatching a few to collaborate on Julia’s JIT & HAL should be well worth it in my opinion!

  3. One thing I noticed is that while the paper showed an AMDGPU kernel written in Julia, it didn’t, as far as I could tell, say much about how data was copied from the CPU to the GPU and back. For all I know, the performance lag could have come from, say, an automatic behind-the-scenes transfer that’s easy to code but not performant, similar to how Nvidia’s Unified Memory works without prefetching.

    • Sounds about right (and may connect back to Eric’s point in a way). Taking memory access patterns (as specified in the algorithm being compiled or JIT-ted) into account when lowering (adapting, splitting) the computational graph to the hardware, is likely a key factor in eventual performance (in face of the memory wall).

      The “Numenta” outfit apparently makes even CPUs run AI/ML/HPC faster by focusing on that.

      • Interesting — Numenta (eg. S. Ahmad, B.S. Cornell, M.S. & PhD. UIUC) champions Spatial Pooling (SP) within a Hierarchical Temporal Memory (HTM) perspective. I wonder how much of it could be implemented in memory-controller hardware (maybe with memory hint instructions from the CPU/code), or if it works best as a static, pre-compiled, optimization?

Leave a Reply

Your email address will not be published.


*


This site uses Akismet to reduce spam. Learn how your comment data is processed.