Very few organizations have enough iron to train a large language model in a reasonably short amount of time, and that is why most will be grabbing pre-trained models and then retraining the parameters in the models with much smaller datasets that are important to them. After that, they need to figure out an AI inference strategy, but that is really an entirely different problem.
Curating the datasets to train LLMs, particularly given the increasing sensitivity of copyright holders to the unauthorized use of their material to feed the models, is an issue, and it is one that supercomputer maker Fujitsu and its largest and most famous HPC customer, RIKEN Lab, has been sensitive to as they have worked with researchers in Japan to train an open source variant of the GPT strain of LLMs on the “Fugaku” supercomputer, currently ranked as the fourth most powerful machine on the June 2024 Top500 supercomputer rankings.
For those of you who may want to train your own LLM, what Fujitsu and RIKEN have done is illustrative and also demonstrates how mature some of the technologies are that enable organizations to use open source tools to train an LLM more efficiently than they otherwise might.
The team that created the Fugaku-LLM, as the model is called, was led by professor Rio Yokota of Tokyo Institute of Technology, associate professor Keisuke Sakaguchi of Tohoku University, Koichi Shirahata of Fujitsu, team leader Mohamed Wahib of RIKEN, associate professor Koji Nishiguchi of Nagoya University, Shota Sasaki of CyberAgent, and Noriyuki Kojima of Kotoba Technologies.
Every part of the team had its own task.
CyberAgent curated the Fugaku-LLM dataset using Japanese text as well as some English text, mathematics, and programming code, according to Fujitsu. The resulting dataset had a total of 380 billion tokens – a pretty small scale compared to some of the English language models we see these days that have trillions of tokens. About 60 percent of the model’s data is in Japanese. The idea is to train the Fugaku-LLM from scratch and not retrain an existing model with Japanese data so “the entire learning process can be understood,” as Fujitsu put it in the statement announcing the model. Most of the models trained in Japan, according to the company, were trained using a continual learning process where they are trained in another language outside of Japan – and almost certainly in English, but Fujitsu did not say – and then continually updated with more and more Japanese text.
The Fugaku-LLM is unique in another way in that it is pushing up the parameter count, which raises the IQ of the model – to use what is perhaps a bad metaphor. Most of the existing Japanese models, according to Fujitsu, have less than 7 billion parameters, but the Fugaku-LLM is being open sourced with 13 billion parameters.
For comparison, the composite GPT-4 model from OpenAI is rumored to have around 1.76 trillion parameters for across eight individual models. (This is known as a Mixture of Experts approach.) Google has not given parameter counts for its top-end Gemini Pro and Gemini Ultra LLMs, but the Nano-1 variant of Gemini has 1.8 trillion parameters and the Nano-2 version has 3.25 trillion parameters.
As we explained last July when talking about the Inflection-1 model from Inflection AI, and anthropomorphizing just a bit for effect: Tokens tell you how much you know and parameters tell you how well you can think about what you know. Smaller parameter counts against a larger set of tokens gives you quicker, but simpler, answers. Larger parameter counts against a smaller set of tokens gives you very good answers about a limited number of things. The key is to come up with a balance between the tokens and the parameters, and that is what the Fugaku-LLM team has done given the size of the Fugaku supercomputer, which was not designed to run LLMs but can do it thanks to its mixed precision support for FP64, FP32, and FP16 data types and processing, and the size of the dataset that can be curated in Japanese.
Tohoku University participated in the collection of the training data with CyberAgent and also selected the model to be used. In this case, the model chosen was a version of the open source OpenAI GPT-2 LLM called Megatron-LM, which itself was trained by Nvidia several years ago for its own GPU accelerators and which was open sourced. Nvidia pushed Megatron all the way up to 1 trillion parameters, making it one of the most powerful of the so-called “transformer” models a few years back when it was trained on 3,072 of Nvidia’s “Ampere” A100 GPU accelerators in the “Selene” supercomputer.
The Megatron LLM chosen by the Tohoku team had its compute and networking performance goosed by the DeepSpeed library for the PyTorch framework created and open sourced by Microsoft, which used it to accelerate its own GPT implementations for the past several years. Three years ago, the Microsoft team that created DeepSpeed said that the tool increased the throughput for both AI training and inference. There are a lot of tricks for inference. (The Megatron-DeepSpeed code was open sourced by Microsoft here.)
Fujitsu and RIKEN tweaked the Megatron model so it could run well on the 6D mesh/torus interconnect, called Tofu D, employed in the Fugaku supercomputer, which uses 48-core A64FX Arm processors from Fujitsu and which has 7.3 million cores running at 2.2 GHz to deliver 513.85 peak theoretical petaflops at FP64 precision. (Multiply by two to get FP32 throughput and by two again to get FP16 throughput.) Presumably the DeepSpeed tool was used to quantize the training runs, pushing as much of the Megatron model down to FP16 precision as possible to boost throughput on training. Tokyo Institute of Technology worked on the several layers of parallelism to train the Fugaku-LLM as well as boosting the performance of the Tofu D collective communications running Megatron. Kotoba ported the PyTorch framework that supports Megatron to Fugaku, and Nagoya did a study on how to create applications from the Fugaku-LLM.
The source code for Fugaku-LLM is available on GitHub here, and the model is available on Hugging Face there. Quick to seize any opportunity, SambaNova Systems immediately added the Fugaku-LLM model to its Samba-1 collection of models for its eponymous AI systems.
Here is the fun bit for us HPC and AI enthusiasts. The Fugaku-LLM was trained on 13,824 nodes in the Fugaku system. This may sound like a lot, but it doesn’t when you realize that the Fugaku system has 158,976 single-socket A64FX nodes. That is only 8.7 percent of the machine. Which suggests to use that the Fujitsu-RIKEN team could train a model with maybe 150 billion parameters and chew on perhaps 4.37 trillion tokens – if the latter could be found, of course.
Way to go! Training LLMs directly in different written languages (rather than via translation), including alphabetic, phonetic, ideogrammed, and even hieroglyphic ones, seems to me like the best way to enable the eventual identification of their linguistic sub-components, and their “distinct” cognitive components (if any), by inter-comparison of the trained results. And, like many, I definitely wonder what Fugaku-LLM would respond to the prompt: “わさびはいかがですか”? (eh-eh-eh!)
There is a typo in the last paragraph: ‘13,824 modes’. The author may have meant ‘nodes’.
Correct! Thanks.