Nvidia Research Gives Generative AI Images And 3D A Speed Boost

Time is money when it comes to generative AI, as is the case with most technologies. But with generative AI, it is big money. And the longer it takes for an AI model to do anything, such as training on data or generating images and video, the more money is spent.

Researchers at Nvidia’s Toronto-based AI lab are tackling the issue of time when it comes to generating images and video, and this week at the company’s GTC 2024 conference, they outlined some of the fruits of that labor, showing off advancements aimed at making it faster – and thus more affordable – to generate significantly less “noisy” images and more detailed 3D images, reducing times from as long as weeks or months into days or minutes.

At a presentation during the show, Sanja Fidler, vice president of AI research at Nvidia, talked about “critical advancements” in generative AI in the “algorithms that can train very large models on very large datasets, on very large computers, and affordable training times as well affordable inference.”

For image generation, the researchers looked at accelerating the work of diffusion models, which are used to address the thorny issue of generating high-resolution images with high fidelity and are the basis for such text-to-image models like OpenAI’s Dall-E3 and Google’s Imagen. Essentially, they remove the “noise” – artifacts that aren’t in the image’s original scene content but that can make the image look blurry, pixelated, grainy, or in some other way poor.

Other models have been used to create greater accuracy in images – such as GANs and flow models – but diffusion models have moved to the forefront. They work in a two-step process, first by adding Gaussian noise to the dataset – the forward diffusion process – and then reversing the noising process, essentially teaching the model how to remove noise from images.

Nvidia researchers have taken hard looks at diffusion models in the past, including such aspects as sampling, parameterization, and training, and considered ways to improve the performance and speed of the training in ADM (ablated diffusion model) desnoiser networks.

Enter EDM-2

In a technical blog released on the last day of GTC, Miika Aittala, a senior research scientist at Nvidia who works on neural generative model and computer graphics, wrote that Nvidia researchers developed EDM-2, a “streamlined [neural] network architecture and training recipe [that is] a robust, clean slate that isolates the powerful core of ADM while shedding the baggage and cruft.”

The researchers also put a focus on a “poorly understood but crucially important procedure of exponentially moving averaging of network weights and drastically simplify the tuning of this hyperparameter.”

With this, EDM-2, when tested against other methods such as GANs and VDM++, delivered generation quality that was comparable but with less complexity and much faster training time, which increasingly improved as the models got larger.

The results addressed concerns of both model users and developers, Fidler said in her presentation. Users are most focused on the quality of the images and with EDM-2, the quality if high. In addition, “we measure computation time,” she said. “That’s actually when [we’re] training these models. That is what developers care about, the training time and turnaround as well as the cost that comes with the training.”

The faster the training, the lower the costs for doing the training, and according to Fidler, EDM-2 delivers five times faster training than other diffusion models. A training job that may have taken a month previously can now be done in days.

A key for the researchers was making the neural network run more efficiently by addressing a problem of growth of both activations and weights in the model. The complex problem didn’t seem to prevent the network from learning, but it was an “unhealthy phenomenon that significantly hampers the speed, reliability, and predictability of the training and compromises the quality of the end result,” Aittala wrote.

Eliminating activation and weight growth and making the way exponential moving averages (EMA) are calculated – essentially by storing periodic snapshots of intermediate training states with shorter EMA lengths – is much more efficient than re-running the entire training, he wrote.

Fidler said Nvidia will release the code for EDM-2 to help make diffusion models more efficient.

A Boost from LATTE3D

Speed and fidelity also are key components behind LATTE3D, a text-to-3D generative AI model for producing 3D representations of objects and animals. Using LATTE3D, the high-quality images can be produced almost instantaneously, in about a second according to Fidler.

They can be used for everything from video games to design projects to virtual training for robotics, an area of generative AI and automation that got a lot of attention during the week of GTC.

“A year ago, it took an hour for AI models to generate 3D visuals of this quality – and the current state of the art is now around 10 to 12 seconds,” she said in a blog post. “We can now produce results an order of magnitude faster, putting near-real-time text-to-3D generation within reach for creators across industries.”

In a paper, Nvidia researchers noted that other approaches to 3D image generation came with tradeoffs. Some came with impressive results, but that the optimization procedures could take as long as an hour for each prompt.

“Amortized methods like ATT3D optimize multiple prompts simultaneously to improve efficiency, enabling fast text-to-3D synthesis,” they wrote. “However, they cannot capture high-frequency geometry and texture details and struggle to scale to large prompt sets, so they generalize poorly.”

In addition, “these methods often involve an expensive and time-consuming optimization that can take up to an hour to generate a single 3D object from a text prompt,” they wrote.

With LATTE3D (Large-scale Amortized Text-To-Enhanced3D), Nvidia built a scalable architecture and used 3D data during optimization via what the paper’s authors wrote were “3D-aware diffusion priors, shape regularization, and model initialization to achieve robustness to diverse and complex training prompts.” It leveraged neural field and textured surface generation to create “highly detailed textured meshes in a single forward pass,” generating 3D objects in 400ms.

Rather than starting a design from the ground up or parsing through a library of 3D assets, LATTE3D “generates a few different 3D shape options based on each text prompt, giving a creator options. Selected objects can be optimized for higher quality within a few minutes,” according to the blog post.

Users can then move the shape into graphics applications or platforms like Nvidia’s Omniverse, which uses Universal Scene Description (OpenUSD) 3D workflows and software.

The model was trained using the chip maker’s “Ampere” A100 GPUs and trained on a broad array of text prompts using the ChatGPT chatbot to allow it to easily adapt to phrases users my input to describe 3D objects. Prompts that featured dogs should all generate dog-like shapes.

Nvidia researchers used a single RTX A6000 GPU in its demo.

Sign up to our Newsletter

Featuring highlights, analysis, and stories from the week directly from us to your inbox with nothing in between.
Subscribe now


Leave a Reply

Your email address will not be published.


This site uses Akismet to reduce spam. Learn how your comment data is processed.