The Future Is The One We Generate

If you want to be at the top of the food chain on Earth, as human beings have ascended to after millions of years of evolution, there are a bunch of things that you need to be able to do.

One important one is to be able to take in experiences and learn from them sufficiently to do some sort of prediction about the outcome of discrete actions. What we have usually called thinking. The other one is to be able to move through and change the physical world. And another important necessity to be in charge on Earth is to do this all in a very low power budget. We would add that collective action and sacrificial action – a kind of merging of thinking and acting – is a fourth key component, and certainly something that separates us from the lower animals in the web of life.

If the keynote address by Nvidia co-founder and chief executive officer Jensen Huang is any indication, AI systems are well on their way to accomplishing the first. And with its new Cosmos world foundation model, announced at the Consumer Electronics Show in Las Vegas last night, Nvidia is blazing the trail on the second.

Thank heavens it looks like humans still have the thermal advantage of thinking and doing for a few thousand calories a day – but we may not be able to think and act quick enough to keep up with the billions of humanoid robots that Nvidia and its partners are dreaming up in our future. Or rather, their version of our future. And no one who has children and raises them through college would argue that they are inexpensive. It takes a lot of time – call it 18 years at a minimum but it is really more like 23 years to even 25 years these days – and a lot of money – maybe between $300,000 and $500,000 – to end up with an adult human being that can be productive and self-sufficient.

And while no one will come out and say it directly, this is the economics that AI and its robotics packaging is going to disrupt in the coming years.

There were hints of the magnitude of this “opportunity” throughout the prebriefings made by Nvidia executives ahead the CES keynote as well as during as Huang painted a picture of the AI future for all of us and Wall Street, too.

The Power Of Three

A message that we heard again and again is that most organization on Earth are going to need three computers. You need a DGX system for training AI models, loaded up with lots of Nvidia GPUs, CPUs, and DPUs. The flagship DGX machine is a DGX GB200 NVL72 rackscale system, and Huang amused the Las Vegas crowd not only with his shiny alligator skin leather jacket, but with a mockup waferscale chip that might, in theory, be created to put this all on one wafer. That is shown in the feature image at the top of this story and in more detail below:

The actual NVL72 system, with its NVSwitch interconnects, is the next best thing to that cardboard cutout waferscale NVL72, and it no doubt costs less and has better yield, too, as a rack of server and switch components. But, at some point in the future, a rackscale system of today will be crammed onto a single socket of chiplets, much as a NUMA server system from the late 1990s is now compressed down into a single socket today. Such miniaturization is an economic necessity, and it is a technical one now, too, with AI models being so sensitive to latency between compute and memory components.

In addition to this DGX training system, or a clone thereof from an ODM or OEM, organizations that use AI in the physical world will also need an Omniverse system to create a digital twin of their work environment or the vehicle or what have you. Omniverse needs to be supplemented with a physical AI model that literally understands the physics of the real world, and this is the new Cosmos world foundation model that Nvidia just released.

And the third thing is the GPU-accelerated factory, warehouse, car, or robot that exists in the real world and that is given autonomy.

When you connect all three – and we all know the third time is the charm and the three sisters are Charmed – then you can create a virtuous feedback loop between them whereby a model is trained with real world data, practices in a digital twin world that understands physics and that runs at a much faster speed than reality and is massively augmented with synthetic reality to train more scenarios and thereby learn faster.

“These three computers are going to be working interactively,” Huang explained in the keynote. “Nvidia’s strategy for the industrial world – and we’ve been talking about this for some time – is this three computer system. Instead of a three body problem, we have a three computer solution.”

And you thought you only needed to buy one, or two. The more you buy, the more you save. . . .

So what exactly is this Cosmos thingamabob? Well, last fall it was a “comprehensive suite of continuous and discrete tokenizers for images and videos,” as Nvidia put it, and they work a little bit differently from the text tokenizers that underpin large language models. But in general, they chop up images across space and chop up videos across space and time so foundation and diffusion models can draw relationships between snippets of this data and then they output images using the derived tokens. As you can see from this blog post, the results of generated images and video using the Cosmos tokenizer are pretty impressive.

In a few short months, Nvidia has turned Cosmos from a set of tokenizers into a full-blown platform:

With a large language model, you chew on data with a machine learning algorithm to create a neural network that encapsulates, through statistical tricks, the semantic landscape of that language. If you do that for many languages, you can use them to convert from one language to another, and if you attack a diffusion generative model to it, you can convert from one kind of input (text, speech, image, or video) into another one.

Physical AI is the next phase in the AI revolution according to Huang, and doesn’t deal with data, but the real world. And to be clear, the other phases of AI, says Nvidia, start with perceptron AI, conceived in the 1940s and implemented in the 1950s on rudimentary IBM 704 supercomputers and then custom machinery at the US Office of Naval Research laboratory.

It took nearly eight decades to get to the next phase of generative AI, where large language models were created that had massive numbers of parameters and demonstrated some emergent behaviors that sure as hell look like thinking and reasoning from the outside.

The third phase is agentic AI, where we essentially cross-link hierarchies of generative models all tuned for different tasks together and have them – for lack of better words – mull things over instead of just blurting out the first statistically superior response an LLM can think of when you feed it a query and some context data.

Huang explained well what physical AI means in his keynote:

“But what we need to do is we need to create effectively the world model as opposed to GPT, which is a language model. And this world model has to understand the language of the world. It has to understand physical dynamics, things like gravity and friction and inertia. It has to understand geometric and spatial relationships. It has to understand cause and effect. If you drop something and it falls to the ground, if you poke at it and it tips over. It has to understand object permanence. If you roll a ball over the kitchen counter, when it goes off the other side, the ball didn’t leap into another quantum universe, it’s still there.”

We all learn this, and pretty early and automagically, through experience. And, to be fair, in a statistical way that resembles what these neural networks are doing through emulated neurons running on tensor and vector cores in a GPU. We believe the ball is still in the kitchen because we have seen it or been given it back and we trust things don’t just go “poof!” perhaps because our brains are neural and binary but not quantum enough to realize that there are quantum particles flitting in and out of existence all around us and within us. . . . and just perhaps, if we thought a different way – or the Universe did – maybe the ball would just disappear. Or, by reversing the process like a diffusion model, might pop into existence instead of poofing out.

But we digress.

A little later in a video cutaway during the keynote, Huang elaborated on Cosmos:

“Cosmos models ingest text, image, or video prompts and generate virtual world states as videos. Cosmos generations prioritize the unique requirements of AV and robotics use cases, like real-world environments, lighting, and object permanence. Developers use Nvidia Omniverse to build physics-based, geospatially accurate scenarios, then output Omniverse renders into Cosmos, which generates photoreal, physically based synthetic data. Whether diverse objects or environments – conditions like weather, or time of day, or edge case scenarios – developers use Cosmos to generate worlds for reinforcement learning AI feedback to improve policy models. Or to test and validate model performance. Even across multi-sensor views. Cosmos can generate tokens in real time, bringing the power of foresight and multiverse simulation to AI models, generating every possible future to help the model select the right path.”

It literally looks like this:

Isn’t that how you imagine possible actions?

Huang said that Cosmos was the world’s first world foundation model, and that it was trained on 20 million hours of video that showed physical, dynamic things like people moving or their hands manipulating objects to help AI models – which will someday power robots – understand the physical world and how to manipulate it.

One last funny thing. In that quote above, we honestly we are not certain that Huang’s voice and words were not themselves generated by Nvidia’s AI models. It had a funny tinniness to it, lacking the usual spark in Nvidia’s co-founders words.

The Cosmos world foundation model will be freely distributed under an “open model” through Hugging Face and the Nvidia GPU Cloud, itself a kind of Nvidia veneer over cloud builder infrastructure spanning the globe. This is distinct from open sourcing Cosmos, which as far as we know Nvidia is not doing any more than it opens up its CUDA libraries or video drivers.

Now, let’s shift gears towards the money, which is what everyone really wants to talk about.

With agentic AI, the models are talking to the models at a rate that is much faster than a human can read or interpret an image or video, so that is going to require larger machines with more bandwidth. The expectation is that it will take at least two orders of magnitude more compute to create these agentic systems, which are in essence human robots encapsulated in software algorithms.

There are roughly 1 billion knowledge workers in the world, according to Nvidia. There are 30 million software developers, who are presumably among those knowledge workers and who are in the cross-hairs of code assistants and code generators that are being created based on GenAI technologies.

There are 10 million factories in the world, says Nvidia, and 200,000 warehouses for distribution and retailers who stage the stuff these factories make so we can get to it or it can get to us. By poking around on the Internet, we reckon that the factories and warehouses might employ another 1 billion people or so. The global workforce is somewhere north of 3 billion people across a total population – with more than 1 billion in various kinds of services jobs – of more than 8 billion people alive at this time.

Virtual robots using GenAI are gunning for the knowledge workers and actual physical robots are gunning for the factory and warehouse workers.

There is no doubt in our minds that replacing some or all of the functions performed as work by those billions of people is a multi-trillion dollar opportunity. Which is exciting if you like technology, as we do. But to what end? When does too much technology actually break a human economy, literally stop the flow of money across people and companies and governments?

We don’t know. But consider this sentence uttered by Huang in his keynote:

“In a lot of ways, the IT department of every company is going to be the HR department of AI agents in the future. Today, they manage and maintain a bunch of software from the IT industry. In the future, they will maintain, nurture, onboard, and improve a whole bunch of digital agents and provision them to the companies to use. And so your IT department is going to become kind of like AI agent HR.”

The answer in the past revolutions, where everything changed, was for workers to learn new skills as the economy added new economic sectors. It is hard to see what these might be when a robot is better, smarter, and faster, and doesn’t cost a half million dollars and decades to train. You either download it from NCG and run it virtually on a cloud or buy a physical robot that runs on electricity and presumably lasts for decades. Nvidia is openly predicting there will be billions of humanoid robots in the not too distant future.

What has been clear for years – and what still is very clear – is just how far ahead Nvidia is compared to its competition in completeness of vision and implementation of that vision when it comes to AI in its many forms. Jensen Huang lives in the future, and that future reverberates with the science fiction that many of us have read. It remains to be seen how the rest of us will fare. This is not fiction, but it is science. And at some point, after it becomes economics, it will become politics. Perhaps faster than many think.

We shall see.

All of this seems so much larger than Nvidia wrapping up its own enterprise-grade implementations of the Llama 3.1 models from Meta Platforms, or creating a desktop PC out of Grace CPUs and Blackwell GPUs, both of which we found very interesting indeed. Like many others, we want to see how this stuff works, and a $3,000 Grace-Blackwell PC that is the size of a stack of smartphones and that can deliver 1 petaflops of FP4 tensor performance and run GenAI models up to 200 billion parameters in size – and can be networked in pairs – looks like something interesting to play with. We think lots of people will want one. Perhaps hundreds of millions.

It is a strange new world, isn’t it? And the only true way to predict what will happen is to live through to the future, minute by minute, action by action, together. There aren’t enough GPUs in the world to simulate all of us at our natural granularity, at least not yet.

Sign up to our Newsletter

Featuring highlights, analysis, and stories from the week directly from us to your inbox with nothing in between.
Subscribe now

14 Comments

  1. So perhaps Nvidia would be the “Technology Mule” in Asimov’s Foundation?…At $3.66T market cap (“Too big to FPale”) while continually growing and absorbing most of the kinetic energy in the GPU Universe?

    PsychoHistory = LLM (Not2Distant.Future); // Reminder: Order a few more SMRs for Northern VA

    Oh and that $500K estimate to “train and make productive” ONE human being? Well add in (in 2001 dollars) $42K per annum for New York University…And only 25 years total runtime to 100% training task completion? Good luck with that! // #define NUMA Non-Uniform-Money-Access

      • Well, rich people drive Bugattis, poor people drive a Kia. To get from point A to point B these two dissimilar vehicles have to take the same highway that imposes the same speed limit on both.

        My point is there is diminished return in everything. Hyper-expensive universities won’t make rich kids more intelligent or get them faster to their destination(career). One thing for sure over-rated schools offer is they provide a lot of entertainment for bored spoiled brats; they are becoming more like 6-star resorts.

        In the meantime there is a little guy who was born in Calcutta just got his dream job at OpenAI. How much did cost to get him there? I guess not much. Maybe India should teach America one or two things about efficiency.

        And by the way, the brain consumes just 20 Watts. So, a team of 100 engineers who build technological marvels, collectively requires a mere 2000 Watts of power.

        • I was making a general point, not specifics, and here in my home market where most of this hyperscaling and clouding is still going on.

          But I see your points.

          My point is that the argument will be made that people evolve and move too slowly, that things are more complex than people can handle, and that automation will be cheaper and easier. And that there is plenty of room to argue for cost efficiencies for deploying AI on a broad scale.

          I did not wade even a little bit into the cultural and societal ramifications of this. I am aware of them, but that is not the task of The Next Platform. But I do feel that it is our job to establish the parameters of why certain kinds of computing are happening and the market forces at work. It’s 3 billion people in a workforce and heaven only knows how many billions of virtual and physical robots. We will see how this all shakes out.

          Just for fun, when I asked Claude Sonnet what the implications might be for AI adoption, it dodged the question. Perhaps it has been taught to do that.

          • I understand you.

            But the big question is: Do we need AI? and the answer is yes and no. That leads to a much bigger question: How big the Yes, and how big is the NO?

    • Hmmm seems to assume that all Ivy(ish) league school attendees are “rich”…Definitely not in the above case.

      Let’s just look at this as an example of the out of whack US costs for “training” carbon-based life forms from say the era of the 1970s to the turn of the millennium. My 70’s CompSci program total 4 year cost (at a US state Uni) versus 1st year salary, perhaps $1 to $2 (in my favor).

      Versus the above school (non-STEM program) of $1 to 25 cents. A horrible ratio inversion and speaks to the increasing cost of training aspiring US middle class, non-STEM office workers in the modern age (yeah definitely making the ROI for training silicon-based “life forms” look more attractive).

  2. A cool article for the New Year! It is mildly unfortunate however, given the advantages suggested by Jensen Huang as: “Instead of a three body problem, we have a three computer solution”, that the $3,000 Project DIGITS GB10 PCs can only be “networked in pairs” … I’d love to pair 3 of them together for the full three-body experience!

  3. Yet another great article! Thank you Timothy and Happy New Year!
    nVIDIA move into desktop computers and Cosmos (hyperscaler competitor when other hyperscalers move to their own silicon?) is a very interesting development.

  4. Depsite being a very interesting article, it makes me sad:

    ####
    “If you want to be at the top of the food chain on Earth, as human beings have ascended to after millions of years of evolution, …”
    […]
    “What has been clear for years – and what still is very clear – is just how far ahead Nvidia is compared to its competition in completeness of vision and implementation of that vision when it comes to AI in its many forms. Jensen Huang lives in the future, and that future reverberates with the science fiction that many of us have read. It remains to be seen how the rest of us will fare. This is not fiction, but it is science. And at some point, after it becomes economics, it will become politics. Perhaps faster than many think.

    We shall see.”

    ####

    “And on the seventh day God came to the end of all his work; and on the seventh day he took his rest from all the work which he had done.”
    Genesis 2:2

    Nobody cares what God says in his word (the truth), because we humans today know it better.

    …what does God then?

    …because they received not the love of the truth, that they might be saved. And for this cause God shall send them strong delusion, that they should believe a lie:
    2 Thessalonians 2:10b-11

    …they really MUST believe their dreams, visions, simulations, … as God let them come true and let them find new things from day to day, and does not let see them the truth.
    “At that time Jesus answered and said, I thank thee, O Father, Lord of heaven and earth, because thou hast hid these things from the wise and prudent, and hast revealed them unto babes.” Matthew 11:25
    That’s the biggest punishment.

    How did Elon Musk say in an interview on Lex Fridman, when asked what he would ask an AGI if he would have one question (https://www.youtube.com/watch?v=dEv99vxKjVI&t=1913s) after he thought some seconds?

    “What’s outside the simulation.”

    He is not that far away…and at the same time he is one of the main persons, which is deepest blinded by this simulation, or better said, by this lie God send to e.g. show him, who is God, and who not (we humans)…

    At the end it’s still the old lie from the beginning, told by the already defeated devil: “…and ye shall be as gods, knowing good and evil….” Genesis 3:5

    “Death is inevitable…” as Matthew MacDougall, Head Neurosurgeon at Neuralink said (https://www.youtube.com/watch?v=Kbk9BiPhm7o&t=17637s).

    …and still people live as if they would live forever and could not die every coming second…if they would be able to see this, to really see this, everybody would say: “First of all I have to make clear that I have peace with this God who will judge me at the end and decide, if I go to hell or heaven, as I do not know if I live anymore in some seconds.”
    They would pray to God to let Him understand his word, know their sins (which can be seen by his commandments, which we do not do), and they would find Jesus Christ as the only one, as he paid for their sins, and they would let themselves being baptized (if not already baptized as a baby, which is fine, if not even better, also if false prophets like charismatic, pentecostal or baptism tell you:
    “Verily I say unto you, Whosoever shall not receive the kingdom of God as a little child, he shall not enter therein.” Mark 10:15)

    For he hath made him to be sin for us, who knew no sin; that we might be made the righteousness of God in him.
    2 Corinthians 5,21

    For as many of you as have been baptized into Christ have put on Christ.
    Galatians 3,27

    …and hope to the end for the grace that is to be brought unto you at the revelation of Jesus Christ;
    1 Peter 1,13

    …it is always sad for me to see, how people are, as I also was as this…and I loved my past life…when I was in the past…today, I hate it and every old “pings” from the past, and I’m relieved to not live in it anymore…but still (sadly it is true) a sinner, but having the righteousness put on like clothes…and have not to fear death, as God is my father…all of that I do not deserve at all…au contraire…
    …that’s pure grace…

    • “This is not fiction, but it is science.”

      It’s about technology, not science. Science doesn’t alter the world; technology does.

  5. Thanks TPM for this great thinking and for making it available to us.

    IBM desperately need s you now.

    I atsrted at IBM in 1962 as am IBM Systems Engineer calculating on an IBM 604 calculating punch, before that IBM System/360computer

    THe very basis of todays AI technlogy, as well as Tesls’s autonomus Full Self Driving (FSD) is capturing the informationin real-time, and IBM simply does not do that, and rejested doing that a decade ago.

    .

Leave a Reply

Your email address will not be published.


*


This site uses Akismet to reduce spam. Learn how your comment data is processed.