Transcript
TTLX1bJhae4 • NVIDIA Cosmos Reason 2 Explained: The New Brain for Physical AI
/home/itcorpmy/itcorp.my.id/harry/yt_channel/out/FoundationModelsForRobotics/.shards/text-0001.zst#text/0064_TTLX1bJhae4.txt
Kind: captions
Language: en
So, what if an AI could not only see the
world, but truly understand it? And I
don't just mean labeling objects in a
picture. I'm talking about genuine
comprehension. You know, common sense
and intuition for physics and the
ability to actually plan and act in our
messy, unpredictable reality. Well, that
is the huge promise behind Nvidia's new
model, Cosmos Reason 2. It's a major
leap forward for what we call physical
AI. These are the systems built to break
out with the digital world and operate
right here alongside us. So, for this
deep dive, we're going to break down
exactly how it works, what makes it so
different, and why it might just be the
key to unlocking the next generation of
real world robotics and automation.
Let's get into it. But first, let's
start with a really basic question. Have
you ever seen one of those videos of a
super expensive multi-million dollar
robot? You know, a total marvel of
engineering, and it's trying to do
something simple like fold a t-shirt,
and it just moves with this painful,
clumsy slowness. Or maybe it tries to
pick a strawberry and just completely
crushes it. It's kind of funny, right?
We have machines that can master the
most complex games ever invented, but
they fumble with tasks a toddler can do
without even thinking. Well, this gap,
this huge chasm between digital smarts
and physical competence is one of the
biggest challenges in all of AI. It's
the reason our homes aren't filled with
robot butlers yet. And it all boils down
to one missing ingredient. Common sense.
Okay, so this slide really breaks that
problem down. On the left, you've got us
human reasoning. When you decide to make
a cup of coffee, your brain doesn't just
think get coffee. It runs this whole
subconscious plan. Walk to the kitchen,
open the cupboard, grab the mug, which
hey, might not be exactly where you left
it. So, you adapt on the fly. You get
the coffee, you grip the machine, and
you deal with a dozen tiny unexpected
things along the way. We deal with
uncertainty constantly. Now, look at the
right side. That's been the story for
traditional AI. They've been brittle.
They're amazing and predictable
environments like a chessboard where the
rules are set in stone. But in the real
world, things get messy. A weird shadow,
an object that's been moved an inch, a
slippery floor. Any of those things can
make a traditional AI totally fall
apart. They just lack that fluid,
adaptable common sense, and that is the
exact gap Nvidia is trying to close. All
right, so here's the game plan for this
deep dive. We've just covered AI's
common sense problem. Next, we'll
officially meet the solution, Cosmos
Reason 2. Then, we're going to pop the
hood and look at the key upgrades that
make it tick. After that, we'll see it
in action with some incredible real
world examples. Then, we'll zoom out to
meet the whole Cosmos family of models
to see the bigger picture. And finally,
we'll look ahead to the next physical
frontier this tech is unlocking. Okay,
let's get to the main event, Cosmos
Reason 2. Now, that subtitle is super
important. A vision language model for
the physical world. This is not a
chatbot that they just taught how to
look at pictures. No, this was built
from the ground up with the physics and
the logic of our world baked right in.
Its whole purpose is to be the cognitive
engine, the brain really for robots and
other autonomous agents, giving them the
ability to see what's around them,
understand how things relate, and then
actually make and execute a plan. So,
let's be super clear about what a
reasoning reason language model even is.
A standard VLM, which you've probably
seen, can look at a picture and just
label stuff. It's seizable of fruit, a
knife, and a cutting board, and it'll
spit out the words apple, knife, cutting
board. A reasoning BLM goes so much
deeper. It doesn't just see the objects.
It understands what you can do with
them, how they relate, the physics
involved. It knows a knife can be used
to slice the apple on the cutting board.
It gets that if you push that apple off
the edge, gravity is going to make it
fall. So, here's the bottom line. It's
the difference between just naming the
nouns in a scene and actually
understanding the verbs, the potential
for action. That's the reasoning part.
And get this, this isn't just some
marketing slogan from Nvidia. Cosmos
Reason 2 has been put through its paces
and it has climbed to the number one
spot on two really critical industry
leaderboards. You can think of these
leaderboards, the physical AI bench and
the physical reasoning leaderboard as
the Olympics for AI. They throw a whole
bunch of challenges at these models
specifically to see how well they can
reason about the physical world. And
right now among all the open models out
there, Cosmos Reason 2 is the champ.
That open model part is also a huge deal
because it means researchers and
developers everywhere can build on it
which just speeds up innovation for
everyone. So how did it get to be number
one? Well, it's not magic. It comes down
to some very specific, very powerful
technical upgrades. So now we're going
under the hood. We're going to look at
the new capabilities that give this
thing its edge. This is where we see
what really separates this model from
everything that came before it. The
first and maybe the most important
upgrade is something called long context
understanding. The best way to think
about this is like the AI short-term
memory. Imagine you're trying to build a
really complex piece of IKEA furniture.
You have to remember what you did in
step one when you're all the way on step
12. If your memory is too short, you're
just going to get lost. It's the same
for a robot. It has to remember the main
goal, the steps it's already taken, and
what it's seen along the way. The more
information it can hold in its head at
once, the longer its context window, the
more complex the tasks it can actually
pull off. And the improvement here is
just wild. 256,000
tokens. Now, to put that in perspective,
a token is basically a piece of a word.
So, we are talking about the ability for
this model to read, process, and
understand the equivalent of a 400page
book in one single go. That means you
could feed it an entire super complex
technical manual for a machine and it
could then use that manual to guide a
robot through a repair. It's just a
massive expansion of its cognitive
workspace. Yeah, this chart really puts
it into perspective, doesn't it? The
last version, Cosmos Reason 1, had a
context window of 16,000 tokens, which
was already pretty good. But this new
version expands that working memory 16
times over. Let's go back to that IKEA
analogy. The 16K model, maybe it could
handle a simple little bookshelf, but if
you gave it the instructions for a giant
wardrobe with 50 steps, by the time it
gets to step 30, it might have already
forgotten a critical detail from step
three. The 256K model can hold the
entire instruction book in its memory at
once. That's a true gamecher for any
kind of long-term planning. Okay, next
up, a great memory isn't everything. A
physical AI also needs incredibly sharp
senses. So, the second huge area of
upgrades is in the model's eyes, its
visual perception, and its spatial
understanding. And this isn't just about
a higher resolution camera. This is
about giving the AI totally new ways to
see the world, moving it from just
seeing a flat 2D picture to
understanding a full 4D environment.
That's 3D space plus the dimension of
time and motion. Pretty cool, right?
Let's quickly break down this new set of
senses because this is where it gets its
real world smarts. 2D and 3D point
localization means it can see a specific
screw on a workbench and know its exact
coordinates in space. Bounding box
coordinates let it draw a perfect
digital box around an object. Trajectory
data. This one is a huge leap. It can
see a ball rolling and not just track
it, but actually output the coordinates
of where it's going to be. OCR support
means it can literally read any text it
sees in the world, like a serial number
on a part or a warning label. And
finally, timestamp precision lets it tie
all of this rich data to a specific
moment in time so it can understand how
a whole scene is changing. All right, so
the third key upgrade is all about
practicality cuz all this power is
great, but can you actually use it in
the real world? It's one thing to build
an AI that needs a supercomput the size
of a room to run. It's another thing
entirely to make that power accessible.
So Nvidia has engineered Cosmos Reason 2
to be super flexible so it can run on
everything from a tiny chip on a drone
to a massive server in the cloud. And
they do this by offering two different
sizes. You can think of a model's
parameters as kind of like the neurons
in its brain. More parameters usually
means more power, but it also takes more
energy to run. So the 2 billion
parameter model is the lean, efficient
one. It's designed for edge deployment,
meaning it can run right on the robot or
camera itself. No internet needed, which
is perfect for real-time decisions. Then
you have the 8 billion parameter
powerhouse. That's the version you'd use
in the cloud to analyze video from, say,
a whole fleet of self-driving cars. This
flexibility means developers can pick
the right tool for the job. Okay, so
we've geeked out on the upgraded memory,
the news senses, and the flexible sizes.
But what does all this tech power
actually do? This is where theory meets
reality. Let's look at how industry
leaders are already putting this thing
to work to solve some really complex
real world problems. So, first let's
look at the classic challenge, robotics.
Picture this. In a workshop, you want a
robot arm to do what seems like a simple
task. Pick up a specific roll of
painters tape from a cluttered table and
put it in a basket. For you or me,
that's nothing. But for a robot, it
requires this perfect seamless flow of
seeing, understanding, planning, and
then acting, all while dealing with
things like weird lighting or
reflections. Now, look at this. This is
how you tell the system what to do.
You're not writing hundreds of lines of
complex code. You just give it a natural
language prompt, just like you'd ask a
person for help. And notice how the
prompt specifically asks for both the
steps and the trajectory. This is a
direct payoff from those new
capabilities we just talked about.
You're not just asking what to do.
You're asking, "Show me exactly how to
do it down to the precise path through
space." And here's what it spits out. A
perfect logical plan. Step one, locate
painters tape. That's the advanced
visual perception we talked about. Step
two, determine optimal gripper position.
That's its common sense physics kicking
in. Step three, calculate motion
trajectory coordinates. There's that
brand new trajectory data capability,
giving it a collision-free path. And
then steps four and five are the
execution. This shows how vision,
reasoning, and action all come together,
all from one simple English sentence.
It's incredible. Okay, let's shift gears
literally to another massive industry,
autonomous vehicles. You know, one of
the biggest bottlenecks in developing
self-driving cars is just getting enough
perfectly labeled training data. For
years, this has meant paying armies of
people to manually watch video frame by
frame and draw boxes around every single
car, person, and traffic light. It's
incredibly slow. It's expensive and it's
full of human error. So now Uber is
looking at Cosmos Reason 2 to automate
and seriously upgrade this whole
process. And the results are already in.
After training the model on their own
specialized driving data, Uber saw some
major measurable improvements. Okay,
let's break down this table. That blue
score, which basically measures how
accurately an AI can describe a video,
it went up by over 10%. And that Lingo
QA score, which tests for a much deeper
understanding of the scene, it jumped by
a massive 13.8%.
Now, the key thing here isn't just the
numbers, it's what they mean. This model
can understand the super complex,
highstakes world of traffic with much
greater accuracy. And that leads to
better training data and at the end of
the day, safer self-driving cars. And
it's not just Uber. A whole ecosystem of
major companies is already building with
this. Salesforce is using it to analyze
factory videos to spot safety hazards
before anyone gets hurt. Hitachi is
using it to build their nextG robots.
Milestone and Vast Data are using it in
smart cities to improve traffic flow.
And companies like Encord are plugging
it right into their data platforms. The
adoption is happening fast and it's
happening everywhere. Seeing how
companies like Uber and Salesforce are
applying this technology is absolutely
fascinating. And to stay on top of how
this tech is changing our world, make
sure you subscribe for more deep dives
just like this one. Now, it's really
important to get that Cosmos Reason 2,
as powerful as it is, doesn't exist all
by itself. It's actually the reasoning
core of a whole family of models from
NVIDIA. Think of it like a complete
cognitive toolkit, an entire ecosystem
of AI designed to tackle physical
intelligence from every possible angle.
First up, meet Cosmos Predict 2.5. So if
Cosmos reason is the part of the brain
that understands the present moment,
Cosmos predict is the part that imagines
the future. It's a generative AI. You
can give it a video clip and it will
actually generate a realistic prediction
of what's probably going to happen next.
This is obviously crucial for something
like a self-driving car that needs to
predict if a pedestrian is about to step
into the street. And its skills are
seriously impressive. It can generate up
to 30 seconds of video, which is an
eternity in the world of AI prediction.
It learned how to do this by being
trained on a crazy huge data set of 200
million video clips. So, it could just
soak up the rules of physics and motion.
And just like its sibling, it's also a
leader on the physical AI bench for the
quality of its predictions. This is how
an AI can game out what might happen
before it ever has to act in the real
world. Next in the family is Cosmos
Transfer 2.5. This model is designed to
solve a really stubborn problem in
robotics called the SIM toreal gap. See,
it's way cheaper and safer to train a
robot in a computer simulation. The
problem is simulations are usually too
perfect. They don't have the messiness
of reality, the weird lighting, the dust
on a sensor. So, the sim tore gap is
when a robot that's a genius in the
simulation moves to the real world and
suddenly it failed. Cosmos transfer is
the bridge across that gap. It basically
takes the clean data from a simulation
like one built in NVIDIA's Isaac SIM
platform and it reskins it to look like
it was recorded in all sorts of real
world conditions, different lighting,
weather, textures, you name it. This
creates a way richer, more realistic
training data set, which makes the robot
way better when it's finally deployed
for real. It's a super powerful tool for
speeding up development. And all of this
leads up to NVIDIA's most ambitious
projects like GR0T. This is a
foundational model for humanoid robots
and what's running inside its digital
brain, the Cosmos family. GR0T uses
what's called a vision, language, action
or VA model. It takes in vision and
language and it outputs actions for the
robot's entire body. At its very core,
NVIDIA Cosmos Reason provides the highle
thinking while the other family members
help it navigate the world. The Cosmos
family isn't just a set of tools. It's
the blueprint for the brain of the next
generation of embodied AI. So, let's
zoom out one last time and just think
about the big picture here. We've gone
through the problem, the tech upgrades,
the real world uses, and the whole
ecosystem. What is the ultimate
long-term impact of giving machines real
common sense? This really is the key
takeaway. For the last what, 70 years,
AI has been pretty much stuck in the
digital world of data and games. Models
like Cosmos Reason 2 represent the next
great step. They are giving machines the
fundamental building blocks, perception,
reasoning, planning that they need to
act safely and effectively in our world.
The AI is in a very real sense breaking
out of the computer and into our
reality. And for any of you developers,
researchers, or just curious folks out
there, Nvidia is making this technology
incredibly accessible. You can download
the models directly from Hugging Face.
You can go play with sample prompts on
Nvidia's website right now to get a feel
for it. It's being rolled out on all the
major cloud platforms. They've even
published a Cosmos cookbook with code
recipes to help you start your own
projects. And there's an active Discord
community to share ideas. They're
actively encouraging the world to build
with this stuff. Which brings us to our
final thought. This technology is about
so much more than a better factory robot
or a smarter car. It's a foundational
shift. So, I'll leave you with this
question to think about. What problems
are we going to solve when every machine
from a surgical assistant to a planetary
rover can truly see, reason, and
interact with the world around it? The
possibilities are just staggering. What
do you think is the most exciting
application for this? Let us know down
in the comments. To keep breaking down
the tech that shapes our future, make
sure you subscribe. We'll see you in the
next explanation.