The Cognitive Architecture of Future AI: From LLMs to Multimodal Embodied Systems
QyKSefEvEK8 • 2025-12-13
Transcript preview
Open
Kind: captions
Language: en
Hey everyone and welcome. Today we're
diving into something truly
mind-bending. How AI is making the
incredible leap from just being an
expert with words to becoming an agent
that can actually see, understand, and
act in our physical world. So, let's
kick things off with a pretty
fascinating question. We've all seen AI
do incredible things, right? Write
stories, generate code. But why can that
same super smart AI write a beautiful
poem about a cup, but it can't do
something as simple as just pick one up?
Well, the answer to that question is the
key to understanding the next huge leap
for AI. Okay, so to get to the bottom of
that, we have to start with what we all
know, right? The world of large language
models or LLMs.
You see, the real problem with these
textonly AIs boils down to something
called the symbol grounding problem. An
LLM knows the word cup because it's seen
it in billions of sentences online. It
knows all the words that go with cup,
but it has absolutely no idea what a cup
is in the real world. It doesn't know
its shape, its weight, or that you can't
just stick your fingers through it. It's
just shuffling symbols around without
any real world connection. It doesn't
get it. And that brings us to this
really powerful way of thinking about
it. LLMs are basically like a brain in a
vat. They have this gigantic universe of
information stored inside, but it's
completely cut off from physical
reality. This is exactly why they can
hallucinate and just make stuff up that
sounds plausible because there's no
reality check. It can't look out the
window and see if what it's saying
actually makes any sense. So, how do you
get the brain out of the vat? The first
step, you've got to give it senses. And
that brings us to the next stage in this
evolution, large multimodel models or
LMMs.
Okay, take a look at this chart. This is
going to be our roadmap for the whole
journey. We're going to use this to
track how AI is evolving, moving from
left to right, from those basic LLMs to
the really advanced stuff that's coming.
Now, let's zoom in on the first two
columns here. See that line for input
modalities? LLMs are text only, but look
at LLMs. They can handle text, vision,
and audio. Giving AI eyes and ears, I
mean, that's a total gamecher for
connecting it to the real world. Its
understanding is suddenly grounded in
what it can actually perceive. And this
is where things get really cool.
Google's RT2 model was a massive
breakthrough because for the very first
time, a robot could tap into that huge
library of knowledge on the internet,
all those images, all that text, and use
it to figure out how to do something new
in the real world. And the results, I
mean, they were staggering. RT2 was
nearly three times better at performing
tasks it had never ever been trained on
before. This wasn't just a tiny
improvement. It was a massive leap in
its ability to generalize and figure out
new stuff on its own. All thanks to that
new multimodal understanding. So, okay,
we've given the AI senses, but that's
not the whole story. To act
intelligently, it needs a better way to
well to think. And this brings us to a
really fascinating idea. Building an AI
that thinks a little more like we do.
You know, the psychologist Daniel
Conorman came up with this idea that we
humans have two different ways of
thinking. System one is our fast,
intuitive gut reaction. You know, that
split-second decision when you slam on
the brakes. And system two, that's our
slow, deliberate, logical thinking.
That's your sit down and really think it
through mode, like when you're working
on a tough puzzle. The thing is, today's
LLMs are almost pure system one. They
are phenomenal pattern matchers, giving
you a quick, almost instinctive answer.
But, and this is a big butt, that's also
why they can get things wrong or
hallucinate. They're great at quick
connections, but they fall apart when a
problem needs slow, careful, logical
steps. So, the future, the real goal, is
to build an AI that has both. See how
this diagram lays it out? You've got
that fast, reactive system one on one
side and the slow, deliberate system two
on the other. And the secret sauce is
right there in the middle, that
integration point. That's what lets the
AI be both quick and thoughtful to react
instantly when it needs to, but also to
pause, plan, and reason when it hits a
complex problem. So, we have an AI with
senses. We have one with a more
sophisticated way of thinking. What's
the final piece of the puzzle? Well,
it's giving that brain a body so it can
finally get out and do things in the
world. All right, let's go back to our
road map here and look at that last
column, embodied AI. Check out the real
world grounding. It says strong,
grounded in physical interaction. This
right here is the final stage. This is
where it all comes together. The senses,
the smarter thinking, and now physical
action. And this is all possible because
of a brand new type of technology called
vision language action models or VALAs.
Think about it. Instead of cobbling
together different systems for seeing,
thinking, and moving, a VLA bundles it
all into one seamless model. It can
literally see a scene, understand a
command like, "Hey, pick up the red
apple," and then translate that directly
into the right physical movements to get
it done. And this isn't science fiction.
We're seeing it happen right now. You've
got Nvidia's GRO project, which is
trying to build a general purpose AI for
all kinds of humanoid robots. You've got
Tesla pushing forward with its Optimus
robot. And then you have incredible
research like Dex Mimic Genen, which
lets robots learn how to do really
complex two-handed jobs just by watching
a person do it one time. So when you put
it all together, perception, cognition,
action, you realize we are stepping into
a brand new frontier. But you also
realize that with this kind of power
comes some seriously profound new
responsibilities.
I mean, the ultimate dream here is for
AIs to develop what are called emergent
capabilities. It's like how a child
learns to walk and then from that
figures out how to run and jump on their
own. These embodied AIs could start
picking up new skills just by
interacting with the world, learning and
growing in ways we didn't explicitly
program. It's truly unpredictable and
honestly a little mind-blowing. But to
get to that future, we have to tackle
some of the biggest questions humanity
has ever faced. Like who's in charge of
this stuff? Who governs it? How do we
guarantee that a robot acting in our
world actually shares our values? And as
these AIs get more and more complex,
what kind of ethical duties might we
have toward them? You know, when you
boil it all down, it comes back to this
one single absolutely crucial question.
As we teach our machines to move beyond
words and actually step into our world,
the great challenge of our time will be
making sure they act not just
intelligently, but wisely and for the
good of every single one of us.
Resume
Read
file updated 2026-02-12 02:44:53 UTC
Categories
Manage