The Dark Matter of Robotics: Why Machines Struggle with "Physical Commonsense"

qEdcohT283I • 2026-01-31

FoundationModelsForRobotics YouTube Transcript

Transcript preview

Open

Kind: captions
Language: en
All right, let's dive right in. Today
we're talking about something that is
well, it's all around us, but it's
almost impossible to see. It's a force
that basically shapes our entire
physical world, but it has been
completely out of reach for our most
advanced AIs. You see, it's a kind of
intelligence that's so baked into who we
are that we don't even recognize it as
intelligence. This is the story of the
dark matter of robotics.
To really get what we're talking about,
I want you to just do a little thought
experiment with me. Picture a bookshelf,
maybe one in your own home, that is just
absolutely jammed with books. Now, in
your mind's eye, reach out and try to
grab one specific book from right in the
middle. It's really wedged in there. And
pay close attention to what your hand
actually does. It's not just a simple
reach, grasp, and pull. Is it? No. It's
this little symphony of tiny
adjustments. And right there in that
unconscious performance lies one of the
biggest challenges in the history of AI.
You probably don't even think about
these movements. Maybe your fingers
press against the book next door, just
kind of wiggling it to create a tiny bit
of space. Or maybe you hook the spine
with one finger and slide the book out
just enough to get a real grip on it.
And what happens if the cover's a little
slick and it starts to slip as you pull
it out before you can even think, "Oh
no, I'm dropping it. Your hand is
already tilted. Your grip has changed
and you've pinned it against the shelf
to save it." This is this invisible
dance, this constant super fast
conversation between your senses and
your muscles. So, this constant fluid
interaction, well, researchers have a
name for it, physical common sense. And
let's break this down because every word
here is super important. Reactive, that
means it's not pre-planned. It happens
in response to the world as it's
changing moment to moment. Closed loop,
that's just a way of saying there's a
constant feedback cycle. Your eyes see
the slip, your nerves feel the pressure
change, your brain sends a correction,
and your senses report back on how it
went, all in milliseconds. It's a gut
feeling for physics, not the equations,
but a real intuition for forces, for
friction, for weight, and for all the
messiness of the real world. And over
your lifetime, all this intuition gets
compiled kind of like software, right
into your reflexes and muscle memory.
You know, this ability to just do
things, to handle objects with this kind
of grace and adaptability, it feels like
second nature to us, but it's really an
invisible superpower. It's an
intelligence so basic, so ancient that
we just take it for granted. But the
second we try to build machines that can
live and work in our world, we slam it
to the reality that this superpower is
the one thing they almost always lack.
And that brings us right to the heart of
the mystery we're unpacking today. How
can something a toddler can do without
even thinking be one of the toughest,
most stubborn problems in robotics? I
mean, think about it. We've built AI
that can write poetry, compose music,
and crush the world's best Go players. A
game with more possible moves than atoms
in the universe. And yet we still
struggle to build a robot that can
reliably pick a piece of fruit without
smooshing it or clear a dinner table
without dropping a glass. To really get
this, we've got to look back at how this
problem was first spotted. And look,
this isn't some new frustration. This is
a famous paradox that's been bugging AI
researchers for over 50 years. Sometimes
it's called the paradox of the toddler
for the simple reason that the physical
skills of a 2-year-old are in many ways
still way beyond our most advanced
machines. The history here really puts
the whole challenge into perspective. It
really starts back in 1966 with a
philosopher named Michael Palani. He
came up with this idea of tacet
knowledge. His famous line was, "We can
know more than we can tell." Just think
about riding a bike. You can't write a
perfect instruction manual for it, can
you? The knowledge isn't in words, it's
in your body's sense of balance, the
little shifts in weight. It's knowledge
you can only get by doing. Then fast
forward to 1988 and the brilliant
roboticist Hans Moravec puts a name to
it. His famous paradox. He realized that
the things we think of as hard like
highle reasoning actually take very
little computation while the easy stuff
are basic sensor motor skills take
enormous computational power. And here
we are today and that paradox is still
alive and well. And this slide just lays
it out so clearly. On the left you've
got the world of pure logic. An AI can
master chess, something that takes
humans years of intense training. It can
do calculus in a blink. An industrial
robot can do the same exact perfect weld
on a car door 10,000 times without ever
messing up. But then you look at the
right side. This is the stuff evolution
has spent hundreds of millions of years
getting right. Walking on a patch of
ice, recognizing a friend in a crowd,
and most importantly for us, adapting to
a messy, unpredictable room, and
recovering when something, say, starts
to slip from your grip. These things are
completely effortless for us, but
they're exactly where rigid
pre-programmed machines just fall apart.
So, that just raises the question, why?
I mean, we're living in the age of large
language models, right? AIS that have
basically read the entire internet. Why
can't they learn this physical
intuition? Well, it turns out the answer
isn't about how much data they have, but
what kind of data they're learning from?
And this is it right here. Palana's idea
from the 60s is the absolute key.
Physical knowledge just isn't made of
words. You can't pack it into a
sentence. It's not a list of rules like
if book slips then readjust grip. No,
it's something that only exists inside
that continuous high-speed feedback loop
between what you sense and how you act.
Now, let's compare that to what language
models actually learn from the internet.
They learn what's called semantic common
sense. And they are masters at this.
It's all about understanding the
statistical patterns between words and
ideas. For example, an LLM knows that if
a sentence starts with the bird flew out
of its, the next word is probably nest
or cage because it's seen that pattern
trillions of times. It has a common
sense for language, but it's a common
sense of text, not texture, of symbols,
not slipping. And this analogy is just
perfect. Reading the driver's manual is
exactly like an AI learning from the
internet. It gives you all the
background knowledge, the rules of the
road, what the signs mean, the theory of
it all. That's semantic knowledge. But
reading that book a hundred times will
never ever prepare you for the actual
physical feeling of your car
hydroplaning on a wet road or that
intuitive flick of the wheel you do when
you feel the car start to skid. That is
physical knowledge, tacit knowledge. And
you can only learn it by actually
holding the wheel and feeling the
consequences. So here's the bottom line.
The entire internet, all of it, is
basically a passive third person
recording of the world. It has no
propriception, no feeling of a body
moving through space. It gives no chance
for intervention. An AI can't read about
a ball and then decide to go push it.
And most importantly, it has no
consequences. In all those trillions of
words, there is no data that captures
the feeling of an object slipping. And
because of that, there's no data on the
reflex you need to catch it. That key
ingredient, the interactive closed loop
experience, is just completely missing.
And when you realize that, it leads to a
huge conclusion. If we want to teach
machines physical common sense, we need
a totally new kind of data. We can't
just keep showing them pictures of the
world. We need data that is born from
physical experience. We have to figure
out how to record the dance itself. We
have to capture that sensory motor loop.
Now, for decades, the go-to method for
this was teleoperation. Basically, a
human controlling a robot from afar to
collect data. But the old ways of doing
this were just bad. The interfaces were
these clunky joysticks and laggy screens
with almost no physical feedback. And
this awkward setup forces the human
operator to switch off their fast,
intuitive, reflexive system one
thinking. They have to start using their
slow, deliberate system two brain,
thinking through every single step.
Okay, now move the gripper left. Now
close the fingers. The movements you
record are stiff, robotic, and totally
missing those fluid little corrections
that make us so good at this stuff. And
when you train an AI on that bad data,
well, you get a bad robot. It's jagged,
slow, and inefficient. So, the
breakthrough here isn't just about
getting more data. It's about inventing
a way to get the right data. The holy
grail is a system for collecting data
that's so smooth, so intuitive that the
human operator's natural, reflexive
behavior just flows right through to the
robot. The goal is to make the interface
basically disappear so we can finally
capture that physical intelligence that
evolution has been working on for
millions of years. So, how's this
actually being done? Well, companies
like generalist are building these
lightweight ergonomic controllers that
let an operator move a robot's hands
almost like they're their own. But the
real game changer here is highfidelity
force feedback. This means the operator
can actually feel what the robot is
feeling. They can feel the resistance of
pushing something, the texture of a
surface, the weight of an object. And
you know what? A few minutes into using
a system like this, something amazing
happens. The operator stops planning
their moves. They stop thinking and they
just start reacting. And the data that
comes out of that is a world away from
the old stuff. It's rich with the very
soul of physical common sense. All those
little reflexes, those real-time
recoveries, those tiny intuitive
corrections. We're at a really cool
moment in this whole story. We've
defined this huge problem. We've looked
at its long history, and we've seen why
the old solutions just didn't cut it.
And now we're right on the edge of a
potential breakthrough. The results of
this new approach are honestly
mind-blowing. And if this is the kind of
stuff that gets you excited, you should
definitely subscribe to see where it all
goes from here. So, we have this new way
of thinking and this new tech for
capturing data that's packed with human
intuition. The huge question is, what
happens when you train a giant AI model
on this amazing new data? The answer is
you start to see something that looks a
lot less like programming and a whole
lot more like improvisation. You see
these flashes of real physical
intelligence.
Okay, check out this first example. The
robot's task is to put a small metal
washer into a really tight foam slot. As
it's pushing down, its sensors feel a
tiny rotation. The washer starting to
slip out of its grip. Now, a normal
robot would probably just drop it or jam
it in crooked. But this model does
something totally different. It pauses
the main action, the pushing. It does
this super quick, tiny regrasp to stop
the slip. And then this is the part that
just feels so human. It gives the washer
this little double nudge, a quick push
and release just to make sure it's
seated properly. That final nudge wasn't
programmed. It's a learned trick for
making sure the job is done right.
Here's another one. The robot needs to
place a small box into its lid. It picks
up the box, but oops, it fumbles it and
flips it upside down. A total failure
for most robots. But instead of just
freezing or dropping it, this model
immediately does this fluid inhand
regrasp. It basically juggles the box in
its fingers to flip it right side up all
in one smooth motion. And then just like
a person would, it gently pats the box
down into the lid a couple of times.
That final pat, it's not really
necessary, but it shows this intuitive,
learned understanding of making sure
something is secure. It's incredible.
This next one shows an even higher level
of problem solving. The robot is trying
to put something into a cardboard box,
but one of the inner flaps is bent in,
blocking the way. A simple robot would
just keep pushing and fail. This model
though, it sees the problem. It uses its
other finger, the one that isn't holding
the object as a tool. It reaches out,
hooks that annoying flap, and
deliberately folds it out of the way.
And only after it's cleared the path,
does it go on to finish the job. It's
solving a problem by using its body in a
totally new way. And one last example of
this improvisation. The robot needs to
grab a Tic Tac container out of a bin,
but the container is right up against
the wall, so there's no room for the
fingers to get around it. The robot
solution is so simple and so smart. It
realizes it can't just grab it directly.
So, first it does a prep move. It uses
one finger to just nudge the container
away from the wall out into the middle
of the bin. That one little move creates
just enough space for it to get a
perfect stable grip. It actually changed
the world a little bit to make its goal
possible. So, and this is the most
important thing to get here. None of
these clever moves, the catch, the flip,
the fold, the nudge, none of them were
programmed. There's no if then statement
for a cardboard flap. These are emergent
behaviors. They're learned physical
intuitions that just happen when you
train a huge AI model on a massive
diverse data set that truly captures the
physics of the real world. This is the
big shift from fragile pre-programmed
robots to ones with robust learned
intuition. You know, what we're seeing
here is way bigger than just making
robots less clumsy. This could be the
start of a bridge between what we think
of as low-level physical reflexes and
highle intelligent thinking. It's
suggesting that maybe those two things
aren't so separate after all. Take a
look at this demo where a robot is shown
a finished Lego model for just a second
and then it has to build copies of it
from a jumbled pile of bricks. To pull
this off, a single AI model has to work
on multiple levels at once. At the very
lowest level, it's doing all that
physical common sense stuff we just saw,
nudging a brick into place, regripping a
piece that's a little off. But at the
exact same time, it's doing highle
reasoning. It has to remember the goal,
find the next piece it needs in that
messy pile, and plan out the building
sequence. And this is where it gets
really, really deep. As these physically
grounded models get better, that hard
line that AI designers have always drawn
between low-level motor control and
highle strategy, well, it just starts to
melt away. Thinking and acting start to
become two sides of the same coin, woven
together, just like they are for us.
Your plan to make coffee is made up of a
thousand tiny physical intuitions about
handling the filter and pouring the
water. As the researchers put it, your
highle plans have to happen in real time
because gravity doesn't wait for anyone.
And this brings us to the biggest
takeaway from all of this. For decades,
AI research has mostly been a top- down
game, starting with abstract logic and
symbols and hoping to somehow connect it
all to the real world later. But maybe
that was all backwards. This new
evidence suggests that real general
intelligence, the kind that can actually
work in our world, doesn't start with
symbols. It has to be built up from a
foundation of physical experience. It
has to start with a body. Real robot
intelligence starts with physical common
sense. Unlocking this dark matter of
robotics could be the tipping point. The
moment when robots finally go from being
specialized tools stuck in cages to
being general purpose helpers in our
everyday lives. And it leaves us with
one final huge question to think about.
We are right at the beginning of this.
But what happens to our world, to how we
make things, to how we get things, to
healthcare, to our own homes? What
happens when the machines we build
finally develop a real gut feeling for
the physical reality we all live in? The
possibilities are just staggering, and
we'll be here to explain them as they
happen. Thanks for tuning in.

Resume

Berikut adalah rangkuman komprehensif dan terstruktur berdasarkan transkrip yang Anda berikan.

***

# Materi Gelap Robotika: Mengungkap Rahasia "Common Sense" Fisik yang Hilang dari AI

### Inti Sari
Video ini membahas konsep "materi gelap robotika" (*dark matter of robotics*), yaitu intuisi fisik dan *common sense* yang memungkinkan manusia berinteraksi dengan dunia secara alami—sesuatu yang masih sulit dicapai oleh kecerdasan buatan modern. Meskipun AI unggul dalam pemrosesan bahasa dan logika abstrak, AI masih kesulitan meniru kemampuan sensorimotorik dasar manusia karena kurangnya pengalaman fisik. Solusi untuk kesenjangan ini terletak pada pengumpulan data fisik berkualitas tinggi melalui teleoperasi canggih, yang memungkinkan robot belajar refleks dan improvisasi layaknya manusia.

### Poin-Poin Kunci
*   **Intuisi Fisik vs. Kecerdasan Semantik:** Manusia memiliki "common sense" fisik yang reaktif dan intuitif (seperti mengambil buku dari rak yang sempit), sedangkan AI hanya memiliki kecerdasan semantik dari data teks.
*   **Keterbatasan LLM:** Model Bahasa Besar (LLM) belajar dari pola statistik kata-kata yang pasif dan tidak memiliki *proprioception* (kesadaran tubuh), sehingga tidak memahami konsekuensi fisika di dunia nyata.
*   **Paradox AI:** Terdapat paradoks di mana balita memiliki keterampilan fisik yang melampaui robot canggih (Paradoks Moravec), karena banyak pengetahuan manusia bersifat implisit (*tacit knowledge*) dan sulit diungkapkan dengan kata-kata.
*   **Solusi Teleoperasi:** Untuk menciptakan data fisik yang baik, dibutuhkan antarmuka teleoperasi dengan *force feedback* (umpan balik gaya) yang akurat, memungkinkan operator menggunakan refleks alami (Sistem 1) bukan perencanaan lambat (Sistem 2).
*   **Perilaku Emergen:** Robot yang dilatih dengan data fisik mampu menunjukkan perilaku improvisasi yang tidak diprogram secara eksplisit, seperti menggunakan jari sebagai alat atau menyesuaikan genggaman secara instan.
*   **Integrasi Berpikir & Bertindak:** Kecerdasan sejati menggabungkan refleks motorik tingkat rendah dengan perencanaan strategis tingkat tinggi, menunjukkan bahwa intuisi fisik adalah fondasi dari penalaran abstrak.

---

### Rincian Materi

#### 1. Konsep "Materi Gelap" Robotika dan Common Sense Fisik
Video diawali dengan pengenalan konsep "materi gelap robotika"—sebuah kecerdasan yang ada di sekitar kita namun tak terlihat, yang membentuk dunia fisik namun belum dapat dijangkau oleh AI.
*   **Eksperimen Pikiran:** Mengambil buku dari rak yang macet melibatkan penyesuaian halus (menggelitik, mengait, bereaksi terhadap licin). Ini adalah "tarian tak terlihat" antara indra dan otot.
*   **Ciri Common Sense Fisik:** Bersifat *reaktif* (bukan pra-rencana), *loop tertutup* (umpan balik konstan antara mata-saraf-otot dalam milidetik), dan berupa intuisi tentang fisika (gaya, gesekan, berat, kekacauan).
*   **Kompilasi Pengalaman:** Kemampuan ini dikompilasi menjadi refleks dan memori otot selama seumur hidup manusia, namun sangat sulit bagi mesin.

#### 2. Paradoks dalam Kecerdasan Buatan
Meskipun AI modern bisa menulis puisi atau bermain Go, mereka gagal dalam tugas fisik sederhana seperti memetik buah atau membersihkan meja.
*   **Paradoks Balita:** Kemampuan fisik anak usia 2 tahun melampaui mesin canggih.
*   **Tacit Knowledge (Michael Polanyi, 1966):** "Kita bisa mengetahui lebih banyak daripada yang bisa kita katakan." Contohnya adalah bersepeda; pengetahuannya ada di tubuh, bukan dalam deskripsi verbal.
*   **Paradoks Moravec (Hans Moravec, 1988):** Penalaran tingkat tinggi (logika, catur) membutuhkan sedikit komputasi, sedangkan keterampilan sensorimotorik dasar membutuhkan komputasi yang sangat besar.

#### 3. Mengapa AI dan LLM Gagal dalam Fisika
LLM (Large Language Models) belajar "common sense semantik" dari pola teks di internet, namun ini berbeda jauh dengan common sense fisik.
*   **Data Pasif:** Data internet bersifat pasif, orang ketiga, dan tanpa konsekuensi.
*   **Analogi:** Membaca buku panduan mengemudi (semantik) vs merasakan mobil kehilangan traksi (fisik/taktil).
*   **Bahan Kurang:** Yang hilang adalah pengalaman interaktif *loop tertutup*. AI butuh data yang lahir dari pengalaman fisik, bukan sekadar deskripsi tentang pengalaman tersebut.

#### 4. Terobosan: Teleoperasi dan Force Feedback
Untuk mendapatkan data fisik yang "suci", dibutuhkan metode di mana antarmuka menghilang sehingga perilaku refleksif manusia mengalir ke robot.
*   **Masalah Teleoperasi Lama:** Menggunakan *joystick* yang kaku, *lag*, dan tanpa umpan balik. Ini memaksa manusia menggunakan pemikiran lambat dan disengaja (Sistem 2), menghasilkan data yang kaku dan "robotik".
*   **Solusi Baru:** Perusahaan seperti "Generalist" membuat pengontrol ergonomis ringan dengan *force feedback* (umpan balik gaya) tingkat tinggi.
*   **Hasil:** Operator bisa merasakan tekstur, berat, dan resistensi. Mereka berhenti berpikir dan mulai bereaksi. Data yang dihasilkan kaya akan "common sense fisik" berupa refleks dan koreksi.

#### 5. Perilaku Emergen: Ketika Robot Berimprovisasi
Ketika model AI besar dilatih dengan data fisik ini, hasilnya bukan pemrograman melainkan improvisasi. Muncul kilatan kecerdasan fisik:
*   **Kasus Washer:** Robot memasang *washer* logam ke dalam busa. Saat merasakan slip/putaran, robot berhenti, melakukan pegangan ulang cepat, dan mendorong ganda (*double nudge*) untuk memasangnya. Dorongan ini tidak diprogram.
*   **Kasus Kotak & Penutup:** Robot menjatuhkan kotak dan membaliknya. Robot melakukan *regrasp* (mengganti genggaman) dengan cairan seperti *juggling* untuk membaliknya kembali, lalu menepuknya lembut. Tepukan menunjukkan pemahaman intuitif.
*   **Kasus Kardus:** *Flap* dalam menghalangi. Robot menggunakan jari bebas sebagai alat untuk mengait dan melipat *flap* tersebut. Ini memecahkan masalah dengan menggunakan tubuh dengan cara baru.
*   **Kasus Wadah Tic Tac:** Wadah menempel di dinding, tidak ada ruang untuk jari. Robot mendorong wadah menjauh dari dinding (*prep move*) untuk menciptakan ruang lalu menggenggamnya. Robot mengubah dunia untuk mencapai tujuan.

#### 6. Jembatan Menuju Penalaran Tingkat Tinggi
Contoh-contoh tersebut bukan pre-programmed (tidak ada "jika/maka"), melainkan muncul dari pelatihan pada data fisik yang beragam. Ini menggeser robot dari sistem yang rapuh menjadi sistem yang memiliki intuisi yang kuat.
*   **Demo Lego:** Diberikan model selesai, robot membangun dari tumpukan berantakan. Ini menggabungkan *common sense* fisik tingkat rendah (mendorong, menggenggam ulang) dengan penalaran tingkat tinggi (mengingat tujuan, menemukan potongan, merencanakan urutan).
*   **Penggabungan Pikiran & Tubuh:** Garis batas antara kontrol motorik dan strategi menjadi kabur. Berpikir dan bertindak terjalin menjadi satu, seperti saat kita membuat kopi.

---

### Kesimpulan & Pesan Penutup
Inti dari kecerdasan buatan masa depan bukan lagi tentang menurunkan logika ke dunia fisik (top-down), melainkan membangun pemahaman dari interaksi fisik ke atas (bottom-up). Dengan memberikan AI pengalaman sensorimotorik yang mirip dengan manusia, kita tidak hanya membuat robot yang lebih mampu secara fisik, tetapi juga mendekati esensi sejati dari kecerdasan itu sendiri—di mana kemampuan untuk merasakan dan berinteraksi dengan dunia menjadi fondasi bagi kemampuan untuk memahaminya.

Read

file updated 2026-02-12 02:45:05 UTC