Kind: captions
Language: en
Have you ever thought about how weird it
is that some of the most powerful AIs on
the planet can write a beautiful poem,
but can't tell you if a car is making a
U-turn? Well, today we're diving into a
project called Foundation Motion that's
designed to fix that. Tackling one of
AI's biggest and honestly most
surprising blind spots. So, let's just
jump right in. You've got these massive
AI models, right? Things like Gemini,
Quen. They can generate incredible art,
write code, do all this amazing stuff.
But here's the wild part. You show them
a video and ask them to describe
something as basic as a car turning a
corner and they just fall apart. It's
this massive glaring gap in how they
understand the physical world. And this
slide really nails the problem. AIS are
absolutely brilliant at identifying what
something is. They'll look at a video
and go, "Yep, that's a car. That's a
person. That's a ball. No problem." But
the second you ask them how that thing
is moving, they get completely lost.
They've mastered the nouns, you could
say, but they're completely failing at
the verbs. So, here's how we're going to
break it all down. First, we're going to
dig a little deeper into this motion
blind spot. Then, we'll look at the real
reason this problem exists, a huge data
bottleneck. After that, we'll introduce
the solution, an automated data factory
called Foundation Motion. We'll see how
it works, check out some pretty shocking
results, and then talk about what this
all means for the future of AI in the
real world. Okay, first up, AI's motion
blind spot. And to really understand why
this is such a big deal, you kind of
have to take a step back and think about
what intelligence even is. I really love
this quote from the psychologist Barbara
Tverki. She says, "Spatial thinking is
the foundation of thought." And what she
means is that understanding space and
movement isn't just one part of being
smart. It's the bedrock of everything
else. I mean, think about your own day.
From squeezing your car into a tight
parking spot to just stacking groceries
in a bag so they don't crush each other,
you're constantly intuitively reasoning
about physics. If we want AIs to be
truly intelligent, they have to get this
right. So, if it's so fundamental, why
are AI so bad at it? Well, it's not
because the models aren't big enough or
powerful enough. The real problem, the
culprit here, is a massive shortage of
the right kind of data. data that
actually explains how things move. And
just to give you a sense of the scale of
this problem, get this. Researchers
calculated that it would take a team of
10 people working full-time about 100
days just to manually label the motion
in a 100,000 short videos. That's over 3
months of non-stop work for what in the
AI world is actually a pretty tiny data
set. So there you have it. That's the
core issue. Doing this by hand is a
gigantic bottleneck. It is way too slow
and way too expensive to ever work at
the scale that modern AI needs. And this
data scarcity is what has been holding
AI back from learning real common sense
physical reasoning. Which brings us to
the hero of our story, Foundation
Motion. This is a system that was
specifically designed to just smash
right through that data bottleneck by
automating the entire creation process.
The best way to think about Foundation
Motion is as a data factory. You feed
raw unlabeled videos in one end and out
the other comes a massive highquality
data set packed with superdetailed
descriptions of motion. It's turnurning
out exactly the kind of how data that AI
has been starved for. And the scale
we're talking about here is just
staggering. Using this automated
factory, the team created a data set
with 467,000
highquality question and answer pairs
all about motion. Just imagine trying to
do that by hand. It would be practically
impossible. So, you're probably
wondering, how does this magic data
factory actually work? Let's pop the
hood and take a quick look at the
pipeline. It's basically a clever
four-step process. First, it
automatically finds video clips that
have a lot of interesting movement.
Then, it uses other AI models to detect
and track the path of every single
moving object in that clip. And then,
here's the cool part. It packages up all
that tracking data along with the video
and hands it all over to a large
language model to generate really
detailed captions and even question
answer pairs about the motion. And right
there, that is the secret sauce. The
system doesn't just show the video to
the language model. It also gives it a
cheat sheet. Basically, a file with the
precise trajectory, the speed, the
location of every moving object. Having
that extra structured data is what
allows the language model to go from a
generic description to something
incredibly detailed and accurate. Okay,
now we get to the really exciting part,
the results. What happens when you
actually train an AI on this new super
smart data? Well, the answer is pretty
mind-blowing. Just look at this chart.
This is a classic David versus Goliath
story. on a benchmark for understanding
car movements. You've got the Giants,
Google's Gemini and Alibaba's Quen
scoring in the low 80s. Pretty good. But
then you have this other model, Envia,
which is way, way smaller. But when it's
trained on the Foundation Motion data,
it just blows them out of the water,
scoring over 91%.
And it's not a one-off thing either. We
see this pattern again and again on
other benchmarks. The smaller model with
the smarter data consistently
outperforms the giants. And check out
the gains when you apply this to
specific tasks. A nearly 15% jump in
performance for robotics. Over 7% for
driving tasks. That's huge. It's
definitive proof that at least for
understanding the physical world,
smarter data really can beat bigger
models. So, let's zoom out. What's the
big picture? Why does it matter so much
if an AI can tell that a car is making a
left turn? What's the real endame here?
Well, the real world impact is
potentially massive. For autonomous
cars, this means a much deeper, more
nuanced understanding of complex traffic
scenarios. For robotics, it could unlock
the ability to perform delicate physical
tasks that require a true feel for how
objects move. It's a critical step
toward what researchers call embodied
AI. That is, AIs that can actually
understand and interact with the
physical world around us safely and
competently.
Now, of course, there's still work to
do. Right now, this all operates on 2D
video. The huge next leap, the next
frontier is to bring this same deep
understanding of motion into true 3D
space. When that happens, especially for
robotics, it's going to be an absolute
gamecher,
which leaves us with a really
fascinating thought to end on. For the
longest time, the AI race has felt like
it's all about scale. Who can build the
biggest model with the most parameters?
But what Foundation Motion strongly
suggests is that maybe we've been
focused on the wrong thing. If a smaller
model with smarter, more targeted data
can outperform the giants, maybe the
future of AI isn't just about raw size.
Maybe it's about wisdom.