Kind: captions Language: en Have you ever thought about how weird it is that some of the most powerful AIs on the planet can write a beautiful poem, but can't tell you if a car is making a U-turn? Well, today we're diving into a project called Foundation Motion that's designed to fix that. Tackling one of AI's biggest and honestly most surprising blind spots. So, let's just jump right in. You've got these massive AI models, right? Things like Gemini, Quen. They can generate incredible art, write code, do all this amazing stuff. But here's the wild part. You show them a video and ask them to describe something as basic as a car turning a corner and they just fall apart. It's this massive glaring gap in how they understand the physical world. And this slide really nails the problem. AIS are absolutely brilliant at identifying what something is. They'll look at a video and go, "Yep, that's a car. That's a person. That's a ball. No problem." But the second you ask them how that thing is moving, they get completely lost. They've mastered the nouns, you could say, but they're completely failing at the verbs. So, here's how we're going to break it all down. First, we're going to dig a little deeper into this motion blind spot. Then, we'll look at the real reason this problem exists, a huge data bottleneck. After that, we'll introduce the solution, an automated data factory called Foundation Motion. We'll see how it works, check out some pretty shocking results, and then talk about what this all means for the future of AI in the real world. Okay, first up, AI's motion blind spot. And to really understand why this is such a big deal, you kind of have to take a step back and think about what intelligence even is. I really love this quote from the psychologist Barbara Tverki. She says, "Spatial thinking is the foundation of thought." And what she means is that understanding space and movement isn't just one part of being smart. It's the bedrock of everything else. I mean, think about your own day. From squeezing your car into a tight parking spot to just stacking groceries in a bag so they don't crush each other, you're constantly intuitively reasoning about physics. If we want AIs to be truly intelligent, they have to get this right. So, if it's so fundamental, why are AI so bad at it? Well, it's not because the models aren't big enough or powerful enough. The real problem, the culprit here, is a massive shortage of the right kind of data. data that actually explains how things move. And just to give you a sense of the scale of this problem, get this. Researchers calculated that it would take a team of 10 people working full-time about 100 days just to manually label the motion in a 100,000 short videos. That's over 3 months of non-stop work for what in the AI world is actually a pretty tiny data set. So there you have it. That's the core issue. Doing this by hand is a gigantic bottleneck. It is way too slow and way too expensive to ever work at the scale that modern AI needs. And this data scarcity is what has been holding AI back from learning real common sense physical reasoning. Which brings us to the hero of our story, Foundation Motion. This is a system that was specifically designed to just smash right through that data bottleneck by automating the entire creation process. The best way to think about Foundation Motion is as a data factory. You feed raw unlabeled videos in one end and out the other comes a massive highquality data set packed with superdetailed descriptions of motion. It's turnurning out exactly the kind of how data that AI has been starved for. And the scale we're talking about here is just staggering. Using this automated factory, the team created a data set with 467,000 highquality question and answer pairs all about motion. Just imagine trying to do that by hand. It would be practically impossible. So, you're probably wondering, how does this magic data factory actually work? Let's pop the hood and take a quick look at the pipeline. It's basically a clever four-step process. First, it automatically finds video clips that have a lot of interesting movement. Then, it uses other AI models to detect and track the path of every single moving object in that clip. And then, here's the cool part. It packages up all that tracking data along with the video and hands it all over to a large language model to generate really detailed captions and even question answer pairs about the motion. And right there, that is the secret sauce. The system doesn't just show the video to the language model. It also gives it a cheat sheet. Basically, a file with the precise trajectory, the speed, the location of every moving object. Having that extra structured data is what allows the language model to go from a generic description to something incredibly detailed and accurate. Okay, now we get to the really exciting part, the results. What happens when you actually train an AI on this new super smart data? Well, the answer is pretty mind-blowing. Just look at this chart. This is a classic David versus Goliath story. on a benchmark for understanding car movements. You've got the Giants, Google's Gemini and Alibaba's Quen scoring in the low 80s. Pretty good. But then you have this other model, Envia, which is way, way smaller. But when it's trained on the Foundation Motion data, it just blows them out of the water, scoring over 91%. And it's not a one-off thing either. We see this pattern again and again on other benchmarks. The smaller model with the smarter data consistently outperforms the giants. And check out the gains when you apply this to specific tasks. A nearly 15% jump in performance for robotics. Over 7% for driving tasks. That's huge. It's definitive proof that at least for understanding the physical world, smarter data really can beat bigger models. So, let's zoom out. What's the big picture? Why does it matter so much if an AI can tell that a car is making a left turn? What's the real endame here? Well, the real world impact is potentially massive. For autonomous cars, this means a much deeper, more nuanced understanding of complex traffic scenarios. For robotics, it could unlock the ability to perform delicate physical tasks that require a true feel for how objects move. It's a critical step toward what researchers call embodied AI. That is, AIs that can actually understand and interact with the physical world around us safely and competently. Now, of course, there's still work to do. Right now, this all operates on 2D video. The huge next leap, the next frontier is to bring this same deep understanding of motion into true 3D space. When that happens, especially for robotics, it's going to be an absolute gamecher, which leaves us with a really fascinating thought to end on. For the longest time, the AI race has felt like it's all about scale. Who can build the biggest model with the most parameters? But what Foundation Motion strongly suggests is that maybe we've been focused on the wrong thing. If a smaller model with smarter, more targeted data can outperform the giants, maybe the future of AI isn't just about raw size. Maybe it's about wisdom.