Kind: captions Language: en Hello and welcome. You know, we've all seen those mind-blowing videos that look like they're straight out of a sci-fi movie, right? Robots doing these incredibly complex things with almost humanlike skill. But here's the thing. There is a huge, huge gap between a robot that can work perfectly in a controlled lab and one that can actually function out here in our world, which, let's be honest, is messy. It's unpredictable and it's always changing. So today we're going to dive deep into a pretty revolutionary training idea that's closing that gap. It's a set of techniques called co-raining and it is paving the way for the very first true generalist robots. I mean just look at this. It's a robot making a cup of coffee. And you have to admit it's seriously impressive. The precision, the way the two arms coordinate, the way it handles that porta filter. It's the kind of thing that just makes you feel like, okay, the future is finally here. This is the promise of modern robotics all wrapped up in one single really elegant action. But that leads to the big question, doesn't it? If robots can pull this off in a demo, why don't we have them in our homes or in our local coffee shops? You know, why isn't a robot butler tidying up your living room or whipping up breakfast? Well, the answer lies in this fundamental challenge in robotics. It's the problem of generalization. that amazing coffee making robot. It might fail completely, just totally break down if the lighting changes or if the coffee machine gets moved just two inches to the left or even if you use a slightly different brand of coffee beans. So, how do we fix that? That's the big puzzle. Here's our road map for this deep dive. First, we'll kick things off by really defining the core problem facing what we call robots in the wild. Then, we're going to introduce the really elegant solution, the co-raining paradigm. After that, we'll break down the specific tools in the co-raining toolkit, looking at four really powerful techniques. From there, we'll see how this whole concept actually extends to humans and robots learning together. And finally, we'll see what it's all building towards, the emergence of the true generalist robot. Okay, so let's start at the very beginning. To really get why today's robots can be so brittle, we first need to pop the hood and see how they actually learn. The brains of these machines are a really powerful new kind of AI model. They're called vision language action models or VLAS for short. And the concept is actually beautifully simple. The model sees the world through its cameras. That's the vision part. It's given a command in plain English like pick up the red block. That's the language part. And then its job is to figure out the exact sequence of motor commands to make that happen. And that is the action. And the main way they learn is through imitation, just by watching thousands and thousands of human demonstrations. And this right here gets us to the absolute heart of the problem. As the researchers from Stanford point out in this quote, even when you train these models on massive data sets, they end up being brittle. That is the perfect word for it. They just crack under pressure. If the real world doesn't look almost exactly like the data they were trained on, they fail. A shadow falls in the wrong spot, somebody moves a chair on the background, the camera angle is just a little bit different, and suddenly the robot is completely lost. So this brittlelessness, this is the key obstacle that's been holding robotics back. And to get over it, researchers are pioneering a completely new philosophy for teaching these machines. A really powerful set of ideas that we're calling the co-raining paradigm. You know, the best way to wrap your head around co-raining is with an analogy. So, imagine a student getting ready for a huge exam. A robot trained the old way with just one single data set is kind of like a student who only reads one textbook over and over and over again. Sure, they might memorize it perfectly. But if the exam asks a question in a slightly different way, they're completely stuck. But a code-trained robot, that's like a student in a study group. They're reading multiple textbooks. They're watching explainer videos. They're discussing the concepts with their friends. And they're working through all different kinds of practice problems. They learn the underlying principles, not just memorizing the words on a page. And that's exactly what co-raining is. Instead of just relying on one single uniform data set, it's a strategy for training a single model on many different kinds of data all at the same time. This could be data from simulations, data from totally different robots, general knowledge scraped from the web, or even data from static images. The whole goal is to give the robot a more well-rounded, more comprehensive education so it can build an understanding of the world that is deep and flexible, not shallow and brittle. Okay, so we've covered the what and the why. Now we're about to get into the how. the really cool specific techniques that make co-raining possible. If you're finding this level of detail as fascinating as I do and you want to keep exploring the tech that's literally building our future, just take a quick second to subscribe. We would absolutely love to have you in our study group. All right, let's open up the toolbox. Coderaining isn't some single magic bullet. It's a whole collection of powerful methods. We're going to break down four of the most important techniques that researchers are using right now to build these smarter, more robust robots. So, our first technique is called invariance co-raining. The core idea here is to explicitly teach a robot what not to pay attention to. It's all about building invariance to all the distractions. I mean, think about it. When you pick up a cup, the actual task doesn't change if the lights get brighter or if someone moves a plant in the background, right? You just instinctively know those things are totally irrelevant. But robots don't. Invariance codeing helps them learn this. It uses a mix of real robot data and a massive amount of synthetic images with all kinds of different camera angles, lighting, and just random objects in the background. By co-raining on all this varied visual data, the model learns to filter out all that noise and focus only on the elements that are actually essential for the task at hand. And does it work? Oh, yeah. The results are dramatic. Researchers found this method all by itself boosts a robot success rate by 40% when it's faced with these kinds of realworld visual distractions. And let me tell you, in the world of robotics, a 40% jump is a massive, massive leap in performance. All right, the next tool in our kit leverages the incredible power of simulation. It's called SIM and real co-raining, and it's designed to tackle the single biggest bottleneck in robotics, the scarcity of good data. This slide really illustrates a fundamental trade-off in robotics. See, collecting real world robot data is super slow. It's expensive, and it requires a human to manually drive the robot for every single demonstration. But in simulation, we can generate a nearly infinite amount of data basically for free and completely automatically. We can create millions of trajectories with endless variations of objects and scenarios. So, SIM andreal code training combines the best of both worlds. You get the massive scale and diversity of simulation data combined with the highfidelity real world grounding of a smaller, more expensive physical data set. But there's a catch. The simulation can't just be any old thing. As this key finding from a Google DeepMind paper points out, the real magic happens when the simulation is like a digital cousin of the real world. What that means is the tasks, the objects, and the general layout of the scene in the simulation should closely mirror the real environment. The closer that match is, the more effective the knowledge transfer and the bigger the performance boost for the robot in the real world. Our third technique basically expands the robot's classroom from the lab to the entire internet. This is web data code training. And believe it or not, it's about preventing our very smart models from getting dumber. So here's the problem. These vision language action models, they start their life as powerful vision language models or VLMs that have been pre-trained on a huge chunk of the internet. They already have this vast general understanding of objects, concepts, and language. But when we then fine-tune them only on relatively tiny robot data sets, something called catastrophic forgetting can happen. The model basically overspecializes and forgets all that powerful general knowledge it started with. Coderaining on web data right alongside the robot data solves this. It's like a constant refresher course, reminding the model of its vast pre-existing knowledge, keeping its language understanding sharp, and letting it connect a command like clean up the spill to a common sense understanding of what spills and sponges and cleaning actually look like. Okay, now let's look at a really clever and more advanced technique. This one's called knowledge insulation, and it's all about carefully managing how the different parts of the robot's brain learn just to make sure the whole process is stable and efficient. Okay, this one's a little more technical, but the idea is just brilliant. Imagine you're teaching a robot to pick up a pen. With knowledge insulation, you do two things at once. First, you teach the main VLM brain the highle plan using these simple action tokens. Think of them like flash cards that say move hand forward, then close gripper. It's a really stable, easy tolearn signal. At the exact same time, a totally separate module, an Action Expert, learns the hard part, translating those flash cards into smooth, continuous motor commands. But here's the genius part. Step three, you build a firewall between them. You insulate the main brain from all the messy trial and error of the action expert. This means the expert's early, clumsy attempts at movement don't confuse the main brain while it's just trying to learn the basic plan. It's like having a soundproof practice room. The brain learns the clean theory. The expert perfects the messy practice and they work together perfectly without messing up each other's education. So far, we've been talking about different data sets learning together inside a single AI model. But the principles of co-raining actually go way beyond just data. The most powerful collaborations of all are going to be between robots and humans learning together in real time. This whole new field is called human robot co-learning. And the key concept to get here is mutual adaptation. It's not a one-way street where the robot just learns from the human. Nope. For a team to become truly fluent and effective, the human also has to learn and adapt to the robot's capabilities, its tendencies, and yeah, even its mistakes. It's a continuous two-way feedback loop. This experiment just perfectly illustrates the idea. So, you have a person guiding a robot on a leash, and they have to navigate a course together. The thing is, the robot has its own intentions, and sometimes it's pulling in a different direction. The human has to constantly negotiate when should I lead, when should I let the robot lead. This isn't something you can just plan out ahead of time. It's a dynamic non-verbal co-learning process where both partners have to implicitly feel out and adapt to each other's strategy to have any chance of succeeding. And this is the critical insight from that research. The robots learn the most effectively when their human partner acted like a good teacher. When a person noticed the robot was about to mess up and they adapted their own strategy to kind of guide it away from that error, the robot didn't just avoid the mistake. It learned a better overall strategy that was perfectly in sync with its human partner. This just goes to show that the future of robotics isn't just about building better AIs. It's also about us becoming better collaborators. Okay, so we've looked at the problem of brittleleness. We've explored the coderaining paradigm and a whole toolkit of powerful techniques. Now it's time for the payoff. What happens when you combine all of these ideas? Well, you get the first real glimpse of a true generalist robot. And this right here brilliantly illustrates the result. This is the architecture for Google DeepMind's Gemini Robotics 1.5. Now, it looks complex. So, let's walk through it. Look all the way to the far left. You see all the diverse inputs the system takes in speech, text, images, and even the robot's own physical state, its proprioception. Now, follow those inputs to the central boxes. This is the brain where the code trained models use what they call thinking traces to reason about the task and plan the next steps. It can even call on external tools like a web search to get more info. Finally, look over to the right. You can see the incredible variety of outputs this one system can produce. Pointing, segmenting images, and most importantly, generating actions that can be executed across completely different types of robots. So, let's break down what this new level of intelligence actually means, and let's connect it back to our code training toolkit. Embodied reasoning, its grounded understanding of physics. Well, that's supercharged by sim andre real co-raining where it can learn from millions of simulated physical interactions. Thinking traces, its ability to literally talk itself through a problem. That's a direct result of co-raining on massive webcale language data. Motion transfer, which is kind of the holy grail of controlling different robots without retraining. That's enabled by co-raining on data from tons of different robot types, both real and simulated, which lets the model learn the general concept of movement separate from any one body. and tool use, the ability to search the web. That's a clear benefit of web data co-raining, keeping the model plugged into a live source of information. And here is what that looks like in action. We're seeing the robot perform a long multi-step task, just organizing a cluttered desk. It has to identify multiple objects, understand their abstract categories, and then place them correctly. It knows that pens and staplers are office supplies because of that vast general knowledge it got from web data co-raining. and it avoids knocking things over because its embodied reasoning was hardened by millions and millions of trials in simulation. Here's another example. Tidying up a kitchen shelf. The robot is taking items from a table and putting them away. Its visual system isn't thrown off by the shifting shadows or that cluttered background because of invariance co-raining. It can handle all these different objects because it's learned from a huge diversity of examples. This is that generalist ability we've been building towards. A robot that you can give a highle command to and it can figure out the complex sequence of actions needed to get it done in a messy, unstructured environment. So, let's just bring it all together. The incredible versatility you're seeing here isn't the result of one single breakthrough. It's the result of a whole paradigm shift. It's a system that's built on co-raining, learning from simulation, from the web, from diverse visual data, and from those carefully insulated internal processes. This is what finally allows a robot to generalize, to take what has learned and apply it flexibly to new tasks, new environments, and yeah, even entirely new bodies. And this brings us to our final thought. For decades, the focus has really been on making robots smart enough to work with us. But as code training makes them exponentially better learners and more adaptive partners, the question starts to flip. It's no longer just about the robot adapting to us. It's about us adapting to them. So, how will we need to change our behaviors, our instructions, and our expectations to become effective partners and teachers for this new generation of generalist robots. That is all for this deep dive. The future of human robot collaboration is being written as we speak, and co-learning is the language it's being written in. If you want to continue learning alongside us as we explore these frontiers, make sure you're subscribed. Thanks for watching.