File TXT tidak ditemukan.
Train LLMs for $5? DeepSeek’s mHC Breakthrough & The Blueberry 88M Project
9EEUbjf7Oig • 2026-01-08
Transcript preview
Open
Kind: captions Language: en Welcome to the explainer. Today we're getting into the weeds on a totally new way to build large language models. And this whole thing is inspired by a really ambitious mission to create a worldclass top tier AI that is completely open-source for everyone. So here's the plan. We're going to start with the quest that kicked this whole thing off. Then we'll get a handle on the basic tech that powers pretty much every LLM out there. After that, we'll uncover a brilliant, but as you'll see, deeply flawed new idea. Then comes the really cool part, the elegant mathematical fix. We'll see how they engineered it to work at a massive scale and then we'll zoom out and look at what this all means for the future of how we build AI. Okay, so this all starts with the open super intelligence lab. Now, these folks have one single incredibly audacious goal. Build one of the world's top 10 large language models and then just give it away. Keep it fully open. You know, this isn't just about making another chatbot. This is about democratizing the absolute bleeding edge of artificial intelligence. And trust me, they are not messing around. Just look at their public road map. It is an allout sprint to catch up to and then blow past the biggest models out there with the goal of hitting the top 10 by the end of 2027. Now, that raises a huge question, right? What kind of insane radical new technology do you need to even try something like that? You can't just do what everyone else is doing. You need a fundamental edge. And that is where our story really gets interesting. So to really get what's so revolutionary here, we've got to go back to basics for a second. We need to talk about the fundamental plumbing that's inside almost every single large neural network today. It's called the residual connection. And honestly, it completely changed the game for training ridiculously deep networks. Here's the best way to think about it. Picture some data, like a word from a sentence, going into a layer of the network to get processed. While that's happening, a perfect untouched copy of that original word takes an express lane right around the processing block. At the other end, the new processed version and the original version just get added together. Now, this is absolutely critical. It means that even if a layer learns nothing useful, the model doesn't get dumber. The original signal just passes straight through. It's the ultimate safety net, and it's what lets us build these models that are hundreds of layers deep without them just falling apart. And that leads to a really important distinction we have to make. You know, most of the progress you hear about things like new attention mechanisms or a mixture of experts, that's all in what we call micro design. It's like inventing a better fuel injector, a more efficient part for the engine. But the story we're telling today, it's all about macro design. It's a radical change to the car's entire chassis. We're not just tuning up the engine. We are fundamentally rethinking how information flows through the entire system. So with this focus on the big picture on macro design, some researchers at Bite Dance came up with a wild idea back in 2025. They called it hyper connections or HC. They looked at that single express lane we just talked about and they asked a really simple question. Why just one? Why not build a multi-lane superighway? Right? The core idea is brilliant. You take the data for a token and you expand it out into let's say four parallel streams. The hope is that each stream might specialize. You know, maybe stream one gets really good at grammar. Stream two holds the longrange context of the conversation. Stream three focuses on math. Now, here's the genius part. Right before you get to a really expensive part of the model, like the attention layer, you use a smart function to squish all four streams down into one. You do all the heavy lifting on that single condensed stream and then you expand it back out into four. So, you get the information capacity of a four-lane highway, but with the computational traffic of a single lane road. It's super clever. And on the surface, this sounds amazing, right? It's a fantastic way to cram more memory and more information into the model. It's a genuinely clever idea. But there's a catch. There's a hidden flaw that you only see when you really, really scale it up. So, what happens when you try to build a 100tory skyscraper with this shiny new dessert? Well, this happens. It completely breaks. I mean, it spectacularly fails. What you're looking at is the training progress of a massive 27 billion parameter model using this design. And for a while, everything looks great. It's actually learning faster than the standard models. But then, right around the 12,000th training step, bam, the loss just goes through the roof. The whole learning process becomes totally unstable and the model's performance just collapses into absolute chaos. So, what on earth is going on here? Well, the problem is that we lost that beautiful safety net from the original design. The way the information is getting mixed between the four streams at each layer isn't controlled. So the signals are getting amplified layer after layer after layer. It's like you're turning up the volume knob just a tiny tiny bit, but you do it a 100 times in a row. By the time you get to the top, the signal is absolutely deafening. This chart shows the signal gain, which should be one, has rocketed to 3,000. It's 3,000 times louder than it should be. At that point, it's not a signal anymore. It's just noise. The model is basically just listening to static. This is just a classic classic large-scale engineering problem. You have a brilliant idea on paper that just completely shatters when it meets the brutal reality of a truly massive system. Now, how they fixed it is, in my opinion, even more brilliant than the original idea. And if you want to see how researchers solve these kinds of huge engineering puzzles, make sure you're subscribed for more of these explainers. Okay, so this is where DeepSk AI comes into the picture with an incredibly elegant solution. They call it manifold constrained hyperconnections or MHC. And they didn't throw out the superighway idea. They knew it was powerful. They just they added a traffic controller, a very specific mathematically perfect traffic controller. And that traffic controller has a really fancy name, the doubly stochastic matrix. Now, I know that sounds super complicated, but the idea behind it is actually incredibly simple and powerful. It's just a grid of numbers where every single row adds up to one and every single column also adds up to one. And that one simple rule guarantees that when you use it to mix the information between your streams, you cannot create or destroy signal energy. You can only redistribute it. The exploding signal problem gone. It's a perfect fix. So how do you actually force the model to learn a matrix with this special property? Well, you use this beautiful classic algorithm from the 1960s called Synhorn KOP. It's actually pretty simple. You take your matrix and you force all the rows to add up to one. Of course, that messes up the columns. So then you force all the columns to add up to one, which messes up the rows again, but a little less this time. And you just keep doing that back and forth, back and forth. And amazingly, it is mathematically guaranteed to eventually settle on a perfect doubly stochastic matrix. Okay, let's make this super concrete. Let's say we only have two streams. Stream one has a really strong signal. Let's call it 100 comma 100. Stream 2 is totally empty. 0 comma 0. Now we apply our special mixing matrix. To get the new stream 1, we take 90% of the old stream one and 10% from the old empty stream 2. To get the new stream 2, we do the opposite. 10% from the strong one, 90% from the empty one. And look what happens. The total signal 200 is perfectly conserved. It's just been redistributed. It's a perfect stable, completely controlled leak of information between the lanes. Okay, so the math checks out. The theory is beautiful. We have a stable way to get all the benefits of this super highway without the model blowing up. But, you know, there's always a button. Theory is one thing. Making this stuff actually run efficiently on thousands of GPUs is a whole other beast. And this new design introduces a new problem. It's called the memory wall. All this extra data from the multiple streams has to be shuffled around constantly and that can create a massive traffic jam that slows down the whole training process. And this table really shows you why it's such a big deal. With a standard model, you're reading and writing a certain amount of data for every token. With hyperconnections where you've expanded that data by four times, look at those formulas. The amount of data being moved around just skyrockets. This threatens to make the model so slow to train that it's completely impractical, no matter how clever the math is. But, and this is where it gets really impressive. The Deep Seek team aren't just brilliant theorists. They are worldclass engineers. They attacked this problem with everything they had. Doing things like kernel fusion to reduce memory trips, recomputing values on the fly instead of storing them. Just all sorts of clever low-level optimizations. And the final result, this incredibly complex new architecture adds only a 6.7% time overhead during training. That's it. It's an absolute engineering marvel. Okay, so let's recap. We have a stable theory. We have some hardcore engineering that makes it run efficiently. But the million-doll question is still on the table. Does it actually work any better? Does it make the model smarter? And the answer, thankfully, is a resounding yes. This is the proof in the pudding. On a 27 billion parameter model, MHC doesn't just beat the standard design, it also outperforms the original unstable version even before it blew up. And what's really telling is where it gets better. On tasks that require complex reasoning, like BBH or tough reading comprehension, it shows really significant gains. This tells us that the stable principled way of mixing information isn't just about preventing explosions. It's actively leading to a more intelligent model. And this is where we loop all the way back to the beginning, back to the open super intelligence lab. A fundamental breakthrough like this is exactly what a team like that needs. It's not some small incremental tweak. It is a foundational change to the architecture that lets them scale better and get more performance. It's what turns their ambitious goal from a wild dream into a plausible engineering reality. The authors of the paper wrap it all up with this really powerful thought. For the past decade, we've basically been focused on micro design, right? Just upgrading the engine. MISC is a really compelling argument that we need to be paying just as much attention to the macro design to actually redesigning the entire chassis of the car. It opens up a whole new front for innovation. And that leaves us with a pretty provocative question to end on. MHC proves that the basic blueprints we've been using for years aren't set in stone. We can fundamentally change how these massive networks are put together. So, if we can do that, what other assumptions are we making that we should challenge? What totally new crazy looking structures will we build next? If you enjoyed this deep dive into the architecture of AI, make sure you subscribe for more explainers that break down the complex science that is shaping our future.
Resume
Categories