I recently read Climbing the Ladder of Reasoning: What LLMs Can—and Still Can’t—Solve after SFT, and it clarified something I’d been suspecting for a while: supervised fine-tuning really can make language models smarter, but only up to a point. The paper lays out a kind of "reasoning ladder" to sort problems by difficulty, from Easy to Extremely Hard, and then looks at how well large language models do at each level after different amounts of fine-tuning.
The results are striking. With just a small number of high-quality examples, models get dramatically better at intermediate tasks, especially the kind that require multi-step logic or basic mathematical reasoning. But the climb gets steep fast. At the highest tiers, even heavily fine-tuned models start to falter. They don’t just get slower; they get stuck. There’s a ceiling that extra supervision can’t seem to break through. And that’s where things get interesting.
One of the most misleading assumptions we carry from school is that progress is smooth. The math gets harder one chapter at a time. The grades go up or down by a few points. The learning curve is a curve.
But a lot of things in the real world don’t work that way. Startups don’t grow like that. They fumble for months, sometimes years, until something clicks, and then, boom, they’re growing 20% a week. Ideas behave similarly. They aren’t slowly assembled like IKEA furniture. They explode into your head after long periods of nothing.
And apparently, large language models evolve this way too.
If you’ve ever read about punctuated equilibrium in biology, you’ll recognize the pattern. The idea is that species stay more or less the same for long stretches of time, millions of years, sometimes, and then, suddenly, in a geological blink, they change. It’s not evolution as a gentle slope. It’s a staircase. Or a broken line. Stasis punctuated by leaps.
That’s exactly what seems to be happening with supervised fine-tuning.
The pretraining of a language model is like Darwinian evolution in slow motion. Billions of tokens, tiny updates, endlessly optimizing across a sea of generic internet data. And it works. But only to a point. Pretrained models can sound smart. They can autocomplete. They can imitate. But ask them to reason, and they fall apart like a middle school debate team.
Then someone gives the model 1,000 carefully crafted examples of how to reason step by step, and suddenly the model jumps from 26% accuracy to 76%. It’s not a curve. It’s a spike. A mutation. Something changed in the DNA of its behaviour.
That’s weird. But it’s not unheard of.
We see this in startups all the time. The product is almost good enough, the market is almost ready, the pitch is almost clear, and then a tiny tweak flips everything. A better onboarding flow. A single press article. A founder finally understanding their own idea. And growth takes off.
These leaps feel magical, but they’re not. What’s really happening is that the system, whether it’s a company, a mind, or a model, has reached the edge of a phase change. And something small triggers the shift. Ice melting. Water boiling. A model suddenly able to reason.
The mistake is to assume that scaling up is always gradual. That you just feed in more tokens, more compute, more training steps, and the intelligence line will rise accordingly. But real systems are often nonlinear. They hide tipping points. And when you reach one, it’s not more of the same, it’s something else entirely.
This also explains why high-quality data is so disproportionately powerful. It’s not just better, it’s leveraged. If you give a model 10,000 random Reddit posts, it will learn to talk like Reddit. But if you give it 10 well-designed examples of how to solve a logic puzzle step by step, you might unlock an entirely new behaviour.
It’s like teaching a toddler. You can’t brute-force them into understanding algebra with endless arithmetic. But one moment, after enough foundation, they see how x fits into the equation, and suddenly they get it. The change is conceptual. And fast.
There’s something deeply optimistic about this.
It means we might not need exponentially more data to get exponentially smarter models. We might just need the right data. Or the right kind. And that’s more tractable. More human. Because crafting 1,000 good reasoning examples is within reach. It’s an intellectual problem, not an industrial one.
Of course, this analogy isn’t perfect. Evolution doesn’t know what it’s doing. It doesn’t fine-tune. It just throws mutations at the wall and waits a million years. But we can choose the mutations. We can engineer the jumps. That’s what makes this so powerful.
It also suggests a different way to think about the future of AI. Not as a straight line toward AGI, but as a jagged path, plateaus interrupted by vertical ascents. So maybe the question isn’t, “How big is your model?” Maybe it’s, “What leap are you aiming for next?”
Because if these jumps are real, the most important thing isn’t how fast you’re running. It’s where the cliffs are.
Read the paper
In essence, the paper argues that fine-tuning helps models "climb the ladder of reasoning" only so far. While it's surprisingly effective at early stages, current methods hit a ceiling, especially on problems that demand flexibility, intuition, or non-linear thinking, suggesting a need for new techniques beyond traditional supervised fine-tuning (SFT) to reach human-expert-level reasoning.
The authors introduce a structured, tiered "reasoning ladder" to categorize problem complexity (Easy → Medium → Hard → Extremely Hard) and systematically analyze how LLMs perform across these levels after various degrees of fine-tuning.

Comments