I recently read Climbing the Ladder of Reasoning: What LLMs Can—and Still Can’t—Solve after SFT, and it clarified something I’d been suspecting for a while: supervised fine-tuning really can make language models smarter, but only up to a point. The paper lays out a kind of "reasoning ladder" to sort problems by difficulty, from Easy to Extremely Hard, and then looks at how well large language models do at each level after different amounts of fine-tuning.
The results are striking. With just a small number of high-quality examples, models get dramatically better at intermediate tasks, especially