😮 Massive Progress in Reasoning Models

We've entered the most exciting phase in AI since GPT-4 and ChatGPT

Feb 07, 2025

Beating OpenAI with Open-Source
99% performance with only 1% data
Chain-of-Associated-Thoughts (CoAT)

1. s1: Simple test-time scaling

Watching: s1 (paper/code)

What problem does it solve? Current approaches to test-time scaling—using additional compute during inference to boost performance—often rely on opaque or complex methodologies, as seen in OpenAI’s proprietary "o1" model. This lack of transparency and simplicity hinders reproducibility and practical adoption. The article addresses this gap by proposing an accessible, minimalistic framework for test-time compute scaling, focusing on enhancing model reasoning without intricate architectural changes.

How does it solve the problem? The authors combined two key innovations: a compact, high-quality dataset (s1K) and a novel "budget forcing" mechanism. s1K was curated via rigorous criteria (difficulty, diversity, quality) to maximize fine-tuning efficiency with minimal data. Budget forcing dynamically controls test-time compute by either truncating the model’s reasoning to limit resources or appending "Wait" tokens to prompt self-correction, effectively simulating iterative refinement. The approach was applied via supervised fine-tuning on the Qwen2.5-32B-Instruct model.

What are the key findings? The model (s1-32B) outperformed OpenAI’s o1-preview by up to 27% on competition math benchmarks (MATH, AIME24) and extrapolated beyond its base performance with budget forcing, improving from 50% to 57% on AIME24. The results demonstrate that controlled test-time compute interventions, even with a small dataset, yield significant gains in reasoning tasks. The open-source release of the model, data, and code further underscores reproducibility.

Why does it matter? This work democratizes test-time scaling by proving its viability through simple, transparent methods—contrasting with proprietary "black-box" approaches. Budget forcing introduces a lightweight, adaptive mechanism to optimize compute use during reasoning, applicable across domains like education or coding. By open-sourcing their framework, the authors enable broader community adoption and innovation, advancing equitable access to high-performance LLM capabilities.

2. LIMO: Less is More for Reasoning

Watching: LIMO (paper)

What problem does it solve? The article challenges the prevailing belief that eliciting complex reasoning—especially in mathematical domains—requires an enormous amount of training data (often exceeding 100,000 examples). It questions the long-held assumption that large-scale supervised fine-tuning (SFT) is necessary for fostering sophisticated reasoning skills in large language models (LLMs).

How does it solve the problem? The authors propose a model called LIMO (Less-Is-More Reasoning) that challenges the conventional wisdom. They demonstrate that complex mathematical reasoning abilities can be effectively elicited in LLMs with surprisingly few examples. By using only 817 carefully curated training samples, LIMO achieves remarkable performance on mathematical reasoning tasks, outperforming previous models trained on significantly larger datasets. Key innovations include iterative distillation of examples emphasizing intermediate cognitive steps "borrowed" from human solvers, combined with gradient-aware example pruning to maximize template efficiency.

What are the key findings? LIMO achieves 57.1% accuracy on AIME and 94.8% on MATH, a substantial improvement from previous SFT-based models' 6.5% and 59.2% respectively, while using only 1% of the training data. LIMO also demonstrates exceptional out-of-distribution generalization, achieving a 40.5% absolute improvement across 10 diverse benchmarks, surpassing models trained on 100x more data. These results challenge the notion that SFT leads to memorization rather than generalization.

Why does it matter? They fundamentally reframe how we approach specialized reasoning in LLMs - not as data-hungry pattern recognition tasks, but as knowledge-unlocking challenges. This reduces computational costs and democratizes development of specialized models. The discovery also offers new perspectives on AI cognition, suggesting foundation models may possess "latent reasoning muscles" that require targeted activation rather than brute-force training. For practitioners, it enables mathematical reasoning deployment scenarios where large labeled datasets are unavailable.

3. CoAT: Chain-of-Associated-Thoughts Framework for Enhancing Large Language Models Reasoning

Watching: CoAT (paper)

What problem does it solve? Current LLM inference predominantly relies on a "fast thinking" approach, where models generate outputs in a single pass without iterative refinement. While effective for many tasks, this methodology lacks mechanisms to dynamically integrate new information or revisit earlier reasoning steps—key aspects of human-like "slow thinking." This limitation becomes pronounced in complex scenarios requiring adaptability, multi-step reasoning, or incorporation of evolving context.

How does it solve the problem? The authors introduced Chain-of-Associated-Thoughts (CoAT), blending Monte Carlo Tree Search (MCTS) with a dynamic "associative memory" system. MCTS enables structured exploration of diverse reasoning pathways similar to human brainstorming, while associative memory acts as a real-time knowledge repository. This combination allows LLMs to iteratively update their reasoning by retrieving and cross-referencing stored insights, mimicking the human ability to pause, reflect, and refine earlier conclusions.

What are the key findings? CoAT significantly outperformed conventional inference methods across generative and reasoning tasks, scoring higher in accuracy, coherence, and output diversity. The framework’s iterative refinement—enabled by MCTS-guided exploration and associative memory integration—produced outputs better aligned with complex problem-solving requirements. Notably, the system demonstrated strong context retention even as search spaces expanded dynamically.

Why does it matter? These results address a critical gap in LLM capabilities: the inability to self-correct or incorporate new insights mid-reasoning. By aligning model reasoning closer to human cognitive processes, the CoAT framework paves the way for developing AI systems that are not only more accurate but also better equipped to handle complex, real-world tasks that demand flexibility and contextual adaptability.

LLM Watch

Discussion about this post