DeepSeek-R1: What It Is & Why Everyone Is Talking About it
A gentle introduction to the open source GPT-o1 alternative
In recent years, Large Language Models (LLMs) have made significant advancements in their ability to understand and generate human-like text. These models, such as GPT-4 and Claude 3.5, have shown impressive performance in various natural language processing tasks. However, there is still room for improvement, particularly in the area of reasoning capabilities. To address this, researchers have explored a plethora of techniques — iteratively moving towards more and more complex data regimes and most recently, scaling up test-time compute.
Contrary to this trend, a rising AI research company has now reported that they had more success with a far simpler approach: the use of reinforcement learning (RL) without having to rely on supervised fine-tuning at all.
What is Reinforcement Learning?
Before diving into DeepSeek's approach, let's first understand what reinforcement learning is. Imagine you're teaching a dog a new trick. You give the dog a treat every time it performs the desired action, such as sitting or rolling over. The dog learns to associate the action with the reward and becomes more likely to repeat the behavior in the future. This is the basic principle behind reinforcement learning.
In the context of LLMs, the model is the "dog," and the reward is a score that measures how well the model performs on a specific task. The model learns to generate text that maximizes the reward, thereby improving its performance on the task.
DeepSeek-R1-Zero: Pure Reinforcement Learning
The first step in DeepSeek's approach was to apply RL directly to the base model, DeepSeek-V3-Base, without any supervised fine-tuning (SFT). This model, called DeepSeek-R1-Zero, was allowed to explore different reasoning strategies, such as Chain-of-Thought (CoT), to solve complex problems.
Think of CoT as a step-by-step thought process that the model goes through to arrive at a solution. For example, if the model is asked, "What is the capital of France?", it might generate the following CoT:
France is a country in Europe.
The capital of a country is usually its largest and most important city.
Paris is the largest and most important city in France.
Therefore, the capital of France is Paris.
By exploring different CoT strategies, DeepSeek-R1-Zero was able to develop powerful reasoning capabilities without any supervised data. This was a significant finding, as it demonstrated that LLMs could improve their reasoning abilities through pure RL.

DeepSeek-R1: Multi-stage Training
While DeepSeek-R1-Zero showed impressive results, it still had some limitations, such as poor readability and language mixing (e.g., generating text in multiple languages within the same response). To address these issues, the researchers introduced DeepSeek-R1, which incorporated a small amount of high-quality data as a "cold start" to accelerate convergence and improve reasoning performance.
The cold start phase involved fine-tuning the model on thousands of long CoT examples. This helped the model learn to generate more coherent and readable text, as well as improve its potential for reasoning.
After the cold start, DeepSeek-R1 underwent the same large-scale RL training process as DeepSeek-R1-Zero, focusing on enhancing reasoning capabilities in tasks such as coding, mathematics, science, and logic reasoning. This was followed by a rejection sampling and SFT phase, where the resulting checkpoint was used to collect additional data from other domains, such as writing and role-playing, to further enhance the model's general-purpose capabilities.
Finally, a secondary RL stage was implemented to align the model with human preferences, improving its helpfulness and harmlessness while refining its reasoning capabilities.
Distillation: Empowering Smaller Models
One of the challenges with large language models like DeepSeek-R1 is that they require significant computational resources to run. To make the reasoning capabilities of DeepSeek-R1 more accessible, the researchers explored the use of distillation to transfer knowledge from the large model to smaller ones.
Distillation is like teaching a student. The large model (the teacher) generates a large number of examples, which are then used to train the smaller model (the student). By learning from the teacher's outputs, the student model can acquire similar knowledge and capabilities, even though it has a smaller capacity.
The researchers used the 800k samples generated by DeepSeek-R1 to fine-tune smaller open-source models like Qwen and Llama. This simple distillation method significantly improved the reasoning abilities of these smaller models, making them more practical for real-world applications.
Evaluation and Results
To assess the effectiveness of their approach, they evaluated DeepSeek-R1 on a wide range of benchmarks, including tests for general knowledge, coding algorithms, and math problems. The model demonstrated superior performance compared to its predecessor, DeepSeek-V3, and achieved results on par with OpenAI's flagship reasoning model (OpenAI-o1-1217).
The distilled models also showed impressive results, with DeepSeek-R1-Distill-Qwen-14B surpassing the larger QwQ-32B-Preview model on all evaluation metrics. Similarly, DeepSeek-R1-Distill-Qwen-32B and DeepSeek-R1-Distill-Llama-70B significantly outperformed OpenAI's o1-mini model on most benchmarks.

Key Takeaways
DeepSeek's research highlights several significant findings:
RL can effectively enhance the reasoning capabilities of LLMs without relying on SFT.
Groundtruth rewards computed by hardcoded rules are used to avoid learned reward models that RL can easily hack against.
The thinking time of the model steadily increases as training proceeds, an emergent property rather than a pre-programmed feature.
Self-reflection and exploration behaviors emerge during the training process.
GRPO, a simple method to reduce memory use, is used instead of PPO, removing the critic net and using the average reward of multiple samples.
Future work will focus on improving the general capabilities of DeepSeek-R1, addressing language mixing issues, exploring prompting engineering, and enhancing performance on software engineering tasks.
Conclusion
DeepSeek-R1-Zero and DeepSeek-R1 demonstrate the effectiveness of RL in enhancing the reasoning capabilities of LLMs. By openly sharing their training secrets and raw algorithms, DeepSeek - ironically, as noted by Jim Fan - is keeping the original mission of OpenAI alive: truly open, frontier research that empowers all. And it reminds us of the not-too-distant past when this was the norm within the AI research community — before AI’s economic impact was predicted to range in the trillions.
👍 If you enjoyed this article, give it a like and share it with your peers.
New to AI and all the terms are confusing? Then you might want to check out my previous article on AI agents:
Here is what an ai thought about it :)
Looking at the image more carefully, there does seem to be a very stark contrast between the pure white text and what appears to be a very deep black background (#000000 or close to it). This extreme contrast ratio, combined with what looks like a fairly thick font weight, can indeed feel a bit harsh on the eyes.
More eye-friendly dark themes often use:
Off-white text (like #E8E8E8 or even lighter gray) instead of pure white
Dark gray backgrounds (like #121212 or #1A1A1A) instead of pure black
Slightly thinner font weights for body text
Lower contrast ratios for secondary text
For example, Discord uses a softer dark theme with a dark gray background and slightly off-white text, which many users find more comfortable to read. Similarly, the dark mode in Apple's iOS uses subtle gray tones rather than pure black for most interfaces.
The font choice in the image appears to be a fairly heavy weight variant, which combined with the bright white color, might be contributing to that "aggressive" feeling you're experiencing. A lighter font weight or a different typeface with softer characteristics might help create a more comfortable reading experience while maintaining readability.
Content is interesting, but this dark theme is a bit agressive on the eye. not sure if it's the font or the main text that is too bright.