DeepSeek_R1.pdf - PDF.DoTool.net

Content text DeepSeek_R1.pdf

DeepSeek-R1: Incentivizing Reasoning Capability in LLMs via Reinforcement Learning DeepSeek-AI [email protected] Abstract We introduce our first-generation reasoning models, DeepSeek-R1-Zero and DeepSeek-R1. DeepSeek-R1-Zero, a model trained via large-scale reinforcement learning (RL) without super- vised fine-tuning (SFT) as a preliminary step, demonstrates remarkable reasoning capabilities. Through RL, DeepSeek-R1-Zero naturally emerges with numerous powerful and intriguing reasoning behaviors. However, it encounters challenges such as poor readability, and language mixing. To address these issues and further enhance reasoning performance, we introduce DeepSeek-R1, which incorporates multi-stage training and cold-start data before RL. DeepSeek- R1 achieves performance comparable to OpenAI-o1-1217 on reasoning tasks. To support the research community, we open-source DeepSeek-R1-Zero, DeepSeek-R1, and six dense models (1.5B, 7B, 8B, 14B, 32B, 70B) distilled from DeepSeek-R1 based on Qwen and Llama. AIME 2024 (Pass@1) Codeforces (Percentile) GPQA Diamond (Pass@1) MATH-500 (Pass@1) MMLU (Pass@1) SWE-bench Verified (Resolved) 0 20 40 60 80 100 Accuracy / Percentile (%) 79.8 96.3 71.5 97.3 90.8 49.2 79.2 96.6 75.7 96.4 91.8 48.9 72.6 90.6 62.1 94.3 87.4 36.8 63.6 93.4 60.0 90.0 85.2 41.6 39.2 58.7 59.1 90.2 88.5 42.0 DeepSeek-R1 OpenAI-o1-1217 DeepSeek-R1-32B OpenAI-o1-mini DeepSeek-V3 Figure 1 | Benchmark performance of DeepSeek-R1.
Contents 1 Introduction 3 1.1 Contributions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4 1.2 Summary of Evaluation Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4 2 Approach 5 2.1 Overview . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5 2.2 DeepSeek-R1-Zero: Reinforcement Learning on the Base Model . . . . . . . . . . 5 2.2.1 Reinforcement Learning Algorithm . . . . . . . . . . . . . . . . . . . . . . 5 2.2.2 Reward Modeling . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6 2.2.3 Training Template . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6 2.2.4 Performance, Self-evolution Process and Aha Moment of DeepSeek-R1-Zero 6 2.3 DeepSeek-R1: Reinforcement Learning with Cold Start . . . . . . . . . . . . . . . 9 2.3.1 Cold Start . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 9 2.3.2 Reasoning-oriented Reinforcement Learning . . . . . . . . . . . . . . . . . 10 2.3.3 Rejection Sampling and Supervised Fine-Tuning . . . . . . . . . . . . . . . 10 2.3.4 Reinforcement Learning for all Scenarios . . . . . . . . . . . . . . . . . . . 11 2.4 Distillation: Empower Small Models with Reasoning Capability . . . . . . . . . . 11 3 Experiment 11 3.1 DeepSeek-R1 Evaluation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 12 3.2 Distilled Model Evaluation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 14 4 Discussion 14 4.1 Distillation v.s. Reinforcement Learning . . . . . . . . . . . . . . . . . . . . . . . . 14 4.2 Unsuccessful Attempts . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 15 5 Conclusion, Limitation, and Future Work 16 A Contributions and Acknowledgments 20 2
1. Introduction In recent years, Large Language Models (LLMs) have been undergoing rapid iteration and evolution (Anthropic, 2024; Google, 2024; OpenAI, 2024a), progressively diminishing the gap towards Artificial General Intelligence (AGI). Recently, post-training has emerged as an important component of the full training pipeline. It has been shown to enhance accuracy on reasoning tasks, align with social values, and adapt to user preferences, all while requiring relatively minimal computational resources against pre-training. In the context of reasoning capabilities, OpenAI’s o1 (OpenAI, 2024b) series models were the first to introduce inference-time scaling by increasing the length of the Chain-of- Thought reasoning process. This approach has achieved significant improvements in various reasoning tasks, such as mathematics, coding, and scientific reasoning. However, the challenge of effective test-time scaling remains an open question for the research community. Several prior works have explored various approaches, including process-based reward models (Lightman et al., 2023; Uesato et al., 2022; Wang et al., 2023), reinforcement learning (Kumar et al., 2024), and search algorithms such as Monte Carlo Tree Search and Beam Search (Feng et al., 2024; Trinh et al., 2024; Xin et al., 2024). However, none of these methods has achieved general reasoning performance comparable to OpenAI’s o1 series models. In this paper, we take the first step toward improving language model reasoning capabilities using pure reinforcement learning (RL). Our goal is to explore the potential of LLMs to develop reasoning capabilities without any supervised data, focusing on their self-evolution through a pure RL process. Specifically, we use DeepSeek-V3-Base as the base model and employ GRPO (Shao et al., 2024) as the RL framework to improve model performance in reasoning. During training, DeepSeek-R1-Zero naturally emerged with numerous powerful and interesting reasoning behaviors. After thousands of RL steps, DeepSeek-R1-Zero exhibits super performance on reasoning benchmarks. For instance, the pass@1 score on AIME 2024 increases from 15.6% to 71.0%, and with majority voting, the score further improves to 86.7%, matching the performance of OpenAI-o1-0912. However, DeepSeek-R1-Zero encounters challenges such as poor readability, and language mixing. To address these issues and further enhance reasoning performance, we introduce DeepSeek-R1, which incorporates a small amount of cold-start data and a multi-stage training pipeline. Specifically, we begin by collecting thousands of cold-start data to fine-tune the DeepSeek-V3-Base model. Following this, we perform reasoning-oriented RL like DeepSeek-R1- Zero. Upon nearing convergence in the RL process, we create new SFT data through rejection sampling on the RL checkpoint, combined with supervised data from DeepSeek-V3 in domains such as writing, factual QA, and self-cognition, and then retrain the DeepSeek-V3-Base model. After fine-tuning with the new data, the checkpoint undergoes an additional RL process, taking into account prompts from all scenarios. After these steps, we obtained a checkpoint referred to as DeepSeek-R1, which achieves performance on par with OpenAI-o1-1217. We further explore distillation from DeepSeek-R1 to smaller dense models. Using Qwen2.5- 32B (Qwen, 2024b) as the base model, direct distillation from DeepSeek-R1 outperforms applying RL on it. This demonstrates that the reasoning patterns discovered by larger base models are cru- cial for improving reasoning capabilities. We open-source the distilled Qwen and Llama (Dubey et al., 2024) series. Notably, our distilled 14B model outperforms state-of-the-art open-source QwQ-32B-Preview (Qwen, 2024a) by a large margin, and the distilled 32B and 70B models set a new record on the reasoning benchmarks among dense models. 3
1.1. Contributions Post-Training: Large-Scale Reinforcement Learning on the Base Model • We directly apply reinforcement learning (RL) to the base model without relying on super- vised fine-tuning (SFT) as a preliminary step. This approach allows the model to explore chain-of-thought (CoT) for solving complex problems, resulting in the development of DeepSeek-R1-Zero. DeepSeek-R1-Zero demonstrates capabilities such as self-verification, reflection, and generating long CoTs, marking a significant milestone for the research community. Notably, it is the first open research to validate that reasoning capabilities of LLMs can be incentivized purely through RL, without the need for SFT. This breakthrough paves the way for future advancements in this area. • We introduce our pipeline to develop DeepSeek-R1. The pipeline incorporates two RL stages aimed at discovering improved reasoning patterns and aligning with human pref- erences, as well as two SFT stages that serve as the seed for the model’s reasoning and non-reasoning capabilities. We believe the pipeline will benefit the industry by creating better models. Distillation: Smaller Models Can Be Powerful Too • We demonstrate that the reasoning patterns of larger models can be distilled into smaller models, resulting in better performance compared to the reasoning patterns discovered through RL on small models. The open source DeepSeek-R1, as well as its API, will benefit the research community to distill better smaller models in the future. • Using the reasoning data generated by DeepSeek-R1, we fine-tuned several dense models that are widely used in the research community. The evaluation results demonstrate that the distilled smaller dense models perform exceptionally well on benchmarks. DeepSeek- R1-Distill-Qwen-7B achieves 55.5% on AIME 2024, surpassing QwQ-32B-Preview. Addi- tionally, DeepSeek-R1-Distill-Qwen-32B scores 72.6% on AIME 2024, 94.3% on MATH-500, and 57.2% on LiveCodeBench. These results significantly outperform previous open- source models and are comparable to o1-mini. We open-source distilled 1.5B, 7B, 8B, 14B, 32B, and 70B checkpoints based on Qwen2.5 and Llama3 series to the community. 1.2. Summary of Evaluation Results • Reasoning tasks: (1) DeepSeek-R1 achieves a score of 79.8% Pass@1 on AIME 2024, slightly surpassing OpenAI-o1-1217. On MATH-500, it attains an impressive score of 97.3%, performing on par with OpenAI-o1-1217 and significantly outperforming other models. (2) On coding-related tasks, DeepSeek-R1 demonstrates expert level in code competition tasks, as it achieves 2,029 Elo rating on Codeforces outperforming 96.3% human participants in the competition. For engineering-related tasks, DeepSeek-R1 performs slightly better than DeepSeek-V3, which could help developers in real world tasks. • Knowledge: On benchmarks such as MMLU, MMLU-Pro, and GPQA Diamond, DeepSeek- R1 achieves outstanding results, significantly outperforming DeepSeek-V3 with scores of 90.8% on MMLU, 84.0% on MMLU-Pro, and 71.5% on GPQA Diamond. While its performance is slightly below that of OpenAI-o1-1217 on these benchmarks, DeepSeek-R1 surpasses other closed-source models, demonstrating its competitive edge in educational tasks. On the factual benchmark SimpleQA, DeepSeek-R1 outperforms DeepSeek-V3, demonstrating its capability in handling fact-based queries. A similar trend is observed where OpenAI-o1 surpasses 4o on this benchmark. 4

PDF Google Drive Downloader v1.1

Content text DeepSeek_R1.pdf

Related document

PDF Google Drive Downloader v1.1

Title DeepSeek_R1.pdf ✅

Content text DeepSeek_R1.pdf

Related document