Deepseek pretrain dataset

5 minute read

Published:

1. Proximal Policy Optimization (PPO)

PPO is a well-established actor-critic algorithm in reinforcement learning. It optimizes a policy by maximizing a surrogate objective function while ensuring that the new policy does not deviate too drastically from the old policy.

Core Mechanism:

PPO’s objective function typically involves the ratio of probabilities between the current policy ($\pi_\theta$) and the old policy ($\pi_{\theta_{old}}$), scaled by an advantage term ($A_t$).

Objective Function (General Form): \(J_{PPO}(\theta) = \mathbb{E} \left[ \min \left( \frac{\pi_\theta(a_t|s_t)}{\pi_{\theta_{old}}(a_t|s_t)} A_t, \text{clip}\left(\frac{\pi_\theta(a_t|s_t)}{\pi_{\theta_{old}}(a_t|s_t)}, 1-\epsilon, 1+\epsilon\right) A_t \right) \right]\)

  • $\frac{\pi_\theta(a_ts_t)}{\pi_{\theta_{old}}(a_ts_t)}$: The probability ratio, measuring how much the new policy deviates from the old one for action $a_t$ in state $s_t$.
  • $A_t$: The advantage function, which estimates how much better an action is compared to the average action in a given state. In PPO, $A_t$ is typically computed using Generalized Advantage Estimation (GAE), which relies on rewards and a learned value function ($V_\psi$) from a separate critic model.
  • $\text{clip}(\cdot, 1-\epsilon, 1+\epsilon)$: A clipping function that limits the policy update step size, preventing excessively large updates and maintaining training stability. $\epsilon$ is a small hyperparameter.

Gradient Coefficient (from DeepSeekMath unified paradigm, Equation 18): The gradient coefficient for PPO, $GC_{PPO}$, is essentially the advantage term $A_t$: \(GC_{PPO}(q, o, t, \pi_{\theta_{rm}}) = A_t\) This means that the learning signal for each token $o_t$ in an output $o$ given a query $q$ is scaled by the advantage $A_t$. A higher positive advantage leads to a stronger reinforcement of the token’s probability.

Key Characteristics:

  • Actor-Critic Architecture: PPO requires training both a policy (actor) network and a value (critic) network. The critic network estimates the value function, which is used to calculate the advantage.
  • Stability: The clipping mechanism helps stabilize training by preventing large policy updates.
  • KL Divergence Penalty: PPO often incorporates a per-token KL divergence penalty from a reference model into the reward signal to prevent the policy from diverging too far from the initial fine-tuned model.

2. Group Relative Policy Optimization (GRPO)

DeepSeek-AI introduces GRPO as an efficient and effective alternative to PPO, particularly for large-scale LLM training. GRPO addresses the computational burden of PPO’s critic model while maintaining strong performance.

Core Mechanism (from DeepSeekMath paper, Section A.1.6, and DeepSeek-R1 paper, Abstract):

The most significant difference from PPO is that GRPO foregoes the critic (value) model. Instead, it estimates the baseline for advantage calculation directly from group scores of multiple outputs sampled for the same question.

GRPO Objective Function (from DeepSeekMath paper, Equation 3): \(J_{GRPO}(\theta)=\mathbb{E}[q\sim P(Q),\{o_{i}\}_{i=1}^{G}\sim\mathfrak{\pi}_{\theta_{old}}(O|q)] \frac{1}{G}\sum_{i=1}^{G}\left(\min\left(\frac{\pi_{\theta}(o_{i}|q)}{\pi_{\theta_{old}}(o_{i}|q)}\hat{A}_{i,t}, \text{clip}\left(\frac{\pi_{\theta}(o_{i}|q)}{\pi_{\theta_{old}}(o_{i}|q)},1-\epsilon,1+\epsilon\right)\hat{A}_{i,t}\right)-\beta D_{KL}(\pi_{\theta}||\pi_{ref})\right)\)

  • **$\mathbb{E}[q\sim P(Q),{o_{i}}{i=1}^{G}\sim\mathfrak{\pi}{\theta_{old}}(Oq)]$*: Expectation over questions $q$ and a *group of $G$ outputs ${o_i}{i=1}^G$ sampled from the old policy $\pi{\theta_{old}}$. This group-based sampling is central to GRPO’s baseline estimation.
  • **$\frac{\pi_{\theta}(o_{i}q)}{\pi_{\theta_{old}}(o_{i}q)}$**: Probability ratio, similar to PPO, but applied to the entire output sequence $o_i$ (or its tokens $o_{i,t}$).
  • *$\hat{A}_{i,t}$: The **group relative advantage function. This is the key innovation. Instead of relying on a learned value function, $\hat{A}_{i,t}$ is calculated based on the relative rewards of the outputs *within the sampled group only.

Advantage Calculation in GRPO (from DeepSeekMath paper, Sections 4.1.2 & 4.1.3): GRPO computes advantage based on the relative rewards within a group, aligning with the comparative nature of reward models (which are often trained on comparisons between outputs for the same question).

  • Outcome Supervision (OS) for GRPO: For each output $o_i$ with reward $r_i$ in a group $\mathbf{r} = {r_1, \dots, r_G}$: \(\hat{A}_{i,t} = \tilde{r}_i = \frac{r_i - \text{mean}(\mathbf{r})}{\text{std}(\mathbf{r})}\) All tokens in output $o_i$ receive the same normalized reward $\tilde{r}_i$ as their advantage.
  • Process Supervision (PS) for GRPO: If step-wise rewards $r_i^{\text{index}(j)}$ are available for reasoning steps: \(\hat{A}_{i,t} = \sum_{\text{index}(j) \ge t} \tilde{r}_i^{\text{index}(j)}\) where $\tilde{r}_i^{\text{index}(j)}$ are the normalized step-wise rewards. This provides a more fine-grained signal for learning reasoning steps.

  • $-\beta D_{KL}(\pi_{\theta}||\pi_{ref})$: Direct KL Divergence Regularization (from DeepSeekMath paper, Equation 4). Unlike PPO, which typically adds KL penalty to the reward, GRPO directly adds the KL divergence term to the loss function. This term, with coefficient $\beta$, regularizes the updated policy $\pi_\theta$ towards a reference policy $\pi_{ref}$ (usually the SFT model), preventing large drifts and ensuring stability. \(D_{KL}(\pi_\theta || \pi_{ref}) = \frac{\pi_{ref}(o_{i,t}|q, o_{i,<t})}{\pi_\theta(o_{i,t}|q, o_{i,<t})} - \log \frac{\pi_{ref}(o_{i,t}|q, o_{i,<t})}{\pi_\theta(o_{i,t}|q, o_{i,<t})} - 1\) This unbiased estimator is guaranteed to be positive.

Gradient Coefficient (from DeepSeekMath unified paradigm, Equation 21): \(GC_{GRPO}(q, o, t, \pi_{\theta_{rm}}) = \hat{A}_{i,t} + \beta \left( \frac{\pi_{ref}(o_{i,t}|o_{i,<t})}{\pi_\theta(o_{i,t}|o_{i,<t})} - 1 \right)\) The GRPO gradient coefficient uniquely combines the group-relative advantage with a direct KL regularization term. This allows for differential reinforcement or penalization of responses based on their varying magnitudes and ensures policy stability.

Main Differences Between PPO and GRPO

GRPO’s “Critic-free” or “Implicit Critic”: GRPO calculates the advantage by generating a set of responses for each prompt and using the average reward of these responses as a baseline. This can be seen as an implicit value estimate to some extent, avoiding the need to train an additional Critic network, which may bring advantages in memory efficiency and training simplification, especially in large models such as LLM. It uses intra-group comparisons to reduce variance.