Authors: Jiakang Wang******([email protected]), Runze Liu*†**, Lei Lin, Wenpin Hu, Xiu Li, Fuzheng Zhang, Guorui Zhou
*: Project Leads; †: Significant Contributor
First Published: September 18, 2025
<aside> ✨
TL;DR
Mainstream reinforcement learning algorithms for training large language models (LLMs), such as PPO and GRPO, rely on the PPO-Clip component. We identify a critical flaw in its original design: a weight mismatch for tokens with positive advantage, which leads to overfitting, entropy collapse, and repeated outputs, ultimately causing premature convergence to local optima. To address this, we propose Asymmetric Dual-Clipping, a redesigned clipping strategy. Experiments show that it effectively prevents premature convergence, improves training stability, and significantly enhances LLM performance.
👨💻 Github, 🤗 HF Model, 🤗 HF Dataset, 🌐 Zhihu, 📖 Paper
</aside>
<aside> 🎯
Key Insights for a Deeper Understanding of Reinforcement Learning in LLM Training
The PPO-Clip technique was introduced by OpenAI researchers within the design of the PPO algorithm. It restricts the trust region by clipping the importance sampling ratio, thereby enhancing the stability of reinforcement learning. Specifically, it encourages the policy to move in a favorable direction while strictly limiting the maximum step size of each update. The PPO loss function using PPO-Clip is expressed as follows:
$\mathcal{L}\mathrm{PPO}(\theta) = \mathbb{E}\tau\bigg[\min\Big( r_\theta(\tau) A(\tau),\mathrm{clip}\big(r_\theta(\tau), 1 - \varepsilon_\mathrm{low}, 1 +\varepsilon_\mathrm{high}\big) A(\tau) \Big)\bigg]$
where:
On closer examination, the theoretical core of PPO-Clip can be summarized in two aspects:
Below, we present the core implementation of PPO-Clip (excerpted from the Verl framework). Later, when discussing subsequent concepts, we will occasionally refer back to this code to aid understanding. Next, we will delve into how PPO-Clip achieves these two functions in more detail.
negative_approx_kl = torch.clamp(log_prob - old_log_prob, min=-20.0, max=20.0)
ratio = torch.exp(negative_approx_kl)
pg_losses1 = -advantages * ratio
pg_losses2 = -advantages * torch.clamp(ratio, 1 - cliprange_low, 1 + cliprange_high)
clip_pg_losses1 = torch.maximum(pg_losses1, pg_losses2)
pg_clipfrac = verl_F.masked_mean(torch.gt(pg_losses2, pg_losses1).float(), response_mask)
pg_losses3 = -advantages * clip_ratio_c
clip_pg_losses2 = torch.min(pg_losses3, clip_pg_losses1)
pg_clipfrac_lower = verl_F.masked_mean(torch.gt(clip_pg_losses1, pg_losses3) * (advantages < 0).float(), response_mask)
pg_losses = torch.where(advantages < 0, clip_pg_losses2, clip_pg_losses1)
Let’s first take a look at token-masking and see which tokens PPO-Clip masks. Here, we will only visualize the final masking effect. For a detailed analysis of the PPO loss formula, you can refer to this guide. The result is illustrated in the figure below: