Authors: Jiakang Wang******([email protected]), Runze Liu*†**, Lei Lin, Wenpin Hu, Xiu Li, Fuzheng Zhang, Guorui Zhou

*: Project Leads; †: Significant Contributor

First Published: September 18, 2025

<aside> ✨

TL;DR

Mainstream reinforcement learning algorithms for training large language models (LLMs), such as PPO and GRPO, rely on the PPO-Clip component. We identify a critical flaw in its original design: a weight mismatch for tokens with positive advantage, which leads to overfitting, entropy collapse, and repeated outputs, ultimately causing premature convergence to local optima. To address this, we propose Asymmetric Dual-Clipping, a redesigned clipping strategy. Experiments show that it effectively prevents premature convergence, improves training stability, and significantly enhances LLM performance.

👨‍💻 Github, 🤗 HF Model, 🤗 HF Dataset, 🌐 Zhihu, 📖 Paper

</aside>

<aside> 🎯

Key Insights for a Deeper Understanding of Reinforcement Learning in LLM Training

Core design principle of PPO-Clip: To ensure training stability, tokens that have already gained a significant advantage in the update direction are excluded from the current update.
In GRPO-style algorithms applied to LLMs, the key role of PPO-Clip lies in the token-masking mechanism, while the weight adjustment from importance sampling contributes little in practice.
The design of PPO-Clip suffers from a weight mismatch on positive-sample tokens, which leads to overfitting on those tokens during training. This, in turn, causes entropy collapse, ultimately driving the model into local optima.
Popular GRPO algorithms face inherent limitations—a restricted search space and coarse reward granularity—which constrain the model’s abilities to the shallow capabilities of the base model. </aside>

1. Starting with the Design Principles of PPO-Clip

The PPO-Clip technique was introduced by OpenAI researchers within the design of the PPO algorithm. It restricts the trust region by clipping the importance sampling ratio, thereby enhancing the stability of reinforcement learning. Specifically, it encourages the policy to move in a favorable direction while strictly limiting the maximum step size of each update. The PPO loss function using PPO-Clip is expressed as follows:

$\mathcal{L}\mathrm{PPO}(\theta) = \mathbb{E}\tau\bigg[\min\Big( r_\theta(\tau) A(\tau),\mathrm{clip}\big(r_\theta(\tau), 1 - \varepsilon_\mathrm{low}, 1 +\varepsilon_\mathrm{high}\big) A(\tau) \Big)\bigg]$

where:

$r_\theta(\tau) = \frac{\pi_\theta(\tau)}{\pi_{\text{old}}(\tau)}$

On closer examination, the theoretical core of PPO-Clip can be summarized in two aspects:

Token-Masking: Tokens whose probabilities under the current policy deviate too far from the old policy are masked. This ensures that each update does not “take too large a step,” preventing training collapse.
Importance-Sampling: This eliminates the distribution discrepancy between the sampling policy (old policy) and the current policy, ensuring accurate estimation of expected returns. As a result, the model can perform multiple optimization updates on the same set of sampled data, improving sample efficiency.

Below, we present the core implementation of PPO-Clip (excerpted from the Verl framework). Later, when discussing subsequent concepts, we will occasionally refer back to this code to aid understanding. Next, we will delve into how PPO-Clip achieves these two functions in more detail.

negative_approx_kl = torch.clamp(log_prob - old_log_prob, min=-20.0, max=20.0)
ratio = torch.exp(negative_approx_kl)

pg_losses1 = -advantages * ratio
pg_losses2 = -advantages * torch.clamp(ratio, 1 - cliprange_low, 1 + cliprange_high)
clip_pg_losses1 = torch.maximum(pg_losses1, pg_losses2) 
pg_clipfrac = verl_F.masked_mean(torch.gt(pg_losses2, pg_losses1).float(), response_mask)

pg_losses3 = -advantages * clip_ratio_c
clip_pg_losses2 = torch.min(pg_losses3, clip_pg_losses1)
pg_clipfrac_lower = verl_F.masked_mean(torch.gt(clip_pg_losses1, pg_losses3) * (advantages < 0).float(), response_mask)

pg_losses = torch.where(advantages < 0, clip_pg_losses2, clip_pg_losses1)

1.1 Token-Masking

Let’s first take a look at token-masking and see which tokens PPO-Clip masks. Here, we will only visualize the final masking effect. For a detailed analysis of the PPO loss formula, you can refer to this guide. The result is illustrated in the figure below: