Moonshot AI Researchers Introduce Seer: An Online Context Learning System for Fast Synchronous Reinforcement Learning RL Rollouts

How do you keep reinforcement learning for large reasoning models from stalling on a few very long, very slow rollouts while GPUs sit under used? a team of researchers from Moonshot AI and Tsinghua University introduce ‘Seer’, a new online context learning system that targets a specific systems bottleneck in reinforcement learning for large language models. In synchronous on policy setups, the rollout phase dominates the cost of each iteration. Seer restructures this phase and reports rollout throughput gains of 74 percent to 97 percent and tail latency reductions of 75 percent to 93 percent compared with a strong synchronous baseline called veRL.

Why synchronous rollout is slow for reasoning models?

Modern reasoning RL workloads use long chain of thought style outputs. In the Seer experiments, the researchers apply GRPO to three different models, Moonlight, Qwen2 VL 72B and Kimi K2. These workloads run on 32 compute nodes with 8 H800 GPUs per node. The three tasks use 32, 128 and 256 GPUs respectively, with 400, 600 and 800 prompts per iteration and 8 or 16 responses per prompt.

Maximum generation length is large. Moonlight is configured for 65,536 tokens, Qwen2 VL 72B for 40,960 tokens and Kimi K2 for 98,304 tokens. A single long chain of thought request can grow from a few hundred megabytes of KVCache to tens of gigabytes as decoding progresses. This memory growth forces instances to reduce concurrency or to preempt requests, which triggers expensive re decoding.

The research team defines tail requests as the last 10 percent of requests to finish in a rollout. For Moonlight and Qwen2 VL 72B, this tail alone can consume up to 50 percent of the total rollout time in the baseline system. Rollout already dominates iteration time, so this tail effect directly slows RL.

Seer architecture on top of Mooncake and vLLM

Seer keeps the RL algorithm identical to synchronous veRL. Each training iteration uses only data from the current rollout iteration, so the system preserves on policy behavior. The training phase uses Megatron for distributed optimization. The rollout phase uses an in house implementation of vLLM as the inference engine.

To support aggressive request scheduling, Seer relies on a Global KVCache Pool built on the Mooncake disaggregated KVCache architecture used in production for Kimi. Mooncake provides a two tier DRAM and SSD KV cache store shared across inference nodes, which allows Seer to migrate requests without recomputing prefills.

On top of this substrate, Seer introduces three key mechanisms:

Divided Rollout
Context Aware Scheduling
Adaptive Grouped Speculative Decoding

These are orchestrated by a Request Buffer, a Context Manager and an Inference Engine Pool connected to the Global KVCache Pool.

Divided Rollout, fine grained scheduling and migration

Conventional synchronous rollout assigns whole GRPO groups to inference instances. A group is a set of requests that share one prompt. Once assigned, a group stays on the same instance until all responses finish. Due to large variance in output lengths, this leads to load imbalance and long running stragglers.

Seer breaks groups down in two steps. It first decomposes each group into individual requests. It then divides each request into multiple chunks based on generation length. When the scheduler dispatches a request from the Request Buffer, it sets a small max tokens value such as 8,000 tokens for that chunk. After each chunk, the request is re enqueued until it reaches an end of sequence token or its original max tokens limit.

Because KVCache is stored in the Global KVCache Pool, divided requests can move between instances at chunk boundaries without re running the prefill. The scheduler maintains a concurrency level that keeps memory utilization high while avoiding preemption. This reduces waste and smooths KVCache usage across the iteration.

Context Aware Scheduling using group length statistics

The research team observe that different requests in the same group tend to have correlated output lengths. Seer uses this structure as online context. For each prompt group, it designates one request as the speculative request. The scheduler keeps speculative requests in a high priority queue and serves them with a smallest first policy based on generated tokens so far. Short requests complete quickly and exit. Long requests remain and identify groups that are potential tail candidates.

The Context Manager maintains a length estimate for each group. It updates this estimate to the maximum generated length among completed requests in the group. If no request has finished, it uses the original max tokens as a conservative bound. Once speculative requests are in flight or done, Seer schedules remaining requests with an approximate longest first policy at group level. This design achieves throughput and tail behavior close to an oracle scheduler that knows all output lengths in advance.

Adaptive Grouped Speculative Decoding

Seer adds Adaptive Grouped Speculative Decoding on top of the previous two components to accelerate decoding, especially for long requests in the tail. It introduces a Distributed Grouped Draft Server, or DGDS. DGDS maintains a Compressed Suffix Tree for each group and aggregates token sequences from all requests in that group. Instances asynchronously append generated tokens to DGDS, periodically fetch updated suffix trees and perform local speculative decoding based on the shared pattern statistics.

The system adjusts draft length and the number of paths according to model architecture, batch size and measured acceptance length. For dense and Mixture of Experts models, it pre-computes different speculation thresholds and uses them to bound draft depth for each batch. In late tail stages, concurrency is low, so Seer increases draft depth and enables multi path drafting to raise accepted tokens per step.

Ablation results show that divided rollout yields up to 35 percent throughput improvement over the baseline. Adding Context Aware Scheduling increases this to up to 47 percent over baseline. Enabling grouped speculative decoding raises the total speedup to 77 percent to 87 percent over the baseline in the evaluated iteration.

End to end impact on RL training

The research team evaluate Seer on three RL tasks built on Moonlight, Qwen2 VL 72B and Kimi K2. They run 10 rollout iterations per task and measure output tokens per second and completion time for each rollout. Seer improves rollout throughput by 74 percent to 97 percent across these workloads relative to veRL with the same RL algorithm and vLLM based inference engine.

Tail latency is reduced by 75 percent to 93 percent. For memory constrained tasks, the baseline system spends up to half of its time on the last 10 percent of requests. Seer removes most of this tail by combining divided rollout, Context Aware Scheduling and Adaptive Grouped Speculative Decoding on top of the Mooncake based Global KVCache Pool.

Key Takeaways

Rollout bottleneck: Seer targets the rollout phase of synchronous RL, which accounts for about 63% to 87% of iteration time and is dominated by long tail requests and KV cache fragmentation.
Three core mechanisms: Seer combines divided rollout, context aware scheduling and adaptive grouped speculative decoding to exploit output length and pattern similarity among GRPO responses that share a prompt.
Fine grained scheduling on a global KV cache: Requests are split into chunks and migrated across a Mooncake style Global KVCache Pool, which preserves synchronous on policy RL while keeping GPU memory utilization high and reducing preemptions.
Online context for tail latency reduction: Group level length statistics from speculative requests drive context aware scheduling that approximates an oracle longest first scheduler and sharply reduces the time spent on the last 10 percent of requests.
Measured end to end gains: On production grade RL workloads with Moonlight, Qwen2 VL 72B and Kimi K2, Seer improves rollout throughput by 74% to 97% and reduces long tail latency by 75% to 93% relative to a state of the art synchronous vLLM based baseline.

Editorial Comments

Seer is an important systems contribution because it optimizes the rollout phase in synchronous RL without changing the underlying GRPO algorithm, so it preserves on policy guarantees and reproducibility while fixing a real infrastructure bottleneck. The combination of divided rollout, context aware scheduling and adaptive grouped speculative decoding offers a practical template for other RL stacks that rely on long chain of thought reasoning models and large KVCache footprints. Overall, Seer shows that online context learning at the systems level is now as critical as model architecture for scaling reasoning RL efficiently.

Check out the Paper here. Feel free to check out our GitHub Page for Tutorials, Codes and Notebooks. Also, feel free to follow us on Twitter and don’t forget to join our 100k+ ML SubReddit and Subscribe to our Newsletter. Wait! are you on telegram? now you can join us on telegram as well.

The post Moonshot AI Researchers Introduce Seer: An Online Context Learning System for Fast Synchronous Reinforcement Learning RL Rollouts appeared first on MarkTechPost.