ALOE: Action-Level Off-Policy Evaluation for Vision-Language-Action Model Post-Training

Real-world actor–critic RL for VLA with off-policy value estimation under human-in-the-loop data

1 AgiBot   2 The Hong Kong University of Science and Technology   3 Fudan University   4 Independent Researcher   * Equal contribution   Corresponding author

Abstract

We study how to improve large foundation vision-language-action (VLA) systems through online reinforcement learning (RL) in real-world settings. Central to this process is the value function, which provides learning signals to guide VLA learning from experience. In practice, the value function is estimated from trajectory fragments collected from different data sources, including historical policies and intermittent human interventions. Estimating the value function of current behavior quality from the mixture data is inherently an off-policy evaluation problem. However, prior work often adopts conservative on-policy estimation for stability, which avoids direct evaluation of the current high-capacity policy and limits learning effectiveness. In this paper, we propose ALOE, an action-level off-policy evaluation framework for VLA post-training. ALOE applies chunking-based temporal-difference bootstrapping to evaluate individual action sequences instead of predicting final task outcomes. This design improves effective credit assignment to critical action chunks under sparse rewards and supports stable policy improvement. We evaluate our method on three real-world manipulation tasks, including smartphone packing as a high-precision task, laundry folding as a long-horizon deformable-object task, and bimanual pick-and-place involving multi-object perception. Across all tasks, ALOE improves learning efficiency without compromising execution speed, showing that off-policy RL can be reintroduced in a reliable manner for real-world VLA post-training.

Method

We adopt a real-world actor–critic framework: the actor is a flow-matching foundation VLA and the critic is a lightweight ensemble Q-network. The actor outputs action sequences for online rollouts; the critic evaluates action chunks via Q-chunking temporal-difference updates. Training proceeds in three stages: (1) data collection under human intervention (warm-start with behavior cloning, then store policy and human-correction trajectories), (2) off-policy critic estimation on the aggregated data buffer, and (3) policy improvement with pessimistic value estimation and advantage-weighted maximum likelihood.

ALOE overview: actor–critic framework for VLA post-training.

Task Demos & Final Results

Task Demos

We evaluate ALOE on three representative tasks designed to stress long-horizon reasoning, precise action selection, and robustness: Pack Smart Phone (align and attach phone case onto device body; precise pose alignment and recovery from misalignment), Folding Laundry (long-horizon deformable-object manipulation with grasp, flatten, and sequential folds), and Product Sorting (bimanual pick-place: identify, manipulate, and place objects from bins onto a conveyor belt).

Final Results

Average success rate of the three manipulation tasks under real-world evaluation across multiple runs. Blue bars are the results of ALOE.

Success Rate

Pack Smart Phone is precision-critical: the phone case (17.8cm by 8.8cm) has minimal clearance inside the rigid container (17.5cm by 8.6cm). Under position control, slight pose errors can cause insertion failure, tipping, or damage. The video shows 21 consecutive trials with 19 successes.

Robustness

We inject random perturbations during execution (e.g., to the garment in Folding Laundry or to the assembly in Pack Smart Phone) and measure whether the policy can re-adjust and still complete the task — evaluating recovery from unexpected disturbances.

Pack Smart Phone — recovery under disturbance.

Folding Laundry — recovery under disturbance.

Q-Value Learning

The learned action-value function identifies failure modes and successful recovery: the Q-value drops sharply when the robot fails a critical action (e.g., scooping the phone) and rises on recovery, providing finer credit assignment than trajectory-level estimation.

Q-value and task success
Q(s,a) visualization

Zero-Shot Generalization

We evaluate generalization on the Product Sorting task by replacing objects with ones never seen during training (different shapes and colors). The policy is evaluated without fine-tuning, assessing the VLA’s ability to generalize to unseen object appearances and geometries. Below: in-distribution (seen objects) for comparison, then OOD (unseen objects) in parallel.

In-distribution (seen objects).

OOD (unseen objects).

Objects used during training (left) and unseen objects used for zero-shot evaluation (right).

Contributions

Citation

@article{yang2026aloe,
  title         = {ALOE: Action-Level Off-Policy Evaluation for Vision-Language-Action Model Post-Training},
  author        = {Yang, Rushuai and Wang, Hecheng and Liu, Chiming and Yan, Xiaohan and Wang, Yunlong and Du, Xuan and Yue, Shuoyu and Liu, Yongcheng and Zhang, Chuheng and Qi, Lizhe and Chen, Yi and Shan, Wei and Yao, Maoqing},
  year          = {2026},
  eprint        = {2602.12691},
  archivePrefix = {arXiv},
  primaryClass  = {cs.RO}
}