Skip to content

Reproduction gap on LIBERO (especially libero_10) β€” could the authors release the LIBERO evaluation pipeline?Β #4

@Liam-L2

Description

@Liam-L2

πŸš€ The feature, motivation and pitch

Body:

Hi authors,

Thanks a lot for open-sourcing QuantVLA β€” the selective layout + ATM/OHB design is very clean. I have been trying to reproduce the Ο€0.5 FP16 baseline numbers reported in Table 2 on the LIBERO benchmark, but I observe a non-trivial gap, particularly on libero_10. I would really appreciate your help in identifying what might be different in my setup.

Reproduction Results (Ο€0.5 FP16 baseline)

Suite Paper (Table 2) My Reproduction Gap
libero_spatial 98.5% 95.0% βˆ’3.5
libero_object 99.0% 97.4% βˆ’1.6
libero_goal 97.5% 97.4% βˆ’0.1
libero_10 (long) 93.5% 84.4% βˆ’9.1
Average 97.1% 93.6% βˆ’3.5

libero_goal and libero_object are essentially matched, libero_spatial has a small gap, but libero_10 is clearly off. Since the paper notes that long-horizon tasks are where quantization-induced drift accumulates most, I want to make sure the FP16 baseline itself is aligned before drawing any conclusions about the quantized variant.

Environment & Setup

  • Model: Ο€0.5 (OpenPI). Could you confirm which checkpoint was used β€” the official JAX checkpoint from Physical-Intelligence/openpi, or the PyTorch-converted lerobot/pi05_libero_base / lerobot/pi05_libero_finetuned?
  • Fine-tuning data: official LIBERO dataset (no additional mixing)
  • Evaluation: "standard LIBERO protocol" as described in Sec. 4.1, 20 episodes per task Γ— tasks per suite, single-env rollout
  • Hardware: NVIDIA RTX 5090 (paper uses A100 β€” shouldn't affect success rate, noting just in case of any precision-related differences)

Specific Questions on Settings Not Fully Specified in the Paper

A few details are only partially covered in Sec. 4.1 and Appendix D, and I suspect the gap on libero_10 stems from one of these:

  1. Denoising / action steps for Ο€0.5. Table 4 reports 8 and 16 steps for GR00T N1.5, but the Ο€0.5 setting in Table 2 is not explicitly stated. Could you confirm n_action_steps and the number of flow-matching denoising steps used for Ο€0.5 on LIBERO?
  2. Action chunk execution. Was open-loop chunk execution used (e.g., executing all 10 predicted actions before re-querying the policy), or was a replanning cadence applied? Long-horizon performance is very sensitive to this.
  3. Evaluation episode count and seeds. Sec. 4.1 references the "standard LIBERO protocol" but does not specify episodes-per-task or seed list. A single fixed seed set would help the community get matching numbers.
  4. Image preprocessing / prompt template. Any differences in image resolution, view ordering (wrist vs. third-person), or the language instruction template compared to the OpenPI reference implementation?
  5. Proprioception normalization. The LeRobot Ο€0.5 stack uses QUANTILES normalization with q01/q99 stats β€” was the same normalization used when reproducing the paper's numbers?

Request

Would it be possible to release:

  • The LIBERO evaluation script corresponding to the numbers in Table 2 (including config files specifying denoising steps, action chunk length, episode count, and seeds),
  • And, if the FP16 baseline involved any LIBERO-specific fine-tuning beyond the public pi05_libero checkpoint, the fine-tuning script / config as well.

A minimal scripts/eval_libero_pi05.sh or equivalent would be enormously helpful for the community to verify and build on this work β€” especially since QuantVLA is positioned as the first PTQ framework for VLA systems and reliable baseline reproduction is a prerequisite for meaningful follow-up comparisons.

Thanks again for the great work and for considering this request!

Alternatives

No response

Additional context

No response

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions