π The feature, motivation and pitch
Body:
Hi authors,
Thanks a lot for open-sourcing QuantVLA β the selective layout + ATM/OHB design is very clean. I have been trying to reproduce the Ο0.5 FP16 baseline numbers reported in Table 2 on the LIBERO benchmark, but I observe a non-trivial gap, particularly on libero_10. I would really appreciate your help in identifying what might be different in my setup.
Reproduction Results (Ο0.5 FP16 baseline)
| Suite |
Paper (Table 2) |
My Reproduction |
Gap |
| libero_spatial |
98.5% |
95.0% |
β3.5 |
| libero_object |
99.0% |
97.4% |
β1.6 |
| libero_goal |
97.5% |
97.4% |
β0.1 |
| libero_10 (long) |
93.5% |
84.4% |
β9.1 |
| Average |
97.1% |
93.6% |
β3.5 |
libero_goal and libero_object are essentially matched, libero_spatial has a small gap, but libero_10 is clearly off. Since the paper notes that long-horizon tasks are where quantization-induced drift accumulates most, I want to make sure the FP16 baseline itself is aligned before drawing any conclusions about the quantized variant.
Environment & Setup
- Model: Ο0.5 (OpenPI). Could you confirm which checkpoint was used β the official JAX checkpoint from
Physical-Intelligence/openpi, or the PyTorch-converted lerobot/pi05_libero_base / lerobot/pi05_libero_finetuned?
- Fine-tuning data: official LIBERO dataset (no additional mixing)
- Evaluation: "standard LIBERO protocol" as described in Sec. 4.1, 20 episodes per task Γ tasks per suite, single-env rollout
- Hardware: NVIDIA RTX 5090 (paper uses A100 β shouldn't affect success rate, noting just in case of any precision-related differences)
Specific Questions on Settings Not Fully Specified in the Paper
A few details are only partially covered in Sec. 4.1 and Appendix D, and I suspect the gap on libero_10 stems from one of these:
- Denoising / action steps for Ο0.5. Table 4 reports 8 and 16 steps for GR00T N1.5, but the Ο0.5 setting in Table 2 is not explicitly stated. Could you confirm
n_action_steps and the number of flow-matching denoising steps used for Ο0.5 on LIBERO?
- Action chunk execution. Was open-loop chunk execution used (e.g., executing all 10 predicted actions before re-querying the policy), or was a replanning cadence applied? Long-horizon performance is very sensitive to this.
- Evaluation episode count and seeds. Sec. 4.1 references the "standard LIBERO protocol" but does not specify episodes-per-task or seed list. A single fixed seed set would help the community get matching numbers.
- Image preprocessing / prompt template. Any differences in image resolution, view ordering (wrist vs. third-person), or the language instruction template compared to the OpenPI reference implementation?
- Proprioception normalization. The LeRobot Ο0.5 stack uses QUANTILES normalization with
q01/q99 stats β was the same normalization used when reproducing the paper's numbers?
Request
Would it be possible to release:
- The LIBERO evaluation script corresponding to the numbers in Table 2 (including config files specifying denoising steps, action chunk length, episode count, and seeds),
- And, if the FP16 baseline involved any LIBERO-specific fine-tuning beyond the public
pi05_libero checkpoint, the fine-tuning script / config as well.
A minimal scripts/eval_libero_pi05.sh or equivalent would be enormously helpful for the community to verify and build on this work β especially since QuantVLA is positioned as the first PTQ framework for VLA systems and reliable baseline reproduction is a prerequisite for meaningful follow-up comparisons.
Thanks again for the great work and for considering this request!
Alternatives
No response
Additional context
No response
π The feature, motivation and pitch
Body:
Hi authors,
Thanks a lot for open-sourcing QuantVLA β the selective layout + ATM/OHB design is very clean. I have been trying to reproduce the Ο0.5 FP16 baseline numbers reported in Table 2 on the LIBERO benchmark, but I observe a non-trivial gap, particularly on
libero_10. I would really appreciate your help in identifying what might be different in my setup.Reproduction Results (Ο0.5 FP16 baseline)
libero_goalandlibero_objectare essentially matched,libero_spatialhas a small gap, butlibero_10is clearly off. Since the paper notes that long-horizon tasks are where quantization-induced drift accumulates most, I want to make sure the FP16 baseline itself is aligned before drawing any conclusions about the quantized variant.Environment & Setup
Physical-Intelligence/openpi, or the PyTorch-convertedlerobot/pi05_libero_base/lerobot/pi05_libero_finetuned?Specific Questions on Settings Not Fully Specified in the Paper
A few details are only partially covered in Sec. 4.1 and Appendix D, and I suspect the gap on
libero_10stems from one of these:n_action_stepsand the number of flow-matching denoising steps used for Ο0.5 on LIBERO?q01/q99stats β was the same normalization used when reproducing the paper's numbers?Request
Would it be possible to release:
pi05_liberocheckpoint, the fine-tuning script / config as well.A minimal
scripts/eval_libero_pi05.shor equivalent would be enormously helpful for the community to verify and build on this work β especially since QuantVLA is positioned as the first PTQ framework for VLA systems and reliable baseline reproduction is a prerequisite for meaningful follow-up comparisons.Thanks again for the great work and for considering this request!
Alternatives
No response
Additional context
No response