Reproduction gap on LIBERO (especially libero_10) — could the authors release the LIBERO evaluation pipeline?

### 🚀 The feature, motivation and pitch


**Body:**

Hi authors,

Thanks a lot for open-sourcing QuantVLA — the selective layout + ATM/OHB design is very clean. I have been trying to reproduce the π0.5 FP16 baseline numbers reported in Table 2 on the LIBERO benchmark, but I observe a non-trivial gap, particularly on `libero_10`. I would really appreciate your help in identifying what might be different in my setup.

## Reproduction Results (π0.5 FP16 baseline)

| Suite            | Paper (Table 2) | My Reproduction | Gap      |
| ---------------- | --------------- | --------------- | -------- |
| libero_spatial   | 98.5%           | 95.0%           | −3.5     |
| libero_object    | 99.0%           | 97.4%           | −1.6     |
| libero_goal      | 97.5%           | 97.4%           | −0.1     |
| libero_10 (long) | 93.5%           | 84.4%           | **−9.1** |
| **Average**      | **97.1%**       | **93.6%**       | −3.5     |

`libero_goal` and `libero_object` are essentially matched, `libero_spatial` has a small gap, but `libero_10` is clearly off. Since the paper notes that long-horizon tasks are where quantization-induced drift accumulates most, I want to make sure the FP16 baseline itself is aligned before drawing any conclusions about the quantized variant.

## Environment & Setup

* **Model:** π0.5 (OpenPI). Could you confirm which checkpoint was used — the official JAX checkpoint from `Physical-Intelligence/openpi`, or the PyTorch-converted `lerobot/pi05_libero_base` / `lerobot/pi05_libero_finetuned`?
* **Fine-tuning data:** official LIBERO dataset (no additional mixing)
* **Evaluation:** "standard LIBERO protocol" as described in Sec. 4.1, 20 episodes per task × tasks per suite, single-env rollout
* **Hardware:** NVIDIA RTX 5090 (paper uses A100 — shouldn't affect success rate, noting just in case of any precision-related differences)

## Specific Questions on Settings Not Fully Specified in the Paper

A few details are only partially covered in Sec. 4.1 and Appendix D, and I suspect the gap on `libero_10` stems from one of these:

1. **Denoising / action steps for π0.5.** Table 4 reports 8 and 16 steps for GR00T N1.5, but the π0.5 setting in Table 2 is not explicitly stated. Could you confirm `n_action_steps` and the number of flow-matching denoising steps used for π0.5 on LIBERO?
2. **Action chunk execution.** Was open-loop chunk execution used (e.g., executing all 10 predicted actions before re-querying the policy), or was a replanning cadence applied? Long-horizon performance is very sensitive to this.
3. **Evaluation episode count and seeds.** Sec. 4.1 references the "standard LIBERO protocol" but does not specify episodes-per-task or seed list. A single fixed seed set would help the community get matching numbers.
4. **Image preprocessing / prompt template.** Any differences in image resolution, view ordering (wrist vs. third-person), or the language instruction template compared to the OpenPI reference implementation?
5. **Proprioception normalization.** The LeRobot π0.5 stack uses QUANTILES normalization with `q01/q99` stats — was the same normalization used when reproducing the paper's numbers?

## Request

Would it be possible to release:

* The **LIBERO evaluation script** corresponding to the numbers in Table 2 (including config files specifying denoising steps, action chunk length, episode count, and seeds),
* And, if the FP16 baseline involved any LIBERO-specific fine-tuning beyond the public `pi05_libero` checkpoint, the **fine-tuning script / config** as well.

A minimal `scripts/eval_libero_pi05.sh` or equivalent would be enormously helpful for the community to verify and build on this work — especially since QuantVLA is positioned as the first PTQ framework for VLA systems and reliable baseline reproduction is a prerequisite for meaningful follow-up comparisons.

Thanks again for the great work and for considering this request!

### Alternatives

_No response_

### Additional context

_No response_

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Reproduction gap on LIBERO (especially libero_10) — could the authors release the LIBERO evaluation pipeline? #4

🚀 The feature, motivation and pitch

Reproduction Results (π0.5 FP16 baseline)

Environment & Setup

Specific Questions on Settings Not Fully Specified in the Paper

Request

Alternatives

Additional context

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Suite	Paper (Table 2)	My Reproduction	Gap
libero_spatial	98.5%	95.0%	−3.5
libero_object	99.0%	97.4%	−1.6
libero_goal	97.5%	97.4%	−0.1
libero_10 (long)	93.5%	84.4%	−9.1
Average	97.1%	93.6%	−3.5

Reproduction gap on LIBERO (especially libero_10) — could the authors release the LIBERO evaluation pipeline? #4

Description

🚀 The feature, motivation and pitch

Reproduction Results (π0.5 FP16 baseline)

Environment & Setup

Specific Questions on Settings Not Fully Specified in the Paper

Request

Alternatives

Additional context

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions