GRPOTrainer and bnb quantization configs are incomptiable

### Reproduction

When I add Quantization config to the Policy model , I get error as 

``` python
/usr/local/lib/python3.11/dist-packages/transformers/models/qwen3/modeling_qwen3.py in forward(self, input_ids, attention_mask, position_ids, past_key_values, inputs_embeds, labels, use_cache, cache_position, logits_to_keep, **kwargs)
    492         # Only compute necessary logits, and do not upcast them to float if we are not computing the loss
    493         slice_indices = slice(-logits_to_keep, None) if isinstance(logits_to_keep, int) else logits_to_keep
--> 494         logits = self.lm_head(hidden_states[:, slice_indices, :])
    495 
    496         loss = None

/usr/local/lib/python3.11/dist-packages/torch/nn/modules/module.py in _wrapped_call_impl(self, *args, **kwargs)
   1737             return self._compiled_call_impl(*args, **kwargs)  # type: ignore[misc]
   1738         else:
-> 1739             return self._call_impl(*args, **kwargs)
   1740 
   1741     # torchrec tests the code consistency with the following code

/usr/local/lib/python3.11/dist-packages/torch/nn/modules/module.py in _call_impl(self, *args, **kwargs)
   1748                 or _global_backward_pre_hooks or _global_backward_hooks
   1749                 or _global_forward_hooks or _global_forward_pre_hooks):
-> 1750             return forward_call(*args, **kwargs)
   1751 
   1752         result = None

/usr/local/lib/python3.11/dist-packages/torch/nn/modules/linear.py in forward(self, input)
    123 
    124     def forward(self, input: Tensor) -> Tensor:
--> 125         return F.linear(input, self.weight, self.bias)
    126 
    127     def extra_repr(self) -> str:

RuntimeError: expected scalar type Half but found Float


```
The Code:

``` python
from datasets import load_dataset
import torch
from transformers import (
    AutoTokenizer,
    AutoModelForSequenceClassification,
    AutoModelForCausalLM,
    BitsAndBytesConfig,
)
from trl import GRPOTrainer, GRPOConfig
from peft import LoraConfig


# ======================================================
# 1. Dataset
# ======================================================

ds = load_dataset("AIPlans/Helpsteer2-helpfulness-prompts", split="train")


# ======================================================
# 2. Reward model + reward function
# ======================================================

rm_model_name = "AIPlans/Qwen3-0.6B-RM-hs2"

tokenizer_reward = AutoTokenizer.from_pretrained(
    rm_model_name,
    trust_remote_code=True,
)

model_reward = AutoModelForSequenceClassification.from_pretrained(
    rm_model_name,
    trust_remote_code=True,
    torch_dtype=torch.float16,
)

device_reward = "cuda" if torch.cuda.is_available() else "cpu"
model_reward = model_reward.to(device_reward)
model_reward.eval()


def reward_model_score(prompts: list[str], completions: list[str], **kwargs):
    if not prompts:
        return []

    conversations = [
        [
            {"role": "user", "content": p},
            {"role": "assistant", "content": c},
        ]
        for p, c in zip(prompts, completions)
    ]

    input_ids = tokenizer_reward.apply_chat_template(
        conversations,
        tokenize=True,
        add_generation_prompt=False,
        padding=True,
        truncation=True,
        return_tensors="pt",
    ).to(device_reward)

    with torch.no_grad():
        outputs = model_reward(input_ids=input_ids)
        logits = outputs.logits  # shape: [batch, num_labels]

    # Assuming label 0 is the scalar reward
    return logits[:, 0].detach().cpu().tolist()


# ======================================================
# 3. Main: policy model, PEFT, GRPO trainer
# ======================================================

policy_model_name = "Qwen/Qwen3-0.6B"

bnb_config = BitsAndBytesConfig(
    load_in_4bit=True,
    bnb_4bit_quant_type="nf4",
    bnb_4bit_use_double_quant=True,
    bnb_4bit_compute_dtype=torch.float32,
)

tokenizer = AutoTokenizer.from_pretrained(
    policy_model_name,
    trust_remote_code=True,
)

model = AutoModelForCausalLM.from_pretrained(
    policy_model_name,
    quantization_config=bnb_config,
    device_map="auto",
    trust_remote_code=True,
)

peft_config = LoraConfig(
    r=16,
    lora_alpha=32,
    lora_dropout=0.05,
    target_modules=["q_proj", "v_proj"],
    task_type="CAUSAL_LM",
)

training_args = GRPOConfig(
    "QwenModel",
    report_to="none",
    per_device_train_batch_size=8,
    gradient_accumulation_steps=6,
    gradient_checkpointing=True,
)

trainer = GRPOTrainer(
    model=model,
    args=training_args,
    reward_funcs=reward_model_score,
    train_dataset=ds,
    peft_config=peft_config,
)

trainer.train()

```
PS: I got the same error when working with PPOTrainer as well

### System Info

- Platform: Linux-6.6.105+-x86_64-with-glibc2.35
- Python version: 3.11.13
- TRL version: 0.25.1
- PyTorch version: 2.6.0+cu124
- accelerator(s): Tesla P100-PCIE-16GB
- Transformers version: 4.57.3
- Accelerate version: 1.9.0
- Accelerate config: not found
- Datasets version: 4.4.1
- HF Hub version: 0.36.0
- bitsandbytes version: 0.48.2
- DeepSpeed version: 0.18.2
- Liger-Kernel version: not installed
- LLM-Blender version: not installed
- OpenAI version: 2.7.1
- PEFT version: 0.16.0
- vLLM version: not installed

### Checklist

- [x] I have checked that my issue isn't already filed (see [open issues](https://github.com/huggingface/trl/issues?q=is%3Aissue))
- [x] I have included my system information
- [x] Any code provided is minimal, complete, and reproducible ([more on MREs](https://docs.github.com/en/get-started/writing-on-github/working-with-advanced-formatting/creating-and-highlighting-code-blocks))
- [x] Any code provided is properly formatted in code blocks, (no screenshot, [more on code blocks](https://docs.github.com/en/get-started/writing-on-github/working-with-advanced-formatting/creating-and-highlighting-code-blocks))
- [x] Any traceback provided is complete

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

GRPOTrainer and bnb quantization configs are incomptiable #4634

Reproduction

System Info

Checklist

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

GRPOTrainer and bnb quantization configs are incomptiable #4634

Description

Reproduction

System Info

Checklist

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions