Skip to content

GRPOTrainer and bnb quantization configs are incomptiable #4634

@JitheshPavan

Description

@JitheshPavan

Reproduction

When I add Quantization config to the Policy model , I get error as

/usr/local/lib/python3.11/dist-packages/transformers/models/qwen3/modeling_qwen3.py in forward(self, input_ids, attention_mask, position_ids, past_key_values, inputs_embeds, labels, use_cache, cache_position, logits_to_keep, **kwargs)
    492         # Only compute necessary logits, and do not upcast them to float if we are not computing the loss
    493         slice_indices = slice(-logits_to_keep, None) if isinstance(logits_to_keep, int) else logits_to_keep
--> 494         logits = self.lm_head(hidden_states[:, slice_indices, :])
    495 
    496         loss = None

/usr/local/lib/python3.11/dist-packages/torch/nn/modules/module.py in _wrapped_call_impl(self, *args, **kwargs)
   1737             return self._compiled_call_impl(*args, **kwargs)  # type: ignore[misc]
   1738         else:
-> 1739             return self._call_impl(*args, **kwargs)
   1740 
   1741     # torchrec tests the code consistency with the following code

/usr/local/lib/python3.11/dist-packages/torch/nn/modules/module.py in _call_impl(self, *args, **kwargs)
   1748                 or _global_backward_pre_hooks or _global_backward_hooks
   1749                 or _global_forward_hooks or _global_forward_pre_hooks):
-> 1750             return forward_call(*args, **kwargs)
   1751 
   1752         result = None

/usr/local/lib/python3.11/dist-packages/torch/nn/modules/linear.py in forward(self, input)
    123 
    124     def forward(self, input: Tensor) -> Tensor:
--> 125         return F.linear(input, self.weight, self.bias)
    126 
    127     def extra_repr(self) -> str:

RuntimeError: expected scalar type Half but found Float

The Code:

from datasets import load_dataset
import torch
from transformers import (
    AutoTokenizer,
    AutoModelForSequenceClassification,
    AutoModelForCausalLM,
    BitsAndBytesConfig,
)
from trl import GRPOTrainer, GRPOConfig
from peft import LoraConfig


# ======================================================
# 1. Dataset
# ======================================================

ds = load_dataset("AIPlans/Helpsteer2-helpfulness-prompts", split="train")


# ======================================================
# 2. Reward model + reward function
# ======================================================

rm_model_name = "AIPlans/Qwen3-0.6B-RM-hs2"

tokenizer_reward = AutoTokenizer.from_pretrained(
    rm_model_name,
    trust_remote_code=True,
)

model_reward = AutoModelForSequenceClassification.from_pretrained(
    rm_model_name,
    trust_remote_code=True,
    torch_dtype=torch.float16,
)

device_reward = "cuda" if torch.cuda.is_available() else "cpu"
model_reward = model_reward.to(device_reward)
model_reward.eval()


def reward_model_score(prompts: list[str], completions: list[str], **kwargs):
    if not prompts:
        return []

    conversations = [
        [
            {"role": "user", "content": p},
            {"role": "assistant", "content": c},
        ]
        for p, c in zip(prompts, completions)
    ]

    input_ids = tokenizer_reward.apply_chat_template(
        conversations,
        tokenize=True,
        add_generation_prompt=False,
        padding=True,
        truncation=True,
        return_tensors="pt",
    ).to(device_reward)

    with torch.no_grad():
        outputs = model_reward(input_ids=input_ids)
        logits = outputs.logits  # shape: [batch, num_labels]

    # Assuming label 0 is the scalar reward
    return logits[:, 0].detach().cpu().tolist()


# ======================================================
# 3. Main: policy model, PEFT, GRPO trainer
# ======================================================

policy_model_name = "Qwen/Qwen3-0.6B"

bnb_config = BitsAndBytesConfig(
    load_in_4bit=True,
    bnb_4bit_quant_type="nf4",
    bnb_4bit_use_double_quant=True,
    bnb_4bit_compute_dtype=torch.float32,
)

tokenizer = AutoTokenizer.from_pretrained(
    policy_model_name,
    trust_remote_code=True,
)

model = AutoModelForCausalLM.from_pretrained(
    policy_model_name,
    quantization_config=bnb_config,
    device_map="auto",
    trust_remote_code=True,
)

peft_config = LoraConfig(
    r=16,
    lora_alpha=32,
    lora_dropout=0.05,
    target_modules=["q_proj", "v_proj"],
    task_type="CAUSAL_LM",
)

training_args = GRPOConfig(
    "QwenModel",
    report_to="none",
    per_device_train_batch_size=8,
    gradient_accumulation_steps=6,
    gradient_checkpointing=True,
)

trainer = GRPOTrainer(
    model=model,
    args=training_args,
    reward_funcs=reward_model_score,
    train_dataset=ds,
    peft_config=peft_config,
)

trainer.train()

PS: I got the same error when working with PPOTrainer as well

System Info

  • Platform: Linux-6.6.105+-x86_64-with-glibc2.35
  • Python version: 3.11.13
  • TRL version: 0.25.1
  • PyTorch version: 2.6.0+cu124
  • accelerator(s): Tesla P100-PCIE-16GB
  • Transformers version: 4.57.3
  • Accelerate version: 1.9.0
  • Accelerate config: not found
  • Datasets version: 4.4.1
  • HF Hub version: 0.36.0
  • bitsandbytes version: 0.48.2
  • DeepSpeed version: 0.18.2
  • Liger-Kernel version: not installed
  • LLM-Blender version: not installed
  • OpenAI version: 2.7.1
  • PEFT version: 0.16.0
  • vLLM version: not installed

Checklist

  • I have checked that my issue isn't already filed (see open issues)
  • I have included my system information
  • Any code provided is minimal, complete, and reproducible (more on MREs)
  • Any code provided is properly formatted in code blocks, (no screenshot, more on code blocks)
  • Any traceback provided is complete

Metadata

Metadata

Assignees

No one assigned

    Labels

    🏋 GRPORelated to GRPO🐛 bugSomething isn't working

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions