generated from fastai/nbdev_template
-
Notifications
You must be signed in to change notification settings - Fork 2.3k
Open
Labels
Description
Reproduction
When I add Quantization config to the Policy model , I get error as
/usr/local/lib/python3.11/dist-packages/transformers/models/qwen3/modeling_qwen3.py in forward(self, input_ids, attention_mask, position_ids, past_key_values, inputs_embeds, labels, use_cache, cache_position, logits_to_keep, **kwargs)
492 # Only compute necessary logits, and do not upcast them to float if we are not computing the loss
493 slice_indices = slice(-logits_to_keep, None) if isinstance(logits_to_keep, int) else logits_to_keep
--> 494 logits = self.lm_head(hidden_states[:, slice_indices, :])
495
496 loss = None
/usr/local/lib/python3.11/dist-packages/torch/nn/modules/module.py in _wrapped_call_impl(self, *args, **kwargs)
1737 return self._compiled_call_impl(*args, **kwargs) # type: ignore[misc]
1738 else:
-> 1739 return self._call_impl(*args, **kwargs)
1740
1741 # torchrec tests the code consistency with the following code
/usr/local/lib/python3.11/dist-packages/torch/nn/modules/module.py in _call_impl(self, *args, **kwargs)
1748 or _global_backward_pre_hooks or _global_backward_hooks
1749 or _global_forward_hooks or _global_forward_pre_hooks):
-> 1750 return forward_call(*args, **kwargs)
1751
1752 result = None
/usr/local/lib/python3.11/dist-packages/torch/nn/modules/linear.py in forward(self, input)
123
124 def forward(self, input: Tensor) -> Tensor:
--> 125 return F.linear(input, self.weight, self.bias)
126
127 def extra_repr(self) -> str:
RuntimeError: expected scalar type Half but found Float
The Code:
from datasets import load_dataset
import torch
from transformers import (
AutoTokenizer,
AutoModelForSequenceClassification,
AutoModelForCausalLM,
BitsAndBytesConfig,
)
from trl import GRPOTrainer, GRPOConfig
from peft import LoraConfig
# ======================================================
# 1. Dataset
# ======================================================
ds = load_dataset("AIPlans/Helpsteer2-helpfulness-prompts", split="train")
# ======================================================
# 2. Reward model + reward function
# ======================================================
rm_model_name = "AIPlans/Qwen3-0.6B-RM-hs2"
tokenizer_reward = AutoTokenizer.from_pretrained(
rm_model_name,
trust_remote_code=True,
)
model_reward = AutoModelForSequenceClassification.from_pretrained(
rm_model_name,
trust_remote_code=True,
torch_dtype=torch.float16,
)
device_reward = "cuda" if torch.cuda.is_available() else "cpu"
model_reward = model_reward.to(device_reward)
model_reward.eval()
def reward_model_score(prompts: list[str], completions: list[str], **kwargs):
if not prompts:
return []
conversations = [
[
{"role": "user", "content": p},
{"role": "assistant", "content": c},
]
for p, c in zip(prompts, completions)
]
input_ids = tokenizer_reward.apply_chat_template(
conversations,
tokenize=True,
add_generation_prompt=False,
padding=True,
truncation=True,
return_tensors="pt",
).to(device_reward)
with torch.no_grad():
outputs = model_reward(input_ids=input_ids)
logits = outputs.logits # shape: [batch, num_labels]
# Assuming label 0 is the scalar reward
return logits[:, 0].detach().cpu().tolist()
# ======================================================
# 3. Main: policy model, PEFT, GRPO trainer
# ======================================================
policy_model_name = "Qwen/Qwen3-0.6B"
bnb_config = BitsAndBytesConfig(
load_in_4bit=True,
bnb_4bit_quant_type="nf4",
bnb_4bit_use_double_quant=True,
bnb_4bit_compute_dtype=torch.float32,
)
tokenizer = AutoTokenizer.from_pretrained(
policy_model_name,
trust_remote_code=True,
)
model = AutoModelForCausalLM.from_pretrained(
policy_model_name,
quantization_config=bnb_config,
device_map="auto",
trust_remote_code=True,
)
peft_config = LoraConfig(
r=16,
lora_alpha=32,
lora_dropout=0.05,
target_modules=["q_proj", "v_proj"],
task_type="CAUSAL_LM",
)
training_args = GRPOConfig(
"QwenModel",
report_to="none",
per_device_train_batch_size=8,
gradient_accumulation_steps=6,
gradient_checkpointing=True,
)
trainer = GRPOTrainer(
model=model,
args=training_args,
reward_funcs=reward_model_score,
train_dataset=ds,
peft_config=peft_config,
)
trainer.train()PS: I got the same error when working with PPOTrainer as well
System Info
- Platform: Linux-6.6.105+-x86_64-with-glibc2.35
- Python version: 3.11.13
- TRL version: 0.25.1
- PyTorch version: 2.6.0+cu124
- accelerator(s): Tesla P100-PCIE-16GB
- Transformers version: 4.57.3
- Accelerate version: 1.9.0
- Accelerate config: not found
- Datasets version: 4.4.1
- HF Hub version: 0.36.0
- bitsandbytes version: 0.48.2
- DeepSpeed version: 0.18.2
- Liger-Kernel version: not installed
- LLM-Blender version: not installed
- OpenAI version: 2.7.1
- PEFT version: 0.16.0
- vLLM version: not installed
Checklist
- I have checked that my issue isn't already filed (see open issues)
- I have included my system information
- Any code provided is minimal, complete, and reproducible (more on MREs)
- Any code provided is properly formatted in code blocks, (no screenshot, more on code blocks)
- Any traceback provided is complete