CUDA Sampler

### 🚀 The feature, motivation and pitch

Run sampler (argmax, softmax for temperature > 0) on CUDA so that in the LLM workflow we don't have to memcpy logits to CPU and then sample.

### Alternatives

_No response_

### Additional context

_No response_

### RFC (Optional)

_No response_