[clang] Avoid accounting for tail padding in kernel arguments #156229

kevinsala · 2025-08-31T07:48:28Z

It seems that cuLaunchKernel expects the arguments size (CU_LAUNCH_PARAM_BUFFER_SIZE) without accounting for tail padding. For example, for a kernel with arguments int *A, short B, the function requires a size of 10 bytes. However, we are currently passing the sizeof(struct { int *A, short B }), which results in 16 bytes.

This commit exposes both sizes into the KernelLaunchParamsTy so the plugins can decide which one to use. It fixes the offload/test/offloading/CUDA/basic_launch_multi_arg.cu test on NVIDIA GPUs, which was failing with error too many resources requested for launch.

llvmbot · 2025-08-31T07:48:59Z

@llvm/pr-subscribers-backend-amdgpu
@llvm/pr-subscribers-offload
@llvm/pr-subscribers-clang-codegen

@llvm/pr-subscribers-clang

Author: Kevin Sala Penades (kevinsala)

Changes

It seems that cuLaunchKernel expects the arguments size (CU_LAUNCH_PARAM_BUFFER_SIZE) without accounting for tail padding. For example, for a kernel with arguments int *A, short B, the function requires a size of 12 bytes. However, we are currently passing the sizeof(struct { int *A, short B }), which results in 16 bytes.

This commit exposes both sizes into the KernelLaunchParamsTy so the plugins can decide which one to use. It fixes the offload/test/offloading/CUDA/basic_launch_multi_arg.cu test on NVIDIA GPUs, which was failing with error too many resources requested for launch.

Full diff: https://github.com/llvm/llvm-project/pull/156229.diff

5 Files Affected:

(modified) clang/lib/CodeGen/CGCUDANV.cpp (+19-5)
(modified) offload/include/Shared/APITypes.h (+2)
(modified) offload/plugins-nextgen/common/src/PluginInterface.cpp (+3-1)
(modified) offload/plugins-nextgen/cuda/src/rtl.cpp (+1-1)
(modified) offload/test/offloading/CUDA/basic_launch_multi_arg.cu (+8)

diff --git a/clang/lib/CodeGen/CGCUDANV.cpp b/clang/lib/CodeGen/CGCUDANV.cpp
index 5090a0559eab2..1f3492d57c6a1 100644
--- a/clang/lib/CodeGen/CGCUDANV.cpp
+++ b/clang/lib/CodeGen/CGCUDANV.cpp
@@ -327,9 +327,10 @@ void CGNVCUDARuntime::emitDeviceStub(CodeGenFunction &CGF,
 /// (void*, short, void*) is passed as {void **, short *, void **} to the launch
 /// function. For the LLVM/offload launch we flatten the arguments into the
 /// struct directly. In addition, we include the size of the arguments, thus
-/// pass {sizeof({void *, short, void *}), ptr to {void *, short, void *},
-/// nullptr}. The last nullptr needs to be initialized to an array of pointers
-/// pointing to the arguments if we want to offload to the host.
+/// pass {size of ({void *, short, void *}) without tail padding, ptr to {void
+/// *, short, void *}, nullptr}. The last nullptr needs to be initialized to an
+/// array of pointers pointing to the arguments if we want to offload to the
+/// host.
 Address CGNVCUDARuntime::prepareKernelArgsLLVMOffload(CodeGenFunction &CGF,
                                                       FunctionArgList &Args) {
   SmallVector<llvm::Type *> ArgTypes, KernelLaunchParamsTypes;
@@ -339,6 +340,7 @@ Address CGNVCUDARuntime::prepareKernelArgsLLVMOffload(CodeGenFunction &CGF,
 
   auto *Int64Ty = CGF.Builder.getInt64Ty();
   KernelLaunchParamsTypes.push_back(Int64Ty);
+  KernelLaunchParamsTypes.push_back(Int64Ty);
   KernelLaunchParamsTypes.push_back(PtrTy);
   KernelLaunchParamsTypes.push_back(PtrTy);
 
@@ -351,12 +353,24 @@ Address CGNVCUDARuntime::prepareKernelArgsLLVMOffload(CodeGenFunction &CGF,
       "kernel_launch_params");
 
   auto KernelArgsSize = CGM.getDataLayout().getTypeAllocSize(KernelArgsTy);
+
+  // Avoid accounting the tail padding for CUDA.
+  auto KernelArgsSizeNoTailPadding = llvm::TypeSize::getZero();
+  if (auto N = KernelArgsTy->getNumElements()) {
+    auto *SL = CGM.getDataLayout().getStructLayout(KernelArgsTy);
+    KernelArgsSizeNoTailPadding = SL->getElementOffset(N - 1);
+    KernelArgsSizeNoTailPadding += CGM.getDataLayout().getTypeAllocSize(
+        KernelArgsTy->getElementType(N - 1));
+  }
+
   CGF.Builder.CreateStore(llvm::ConstantInt::get(Int64Ty, KernelArgsSize),
                           CGF.Builder.CreateStructGEP(KernelLaunchParams, 0));
-  CGF.Builder.CreateStore(KernelArgs.emitRawPointer(CGF),
+  CGF.Builder.CreateStore(llvm::ConstantInt::get(Int64Ty, KernelArgsSizeNoTailPadding),
                           CGF.Builder.CreateStructGEP(KernelLaunchParams, 1));
-  CGF.Builder.CreateStore(llvm::Constant::getNullValue(PtrTy),
+  CGF.Builder.CreateStore(KernelArgs.emitRawPointer(CGF),
                           CGF.Builder.CreateStructGEP(KernelLaunchParams, 2));
+  CGF.Builder.CreateStore(llvm::Constant::getNullValue(PtrTy),
+                          CGF.Builder.CreateStructGEP(KernelLaunchParams, 3));
 
   for (unsigned i = 0; i < Args.size(); ++i) {
     auto *ArgVal = CGF.Builder.CreateLoad(CGF.GetAddrOfLocalVar(Args[i]));
diff --git a/offload/include/Shared/APITypes.h b/offload/include/Shared/APITypes.h
index 8c150b6bfc2d4..52725a0474c6a 100644
--- a/offload/include/Shared/APITypes.h
+++ b/offload/include/Shared/APITypes.h
@@ -121,6 +121,8 @@ static_assert(sizeof(KernelArgsTy) ==
 struct KernelLaunchParamsTy {
   /// Size of the Data array.
   size_t Size = 0;
+  /// Size of the Data array without tail padding.
+  size_t SizeNoTailPadding = 0;
   /// Flat array of kernel parameters.
   void *Data = nullptr;
   /// Ptrs to the Data entries. Only strictly required for the host plugin.
diff --git a/offload/plugins-nextgen/common/src/PluginInterface.cpp b/offload/plugins-nextgen/common/src/PluginInterface.cpp
index d4b5f914c6672..238f6dccc6640 100644
--- a/offload/plugins-nextgen/common/src/PluginInterface.cpp
+++ b/offload/plugins-nextgen/common/src/PluginInterface.cpp
@@ -627,7 +627,9 @@ KernelLaunchParamsTy GenericKernelTy::prepareArgs(
         (void *)((intptr_t)ArgPtrs[I - KLEOffset] + ArgOffsets[I - KLEOffset]);
     Ptrs[I] = &Args[I];
   }
-  return KernelLaunchParamsTy{sizeof(void *) * NumArgs, &Args[0], &Ptrs[0]};
+
+  size_t ArgsSize = sizeof(void *) * NumArgs;
+  return KernelLaunchParamsTy{ArgsSize, ArgsSize, &Args[0], &Ptrs[0]};
 }
 
 uint32_t GenericKernelTy::getNumThreads(GenericDeviceTy &GenericDevice,
diff --git a/offload/plugins-nextgen/cuda/src/rtl.cpp b/offload/plugins-nextgen/cuda/src/rtl.cpp
index c7984287f7533..ddb21f1678a6a 100644
--- a/offload/plugins-nextgen/cuda/src/rtl.cpp
+++ b/offload/plugins-nextgen/cuda/src/rtl.cpp
@@ -1430,7 +1430,7 @@ Error CUDAKernelTy::launchImpl(GenericDeviceTy &GenericDevice,
 
   void *Config[] = {CU_LAUNCH_PARAM_BUFFER_POINTER, LaunchParams.Data,
                     CU_LAUNCH_PARAM_BUFFER_SIZE,
-                    reinterpret_cast<void *>(&LaunchParams.Size),
+                    reinterpret_cast<void *>(&LaunchParams.SizeNoTailPadding),
                     CU_LAUNCH_PARAM_END};
 
   // If we are running an RPC server we want to wake up the server thread
diff --git a/offload/test/offloading/CUDA/basic_launch_multi_arg.cu b/offload/test/offloading/CUDA/basic_launch_multi_arg.cu
index 1f84a0e1288d4..ab6f753150932 100644
--- a/offload/test/offloading/CUDA/basic_launch_multi_arg.cu
+++ b/offload/test/offloading/CUDA/basic_launch_multi_arg.cu
@@ -23,6 +23,10 @@ __global__ void square(int *Dst, short Q, int *Src, short P) {
   Src[1] = P;
 }
 
+__global__ void accumulate(short Q, int *Dst, char P) {
+  *Dst += Q + P;
+}
+
 int main(int argc, char **argv) {
   int DevNo = 0;
   int *Ptr = reinterpret_cast<int *>(llvm_omp_target_alloc_shared(4, DevNo));
@@ -39,5 +43,9 @@ int main(int argc, char **argv) {
   // CHECK: Ptr [[Ptr]], *Ptr: 42
   printf("Src: %i : %i\n", Src[0], Src[1]);
   // CHECK: Src: 3 : 4
+  accumulate<<<1, 1>>>(3, Ptr, 7);
+  printf("Ptr %p, *Ptr: %i\n", Ptr, *Ptr);
+  // CHECK: Ptr [[Ptr]], *Ptr: 52
   llvm_omp_target_free_shared(Ptr, DevNo);
+  llvm_omp_target_free_shared(Src, DevNo);
 }

github-actions · 2025-08-31T07:52:16Z

✅ With the latest revision this PR passed the C/C++ code formatter.

kevinsala · 2025-08-31T07:54:35Z

clang/lib/CodeGen/CGCUDANV.cpp

+    auto *SL = CGM.getDataLayout().getStructLayout(KernelArgsTy);
+    KernelArgsSizeNoTailPadding = SL->getElementOffset(N - 1);
+    KernelArgsSizeNoTailPadding += CGM.getDataLayout().getTypeAllocSize(
+        KernelArgsTy->getElementType(N - 1));


If there is a cleaner way of getting this information, please let me know.

Artem-B · 2025-09-02T18:32:45Z

It looks like we may be fixing a consequence of an issues somewhere in the offloading code. I'm not familiar with the details, so I'm speculating based on general principles.

I suspect that shipping the size without trailing padding may not be sufficient. E.g. what if we pass void*, short, void*, short. Do we handle the internal padding correctly? If we do, what makes the trailing element padding special? We should have handled it the same way, as the padding for the inner short field. If we can't handle inner padding, then avoiding the trailing padding will only fix the problem for the trailing fields with padding, but the general issue will remain.

What am I missing?

kevinsala · 2025-09-03T17:44:37Z

@Artem-B My understanding is that the CUDA Driver API requires the arguments in the buffer to be placed with the proper alignment (i.e., padding between fields). However, the trailing padding after the last element should not be accounted. Otherwise, if it's accounted, the cuLaunchKernel call fails.

The documentation of cuLaunchKernel says:

Kernel parameters can also be packaged by the application into a single buffer that is passed in via the extra parameter. This places the burden on the application of knowing each kernel parameter's size and alignment/padding within the buffer.

and

CU_LAUNCH_PARAM_BUFFER_SIZE, which specifies that the next value in extra will be a pointer to a size_t containing the size of the buffer specified with CU_LAUNCH_PARAM_BUFFER_POINTER.

The following code is a simple reproducer that works directly on top of the CUDA Driver API:

kernel.cu:

extern "C" __global__ void kernel(int *arg1, short arg2, int *arg3, short arg4) {
  *arg1 = arg2;
  *arg3 = arg4;
}

main.cu:

#include <cstdio>
#include <cuda.h>

#define CU_CHECK(err) \
  do { \
    CUresult err__ = (err); \
    if (err__ != CUDA_SUCCESS) { \
      const char *errStr; \
      cuGetErrorString(err__, &errStr); \
      fprintf(stderr, "Error: %s\n", errStr ? errStr : "Unknown"); \
      exit(1); \
    } \
  } while (0)


int main(int argc, char **argv) {
  CU_CHECK(cuInit(0));

  CUdevice device;
  CU_CHECK(cuDeviceGet(&device, 0));

  CUcontext context;
  CU_CHECK(cuCtxCreate(&context, 0, device));

  CUmodule module;
  CU_CHECK(cuModuleLoad(&module, "kernel.cubin"));

  CUfunction kernel;
  CU_CHECK(cuModuleGetFunction(&kernel, module, "kernel"));

  CUdeviceptr d_arg1, d_arg3;
  CU_CHECK(cuMemAlloc(&d_arg1, sizeof(int)));
  CU_CHECK(cuMemAlloc(&d_arg3, sizeof(int)));

  short arg2 = 2, arg4 = 4;

  struct Args {
    CUdeviceptr arg1;
    short arg2;
    CUdeviceptr arg3;
    short arg4;
  };

  Args args = { d_arg1, arg2, d_arg3, arg4 };
  size_t size = 8 + 8 + 8 + 2; // OK
  // size_t size = sizeof(Args); // ERROR

  void *config[] = { CU_LAUNCH_PARAM_BUFFER_POINTER, &args,
                     CU_LAUNCH_PARAM_BUFFER_SIZE,
                     reinterpret_cast<void *>(&size),
                     CU_LAUNCH_PARAM_END };

  CU_CHECK(cuLaunchKernel(
    kernel, 1, 1, 1, 1, 1, 1,
    0, 0, nullptr, config
  ));

  CU_CHECK(cuCtxSynchronize());

  int h_out = 0;
  CU_CHECK(cuMemcpyDtoH(&h_out, d_arg1, sizeof(int)));
  printf("Result from kernel: %d\n", h_out);

  CU_CHECK(cuMemFree(d_arg1));
  CU_CHECK(cuMemFree(d_arg3));
  CU_CHECK(cuModuleUnload(module));
  CU_CHECK(cuCtxDestroy(context));

  return 0;
}

Commands to build the reproducer:

nvcc -arch=sm_90 --cubin kernel.cu -o kernel.cubin
nvcc -arch=sm_90 main.cu -o main -lcuda
./main

The code works at is it, passing the size skipping the trailing padding. If it is replaced by sizeof(Args), the CUDA call fails.

Artem-B

I see. So offloading uses cuLaunchKernel, while LLVM-generated launch glue uses cudaLaunchKernel which does not have this problem.

The offloading code grabs the launch data that was intended to be used for the CUDA runtime API, tries to use it for the CUDA driver API, and now we've discovered that they do not quite work the same way.

Fine. Passing unpadded args size for offloading glue looks like a necessary quirk.

LGTM, but please wait for @jhuber6 to chime in.

jhuber6

I'm not going to block this, but I think long term being divergent with CUDA code generation here is unwise. We should let both lower to the same but change the runtime calls, the runtime call then goes to cuLaunchkernel which expects a struct + size.

kevinsala · 2025-09-03T22:46:54Z

I believe I can use the same size (without the trailing padding) for the amdgpu plugin. However, I have to remove the whole check below:

llvm-project/offload/plugins-nextgen/amdgpu/src/rtl.cpp

Line 3660 in 8edb5b4

ArgsSize > LaunchParams.Size + getImplicitArgsSize())

The check will be invalid because it won't account for the padding in between. Let me know what you think.

kevinsala · 2025-09-18T16:39:48Z

@jhuber6 @Artem-B does the new approach look better? We now pass the same size for both nvidia and amdgpu kernels.

jhuber6 · 2025-09-18T16:42:24Z

offload/plugins-nextgen/amdgpu/src/rtl.cpp

@@ -3655,11 +3655,6 @@ Error AMDGPUKernelTy::launchImpl(GenericDeviceTy &GenericDevice,
                                 KernelArgsTy &KernelArgs,
                                 KernelLaunchParamsTy LaunchParams,
                                 AsyncInfoWrapperTy &AsyncInfoWrapper) const {
-  if (ArgsSize != LaunchParams.Size &&


This is concerning, the argument size here is directly returned by the runtime. If this is false then I'd assume the ABI is broken unless I'm missing something.

The first condition will be true if there are implicit kernel arguments. The second condition can't be checked anymore because the LanuchParams.Size does not account for the tail padding of user arguments. Thus, ArgsSize can actually be larger than LaunchParams.Size + getImplicitArgsSize() because the padding between these last two is not accounted.

I could compute the padded size (to a pointer alignment, which is the alignment of implicit args) and keep the check. But I'm not sure if it's worthy.

The implicit arguments are placed directly after what HSA reports as the argument size, does this change that?

It shouldn't change that. There is the ImplArgsOffset computed later, which rounds up the size to ImplicitArgsTy's alginment.

kevinsala · 2025-09-29T23:58:35Z

This fixes one of the disabled tests listed in #161265

jhuber6 · 2025-09-30T00:01:31Z

clang/lib/CodeGen/CGCUDANV.cpp


-  auto KernelArgsSize = CGM.getDataLayout().getTypeAllocSize(KernelArgsTy);
+  // Avoid accounting the tail padding for the kernel arguments.
+  auto KernelArgsSize = llvm::TypeSize::getZero();


I'm still a little iffy on this, what's the reason that we can't just mimic CUDA's handling? IIUC this is the old version where we passed these things through the liboimptarget interface. Wasn't there some effort to make a different one on top of libcudart?

What do you mean with mimicking "CUDA's handling"?

I don't understand why this "Offload on LLVM" thing is using the OpenMP API in the first place.

Probably the title of the PR should be more generic. I don't think this issue is only affecting LLVM offloading. It could potentially affect any kernel in OpenMP offloading with parameters that have different alignments.

That never occurs in OpenMP because we pad everything to u64, but I get what you mean. I need to look deeper into how argument passing works, hopefully make some improvements in offload and pull some more things into OpenMP so we can make the argument handling generic and leave the OpenMP specific assumptions in libomptarget.

kevinsala · 2025-12-15T05:59:02Z

Fixed by #172249 from the CUDA plugin.

llvmbot added clang Clang issues not falling into any other category clang:codegen IR generation bugs: mangling, exceptions, etc. offload labels Aug 31, 2025

kevinsala requested review from jhuber6 and shiltian August 31, 2025 07:49

kevinsala commented Aug 31, 2025

View reviewed changes

shiltian requested a review from Artem-B August 31, 2025 09:32

Artem-B approved these changes Sep 3, 2025

View reviewed changes

jhuber6 approved these changes Sep 3, 2025

View reviewed changes

llvmbot added the backend:AMDGPU label Sep 4, 2025

jhuber6 reviewed Sep 18, 2025

View reviewed changes

kevinsala added 2 commits September 25, 2025 17:35

[clang][CUDA] Avoid accounting for tail padding in LLVM offloading

6e3b496

Use the same args size (without tail padding) for AMDGPU

08dfe94

kevinsala force-pushed the fix-cuda-offload-kernel-params branch from 830a4ed to 08dfe94 Compare September 26, 2025 01:18

jhuber6 reviewed Sep 30, 2025

View reviewed changes

kevinsala changed the title ~~[clang][CUDA] Avoid accounting for tail padding in LLVM offloading~~ [clang] Avoid accounting for tail padding in kernel arguments Oct 6, 2025

kevinsala closed this Dec 15, 2025

[clang] Avoid accounting for tail padding in kernel arguments #156229

[clang] Avoid accounting for tail padding in kernel arguments #156229

Uh oh!

Conversation

kevinsala commented Aug 31, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

llvmbot commented Aug 31, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

github-actions bot commented Aug 31, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Artem-B commented Sep 2, 2025

Uh oh!

kevinsala commented Sep 3, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Artem-B left a comment

Choose a reason for hiding this comment

Uh oh!

jhuber6 left a comment

Choose a reason for hiding this comment

Uh oh!

kevinsala commented Sep 3, 2025

Uh oh!

kevinsala commented Sep 18, 2025

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

kevinsala commented Sep 29, 2025

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

jhuber6 Oct 3, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

kevinsala commented Dec 15, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

kevinsala commented Aug 31, 2025 •

edited

Loading

llvmbot commented Aug 31, 2025 •

edited

Loading

github-actions bot commented Aug 31, 2025 •

edited

Loading

kevinsala commented Sep 3, 2025 •

edited

Loading

jhuber6 Oct 3, 2025 •

edited

Loading