Skip to content

Conversation

@kevinsala
Copy link
Contributor

@kevinsala kevinsala commented Aug 31, 2025

It seems that cuLaunchKernel expects the arguments size (CU_LAUNCH_PARAM_BUFFER_SIZE) without accounting for tail padding. For example, for a kernel with arguments int *A, short B, the function requires a size of 10 bytes. However, we are currently passing the sizeof(struct { int *A, short B }), which results in 16 bytes.

This commit exposes both sizes into the KernelLaunchParamsTy so the plugins can decide which one to use. It fixes the offload/test/offloading/CUDA/basic_launch_multi_arg.cu test on NVIDIA GPUs, which was failing with error too many resources requested for launch.

@llvmbot llvmbot added clang Clang issues not falling into any other category clang:codegen IR generation bugs: mangling, exceptions, etc. offload labels Aug 31, 2025
@llvmbot
Copy link
Member

llvmbot commented Aug 31, 2025

@llvm/pr-subscribers-backend-amdgpu
@llvm/pr-subscribers-offload
@llvm/pr-subscribers-clang-codegen

@llvm/pr-subscribers-clang

Author: Kevin Sala Penades (kevinsala)

Changes

It seems that cuLaunchKernel expects the arguments size (CU_LAUNCH_PARAM_BUFFER_SIZE) without accounting for tail padding. For example, for a kernel with arguments int *A, short B, the function requires a size of 12 bytes. However, we are currently passing the sizeof(struct { int *A, short B }), which results in 16 bytes.

This commit exposes both sizes into the KernelLaunchParamsTy so the plugins can decide which one to use. It fixes the offload/test/offloading/CUDA/basic_launch_multi_arg.cu test on NVIDIA GPUs, which was failing with error too many resources requested for launch.


Full diff: https://github.com/llvm/llvm-project/pull/156229.diff

5 Files Affected:

  • (modified) clang/lib/CodeGen/CGCUDANV.cpp (+19-5)
  • (modified) offload/include/Shared/APITypes.h (+2)
  • (modified) offload/plugins-nextgen/common/src/PluginInterface.cpp (+3-1)
  • (modified) offload/plugins-nextgen/cuda/src/rtl.cpp (+1-1)
  • (modified) offload/test/offloading/CUDA/basic_launch_multi_arg.cu (+8)
diff --git a/clang/lib/CodeGen/CGCUDANV.cpp b/clang/lib/CodeGen/CGCUDANV.cpp
index 5090a0559eab2..1f3492d57c6a1 100644
--- a/clang/lib/CodeGen/CGCUDANV.cpp
+++ b/clang/lib/CodeGen/CGCUDANV.cpp
@@ -327,9 +327,10 @@ void CGNVCUDARuntime::emitDeviceStub(CodeGenFunction &CGF,
 /// (void*, short, void*) is passed as {void **, short *, void **} to the launch
 /// function. For the LLVM/offload launch we flatten the arguments into the
 /// struct directly. In addition, we include the size of the arguments, thus
-/// pass {sizeof({void *, short, void *}), ptr to {void *, short, void *},
-/// nullptr}. The last nullptr needs to be initialized to an array of pointers
-/// pointing to the arguments if we want to offload to the host.
+/// pass {size of ({void *, short, void *}) without tail padding, ptr to {void
+/// *, short, void *}, nullptr}. The last nullptr needs to be initialized to an
+/// array of pointers pointing to the arguments if we want to offload to the
+/// host.
 Address CGNVCUDARuntime::prepareKernelArgsLLVMOffload(CodeGenFunction &CGF,
                                                       FunctionArgList &Args) {
   SmallVector<llvm::Type *> ArgTypes, KernelLaunchParamsTypes;
@@ -339,6 +340,7 @@ Address CGNVCUDARuntime::prepareKernelArgsLLVMOffload(CodeGenFunction &CGF,
 
   auto *Int64Ty = CGF.Builder.getInt64Ty();
   KernelLaunchParamsTypes.push_back(Int64Ty);
+  KernelLaunchParamsTypes.push_back(Int64Ty);
   KernelLaunchParamsTypes.push_back(PtrTy);
   KernelLaunchParamsTypes.push_back(PtrTy);
 
@@ -351,12 +353,24 @@ Address CGNVCUDARuntime::prepareKernelArgsLLVMOffload(CodeGenFunction &CGF,
       "kernel_launch_params");
 
   auto KernelArgsSize = CGM.getDataLayout().getTypeAllocSize(KernelArgsTy);
+
+  // Avoid accounting the tail padding for CUDA.
+  auto KernelArgsSizeNoTailPadding = llvm::TypeSize::getZero();
+  if (auto N = KernelArgsTy->getNumElements()) {
+    auto *SL = CGM.getDataLayout().getStructLayout(KernelArgsTy);
+    KernelArgsSizeNoTailPadding = SL->getElementOffset(N - 1);
+    KernelArgsSizeNoTailPadding += CGM.getDataLayout().getTypeAllocSize(
+        KernelArgsTy->getElementType(N - 1));
+  }
+
   CGF.Builder.CreateStore(llvm::ConstantInt::get(Int64Ty, KernelArgsSize),
                           CGF.Builder.CreateStructGEP(KernelLaunchParams, 0));
-  CGF.Builder.CreateStore(KernelArgs.emitRawPointer(CGF),
+  CGF.Builder.CreateStore(llvm::ConstantInt::get(Int64Ty, KernelArgsSizeNoTailPadding),
                           CGF.Builder.CreateStructGEP(KernelLaunchParams, 1));
-  CGF.Builder.CreateStore(llvm::Constant::getNullValue(PtrTy),
+  CGF.Builder.CreateStore(KernelArgs.emitRawPointer(CGF),
                           CGF.Builder.CreateStructGEP(KernelLaunchParams, 2));
+  CGF.Builder.CreateStore(llvm::Constant::getNullValue(PtrTy),
+                          CGF.Builder.CreateStructGEP(KernelLaunchParams, 3));
 
   for (unsigned i = 0; i < Args.size(); ++i) {
     auto *ArgVal = CGF.Builder.CreateLoad(CGF.GetAddrOfLocalVar(Args[i]));
diff --git a/offload/include/Shared/APITypes.h b/offload/include/Shared/APITypes.h
index 8c150b6bfc2d4..52725a0474c6a 100644
--- a/offload/include/Shared/APITypes.h
+++ b/offload/include/Shared/APITypes.h
@@ -121,6 +121,8 @@ static_assert(sizeof(KernelArgsTy) ==
 struct KernelLaunchParamsTy {
   /// Size of the Data array.
   size_t Size = 0;
+  /// Size of the Data array without tail padding.
+  size_t SizeNoTailPadding = 0;
   /// Flat array of kernel parameters.
   void *Data = nullptr;
   /// Ptrs to the Data entries. Only strictly required for the host plugin.
diff --git a/offload/plugins-nextgen/common/src/PluginInterface.cpp b/offload/plugins-nextgen/common/src/PluginInterface.cpp
index d4b5f914c6672..238f6dccc6640 100644
--- a/offload/plugins-nextgen/common/src/PluginInterface.cpp
+++ b/offload/plugins-nextgen/common/src/PluginInterface.cpp
@@ -627,7 +627,9 @@ KernelLaunchParamsTy GenericKernelTy::prepareArgs(
         (void *)((intptr_t)ArgPtrs[I - KLEOffset] + ArgOffsets[I - KLEOffset]);
     Ptrs[I] = &Args[I];
   }
-  return KernelLaunchParamsTy{sizeof(void *) * NumArgs, &Args[0], &Ptrs[0]};
+
+  size_t ArgsSize = sizeof(void *) * NumArgs;
+  return KernelLaunchParamsTy{ArgsSize, ArgsSize, &Args[0], &Ptrs[0]};
 }
 
 uint32_t GenericKernelTy::getNumThreads(GenericDeviceTy &GenericDevice,
diff --git a/offload/plugins-nextgen/cuda/src/rtl.cpp b/offload/plugins-nextgen/cuda/src/rtl.cpp
index c7984287f7533..ddb21f1678a6a 100644
--- a/offload/plugins-nextgen/cuda/src/rtl.cpp
+++ b/offload/plugins-nextgen/cuda/src/rtl.cpp
@@ -1430,7 +1430,7 @@ Error CUDAKernelTy::launchImpl(GenericDeviceTy &GenericDevice,
 
   void *Config[] = {CU_LAUNCH_PARAM_BUFFER_POINTER, LaunchParams.Data,
                     CU_LAUNCH_PARAM_BUFFER_SIZE,
-                    reinterpret_cast<void *>(&LaunchParams.Size),
+                    reinterpret_cast<void *>(&LaunchParams.SizeNoTailPadding),
                     CU_LAUNCH_PARAM_END};
 
   // If we are running an RPC server we want to wake up the server thread
diff --git a/offload/test/offloading/CUDA/basic_launch_multi_arg.cu b/offload/test/offloading/CUDA/basic_launch_multi_arg.cu
index 1f84a0e1288d4..ab6f753150932 100644
--- a/offload/test/offloading/CUDA/basic_launch_multi_arg.cu
+++ b/offload/test/offloading/CUDA/basic_launch_multi_arg.cu
@@ -23,6 +23,10 @@ __global__ void square(int *Dst, short Q, int *Src, short P) {
   Src[1] = P;
 }
 
+__global__ void accumulate(short Q, int *Dst, char P) {
+  *Dst += Q + P;
+}
+
 int main(int argc, char **argv) {
   int DevNo = 0;
   int *Ptr = reinterpret_cast<int *>(llvm_omp_target_alloc_shared(4, DevNo));
@@ -39,5 +43,9 @@ int main(int argc, char **argv) {
   // CHECK: Ptr [[Ptr]], *Ptr: 42
   printf("Src: %i : %i\n", Src[0], Src[1]);
   // CHECK: Src: 3 : 4
+  accumulate<<<1, 1>>>(3, Ptr, 7);
+  printf("Ptr %p, *Ptr: %i\n", Ptr, *Ptr);
+  // CHECK: Ptr [[Ptr]], *Ptr: 52
   llvm_omp_target_free_shared(Ptr, DevNo);
+  llvm_omp_target_free_shared(Src, DevNo);
 }

@kevinsala kevinsala requested review from jhuber6 and shiltian August 31, 2025 07:49
@github-actions
Copy link

github-actions bot commented Aug 31, 2025

✅ With the latest revision this PR passed the C/C++ code formatter.

auto *SL = CGM.getDataLayout().getStructLayout(KernelArgsTy);
KernelArgsSizeNoTailPadding = SL->getElementOffset(N - 1);
KernelArgsSizeNoTailPadding += CGM.getDataLayout().getTypeAllocSize(
KernelArgsTy->getElementType(N - 1));
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If there is a cleaner way of getting this information, please let me know.

@shiltian shiltian requested a review from Artem-B August 31, 2025 09:32
@Artem-B
Copy link
Member

Artem-B commented Sep 2, 2025

It looks like we may be fixing a consequence of an issues somewhere in the offloading code. I'm not familiar with the details, so I'm speculating based on general principles.

I suspect that shipping the size without trailing padding may not be sufficient. E.g. what if we pass void*, short, void*, short. Do we handle the internal padding correctly? If we do, what makes the trailing element padding special? We should have handled it the same way, as the padding for the inner short field. If we can't handle inner padding, then avoiding the trailing padding will only fix the problem for the trailing fields with padding, but the general issue will remain.

What am I missing?

@kevinsala
Copy link
Contributor Author

kevinsala commented Sep 3, 2025

@Artem-B My understanding is that the CUDA Driver API requires the arguments in the buffer to be placed with the proper alignment (i.e., padding between fields). However, the trailing padding after the last element should not be accounted. Otherwise, if it's accounted, the cuLaunchKernel call fails.

The documentation of cuLaunchKernel says:

Kernel parameters can also be packaged by the application into a single buffer that is passed in via the extra parameter. This places the burden on the application of knowing each kernel parameter's size and alignment/padding within the buffer.

and

CU_LAUNCH_PARAM_BUFFER_SIZE, which specifies that the next value in extra will be a pointer to a size_t containing the size of the buffer specified with CU_LAUNCH_PARAM_BUFFER_POINTER.

The following code is a simple reproducer that works directly on top of the CUDA Driver API:

kernel.cu:

extern "C" __global__ void kernel(int *arg1, short arg2, int *arg3, short arg4) {
  *arg1 = arg2;
  *arg3 = arg4;
}

main.cu:

#include <cstdio>
#include <cuda.h>

#define CU_CHECK(err) \
  do { \
    CUresult err__ = (err); \
    if (err__ != CUDA_SUCCESS) { \
      const char *errStr; \
      cuGetErrorString(err__, &errStr); \
      fprintf(stderr, "Error: %s\n", errStr ? errStr : "Unknown"); \
      exit(1); \
    } \
  } while (0)


int main(int argc, char **argv) {
  CU_CHECK(cuInit(0));

  CUdevice device;
  CU_CHECK(cuDeviceGet(&device, 0));

  CUcontext context;
  CU_CHECK(cuCtxCreate(&context, 0, device));

  CUmodule module;
  CU_CHECK(cuModuleLoad(&module, "kernel.cubin"));

  CUfunction kernel;
  CU_CHECK(cuModuleGetFunction(&kernel, module, "kernel"));

  CUdeviceptr d_arg1, d_arg3;
  CU_CHECK(cuMemAlloc(&d_arg1, sizeof(int)));
  CU_CHECK(cuMemAlloc(&d_arg3, sizeof(int)));

  short arg2 = 2, arg4 = 4;

  struct Args {
    CUdeviceptr arg1;
    short arg2;
    CUdeviceptr arg3;
    short arg4;
  };

  Args args = { d_arg1, arg2, d_arg3, arg4 };
  size_t size = 8 + 8 + 8 + 2; // OK
  // size_t size = sizeof(Args); // ERROR

  void *config[] = { CU_LAUNCH_PARAM_BUFFER_POINTER, &args,
                     CU_LAUNCH_PARAM_BUFFER_SIZE,
                     reinterpret_cast<void *>(&size),
                     CU_LAUNCH_PARAM_END };

  CU_CHECK(cuLaunchKernel(
    kernel, 1, 1, 1, 1, 1, 1,
    0, 0, nullptr, config
  ));

  CU_CHECK(cuCtxSynchronize());

  int h_out = 0;
  CU_CHECK(cuMemcpyDtoH(&h_out, d_arg1, sizeof(int)));
  printf("Result from kernel: %d\n", h_out);

  CU_CHECK(cuMemFree(d_arg1));
  CU_CHECK(cuMemFree(d_arg3));
  CU_CHECK(cuModuleUnload(module));
  CU_CHECK(cuCtxDestroy(context));

  return 0;
}

Commands to build the reproducer:

nvcc -arch=sm_90 --cubin kernel.cu -o kernel.cubin
nvcc -arch=sm_90 main.cu -o main -lcuda
./main

The code works at is it, passing the size skipping the trailing padding. If it is replaced by sizeof(Args), the CUDA call fails.

Copy link
Member

@Artem-B Artem-B left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I see. So offloading uses cuLaunchKernel, while LLVM-generated launch glue uses cudaLaunchKernel which does not have this problem.

The offloading code grabs the launch data that was intended to be used for the CUDA runtime API, tries to use it for the CUDA driver API, and now we've discovered that they do not quite work the same way.

Fine. Passing unpadded args size for offloading glue looks like a necessary quirk.

LGTM, but please wait for @jhuber6 to chime in.

Copy link
Contributor

@jhuber6 jhuber6 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm not going to block this, but I think long term being divergent with CUDA code generation here is unwise. We should let both lower to the same but change the runtime calls, the runtime call then goes to cuLaunchkernel which expects a struct + size.

@kevinsala
Copy link
Contributor Author

I believe I can use the same size (without the trailing padding) for the amdgpu plugin. However, I have to remove the whole check below:

ArgsSize > LaunchParams.Size + getImplicitArgsSize())

The check will be invalid because it won't account for the padding in between. Let me know what you think.

@kevinsala
Copy link
Contributor Author

@jhuber6 @Artem-B does the new approach look better? We now pass the same size for both nvidia and amdgpu kernels.

@@ -3655,11 +3655,6 @@ Error AMDGPUKernelTy::launchImpl(GenericDeviceTy &GenericDevice,
KernelArgsTy &KernelArgs,
KernelLaunchParamsTy LaunchParams,
AsyncInfoWrapperTy &AsyncInfoWrapper) const {
if (ArgsSize != LaunchParams.Size &&
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is concerning, the argument size here is directly returned by the runtime. If this is false then I'd assume the ABI is broken unless I'm missing something.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The first condition will be true if there are implicit kernel arguments. The second condition can't be checked anymore because the LanuchParams.Size does not account for the tail padding of user arguments. Thus, ArgsSize can actually be larger than LaunchParams.Size + getImplicitArgsSize() because the padding between these last two is not accounted.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I could compute the padded size (to a pointer alignment, which is the alignment of implicit args) and keep the check. But I'm not sure if it's worthy.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The implicit arguments are placed directly after what HSA reports as the argument size, does this change that?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It shouldn't change that. There is the ImplArgsOffset computed later, which rounds up the size to ImplicitArgsTy's alginment.

@kevinsala kevinsala force-pushed the fix-cuda-offload-kernel-params branch from 830a4ed to 08dfe94 Compare September 26, 2025 01:18
@kevinsala
Copy link
Contributor Author

This fixes one of the disabled tests listed in #161265


auto KernelArgsSize = CGM.getDataLayout().getTypeAllocSize(KernelArgsTy);
// Avoid accounting the tail padding for the kernel arguments.
auto KernelArgsSize = llvm::TypeSize::getZero();
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm still a little iffy on this, what's the reason that we can't just mimic CUDA's handling? IIUC this is the old version where we passed these things through the liboimptarget interface. Wasn't there some effort to make a different one on top of libcudart?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What do you mean with mimicking "CUDA's handling"?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't understand why this "Offload on LLVM" thing is using the OpenMP API in the first place.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Probably the title of the PR should be more generic. I don't think this issue is only affecting LLVM offloading. It could potentially affect any kernel in OpenMP offloading with parameters that have different alignments.

Copy link
Contributor

@jhuber6 jhuber6 Oct 3, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That never occurs in OpenMP because we pad everything to u64, but I get what you mean. I need to look deeper into how argument passing works, hopefully make some improvements in offload and pull some more things into OpenMP so we can make the argument handling generic and leave the OpenMP specific assumptions in libomptarget.

@kevinsala kevinsala changed the title [clang][CUDA] Avoid accounting for tail padding in LLVM offloading [clang] Avoid accounting for tail padding in kernel arguments Oct 6, 2025
@kevinsala
Copy link
Contributor Author

Fixed by #172249 from the CUDA plugin.

@kevinsala kevinsala closed this Dec 15, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

backend:AMDGPU clang:codegen IR generation bugs: mangling, exceptions, etc. clang Clang issues not falling into any other category offload

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants