Skip to content
Closed
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
17 changes: 13 additions & 4 deletions clang/lib/CodeGen/CGCUDANV.cpp
Original file line number Diff line number Diff line change
Expand Up @@ -327,9 +327,10 @@ void CGNVCUDARuntime::emitDeviceStub(CodeGenFunction &CGF,
/// (void*, short, void*) is passed as {void **, short *, void **} to the launch
/// function. For the LLVM/offload launch we flatten the arguments into the
/// struct directly. In addition, we include the size of the arguments, thus
/// pass {sizeof({void *, short, void *}), ptr to {void *, short, void *},
/// nullptr}. The last nullptr needs to be initialized to an array of pointers
/// pointing to the arguments if we want to offload to the host.
/// pass {size of ({void *, short, void *}) without tail padding, ptr to {void
/// *, short, void *}, nullptr}. The last nullptr needs to be initialized to an
/// array of pointers pointing to the arguments if we want to offload to the
/// host.
Address CGNVCUDARuntime::prepareKernelArgsLLVMOffload(CodeGenFunction &CGF,
FunctionArgList &Args) {
SmallVector<llvm::Type *> ArgTypes, KernelLaunchParamsTypes;
Expand All @@ -350,7 +351,15 @@ Address CGNVCUDARuntime::prepareKernelArgsLLVMOffload(CodeGenFunction &CGF,
KernelLaunchParamsTy, CharUnits::fromQuantity(16),
"kernel_launch_params");

auto KernelArgsSize = CGM.getDataLayout().getTypeAllocSize(KernelArgsTy);
// Avoid accounting the tail padding for the kernel arguments.
auto KernelArgsSize = llvm::TypeSize::getZero();
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm still a little iffy on this, what's the reason that we can't just mimic CUDA's handling? IIUC this is the old version where we passed these things through the liboimptarget interface. Wasn't there some effort to make a different one on top of libcudart?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What do you mean with mimicking "CUDA's handling"?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't understand why this "Offload on LLVM" thing is using the OpenMP API in the first place.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Probably the title of the PR should be more generic. I don't think this issue is only affecting LLVM offloading. It could potentially affect any kernel in OpenMP offloading with parameters that have different alignments.

Copy link
Contributor

@jhuber6 jhuber6 Oct 3, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That never occurs in OpenMP because we pad everything to u64, but I get what you mean. I need to look deeper into how argument passing works, hopefully make some improvements in offload and pull some more things into OpenMP so we can make the argument handling generic and leave the OpenMP specific assumptions in libomptarget.

if (auto N = KernelArgsTy->getNumElements()) {
auto *SL = CGM.getDataLayout().getStructLayout(KernelArgsTy);
KernelArgsSize += SL->getElementOffset(N - 1);
KernelArgsSize += CGM.getDataLayout().getTypeAllocSize(
KernelArgsTy->getElementType(N - 1));
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If there is a cleaner way of getting this information, please let me know.

}

CGF.Builder.CreateStore(llvm::ConstantInt::get(Int64Ty, KernelArgsSize),
CGF.Builder.CreateStructGEP(KernelLaunchParams, 0));
CGF.Builder.CreateStore(KernelArgs.emitRawPointer(CGF),
Expand Down
2 changes: 1 addition & 1 deletion offload/include/Shared/APITypes.h
Original file line number Diff line number Diff line change
Expand Up @@ -119,7 +119,7 @@ static_assert(sizeof(KernelArgsTy) ==

/// Flat array of kernel launch parameters and their total size.
struct KernelLaunchParamsTy {
/// Size of the Data array.
/// Size of the Data array without the tail padding.
size_t Size = 0;
/// Flat array of kernel parameters.
void *Data = nullptr;
Expand Down
5 changes: 0 additions & 5 deletions offload/plugins-nextgen/amdgpu/src/rtl.cpp
Original file line number Diff line number Diff line change
Expand Up @@ -3658,11 +3658,6 @@ Error AMDGPUKernelTy::launchImpl(GenericDeviceTy &GenericDevice,
KernelArgsTy &KernelArgs,
KernelLaunchParamsTy LaunchParams,
AsyncInfoWrapperTy &AsyncInfoWrapper) const {
if (ArgsSize != LaunchParams.Size &&
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is concerning, the argument size here is directly returned by the runtime. If this is false then I'd assume the ABI is broken unless I'm missing something.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The first condition will be true if there are implicit kernel arguments. The second condition can't be checked anymore because the LanuchParams.Size does not account for the tail padding of user arguments. Thus, ArgsSize can actually be larger than LaunchParams.Size + getImplicitArgsSize() because the padding between these last two is not accounted.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I could compute the padded size (to a pointer alignment, which is the alignment of implicit args) and keep the check. But I'm not sure if it's worthy.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The implicit arguments are placed directly after what HSA reports as the argument size, does this change that?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It shouldn't change that. There is the ImplArgsOffset computed later, which rounds up the size to ImplicitArgsTy's alginment.

ArgsSize > LaunchParams.Size + getImplicitArgsSize())
return Plugin::error(ErrorCode::INVALID_ARGUMENT,
"invalid kernel arguments size");

AMDGPUPluginTy &AMDGPUPlugin =
static_cast<AMDGPUPluginTy &>(GenericDevice.Plugin);
AMDHostDeviceTy &HostDevice = AMDGPUPlugin.getHostDevice();
Expand Down
4 changes: 3 additions & 1 deletion offload/plugins-nextgen/common/src/PluginInterface.cpp
Original file line number Diff line number Diff line change
Expand Up @@ -571,7 +571,9 @@ KernelLaunchParamsTy GenericKernelTy::prepareArgs(
(void *)((intptr_t)ArgPtrs[I - KLEOffset] + ArgOffsets[I - KLEOffset]);
Ptrs[I] = &Args[I];
}
return KernelLaunchParamsTy{sizeof(void *) * NumArgs, &Args[0], &Ptrs[0]};

size_t ArgsSize = sizeof(void *) * NumArgs;
return KernelLaunchParamsTy{ArgsSize, &Args[0], &Ptrs[0]};
}

uint32_t GenericKernelTy::getNumThreads(GenericDeviceTy &GenericDevice,
Expand Down
8 changes: 8 additions & 0 deletions offload/test/offloading/CUDA/basic_launch_multi_arg.cu
Original file line number Diff line number Diff line change
Expand Up @@ -23,6 +23,10 @@ __global__ void square(int *Dst, short Q, int *Src, short P) {
Src[1] = P;
}

__global__ void accumulate(short Q, int *Dst, char P) {
*Dst += Q + P;
}

int main(int argc, char **argv) {
int DevNo = 0;
int *Ptr = reinterpret_cast<int *>(llvm_omp_target_alloc_shared(4, DevNo));
Expand All @@ -39,5 +43,9 @@ int main(int argc, char **argv) {
// CHECK: Ptr [[Ptr]], *Ptr: 42
printf("Src: %i : %i\n", Src[0], Src[1]);
// CHECK: Src: 3 : 4
accumulate<<<1, 1>>>(3, Ptr, 7);
printf("Ptr %p, *Ptr: %i\n", Ptr, *Ptr);
// CHECK: Ptr [[Ptr]], *Ptr: 52
llvm_omp_target_free_shared(Ptr, DevNo);
llvm_omp_target_free_shared(Src, DevNo);
}