-
Notifications
You must be signed in to change notification settings - Fork 15.5k
[clang] Avoid accounting for tail padding in kernel arguments #156229
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
|
@llvm/pr-subscribers-backend-amdgpu @llvm/pr-subscribers-clang Author: Kevin Sala Penades (kevinsala) ChangesIt seems that This commit exposes both sizes into the Full diff: https://github.com/llvm/llvm-project/pull/156229.diff 5 Files Affected:
diff --git a/clang/lib/CodeGen/CGCUDANV.cpp b/clang/lib/CodeGen/CGCUDANV.cpp
index 5090a0559eab2..1f3492d57c6a1 100644
--- a/clang/lib/CodeGen/CGCUDANV.cpp
+++ b/clang/lib/CodeGen/CGCUDANV.cpp
@@ -327,9 +327,10 @@ void CGNVCUDARuntime::emitDeviceStub(CodeGenFunction &CGF,
/// (void*, short, void*) is passed as {void **, short *, void **} to the launch
/// function. For the LLVM/offload launch we flatten the arguments into the
/// struct directly. In addition, we include the size of the arguments, thus
-/// pass {sizeof({void *, short, void *}), ptr to {void *, short, void *},
-/// nullptr}. The last nullptr needs to be initialized to an array of pointers
-/// pointing to the arguments if we want to offload to the host.
+/// pass {size of ({void *, short, void *}) without tail padding, ptr to {void
+/// *, short, void *}, nullptr}. The last nullptr needs to be initialized to an
+/// array of pointers pointing to the arguments if we want to offload to the
+/// host.
Address CGNVCUDARuntime::prepareKernelArgsLLVMOffload(CodeGenFunction &CGF,
FunctionArgList &Args) {
SmallVector<llvm::Type *> ArgTypes, KernelLaunchParamsTypes;
@@ -339,6 +340,7 @@ Address CGNVCUDARuntime::prepareKernelArgsLLVMOffload(CodeGenFunction &CGF,
auto *Int64Ty = CGF.Builder.getInt64Ty();
KernelLaunchParamsTypes.push_back(Int64Ty);
+ KernelLaunchParamsTypes.push_back(Int64Ty);
KernelLaunchParamsTypes.push_back(PtrTy);
KernelLaunchParamsTypes.push_back(PtrTy);
@@ -351,12 +353,24 @@ Address CGNVCUDARuntime::prepareKernelArgsLLVMOffload(CodeGenFunction &CGF,
"kernel_launch_params");
auto KernelArgsSize = CGM.getDataLayout().getTypeAllocSize(KernelArgsTy);
+
+ // Avoid accounting the tail padding for CUDA.
+ auto KernelArgsSizeNoTailPadding = llvm::TypeSize::getZero();
+ if (auto N = KernelArgsTy->getNumElements()) {
+ auto *SL = CGM.getDataLayout().getStructLayout(KernelArgsTy);
+ KernelArgsSizeNoTailPadding = SL->getElementOffset(N - 1);
+ KernelArgsSizeNoTailPadding += CGM.getDataLayout().getTypeAllocSize(
+ KernelArgsTy->getElementType(N - 1));
+ }
+
CGF.Builder.CreateStore(llvm::ConstantInt::get(Int64Ty, KernelArgsSize),
CGF.Builder.CreateStructGEP(KernelLaunchParams, 0));
- CGF.Builder.CreateStore(KernelArgs.emitRawPointer(CGF),
+ CGF.Builder.CreateStore(llvm::ConstantInt::get(Int64Ty, KernelArgsSizeNoTailPadding),
CGF.Builder.CreateStructGEP(KernelLaunchParams, 1));
- CGF.Builder.CreateStore(llvm::Constant::getNullValue(PtrTy),
+ CGF.Builder.CreateStore(KernelArgs.emitRawPointer(CGF),
CGF.Builder.CreateStructGEP(KernelLaunchParams, 2));
+ CGF.Builder.CreateStore(llvm::Constant::getNullValue(PtrTy),
+ CGF.Builder.CreateStructGEP(KernelLaunchParams, 3));
for (unsigned i = 0; i < Args.size(); ++i) {
auto *ArgVal = CGF.Builder.CreateLoad(CGF.GetAddrOfLocalVar(Args[i]));
diff --git a/offload/include/Shared/APITypes.h b/offload/include/Shared/APITypes.h
index 8c150b6bfc2d4..52725a0474c6a 100644
--- a/offload/include/Shared/APITypes.h
+++ b/offload/include/Shared/APITypes.h
@@ -121,6 +121,8 @@ static_assert(sizeof(KernelArgsTy) ==
struct KernelLaunchParamsTy {
/// Size of the Data array.
size_t Size = 0;
+ /// Size of the Data array without tail padding.
+ size_t SizeNoTailPadding = 0;
/// Flat array of kernel parameters.
void *Data = nullptr;
/// Ptrs to the Data entries. Only strictly required for the host plugin.
diff --git a/offload/plugins-nextgen/common/src/PluginInterface.cpp b/offload/plugins-nextgen/common/src/PluginInterface.cpp
index d4b5f914c6672..238f6dccc6640 100644
--- a/offload/plugins-nextgen/common/src/PluginInterface.cpp
+++ b/offload/plugins-nextgen/common/src/PluginInterface.cpp
@@ -627,7 +627,9 @@ KernelLaunchParamsTy GenericKernelTy::prepareArgs(
(void *)((intptr_t)ArgPtrs[I - KLEOffset] + ArgOffsets[I - KLEOffset]);
Ptrs[I] = &Args[I];
}
- return KernelLaunchParamsTy{sizeof(void *) * NumArgs, &Args[0], &Ptrs[0]};
+
+ size_t ArgsSize = sizeof(void *) * NumArgs;
+ return KernelLaunchParamsTy{ArgsSize, ArgsSize, &Args[0], &Ptrs[0]};
}
uint32_t GenericKernelTy::getNumThreads(GenericDeviceTy &GenericDevice,
diff --git a/offload/plugins-nextgen/cuda/src/rtl.cpp b/offload/plugins-nextgen/cuda/src/rtl.cpp
index c7984287f7533..ddb21f1678a6a 100644
--- a/offload/plugins-nextgen/cuda/src/rtl.cpp
+++ b/offload/plugins-nextgen/cuda/src/rtl.cpp
@@ -1430,7 +1430,7 @@ Error CUDAKernelTy::launchImpl(GenericDeviceTy &GenericDevice,
void *Config[] = {CU_LAUNCH_PARAM_BUFFER_POINTER, LaunchParams.Data,
CU_LAUNCH_PARAM_BUFFER_SIZE,
- reinterpret_cast<void *>(&LaunchParams.Size),
+ reinterpret_cast<void *>(&LaunchParams.SizeNoTailPadding),
CU_LAUNCH_PARAM_END};
// If we are running an RPC server we want to wake up the server thread
diff --git a/offload/test/offloading/CUDA/basic_launch_multi_arg.cu b/offload/test/offloading/CUDA/basic_launch_multi_arg.cu
index 1f84a0e1288d4..ab6f753150932 100644
--- a/offload/test/offloading/CUDA/basic_launch_multi_arg.cu
+++ b/offload/test/offloading/CUDA/basic_launch_multi_arg.cu
@@ -23,6 +23,10 @@ __global__ void square(int *Dst, short Q, int *Src, short P) {
Src[1] = P;
}
+__global__ void accumulate(short Q, int *Dst, char P) {
+ *Dst += Q + P;
+}
+
int main(int argc, char **argv) {
int DevNo = 0;
int *Ptr = reinterpret_cast<int *>(llvm_omp_target_alloc_shared(4, DevNo));
@@ -39,5 +43,9 @@ int main(int argc, char **argv) {
// CHECK: Ptr [[Ptr]], *Ptr: 42
printf("Src: %i : %i\n", Src[0], Src[1]);
// CHECK: Src: 3 : 4
+ accumulate<<<1, 1>>>(3, Ptr, 7);
+ printf("Ptr %p, *Ptr: %i\n", Ptr, *Ptr);
+ // CHECK: Ptr [[Ptr]], *Ptr: 52
llvm_omp_target_free_shared(Ptr, DevNo);
+ llvm_omp_target_free_shared(Src, DevNo);
}
|
|
✅ With the latest revision this PR passed the C/C++ code formatter. |
| auto *SL = CGM.getDataLayout().getStructLayout(KernelArgsTy); | ||
| KernelArgsSizeNoTailPadding = SL->getElementOffset(N - 1); | ||
| KernelArgsSizeNoTailPadding += CGM.getDataLayout().getTypeAllocSize( | ||
| KernelArgsTy->getElementType(N - 1)); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If there is a cleaner way of getting this information, please let me know.
|
It looks like we may be fixing a consequence of an issues somewhere in the offloading code. I'm not familiar with the details, so I'm speculating based on general principles. I suspect that shipping the size without trailing padding may not be sufficient. E.g. what if we pass What am I missing? |
|
@Artem-B My understanding is that the CUDA Driver API requires the arguments in the buffer to be placed with the proper alignment (i.e., padding between fields). However, the trailing padding after the last element should not be accounted. Otherwise, if it's accounted, the The documentation of
and
The following code is a simple reproducer that works directly on top of the CUDA Driver API:
extern "C" __global__ void kernel(int *arg1, short arg2, int *arg3, short arg4) {
*arg1 = arg2;
*arg3 = arg4;
}
#include <cstdio>
#include <cuda.h>
#define CU_CHECK(err) \
do { \
CUresult err__ = (err); \
if (err__ != CUDA_SUCCESS) { \
const char *errStr; \
cuGetErrorString(err__, &errStr); \
fprintf(stderr, "Error: %s\n", errStr ? errStr : "Unknown"); \
exit(1); \
} \
} while (0)
int main(int argc, char **argv) {
CU_CHECK(cuInit(0));
CUdevice device;
CU_CHECK(cuDeviceGet(&device, 0));
CUcontext context;
CU_CHECK(cuCtxCreate(&context, 0, device));
CUmodule module;
CU_CHECK(cuModuleLoad(&module, "kernel.cubin"));
CUfunction kernel;
CU_CHECK(cuModuleGetFunction(&kernel, module, "kernel"));
CUdeviceptr d_arg1, d_arg3;
CU_CHECK(cuMemAlloc(&d_arg1, sizeof(int)));
CU_CHECK(cuMemAlloc(&d_arg3, sizeof(int)));
short arg2 = 2, arg4 = 4;
struct Args {
CUdeviceptr arg1;
short arg2;
CUdeviceptr arg3;
short arg4;
};
Args args = { d_arg1, arg2, d_arg3, arg4 };
size_t size = 8 + 8 + 8 + 2; // OK
// size_t size = sizeof(Args); // ERROR
void *config[] = { CU_LAUNCH_PARAM_BUFFER_POINTER, &args,
CU_LAUNCH_PARAM_BUFFER_SIZE,
reinterpret_cast<void *>(&size),
CU_LAUNCH_PARAM_END };
CU_CHECK(cuLaunchKernel(
kernel, 1, 1, 1, 1, 1, 1,
0, 0, nullptr, config
));
CU_CHECK(cuCtxSynchronize());
int h_out = 0;
CU_CHECK(cuMemcpyDtoH(&h_out, d_arg1, sizeof(int)));
printf("Result from kernel: %d\n", h_out);
CU_CHECK(cuMemFree(d_arg1));
CU_CHECK(cuMemFree(d_arg3));
CU_CHECK(cuModuleUnload(module));
CU_CHECK(cuCtxDestroy(context));
return 0;
}Commands to build the reproducer: nvcc -arch=sm_90 --cubin kernel.cu -o kernel.cubin
nvcc -arch=sm_90 main.cu -o main -lcuda
./mainThe code works at is it, passing the size skipping the trailing padding. If it is replaced by |
Artem-B
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I see. So offloading uses cuLaunchKernel, while LLVM-generated launch glue uses cudaLaunchKernel which does not have this problem.
The offloading code grabs the launch data that was intended to be used for the CUDA runtime API, tries to use it for the CUDA driver API, and now we've discovered that they do not quite work the same way.
Fine. Passing unpadded args size for offloading glue looks like a necessary quirk.
LGTM, but please wait for @jhuber6 to chime in.
jhuber6
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm not going to block this, but I think long term being divergent with CUDA code generation here is unwise. We should let both lower to the same but change the runtime calls, the runtime call then goes to cuLaunchkernel which expects a struct + size.
|
I believe I can use the same size (without the trailing padding) for the amdgpu plugin. However, I have to remove the whole check below:
The check will be invalid because it won't account for the padding in between. Let me know what you think. |
| @@ -3655,11 +3655,6 @@ Error AMDGPUKernelTy::launchImpl(GenericDeviceTy &GenericDevice, | |||
| KernelArgsTy &KernelArgs, | |||
| KernelLaunchParamsTy LaunchParams, | |||
| AsyncInfoWrapperTy &AsyncInfoWrapper) const { | |||
| if (ArgsSize != LaunchParams.Size && | |||
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is concerning, the argument size here is directly returned by the runtime. If this is false then I'd assume the ABI is broken unless I'm missing something.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The first condition will be true if there are implicit kernel arguments. The second condition can't be checked anymore because the LanuchParams.Size does not account for the tail padding of user arguments. Thus, ArgsSize can actually be larger than LaunchParams.Size + getImplicitArgsSize() because the padding between these last two is not accounted.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I could compute the padded size (to a pointer alignment, which is the alignment of implicit args) and keep the check. But I'm not sure if it's worthy.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The implicit arguments are placed directly after what HSA reports as the argument size, does this change that?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It shouldn't change that. There is the ImplArgsOffset computed later, which rounds up the size to ImplicitArgsTy's alginment.
830a4ed to
08dfe94
Compare
|
This fixes one of the disabled tests listed in #161265 |
|
|
||
| auto KernelArgsSize = CGM.getDataLayout().getTypeAllocSize(KernelArgsTy); | ||
| // Avoid accounting the tail padding for the kernel arguments. | ||
| auto KernelArgsSize = llvm::TypeSize::getZero(); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm still a little iffy on this, what's the reason that we can't just mimic CUDA's handling? IIUC this is the old version where we passed these things through the liboimptarget interface. Wasn't there some effort to make a different one on top of libcudart?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What do you mean with mimicking "CUDA's handling"?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't understand why this "Offload on LLVM" thing is using the OpenMP API in the first place.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Probably the title of the PR should be more generic. I don't think this issue is only affecting LLVM offloading. It could potentially affect any kernel in OpenMP offloading with parameters that have different alignments.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
That never occurs in OpenMP because we pad everything to u64, but I get what you mean. I need to look deeper into how argument passing works, hopefully make some improvements in offload and pull some more things into OpenMP so we can make the argument handling generic and leave the OpenMP specific assumptions in libomptarget.
|
Fixed by #172249 from the CUDA plugin. |
It seems that
cuLaunchKernelexpects the arguments size (CU_LAUNCH_PARAM_BUFFER_SIZE) without accounting for tail padding. For example, for a kernel with argumentsint *A, short B, the function requires a size of 10 bytes. However, we are currently passing thesizeof(struct { int *A, short B }), which results in 16 bytes.This commit exposes both sizes into the
KernelLaunchParamsTyso the plugins can decide which one to use. It fixes theoffload/test/offloading/CUDA/basic_launch_multi_arg.cutest on NVIDIA GPUs, which was failing with error too many resources requested for launch.