-
Notifications
You must be signed in to change notification settings - Fork 15.4k
[SPIRV] Use AMDGPU ABI for AMDGCN flavoured SPIRV #169865
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
|
@llvm/pr-subscribers-backend-amdgpu Author: Alex Voicu (AlexVlx) ChangesAt the moment AMDGCN flavoured SPIRV uses the SPIRV ABI with some tweaks revolving around passing aggregates as direct. This is problematic in multiple ways:
This patch switches over AMDGCN flavoured SPIRV to implement the AMDGPU ABI (except for dealing with variadic functions, which will be added in the future). One additional complication (and the primary motivation behind the current less than ideal state of affairs) stems from Patch is 50.43 KiB, truncated to 20.00 KiB below, full version: https://github.com/llvm/llvm-project/pull/169865.diff 6 Files Affected:
diff --git a/clang/lib/CodeGen/Targets/SPIR.cpp b/clang/lib/CodeGen/Targets/SPIR.cpp
index 1a8c85d8871ec..3540093074bfe 100644
--- a/clang/lib/CodeGen/Targets/SPIR.cpp
+++ b/clang/lib/CodeGen/Targets/SPIR.cpp
@@ -9,6 +9,11 @@
#include "ABIInfoImpl.h"
#include "HLSLBufferLayoutBuilder.h"
#include "TargetInfo.h"
+#include "clang/Basic/LangOptions.h"
+#include "llvm/IR/DerivedTypes.h"
+
+#include <stdint.h>
+#include <utility>
using namespace clang;
using namespace clang::CodeGen;
@@ -33,9 +38,41 @@ class SPIRVABIInfo : public CommonSPIRABIInfo {
void computeInfo(CGFunctionInfo &FI) const override;
private:
+ ABIArgInfo classifyKernelArgumentType(QualType Ty) const;
+};
+
+class AMDGCNSPIRVABIInfo : public SPIRVABIInfo {
+ // TODO: this should be unified / shared with AMDGPU, ideally we'd like to
+ // re-use AMDGPUABIInfo eventually, rather than duplicate.
+ static constexpr unsigned MaxNumRegsForArgsRet = 16; // 16 32-bit registers
+ mutable unsigned NumRegsLeft = 0;
+
+ unsigned numRegsForType(QualType Ty) const;
+
+ bool isHomogeneousAggregateBaseType(QualType Ty) const override {
+ return true;
+ }
+ bool isHomogeneousAggregateSmallEnough(const Type *Base,
+ uint64_t Members) const override {
+ uint32_t NumRegs = (getContext().getTypeSize(Base) + 31) / 32;
+
+ // Homogeneous Aggregates may occupy at most 16 registers.
+ return Members * NumRegs <= MaxNumRegsForArgsRet;
+ }
+
+ // Coerce HIP scalar pointer arguments from generic pointers to global ones.
+ llvm::Type *coerceKernelArgumentType(llvm::Type *Ty, unsigned FromAS,
+ unsigned ToAS) const;
+
ABIArgInfo classifyReturnType(QualType RetTy) const;
ABIArgInfo classifyKernelArgumentType(QualType Ty) const;
ABIArgInfo classifyArgumentType(QualType Ty) const;
+public:
+ AMDGCNSPIRVABIInfo(CodeGenTypes &CGT) : SPIRVABIInfo(CGT) {}
+ void computeInfo(CGFunctionInfo &FI) const override;
+
+ llvm::FixedVectorType *getOptimalVectorMemoryType(
+ llvm::FixedVectorType *Ty, const LangOptions &LangOpt) const override;
};
} // end anonymous namespace
namespace {
@@ -83,7 +120,10 @@ class CommonSPIRTargetCodeGenInfo : public TargetCodeGenInfo {
class SPIRVTargetCodeGenInfo : public CommonSPIRTargetCodeGenInfo {
public:
SPIRVTargetCodeGenInfo(CodeGen::CodeGenTypes &CGT)
- : CommonSPIRTargetCodeGenInfo(std::make_unique<SPIRVABIInfo>(CGT)) {}
+ : CommonSPIRTargetCodeGenInfo(
+ (CGT.getTarget().getTriple().getVendor() == llvm::Triple::AMD)
+ ? std::make_unique<AMDGCNSPIRVABIInfo>(CGT)
+ : std::make_unique<SPIRVABIInfo>(CGT)) {}
void setCUDAKernelCallingConvention(const FunctionType *&FT) const override;
LangAS getGlobalVarAddressSpace(CodeGenModule &CGM,
const VarDecl *D) const override;
@@ -132,25 +172,6 @@ void CommonSPIRABIInfo::setCCs() {
RuntimeCC = llvm::CallingConv::SPIR_FUNC;
}
-ABIArgInfo SPIRVABIInfo::classifyReturnType(QualType RetTy) const {
- if (getTarget().getTriple().getVendor() != llvm::Triple::AMD)
- return DefaultABIInfo::classifyReturnType(RetTy);
- if (!isAggregateTypeForABI(RetTy) || getRecordArgABI(RetTy, getCXXABI()))
- return DefaultABIInfo::classifyReturnType(RetTy);
-
- if (const auto *RD = RetTy->getAsRecordDecl();
- RD && RD->hasFlexibleArrayMember())
- return DefaultABIInfo::classifyReturnType(RetTy);
-
- // TODO: The AMDGPU ABI is non-trivial to represent in SPIR-V; in order to
- // avoid encoding various architecture specific bits here we return everything
- // as direct to retain type info for things like aggregates, for later perusal
- // when translating back to LLVM/lowering in the BE. This is also why we
- // disable flattening as the outcomes can mismatch between SPIR-V and AMDGPU.
- // This will be revisited / optimised in the future.
- return ABIArgInfo::getDirect(CGT.ConvertType(RetTy), 0u, nullptr, false);
-}
-
ABIArgInfo SPIRVABIInfo::classifyKernelArgumentType(QualType Ty) const {
if (getContext().getLangOpts().isTargetDevice()) {
// Coerce pointer arguments with default address space to CrossWorkGroup
@@ -167,18 +188,6 @@ ABIArgInfo SPIRVABIInfo::classifyKernelArgumentType(QualType Ty) const {
}
if (isAggregateTypeForABI(Ty)) {
- if (getTarget().getTriple().getVendor() == llvm::Triple::AMD)
- // TODO: The AMDGPU kernel ABI passes aggregates byref, which is not
- // currently expressible in SPIR-V; SPIR-V passes aggregates byval,
- // which the AMDGPU kernel ABI does not allow. Passing aggregates as
- // direct works around this impedance mismatch, as it retains type info
- // and can be correctly handled, post reverse-translation, by the AMDGPU
- // BE, which has to support this CC for legacy OpenCL purposes. It can
- // be brittle and does lead to performance degradation in certain
- // pathological cases. This will be revisited / optimised in the future,
- // once a way to deal with the byref/byval impedance mismatch is
- // identified.
- return ABIArgInfo::getDirect(LTy, 0, nullptr, false);
// Force copying aggregate type in kernel arguments by value when
// compiling CUDA targeting SPIR-V. This is required for the object
// copied to be valid on the device.
@@ -193,11 +202,150 @@ ABIArgInfo SPIRVABIInfo::classifyKernelArgumentType(QualType Ty) const {
return classifyArgumentType(Ty);
}
-ABIArgInfo SPIRVABIInfo::classifyArgumentType(QualType Ty) const {
- if (getTarget().getTriple().getVendor() != llvm::Triple::AMD)
- return DefaultABIInfo::classifyArgumentType(Ty);
- if (!isAggregateTypeForABI(Ty))
- return DefaultABIInfo::classifyArgumentType(Ty);
+void SPIRVABIInfo::computeInfo(CGFunctionInfo &FI) const {
+ // The logic is same as in DefaultABIInfo with an exception on the kernel
+ // arguments handling.
+ llvm::CallingConv::ID CC = FI.getCallingConvention();
+
+ if (!getCXXABI().classifyReturnType(FI))
+ FI.getReturnInfo() = classifyReturnType(FI.getReturnType());
+
+ for (auto &I : FI.arguments()) {
+ if (CC == llvm::CallingConv::SPIR_KERNEL) {
+ I.info = classifyKernelArgumentType(I.type);
+ } else {
+ I.info = classifyArgumentType(I.type);
+ }
+ }
+}
+
+unsigned AMDGCNSPIRVABIInfo::numRegsForType(QualType Ty) const {
+ // This duplicates the AMDGPUABI computation.
+ unsigned NumRegs = 0;
+
+ if (const VectorType *VT = Ty->getAs<VectorType>()) {
+ // Compute from the number of elements. The reported size is based on the
+ // in-memory size, which includes the padding 4th element for 3-vectors.
+ QualType EltTy = VT->getElementType();
+ unsigned EltSize = getContext().getTypeSize(EltTy);
+
+ // 16-bit element vectors should be passed as packed.
+ if (EltSize == 16)
+ return (VT->getNumElements() + 1) / 2;
+
+ unsigned EltNumRegs = (EltSize + 31) / 32;
+ return EltNumRegs * VT->getNumElements();
+ }
+
+ if (const auto *RD = Ty->getAsRecordDecl()) {
+ assert(!RD->hasFlexibleArrayMember());
+
+ for (const FieldDecl *Field : RD->fields()) {
+ QualType FieldTy = Field->getType();
+ NumRegs += numRegsForType(FieldTy);
+ }
+
+ return NumRegs;
+ }
+
+ return (getContext().getTypeSize(Ty) + 31) / 32;
+}
+
+llvm::Type *
+AMDGCNSPIRVABIInfo::coerceKernelArgumentType(llvm::Type *Ty, unsigned FromAS,
+ unsigned ToAS) const {
+ // Single value types.
+ auto *PtrTy = llvm::dyn_cast<llvm::PointerType>(Ty);
+ if (PtrTy && PtrTy->getAddressSpace() == FromAS)
+ return llvm::PointerType::get(Ty->getContext(), ToAS);
+ return Ty;
+}
+
+ABIArgInfo AMDGCNSPIRVABIInfo::classifyReturnType(QualType RetTy) const {
+ if (!isAggregateTypeForABI(RetTy) || getRecordArgABI(RetTy, getCXXABI()))
+ return DefaultABIInfo::classifyReturnType(RetTy);
+
+ // Ignore empty structs/unions.
+ if (isEmptyRecord(getContext(), RetTy, true))
+ return ABIArgInfo::getIgnore();
+
+ // Lower single-element structs to just return a regular value.
+ if (const Type *SeltTy = isSingleElementStruct(RetTy, getContext()))
+ return ABIArgInfo::getDirect(CGT.ConvertType(QualType(SeltTy, 0)));
+
+ if (const auto *RD = RetTy->getAsRecordDecl();
+ RD && RD->hasFlexibleArrayMember())
+ return DefaultABIInfo::classifyReturnType(RetTy);
+
+ // Pack aggregates <= 4 bytes into single VGPR or pair.
+ uint64_t Size = getContext().getTypeSize(RetTy);
+ if (Size <= 16)
+ return ABIArgInfo::getDirect(llvm::Type::getInt16Ty(getVMContext()));
+
+ if (Size <= 32)
+ return ABIArgInfo::getDirect(llvm::Type::getInt32Ty(getVMContext()));
+
+ // TODO: This carried over from AMDGPU oddity, we retain it to
+ // ensure consistency, but it might be reasonable to return Int64.
+ if (Size <= 64) {
+ llvm::Type *I32Ty = llvm::Type::getInt32Ty(getVMContext());
+ return ABIArgInfo::getDirect(llvm::ArrayType::get(I32Ty, 2));
+ }
+
+ if (numRegsForType(RetTy) <= MaxNumRegsForArgsRet)
+ return ABIArgInfo::getDirect();
+ return DefaultABIInfo::classifyReturnType(RetTy);
+}
+
+/// For kernels all parameters are really passed in a special buffer. It doesn't
+/// make sense to pass anything byval, so everything must be direct.
+ABIArgInfo AMDGCNSPIRVABIInfo::classifyKernelArgumentType(QualType Ty) const {
+ Ty = useFirstFieldIfTransparentUnion(Ty);
+
+ // TODO: Can we omit empty structs?
+
+ if (const Type *SeltTy = isSingleElementStruct(Ty, getContext()))
+ Ty = QualType(SeltTy, 0);
+
+ llvm::Type *OrigLTy = CGT.ConvertType(Ty);
+ llvm::Type *LTy = OrigLTy;
+ if (getContext().getLangOpts().isTargetDevice()) {
+ LTy = coerceKernelArgumentType(
+ OrigLTy, /*FromAS=*/getContext().getTargetAddressSpace(LangAS::Default),
+ /*ToAS=*/getContext().getTargetAddressSpace(LangAS::opencl_global));
+ }
+
+ // FIXME: This doesn't apply the optimization of coercing pointers in structs
+ // to global address space when using byref. This would require implementing a
+ // new kind of coercion of the in-memory type when for indirect arguments.
+ if (LTy == OrigLTy && isAggregateTypeForABI(Ty)) {
+ return ABIArgInfo::getIndirectAliased(
+ getContext().getTypeAlignInChars(Ty),
+ getContext().getTargetAddressSpace(LangAS::opencl_constant),
+ false /*Realign*/, nullptr /*Padding*/);
+ }
+
+ // TODO: inhibiting flattening is an AMDGPU workaround for Clover, which might
+ // be vestigial and should be revisited.
+ return ABIArgInfo::getDirect(LTy, 0, nullptr, false);
+}
+
+ABIArgInfo AMDGCNSPIRVABIInfo::classifyArgumentType(QualType Ty) const {
+ assert(NumRegsLeft <= MaxNumRegsForArgsRet && "register estimate underflow");
+
+ Ty = useFirstFieldIfTransparentUnion(Ty);
+
+ // TODO: support for variadics.
+
+ if (!isAggregateTypeForABI(Ty)) {
+ ABIArgInfo ArgInfo = DefaultABIInfo::classifyArgumentType(Ty);
+ if (!ArgInfo.isIndirect()) {
+ unsigned NumRegs = numRegsForType(Ty);
+ NumRegsLeft -= std::min(NumRegs, NumRegsLeft);
+ }
+
+ return ArgInfo;
+ }
// Records with non-trivial destructors/copy-constructors should not be
// passed by value.
@@ -205,37 +353,87 @@ ABIArgInfo SPIRVABIInfo::classifyArgumentType(QualType Ty) const {
return getNaturalAlignIndirect(Ty, getDataLayout().getAllocaAddrSpace(),
RAA == CGCXXABI::RAA_DirectInMemory);
+ // Ignore empty structs/unions.
+ if (isEmptyRecord(getContext(), Ty, true))
+ return ABIArgInfo::getIgnore();
+
+ // Lower single-element structs to just pass a regular value. TODO: We
+ // could do reasonable-size multiple-element structs too, using getExpand(),
+ // though watch out for things like bitfields.
+ if (const Type *SeltTy = isSingleElementStruct(Ty, getContext()))
+ return ABIArgInfo::getDirect(CGT.ConvertType(QualType(SeltTy, 0)));
+
if (const auto *RD = Ty->getAsRecordDecl();
RD && RD->hasFlexibleArrayMember())
return DefaultABIInfo::classifyArgumentType(Ty);
- return ABIArgInfo::getDirect(CGT.ConvertType(Ty), 0u, nullptr, false);
+ uint64_t Size = getContext().getTypeSize(Ty);
+ if (Size <= 64) {
+ // Pack aggregates <= 8 bytes into single VGPR or pair.
+ unsigned NumRegs = (Size + 31) / 32;
+ NumRegsLeft -= std::min(NumRegsLeft, NumRegs);
+
+ if (Size <= 16)
+ return ABIArgInfo::getDirect(llvm::Type::getInt16Ty(getVMContext()));
+
+ if (Size <= 32)
+ return ABIArgInfo::getDirect(llvm::Type::getInt32Ty(getVMContext()));
+
+ // TODO: This is an AMDGPU oddity, and might be vestigial, we retain it to
+ // ensure consistency, but it should be revisited.
+ llvm::Type *I32Ty = llvm::Type::getInt32Ty(getVMContext());
+ return ABIArgInfo::getDirect(llvm::ArrayType::get(I32Ty, 2));
+ }
+
+ if (NumRegsLeft > 0) {
+ unsigned NumRegs = numRegsForType(Ty);
+ if (NumRegsLeft >= NumRegs) {
+ NumRegsLeft -= NumRegs;
+ return ABIArgInfo::getDirect();
+ }
+ }
+
+ // Use pass-by-reference in stead of pass-by-value for struct arguments in
+ // function ABI.
+ return ABIArgInfo::getIndirectAliased(
+ getContext().getTypeAlignInChars(Ty),
+ getContext().getTargetAddressSpace(LangAS::opencl_private));
}
-void SPIRVABIInfo::computeInfo(CGFunctionInfo &FI) const {
- // The logic is same as in DefaultABIInfo with an exception on the kernel
- // arguments handling.
+void AMDGCNSPIRVABIInfo::computeInfo(CGFunctionInfo &FI) const {
llvm::CallingConv::ID CC = FI.getCallingConvention();
if (!getCXXABI().classifyReturnType(FI))
FI.getReturnInfo() = classifyReturnType(FI.getReturnType());
+ NumRegsLeft = MaxNumRegsForArgsRet;
for (auto &I : FI.arguments()) {
- if (CC == llvm::CallingConv::SPIR_KERNEL) {
+ if (CC == llvm::CallingConv::SPIR_KERNEL)
I.info = classifyKernelArgumentType(I.type);
- } else {
+ else
I.info = classifyArgumentType(I.type);
- }
}
}
+llvm::FixedVectorType *AMDGCNSPIRVABIInfo::getOptimalVectorMemoryType(
+ llvm::FixedVectorType *Ty, const LangOptions &LangOpt) const {
+ // AMDGPU has legal instructions for 96-bit so 3x32 can be supported.
+ if (Ty->getNumElements() == 3 && getDataLayout().getTypeSizeInBits(Ty) == 96)
+ return Ty;
+ return DefaultABIInfo::getOptimalVectorMemoryType(Ty, LangOpt);
+}
+
namespace clang {
namespace CodeGen {
void computeSPIRKernelABIInfo(CodeGenModule &CGM, CGFunctionInfo &FI) {
- if (CGM.getTarget().getTriple().isSPIRV())
- SPIRVABIInfo(CGM.getTypes()).computeInfo(FI);
- else
+ if (CGM.getTarget().getTriple().isSPIRV()) {
+ if (CGM.getTarget().getTriple().getVendor() == llvm::Triple::AMD)
+ AMDGCNSPIRVABIInfo(CGM.getTypes()).computeInfo(FI);
+ else
+ SPIRVABIInfo(CGM.getTypes()).computeInfo(FI);
+ } else {
CommonSPIRABIInfo(CGM.getTypes()).computeInfo(FI);
+ }
}
}
}
diff --git a/clang/test/CodeGenCUDA/amdgpu-kernel-arg-pointer-type.cu b/clang/test/CodeGenCUDA/amdgpu-kernel-arg-pointer-type.cu
index a48affaec3c8a..bf45a353851b4 100644
--- a/clang/test/CodeGenCUDA/amdgpu-kernel-arg-pointer-type.cu
+++ b/clang/test/CodeGenCUDA/amdgpu-kernel-arg-pointer-type.cu
@@ -95,7 +95,7 @@ __global__ void kernel1(int *x) {
// CHECK-NEXT: store ptr addrspace(1) [[X_COERCE]], ptr [[X_ASCAST]], align 8
// CHECK-NEXT: [[X1:%.*]] = load ptr, ptr [[X_ASCAST]], align 8
// CHECK-NEXT: store ptr [[X1]], ptr [[X_ADDR_ASCAST]], align 8
-// CHECK-NEXT: [[TMP0:%.*]] = load ptr, ptr [[X_ADDR_ASCAST]], align 8
+// CHECK-NEXT: [[TMP0:%.*]] = load ptr, ptr [[X_ADDR_ASCAST]], align 8, !nonnull [[META4:![0-9]+]], !align [[META5:![0-9]+]]
// CHECK-NEXT: [[TMP1:%.*]] = load i32, ptr [[TMP0]], align 4
// CHECK-NEXT: [[INC:%.*]] = add nsw i32 [[TMP1]], 1
// CHECK-NEXT: store i32 [[INC]], ptr [[TMP0]], align 4
@@ -111,7 +111,7 @@ __global__ void kernel1(int *x) {
// CHECK-SPIRV-NEXT: store ptr addrspace(1) [[X_COERCE]], ptr addrspace(4) [[X_ASCAST]], align 8
// CHECK-SPIRV-NEXT: [[X1:%.*]] = load ptr addrspace(4), ptr addrspace(4) [[X_ASCAST]], align 8
// CHECK-SPIRV-NEXT: store ptr addrspace(4) [[X1]], ptr addrspace(4) [[X_ADDR_ASCAST]], align 8
-// CHECK-SPIRV-NEXT: [[TMP0:%.*]] = load ptr addrspace(4), ptr addrspace(4) [[X_ADDR_ASCAST]], align 8
+// CHECK-SPIRV-NEXT: [[TMP0:%.*]] = load ptr addrspace(4), ptr addrspace(4) [[X_ADDR_ASCAST]], align 8, !align [[META6:![0-9]+]]
// CHECK-SPIRV-NEXT: [[TMP1:%.*]] = load i32, ptr addrspace(4) [[TMP0]], align 4
// CHECK-SPIRV-NEXT: [[INC:%.*]] = add nsw i32 [[TMP1]], 1
// CHECK-SPIRV-NEXT: store i32 [[INC]], ptr addrspace(4) [[TMP0]], align 4
@@ -302,28 +302,23 @@ struct S {
// CHECK-NEXT: ret void
//
// CHECK-SPIRV-LABEL: define spir_kernel void @_Z7kernel41S(
-// CHECK-SPIRV-SAME: [[STRUCT_S:%.*]] [[S_COERCE:%.*]]) addrspace(4) #[[ATTR0]] !max_work_group_size [[META5]] {
+// CHECK-SPIRV-SAME: ptr addrspace(2) noundef byref([[STRUCT_S:%.*]]) align 8 [[TMP0:%.*]]) addrspace(4) #[[ATTR0]] !max_work_group_size [[META5]] {
// CHECK-SPIRV-NEXT: [[ENTRY:.*:]]
-// CHECK-SPIRV-NEXT: [[S:%.*]] = alloca [[STRUCT_S]], align 8
-// CHECK-SPIRV-NEXT: [[S1:%.*]] = addrspacecast ptr [[S]] to ptr addrspace(4)
-// CHECK-SPIRV-NEXT: [[TMP0:%.*]] = getelementptr inbounds nuw [[STRUCT_S]], ptr addrspace(4) [[S1]], i32 0, i32 0
-// CHECK-SPIRV-NEXT: [[TMP1:%.*]] = extractvalue [[STRUCT_S]] [[S_COERCE]], 0
-// CHECK-SPIRV-NEXT: store ptr addrspace(4) [[TMP1]], ptr addrspace(4) [[TMP0]], align 8
-// CHECK-SPIRV-NEXT: [[TMP2:%.*]] = getelementptr inbounds nuw [[STRUCT_S]], ptr addrspace(4) [[S1]], i32 0, i32 1
-// CHECK-SPIRV-NEXT: [[TMP3:%.*]] = extractvalue [[STRUCT_S]] [[S_COERCE]], 1
-// CHECK-SPIRV-NEXT: store ptr addrspace(4) [[TMP3]], ptr addrspace(4) [[TMP2]], align 8
-// CHECK-SPIRV-NEXT: [[X:%.*]] = getelementptr inbounds nuw [[STRUCT_S]], ptr addrspace(4) [[S1]], i32 0, i32 0
-// CHECK-SPIRV-NEXT: [[TMP4:%.*]] = load ptr addrspace(4), ptr addrspace(4) [[X]], align 8
-// CHECK-SPIRV-NEXT: [[ARRAYIDX:%.*]] = getelementptr inbounds i32, ptr addrspace(4) [[TMP4]], i64 0
-// CHECK-SPIRV-NEXT: [[TMP5:%.*]] = load i32, ptr addrspace(4) [[ARRAYIDX]], align 4
-// CHECK-SPIRV-NEXT: [[INC:%.*]] = add nsw i32 [[TMP5]], 1
+// CHECK-SPIRV-NEXT: [[COERCE:%.*]] = alloca [[STRUCT_S]], align 8
+// CHECK-SPIRV-NEXT: [[S:%.*]] = addrspacecast ptr [[COERCE]] to ptr addrspace(4)
+// CHECK-SPIRV-NEXT: call addrspace(4) void @llvm.memcpy.p4.p2.i64(ptr addrspace(4) align 8 [[S]], ptr addrspace(2) align 8 [[TMP0]], i64 16, i1 false)
+// CHECK-SPIRV-NEXT: [[X:%.*]] = getelementptr inbounds nuw [[STRUCT_S]], ptr addrspace(4) [[S]], i32 0, i32 0
+// CHECK-SPIRV-NEXT: [[TMP1:%.*]] = load ptr addrspace(4), ptr addrspace(4) [[X]], align 8
+// CHECK-SPIRV-NEXT: [[ARRAYIDX:%.*]] = getelementptr inbounds i32, ptr addrspace(4) [[TMP1]], i64 0
+// CHECK-SPIRV-NEXT: [[TMP2:%.*]] = load i32, ptr addrspace(4) [[ARRAYIDX]], align 4
+// CHECK-SPIRV-NEXT: [[INC:%.*]] = add nsw i32 [[TMP2]], 1
// CHECK-SPIRV-NEXT: store i32 [[INC]], ptr addrspace(4) [[ARRAYIDX]], align 4
-// CHECK-SPIRV-NEXT: [[Y:%.*]] = getelementptr inbounds nuw [[STRUCT_S]], ptr addrspace(4) [[S1]], i32 0, i32 1
-// CHECK-SPIRV-NEXT: [[TMP6:%.*]] = load ptr addrspace(4), ptr addrspace(4) [[Y]], align 8
-// CHECK-SPIRV-NEXT: [[ARRAYIDX2:%.*]] = getelementptr inbounds float, ptr addrspace(4) [[TMP6]], i64 0
-// CHECK-SPIRV-NEXT: [[TMP7:%.*]] = load float, ptr addrspace(4) [[ARRAYIDX2]], align 4
-// CHECK-SPIRV-NEXT: [[ADD:%.*]] = fadd contract float [[TMP7]], 1.000000e+00
-// CHECK-SPIRV-NEXT: store float [[ADD]], ptr addrspace(4) [[ARRAYIDX2]], align 4
+// CHECK-SPIRV-NEXT: [[Y:%.*]] = getelementptr inbounds nuw [[STRUCT_S]], ptr addrspace(4) [[S]], i32 0, i32 1
+// CHECK-SPIRV-NEXT: [[TMP3:%.*]] = load ptr addrspace(4), ptr addrspace(4) [[Y]], align 8
+// CHECK-SPIRV-NEXT: [[ARRAYIDX1:%.*]] = getelementptr inbounds float, ptr addrspace(4) [[TMP3]], i64 0
+// CHECK-SPIRV-NEXT: [[TMP4:%.*]] = load float, ptr addrspace(4) [[ARRAYIDX1]], align 4
+// CHECK-SPIRV-NEXT: [[ADD:%.*]] = fadd contract float [[TMP4]], 1.000000e+00
+// CHECK-SPIRV-NEXT: store float [[ADD]], ptr addrspace(4) [[ARRAYIDX1]], align 4
// CHECK-SPIRV-NEXT: ...
[truncated]
|
|
@llvm/pr-subscribers-clang-codegen Author: Alex Voicu (AlexVlx) ChangesAt the moment AMDGCN flavoured SPIRV uses the SPIRV ABI with some tweaks revolving around passing aggregates as direct. This is problematic in multiple ways:
This patch switches over AMDGCN flavoured SPIRV to implement the AMDGPU ABI (except for dealing with variadic functions, which will be added in the future). One additional complication (and the primary motivation behind the current less than ideal state of affairs) stems from Patch is 50.43 KiB, truncated to 20.00 KiB below, full version: https://github.com/llvm/llvm-project/pull/169865.diff 6 Files Affected:
diff --git a/clang/lib/CodeGen/Targets/SPIR.cpp b/clang/lib/CodeGen/Targets/SPIR.cpp
index 1a8c85d8871ec..3540093074bfe 100644
--- a/clang/lib/CodeGen/Targets/SPIR.cpp
+++ b/clang/lib/CodeGen/Targets/SPIR.cpp
@@ -9,6 +9,11 @@
#include "ABIInfoImpl.h"
#include "HLSLBufferLayoutBuilder.h"
#include "TargetInfo.h"
+#include "clang/Basic/LangOptions.h"
+#include "llvm/IR/DerivedTypes.h"
+
+#include <stdint.h>
+#include <utility>
using namespace clang;
using namespace clang::CodeGen;
@@ -33,9 +38,41 @@ class SPIRVABIInfo : public CommonSPIRABIInfo {
void computeInfo(CGFunctionInfo &FI) const override;
private:
+ ABIArgInfo classifyKernelArgumentType(QualType Ty) const;
+};
+
+class AMDGCNSPIRVABIInfo : public SPIRVABIInfo {
+ // TODO: this should be unified / shared with AMDGPU, ideally we'd like to
+ // re-use AMDGPUABIInfo eventually, rather than duplicate.
+ static constexpr unsigned MaxNumRegsForArgsRet = 16; // 16 32-bit registers
+ mutable unsigned NumRegsLeft = 0;
+
+ unsigned numRegsForType(QualType Ty) const;
+
+ bool isHomogeneousAggregateBaseType(QualType Ty) const override {
+ return true;
+ }
+ bool isHomogeneousAggregateSmallEnough(const Type *Base,
+ uint64_t Members) const override {
+ uint32_t NumRegs = (getContext().getTypeSize(Base) + 31) / 32;
+
+ // Homogeneous Aggregates may occupy at most 16 registers.
+ return Members * NumRegs <= MaxNumRegsForArgsRet;
+ }
+
+ // Coerce HIP scalar pointer arguments from generic pointers to global ones.
+ llvm::Type *coerceKernelArgumentType(llvm::Type *Ty, unsigned FromAS,
+ unsigned ToAS) const;
+
ABIArgInfo classifyReturnType(QualType RetTy) const;
ABIArgInfo classifyKernelArgumentType(QualType Ty) const;
ABIArgInfo classifyArgumentType(QualType Ty) const;
+public:
+ AMDGCNSPIRVABIInfo(CodeGenTypes &CGT) : SPIRVABIInfo(CGT) {}
+ void computeInfo(CGFunctionInfo &FI) const override;
+
+ llvm::FixedVectorType *getOptimalVectorMemoryType(
+ llvm::FixedVectorType *Ty, const LangOptions &LangOpt) const override;
};
} // end anonymous namespace
namespace {
@@ -83,7 +120,10 @@ class CommonSPIRTargetCodeGenInfo : public TargetCodeGenInfo {
class SPIRVTargetCodeGenInfo : public CommonSPIRTargetCodeGenInfo {
public:
SPIRVTargetCodeGenInfo(CodeGen::CodeGenTypes &CGT)
- : CommonSPIRTargetCodeGenInfo(std::make_unique<SPIRVABIInfo>(CGT)) {}
+ : CommonSPIRTargetCodeGenInfo(
+ (CGT.getTarget().getTriple().getVendor() == llvm::Triple::AMD)
+ ? std::make_unique<AMDGCNSPIRVABIInfo>(CGT)
+ : std::make_unique<SPIRVABIInfo>(CGT)) {}
void setCUDAKernelCallingConvention(const FunctionType *&FT) const override;
LangAS getGlobalVarAddressSpace(CodeGenModule &CGM,
const VarDecl *D) const override;
@@ -132,25 +172,6 @@ void CommonSPIRABIInfo::setCCs() {
RuntimeCC = llvm::CallingConv::SPIR_FUNC;
}
-ABIArgInfo SPIRVABIInfo::classifyReturnType(QualType RetTy) const {
- if (getTarget().getTriple().getVendor() != llvm::Triple::AMD)
- return DefaultABIInfo::classifyReturnType(RetTy);
- if (!isAggregateTypeForABI(RetTy) || getRecordArgABI(RetTy, getCXXABI()))
- return DefaultABIInfo::classifyReturnType(RetTy);
-
- if (const auto *RD = RetTy->getAsRecordDecl();
- RD && RD->hasFlexibleArrayMember())
- return DefaultABIInfo::classifyReturnType(RetTy);
-
- // TODO: The AMDGPU ABI is non-trivial to represent in SPIR-V; in order to
- // avoid encoding various architecture specific bits here we return everything
- // as direct to retain type info for things like aggregates, for later perusal
- // when translating back to LLVM/lowering in the BE. This is also why we
- // disable flattening as the outcomes can mismatch between SPIR-V and AMDGPU.
- // This will be revisited / optimised in the future.
- return ABIArgInfo::getDirect(CGT.ConvertType(RetTy), 0u, nullptr, false);
-}
-
ABIArgInfo SPIRVABIInfo::classifyKernelArgumentType(QualType Ty) const {
if (getContext().getLangOpts().isTargetDevice()) {
// Coerce pointer arguments with default address space to CrossWorkGroup
@@ -167,18 +188,6 @@ ABIArgInfo SPIRVABIInfo::classifyKernelArgumentType(QualType Ty) const {
}
if (isAggregateTypeForABI(Ty)) {
- if (getTarget().getTriple().getVendor() == llvm::Triple::AMD)
- // TODO: The AMDGPU kernel ABI passes aggregates byref, which is not
- // currently expressible in SPIR-V; SPIR-V passes aggregates byval,
- // which the AMDGPU kernel ABI does not allow. Passing aggregates as
- // direct works around this impedance mismatch, as it retains type info
- // and can be correctly handled, post reverse-translation, by the AMDGPU
- // BE, which has to support this CC for legacy OpenCL purposes. It can
- // be brittle and does lead to performance degradation in certain
- // pathological cases. This will be revisited / optimised in the future,
- // once a way to deal with the byref/byval impedance mismatch is
- // identified.
- return ABIArgInfo::getDirect(LTy, 0, nullptr, false);
// Force copying aggregate type in kernel arguments by value when
// compiling CUDA targeting SPIR-V. This is required for the object
// copied to be valid on the device.
@@ -193,11 +202,150 @@ ABIArgInfo SPIRVABIInfo::classifyKernelArgumentType(QualType Ty) const {
return classifyArgumentType(Ty);
}
-ABIArgInfo SPIRVABIInfo::classifyArgumentType(QualType Ty) const {
- if (getTarget().getTriple().getVendor() != llvm::Triple::AMD)
- return DefaultABIInfo::classifyArgumentType(Ty);
- if (!isAggregateTypeForABI(Ty))
- return DefaultABIInfo::classifyArgumentType(Ty);
+void SPIRVABIInfo::computeInfo(CGFunctionInfo &FI) const {
+ // The logic is same as in DefaultABIInfo with an exception on the kernel
+ // arguments handling.
+ llvm::CallingConv::ID CC = FI.getCallingConvention();
+
+ if (!getCXXABI().classifyReturnType(FI))
+ FI.getReturnInfo() = classifyReturnType(FI.getReturnType());
+
+ for (auto &I : FI.arguments()) {
+ if (CC == llvm::CallingConv::SPIR_KERNEL) {
+ I.info = classifyKernelArgumentType(I.type);
+ } else {
+ I.info = classifyArgumentType(I.type);
+ }
+ }
+}
+
+unsigned AMDGCNSPIRVABIInfo::numRegsForType(QualType Ty) const {
+ // This duplicates the AMDGPUABI computation.
+ unsigned NumRegs = 0;
+
+ if (const VectorType *VT = Ty->getAs<VectorType>()) {
+ // Compute from the number of elements. The reported size is based on the
+ // in-memory size, which includes the padding 4th element for 3-vectors.
+ QualType EltTy = VT->getElementType();
+ unsigned EltSize = getContext().getTypeSize(EltTy);
+
+ // 16-bit element vectors should be passed as packed.
+ if (EltSize == 16)
+ return (VT->getNumElements() + 1) / 2;
+
+ unsigned EltNumRegs = (EltSize + 31) / 32;
+ return EltNumRegs * VT->getNumElements();
+ }
+
+ if (const auto *RD = Ty->getAsRecordDecl()) {
+ assert(!RD->hasFlexibleArrayMember());
+
+ for (const FieldDecl *Field : RD->fields()) {
+ QualType FieldTy = Field->getType();
+ NumRegs += numRegsForType(FieldTy);
+ }
+
+ return NumRegs;
+ }
+
+ return (getContext().getTypeSize(Ty) + 31) / 32;
+}
+
+llvm::Type *
+AMDGCNSPIRVABIInfo::coerceKernelArgumentType(llvm::Type *Ty, unsigned FromAS,
+ unsigned ToAS) const {
+ // Single value types.
+ auto *PtrTy = llvm::dyn_cast<llvm::PointerType>(Ty);
+ if (PtrTy && PtrTy->getAddressSpace() == FromAS)
+ return llvm::PointerType::get(Ty->getContext(), ToAS);
+ return Ty;
+}
+
+ABIArgInfo AMDGCNSPIRVABIInfo::classifyReturnType(QualType RetTy) const {
+ if (!isAggregateTypeForABI(RetTy) || getRecordArgABI(RetTy, getCXXABI()))
+ return DefaultABIInfo::classifyReturnType(RetTy);
+
+ // Ignore empty structs/unions.
+ if (isEmptyRecord(getContext(), RetTy, true))
+ return ABIArgInfo::getIgnore();
+
+ // Lower single-element structs to just return a regular value.
+ if (const Type *SeltTy = isSingleElementStruct(RetTy, getContext()))
+ return ABIArgInfo::getDirect(CGT.ConvertType(QualType(SeltTy, 0)));
+
+ if (const auto *RD = RetTy->getAsRecordDecl();
+ RD && RD->hasFlexibleArrayMember())
+ return DefaultABIInfo::classifyReturnType(RetTy);
+
+ // Pack aggregates <= 4 bytes into single VGPR or pair.
+ uint64_t Size = getContext().getTypeSize(RetTy);
+ if (Size <= 16)
+ return ABIArgInfo::getDirect(llvm::Type::getInt16Ty(getVMContext()));
+
+ if (Size <= 32)
+ return ABIArgInfo::getDirect(llvm::Type::getInt32Ty(getVMContext()));
+
+ // TODO: This carried over from AMDGPU oddity, we retain it to
+ // ensure consistency, but it might be reasonable to return Int64.
+ if (Size <= 64) {
+ llvm::Type *I32Ty = llvm::Type::getInt32Ty(getVMContext());
+ return ABIArgInfo::getDirect(llvm::ArrayType::get(I32Ty, 2));
+ }
+
+ if (numRegsForType(RetTy) <= MaxNumRegsForArgsRet)
+ return ABIArgInfo::getDirect();
+ return DefaultABIInfo::classifyReturnType(RetTy);
+}
+
+/// For kernels all parameters are really passed in a special buffer. It doesn't
+/// make sense to pass anything byval, so everything must be direct.
+ABIArgInfo AMDGCNSPIRVABIInfo::classifyKernelArgumentType(QualType Ty) const {
+ Ty = useFirstFieldIfTransparentUnion(Ty);
+
+ // TODO: Can we omit empty structs?
+
+ if (const Type *SeltTy = isSingleElementStruct(Ty, getContext()))
+ Ty = QualType(SeltTy, 0);
+
+ llvm::Type *OrigLTy = CGT.ConvertType(Ty);
+ llvm::Type *LTy = OrigLTy;
+ if (getContext().getLangOpts().isTargetDevice()) {
+ LTy = coerceKernelArgumentType(
+ OrigLTy, /*FromAS=*/getContext().getTargetAddressSpace(LangAS::Default),
+ /*ToAS=*/getContext().getTargetAddressSpace(LangAS::opencl_global));
+ }
+
+ // FIXME: This doesn't apply the optimization of coercing pointers in structs
+ // to global address space when using byref. This would require implementing a
+ // new kind of coercion of the in-memory type when for indirect arguments.
+ if (LTy == OrigLTy && isAggregateTypeForABI(Ty)) {
+ return ABIArgInfo::getIndirectAliased(
+ getContext().getTypeAlignInChars(Ty),
+ getContext().getTargetAddressSpace(LangAS::opencl_constant),
+ false /*Realign*/, nullptr /*Padding*/);
+ }
+
+ // TODO: inhibiting flattening is an AMDGPU workaround for Clover, which might
+ // be vestigial and should be revisited.
+ return ABIArgInfo::getDirect(LTy, 0, nullptr, false);
+}
+
+ABIArgInfo AMDGCNSPIRVABIInfo::classifyArgumentType(QualType Ty) const {
+ assert(NumRegsLeft <= MaxNumRegsForArgsRet && "register estimate underflow");
+
+ Ty = useFirstFieldIfTransparentUnion(Ty);
+
+ // TODO: support for variadics.
+
+ if (!isAggregateTypeForABI(Ty)) {
+ ABIArgInfo ArgInfo = DefaultABIInfo::classifyArgumentType(Ty);
+ if (!ArgInfo.isIndirect()) {
+ unsigned NumRegs = numRegsForType(Ty);
+ NumRegsLeft -= std::min(NumRegs, NumRegsLeft);
+ }
+
+ return ArgInfo;
+ }
// Records with non-trivial destructors/copy-constructors should not be
// passed by value.
@@ -205,37 +353,87 @@ ABIArgInfo SPIRVABIInfo::classifyArgumentType(QualType Ty) const {
return getNaturalAlignIndirect(Ty, getDataLayout().getAllocaAddrSpace(),
RAA == CGCXXABI::RAA_DirectInMemory);
+ // Ignore empty structs/unions.
+ if (isEmptyRecord(getContext(), Ty, true))
+ return ABIArgInfo::getIgnore();
+
+ // Lower single-element structs to just pass a regular value. TODO: We
+ // could do reasonable-size multiple-element structs too, using getExpand(),
+ // though watch out for things like bitfields.
+ if (const Type *SeltTy = isSingleElementStruct(Ty, getContext()))
+ return ABIArgInfo::getDirect(CGT.ConvertType(QualType(SeltTy, 0)));
+
if (const auto *RD = Ty->getAsRecordDecl();
RD && RD->hasFlexibleArrayMember())
return DefaultABIInfo::classifyArgumentType(Ty);
- return ABIArgInfo::getDirect(CGT.ConvertType(Ty), 0u, nullptr, false);
+ uint64_t Size = getContext().getTypeSize(Ty);
+ if (Size <= 64) {
+ // Pack aggregates <= 8 bytes into single VGPR or pair.
+ unsigned NumRegs = (Size + 31) / 32;
+ NumRegsLeft -= std::min(NumRegsLeft, NumRegs);
+
+ if (Size <= 16)
+ return ABIArgInfo::getDirect(llvm::Type::getInt16Ty(getVMContext()));
+
+ if (Size <= 32)
+ return ABIArgInfo::getDirect(llvm::Type::getInt32Ty(getVMContext()));
+
+ // TODO: This is an AMDGPU oddity, and might be vestigial, we retain it to
+ // ensure consistency, but it should be revisited.
+ llvm::Type *I32Ty = llvm::Type::getInt32Ty(getVMContext());
+ return ABIArgInfo::getDirect(llvm::ArrayType::get(I32Ty, 2));
+ }
+
+ if (NumRegsLeft > 0) {
+ unsigned NumRegs = numRegsForType(Ty);
+ if (NumRegsLeft >= NumRegs) {
+ NumRegsLeft -= NumRegs;
+ return ABIArgInfo::getDirect();
+ }
+ }
+
+ // Use pass-by-reference in stead of pass-by-value for struct arguments in
+ // function ABI.
+ return ABIArgInfo::getIndirectAliased(
+ getContext().getTypeAlignInChars(Ty),
+ getContext().getTargetAddressSpace(LangAS::opencl_private));
}
-void SPIRVABIInfo::computeInfo(CGFunctionInfo &FI) const {
- // The logic is same as in DefaultABIInfo with an exception on the kernel
- // arguments handling.
+void AMDGCNSPIRVABIInfo::computeInfo(CGFunctionInfo &FI) const {
llvm::CallingConv::ID CC = FI.getCallingConvention();
if (!getCXXABI().classifyReturnType(FI))
FI.getReturnInfo() = classifyReturnType(FI.getReturnType());
+ NumRegsLeft = MaxNumRegsForArgsRet;
for (auto &I : FI.arguments()) {
- if (CC == llvm::CallingConv::SPIR_KERNEL) {
+ if (CC == llvm::CallingConv::SPIR_KERNEL)
I.info = classifyKernelArgumentType(I.type);
- } else {
+ else
I.info = classifyArgumentType(I.type);
- }
}
}
+llvm::FixedVectorType *AMDGCNSPIRVABIInfo::getOptimalVectorMemoryType(
+ llvm::FixedVectorType *Ty, const LangOptions &LangOpt) const {
+ // AMDGPU has legal instructions for 96-bit so 3x32 can be supported.
+ if (Ty->getNumElements() == 3 && getDataLayout().getTypeSizeInBits(Ty) == 96)
+ return Ty;
+ return DefaultABIInfo::getOptimalVectorMemoryType(Ty, LangOpt);
+}
+
namespace clang {
namespace CodeGen {
void computeSPIRKernelABIInfo(CodeGenModule &CGM, CGFunctionInfo &FI) {
- if (CGM.getTarget().getTriple().isSPIRV())
- SPIRVABIInfo(CGM.getTypes()).computeInfo(FI);
- else
+ if (CGM.getTarget().getTriple().isSPIRV()) {
+ if (CGM.getTarget().getTriple().getVendor() == llvm::Triple::AMD)
+ AMDGCNSPIRVABIInfo(CGM.getTypes()).computeInfo(FI);
+ else
+ SPIRVABIInfo(CGM.getTypes()).computeInfo(FI);
+ } else {
CommonSPIRABIInfo(CGM.getTypes()).computeInfo(FI);
+ }
}
}
}
diff --git a/clang/test/CodeGenCUDA/amdgpu-kernel-arg-pointer-type.cu b/clang/test/CodeGenCUDA/amdgpu-kernel-arg-pointer-type.cu
index a48affaec3c8a..bf45a353851b4 100644
--- a/clang/test/CodeGenCUDA/amdgpu-kernel-arg-pointer-type.cu
+++ b/clang/test/CodeGenCUDA/amdgpu-kernel-arg-pointer-type.cu
@@ -95,7 +95,7 @@ __global__ void kernel1(int *x) {
// CHECK-NEXT: store ptr addrspace(1) [[X_COERCE]], ptr [[X_ASCAST]], align 8
// CHECK-NEXT: [[X1:%.*]] = load ptr, ptr [[X_ASCAST]], align 8
// CHECK-NEXT: store ptr [[X1]], ptr [[X_ADDR_ASCAST]], align 8
-// CHECK-NEXT: [[TMP0:%.*]] = load ptr, ptr [[X_ADDR_ASCAST]], align 8
+// CHECK-NEXT: [[TMP0:%.*]] = load ptr, ptr [[X_ADDR_ASCAST]], align 8, !nonnull [[META4:![0-9]+]], !align [[META5:![0-9]+]]
// CHECK-NEXT: [[TMP1:%.*]] = load i32, ptr [[TMP0]], align 4
// CHECK-NEXT: [[INC:%.*]] = add nsw i32 [[TMP1]], 1
// CHECK-NEXT: store i32 [[INC]], ptr [[TMP0]], align 4
@@ -111,7 +111,7 @@ __global__ void kernel1(int *x) {
// CHECK-SPIRV-NEXT: store ptr addrspace(1) [[X_COERCE]], ptr addrspace(4) [[X_ASCAST]], align 8
// CHECK-SPIRV-NEXT: [[X1:%.*]] = load ptr addrspace(4), ptr addrspace(4) [[X_ASCAST]], align 8
// CHECK-SPIRV-NEXT: store ptr addrspace(4) [[X1]], ptr addrspace(4) [[X_ADDR_ASCAST]], align 8
-// CHECK-SPIRV-NEXT: [[TMP0:%.*]] = load ptr addrspace(4), ptr addrspace(4) [[X_ADDR_ASCAST]], align 8
+// CHECK-SPIRV-NEXT: [[TMP0:%.*]] = load ptr addrspace(4), ptr addrspace(4) [[X_ADDR_ASCAST]], align 8, !align [[META6:![0-9]+]]
// CHECK-SPIRV-NEXT: [[TMP1:%.*]] = load i32, ptr addrspace(4) [[TMP0]], align 4
// CHECK-SPIRV-NEXT: [[INC:%.*]] = add nsw i32 [[TMP1]], 1
// CHECK-SPIRV-NEXT: store i32 [[INC]], ptr addrspace(4) [[TMP0]], align 4
@@ -302,28 +302,23 @@ struct S {
// CHECK-NEXT: ret void
//
// CHECK-SPIRV-LABEL: define spir_kernel void @_Z7kernel41S(
-// CHECK-SPIRV-SAME: [[STRUCT_S:%.*]] [[S_COERCE:%.*]]) addrspace(4) #[[ATTR0]] !max_work_group_size [[META5]] {
+// CHECK-SPIRV-SAME: ptr addrspace(2) noundef byref([[STRUCT_S:%.*]]) align 8 [[TMP0:%.*]]) addrspace(4) #[[ATTR0]] !max_work_group_size [[META5]] {
// CHECK-SPIRV-NEXT: [[ENTRY:.*:]]
-// CHECK-SPIRV-NEXT: [[S:%.*]] = alloca [[STRUCT_S]], align 8
-// CHECK-SPIRV-NEXT: [[S1:%.*]] = addrspacecast ptr [[S]] to ptr addrspace(4)
-// CHECK-SPIRV-NEXT: [[TMP0:%.*]] = getelementptr inbounds nuw [[STRUCT_S]], ptr addrspace(4) [[S1]], i32 0, i32 0
-// CHECK-SPIRV-NEXT: [[TMP1:%.*]] = extractvalue [[STRUCT_S]] [[S_COERCE]], 0
-// CHECK-SPIRV-NEXT: store ptr addrspace(4) [[TMP1]], ptr addrspace(4) [[TMP0]], align 8
-// CHECK-SPIRV-NEXT: [[TMP2:%.*]] = getelementptr inbounds nuw [[STRUCT_S]], ptr addrspace(4) [[S1]], i32 0, i32 1
-// CHECK-SPIRV-NEXT: [[TMP3:%.*]] = extractvalue [[STRUCT_S]] [[S_COERCE]], 1
-// CHECK-SPIRV-NEXT: store ptr addrspace(4) [[TMP3]], ptr addrspace(4) [[TMP2]], align 8
-// CHECK-SPIRV-NEXT: [[X:%.*]] = getelementptr inbounds nuw [[STRUCT_S]], ptr addrspace(4) [[S1]], i32 0, i32 0
-// CHECK-SPIRV-NEXT: [[TMP4:%.*]] = load ptr addrspace(4), ptr addrspace(4) [[X]], align 8
-// CHECK-SPIRV-NEXT: [[ARRAYIDX:%.*]] = getelementptr inbounds i32, ptr addrspace(4) [[TMP4]], i64 0
-// CHECK-SPIRV-NEXT: [[TMP5:%.*]] = load i32, ptr addrspace(4) [[ARRAYIDX]], align 4
-// CHECK-SPIRV-NEXT: [[INC:%.*]] = add nsw i32 [[TMP5]], 1
+// CHECK-SPIRV-NEXT: [[COERCE:%.*]] = alloca [[STRUCT_S]], align 8
+// CHECK-SPIRV-NEXT: [[S:%.*]] = addrspacecast ptr [[COERCE]] to ptr addrspace(4)
+// CHECK-SPIRV-NEXT: call addrspace(4) void @llvm.memcpy.p4.p2.i64(ptr addrspace(4) align 8 [[S]], ptr addrspace(2) align 8 [[TMP0]], i64 16, i1 false)
+// CHECK-SPIRV-NEXT: [[X:%.*]] = getelementptr inbounds nuw [[STRUCT_S]], ptr addrspace(4) [[S]], i32 0, i32 0
+// CHECK-SPIRV-NEXT: [[TMP1:%.*]] = load ptr addrspace(4), ptr addrspace(4) [[X]], align 8
+// CHECK-SPIRV-NEXT: [[ARRAYIDX:%.*]] = getelementptr inbounds i32, ptr addrspace(4) [[TMP1]], i64 0
+// CHECK-SPIRV-NEXT: [[TMP2:%.*]] = load i32, ptr addrspace(4) [[ARRAYIDX]], align 4
+// CHECK-SPIRV-NEXT: [[INC:%.*]] = add nsw i32 [[TMP2]], 1
// CHECK-SPIRV-NEXT: store i32 [[INC]], ptr addrspace(4) [[ARRAYIDX]], align 4
-// CHECK-SPIRV-NEXT: [[Y:%.*]] = getelementptr inbounds nuw [[STRUCT_S]], ptr addrspace(4) [[S1]], i32 0, i32 1
-// CHECK-SPIRV-NEXT: [[TMP6:%.*]] = load ptr addrspace(4), ptr addrspace(4) [[Y]], align 8
-// CHECK-SPIRV-NEXT: [[ARRAYIDX2:%.*]] = getelementptr inbounds float, ptr addrspace(4) [[TMP6]], i64 0
-// CHECK-SPIRV-NEXT: [[TMP7:%.*]] = load float, ptr addrspace(4) [[ARRAYIDX2]], align 4
-// CHECK-SPIRV-NEXT: [[ADD:%.*]] = fadd contract float [[TMP7]], 1.000000e+00
-// CHECK-SPIRV-NEXT: store float [[ADD]], ptr addrspace(4) [[ARRAYIDX2]], align 4
+// CHECK-SPIRV-NEXT: [[Y:%.*]] = getelementptr inbounds nuw [[STRUCT_S]], ptr addrspace(4) [[S]], i32 0, i32 1
+// CHECK-SPIRV-NEXT: [[TMP3:%.*]] = load ptr addrspace(4), ptr addrspace(4) [[Y]], align 8
+// CHECK-SPIRV-NEXT: [[ARRAYIDX1:%.*]] = getelementptr inbounds float, ptr addrspace(4) [[TMP3]], i64 0
+// CHECK-SPIRV-NEXT: [[TMP4:%.*]] = load float, ptr addrspace(4) [[ARRAYIDX1]], align 4
+// CHECK-SPIRV-NEXT: [[ADD:%.*]] = fadd contract float [[TMP4]], 1.000000e+00
+// CHECK-SPIRV-NEXT: store float [[ADD]], ptr addrspace(4) [[ARRAYIDX1]], align 4
// CHECK-SPIRV-NEXT: ...
[truncated]
|
|
@llvm/pr-subscribers-backend-spir-v Author: Alex Voicu (AlexVlx) ChangesAt the moment AMDGCN flavoured SPIRV uses the SPIRV ABI with some tweaks revolving around passing aggregates as direct. This is problematic in multiple ways:
This patch switches over AMDGCN flavoured SPIRV to implement the AMDGPU ABI (except for dealing with variadic functions, which will be added in the future). One additional complication (and the primary motivation behind the current less than ideal state of affairs) stems from Patch is 50.43 KiB, truncated to 20.00 KiB below, full version: https://github.com/llvm/llvm-project/pull/169865.diff 6 Files Affected:
diff --git a/clang/lib/CodeGen/Targets/SPIR.cpp b/clang/lib/CodeGen/Targets/SPIR.cpp
index 1a8c85d8871ec..3540093074bfe 100644
--- a/clang/lib/CodeGen/Targets/SPIR.cpp
+++ b/clang/lib/CodeGen/Targets/SPIR.cpp
@@ -9,6 +9,11 @@
#include "ABIInfoImpl.h"
#include "HLSLBufferLayoutBuilder.h"
#include "TargetInfo.h"
+#include "clang/Basic/LangOptions.h"
+#include "llvm/IR/DerivedTypes.h"
+
+#include <stdint.h>
+#include <utility>
using namespace clang;
using namespace clang::CodeGen;
@@ -33,9 +38,41 @@ class SPIRVABIInfo : public CommonSPIRABIInfo {
void computeInfo(CGFunctionInfo &FI) const override;
private:
+ ABIArgInfo classifyKernelArgumentType(QualType Ty) const;
+};
+
+class AMDGCNSPIRVABIInfo : public SPIRVABIInfo {
+ // TODO: this should be unified / shared with AMDGPU, ideally we'd like to
+ // re-use AMDGPUABIInfo eventually, rather than duplicate.
+ static constexpr unsigned MaxNumRegsForArgsRet = 16; // 16 32-bit registers
+ mutable unsigned NumRegsLeft = 0;
+
+ unsigned numRegsForType(QualType Ty) const;
+
+ bool isHomogeneousAggregateBaseType(QualType Ty) const override {
+ return true;
+ }
+ bool isHomogeneousAggregateSmallEnough(const Type *Base,
+ uint64_t Members) const override {
+ uint32_t NumRegs = (getContext().getTypeSize(Base) + 31) / 32;
+
+ // Homogeneous Aggregates may occupy at most 16 registers.
+ return Members * NumRegs <= MaxNumRegsForArgsRet;
+ }
+
+ // Coerce HIP scalar pointer arguments from generic pointers to global ones.
+ llvm::Type *coerceKernelArgumentType(llvm::Type *Ty, unsigned FromAS,
+ unsigned ToAS) const;
+
ABIArgInfo classifyReturnType(QualType RetTy) const;
ABIArgInfo classifyKernelArgumentType(QualType Ty) const;
ABIArgInfo classifyArgumentType(QualType Ty) const;
+public:
+ AMDGCNSPIRVABIInfo(CodeGenTypes &CGT) : SPIRVABIInfo(CGT) {}
+ void computeInfo(CGFunctionInfo &FI) const override;
+
+ llvm::FixedVectorType *getOptimalVectorMemoryType(
+ llvm::FixedVectorType *Ty, const LangOptions &LangOpt) const override;
};
} // end anonymous namespace
namespace {
@@ -83,7 +120,10 @@ class CommonSPIRTargetCodeGenInfo : public TargetCodeGenInfo {
class SPIRVTargetCodeGenInfo : public CommonSPIRTargetCodeGenInfo {
public:
SPIRVTargetCodeGenInfo(CodeGen::CodeGenTypes &CGT)
- : CommonSPIRTargetCodeGenInfo(std::make_unique<SPIRVABIInfo>(CGT)) {}
+ : CommonSPIRTargetCodeGenInfo(
+ (CGT.getTarget().getTriple().getVendor() == llvm::Triple::AMD)
+ ? std::make_unique<AMDGCNSPIRVABIInfo>(CGT)
+ : std::make_unique<SPIRVABIInfo>(CGT)) {}
void setCUDAKernelCallingConvention(const FunctionType *&FT) const override;
LangAS getGlobalVarAddressSpace(CodeGenModule &CGM,
const VarDecl *D) const override;
@@ -132,25 +172,6 @@ void CommonSPIRABIInfo::setCCs() {
RuntimeCC = llvm::CallingConv::SPIR_FUNC;
}
-ABIArgInfo SPIRVABIInfo::classifyReturnType(QualType RetTy) const {
- if (getTarget().getTriple().getVendor() != llvm::Triple::AMD)
- return DefaultABIInfo::classifyReturnType(RetTy);
- if (!isAggregateTypeForABI(RetTy) || getRecordArgABI(RetTy, getCXXABI()))
- return DefaultABIInfo::classifyReturnType(RetTy);
-
- if (const auto *RD = RetTy->getAsRecordDecl();
- RD && RD->hasFlexibleArrayMember())
- return DefaultABIInfo::classifyReturnType(RetTy);
-
- // TODO: The AMDGPU ABI is non-trivial to represent in SPIR-V; in order to
- // avoid encoding various architecture specific bits here we return everything
- // as direct to retain type info for things like aggregates, for later perusal
- // when translating back to LLVM/lowering in the BE. This is also why we
- // disable flattening as the outcomes can mismatch between SPIR-V and AMDGPU.
- // This will be revisited / optimised in the future.
- return ABIArgInfo::getDirect(CGT.ConvertType(RetTy), 0u, nullptr, false);
-}
-
ABIArgInfo SPIRVABIInfo::classifyKernelArgumentType(QualType Ty) const {
if (getContext().getLangOpts().isTargetDevice()) {
// Coerce pointer arguments with default address space to CrossWorkGroup
@@ -167,18 +188,6 @@ ABIArgInfo SPIRVABIInfo::classifyKernelArgumentType(QualType Ty) const {
}
if (isAggregateTypeForABI(Ty)) {
- if (getTarget().getTriple().getVendor() == llvm::Triple::AMD)
- // TODO: The AMDGPU kernel ABI passes aggregates byref, which is not
- // currently expressible in SPIR-V; SPIR-V passes aggregates byval,
- // which the AMDGPU kernel ABI does not allow. Passing aggregates as
- // direct works around this impedance mismatch, as it retains type info
- // and can be correctly handled, post reverse-translation, by the AMDGPU
- // BE, which has to support this CC for legacy OpenCL purposes. It can
- // be brittle and does lead to performance degradation in certain
- // pathological cases. This will be revisited / optimised in the future,
- // once a way to deal with the byref/byval impedance mismatch is
- // identified.
- return ABIArgInfo::getDirect(LTy, 0, nullptr, false);
// Force copying aggregate type in kernel arguments by value when
// compiling CUDA targeting SPIR-V. This is required for the object
// copied to be valid on the device.
@@ -193,11 +202,150 @@ ABIArgInfo SPIRVABIInfo::classifyKernelArgumentType(QualType Ty) const {
return classifyArgumentType(Ty);
}
-ABIArgInfo SPIRVABIInfo::classifyArgumentType(QualType Ty) const {
- if (getTarget().getTriple().getVendor() != llvm::Triple::AMD)
- return DefaultABIInfo::classifyArgumentType(Ty);
- if (!isAggregateTypeForABI(Ty))
- return DefaultABIInfo::classifyArgumentType(Ty);
+void SPIRVABIInfo::computeInfo(CGFunctionInfo &FI) const {
+ // The logic is same as in DefaultABIInfo with an exception on the kernel
+ // arguments handling.
+ llvm::CallingConv::ID CC = FI.getCallingConvention();
+
+ if (!getCXXABI().classifyReturnType(FI))
+ FI.getReturnInfo() = classifyReturnType(FI.getReturnType());
+
+ for (auto &I : FI.arguments()) {
+ if (CC == llvm::CallingConv::SPIR_KERNEL) {
+ I.info = classifyKernelArgumentType(I.type);
+ } else {
+ I.info = classifyArgumentType(I.type);
+ }
+ }
+}
+
+unsigned AMDGCNSPIRVABIInfo::numRegsForType(QualType Ty) const {
+ // This duplicates the AMDGPUABI computation.
+ unsigned NumRegs = 0;
+
+ if (const VectorType *VT = Ty->getAs<VectorType>()) {
+ // Compute from the number of elements. The reported size is based on the
+ // in-memory size, which includes the padding 4th element for 3-vectors.
+ QualType EltTy = VT->getElementType();
+ unsigned EltSize = getContext().getTypeSize(EltTy);
+
+ // 16-bit element vectors should be passed as packed.
+ if (EltSize == 16)
+ return (VT->getNumElements() + 1) / 2;
+
+ unsigned EltNumRegs = (EltSize + 31) / 32;
+ return EltNumRegs * VT->getNumElements();
+ }
+
+ if (const auto *RD = Ty->getAsRecordDecl()) {
+ assert(!RD->hasFlexibleArrayMember());
+
+ for (const FieldDecl *Field : RD->fields()) {
+ QualType FieldTy = Field->getType();
+ NumRegs += numRegsForType(FieldTy);
+ }
+
+ return NumRegs;
+ }
+
+ return (getContext().getTypeSize(Ty) + 31) / 32;
+}
+
+llvm::Type *
+AMDGCNSPIRVABIInfo::coerceKernelArgumentType(llvm::Type *Ty, unsigned FromAS,
+ unsigned ToAS) const {
+ // Single value types.
+ auto *PtrTy = llvm::dyn_cast<llvm::PointerType>(Ty);
+ if (PtrTy && PtrTy->getAddressSpace() == FromAS)
+ return llvm::PointerType::get(Ty->getContext(), ToAS);
+ return Ty;
+}
+
+ABIArgInfo AMDGCNSPIRVABIInfo::classifyReturnType(QualType RetTy) const {
+ if (!isAggregateTypeForABI(RetTy) || getRecordArgABI(RetTy, getCXXABI()))
+ return DefaultABIInfo::classifyReturnType(RetTy);
+
+ // Ignore empty structs/unions.
+ if (isEmptyRecord(getContext(), RetTy, true))
+ return ABIArgInfo::getIgnore();
+
+ // Lower single-element structs to just return a regular value.
+ if (const Type *SeltTy = isSingleElementStruct(RetTy, getContext()))
+ return ABIArgInfo::getDirect(CGT.ConvertType(QualType(SeltTy, 0)));
+
+ if (const auto *RD = RetTy->getAsRecordDecl();
+ RD && RD->hasFlexibleArrayMember())
+ return DefaultABIInfo::classifyReturnType(RetTy);
+
+ // Pack aggregates <= 4 bytes into single VGPR or pair.
+ uint64_t Size = getContext().getTypeSize(RetTy);
+ if (Size <= 16)
+ return ABIArgInfo::getDirect(llvm::Type::getInt16Ty(getVMContext()));
+
+ if (Size <= 32)
+ return ABIArgInfo::getDirect(llvm::Type::getInt32Ty(getVMContext()));
+
+ // TODO: This carried over from AMDGPU oddity, we retain it to
+ // ensure consistency, but it might be reasonable to return Int64.
+ if (Size <= 64) {
+ llvm::Type *I32Ty = llvm::Type::getInt32Ty(getVMContext());
+ return ABIArgInfo::getDirect(llvm::ArrayType::get(I32Ty, 2));
+ }
+
+ if (numRegsForType(RetTy) <= MaxNumRegsForArgsRet)
+ return ABIArgInfo::getDirect();
+ return DefaultABIInfo::classifyReturnType(RetTy);
+}
+
+/// For kernels all parameters are really passed in a special buffer. It doesn't
+/// make sense to pass anything byval, so everything must be direct.
+ABIArgInfo AMDGCNSPIRVABIInfo::classifyKernelArgumentType(QualType Ty) const {
+ Ty = useFirstFieldIfTransparentUnion(Ty);
+
+ // TODO: Can we omit empty structs?
+
+ if (const Type *SeltTy = isSingleElementStruct(Ty, getContext()))
+ Ty = QualType(SeltTy, 0);
+
+ llvm::Type *OrigLTy = CGT.ConvertType(Ty);
+ llvm::Type *LTy = OrigLTy;
+ if (getContext().getLangOpts().isTargetDevice()) {
+ LTy = coerceKernelArgumentType(
+ OrigLTy, /*FromAS=*/getContext().getTargetAddressSpace(LangAS::Default),
+ /*ToAS=*/getContext().getTargetAddressSpace(LangAS::opencl_global));
+ }
+
+ // FIXME: This doesn't apply the optimization of coercing pointers in structs
+ // to global address space when using byref. This would require implementing a
+ // new kind of coercion of the in-memory type when for indirect arguments.
+ if (LTy == OrigLTy && isAggregateTypeForABI(Ty)) {
+ return ABIArgInfo::getIndirectAliased(
+ getContext().getTypeAlignInChars(Ty),
+ getContext().getTargetAddressSpace(LangAS::opencl_constant),
+ false /*Realign*/, nullptr /*Padding*/);
+ }
+
+ // TODO: inhibiting flattening is an AMDGPU workaround for Clover, which might
+ // be vestigial and should be revisited.
+ return ABIArgInfo::getDirect(LTy, 0, nullptr, false);
+}
+
+ABIArgInfo AMDGCNSPIRVABIInfo::classifyArgumentType(QualType Ty) const {
+ assert(NumRegsLeft <= MaxNumRegsForArgsRet && "register estimate underflow");
+
+ Ty = useFirstFieldIfTransparentUnion(Ty);
+
+ // TODO: support for variadics.
+
+ if (!isAggregateTypeForABI(Ty)) {
+ ABIArgInfo ArgInfo = DefaultABIInfo::classifyArgumentType(Ty);
+ if (!ArgInfo.isIndirect()) {
+ unsigned NumRegs = numRegsForType(Ty);
+ NumRegsLeft -= std::min(NumRegs, NumRegsLeft);
+ }
+
+ return ArgInfo;
+ }
// Records with non-trivial destructors/copy-constructors should not be
// passed by value.
@@ -205,37 +353,87 @@ ABIArgInfo SPIRVABIInfo::classifyArgumentType(QualType Ty) const {
return getNaturalAlignIndirect(Ty, getDataLayout().getAllocaAddrSpace(),
RAA == CGCXXABI::RAA_DirectInMemory);
+ // Ignore empty structs/unions.
+ if (isEmptyRecord(getContext(), Ty, true))
+ return ABIArgInfo::getIgnore();
+
+ // Lower single-element structs to just pass a regular value. TODO: We
+ // could do reasonable-size multiple-element structs too, using getExpand(),
+ // though watch out for things like bitfields.
+ if (const Type *SeltTy = isSingleElementStruct(Ty, getContext()))
+ return ABIArgInfo::getDirect(CGT.ConvertType(QualType(SeltTy, 0)));
+
if (const auto *RD = Ty->getAsRecordDecl();
RD && RD->hasFlexibleArrayMember())
return DefaultABIInfo::classifyArgumentType(Ty);
- return ABIArgInfo::getDirect(CGT.ConvertType(Ty), 0u, nullptr, false);
+ uint64_t Size = getContext().getTypeSize(Ty);
+ if (Size <= 64) {
+ // Pack aggregates <= 8 bytes into single VGPR or pair.
+ unsigned NumRegs = (Size + 31) / 32;
+ NumRegsLeft -= std::min(NumRegsLeft, NumRegs);
+
+ if (Size <= 16)
+ return ABIArgInfo::getDirect(llvm::Type::getInt16Ty(getVMContext()));
+
+ if (Size <= 32)
+ return ABIArgInfo::getDirect(llvm::Type::getInt32Ty(getVMContext()));
+
+ // TODO: This is an AMDGPU oddity, and might be vestigial, we retain it to
+ // ensure consistency, but it should be revisited.
+ llvm::Type *I32Ty = llvm::Type::getInt32Ty(getVMContext());
+ return ABIArgInfo::getDirect(llvm::ArrayType::get(I32Ty, 2));
+ }
+
+ if (NumRegsLeft > 0) {
+ unsigned NumRegs = numRegsForType(Ty);
+ if (NumRegsLeft >= NumRegs) {
+ NumRegsLeft -= NumRegs;
+ return ABIArgInfo::getDirect();
+ }
+ }
+
+ // Use pass-by-reference in stead of pass-by-value for struct arguments in
+ // function ABI.
+ return ABIArgInfo::getIndirectAliased(
+ getContext().getTypeAlignInChars(Ty),
+ getContext().getTargetAddressSpace(LangAS::opencl_private));
}
-void SPIRVABIInfo::computeInfo(CGFunctionInfo &FI) const {
- // The logic is same as in DefaultABIInfo with an exception on the kernel
- // arguments handling.
+void AMDGCNSPIRVABIInfo::computeInfo(CGFunctionInfo &FI) const {
llvm::CallingConv::ID CC = FI.getCallingConvention();
if (!getCXXABI().classifyReturnType(FI))
FI.getReturnInfo() = classifyReturnType(FI.getReturnType());
+ NumRegsLeft = MaxNumRegsForArgsRet;
for (auto &I : FI.arguments()) {
- if (CC == llvm::CallingConv::SPIR_KERNEL) {
+ if (CC == llvm::CallingConv::SPIR_KERNEL)
I.info = classifyKernelArgumentType(I.type);
- } else {
+ else
I.info = classifyArgumentType(I.type);
- }
}
}
+llvm::FixedVectorType *AMDGCNSPIRVABIInfo::getOptimalVectorMemoryType(
+ llvm::FixedVectorType *Ty, const LangOptions &LangOpt) const {
+ // AMDGPU has legal instructions for 96-bit so 3x32 can be supported.
+ if (Ty->getNumElements() == 3 && getDataLayout().getTypeSizeInBits(Ty) == 96)
+ return Ty;
+ return DefaultABIInfo::getOptimalVectorMemoryType(Ty, LangOpt);
+}
+
namespace clang {
namespace CodeGen {
void computeSPIRKernelABIInfo(CodeGenModule &CGM, CGFunctionInfo &FI) {
- if (CGM.getTarget().getTriple().isSPIRV())
- SPIRVABIInfo(CGM.getTypes()).computeInfo(FI);
- else
+ if (CGM.getTarget().getTriple().isSPIRV()) {
+ if (CGM.getTarget().getTriple().getVendor() == llvm::Triple::AMD)
+ AMDGCNSPIRVABIInfo(CGM.getTypes()).computeInfo(FI);
+ else
+ SPIRVABIInfo(CGM.getTypes()).computeInfo(FI);
+ } else {
CommonSPIRABIInfo(CGM.getTypes()).computeInfo(FI);
+ }
}
}
}
diff --git a/clang/test/CodeGenCUDA/amdgpu-kernel-arg-pointer-type.cu b/clang/test/CodeGenCUDA/amdgpu-kernel-arg-pointer-type.cu
index a48affaec3c8a..bf45a353851b4 100644
--- a/clang/test/CodeGenCUDA/amdgpu-kernel-arg-pointer-type.cu
+++ b/clang/test/CodeGenCUDA/amdgpu-kernel-arg-pointer-type.cu
@@ -95,7 +95,7 @@ __global__ void kernel1(int *x) {
// CHECK-NEXT: store ptr addrspace(1) [[X_COERCE]], ptr [[X_ASCAST]], align 8
// CHECK-NEXT: [[X1:%.*]] = load ptr, ptr [[X_ASCAST]], align 8
// CHECK-NEXT: store ptr [[X1]], ptr [[X_ADDR_ASCAST]], align 8
-// CHECK-NEXT: [[TMP0:%.*]] = load ptr, ptr [[X_ADDR_ASCAST]], align 8
+// CHECK-NEXT: [[TMP0:%.*]] = load ptr, ptr [[X_ADDR_ASCAST]], align 8, !nonnull [[META4:![0-9]+]], !align [[META5:![0-9]+]]
// CHECK-NEXT: [[TMP1:%.*]] = load i32, ptr [[TMP0]], align 4
// CHECK-NEXT: [[INC:%.*]] = add nsw i32 [[TMP1]], 1
// CHECK-NEXT: store i32 [[INC]], ptr [[TMP0]], align 4
@@ -111,7 +111,7 @@ __global__ void kernel1(int *x) {
// CHECK-SPIRV-NEXT: store ptr addrspace(1) [[X_COERCE]], ptr addrspace(4) [[X_ASCAST]], align 8
// CHECK-SPIRV-NEXT: [[X1:%.*]] = load ptr addrspace(4), ptr addrspace(4) [[X_ASCAST]], align 8
// CHECK-SPIRV-NEXT: store ptr addrspace(4) [[X1]], ptr addrspace(4) [[X_ADDR_ASCAST]], align 8
-// CHECK-SPIRV-NEXT: [[TMP0:%.*]] = load ptr addrspace(4), ptr addrspace(4) [[X_ADDR_ASCAST]], align 8
+// CHECK-SPIRV-NEXT: [[TMP0:%.*]] = load ptr addrspace(4), ptr addrspace(4) [[X_ADDR_ASCAST]], align 8, !align [[META6:![0-9]+]]
// CHECK-SPIRV-NEXT: [[TMP1:%.*]] = load i32, ptr addrspace(4) [[TMP0]], align 4
// CHECK-SPIRV-NEXT: [[INC:%.*]] = add nsw i32 [[TMP1]], 1
// CHECK-SPIRV-NEXT: store i32 [[INC]], ptr addrspace(4) [[TMP0]], align 4
@@ -302,28 +302,23 @@ struct S {
// CHECK-NEXT: ret void
//
// CHECK-SPIRV-LABEL: define spir_kernel void @_Z7kernel41S(
-// CHECK-SPIRV-SAME: [[STRUCT_S:%.*]] [[S_COERCE:%.*]]) addrspace(4) #[[ATTR0]] !max_work_group_size [[META5]] {
+// CHECK-SPIRV-SAME: ptr addrspace(2) noundef byref([[STRUCT_S:%.*]]) align 8 [[TMP0:%.*]]) addrspace(4) #[[ATTR0]] !max_work_group_size [[META5]] {
// CHECK-SPIRV-NEXT: [[ENTRY:.*:]]
-// CHECK-SPIRV-NEXT: [[S:%.*]] = alloca [[STRUCT_S]], align 8
-// CHECK-SPIRV-NEXT: [[S1:%.*]] = addrspacecast ptr [[S]] to ptr addrspace(4)
-// CHECK-SPIRV-NEXT: [[TMP0:%.*]] = getelementptr inbounds nuw [[STRUCT_S]], ptr addrspace(4) [[S1]], i32 0, i32 0
-// CHECK-SPIRV-NEXT: [[TMP1:%.*]] = extractvalue [[STRUCT_S]] [[S_COERCE]], 0
-// CHECK-SPIRV-NEXT: store ptr addrspace(4) [[TMP1]], ptr addrspace(4) [[TMP0]], align 8
-// CHECK-SPIRV-NEXT: [[TMP2:%.*]] = getelementptr inbounds nuw [[STRUCT_S]], ptr addrspace(4) [[S1]], i32 0, i32 1
-// CHECK-SPIRV-NEXT: [[TMP3:%.*]] = extractvalue [[STRUCT_S]] [[S_COERCE]], 1
-// CHECK-SPIRV-NEXT: store ptr addrspace(4) [[TMP3]], ptr addrspace(4) [[TMP2]], align 8
-// CHECK-SPIRV-NEXT: [[X:%.*]] = getelementptr inbounds nuw [[STRUCT_S]], ptr addrspace(4) [[S1]], i32 0, i32 0
-// CHECK-SPIRV-NEXT: [[TMP4:%.*]] = load ptr addrspace(4), ptr addrspace(4) [[X]], align 8
-// CHECK-SPIRV-NEXT: [[ARRAYIDX:%.*]] = getelementptr inbounds i32, ptr addrspace(4) [[TMP4]], i64 0
-// CHECK-SPIRV-NEXT: [[TMP5:%.*]] = load i32, ptr addrspace(4) [[ARRAYIDX]], align 4
-// CHECK-SPIRV-NEXT: [[INC:%.*]] = add nsw i32 [[TMP5]], 1
+// CHECK-SPIRV-NEXT: [[COERCE:%.*]] = alloca [[STRUCT_S]], align 8
+// CHECK-SPIRV-NEXT: [[S:%.*]] = addrspacecast ptr [[COERCE]] to ptr addrspace(4)
+// CHECK-SPIRV-NEXT: call addrspace(4) void @llvm.memcpy.p4.p2.i64(ptr addrspace(4) align 8 [[S]], ptr addrspace(2) align 8 [[TMP0]], i64 16, i1 false)
+// CHECK-SPIRV-NEXT: [[X:%.*]] = getelementptr inbounds nuw [[STRUCT_S]], ptr addrspace(4) [[S]], i32 0, i32 0
+// CHECK-SPIRV-NEXT: [[TMP1:%.*]] = load ptr addrspace(4), ptr addrspace(4) [[X]], align 8
+// CHECK-SPIRV-NEXT: [[ARRAYIDX:%.*]] = getelementptr inbounds i32, ptr addrspace(4) [[TMP1]], i64 0
+// CHECK-SPIRV-NEXT: [[TMP2:%.*]] = load i32, ptr addrspace(4) [[ARRAYIDX]], align 4
+// CHECK-SPIRV-NEXT: [[INC:%.*]] = add nsw i32 [[TMP2]], 1
// CHECK-SPIRV-NEXT: store i32 [[INC]], ptr addrspace(4) [[ARRAYIDX]], align 4
-// CHECK-SPIRV-NEXT: [[Y:%.*]] = getelementptr inbounds nuw [[STRUCT_S]], ptr addrspace(4) [[S1]], i32 0, i32 1
-// CHECK-SPIRV-NEXT: [[TMP6:%.*]] = load ptr addrspace(4), ptr addrspace(4) [[Y]], align 8
-// CHECK-SPIRV-NEXT: [[ARRAYIDX2:%.*]] = getelementptr inbounds float, ptr addrspace(4) [[TMP6]], i64 0
-// CHECK-SPIRV-NEXT: [[TMP7:%.*]] = load float, ptr addrspace(4) [[ARRAYIDX2]], align 4
-// CHECK-SPIRV-NEXT: [[ADD:%.*]] = fadd contract float [[TMP7]], 1.000000e+00
-// CHECK-SPIRV-NEXT: store float [[ADD]], ptr addrspace(4) [[ARRAYIDX2]], align 4
+// CHECK-SPIRV-NEXT: [[Y:%.*]] = getelementptr inbounds nuw [[STRUCT_S]], ptr addrspace(4) [[S]], i32 0, i32 1
+// CHECK-SPIRV-NEXT: [[TMP3:%.*]] = load ptr addrspace(4), ptr addrspace(4) [[Y]], align 8
+// CHECK-SPIRV-NEXT: [[ARRAYIDX1:%.*]] = getelementptr inbounds float, ptr addrspace(4) [[TMP3]], i64 0
+// CHECK-SPIRV-NEXT: [[TMP4:%.*]] = load float, ptr addrspace(4) [[ARRAYIDX1]], align 4
+// CHECK-SPIRV-NEXT: [[ADD:%.*]] = fadd contract float [[TMP4]], 1.000000e+00
+// CHECK-SPIRV-NEXT: store float [[ADD]], ptr addrspace(4) [[ARRAYIDX1]], align 4
// CHECK-SPIRV-NEXT: ...
[truncated]
|
|
✅ With the latest revision this PR passed the C/C++ code formatter. |
maarquitos14
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
SPIRV part LGTM, just one nit.
| SPIRV::Decoration::FuncParamAttr, {Attr}); | ||
| } | ||
| if (Arg.hasAttribute(Attribute::ByVal)) { | ||
| if (Arg.hasAttribute(Attribute::ByVal) || |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can we add a comment here explaining why you need this for AMD, so that people can easily understand when reading this code? I fear people might don't understand, or simply think it's wrong, to lower ByRef to ByVal.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done, thank you for the suggestion.
jmmartinez
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It looks good to me (but I'm also not super familiar with the ABI side)
Keenuts
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM for the shared SPIR-V part. Glanced at the AMD part, code LGTM, but don't know about the ABI logic so haven't checked that.
efriedma-quic
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is there any possibility this impacts interop between AMDGCN flavored SPIRV and generic SPIRV? Or is that not a thing?
MrSidims
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
SPIR-V -related part is LGTM
Not a thing, we don't compose them. If we were to, we'd have to constrain the ABI to unambiguous cases (so, probably, no aggregates), but that's not under active consideration at the moment. |
| #include "clang/Basic/LangOptions.h" | ||
| #include "llvm/IR/DerivedTypes.h" | ||
|
|
||
| #include <stdint.h> |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
| #include <stdint.h> | |
| #include <cstdint> |
But I'd be surprised you need this?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Conventionally, cstdint is for std::foo_t, and stdint.h is for foo_t without the namespace. Now, the convention is weak and often broken, but it exists. As for needing it, there are uint64_t uses, and whilst transitively cstdint / stdint.h is extremely likely to get included, I thought that we want to make files as self sufficient as possible.
| return Ty; | ||
| } | ||
|
|
||
| ABIArgInfo AMDGCNSPIRVABIInfo::classifyReturnType(QualType RetTy) const { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can we do something better than duplicate all of this AMDGPU stuff?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It's not immediately apparent without refactoring how the ABIInfo things are currently done (internal types inaccessible to other TUs). We could consider hoisting the (shared) implementation into a file that both AMDGPU and SPIRV include, but I didn't want to do that here since it's more intrusive. If you have another suggestion, please share, it's more than welcome.
At the moment AMDGCN flavoured SPIRV uses the SPIRV ABI with some tweaks revolving around passing aggregates as direct. This is problematic in multiple ways: - it leads to divergence from code compiled for a concrete target, which makes it difficult to debug; - it incurs a run time cost, when dealing with larger aggregates; - it incurs a compile time cost, when dealing with larger aggregates. This patch switches over AMDGCN flavoured SPIRV to implement the AMDGPU ABI (except for dealing with variadic functions, which will be added in the future). One additional complication (and the primary motivation behind the current less than ideal state of affairs) stems from `byref`, which AMDGPU uses, not being expressible in SPIR-V. We deal with this by CodeGen-ing for `byref`, lowering it to the `FuncParamAttr ByVal` in SPIR-V, and restoring it when doing reverse translation from AMDGCN flavoured SPIR-V.
At the moment AMDGCN flavoured SPIRV uses the SPIRV ABI with some tweaks revolving around passing aggregates as direct. This is problematic in multiple ways:
This patch switches over AMDGCN flavoured SPIRV to implement the AMDGPU ABI (except for dealing with variadic functions, which will be added in the future). One additional complication (and the primary motivation behind the current less than ideal state of affairs) stems from
byref, which AMDGPU uses, not being expressible in SPIR-V. We deal with this by CodeGen-ing forbyref, lowering it to theFuncParamAttr ByValin SPIR-V, and restoring it when doing reverse translation from AMDGCN flavoured SPIR-V.