[SPIRV] Use AMDGPU ABI for AMDGCN flavoured SPIRV #169865

AlexVlx · 2025-11-28T03:02:00Z

At the moment AMDGCN flavoured SPIRV uses the SPIRV ABI with some tweaks revolving around passing aggregates as direct. This is problematic in multiple ways:

it leads to divergence from code compiled for a concrete target, which makes it difficult to debug;
it incurs a run time cost, when dealing with larger aggregates;
it incurs a compile time cost, when dealing with larger aggregates.

This patch switches over AMDGCN flavoured SPIRV to implement the AMDGPU ABI (except for dealing with variadic functions, which will be added in the future). One additional complication (and the primary motivation behind the current less than ideal state of affairs) stems from byref, which AMDGPU uses, not being expressible in SPIR-V. We deal with this by CodeGen-ing for byref, lowering it to the FuncParamAttr ByVal in SPIR-V, and restoring it when doing reverse translation from AMDGCN flavoured SPIR-V.

llvmbot · 2025-11-28T03:02:33Z

@llvm/pr-subscribers-backend-amdgpu

Author: Alex Voicu (AlexVlx)

Changes

At the moment AMDGCN flavoured SPIRV uses the SPIRV ABI with some tweaks revolving around passing aggregates as direct. This is problematic in multiple ways:

it leads to divergence from code compiled for a concrete target, which makes it difficult to debug;
it incurs a run time cost, when dealing with larger aggregates;
it incurs a compile time cost, when dealing with larger aggregates.

This patch switches over AMDGCN flavoured SPIRV to implement the AMDGPU ABI (except for dealing with variadic functions, which will be added in the future). One additional complication (and the primary motivation behind the current less than ideal state of affairs) stems from byref, which AMDGPU uses, not being expressible in SPIR-V. We deal with this by CodeGen-ing for byref, lowering it to the FuncParamAttr ByVal in SPIR-V, and restoring it when doing reverse translation from AMDGCN flavoured SPIR-V.

Patch is 50.43 KiB, truncated to 20.00 KiB below, full version: https://github.com/llvm/llvm-project/pull/169865.diff

6 Files Affected:

(modified) clang/lib/CodeGen/Targets/SPIR.cpp (+245-47)
(modified) clang/test/CodeGenCUDA/amdgpu-kernel-arg-pointer-type.cu (+71-73)
(modified) clang/test/CodeGenCUDA/kernel-args.cu (+4-4)
(added) clang/test/CodeGenHIP/amdgcnspirv-uses-amdgpu-abi.cpp (+321)
(modified) llvm/lib/Target/SPIRV/SPIRVCallLowering.cpp (+4-1)
(added) llvm/test/CodeGen/SPIRV/pointers/ptr-argument-byref-amdgcnspirv.ll (+24)

diff --git a/clang/lib/CodeGen/Targets/SPIR.cpp b/clang/lib/CodeGen/Targets/SPIR.cpp
index 1a8c85d8871ec..3540093074bfe 100644
--- a/clang/lib/CodeGen/Targets/SPIR.cpp
+++ b/clang/lib/CodeGen/Targets/SPIR.cpp
@@ -9,6 +9,11 @@
 #include "ABIInfoImpl.h"
 #include "HLSLBufferLayoutBuilder.h"
 #include "TargetInfo.h"
+#include "clang/Basic/LangOptions.h"
+#include "llvm/IR/DerivedTypes.h"
+
+#include <stdint.h>
+#include <utility>
 
 using namespace clang;
 using namespace clang::CodeGen;
@@ -33,9 +38,41 @@ class SPIRVABIInfo : public CommonSPIRABIInfo {
   void computeInfo(CGFunctionInfo &FI) const override;
 
 private:
+  ABIArgInfo classifyKernelArgumentType(QualType Ty) const;
+};
+
+class AMDGCNSPIRVABIInfo : public SPIRVABIInfo {
+  // TODO: this should be unified / shared with AMDGPU, ideally we'd like to
+  //       re-use AMDGPUABIInfo eventually, rather than duplicate.
+  static constexpr unsigned MaxNumRegsForArgsRet = 16; // 16 32-bit registers
+  mutable unsigned NumRegsLeft = 0;
+
+  unsigned numRegsForType(QualType Ty) const;
+
+  bool isHomogeneousAggregateBaseType(QualType Ty) const override {
+    return true;
+  }
+  bool isHomogeneousAggregateSmallEnough(const Type *Base,
+                                         uint64_t Members) const override {
+    uint32_t NumRegs = (getContext().getTypeSize(Base) + 31) / 32;
+
+    // Homogeneous Aggregates may occupy at most 16 registers.
+    return Members * NumRegs <= MaxNumRegsForArgsRet;
+  }
+
+  // Coerce HIP scalar pointer arguments from generic pointers to global ones.
+  llvm::Type *coerceKernelArgumentType(llvm::Type *Ty, unsigned FromAS,
+                                       unsigned ToAS) const;
+
   ABIArgInfo classifyReturnType(QualType RetTy) const;
   ABIArgInfo classifyKernelArgumentType(QualType Ty) const;
   ABIArgInfo classifyArgumentType(QualType Ty) const;
+public:
+  AMDGCNSPIRVABIInfo(CodeGenTypes &CGT) : SPIRVABIInfo(CGT) {}
+  void computeInfo(CGFunctionInfo &FI) const override;
+
+  llvm::FixedVectorType *getOptimalVectorMemoryType(
+      llvm::FixedVectorType *Ty, const LangOptions &LangOpt) const override;
 };
 } // end anonymous namespace
 namespace {
@@ -83,7 +120,10 @@ class CommonSPIRTargetCodeGenInfo : public TargetCodeGenInfo {
 class SPIRVTargetCodeGenInfo : public CommonSPIRTargetCodeGenInfo {
 public:
   SPIRVTargetCodeGenInfo(CodeGen::CodeGenTypes &CGT)
-      : CommonSPIRTargetCodeGenInfo(std::make_unique<SPIRVABIInfo>(CGT)) {}
+      : CommonSPIRTargetCodeGenInfo(
+            (CGT.getTarget().getTriple().getVendor() == llvm::Triple::AMD)
+                ? std::make_unique<AMDGCNSPIRVABIInfo>(CGT)
+                : std::make_unique<SPIRVABIInfo>(CGT)) {}
   void setCUDAKernelCallingConvention(const FunctionType *&FT) const override;
   LangAS getGlobalVarAddressSpace(CodeGenModule &CGM,
                                   const VarDecl *D) const override;
@@ -132,25 +172,6 @@ void CommonSPIRABIInfo::setCCs() {
   RuntimeCC = llvm::CallingConv::SPIR_FUNC;
 }
 
-ABIArgInfo SPIRVABIInfo::classifyReturnType(QualType RetTy) const {
-  if (getTarget().getTriple().getVendor() != llvm::Triple::AMD)
-    return DefaultABIInfo::classifyReturnType(RetTy);
-  if (!isAggregateTypeForABI(RetTy) || getRecordArgABI(RetTy, getCXXABI()))
-    return DefaultABIInfo::classifyReturnType(RetTy);
-
-  if (const auto *RD = RetTy->getAsRecordDecl();
-      RD && RD->hasFlexibleArrayMember())
-    return DefaultABIInfo::classifyReturnType(RetTy);
-
-  // TODO: The AMDGPU ABI is non-trivial to represent in SPIR-V; in order to
-  // avoid encoding various architecture specific bits here we return everything
-  // as direct to retain type info for things like aggregates, for later perusal
-  // when translating back to LLVM/lowering in the BE. This is also why we
-  // disable flattening as the outcomes can mismatch between SPIR-V and AMDGPU.
-  // This will be revisited / optimised in the future.
-  return ABIArgInfo::getDirect(CGT.ConvertType(RetTy), 0u, nullptr, false);
-}
-
 ABIArgInfo SPIRVABIInfo::classifyKernelArgumentType(QualType Ty) const {
   if (getContext().getLangOpts().isTargetDevice()) {
     // Coerce pointer arguments with default address space to CrossWorkGroup
@@ -167,18 +188,6 @@ ABIArgInfo SPIRVABIInfo::classifyKernelArgumentType(QualType Ty) const {
     }
 
     if (isAggregateTypeForABI(Ty)) {
-      if (getTarget().getTriple().getVendor() == llvm::Triple::AMD)
-        // TODO: The AMDGPU kernel ABI passes aggregates byref, which is not
-        // currently expressible in SPIR-V; SPIR-V passes aggregates byval,
-        // which the AMDGPU kernel ABI does not allow. Passing aggregates as
-        // direct works around this impedance mismatch, as it retains type info
-        // and can be correctly handled, post reverse-translation, by the AMDGPU
-        // BE, which has to support this CC for legacy OpenCL purposes. It can
-        // be brittle and does lead to performance degradation in certain
-        // pathological cases. This will be revisited / optimised in the future,
-        // once a way to deal with the byref/byval impedance mismatch is
-        // identified.
-        return ABIArgInfo::getDirect(LTy, 0, nullptr, false);
       // Force copying aggregate type in kernel arguments by value when
       // compiling CUDA targeting SPIR-V. This is required for the object
       // copied to be valid on the device.
@@ -193,11 +202,150 @@ ABIArgInfo SPIRVABIInfo::classifyKernelArgumentType(QualType Ty) const {
   return classifyArgumentType(Ty);
 }
 
-ABIArgInfo SPIRVABIInfo::classifyArgumentType(QualType Ty) const {
-  if (getTarget().getTriple().getVendor() != llvm::Triple::AMD)
-    return DefaultABIInfo::classifyArgumentType(Ty);
-  if (!isAggregateTypeForABI(Ty))
-    return DefaultABIInfo::classifyArgumentType(Ty);
+void SPIRVABIInfo::computeInfo(CGFunctionInfo &FI) const {
+  // The logic is same as in DefaultABIInfo with an exception on the kernel
+  // arguments handling.
+  llvm::CallingConv::ID CC = FI.getCallingConvention();
+
+  if (!getCXXABI().classifyReturnType(FI))
+    FI.getReturnInfo() = classifyReturnType(FI.getReturnType());
+
+  for (auto &I : FI.arguments()) {
+    if (CC == llvm::CallingConv::SPIR_KERNEL) {
+      I.info = classifyKernelArgumentType(I.type);
+    } else {
+      I.info = classifyArgumentType(I.type);
+    }
+  }
+}
+
+unsigned AMDGCNSPIRVABIInfo::numRegsForType(QualType Ty) const {
+  // This duplicates the AMDGPUABI computation.
+  unsigned NumRegs = 0;
+
+  if (const VectorType *VT = Ty->getAs<VectorType>()) {
+    // Compute from the number of elements. The reported size is based on the
+    // in-memory size, which includes the padding 4th element for 3-vectors.
+    QualType EltTy = VT->getElementType();
+    unsigned EltSize = getContext().getTypeSize(EltTy);
+
+    // 16-bit element vectors should be passed as packed.
+    if (EltSize == 16)
+      return (VT->getNumElements() + 1) / 2;
+
+    unsigned EltNumRegs = (EltSize + 31) / 32;
+    return EltNumRegs * VT->getNumElements();
+  }
+
+  if (const auto *RD = Ty->getAsRecordDecl()) {
+    assert(!RD->hasFlexibleArrayMember());
+
+    for (const FieldDecl *Field : RD->fields()) {
+      QualType FieldTy = Field->getType();
+      NumRegs += numRegsForType(FieldTy);
+    }
+
+    return NumRegs;
+  }
+
+  return (getContext().getTypeSize(Ty) + 31) / 32;
+}
+
+llvm::Type *
+AMDGCNSPIRVABIInfo::coerceKernelArgumentType(llvm::Type *Ty, unsigned FromAS,
+                                             unsigned ToAS) const {
+  // Single value types.
+  auto *PtrTy = llvm::dyn_cast<llvm::PointerType>(Ty);
+  if (PtrTy && PtrTy->getAddressSpace() == FromAS)
+    return llvm::PointerType::get(Ty->getContext(), ToAS);
+  return Ty;
+}
+
+ABIArgInfo AMDGCNSPIRVABIInfo::classifyReturnType(QualType RetTy) const {
+  if (!isAggregateTypeForABI(RetTy) || getRecordArgABI(RetTy, getCXXABI()))
+    return DefaultABIInfo::classifyReturnType(RetTy);
+
+  // Ignore empty structs/unions.
+  if (isEmptyRecord(getContext(), RetTy, true))
+    return ABIArgInfo::getIgnore();
+
+  // Lower single-element structs to just return a regular value.
+  if (const Type *SeltTy = isSingleElementStruct(RetTy, getContext()))
+    return ABIArgInfo::getDirect(CGT.ConvertType(QualType(SeltTy, 0)));
+
+  if (const auto *RD = RetTy->getAsRecordDecl();
+      RD && RD->hasFlexibleArrayMember())
+    return DefaultABIInfo::classifyReturnType(RetTy);
+
+  // Pack aggregates <= 4 bytes into single VGPR or pair.
+  uint64_t Size = getContext().getTypeSize(RetTy);
+  if (Size <= 16)
+    return ABIArgInfo::getDirect(llvm::Type::getInt16Ty(getVMContext()));
+
+  if (Size <= 32)
+    return ABIArgInfo::getDirect(llvm::Type::getInt32Ty(getVMContext()));
+
+  // TODO: This carried over from AMDGPU oddity, we retain it to
+  //       ensure consistency, but it might be reasonable to return Int64.
+  if (Size <= 64) {
+    llvm::Type *I32Ty = llvm::Type::getInt32Ty(getVMContext());
+    return ABIArgInfo::getDirect(llvm::ArrayType::get(I32Ty, 2));
+  }
+
+  if (numRegsForType(RetTy) <= MaxNumRegsForArgsRet)
+    return ABIArgInfo::getDirect();
+  return DefaultABIInfo::classifyReturnType(RetTy);
+}
+
+/// For kernels all parameters are really passed in a special buffer. It doesn't
+/// make sense to pass anything byval, so everything must be direct.
+ABIArgInfo AMDGCNSPIRVABIInfo::classifyKernelArgumentType(QualType Ty) const {
+  Ty = useFirstFieldIfTransparentUnion(Ty);
+
+  // TODO: Can we omit empty structs?
+
+  if (const Type *SeltTy = isSingleElementStruct(Ty, getContext()))
+    Ty = QualType(SeltTy, 0);
+
+  llvm::Type *OrigLTy = CGT.ConvertType(Ty);
+  llvm::Type *LTy = OrigLTy;
+  if (getContext().getLangOpts().isTargetDevice()) {
+    LTy = coerceKernelArgumentType(
+        OrigLTy, /*FromAS=*/getContext().getTargetAddressSpace(LangAS::Default),
+        /*ToAS=*/getContext().getTargetAddressSpace(LangAS::opencl_global));
+  }
+
+  // FIXME: This doesn't apply the optimization of coercing pointers in structs
+  // to global address space when using byref. This would require implementing a
+  // new kind of coercion of the in-memory type when for indirect arguments.
+  if (LTy == OrigLTy && isAggregateTypeForABI(Ty)) {
+    return ABIArgInfo::getIndirectAliased(
+        getContext().getTypeAlignInChars(Ty),
+        getContext().getTargetAddressSpace(LangAS::opencl_constant),
+        false /*Realign*/, nullptr /*Padding*/);
+  }
+
+  // TODO: inhibiting flattening is an AMDGPU workaround for Clover, which might
+  //       be vestigial and should be revisited.
+  return ABIArgInfo::getDirect(LTy, 0, nullptr, false);
+}
+
+ABIArgInfo AMDGCNSPIRVABIInfo::classifyArgumentType(QualType Ty) const {
+  assert(NumRegsLeft <= MaxNumRegsForArgsRet && "register estimate underflow");
+
+  Ty = useFirstFieldIfTransparentUnion(Ty);
+
+  // TODO: support for variadics.
+
+  if (!isAggregateTypeForABI(Ty)) {
+    ABIArgInfo ArgInfo = DefaultABIInfo::classifyArgumentType(Ty);
+    if (!ArgInfo.isIndirect()) {
+      unsigned NumRegs = numRegsForType(Ty);
+      NumRegsLeft -= std::min(NumRegs, NumRegsLeft);
+    }
+
+    return ArgInfo;
+  }
 
   // Records with non-trivial destructors/copy-constructors should not be
   // passed by value.
@@ -205,37 +353,87 @@ ABIArgInfo SPIRVABIInfo::classifyArgumentType(QualType Ty) const {
     return getNaturalAlignIndirect(Ty, getDataLayout().getAllocaAddrSpace(),
                                    RAA == CGCXXABI::RAA_DirectInMemory);
 
+  // Ignore empty structs/unions.
+  if (isEmptyRecord(getContext(), Ty, true))
+    return ABIArgInfo::getIgnore();
+
+  // Lower single-element structs to just pass a regular value. TODO: We
+  // could do reasonable-size multiple-element structs too, using getExpand(),
+  // though watch out for things like bitfields.
+  if (const Type *SeltTy = isSingleElementStruct(Ty, getContext()))
+    return ABIArgInfo::getDirect(CGT.ConvertType(QualType(SeltTy, 0)));
+
   if (const auto *RD = Ty->getAsRecordDecl();
       RD && RD->hasFlexibleArrayMember())
     return DefaultABIInfo::classifyArgumentType(Ty);
 
-  return ABIArgInfo::getDirect(CGT.ConvertType(Ty), 0u, nullptr, false);
+  uint64_t Size = getContext().getTypeSize(Ty);
+  if (Size <= 64) {
+    // Pack aggregates <= 8 bytes into single VGPR or pair.
+    unsigned NumRegs = (Size + 31) / 32;
+    NumRegsLeft -= std::min(NumRegsLeft, NumRegs);
+
+    if (Size <= 16)
+      return ABIArgInfo::getDirect(llvm::Type::getInt16Ty(getVMContext()));
+
+    if (Size <= 32)
+      return ABIArgInfo::getDirect(llvm::Type::getInt32Ty(getVMContext()));
+
+    // TODO: This is an AMDGPU oddity, and might be vestigial, we retain it to
+    //       ensure consistency, but it should be revisited.
+    llvm::Type *I32Ty = llvm::Type::getInt32Ty(getVMContext());
+    return ABIArgInfo::getDirect(llvm::ArrayType::get(I32Ty, 2));
+  }
+
+  if (NumRegsLeft > 0) {
+    unsigned NumRegs = numRegsForType(Ty);
+    if (NumRegsLeft >= NumRegs) {
+      NumRegsLeft -= NumRegs;
+      return ABIArgInfo::getDirect();
+    }
+  }
+
+  // Use pass-by-reference in stead of pass-by-value for struct arguments in
+  // function ABI.
+  return ABIArgInfo::getIndirectAliased(
+      getContext().getTypeAlignInChars(Ty),
+      getContext().getTargetAddressSpace(LangAS::opencl_private));
 }
 
-void SPIRVABIInfo::computeInfo(CGFunctionInfo &FI) const {
-  // The logic is same as in DefaultABIInfo with an exception on the kernel
-  // arguments handling.
+void AMDGCNSPIRVABIInfo::computeInfo(CGFunctionInfo &FI) const {
   llvm::CallingConv::ID CC = FI.getCallingConvention();
 
   if (!getCXXABI().classifyReturnType(FI))
     FI.getReturnInfo() = classifyReturnType(FI.getReturnType());
 
+  NumRegsLeft = MaxNumRegsForArgsRet;
   for (auto &I : FI.arguments()) {
-    if (CC == llvm::CallingConv::SPIR_KERNEL) {
+    if (CC == llvm::CallingConv::SPIR_KERNEL)
       I.info = classifyKernelArgumentType(I.type);
-    } else {
+    else
       I.info = classifyArgumentType(I.type);
-    }
   }
 }
 
+llvm::FixedVectorType *AMDGCNSPIRVABIInfo::getOptimalVectorMemoryType(
+    llvm::FixedVectorType *Ty, const LangOptions &LangOpt) const {
+  // AMDGPU has legal instructions for 96-bit so 3x32 can be supported.
+  if (Ty->getNumElements() == 3 && getDataLayout().getTypeSizeInBits(Ty) == 96)
+    return Ty;
+  return DefaultABIInfo::getOptimalVectorMemoryType(Ty, LangOpt);
+}
+
 namespace clang {
 namespace CodeGen {
 void computeSPIRKernelABIInfo(CodeGenModule &CGM, CGFunctionInfo &FI) {
-  if (CGM.getTarget().getTriple().isSPIRV())
-    SPIRVABIInfo(CGM.getTypes()).computeInfo(FI);
-  else
+  if (CGM.getTarget().getTriple().isSPIRV()) {
+    if (CGM.getTarget().getTriple().getVendor() == llvm::Triple::AMD)
+      AMDGCNSPIRVABIInfo(CGM.getTypes()).computeInfo(FI);
+    else
+      SPIRVABIInfo(CGM.getTypes()).computeInfo(FI);
+  } else {
     CommonSPIRABIInfo(CGM.getTypes()).computeInfo(FI);
+  }
 }
 }
 }
diff --git a/clang/test/CodeGenCUDA/amdgpu-kernel-arg-pointer-type.cu b/clang/test/CodeGenCUDA/amdgpu-kernel-arg-pointer-type.cu
index a48affaec3c8a..bf45a353851b4 100644
--- a/clang/test/CodeGenCUDA/amdgpu-kernel-arg-pointer-type.cu
+++ b/clang/test/CodeGenCUDA/amdgpu-kernel-arg-pointer-type.cu
@@ -95,7 +95,7 @@ __global__ void kernel1(int *x) {
 // CHECK-NEXT:    store ptr addrspace(1) [[X_COERCE]], ptr [[X_ASCAST]], align 8
 // CHECK-NEXT:    [[X1:%.*]] = load ptr, ptr [[X_ASCAST]], align 8
 // CHECK-NEXT:    store ptr [[X1]], ptr [[X_ADDR_ASCAST]], align 8
-// CHECK-NEXT:    [[TMP0:%.*]] = load ptr, ptr [[X_ADDR_ASCAST]], align 8
+// CHECK-NEXT:    [[TMP0:%.*]] = load ptr, ptr [[X_ADDR_ASCAST]], align 8, !nonnull [[META4:![0-9]+]], !align [[META5:![0-9]+]]
 // CHECK-NEXT:    [[TMP1:%.*]] = load i32, ptr [[TMP0]], align 4
 // CHECK-NEXT:    [[INC:%.*]] = add nsw i32 [[TMP1]], 1
 // CHECK-NEXT:    store i32 [[INC]], ptr [[TMP0]], align 4
@@ -111,7 +111,7 @@ __global__ void kernel1(int *x) {
 // CHECK-SPIRV-NEXT:    store ptr addrspace(1) [[X_COERCE]], ptr addrspace(4) [[X_ASCAST]], align 8
 // CHECK-SPIRV-NEXT:    [[X1:%.*]] = load ptr addrspace(4), ptr addrspace(4) [[X_ASCAST]], align 8
 // CHECK-SPIRV-NEXT:    store ptr addrspace(4) [[X1]], ptr addrspace(4) [[X_ADDR_ASCAST]], align 8
-// CHECK-SPIRV-NEXT:    [[TMP0:%.*]] = load ptr addrspace(4), ptr addrspace(4) [[X_ADDR_ASCAST]], align 8
+// CHECK-SPIRV-NEXT:    [[TMP0:%.*]] = load ptr addrspace(4), ptr addrspace(4) [[X_ADDR_ASCAST]], align 8, !align [[META6:![0-9]+]]
 // CHECK-SPIRV-NEXT:    [[TMP1:%.*]] = load i32, ptr addrspace(4) [[TMP0]], align 4
 // CHECK-SPIRV-NEXT:    [[INC:%.*]] = add nsw i32 [[TMP1]], 1
 // CHECK-SPIRV-NEXT:    store i32 [[INC]], ptr addrspace(4) [[TMP0]], align 4
@@ -302,28 +302,23 @@ struct S {
 // CHECK-NEXT:    ret void
 //
 // CHECK-SPIRV-LABEL: define spir_kernel void @_Z7kernel41S(
-// CHECK-SPIRV-SAME: [[STRUCT_S:%.*]] [[S_COERCE:%.*]]) addrspace(4) #[[ATTR0]] !max_work_group_size [[META5]] {
+// CHECK-SPIRV-SAME: ptr addrspace(2) noundef byref([[STRUCT_S:%.*]]) align 8 [[TMP0:%.*]]) addrspace(4) #[[ATTR0]] !max_work_group_size [[META5]] {
 // CHECK-SPIRV-NEXT:  [[ENTRY:.*:]]
-// CHECK-SPIRV-NEXT:    [[S:%.*]] = alloca [[STRUCT_S]], align 8
-// CHECK-SPIRV-NEXT:    [[S1:%.*]] = addrspacecast ptr [[S]] to ptr addrspace(4)
-// CHECK-SPIRV-NEXT:    [[TMP0:%.*]] = getelementptr inbounds nuw [[STRUCT_S]], ptr addrspace(4) [[S1]], i32 0, i32 0
-// CHECK-SPIRV-NEXT:    [[TMP1:%.*]] = extractvalue [[STRUCT_S]] [[S_COERCE]], 0
-// CHECK-SPIRV-NEXT:    store ptr addrspace(4) [[TMP1]], ptr addrspace(4) [[TMP0]], align 8
-// CHECK-SPIRV-NEXT:    [[TMP2:%.*]] = getelementptr inbounds nuw [[STRUCT_S]], ptr addrspace(4) [[S1]], i32 0, i32 1
-// CHECK-SPIRV-NEXT:    [[TMP3:%.*]] = extractvalue [[STRUCT_S]] [[S_COERCE]], 1
-// CHECK-SPIRV-NEXT:    store ptr addrspace(4) [[TMP3]], ptr addrspace(4) [[TMP2]], align 8
-// CHECK-SPIRV-NEXT:    [[X:%.*]] = getelementptr inbounds nuw [[STRUCT_S]], ptr addrspace(4) [[S1]], i32 0, i32 0
-// CHECK-SPIRV-NEXT:    [[TMP4:%.*]] = load ptr addrspace(4), ptr addrspace(4) [[X]], align 8
-// CHECK-SPIRV-NEXT:    [[ARRAYIDX:%.*]] = getelementptr inbounds i32, ptr addrspace(4) [[TMP4]], i64 0
-// CHECK-SPIRV-NEXT:    [[TMP5:%.*]] = load i32, ptr addrspace(4) [[ARRAYIDX]], align 4
-// CHECK-SPIRV-NEXT:    [[INC:%.*]] = add nsw i32 [[TMP5]], 1
+// CHECK-SPIRV-NEXT:    [[COERCE:%.*]] = alloca [[STRUCT_S]], align 8
+// CHECK-SPIRV-NEXT:    [[S:%.*]] = addrspacecast ptr [[COERCE]] to ptr addrspace(4)
+// CHECK-SPIRV-NEXT:    call addrspace(4) void @llvm.memcpy.p4.p2.i64(ptr addrspace(4) align 8 [[S]], ptr addrspace(2) align 8 [[TMP0]], i64 16, i1 false)
+// CHECK-SPIRV-NEXT:    [[X:%.*]] = getelementptr inbounds nuw [[STRUCT_S]], ptr addrspace(4) [[S]], i32 0, i32 0
+// CHECK-SPIRV-NEXT:    [[TMP1:%.*]] = load ptr addrspace(4), ptr addrspace(4) [[X]], align 8
+// CHECK-SPIRV-NEXT:    [[ARRAYIDX:%.*]] = getelementptr inbounds i32, ptr addrspace(4) [[TMP1]], i64 0
+// CHECK-SPIRV-NEXT:    [[TMP2:%.*]] = load i32, ptr addrspace(4) [[ARRAYIDX]], align 4
+// CHECK-SPIRV-NEXT:    [[INC:%.*]] = add nsw i32 [[TMP2]], 1
 // CHECK-SPIRV-NEXT:    store i32 [[INC]], ptr addrspace(4) [[ARRAYIDX]], align 4
-// CHECK-SPIRV-NEXT:    [[Y:%.*]] = getelementptr inbounds nuw [[STRUCT_S]], ptr addrspace(4) [[S1]], i32 0, i32 1
-// CHECK-SPIRV-NEXT:    [[TMP6:%.*]] = load ptr addrspace(4), ptr addrspace(4) [[Y]], align 8
-// CHECK-SPIRV-NEXT:    [[ARRAYIDX2:%.*]] = getelementptr inbounds float, ptr addrspace(4) [[TMP6]], i64 0
-// CHECK-SPIRV-NEXT:    [[TMP7:%.*]] = load float, ptr addrspace(4) [[ARRAYIDX2]], align 4
-// CHECK-SPIRV-NEXT:    [[ADD:%.*]] = fadd contract float [[TMP7]], 1.000000e+00
-// CHECK-SPIRV-NEXT:    store float [[ADD]], ptr addrspace(4) [[ARRAYIDX2]], align 4
+// CHECK-SPIRV-NEXT:    [[Y:%.*]] = getelementptr inbounds nuw [[STRUCT_S]], ptr addrspace(4) [[S]], i32 0, i32 1
+// CHECK-SPIRV-NEXT:    [[TMP3:%.*]] = load ptr addrspace(4), ptr addrspace(4) [[Y]], align 8
+// CHECK-SPIRV-NEXT:    [[ARRAYIDX1:%.*]] = getelementptr inbounds float, ptr addrspace(4) [[TMP3]], i64 0
+// CHECK-SPIRV-NEXT:    [[TMP4:%.*]] = load float, ptr addrspace(4) [[ARRAYIDX1]], align 4
+// CHECK-SPIRV-NEXT:    [[ADD:%.*]] = fadd contract float [[TMP4]], 1.000000e+00
+// CHECK-SPIRV-NEXT:    store float [[ADD]], ptr addrspace(4) [[ARRAYIDX1]], align 4
 // CHECK-SPIRV-NEXT:  ...
[truncated]

llvmbot · 2025-11-28T03:02:33Z

@llvm/pr-subscribers-clang-codegen

Author: Alex Voicu (AlexVlx)

Changes

At the moment AMDGCN flavoured SPIRV uses the SPIRV ABI with some tweaks revolving around passing aggregates as direct. This is problematic in multiple ways:

it leads to divergence from code compiled for a concrete target, which makes it difficult to debug;
it incurs a run time cost, when dealing with larger aggregates;
it incurs a compile time cost, when dealing with larger aggregates.

This patch switches over AMDGCN flavoured SPIRV to implement the AMDGPU ABI (except for dealing with variadic functions, which will be added in the future). One additional complication (and the primary motivation behind the current less than ideal state of affairs) stems from byref, which AMDGPU uses, not being expressible in SPIR-V. We deal with this by CodeGen-ing for byref, lowering it to the FuncParamAttr ByVal in SPIR-V, and restoring it when doing reverse translation from AMDGCN flavoured SPIR-V.

Patch is 50.43 KiB, truncated to 20.00 KiB below, full version: https://github.com/llvm/llvm-project/pull/169865.diff

6 Files Affected:

(modified) clang/lib/CodeGen/Targets/SPIR.cpp (+245-47)
(modified) clang/test/CodeGenCUDA/amdgpu-kernel-arg-pointer-type.cu (+71-73)
(modified) clang/test/CodeGenCUDA/kernel-args.cu (+4-4)
(added) clang/test/CodeGenHIP/amdgcnspirv-uses-amdgpu-abi.cpp (+321)
(modified) llvm/lib/Target/SPIRV/SPIRVCallLowering.cpp (+4-1)
(added) llvm/test/CodeGen/SPIRV/pointers/ptr-argument-byref-amdgcnspirv.ll (+24)

diff --git a/clang/lib/CodeGen/Targets/SPIR.cpp b/clang/lib/CodeGen/Targets/SPIR.cpp
index 1a8c85d8871ec..3540093074bfe 100644
--- a/clang/lib/CodeGen/Targets/SPIR.cpp
+++ b/clang/lib/CodeGen/Targets/SPIR.cpp
@@ -9,6 +9,11 @@
 #include "ABIInfoImpl.h"
 #include "HLSLBufferLayoutBuilder.h"
 #include "TargetInfo.h"
+#include "clang/Basic/LangOptions.h"
+#include "llvm/IR/DerivedTypes.h"
+
+#include <stdint.h>
+#include <utility>
 
 using namespace clang;
 using namespace clang::CodeGen;
@@ -33,9 +38,41 @@ class SPIRVABIInfo : public CommonSPIRABIInfo {
   void computeInfo(CGFunctionInfo &FI) const override;
 
 private:
+  ABIArgInfo classifyKernelArgumentType(QualType Ty) const;
+};
+
+class AMDGCNSPIRVABIInfo : public SPIRVABIInfo {
+  // TODO: this should be unified / shared with AMDGPU, ideally we'd like to
+  //       re-use AMDGPUABIInfo eventually, rather than duplicate.
+  static constexpr unsigned MaxNumRegsForArgsRet = 16; // 16 32-bit registers
+  mutable unsigned NumRegsLeft = 0;
+
+  unsigned numRegsForType(QualType Ty) const;
+
+  bool isHomogeneousAggregateBaseType(QualType Ty) const override {
+    return true;
+  }
+  bool isHomogeneousAggregateSmallEnough(const Type *Base,
+                                         uint64_t Members) const override {
+    uint32_t NumRegs = (getContext().getTypeSize(Base) + 31) / 32;
+
+    // Homogeneous Aggregates may occupy at most 16 registers.
+    return Members * NumRegs <= MaxNumRegsForArgsRet;
+  }
+
+  // Coerce HIP scalar pointer arguments from generic pointers to global ones.
+  llvm::Type *coerceKernelArgumentType(llvm::Type *Ty, unsigned FromAS,
+                                       unsigned ToAS) const;
+
   ABIArgInfo classifyReturnType(QualType RetTy) const;
   ABIArgInfo classifyKernelArgumentType(QualType Ty) const;
   ABIArgInfo classifyArgumentType(QualType Ty) const;
+public:
+  AMDGCNSPIRVABIInfo(CodeGenTypes &CGT) : SPIRVABIInfo(CGT) {}
+  void computeInfo(CGFunctionInfo &FI) const override;
+
+  llvm::FixedVectorType *getOptimalVectorMemoryType(
+      llvm::FixedVectorType *Ty, const LangOptions &LangOpt) const override;
 };
 } // end anonymous namespace
 namespace {
@@ -83,7 +120,10 @@ class CommonSPIRTargetCodeGenInfo : public TargetCodeGenInfo {
 class SPIRVTargetCodeGenInfo : public CommonSPIRTargetCodeGenInfo {
 public:
   SPIRVTargetCodeGenInfo(CodeGen::CodeGenTypes &CGT)
-      : CommonSPIRTargetCodeGenInfo(std::make_unique<SPIRVABIInfo>(CGT)) {}
+      : CommonSPIRTargetCodeGenInfo(
+            (CGT.getTarget().getTriple().getVendor() == llvm::Triple::AMD)
+                ? std::make_unique<AMDGCNSPIRVABIInfo>(CGT)
+                : std::make_unique<SPIRVABIInfo>(CGT)) {}
   void setCUDAKernelCallingConvention(const FunctionType *&FT) const override;
   LangAS getGlobalVarAddressSpace(CodeGenModule &CGM,
                                   const VarDecl *D) const override;
@@ -132,25 +172,6 @@ void CommonSPIRABIInfo::setCCs() {
   RuntimeCC = llvm::CallingConv::SPIR_FUNC;
 }
 
-ABIArgInfo SPIRVABIInfo::classifyReturnType(QualType RetTy) const {
-  if (getTarget().getTriple().getVendor() != llvm::Triple::AMD)
-    return DefaultABIInfo::classifyReturnType(RetTy);
-  if (!isAggregateTypeForABI(RetTy) || getRecordArgABI(RetTy, getCXXABI()))
-    return DefaultABIInfo::classifyReturnType(RetTy);
-
-  if (const auto *RD = RetTy->getAsRecordDecl();
-      RD && RD->hasFlexibleArrayMember())
-    return DefaultABIInfo::classifyReturnType(RetTy);
-
-  // TODO: The AMDGPU ABI is non-trivial to represent in SPIR-V; in order to
-  // avoid encoding various architecture specific bits here we return everything
-  // as direct to retain type info for things like aggregates, for later perusal
-  // when translating back to LLVM/lowering in the BE. This is also why we
-  // disable flattening as the outcomes can mismatch between SPIR-V and AMDGPU.
-  // This will be revisited / optimised in the future.
-  return ABIArgInfo::getDirect(CGT.ConvertType(RetTy), 0u, nullptr, false);
-}
-
 ABIArgInfo SPIRVABIInfo::classifyKernelArgumentType(QualType Ty) const {
   if (getContext().getLangOpts().isTargetDevice()) {
     // Coerce pointer arguments with default address space to CrossWorkGroup
@@ -167,18 +188,6 @@ ABIArgInfo SPIRVABIInfo::classifyKernelArgumentType(QualType Ty) const {
     }
 
     if (isAggregateTypeForABI(Ty)) {
-      if (getTarget().getTriple().getVendor() == llvm::Triple::AMD)
-        // TODO: The AMDGPU kernel ABI passes aggregates byref, which is not
-        // currently expressible in SPIR-V; SPIR-V passes aggregates byval,
-        // which the AMDGPU kernel ABI does not allow. Passing aggregates as
-        // direct works around this impedance mismatch, as it retains type info
-        // and can be correctly handled, post reverse-translation, by the AMDGPU
-        // BE, which has to support this CC for legacy OpenCL purposes. It can
-        // be brittle and does lead to performance degradation in certain
-        // pathological cases. This will be revisited / optimised in the future,
-        // once a way to deal with the byref/byval impedance mismatch is
-        // identified.
-        return ABIArgInfo::getDirect(LTy, 0, nullptr, false);
       // Force copying aggregate type in kernel arguments by value when
       // compiling CUDA targeting SPIR-V. This is required for the object
       // copied to be valid on the device.
@@ -193,11 +202,150 @@ ABIArgInfo SPIRVABIInfo::classifyKernelArgumentType(QualType Ty) const {
   return classifyArgumentType(Ty);
 }
 
-ABIArgInfo SPIRVABIInfo::classifyArgumentType(QualType Ty) const {
-  if (getTarget().getTriple().getVendor() != llvm::Triple::AMD)
-    return DefaultABIInfo::classifyArgumentType(Ty);
-  if (!isAggregateTypeForABI(Ty))
-    return DefaultABIInfo::classifyArgumentType(Ty);
+void SPIRVABIInfo::computeInfo(CGFunctionInfo &FI) const {
+  // The logic is same as in DefaultABIInfo with an exception on the kernel
+  // arguments handling.
+  llvm::CallingConv::ID CC = FI.getCallingConvention();
+
+  if (!getCXXABI().classifyReturnType(FI))
+    FI.getReturnInfo() = classifyReturnType(FI.getReturnType());
+
+  for (auto &I : FI.arguments()) {
+    if (CC == llvm::CallingConv::SPIR_KERNEL) {
+      I.info = classifyKernelArgumentType(I.type);
+    } else {
+      I.info = classifyArgumentType(I.type);
+    }
+  }
+}
+
+unsigned AMDGCNSPIRVABIInfo::numRegsForType(QualType Ty) const {
+  // This duplicates the AMDGPUABI computation.
+  unsigned NumRegs = 0;
+
+  if (const VectorType *VT = Ty->getAs<VectorType>()) {
+    // Compute from the number of elements. The reported size is based on the
+    // in-memory size, which includes the padding 4th element for 3-vectors.
+    QualType EltTy = VT->getElementType();
+    unsigned EltSize = getContext().getTypeSize(EltTy);
+
+    // 16-bit element vectors should be passed as packed.
+    if (EltSize == 16)
+      return (VT->getNumElements() + 1) / 2;
+
+    unsigned EltNumRegs = (EltSize + 31) / 32;
+    return EltNumRegs * VT->getNumElements();
+  }
+
+  if (const auto *RD = Ty->getAsRecordDecl()) {
+    assert(!RD->hasFlexibleArrayMember());
+
+    for (const FieldDecl *Field : RD->fields()) {
+      QualType FieldTy = Field->getType();
+      NumRegs += numRegsForType(FieldTy);
+    }
+
+    return NumRegs;
+  }
+
+  return (getContext().getTypeSize(Ty) + 31) / 32;
+}
+
+llvm::Type *
+AMDGCNSPIRVABIInfo::coerceKernelArgumentType(llvm::Type *Ty, unsigned FromAS,
+                                             unsigned ToAS) const {
+  // Single value types.
+  auto *PtrTy = llvm::dyn_cast<llvm::PointerType>(Ty);
+  if (PtrTy && PtrTy->getAddressSpace() == FromAS)
+    return llvm::PointerType::get(Ty->getContext(), ToAS);
+  return Ty;
+}
+
+ABIArgInfo AMDGCNSPIRVABIInfo::classifyReturnType(QualType RetTy) const {
+  if (!isAggregateTypeForABI(RetTy) || getRecordArgABI(RetTy, getCXXABI()))
+    return DefaultABIInfo::classifyReturnType(RetTy);
+
+  // Ignore empty structs/unions.
+  if (isEmptyRecord(getContext(), RetTy, true))
+    return ABIArgInfo::getIgnore();
+
+  // Lower single-element structs to just return a regular value.
+  if (const Type *SeltTy = isSingleElementStruct(RetTy, getContext()))
+    return ABIArgInfo::getDirect(CGT.ConvertType(QualType(SeltTy, 0)));
+
+  if (const auto *RD = RetTy->getAsRecordDecl();
+      RD && RD->hasFlexibleArrayMember())
+    return DefaultABIInfo::classifyReturnType(RetTy);
+
+  // Pack aggregates <= 4 bytes into single VGPR or pair.
+  uint64_t Size = getContext().getTypeSize(RetTy);
+  if (Size <= 16)
+    return ABIArgInfo::getDirect(llvm::Type::getInt16Ty(getVMContext()));
+
+  if (Size <= 32)
+    return ABIArgInfo::getDirect(llvm::Type::getInt32Ty(getVMContext()));
+
+  // TODO: This carried over from AMDGPU oddity, we retain it to
+  //       ensure consistency, but it might be reasonable to return Int64.
+  if (Size <= 64) {
+    llvm::Type *I32Ty = llvm::Type::getInt32Ty(getVMContext());
+    return ABIArgInfo::getDirect(llvm::ArrayType::get(I32Ty, 2));
+  }
+
+  if (numRegsForType(RetTy) <= MaxNumRegsForArgsRet)
+    return ABIArgInfo::getDirect();
+  return DefaultABIInfo::classifyReturnType(RetTy);
+}
+
+/// For kernels all parameters are really passed in a special buffer. It doesn't
+/// make sense to pass anything byval, so everything must be direct.
+ABIArgInfo AMDGCNSPIRVABIInfo::classifyKernelArgumentType(QualType Ty) const {
+  Ty = useFirstFieldIfTransparentUnion(Ty);
+
+  // TODO: Can we omit empty structs?
+
+  if (const Type *SeltTy = isSingleElementStruct(Ty, getContext()))
+    Ty = QualType(SeltTy, 0);
+
+  llvm::Type *OrigLTy = CGT.ConvertType(Ty);
+  llvm::Type *LTy = OrigLTy;
+  if (getContext().getLangOpts().isTargetDevice()) {
+    LTy = coerceKernelArgumentType(
+        OrigLTy, /*FromAS=*/getContext().getTargetAddressSpace(LangAS::Default),
+        /*ToAS=*/getContext().getTargetAddressSpace(LangAS::opencl_global));
+  }
+
+  // FIXME: This doesn't apply the optimization of coercing pointers in structs
+  // to global address space when using byref. This would require implementing a
+  // new kind of coercion of the in-memory type when for indirect arguments.
+  if (LTy == OrigLTy && isAggregateTypeForABI(Ty)) {
+    return ABIArgInfo::getIndirectAliased(
+        getContext().getTypeAlignInChars(Ty),
+        getContext().getTargetAddressSpace(LangAS::opencl_constant),
+        false /*Realign*/, nullptr /*Padding*/);
+  }
+
+  // TODO: inhibiting flattening is an AMDGPU workaround for Clover, which might
+  //       be vestigial and should be revisited.
+  return ABIArgInfo::getDirect(LTy, 0, nullptr, false);
+}
+
+ABIArgInfo AMDGCNSPIRVABIInfo::classifyArgumentType(QualType Ty) const {
+  assert(NumRegsLeft <= MaxNumRegsForArgsRet && "register estimate underflow");
+
+  Ty = useFirstFieldIfTransparentUnion(Ty);
+
+  // TODO: support for variadics.
+
+  if (!isAggregateTypeForABI(Ty)) {
+    ABIArgInfo ArgInfo = DefaultABIInfo::classifyArgumentType(Ty);
+    if (!ArgInfo.isIndirect()) {
+      unsigned NumRegs = numRegsForType(Ty);
+      NumRegsLeft -= std::min(NumRegs, NumRegsLeft);
+    }
+
+    return ArgInfo;
+  }
 
   // Records with non-trivial destructors/copy-constructors should not be
   // passed by value.
@@ -205,37 +353,87 @@ ABIArgInfo SPIRVABIInfo::classifyArgumentType(QualType Ty) const {
     return getNaturalAlignIndirect(Ty, getDataLayout().getAllocaAddrSpace(),
                                    RAA == CGCXXABI::RAA_DirectInMemory);
 
+  // Ignore empty structs/unions.
+  if (isEmptyRecord(getContext(), Ty, true))
+    return ABIArgInfo::getIgnore();
+
+  // Lower single-element structs to just pass a regular value. TODO: We
+  // could do reasonable-size multiple-element structs too, using getExpand(),
+  // though watch out for things like bitfields.
+  if (const Type *SeltTy = isSingleElementStruct(Ty, getContext()))
+    return ABIArgInfo::getDirect(CGT.ConvertType(QualType(SeltTy, 0)));
+
   if (const auto *RD = Ty->getAsRecordDecl();
       RD && RD->hasFlexibleArrayMember())
     return DefaultABIInfo::classifyArgumentType(Ty);
 
-  return ABIArgInfo::getDirect(CGT.ConvertType(Ty), 0u, nullptr, false);
+  uint64_t Size = getContext().getTypeSize(Ty);
+  if (Size <= 64) {
+    // Pack aggregates <= 8 bytes into single VGPR or pair.
+    unsigned NumRegs = (Size + 31) / 32;
+    NumRegsLeft -= std::min(NumRegsLeft, NumRegs);
+
+    if (Size <= 16)
+      return ABIArgInfo::getDirect(llvm::Type::getInt16Ty(getVMContext()));
+
+    if (Size <= 32)
+      return ABIArgInfo::getDirect(llvm::Type::getInt32Ty(getVMContext()));
+
+    // TODO: This is an AMDGPU oddity, and might be vestigial, we retain it to
+    //       ensure consistency, but it should be revisited.
+    llvm::Type *I32Ty = llvm::Type::getInt32Ty(getVMContext());
+    return ABIArgInfo::getDirect(llvm::ArrayType::get(I32Ty, 2));
+  }
+
+  if (NumRegsLeft > 0) {
+    unsigned NumRegs = numRegsForType(Ty);
+    if (NumRegsLeft >= NumRegs) {
+      NumRegsLeft -= NumRegs;
+      return ABIArgInfo::getDirect();
+    }
+  }
+
+  // Use pass-by-reference in stead of pass-by-value for struct arguments in
+  // function ABI.
+  return ABIArgInfo::getIndirectAliased(
+      getContext().getTypeAlignInChars(Ty),
+      getContext().getTargetAddressSpace(LangAS::opencl_private));
 }
 
-void SPIRVABIInfo::computeInfo(CGFunctionInfo &FI) const {
-  // The logic is same as in DefaultABIInfo with an exception on the kernel
-  // arguments handling.
+void AMDGCNSPIRVABIInfo::computeInfo(CGFunctionInfo &FI) const {
   llvm::CallingConv::ID CC = FI.getCallingConvention();
 
   if (!getCXXABI().classifyReturnType(FI))
     FI.getReturnInfo() = classifyReturnType(FI.getReturnType());
 
+  NumRegsLeft = MaxNumRegsForArgsRet;
   for (auto &I : FI.arguments()) {
-    if (CC == llvm::CallingConv::SPIR_KERNEL) {
+    if (CC == llvm::CallingConv::SPIR_KERNEL)
       I.info = classifyKernelArgumentType(I.type);
-    } else {
+    else
       I.info = classifyArgumentType(I.type);
-    }
   }
 }
 
+llvm::FixedVectorType *AMDGCNSPIRVABIInfo::getOptimalVectorMemoryType(
+    llvm::FixedVectorType *Ty, const LangOptions &LangOpt) const {
+  // AMDGPU has legal instructions for 96-bit so 3x32 can be supported.
+  if (Ty->getNumElements() == 3 && getDataLayout().getTypeSizeInBits(Ty) == 96)
+    return Ty;
+  return DefaultABIInfo::getOptimalVectorMemoryType(Ty, LangOpt);
+}
+
 namespace clang {
 namespace CodeGen {
 void computeSPIRKernelABIInfo(CodeGenModule &CGM, CGFunctionInfo &FI) {
-  if (CGM.getTarget().getTriple().isSPIRV())
-    SPIRVABIInfo(CGM.getTypes()).computeInfo(FI);
-  else
+  if (CGM.getTarget().getTriple().isSPIRV()) {
+    if (CGM.getTarget().getTriple().getVendor() == llvm::Triple::AMD)
+      AMDGCNSPIRVABIInfo(CGM.getTypes()).computeInfo(FI);
+    else
+      SPIRVABIInfo(CGM.getTypes()).computeInfo(FI);
+  } else {
     CommonSPIRABIInfo(CGM.getTypes()).computeInfo(FI);
+  }
 }
 }
 }
diff --git a/clang/test/CodeGenCUDA/amdgpu-kernel-arg-pointer-type.cu b/clang/test/CodeGenCUDA/amdgpu-kernel-arg-pointer-type.cu
index a48affaec3c8a..bf45a353851b4 100644
--- a/clang/test/CodeGenCUDA/amdgpu-kernel-arg-pointer-type.cu
+++ b/clang/test/CodeGenCUDA/amdgpu-kernel-arg-pointer-type.cu
@@ -95,7 +95,7 @@ __global__ void kernel1(int *x) {
 // CHECK-NEXT:    store ptr addrspace(1) [[X_COERCE]], ptr [[X_ASCAST]], align 8
 // CHECK-NEXT:    [[X1:%.*]] = load ptr, ptr [[X_ASCAST]], align 8
 // CHECK-NEXT:    store ptr [[X1]], ptr [[X_ADDR_ASCAST]], align 8
-// CHECK-NEXT:    [[TMP0:%.*]] = load ptr, ptr [[X_ADDR_ASCAST]], align 8
+// CHECK-NEXT:    [[TMP0:%.*]] = load ptr, ptr [[X_ADDR_ASCAST]], align 8, !nonnull [[META4:![0-9]+]], !align [[META5:![0-9]+]]
 // CHECK-NEXT:    [[TMP1:%.*]] = load i32, ptr [[TMP0]], align 4
 // CHECK-NEXT:    [[INC:%.*]] = add nsw i32 [[TMP1]], 1
 // CHECK-NEXT:    store i32 [[INC]], ptr [[TMP0]], align 4
@@ -111,7 +111,7 @@ __global__ void kernel1(int *x) {
 // CHECK-SPIRV-NEXT:    store ptr addrspace(1) [[X_COERCE]], ptr addrspace(4) [[X_ASCAST]], align 8
 // CHECK-SPIRV-NEXT:    [[X1:%.*]] = load ptr addrspace(4), ptr addrspace(4) [[X_ASCAST]], align 8
 // CHECK-SPIRV-NEXT:    store ptr addrspace(4) [[X1]], ptr addrspace(4) [[X_ADDR_ASCAST]], align 8
-// CHECK-SPIRV-NEXT:    [[TMP0:%.*]] = load ptr addrspace(4), ptr addrspace(4) [[X_ADDR_ASCAST]], align 8
+// CHECK-SPIRV-NEXT:    [[TMP0:%.*]] = load ptr addrspace(4), ptr addrspace(4) [[X_ADDR_ASCAST]], align 8, !align [[META6:![0-9]+]]
 // CHECK-SPIRV-NEXT:    [[TMP1:%.*]] = load i32, ptr addrspace(4) [[TMP0]], align 4
 // CHECK-SPIRV-NEXT:    [[INC:%.*]] = add nsw i32 [[TMP1]], 1
 // CHECK-SPIRV-NEXT:    store i32 [[INC]], ptr addrspace(4) [[TMP0]], align 4
@@ -302,28 +302,23 @@ struct S {
 // CHECK-NEXT:    ret void
 //
 // CHECK-SPIRV-LABEL: define spir_kernel void @_Z7kernel41S(
-// CHECK-SPIRV-SAME: [[STRUCT_S:%.*]] [[S_COERCE:%.*]]) addrspace(4) #[[ATTR0]] !max_work_group_size [[META5]] {
+// CHECK-SPIRV-SAME: ptr addrspace(2) noundef byref([[STRUCT_S:%.*]]) align 8 [[TMP0:%.*]]) addrspace(4) #[[ATTR0]] !max_work_group_size [[META5]] {
 // CHECK-SPIRV-NEXT:  [[ENTRY:.*:]]
-// CHECK-SPIRV-NEXT:    [[S:%.*]] = alloca [[STRUCT_S]], align 8
-// CHECK-SPIRV-NEXT:    [[S1:%.*]] = addrspacecast ptr [[S]] to ptr addrspace(4)
-// CHECK-SPIRV-NEXT:    [[TMP0:%.*]] = getelementptr inbounds nuw [[STRUCT_S]], ptr addrspace(4) [[S1]], i32 0, i32 0
-// CHECK-SPIRV-NEXT:    [[TMP1:%.*]] = extractvalue [[STRUCT_S]] [[S_COERCE]], 0
-// CHECK-SPIRV-NEXT:    store ptr addrspace(4) [[TMP1]], ptr addrspace(4) [[TMP0]], align 8
-// CHECK-SPIRV-NEXT:    [[TMP2:%.*]] = getelementptr inbounds nuw [[STRUCT_S]], ptr addrspace(4) [[S1]], i32 0, i32 1
-// CHECK-SPIRV-NEXT:    [[TMP3:%.*]] = extractvalue [[STRUCT_S]] [[S_COERCE]], 1
-// CHECK-SPIRV-NEXT:    store ptr addrspace(4) [[TMP3]], ptr addrspace(4) [[TMP2]], align 8
-// CHECK-SPIRV-NEXT:    [[X:%.*]] = getelementptr inbounds nuw [[STRUCT_S]], ptr addrspace(4) [[S1]], i32 0, i32 0
-// CHECK-SPIRV-NEXT:    [[TMP4:%.*]] = load ptr addrspace(4), ptr addrspace(4) [[X]], align 8
-// CHECK-SPIRV-NEXT:    [[ARRAYIDX:%.*]] = getelementptr inbounds i32, ptr addrspace(4) [[TMP4]], i64 0
-// CHECK-SPIRV-NEXT:    [[TMP5:%.*]] = load i32, ptr addrspace(4) [[ARRAYIDX]], align 4
-// CHECK-SPIRV-NEXT:    [[INC:%.*]] = add nsw i32 [[TMP5]], 1
+// CHECK-SPIRV-NEXT:    [[COERCE:%.*]] = alloca [[STRUCT_S]], align 8
+// CHECK-SPIRV-NEXT:    [[S:%.*]] = addrspacecast ptr [[COERCE]] to ptr addrspace(4)
+// CHECK-SPIRV-NEXT:    call addrspace(4) void @llvm.memcpy.p4.p2.i64(ptr addrspace(4) align 8 [[S]], ptr addrspace(2) align 8 [[TMP0]], i64 16, i1 false)
+// CHECK-SPIRV-NEXT:    [[X:%.*]] = getelementptr inbounds nuw [[STRUCT_S]], ptr addrspace(4) [[S]], i32 0, i32 0
+// CHECK-SPIRV-NEXT:    [[TMP1:%.*]] = load ptr addrspace(4), ptr addrspace(4) [[X]], align 8
+// CHECK-SPIRV-NEXT:    [[ARRAYIDX:%.*]] = getelementptr inbounds i32, ptr addrspace(4) [[TMP1]], i64 0
+// CHECK-SPIRV-NEXT:    [[TMP2:%.*]] = load i32, ptr addrspace(4) [[ARRAYIDX]], align 4
+// CHECK-SPIRV-NEXT:    [[INC:%.*]] = add nsw i32 [[TMP2]], 1
 // CHECK-SPIRV-NEXT:    store i32 [[INC]], ptr addrspace(4) [[ARRAYIDX]], align 4
-// CHECK-SPIRV-NEXT:    [[Y:%.*]] = getelementptr inbounds nuw [[STRUCT_S]], ptr addrspace(4) [[S1]], i32 0, i32 1
-// CHECK-SPIRV-NEXT:    [[TMP6:%.*]] = load ptr addrspace(4), ptr addrspace(4) [[Y]], align 8
-// CHECK-SPIRV-NEXT:    [[ARRAYIDX2:%.*]] = getelementptr inbounds float, ptr addrspace(4) [[TMP6]], i64 0
-// CHECK-SPIRV-NEXT:    [[TMP7:%.*]] = load float, ptr addrspace(4) [[ARRAYIDX2]], align 4
-// CHECK-SPIRV-NEXT:    [[ADD:%.*]] = fadd contract float [[TMP7]], 1.000000e+00
-// CHECK-SPIRV-NEXT:    store float [[ADD]], ptr addrspace(4) [[ARRAYIDX2]], align 4
+// CHECK-SPIRV-NEXT:    [[Y:%.*]] = getelementptr inbounds nuw [[STRUCT_S]], ptr addrspace(4) [[S]], i32 0, i32 1
+// CHECK-SPIRV-NEXT:    [[TMP3:%.*]] = load ptr addrspace(4), ptr addrspace(4) [[Y]], align 8
+// CHECK-SPIRV-NEXT:    [[ARRAYIDX1:%.*]] = getelementptr inbounds float, ptr addrspace(4) [[TMP3]], i64 0
+// CHECK-SPIRV-NEXT:    [[TMP4:%.*]] = load float, ptr addrspace(4) [[ARRAYIDX1]], align 4
+// CHECK-SPIRV-NEXT:    [[ADD:%.*]] = fadd contract float [[TMP4]], 1.000000e+00
+// CHECK-SPIRV-NEXT:    store float [[ADD]], ptr addrspace(4) [[ARRAYIDX1]], align 4
 // CHECK-SPIRV-NEXT:  ...
[truncated]

llvmbot · 2025-11-28T03:02:33Z

@llvm/pr-subscribers-backend-spir-v

Author: Alex Voicu (AlexVlx)

Changes

At the moment AMDGCN flavoured SPIRV uses the SPIRV ABI with some tweaks revolving around passing aggregates as direct. This is problematic in multiple ways:

it leads to divergence from code compiled for a concrete target, which makes it difficult to debug;
it incurs a run time cost, when dealing with larger aggregates;
it incurs a compile time cost, when dealing with larger aggregates.

This patch switches over AMDGCN flavoured SPIRV to implement the AMDGPU ABI (except for dealing with variadic functions, which will be added in the future). One additional complication (and the primary motivation behind the current less than ideal state of affairs) stems from byref, which AMDGPU uses, not being expressible in SPIR-V. We deal with this by CodeGen-ing for byref, lowering it to the FuncParamAttr ByVal in SPIR-V, and restoring it when doing reverse translation from AMDGCN flavoured SPIR-V.

Patch is 50.43 KiB, truncated to 20.00 KiB below, full version: https://github.com/llvm/llvm-project/pull/169865.diff

6 Files Affected:

(modified) clang/lib/CodeGen/Targets/SPIR.cpp (+245-47)
(modified) clang/test/CodeGenCUDA/amdgpu-kernel-arg-pointer-type.cu (+71-73)
(modified) clang/test/CodeGenCUDA/kernel-args.cu (+4-4)
(added) clang/test/CodeGenHIP/amdgcnspirv-uses-amdgpu-abi.cpp (+321)
(modified) llvm/lib/Target/SPIRV/SPIRVCallLowering.cpp (+4-1)
(added) llvm/test/CodeGen/SPIRV/pointers/ptr-argument-byref-amdgcnspirv.ll (+24)

diff --git a/clang/lib/CodeGen/Targets/SPIR.cpp b/clang/lib/CodeGen/Targets/SPIR.cpp
index 1a8c85d8871ec..3540093074bfe 100644
--- a/clang/lib/CodeGen/Targets/SPIR.cpp
+++ b/clang/lib/CodeGen/Targets/SPIR.cpp
@@ -9,6 +9,11 @@
 #include "ABIInfoImpl.h"
 #include "HLSLBufferLayoutBuilder.h"
 #include "TargetInfo.h"
+#include "clang/Basic/LangOptions.h"
+#include "llvm/IR/DerivedTypes.h"
+
+#include <stdint.h>
+#include <utility>
 
 using namespace clang;
 using namespace clang::CodeGen;
@@ -33,9 +38,41 @@ class SPIRVABIInfo : public CommonSPIRABIInfo {
   void computeInfo(CGFunctionInfo &FI) const override;
 
 private:
+  ABIArgInfo classifyKernelArgumentType(QualType Ty) const;
+};
+
+class AMDGCNSPIRVABIInfo : public SPIRVABIInfo {
+  // TODO: this should be unified / shared with AMDGPU, ideally we'd like to
+  //       re-use AMDGPUABIInfo eventually, rather than duplicate.
+  static constexpr unsigned MaxNumRegsForArgsRet = 16; // 16 32-bit registers
+  mutable unsigned NumRegsLeft = 0;
+
+  unsigned numRegsForType(QualType Ty) const;
+
+  bool isHomogeneousAggregateBaseType(QualType Ty) const override {
+    return true;
+  }
+  bool isHomogeneousAggregateSmallEnough(const Type *Base,
+                                         uint64_t Members) const override {
+    uint32_t NumRegs = (getContext().getTypeSize(Base) + 31) / 32;
+
+    // Homogeneous Aggregates may occupy at most 16 registers.
+    return Members * NumRegs <= MaxNumRegsForArgsRet;
+  }
+
+  // Coerce HIP scalar pointer arguments from generic pointers to global ones.
+  llvm::Type *coerceKernelArgumentType(llvm::Type *Ty, unsigned FromAS,
+                                       unsigned ToAS) const;
+
   ABIArgInfo classifyReturnType(QualType RetTy) const;
   ABIArgInfo classifyKernelArgumentType(QualType Ty) const;
   ABIArgInfo classifyArgumentType(QualType Ty) const;
+public:
+  AMDGCNSPIRVABIInfo(CodeGenTypes &CGT) : SPIRVABIInfo(CGT) {}
+  void computeInfo(CGFunctionInfo &FI) const override;
+
+  llvm::FixedVectorType *getOptimalVectorMemoryType(
+      llvm::FixedVectorType *Ty, const LangOptions &LangOpt) const override;
 };
 } // end anonymous namespace
 namespace {
@@ -83,7 +120,10 @@ class CommonSPIRTargetCodeGenInfo : public TargetCodeGenInfo {
 class SPIRVTargetCodeGenInfo : public CommonSPIRTargetCodeGenInfo {
 public:
   SPIRVTargetCodeGenInfo(CodeGen::CodeGenTypes &CGT)
-      : CommonSPIRTargetCodeGenInfo(std::make_unique<SPIRVABIInfo>(CGT)) {}
+      : CommonSPIRTargetCodeGenInfo(
+            (CGT.getTarget().getTriple().getVendor() == llvm::Triple::AMD)
+                ? std::make_unique<AMDGCNSPIRVABIInfo>(CGT)
+                : std::make_unique<SPIRVABIInfo>(CGT)) {}
   void setCUDAKernelCallingConvention(const FunctionType *&FT) const override;
   LangAS getGlobalVarAddressSpace(CodeGenModule &CGM,
                                   const VarDecl *D) const override;
@@ -132,25 +172,6 @@ void CommonSPIRABIInfo::setCCs() {
   RuntimeCC = llvm::CallingConv::SPIR_FUNC;
 }
 
-ABIArgInfo SPIRVABIInfo::classifyReturnType(QualType RetTy) const {
-  if (getTarget().getTriple().getVendor() != llvm::Triple::AMD)
-    return DefaultABIInfo::classifyReturnType(RetTy);
-  if (!isAggregateTypeForABI(RetTy) || getRecordArgABI(RetTy, getCXXABI()))
-    return DefaultABIInfo::classifyReturnType(RetTy);
-
-  if (const auto *RD = RetTy->getAsRecordDecl();
-      RD && RD->hasFlexibleArrayMember())
-    return DefaultABIInfo::classifyReturnType(RetTy);
-
-  // TODO: The AMDGPU ABI is non-trivial to represent in SPIR-V; in order to
-  // avoid encoding various architecture specific bits here we return everything
-  // as direct to retain type info for things like aggregates, for later perusal
-  // when translating back to LLVM/lowering in the BE. This is also why we
-  // disable flattening as the outcomes can mismatch between SPIR-V and AMDGPU.
-  // This will be revisited / optimised in the future.
-  return ABIArgInfo::getDirect(CGT.ConvertType(RetTy), 0u, nullptr, false);
-}
-
 ABIArgInfo SPIRVABIInfo::classifyKernelArgumentType(QualType Ty) const {
   if (getContext().getLangOpts().isTargetDevice()) {
     // Coerce pointer arguments with default address space to CrossWorkGroup
@@ -167,18 +188,6 @@ ABIArgInfo SPIRVABIInfo::classifyKernelArgumentType(QualType Ty) const {
     }
 
     if (isAggregateTypeForABI(Ty)) {
-      if (getTarget().getTriple().getVendor() == llvm::Triple::AMD)
-        // TODO: The AMDGPU kernel ABI passes aggregates byref, which is not
-        // currently expressible in SPIR-V; SPIR-V passes aggregates byval,
-        // which the AMDGPU kernel ABI does not allow. Passing aggregates as
-        // direct works around this impedance mismatch, as it retains type info
-        // and can be correctly handled, post reverse-translation, by the AMDGPU
-        // BE, which has to support this CC for legacy OpenCL purposes. It can
-        // be brittle and does lead to performance degradation in certain
-        // pathological cases. This will be revisited / optimised in the future,
-        // once a way to deal with the byref/byval impedance mismatch is
-        // identified.
-        return ABIArgInfo::getDirect(LTy, 0, nullptr, false);
       // Force copying aggregate type in kernel arguments by value when
       // compiling CUDA targeting SPIR-V. This is required for the object
       // copied to be valid on the device.
@@ -193,11 +202,150 @@ ABIArgInfo SPIRVABIInfo::classifyKernelArgumentType(QualType Ty) const {
   return classifyArgumentType(Ty);
 }
 
-ABIArgInfo SPIRVABIInfo::classifyArgumentType(QualType Ty) const {
-  if (getTarget().getTriple().getVendor() != llvm::Triple::AMD)
-    return DefaultABIInfo::classifyArgumentType(Ty);
-  if (!isAggregateTypeForABI(Ty))
-    return DefaultABIInfo::classifyArgumentType(Ty);
+void SPIRVABIInfo::computeInfo(CGFunctionInfo &FI) const {
+  // The logic is same as in DefaultABIInfo with an exception on the kernel
+  // arguments handling.
+  llvm::CallingConv::ID CC = FI.getCallingConvention();
+
+  if (!getCXXABI().classifyReturnType(FI))
+    FI.getReturnInfo() = classifyReturnType(FI.getReturnType());
+
+  for (auto &I : FI.arguments()) {
+    if (CC == llvm::CallingConv::SPIR_KERNEL) {
+      I.info = classifyKernelArgumentType(I.type);
+    } else {
+      I.info = classifyArgumentType(I.type);
+    }
+  }
+}
+
+unsigned AMDGCNSPIRVABIInfo::numRegsForType(QualType Ty) const {
+  // This duplicates the AMDGPUABI computation.
+  unsigned NumRegs = 0;
+
+  if (const VectorType *VT = Ty->getAs<VectorType>()) {
+    // Compute from the number of elements. The reported size is based on the
+    // in-memory size, which includes the padding 4th element for 3-vectors.
+    QualType EltTy = VT->getElementType();
+    unsigned EltSize = getContext().getTypeSize(EltTy);
+
+    // 16-bit element vectors should be passed as packed.
+    if (EltSize == 16)
+      return (VT->getNumElements() + 1) / 2;
+
+    unsigned EltNumRegs = (EltSize + 31) / 32;
+    return EltNumRegs * VT->getNumElements();
+  }
+
+  if (const auto *RD = Ty->getAsRecordDecl()) {
+    assert(!RD->hasFlexibleArrayMember());
+
+    for (const FieldDecl *Field : RD->fields()) {
+      QualType FieldTy = Field->getType();
+      NumRegs += numRegsForType(FieldTy);
+    }
+
+    return NumRegs;
+  }
+
+  return (getContext().getTypeSize(Ty) + 31) / 32;
+}
+
+llvm::Type *
+AMDGCNSPIRVABIInfo::coerceKernelArgumentType(llvm::Type *Ty, unsigned FromAS,
+                                             unsigned ToAS) const {
+  // Single value types.
+  auto *PtrTy = llvm::dyn_cast<llvm::PointerType>(Ty);
+  if (PtrTy && PtrTy->getAddressSpace() == FromAS)
+    return llvm::PointerType::get(Ty->getContext(), ToAS);
+  return Ty;
+}
+
+ABIArgInfo AMDGCNSPIRVABIInfo::classifyReturnType(QualType RetTy) const {
+  if (!isAggregateTypeForABI(RetTy) || getRecordArgABI(RetTy, getCXXABI()))
+    return DefaultABIInfo::classifyReturnType(RetTy);
+
+  // Ignore empty structs/unions.
+  if (isEmptyRecord(getContext(), RetTy, true))
+    return ABIArgInfo::getIgnore();
+
+  // Lower single-element structs to just return a regular value.
+  if (const Type *SeltTy = isSingleElementStruct(RetTy, getContext()))
+    return ABIArgInfo::getDirect(CGT.ConvertType(QualType(SeltTy, 0)));
+
+  if (const auto *RD = RetTy->getAsRecordDecl();
+      RD && RD->hasFlexibleArrayMember())
+    return DefaultABIInfo::classifyReturnType(RetTy);
+
+  // Pack aggregates <= 4 bytes into single VGPR or pair.
+  uint64_t Size = getContext().getTypeSize(RetTy);
+  if (Size <= 16)
+    return ABIArgInfo::getDirect(llvm::Type::getInt16Ty(getVMContext()));
+
+  if (Size <= 32)
+    return ABIArgInfo::getDirect(llvm::Type::getInt32Ty(getVMContext()));
+
+  // TODO: This carried over from AMDGPU oddity, we retain it to
+  //       ensure consistency, but it might be reasonable to return Int64.
+  if (Size <= 64) {
+    llvm::Type *I32Ty = llvm::Type::getInt32Ty(getVMContext());
+    return ABIArgInfo::getDirect(llvm::ArrayType::get(I32Ty, 2));
+  }
+
+  if (numRegsForType(RetTy) <= MaxNumRegsForArgsRet)
+    return ABIArgInfo::getDirect();
+  return DefaultABIInfo::classifyReturnType(RetTy);
+}
+
+/// For kernels all parameters are really passed in a special buffer. It doesn't
+/// make sense to pass anything byval, so everything must be direct.
+ABIArgInfo AMDGCNSPIRVABIInfo::classifyKernelArgumentType(QualType Ty) const {
+  Ty = useFirstFieldIfTransparentUnion(Ty);
+
+  // TODO: Can we omit empty structs?
+
+  if (const Type *SeltTy = isSingleElementStruct(Ty, getContext()))
+    Ty = QualType(SeltTy, 0);
+
+  llvm::Type *OrigLTy = CGT.ConvertType(Ty);
+  llvm::Type *LTy = OrigLTy;
+  if (getContext().getLangOpts().isTargetDevice()) {
+    LTy = coerceKernelArgumentType(
+        OrigLTy, /*FromAS=*/getContext().getTargetAddressSpace(LangAS::Default),
+        /*ToAS=*/getContext().getTargetAddressSpace(LangAS::opencl_global));
+  }
+
+  // FIXME: This doesn't apply the optimization of coercing pointers in structs
+  // to global address space when using byref. This would require implementing a
+  // new kind of coercion of the in-memory type when for indirect arguments.
+  if (LTy == OrigLTy && isAggregateTypeForABI(Ty)) {
+    return ABIArgInfo::getIndirectAliased(
+        getContext().getTypeAlignInChars(Ty),
+        getContext().getTargetAddressSpace(LangAS::opencl_constant),
+        false /*Realign*/, nullptr /*Padding*/);
+  }
+
+  // TODO: inhibiting flattening is an AMDGPU workaround for Clover, which might
+  //       be vestigial and should be revisited.
+  return ABIArgInfo::getDirect(LTy, 0, nullptr, false);
+}
+
+ABIArgInfo AMDGCNSPIRVABIInfo::classifyArgumentType(QualType Ty) const {
+  assert(NumRegsLeft <= MaxNumRegsForArgsRet && "register estimate underflow");
+
+  Ty = useFirstFieldIfTransparentUnion(Ty);
+
+  // TODO: support for variadics.
+
+  if (!isAggregateTypeForABI(Ty)) {
+    ABIArgInfo ArgInfo = DefaultABIInfo::classifyArgumentType(Ty);
+    if (!ArgInfo.isIndirect()) {
+      unsigned NumRegs = numRegsForType(Ty);
+      NumRegsLeft -= std::min(NumRegs, NumRegsLeft);
+    }
+
+    return ArgInfo;
+  }
 
   // Records with non-trivial destructors/copy-constructors should not be
   // passed by value.
@@ -205,37 +353,87 @@ ABIArgInfo SPIRVABIInfo::classifyArgumentType(QualType Ty) const {
     return getNaturalAlignIndirect(Ty, getDataLayout().getAllocaAddrSpace(),
                                    RAA == CGCXXABI::RAA_DirectInMemory);
 
+  // Ignore empty structs/unions.
+  if (isEmptyRecord(getContext(), Ty, true))
+    return ABIArgInfo::getIgnore();
+
+  // Lower single-element structs to just pass a regular value. TODO: We
+  // could do reasonable-size multiple-element structs too, using getExpand(),
+  // though watch out for things like bitfields.
+  if (const Type *SeltTy = isSingleElementStruct(Ty, getContext()))
+    return ABIArgInfo::getDirect(CGT.ConvertType(QualType(SeltTy, 0)));
+
   if (const auto *RD = Ty->getAsRecordDecl();
       RD && RD->hasFlexibleArrayMember())
     return DefaultABIInfo::classifyArgumentType(Ty);
 
-  return ABIArgInfo::getDirect(CGT.ConvertType(Ty), 0u, nullptr, false);
+  uint64_t Size = getContext().getTypeSize(Ty);
+  if (Size <= 64) {
+    // Pack aggregates <= 8 bytes into single VGPR or pair.
+    unsigned NumRegs = (Size + 31) / 32;
+    NumRegsLeft -= std::min(NumRegsLeft, NumRegs);
+
+    if (Size <= 16)
+      return ABIArgInfo::getDirect(llvm::Type::getInt16Ty(getVMContext()));
+
+    if (Size <= 32)
+      return ABIArgInfo::getDirect(llvm::Type::getInt32Ty(getVMContext()));
+
+    // TODO: This is an AMDGPU oddity, and might be vestigial, we retain it to
+    //       ensure consistency, but it should be revisited.
+    llvm::Type *I32Ty = llvm::Type::getInt32Ty(getVMContext());
+    return ABIArgInfo::getDirect(llvm::ArrayType::get(I32Ty, 2));
+  }
+
+  if (NumRegsLeft > 0) {
+    unsigned NumRegs = numRegsForType(Ty);
+    if (NumRegsLeft >= NumRegs) {
+      NumRegsLeft -= NumRegs;
+      return ABIArgInfo::getDirect();
+    }
+  }
+
+  // Use pass-by-reference in stead of pass-by-value for struct arguments in
+  // function ABI.
+  return ABIArgInfo::getIndirectAliased(
+      getContext().getTypeAlignInChars(Ty),
+      getContext().getTargetAddressSpace(LangAS::opencl_private));
 }
 
-void SPIRVABIInfo::computeInfo(CGFunctionInfo &FI) const {
-  // The logic is same as in DefaultABIInfo with an exception on the kernel
-  // arguments handling.
+void AMDGCNSPIRVABIInfo::computeInfo(CGFunctionInfo &FI) const {
   llvm::CallingConv::ID CC = FI.getCallingConvention();
 
   if (!getCXXABI().classifyReturnType(FI))
     FI.getReturnInfo() = classifyReturnType(FI.getReturnType());
 
+  NumRegsLeft = MaxNumRegsForArgsRet;
   for (auto &I : FI.arguments()) {
-    if (CC == llvm::CallingConv::SPIR_KERNEL) {
+    if (CC == llvm::CallingConv::SPIR_KERNEL)
       I.info = classifyKernelArgumentType(I.type);
-    } else {
+    else
       I.info = classifyArgumentType(I.type);
-    }
   }
 }
 
+llvm::FixedVectorType *AMDGCNSPIRVABIInfo::getOptimalVectorMemoryType(
+    llvm::FixedVectorType *Ty, const LangOptions &LangOpt) const {
+  // AMDGPU has legal instructions for 96-bit so 3x32 can be supported.
+  if (Ty->getNumElements() == 3 && getDataLayout().getTypeSizeInBits(Ty) == 96)
+    return Ty;
+  return DefaultABIInfo::getOptimalVectorMemoryType(Ty, LangOpt);
+}
+
 namespace clang {
 namespace CodeGen {
 void computeSPIRKernelABIInfo(CodeGenModule &CGM, CGFunctionInfo &FI) {
-  if (CGM.getTarget().getTriple().isSPIRV())
-    SPIRVABIInfo(CGM.getTypes()).computeInfo(FI);
-  else
+  if (CGM.getTarget().getTriple().isSPIRV()) {
+    if (CGM.getTarget().getTriple().getVendor() == llvm::Triple::AMD)
+      AMDGCNSPIRVABIInfo(CGM.getTypes()).computeInfo(FI);
+    else
+      SPIRVABIInfo(CGM.getTypes()).computeInfo(FI);
+  } else {
     CommonSPIRABIInfo(CGM.getTypes()).computeInfo(FI);
+  }
 }
 }
 }
diff --git a/clang/test/CodeGenCUDA/amdgpu-kernel-arg-pointer-type.cu b/clang/test/CodeGenCUDA/amdgpu-kernel-arg-pointer-type.cu
index a48affaec3c8a..bf45a353851b4 100644
--- a/clang/test/CodeGenCUDA/amdgpu-kernel-arg-pointer-type.cu
+++ b/clang/test/CodeGenCUDA/amdgpu-kernel-arg-pointer-type.cu
@@ -95,7 +95,7 @@ __global__ void kernel1(int *x) {
 // CHECK-NEXT:    store ptr addrspace(1) [[X_COERCE]], ptr [[X_ASCAST]], align 8
 // CHECK-NEXT:    [[X1:%.*]] = load ptr, ptr [[X_ASCAST]], align 8
 // CHECK-NEXT:    store ptr [[X1]], ptr [[X_ADDR_ASCAST]], align 8
-// CHECK-NEXT:    [[TMP0:%.*]] = load ptr, ptr [[X_ADDR_ASCAST]], align 8
+// CHECK-NEXT:    [[TMP0:%.*]] = load ptr, ptr [[X_ADDR_ASCAST]], align 8, !nonnull [[META4:![0-9]+]], !align [[META5:![0-9]+]]
 // CHECK-NEXT:    [[TMP1:%.*]] = load i32, ptr [[TMP0]], align 4
 // CHECK-NEXT:    [[INC:%.*]] = add nsw i32 [[TMP1]], 1
 // CHECK-NEXT:    store i32 [[INC]], ptr [[TMP0]], align 4
@@ -111,7 +111,7 @@ __global__ void kernel1(int *x) {
 // CHECK-SPIRV-NEXT:    store ptr addrspace(1) [[X_COERCE]], ptr addrspace(4) [[X_ASCAST]], align 8
 // CHECK-SPIRV-NEXT:    [[X1:%.*]] = load ptr addrspace(4), ptr addrspace(4) [[X_ASCAST]], align 8
 // CHECK-SPIRV-NEXT:    store ptr addrspace(4) [[X1]], ptr addrspace(4) [[X_ADDR_ASCAST]], align 8
-// CHECK-SPIRV-NEXT:    [[TMP0:%.*]] = load ptr addrspace(4), ptr addrspace(4) [[X_ADDR_ASCAST]], align 8
+// CHECK-SPIRV-NEXT:    [[TMP0:%.*]] = load ptr addrspace(4), ptr addrspace(4) [[X_ADDR_ASCAST]], align 8, !align [[META6:![0-9]+]]
 // CHECK-SPIRV-NEXT:    [[TMP1:%.*]] = load i32, ptr addrspace(4) [[TMP0]], align 4
 // CHECK-SPIRV-NEXT:    [[INC:%.*]] = add nsw i32 [[TMP1]], 1
 // CHECK-SPIRV-NEXT:    store i32 [[INC]], ptr addrspace(4) [[TMP0]], align 4
@@ -302,28 +302,23 @@ struct S {
 // CHECK-NEXT:    ret void
 //
 // CHECK-SPIRV-LABEL: define spir_kernel void @_Z7kernel41S(
-// CHECK-SPIRV-SAME: [[STRUCT_S:%.*]] [[S_COERCE:%.*]]) addrspace(4) #[[ATTR0]] !max_work_group_size [[META5]] {
+// CHECK-SPIRV-SAME: ptr addrspace(2) noundef byref([[STRUCT_S:%.*]]) align 8 [[TMP0:%.*]]) addrspace(4) #[[ATTR0]] !max_work_group_size [[META5]] {
 // CHECK-SPIRV-NEXT:  [[ENTRY:.*:]]
-// CHECK-SPIRV-NEXT:    [[S:%.*]] = alloca [[STRUCT_S]], align 8
-// CHECK-SPIRV-NEXT:    [[S1:%.*]] = addrspacecast ptr [[S]] to ptr addrspace(4)
-// CHECK-SPIRV-NEXT:    [[TMP0:%.*]] = getelementptr inbounds nuw [[STRUCT_S]], ptr addrspace(4) [[S1]], i32 0, i32 0
-// CHECK-SPIRV-NEXT:    [[TMP1:%.*]] = extractvalue [[STRUCT_S]] [[S_COERCE]], 0
-// CHECK-SPIRV-NEXT:    store ptr addrspace(4) [[TMP1]], ptr addrspace(4) [[TMP0]], align 8
-// CHECK-SPIRV-NEXT:    [[TMP2:%.*]] = getelementptr inbounds nuw [[STRUCT_S]], ptr addrspace(4) [[S1]], i32 0, i32 1
-// CHECK-SPIRV-NEXT:    [[TMP3:%.*]] = extractvalue [[STRUCT_S]] [[S_COERCE]], 1
-// CHECK-SPIRV-NEXT:    store ptr addrspace(4) [[TMP3]], ptr addrspace(4) [[TMP2]], align 8
-// CHECK-SPIRV-NEXT:    [[X:%.*]] = getelementptr inbounds nuw [[STRUCT_S]], ptr addrspace(4) [[S1]], i32 0, i32 0
-// CHECK-SPIRV-NEXT:    [[TMP4:%.*]] = load ptr addrspace(4), ptr addrspace(4) [[X]], align 8
-// CHECK-SPIRV-NEXT:    [[ARRAYIDX:%.*]] = getelementptr inbounds i32, ptr addrspace(4) [[TMP4]], i64 0
-// CHECK-SPIRV-NEXT:    [[TMP5:%.*]] = load i32, ptr addrspace(4) [[ARRAYIDX]], align 4
-// CHECK-SPIRV-NEXT:    [[INC:%.*]] = add nsw i32 [[TMP5]], 1
+// CHECK-SPIRV-NEXT:    [[COERCE:%.*]] = alloca [[STRUCT_S]], align 8
+// CHECK-SPIRV-NEXT:    [[S:%.*]] = addrspacecast ptr [[COERCE]] to ptr addrspace(4)
+// CHECK-SPIRV-NEXT:    call addrspace(4) void @llvm.memcpy.p4.p2.i64(ptr addrspace(4) align 8 [[S]], ptr addrspace(2) align 8 [[TMP0]], i64 16, i1 false)
+// CHECK-SPIRV-NEXT:    [[X:%.*]] = getelementptr inbounds nuw [[STRUCT_S]], ptr addrspace(4) [[S]], i32 0, i32 0
+// CHECK-SPIRV-NEXT:    [[TMP1:%.*]] = load ptr addrspace(4), ptr addrspace(4) [[X]], align 8
+// CHECK-SPIRV-NEXT:    [[ARRAYIDX:%.*]] = getelementptr inbounds i32, ptr addrspace(4) [[TMP1]], i64 0
+// CHECK-SPIRV-NEXT:    [[TMP2:%.*]] = load i32, ptr addrspace(4) [[ARRAYIDX]], align 4
+// CHECK-SPIRV-NEXT:    [[INC:%.*]] = add nsw i32 [[TMP2]], 1
 // CHECK-SPIRV-NEXT:    store i32 [[INC]], ptr addrspace(4) [[ARRAYIDX]], align 4
-// CHECK-SPIRV-NEXT:    [[Y:%.*]] = getelementptr inbounds nuw [[STRUCT_S]], ptr addrspace(4) [[S1]], i32 0, i32 1
-// CHECK-SPIRV-NEXT:    [[TMP6:%.*]] = load ptr addrspace(4), ptr addrspace(4) [[Y]], align 8
-// CHECK-SPIRV-NEXT:    [[ARRAYIDX2:%.*]] = getelementptr inbounds float, ptr addrspace(4) [[TMP6]], i64 0
-// CHECK-SPIRV-NEXT:    [[TMP7:%.*]] = load float, ptr addrspace(4) [[ARRAYIDX2]], align 4
-// CHECK-SPIRV-NEXT:    [[ADD:%.*]] = fadd contract float [[TMP7]], 1.000000e+00
-// CHECK-SPIRV-NEXT:    store float [[ADD]], ptr addrspace(4) [[ARRAYIDX2]], align 4
+// CHECK-SPIRV-NEXT:    [[Y:%.*]] = getelementptr inbounds nuw [[STRUCT_S]], ptr addrspace(4) [[S]], i32 0, i32 1
+// CHECK-SPIRV-NEXT:    [[TMP3:%.*]] = load ptr addrspace(4), ptr addrspace(4) [[Y]], align 8
+// CHECK-SPIRV-NEXT:    [[ARRAYIDX1:%.*]] = getelementptr inbounds float, ptr addrspace(4) [[TMP3]], i64 0
+// CHECK-SPIRV-NEXT:    [[TMP4:%.*]] = load float, ptr addrspace(4) [[ARRAYIDX1]], align 4
+// CHECK-SPIRV-NEXT:    [[ADD:%.*]] = fadd contract float [[TMP4]], 1.000000e+00
+// CHECK-SPIRV-NEXT:    store float [[ADD]], ptr addrspace(4) [[ARRAYIDX1]], align 4
 // CHECK-SPIRV-NEXT:  ...
[truncated]

github-actions · 2025-11-28T03:03:43Z

✅ With the latest revision this PR passed the C/C++ code formatter.

…v_be_staging_9

maarquitos14

SPIRV part LGTM, just one nit.

maarquitos14 · 2025-11-28T09:42:09Z

llvm/lib/Target/SPIRV/SPIRVCallLowering.cpp

                        SPIRV::Decoration::FuncParamAttr, {Attr});
      }
-      if (Arg.hasAttribute(Attribute::ByVal)) {
+      if (Arg.hasAttribute(Attribute::ByVal) ||


Can we add a comment here explaining why you need this for AMD, so that people can easily understand when reading this code? I fear people might don't understand, or simply think it's wrong, to lower ByRef to ByVal.

Done, thank you for the suggestion.

jmmartinez

It looks good to me (but I'm also not super familiar with the ABI side)

Keenuts

LGTM for the shared SPIR-V part. Glanced at the AMD part, code LGTM, but don't know about the ABI logic so haven't checked that.

efriedma-quic

Is there any possibility this impacts interop between AMDGCN flavored SPIRV and generic SPIRV? Or is that not a thing?

MrSidims

SPIR-V -related part is LGTM

AlexVlx · 2025-12-02T14:15:30Z

Is there any possibility this impacts interop between AMDGCN flavored SPIRV and generic SPIRV? Or is that not a thing?

Not a thing, we don't compose them. If we were to, we'd have to constrain the ABI to unambiguous cases (so, probably, no aggregates), but that's not under active consideration at the moment.

arsenm · 2025-12-03T01:02:10Z

clang/lib/CodeGen/Targets/SPIR.cpp

+#include "clang/Basic/LangOptions.h"
+#include "llvm/IR/DerivedTypes.h"
+
+#include <stdint.h>


Suggested change

#include <stdint.h>

#include <cstdint>

But I'd be surprised you need this?

Conventionally, cstdint is for std::foo_t, and stdint.h is for foo_t without the namespace. Now, the convention is weak and often broken, but it exists. As for needing it, there are uint64_t uses, and whilst transitively cstdint / stdint.h is extremely likely to get included, I thought that we want to make files as self sufficient as possible.

arsenm · 2025-12-03T01:03:08Z

clang/lib/CodeGen/Targets/SPIR.cpp

+  return Ty;
+}
+
+ABIArgInfo AMDGCNSPIRVABIInfo::classifyReturnType(QualType RetTy) const {


Can we do something better than duplicate all of this AMDGPU stuff?

It's not immediately apparent without refactoring how the ABIInfo things are currently done (internal types inaccessible to other TUs). We could consider hoisting the (shared) implementation into a file that both AMDGPU and SPIRV include, but I didn't want to do that here since it's more intrusive. If you have another suggestion, please share, it's more than welcome.

At the moment AMDGCN flavoured SPIRV uses the SPIRV ABI with some tweaks revolving around passing aggregates as direct. This is problematic in multiple ways: - it leads to divergence from code compiled for a concrete target, which makes it difficult to debug; - it incurs a run time cost, when dealing with larger aggregates; - it incurs a compile time cost, when dealing with larger aggregates. This patch switches over AMDGCN flavoured SPIRV to implement the AMDGPU ABI (except for dealing with variadic functions, which will be added in the future). One additional complication (and the primary motivation behind the current less than ideal state of affairs) stems from `byref`, which AMDGPU uses, not being expressible in SPIR-V. We deal with this by CodeGen-ing for `byref`, lowering it to the `FuncParamAttr ByVal` in SPIR-V, and restoring it when doing reverse translation from AMDGCN flavoured SPIR-V.

Use AMDGPU ABI for AMDGCNSPIRV; add lowering for byref.

829faa3

llvmbot added backend:AMDGPU clang:codegen IR generation bugs: mangling, exceptions, etc. backend:SPIR-V labels Nov 28, 2025

AlexVlx requested review from Keenuts, MrSidims, VyacheslavLevytskyy, arsenm, jmmartinez, lamb-j, maarquitos14, mgcarrasco and michalpaszkowski November 28, 2025 03:02

AlexVlx added 2 commits November 28, 2025 03:05

Fix formatting.

5b256a5

Merge branch 'main' of https://github.com/llvm/llvm-project into spir…

8543522

…v_be_staging_9

maarquitos14 approved these changes Nov 28, 2025

View reviewed changes

Add clarifying comment.

a15f083

jmmartinez approved these changes Nov 28, 2025

View reviewed changes

Keenuts approved these changes Dec 1, 2025

View reviewed changes

efriedma-quic reviewed Dec 1, 2025

View reviewed changes

MrSidims approved these changes Dec 2, 2025

View reviewed changes

arsenm reviewed Dec 3, 2025

View reviewed changes

AlexVlx merged commit 68fea00 into llvm:main Dec 7, 2025
11 checks passed

[SPIRV] Use AMDGPU ABI for AMDGCN flavoured SPIRV #169865

[SPIRV] Use AMDGPU ABI for AMDGCN flavoured SPIRV #169865

Uh oh!

Conversation

AlexVlx commented Nov 28, 2025

Uh oh!

llvmbot commented Nov 28, 2025

Uh oh!

llvmbot commented Nov 28, 2025

Uh oh!

llvmbot commented Nov 28, 2025

Uh oh!

github-actions bot commented Nov 28, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

maarquitos14 left a comment

Choose a reason for hiding this comment

Uh oh!

maarquitos14 Nov 28, 2025

Choose a reason for hiding this comment

Uh oh!

AlexVlx Nov 28, 2025

Choose a reason for hiding this comment

Uh oh!

jmmartinez left a comment

Choose a reason for hiding this comment

Uh oh!

Keenuts left a comment

Choose a reason for hiding this comment

Uh oh!

efriedma-quic left a comment

Choose a reason for hiding this comment

Uh oh!

MrSidims left a comment

Choose a reason for hiding this comment

Uh oh!

AlexVlx commented Dec 2, 2025

Uh oh!

arsenm Dec 3, 2025

Choose a reason for hiding this comment

Uh oh!

AlexVlx Dec 3, 2025

Choose a reason for hiding this comment

Uh oh!

arsenm Dec 3, 2025

Choose a reason for hiding this comment

Uh oh!

AlexVlx Dec 3, 2025

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

8 participants

github-actions bot commented Nov 28, 2025 •

edited

Loading