Skip to content

Conversation

@AlexVlx
Copy link
Contributor

@AlexVlx AlexVlx commented Nov 28, 2025

At the moment AMDGCN flavoured SPIRV uses the SPIRV ABI with some tweaks revolving around passing aggregates as direct. This is problematic in multiple ways:

  • it leads to divergence from code compiled for a concrete target, which makes it difficult to debug;
  • it incurs a run time cost, when dealing with larger aggregates;
  • it incurs a compile time cost, when dealing with larger aggregates.

This patch switches over AMDGCN flavoured SPIRV to implement the AMDGPU ABI (except for dealing with variadic functions, which will be added in the future). One additional complication (and the primary motivation behind the current less than ideal state of affairs) stems from byref, which AMDGPU uses, not being expressible in SPIR-V. We deal with this by CodeGen-ing for byref, lowering it to the FuncParamAttr ByVal in SPIR-V, and restoring it when doing reverse translation from AMDGCN flavoured SPIR-V.

@llvmbot llvmbot added backend:AMDGPU clang:codegen IR generation bugs: mangling, exceptions, etc. backend:SPIR-V labels Nov 28, 2025
@llvmbot
Copy link
Member

llvmbot commented Nov 28, 2025

@llvm/pr-subscribers-backend-amdgpu

Author: Alex Voicu (AlexVlx)

Changes

At the moment AMDGCN flavoured SPIRV uses the SPIRV ABI with some tweaks revolving around passing aggregates as direct. This is problematic in multiple ways:

  • it leads to divergence from code compiled for a concrete target, which makes it difficult to debug;
  • it incurs a run time cost, when dealing with larger aggregates;
  • it incurs a compile time cost, when dealing with larger aggregates.

This patch switches over AMDGCN flavoured SPIRV to implement the AMDGPU ABI (except for dealing with variadic functions, which will be added in the future). One additional complication (and the primary motivation behind the current less than ideal state of affairs) stems from byref, which AMDGPU uses, not being expressible in SPIR-V. We deal with this by CodeGen-ing for byref, lowering it to the FuncParamAttr ByVal in SPIR-V, and restoring it when doing reverse translation from AMDGCN flavoured SPIR-V.


Patch is 50.43 KiB, truncated to 20.00 KiB below, full version: https://github.com/llvm/llvm-project/pull/169865.diff

6 Files Affected:

  • (modified) clang/lib/CodeGen/Targets/SPIR.cpp (+245-47)
  • (modified) clang/test/CodeGenCUDA/amdgpu-kernel-arg-pointer-type.cu (+71-73)
  • (modified) clang/test/CodeGenCUDA/kernel-args.cu (+4-4)
  • (added) clang/test/CodeGenHIP/amdgcnspirv-uses-amdgpu-abi.cpp (+321)
  • (modified) llvm/lib/Target/SPIRV/SPIRVCallLowering.cpp (+4-1)
  • (added) llvm/test/CodeGen/SPIRV/pointers/ptr-argument-byref-amdgcnspirv.ll (+24)
diff --git a/clang/lib/CodeGen/Targets/SPIR.cpp b/clang/lib/CodeGen/Targets/SPIR.cpp
index 1a8c85d8871ec..3540093074bfe 100644
--- a/clang/lib/CodeGen/Targets/SPIR.cpp
+++ b/clang/lib/CodeGen/Targets/SPIR.cpp
@@ -9,6 +9,11 @@
 #include "ABIInfoImpl.h"
 #include "HLSLBufferLayoutBuilder.h"
 #include "TargetInfo.h"
+#include "clang/Basic/LangOptions.h"
+#include "llvm/IR/DerivedTypes.h"
+
+#include <stdint.h>
+#include <utility>
 
 using namespace clang;
 using namespace clang::CodeGen;
@@ -33,9 +38,41 @@ class SPIRVABIInfo : public CommonSPIRABIInfo {
   void computeInfo(CGFunctionInfo &FI) const override;
 
 private:
+  ABIArgInfo classifyKernelArgumentType(QualType Ty) const;
+};
+
+class AMDGCNSPIRVABIInfo : public SPIRVABIInfo {
+  // TODO: this should be unified / shared with AMDGPU, ideally we'd like to
+  //       re-use AMDGPUABIInfo eventually, rather than duplicate.
+  static constexpr unsigned MaxNumRegsForArgsRet = 16; // 16 32-bit registers
+  mutable unsigned NumRegsLeft = 0;
+
+  unsigned numRegsForType(QualType Ty) const;
+
+  bool isHomogeneousAggregateBaseType(QualType Ty) const override {
+    return true;
+  }
+  bool isHomogeneousAggregateSmallEnough(const Type *Base,
+                                         uint64_t Members) const override {
+    uint32_t NumRegs = (getContext().getTypeSize(Base) + 31) / 32;
+
+    // Homogeneous Aggregates may occupy at most 16 registers.
+    return Members * NumRegs <= MaxNumRegsForArgsRet;
+  }
+
+  // Coerce HIP scalar pointer arguments from generic pointers to global ones.
+  llvm::Type *coerceKernelArgumentType(llvm::Type *Ty, unsigned FromAS,
+                                       unsigned ToAS) const;
+
   ABIArgInfo classifyReturnType(QualType RetTy) const;
   ABIArgInfo classifyKernelArgumentType(QualType Ty) const;
   ABIArgInfo classifyArgumentType(QualType Ty) const;
+public:
+  AMDGCNSPIRVABIInfo(CodeGenTypes &CGT) : SPIRVABIInfo(CGT) {}
+  void computeInfo(CGFunctionInfo &FI) const override;
+
+  llvm::FixedVectorType *getOptimalVectorMemoryType(
+      llvm::FixedVectorType *Ty, const LangOptions &LangOpt) const override;
 };
 } // end anonymous namespace
 namespace {
@@ -83,7 +120,10 @@ class CommonSPIRTargetCodeGenInfo : public TargetCodeGenInfo {
 class SPIRVTargetCodeGenInfo : public CommonSPIRTargetCodeGenInfo {
 public:
   SPIRVTargetCodeGenInfo(CodeGen::CodeGenTypes &CGT)
-      : CommonSPIRTargetCodeGenInfo(std::make_unique<SPIRVABIInfo>(CGT)) {}
+      : CommonSPIRTargetCodeGenInfo(
+            (CGT.getTarget().getTriple().getVendor() == llvm::Triple::AMD)
+                ? std::make_unique<AMDGCNSPIRVABIInfo>(CGT)
+                : std::make_unique<SPIRVABIInfo>(CGT)) {}
   void setCUDAKernelCallingConvention(const FunctionType *&FT) const override;
   LangAS getGlobalVarAddressSpace(CodeGenModule &CGM,
                                   const VarDecl *D) const override;
@@ -132,25 +172,6 @@ void CommonSPIRABIInfo::setCCs() {
   RuntimeCC = llvm::CallingConv::SPIR_FUNC;
 }
 
-ABIArgInfo SPIRVABIInfo::classifyReturnType(QualType RetTy) const {
-  if (getTarget().getTriple().getVendor() != llvm::Triple::AMD)
-    return DefaultABIInfo::classifyReturnType(RetTy);
-  if (!isAggregateTypeForABI(RetTy) || getRecordArgABI(RetTy, getCXXABI()))
-    return DefaultABIInfo::classifyReturnType(RetTy);
-
-  if (const auto *RD = RetTy->getAsRecordDecl();
-      RD && RD->hasFlexibleArrayMember())
-    return DefaultABIInfo::classifyReturnType(RetTy);
-
-  // TODO: The AMDGPU ABI is non-trivial to represent in SPIR-V; in order to
-  // avoid encoding various architecture specific bits here we return everything
-  // as direct to retain type info for things like aggregates, for later perusal
-  // when translating back to LLVM/lowering in the BE. This is also why we
-  // disable flattening as the outcomes can mismatch between SPIR-V and AMDGPU.
-  // This will be revisited / optimised in the future.
-  return ABIArgInfo::getDirect(CGT.ConvertType(RetTy), 0u, nullptr, false);
-}
-
 ABIArgInfo SPIRVABIInfo::classifyKernelArgumentType(QualType Ty) const {
   if (getContext().getLangOpts().isTargetDevice()) {
     // Coerce pointer arguments with default address space to CrossWorkGroup
@@ -167,18 +188,6 @@ ABIArgInfo SPIRVABIInfo::classifyKernelArgumentType(QualType Ty) const {
     }
 
     if (isAggregateTypeForABI(Ty)) {
-      if (getTarget().getTriple().getVendor() == llvm::Triple::AMD)
-        // TODO: The AMDGPU kernel ABI passes aggregates byref, which is not
-        // currently expressible in SPIR-V; SPIR-V passes aggregates byval,
-        // which the AMDGPU kernel ABI does not allow. Passing aggregates as
-        // direct works around this impedance mismatch, as it retains type info
-        // and can be correctly handled, post reverse-translation, by the AMDGPU
-        // BE, which has to support this CC for legacy OpenCL purposes. It can
-        // be brittle and does lead to performance degradation in certain
-        // pathological cases. This will be revisited / optimised in the future,
-        // once a way to deal with the byref/byval impedance mismatch is
-        // identified.
-        return ABIArgInfo::getDirect(LTy, 0, nullptr, false);
       // Force copying aggregate type in kernel arguments by value when
       // compiling CUDA targeting SPIR-V. This is required for the object
       // copied to be valid on the device.
@@ -193,11 +202,150 @@ ABIArgInfo SPIRVABIInfo::classifyKernelArgumentType(QualType Ty) const {
   return classifyArgumentType(Ty);
 }
 
-ABIArgInfo SPIRVABIInfo::classifyArgumentType(QualType Ty) const {
-  if (getTarget().getTriple().getVendor() != llvm::Triple::AMD)
-    return DefaultABIInfo::classifyArgumentType(Ty);
-  if (!isAggregateTypeForABI(Ty))
-    return DefaultABIInfo::classifyArgumentType(Ty);
+void SPIRVABIInfo::computeInfo(CGFunctionInfo &FI) const {
+  // The logic is same as in DefaultABIInfo with an exception on the kernel
+  // arguments handling.
+  llvm::CallingConv::ID CC = FI.getCallingConvention();
+
+  if (!getCXXABI().classifyReturnType(FI))
+    FI.getReturnInfo() = classifyReturnType(FI.getReturnType());
+
+  for (auto &I : FI.arguments()) {
+    if (CC == llvm::CallingConv::SPIR_KERNEL) {
+      I.info = classifyKernelArgumentType(I.type);
+    } else {
+      I.info = classifyArgumentType(I.type);
+    }
+  }
+}
+
+unsigned AMDGCNSPIRVABIInfo::numRegsForType(QualType Ty) const {
+  // This duplicates the AMDGPUABI computation.
+  unsigned NumRegs = 0;
+
+  if (const VectorType *VT = Ty->getAs<VectorType>()) {
+    // Compute from the number of elements. The reported size is based on the
+    // in-memory size, which includes the padding 4th element for 3-vectors.
+    QualType EltTy = VT->getElementType();
+    unsigned EltSize = getContext().getTypeSize(EltTy);
+
+    // 16-bit element vectors should be passed as packed.
+    if (EltSize == 16)
+      return (VT->getNumElements() + 1) / 2;
+
+    unsigned EltNumRegs = (EltSize + 31) / 32;
+    return EltNumRegs * VT->getNumElements();
+  }
+
+  if (const auto *RD = Ty->getAsRecordDecl()) {
+    assert(!RD->hasFlexibleArrayMember());
+
+    for (const FieldDecl *Field : RD->fields()) {
+      QualType FieldTy = Field->getType();
+      NumRegs += numRegsForType(FieldTy);
+    }
+
+    return NumRegs;
+  }
+
+  return (getContext().getTypeSize(Ty) + 31) / 32;
+}
+
+llvm::Type *
+AMDGCNSPIRVABIInfo::coerceKernelArgumentType(llvm::Type *Ty, unsigned FromAS,
+                                             unsigned ToAS) const {
+  // Single value types.
+  auto *PtrTy = llvm::dyn_cast<llvm::PointerType>(Ty);
+  if (PtrTy && PtrTy->getAddressSpace() == FromAS)
+    return llvm::PointerType::get(Ty->getContext(), ToAS);
+  return Ty;
+}
+
+ABIArgInfo AMDGCNSPIRVABIInfo::classifyReturnType(QualType RetTy) const {
+  if (!isAggregateTypeForABI(RetTy) || getRecordArgABI(RetTy, getCXXABI()))
+    return DefaultABIInfo::classifyReturnType(RetTy);
+
+  // Ignore empty structs/unions.
+  if (isEmptyRecord(getContext(), RetTy, true))
+    return ABIArgInfo::getIgnore();
+
+  // Lower single-element structs to just return a regular value.
+  if (const Type *SeltTy = isSingleElementStruct(RetTy, getContext()))
+    return ABIArgInfo::getDirect(CGT.ConvertType(QualType(SeltTy, 0)));
+
+  if (const auto *RD = RetTy->getAsRecordDecl();
+      RD && RD->hasFlexibleArrayMember())
+    return DefaultABIInfo::classifyReturnType(RetTy);
+
+  // Pack aggregates <= 4 bytes into single VGPR or pair.
+  uint64_t Size = getContext().getTypeSize(RetTy);
+  if (Size <= 16)
+    return ABIArgInfo::getDirect(llvm::Type::getInt16Ty(getVMContext()));
+
+  if (Size <= 32)
+    return ABIArgInfo::getDirect(llvm::Type::getInt32Ty(getVMContext()));
+
+  // TODO: This carried over from AMDGPU oddity, we retain it to
+  //       ensure consistency, but it might be reasonable to return Int64.
+  if (Size <= 64) {
+    llvm::Type *I32Ty = llvm::Type::getInt32Ty(getVMContext());
+    return ABIArgInfo::getDirect(llvm::ArrayType::get(I32Ty, 2));
+  }
+
+  if (numRegsForType(RetTy) <= MaxNumRegsForArgsRet)
+    return ABIArgInfo::getDirect();
+  return DefaultABIInfo::classifyReturnType(RetTy);
+}
+
+/// For kernels all parameters are really passed in a special buffer. It doesn't
+/// make sense to pass anything byval, so everything must be direct.
+ABIArgInfo AMDGCNSPIRVABIInfo::classifyKernelArgumentType(QualType Ty) const {
+  Ty = useFirstFieldIfTransparentUnion(Ty);
+
+  // TODO: Can we omit empty structs?
+
+  if (const Type *SeltTy = isSingleElementStruct(Ty, getContext()))
+    Ty = QualType(SeltTy, 0);
+
+  llvm::Type *OrigLTy = CGT.ConvertType(Ty);
+  llvm::Type *LTy = OrigLTy;
+  if (getContext().getLangOpts().isTargetDevice()) {
+    LTy = coerceKernelArgumentType(
+        OrigLTy, /*FromAS=*/getContext().getTargetAddressSpace(LangAS::Default),
+        /*ToAS=*/getContext().getTargetAddressSpace(LangAS::opencl_global));
+  }
+
+  // FIXME: This doesn't apply the optimization of coercing pointers in structs
+  // to global address space when using byref. This would require implementing a
+  // new kind of coercion of the in-memory type when for indirect arguments.
+  if (LTy == OrigLTy && isAggregateTypeForABI(Ty)) {
+    return ABIArgInfo::getIndirectAliased(
+        getContext().getTypeAlignInChars(Ty),
+        getContext().getTargetAddressSpace(LangAS::opencl_constant),
+        false /*Realign*/, nullptr /*Padding*/);
+  }
+
+  // TODO: inhibiting flattening is an AMDGPU workaround for Clover, which might
+  //       be vestigial and should be revisited.
+  return ABIArgInfo::getDirect(LTy, 0, nullptr, false);
+}
+
+ABIArgInfo AMDGCNSPIRVABIInfo::classifyArgumentType(QualType Ty) const {
+  assert(NumRegsLeft <= MaxNumRegsForArgsRet && "register estimate underflow");
+
+  Ty = useFirstFieldIfTransparentUnion(Ty);
+
+  // TODO: support for variadics.
+
+  if (!isAggregateTypeForABI(Ty)) {
+    ABIArgInfo ArgInfo = DefaultABIInfo::classifyArgumentType(Ty);
+    if (!ArgInfo.isIndirect()) {
+      unsigned NumRegs = numRegsForType(Ty);
+      NumRegsLeft -= std::min(NumRegs, NumRegsLeft);
+    }
+
+    return ArgInfo;
+  }
 
   // Records with non-trivial destructors/copy-constructors should not be
   // passed by value.
@@ -205,37 +353,87 @@ ABIArgInfo SPIRVABIInfo::classifyArgumentType(QualType Ty) const {
     return getNaturalAlignIndirect(Ty, getDataLayout().getAllocaAddrSpace(),
                                    RAA == CGCXXABI::RAA_DirectInMemory);
 
+  // Ignore empty structs/unions.
+  if (isEmptyRecord(getContext(), Ty, true))
+    return ABIArgInfo::getIgnore();
+
+  // Lower single-element structs to just pass a regular value. TODO: We
+  // could do reasonable-size multiple-element structs too, using getExpand(),
+  // though watch out for things like bitfields.
+  if (const Type *SeltTy = isSingleElementStruct(Ty, getContext()))
+    return ABIArgInfo::getDirect(CGT.ConvertType(QualType(SeltTy, 0)));
+
   if (const auto *RD = Ty->getAsRecordDecl();
       RD && RD->hasFlexibleArrayMember())
     return DefaultABIInfo::classifyArgumentType(Ty);
 
-  return ABIArgInfo::getDirect(CGT.ConvertType(Ty), 0u, nullptr, false);
+  uint64_t Size = getContext().getTypeSize(Ty);
+  if (Size <= 64) {
+    // Pack aggregates <= 8 bytes into single VGPR or pair.
+    unsigned NumRegs = (Size + 31) / 32;
+    NumRegsLeft -= std::min(NumRegsLeft, NumRegs);
+
+    if (Size <= 16)
+      return ABIArgInfo::getDirect(llvm::Type::getInt16Ty(getVMContext()));
+
+    if (Size <= 32)
+      return ABIArgInfo::getDirect(llvm::Type::getInt32Ty(getVMContext()));
+
+    // TODO: This is an AMDGPU oddity, and might be vestigial, we retain it to
+    //       ensure consistency, but it should be revisited.
+    llvm::Type *I32Ty = llvm::Type::getInt32Ty(getVMContext());
+    return ABIArgInfo::getDirect(llvm::ArrayType::get(I32Ty, 2));
+  }
+
+  if (NumRegsLeft > 0) {
+    unsigned NumRegs = numRegsForType(Ty);
+    if (NumRegsLeft >= NumRegs) {
+      NumRegsLeft -= NumRegs;
+      return ABIArgInfo::getDirect();
+    }
+  }
+
+  // Use pass-by-reference in stead of pass-by-value for struct arguments in
+  // function ABI.
+  return ABIArgInfo::getIndirectAliased(
+      getContext().getTypeAlignInChars(Ty),
+      getContext().getTargetAddressSpace(LangAS::opencl_private));
 }
 
-void SPIRVABIInfo::computeInfo(CGFunctionInfo &FI) const {
-  // The logic is same as in DefaultABIInfo with an exception on the kernel
-  // arguments handling.
+void AMDGCNSPIRVABIInfo::computeInfo(CGFunctionInfo &FI) const {
   llvm::CallingConv::ID CC = FI.getCallingConvention();
 
   if (!getCXXABI().classifyReturnType(FI))
     FI.getReturnInfo() = classifyReturnType(FI.getReturnType());
 
+  NumRegsLeft = MaxNumRegsForArgsRet;
   for (auto &I : FI.arguments()) {
-    if (CC == llvm::CallingConv::SPIR_KERNEL) {
+    if (CC == llvm::CallingConv::SPIR_KERNEL)
       I.info = classifyKernelArgumentType(I.type);
-    } else {
+    else
       I.info = classifyArgumentType(I.type);
-    }
   }
 }
 
+llvm::FixedVectorType *AMDGCNSPIRVABIInfo::getOptimalVectorMemoryType(
+    llvm::FixedVectorType *Ty, const LangOptions &LangOpt) const {
+  // AMDGPU has legal instructions for 96-bit so 3x32 can be supported.
+  if (Ty->getNumElements() == 3 && getDataLayout().getTypeSizeInBits(Ty) == 96)
+    return Ty;
+  return DefaultABIInfo::getOptimalVectorMemoryType(Ty, LangOpt);
+}
+
 namespace clang {
 namespace CodeGen {
 void computeSPIRKernelABIInfo(CodeGenModule &CGM, CGFunctionInfo &FI) {
-  if (CGM.getTarget().getTriple().isSPIRV())
-    SPIRVABIInfo(CGM.getTypes()).computeInfo(FI);
-  else
+  if (CGM.getTarget().getTriple().isSPIRV()) {
+    if (CGM.getTarget().getTriple().getVendor() == llvm::Triple::AMD)
+      AMDGCNSPIRVABIInfo(CGM.getTypes()).computeInfo(FI);
+    else
+      SPIRVABIInfo(CGM.getTypes()).computeInfo(FI);
+  } else {
     CommonSPIRABIInfo(CGM.getTypes()).computeInfo(FI);
+  }
 }
 }
 }
diff --git a/clang/test/CodeGenCUDA/amdgpu-kernel-arg-pointer-type.cu b/clang/test/CodeGenCUDA/amdgpu-kernel-arg-pointer-type.cu
index a48affaec3c8a..bf45a353851b4 100644
--- a/clang/test/CodeGenCUDA/amdgpu-kernel-arg-pointer-type.cu
+++ b/clang/test/CodeGenCUDA/amdgpu-kernel-arg-pointer-type.cu
@@ -95,7 +95,7 @@ __global__ void kernel1(int *x) {
 // CHECK-NEXT:    store ptr addrspace(1) [[X_COERCE]], ptr [[X_ASCAST]], align 8
 // CHECK-NEXT:    [[X1:%.*]] = load ptr, ptr [[X_ASCAST]], align 8
 // CHECK-NEXT:    store ptr [[X1]], ptr [[X_ADDR_ASCAST]], align 8
-// CHECK-NEXT:    [[TMP0:%.*]] = load ptr, ptr [[X_ADDR_ASCAST]], align 8
+// CHECK-NEXT:    [[TMP0:%.*]] = load ptr, ptr [[X_ADDR_ASCAST]], align 8, !nonnull [[META4:![0-9]+]], !align [[META5:![0-9]+]]
 // CHECK-NEXT:    [[TMP1:%.*]] = load i32, ptr [[TMP0]], align 4
 // CHECK-NEXT:    [[INC:%.*]] = add nsw i32 [[TMP1]], 1
 // CHECK-NEXT:    store i32 [[INC]], ptr [[TMP0]], align 4
@@ -111,7 +111,7 @@ __global__ void kernel1(int *x) {
 // CHECK-SPIRV-NEXT:    store ptr addrspace(1) [[X_COERCE]], ptr addrspace(4) [[X_ASCAST]], align 8
 // CHECK-SPIRV-NEXT:    [[X1:%.*]] = load ptr addrspace(4), ptr addrspace(4) [[X_ASCAST]], align 8
 // CHECK-SPIRV-NEXT:    store ptr addrspace(4) [[X1]], ptr addrspace(4) [[X_ADDR_ASCAST]], align 8
-// CHECK-SPIRV-NEXT:    [[TMP0:%.*]] = load ptr addrspace(4), ptr addrspace(4) [[X_ADDR_ASCAST]], align 8
+// CHECK-SPIRV-NEXT:    [[TMP0:%.*]] = load ptr addrspace(4), ptr addrspace(4) [[X_ADDR_ASCAST]], align 8, !align [[META6:![0-9]+]]
 // CHECK-SPIRV-NEXT:    [[TMP1:%.*]] = load i32, ptr addrspace(4) [[TMP0]], align 4
 // CHECK-SPIRV-NEXT:    [[INC:%.*]] = add nsw i32 [[TMP1]], 1
 // CHECK-SPIRV-NEXT:    store i32 [[INC]], ptr addrspace(4) [[TMP0]], align 4
@@ -302,28 +302,23 @@ struct S {
 // CHECK-NEXT:    ret void
 //
 // CHECK-SPIRV-LABEL: define spir_kernel void @_Z7kernel41S(
-// CHECK-SPIRV-SAME: [[STRUCT_S:%.*]] [[S_COERCE:%.*]]) addrspace(4) #[[ATTR0]] !max_work_group_size [[META5]] {
+// CHECK-SPIRV-SAME: ptr addrspace(2) noundef byref([[STRUCT_S:%.*]]) align 8 [[TMP0:%.*]]) addrspace(4) #[[ATTR0]] !max_work_group_size [[META5]] {
 // CHECK-SPIRV-NEXT:  [[ENTRY:.*:]]
-// CHECK-SPIRV-NEXT:    [[S:%.*]] = alloca [[STRUCT_S]], align 8
-// CHECK-SPIRV-NEXT:    [[S1:%.*]] = addrspacecast ptr [[S]] to ptr addrspace(4)
-// CHECK-SPIRV-NEXT:    [[TMP0:%.*]] = getelementptr inbounds nuw [[STRUCT_S]], ptr addrspace(4) [[S1]], i32 0, i32 0
-// CHECK-SPIRV-NEXT:    [[TMP1:%.*]] = extractvalue [[STRUCT_S]] [[S_COERCE]], 0
-// CHECK-SPIRV-NEXT:    store ptr addrspace(4) [[TMP1]], ptr addrspace(4) [[TMP0]], align 8
-// CHECK-SPIRV-NEXT:    [[TMP2:%.*]] = getelementptr inbounds nuw [[STRUCT_S]], ptr addrspace(4) [[S1]], i32 0, i32 1
-// CHECK-SPIRV-NEXT:    [[TMP3:%.*]] = extractvalue [[STRUCT_S]] [[S_COERCE]], 1
-// CHECK-SPIRV-NEXT:    store ptr addrspace(4) [[TMP3]], ptr addrspace(4) [[TMP2]], align 8
-// CHECK-SPIRV-NEXT:    [[X:%.*]] = getelementptr inbounds nuw [[STRUCT_S]], ptr addrspace(4) [[S1]], i32 0, i32 0
-// CHECK-SPIRV-NEXT:    [[TMP4:%.*]] = load ptr addrspace(4), ptr addrspace(4) [[X]], align 8
-// CHECK-SPIRV-NEXT:    [[ARRAYIDX:%.*]] = getelementptr inbounds i32, ptr addrspace(4) [[TMP4]], i64 0
-// CHECK-SPIRV-NEXT:    [[TMP5:%.*]] = load i32, ptr addrspace(4) [[ARRAYIDX]], align 4
-// CHECK-SPIRV-NEXT:    [[INC:%.*]] = add nsw i32 [[TMP5]], 1
+// CHECK-SPIRV-NEXT:    [[COERCE:%.*]] = alloca [[STRUCT_S]], align 8
+// CHECK-SPIRV-NEXT:    [[S:%.*]] = addrspacecast ptr [[COERCE]] to ptr addrspace(4)
+// CHECK-SPIRV-NEXT:    call addrspace(4) void @llvm.memcpy.p4.p2.i64(ptr addrspace(4) align 8 [[S]], ptr addrspace(2) align 8 [[TMP0]], i64 16, i1 false)
+// CHECK-SPIRV-NEXT:    [[X:%.*]] = getelementptr inbounds nuw [[STRUCT_S]], ptr addrspace(4) [[S]], i32 0, i32 0
+// CHECK-SPIRV-NEXT:    [[TMP1:%.*]] = load ptr addrspace(4), ptr addrspace(4) [[X]], align 8
+// CHECK-SPIRV-NEXT:    [[ARRAYIDX:%.*]] = getelementptr inbounds i32, ptr addrspace(4) [[TMP1]], i64 0
+// CHECK-SPIRV-NEXT:    [[TMP2:%.*]] = load i32, ptr addrspace(4) [[ARRAYIDX]], align 4
+// CHECK-SPIRV-NEXT:    [[INC:%.*]] = add nsw i32 [[TMP2]], 1
 // CHECK-SPIRV-NEXT:    store i32 [[INC]], ptr addrspace(4) [[ARRAYIDX]], align 4
-// CHECK-SPIRV-NEXT:    [[Y:%.*]] = getelementptr inbounds nuw [[STRUCT_S]], ptr addrspace(4) [[S1]], i32 0, i32 1
-// CHECK-SPIRV-NEXT:    [[TMP6:%.*]] = load ptr addrspace(4), ptr addrspace(4) [[Y]], align 8
-// CHECK-SPIRV-NEXT:    [[ARRAYIDX2:%.*]] = getelementptr inbounds float, ptr addrspace(4) [[TMP6]], i64 0
-// CHECK-SPIRV-NEXT:    [[TMP7:%.*]] = load float, ptr addrspace(4) [[ARRAYIDX2]], align 4
-// CHECK-SPIRV-NEXT:    [[ADD:%.*]] = fadd contract float [[TMP7]], 1.000000e+00
-// CHECK-SPIRV-NEXT:    store float [[ADD]], ptr addrspace(4) [[ARRAYIDX2]], align 4
+// CHECK-SPIRV-NEXT:    [[Y:%.*]] = getelementptr inbounds nuw [[STRUCT_S]], ptr addrspace(4) [[S]], i32 0, i32 1
+// CHECK-SPIRV-NEXT:    [[TMP3:%.*]] = load ptr addrspace(4), ptr addrspace(4) [[Y]], align 8
+// CHECK-SPIRV-NEXT:    [[ARRAYIDX1:%.*]] = getelementptr inbounds float, ptr addrspace(4) [[TMP3]], i64 0
+// CHECK-SPIRV-NEXT:    [[TMP4:%.*]] = load float, ptr addrspace(4) [[ARRAYIDX1]], align 4
+// CHECK-SPIRV-NEXT:    [[ADD:%.*]] = fadd contract float [[TMP4]], 1.000000e+00
+// CHECK-SPIRV-NEXT:    store float [[ADD]], ptr addrspace(4) [[ARRAYIDX1]], align 4
 // CHECK-SPIRV-NEXT:  ...
[truncated]

@llvmbot
Copy link
Member

llvmbot commented Nov 28, 2025

@llvm/pr-subscribers-clang-codegen

Author: Alex Voicu (AlexVlx)

Changes

At the moment AMDGCN flavoured SPIRV uses the SPIRV ABI with some tweaks revolving around passing aggregates as direct. This is problematic in multiple ways:

  • it leads to divergence from code compiled for a concrete target, which makes it difficult to debug;
  • it incurs a run time cost, when dealing with larger aggregates;
  • it incurs a compile time cost, when dealing with larger aggregates.

This patch switches over AMDGCN flavoured SPIRV to implement the AMDGPU ABI (except for dealing with variadic functions, which will be added in the future). One additional complication (and the primary motivation behind the current less than ideal state of affairs) stems from byref, which AMDGPU uses, not being expressible in SPIR-V. We deal with this by CodeGen-ing for byref, lowering it to the FuncParamAttr ByVal in SPIR-V, and restoring it when doing reverse translation from AMDGCN flavoured SPIR-V.


Patch is 50.43 KiB, truncated to 20.00 KiB below, full version: https://github.com/llvm/llvm-project/pull/169865.diff

6 Files Affected:

  • (modified) clang/lib/CodeGen/Targets/SPIR.cpp (+245-47)
  • (modified) clang/test/CodeGenCUDA/amdgpu-kernel-arg-pointer-type.cu (+71-73)
  • (modified) clang/test/CodeGenCUDA/kernel-args.cu (+4-4)
  • (added) clang/test/CodeGenHIP/amdgcnspirv-uses-amdgpu-abi.cpp (+321)
  • (modified) llvm/lib/Target/SPIRV/SPIRVCallLowering.cpp (+4-1)
  • (added) llvm/test/CodeGen/SPIRV/pointers/ptr-argument-byref-amdgcnspirv.ll (+24)
diff --git a/clang/lib/CodeGen/Targets/SPIR.cpp b/clang/lib/CodeGen/Targets/SPIR.cpp
index 1a8c85d8871ec..3540093074bfe 100644
--- a/clang/lib/CodeGen/Targets/SPIR.cpp
+++ b/clang/lib/CodeGen/Targets/SPIR.cpp
@@ -9,6 +9,11 @@
 #include "ABIInfoImpl.h"
 #include "HLSLBufferLayoutBuilder.h"
 #include "TargetInfo.h"
+#include "clang/Basic/LangOptions.h"
+#include "llvm/IR/DerivedTypes.h"
+
+#include <stdint.h>
+#include <utility>
 
 using namespace clang;
 using namespace clang::CodeGen;
@@ -33,9 +38,41 @@ class SPIRVABIInfo : public CommonSPIRABIInfo {
   void computeInfo(CGFunctionInfo &FI) const override;
 
 private:
+  ABIArgInfo classifyKernelArgumentType(QualType Ty) const;
+};
+
+class AMDGCNSPIRVABIInfo : public SPIRVABIInfo {
+  // TODO: this should be unified / shared with AMDGPU, ideally we'd like to
+  //       re-use AMDGPUABIInfo eventually, rather than duplicate.
+  static constexpr unsigned MaxNumRegsForArgsRet = 16; // 16 32-bit registers
+  mutable unsigned NumRegsLeft = 0;
+
+  unsigned numRegsForType(QualType Ty) const;
+
+  bool isHomogeneousAggregateBaseType(QualType Ty) const override {
+    return true;
+  }
+  bool isHomogeneousAggregateSmallEnough(const Type *Base,
+                                         uint64_t Members) const override {
+    uint32_t NumRegs = (getContext().getTypeSize(Base) + 31) / 32;
+
+    // Homogeneous Aggregates may occupy at most 16 registers.
+    return Members * NumRegs <= MaxNumRegsForArgsRet;
+  }
+
+  // Coerce HIP scalar pointer arguments from generic pointers to global ones.
+  llvm::Type *coerceKernelArgumentType(llvm::Type *Ty, unsigned FromAS,
+                                       unsigned ToAS) const;
+
   ABIArgInfo classifyReturnType(QualType RetTy) const;
   ABIArgInfo classifyKernelArgumentType(QualType Ty) const;
   ABIArgInfo classifyArgumentType(QualType Ty) const;
+public:
+  AMDGCNSPIRVABIInfo(CodeGenTypes &CGT) : SPIRVABIInfo(CGT) {}
+  void computeInfo(CGFunctionInfo &FI) const override;
+
+  llvm::FixedVectorType *getOptimalVectorMemoryType(
+      llvm::FixedVectorType *Ty, const LangOptions &LangOpt) const override;
 };
 } // end anonymous namespace
 namespace {
@@ -83,7 +120,10 @@ class CommonSPIRTargetCodeGenInfo : public TargetCodeGenInfo {
 class SPIRVTargetCodeGenInfo : public CommonSPIRTargetCodeGenInfo {
 public:
   SPIRVTargetCodeGenInfo(CodeGen::CodeGenTypes &CGT)
-      : CommonSPIRTargetCodeGenInfo(std::make_unique<SPIRVABIInfo>(CGT)) {}
+      : CommonSPIRTargetCodeGenInfo(
+            (CGT.getTarget().getTriple().getVendor() == llvm::Triple::AMD)
+                ? std::make_unique<AMDGCNSPIRVABIInfo>(CGT)
+                : std::make_unique<SPIRVABIInfo>(CGT)) {}
   void setCUDAKernelCallingConvention(const FunctionType *&FT) const override;
   LangAS getGlobalVarAddressSpace(CodeGenModule &CGM,
                                   const VarDecl *D) const override;
@@ -132,25 +172,6 @@ void CommonSPIRABIInfo::setCCs() {
   RuntimeCC = llvm::CallingConv::SPIR_FUNC;
 }
 
-ABIArgInfo SPIRVABIInfo::classifyReturnType(QualType RetTy) const {
-  if (getTarget().getTriple().getVendor() != llvm::Triple::AMD)
-    return DefaultABIInfo::classifyReturnType(RetTy);
-  if (!isAggregateTypeForABI(RetTy) || getRecordArgABI(RetTy, getCXXABI()))
-    return DefaultABIInfo::classifyReturnType(RetTy);
-
-  if (const auto *RD = RetTy->getAsRecordDecl();
-      RD && RD->hasFlexibleArrayMember())
-    return DefaultABIInfo::classifyReturnType(RetTy);
-
-  // TODO: The AMDGPU ABI is non-trivial to represent in SPIR-V; in order to
-  // avoid encoding various architecture specific bits here we return everything
-  // as direct to retain type info for things like aggregates, for later perusal
-  // when translating back to LLVM/lowering in the BE. This is also why we
-  // disable flattening as the outcomes can mismatch between SPIR-V and AMDGPU.
-  // This will be revisited / optimised in the future.
-  return ABIArgInfo::getDirect(CGT.ConvertType(RetTy), 0u, nullptr, false);
-}
-
 ABIArgInfo SPIRVABIInfo::classifyKernelArgumentType(QualType Ty) const {
   if (getContext().getLangOpts().isTargetDevice()) {
     // Coerce pointer arguments with default address space to CrossWorkGroup
@@ -167,18 +188,6 @@ ABIArgInfo SPIRVABIInfo::classifyKernelArgumentType(QualType Ty) const {
     }
 
     if (isAggregateTypeForABI(Ty)) {
-      if (getTarget().getTriple().getVendor() == llvm::Triple::AMD)
-        // TODO: The AMDGPU kernel ABI passes aggregates byref, which is not
-        // currently expressible in SPIR-V; SPIR-V passes aggregates byval,
-        // which the AMDGPU kernel ABI does not allow. Passing aggregates as
-        // direct works around this impedance mismatch, as it retains type info
-        // and can be correctly handled, post reverse-translation, by the AMDGPU
-        // BE, which has to support this CC for legacy OpenCL purposes. It can
-        // be brittle and does lead to performance degradation in certain
-        // pathological cases. This will be revisited / optimised in the future,
-        // once a way to deal with the byref/byval impedance mismatch is
-        // identified.
-        return ABIArgInfo::getDirect(LTy, 0, nullptr, false);
       // Force copying aggregate type in kernel arguments by value when
       // compiling CUDA targeting SPIR-V. This is required for the object
       // copied to be valid on the device.
@@ -193,11 +202,150 @@ ABIArgInfo SPIRVABIInfo::classifyKernelArgumentType(QualType Ty) const {
   return classifyArgumentType(Ty);
 }
 
-ABIArgInfo SPIRVABIInfo::classifyArgumentType(QualType Ty) const {
-  if (getTarget().getTriple().getVendor() != llvm::Triple::AMD)
-    return DefaultABIInfo::classifyArgumentType(Ty);
-  if (!isAggregateTypeForABI(Ty))
-    return DefaultABIInfo::classifyArgumentType(Ty);
+void SPIRVABIInfo::computeInfo(CGFunctionInfo &FI) const {
+  // The logic is same as in DefaultABIInfo with an exception on the kernel
+  // arguments handling.
+  llvm::CallingConv::ID CC = FI.getCallingConvention();
+
+  if (!getCXXABI().classifyReturnType(FI))
+    FI.getReturnInfo() = classifyReturnType(FI.getReturnType());
+
+  for (auto &I : FI.arguments()) {
+    if (CC == llvm::CallingConv::SPIR_KERNEL) {
+      I.info = classifyKernelArgumentType(I.type);
+    } else {
+      I.info = classifyArgumentType(I.type);
+    }
+  }
+}
+
+unsigned AMDGCNSPIRVABIInfo::numRegsForType(QualType Ty) const {
+  // This duplicates the AMDGPUABI computation.
+  unsigned NumRegs = 0;
+
+  if (const VectorType *VT = Ty->getAs<VectorType>()) {
+    // Compute from the number of elements. The reported size is based on the
+    // in-memory size, which includes the padding 4th element for 3-vectors.
+    QualType EltTy = VT->getElementType();
+    unsigned EltSize = getContext().getTypeSize(EltTy);
+
+    // 16-bit element vectors should be passed as packed.
+    if (EltSize == 16)
+      return (VT->getNumElements() + 1) / 2;
+
+    unsigned EltNumRegs = (EltSize + 31) / 32;
+    return EltNumRegs * VT->getNumElements();
+  }
+
+  if (const auto *RD = Ty->getAsRecordDecl()) {
+    assert(!RD->hasFlexibleArrayMember());
+
+    for (const FieldDecl *Field : RD->fields()) {
+      QualType FieldTy = Field->getType();
+      NumRegs += numRegsForType(FieldTy);
+    }
+
+    return NumRegs;
+  }
+
+  return (getContext().getTypeSize(Ty) + 31) / 32;
+}
+
+llvm::Type *
+AMDGCNSPIRVABIInfo::coerceKernelArgumentType(llvm::Type *Ty, unsigned FromAS,
+                                             unsigned ToAS) const {
+  // Single value types.
+  auto *PtrTy = llvm::dyn_cast<llvm::PointerType>(Ty);
+  if (PtrTy && PtrTy->getAddressSpace() == FromAS)
+    return llvm::PointerType::get(Ty->getContext(), ToAS);
+  return Ty;
+}
+
+ABIArgInfo AMDGCNSPIRVABIInfo::classifyReturnType(QualType RetTy) const {
+  if (!isAggregateTypeForABI(RetTy) || getRecordArgABI(RetTy, getCXXABI()))
+    return DefaultABIInfo::classifyReturnType(RetTy);
+
+  // Ignore empty structs/unions.
+  if (isEmptyRecord(getContext(), RetTy, true))
+    return ABIArgInfo::getIgnore();
+
+  // Lower single-element structs to just return a regular value.
+  if (const Type *SeltTy = isSingleElementStruct(RetTy, getContext()))
+    return ABIArgInfo::getDirect(CGT.ConvertType(QualType(SeltTy, 0)));
+
+  if (const auto *RD = RetTy->getAsRecordDecl();
+      RD && RD->hasFlexibleArrayMember())
+    return DefaultABIInfo::classifyReturnType(RetTy);
+
+  // Pack aggregates <= 4 bytes into single VGPR or pair.
+  uint64_t Size = getContext().getTypeSize(RetTy);
+  if (Size <= 16)
+    return ABIArgInfo::getDirect(llvm::Type::getInt16Ty(getVMContext()));
+
+  if (Size <= 32)
+    return ABIArgInfo::getDirect(llvm::Type::getInt32Ty(getVMContext()));
+
+  // TODO: This carried over from AMDGPU oddity, we retain it to
+  //       ensure consistency, but it might be reasonable to return Int64.
+  if (Size <= 64) {
+    llvm::Type *I32Ty = llvm::Type::getInt32Ty(getVMContext());
+    return ABIArgInfo::getDirect(llvm::ArrayType::get(I32Ty, 2));
+  }
+
+  if (numRegsForType(RetTy) <= MaxNumRegsForArgsRet)
+    return ABIArgInfo::getDirect();
+  return DefaultABIInfo::classifyReturnType(RetTy);
+}
+
+/// For kernels all parameters are really passed in a special buffer. It doesn't
+/// make sense to pass anything byval, so everything must be direct.
+ABIArgInfo AMDGCNSPIRVABIInfo::classifyKernelArgumentType(QualType Ty) const {
+  Ty = useFirstFieldIfTransparentUnion(Ty);
+
+  // TODO: Can we omit empty structs?
+
+  if (const Type *SeltTy = isSingleElementStruct(Ty, getContext()))
+    Ty = QualType(SeltTy, 0);
+
+  llvm::Type *OrigLTy = CGT.ConvertType(Ty);
+  llvm::Type *LTy = OrigLTy;
+  if (getContext().getLangOpts().isTargetDevice()) {
+    LTy = coerceKernelArgumentType(
+        OrigLTy, /*FromAS=*/getContext().getTargetAddressSpace(LangAS::Default),
+        /*ToAS=*/getContext().getTargetAddressSpace(LangAS::opencl_global));
+  }
+
+  // FIXME: This doesn't apply the optimization of coercing pointers in structs
+  // to global address space when using byref. This would require implementing a
+  // new kind of coercion of the in-memory type when for indirect arguments.
+  if (LTy == OrigLTy && isAggregateTypeForABI(Ty)) {
+    return ABIArgInfo::getIndirectAliased(
+        getContext().getTypeAlignInChars(Ty),
+        getContext().getTargetAddressSpace(LangAS::opencl_constant),
+        false /*Realign*/, nullptr /*Padding*/);
+  }
+
+  // TODO: inhibiting flattening is an AMDGPU workaround for Clover, which might
+  //       be vestigial and should be revisited.
+  return ABIArgInfo::getDirect(LTy, 0, nullptr, false);
+}
+
+ABIArgInfo AMDGCNSPIRVABIInfo::classifyArgumentType(QualType Ty) const {
+  assert(NumRegsLeft <= MaxNumRegsForArgsRet && "register estimate underflow");
+
+  Ty = useFirstFieldIfTransparentUnion(Ty);
+
+  // TODO: support for variadics.
+
+  if (!isAggregateTypeForABI(Ty)) {
+    ABIArgInfo ArgInfo = DefaultABIInfo::classifyArgumentType(Ty);
+    if (!ArgInfo.isIndirect()) {
+      unsigned NumRegs = numRegsForType(Ty);
+      NumRegsLeft -= std::min(NumRegs, NumRegsLeft);
+    }
+
+    return ArgInfo;
+  }
 
   // Records with non-trivial destructors/copy-constructors should not be
   // passed by value.
@@ -205,37 +353,87 @@ ABIArgInfo SPIRVABIInfo::classifyArgumentType(QualType Ty) const {
     return getNaturalAlignIndirect(Ty, getDataLayout().getAllocaAddrSpace(),
                                    RAA == CGCXXABI::RAA_DirectInMemory);
 
+  // Ignore empty structs/unions.
+  if (isEmptyRecord(getContext(), Ty, true))
+    return ABIArgInfo::getIgnore();
+
+  // Lower single-element structs to just pass a regular value. TODO: We
+  // could do reasonable-size multiple-element structs too, using getExpand(),
+  // though watch out for things like bitfields.
+  if (const Type *SeltTy = isSingleElementStruct(Ty, getContext()))
+    return ABIArgInfo::getDirect(CGT.ConvertType(QualType(SeltTy, 0)));
+
   if (const auto *RD = Ty->getAsRecordDecl();
       RD && RD->hasFlexibleArrayMember())
     return DefaultABIInfo::classifyArgumentType(Ty);
 
-  return ABIArgInfo::getDirect(CGT.ConvertType(Ty), 0u, nullptr, false);
+  uint64_t Size = getContext().getTypeSize(Ty);
+  if (Size <= 64) {
+    // Pack aggregates <= 8 bytes into single VGPR or pair.
+    unsigned NumRegs = (Size + 31) / 32;
+    NumRegsLeft -= std::min(NumRegsLeft, NumRegs);
+
+    if (Size <= 16)
+      return ABIArgInfo::getDirect(llvm::Type::getInt16Ty(getVMContext()));
+
+    if (Size <= 32)
+      return ABIArgInfo::getDirect(llvm::Type::getInt32Ty(getVMContext()));
+
+    // TODO: This is an AMDGPU oddity, and might be vestigial, we retain it to
+    //       ensure consistency, but it should be revisited.
+    llvm::Type *I32Ty = llvm::Type::getInt32Ty(getVMContext());
+    return ABIArgInfo::getDirect(llvm::ArrayType::get(I32Ty, 2));
+  }
+
+  if (NumRegsLeft > 0) {
+    unsigned NumRegs = numRegsForType(Ty);
+    if (NumRegsLeft >= NumRegs) {
+      NumRegsLeft -= NumRegs;
+      return ABIArgInfo::getDirect();
+    }
+  }
+
+  // Use pass-by-reference in stead of pass-by-value for struct arguments in
+  // function ABI.
+  return ABIArgInfo::getIndirectAliased(
+      getContext().getTypeAlignInChars(Ty),
+      getContext().getTargetAddressSpace(LangAS::opencl_private));
 }
 
-void SPIRVABIInfo::computeInfo(CGFunctionInfo &FI) const {
-  // The logic is same as in DefaultABIInfo with an exception on the kernel
-  // arguments handling.
+void AMDGCNSPIRVABIInfo::computeInfo(CGFunctionInfo &FI) const {
   llvm::CallingConv::ID CC = FI.getCallingConvention();
 
   if (!getCXXABI().classifyReturnType(FI))
     FI.getReturnInfo() = classifyReturnType(FI.getReturnType());
 
+  NumRegsLeft = MaxNumRegsForArgsRet;
   for (auto &I : FI.arguments()) {
-    if (CC == llvm::CallingConv::SPIR_KERNEL) {
+    if (CC == llvm::CallingConv::SPIR_KERNEL)
       I.info = classifyKernelArgumentType(I.type);
-    } else {
+    else
       I.info = classifyArgumentType(I.type);
-    }
   }
 }
 
+llvm::FixedVectorType *AMDGCNSPIRVABIInfo::getOptimalVectorMemoryType(
+    llvm::FixedVectorType *Ty, const LangOptions &LangOpt) const {
+  // AMDGPU has legal instructions for 96-bit so 3x32 can be supported.
+  if (Ty->getNumElements() == 3 && getDataLayout().getTypeSizeInBits(Ty) == 96)
+    return Ty;
+  return DefaultABIInfo::getOptimalVectorMemoryType(Ty, LangOpt);
+}
+
 namespace clang {
 namespace CodeGen {
 void computeSPIRKernelABIInfo(CodeGenModule &CGM, CGFunctionInfo &FI) {
-  if (CGM.getTarget().getTriple().isSPIRV())
-    SPIRVABIInfo(CGM.getTypes()).computeInfo(FI);
-  else
+  if (CGM.getTarget().getTriple().isSPIRV()) {
+    if (CGM.getTarget().getTriple().getVendor() == llvm::Triple::AMD)
+      AMDGCNSPIRVABIInfo(CGM.getTypes()).computeInfo(FI);
+    else
+      SPIRVABIInfo(CGM.getTypes()).computeInfo(FI);
+  } else {
     CommonSPIRABIInfo(CGM.getTypes()).computeInfo(FI);
+  }
 }
 }
 }
diff --git a/clang/test/CodeGenCUDA/amdgpu-kernel-arg-pointer-type.cu b/clang/test/CodeGenCUDA/amdgpu-kernel-arg-pointer-type.cu
index a48affaec3c8a..bf45a353851b4 100644
--- a/clang/test/CodeGenCUDA/amdgpu-kernel-arg-pointer-type.cu
+++ b/clang/test/CodeGenCUDA/amdgpu-kernel-arg-pointer-type.cu
@@ -95,7 +95,7 @@ __global__ void kernel1(int *x) {
 // CHECK-NEXT:    store ptr addrspace(1) [[X_COERCE]], ptr [[X_ASCAST]], align 8
 // CHECK-NEXT:    [[X1:%.*]] = load ptr, ptr [[X_ASCAST]], align 8
 // CHECK-NEXT:    store ptr [[X1]], ptr [[X_ADDR_ASCAST]], align 8
-// CHECK-NEXT:    [[TMP0:%.*]] = load ptr, ptr [[X_ADDR_ASCAST]], align 8
+// CHECK-NEXT:    [[TMP0:%.*]] = load ptr, ptr [[X_ADDR_ASCAST]], align 8, !nonnull [[META4:![0-9]+]], !align [[META5:![0-9]+]]
 // CHECK-NEXT:    [[TMP1:%.*]] = load i32, ptr [[TMP0]], align 4
 // CHECK-NEXT:    [[INC:%.*]] = add nsw i32 [[TMP1]], 1
 // CHECK-NEXT:    store i32 [[INC]], ptr [[TMP0]], align 4
@@ -111,7 +111,7 @@ __global__ void kernel1(int *x) {
 // CHECK-SPIRV-NEXT:    store ptr addrspace(1) [[X_COERCE]], ptr addrspace(4) [[X_ASCAST]], align 8
 // CHECK-SPIRV-NEXT:    [[X1:%.*]] = load ptr addrspace(4), ptr addrspace(4) [[X_ASCAST]], align 8
 // CHECK-SPIRV-NEXT:    store ptr addrspace(4) [[X1]], ptr addrspace(4) [[X_ADDR_ASCAST]], align 8
-// CHECK-SPIRV-NEXT:    [[TMP0:%.*]] = load ptr addrspace(4), ptr addrspace(4) [[X_ADDR_ASCAST]], align 8
+// CHECK-SPIRV-NEXT:    [[TMP0:%.*]] = load ptr addrspace(4), ptr addrspace(4) [[X_ADDR_ASCAST]], align 8, !align [[META6:![0-9]+]]
 // CHECK-SPIRV-NEXT:    [[TMP1:%.*]] = load i32, ptr addrspace(4) [[TMP0]], align 4
 // CHECK-SPIRV-NEXT:    [[INC:%.*]] = add nsw i32 [[TMP1]], 1
 // CHECK-SPIRV-NEXT:    store i32 [[INC]], ptr addrspace(4) [[TMP0]], align 4
@@ -302,28 +302,23 @@ struct S {
 // CHECK-NEXT:    ret void
 //
 // CHECK-SPIRV-LABEL: define spir_kernel void @_Z7kernel41S(
-// CHECK-SPIRV-SAME: [[STRUCT_S:%.*]] [[S_COERCE:%.*]]) addrspace(4) #[[ATTR0]] !max_work_group_size [[META5]] {
+// CHECK-SPIRV-SAME: ptr addrspace(2) noundef byref([[STRUCT_S:%.*]]) align 8 [[TMP0:%.*]]) addrspace(4) #[[ATTR0]] !max_work_group_size [[META5]] {
 // CHECK-SPIRV-NEXT:  [[ENTRY:.*:]]
-// CHECK-SPIRV-NEXT:    [[S:%.*]] = alloca [[STRUCT_S]], align 8
-// CHECK-SPIRV-NEXT:    [[S1:%.*]] = addrspacecast ptr [[S]] to ptr addrspace(4)
-// CHECK-SPIRV-NEXT:    [[TMP0:%.*]] = getelementptr inbounds nuw [[STRUCT_S]], ptr addrspace(4) [[S1]], i32 0, i32 0
-// CHECK-SPIRV-NEXT:    [[TMP1:%.*]] = extractvalue [[STRUCT_S]] [[S_COERCE]], 0
-// CHECK-SPIRV-NEXT:    store ptr addrspace(4) [[TMP1]], ptr addrspace(4) [[TMP0]], align 8
-// CHECK-SPIRV-NEXT:    [[TMP2:%.*]] = getelementptr inbounds nuw [[STRUCT_S]], ptr addrspace(4) [[S1]], i32 0, i32 1
-// CHECK-SPIRV-NEXT:    [[TMP3:%.*]] = extractvalue [[STRUCT_S]] [[S_COERCE]], 1
-// CHECK-SPIRV-NEXT:    store ptr addrspace(4) [[TMP3]], ptr addrspace(4) [[TMP2]], align 8
-// CHECK-SPIRV-NEXT:    [[X:%.*]] = getelementptr inbounds nuw [[STRUCT_S]], ptr addrspace(4) [[S1]], i32 0, i32 0
-// CHECK-SPIRV-NEXT:    [[TMP4:%.*]] = load ptr addrspace(4), ptr addrspace(4) [[X]], align 8
-// CHECK-SPIRV-NEXT:    [[ARRAYIDX:%.*]] = getelementptr inbounds i32, ptr addrspace(4) [[TMP4]], i64 0
-// CHECK-SPIRV-NEXT:    [[TMP5:%.*]] = load i32, ptr addrspace(4) [[ARRAYIDX]], align 4
-// CHECK-SPIRV-NEXT:    [[INC:%.*]] = add nsw i32 [[TMP5]], 1
+// CHECK-SPIRV-NEXT:    [[COERCE:%.*]] = alloca [[STRUCT_S]], align 8
+// CHECK-SPIRV-NEXT:    [[S:%.*]] = addrspacecast ptr [[COERCE]] to ptr addrspace(4)
+// CHECK-SPIRV-NEXT:    call addrspace(4) void @llvm.memcpy.p4.p2.i64(ptr addrspace(4) align 8 [[S]], ptr addrspace(2) align 8 [[TMP0]], i64 16, i1 false)
+// CHECK-SPIRV-NEXT:    [[X:%.*]] = getelementptr inbounds nuw [[STRUCT_S]], ptr addrspace(4) [[S]], i32 0, i32 0
+// CHECK-SPIRV-NEXT:    [[TMP1:%.*]] = load ptr addrspace(4), ptr addrspace(4) [[X]], align 8
+// CHECK-SPIRV-NEXT:    [[ARRAYIDX:%.*]] = getelementptr inbounds i32, ptr addrspace(4) [[TMP1]], i64 0
+// CHECK-SPIRV-NEXT:    [[TMP2:%.*]] = load i32, ptr addrspace(4) [[ARRAYIDX]], align 4
+// CHECK-SPIRV-NEXT:    [[INC:%.*]] = add nsw i32 [[TMP2]], 1
 // CHECK-SPIRV-NEXT:    store i32 [[INC]], ptr addrspace(4) [[ARRAYIDX]], align 4
-// CHECK-SPIRV-NEXT:    [[Y:%.*]] = getelementptr inbounds nuw [[STRUCT_S]], ptr addrspace(4) [[S1]], i32 0, i32 1
-// CHECK-SPIRV-NEXT:    [[TMP6:%.*]] = load ptr addrspace(4), ptr addrspace(4) [[Y]], align 8
-// CHECK-SPIRV-NEXT:    [[ARRAYIDX2:%.*]] = getelementptr inbounds float, ptr addrspace(4) [[TMP6]], i64 0
-// CHECK-SPIRV-NEXT:    [[TMP7:%.*]] = load float, ptr addrspace(4) [[ARRAYIDX2]], align 4
-// CHECK-SPIRV-NEXT:    [[ADD:%.*]] = fadd contract float [[TMP7]], 1.000000e+00
-// CHECK-SPIRV-NEXT:    store float [[ADD]], ptr addrspace(4) [[ARRAYIDX2]], align 4
+// CHECK-SPIRV-NEXT:    [[Y:%.*]] = getelementptr inbounds nuw [[STRUCT_S]], ptr addrspace(4) [[S]], i32 0, i32 1
+// CHECK-SPIRV-NEXT:    [[TMP3:%.*]] = load ptr addrspace(4), ptr addrspace(4) [[Y]], align 8
+// CHECK-SPIRV-NEXT:    [[ARRAYIDX1:%.*]] = getelementptr inbounds float, ptr addrspace(4) [[TMP3]], i64 0
+// CHECK-SPIRV-NEXT:    [[TMP4:%.*]] = load float, ptr addrspace(4) [[ARRAYIDX1]], align 4
+// CHECK-SPIRV-NEXT:    [[ADD:%.*]] = fadd contract float [[TMP4]], 1.000000e+00
+// CHECK-SPIRV-NEXT:    store float [[ADD]], ptr addrspace(4) [[ARRAYIDX1]], align 4
 // CHECK-SPIRV-NEXT:  ...
[truncated]

@llvmbot
Copy link
Member

llvmbot commented Nov 28, 2025

@llvm/pr-subscribers-backend-spir-v

Author: Alex Voicu (AlexVlx)

Changes

At the moment AMDGCN flavoured SPIRV uses the SPIRV ABI with some tweaks revolving around passing aggregates as direct. This is problematic in multiple ways:

  • it leads to divergence from code compiled for a concrete target, which makes it difficult to debug;
  • it incurs a run time cost, when dealing with larger aggregates;
  • it incurs a compile time cost, when dealing with larger aggregates.

This patch switches over AMDGCN flavoured SPIRV to implement the AMDGPU ABI (except for dealing with variadic functions, which will be added in the future). One additional complication (and the primary motivation behind the current less than ideal state of affairs) stems from byref, which AMDGPU uses, not being expressible in SPIR-V. We deal with this by CodeGen-ing for byref, lowering it to the FuncParamAttr ByVal in SPIR-V, and restoring it when doing reverse translation from AMDGCN flavoured SPIR-V.


Patch is 50.43 KiB, truncated to 20.00 KiB below, full version: https://github.com/llvm/llvm-project/pull/169865.diff

6 Files Affected:

  • (modified) clang/lib/CodeGen/Targets/SPIR.cpp (+245-47)
  • (modified) clang/test/CodeGenCUDA/amdgpu-kernel-arg-pointer-type.cu (+71-73)
  • (modified) clang/test/CodeGenCUDA/kernel-args.cu (+4-4)
  • (added) clang/test/CodeGenHIP/amdgcnspirv-uses-amdgpu-abi.cpp (+321)
  • (modified) llvm/lib/Target/SPIRV/SPIRVCallLowering.cpp (+4-1)
  • (added) llvm/test/CodeGen/SPIRV/pointers/ptr-argument-byref-amdgcnspirv.ll (+24)
diff --git a/clang/lib/CodeGen/Targets/SPIR.cpp b/clang/lib/CodeGen/Targets/SPIR.cpp
index 1a8c85d8871ec..3540093074bfe 100644
--- a/clang/lib/CodeGen/Targets/SPIR.cpp
+++ b/clang/lib/CodeGen/Targets/SPIR.cpp
@@ -9,6 +9,11 @@
 #include "ABIInfoImpl.h"
 #include "HLSLBufferLayoutBuilder.h"
 #include "TargetInfo.h"
+#include "clang/Basic/LangOptions.h"
+#include "llvm/IR/DerivedTypes.h"
+
+#include <stdint.h>
+#include <utility>
 
 using namespace clang;
 using namespace clang::CodeGen;
@@ -33,9 +38,41 @@ class SPIRVABIInfo : public CommonSPIRABIInfo {
   void computeInfo(CGFunctionInfo &FI) const override;
 
 private:
+  ABIArgInfo classifyKernelArgumentType(QualType Ty) const;
+};
+
+class AMDGCNSPIRVABIInfo : public SPIRVABIInfo {
+  // TODO: this should be unified / shared with AMDGPU, ideally we'd like to
+  //       re-use AMDGPUABIInfo eventually, rather than duplicate.
+  static constexpr unsigned MaxNumRegsForArgsRet = 16; // 16 32-bit registers
+  mutable unsigned NumRegsLeft = 0;
+
+  unsigned numRegsForType(QualType Ty) const;
+
+  bool isHomogeneousAggregateBaseType(QualType Ty) const override {
+    return true;
+  }
+  bool isHomogeneousAggregateSmallEnough(const Type *Base,
+                                         uint64_t Members) const override {
+    uint32_t NumRegs = (getContext().getTypeSize(Base) + 31) / 32;
+
+    // Homogeneous Aggregates may occupy at most 16 registers.
+    return Members * NumRegs <= MaxNumRegsForArgsRet;
+  }
+
+  // Coerce HIP scalar pointer arguments from generic pointers to global ones.
+  llvm::Type *coerceKernelArgumentType(llvm::Type *Ty, unsigned FromAS,
+                                       unsigned ToAS) const;
+
   ABIArgInfo classifyReturnType(QualType RetTy) const;
   ABIArgInfo classifyKernelArgumentType(QualType Ty) const;
   ABIArgInfo classifyArgumentType(QualType Ty) const;
+public:
+  AMDGCNSPIRVABIInfo(CodeGenTypes &CGT) : SPIRVABIInfo(CGT) {}
+  void computeInfo(CGFunctionInfo &FI) const override;
+
+  llvm::FixedVectorType *getOptimalVectorMemoryType(
+      llvm::FixedVectorType *Ty, const LangOptions &LangOpt) const override;
 };
 } // end anonymous namespace
 namespace {
@@ -83,7 +120,10 @@ class CommonSPIRTargetCodeGenInfo : public TargetCodeGenInfo {
 class SPIRVTargetCodeGenInfo : public CommonSPIRTargetCodeGenInfo {
 public:
   SPIRVTargetCodeGenInfo(CodeGen::CodeGenTypes &CGT)
-      : CommonSPIRTargetCodeGenInfo(std::make_unique<SPIRVABIInfo>(CGT)) {}
+      : CommonSPIRTargetCodeGenInfo(
+            (CGT.getTarget().getTriple().getVendor() == llvm::Triple::AMD)
+                ? std::make_unique<AMDGCNSPIRVABIInfo>(CGT)
+                : std::make_unique<SPIRVABIInfo>(CGT)) {}
   void setCUDAKernelCallingConvention(const FunctionType *&FT) const override;
   LangAS getGlobalVarAddressSpace(CodeGenModule &CGM,
                                   const VarDecl *D) const override;
@@ -132,25 +172,6 @@ void CommonSPIRABIInfo::setCCs() {
   RuntimeCC = llvm::CallingConv::SPIR_FUNC;
 }
 
-ABIArgInfo SPIRVABIInfo::classifyReturnType(QualType RetTy) const {
-  if (getTarget().getTriple().getVendor() != llvm::Triple::AMD)
-    return DefaultABIInfo::classifyReturnType(RetTy);
-  if (!isAggregateTypeForABI(RetTy) || getRecordArgABI(RetTy, getCXXABI()))
-    return DefaultABIInfo::classifyReturnType(RetTy);
-
-  if (const auto *RD = RetTy->getAsRecordDecl();
-      RD && RD->hasFlexibleArrayMember())
-    return DefaultABIInfo::classifyReturnType(RetTy);
-
-  // TODO: The AMDGPU ABI is non-trivial to represent in SPIR-V; in order to
-  // avoid encoding various architecture specific bits here we return everything
-  // as direct to retain type info for things like aggregates, for later perusal
-  // when translating back to LLVM/lowering in the BE. This is also why we
-  // disable flattening as the outcomes can mismatch between SPIR-V and AMDGPU.
-  // This will be revisited / optimised in the future.
-  return ABIArgInfo::getDirect(CGT.ConvertType(RetTy), 0u, nullptr, false);
-}
-
 ABIArgInfo SPIRVABIInfo::classifyKernelArgumentType(QualType Ty) const {
   if (getContext().getLangOpts().isTargetDevice()) {
     // Coerce pointer arguments with default address space to CrossWorkGroup
@@ -167,18 +188,6 @@ ABIArgInfo SPIRVABIInfo::classifyKernelArgumentType(QualType Ty) const {
     }
 
     if (isAggregateTypeForABI(Ty)) {
-      if (getTarget().getTriple().getVendor() == llvm::Triple::AMD)
-        // TODO: The AMDGPU kernel ABI passes aggregates byref, which is not
-        // currently expressible in SPIR-V; SPIR-V passes aggregates byval,
-        // which the AMDGPU kernel ABI does not allow. Passing aggregates as
-        // direct works around this impedance mismatch, as it retains type info
-        // and can be correctly handled, post reverse-translation, by the AMDGPU
-        // BE, which has to support this CC for legacy OpenCL purposes. It can
-        // be brittle and does lead to performance degradation in certain
-        // pathological cases. This will be revisited / optimised in the future,
-        // once a way to deal with the byref/byval impedance mismatch is
-        // identified.
-        return ABIArgInfo::getDirect(LTy, 0, nullptr, false);
       // Force copying aggregate type in kernel arguments by value when
       // compiling CUDA targeting SPIR-V. This is required for the object
       // copied to be valid on the device.
@@ -193,11 +202,150 @@ ABIArgInfo SPIRVABIInfo::classifyKernelArgumentType(QualType Ty) const {
   return classifyArgumentType(Ty);
 }
 
-ABIArgInfo SPIRVABIInfo::classifyArgumentType(QualType Ty) const {
-  if (getTarget().getTriple().getVendor() != llvm::Triple::AMD)
-    return DefaultABIInfo::classifyArgumentType(Ty);
-  if (!isAggregateTypeForABI(Ty))
-    return DefaultABIInfo::classifyArgumentType(Ty);
+void SPIRVABIInfo::computeInfo(CGFunctionInfo &FI) const {
+  // The logic is same as in DefaultABIInfo with an exception on the kernel
+  // arguments handling.
+  llvm::CallingConv::ID CC = FI.getCallingConvention();
+
+  if (!getCXXABI().classifyReturnType(FI))
+    FI.getReturnInfo() = classifyReturnType(FI.getReturnType());
+
+  for (auto &I : FI.arguments()) {
+    if (CC == llvm::CallingConv::SPIR_KERNEL) {
+      I.info = classifyKernelArgumentType(I.type);
+    } else {
+      I.info = classifyArgumentType(I.type);
+    }
+  }
+}
+
+unsigned AMDGCNSPIRVABIInfo::numRegsForType(QualType Ty) const {
+  // This duplicates the AMDGPUABI computation.
+  unsigned NumRegs = 0;
+
+  if (const VectorType *VT = Ty->getAs<VectorType>()) {
+    // Compute from the number of elements. The reported size is based on the
+    // in-memory size, which includes the padding 4th element for 3-vectors.
+    QualType EltTy = VT->getElementType();
+    unsigned EltSize = getContext().getTypeSize(EltTy);
+
+    // 16-bit element vectors should be passed as packed.
+    if (EltSize == 16)
+      return (VT->getNumElements() + 1) / 2;
+
+    unsigned EltNumRegs = (EltSize + 31) / 32;
+    return EltNumRegs * VT->getNumElements();
+  }
+
+  if (const auto *RD = Ty->getAsRecordDecl()) {
+    assert(!RD->hasFlexibleArrayMember());
+
+    for (const FieldDecl *Field : RD->fields()) {
+      QualType FieldTy = Field->getType();
+      NumRegs += numRegsForType(FieldTy);
+    }
+
+    return NumRegs;
+  }
+
+  return (getContext().getTypeSize(Ty) + 31) / 32;
+}
+
+llvm::Type *
+AMDGCNSPIRVABIInfo::coerceKernelArgumentType(llvm::Type *Ty, unsigned FromAS,
+                                             unsigned ToAS) const {
+  // Single value types.
+  auto *PtrTy = llvm::dyn_cast<llvm::PointerType>(Ty);
+  if (PtrTy && PtrTy->getAddressSpace() == FromAS)
+    return llvm::PointerType::get(Ty->getContext(), ToAS);
+  return Ty;
+}
+
+ABIArgInfo AMDGCNSPIRVABIInfo::classifyReturnType(QualType RetTy) const {
+  if (!isAggregateTypeForABI(RetTy) || getRecordArgABI(RetTy, getCXXABI()))
+    return DefaultABIInfo::classifyReturnType(RetTy);
+
+  // Ignore empty structs/unions.
+  if (isEmptyRecord(getContext(), RetTy, true))
+    return ABIArgInfo::getIgnore();
+
+  // Lower single-element structs to just return a regular value.
+  if (const Type *SeltTy = isSingleElementStruct(RetTy, getContext()))
+    return ABIArgInfo::getDirect(CGT.ConvertType(QualType(SeltTy, 0)));
+
+  if (const auto *RD = RetTy->getAsRecordDecl();
+      RD && RD->hasFlexibleArrayMember())
+    return DefaultABIInfo::classifyReturnType(RetTy);
+
+  // Pack aggregates <= 4 bytes into single VGPR or pair.
+  uint64_t Size = getContext().getTypeSize(RetTy);
+  if (Size <= 16)
+    return ABIArgInfo::getDirect(llvm::Type::getInt16Ty(getVMContext()));
+
+  if (Size <= 32)
+    return ABIArgInfo::getDirect(llvm::Type::getInt32Ty(getVMContext()));
+
+  // TODO: This carried over from AMDGPU oddity, we retain it to
+  //       ensure consistency, but it might be reasonable to return Int64.
+  if (Size <= 64) {
+    llvm::Type *I32Ty = llvm::Type::getInt32Ty(getVMContext());
+    return ABIArgInfo::getDirect(llvm::ArrayType::get(I32Ty, 2));
+  }
+
+  if (numRegsForType(RetTy) <= MaxNumRegsForArgsRet)
+    return ABIArgInfo::getDirect();
+  return DefaultABIInfo::classifyReturnType(RetTy);
+}
+
+/// For kernels all parameters are really passed in a special buffer. It doesn't
+/// make sense to pass anything byval, so everything must be direct.
+ABIArgInfo AMDGCNSPIRVABIInfo::classifyKernelArgumentType(QualType Ty) const {
+  Ty = useFirstFieldIfTransparentUnion(Ty);
+
+  // TODO: Can we omit empty structs?
+
+  if (const Type *SeltTy = isSingleElementStruct(Ty, getContext()))
+    Ty = QualType(SeltTy, 0);
+
+  llvm::Type *OrigLTy = CGT.ConvertType(Ty);
+  llvm::Type *LTy = OrigLTy;
+  if (getContext().getLangOpts().isTargetDevice()) {
+    LTy = coerceKernelArgumentType(
+        OrigLTy, /*FromAS=*/getContext().getTargetAddressSpace(LangAS::Default),
+        /*ToAS=*/getContext().getTargetAddressSpace(LangAS::opencl_global));
+  }
+
+  // FIXME: This doesn't apply the optimization of coercing pointers in structs
+  // to global address space when using byref. This would require implementing a
+  // new kind of coercion of the in-memory type when for indirect arguments.
+  if (LTy == OrigLTy && isAggregateTypeForABI(Ty)) {
+    return ABIArgInfo::getIndirectAliased(
+        getContext().getTypeAlignInChars(Ty),
+        getContext().getTargetAddressSpace(LangAS::opencl_constant),
+        false /*Realign*/, nullptr /*Padding*/);
+  }
+
+  // TODO: inhibiting flattening is an AMDGPU workaround for Clover, which might
+  //       be vestigial and should be revisited.
+  return ABIArgInfo::getDirect(LTy, 0, nullptr, false);
+}
+
+ABIArgInfo AMDGCNSPIRVABIInfo::classifyArgumentType(QualType Ty) const {
+  assert(NumRegsLeft <= MaxNumRegsForArgsRet && "register estimate underflow");
+
+  Ty = useFirstFieldIfTransparentUnion(Ty);
+
+  // TODO: support for variadics.
+
+  if (!isAggregateTypeForABI(Ty)) {
+    ABIArgInfo ArgInfo = DefaultABIInfo::classifyArgumentType(Ty);
+    if (!ArgInfo.isIndirect()) {
+      unsigned NumRegs = numRegsForType(Ty);
+      NumRegsLeft -= std::min(NumRegs, NumRegsLeft);
+    }
+
+    return ArgInfo;
+  }
 
   // Records with non-trivial destructors/copy-constructors should not be
   // passed by value.
@@ -205,37 +353,87 @@ ABIArgInfo SPIRVABIInfo::classifyArgumentType(QualType Ty) const {
     return getNaturalAlignIndirect(Ty, getDataLayout().getAllocaAddrSpace(),
                                    RAA == CGCXXABI::RAA_DirectInMemory);
 
+  // Ignore empty structs/unions.
+  if (isEmptyRecord(getContext(), Ty, true))
+    return ABIArgInfo::getIgnore();
+
+  // Lower single-element structs to just pass a regular value. TODO: We
+  // could do reasonable-size multiple-element structs too, using getExpand(),
+  // though watch out for things like bitfields.
+  if (const Type *SeltTy = isSingleElementStruct(Ty, getContext()))
+    return ABIArgInfo::getDirect(CGT.ConvertType(QualType(SeltTy, 0)));
+
   if (const auto *RD = Ty->getAsRecordDecl();
       RD && RD->hasFlexibleArrayMember())
     return DefaultABIInfo::classifyArgumentType(Ty);
 
-  return ABIArgInfo::getDirect(CGT.ConvertType(Ty), 0u, nullptr, false);
+  uint64_t Size = getContext().getTypeSize(Ty);
+  if (Size <= 64) {
+    // Pack aggregates <= 8 bytes into single VGPR or pair.
+    unsigned NumRegs = (Size + 31) / 32;
+    NumRegsLeft -= std::min(NumRegsLeft, NumRegs);
+
+    if (Size <= 16)
+      return ABIArgInfo::getDirect(llvm::Type::getInt16Ty(getVMContext()));
+
+    if (Size <= 32)
+      return ABIArgInfo::getDirect(llvm::Type::getInt32Ty(getVMContext()));
+
+    // TODO: This is an AMDGPU oddity, and might be vestigial, we retain it to
+    //       ensure consistency, but it should be revisited.
+    llvm::Type *I32Ty = llvm::Type::getInt32Ty(getVMContext());
+    return ABIArgInfo::getDirect(llvm::ArrayType::get(I32Ty, 2));
+  }
+
+  if (NumRegsLeft > 0) {
+    unsigned NumRegs = numRegsForType(Ty);
+    if (NumRegsLeft >= NumRegs) {
+      NumRegsLeft -= NumRegs;
+      return ABIArgInfo::getDirect();
+    }
+  }
+
+  // Use pass-by-reference in stead of pass-by-value for struct arguments in
+  // function ABI.
+  return ABIArgInfo::getIndirectAliased(
+      getContext().getTypeAlignInChars(Ty),
+      getContext().getTargetAddressSpace(LangAS::opencl_private));
 }
 
-void SPIRVABIInfo::computeInfo(CGFunctionInfo &FI) const {
-  // The logic is same as in DefaultABIInfo with an exception on the kernel
-  // arguments handling.
+void AMDGCNSPIRVABIInfo::computeInfo(CGFunctionInfo &FI) const {
   llvm::CallingConv::ID CC = FI.getCallingConvention();
 
   if (!getCXXABI().classifyReturnType(FI))
     FI.getReturnInfo() = classifyReturnType(FI.getReturnType());
 
+  NumRegsLeft = MaxNumRegsForArgsRet;
   for (auto &I : FI.arguments()) {
-    if (CC == llvm::CallingConv::SPIR_KERNEL) {
+    if (CC == llvm::CallingConv::SPIR_KERNEL)
       I.info = classifyKernelArgumentType(I.type);
-    } else {
+    else
       I.info = classifyArgumentType(I.type);
-    }
   }
 }
 
+llvm::FixedVectorType *AMDGCNSPIRVABIInfo::getOptimalVectorMemoryType(
+    llvm::FixedVectorType *Ty, const LangOptions &LangOpt) const {
+  // AMDGPU has legal instructions for 96-bit so 3x32 can be supported.
+  if (Ty->getNumElements() == 3 && getDataLayout().getTypeSizeInBits(Ty) == 96)
+    return Ty;
+  return DefaultABIInfo::getOptimalVectorMemoryType(Ty, LangOpt);
+}
+
 namespace clang {
 namespace CodeGen {
 void computeSPIRKernelABIInfo(CodeGenModule &CGM, CGFunctionInfo &FI) {
-  if (CGM.getTarget().getTriple().isSPIRV())
-    SPIRVABIInfo(CGM.getTypes()).computeInfo(FI);
-  else
+  if (CGM.getTarget().getTriple().isSPIRV()) {
+    if (CGM.getTarget().getTriple().getVendor() == llvm::Triple::AMD)
+      AMDGCNSPIRVABIInfo(CGM.getTypes()).computeInfo(FI);
+    else
+      SPIRVABIInfo(CGM.getTypes()).computeInfo(FI);
+  } else {
     CommonSPIRABIInfo(CGM.getTypes()).computeInfo(FI);
+  }
 }
 }
 }
diff --git a/clang/test/CodeGenCUDA/amdgpu-kernel-arg-pointer-type.cu b/clang/test/CodeGenCUDA/amdgpu-kernel-arg-pointer-type.cu
index a48affaec3c8a..bf45a353851b4 100644
--- a/clang/test/CodeGenCUDA/amdgpu-kernel-arg-pointer-type.cu
+++ b/clang/test/CodeGenCUDA/amdgpu-kernel-arg-pointer-type.cu
@@ -95,7 +95,7 @@ __global__ void kernel1(int *x) {
 // CHECK-NEXT:    store ptr addrspace(1) [[X_COERCE]], ptr [[X_ASCAST]], align 8
 // CHECK-NEXT:    [[X1:%.*]] = load ptr, ptr [[X_ASCAST]], align 8
 // CHECK-NEXT:    store ptr [[X1]], ptr [[X_ADDR_ASCAST]], align 8
-// CHECK-NEXT:    [[TMP0:%.*]] = load ptr, ptr [[X_ADDR_ASCAST]], align 8
+// CHECK-NEXT:    [[TMP0:%.*]] = load ptr, ptr [[X_ADDR_ASCAST]], align 8, !nonnull [[META4:![0-9]+]], !align [[META5:![0-9]+]]
 // CHECK-NEXT:    [[TMP1:%.*]] = load i32, ptr [[TMP0]], align 4
 // CHECK-NEXT:    [[INC:%.*]] = add nsw i32 [[TMP1]], 1
 // CHECK-NEXT:    store i32 [[INC]], ptr [[TMP0]], align 4
@@ -111,7 +111,7 @@ __global__ void kernel1(int *x) {
 // CHECK-SPIRV-NEXT:    store ptr addrspace(1) [[X_COERCE]], ptr addrspace(4) [[X_ASCAST]], align 8
 // CHECK-SPIRV-NEXT:    [[X1:%.*]] = load ptr addrspace(4), ptr addrspace(4) [[X_ASCAST]], align 8
 // CHECK-SPIRV-NEXT:    store ptr addrspace(4) [[X1]], ptr addrspace(4) [[X_ADDR_ASCAST]], align 8
-// CHECK-SPIRV-NEXT:    [[TMP0:%.*]] = load ptr addrspace(4), ptr addrspace(4) [[X_ADDR_ASCAST]], align 8
+// CHECK-SPIRV-NEXT:    [[TMP0:%.*]] = load ptr addrspace(4), ptr addrspace(4) [[X_ADDR_ASCAST]], align 8, !align [[META6:![0-9]+]]
 // CHECK-SPIRV-NEXT:    [[TMP1:%.*]] = load i32, ptr addrspace(4) [[TMP0]], align 4
 // CHECK-SPIRV-NEXT:    [[INC:%.*]] = add nsw i32 [[TMP1]], 1
 // CHECK-SPIRV-NEXT:    store i32 [[INC]], ptr addrspace(4) [[TMP0]], align 4
@@ -302,28 +302,23 @@ struct S {
 // CHECK-NEXT:    ret void
 //
 // CHECK-SPIRV-LABEL: define spir_kernel void @_Z7kernel41S(
-// CHECK-SPIRV-SAME: [[STRUCT_S:%.*]] [[S_COERCE:%.*]]) addrspace(4) #[[ATTR0]] !max_work_group_size [[META5]] {
+// CHECK-SPIRV-SAME: ptr addrspace(2) noundef byref([[STRUCT_S:%.*]]) align 8 [[TMP0:%.*]]) addrspace(4) #[[ATTR0]] !max_work_group_size [[META5]] {
 // CHECK-SPIRV-NEXT:  [[ENTRY:.*:]]
-// CHECK-SPIRV-NEXT:    [[S:%.*]] = alloca [[STRUCT_S]], align 8
-// CHECK-SPIRV-NEXT:    [[S1:%.*]] = addrspacecast ptr [[S]] to ptr addrspace(4)
-// CHECK-SPIRV-NEXT:    [[TMP0:%.*]] = getelementptr inbounds nuw [[STRUCT_S]], ptr addrspace(4) [[S1]], i32 0, i32 0
-// CHECK-SPIRV-NEXT:    [[TMP1:%.*]] = extractvalue [[STRUCT_S]] [[S_COERCE]], 0
-// CHECK-SPIRV-NEXT:    store ptr addrspace(4) [[TMP1]], ptr addrspace(4) [[TMP0]], align 8
-// CHECK-SPIRV-NEXT:    [[TMP2:%.*]] = getelementptr inbounds nuw [[STRUCT_S]], ptr addrspace(4) [[S1]], i32 0, i32 1
-// CHECK-SPIRV-NEXT:    [[TMP3:%.*]] = extractvalue [[STRUCT_S]] [[S_COERCE]], 1
-// CHECK-SPIRV-NEXT:    store ptr addrspace(4) [[TMP3]], ptr addrspace(4) [[TMP2]], align 8
-// CHECK-SPIRV-NEXT:    [[X:%.*]] = getelementptr inbounds nuw [[STRUCT_S]], ptr addrspace(4) [[S1]], i32 0, i32 0
-// CHECK-SPIRV-NEXT:    [[TMP4:%.*]] = load ptr addrspace(4), ptr addrspace(4) [[X]], align 8
-// CHECK-SPIRV-NEXT:    [[ARRAYIDX:%.*]] = getelementptr inbounds i32, ptr addrspace(4) [[TMP4]], i64 0
-// CHECK-SPIRV-NEXT:    [[TMP5:%.*]] = load i32, ptr addrspace(4) [[ARRAYIDX]], align 4
-// CHECK-SPIRV-NEXT:    [[INC:%.*]] = add nsw i32 [[TMP5]], 1
+// CHECK-SPIRV-NEXT:    [[COERCE:%.*]] = alloca [[STRUCT_S]], align 8
+// CHECK-SPIRV-NEXT:    [[S:%.*]] = addrspacecast ptr [[COERCE]] to ptr addrspace(4)
+// CHECK-SPIRV-NEXT:    call addrspace(4) void @llvm.memcpy.p4.p2.i64(ptr addrspace(4) align 8 [[S]], ptr addrspace(2) align 8 [[TMP0]], i64 16, i1 false)
+// CHECK-SPIRV-NEXT:    [[X:%.*]] = getelementptr inbounds nuw [[STRUCT_S]], ptr addrspace(4) [[S]], i32 0, i32 0
+// CHECK-SPIRV-NEXT:    [[TMP1:%.*]] = load ptr addrspace(4), ptr addrspace(4) [[X]], align 8
+// CHECK-SPIRV-NEXT:    [[ARRAYIDX:%.*]] = getelementptr inbounds i32, ptr addrspace(4) [[TMP1]], i64 0
+// CHECK-SPIRV-NEXT:    [[TMP2:%.*]] = load i32, ptr addrspace(4) [[ARRAYIDX]], align 4
+// CHECK-SPIRV-NEXT:    [[INC:%.*]] = add nsw i32 [[TMP2]], 1
 // CHECK-SPIRV-NEXT:    store i32 [[INC]], ptr addrspace(4) [[ARRAYIDX]], align 4
-// CHECK-SPIRV-NEXT:    [[Y:%.*]] = getelementptr inbounds nuw [[STRUCT_S]], ptr addrspace(4) [[S1]], i32 0, i32 1
-// CHECK-SPIRV-NEXT:    [[TMP6:%.*]] = load ptr addrspace(4), ptr addrspace(4) [[Y]], align 8
-// CHECK-SPIRV-NEXT:    [[ARRAYIDX2:%.*]] = getelementptr inbounds float, ptr addrspace(4) [[TMP6]], i64 0
-// CHECK-SPIRV-NEXT:    [[TMP7:%.*]] = load float, ptr addrspace(4) [[ARRAYIDX2]], align 4
-// CHECK-SPIRV-NEXT:    [[ADD:%.*]] = fadd contract float [[TMP7]], 1.000000e+00
-// CHECK-SPIRV-NEXT:    store float [[ADD]], ptr addrspace(4) [[ARRAYIDX2]], align 4
+// CHECK-SPIRV-NEXT:    [[Y:%.*]] = getelementptr inbounds nuw [[STRUCT_S]], ptr addrspace(4) [[S]], i32 0, i32 1
+// CHECK-SPIRV-NEXT:    [[TMP3:%.*]] = load ptr addrspace(4), ptr addrspace(4) [[Y]], align 8
+// CHECK-SPIRV-NEXT:    [[ARRAYIDX1:%.*]] = getelementptr inbounds float, ptr addrspace(4) [[TMP3]], i64 0
+// CHECK-SPIRV-NEXT:    [[TMP4:%.*]] = load float, ptr addrspace(4) [[ARRAYIDX1]], align 4
+// CHECK-SPIRV-NEXT:    [[ADD:%.*]] = fadd contract float [[TMP4]], 1.000000e+00
+// CHECK-SPIRV-NEXT:    store float [[ADD]], ptr addrspace(4) [[ARRAYIDX1]], align 4
 // CHECK-SPIRV-NEXT:  ...
[truncated]

@github-actions
Copy link

github-actions bot commented Nov 28, 2025

✅ With the latest revision this PR passed the C/C++ code formatter.

Copy link
Contributor

@maarquitos14 maarquitos14 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

SPIRV part LGTM, just one nit.

SPIRV::Decoration::FuncParamAttr, {Attr});
}
if (Arg.hasAttribute(Attribute::ByVal)) {
if (Arg.hasAttribute(Attribute::ByVal) ||
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can we add a comment here explaining why you need this for AMD, so that people can easily understand when reading this code? I fear people might don't understand, or simply think it's wrong, to lower ByRef to ByVal.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done, thank you for the suggestion.

Copy link
Contributor

@jmmartinez jmmartinez left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It looks good to me (but I'm also not super familiar with the ABI side)

Copy link
Contributor

@Keenuts Keenuts left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM for the shared SPIR-V part. Glanced at the AMD part, code LGTM, but don't know about the ABI logic so haven't checked that.

Copy link
Collaborator

@efriedma-quic efriedma-quic left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is there any possibility this impacts interop between AMDGCN flavored SPIRV and generic SPIRV? Or is that not a thing?

Copy link
Contributor

@MrSidims MrSidims left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

SPIR-V -related part is LGTM

@AlexVlx
Copy link
Contributor Author

AlexVlx commented Dec 2, 2025

Is there any possibility this impacts interop between AMDGCN flavored SPIRV and generic SPIRV? Or is that not a thing?

Not a thing, we don't compose them. If we were to, we'd have to constrain the ABI to unambiguous cases (so, probably, no aggregates), but that's not under active consideration at the moment.

#include "clang/Basic/LangOptions.h"
#include "llvm/IR/DerivedTypes.h"

#include <stdint.h>
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
#include <stdint.h>
#include <cstdint>

But I'd be surprised you need this?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Conventionally, cstdint is for std::foo_t, and stdint.h is for foo_t without the namespace. Now, the convention is weak and often broken, but it exists. As for needing it, there are uint64_t uses, and whilst transitively cstdint / stdint.h is extremely likely to get included, I thought that we want to make files as self sufficient as possible.

return Ty;
}

ABIArgInfo AMDGCNSPIRVABIInfo::classifyReturnType(QualType RetTy) const {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can we do something better than duplicate all of this AMDGPU stuff?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's not immediately apparent without refactoring how the ABIInfo things are currently done (internal types inaccessible to other TUs). We could consider hoisting the (shared) implementation into a file that both AMDGPU and SPIRV include, but I didn't want to do that here since it's more intrusive. If you have another suggestion, please share, it's more than welcome.

@AlexVlx AlexVlx merged commit 68fea00 into llvm:main Dec 7, 2025
11 checks passed
honeygoyal pushed a commit to honeygoyal/llvm-project that referenced this pull request Dec 9, 2025
At the moment AMDGCN flavoured SPIRV uses the SPIRV ABI with some tweaks
revolving around passing aggregates as direct. This is problematic in
multiple ways:

- it leads to divergence from code compiled for a concrete target, which
makes it difficult to debug;
- it incurs a run time cost, when dealing with larger aggregates;
- it incurs a compile time cost, when dealing with larger aggregates.

This patch switches over AMDGCN flavoured SPIRV to implement the AMDGPU
ABI (except for dealing with variadic functions, which will be added in
the future). One additional complication (and the primary motivation
behind the current less than ideal state of affairs) stems from `byref`,
which AMDGPU uses, not being expressible in SPIR-V. We deal with this by
CodeGen-ing for `byref`, lowering it to the `FuncParamAttr ByVal` in
SPIR-V, and restoring it when doing reverse translation from AMDGCN
flavoured SPIR-V.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

backend:AMDGPU backend:SPIR-V clang:codegen IR generation bugs: mangling, exceptions, etc.

Projects

None yet

Development

Successfully merging this pull request may close these issues.

8 participants