Profile and optimise-away small GPU allocations

> [!NOTE]
> For all development, please work out of the [`devel`](https://github.com/QuEST-Kit/QuEST/tree/devel) branch

## Summary

Use a profiler to investigate and optimise QuEST's GPU Thrust backend in the "few-qubit" regime, by avoiding superfluous memory allocations.

## Context

QuEST juggles many algorithmic and implementation optimisations which have different performance benefits in different settings (e.g. multithreaded vs GPU-accelerated), and scales (e.g. number of amplitudes in a statevector). When a `Qureg` (the simulated quantum state) is _small_ (e.g. 10 qubits), some unexpected overheads ordinarily handwaved as trivial can begin to dominate runtime. Such a phenomenon is witnessed [right now](https://github.com/QuEST-Kit/QuEST/pull/739) in QuEST's [GPU backend](https://github.com/QuEST-Kit/QuEST/tree/devel/quest/src/gpu), where the cost of copying a quantum operator's qubit lists - from host memory, to the GPU's memory - are exceeding the cost of _applying_ the quantum operator upon the state! 

CUDA offers many tricks to address problems like this, such as strategies to reserve permanent space for the lists in faster device memory, as you can see being explored [here](https://github.com/QuEST-Kit/QuEST/pull/739/commits). But to figure out _how_ to optimise code, it is first necessary to understand precisely _what_ is slowing the code! This is the utility of a _profiler_; a tool to find and visualise which parts of the codebase are unexpectedly slow, or which are invoked unexpectedly frequently. Learning to use and interpret the outputs of profilers is an extraordinarily useful skill in high-performance computing, and general software engineering.

## Task

The heart of QuEST's GPU backend is [`gpu_subroutines.cpp`](https://github.com/QuEST-Kit/QuEST/blob/devel/quest/src/gpu/gpu_subroutines.cpp). The functions therein accept host-memory data (like lists of qubit indices), copy them into fresh GPU device memory, then dispatch a GPU job - either invoking a kernel (like [here](https://github.com/QuEST-Kit/QuEST/blob/fed6452ebf8a5af8985a1595c01858404b440fed/quest/src/gpu/gpu_subroutines.cpp#L342-L345)) from [`gpu_kernels.cuh`](https://github.com/QuEST-Kit/QuEST/blob/devel/quest/src/gpu/gpu_kernels.cuh), or a Thrust routine (like [here](https://github.com/QuEST-Kit/QuEST/blob/fed6452ebf8a5af8985a1595c01858404b440fed/quest/src/gpu/gpu_subroutines.cpp#L1739)) from [`gpu_thrust.cuh`](https://github.com/QuEST-Kit/QuEST/blob/devel/quest/src/gpu/gpu_thrust.cuh). When the `Qureg` is small (e.g. a 10 qubit `Qureg` which has only `1024` elements = `16 KiB`), allocating the tiny, temporary GPU memory and copying over the lists forms a significant amount of the runtime! Therefore, accelerating the list copying can greatly improve QuEST's few-qubit performance!

Use a profiler, such as [NVIDIA Nsight systems](https://developer.nvidia.com/nsight-systems), to measure the cost of copying over qubit lists to QuEST's [Thrust routines](https://github.com/QuEST-Kit/QuEST/blob/devel/quest/src/gpu/gpu_thrust.cuh). Using a candidate function (see below) running on any supported GPU, find the percentage of time that is spent allocating and copying data to the GPU at this scale. By varying the `Qureg` size, estimate the maximum speedup that could be achieved by eliminating the memory-copy completely, as a percentage of the full runtime of the function. Evidence your estimation using plots or screenshots of the profiler.

Then, picking any candidate function, attempt to optimise away this few-qubit overhead by eliminating the memory copy. This will be through an analogous method of the (much more substantial) optimisation being investigated in #739, but tailored to the Thrust API / functor paradigm.

## Clues

> An example of a Thrust routine **_without_** a copy overhead is [`thrust_statevec_calcExpecAnyTargZ_sub()`](https://github.com/QuEST-Kit/QuEST/blob/fed6452ebf8a5af8985a1595c01858404b440fed/quest/src/gpu/gpu_thrust.cuh#L892-L904), which is invoked when calling QuEST's [`calcExpecPauliString()`](https://quest-kit.github.io/QuEST/group__calc__expec.html#gafc9d2afc79495609b62f5a1825ff7218) API function, and passing a `PauliStr` containing only `I` or `Z` operators. This function has no copy overhead, because in lieu of passing over the entire qubit list to device memory, it passes instead a [bitmask](https://github.com/QuEST-Kit/QuEST/blob/fed6452ebf8a5af8985a1595c01858404b440fed/quest/src/gpu/gpu_thrust.cuh#L894) representation (stored in a `qindex = long long int`) - that's only possible here because the operation is mathematically agnostic to the qubit ordering. Because the bitmask is a mere primitive type, it can be passed and loaded directly into a register of a CUDA device - in the Thrust paradigm, we see this as a simple field of the [functor](https://github.com/QuEST-Kit/QuEST/blob/fed6452ebf8a5af8985a1595c01858404b440fed/quest/src/gpu/gpu_thrust.cuh#L222-L236).
> 
> This means, if you studied the execution of the below QuEST program through a profiler, you would see the ideal few-qubit performance, with no copy/alloc overheads!
> ```c
> #include "quest.h"
> #include <stdio.h>
> 
> int main(void)
> {
>     initQuESTEnv();
>     if (!getQuESTEnv().isGpuAccelerated)
>     {
>         printf(
>             "GPU acceleration is not enabled, so calling calcExpecPauliStr() will not "
>             "invoke thrust_statevec_calcExpecAnyTargZ_sub(). Exiting...");
>         finalizeQuESTEnv();
>         return 0;
>     }
> 
>     // create a 10-qubit random statevector with only GPU-acceleration
>     int numQubits = 10;
>     Qureg qureg = createCustomQureg(numQubits, 0, 0, 1, 0);
>     initRandomPureState(qureg);
>     reportQuregParams(qureg);
>     reportQureg(qureg);
> 
>     // obtain an all-Z PauliStr
>     PauliStr str = getInlinePauliStr("ZZZZZ", {0, 3, 5, 8, 9});
>     reportPauliStr(str);
> 
>     // invoke thrust_statevec_calcExpecAnyTargZ_sub()
>     qreal out = calcExpecPauliStr(qureg, str);
>     printf("out = %g\n", out);
> 
>     destroyQureg(qureg);
>     finalizeQuESTEnv();
>     return 0;
> }
> ```
> 
> In contrast, an API function like [`applyMultiQubitProjector`](https://quest-kit.github.io/QuEST/group__op__projectors.html#gaa4bde7e5a344fb46cf3119d462b18745) will invoke 
> https://github.com/QuEST-Kit/QuEST/blob/fed6452ebf8a5af8985a1595c01858404b440fed/quest/src/gpu/gpu_thrust.cuh#L1018-L1031
> which contains the damning line:
> ```cpp
> devints devQubits = getDevInts(qubits); 
> ```
> This line invokes [this copy](https://github.com/QuEST-Kit/QuEST/blob/fed6452ebf8a5af8985a1595c01858404b440fed/quest/src/gpu/gpu_thrust.cuh#L61-L82), which you can trigger by adapting the above example to include:
> ```cpp
>    int qubits[] = {0, 3, 5, 8, 9};
>    int outcomes[] = {0, 0, 1, 1, 0};
>    applyMultiQubitProjector(qureg, qubits, outcomes, 5);
> ```
> Can you spot it in your profiler?


	template <int NumQubits>
	void thrust_statevec_multiQubitProjector_sub(Qureg qureg, ConstList64 qubits, ConstList64 outcomes, qreal renorm) {

	devints devQubits = getDevInts(qubits);
	qindex retainValue = getIntegerFromBits(outcomes.data(), outcomes.size());
	auto projFunctor = functor_projectStateVec<NumQubits>(
	getPtr(devQubits), qubits.size(), retainValue, renorm);

	auto indIter = thrust::make_counting_iterator(QINDEX_ZERO);
	auto ampIter = getStartPtr(qureg);

	qindex numIts = qureg.numAmpsPerNode;
	thrust::transform(indIter, indIter + numIts, ampIter, ampIter, projFunctor); // 4th arg gets modified
	}

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Profile and optimise-away small GPU allocations #749

Summary

Context

Task

Clues

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

Profile and optimise-away small GPU allocations #749

Description

Summary

Context

Task

Clues

Metadata

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

Issue actions