Skip to content

Profile and optimise-away small GPU allocations #749

@TysonRayJones

Description

@TysonRayJones

Note

For all development, please work out of the devel branch

Summary

Use a profiler to investigate and optimise QuEST's GPU Thrust backend in the "few-qubit" regime, by avoiding superfluous memory allocations.

Context

QuEST juggles many algorithmic and implementation optimisations which have different performance benefits in different settings (e.g. multithreaded vs GPU-accelerated), and scales (e.g. number of amplitudes in a statevector). When a Qureg (the simulated quantum state) is small (e.g. 10 qubits), some unexpected overheads ordinarily handwaved as trivial can begin to dominate runtime. Such a phenomenon is witnessed right now in QuEST's GPU backend, where the cost of copying a quantum operator's qubit lists - from host memory, to the GPU's memory - are exceeding the cost of applying the quantum operator upon the state!

CUDA offers many tricks to address problems like this, such as strategies to reserve permanent space for the lists in faster device memory, as you can see being explored here. But to figure out how to optimise code, it is first necessary to understand precisely what is slowing the code! This is the utility of a profiler; a tool to find and visualise which parts of the codebase are unexpectedly slow, or which are invoked unexpectedly frequently. Learning to use and interpret the outputs of profilers is an extraordinarily useful skill in high-performance computing, and general software engineering.

Task

The heart of QuEST's GPU backend is gpu_subroutines.cpp. The functions therein accept host-memory data (like lists of qubit indices), copy them into fresh GPU device memory, then dispatch a GPU job - either invoking a kernel (like here) from gpu_kernels.cuh, or a Thrust routine (like here) from gpu_thrust.cuh. When the Qureg is small (e.g. a 10 qubit Qureg which has only 1024 elements = 16 KiB), allocating the tiny, temporary GPU memory and copying over the lists forms a significant amount of the runtime! Therefore, accelerating the list copying can greatly improve QuEST's few-qubit performance!

Use a profiler, such as NVIDIA Nsight systems, to measure the cost of copying over qubit lists to QuEST's Thrust routines. Using a candidate function (see below) running on any supported GPU, find the percentage of time that is spent allocating and copying data to the GPU at this scale. By varying the Qureg size, estimate the maximum speedup that could be achieved by eliminating the memory-copy completely, as a percentage of the full runtime of the function. Evidence your estimation using plots or screenshots of the profiler.

Then, picking any candidate function, attempt to optimise away this few-qubit overhead by eliminating the memory copy. This will be through an analogous method of the (much more substantial) optimisation being investigated in #739, but tailored to the Thrust API / functor paradigm.

Clues

An example of a Thrust routine without a copy overhead is thrust_statevec_calcExpecAnyTargZ_sub(), which is invoked when calling QuEST's calcExpecPauliString() API function, and passing a PauliStr containing only I or Z operators. This function has no copy overhead, because in lieu of passing over the entire qubit list to device memory, it passes instead a bitmask representation (stored in a qindex = long long int) - that's only possible here because the operation is mathematically agnostic to the qubit ordering. Because the bitmask is a mere primitive type, it can be passed and loaded directly into a register of a CUDA device - in the Thrust paradigm, we see this as a simple field of the functor.

This means, if you studied the execution of the below QuEST program through a profiler, you would see the ideal few-qubit performance, with no copy/alloc overheads!

#include "quest.h"
#include <stdio.h>

int main(void)
{
    initQuESTEnv();
    if (!getQuESTEnv().isGpuAccelerated)
    {
        printf(
            "GPU acceleration is not enabled, so calling calcExpecPauliStr() will not "
            "invoke thrust_statevec_calcExpecAnyTargZ_sub(). Exiting...");
        finalizeQuESTEnv();
        return 0;
    }

    // create a 10-qubit random statevector with only GPU-acceleration
    int numQubits = 10;
    Qureg qureg = createCustomQureg(numQubits, 0, 0, 1, 0);
    initRandomPureState(qureg);
    reportQuregParams(qureg);
    reportQureg(qureg);

    // obtain an all-Z PauliStr
    PauliStr str = getInlinePauliStr("ZZZZZ", {0, 3, 5, 8, 9});
    reportPauliStr(str);

    // invoke thrust_statevec_calcExpecAnyTargZ_sub()
    qreal out = calcExpecPauliStr(qureg, str);
    printf("out = %g\n", out);

    destroyQureg(qureg);
    finalizeQuESTEnv();
    return 0;
}

In contrast, an API function like applyMultiQubitProjector will invoke

template <int NumQubits>
void thrust_statevec_multiQubitProjector_sub(Qureg qureg, ConstList64 qubits, ConstList64 outcomes, qreal renorm) {
devints devQubits = getDevInts(qubits);
qindex retainValue = getIntegerFromBits(outcomes.data(), outcomes.size());
auto projFunctor = functor_projectStateVec<NumQubits>(
getPtr(devQubits), qubits.size(), retainValue, renorm);
auto indIter = thrust::make_counting_iterator(QINDEX_ZERO);
auto ampIter = getStartPtr(qureg);
qindex numIts = qureg.numAmpsPerNode;
thrust::transform(indIter, indIter + numIts, ampIter, ampIter, projFunctor); // 4th arg gets modified
}

which contains the damning line:

devints devQubits = getDevInts(qubits); 

This line invokes this copy, which you can trigger by adapting the above example to include:

   int qubits[] = {0, 3, 5, 8, 9};
   int outcomes[] = {0, 0, 1, 1, 0};
   applyMultiQubitProjector(qureg, qubits, outcomes, 5);

Can you spot it in your profiler?

Metadata

Metadata

Assignees

No one assigned

    Type

    No type
    No fields configured for issues without a type.

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions