You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
For all development, please work out of the devel branch
Summary
Use a profiler to investigate and optimise QuEST's GPU Thrust backend in the "few-qubit" regime, by avoiding superfluous memory allocations.
Context
QuEST juggles many algorithmic and implementation optimisations which have different performance benefits in different settings (e.g. multithreaded vs GPU-accelerated), and scales (e.g. number of amplitudes in a statevector). When a Qureg (the simulated quantum state) is small (e.g. 10 qubits), some unexpected overheads ordinarily handwaved as trivial can begin to dominate runtime. Such a phenomenon is witnessed right now in QuEST's GPU backend, where the cost of copying a quantum operator's qubit lists - from host memory, to the GPU's memory - are exceeding the cost of applying the quantum operator upon the state!
CUDA offers many tricks to address problems like this, such as strategies to reserve permanent space for the lists in faster device memory, as you can see being explored here. But to figure out how to optimise code, it is first necessary to understand precisely what is slowing the code! This is the utility of a profiler; a tool to find and visualise which parts of the codebase are unexpectedly slow, or which are invoked unexpectedly frequently. Learning to use and interpret the outputs of profilers is an extraordinarily useful skill in high-performance computing, and general software engineering.
Task
The heart of QuEST's GPU backend is gpu_subroutines.cpp. The functions therein accept host-memory data (like lists of qubit indices), copy them into fresh GPU device memory, then dispatch a GPU job - either invoking a kernel (like here) from gpu_kernels.cuh, or a Thrust routine (like here) from gpu_thrust.cuh. When the Qureg is small (e.g. a 10 qubit Qureg which has only 1024 elements = 16 KiB), allocating the tiny, temporary GPU memory and copying over the lists forms a significant amount of the runtime! Therefore, accelerating the list copying can greatly improve QuEST's few-qubit performance!
Use a profiler, such as NVIDIA Nsight systems, to measure the cost of copying over qubit lists to QuEST's Thrust routines. Using a candidate function (see below) running on any supported GPU, find the percentage of time that is spent allocating and copying data to the GPU at this scale. By varying the Qureg size, estimate the maximum speedup that could be achieved by eliminating the memory-copy completely, as a percentage of the full runtime of the function. Evidence your estimation using plots or screenshots of the profiler.
Then, picking any candidate function, attempt to optimise away this few-qubit overhead by eliminating the memory copy. This will be through an analogous method of the (much more substantial) optimisation being investigated in #739, but tailored to the Thrust API / functor paradigm.
Clues
An example of a Thrust routine without a copy overhead is thrust_statevec_calcExpecAnyTargZ_sub(), which is invoked when calling QuEST's calcExpecPauliString() API function, and passing a PauliStr containing only I or Z operators. This function has no copy overhead, because in lieu of passing over the entire qubit list to device memory, it passes instead a bitmask representation (stored in a qindex = long long int) - that's only possible here because the operation is mathematically agnostic to the qubit ordering. Because the bitmask is a mere primitive type, it can be passed and loaded directly into a register of a CUDA device - in the Thrust paradigm, we see this as a simple field of the functor.
This means, if you studied the execution of the below QuEST program through a profiler, you would see the ideal few-qubit performance, with no copy/alloc overheads!
#include"quest.h"#include<stdio.h>intmain(void)
{
initQuESTEnv();
if (!getQuESTEnv().isGpuAccelerated)
{
printf(
"GPU acceleration is not enabled, so calling calcExpecPauliStr() will not ""invoke thrust_statevec_calcExpecAnyTargZ_sub(). Exiting...");
finalizeQuESTEnv();
return0;
}
// create a 10-qubit random statevector with only GPU-accelerationintnumQubits=10;
Quregqureg=createCustomQureg(numQubits, 0, 0, 1, 0);
initRandomPureState(qureg);
reportQuregParams(qureg);
reportQureg(qureg);
// obtain an all-Z PauliStrPauliStrstr=getInlinePauliStr("ZZZZZ", {0, 3, 5, 8, 9});
reportPauliStr(str);
// invoke thrust_statevec_calcExpecAnyTargZ_sub()qrealout=calcExpecPauliStr(qureg, str);
printf("out = %g\n", out);
destroyQureg(qureg);
finalizeQuESTEnv();
return0;
}
Note
For all development, please work out of the
develbranchSummary
Use a profiler to investigate and optimise QuEST's GPU Thrust backend in the "few-qubit" regime, by avoiding superfluous memory allocations.
Context
QuEST juggles many algorithmic and implementation optimisations which have different performance benefits in different settings (e.g. multithreaded vs GPU-accelerated), and scales (e.g. number of amplitudes in a statevector). When a
Qureg(the simulated quantum state) is small (e.g. 10 qubits), some unexpected overheads ordinarily handwaved as trivial can begin to dominate runtime. Such a phenomenon is witnessed right now in QuEST's GPU backend, where the cost of copying a quantum operator's qubit lists - from host memory, to the GPU's memory - are exceeding the cost of applying the quantum operator upon the state!CUDA offers many tricks to address problems like this, such as strategies to reserve permanent space for the lists in faster device memory, as you can see being explored here. But to figure out how to optimise code, it is first necessary to understand precisely what is slowing the code! This is the utility of a profiler; a tool to find and visualise which parts of the codebase are unexpectedly slow, or which are invoked unexpectedly frequently. Learning to use and interpret the outputs of profilers is an extraordinarily useful skill in high-performance computing, and general software engineering.
Task
The heart of QuEST's GPU backend is
gpu_subroutines.cpp. The functions therein accept host-memory data (like lists of qubit indices), copy them into fresh GPU device memory, then dispatch a GPU job - either invoking a kernel (like here) fromgpu_kernels.cuh, or a Thrust routine (like here) fromgpu_thrust.cuh. When theQuregis small (e.g. a 10 qubitQuregwhich has only1024elements =16 KiB), allocating the tiny, temporary GPU memory and copying over the lists forms a significant amount of the runtime! Therefore, accelerating the list copying can greatly improve QuEST's few-qubit performance!Use a profiler, such as NVIDIA Nsight systems, to measure the cost of copying over qubit lists to QuEST's Thrust routines. Using a candidate function (see below) running on any supported GPU, find the percentage of time that is spent allocating and copying data to the GPU at this scale. By varying the
Quregsize, estimate the maximum speedup that could be achieved by eliminating the memory-copy completely, as a percentage of the full runtime of the function. Evidence your estimation using plots or screenshots of the profiler.Then, picking any candidate function, attempt to optimise away this few-qubit overhead by eliminating the memory copy. This will be through an analogous method of the (much more substantial) optimisation being investigated in #739, but tailored to the Thrust API / functor paradigm.
Clues