This is the initial attempt at microbenchmarking GPU code. It uses
several compiler hacks to ensure that only the code we want to test is
between these profiling instructions. I tested this on both NVPTX and
AMDGPu architecture. AMDGPU seems to work quite well and matches what I
expect from llvm-mca when checking the assembly via llvm-objump -D
on the binary. NVPTX on the other hand requires -Xcuda-ptxas -O0 to
get consistent results, otherwise it will reorder the operations and end
up getting noise.
This is difficult because if there is a single load or store inside of
the timing region it well completely drown out any latency. A single
load / store is probably more costly than most primitive match
functions so it drowns out everything else.
I'm putting this up as a stand-in that can hopefully be refined further
in the future, as such there are no users currently.
Simulate? Delicate?