Added ptxwrap utility to help incorporating PTX into host-side object file.
Device-side CUDA compilation produces a text file with PTX assembly in
it. In order for the GPU code to be usable, it must be passed to GPU
driver which would then JIT it for appropriate GPU hardware.
Currently we rely on CUDA runtime to launch kernels from the host
side. cudaLaunch() function uses host-side address of the kernel we
want to launch and expects corresponding GPU kernel to be registered
with CUDA runtime by the time kernel launch is attempted.
Before we can register kernels, we have to load GPU code which is
expected to be in 'fatbin' container.
ptxwrap takes a file with PTX assembly and encapsulates into 'fatbin'
container. If -fatbin flag is passed, it produces fatbin binary. If
-stub argument is passed (default) ptxwrap generates kernel
registration code which incorporates fatbin bits as a string, loads it
and registers all the kernels it finds in the PTX. The output can be
included into host-side compilation or can be compiled and linked with
separately.
Caveats: most fatbin parameters are currently hardcoded and were only
tested to work with CUDA-7.0 on sm_35 hardware.