In order to make the device runtime truly device independent and thereby
reusable, e.g., for testing on the host, we want to eliminate the
remaining non-portable code. This basically boils down to device,
shared, and some other attributes, e.g., forceinline.
NOTE: This does not yet create a valid .bc file as the IR files we get
via c++ are not directly usable by llvm-link/opt/.... I did not
investigate how to extract the pure IR. The entire nvptx compilation
process should be re-investigated as we might not need to link anything from cuda in.