ROCm device libs can emit those intrinsics w/ the +dpp attribute, and it counts on the optimizer to remove the call if the GPU is too old.
When built at O0 it caused codegen issues as Clang allowed this intrinsic to go through with just +dpp, but the backend wanted the GPU to be >=GFX8 as well.
This patch allows selecting that intrinsic with just +dpp
Depends on D136945