- The device compilation needs to have a consistent source code compared to the corresponding host compilation. If macros based on the host-specific target processor is not properly populated, the device compilation may fail due to the inconsistent source after the preprocessor. So far, only the host triple is used to build the macros. If a detailed host CPU target or certain features are specified, macros derived from them won't be populated properly, e.g. __SSE3__ won't be added unless +sse3 feature is present. On Windows compilation compatible with MSVC, that missing macros result in that intrinsics are not included and cause device compilation failure on the host-side source.
- This patch addresses this issue by introducing two cc1 options, i.e., -aux-target-cpu and -aux-target-feature. If a specific host CPU target or certain features are specified, the compiler driver will append them during the construction of the offline compilation actions. Then, the toolchain in cc1 phase will populate macros accordingly.
I think gpu- prefix would work better as it's common for HIP and CUDA.