Before this patch, we computed the offsets in memory of args passed to
GPU kernel functions by throwing all of the args into an LLVM struct.
clang emits packed llvm structs basically whenever it feels like it, and
packed structs have alignment 1. So we cannot rely on the llvm type's
alignment matching the C++ type's alignment.
This patch fixes our codegen so we always respect the clang types'
alignments.
Typically clang doesn't need a registered backend for a target to generate IR for that target. It "knows" a whole bunch of stuff about all target calling conventions and data layout. Unless CUDA goes out of its way to query LLVM backend information, we shouldn't need these REQUIRES lines.
You should probably test this theory, though, by configuring an ARM-only clang and running the tests. :)