This provides substantial performance boost on some benchmarks
(~25% on SHOC's FFT) due to vectorized loads/stores.
Unfortunately existing CUDA headers and user code occasionally
take pointer to vector fields which clang does not allow, so
we can't use vector types by default.
While vectorized types help in some cases, they may lower
performance in cases when user reads/writes only part of the vector as
Clang currently generates code to always load/store complete vector.
It may also create data races if user code assumed that parts of the
same vector can be safely changed from different threads.
For now control this feature via -DCUDA_VECTOR_TYPES and let user
choose whether to use Clang's vectorized types or CUDA's
non-vectorized ones.
Hm, this is a surprising (to me) way of controlling this feature. Can we use a -f flag instead? Even if all that -f flag does is define something (although in this case I'd suggest giving it a longer name so it's harder to collide with it).
-fsomething would be more discoverable and canonical, I think, and would be easier to document.