I really thought we were doing this already, but we were not. Given this input:
void Test(int *res, int *c, int *d, int *p) { #pragma clang loop vectorize(assume_safety) for (int i = 0; i < 16; i++) res[i] = (p[i] == 0) ? res[i] : res[i] + d[i]; }
we still don't vectorize this loop. Even with "assume_safety", the check that we don't if-convert conditionally-executed loads (to protect against data-dependent deferenceability) was not elided. We should vectorize this.
The change here seems straightforward. One subtlety: As implemented, it will still prefer to use a masked-load instrinsic (given target support) over the speculated load. The choice here seems architecture specific; the best option depends on how expensive the masked load is compared to a regular load. Ideally, using the masked load still reduces unnecessary memory traffic, and so should be preferred. If we'd rather do it the other way, flipping the order of the checks is easy.
There is still an issue with the generated code: it contains runtime overlap checks. Fixing that will be follow-up work.