Clang's current lowering for OpenMP parallel worksharing loops with a reduction clause prevents lots of optimization opportunities because the address of the stack variable for the reduction is passed to an OpenMP runtime function after the loop; this causes SROA/mem2reg to skip over promoting it to SSA form.
The intent of this work is to partially promote the reduction variable to SSA form before the runtime call takes place for a loop like the following so that optimizations (like vectorization) can be performed.
int loop(int data[restrict 128U]) { int retval = 0; #pragma omp parallel for simd schedule(simd:static) default(none) shared(data) reduction(+:retval) for (int i = 0; i < 128; i++) { int n = 0; if (data[i]) { n = 1; retval += n; } } return retval; }
The code as it is right now was written to avoid clashing too much with other code in order to reduce maintenance costs downstream; I expect I'll need to refactor it considerably but I would like to hear from reviewers before undertaking that work.
I have a few questions to resolve first:
- Is this feature something the community wants, or am I just overcomplicating things? Is there an easier way to get the above loop to vectorize?
- I've been a bit paranoid about ensuring ordering here and used the PostDominatorTree; I think it may be possible to do this with a modification to the IDF algorithm used in mem2reg, but I haven't worked through it yet. Does anyone have more experience with it to help guide that?
- This is currently a separate pass, but could be implemented as part of the normal SROA/mem2reg optimization pass. Would this be preferred? Does the outcome of the previous question about PostDom trees affect that?