The scheduler will try to classify the MemOps into different groups and then clustering neighboring MemOps within each group. The current algorithm is to have the MemOps with the same ctrl(non-data) dep into the same group. That works fine for load but not well for store as store might have two memory dep.
See this example: Store Addr and Store Addr+8 are clusterable pair. They have memory(ctrl) dependency on different loads. Current implementation will put these two stores into different group and miss to cluster them.
Load X Load Y ^ ^ | | |mem |mem | | + + Store Addr Store Addr+8 ^ ^ +--------------------+ cluster
This will affect the case like this.
void foo(long long *restrict a, long long *restrict b, long long *restrict c, int n) { for (int i =0; i<n;i++) a[i] += b[i]*c[i]; }
It doesn't seem to me that the condition needs to be cached in a variable.