Compiling the following code with optimization will cause the compiler to fail with an assertion.
typedef float alpha __attribute__((__vector_size__(16))); float foo(int& i, alpha& v) { v *= v; return v[i]; }
Overran sorted position:
t15: ch,glue = CopyToReg t22:1, Register:f32 %XMM0, t22 t14: f32 = Register %XMM0 t22: f32,ch = load<LD4[<unknown>]> t9, t21, undef:i64 t21: i64 = add t24, t4 t24: i64 = shl t17, Constant:i8<2> t17: i64,ch = load<LD4[%i](tbaa=<0x5eab5d8>), sext from i32> t22:1, t2, undef:i64 t2: i64,ch = CopyFromReg t0, Register:i64 %vreg0 t1: i64 = Register %vreg0 t6: i64 = undef t23: i8 = Constant<2> t4: i64,ch = CopyFromReg t0, Register:i64 %vreg1 t3: i64 = Register %vreg1 t6: i64 = undef
Checking if this is due to cycles
Detected cycle in SelectionDAG
Offending node:
t22: f32,ch = load<LD4[<unknown>]> t9, t21, undef:i64 t21: i64 = add t24, t4 t24: i64 = shl t17, Constant:i8<2> t17: i64,ch = load<LD4[%i](tbaa=<0x5eab5d8>), sext from i32> t22:1, t2, undef:i64 t2: i64,ch = CopyFromReg t0, Register:i64 %vreg0 t1: i64 = Register %vreg0 t6: i64 = undef t23: i8 = Constant<2> t4: i64,ch = CopyFromReg t0, Register:i64 %vreg1 t3: i64 = Register %vreg1 t6: i64 = undef
This is the IR for the test case:
%1 = load <4 x float>, <4 x float>* %v, align 16 %mul = fmul <4 x float> %1, %1 store <4 x float> %mul, <4 x float>* %v, align 16 %2 = load i32, i32* %i, align 4 %vecext = extractelement <4 x float> %mul, i32 %2 ret float %vecext
During selection DAG legalization, the extractelement is replaced with a load instruction. To do this, a temporary store to the stack is used unless an existing store is found that can be re-used. In this case, the store is found.
After creating the load the code replaces the chain going out of the store with the one going out of the new load. The code does this to ensure that any stores that must take place after the store happens after the load, else the value might be overwritten before it is loaded.
The problem is, if the extractelement index is dependent on the store replacing the chain will introduce a cycle in the selection DAG (the load uses the index, and by replacing the chain we will make the index dependent on the load).
This patch is a simple fix for the problem, and adds an additional dependency test when searching for a store. This is conservative as we may end up creating an unnecessary extra store to the stack if another store cannot be found (as in this case). However, the logic required to calculate the correct dependency order is complex (see below), and the situation is not expected to occur very often.
Opinions on the patch are welcome. If the view is that the conservative approach is not appropriate I can try and implement the logic below and submit that instead.
Non-conservative fix
The change that introduced the problem is r231721 which added the code to update the dependency chain. As already stated, previously there was no dependency between the new load and any other store to the same location, meaning the load could occur after the store, getting a wrong value.
The problem is, the change always replaces the chain irrespective of whether there are any other stores. In the testcase above we have no additional stores and so the dependency chain does not need to be changed. However, it isn't as easy as simply scanning the uses of the re-used store and only updating the stores as we may have a load which is followed in the chain by a store. This is demonstrated by the testcase for r231721 which has multiple extractelements. As soon as an extractelement is rewritten into a load, the re-used store becomes followed in the chain by a load and then a store (all 4 extractelements end up chained between the original stores).
So we need to traverse the dependency chain looking for stores, and only replace the chain if a store occurs. However, this is also not sufficient. We can modify the testcase above:
typedef float alpha __attribute__((__vector_size__(16))); float foo(int& i, alpha& v) { v *= v; return v[i++]; }
To create the following selection DAG:
Initial selection DAG: BB#0 '_Z3fooRiRDv4_f:entry' SelectionDAG has 20 nodes: t0: ch = EntryToken t2: i64,ch = CopyFromReg t0, Register:i64 %vreg0 t4: i64,ch = CopyFromReg t0, Register:i64 %vreg1 t5: i64 = Constant<0> t7: v4f32,ch = load<LD16[%v](tbaa=<0x5990268>)> t0, t4, undef:i64 t8: v4f32 = fmul t7, t7 t9: ch = store<ST16[%v](tbaa=<0x5990268>)> t7:1, t8, t4, undef:i64 t10: i32,ch = load<LD4[%i](tbaa=<0x59926b8>)> t9, t2, undef:i64 t12: i32 = add t10, Constant:i32<1> t13: ch = store<ST4[%i](tbaa=<0x59926b8>)> t10:1, t12, t2, undef:i64 t14: i64 = sign_extend t10 t15: f32 = extract_vector_elt t8, t14 t18: ch,glue = CopyToReg t13, Register:f32 %XMM0, t15 t19: ch = X86ISD::RET_FLAG t18, TargetConstant:i16<0>, Register:f32 %XMM0, t18:1
The store (t9) that will be re-used for the extract_vector_elt is now followed in the dependency chain by a load and a store (t9->t10->t13). By the previous logic we need to replace the chain going out of the store. However, this will get the cycle in the DAG as in the original testcase, so we haven't solved the problem.
In this case we can change the dependency chain and insert the new load between t10 and t13. The new load will occur before the store (t13) maintaining the store ordering but after the load so avoiding the cycle in the DAG.
The problem is, I can see that it is theoretically possible to have multiple chains emanating from the re-used store all of which have a problematic load/store. We can't fix all of these chains by inserting the new load between the load and the store.