We don't need to fix the PHI if the common dominator of the two incoming blocks terminates with a uniform branch. But looks like the condition is not strong enough to find two blocks with a uniform branch, which bring loop iteration order regression.
MI: %vreg5<def> = PHI %vreg125, <BB#1>, %vreg12, <BB#3>; SReg_64:%vreg5,%vreg125,%vreg12 MBB0: BB#1: derived from LLVM BB %for.body.preheader Predecessors according to CFG: BB#0 %vreg126<def> = S_MOV_B32 0; SReg_32_XM0:%vreg126 %vreg125<def> = S_MOV_B64 0; SReg_64:%vreg125 S_BRANCH <BB#3> Successors according to CFG: BB#3(?%) MBB1: BB#3: derived from LLVM BB %for.body Predecessors according to CFG: BB#1 BB#3 %vreg5<def> = PHI %vreg125, <BB#1>, %vreg12, <BB#3>; SReg_64:%vreg5,%vreg125,%vreg12 %vreg6<def> = PHI %vreg1, <BB#1>, %vreg9, <BB#3>; VGPR_32:%vreg6,%vreg1,%vreg9 %vreg7<def> = PHI %vreg126, <BB#1>, %vreg11, <BB#3>; SReg_32_XM0:%vreg7,%vreg126,%vreg11 %vreg8<def> = PHI %vreg2, <BB#1>, %vreg10, <BB#3>; VGPR_32:%vreg8,%vreg2,%vreg10 %vreg127<def,tied6> = V_MAC_F32_e64 0, %vreg6, 0, %vreg6, 0, %vreg1<tied0>, 0, 0, %EXEC<imp-use>; VGPR_32:%vreg127,%vreg6,%vreg6,%vreg1 %vreg9<def> = V_MAD_F32 1, %vreg8, 0, %vreg8, 0, %vreg127<kill>, 0, 0, %EXEC<imp-use>; VGPR_32:%vreg9,%vreg8,%vreg8,%vreg127 %vreg128<def> = V_ADD_F32_e64 0, %vreg6, 0, %vreg6, 0, 0, %EXEC<imp-use>; VGPR_32:%vreg128,%vreg6,%vreg6 %vreg10<def,tied6> = V_MAC_F32_e64 0, %vreg128<kill>, 0, %vreg8, 0, %vreg2<tied0>, 0, 0, %EXEC<imp-use>; VGPR_32:%vreg10,%vreg128,%vreg8,%vreg2 %vreg129<def> = S_MOV_B32 1; SReg_32_XM0:%vreg129 %vreg11<def> = S_ADD_I32 %vreg7, %vreg129<kill>, %SCC<imp-def,dead>; SReg_32_XM0:%vreg11,%vreg7,%vreg129 %vreg130<def> = V_MUL_F32_e64 0, %vreg10, 0, %vreg10, 0, 0, %EXEC<imp-use>; VGPR_32:%vreg130,%vreg10,%vreg10 %vreg131<def,tied6> = V_MAC_F32_e64 0, %vreg9, 0, %vreg9, 0, %vreg130<tied0>, 0, 0, %EXEC<imp-use>; VGPR_32:%vreg131,%vreg9,%vreg9,%vreg130 %vreg132<def> = V_MOV_B32_e32 1082130432, %EXEC<imp-use>; VGPR_32:%vreg132 %vreg133<def> = V_CMP_NLE_F32_e64 0, %vreg131<kill>, 0, %vreg132<kill>, 0, 0, %EXEC<imp-use>; SReg_64:%vreg133 VGPR_32:%vreg131,%vreg132 %vreg135<def> = COPY %vreg21; VGPR_32:%vreg135 SReg_32_XM0:%vreg21 %vreg134<def> = V_CMP_GE_U32_e64 %vreg11, %vreg135, %EXEC<imp-use>; SReg_64:%vreg134 SReg_32_XM0:%vreg11 VGPR_32:%vreg135 %vreg136<def> = S_OR_B64 %vreg134<kill>, %vreg133<kill>, %SCC<imp-def,dead>; SReg_64:%vreg136,%vreg134,%vreg133 %vreg12<def> = SI_IF_BREAK %vreg136<kill>, %vreg5, %SCC<imp-def,dead>; SReg_64:%vreg12,%vreg136,%vreg5 SI_LOOP %vreg12, <BB#3>, %EXEC<imp-def,dead>, %SCC<imp-def,dead>, %EXEC<imp-use>; SReg_64:%vreg12 S_BRANCH <BB#4> Successors according to CFG: BB#4(0x04000000 / 0x80000000 = 3.12%) BB#3(0x7c000000 / 0x80000000 = 96.88%) NCD: BB#1: derived from LLVM BB %for.body.preheader Predecessors according to CFG: BB#0 %vreg126<def> = S_MOV_B32 0; SReg_32_XM0:%vreg126 %vreg125<def> = S_MOV_B64 0; SReg_64:%vreg125 S_BRANCH <BB#3> Successors according to CFG: BB#3(?%)
This should be the first included llvm header