Instead of inserting everything after the 'root' of the reduction, insert all instructions as close to their operands as possible. This can help reduce register pressure.
Note: I have no idea why git has decided that I've made a change to an MC test.
Just curious, why did you change the iteration order, Loads/Writes vs. Writes/Loads?