This is another step towards ensuring that we produce the optimal code for reductions, but there are other potential benefits as seen in the tests diffs:
- Memory loads may get scalarized resulting in more efficient code.
- Memory stores may get scalarized resulting in more efficient code.
- Complex ops like fdiv/sqrt get scalarized which may be faster instructions depending on uarch.
- Even simple ops like addss/subss/mulss/roundss may result in faster operation/less frequency throttling when scalarized depending on uarch.
The TODO comment suggests 1 or more follow-ups for opcodes that can currently result in regressions.
The tests for "minimum" and "maximum" IR in extractelement-fp.ll are commented out because those currently crash independently of this patch. I'm not sure what that problem is yet.
(style) Is clang-format happy with this?