[SROA] Teach SROA how to much more intelligently handle split loads and

stores.

When there are accesses to an entire alloca with an integer

load or store as well as accesses to small pieces of the alloca, SROA

splits up the large integer accesses. In order to do that, it uses bit

math to merge the small accesses into large integers. While this is

effective, it produces insane IR that can cause significant problems in

the rest of the optimizer:

- It can cause load and store mismatches with GVN on the non-alloca side where we end up loading an i64 (or some such) rather than loading specific elements that are stored.
- We can't always get rid of the integer bit math, which is why we can't always fix the loads and stores to work well with GVN.
- This is especially bad when we have operations that mix poorly with integer bit math such as floating point operations.
- It will block things like the vectorizer which might be able to handle the scalar stores that underly the aggregate.

At the same time, we can't just directly split up these loads and stores

in all cases. If there is actual integer arithmetic involved on the

values, then using integer bit math is actually the perfect lowering

because we can often combine it heavily with the surrounding math.

The solution this patch provides is to find places where SROA is

partitioning aggregates into small elements, and look for splittable

loads and stores that it can split all the way to some other adjacent

load and store. These are uniformly the cases where failing to split the

loads and stores hurts the optimizer that I have seen, and I've looked

extensively at the code produced both from more and less aggressive

approaches to this problem.

However, it is quite tricky to actually do this in SROA. We may have

loads and stores to the same alloca, or other complex patterns that are

hard to handle. This complexity leads to the somewhat subtle algorithm

implemented here. We have to do this entire process as a separate pass

over the partitioning of the alloca, and split up all of the loads prior

to splitting the stores so that we can handle safely the cases of

overlapping, including partially overlapping, loads and stores to the

same alloca. We also have to reconstitute the post-split slice

configuration so we can avoid iterating again over all the alloca uses

(the slow part of SROA). But we also have to ensure that when we split

up loads and stores to *other* allocas, we *do* re-iterate over them in

SROA to adapt to the more refined partitioning now required.

With this, I actually think we can fix a long-standing TODO in SROA

where I avoided splitting as many loads and stores as probably should be

splittable. This limitation historically mitigated the fallout of all

the bad things mentioned above. Now that we have more intelligent

handling, I plan to remove the FIXME and more aggressively mark integer

loads and stores as splittable. I'll do that in a follow-up patch to

help with bisecting any fallout.

The net result of this change should be more fine-grained and accurate

scalars being formed out of aggregates. At the very least, Clang now

generates perfect code for this high-level test case using

std::complex<float>:

#include <complex> void g1(std::complex<float> &x, float a, float b) { x += std::complex<float>(a, b); } void g2(std::complex<float> &x, float a, float b) { x -= std::complex<float>(a, b); } void foo(const std::complex<float> &x, float a, float b, std::complex<float> &x1, std::complex<float> &x2) { std::complex<float> l1 = x; g1(l1, a, b); std::complex<float> l2 = x; g2(l2, a, b); x1 = l1; x2 = l2; }

This code isn't just hypothetical either. It was reduced out of the hot

inner loops of essentially every part of the Eigen math library when

using std::complex<float>. Those loops would consistently and

pervasively hop between the floating point unit and the integer unit due

to bit math extraction and insertion of floating point values that were

"stored" in a 64-bit integer register around the loop backedge.

So far, this change has passed a bootstrap and I have done some other

testing and so far, no issues. That doesn't mean there won't be though,

so I'll be prepared to help with any fallout. If you performance swings

in particular, please let me know. I'm very curious what all the impact

of this change will be. Stay tuned for the follow-up to also split more

integer loads and stores.