This is a WIP of a somewhat more principled attempt to solve https://github.com/llvm/llvm-project/issues/59116
Admittedly, i am not familiar neither with this pass, nor with MSSA.
We have something like this:
MDep: memcpy(tmp <- a) ... ... a is potentially modified inbetween, e.g.: memcpy(a <- b) ... M: memcpy(b <- tmp)
Since we know that tmp is last modified by MDep,
what we can do, is expand MDep's memcpy into load+store pair,
and then expand M's memcpy into a store of the MDep's load:
reload = load a store reload, tmp ; spill ... store reload, b ; final store
This pattern can happen e.g. when swapping contents of the a and b,
in which case tmp might go away completely, especially if it is an alloca.
This isn't always an obvious improvement, and in general, creating large
vectors can easily cause problematic compile-time implications, so there is a
profitability heuristic: loading said vector should not require more vectors
than theoretically available on the given target.
Conceptually, does this make sense? Thoughts on the profitability check?
Should we actually ensure that the clobber is a memcpy(a <- b) ?
I suspect it's still too lax, yet we will have compile-time implications regardless.
So we want to use up *all* vector registers for the copy? That's like 64 * 32 = 2048 bytes for AVX-512. That seems *way* too aggressive.