Currently SROA liberally creates llvm.memcpy calls when dealing with
small slices of allocas. Unfortunately there are multiple places in LLVM
that do not work too well with llvm.memcpy.
This can lead to surprising code gen, e.g. see PR47705 and PR47709 (LICM
does not hoist invariant llvm.memcyp calls).
We can side-step this issue in some cases, by letting SROA emit
loads/stores instead of memcpy if the slice is small and we can
reasonably expect vector versions of those loads and stores can be used.
The chosen threshold of 2 x widest vector register is somewhat
arbitrary, but should ensure that we can be reasonably confident that
those loads & stores will be lowered relatively efficiently.
The patch as is is not ideal, because it potentially results in a large
number of insert/extractvalue instructions to move the loaded/stored
values to and from the slice. We could (and maybe should) try to
directly emit the correct vector loads/stores.
At this stage I am mainly interested to see if there's a reason for not
doing so already. It might not be desirable to bake in too much
target-specific knowledge into something as general as SROA. I'll update
the tests if we settle on the final approach
This potentially provides some nice performance benefits, e.g. on ARM64
with -O3 -flto, 450.soplex runs roughly 2.2% faster and generates to
expected assembly for PR47705.
We should also work on improving the handling of llvm.memcpy in
different passes, but that might be tricky in some cases. For example,
it might be desirable to de-compose llvm.memcpy in separate load/store
parts if this would lead to the load part being loop-invariant.