Ok, beforehand, i must admit, this fold is esoteric,
does not really fit into InstCombine,
and we don't have any reasonable place for it.
https://discourse.llvm.org/t/where-could-we-perform-sudo-sroa-with-non-constant-offsets/66645
First, consider: (example 0)
#include <cstdlib> #include <cstring> #include <algorithm> void sink(char*); constexpr int size = 4; void entry(char* input, int length, int position) { int max_bytes = length - position; int bytes = std::min(size, max_bytes); char res[size] = {}; memcpy(res, input + position, bytes); sink(res); }
This function has to perform dynamically-sized, but bound, memcpy,
which may or may not be good, given particular use case: https://godbolt.org/z/3cd6bvEq5
Now, let's look at another variant (not strictly identical): (example 1)
#include <cstdlib> #include <cstring> #include <algorithm> void sink(char*); constexpr int size = 4; void entry(char* input, int length, int position) { int last_pos = length - size; int clamped_pos = std::min(position, last_pos); char tmp[2 * size] = {}; memcpy(tmp, &input + clamped_pos, size); int num_leading_padding_bytes = std::min(size, position - clamped_pos); char res[size] = {}; memcpy(res, tmp + num_leading_padding_bytes, size); sink(res); }
Here, both memory loads are statically-sized.
Under some external preconditions, that are not relevant here,
the examples are equivalent.
Problem is, the second memcpy loads from a non-constant offset into tmp,
SROA does not deal with non-constant offsets, so we end up with tmp
not being promoted into a register: https://godbolt.org/z/ebPrrjaa6
So while this may or may not already be better than the original variant,
this is still not great. This can come up in hot paths, e.g. (example 0) is
https://github.com/darktable-org/rawspeed/blob/6be00ea43b92c876692593436f9edbbf70d4c3d4/src/librawspeed/io/BitStream.h#L145-L173
and i was in procees of improving it into (example 1) but got stuck on performance.
The transformation itself isn't that complicated,
we just don't have a great place for it.
I've added hopefully sufficiently exhaustive test coverage,
and verified it with alive.
Now, huge caveat: indeed, this needs a profitability check.
Profitability reasoning: we expect that for the largest legal int type, we
do have good support for variable-amount shifts. For the type 2x that
width, the legalization will expand the shift into, at worst, 3 legal-sized
shifts + 5 supporting ALU ops. We expect that such an expansion is still
not worse than the original pattern we have matched here.
But for any bit width larger than that, this isn't worth it.
Codegen for true i128 case: https://alive2.llvm.org/ce/z/Tu85qE
I think, this is pretty uncontentious for largest legal integer,
but unfortunately i'm interested in "load i32 from i64" and "load i64 from i128" :)
Sliver-lining: in the case i'm looking at, the upper half of the alloca is always zeros,
so after SROA, this becomes: https://alive2.llvm.org/ce/z/FgRHaZ,
and now that i128 codegen is rather good, isn't it?
Thoughts?
As far as I can tell, this transform can only ever be profitable if we actually do end up promoting the alloca. In your tests here this is not going to happen because the pointer escapes. In this case, we just end up replacing a simple load with a complex large load and extract pattern. That's generally going to be a pretty significant regression, no?
So I think promotion has to be a pre-condition for the transform, in which case this might potentially fit into SROA? Basically allow a whole-alloca promotion by rewriting the dynamic offset access in terms of operations like these. Not sure how well that would work in practice, and of course that leaves the cost modelling question.