This originally started out as an InstCombine patch,
but it does not really fit there: https://reviews.llvm.org/D138766
Here, when we encounter a load with variable index,
instead of immediately bailing, we bail only if profitability check says it's not worth it,
and otherwise record it, and rewrite it iff we don't bail for other reasons.
Rewrite is straight-forward - just load whole alloca, and extract the required bit sequence.
https://discourse.llvm.org/t/where-could-we-perform-sudo-sroa-with-non-constant-offsets/66645
First, consider: (example 0)
#include <cstdlib> #include <cstring> #include <algorithm> void sink(char*); constexpr int size = 4; void entry(char* input, int length, int position) { int max_bytes = length - position; int bytes = std::min(size, max_bytes); char res[size] = {}; memcpy(res, input + position, bytes); sink(res); }
This function has to perform dynamically-sized, but bound, memcpy,
which may or may not be good, given particular use case: https://godbolt.org/z/3cd6bvEq5
Now, let's look at another variant (not strictly identical): (example 1)
#include <cstdlib> #include <cstring> #include <algorithm> void sink(char*); constexpr int size = 4; void entry(char* input, int length, int position) { int last_pos = length - size; int clamped_pos = std::min(position, last_pos); char tmp[2 * size] = {}; memcpy(tmp, &input + clamped_pos, size); int num_leading_padding_bytes = std::min(size, position - clamped_pos); char res[size] = {}; memcpy(res, tmp + num_leading_padding_bytes, size); sink(res); }
Here, both memory loads are statically-sized.
Under some external preconditions, that are not relevant here,
the examples are equivalent.
Problem is, the second memcpy loads from a non-constant offset into tmp,
SROA does not deal with non-constant offsets, so we end up with tmp
not being promoted into a register: https://godbolt.org/z/ebPrrjaa6
So while this may or may not already be better than the original variant,
this is still not great. This can come up in hot paths, e.g. (example 0) is
https://github.com/darktable-org/rawspeed/blob/6be00ea43b92c876692593436f9edbbf70d4c3d4/src/librawspeed/io/BitStream.h#L145-L173
and i was in procees of improving it into (example 1) but got stuck on performance.
The transformation itself isn't that complicated,
we just don't have a great place for it.
I've added hopefully sufficiently exhaustive test coverage,
and verified it with alive.
Now, huge caveat: indeed, this needs a profitability check.
Profitability reasoning: we expect that for the largest legal int type, we
do have good support for variable-amount shifts. For the type 2x that
width, the legalization will expand the shift into, at worst, 3 legal-sized
shifts + 5 supporting ALU ops. We expect that such an expansion is still
not worse than the original pattern we have matched here.
But for any bit width larger than that, this isn't worth it.
Codegen for true i128 case: https://alive2.llvm.org/ce/z/Tu85qE
I think, this is pretty uncontentious for largest legal integer,
but unfortunately i'm interested in "load i32 from i64" and "load i64 from i128" :)
Sliver-lining: in the case i'm looking at, the upper half of the alloca is always zeros,
so after SROA, this becomes: https://alive2.llvm.org/ce/z/FgRHaZ,
and now that i128 codegen is rather good, isn't it?
D140638 ([Codegen][LegalizeIntegerTypes] New legalization strategy for scalar shifts: shift through stack.)
has taught LLVM how to undo this kind transformation during codegen,
in fact, using effectively the very same profitability heuristic,
so we don't seem to need a TLI hook here.
We could maintain a cache for these two functions,
but it would only be effective for the current AllocaSlices,
so i'm not sure if that is something we want?