This originally started out as an InstCombine patch,

but it does not really fit there: https://reviews.llvm.org/D138766

Here, when we encounter a load with variable index,

instead of immediately bailing, we bail only if profitability check says it's not worth it,

and otherwise record it, and rewrite it iff we don't bail for other reasons.

Rewrite is straight-forward - just load whole alloca, and extract the required bit sequence.

https://discourse.llvm.org/t/where-could-we-perform-sudo-sroa-with-non-constant-offsets/66645

First, consider: (example 0)

#include <cstdlib> #include <cstring> #include <algorithm> void sink(char*); constexpr int size = 4; void entry(char* input, int length, int position) { int max_bytes = length - position; int bytes = std::min(size, max_bytes); char res[size] = {}; memcpy(res, input + position, bytes); sink(res); }

This function has to perform dynamically-sized, but bound, `memcpy`,

which may or may not be good, given particular use case: https://godbolt.org/z/3cd6bvEq5

Now, let's look at another variant (not strictly identical): (example 1)

#include <cstdlib> #include <cstring> #include <algorithm> void sink(char*); constexpr int size = 4; void entry(char* input, int length, int position) { int last_pos = length - size; int clamped_pos = std::min(position, last_pos); char tmp[2 * size] = {}; memcpy(tmp, &input + clamped_pos, size); int num_leading_padding_bytes = std::min(size, position - clamped_pos); char res[size] = {}; memcpy(res, tmp + num_leading_padding_bytes, size); sink(res); }

Here, both memory loads are statically-sized.

Under some external preconditions, that are not relevant here,

the examples are equivalent.

Problem is, the second `memcpy` loads from a non-constant offset into `tmp`,

SROA does not deal with non-constant offsets, so we end up with `tmp`

not being promoted into a register: https://godbolt.org/z/ebPrrjaa6

So while this may or may not already be better than the original variant,

this is still not great. This can come up in hot paths, e.g. (example 0) is

https://github.com/darktable-org/rawspeed/blob/6be00ea43b92c876692593436f9edbbf70d4c3d4/src/librawspeed/io/BitStream.h#L145-L173

and i was in procees of improving it into (example 1) but got stuck on performance.

The transformation itself isn't that complicated,

we just don't have a great place for it.

I've added hopefully sufficiently exhaustive test coverage,

and verified it with alive.

Now, huge caveat: indeed, this needs a profitability check.

Profitability reasoning: we expect that for the largest legal int type, we

do have good support for variable-amount shifts. For the type 2x that

width, the legalization will expand the shift into, at worst, 3 legal-sized

shifts + 5 supporting ALU ops. We expect that such an expansion is still

not worse than the original pattern we have matched here.

But for any bit width larger than that, this isn't worth it.

Codegen for true i128 case: https://alive2.llvm.org/ce/z/Tu85qE

I think, this is pretty uncontentious for largest legal integer,

but unfortunately i'm interested in "load i32 from i64" **and** "load i64 from i128" :)

Sliver-lining: in the case i'm looking at, the upper half of the alloca is always zeros,

so after SROA, this becomes: https://alive2.llvm.org/ce/z/FgRHaZ,

and now that i128 codegen is rather good, isn't it?

D140638 (`[Codegen][LegalizeIntegerTypes] New legalization strategy for scalar shifts: shift through stack.`)

has taught LLVM how to undo this kind transformation during codegen,

in fact, using effectively the very same profitability heuristic,

so we don't seem to need a TLI hook here.