This is an archive of the discontinued LLVM Phabricator instance.

[InstCombine] If loading from small alloca, load whole alloca and perform variable extraction
AbandonedPublic

Authored by lebedev.ri on Nov 27 2022, 11:18 AM.

Details

Summary

Ok, beforehand, i must admit, this fold is esoteric,
does not really fit into InstCombine,
and we don't have any reasonable place for it.

https://discourse.llvm.org/t/where-could-we-perform-sudo-sroa-with-non-constant-offsets/66645

First, consider: (example 0)

#include <cstdlib>
#include <cstring>
#include <algorithm>

void sink(char*);

constexpr int size = 4;

void entry(char* input, int length, int position) {
    int max_bytes = length - position;
    int bytes = std::min(size, max_bytes);
    char res[size] = {};
    memcpy(res, input + position, bytes);
    sink(res);
}

This function has to perform dynamically-sized, but bound, memcpy,
which may or may not be good, given particular use case: https://godbolt.org/z/3cd6bvEq5

Now, let's look at another variant (not strictly identical): (example 1)

#include <cstdlib>
#include <cstring>
#include <algorithm>

void sink(char*);

constexpr int size = 4;

void entry(char* input, int length, int position) {
    int last_pos = length - size;
    int clamped_pos = std::min(position, last_pos);
    char tmp[2 * size] = {};
    memcpy(tmp, &input + clamped_pos, size);
    int num_leading_padding_bytes = std::min(size, position - clamped_pos);
    char res[size] = {};
    memcpy(res, tmp + num_leading_padding_bytes, size);
    sink(res);
}

Here, both memory loads are statically-sized.
Under some external preconditions, that are not relevant here,
the examples are equivalent.

Problem is, the second memcpy loads from a non-constant offset into tmp,
SROA does not deal with non-constant offsets, so we end up with tmp
not being promoted into a register: https://godbolt.org/z/ebPrrjaa6

So while this may or may not already be better than the original variant,
this is still not great. This can come up in hot paths, e.g. (example 0) is
https://github.com/darktable-org/rawspeed/blob/6be00ea43b92c876692593436f9edbbf70d4c3d4/src/librawspeed/io/BitStream.h#L145-L173
and i was in procees of improving it into (example 1) but got stuck on performance.

The transformation itself isn't that complicated,
we just don't have a great place for it.
I've added hopefully sufficiently exhaustive test coverage,
and verified it with alive.

Now, huge caveat: indeed, this needs a profitability check.
Profitability reasoning: we expect that for the largest legal int type, we
do have good support for variable-amount shifts. For the type 2x that
width, the legalization will expand the shift into, at worst, 3 legal-sized
shifts + 5 supporting ALU ops. We expect that such an expansion is still
not worse than the original pattern we have matched here.
But for any bit width larger than that, this isn't worth it.
Codegen for true i128 case: https://alive2.llvm.org/ce/z/Tu85qE

I think, this is pretty uncontentious for largest legal integer,
but unfortunately i'm interested in "load i32 from i64" and "load i64 from i128" :)
Sliver-lining: in the case i'm looking at, the upper half of the alloca is always zeros,
so after SROA, this becomes: https://alive2.llvm.org/ce/z/FgRHaZ,
and now that i128 codegen is rather good, isn't it?

Thoughts?

Diff Detail

Event Timeline

lebedev.ri created this revision.Nov 27 2022, 11:18 AM
Herald added a project: Restricted Project. · View Herald TranscriptNov 27 2022, 11:18 AM
Herald added a subscriber: hiraditya. · View Herald Transcript
lebedev.ri requested review of this revision.Nov 27 2022, 11:18 AM
lebedev.ri edited the summary of this revision. (Show Details)EditedNov 28 2022, 5:03 AM

Correction: while we can't do much for i128 case in general,
in the case i'm looking at, the upper half of the alloca is always zeros,
so after SROA, this becomes: https://alive2.llvm.org/ce/z/FgRHaZ,
and now that i128 codegen is rather good, isn't it?

I'm not quite sure we can/should factor that into profitability check though,
we'd need to look at other stores into alloca, etc.
But we can look into at least teaching optimizations to do that after SROA.

Note, here the main question i want to resolve is the profitability check.
The rest i could be comfortable taking into post-commit review mode after adding a bit more tests.

I'm confused by the tests - do we care if the full load already exists? That's not part of the pattern match.

Does the 2x register limit mean we are also creating a double-width load? Are we relying on later passes/codegen to narrow that?

If we're trying to justify this target-independently, then lets use a less familiar target to avoid reaching the conclusion that the transform is generally good.
I have very little idea about what is happening here with RISCV64:
https://godbolt.org/z/vEcMP6P4x

lebedev.ri added a comment.EditedDec 1 2022, 9:14 AM

Thank you for taking a look!
I'm really looking forward to this change.

I'm confused by the tests - do we care if the full load already exists? That's not part of the pattern match.

Perhaps i should adjust the tests. That was
mostly for illustrative purposes, and it stuck.

Does the 2x register limit mean we are also creating a double-width load? Are we relying on later passes/codegen to narrow that?

We are indeed creating double-width load,
which is generally going to survive until Codegen.
Alternatively, we could produce non-canonical IR here,
by performing legalization ourselves.
Would that be better?

If we're trying to justify this target-independently, then lets use a less familiar target to avoid reaching the conclusion that the transform is generally good.
I have very little idea about what is happening here with RISCV64:
https://godbolt.org/z/vEcMP6P4x

Please clarify, would be be less burdensome to first proceed with
a single-width change, and then discuss relaxing it to double-width?

Does the 2x register limit mean we are also creating a double-width load? Are we relying on later passes/codegen to narrow that?

We are indeed creating double-width load,
which is generally going to survive until Codegen.
Alternatively, we could produce non-canonical IR here,
by performing legalization ourselves.
Would that be better?

I'm not sure it makes much difference to subsequent IR transforms, but creating a known illegal load seems like a (scary) precedent for target-independent canonicalization.

If we're trying to justify this target-independently, then lets use a less familiar target to avoid reaching the conclusion that the transform is generally good.
I have very little idea about what is happening here with RISCV64:
https://godbolt.org/z/vEcMP6P4x

Please clarify, would be be less burdensome to first proceed with
a single-width change, and then discuss relaxing it to double-width?

Yes, the single-width seems less scary, but we're still really stretching to call this a canonicalization. Does transforming later in IR miss some other optimizations that you want to happen?
For example, I'm still not sure what is happening here:
https://godbolt.org/z/Tcq8af83j

Have I messed up the target specification in some way? Assuming the 2nd version is worse, how is a target expected to reverse it?

Thank you for taking a look!

I remember now, that extra IR was needed to make alive2 happy,
it still has rough edges about not checking the content of pointers passed to calls,
so we have to spell the IR somehow..

Does the 2x register limit mean we are also creating a double-width load? Are we relying on later passes/codegen to narrow that?

We are indeed creating double-width load,
which is generally going to survive until Codegen.
Alternatively, we could produce non-canonical IR here,
by performing legalization ourselves.
Would that be better?

I'm not sure it makes much difference to subsequent IR transforms, but creating a known illegal load seems like a (scary) precedent for target-independent canonicalization.

If we're trying to justify this target-independently, then lets use a less familiar target to avoid reaching the conclusion that the transform is generally good.
I have very little idea about what is happening here with RISCV64:
https://godbolt.org/z/vEcMP6P4x

Please clarify, would be be less burdensome to first proceed with
a single-width change, and then discuss relaxing it to double-width?

Yes, the single-width seems less scary, but we're still really stretching to call this a canonicalization.

Does transforming later in IR miss some other optimizations that you want to happen?

What does "transforming later" mean? If we don't do this transform,
then we are left with variable-indexed load into alloca,
with which we won't be able to do anything else,
and the entirety of the alloca won't be promoted to registers,
and now that is usually a big loss.

For example, I'm still not sure what is happening here:
https://godbolt.org/z/Tcq8af83j

Have I messed up the target specification in some way? Assuming the 2nd version is worse, how is a target expected to reverse it?

I'm sorry, i'm not familiar with that architecture.
@craig.topper / @reames should be able to comment on that..

riscv64 doesn't have vectors without the "v" extension, so anything involving byte vectors is going to get scalarized to something very messy. With -mattr=+v, you get something similar to what you'd expect on other targets.

It feels a little weird to me that you're using an integer shift on a vector... I'd guess in most cases, you'd want some sort of variable shuffle (like x86 pshufb).

Thank you for taking a look!

riscv64 doesn't have vectors without the "v" extension, so anything involving byte vectors is going to get scalarized to something very messy. With -mattr=+v, you get something similar to what you'd expect on other targets.

It feels a little weird to me that you're using an integer shift on a vector... I'd guess in most cases, you'd want some sort of variable shuffle (like x86 pshufb).

As per langref, poison is byte-wise, so in situations like this, we can't just load i64,
we must load <8 x i8>, freeze it(!), and finally bitcast to the i64.
I'm not using vector for anything else here, shift is scalar.
Also, we don't really have a way in LLVM IR to represent a shuffle with variable mask.
Shift is the right choice here.

I'm not using vector for anything else here

All your testcases involve using the result as a vector? That seems relevant.

Also, we don't really have a way in LLVM IR to represent a shuffle with variable mask.

Not in a target-independent way, no. Almost every target with a vector unit has some sort of variable shuffle intrinsic, but the semantics aren't consistent.

I'm not using vector for anything else here

All your testcases involve using the result as a vector? That seems relevant.

The call void @use.v8i8(<8 x i8> %init) things are test artifacts,
common to instcombine and other passes that are well-tested
They have two purposes.

  1. mark the value as not having just a single use, but some unknown opaque user
  2. escape it (it's value) as a side-effect. This allows e.g. alive2 to complain if somehow the value changes.

So no, i really don't use it as a vector. @spatel will confirm this.

Also, we don't really have a way in LLVM IR to represent a shuffle with variable mask.

Not in a target-independent way, no. Almost every target with a vector unit has some sort of variable shuffle intrinsic, but the semantics aren't consistent.

nikic added a reviewer: nikic.Dec 5 2022, 11:53 AM

(if an off-list conversation about could help move things forward, please feel free to email me...)

nikic added inline comments.Dec 8 2022, 6:24 AM
llvm/test/Transforms/InstCombine/widen-load-of-small-alloca.ll
111

As far as I can tell, this transform can only ever be profitable if we actually do end up promoting the alloca. In your tests here this is not going to happen because the pointer escapes. In this case, we just end up replacing a simple load with a complex large load and extract pattern. That's generally going to be a pretty significant regression, no?

So I think promotion has to be a pre-condition for the transform, in which case this might potentially fit into SROA? Basically allow a whole-alloca promotion by rewriting the dynamic offset access in terms of operations like these. Not sure how well that would work in practice, and of course that leaves the cost modelling question.

FWIW, I've discovered today that GVN does a similar optimization (but without the freeze..).
See here (scroll to the bottom): https://web.ist.utl.pt/nuno.lopes/alive2/index.php?hash=aed14c64378404c9&test=Transforms%2FPhaseOrdering%2FX86%2Fvec-load-combine.ll

FWIW, I've discovered today that GVN does a similar optimization (but without the freeze..).
See here (scroll to the bottom): https://web.ist.utl.pt/nuno.lopes/alive2/index.php?hash=aed14c64378404c9&test=Transforms%2FPhaseOrdering%2FX86%2Fvec-load-combine.ll

That seems to be with constant indexes, though?

FWIW, I've discovered today that GVN does a similar optimization (but without the freeze..).
See here (scroll to the bottom): https://web.ist.utl.pt/nuno.lopes/alive2/index.php?hash=aed14c64378404c9&test=Transforms%2FPhaseOrdering%2FX86%2Fvec-load-combine.ll

That seems to be with constant indexes, though?

True.
So it could use a simple extractelement rather than bit masking.

Thank you for looking into it!

FWIW, I've discovered today that GVN does a similar optimization (but without the freeze..).
See here (scroll to the bottom): https://web.ist.utl.pt/nuno.lopes/alive2/index.php?hash=aed14c64378404c9&test=Transforms%2FPhaseOrdering%2FX86%2Fvec-load-combine.ll

That seems to be with constant indexes, though?

True.
So it could use a simple extractelement rather than bit masking.

Define "could"? Define "simple"?
I've looked at alternative lowerings (shufflevector or chain of extractelement's),
and they all result in worse codegen. We can not use a single `extractelement,
because the byte offset may not be a multiple of the element size.
The shift is the optimal lowering here, any alternative chosen lowering
would need to be canonicalized into it, and and which point why bother?

(Yes, i will look into doing this in SROA.)

Thank you for looking into it!

FWIW, I've discovered today that GVN does a similar optimization (but without the freeze..).
See here (scroll to the bottom): https://web.ist.utl.pt/nuno.lopes/alive2/index.php?hash=aed14c64378404c9&test=Transforms%2FPhaseOrdering%2FX86%2Fvec-load-combine.ll

That seems to be with constant indexes, though?

True.
So it could use a simple extractelement rather than bit masking.

Define "could"? Define "simple"?
I've looked at alternative lowerings (shufflevector or chain of extractelement's),
and they all result in worse codegen. We can not use a single `extractelement,
because the byte offset may not be a multiple of the element size.
The shift is the optimal lowering here, any alternative chosen lowering
would need to be canonicalized into it, and and which point why bother?

I meant for the GVN case, with constant indexes.
Your case is annoying as neither sufflevector or extractelement allow for easy vector extraction w/ dynamic indexes.

Thank you for looking into it!

FWIW, I've discovered today that GVN does a similar optimization (but without the freeze..).
See here (scroll to the bottom): https://web.ist.utl.pt/nuno.lopes/alive2/index.php?hash=aed14c64378404c9&test=Transforms%2FPhaseOrdering%2FX86%2Fvec-load-combine.ll

That seems to be with constant indexes, though?

True.
So it could use a simple extractelement rather than bit masking.

Define "could"? Define "simple"?
I've looked at alternative lowerings (shufflevector or chain of extractelement's),
and they all result in worse codegen. We can not use a single `extractelement,
because the byte offset may not be a multiple of the element size.
The shift is the optimal lowering here, any alternative chosen lowering
would need to be canonicalized into it, and and which point why bother?

I meant for the GVN case, with constant indexes.

D'oh! Sorry!

Your case is annoying as neither sufflevector or extractelement allow for easy vector extraction w/ dynamic indexes.