This is an archive of the discontinued LLVM Phabricator instance.

[SROA] Support promotion in presence of variably-indexed loads
Needs ReviewPublic

Authored by lebedev.ri on Dec 21 2022, 12:41 PM.

Details

Summary

This originally started out as an InstCombine patch,
but it does not really fit there: https://reviews.llvm.org/D138766

Here, when we encounter a load with variable index,
instead of immediately bailing, we bail only if profitability check says it's not worth it,
and otherwise record it, and rewrite it iff we don't bail for other reasons.
Rewrite is straight-forward - just load whole alloca, and extract the required bit sequence.

https://discourse.llvm.org/t/where-could-we-perform-sudo-sroa-with-non-constant-offsets/66645

First, consider: (example 0)

#include <cstdlib>
#include <cstring>
#include <algorithm>

void sink(char*);

constexpr int size = 4;

void entry(char* input, int length, int position) {
    int max_bytes = length - position;
    int bytes = std::min(size, max_bytes);
    char res[size] = {};
    memcpy(res, input + position, bytes);
    sink(res);
}

This function has to perform dynamically-sized, but bound, memcpy,
which may or may not be good, given particular use case: https://godbolt.org/z/3cd6bvEq5

Now, let's look at another variant (not strictly identical): (example 1)

#include <cstdlib>
#include <cstring>
#include <algorithm>

void sink(char*);

constexpr int size = 4;

void entry(char* input, int length, int position) {
    int last_pos = length - size;
    int clamped_pos = std::min(position, last_pos);
    char tmp[2 * size] = {};
    memcpy(tmp, &input + clamped_pos, size);
    int num_leading_padding_bytes = std::min(size, position - clamped_pos);
    char res[size] = {};
    memcpy(res, tmp + num_leading_padding_bytes, size);
    sink(res);
}

Here, both memory loads are statically-sized.
Under some external preconditions, that are not relevant here,
the examples are equivalent.

Problem is, the second memcpy loads from a non-constant offset into tmp,
SROA does not deal with non-constant offsets, so we end up with tmp
not being promoted into a register: https://godbolt.org/z/ebPrrjaa6

So while this may or may not already be better than the original variant,
this is still not great. This can come up in hot paths, e.g. (example 0) is
https://github.com/darktable-org/rawspeed/blob/6be00ea43b92c876692593436f9edbbf70d4c3d4/src/librawspeed/io/BitStream.h#L145-L173
and i was in procees of improving it into (example 1) but got stuck on performance.

The transformation itself isn't that complicated,
we just don't have a great place for it.
I've added hopefully sufficiently exhaustive test coverage,
and verified it with alive.

Now, huge caveat: indeed, this needs a profitability check.
Profitability reasoning: we expect that for the largest legal int type, we
do have good support for variable-amount shifts. For the type 2x that
width, the legalization will expand the shift into, at worst, 3 legal-sized
shifts + 5 supporting ALU ops. We expect that such an expansion is still
not worse than the original pattern we have matched here.
But for any bit width larger than that, this isn't worth it.
Codegen for true i128 case: https://alive2.llvm.org/ce/z/Tu85qE

I think, this is pretty uncontentious for largest legal integer,
but unfortunately i'm interested in "load i32 from i64" and "load i64 from i128" :)
Sliver-lining: in the case i'm looking at, the upper half of the alloca is always zeros,
so after SROA, this becomes: https://alive2.llvm.org/ce/z/FgRHaZ,
and now that i128 codegen is rather good, isn't it?

D140638 ([Codegen][LegalizeIntegerTypes] New legalization strategy for scalar shifts: shift through stack.)
has taught LLVM how to undo this kind transformation during codegen,
in fact, using effectively the very same profitability heuristic,
so we don't seem to need a TLI hook here.

https://llvm-compile-time-tracker.com/compare.php?from=4d255f9e3374ecc5a85ac30ecbe65f3a737dfe35&to=aaaef46a6e3ef5c21178f86a18ce8911113ab026&stat=instructions%3Au

Diff Detail

Event Timeline

lebedev.ri created this revision.Dec 21 2022, 12:41 PM
Herald added a project: Restricted Project. · View Herald TranscriptDec 21 2022, 12:41 PM
lebedev.ri requested review of this revision.Dec 21 2022, 12:41 PM
lebedev.ri added inline comments.Dec 21 2022, 12:45 PM
llvm/lib/Transforms/Scalar/SROA.cpp
1242–1243

We could maintain a cache for these two functions,
but it would only be effective for the current AllocaSlices,
so i'm not sure if that is something we want?

nikic added a comment.Dec 21 2022, 1:49 PM

FYI sqlite3 from llvm-test-suite fails to build in ReleaseLTO-g configuration with the following error:

/usr/bin/ld: error: LLVM gold plugin: Invalid value reference from metadata

Some more tests.

FYI sqlite3 from llvm-test-suite fails to build in ReleaseLTO-g configuration with the following error:

/usr/bin/ld: error: LLVM gold plugin: Invalid value reference from metadata

Yeah, this is a rough version of the patch, it still probably needs some polishing.

Could ScalarEvolution + SCEVExpander help with computing / materializing the offset?
It is mainly used for loop transformations, but it should be possible to use it for linear code as well.
One caveat: ScalarEvolution does not peek through addrspacecasts (is it safe to ignore them anyway?).

Can't bitcast int <-> ptr, give up on those.

lebedev.ri edited the summary of this revision. (Show Details)Dec 21 2022, 2:47 PM

Could ScalarEvolution + SCEVExpander help with computing / materializing the offset?
It is mainly used for loop transformations, but it should be possible to use it for linear code as well.
One caveat: ScalarEvolution does not peek through addrspacecasts (is it safe to ignore them anyway?).

Oh believe me, i've thought about that one.
For the minuscule amount of functionality we need,
i don't think it's worth all the extra complexity it would bring.

FYI sqlite3 from llvm-test-suite fails to build in ReleaseLTO-g configuration with the following error:

/usr/bin/ld: error: LLVM gold plugin: Invalid value reference from metadata

Yeah, this is a rough version of the patch, it still probably needs some polishing.

All good now: https://llvm-compile-time-tracker.com/compare.php?from=4d255f9e3374ecc5a85ac30ecbe65f3a737dfe35&to=aaaef46a6e3ef5c21178f86a18ce8911113ab026&stat=instructions%3Au

FYI sqlite3 from llvm-test-suite fails to build in ReleaseLTO-g configuration with the following error:

/usr/bin/ld: error: LLVM gold plugin: Invalid value reference from metadata

Yeah, this is a rough version of the patch, it still probably needs some polishing.

All good now: https://llvm-compile-time-tracker.com/compare.php?from=4d255f9e3374ecc5a85ac30ecbe65f3a737dfe35&to=aaaef46a6e3ef5c21178f86a18ce8911113ab026&stat=instructions%3Au

Looks like caching makes things worse: https://llvm-compile-time-tracker.com/compare.php?from=967ba1a86d4f949e7663467b40abce0d97ac7673&to=247be4da9921aaf357db80032fce9a8d1d05414a&stat=instructions:u
Either switching to manual GEP expansion will help somewhat, or those are "expected" second-order effects.

nikic added a comment.Dec 22 2022, 1:36 AM

I think this is moving in the right direction in terms of how the transform should be implemented. However, I have two high-level concerns:

The first one is basically the same as for the "whole alloca to vector promotion": If we perform this transform and then inline and it turns out that the offset is now known, it's likely that we're now going to generate much worse code -- we're likely not going to be able to get rid of the freeze and vector manipulation. I think it is very likely that the "second-order effects" you refer to are codegen regressions rather than improvements (will need to be investigated in either case).

The second is that even if we ignore that case, the profitability of this transform is very unclear to me. https://llvm.godbolt.org/z/W6e7K5W9d has a representative case (load i32 from 8 byte buffer after sroa+instcombine, with two kinds of init because some targets really hate the vector init). I'm not even sure that this is better on x86_64, but there are some targets where it's clearly worse, e.g. riscv32: https://llvm.godbolt.org/z/fzK8sbaEv If we want to do this transform, we'll probably not be able to avoid TTI-based cost modelling. It's also possible that this transform just isn't profitable in isolation (i.e. without the larger context of your examples).

Regarding the rewrite itself, are you possibly looking for the EmitGEPOffset() helper? It should be possible to sum the results of EmitGEPOffset() on all the GEPs in the chain to obtain the desired offset.

@craig.topper @reames Hi! Can you comment on the profitability heuristic here? Does it appear that we need a TTI hook?

I think this is moving in the right direction in terms of how the transform should be implemented. However, I have two high-level concerns:

The first one is basically the same as for the "whole alloca to vector promotion": If we perform this transform and then inline and it turns out that the offset is now known, it's likely that we're now going to generate much worse code -- we're likely not going to be able to get rid of the freeze and vector manipulation. I think it is very likely that the "second-order effects" you refer to are codegen regressions rather than improvements (will need to be investigated in either case).

The second is that even if we ignore that case, the profitability of this transform is very unclear to me. https://llvm.godbolt.org/z/W6e7K5W9d has a representative case (load i32 from 8 byte buffer after sroa+instcombine, with two kinds of init because some targets really hate the vector init). I'm not even sure that this is better on x86_64, but there are some targets where it's clearly worse, e.g. riscv32: https://llvm.godbolt.org/z/fzK8sbaEv If we want to do this transform, we'll probably not be able to avoid TTI-based cost modelling. It's also possible that this transform just isn't profitable in isolation (i.e. without the larger context of your examples).

Regarding the rewrite itself, are you possibly looking for the EmitGEPOffset() helper? It should be possible to sum the results of EmitGEPOffset() on all the GEPs in the chain to obtain the desired offset.

Directly expand GEP's into math, seems nicer, and seems to help a tiny bit, but not sufficiently.
Also, don't cache non-GEP's.
https://llvm-compile-time-tracker.com/compare.php?from=19e55791c4fa484401d0b4a4e5d66dd313251ade&to=9b20759171567953b78b9a2c5b2b4257a4762f00&stat=instructions:u

@craig.topper @reames Hi! Can you comment on the profitability heuristic here? Does it appear that we need a TTI hook?

I think this is moving in the right direction in terms of how the transform should be implemented. However, I have two high-level concerns:

The first one is basically the same as for the "whole alloca to vector promotion": If we perform this transform and then inline and it turns out that the offset is now known, it's likely that we're now going to generate much worse code -- we're likely not going to be able to get rid of the freeze and vector manipulation. I think it is very likely that the "second-order effects" you refer to are codegen regressions rather than improvements (will need to be investigated in either case).

The second is that even if we ignore that case, the profitability of this transform is very unclear to me. https://llvm.godbolt.org/z/W6e7K5W9d has a representative case (load i32 from 8 byte buffer after sroa+instcombine, with two kinds of init because some targets really hate the vector init). I'm not even sure that this is better on x86_64, but there are some targets where it's clearly worse, e.g. riscv32: https://llvm.godbolt.org/z/fzK8sbaEv If we want to do this transform, we'll probably not be able to avoid TTI-based cost modelling. It's also possible that this transform just isn't profitable in isolation (i.e. without the larger context of your examples).

Regarding the rewrite itself, are you possibly looking for the EmitGEPOffset() helper? It should be possible to sum the results of EmitGEPOffset() on all the GEPs in the chain to obtain the desired offset.

From a quick glance I'm seeing a couple issues. The first is that the basic RISC-V base ISA doesn't support vectors and SelectionDAG is fully scalarizing the vector load/store.

The other issue is that the test contains a variable shift of an i64 which isn't directly supported on a 32-bit target. RISC-V doesn't have an instructions like X86's SHLD/SHRD so we have to do it in multiple instructions.

@craig.topper thank you!

@craig.topper @reames Hi! Can you comment on the profitability heuristic here? Does it appear that we need a TTI hook?

I think this is moving in the right direction in terms of how the transform should be implemented. However, I have two high-level concerns:

The first one is basically the same as for the "whole alloca to vector promotion": If we perform this transform and then inline and it turns out that the offset is now known, it's likely that we're now going to generate much worse code -- we're likely not going to be able to get rid of the freeze and vector manipulation. I think it is very likely that the "second-order effects" you refer to are codegen regressions rather than improvements (will need to be investigated in either case).

The second is that even if we ignore that case, the profitability of this transform is very unclear to me. https://llvm.godbolt.org/z/W6e7K5W9d has a representative case (load i32 from 8 byte buffer after sroa+instcombine, with two kinds of init because some targets really hate the vector init). I'm not even sure that this is better on x86_64, but there are some targets where it's clearly worse, e.g. riscv32: https://llvm.godbolt.org/z/fzK8sbaEv If we want to do this transform, we'll probably not be able to avoid TTI-based cost modelling. It's also possible that this transform just isn't profitable in isolation (i.e. without the larger context of your examples).

Regarding the rewrite itself, are you possibly looking for the EmitGEPOffset() helper? It should be possible to sum the results of EmitGEPOffset() on all the GEPs in the chain to obtain the desired offset.

From a quick glance I'm seeing a couple issues. The first is that the basic RISC-V base ISA doesn't support vectors and SelectionDAG is fully scalarizing the vector load/store.

Right, i'm looking into that right now. We should be getting rid of vector loads in these cases in SDAG.

The other issue is that the test contains a variable shift of an i64 which isn't directly supported on a 32-bit target. RISC-V doesn't have an instructions like X86's SHLD/SHRD so we have to do it in multiple instructions.

Err, this is actually expected, see profitability reasoning in description/diff.

@craig.topper i'm somewhat confused with that RISC-V codegen.
The fact that it scalarizes *all* vector loads is a pretty glaring bug,
which a bit disqualifies cost modelling question, since there are
wider integer loads, that could be used instead. I'm rather not familiar
with that target, so i'm not sure if i should look into it.

And after that, we are back to the question:

  • is there support for shifts in base ISA?
  • what is the widest integer type that can be shifted?

@craig.topper thank you!

@craig.topper @reames Hi! Can you comment on the profitability heuristic here? Does it appear that we need a TTI hook?

I think this is moving in the right direction in terms of how the transform should be implemented. However, I have two high-level concerns:

The first one is basically the same as for the "whole alloca to vector promotion": If we perform this transform and then inline and it turns out that the offset is now known, it's likely that we're now going to generate much worse code -- we're likely not going to be able to get rid of the freeze and vector manipulation. I think it is very likely that the "second-order effects" you refer to are codegen regressions rather than improvements (will need to be investigated in either case).

The second is that even if we ignore that case, the profitability of this transform is very unclear to me. https://llvm.godbolt.org/z/W6e7K5W9d has a representative case (load i32 from 8 byte buffer after sroa+instcombine, with two kinds of init because some targets really hate the vector init). I'm not even sure that this is better on x86_64, but there are some targets where it's clearly worse, e.g. riscv32: https://llvm.godbolt.org/z/fzK8sbaEv If we want to do this transform, we'll probably not be able to avoid TTI-based cost modelling. It's also possible that this transform just isn't profitable in isolation (i.e. without the larger context of your examples).

Regarding the rewrite itself, are you possibly looking for the EmitGEPOffset() helper? It should be possible to sum the results of EmitGEPOffset() on all the GEPs in the chain to obtain the desired offset.

From a quick glance I'm seeing a couple issues. The first is that the basic RISC-V base ISA doesn't support vectors and SelectionDAG is fully scalarizing the vector load/store.

Right, i'm looking into that right now. We should be getting rid of vector loads in these cases in SDAG.

At least for X86, I've mostly fixed that, except for 16-byte alloca case,
no more vectors there: https://godbolt.org/z/TE78e44hq
I know how deal with 16-byte case too, will do that in a bit.

The other issue is that the test contains a variable shift of an i64 which isn't directly supported on a 32-bit target. RISC-V doesn't have an instructions like X86's SHLD/SHRD so we have to do it in multiple instructions.

Err, this is actually expected, see profitability reasoning in description/diff.

@craig.topper ^ FYI

@craig.topper i'm somewhat confused with that RISC-V codegen.
The fact that it scalarizes *all* vector loads is a pretty glaring bug,
which a bit disqualifies cost modelling question, since there are
wider integer loads, that could be used instead. I'm rather not familiar
with that target, so i'm not sure if i should look into it.

I haven't looked but I assume that's the type legalizer just doing SplitVector repeatedly and then doing ScalarizeVector. There is no intelligence for loads that are only used by stores.

NOTE: RISC-V also requires all scalar loads to be naturally aligned which can introduce additional restrictions on how a vector load/store can be scalarized.

And after that, we are back to the question:

is there support for shifts in base ISA?
what is the widest integer type that can be shifted?

Yes there are shifts. riscv32 has 32-bit shifts. riscv64 has 64-bit shifts. smaller shifts need to promoted. wider shifts will be expanded. There is no cmov so expansion requires a select to be expanded to control flow.

lebedev.ri added a comment.EditedDec 22 2022, 5:55 PM

(added x86 codegen tests in e7f21d750cc51f5a9610b7f13586b2b6907c6097)

@craig.topper i'm somewhat confused with that RISC-V codegen.
The fact that it scalarizes *all* vector loads is a pretty glaring bug,
which a bit disqualifies cost modelling question, since there are
wider integer loads, that could be used instead. I'm rather not familiar
with that target, so i'm not sure if i should look into it.

I haven't looked but I assume that's the type legalizer just doing SplitVector repeatedly and then doing ScalarizeVector. There is no intelligence for loads that are only used by stores.

Yes, i have looked before posting that comment, and yes that is what happens.

NOTE: RISC-V also requires all scalar loads to be naturally aligned which can introduce additional restrictions on how a vector load/store can be scalarized.

Aha, i had forgotten about that. That does complicate things, to say the least,
but then the pre-SROA codegen would be just as, err, uninspiring:
this is the current codegen diff https://godbolt.org/z/qzjc7Wxoa

Would you say that as far as RISCV is concerned guarding with a simple profitability check
of "alloca size must not be more than 2x the largest legal integer size
(and we expect we have good variable shifts for largest legal integer size)"
is not sufficient and a TTI hook is needed?

And after that, we are back to the question:

is there support for shifts in base ISA?
what is the widest integer type that can be shifted?

Yes there are shifts. riscv32 has 32-bit shifts. riscv64 has 64-bit shifts. smaller shifts need to promoted. wider shifts will be expanded. There is no cmov so expansion requires a select to be expanded to control flow.

lebedev.ri added a comment.EditedDec 22 2022, 6:04 PM

Though even for fully aligned loads, they are still split: https://godbolt.org/z/z5T4cKGYT
This is silly. There is no other way to write load-of-bytes.

lebedev.ri edited the summary of this revision. (Show Details)Jan 17 2023, 9:43 AM

D140638 ([Codegen][LegalizeIntegerTypes] New legalization strategy for scalar shifts: shift through stack.)
has taught LLVM how to undo this kind transformation during codegen,
in fact, using effectively the very same profitability heuristic that i use here,
so we don't seem to need a TLI hook here.

I'm still looking into a few codegen improvements,
but otherwise i'm not tracking any other issues with this,
so once i'm done with codegen, this will proceed.
(SROA still needs a "are we done with inlining?" flag, yes.)