[DAGCombiner] Match load by bytes idiom and fold it into a single loadClosedPublicActions

Authored by apilipenko on Oct 31 2016, 7:39 AM.

Details

Summary

I don't know who can be a good reviewer in this area, so I some added people who have touched something in DAGCombiner recently. If you know somebody who is a better reviewer for this change please let me know.

Match a pattern where a wide type scalar value is loaded by several narrow loads and combined by shifts and ors. Fold it into a single load or a load and a bswap if the targets supports it.

Assuming little endian target:

```i8 *a = ...
i32 val = a[0] | (a[1] << 8) | (a[2] << 16) | (a[3] << 24)
=>
i32 val = *((i32)a)```
```i8 *a = ...
i32 val = (a[0] << 24) | (a[1] << 16) | (a[2] << 8) | a[3]
=>
i32 val = BSWAP(*((i32)a))```

This optimization was discussed on llvm-dev some time ago in "Load combine pass" thread. We came to the conclusion that we want to do this transformation late in the pipeline because in presence of atomic loads load widening is irreversible transformation and it might hinder other optimizations.

Eventually we'd like to support folding patterns like this where the offset has a variable and a constant part:

`i32 val = a[i] | (a[i + 1] << 8) | (a[i + 2] << 16) | (a[i + 3] << 24)`

Matching the pattern above is easier at SelectionDAG level since address reassociation has already happened and the fact that the loads are adjacent is clear. Understanding that these loads are adjacent at IR level would have involved looking through geps/zexts/adds while looking at the addresses.

The general scheme is to match OR expressions by recursively calculating the origin of individual bits which constitute the resulting OR value. If all the OR bits come from memory verify that they are adjacent and match with little or big endian encoding of a wider value. If so and the load of the wider type (and bswap if needed) is allowed by the target generate a load and a bswap if needed.

Diff Detail

Repository
rL LLVM

Event Timeline

apilipenko updated this revision to Diff 76397.Oct 31 2016, 7:39 AM
apilipenko updated this revision to Diff 76402.
apilipenko retitled this revision from to [DAGCombiner] Match load by bytes idiom and fold it into a single load.
apilipenko updated this object.

I found that this rule doesn't interact with the worklist mechanism in the right way. When a node is changed all its direct users are added to the work list so they can be visited and updated after the definition they are using is updated. The rule I'm adding matches complex patterns with OR node roots. When a part of the pattern is updated (e.g. one of the loads) I need to revisit the OR node but it's not necessarily a direct user of the changed node. For example, I'd like the address of this load to be reassociated so it's easier to analyze:

```            t25: i32 = add t4, Constant:i32<2>
t26: i64 = sign_extend t25
t27: i64 = add t2, t26
t28: i8,ch = load<LD1[%tmp9]> t0, t27, undef:i64
t29: i32 = zero_extend t28
t32: i32 = shl t29, Constant:i8<8>
t33: i32 = or t23, t32```

But I have no way to trigger OR pattern matching after it happened because OR is not a direct user of the address.

What is the best way to work around this limitation?

RKSimon edited edge metadata.Nov 8 2016, 8:29 AM

Before deep diving - I don't want to start making feature creep demands here but I wonder if this should be trying to handle more general cases than just BSWAP - ROTL/ROTR for instance.

lib/CodeGen/SelectionDAG/DAGCombiner.cpp
4344 ↗(On Diff #76402)

I don't see any reason why these types are not inside DAGCombiner::MatchLoadCombine

4385 ↗(On Diff #76402)

Introducing this type seems to be almost useless - why not just use SmallVector/ArrayRef types directly?

4391 ↗(On Diff #76402)

Why not move this into MatchLoadCombine as a lambda?

2 ↗(On Diff #76402)

Please test with triples not archs and regenerate the test file using utils/update_llc_test_checks.py

443 ↗(On Diff #76402)

Please add i64 test cases - to see what happens on 32-bit targets especially.

hfinkel edited edge metadata.Nov 8 2016, 4:19 PM

I found that this rule doesn't interact with the worklist mechanism in the right way. When a node is changed all its direct users are added to the work list so they can be visited and updated after the definition they are using is updated. The rule I'm adding matches complex patterns with OR node roots. When a part of the pattern is updated (e.g. one of the loads) I need to revisit the OR node but it's not necessarily a direct user of the changed node. For example, I'd like the address of this load to be reassociated so it's easier to analyze:

```            t25: i32 = add t4, Constant:i32<2>
t26: i64 = sign_extend t25
t27: i64 = add t2, t26
t28: i8,ch = load<LD1[%tmp9]> t0, t27, undef:i64
t29: i32 = zero_extend t28
t32: i32 = shl t29, Constant:i8<8>
t33: i32 = or t23, t32```

But I have no way to trigger OR pattern matching after it happened because OR is not a direct user of the address.

What is the best way to work around this limitation?

apilipenko updated this revision to Diff 77463.Nov 10 2016, 4:10 AM
apilipenko marked 3 inline comments as done.

lib/CodeGen/SelectionDAG/DAGCombiner.cpp
4344 ↗(On Diff #76402)

I guess it's a matter of style. We can move BitProvider type and collectBitProviders helper into MatchLoadCombine but I doubt that it will improve readability.

I put them into anonymous namespace so as not to pollute namespace, but if you think that they should live in MatchLoadCombine I don't mind.

4391 ↗(On Diff #76402)

Same as above.

As for other patterns, in the code which motivated this change we see reads of different endiannes. I don't know how common will be other patterns, but in general it should not be difficult to extend the logic.

apilipenko updated this revision to Diff 77465.Nov 10 2016, 4:27 AM

Use anonymous namespace instead of static function.

I found that this rule doesn't interact with the worklist mechanism in the right way. When a node is changed all its direct users are added to the work list so they can be visited and updated after the definition they are using is updated. The rule I'm adding matches complex patterns with OR node roots. When a part of the pattern is updated (e.g. one of the loads) I need to revisit the OR node but it's not necessarily a direct user of the changed node. For example, I'd like the address of this load to be reassociated so it's easier to analyze:

```            t25: i32 = add t4, Constant:i32<2>
t26: i64 = sign_extend t25
t27: i64 = add t2, t26
t28: i8,ch = load<LD1[%tmp9]> t0, t27, undef:i64
t29: i32 = zero_extend t28
t32: i32 = shl t29, Constant:i8<8>
t33: i32 = or t23, t32```

But I have no way to trigger OR pattern matching after it happened because OR is not a direct user of the address.

What is the best way to work around this limitation?

I don't see how they can help me. What I need is to trigger combine for

`t33: i32 = or t23, t32`

When any of the node in this use-chain is combined, e.g. when

`t26: i64 = sign_extend t25`

is combined.

I found that this rule doesn't interact with the worklist mechanism in the right way. When a node is changed all its direct users are added to the work list so they can be visited and updated after the definition they are using is updated. The rule I'm adding matches complex patterns with OR node roots. When a part of the pattern is updated (e.g. one of the loads) I need to revisit the OR node but it's not necessarily a direct user of the changed node. For example, I'd like the address of this load to be reassociated so it's easier to analyze:

```            t25: i32 = add t4, Constant:i32<2>
t26: i64 = sign_extend t25
t27: i64 = add t2, t26
t28: i8,ch = load<LD1[%tmp9]> t0, t27, undef:i64
t29: i32 = zero_extend t28
t32: i32 = shl t29, Constant:i8<8>
t33: i32 = or t23, t32```

But I have no way to trigger OR pattern matching after it happened because OR is not a direct user of the address.

What is the best way to work around this limitation?

I don't see how they can help me. What I need is to trigger combine for

`t33: i32 = or t23, t32`

When any of the node in this use-chain is combined, e.g. when

`t26: i64 = sign_extend t25`

is combined.

Ah, indeed. You need, in essence, to trigger another whole iteration of the combiner. You can obviously search down the chain and add things to the worklist, but you obviously want to avoid a lot of rarely-profitable work.

Can you make your code match the pattern before the sext combine?

@hfinkel, this problem is more hypothetical than practical. For the cases we are interested in BaseIndexOffset logic is smart enough to look through non-combined addresses. Although triggering load combine might enable handling of more complex patterns.

As a possible fix we can add some logic in visitLoad which checks if the load can be a part of the load combine pattern. It should be fairly easy since all the nodes in the pattern must have only one use and this use must be on of few known nodes (or, shl, zext). If the load can be a part of the pattern visitLoad should add the potential root of the pattern ("or" node) to the work list. But this is definitely a separate improvement for this optimization.

213 ↗(On Diff #77465)

What is preventing the 32-bit target doing anything here?

299 ↗(On Diff #77465)

What is preventing the 32-bit target doing anything here?

213 ↗(On Diff #77465)

This patch can only handle situations when all the bits of the value are available and can be loaded in a single load. In this pattern individual bytes are extended to i64 and all the computation are done with i64 values. So we have no chance to combine i32 halves of the pattern. I'm going to send a follow up patch to add support for partially available loads once this patch is in.

299 ↗(On Diff #77465)

The same as above.

@hfinkel, this problem is more hypothetical than practical. For the cases we are interested in BaseIndexOffset logic is smart enough to look through non-combined addresses. Although triggering load combine might enable handling of more complex patterns.

As a possible fix we can add some logic in visitLoad which checks if the load can be a part of the load combine pattern. It should be fairly easy since all the nodes in the pattern must have only one use and this use must be on of few known nodes (or, shl, zext). If the load can be a part of the pattern visitLoad should add the potential root of the pattern ("or" node) to the work list. But this is definitely a separate improvement for this optimization.

Okay; makes sense to me.

filcab added a subscriber: filcab.Nov 22 2016, 10:18 AM

This is low-level enough that it's probably worth it to just do it byte by byte, instead of bit for bit. At least for now, no?

Your pattern matching on the non-x86 cases needs to be improved. At the very least, we need to match:
bswap if needed
"ret-ish" instruction

Or:
not bswap/rev
shift
or

Ideally, you'd use the same kinds of checks as the x86 versions.

lib/CodeGen/SelectionDAG/DAGCombiner.cpp
4434 ↗(On Diff #77465)

Why copy the whole vector only to replace parts of it?

4448 ↗(On Diff #77465)

Maybe use std::fill_n, std::copy?

4492 ↗(On Diff #77465)

Can you make the string more explicit? Something like "Can only match load combining against OR nodes" or something?

4591 ↗(On Diff #77465)

Should we check this before we do all the work with the bitoffsets?
Might be hard as we probably need address space + alignment info. But we can at least bubble this up to above the little/big endian decision (unsure if that's very profitable, though).

@filcab, the idea was mentioned above that we might want to match patterns other that bswap, for example rotate. Given this I'd prefer to keep bit-by-bit matching.

As for improving the matching do you mean supporting the listed instructions as a part of the pattern? I do plan to add bswap and different extension nodes support in a follow up patch. I'd like to keep the initial patch simple though.

lib/CodeGen/SelectionDAG/DAGCombiner.cpp
4591 ↗(On Diff #77465)

I don't think that it will make a difference since LE/BE matching is not that expensive. On the other hand this placement seems a bit more logical: we ask about a possibility of load from a particular location when we know that the code in question is actually doing this load.

apilipenko updated this revision to Diff 79561.Nov 29 2016, 6:35 AM

@filcab, the idea was mentioned above that we might want to match patterns other that bswap, for example rotate. Given this I'd prefer to keep bit-by-bit matching.

Fair enough. Do you have plans on doing that "soon", or is it just a "in the future, probably we'll have X"?

As for improving the matching do you mean supporting the listed instructions as a part of the pattern? I do plan to add bswap and different extension nodes support in a follow up patch. I'd like to keep the initial patch simple though.

I'd prefer to be sure we're testing the thing we think we're testing. At the very least, we'd want to make sure there are no or instructions in most of those cases. I'd even go with "they should just load, maybe bswap, and exit the function", so most of those arm tests would have two or three check lines (at least). I know that currently we're emitting ldrb to load each byte. But I prefer tighter checks since we want them to fail as soon as possible.

btw, do you have numbers on how many times we end up triggering this on a specific code-base?

lib/CodeGen/SelectionDAG/DAGCombiner.cpp
4432 ↗(On Diff #79561)

Having both Res and Result seems confusing: "Which is the actual result?"
Can you change this name to something like LHSBits/LHSProviders or something?

4444 ↗(On Diff #79561)

Res vs Result again. If the Res means something other than "result", please write the full word, it might become sufficiently less ambiguous.
Otherwise, something like Original, Narrow, Value (not ideal), or something else should be better than Res for code reading.

apilipenko updated this revision to Diff 79724.Nov 30 2016, 4:32 AM
apilipenko marked 5 inline comments as done.

Address review comments: rename locals in collectBitProviders, tighten CHECK patterns in non-x86 tests.

We see load by bytes (with optional bswap) patterns in Java code originating from ByteBuffers from the standard library. This is the main motivator for the change. I don't have any numbers for C/C++ code but I expect it's not uncommon.

As of matching patterns like rotate, we don't see it the the code we are interested about, so it's more like "this is where this code can evolve in the future".

Thanks, the non-x86 tests look good.

We see load by bytes (with optional bswap) patterns in Java code originating from ByteBuffers from the standard library. This is the main motivator for the change. I don't have any numbers for C/C++ code but I expect it's not uncommon.

I'm sure we see it. I'm just curious about how frequently this shows up :-)

As of matching patterns like rotate, we don't see it the the code we are interested about, so it's more like "this is where this code can evolve in the future".

If there's no actual planned usage of the "bit by bit" feature, I'd much rather have an implementation that tracks full bytes. Even if we wanted to track bit by bit, having byte-tracking will get us most of the way there, since you can only fetch memory byte by byte. If an edge case arised that forced us to track bits, then I'd be ok with it. But as it is, I don't see why we should pay that price.

apilipenko updated this revision to Diff 80276.Dec 5 2016, 8:57 AM

Analyze the pattern byte by byte.

lib/CodeGen/SelectionDAG/DAGCombiner.cpp
4617 ↗(On Diff #80276)

This all (4346:4617) looks like it needs to be run through clang-format

apilipenko updated this revision to Diff 80288.Dec 5 2016, 10:10 AM

Clang-formatted.

niravd added a subscriber: niravd.Dec 5 2016, 11:02 AM
filcab added a comment.Dec 7 2016, 5:39 AM

Some minor nits: No need to size the vectors at 32, 4 will do it.
And use auto if you're declaring a var and initializing it to the return of a ..._cast operator, since you already mentioned the type in that line.

Please run the ASan test before pushing the commit.

lib/CodeGen/SelectionDAG/DAGCombiner.cpp
4389 ↗(On Diff #80288)

We can make this 4 now, no?
Making it 8 would save us a realloc() when compiling for 64-bit archs on a bunch of places, but I'm unsure it's worth it. Something to be tried later, though.

4397 ↗(On Diff #80288)

4401 ↗(On Diff #80288)

Aren't you binding to the object returned in collectByteProviders? Can you run a test case that triggers this optimization under asan with ASAN_OPTIONS=detect_stack_use_after_return=true?

4423 ↗(On Diff #80288)

4

4430 ↗(On Diff #80288)

auto ShiftOp = ...

4444 ↗(On Diff #80288)

4

4457 ↗(On Diff #80288)

4

4466 ↗(On Diff #80288)

auto L = ...

4474 ↗(On Diff #80288)

4

apilipenko updated this revision to Diff 80900.Dec 9 2016, 7:49 AM
apilipenko marked 8 inline comments as done.

lib/CodeGen/SelectionDAG/DAGCombiner.cpp
4401 ↗(On Diff #80288)

collectByteProviders return the result by value, so it's not a problem. The reference here is misleading and should be just auto. Fixed here and there.

filcab accepted this revision.Dec 9 2016, 8:03 AM

LGTM

This revision is now accepted and ready to land.Dec 9 2016, 8:03 AM
This revision was automatically updated to reflect the committed changes.
Joshua-401 added a subscriber: Joshua-401.EditedNov 14 2021, 5:41 PM

Did you consider doing the load combine at IR level? I'm now trying to combine loads at IR level during the InstCombine process, so I'm wondering why you didn't do that.

Matching the pattern above is easier at SelectionDAG level since address reassociation has already happened and the fact that the loads are adjacent is clear. Understanding that these loads are adjacent at IR level would have involved looking through geps/zexts/adds while looking at the addresses.

Herald added a project: Restricted Project. Nov 14 2021, 5:41 PM

Did you consider doing the load combine at IR level? I'm now trying to combine loads at IR level during the InstCombine process, so I'm wondering why you didn't do that.

There was an IR load combine pass (see D3580 - although never enabled by default AFAIK), but it was removed because it could interfere with other passes like GVN.
https://lists.llvm.org/pipermail/llvm-dev/2016-September/105291.html
https://lists.llvm.org/pipermail/llvm-dev/2019-September/135052.html

But requests for this kind of transform pop up every few months or so, so it is probably worth experimenting to see if there is some limited/altered form of that pass that can get us optimizations without reducing effectiveness of other passes.

Did you consider doing the load combine at IR level? I'm now trying to combine loads at IR level during the InstCombine process, so I'm wondering why you didn't do that.

There was an IR load combine pass (see D3580 - although never enabled by default AFAIK), but it was removed because it could interfere with other passes like GVN.
https://lists.llvm.org/pipermail/llvm-dev/2016-September/105291.html
https://lists.llvm.org/pipermail/llvm-dev/2019-September/135052.html

FYI, load merging at IR has some really hard cases.

One that I remember is that atomic loads are really problematic. four one byte atomic loads are different than one 4 byte atomic atomic load. Even if we have the four byte load on the target. Once we've committed to the wider load type, we have no way to go back. This greatly limits our optimization and lowering, and can lead to very poor outcomes. We'd vaguely discussed the possibility of an IR extension for "element wise atomic vectors" to resolve this particular case, but it never progressed.

I believe there were a couple of other really hard examples too. I just don't remember those. :)