This is an archive of the discontinued LLVM Phabricator instance.

[AggressiveInstCombine] Load merge the reverse load pattern of consecutive loads.
ClosedPublic

Authored by bipmis on Oct 4 2022, 3:28 AM.

Details

Summary

This patch extends the load merge/widen in AggressiveInstCombine() defined in
https://reviews.llvm.org/D127392

to handle reverse load patterns.

Diff Detail

Event Timeline

bipmis requested review of this revision.Oct 4 2022, 3:28 AM
bipmis created this revision.

Nice addition. I find the recursion a bit difficult to reason through, so forgive my questions below.

llvm/lib/Transforms/AggressiveInstCombine/AggressiveInstCombine.cpp
713

Offset2.slt(Offset1) will prevent the conversion to i64.

725

What happens if the load sizes are not the same? That sounds like it was protecting against a number of things.

743

This looks like it just needs a Start and End, that are LI1/LI2 depending on which comesBefore. That could hopefully avoid the duplication.

spatel added inline comments.Oct 9 2022, 3:10 AM
llvm/lib/Transforms/AggressiveInstCombine/AggressiveInstCombine.cpp
725

Related question: can we make this one-line change independently of the rest of the patch (either before or after)? That would reduce risk that we uncover some difficult-to-debug corner case.

bipmis marked 2 inline comments as done.Oct 12 2022, 6:10 AM
bipmis added inline comments.
llvm/lib/Transforms/AggressiveInstCombine/AggressiveInstCombine.cpp
725

This is from the review comments from initial Implementation to enable mixed size merge. Some of the scenarios as added in the test seems to be working fine. However, in doing so we still make sure that the entire chain is reduced to a single load.
I am reverting this change for now. If needed this can go independently in a patch.

725

That is a good suggestion. I think we can do this last bit towards the end if needed.

bipmis updated this revision to Diff 467114.Oct 12 2022, 6:12 AM

Revert mixed sized load merge and handle review comments.

This improves the pattern-matching, but it's still not complete, right? Ie, we should have a PhaseOrdering test with all 24 (4!) combinations of four i8-loads spliced into a 32-bit value, so we know the pattern is matched no matter what order the values are put back together with 'or'. We can probably rely on instcombine to canonicalize half of those patterns, but I'm not sure there's anything else there to reduce the matching space.

define i32 @loadCombine_4consecutive_1234(ptr %p) {
  %p1 = getelementptr i8, ptr %p, i32 1
  %p2 = getelementptr i8, ptr %p, i32 2
  %p3 = getelementptr i8, ptr %p, i32 3
  %l1 = load i8, ptr %p
  %l2 = load i8, ptr %p1
  %l3 = load i8, ptr %p2
  %l4 = load i8, ptr %p3

  %e1 = zext i8 %l1 to i32
  %e2 = zext i8 %l2 to i32
  %e3 = zext i8 %l3 to i32
  %e4 = zext i8 %l4 to i32

  %s2 = shl i32 %e2, 8
  %s3 = shl i32 %e3, 16
  %s4 = shl i32 %e4, 24

  %o1 = or i32 %e1, %s2
  %o2 = or i32 %o1, %s3
  %o3 = or i32 %o2, %s4
  ret i32 %o3
}

define i32 @loadCombine_4consecutive_1243(ptr %p) {
  %p1 = getelementptr i8, ptr %p, i32 1
  %p2 = getelementptr i8, ptr %p, i32 2
  %p3 = getelementptr i8, ptr %p, i32 3
  %l1 = load i8, ptr %p
  %l2 = load i8, ptr %p1
  %l3 = load i8, ptr %p2
  %l4 = load i8, ptr %p3

  %e1 = zext i8 %l1 to i32
  %e2 = zext i8 %l2 to i32
  %e3 = zext i8 %l3 to i32
  %e4 = zext i8 %l4 to i32

  %s2 = shl i32 %e2, 8
  %s3 = shl i32 %e3, 16
  %s4 = shl i32 %e4, 24

  %o1 = or i32 %e1, %s2
  %o2 = or i32 %o1, %s4
  %o3 = or i32 %o2, %s3
  ret i32 %o3
}

define i32 @loadCombine_4consecutive_1324(ptr %p) {
  %p1 = getelementptr i8, ptr %p, i32 1
  %p2 = getelementptr i8, ptr %p, i32 2
  %p3 = getelementptr i8, ptr %p, i32 3
  %l1 = load i8, ptr %p
  %l2 = load i8, ptr %p1
  %l3 = load i8, ptr %p2
  %l4 = load i8, ptr %p3

  %e1 = zext i8 %l1 to i32
  %e2 = zext i8 %l2 to i32
  %e3 = zext i8 %l3 to i32
  %e4 = zext i8 %l4 to i32

  %s2 = shl i32 %e2, 8
  %s3 = shl i32 %e3, 16
  %s4 = shl i32 %e4, 24

  %o1 = or i32 %e1, %s3
  %o2 = or i32 %o1, %s2
  %o3 = or i32 %o2, %s4
  ret i32 %o3
}

define i32 @loadCombine_4consecutive_1342(ptr %p) {

...

This improves the pattern-matching, but it's still not complete, right? Ie, we should have a PhaseOrdering test with all 24 (4!) combinations of four i8-loads spliced into a 32-bit value, so we know the pattern is matched no matter what order the values are put back together with 'or'. We can probably rely on instcombine to canonicalize half of those patterns, but I'm not sure there's anything else there to reduce the matching space.

Right but won't these cover the most commonly seen patterns in the real application scenario, if not all considering it belongs to a pattern which generates a wider load.
InstCombine can possibly canonicalize the or-chain of loads in an ascending/descending order of load indexes. The fact that it is called multiple times should get us the pattern expected by AggressiveInstCombine. However, I am not sure if this is the right thing to do.

This improves the pattern-matching, but it's still not complete, right? Ie, we should have a PhaseOrdering test with all 24 (4!) combinations of four i8-loads spliced into a 32-bit value, so we know the pattern is matched no matter what order the values are put back together with 'or'. We can probably rely on instcombine to canonicalize half of those patterns, but I'm not sure there's anything else there to reduce the matching space.

Right but won't these cover the most commonly seen patterns in the real application scenario, if not all considering it belongs to a pattern which generates a wider load.

In dealing with the related case of trying to match bswap patterns, we've found that eventually we will see every possible pattern in source code somewhere.
I don't disagree that the in-order/reverse are the most likely (and I won't hold up improvements from this patch), but I'd like to have tests in place that acknowledge that we are aware of the limitations of the current implementation. For users, having inconsistent optimization can be almost as frustrating as having no optimization.

InstCombine can possibly canonicalize the or-chain of loads in an ascending/descending order of load indexes. The fact that it is called multiple times should get us the pattern expected by AggressiveInstCombine. However, I am not sure if this is the right thing to do.

I expect that we'll end up using cooperation between "reassociate", "instcombine", and "aggressive-instcombine" to get the optimal loads. That's why I suggest PhaseOrdering tests - it's not clear which pass will be responsible for canonicalizing everything, but we do want to make sure that a complete trip through the optimizer will catch all patterns. There's also an open question of when and how often to invoke aggressive-instcombine (for example, it's only enabled at -O3 currently).

bipmis updated this revision to Diff 468298.Oct 17 2022, 1:09 PM

Add PhaseOrdering tests for loadCombine.

The tests for completeness look as expected, but we want those to live in tests/Transforms/PhaseOrdering and use "RUN: opt -O3 -S < %s" or similar. That way, we can verify that the various passes that are expected to alter this IR are working cooperatively (and some seemingly unrelated change in another pass won't break the optimization that we want to enable with this patch).

bipmis updated this revision to Diff 468508.Oct 18 2022, 6:04 AM

Move the loadCombine.ll which has combinations of four i8-loads spliced into a 32-bit value test to Transforms/PhaseOrdering

spatel added inline comments.Oct 18 2022, 6:25 AM
llvm/test/Transforms/PhaseOrdering/loadcombine.ll
2

I forgot to mention if the test uses a triple, then it needs to go in the X86 directory under this dir. Otherwise, you'll get test bot failures.

Please pre-commit the baseline test file (no pre-commit review needed to add tests). That way, we'll be sure it is working as expected before the code patch goes in.

bipmis updated this revision to Diff 468541.Oct 18 2022, 7:54 AM
bipmis marked 2 inline comments as done.

Repatch with test loadCombine commited.

spatel accepted this revision.Oct 18 2022, 9:10 AM

LGTM.
I'm not sure if the transforms are completely reliable, but the PhaseOrdering tests show that we're getting 8 of the 24 patterns. Another 4 are partially folded, so those would presumably improve with the one-line enhancement that was originally in this patch. That leaves 12 that are escaping, but those might be canonicalized enough that we don't need to deal with every possible ordering.

llvm/lib/Transforms/AggressiveInstCombine/AggressiveInstCombine.cpp
712

typo: Offest -> Offset

This revision is now accepted and ready to land.Oct 18 2022, 9:10 AM

LGTM.
I'm not sure if the transforms are completely reliable, but the PhaseOrdering tests show that we're getting 8 of the 24 patterns. Another 4 are partially folded, so those would presumably improve with the one-line enhancement that was originally in this patch. That leaves 12 that are escaping, but those might be canonicalized enough that we don't need to deal with every possible ordering.

Thanks.
Looks like Reassociate does convert all the cases to the pattern needed. Just that it occurs late.

bipmis retitled this revision from [AggressiveInstCombine] Load merge the reverse load pattern and mixed load sizes. to [AggressiveInstCombine] Load merge the reverse load pattern of consecutive loads..Oct 19 2022, 3:14 AM
bipmis edited the summary of this revision. (Show Details)

LGTM.
I'm not sure if the transforms are completely reliable, but the PhaseOrdering tests show that we're getting 8 of the 24 patterns. Another 4 are partially folded, so those would presumably improve with the one-line enhancement that was originally in this patch. That leaves 12 that are escaping, but those might be canonicalized enough that we don't need to deal with every possible ordering.

Thanks.
Looks like Reassociate does convert all the cases to the pattern needed. Just that it occurs late.

Thanks for checking. So there are 3 possible follow-ups to make the transform robust:

  1. Re-order the passes so AIC is after Reassociate
  2. Add a late run of AIC.
  3. Enhance the pattern matching in AIC to capture the alternate patterns.

Heads up! We are having some tests break due to this patch. We are working on a reproducer on our end and will post as soon as we get it.

Here is the reproducer - https://godbolt.org/z/7bcjzfYK4

The following program returns 33 on trunk and 0 on clang 14:

#include <utility>

template <typename ValueType>
class Cell {
 public:
  Cell() = default;

  void Set(ValueType value) {
    value_ = value;
    has_data_ = true;
  }
  std::pair<ValueType, bool> GetAndMarkRead() {
    bool fresh = has_data_;
    has_data_ = false;
    return std::make_pair(value_, fresh);
  }
  bool GetAndMarkRead2() {
    bool fresh = has_data_;
    has_data_ = false;
    return fresh;
  }

  ValueType value_;
  bool has_data_ = false;
};

bool foo() {
  Cell<bool> cell;
  cell.Set(true);
  return cell.GetAndMarkRead().second;
}

int main() {
    if (foo()) return 0;
    return 33;
}

GetAndMarkRead returns false instead of 'true`. Returning std::pair matters.

Here is the reproducer - https://godbolt.org/z/7bcjzfYK4

Thanks for the test case. The issue likely is inserting the new merged load at a right insert point. I have done a patch to handle this. Can you please test the patch and confirm if this works correctly.

asmok-g added a subscriber: asmok-g.Nov 3 2022, 2:28 AM

Here is the reproducer - https://godbolt.org/z/7bcjzfYK4

Thanks for the test case. The issue likely is inserting the new merged load at a right insert point. I have done a patch to handle this. Can you please test the patch and confirm if this works correctly.

Hello,

This is blocking our releases in Google. I see the fix patch didn't have an update since Tuesday. Can you please give it higher priority ?

Hello, as mentioned above there is a fix in https://reviews.llvm.org/D137201. We found another case that was incorrect though, so were looking into fixing that too. If it does take too long, perhaps a revert is in order in the meantime.

Did you get a chance to verify the fix worked for your case? Thanks.

bipmis added a comment.Nov 3 2022, 5:17 AM

Hello, as mentioned above there is a fix in https://reviews.llvm.org/D137201. We found another case that was incorrect though, so were looking into fixing that too. If it does take too long, perhaps a revert is in order in the meantime.

Did you get a chance to verify the fix worked for your case? Thanks.

I think as a quick fix we can switch off load merge when we encounter stores and eventually come with a proper fix with additional test cases. I will do the patch on the same so that this does not block us.

Did you get a chance to verify the fix worked for your case? Thanks.

We unblocked ourselves for now. Will test the fix once it lands.

We unblocked ourselves for now. Will test the fix once it lands.

OK sounds good. @bipmis was on training this week, but did manage to put https://reviews.llvm.org/D137333 together which as far as we understand should fix all the known issues. Hopefully that will mean no more issues your end and we can work on a more substantial fix as we have more time. Please let us know if more issues do come up. Thanks.

Did you get a chance to verify the fix worked for your case? Thanks.

We unblocked ourselves for now. Will test the fix once it lands.

The test passed with the fix. Thanks!