Page MenuHomePhabricator

[AggressiveInstCombine] Combine consecutive loads which are being merged to form a wider load.
Needs ReviewPublic

Authored by bipmis on Jun 9 2022, 3:49 AM.

Details

Summary

The patch simplifies some of the patterns as below

1. (zExt(L1) << shift1) | (zExt(L2) << shift2) -> zExt(L3) << shift1
2. (? | (zExt(L1) << shift1)) | (zExt(L2) << shift2) -> ? | (zExt(L3) << shift1)

The pattern is indicative of the fact that the loads are being merged to a wider load and the only use of this pattern is with a wider load. In this case for a non-atomic/non-volatile loads reduce the pattern to a combined load which would improve the cost of inlining, unrolling, vectorization etc.

Diff Detail

Event Timeline

bipmis created this revision.Jun 9 2022, 3:49 AM
bipmis requested review of this revision.Jun 9 2022, 3:49 AM

I'd like to see some form of load combining in IR, but we need to be very careful about how that is implemented.

Note that LLVM had a dedicated pass for load combining, but it caused problems and was removed:
https://lists.llvm.org/pipermail/llvm-dev/2016-September/105291.html

So any proposal to do it again should be aware of and avoid those problems. Putting it in InstCombine is controversial because it could interfere with other analysis/passes. We recently added a cost-aware transform to AggressiveInstCombine, so that might be a better fit.

I didn't check all of the tests, but at least the simplest cases should already be handled in DAGCombiner. Can you show a larger, motivating example where that backend transform is too late to get to the ideal code?

@spatel Thanks for reviewing it. The common scenarios in llvm where we see this as an issue is with the cost involved in the inliner, unrolller and vectorizer which results in a sub-optimal code. One such simple example can be with dot product where vectorizer fails to generate a vectorized code as below:
https://godbolt.org/z/Ee4cbf1PG

Similar situation can be seen with the unroller as well as in the example below:
https://godbolt.org/z/dWo8n3Yz8

https://godbolt.org/z/Ee4cbf1PG
Similar situation can be seen with the unroller as well as in the example below:
https://godbolt.org/z/dWo8n3Yz8

Thanks - those look like good motivating examples. I recommend moving this patch over to AggressiveInstCombine or recreating a limited form of the old LoadCombine pass. Either way, we need to be careful to make sure that the transform does not combine loads illegally and that it doesn't harm other analysis. You might want to start an RFC thread on discourse for more feedback and tag people who commented on earlier proposals to do this kind of transform.

I am in favour of moving the discussion to discourse.

AggressiveInstCombine and target-dependent InstCombine are two independent concepts for me.

RKSimon added inline comments.Jun 16 2022, 8:26 AM
llvm/lib/Transforms/InstCombine/InstCombineAndOrXor.cpp
17 ↗(On Diff #435486)

clang-format this?

nikic added a subscriber: nikic.Jun 24 2022, 1:58 AM

The main question I see here is whether this needs to be TTI based or not -- if yes, then it also shouldn't be in InstCombine. I think there's two reason why we might want TTI:

  • Do we want to create unaligned loads? Creating them is legal, but if the target does not support fast unaligned loads, the backend will break them up again. Should we only perform this optimization if TTI.allowsMisalignedMemoryAccesses reports a fast unaligned access?
  • Do we want to create loads with illegal sizes? For example, if we have a 64-bit target, so we want to combine two i64 loads into an i128 load, even though it will later be broken up again? (At least for the current implementation where both loads must have the same size, the question of whether to allow i24 loads or similar does not come up.)
llvm/lib/Transforms/InstCombine/InstCombineAndOrXor.cpp
3004 ↗(On Diff #435486)

nullptr

3017 ↗(On Diff #435486)

Move dyn_cast into initialization, having it in the if is hard to read.

3029 ↗(On Diff #435486)

GEPOperator

Also directly dyn_cast in the condition.

3035 ↗(On Diff #435486)

You are probably looking for GEP2->accumulateConstantOffsets() here -- your current code will treat GEPs with multiple indices incorrectly.

3077 ↗(On Diff #435486)

I don't get this check -- you're checking that there is an available value, but not that this value is the previous load. There might still be a clobber in between.

What you're probably looking for is canInstructionRangeModRef() between the two instructions.

3108 ↗(On Diff #435486)

Unnecessary cast

3114 ↗(On Diff #435486)

Why does this not use the builder to create the instruction, only to insert it?

3119 ↗(On Diff #435486)

Uh, what's the point of oring something with zero?

RKSimon added inline comments.Jun 24 2022, 3:26 AM
llvm/test/Transforms/InstCombine/or-load.ll
2 ↗(On Diff #435486)

add little and big endian test coverage

The main question I see here is whether this needs to be TTI based or not -- if yes, then it also shouldn't be in InstCombine. I think there's two reason why we might want TTI:

  • Do we want to create unaligned loads? Creating them is legal, but if the target does not support fast unaligned loads, the backend will break them up again. Should we only perform this optimization if TTI.allowsMisalignedMemoryAccesses reports a fast unaligned access?
  • Do we want to create loads with illegal sizes? For example, if we have a 64-bit target, so we want to combine two i64 loads into an i128 load, even though it will later be broken up again? (At least for the current implementation where both loads must have the same size, the question of whether to allow i24 loads or similar does not come up.)

Thanks for the review. I have made most of the other suggested changes and can post a patch for the same. It would also handle some more test cases.

Right I think we need to be sure on this particular path. If we want a TTI based should Aggressive Instcombine be a better choice. We also see some use scenarios of TTI in InstCombine but is our case a vaild use case scenario.
Next patch would be in InstCombine for this reason.
I will also test the patch up with extended tests to see if we are seeing any performance issues and if the backend breaking up happens ok.

We also see some use scenarios of TTI in InstCombine

I believe there is a strong consensus to say no to TTI in InstCombine.

bipmis added inline comments.Jun 28 2022, 9:21 AM
llvm/test/Transforms/InstCombine/or-load.ll
2 ↗(On Diff #435486)

In IR I believe the checks would be the same for LE and BE. The differences should pop up in the assembly.
https://godbolt.org/z/xboKx47cc

bipmis updated this revision to Diff 441695.Jul 1 2022, 8:15 AM

Handle some of the review comments.
Added support for additional scenarios like reverse order loads and big-endian support.

Moving the implementation to AgressiveInstCombine still open for discussion in addition with the TTI requirement.

spatel added a comment.Jul 1 2022, 9:04 AM

Handle some of the review comments.
Added support for additional scenarios like reverse order loads and big-endian support.

Moving the implementation to AgressiveInstCombine still open for discussion in addition with the TTI requirement.

IMO, we should move this patch to AggressiveInstCombine with conservative TTI limitations to start.

Now that we've heard from several others, I'd summarize as:

  1. There's good motivation to have load combining in IR as a canonicalization.
  2. But it should be limited with TTI (avoid unaligned loads, illegal operations/types).
  3. Adding TTI to InstCombine is a non-starter; we don't want to have that level of target-dependence in IR canonicalization.
  4. AggressiveInstCombine is lower-weight canonicalization pass that recently gained access to TTI.
  5. Potential follow-ups could relax the TTI constraints and/or allow bswap formation to deal with endian diffs seen in the current tests (we do that type of transform in the backend already).
bipmis added inline comments.Jul 1 2022, 9:07 AM
llvm/lib/Transforms/InstCombine/InstCombineAndOrXor.cpp
3035 ↗(On Diff #435486)

Currently we are handling GEP with single indice. With the multiple indices we exit at the check "isa<GetElementPtrInst>(Op2)" . Will keep a look on this if it can be improved.

3077 ↗(On Diff #435486)

This is basically to check if there are stores or memory accesses b/w the 2 loads which we should consider for before the load merges. The check starts at L2 in reverse and does the memory check till L1.
There are alias tests loadCombine_4consecutive_with_alias* which should give you the idea.

3119 ↗(On Diff #435486)

Yeah this is basically cause the visitOr() does an Insert of the returned OR. This may not be needed if we move this to AggressiveInstCombine.

nikic added a comment.Jul 1 2022, 9:27 AM

When switching this to AggressiveInstCombine, I would strongly recommend to start with a much more minimal patch. Handle only a single simple case, without any of the possible variants. We can build on that base later.

When switching this to AggressiveInstCombine, I would strongly recommend to start with a much more minimal patch. Handle only a single simple case, without any of the possible variants. We can build on that base later.

Strongly agree - there's a lot of potential for this to go wrong both in correctness and perf regressions, so we need to build up in steps.
AFAIK, the load combine pass did not have correctness problems when it died, so that source code would be a good reference.

bipmis updated this revision to Diff 443893.Tue, Jul 12, 3:02 AM
bipmis set the repository for this revision to rG LLVM Github Monorepo.

A simple implementation in AggressiveInstCombine which handles the forward consecutive load sequences as provided in the tests.
The implementation is limited to a specific consecutive load pattern which reduces to a combined load only(One and only use of individual loads is to generate a wider load). This is not a generic LoadCombine as combining loads with other uses can result in poison propagation.

When switching this to AggressiveInstCombine, I would strongly recommend to start with a much more minimal patch. Handle only a single simple case, without any of the possible variants. We can build on that base later.

Strongly agree - there's a lot of potential for this to go wrong both in correctness and perf regressions, so we need to build up in steps.
AFAIK, the load combine pass did not have correctness problems when it died, so that source code would be a good reference.

The load combine pass is different in that it implements load combine in a more generic way and not based on a specific pattern. So if it find 2 loads can be combined it generates a wider load and subsequent usage are derived from this wider load using CreateExtractInteger. The current implementation will be more like a pattern match with offset and shift verify to confirm the loads are consecutive and with the only use to create a wider load.

spatel retitled this revision from [InstCombine] Combine consecutive loads which are being merged to form a wider load. to [AggressiveInstCombine] Combine consecutive loads which are being merged to form a wider load..Thu, Jul 14, 7:14 AM

How does this code account for potential memory accesses between the loads that are getting combined?

define i16 @loadCombine_2consecutive_store_between(ptr %p) {
  %p1 = getelementptr i8, ptr %p, i32 1
  %l1 = load i8, ptr %p, align 2
  store i8 42, ptr %p  ; this must happen after a combined load?
  %l2 = load i8, ptr %p1

  %e1 = zext i8 %l1 to i16
  %e2 = zext i8 %l2 to i16
  %s2 = shl i16 %e2, 8
  %o1 = or i16 %e1, %s2
  ret i16 %o1
}
llvm/test/Transforms/AggressiveInstCombine/or-load.ll
5 ↗(On Diff #443893)

We need to have this test duplicated on a target (x86?) where the fold actually happens. Target-specific tests will need to go inside target subdirectories, otherwise we'll break testing bots. We can pre-commit those once we have the right set of tests.

24 ↗(On Diff #443893)

Why do the tests have non-canonical code (shift-by-0)? I don't think we'd ever see this pattern given where AIC is in the opt pipeline.

How does this code account for potential memory accesses between the loads that are getting combined?

define i16 @loadCombine_2consecutive_store_between(ptr %p) {
  %p1 = getelementptr i8, ptr %p, i32 1
  %l1 = load i8, ptr %p, align 2
  store i8 42, ptr %p  ; this must happen after a combined load?
  %l2 = load i8, ptr %p1

  %e1 = zext i8 %l1 to i16
  %e2 = zext i8 %l2 to i16
  %s2 = shl i16 %e2, 8
  %o1 = or i16 %e1, %s2
  ret i16 %o1
}

I should have mentioned. This being the base version have not enabled the same. Just targeting simple load scenarios in this patch. This was enabled in the the InstCombine patch. The same will be enabled in the subsequent patches to handle memory access b/w loads.

I should have mentioned. This being the base version have not enabled the same. Just targeting simple load scenarios in this patch. This was enabled in the the InstCombine patch. The same will be enabled in the subsequent patches to handle memory access b/w loads.

I don't understand the comment. Does the posted patch miscompile the example with a store? If so, then the patch can't be committed in this form.
If there are planned changes to the posted patch before it is ready for review, please note that in the patch status (or add "WIP" to the title).

bipmis updated this revision to Diff 445525.Mon, Jul 18, 9:03 AM

Updating the patch with AliasAnalysis. Test split up for respective Targets and new tests updated.
Currently handling only simple forward load patterns.

I should have mentioned. This being the base version have not enabled the same. Just targeting simple load scenarios in this patch. This was enabled in the the InstCombine patch. The same will be enabled in the subsequent patches to handle memory access b/w loads.

I don't understand the comment. Does the posted patch miscompile the example with a store? If so, then the patch can't be committed in this form.
If there are planned changes to the posted patch before it is ready for review, please note that in the patch status (or add "WIP" to the title).

Yes it should have been enabled to handle the store scenarios. I have updates the tests and added few more tests. Can add more as needed.
Also to keep it simple have handled forward load sequences.

dmgreen added inline comments.Mon, Jul 18, 11:12 AM
llvm/lib/Transforms/AggressiveInstCombine/AggressiveInstCombine.cpp
443–444

The comment could flow better:

// 2. (? | (zExt(L1) << shift1)) | (zExt(L2) << shift2)
//  -> ? | (zExt(L3) << shift1)
474

Drop extra brackets from (LI1 == LI2)

484

I think this can use getPointersDiff to check the two pointers are the right distance apart.

526

Do we check anywhere that LI1 and LI2 are in the same block?

528–531

Does std::distance work?

578

This doesn't need to cast to a Value*

587

Allows -> Allowed

588

This could also be DL.isLegalInteger, which would avoid the need to create the Type.

596

Should this check the Fast flag too?

if (!Allowed || !Fast)
600–607

I think this is unneeded, and this can always just create:

NewLoad =
        new LoadInst(IntegerType::get(Load1Ptr->getContext(), LOps.LoadSize),
                     LI1->getPointerOperand(), "", LI1->isVolatile(), LI1->getAlign(),
                     LI1->getOrdering(), LI1->getSyncScopeID());

Or possibly use Builder.CreateLoad to avoid the separate Insert.

620–623

This could avoid the NewOp variable and just do:

if (LOps.zext) {
  NewLoad = Builder.CreateZExt(NewLoad, LOps.zext);
llvm/test/Transforms/AggressiveInstCombine/AArch64/or-load.ll
26–27

Can you run these tests through opt -O1 (without this patch) and use the result as the tests (maybe with a little cleanup). LLVM-IR will almost never include shl 0 nodes, and we should make sure we are testing what will appear in reality.
https://godbolt.org/z/raxKnEE9a

bipmis updated this revision to Diff 446107.Wed, Jul 20, 4:26 AM

Handle Review Comments from David.

bipmis updated this revision to Diff 448536.Fri, Jul 29, 12:47 AM

Handle GEP in a more generic way as as requested earlier by @nikic
Add check for loads belonging to same BB.

@spatel I have updated the patch and it should handle the forward load scenarios. Have incorporated most of the review comments received.
The patch has been tested and passing all tests. Do review and suggest if you have more comments. Thanks.

nikic added inline comments.Fri, Jul 29, 2:33 AM
llvm/lib/Transforms/AggressiveInstCombine/AggressiveInstCombine.cpp
507

This whole code can be replaced with stripAndAccumulateConstantOffsets().

537

FindAvailableLoadedValue() is not the correct way to check for clobbers. In particular, it will return an "available value" for a direct clobber (store to the same address).

What you want it to loop over the instructions and call getModRefInfo() on AliasAnalysis, together with a small limit (e.g. 16) when you will abort the walk and bail out of the transform.

llvm/test/Transforms/AggressiveInstCombine/AArch64/or-load.ll
204

Try a variant storing to %p3 rather than %pstr here. I believe your current implementation will incorrectly accept this.

bipmis added inline comments.Fri, Jul 29, 3:07 AM
llvm/lib/Transforms/AggressiveInstCombine/AggressiveInstCombine.cpp
507

Sure. Will look into this.

llvm/test/Transforms/AggressiveInstCombine/AArch64/or-load.ll
204

It does not return an "available value" for a direct clobber. For example a change to the test

%l1 = load i8, ptr %p
%l2 = load i8, ptr %p1
%l3 = load i8, ptr %p2
store i8 10, i8* %p3
%l4 = load i8, ptr %p3

still returns
; LE-NEXT: [[TMP1:%.*]] = load i16, ptr [[P]], align 1
; LE-NEXT: [[TMP2:%.*]] = zext i16 [[TMP1]] to i32
; LE-NEXT: [[L3:%.*]] = load i8, ptr [[P2]], align 1
; LE-NEXT: store i8 10, ptr [[P3]], align 1
; LE-NEXT: [[L4:%.*]] = load i8, ptr [[P3]], align 1

Can add more tests if you suggest.

nikic added inline comments.Fri, Jul 29, 5:57 AM
llvm/test/Transforms/AggressiveInstCombine/AArch64/or-load.ll
204

Okay, I had to test this patch locally to find a case where it fails. Try this variant:

store i8 0, ptr %p3
store i8 1, ptr %p

We are looking for an available value of %p, so we find the store i8 1, ptr %p and are happy. But before that, there is a clobber of %p3 that makes the transform invalid (the load is moved before the store).

nikic added a reviewer: nikic.Fri, Jul 29, 5:57 AM
bipmis updated this revision to Diff 448989.Mon, Aug 1, 4:25 AM

Handle Review comments from nikic. Made changes to Alias Analysis.

bipmis added a comment.Wed, Aug 3, 6:38 AM

Gentle ping for review on this! Thanks.