This is an archive of the discontinued LLVM Phabricator instance.

[AggressiveInstCombine] Combine consecutive loads which are being merged to form a wider load.
ClosedPublic

Authored by bipmis on Jun 9 2022, 3:49 AM.

Details

Summary

The patch simplifies some of the patterns as below

1. (zExt(L1) << shift1) | (zExt(L2) << shift2) -> zExt(L3) << shift1
2. (? | (zExt(L1) << shift1)) | (zExt(L2) << shift2) -> ? | (zExt(L3) << shift1)

The pattern is indicative of the fact that the loads are being merged to a wider load and the only use of this pattern is with a wider load. In this case for a non-atomic/non-volatile loads reduce the pattern to a combined load which would improve the cost of inlining, unrolling, vectorization etc.

Diff Detail

Event Timeline

There are a very large number of changes, so older changes are hidden. Show Older Changes
bipmis requested review of this revision.Jun 9 2022, 3:49 AM

I'd like to see some form of load combining in IR, but we need to be very careful about how that is implemented.

Note that LLVM had a dedicated pass for load combining, but it caused problems and was removed:
https://lists.llvm.org/pipermail/llvm-dev/2016-September/105291.html

So any proposal to do it again should be aware of and avoid those problems. Putting it in InstCombine is controversial because it could interfere with other analysis/passes. We recently added a cost-aware transform to AggressiveInstCombine, so that might be a better fit.

I didn't check all of the tests, but at least the simplest cases should already be handled in DAGCombiner. Can you show a larger, motivating example where that backend transform is too late to get to the ideal code?

@spatel Thanks for reviewing it. The common scenarios in llvm where we see this as an issue is with the cost involved in the inliner, unrolller and vectorizer which results in a sub-optimal code. One such simple example can be with dot product where vectorizer fails to generate a vectorized code as below:
https://godbolt.org/z/Ee4cbf1PG

Similar situation can be seen with the unroller as well as in the example below:
https://godbolt.org/z/dWo8n3Yz8

https://godbolt.org/z/Ee4cbf1PG
Similar situation can be seen with the unroller as well as in the example below:
https://godbolt.org/z/dWo8n3Yz8

Thanks - those look like good motivating examples. I recommend moving this patch over to AggressiveInstCombine or recreating a limited form of the old LoadCombine pass. Either way, we need to be careful to make sure that the transform does not combine loads illegally and that it doesn't harm other analysis. You might want to start an RFC thread on discourse for more feedback and tag people who commented on earlier proposals to do this kind of transform.

I am in favour of moving the discussion to discourse.

AggressiveInstCombine and target-dependent InstCombine are two independent concepts for me.

RKSimon added inline comments.Jun 16 2022, 8:26 AM
llvm/lib/Transforms/InstCombine/InstCombineAndOrXor.cpp
17 ↗(On Diff #435486)

clang-format this?

nikic added a subscriber: nikic.Jun 24 2022, 1:58 AM

The main question I see here is whether this needs to be TTI based or not -- if yes, then it also shouldn't be in InstCombine. I think there's two reason why we might want TTI:

  • Do we want to create unaligned loads? Creating them is legal, but if the target does not support fast unaligned loads, the backend will break them up again. Should we only perform this optimization if TTI.allowsMisalignedMemoryAccesses reports a fast unaligned access?
  • Do we want to create loads with illegal sizes? For example, if we have a 64-bit target, so we want to combine two i64 loads into an i128 load, even though it will later be broken up again? (At least for the current implementation where both loads must have the same size, the question of whether to allow i24 loads or similar does not come up.)
llvm/lib/Transforms/InstCombine/InstCombineAndOrXor.cpp
3004 ↗(On Diff #435486)

nullptr

3017 ↗(On Diff #435486)

Move dyn_cast into initialization, having it in the if is hard to read.

3029 ↗(On Diff #435486)

GEPOperator

Also directly dyn_cast in the condition.

3035 ↗(On Diff #435486)

You are probably looking for GEP2->accumulateConstantOffsets() here -- your current code will treat GEPs with multiple indices incorrectly.

3077 ↗(On Diff #435486)

I don't get this check -- you're checking that there is an available value, but not that this value is the previous load. There might still be a clobber in between.

What you're probably looking for is canInstructionRangeModRef() between the two instructions.

3108 ↗(On Diff #435486)

Unnecessary cast

3114 ↗(On Diff #435486)

Why does this not use the builder to create the instruction, only to insert it?

3119 ↗(On Diff #435486)

Uh, what's the point of oring something with zero?

RKSimon added inline comments.Jun 24 2022, 3:26 AM
llvm/test/Transforms/InstCombine/or-load.ll
2 ↗(On Diff #435486)

add little and big endian test coverage

The main question I see here is whether this needs to be TTI based or not -- if yes, then it also shouldn't be in InstCombine. I think there's two reason why we might want TTI:

  • Do we want to create unaligned loads? Creating them is legal, but if the target does not support fast unaligned loads, the backend will break them up again. Should we only perform this optimization if TTI.allowsMisalignedMemoryAccesses reports a fast unaligned access?
  • Do we want to create loads with illegal sizes? For example, if we have a 64-bit target, so we want to combine two i64 loads into an i128 load, even though it will later be broken up again? (At least for the current implementation where both loads must have the same size, the question of whether to allow i24 loads or similar does not come up.)

Thanks for the review. I have made most of the other suggested changes and can post a patch for the same. It would also handle some more test cases.

Right I think we need to be sure on this particular path. If we want a TTI based should Aggressive Instcombine be a better choice. We also see some use scenarios of TTI in InstCombine but is our case a vaild use case scenario.
Next patch would be in InstCombine for this reason.
I will also test the patch up with extended tests to see if we are seeing any performance issues and if the backend breaking up happens ok.

We also see some use scenarios of TTI in InstCombine

I believe there is a strong consensus to say no to TTI in InstCombine.

bipmis added inline comments.Jun 28 2022, 9:21 AM
llvm/test/Transforms/InstCombine/or-load.ll
2 ↗(On Diff #435486)

In IR I believe the checks would be the same for LE and BE. The differences should pop up in the assembly.
https://godbolt.org/z/xboKx47cc

bipmis updated this revision to Diff 441695.Jul 1 2022, 8:15 AM

Handle some of the review comments.
Added support for additional scenarios like reverse order loads and big-endian support.

Moving the implementation to AgressiveInstCombine still open for discussion in addition with the TTI requirement.

spatel added a comment.Jul 1 2022, 9:04 AM

Handle some of the review comments.
Added support for additional scenarios like reverse order loads and big-endian support.

Moving the implementation to AgressiveInstCombine still open for discussion in addition with the TTI requirement.

IMO, we should move this patch to AggressiveInstCombine with conservative TTI limitations to start.

Now that we've heard from several others, I'd summarize as:

  1. There's good motivation to have load combining in IR as a canonicalization.
  2. But it should be limited with TTI (avoid unaligned loads, illegal operations/types).
  3. Adding TTI to InstCombine is a non-starter; we don't want to have that level of target-dependence in IR canonicalization.
  4. AggressiveInstCombine is lower-weight canonicalization pass that recently gained access to TTI.
  5. Potential follow-ups could relax the TTI constraints and/or allow bswap formation to deal with endian diffs seen in the current tests (we do that type of transform in the backend already).
bipmis added inline comments.Jul 1 2022, 9:07 AM
llvm/lib/Transforms/InstCombine/InstCombineAndOrXor.cpp
3035 ↗(On Diff #435486)

Currently we are handling GEP with single indice. With the multiple indices we exit at the check "isa<GetElementPtrInst>(Op2)" . Will keep a look on this if it can be improved.

3077 ↗(On Diff #435486)

This is basically to check if there are stores or memory accesses b/w the 2 loads which we should consider for before the load merges. The check starts at L2 in reverse and does the memory check till L1.
There are alias tests loadCombine_4consecutive_with_alias* which should give you the idea.

3119 ↗(On Diff #435486)

Yeah this is basically cause the visitOr() does an Insert of the returned OR. This may not be needed if we move this to AggressiveInstCombine.

nikic added a comment.Jul 1 2022, 9:27 AM

When switching this to AggressiveInstCombine, I would strongly recommend to start with a much more minimal patch. Handle only a single simple case, without any of the possible variants. We can build on that base later.

When switching this to AggressiveInstCombine, I would strongly recommend to start with a much more minimal patch. Handle only a single simple case, without any of the possible variants. We can build on that base later.

Strongly agree - there's a lot of potential for this to go wrong both in correctness and perf regressions, so we need to build up in steps.
AFAIK, the load combine pass did not have correctness problems when it died, so that source code would be a good reference.

bipmis updated this revision to Diff 443893.Jul 12 2022, 3:02 AM
bipmis set the repository for this revision to rG LLVM Github Monorepo.

A simple implementation in AggressiveInstCombine which handles the forward consecutive load sequences as provided in the tests.
The implementation is limited to a specific consecutive load pattern which reduces to a combined load only(One and only use of individual loads is to generate a wider load). This is not a generic LoadCombine as combining loads with other uses can result in poison propagation.

When switching this to AggressiveInstCombine, I would strongly recommend to start with a much more minimal patch. Handle only a single simple case, without any of the possible variants. We can build on that base later.

Strongly agree - there's a lot of potential for this to go wrong both in correctness and perf regressions, so we need to build up in steps.
AFAIK, the load combine pass did not have correctness problems when it died, so that source code would be a good reference.

The load combine pass is different in that it implements load combine in a more generic way and not based on a specific pattern. So if it find 2 loads can be combined it generates a wider load and subsequent usage are derived from this wider load using CreateExtractInteger. The current implementation will be more like a pattern match with offset and shift verify to confirm the loads are consecutive and with the only use to create a wider load.

spatel retitled this revision from [InstCombine] Combine consecutive loads which are being merged to form a wider load. to [AggressiveInstCombine] Combine consecutive loads which are being merged to form a wider load..Jul 14 2022, 7:14 AM

How does this code account for potential memory accesses between the loads that are getting combined?

define i16 @loadCombine_2consecutive_store_between(ptr %p) {
  %p1 = getelementptr i8, ptr %p, i32 1
  %l1 = load i8, ptr %p, align 2
  store i8 42, ptr %p  ; this must happen after a combined load?
  %l2 = load i8, ptr %p1

  %e1 = zext i8 %l1 to i16
  %e2 = zext i8 %l2 to i16
  %s2 = shl i16 %e2, 8
  %o1 = or i16 %e1, %s2
  ret i16 %o1
}
llvm/test/Transforms/AggressiveInstCombine/or-load.ll
5

We need to have this test duplicated on a target (x86?) where the fold actually happens. Target-specific tests will need to go inside target subdirectories, otherwise we'll break testing bots. We can pre-commit those once we have the right set of tests.

24

Why do the tests have non-canonical code (shift-by-0)? I don't think we'd ever see this pattern given where AIC is in the opt pipeline.

How does this code account for potential memory accesses between the loads that are getting combined?

define i16 @loadCombine_2consecutive_store_between(ptr %p) {
  %p1 = getelementptr i8, ptr %p, i32 1
  %l1 = load i8, ptr %p, align 2
  store i8 42, ptr %p  ; this must happen after a combined load?
  %l2 = load i8, ptr %p1

  %e1 = zext i8 %l1 to i16
  %e2 = zext i8 %l2 to i16
  %s2 = shl i16 %e2, 8
  %o1 = or i16 %e1, %s2
  ret i16 %o1
}

I should have mentioned. This being the base version have not enabled the same. Just targeting simple load scenarios in this patch. This was enabled in the the InstCombine patch. The same will be enabled in the subsequent patches to handle memory access b/w loads.

I should have mentioned. This being the base version have not enabled the same. Just targeting simple load scenarios in this patch. This was enabled in the the InstCombine patch. The same will be enabled in the subsequent patches to handle memory access b/w loads.

I don't understand the comment. Does the posted patch miscompile the example with a store? If so, then the patch can't be committed in this form.
If there are planned changes to the posted patch before it is ready for review, please note that in the patch status (or add "WIP" to the title).

bipmis updated this revision to Diff 445525.Jul 18 2022, 9:03 AM

Updating the patch with AliasAnalysis. Test split up for respective Targets and new tests updated.
Currently handling only simple forward load patterns.

I should have mentioned. This being the base version have not enabled the same. Just targeting simple load scenarios in this patch. This was enabled in the the InstCombine patch. The same will be enabled in the subsequent patches to handle memory access b/w loads.

I don't understand the comment. Does the posted patch miscompile the example with a store? If so, then the patch can't be committed in this form.
If there are planned changes to the posted patch before it is ready for review, please note that in the patch status (or add "WIP" to the title).

Yes it should have been enabled to handle the store scenarios. I have updates the tests and added few more tests. Can add more as needed.
Also to keep it simple have handled forward load sequences.

dmgreen added inline comments.Jul 18 2022, 11:12 AM
llvm/lib/Transforms/AggressiveInstCombine/AggressiveInstCombine.cpp
443–444

The comment could flow better:

// 2. (? | (zExt(L1) << shift1)) | (zExt(L2) << shift2)
//  -> ? | (zExt(L3) << shift1)
474

Drop extra brackets from (LI1 == LI2)

484

I think this can use getPointersDiff to check the two pointers are the right distance apart.

526

Do we check anywhere that LI1 and LI2 are in the same block?

528–531

Does std::distance work?

578

This doesn't need to cast to a Value*

587

Allows -> Allowed

588

This could also be DL.isLegalInteger, which would avoid the need to create the Type.

596

Should this check the Fast flag too?

if (!Allowed || !Fast)
600–607

I think this is unneeded, and this can always just create:

NewLoad =
        new LoadInst(IntegerType::get(Load1Ptr->getContext(), LOps.LoadSize),
                     LI1->getPointerOperand(), "", LI1->isVolatile(), LI1->getAlign(),
                     LI1->getOrdering(), LI1->getSyncScopeID());

Or possibly use Builder.CreateLoad to avoid the separate Insert.

620–623

This could avoid the NewOp variable and just do:

if (LOps.zext) {
  NewLoad = Builder.CreateZExt(NewLoad, LOps.zext);
llvm/test/Transforms/AggressiveInstCombine/AArch64/or-load.ll
25–26 ↗(On Diff #445525)

Can you run these tests through opt -O1 (without this patch) and use the result as the tests (maybe with a little cleanup). LLVM-IR will almost never include shl 0 nodes, and we should make sure we are testing what will appear in reality.
https://godbolt.org/z/raxKnEE9a

bipmis updated this revision to Diff 446107.Jul 20 2022, 4:26 AM

Handle Review Comments from David.

bipmis updated this revision to Diff 448536.Jul 29 2022, 12:47 AM

Handle GEP in a more generic way as as requested earlier by @nikic
Add check for loads belonging to same BB.

@spatel I have updated the patch and it should handle the forward load scenarios. Have incorporated most of the review comments received.
The patch has been tested and passing all tests. Do review and suggest if you have more comments. Thanks.

nikic added inline comments.Jul 29 2022, 2:33 AM
llvm/lib/Transforms/AggressiveInstCombine/AggressiveInstCombine.cpp
507

This whole code can be replaced with stripAndAccumulateConstantOffsets().

537

FindAvailableLoadedValue() is not the correct way to check for clobbers. In particular, it will return an "available value" for a direct clobber (store to the same address).

What you want it to loop over the instructions and call getModRefInfo() on AliasAnalysis, together with a small limit (e.g. 16) when you will abort the walk and bail out of the transform.

llvm/test/Transforms/AggressiveInstCombine/AArch64/or-load.ll
203 ↗(On Diff #448536)

Try a variant storing to %p3 rather than %pstr here. I believe your current implementation will incorrectly accept this.

bipmis added inline comments.Jul 29 2022, 3:07 AM
llvm/lib/Transforms/AggressiveInstCombine/AggressiveInstCombine.cpp
507

Sure. Will look into this.

llvm/test/Transforms/AggressiveInstCombine/AArch64/or-load.ll
203 ↗(On Diff #448536)

It does not return an "available value" for a direct clobber. For example a change to the test

%l1 = load i8, ptr %p
%l2 = load i8, ptr %p1
%l3 = load i8, ptr %p2
store i8 10, i8* %p3
%l4 = load i8, ptr %p3

still returns
; LE-NEXT: [[TMP1:%.*]] = load i16, ptr [[P]], align 1
; LE-NEXT: [[TMP2:%.*]] = zext i16 [[TMP1]] to i32
; LE-NEXT: [[L3:%.*]] = load i8, ptr [[P2]], align 1
; LE-NEXT: store i8 10, ptr [[P3]], align 1
; LE-NEXT: [[L4:%.*]] = load i8, ptr [[P3]], align 1

Can add more tests if you suggest.

nikic added inline comments.Jul 29 2022, 5:57 AM
llvm/test/Transforms/AggressiveInstCombine/AArch64/or-load.ll
203 ↗(On Diff #448536)

Okay, I had to test this patch locally to find a case where it fails. Try this variant:

store i8 0, ptr %p3
store i8 1, ptr %p

We are looking for an available value of %p, so we find the store i8 1, ptr %p and are happy. But before that, there is a clobber of %p3 that makes the transform invalid (the load is moved before the store).

nikic added a reviewer: nikic.Jul 29 2022, 5:57 AM
bipmis updated this revision to Diff 448989.Aug 1 2022, 4:25 AM

Handle Review comments from nikic. Made changes to Alias Analysis.

bipmis added a comment.Aug 3 2022, 6:38 AM

Gentle ping for review on this! Thanks.

I feel like this being recursive makes it more difficult to reason about than it could be. I have added some mostly nitpicks/cleanups below.

llvm/lib/Transforms/AggressiveInstCombine/AggressiveInstCombine.cpp
438

zext -> Zext (Or ZExt? Capitalized at least)

443

Are there any tests for case 2?

462–463

The variables can be defined where they are first used.

477

Should it check that the address space is the same?

497

Does it matter if it is a bitcast or a gep?

515

Capitalize variable names.

544

We checked that loadSize1 == loadSize2 above.

547

Demorgan this.

563

Maybe replace the name of foldLoadsIterative with foldConsecutiveLoads and vice-versa. foldLoadsIterative doesn't really explain what this function is folding, and it's not really iterative.

bipmis updated this revision to Diff 451989.Aug 11 2022, 2:37 PM

Thanks David for reviewing. Have handled most of the nits.

bipmis added inline comments.Aug 11 2022, 2:43 PM
llvm/lib/Transforms/AggressiveInstCombine/AggressiveInstCombine.cpp
443

The case2 is not required as we are handling the pattern recursively. This also keeps the implementation simple.

497

I think this may be needed, so that we fall through and evaluate further if instructions are only of these types.

544

Right. I have added additional comments on why this is needed.

@dmgreen For pattern matching a chain of or(or,load), recursion seemed to a good choice to go to the root node and evaluate if the entire chain can be reduced. Also there are other instances of pattern match in AggressiveCombine() like matchAndOrChain() which implements it similarly.
@nikic Do you have any further comments on the Alias Analysis used. I have used the limit as the instruction difference between 2 loads for alias analysis b/w which we need to look out for a store.
@spatel Please do suggest if you have any other review comments. Thanks.

dmgreen added inline comments.Aug 25 2022, 3:47 AM
llvm/lib/Transforms/AggressiveInstCombine/AggressiveInstCombine.cpp
497

What is it you mean by fall though? If stripAndAccumulateConstantOffsets could give us an Offset, it seems like we should just always call it and have it do what it can. It will return the original pointer if it couldn't do anything useful.
It may be better to keep Offset1/Offset2 as APInt too. It would help if the pointers were > 64bits.

514

Drop these extra brackets

519

Should this have a limit on the number of instructions?

529

Is there a test for an i128 version of the fold?

596

Is all the metadata on the old instruction (like MD_range) always valid on the new load?

nikic added inline comments.Aug 30 2022, 6:29 AM
llvm/lib/Transforms/AggressiveInstCombine/AggressiveInstCombine.cpp
473

Shouldn't this be !LI1->isSimple() || !LI2->isSimple()`? We want to bail if either load isn't simple, not if both are.

Also, it looks like there are no (negative) tests for volatile/atomic loads.

Allen added a subscriber: Allen.Aug 30 2022, 6:45 AM
bipmis updated this revision to Diff 457929.Sep 5 2022, 4:05 AM

Handle review comments and add additional tests.

bipmis marked 5 inline comments as done.Sep 5 2022, 4:10 AM
bipmis added inline comments.
llvm/lib/Transforms/AggressiveInstCombine/AggressiveInstCombine.cpp
519

Some tests like load64_farLoads() which have wider instruction gap b/w loads may result in partial combine(when tried with 16). I can possibly go for a bigger limit or can keep the limit on the actual instructions b/w 2 loads.

dmgreen added inline comments.Sep 6 2022, 7:19 AM
llvm/lib/Transforms/AggressiveInstCombine/AggressiveInstCombine.cpp
519

Using what limit?

596

What happens if new metadata gets added in the future, that isn't valid?

Is it better to just drop all the metadata? Or is that too likely to be worse for performance?

bipmis added inline comments.Sep 7 2022, 3:10 AM
llvm/lib/Transforms/AggressiveInstCombine/AggressiveInstCombine.cpp
519

In the current implementation worst case limit could be all instructions in a BB. What is the issue with this?
For the test case load64_farLoads() it does fine with a limit of 35.

596

This being a specific scenario of the pattern match and looking for an or-load chain, I dont think performance should be a big concern. Depends on the end use of the merged load.

What I am seeing in most cases is that they try to retain atleast the AATags.

if (AATags)
      NewVal->setAAMetadata(AATags);
nikic added inline comments.Sep 7 2022, 3:19 AM
llvm/lib/Transforms/AggressiveInstCombine/AggressiveInstCombine.cpp
596

Note that you can't simply take the AATags from one load, they have to be merged appropriately. I believe for this specific case you need the AAMDNodes::concat() method, because you are merging loads from different non-overlapping locations.

dmgreen added inline comments.Sep 8 2022, 1:21 AM
llvm/lib/Transforms/AggressiveInstCombine/AggressiveInstCombine.cpp
519

It protects us where there are thousands of instructions in the block, just to be safe for degenerate cases. If we expect the maximum pattern to be (load+zext+shift+or) * i8->i64, so 4*8, then a limit of 64 instructions sounds fine.

bipmis added inline comments.Sep 8 2022, 5:07 AM
llvm/lib/Transforms/AggressiveInstCombine/AggressiveInstCombine.cpp
519

Sounds right. Will implement the same with the 64 instruction limit. Thanks.

596

Agreed. Thanks for this.
I think we are better off dropping the metadata at this point.

alex added a subscriber: alex.Sep 9 2022, 7:24 AM
bipmis updated this revision to Diff 459444.Sep 12 2022, 6:47 AM

Add a limit of 64 instructions for Alias Analysis.
Concat AATags Metadata for merged loads.

bipmis added inline comments.Sep 12 2022, 6:50 AM
llvm/lib/Transforms/AggressiveInstCombine/AggressiveInstCombine.cpp
596

The concat method leaves the tbaa blank so maybe we may want to drop the Metadata altogether?
Currently the ‘noalias’ and ‘alias.scope’ Metadata will be concatenated from
AAMDNodes.

dmgreen accepted this revision.Sep 15 2022, 12:38 AM

Thanks for the the changes - as far as I understand, this LGTM now. Thanks for working through the details.

Any other comments?

llvm/lib/Transforms/AggressiveInstCombine/AggressiveInstCombine.cpp
596

I'm a little surprised that if two the tbaa info are the same, we can't use the same on the result node. I think using concat sounds sensible though. I suspect in practice we will often be combining char in any case.

This revision is now accepted and ready to land.Sep 15 2022, 12:38 AM

There are a number of test failures in pre-merge checks, probably pipeline tests.

llvm/lib/Transforms/AggressiveInstCombine/AggressiveInstCombine.cpp
679

Just like all the other analyses, this should use a reference, not a pointer (the analysis is not optional).

nikic added inline comments.Sep 15 2022, 1:10 AM
llvm/lib/Transforms/AggressiveInstCombine/AggressiveInstCombine.cpp
438

Zext -> ZextType, maybe?

445

Iterative -> Recursive

517

This check should come first, otherwise you don't count store instructions.

541

I think this handles non-byte-sized loads incorrectly. Let's say you do i4 loads with 1 byte offset. Then PrevSize will be 1 and match the offset, even though the loads are not actually consecutive.

Please add some tests for non-byte-sized loads.

574

Combine these declarations with initialization.

595

Why does this not use the IRBuilder?

637

The DataLayout fetch can be moved outside the loop.

spatel added inline comments.Sep 15 2022, 6:11 AM
llvm/lib/Transforms/AggressiveInstCombine/AggressiveInstCombine.cpp
636–637

The transform should be placed ahead of foldSqrt(), or we we may hit a use-after-free bug (because foldSqrt can delete I).
There was a code comment about this in:
https://github.com/llvm/llvm-project/commit/df868edee561eb973edd85ec9df41c67aa0bff6b
...but that patch got reverted. We should probably add that code comment independently (or fix the bug some other way).

bipmis updated this revision to Diff 460709.Sep 16 2022, 6:00 AM

Update the patch with review comments.
-> Support power of 2 loads and minimum load size 8bits.
-> IR builder for new load.
-> New test for a 4 bit load.
-> Nits.

bipmis marked 7 inline comments as done.Sep 16 2022, 6:05 AM
bipmis added inline comments.
llvm/lib/Transforms/AggressiveInstCombine/AggressiveInstCombine.cpp
517

So it would count stores which do not alias. For the case it aliases we terminate anyways and the count update wont matter?

541

Right. The shift condition will prevent it from merging.
But we do not want to combine loads smaller than a byte. Updated checks.

bipmis marked an inline comment as done.Sep 22 2022, 1:37 AM

Ping.
@nikic Any further comments on the patch.

nikic added a comment.Sep 22 2022, 1:49 AM

From a cursory look it looks fine, I assume @dmgreen has reviewed in detail :)

I have one note (that can be addressed later): There's currently a check that the loads have the same size. Does that mean that if you have a partially folded pattern (e.g. one i16 load plus two i8 loads) that it will not get folded? If so, this seems like something we should relax (later).

From a cursory look it looks fine, I assume @dmgreen has reviewed in detail :)

I have one note (that can be addressed later): There's currently a check that the loads have the same size. Does that mean that if you have a partially folded pattern (e.g. one i16 load plus two i8 loads) that it will not get folded? If so, this seems like something we should relax (later).

Thats correct. Currently we are folding loads of same size only in a chain. So for above case if two i8 loads belong to single lower chain they can be folded.
Agreed. This is more to do with a stricter basic implementation first. We can relax this later.

Please ensure you precommit the new tests to trunk and then rebase this patch on it - so we see the effect of the patch on the IR

This revision was landed with ongoing or failed builds.Sep 23 2022, 2:20 AM
This revision was automatically updated to reflect the committed changes.

When I do a 3-stage bootstrap at this commit, the second-stage compiler crashes. The issue does not appear at the previous commit, so I'm reverting this commit.

@gribozavr2 Do you have any useful build log that you can provide to speed up triage please?

bipmis added a comment.EditedSep 26 2022, 4:43 AM

@gribozavr2 Do let me know how to reproduce this issue. I am trying a 3 stage in "clang>/cmake/caches/3-stage.cmake". Linking the executable in second stage takes quite some time for a clean HEAD code.
Builds fine on AArch64 and x86 with the patch. Stage2 and stage3 binary are bit exact.

I'm looking into coming up with a reduced repro.
We're seeing a crash in re2. The weird thing is that the compiler I've repro'd is not bootstrapped, but the crash in re2 happens even when compiling it with -O0. My guess is that something else we compile with the just-built clang like libc++ is getting miscompiled, but more investigation required.

it ended up being a function that we always compile with -O3...

anyway, reduced test case

$ cat /tmp/a.ll
target datalayout = "e-m:e-p270:32:32-p271:32:32-p272:64:64-i64:64-f80:128-n8:16:32:64-S128"                                                                                                       
target triple = "x86_64-grtev4-linux-gnu"                                                                                                                                                          
                                                                                                                                                                                                   
define i64 @eggs(ptr noundef readonly %arg) {                                                                                                                                                      
  %tmp3 = load i8, ptr %arg, align 1                                                             
  %tmp4 = getelementptr inbounds i8, ptr %arg, i64 1                                             
  %tmp5 = load i8, ptr %tmp4, align 1                                                            
  %tmp6 = getelementptr inbounds i8, ptr %arg, i64 2                                             
  %tmp7 = load i8, ptr %tmp6, align 1                                                            
  %tmp8 = getelementptr inbounds i8, ptr %arg, i64 3                                             
  %tmp9 = load i8, ptr %tmp8, align 1                                                            
  %tmp10 = getelementptr inbounds i8, ptr %arg, i64 4                                            
  %tmp11 = load i8, ptr %tmp10, align 1                                                          
  %tmp12 = getelementptr inbounds i8, ptr %arg, i64 5                                            
  %tmp13 = load i8, ptr %tmp12, align 1                                                          
  %tmp14 = getelementptr inbounds i8, ptr %arg, i64 6                                            
  %tmp15 = load i8, ptr %tmp14, align 1                                                          
  %tmp16 = getelementptr inbounds i8, ptr %arg, i64 7                                            
  %tmp17 = load i8, ptr %tmp16, align 1
  %tmp18 = zext i8 %tmp17 to i64     
  %tmp19 = shl nuw i64 %tmp18, 56    
  %tmp20 = zext i8 %tmp15 to i64     
  %tmp21 = shl nuw nsw i64 %tmp20, 48
  %tmp22 = or i64 %tmp19, %tmp21     
  %tmp23 = zext i8 %tmp13 to i64     
  %tmp24 = shl nuw nsw i64 %tmp23, 40
  %tmp25 = or i64 %tmp22, %tmp24     
  %tmp26 = zext i8 %tmp11 to i64     
  %tmp27 = shl nuw nsw i64 %tmp26, 32
  %tmp28 = or i64 %tmp25, %tmp27     
  %tmp29 = zext i8 %tmp9 to i64      
  %tmp30 = shl nuw nsw i64 %tmp29, 24
  %tmp31 = or i64 %tmp28, %tmp30     
  %tmp32 = zext i8 %tmp7 to i64      
  %tmp33 = shl nuw nsw i64 %tmp32, 16
  %tmp34 = zext i8 %tmp5 to i64     
  %tmp35 = shl nuw nsw i64 %tmp34, 8
  %tmp36 = or i64 %tmp31, %tmp33
  %tmp37 = zext i8 %tmp3 to i64 
  %tmp38 = or i64 %tmp36, %tmp35                                                                                                                                                                   
  %tmp39 = or i64 %tmp38, %tmp37                                                                                                                                                                   
  ret i64 %tmp39                                                                                                                                                                                   
}
$ ./build/rel/bin/opt -passes=aggressive-instcombine -S /tmp/a.ll
; ModuleID = '/tmp/b.ll'
source_filename = "/tmp/a.ll"
target datalayout = "e-m:e-p270:32:32-p271:32:32-p272:64:64-i64:64-f80:128-n8:16:32:64-S128"
target triple = "x86_64-grtev4-linux-gnu"

define i64 @eggs(ptr noundef readonly %arg) {
  %tmp3 = load i8, ptr %arg, align 1
  %tmp4 = getelementptr inbounds i8, ptr %arg, i64 1
  %tmp5 = load i32, ptr %tmp4, align 1
  %1 = zext i32 %tmp5 to i64
  %2 = shl i64 %1, 8
  %tmp37 = zext i8 %tmp3 to i64
  %tmp39 = or i64 %2, %tmp37
  ret i64 %tmp39
}

that does look wrong, it looks like it should be optimized to load i64 rather than zext(load i32 (%a + 1)) | zext(load i8 %a)

aeubanks reopened this revision.Sep 26 2022, 11:08 AM
This revision is now accepted and ready to land.Sep 26 2022, 11:08 AM
bipmis added a comment.EditedSep 27 2022, 5:09 AM

it ended up being a function that we always compile with -O3...

anyway, reduced test case

$ cat /tmp/a.ll
target datalayout = "e-m:e-p270:32:32-p271:32:32-p272:64:64-i64:64-f80:128-n8:16:32:64-S128"                                                                                                       
target triple = "x86_64-grtev4-linux-gnu"                                                                                                                                                          
                                                                                                                                                                                                   
define i64 @eggs(ptr noundef readonly %arg) {                                                                                                                                                      
  %tmp3 = load i8, ptr %arg, align 1                                                             
  %tmp4 = getelementptr inbounds i8, ptr %arg, i64 1                                             
  %tmp5 = load i8, ptr %tmp4, align 1                                                            
  %tmp6 = getelementptr inbounds i8, ptr %arg, i64 2                                             
  %tmp7 = load i8, ptr %tmp6, align 1                                                            
  %tmp8 = getelementptr inbounds i8, ptr %arg, i64 3                                             
  %tmp9 = load i8, ptr %tmp8, align 1                                                            
  %tmp10 = getelementptr inbounds i8, ptr %arg, i64 4                                            
  %tmp11 = load i8, ptr %tmp10, align 1                                                          
  %tmp12 = getelementptr inbounds i8, ptr %arg, i64 5                                            
  %tmp13 = load i8, ptr %tmp12, align 1                                                          
  %tmp14 = getelementptr inbounds i8, ptr %arg, i64 6                                            
  %tmp15 = load i8, ptr %tmp14, align 1                                                          
  %tmp16 = getelementptr inbounds i8, ptr %arg, i64 7                                            
  %tmp17 = load i8, ptr %tmp16, align 1
  %tmp18 = zext i8 %tmp17 to i64     
  %tmp19 = shl nuw i64 %tmp18, 56    
  %tmp20 = zext i8 %tmp15 to i64     
  %tmp21 = shl nuw nsw i64 %tmp20, 48
  %tmp22 = or i64 %tmp19, %tmp21     
  %tmp23 = zext i8 %tmp13 to i64     
  %tmp24 = shl nuw nsw i64 %tmp23, 40
  %tmp25 = or i64 %tmp22, %tmp24     
  %tmp26 = zext i8 %tmp11 to i64     
  %tmp27 = shl nuw nsw i64 %tmp26, 32
  %tmp28 = or i64 %tmp25, %tmp27     
  %tmp29 = zext i8 %tmp9 to i64      
  %tmp30 = shl nuw nsw i64 %tmp29, 24
  %tmp31 = or i64 %tmp28, %tmp30     
  %tmp32 = zext i8 %tmp7 to i64      
  %tmp33 = shl nuw nsw i64 %tmp32, 16
  %tmp34 = zext i8 %tmp5 to i64     
  %tmp35 = shl nuw nsw i64 %tmp34, 8
  %tmp36 = or i64 %tmp31, %tmp33
  %tmp37 = zext i8 %tmp3 to i64 
  %tmp38 = or i64 %tmp36, %tmp35                                                                                                                                                                   
  %tmp39 = or i64 %tmp38, %tmp37                                                                                                                                                                   
  ret i64 %tmp39                                                                                                                                                                                   
}
$ ./build/rel/bin/opt -passes=aggressive-instcombine -S /tmp/a.ll
; ModuleID = '/tmp/b.ll'
source_filename = "/tmp/a.ll"
target datalayout = "e-m:e-p270:32:32-p271:32:32-p272:64:64-i64:64-f80:128-n8:16:32:64-S128"
target triple = "x86_64-grtev4-linux-gnu"

define i64 @eggs(ptr noundef readonly %arg) {
  %tmp3 = load i8, ptr %arg, align 1
  %tmp4 = getelementptr inbounds i8, ptr %arg, i64 1
  %tmp5 = load i32, ptr %tmp4, align 1
  %1 = zext i32 %tmp5 to i64
  %2 = shl i64 %1, 8
  %tmp37 = zext i8 %tmp3 to i64
  %tmp39 = or i64 %2, %tmp37
  ret i64 %tmp39
}

that does look wrong, it looks like it should be optimized to load i64 rather than zext(load i32 (%a + 1)) | zext(load i8 %a)

@aeubanks Thanks for the test case.
So in this patch we have implemented forward load merge as can be seen by the test example. However I had to handle the child node "or(load,load)" as this is being reversed by InstCombine currently as seen below. This needed a corrected check.

define i32 @Load32(ptr noundef %ptr) local_unnamed_addr #0 {
entry:
  %0 = load i8, ptr %ptr, align 1, !tbaa !5
  %conv = zext i8 %0 to i32
  %arrayidx1 = getelementptr inbounds i8, ptr %ptr, i64 1
  %1 = load i8, ptr %arrayidx1, align 1, !tbaa !5
  %conv2 = zext i8 %1 to i32
  %shl = shl i32 %conv2, 8
 ** %or = or i32 %conv, %shl**
  %arrayidx3 = getelementptr inbounds i8, ptr %ptr, i64 2
  %2 = load i8, ptr %arrayidx3, align 1, !tbaa !5
  %conv4 = zext i8 %2 to i32
  %shl5 = shl i32 %conv4, 16
  %or6 = or i32 %or, %shl5
  %arrayidx7 = getelementptr inbounds i8, ptr %ptr, i64 3
  %3 = load i8, ptr %arrayidx7, align 1, !tbaa !5
  %conv8 = zext i8 %3 to i32
  %shl9 = shl i32 %conv8, 24
  %or10 = or i32 %or6, %shl9
  ret i32 %or10
}

*** IR Dump After InstCombinePass on Load32 ***
; Function Attrs: nounwind uwtable
define i32 @Load32(ptr noundef %ptr) local_unnamed_addr #0 {
entry:
  %0 = load i8, ptr %ptr, align 1, !tbaa !5
  %conv = zext i8 %0 to i32
  %arrayidx1 = getelementptr inbounds i8, ptr %ptr, i64 1
  %1 = load i8, ptr %arrayidx1, align 1, !tbaa !5
  %conv2 = zext i8 %1 to i32
  %shl = shl nuw nsw i32 %conv2, 8
**  %or = or i32 %shl, %conv**
  %arrayidx3 = getelementptr inbounds i8, ptr %ptr, i64 2
  %2 = load i8, ptr %arrayidx3, align 1, !tbaa !5
  %conv4 = zext i8 %2 to i32
  %shl5 = shl nuw nsw i32 %conv4, 16
  %or6 = or i32 %or, %shl5
  %arrayidx7 = getelementptr inbounds i8, ptr %ptr, i64 3
  %3 = load i8, ptr %arrayidx7, align 1, !tbaa !5
  %conv8 = zext i8 %3 to i32
  %shl9 = shl nuw i32 %conv8, 24
  %or10 = or i32 %or6, %shl9
  ret i32 %or10

The full implementation of reverse load merge pattern is planned subsequently. So now you will see the lowest node loads being merged but not the other ones. I am updating the patch. Would be great if you can test the same.

bipmis updated this revision to Diff 463194.Sep 27 2022, 5:14 AM

Handle the reverse load pattern checks correctly.
Currently we need to handle the leaf node reverse loads as InstCombine pass folds the pattern from a forward to reverse one.
Full reverse load patterns planned to be implemented subsequently.

spatel added inline comments.Sep 27 2022, 7:41 AM
llvm/lib/Transforms/AggressiveInstCombine/AggressiveInstCombine.cpp
482–485

I didn't notice this limitation before. "Forward" and "reverse" are referring to the order that the or instructions use the loaded values?

I agree that we don't want to complicate the initial implementation any more than necessary, but it might help to see how the backend handles that in DAGCombiner::MatchLoadCombine() (see D133584 for a proposed enhancement to that code).

bipmis added inline comments.Sep 27 2022, 9:22 AM
llvm/lib/Transforms/AggressiveInstCombine/AggressiveInstCombine.cpp
482–485

Right it is the order
Forward - or(or(or(or(0,1),2),3)
Reverse - or(or(or(or(3,2),1),0)
Considering 0,1...as zext and shl of loads with index 0,1 etc.

For simplicity we wanted to implement the forward first and if this looks good we can do the reverse+mixed size loads. There should be minimal changes on top of this. So that is in plan next.

Matt added a subscriber: Matt.Sep 27 2022, 11:37 AM
bipmis added inline comments.Sep 28 2022, 6:29 AM
llvm/lib/Transforms/AggressiveInstCombine/AggressiveInstCombine.cpp
482–485

I didn't notice this limitation before. "Forward" and "reverse" are referring to the order that the or instructions use the loaded values?

I agree that we don't want to complicate the initial implementation any more than necessary, but it might help to see how the backend handles that in DAGCombiner::MatchLoadCombine() (see D133584 for a proposed enhancement to that code).

@spatel Verified DAGCombiner::MatchLoadCombine() handles the reverse load pattern fine and a single load is generated.

Please pre-commit the new tests with the baseline CHECK lines.
Also, it would be good to add a test that matches the reported failure case (the @eggs test from the earlier comment).
After that, I think it's fine to re-commit.

llvm/lib/Transforms/AggressiveInstCombine/AggressiveInstCombine.cpp
482–485

Thanks for checking. It doesn't directly impact the initial/immediate patch here, but it might provide a template for the planned enhancements.

Kai added a subscriber: Kai.Oct 4 2022, 12:54 PM
Kai added inline comments.
llvm/lib/Transforms/AggressiveInstCombine/AggressiveInstCombine.cpp
604

FYI: The missing bitcast for the Load1Ptr argument means that this change only works with opaque pointers.

dstuttard added inline comments.
llvm/lib/Transforms/AggressiveInstCombine/AggressiveInstCombine.cpp
604

We had the same issue - I've just uploaded a patch. See D135249

uabelho added inline comments.
llvm/lib/Transforms/AggressiveInstCombine/AggressiveInstCombine.cpp
526

Since the NumScanned counter includes debug instructions, this code makes the result depend on the presence on debug instructions.
I wrote a ticket about that: https://github.com/llvm/llvm-project/issues/69925