Page MenuHomePhabricator

Please use GitHub pull requests for new patches. Phabricator shutdown timeline

[AggressiveInstCombine] Combine consecutive loads which are being merged to form a wider load.
ClosedPublic

Authored by bipmis on Jun 9 2022, 3:49 AM.

Details

Summary

The patch simplifies some of the patterns as below

1. (zExt(L1) << shift1) | (zExt(L2) << shift2) -> zExt(L3) << shift1
2. (? | (zExt(L1) << shift1)) | (zExt(L2) << shift2) -> ? | (zExt(L3) << shift1)

The pattern is indicative of the fact that the loads are being merged to a wider load and the only use of this pattern is with a wider load. In this case for a non-atomic/non-volatile loads reduce the pattern to a combined load which would improve the cost of inlining, unrolling, vectorization etc.

Diff Detail

Event Timeline

There are a very large number of changes, so older changes are hidden. Show Older Changes
bipmis updated this revision to Diff 451989.Aug 11 2022, 2:37 PM

Thanks David for reviewing. Have handled most of the nits.

bipmis added inline comments.Aug 11 2022, 2:43 PM
llvm/lib/Transforms/AggressiveInstCombine/AggressiveInstCombine.cpp
448

The case2 is not required as we are handling the pattern recursively. This also keeps the implementation simple.

502

I think this may be needed, so that we fall through and evaluate further if instructions are only of these types.

549

Right. I have added additional comments on why this is needed.

@dmgreen For pattern matching a chain of or(or,load), recursion seemed to a good choice to go to the root node and evaluate if the entire chain can be reduced. Also there are other instances of pattern match in AggressiveCombine() like matchAndOrChain() which implements it similarly.
@nikic Do you have any further comments on the Alias Analysis used. I have used the limit as the instruction difference between 2 loads for alias analysis b/w which we need to look out for a store.
@spatel Please do suggest if you have any other review comments. Thanks.

dmgreen added inline comments.Aug 25 2022, 3:47 AM
llvm/lib/Transforms/AggressiveInstCombine/AggressiveInstCombine.cpp
502

What is it you mean by fall though? If stripAndAccumulateConstantOffsets could give us an Offset, it seems like we should just always call it and have it do what it can. It will return the original pointer if it couldn't do anything useful.
It may be better to keep Offset1/Offset2 as APInt too. It would help if the pointers were > 64bits.

564

Drop these extra brackets

569

Should this have a limit on the number of instructions?

579

Is there a test for an i128 version of the fold?

646

Is all the metadata on the old instruction (like MD_range) always valid on the new load?

nikic added inline comments.Aug 30 2022, 6:29 AM
llvm/lib/Transforms/AggressiveInstCombine/AggressiveInstCombine.cpp
523

Shouldn't this be !LI1->isSimple() || !LI2->isSimple()`? We want to bail if either load isn't simple, not if both are.

Also, it looks like there are no (negative) tests for volatile/atomic loads.

Allen added a subscriber: Allen.Aug 30 2022, 6:45 AM
bipmis updated this revision to Diff 457929.Sep 5 2022, 4:05 AM

Handle review comments and add additional tests.

bipmis marked 5 inline comments as done.Sep 5 2022, 4:10 AM
bipmis added inline comments.
llvm/lib/Transforms/AggressiveInstCombine/AggressiveInstCombine.cpp
569

Some tests like load64_farLoads() which have wider instruction gap b/w loads may result in partial combine(when tried with 16). I can possibly go for a bigger limit or can keep the limit on the actual instructions b/w 2 loads.

dmgreen added inline comments.Sep 6 2022, 7:19 AM
llvm/lib/Transforms/AggressiveInstCombine/AggressiveInstCombine.cpp
569

Using what limit?

646

What happens if new metadata gets added in the future, that isn't valid?

Is it better to just drop all the metadata? Or is that too likely to be worse for performance?

bipmis added inline comments.Sep 7 2022, 3:10 AM
llvm/lib/Transforms/AggressiveInstCombine/AggressiveInstCombine.cpp
569

In the current implementation worst case limit could be all instructions in a BB. What is the issue with this?
For the test case load64_farLoads() it does fine with a limit of 35.

646

This being a specific scenario of the pattern match and looking for an or-load chain, I dont think performance should be a big concern. Depends on the end use of the merged load.

What I am seeing in most cases is that they try to retain atleast the AATags.

if (AATags)
      NewVal->setAAMetadata(AATags);
nikic added inline comments.Sep 7 2022, 3:19 AM
llvm/lib/Transforms/AggressiveInstCombine/AggressiveInstCombine.cpp
646

Note that you can't simply take the AATags from one load, they have to be merged appropriately. I believe for this specific case you need the AAMDNodes::concat() method, because you are merging loads from different non-overlapping locations.

dmgreen added inline comments.Sep 8 2022, 1:21 AM
llvm/lib/Transforms/AggressiveInstCombine/AggressiveInstCombine.cpp
569

It protects us where there are thousands of instructions in the block, just to be safe for degenerate cases. If we expect the maximum pattern to be (load+zext+shift+or) * i8->i64, so 4*8, then a limit of 64 instructions sounds fine.

bipmis added inline comments.Sep 8 2022, 5:07 AM
llvm/lib/Transforms/AggressiveInstCombine/AggressiveInstCombine.cpp
569

Sounds right. Will implement the same with the 64 instruction limit. Thanks.

646

Agreed. Thanks for this.
I think we are better off dropping the metadata at this point.

alex added a subscriber: alex.Sep 9 2022, 7:24 AM
bipmis updated this revision to Diff 459444.Sep 12 2022, 6:47 AM

Add a limit of 64 instructions for Alias Analysis.
Concat AATags Metadata for merged loads.

bipmis added inline comments.Sep 12 2022, 6:50 AM
llvm/lib/Transforms/AggressiveInstCombine/AggressiveInstCombine.cpp
646

The concat method leaves the tbaa blank so maybe we may want to drop the Metadata altogether?
Currently the ‘noalias’ and ‘alias.scope’ Metadata will be concatenated from
AAMDNodes.

dmgreen accepted this revision.Sep 15 2022, 12:38 AM

Thanks for the the changes - as far as I understand, this LGTM now. Thanks for working through the details.

Any other comments?

llvm/lib/Transforms/AggressiveInstCombine/AggressiveInstCombine.cpp
646

I'm a little surprised that if two the tbaa info are the same, we can't use the same on the result node. I think using concat sounds sensible though. I suspect in practice we will often be combining char in any case.

This revision is now accepted and ready to land.Sep 15 2022, 12:38 AM

There are a number of test failures in pre-merge checks, probably pipeline tests.

llvm/lib/Transforms/AggressiveInstCombine/AggressiveInstCombine.cpp
909–910

Just like all the other analyses, this should use a reference, not a pointer (the analysis is not optional).

nikic added inline comments.Sep 15 2022, 1:10 AM
llvm/lib/Transforms/AggressiveInstCombine/AggressiveInstCombine.cpp
488

Zext -> ZextType, maybe?

495

Iterative -> Recursive

567

This check should come first, otherwise you don't count store instructions.

591

I think this handles non-byte-sized loads incorrectly. Let's say you do i4 loads with 1 byte offset. Then PrevSize will be 1 and match the offset, even though the loads are not actually consecutive.

Please add some tests for non-byte-sized loads.

624

Combine these declarations with initialization.

645

Why does this not use the IRBuilder?

867

The DataLayout fetch can be moved outside the loop.

spatel added inline comments.Sep 15 2022, 6:11 AM
llvm/lib/Transforms/AggressiveInstCombine/AggressiveInstCombine.cpp
866–867

The transform should be placed ahead of foldSqrt(), or we we may hit a use-after-free bug (because foldSqrt can delete I).
There was a code comment about this in:
https://github.com/llvm/llvm-project/commit/df868edee561eb973edd85ec9df41c67aa0bff6b
...but that patch got reverted. We should probably add that code comment independently (or fix the bug some other way).

bipmis updated this revision to Diff 460709.Sep 16 2022, 6:00 AM

Update the patch with review comments.
-> Support power of 2 loads and minimum load size 8bits.
-> IR builder for new load.
-> New test for a 4 bit load.
-> Nits.

bipmis marked 7 inline comments as done.Sep 16 2022, 6:05 AM
bipmis added inline comments.
llvm/lib/Transforms/AggressiveInstCombine/AggressiveInstCombine.cpp
567

So it would count stores which do not alias. For the case it aliases we terminate anyways and the count update wont matter?

591

Right. The shift condition will prevent it from merging.
But we do not want to combine loads smaller than a byte. Updated checks.

bipmis marked an inline comment as done.Sep 22 2022, 1:37 AM

Ping.
@nikic Any further comments on the patch.

nikic added a comment.Sep 22 2022, 1:49 AM

From a cursory look it looks fine, I assume @dmgreen has reviewed in detail :)

I have one note (that can be addressed later): There's currently a check that the loads have the same size. Does that mean that if you have a partially folded pattern (e.g. one i16 load plus two i8 loads) that it will not get folded? If so, this seems like something we should relax (later).

From a cursory look it looks fine, I assume @dmgreen has reviewed in detail :)

I have one note (that can be addressed later): There's currently a check that the loads have the same size. Does that mean that if you have a partially folded pattern (e.g. one i16 load plus two i8 loads) that it will not get folded? If so, this seems like something we should relax (later).

Thats correct. Currently we are folding loads of same size only in a chain. So for above case if two i8 loads belong to single lower chain they can be folded.
Agreed. This is more to do with a stricter basic implementation first. We can relax this later.

Please ensure you precommit the new tests to trunk and then rebase this patch on it - so we see the effect of the patch on the IR

This revision was landed with ongoing or failed builds.Sep 23 2022, 2:20 AM
This revision was automatically updated to reflect the committed changes.

When I do a 3-stage bootstrap at this commit, the second-stage compiler crashes. The issue does not appear at the previous commit, so I'm reverting this commit.

@gribozavr2 Do you have any useful build log that you can provide to speed up triage please?

bipmis added a comment.EditedSep 26 2022, 4:43 AM

@gribozavr2 Do let me know how to reproduce this issue. I am trying a 3 stage in "clang>/cmake/caches/3-stage.cmake". Linking the executable in second stage takes quite some time for a clean HEAD code.
Builds fine on AArch64 and x86 with the patch. Stage2 and stage3 binary are bit exact.

I'm looking into coming up with a reduced repro.
We're seeing a crash in re2. The weird thing is that the compiler I've repro'd is not bootstrapped, but the crash in re2 happens even when compiling it with -O0. My guess is that something else we compile with the just-built clang like libc++ is getting miscompiled, but more investigation required.

it ended up being a function that we always compile with -O3...

anyway, reduced test case

$ cat /tmp/a.ll
target datalayout = "e-m:e-p270:32:32-p271:32:32-p272:64:64-i64:64-f80:128-n8:16:32:64-S128"                                                                                                       
target triple = "x86_64-grtev4-linux-gnu"                                                                                                                                                          
                                                                                                                                                                                                   
define i64 @eggs(ptr noundef readonly %arg) {                                                                                                                                                      
  %tmp3 = load i8, ptr %arg, align 1                                                             
  %tmp4 = getelementptr inbounds i8, ptr %arg, i64 1                                             
  %tmp5 = load i8, ptr %tmp4, align 1                                                            
  %tmp6 = getelementptr inbounds i8, ptr %arg, i64 2                                             
  %tmp7 = load i8, ptr %tmp6, align 1                                                            
  %tmp8 = getelementptr inbounds i8, ptr %arg, i64 3                                             
  %tmp9 = load i8, ptr %tmp8, align 1                                                            
  %tmp10 = getelementptr inbounds i8, ptr %arg, i64 4                                            
  %tmp11 = load i8, ptr %tmp10, align 1                                                          
  %tmp12 = getelementptr inbounds i8, ptr %arg, i64 5                                            
  %tmp13 = load i8, ptr %tmp12, align 1                                                          
  %tmp14 = getelementptr inbounds i8, ptr %arg, i64 6                                            
  %tmp15 = load i8, ptr %tmp14, align 1                                                          
  %tmp16 = getelementptr inbounds i8, ptr %arg, i64 7                                            
  %tmp17 = load i8, ptr %tmp16, align 1
  %tmp18 = zext i8 %tmp17 to i64     
  %tmp19 = shl nuw i64 %tmp18, 56    
  %tmp20 = zext i8 %tmp15 to i64     
  %tmp21 = shl nuw nsw i64 %tmp20, 48
  %tmp22 = or i64 %tmp19, %tmp21     
  %tmp23 = zext i8 %tmp13 to i64     
  %tmp24 = shl nuw nsw i64 %tmp23, 40
  %tmp25 = or i64 %tmp22, %tmp24     
  %tmp26 = zext i8 %tmp11 to i64     
  %tmp27 = shl nuw nsw i64 %tmp26, 32
  %tmp28 = or i64 %tmp25, %tmp27     
  %tmp29 = zext i8 %tmp9 to i64      
  %tmp30 = shl nuw nsw i64 %tmp29, 24
  %tmp31 = or i64 %tmp28, %tmp30     
  %tmp32 = zext i8 %tmp7 to i64      
  %tmp33 = shl nuw nsw i64 %tmp32, 16
  %tmp34 = zext i8 %tmp5 to i64     
  %tmp35 = shl nuw nsw i64 %tmp34, 8
  %tmp36 = or i64 %tmp31, %tmp33
  %tmp37 = zext i8 %tmp3 to i64 
  %tmp38 = or i64 %tmp36, %tmp35                                                                                                                                                                   
  %tmp39 = or i64 %tmp38, %tmp37                                                                                                                                                                   
  ret i64 %tmp39                                                                                                                                                                                   
}
$ ./build/rel/bin/opt -passes=aggressive-instcombine -S /tmp/a.ll
; ModuleID = '/tmp/b.ll'
source_filename = "/tmp/a.ll"
target datalayout = "e-m:e-p270:32:32-p271:32:32-p272:64:64-i64:64-f80:128-n8:16:32:64-S128"
target triple = "x86_64-grtev4-linux-gnu"

define i64 @eggs(ptr noundef readonly %arg) {
  %tmp3 = load i8, ptr %arg, align 1
  %tmp4 = getelementptr inbounds i8, ptr %arg, i64 1
  %tmp5 = load i32, ptr %tmp4, align 1
  %1 = zext i32 %tmp5 to i64
  %2 = shl i64 %1, 8
  %tmp37 = zext i8 %tmp3 to i64
  %tmp39 = or i64 %2, %tmp37
  ret i64 %tmp39
}

that does look wrong, it looks like it should be optimized to load i64 rather than zext(load i32 (%a + 1)) | zext(load i8 %a)

aeubanks reopened this revision.Sep 26 2022, 11:08 AM
This revision is now accepted and ready to land.Sep 26 2022, 11:08 AM
bipmis added a comment.EditedSep 27 2022, 5:09 AM

it ended up being a function that we always compile with -O3...

anyway, reduced test case

$ cat /tmp/a.ll
target datalayout = "e-m:e-p270:32:32-p271:32:32-p272:64:64-i64:64-f80:128-n8:16:32:64-S128"                                                                                                       
target triple = "x86_64-grtev4-linux-gnu"                                                                                                                                                          
                                                                                                                                                                                                   
define i64 @eggs(ptr noundef readonly %arg) {                                                                                                                                                      
  %tmp3 = load i8, ptr %arg, align 1                                                             
  %tmp4 = getelementptr inbounds i8, ptr %arg, i64 1                                             
  %tmp5 = load i8, ptr %tmp4, align 1                                                            
  %tmp6 = getelementptr inbounds i8, ptr %arg, i64 2                                             
  %tmp7 = load i8, ptr %tmp6, align 1                                                            
  %tmp8 = getelementptr inbounds i8, ptr %arg, i64 3                                             
  %tmp9 = load i8, ptr %tmp8, align 1                                                            
  %tmp10 = getelementptr inbounds i8, ptr %arg, i64 4                                            
  %tmp11 = load i8, ptr %tmp10, align 1                                                          
  %tmp12 = getelementptr inbounds i8, ptr %arg, i64 5                                            
  %tmp13 = load i8, ptr %tmp12, align 1                                                          
  %tmp14 = getelementptr inbounds i8, ptr %arg, i64 6                                            
  %tmp15 = load i8, ptr %tmp14, align 1                                                          
  %tmp16 = getelementptr inbounds i8, ptr %arg, i64 7                                            
  %tmp17 = load i8, ptr %tmp16, align 1
  %tmp18 = zext i8 %tmp17 to i64     
  %tmp19 = shl nuw i64 %tmp18, 56    
  %tmp20 = zext i8 %tmp15 to i64     
  %tmp21 = shl nuw nsw i64 %tmp20, 48
  %tmp22 = or i64 %tmp19, %tmp21     
  %tmp23 = zext i8 %tmp13 to i64     
  %tmp24 = shl nuw nsw i64 %tmp23, 40
  %tmp25 = or i64 %tmp22, %tmp24     
  %tmp26 = zext i8 %tmp11 to i64     
  %tmp27 = shl nuw nsw i64 %tmp26, 32
  %tmp28 = or i64 %tmp25, %tmp27     
  %tmp29 = zext i8 %tmp9 to i64      
  %tmp30 = shl nuw nsw i64 %tmp29, 24
  %tmp31 = or i64 %tmp28, %tmp30     
  %tmp32 = zext i8 %tmp7 to i64      
  %tmp33 = shl nuw nsw i64 %tmp32, 16
  %tmp34 = zext i8 %tmp5 to i64     
  %tmp35 = shl nuw nsw i64 %tmp34, 8
  %tmp36 = or i64 %tmp31, %tmp33
  %tmp37 = zext i8 %tmp3 to i64 
  %tmp38 = or i64 %tmp36, %tmp35                                                                                                                                                                   
  %tmp39 = or i64 %tmp38, %tmp37                                                                                                                                                                   
  ret i64 %tmp39                                                                                                                                                                                   
}
$ ./build/rel/bin/opt -passes=aggressive-instcombine -S /tmp/a.ll
; ModuleID = '/tmp/b.ll'
source_filename = "/tmp/a.ll"
target datalayout = "e-m:e-p270:32:32-p271:32:32-p272:64:64-i64:64-f80:128-n8:16:32:64-S128"
target triple = "x86_64-grtev4-linux-gnu"

define i64 @eggs(ptr noundef readonly %arg) {
  %tmp3 = load i8, ptr %arg, align 1
  %tmp4 = getelementptr inbounds i8, ptr %arg, i64 1
  %tmp5 = load i32, ptr %tmp4, align 1
  %1 = zext i32 %tmp5 to i64
  %2 = shl i64 %1, 8
  %tmp37 = zext i8 %tmp3 to i64
  %tmp39 = or i64 %2, %tmp37
  ret i64 %tmp39
}

that does look wrong, it looks like it should be optimized to load i64 rather than zext(load i32 (%a + 1)) | zext(load i8 %a)

@aeubanks Thanks for the test case.
So in this patch we have implemented forward load merge as can be seen by the test example. However I had to handle the child node "or(load,load)" as this is being reversed by InstCombine currently as seen below. This needed a corrected check.

define i32 @Load32(ptr noundef %ptr) local_unnamed_addr #0 {
entry:
  %0 = load i8, ptr %ptr, align 1, !tbaa !5
  %conv = zext i8 %0 to i32
  %arrayidx1 = getelementptr inbounds i8, ptr %ptr, i64 1
  %1 = load i8, ptr %arrayidx1, align 1, !tbaa !5
  %conv2 = zext i8 %1 to i32
  %shl = shl i32 %conv2, 8
 ** %or = or i32 %conv, %shl**
  %arrayidx3 = getelementptr inbounds i8, ptr %ptr, i64 2
  %2 = load i8, ptr %arrayidx3, align 1, !tbaa !5
  %conv4 = zext i8 %2 to i32
  %shl5 = shl i32 %conv4, 16
  %or6 = or i32 %or, %shl5
  %arrayidx7 = getelementptr inbounds i8, ptr %ptr, i64 3
  %3 = load i8, ptr %arrayidx7, align 1, !tbaa !5
  %conv8 = zext i8 %3 to i32
  %shl9 = shl i32 %conv8, 24
  %or10 = or i32 %or6, %shl9
  ret i32 %or10
}

*** IR Dump After InstCombinePass on Load32 ***
; Function Attrs: nounwind uwtable
define i32 @Load32(ptr noundef %ptr) local_unnamed_addr #0 {
entry:
  %0 = load i8, ptr %ptr, align 1, !tbaa !5
  %conv = zext i8 %0 to i32
  %arrayidx1 = getelementptr inbounds i8, ptr %ptr, i64 1
  %1 = load i8, ptr %arrayidx1, align 1, !tbaa !5
  %conv2 = zext i8 %1 to i32
  %shl = shl nuw nsw i32 %conv2, 8
**  %or = or i32 %shl, %conv**
  %arrayidx3 = getelementptr inbounds i8, ptr %ptr, i64 2
  %2 = load i8, ptr %arrayidx3, align 1, !tbaa !5
  %conv4 = zext i8 %2 to i32
  %shl5 = shl nuw nsw i32 %conv4, 16
  %or6 = or i32 %or, %shl5
  %arrayidx7 = getelementptr inbounds i8, ptr %ptr, i64 3
  %3 = load i8, ptr %arrayidx7, align 1, !tbaa !5
  %conv8 = zext i8 %3 to i32
  %shl9 = shl nuw i32 %conv8, 24
  %or10 = or i32 %or6, %shl9
  ret i32 %or10

The full implementation of reverse load merge pattern is planned subsequently. So now you will see the lowest node loads being merged but not the other ones. I am updating the patch. Would be great if you can test the same.

bipmis updated this revision to Diff 463194.Sep 27 2022, 5:14 AM

Handle the reverse load pattern checks correctly.
Currently we need to handle the leaf node reverse loads as InstCombine pass folds the pattern from a forward to reverse one.
Full reverse load patterns planned to be implemented subsequently.

spatel added inline comments.Sep 27 2022, 7:41 AM
llvm/lib/Transforms/AggressiveInstCombine/AggressiveInstCombine.cpp
694–697

I didn't notice this limitation before. "Forward" and "reverse" are referring to the order that the or instructions use the loaded values?

I agree that we don't want to complicate the initial implementation any more than necessary, but it might help to see how the backend handles that in DAGCombiner::MatchLoadCombine() (see D133584 for a proposed enhancement to that code).

bipmis added inline comments.Sep 27 2022, 9:22 AM
llvm/lib/Transforms/AggressiveInstCombine/AggressiveInstCombine.cpp
694–697

Right it is the order
Forward - or(or(or(or(0,1),2),3)
Reverse - or(or(or(or(3,2),1),0)
Considering 0,1...as zext and shl of loads with index 0,1 etc.

For simplicity we wanted to implement the forward first and if this looks good we can do the reverse+mixed size loads. There should be minimal changes on top of this. So that is in plan next.

Matt added a subscriber: Matt.Sep 27 2022, 11:37 AM
bipmis added inline comments.Sep 28 2022, 6:29 AM
llvm/lib/Transforms/AggressiveInstCombine/AggressiveInstCombine.cpp
694–697

I didn't notice this limitation before. "Forward" and "reverse" are referring to the order that the or instructions use the loaded values?

I agree that we don't want to complicate the initial implementation any more than necessary, but it might help to see how the backend handles that in DAGCombiner::MatchLoadCombine() (see D133584 for a proposed enhancement to that code).

@spatel Verified DAGCombiner::MatchLoadCombine() handles the reverse load pattern fine and a single load is generated.

Please pre-commit the new tests with the baseline CHECK lines.
Also, it would be good to add a test that matches the reported failure case (the @eggs test from the earlier comment).
After that, I think it's fine to re-commit.

llvm/lib/Transforms/AggressiveInstCombine/AggressiveInstCombine.cpp
694–697

Thanks for checking. It doesn't directly impact the initial/immediate patch here, but it might provide a template for the planned enhancements.

Kai added a subscriber: Kai.Oct 4 2022, 12:54 PM
Kai added inline comments.
llvm/lib/Transforms/AggressiveInstCombine/AggressiveInstCombine.cpp
816

FYI: The missing bitcast for the Load1Ptr argument means that this change only works with opaque pointers.

dstuttard added inline comments.
llvm/lib/Transforms/AggressiveInstCombine/AggressiveInstCombine.cpp
816

We had the same issue - I've just uploaded a patch. See D135249