This is an archive of the discontinued LLVM Phabricator instance.

[LV] Use lookThroughAnd with logical reductions
ClosedPublic

Authored by kmclaughlin on Jul 8 2021, 7:11 AM.

Details

Summary

If a reduction Phi has a single user which ANDs the Phi with a type mask,
lookThroughAnd will return the user of the Phi and the narrower type represented
by the mask. Currently this is only used for arithmetic reductions, whereas loops
containing logical reductions will create a reduction intrinsic using the widened
type, for example:

for.body:
  %phi = phi i32 [ %and, %for.body ], [ 255, %entry ]
  %mask = and i32 %phi, 255
  %gep = getelementptr inbounds i8, i8* %ptr, i32 %iv
  %load = load i8, i8* %gep
  %ext = zext i8 %load to i32
  %and = and i32 %mask, %ext
  ...

^ this will generate an and reduction intrinsic such as the following:

call i32 @llvm.vector.reduce.and.v8i32(<8 x i32>...)

The same example for an add instruction would create an intrinsic of type i8:

call i8 @llvm.vector.reduce.add.v8i8(<8 x i8>...)

This patch changes AddReductionVar to call lookThroughAnd for other integer
reductions, allowing loops similar to the example above with reductions such
as and, or & xor to vectorize.

Diff Detail

Event Timeline

kmclaughlin created this revision.Jul 8 2021, 7:11 AM
kmclaughlin requested review of this revision.Jul 8 2021, 7:11 AM
Herald added a project: Restricted Project. · View Herald TranscriptJul 8 2021, 7:11 AM

A nice patch! Good to see more vectorisation capability and code improvements here. Before I accept the patch it might be worth seeing what others think about the inloop tests?

llvm/test/Transforms/LoopVectorize/reduction-inloop-pred.ll
1598

Given the reduction intrinsic is no longer in the loop is it worth moving this to a different file? Any thoughts @dmgreen @spatel?

llvm/test/Transforms/LoopVectorize/reduction-inloop.ll
1065

Same comment as in reduction-inloop-pred.ll.

llvm/test/Transforms/LoopVectorize/trunc-reductions.ll
120

This phi instruction hasn't been narrowed like it has for the cases above, but at least we do now vectorise this loop! Perhaps this is something for a future investigation?

198

nit: Is it worth removing the loop hints and just specifying -force-vector-interleave=1 -force-vector-width=8 on the RUN line?

dmgreen added inline comments.Jul 8 2021, 10:58 AM
llvm/test/Transforms/LoopVectorize/reduction-inloop-pred.ll
1598

I'm pretty sure these tests were added to make sure that the combination of inloop reductions and lookThroughAnd worked together correctly. If they still appear to be working, they should be OK to keep here to make sure that combo keeps working in the future.

llvm/test/Transforms/LoopVectorize/trunc-reductions.ll
187

Are we sure that smin/smax work correctly with this? We go from cutting off the top bits (potentially turning a signed number into an unsigned number) to not doing that. Or maybe going from a signed min/max to a unsigned one, from the look of this test? I may be wrong, but I think something like this transform needs to be valid for the op: https://alive2.llvm.org/ce/z/5XJ4ZU.

umin/umax may still be OK, if the initial value from the phi is within range. But it doesn't reduce the vector width.

kmclaughlin marked an inline comment as done.
  • Added new tests for umax & smin
  • Changed the initial value of the reduction phi in @reduction_smax_trunc to be greater than the mask value
llvm/test/Transforms/LoopVectorize/trunc-reductions.ll
187

I think the smax/smin tests are working correctly, as we are still performing an and on the phi before the load & compare instructions to mask the top bits. In this test for smax I was previously using an inital value for the phi which was smaller than the mask, which meant the and was removed by instcombine. I've changed the initial value to make sure we can check for the and in this test which hopefully makes it a bit clearer.

david-arm accepted this revision.Jul 12 2021, 8:50 AM

LGTM! Thanks for making the changes @kmclaughlin and I'm happy with your explanation for the smin/smax/umin/umax loops.

This revision is now accepted and ready to land.Jul 12 2021, 8:50 AM
dmgreen requested changes to this revision.Jul 12 2021, 9:51 AM

I still don't think this code is making correct assumption about smin at least.

llvm/test/Transforms/LoopVectorize/trunc-reductions.ll
187

I'm still not sure this is correct. It's about the trunc(reduce.smin.v8i8(..)) performed on an i8, not an i32 like the original. A value with the '7th' bit set will change from being treated like a large positive value to a negative value. It seems that the code is assuming that the reduction can happen in the smaller type, even with signed min/max.

As for the initial value,

%sum.02p = phi i32 [ %min, %for.body ], [ 256, %entry ]
%sum.02 = and i32 %sum.02p, 255

with only one use on the phi is the same as using an input value of 0 (= 0ff & 0x100) from entry. I don't think any optimization will do that at the moment, but in principle it seems like it could.

This revision now requires changes to proceed.Jul 12 2021, 9:51 AM
david-arm added inline comments.Jul 13 2021, 1:02 AM
llvm/test/Transforms/LoopVectorize/trunc-reductions.ll
187

Hi @dmgreen, ok I think I understand your concerns now with specific reference to the intrinsic after the loop. It wasn't obvious from your initial comment that's all. :) You're specifically worried not about the vector body (which I'm sure is correct), but the ordering of trunc and reduce.smin.v8i8. So basically for this to be correct it needs a minor tweak so that we do:

trunc(reduce.smin.v8i32) -> i8

instead, right? I think that would ensure we get the right answer.

  • Changed AddReductionVar so that for min & max reductions where lookThroughAnd has set the Start instruction to the user of the Phi, we do not change the return type. This ensures that we create the reduction with the wider type (e.g. i32) and truncate the result back to the smaller type (e.g. i8).

The test here might be OK. But what about when the type from the And mask and the reduction do not match?

Something like this, where the mask of the And is smaller than the bitwidth from the extend. The loads might also just not be extended, and the final result might not be truncated.

define i8 @reduction_umax_trunc(i8* noalias nocapture %ptr) {
; CHECK-LABEL: @reduction_umax_trunc(
; CHECK-NEXT:  entry:
; CHECK-NEXT:    br i1 false, label [[SCALAR_PH:%.*]], label [[VECTOR_PH:%.*]]
; CHECK:       vector.ph:
; CHECK-NEXT:    br label [[VECTOR_BODY:%.*]]
; CHECK:       vector.body:
; CHECK-NEXT:    [[INDEX:%.*]] = phi i32 [ 0, [[VECTOR_PH]] ], [ [[INDEX_NEXT:%.*]], [[VECTOR_BODY]] ]
; CHECK-NEXT:    [[VEC_PHI:%.*]] = phi <8 x i32> [ zeroinitializer, [[VECTOR_PH]] ], [ [[TMP6:%.*]], [[VECTOR_BODY]] ]
; CHECK-NEXT:    [[TMP0:%.*]] = and <8 x i32> [[VEC_PHI]], <i32 127, i32 127, i32 127, i32 127, i32 127, i32 127, i32 127, i32 127>
; CHECK-NEXT:    [[TMP1:%.*]] = sext i32 [[INDEX]] to i64
; CHECK-NEXT:    [[TMP2:%.*]] = getelementptr inbounds i8, i8* [[PTR:%.*]], i64 [[TMP1]]
; CHECK-NEXT:    [[TMP3:%.*]] = bitcast i8* [[TMP2]] to <8 x i8>*
; CHECK-NEXT:    [[WIDE_LOAD:%.*]] = load <8 x i8>, <8 x i8>* [[TMP3]], align 1
; CHECK-NEXT:    [[TMP4:%.*]] = zext <8 x i8> [[WIDE_LOAD]] to <8 x i32>
; CHECK-NEXT:    [[TMP5:%.*]] = icmp ugt <8 x i32> [[TMP0]], [[TMP4]]
; CHECK-NEXT:    [[TMP6]] = select <8 x i1> [[TMP5]], <8 x i32> [[TMP0]], <8 x i32> [[TMP4]]
; CHECK-NEXT:    [[INDEX_NEXT]] = add nuw i32 [[INDEX]], 8
; CHECK-NEXT:    [[TMP7:%.*]] = icmp eq i32 [[INDEX_NEXT]], 256
; CHECK-NEXT:    br i1 [[TMP7]], label [[MIDDLE_BLOCK:%.*]], label [[VECTOR_BODY]], !llvm.loop [[LOOP14:![0-9]+]]
; CHECK:       middle.block:
; CHECK-NEXT:    [[TMP8:%.*]] = call i32 @llvm.vector.reduce.umax.v8i32(<8 x i32> [[TMP6]])
; CHECK-NEXT:    [[EXTRACT_T:%.*]] = trunc i32 [[TMP8]] to i8
; CHECK-NEXT:    br i1 true, label [[FOR_END:%.*]], label [[SCALAR_PH]]
; CHECK:       scalar.ph:
; CHECK-NEXT:    br label [[FOR_BODY:%.*]]
; CHECK:       for.body:
; CHECK-NEXT:    br i1 undef, label [[FOR_END]], label [[FOR_BODY]], !llvm.loop [[LOOP15:![0-9]+]]
; CHECK:       for.end:
; CHECK-NEXT:    [[MIN_LCSSA_OFF0:%.*]] = phi i8 [ undef, [[FOR_BODY]] ], [ [[EXTRACT_T]], [[MIDDLE_BLOCK]] ]
; CHECK-NEXT:    ret i8 [[MIN_LCSSA_OFF0]]
;
entry:
  br label %for.body

for.body:
  %iv = phi i32 [ %iv.next, %for.body ], [ 0, %entry ]
  %sum.02p = phi i32 [ %max, %for.body ], [ 0, %entry ]
  %sum.02 = and i32 %sum.02p, 127
  %gep = getelementptr inbounds i8, i8* %ptr, i32 %iv
  %load = load i8, i8* %gep
  %ext = zext i8 %load to i32
  %icmp = icmp ugt i32 %sum.02, %ext
  %max = select i1 %icmp, i32 %sum.02, i32 %ext
  %iv.next = add i32 %iv, 1
  %exitcond = icmp eq i32 %iv.next, 256
  br i1 %exitcond, label %for.end, label %for.body

for.end:
  %ret = trunc i32 %max to i8
  ret i8 %ret
}

The order of max's is no longer the same, and the And will cut off bits from the value. So for example with values ptr[254] = 0x90 and ptr[255] = 0x70. In the scalar code it would have %max=0x90 on the penultimate iteration and on the final iteration %sum.02 = 0x90&0x7f = 0x10, %max=umax(0x10, 0x70) = 0x70, which is the final result. The vector version will compute its final iteration with exiting %max values <.., .., 0x90, 0x70>. This then goes into the vector.reduce.umax, producing 0x90.

It feel conceptually odd to me to treat max(and(x, mask), y) the same as max(x, y). They make sense for add and mul, and I think for and/or/xor too providing it's the bottom bits that are demanded. It may be possible to add an extra And with a mask in to get the same results. But I'm not sure there isn't anything else that would go wrong, considering how subtle the issues are here! And the various assumption other code would end up making.

Do the min/max cases come up in practice from C code?

llvm/test/Transforms/LoopVectorize/trunc-reductions.ll
86

Can you change the start value to 0. Otherwise the loop doesn't really do much, other than return 65535.

Hi @dmgreen, I think the particular problem you're worried about may not happen. This is because @kmclaughlin's tests also include the flags -dce -instcombine, which I think may have eliminated a redundant and instruction after the select that deals with this. It might be worth @kmclaughlin just creating a test with the mask 0x7f (127) in your example and seeing what the output looks like? The additional flags do make the IR tidier, but sometimes hide the work done by the vectoriser to ensure correctness.

kmclaughlin marked an inline comment as done.
kmclaughlin edited the summary of this revision. (Show Details)
  • Changed the condition in IVDescriptiors before lookThroughAnd from isArithmeticRecurrenceKind to !isMinMaxRecurrenceKind
  • Updated the tests for min/max to ensure these are not vectorized

Hi @dmgreen, thank you for taking another look at this & for explaining the problem with min & max here again. I think I understand now :)
I tried your example with a mask of 127 without -instcombine as @david-arm suggested, and there is no additional and after the select which would ensure we get the correct result on the final iteration.

The min/max cases haven't come up in practice from C code. I've changed the condition in IVDescriptors so that the only other instructions which we call lookThroughAnd for are and, or & xor.

david-arm added inline comments.Jul 16 2021, 8:39 AM
llvm/test/Transforms/LoopVectorize/trunc-reductions.ll
151

nit: Hi @kmclaughlin, sorry to be a pain, but perhaps it's worth having a single CHECK line for all the min/max tests, i.e.

CHECK-NOT: vector.body
CHECK-NOT: <8 x

instead of checking for the same scalar loop?

dmgreen accepted this revision.Jul 16 2021, 8:46 AM

Thanks. That makes sense to me. The And/Or/Xor do sound like they should be OK.

LGTM. It sounds like David has some test suggestions, but consider my objections lifted.

This revision is now accepted and ready to land.Jul 16 2021, 8:46 AM
kmclaughlin marked an inline comment as done.
  • Removed CHECK lines for the scalar loop from the min/max tests in trunc-reductions.ll and replaced with the following:
CHECK-NOT: vector.body
CHECK-NOT: <8 x
CHECK: ret
This revision was landed with ongoing or failed builds.Jul 21 2021, 1:56 AM
This revision was automatically updated to reflect the committed changes.

This patch appears to have broken the sanitizer-windows bot: https://lab.llvm.org/buildbot/#/builders/127/builds/14307

******************** TEST 'AddressSanitizer-x86_64-windows :: TestCases/Windows/recalloc_sanity.cpp' FAILED ********************
Exit Code: 1318
# command output:
   Creating library C:\b\slave\sanitizer-windows\build\stage1\projects\compiler-rt\test\asan\X86_64WindowsConfig\TestCases\Windows\Output\recalloc_sanity.cpp.tmp.lib and object C:\b\slave\sanitizer-windows\build\stage1\projects\compiler-rt\test\asan\X86_64WindowsConfig\TestCases\Windows\Output\recalloc_sanity.cpp.tmp.exp
LINK : fatal error LNK1318: Unexpected PDB error; FORMAT (11) ''
# command stderr:
clang-cl: error: linker command failed with exit code 1318 (use -v to see invocation)
error: command failed with exit status: 1318

Could you please take a look?

This patch appears to have broken the sanitizer-windows bot: https://lab.llvm.org/buildbot/#/builders/127/builds/14307

******************** TEST 'AddressSanitizer-x86_64-windows :: TestCases/Windows/recalloc_sanity.cpp' FAILED ********************
Exit Code: 1318
# command output:
   Creating library C:\b\slave\sanitizer-windows\build\stage1\projects\compiler-rt\test\asan\X86_64WindowsConfig\TestCases\Windows\Output\recalloc_sanity.cpp.tmp.lib and object C:\b\slave\sanitizer-windows\build\stage1\projects\compiler-rt\test\asan\X86_64WindowsConfig\TestCases\Windows\Output\recalloc_sanity.cpp.tmp.exp
LINK : fatal error LNK1318: Unexpected PDB error; FORMAT (11) ''
# command stderr:
clang-cl: error: linker command failed with exit code 1318 (use -v to see invocation)
error: command failed with exit status: 1318

Could you please take a look?

Hi @morehouse,
I have reverted this patch for now in case this is causing the failures, although I believe the sanitizer-windows bot is still failing with the reverted changes: https://lab.llvm.org/buildbot/#/builders/127/builds/14324
Looking at the LINK errors here, I think this can happen due to stale files which weren't deleted from a previous run and it's possible this may not be related to my patch?

Thanks for reverting. Indeed this patch was not the culprit. I clobbered the build directory on the bot and we're passing again.

Sorry for the false alarm!

spatel added inline comments.Jul 21 2021, 8:59 AM
llvm/test/Transforms/LoopVectorize/trunc-reductions.ll
238

It's independent of this patch, but I noticed that all of the affected tests have this kind of mask op on the phi.
I think -instcombine will eliminate that op, so -loop-vectorize will not typically see this pattern, and we still get the wide reduction op. Is there a plan to account for that?

kmclaughlin reopened this revision.Jul 26 2021, 7:51 AM
kmclaughlin added inline comments.
llvm/test/Transforms/LoopVectorize/trunc-reductions.ll
238

Hi @spatel, thanks for taking a look at this! I think it is possible for -loop-vectorize to see this pattern - I've tried using this patch with some small examples such as the following:

short foo(short *values, long long N) {
  int sum = 0;
  for(long long i=0; i<N; ++i) {
    sum &= 65535;
    sum |= (int)values[i];
  }
  return (short)sum;
}

I don't believe -instcombine is eliminating the and before -loop-vectorize with the above example, and we do get the narrower reduction op after -loop-vectorize with these changes.

This revision is now accepted and ready to land.Jul 26 2021, 7:51 AM
spatel added inline comments.Jul 28 2021, 5:19 AM
llvm/test/Transforms/LoopVectorize/trunc-reductions.ll
238

Thanks for checking. I was seeing the mask op disappear in some of the reduced IR examples, but that may be due to over-simplification in tests.