Page MenuHomePhabricator

[SLP] Generalization of vectorization of CmpInst operands, NFC.
AbandonedPublic

Authored by ABataev on Feb 2 2017, 6:26 AM.

Details

Summary

Patch removes CmpInst-specific code for vectorization of its operands and adds a general solution. May improve compile time for the code with many compare instructions.

Diff Detail

Event Timeline

ABataev created this revision.Feb 2 2017, 6:26 AM
mkuper added inline comments.Feb 2 2017, 11:42 AM
lib/Transforms/Vectorize/SLPVectorizer.cpp
4860–4865

The code is not equivalent.

  1. A cmp doesn't have to feed a branch. You can have two icmps feeding an or, for instance, with the i1 being used by a branch. Why would we want to avoid vectorizing those? Or do we already fail to vectorize when the icmp feeds anything but a branch?
  2. On the flip side, you can have a branch that is fed by an "or (icmp, icmp)", or any other source of i1. I assume this will get filtered out in tryToVectorize() but it'd be good to have a test.

In any case, the comment above needs to be updated.

ABataev marked an inline comment as done.Feb 3 2017, 3:27 AM
ABataev added inline comments.
lib/Transforms/Vectorize/SLPVectorizer.cpp
4860–4865

You're not quite correct about it.

  1. Yes, of course. But this situation will be handled in vectorizeRootInstruction() function. It calls canBeVectorized() function that traverses all suboperations of the initial instruction (up to RecursionMaxDepth tree height). So actually, the current version of the code does the same work twice: the first time when we perform top-to-bottom analysis of all instruction in vectorizeChainsInBlock() function and the second time when we perform bottom-to-top analysis in vectorizeRootInstruction() function.

The first top-to-bottom analysis breaks vectorization in some cases (for example, for future min/max reductions and maybe in some cases for binops too).

  1. Your assumption is correct, this situation is handled by tryToVectorize() function.

And we have a test for these (or similar) situations in Transforms/SLPVectorizer/X86/compare-reduce.ll, Transforms/SLPVectorizer/X86/horizontal-list.ll and Transforms/SLPVectorizer/X86/in-tree-user.ll tests already.
I'll update the comment

ABataev updated this revision to Diff 86948.Feb 3 2017, 4:02 AM
ABataev marked an inline comment as done.

Updated comment

mkuper added inline comments.Feb 3 2017, 3:49 PM
lib/Transforms/Vectorize/SLPVectorizer.cpp
4860–4865

Re 2 - sounds good.
Re 1 - I'm not quite convinced. I mean, it will work if the cmp is in the middle of a tree, and there's a root instruction somewhere below it, but that doesn't have to be the case.
I mean, if you have:

%cmp = icmp eq i32 %foo, %bar
%cmp2 = icmp eq i32 %foo2, %bar2
%or = or i1 %cmp, %cmp2
br i1 %or, label %l1, label %l2

Then yes, you'll call vectorizeRootInstruction() on %or, and that's fine, because it can start with any binary operator, and look through the cmps.
But what if you don't feed a branch at all?

E.g.

%cmp = icmp eq i32 %foo, %bar
%z = zext i1 %cmp to i32

Or

%cmp = icmp eq i32 %foo, %bar
call @foo(i1 %cmp)

etc.

I agree that what we do now is pretty silly - we start from a fairly arbitrary subset of possible roots. But I don't think we should reduce that set further.

Or did I miss something about how this works?

ABataev added inline comments.Feb 6 2017, 1:26 AM
lib/Transforms/Vectorize/SLPVectorizer.cpp
4860–4865
  1. Michael, the first example:
%cmp = icmp eq i32 %foo, %bar
%z = zext i1 %cmp to i32

still must feed some instructions like return, store or something like this. Otherwise, it is not used and will be removed from the code and we don't need to perform analysis of this CmpInst at all.

  1. Calls also must feed some other instructions. Of course, sometimes we may have standalone calls (like returning void or with ignored result), but in this case BinOps, used as args in this CallInst, won't be analyzed too. We just should teach the vectorizer about this situation.
mkuper edited edge metadata.Feb 6 2017, 2:49 PM

Sorry, I'm failing to communicate the example I have in mind.
Here it is, concretely:

declare void @bar(i1)

define void @foo(i32* %A, i32 %k, i32 %n) {
  %idx0 = getelementptr inbounds i32, i32* %A, i64 0
  %idx4 = getelementptr inbounds i32, i32* %A, i64 4
  %load0 = load i32, i32* %idx0, align 8
  %load4 = load i32, i32* %idx4, align 8
  %mul0 = mul i32 %load0, %k
  %mul4 = mul i32 %load4, %k
  %res = add i32 %mul0, %mul4
  %cmp = icmp eq i32 %res, %n
  call void @bar(i1 %cmp)
  ret void
}

With the current code, we get:
$ bin/opt -slp-vectorizer < ~/llvm/temp/cmpslp.ll -S -o - -slp-threshold=-10

declare void @bar(i1)

define void @foo(i32* %A, i32 %k, i32 %n) {
  %idx0 = getelementptr inbounds i32, i32* %A, i64 0
  %idx4 = getelementptr inbounds i32, i32* %A, i64 4
  %load0 = load i32, i32* %idx0, align 8
  %load4 = load i32, i32* %idx4, align 8
  %1 = insertelement <2 x i32> undef, i32 %k, i32 0
  %2 = insertelement <2 x i32> %1, i32 %k, i32 1
  %3 = insertelement <2 x i32> undef, i32 %load0, i32 0
  %4 = insertelement <2 x i32> %3, i32 %load4, i32 1
  %5 = mul <2 x i32> %2, %4
  %6 = extractelement <2 x i32> %5, i32 0
  %7 = extractelement <2 x i32> %5, i32 1
  %res = add i32 %6, %7
  %cmp = icmp eq i32 %res, %n
  call void @bar(i1 %cmp)
  ret void
}

The new code will not be able to vectorize this.

I agree with you that (a) what we do now is generally pretty bad, and (b) we handle this case more or less by accident.
But this patch is not NFC, and has the potential to regress this kind of cases.

Sorry, I'm failing to communicate the example I have in mind.
Here it is, concretely:

declare void @bar(i1)

define void @foo(i32* %A, i32 %k, i32 %n) {
  %idx0 = getelementptr inbounds i32, i32* %A, i64 0
  %idx4 = getelementptr inbounds i32, i32* %A, i64 4
  %load0 = load i32, i32* %idx0, align 8
  %load4 = load i32, i32* %idx4, align 8
  %mul0 = mul i32 %load0, %k
  %mul4 = mul i32 %load4, %k
  %res = add i32 %mul0, %mul4
  %cmp = icmp eq i32 %res, %n
  call void @bar(i1 %cmp)
  ret void
}

With the current code, we get:
$ bin/opt -slp-vectorizer < ~/llvm/temp/cmpslp.ll -S -o - -slp-threshold=-10

declare void @bar(i1)

define void @foo(i32* %A, i32 %k, i32 %n) {
  %idx0 = getelementptr inbounds i32, i32* %A, i64 0
  %idx4 = getelementptr inbounds i32, i32* %A, i64 4
  %load0 = load i32, i32* %idx0, align 8
  %load4 = load i32, i32* %idx4, align 8
  %1 = insertelement <2 x i32> undef, i32 %k, i32 0
  %2 = insertelement <2 x i32> %1, i32 %k, i32 1
  %3 = insertelement <2 x i32> undef, i32 %load0, i32 0
  %4 = insertelement <2 x i32> %3, i32 %load4, i32 1
  %5 = mul <2 x i32> %2, %4
  %6 = extractelement <2 x i32> %5, i32 0
  %7 = extractelement <2 x i32> %5, i32 1
  %res = add i32 %6, %7
  %cmp = icmp eq i32 %res, %n
  call void @bar(i1 %cmp)
  ret void
}

The new code will not be able to vectorize this.

I agree with you that (a) what we do now is generally pretty bad, and (b) we handle this case more or less by accident.
But this patch is not NFC, and has the potential to regress this kind of cases.

Michael, I understand this.
What should I do then? Prepare a patch with vectorization of CallInst args and then prepare an NFC patch for CmpInst? Or you have something different in your mind?

mkuper added a comment.Feb 7 2017, 1:08 PM

What should I do then?

Short term - maybe nothing?
Is this patch blocking anything? I understand this is part of the work to support min/max reductions, but why is it necessary? Can we go forward with that without regressing any existing cases?

Longer term - it would probably be good to try to come up with a saner, or at least, more principled way to do root selection, that also doesn't cause us to look at instructions several times. I don't think adding more ad-hoc cases (CallInst) is the way to go. I'm fairly sure we can come up with other examples like this.

RKSimon resigned from this revision.Feb 8 2017, 3:41 AM
RKSimon added a subscriber: RKSimon.
ABataev abandoned this revision.Feb 10 2017, 7:26 AM