Patch removes CmpInst-specific code for vectorization of its operands and adds a general solution. May improve compile time for the code with many compare instructions.
Details
Diff Detail
Event Timeline
lib/Transforms/Vectorize/SLPVectorizer.cpp | ||
---|---|---|
4860–4865 | The code is not equivalent.
In any case, the comment above needs to be updated. |
lib/Transforms/Vectorize/SLPVectorizer.cpp | ||
---|---|---|
4860–4865 | You're not quite correct about it.
The first top-to-bottom analysis breaks vectorization in some cases (for example, for future min/max reductions and maybe in some cases for binops too).
And we have a test for these (or similar) situations in Transforms/SLPVectorizer/X86/compare-reduce.ll, Transforms/SLPVectorizer/X86/horizontal-list.ll and Transforms/SLPVectorizer/X86/in-tree-user.ll tests already. |
lib/Transforms/Vectorize/SLPVectorizer.cpp | ||
---|---|---|
4860–4865 | Re 2 - sounds good. %cmp = icmp eq i32 %foo, %bar %cmp2 = icmp eq i32 %foo2, %bar2 %or = or i1 %cmp, %cmp2 br i1 %or, label %l1, label %l2 Then yes, you'll call vectorizeRootInstruction() on %or, and that's fine, because it can start with any binary operator, and look through the cmps. E.g. %cmp = icmp eq i32 %foo, %bar %z = zext i1 %cmp to i32 Or %cmp = icmp eq i32 %foo, %bar call @foo(i1 %cmp) etc. I agree that what we do now is pretty silly - we start from a fairly arbitrary subset of possible roots. But I don't think we should reduce that set further. Or did I miss something about how this works? |
lib/Transforms/Vectorize/SLPVectorizer.cpp | ||
---|---|---|
4860–4865 |
%cmp = icmp eq i32 %foo, %bar %z = zext i1 %cmp to i32 still must feed some instructions like return, store or something like this. Otherwise, it is not used and will be removed from the code and we don't need to perform analysis of this CmpInst at all.
|
Sorry, I'm failing to communicate the example I have in mind.
Here it is, concretely:
declare void @bar(i1) define void @foo(i32* %A, i32 %k, i32 %n) { %idx0 = getelementptr inbounds i32, i32* %A, i64 0 %idx4 = getelementptr inbounds i32, i32* %A, i64 4 %load0 = load i32, i32* %idx0, align 8 %load4 = load i32, i32* %idx4, align 8 %mul0 = mul i32 %load0, %k %mul4 = mul i32 %load4, %k %res = add i32 %mul0, %mul4 %cmp = icmp eq i32 %res, %n call void @bar(i1 %cmp) ret void }
With the current code, we get:
$ bin/opt -slp-vectorizer < ~/llvm/temp/cmpslp.ll -S -o - -slp-threshold=-10
declare void @bar(i1) define void @foo(i32* %A, i32 %k, i32 %n) { %idx0 = getelementptr inbounds i32, i32* %A, i64 0 %idx4 = getelementptr inbounds i32, i32* %A, i64 4 %load0 = load i32, i32* %idx0, align 8 %load4 = load i32, i32* %idx4, align 8 %1 = insertelement <2 x i32> undef, i32 %k, i32 0 %2 = insertelement <2 x i32> %1, i32 %k, i32 1 %3 = insertelement <2 x i32> undef, i32 %load0, i32 0 %4 = insertelement <2 x i32> %3, i32 %load4, i32 1 %5 = mul <2 x i32> %2, %4 %6 = extractelement <2 x i32> %5, i32 0 %7 = extractelement <2 x i32> %5, i32 1 %res = add i32 %6, %7 %cmp = icmp eq i32 %res, %n call void @bar(i1 %cmp) ret void }
The new code will not be able to vectorize this.
I agree with you that (a) what we do now is generally pretty bad, and (b) we handle this case more or less by accident.
But this patch is not NFC, and has the potential to regress this kind of cases.
Michael, I understand this.
What should I do then? Prepare a patch with vectorization of CallInst args and then prepare an NFC patch for CmpInst? Or you have something different in your mind?
What should I do then?
Short term - maybe nothing?
Is this patch blocking anything? I understand this is part of the work to support min/max reductions, but why is it necessary? Can we go forward with that without regressing any existing cases?
Longer term - it would probably be good to try to come up with a saner, or at least, more principled way to do root selection, that also doesn't cause us to look at instructions several times. I don't think adding more ad-hoc cases (CallInst) is the way to go. I'm fairly sure we can come up with other examples like this.
The code is not equivalent.
In any case, the comment above needs to be updated.