This is an archive of the discontinued LLVM Phabricator instance.

[X86] Prefer vpmovq2m over vpternlogd + vpcmpgtq
Needs ReviewPublic

Authored by davezarzycki on Jun 7 2021, 8:26 AM.

Details

Summary

Let's undo earlier optimizations that turn < 0 comparisons into > -1. This enables us to use vpmovq2m instead of vpternlogd + vpcmpgtq.

Diff Detail

Event Timeline

davezarzycki created this revision.Jun 7 2021, 8:26 AM
davezarzycki requested review of this revision.Jun 7 2021, 8:26 AM

I'm surprised that this didn't break existing tests. Where is the right place to update the test suite?

RKSimon retitled this revision from Prefer vpmovq2m over vpternlogd + vpcmpgtq to [X86] Prefer vpmovq2m over vpternlogd + vpcmpgtq.

Actually, wait. Something weird is going on at the mid-level. These two functions should generate the same optimized IR, right?

typedef int V __attribute__((vector_size(64)));

V lt_zero_x_y(V mask, V x, V y) { return mask <  0 ? x : y; }
V ge_zero_y_x(V mask, V x, V y) { return mask >= 0 ? y : x; }
xbolva00 added a subscriber: xbolva00.EditedJun 7 2021, 9:59 AM

Actually, wait. Something weird is going on at the mid-level. These two functions should generate the same optimized IR, right?

typedef int V __attribute__((vector_size(64)));

V lt_zero_x_y(V mask, V x, V y) { return mask <  0 ? x : y; }
V ge_zero_y_x(V mask, V x, V y) { return mask >= 0 ? y : x; }

Should but currently we have a different IR. No canonicalization? Wow.

Plus, just cursious why with

typedef int V __attribute__((vector_size(4)));

we produce

define dso_local i32 @_Z11lt_zero_x_yDv1_iS_S_(i32 %0, i32 %1, i32 %2) local_unnamed_addr #0 {
  %4 = insertelement <1 x i32> poison, i32 %0, i32 0
  %5 = insertelement <1 x i32> poison, i32 %1, i32 0
  %6 = insertelement <1 x i32> poison, i32 %2, i32 0
  %7 = icmp sgt <1 x i32> %4, <i32 -1>
  %8 = select <1 x i1> %7, <1 x i32> %6, <1 x i32> %5
  %9 = extractelement <1 x i32> %8, i32 0
  ret i32 %9
}

Why not scalarize it on IR level?

Actually, wait. Something weird is going on at the mid-level. These two functions should generate the same optimized IR, right?

typedef int V __attribute__((vector_size(64)));

V lt_zero_x_y(V mask, V x, V y) { return mask <  0 ? x : y; }
V ge_zero_y_x(V mask, V x, V y) { return mask >= 0 ? y : x; }

I'm not sure we canonicalize select operand order.

vpmovq2m only exists with avx512dq. Is this transform still useful with just avx512f?

Actually, wait. Something weird is going on at the mid-level. These two functions should generate the same optimized IR, right?

typedef int V __attribute__((vector_size(64)));

V lt_zero_x_y(V mask, V x, V y) { return mask <  0 ? x : y; }
V ge_zero_y_x(V mask, V x, V y) { return mask >= 0 ? y : x; }

I'm not sure we canonicalize select operand order.

vpmovq2m only exists with avx512dq. Is this transform still useful with just avx512f?

On AVX512F? Sure. VPXOR has better code density than VPTERNLOG.

craig.topper added inline comments.Jun 7 2021, 10:20 AM
llvm/lib/Target/X86/X86ISelLowering.cpp
22802

I'm a little confused. the vpmovq2m patter is (setgt imm_allzeros, X). Does this SETGE get canonicalized again?

davezarzycki added inline comments.Jun 7 2021, 10:29 AM
llvm/lib/Target/X86/X86ISelLowering.cpp
22802

Oh, sorry about that. I originally made a logic error and converted > -1 to < 0 instead of >= 0; but at that time, I did verify that VPMOVQ2M was generated. You are right that the desired instruction is not generated with SETGE now that the logic error is fixed. In any case, I'd like to update the tests ahead of any code change. Is there a preferred file to put the current suboptimal code gen?

craig.topper added inline comments.Jun 7 2021, 11:08 AM
llvm/lib/Target/X86/X86ISelLowering.cpp
22802

avx512-vec-cmp.ll seems like a decent file to put it in.

I filed a bug about the IR not being canonicalized. Let's wait for that before proceeding on generating VPMOVQ2M:

https://bugs.llvm.org/show_bug.cgi?id=50605

@davezarzycki What's going on the way phab is reporting the path in Revision Contents pane? llvm/lib/Target/X86 -> i/llvm/lib/Target/X86 ?!?

@davezarzycki What's going on the way phab is reporting the path in Revision Contents pane? llvm/lib/Target/X86 -> i/llvm/lib/Target/X86 ?!?

That's just git-diff. I thought Phab handled that but I guess not. From the documetation:

diff.mnemonicPrefix
If set, 'git diff' uses a prefix pair that is different from the standard "a/" and "b/" depending on what is being compared. When this configuration is in effect, reverse diff output also swaps the order of the prefixes:

git diff
compares the (i)ndex and the (w)ork tree;

git diff HEAD
compares a (c)ommit and the (w)ork tree;

git diff --cached
compares a (c)ommit and the (i)ndex;

git diff HEAD:file1 file2
compares an (o)bject and a (w)ork tree entity;

git diff --no-index a b
compares two non-git things (1) and (2).

@davezarzycki Are you still looking at this? What about PR50605?