This is an archive of the discontinued LLVM Phabricator instance.

[X86] Expand transform (icmp eq/ne (ABS A), C) -> (and/or (icmp eq/ne A, C), (icmp eq/ne A, -C))
AbandonedPublic

Authored by goldstein.w.n on Jan 25 2023, 10:14 PM.

Details

Reviewers
pengfei
RKSimon
Summary

This also makes sense if A is a vector type with i64 elements but
the target doesn't have avx512 but has avx2/sse4.1 (for ymm/xmm respectively).

In that case ABS will expand with 3 instructions `blendv(A, sub(set0,
A))` so its better to just to transform the version with fewer/faster
instructions.

Diff Detail

Event Timeline

goldstein.w.n created this revision.Jan 25 2023, 10:14 PM
Herald added a project: Restricted Project. · View Herald TranscriptJan 25 2023, 10:14 PM
goldstein.w.n requested review of this revision.Jan 25 2023, 10:14 PM
Herald added a project: Restricted Project. · View Herald TranscriptJan 25 2023, 10:14 PM
goldstein.w.n added a comment.EditedJan 25 2023, 10:20 PM

Should this be merged with D142345?
Also this has a habit causing extra constants to manifest, maybe not worth it? Or could guard behind a check that the necessary constant nodes already exist. Seems worth it but maybe not.

RKSimon added inline comments.Feb 5 2023, 9:18 AM
llvm/lib/Target/X86/X86ISelLowering.cpp
53450

SSE41 implies SSE2

RKSimon added inline comments.Feb 5 2023, 9:20 AM
llvm/lib/Target/X86/X86ISelLowering.cpp
53461

Doesn't this need to apply to the vector cases as well?

goldstein.w.n added inline comments.Feb 5 2023, 9:29 AM
llvm/lib/Target/X86/X86ISelLowering.cpp
53461

Doesn't this need to apply to the vector cases as well?

No, for the scalar case we have a little hack to handle (X == Pow2C || X == -Pow2C) which is the only case this is really worth it.

For the vector case if there is no fast abs its preferable for an arbitrary C to do X == C || X == -C as opposed to Abs(X) == Abs(C).

goldstein.w.n marked an inline comment as done.Feb 5 2023, 9:37 AM

Remove unnecessary sse2 check + fixup commit

RKSimon added inline comments.Feb 11 2023, 9:36 AM
llvm/lib/Target/X86/X86ISelLowering.cpp
53461

In which case, could we move the isConstOrConstSplat check inside the isScalarInteger test and replace the -CInt with getNode() / FoldConstantArithmetic() ?

goldstein.w.n added inline comments.Feb 11 2023, 2:17 PM
llvm/lib/Target/X86/X86ISelLowering.cpp
53461

What do you mean by this? As in don't go through APInt to get the value of C/-C? Is there an issue how with it is now?

pengfei added inline comments.Feb 11 2023, 7:59 PM
llvm/lib/Target/X86/X86ISelLowering.cpp
53465

whereas

53465

The comment is not clear to me. Can you refactor it?

53466

is

llvm/test/CodeGen/X86/icmp-abs-C-vec.ll
104–107

I doubt if this is beneficial. The transform neither reduces instructions nor improves throughput, but it introduces extra memory load. WDYT?

goldstein.w.n added inline comments.Feb 11 2023, 10:37 PM
llvm/test/CodeGen/X86/icmp-abs-C-vec.ll
104–107

It's not a lot more memory, only 8 more bytes for the broadcast. If the new constant micro-fused with the vpcmp then it would be +32 bytes but save a true instruction.

Also note vblendvpd is 2 uops, not 1.

But I see the point. Think it would generally make sense as in a loop the load can be hoisted in which case vpcmpeq + vpor is better than vpsub + vblendvpd but granted not by much.

Could make this transform only happen if -C already exists as a node in the DAG, you think that preferable?

RKSimon added inline comments.Feb 12 2023, 3:35 AM
llvm/lib/Target/X86/X86ISelLowering.cpp
53461

As it is now it will only work for RHS with uniform/splat values - but there doesn't appear to be any reason for the vector case not to work for non-uniform cases

Also I'm curious how well this could work on pre-SSSE3 codegen - where we don't have PABS instructions at all

Also I'm curious how well this could work on pre-SSSE3 codegen - where we don't have PABS instructions at all

Pre ssse3 abs gets the following codegen:

0000000000000000 <abs_v2i64>:
   0:	66 0f 6f c8          	movdqa %xmm0,%xmm1
   4:	66 0f 72 e1 1f       	psrad  $0x1f,%xmm1
   9:	66 0f 70 c9 f5       	pshufd $0xf5,%xmm1,%xmm1
   e:	66 0f ef c1          	pxor   %xmm1,%xmm0
  12:	66 0f fb c1          	psubq  %xmm1,%xmm0
  16:	c3                   	ret    
  17:	66 0f 1f 84 00 00 00 	nopw   0x0(%rax,%rax,1)
  1e:	00 00 

0000000000000020 <abs_v4i32>:
  20:	66 0f 6f c8          	movdqa %xmm0,%xmm1
  24:	66 0f 72 e1 1f       	psrad  $0x1f,%xmm1
  29:	66 0f ef c1          	pxor   %xmm1,%xmm0
  2d:	66 0f fa c1          	psubd  %xmm1,%xmm0
  31:	c3                   	ret    
  32:	66 2e 0f 1f 84 00 00 	cs nopw 0x0(%rax,%rax,1)
  39:	00 00 00 
  3c:	0f 1f 40 00          	nopl   0x0(%rax)

0000000000000040 <abs_v8i16>:
  40:	66 0f ef c9          	pxor   %xmm1,%xmm1
  44:	66 0f f9 c8          	psubw  %xmm0,%xmm1
  48:	66 0f ee c1          	pmaxsw %xmm1,%xmm0
  4c:	c3                   	ret    
  4d:	0f 1f 00             	nopl   (%rax)

0000000000000050 <abs_v16i8>:
  50:	66 0f ef c9          	pxor   %xmm1,%xmm1
  54:	66 0f f8 c8          	psubb  %xmm0,%xmm1
  58:	66 0f da c1          	pminub %xmm1,%xmm0
  5c:	c3                   	ret

Probably makes sense for i64 but for the rest will make this only if the node already exists.

Also I'm curious how well this could work on pre-SSSE3 codegen - where we don't have PABS instructions at all

Pre ssse3 abs gets the following codegen:

0000000000000000 <abs_v2i64>:
   0:	66 0f 6f c8          	movdqa %xmm0,%xmm1
   4:	66 0f 72 e1 1f       	psrad  $0x1f,%xmm1
   9:	66 0f 70 c9 f5       	pshufd $0xf5,%xmm1,%xmm1
   e:	66 0f ef c1          	pxor   %xmm1,%xmm0
  12:	66 0f fb c1          	psubq  %xmm1,%xmm0
  16:	c3                   	ret    
  17:	66 0f 1f 84 00 00 00 	nopw   0x0(%rax,%rax,1)
  1e:	00 00 

0000000000000020 <abs_v4i32>:
  20:	66 0f 6f c8          	movdqa %xmm0,%xmm1
  24:	66 0f 72 e1 1f       	psrad  $0x1f,%xmm1
  29:	66 0f ef c1          	pxor   %xmm1,%xmm0
  2d:	66 0f fa c1          	psubd  %xmm1,%xmm0
  31:	c3                   	ret    
  32:	66 2e 0f 1f 84 00 00 	cs nopw 0x0(%rax,%rax,1)
  39:	00 00 00 
  3c:	0f 1f 40 00          	nopl   0x0(%rax)

0000000000000040 <abs_v8i16>:
  40:	66 0f ef c9          	pxor   %xmm1,%xmm1
  44:	66 0f f9 c8          	psubw  %xmm0,%xmm1
  48:	66 0f ee c1          	pmaxsw %xmm1,%xmm0
  4c:	c3                   	ret    
  4d:	0f 1f 00             	nopl   (%rax)

0000000000000050 <abs_v16i8>:
  50:	66 0f ef c9          	pxor   %xmm1,%xmm1
  54:	66 0f f8 c8          	psubb  %xmm0,%xmm1
  58:	66 0f da c1          	pminub %xmm1,%xmm0
  5c:	c3                   	ret

Probably makes sense for i64 but for the rest will make this only if the node already exists.

For the time being, think I'm going to drop this commit, doesn't seem to be much gain. Might revisit later.

goldstein.w.n abandoned this revision.May 12 2023, 3:16 PM