Fold
and(ashr(subNSW(X, V), ScalarSizeInBits - 1), V)
into
V s> X ? V : 0
https://rise4fun.com/Alive/0Mi
Fold shift into select enable more optimization, e.g., vmax generation for ARM target.
Differential D67799
[InstCombine] Fold a shifty implementation of clamp-to-zero. huihuiz on Sep 19 2019, 11:28 PM. Authored by
Details Fold and(ashr(subNSW(X, V), ScalarSizeInBits - 1), V) into V s> X ? V : 0 https://rise4fun.com/Alive/0Mi Fold shift into select enable more optimization, e.g., vmax generation for ARM target.
Diff Detail
Event TimelineComment Actions E.g., vmax generation for ARM target test.c static __inline int clamp0(int v) { return ((-(v) >> 31) & (v)); } void foo(const unsigned char* src0, const unsigned char* src1, unsigned char* dst, int width) { int i; for (i = 0; i < width; ++i) { const int b = src0[0]; const int b_sub = src1[0]; dst[0] = clamp0(b - b_sub); src0 ++; src1 ++; dst ++; } } run : clang -cc1 -triple armv8.1a-linux-gnu -target-abi apcs-gnu -target-feature +neon -vectorize-loops -vectorize-slp -O2 -S -o - test-clamp0.c -o - before this optimization, generate "vneg + vshr + vand" instead.
Comment Actions We need to confirm that the backend produces better asm for at least a few in-tree targets before/after this transform. Please attach output for x86 and AArch64. We'll want to have examples for scalar and vector code, so you probably need to suppress the vectorizers.
Comment Actions For X86, AArch64 and ARM target, backend produce better ASM with this transformation. Please refer to below examples:
X86 target: Test input; Run command : clang -O2 -target x86_64 -march=skylake -S clamp0.ll -o - define i32 @clamp0(i32 %v) { %sub = sub nsw i32 0, %v %shr = ashr i32 %sub, 31 %and = and i32 %shr, %v ret i32 %and } before clamp0: # @clamp0 # %bb.0: movl %edi, %eax negl %eax sarl $31, %eax andl %edi, %eax retq After this optimization clamp0: # @clamp0 # %bb.0: movl %edi, %eax sarl $31, %eax andnl %edi, %eax, %eax retq AArch64 target: before clamp0: // @clamp0 // %bb.0: neg w8, w0 and w0, w0, w8, asr #31 ret After this optimization clamp0: // @clamp0 // %bb.0: bic w0, w0, w0, asr #31 ret ARM target: clamp0: .fnstart @ %bb.0: rsb r1, r0, #0 and r0, r0, r1, asr #31 bx lr After this optimization clamp0: .fnstart @ %bb.0: bic r0, r0, r0, asr #31 bx lr
X86 target: define <4 x i32> @clamp0-vec(<4 x i32> %v) { %sub = sub nsw <4 x i32> zeroinitializer, %v %shr = ashr <4 x i32> %sub, <i32 31, i32 31, i32 31, i32 31> %and = and <4 x i32> %shr, %v ret <4 x i32> %and } before "clamp0-vec": # @clamp0-vec # %bb.0: vpxor %xmm1, %xmm1, %xmm1 vpsubd %xmm0, %xmm1, %xmm1 vpsrad $31, %xmm1, %xmm1 vpand %xmm0, %xmm1, %xmm0 retq After this optimization "clamp0-vec": # @clamp0-vec # %bb.0: vpxor %xmm1, %xmm1, %xmm1 vpmaxsd %xmm1, %xmm0, %xmm0 retq AArch64 target: "clamp0-vec": // @clamp0-vec // %bb.0: neg v1.4s, v0.4s sshr v1.4s, v1.4s, #31 and v0.16b, v1.16b, v0.16b ret After this optimization "clamp0-vec": // @clamp0-vec // %bb.0: movi v1.2d, #0000000000000000 smax v0.4s, v0.4s, v1.4s ret ARM target "clamp0-vec": .fnstart vmov d17, r2, r3 vmov d16, r0, r1 vneg.s32 q9, q8 vshr.s32 q9, q9, #31 vand q8, q9, q8 vmov r0, r1, d16 vmov r2, r3, d17 bx lr After this optimization "clamp0-vec": .fnstart vmov d17, r2, r3 vmov d16, r0, r1 vmov.i32 q9, #0x0 vmax.s32 q8, q8, q9 vmov r0, r1, d16 vmov r2, r3, d17 bx lr Comment Actions
Comment Actions Ah, finally got it. There is more general fold here: Name: sub_ashr_and_nsw %sub = sub nsw i8 %X, %v %ashr = ashr i8 %sub, 7 %r = and i8 %ashr, %v => %cmp = icmp sle i8 %v, %X %r = select i1 %cmp, i8 0, i8 %v Comment Actions Yes, all asm diffs look good to me. DAGCombiner knows how to convert a select with '0' false operand into something better ('max' or 'and not' instructions). I'm not sure if that will be true for the more general fold though, so more testing will be needed for those patterns. Comment Actions llvm-mca results for more general folding pattern
X86: skylake cmovgl latency 1 test input; run : clang clampNegToZero.ll -O2 -target x86_64 -march=skylake -S -o - | llvm-mca -mtriple=x86_64-unknown-unknown -mcpu=skylake define i32 @clamp0(i32 %v, i32 %x) { %sub = sub nsw i32 %x, %v %shr = ashr i32 %sub, 31 %and = and i32 %shr, %v ret i32 %and } Before: Iterations: 100 Instructions: 500 Total Cycles: 159 Total uOps: 700 Dispatch Width: 6 uOps Per Cycle: 4.40 IPC: 3.14 Block RThroughput: 1.2 Instruction Info: [1]: #uOps [2]: Latency [3]: RThroughput [4]: MayLoad [5]: MayStore [6]: HasSideEffects (U) [1] [2] [3] [4] [5] [6] Instructions: 1 1 0.25 movl %esi, %eax 1 1 0.25 subl %edi, %eax 1 1 0.50 sarl $31, %eax 1 1 0.25 andl %edi, %eax 3 7 1.00 U retq After this transformation: Iterations: 100 Instructions: 400 Total Cycles: 110 Total uOps: 600 Dispatch Width: 6 uOps Per Cycle: 5.45 IPC: 3.64 Block RThroughput: 1.0 Instruction Info: [1]: #uOps [2]: Latency [3]: RThroughput [4]: MayLoad [5]: MayStore [6]: HasSideEffects (U) [1] [2] [3] [4] [5] [6] Instructions: 1 0 0.17 xorl %eax, %eax 1 1 0.25 cmpl %esi, %edi 1 1 0.50 cmovgl %edi, %eax 3 7 1.00 U retq X86: cooper lake cmovgl latency also 1 same input; run: clang clampNegToZero.ll -O2 -target x86_64 -march=cooperlake -S -o - | llvm-mca -mtriple=x86_64-unknown-unknown -mcpu=cooperlake before Iterations: 100 Instructions: 500 Total Cycles: 159 Total uOps: 700 Dispatch Width: 6 uOps Per Cycle: 4.40 IPC: 3.14 Block RThroughput: 1.2 Instruction Info: [1]: #uOps [2]: Latency [3]: RThroughput [4]: MayLoad [5]: MayStore [6]: HasSideEffects (U) [1] [2] [3] [4] [5] [6] Instructions: 1 1 0.25 movl %esi, %eax 1 1 0.25 subl %edi, %eax 1 1 0.50 sarl $31, %eax 1 1 0.25 andl %edi, %eax 3 7 1.00 U retq After this transformation: Iterations: 100 Instructions: 400 Total Cycles: 110 Total uOps: 600 Dispatch Width: 6 uOps Per Cycle: 5.45 IPC: 3.64 Block RThroughput: 1.0 Instruction Info: [1]: #uOps [2]: Latency [3]: RThroughput [4]: MayLoad [5]: MayStore [6]: HasSideEffects (U) [1] [2] [3] [4] [5] [6] Instructions: 1 0 0.17 xorl %eax, %eax 1 1 0.25 cmpl %esi, %edi 1 1 0.50 cmovgl %edi, %eax 3 7 1.00 U retq AMD : Before Iterations: 100 Instructions: 500 Total Cycles: 155 Total uOps: 600 Dispatch Width: 4 uOps Per Cycle: 3.87 IPC: 3.23 Block RThroughput: 1.5 Instruction Info: [1]: #uOps [2]: Latency [3]: RThroughput [4]: MayLoad [5]: MayStore [6]: HasSideEffects (U) [1] [2] [3] [4] [5] [6] Instructions: 1 1 0.25 movl %esi, %eax 1 1 0.25 subl %edi, %eax 1 1 0.25 sarl $31, %eax 1 1 0.25 andl %edi, %eax 2 1 0.50 U retq After this transformation: Iterations: 100 Instructions: 400 Total Cycles: 203 Total uOps: 500 Dispatch Width: 4 uOps Per Cycle: 2.46 IPC: 1.97 Block RThroughput: 1.3 Instruction Info: [1]: #uOps [2]: Latency [3]: RThroughput [4]: MayLoad [5]: MayStore [6]: HasSideEffects (U) [1] [2] [3] [4] [5] [6] Instructions: 1 1 0.25 xorl %eax, %eax 1 1 0.25 cmpl %esi, %edi 1 1 0.25 cmovgl %edi, %eax 2 1 0.50 U retq AArch64: cortex-a57 csel latency 1 Iterations: 100 Instructions: 300 Total Cycles: 303 Total uOps: 300 Dispatch Width: 3 uOps Per Cycle: 0.99 IPC: 0.99 Block RThroughput: 1.0 Instruction Info: [1]: #uOps [2]: Latency [3]: RThroughput [4]: MayLoad [5]: MayStore [6]: HasSideEffects (U) [1] [2] [3] [4] [5] [6] Instructions: 1 1 0.50 sub w8, w1, w0 1 2 1.00 and w0, w0, w8, asr #31 1 1 1.00 U ret After this transformation: Iterations: 100 Instructions: 300 Total Cycles: 203 Total uOps: 300 Dispatch Width: 3 uOps Per Cycle: 1.48 IPC: 1.48 Block RThroughput: 1.0 Instruction Info: [1]: #uOps [2]: Latency [3]: RThroughput [4]: MayLoad [5]: MayStore [6]: HasSideEffects (U) [1] [2] [3] [4] [5] [6] Instructions: 1 1 0.50 cmp w0, w1 1 1 0.50 csel w0, w0, wzr, gt 1 1 1.00 U ret
test input define <4 x i32> @clamp0-vec(<4 x i32> %v, <4 x i32> %x) { %sub = sub nsw <4 x i32> %x, %v %shr = ashr <4 x i32> %sub, <i32 31, i32 31, i32 31, i32 31> %and = and <4 x i32> %shr, %v ret <4 x i32> %and } X86 : skylake before Iterations: 100 Instructions: 400 Total Cycles: 303 Total uOps: 600 Dispatch Width: 6 uOps Per Cycle: 1.98 IPC: 1.32 Block RThroughput: 1.0 Instruction Info: [1]: #uOps [2]: Latency [3]: RThroughput [4]: MayLoad [5]: MayStore [6]: HasSideEffects (U) [1] [2] [3] [4] [5] [6] Instructions: 1 1 0.33 vpsubd %xmm0, %xmm1, %xmm1 1 1 0.50 vpsrad $31, %xmm1, %xmm1 1 1 0.33 vpand %xmm0, %xmm1, %xmm0 3 7 1.00 U retq After this transformation Iterations: 100 Instructions: 300 Total Cycles: 203 Total uOps: 500 Dispatch Width: 6 uOps Per Cycle: 2.46 IPC: 1.48 Block RThroughput: 1.0 Instruction Info: [1]: #uOps [2]: Latency [3]: RThroughput [4]: MayLoad [5]: MayStore [6]: HasSideEffects (U) [1] [2] [3] [4] [5] [6] Instructions: 1 1 0.50 vpcmpgtd %xmm1, %xmm0, %xmm1 1 1 0.33 vpand %xmm0, %xmm1, %xmm0 3 7 1.00 U retq AMD znver2 before Iterations: 100 Instructions: 400 Total Cycles: 303 Total uOps: 500 Dispatch Width: 4 uOps Per Cycle: 1.65 IPC: 1.32 Block RThroughput: 1.3 Instruction Info: [1]: #uOps [2]: Latency [3]: RThroughput [4]: MayLoad [5]: MayStore [6]: HasSideEffects (U) [1] [2] [3] [4] [5] [6] Instructions: 1 1 0.25 vpsubd %xmm0, %xmm1, %xmm1 1 1 0.25 vpsrad $31, %xmm1, %xmm1 1 1 0.25 vpand %xmm0, %xmm1, %xmm0 2 1 0.50 U retq After this transformation Iterations: 100 Instructions: 300 Total Cycles: 203 Total uOps: 400 Dispatch Width: 4 uOps Per Cycle: 1.97 IPC: 1.48 Block RThroughput: 1.0 Instruction Info: [1]: #uOps [2]: Latency [3]: RThroughput [4]: MayLoad [5]: MayStore [6]: HasSideEffects (U) [1] [2] [3] [4] [5] [6] Instructions: 1 1 0.25 vpcmpgtd %xmm1, %xmm0, %xmm1 1 1 0.25 vpand %xmm0, %xmm1, %xmm0 2 1 0.50 U retq AArch64 cortex-a57 before Iterations: 100 Instructions: 400 Total Cycles: 903 Total uOps: 400 Dispatch Width: 3 uOps Per Cycle: 0.44 IPC: 0.44 Block RThroughput: 1.5 Instruction Info: [1]: #uOps [2]: Latency [3]: RThroughput [4]: MayLoad [5]: MayStore [6]: HasSideEffects (U) [1] [2] [3] [4] [5] [6] Instructions: 1 3 0.50 sub v1.4s, v1.4s, v0.4s 1 3 0.50 sshr v1.4s, v1.4s, #31 1 3 0.50 and v0.16b, v1.16b, v0.16b 1 1 1.00 U ret After this transformation Iterations: 100 Instructions: 300 Total Cycles: 603 Total uOps: 300 Dispatch Width: 3 uOps Per Cycle: 0.50 IPC: 0.50 Block RThroughput: 1.0 Instruction Info: [1]: #uOps [2]: Latency [3]: RThroughput [4]: MayLoad [5]: MayStore [6]: HasSideEffects (U) [1] [2] [3] [4] [5] [6] Instructions: 1 3 0.50 cmgt v1.4s, v0.4s, v1.4s 1 3 0.50 and v0.16b, v0.16b, v1.16b 1 1 1.00 U ret Comment Actions Another note, for older generation X86 target, e.g., haswell, cmove indeed has latency 2. But able to achieve comparable uOps Per Cycle Iterations: 100 Instructions: 500 Total Cycles: 210 Total uOps: 700 Dispatch Width: 4 uOps Per Cycle: 3.33 IPC: 2.38 Block RThroughput: 1.8 Instruction Info: [1]: #uOps [2]: Latency [3]: RThroughput [4]: MayLoad [5]: MayStore [6]: HasSideEffects (U) [1] [2] [3] [4] [5] [6] Instructions: 1 1 0.25 movl %esi, %eax 1 1 0.25 subl %edi, %eax 1 1 0.50 sarl $31, %eax 1 1 0.25 andl %edi, %eax 3 7 1.00 U retq After Iterations: 100 Instructions: 400 Total Cycles: 209 Total uOps: 700 Dispatch Width: 4 uOps Per Cycle: 3.35 IPC: 1.91 Block RThroughput: 1.8 Instruction Info: [1]: #uOps [2]: Latency [3]: RThroughput [4]: MayLoad [5]: MayStore [6]: HasSideEffects (U) [1] [2] [3] [4] [5] [6] Instructions: 1 0 0.25 xorl %eax, %eax 1 1 0.25 cmpl %esi, %edi 2 2 0.50 cmovgl %edi, %eax 3 7 1.00 U retq Comment Actions Thanks, looks good to me.
Comment Actions cc'ing @craig.topper @RKSimon @andreadb in case the use of x86 'cmov' has any pitfalls that we're not seeing so far. Comment Actions This is just FYI. llvm-mca result for AMD btver2 and bdver2 AMD btver2 Iterations: 100 Instructions: 500 Total Cycles: 256 Total uOps: 500 Dispatch Width: 2 uOps Per Cycle: 1.95 IPC: 1.95 Block RThroughput: 2.5 Instruction Info: [1]: #uOps [2]: Latency [3]: RThroughput [4]: MayLoad [5]: MayStore [6]: HasSideEffects (U) [1] [2] [3] [4] [5] [6] Instructions: 1 1 0.50 movl %esi, %eax 1 1 0.50 subl %edi, %eax 1 1 0.50 sarl $31, %eax 1 1 0.50 andl %edi, %eax 1 4 1.00 U retq After Iterations: 100 Instructions: 400 Total Cycles: 206 Total uOps: 400 Dispatch Width: 2 uOps Per Cycle: 1.94 IPC: 1.94 Block RThroughput: 2.0 Instruction Info: [1]: #uOps [2]: Latency [3]: RThroughput [4]: MayLoad [5]: MayStore [6]: HasSideEffects (U) [1] [2] [3] [4] [5] [6] Instructions: 1 0 0.50 xorl %eax, %eax 1 1 0.50 cmpl %esi, %edi 1 1 0.50 cmovgl %edi, %eax 1 4 1.00 U retq AMD bdver2 Iterations: 100 Instructions: 500 Total Cycles: 455 Total uOps: 500 Dispatch Width: 4 uOps Per Cycle: 1.10 IPC: 1.10 Block RThroughput: 4.0 Instruction Info: [1]: #uOps [2]: Latency [3]: RThroughput [4]: MayLoad [5]: MayStore [6]: HasSideEffects (U) [1] [2] [3] [4] [5] [6] Instructions: 1 1 1.00 movl %esi, %eax 1 1 1.00 subl %edi, %eax 1 1 1.00 sarl $31, %eax 1 1 1.00 andl %edi, %eax 1 5 1.50 U retq After Iterations: 100 Instructions: 400 Total Cycles: 208 Total uOps: 400 Dispatch Width: 4 uOps Per Cycle: 1.92 IPC: 1.92 Block RThroughput: 1.5 Instruction Info: [1]: #uOps [2]: Latency [3]: RThroughput [4]: MayLoad [5]: MayStore [6]: HasSideEffects (U) [1] [2] [3] [4] [5] [6] Instructions: 1 0 0.25 xorl %eax, %eax 1 1 1.00 cmpl %esi, %edi 1 1 0.50 cmovgl %edi, %eax 1 5 1.50 U retq |
maybe just
?