This is an archive of the discontinued LLVM Phabricator instance.

[InstCombine] Fold a shifty implementation of clamp-to-zero.
ClosedPublic

Authored by huihuiz on Sep 19 2019, 11:28 PM.

Details

Summary

Fold

and(ashr(subNSW(X, V), ScalarSizeInBits - 1), V)

into

V s> X ? V : 0

https://rise4fun.com/Alive/0Mi

Fold shift into select enable more optimization, e.g., vmax generation for ARM target.

Diff Detail

Event Timeline

huihuiz created this revision.Sep 19 2019, 11:28 PM

E.g., vmax generation for ARM target

test.c

static __inline int clamp0(int v) {
  return ((-(v) >> 31) & (v));
}

void foo(const unsigned char* src0,
         const unsigned char* src1,
         unsigned char* dst,
                       int width) {
  int i;
  for (i = 0; i < width; ++i) {
    const int b = src0[0];
    const int b_sub = src1[0];
    dst[0] = clamp0(b - b_sub);
    src0 ++;
    src1 ++;
    dst ++;
  }
}

run : clang -cc1 -triple armv8.1a-linux-gnu -target-abi apcs-gnu -target-feature +neon -vectorize-loops -vectorize-slp -O2 -S -o - test-clamp0.c -o -
you can see "vmax" optimization

before this optimization, generate "vneg + vshr + vand" instead.

lebedev.ri added inline comments.Sep 20 2019, 1:28 AM
llvm/lib/Transforms/InstCombine/InstCombineAndOrXor.cpp
1933–1936
  1. Only m_AShr has to be one-use
  2. This doesn't actually deal with commutativity correctly

You want

match(&I, m_c_And(m_OneUse(m_AShr(m_NSWSub(m_Zero(),
                                           m_Specific(V)),
                                  m_APInt(ShAmt))),
                  m_Value(V)))
lebedev.ri added inline comments.Sep 20 2019, 1:41 AM
llvm/lib/Transforms/InstCombine/InstCombineAndOrXor.cpp
1940

Hmm, super random thought.
@spatel we convert code that was written without a branch, likely very intentionally,
into a possibly-branch code. Should we not add unpredictable to this new switch?
I think it's almost correctness question..

lebedev.ri added inline comments.Sep 20 2019, 3:09 AM
llvm/lib/Transforms/InstCombine/InstCombineAndOrXor.cpp
1933–1936

Err,

match(&I, m_c_And(m_OneUse(m_AShr(m_NSWSub(m_Zero(),
                                           m_Specific(V)),
                                  m_APInt(ShAmt))),
                  m_Deferred(V)))

of course

1940

s/switch/select/

We need to confirm that the backend produces better asm for at least a few in-tree targets before/after this transform. Please attach output for x86 and AArch64. We'll want to have examples for scalar and vector code, so you probably need to suppress the vectorizers.

llvm/lib/Transforms/InstCombine/InstCombineAndOrXor.cpp
1940

It's not a matter of correctness, but I agree that we do not want to end up with branchy code when the source used tricky bit-hacks almost certainly to avoid branching.

But adding 'unpredictable' is not a solution here AFAICT because we're not creating a branch or switch. We could add explicit profile metadata to the select to indicate the compare is 50/50, but that doesn't necessarily imply unpredictable.

huihuiz marked 2 inline comments as done.EditedSep 20 2019, 4:35 PM

For X86, AArch64 and ARM target, backend produce better ASM with this transformation. Please refer to below examples:

  • Scalar Test ---

X86 target:

Test input; Run command : clang -O2 -target x86_64 -march=skylake -S clamp0.ll -o -

define i32 @clamp0(i32 %v) {
  %sub = sub nsw i32 0, %v
  %shr = ashr i32 %sub, 31
  %and = and i32 %shr, %v
  ret i32 %and
}

before

clamp0:                                 # @clamp0
# %bb.0:
        movl    %edi, %eax
        negl    %eax
        sarl    $31, %eax
        andl    %edi, %eax
        retq

After this optimization

clamp0:                                 # @clamp0
# %bb.0:
        movl    %edi, %eax
        sarl    $31, %eax
        andnl   %edi, %eax, %eax
        retq

AArch64 target:
Same test input; Run command : clang -O2 -target aarch64 -march=armv8a -S clamp0.ll -o -

before

clamp0:                                 // @clamp0
// %bb.0:
        neg     w8, w0
        and     w0, w0, w8, asr #31
        ret

After this optimization

clamp0:                                 // @clamp0
// %bb.0:
        bic     w0, w0, w0, asr #31
        ret

ARM target:
Same input; run : clang -O2 -target arm -march=armv8.1a -S clamp0.ll -o -
before:

clamp0:
        .fnstart
@ %bb.0:
        rsb     r1, r0, #0
        and     r0, r0, r1, asr #31
        bx      lr

After this optimization

clamp0:
        .fnstart
@ %bb.0:
        bic     r0, r0, r0, asr #31
        bx      lr
  • Vector Test ---

X86 target:
Test input; Run command : clang -O2 -target x86_64 -march=skylake -S clamp0-vec.ll -o -

define <4 x i32> @clamp0-vec(<4 x i32> %v) {
  %sub = sub nsw <4 x i32> zeroinitializer, %v
  %shr = ashr <4 x i32> %sub, <i32 31, i32 31, i32 31, i32 31>
  %and = and <4 x i32> %shr, %v
  ret <4 x i32> %and
}

before

"clamp0-vec":                           # @clamp0-vec
# %bb.0:
        vpxor   %xmm1, %xmm1, %xmm1
        vpsubd  %xmm0, %xmm1, %xmm1
        vpsrad  $31, %xmm1, %xmm1
        vpand   %xmm0, %xmm1, %xmm0
        retq

After this optimization

"clamp0-vec":                           # @clamp0-vec
# %bb.0:
        vpxor   %xmm1, %xmm1, %xmm1
        vpmaxsd %xmm1, %xmm0, %xmm0
        retq

AArch64 target:
Same test input; Run : clang -O2 -target aarch64 -march=armv8a -S clamp0-vec.ll -o -
before

"clamp0-vec":                           // @clamp0-vec
// %bb.0:
        neg     v1.4s, v0.4s
        sshr    v1.4s, v1.4s, #31
        and     v0.16b, v1.16b, v0.16b
        ret

After this optimization

"clamp0-vec":                           // @clamp0-vec
// %bb.0:
        movi    v1.2d, #0000000000000000
        smax    v0.4s, v0.4s, v1.4s
        ret

ARM target
Same input; Run : clang -target arm-arm-none-eabi -mcpu=cortex-a57 -mfpu=neon-fp-armv8 -O2 -S clamp0-vec.ll -o -
before

"clamp0-vec":
        .fnstart
        vmov    d17, r2, r3
        vmov    d16, r0, r1
        vneg.s32        q9, q8
        vshr.s32        q9, q9, #31
        vand    q8, q9, q8
        vmov    r0, r1, d16
        vmov    r2, r3, d17
        bx      lr

After this optimization

"clamp0-vec":
        .fnstart
        vmov    d17, r2, r3
        vmov    d16, r0, r1
        vmov.i32        q9, #0x0
        vmax.s32        q8, q8, q9
        vmov    r0, r1, d16
        vmov    r2, r3, d17
        bx      lr
huihuiz updated this revision to Diff 221145.Sep 20 2019, 6:04 PM

resolved reviews feedback

lebedev.ri accepted this revision.Sep 21 2019, 2:40 AM

Please change clamp0 everywhere to clamp negative to zero, it wasn't obvious to what clamp0 means until reading all of the patch.
This looks ok otherwise. Please wait for @spatel to comment.

For X86, AArch64 and ARM target, backend produce better ASM with this transformation. Please refer to below examples:

I'd agree. @spatel ?

llvm/lib/Transforms/InstCombine/InstCombineAndOrXor.cpp
1927–1928

maybe just

// and(ashr(subNSW(0, V), ScalarSizeInBits -1), V) --> V s< 0 ? 0 : V

?

1937–1939

Let's emit what we get in the tests?

Value *NewICmpInst =
    Builder.CreateICmpSGT(V, ConstantInt::getNullValue(Ty));
return SelectInst::Create(NewICmpInst, V, ConstantInt::getNullValue(Ty));
This revision is now accepted and ready to land.Sep 21 2019, 2:40 AM
lebedev.ri requested changes to this revision.Sep 21 2019, 8:02 AM

Ah, finally got it. There is more general fold here:

Name: sub_ashr_and_nsw
  %sub = sub nsw i8 %X, %v
  %ashr = ashr i8 %sub, 7
  %r = and i8 %ashr, %v
=>
  %cmp = icmp sle i8 %v, %X
  %r = select i1 %cmp, i8 0, i8 %v

https://rise4fun.com/Alive/urO

This revision now requires changes to proceed.Sep 21 2019, 8:02 AM

Please change clamp0 everywhere to clamp negative to zero, it wasn't obvious to what clamp0 means until reading all of the patch.
This looks ok otherwise. Please wait for @spatel to comment.

For X86, AArch64 and ARM target, backend produce better ASM with this transformation. Please refer to below examples:

I'd agree. @spatel ?

Yes, all asm diffs look good to me. DAGCombiner knows how to convert a select with '0' false operand into something better ('max' or 'and not' instructions). I'm not sure if that will be true for the more general fold though, so more testing will be needed for those patterns.

huihuiz updated this revision to Diff 221257.Sep 23 2019, 12:13 AM
huihuiz marked 2 inline comments as done.
huihuiz retitled this revision from [InstCombine] Fold a shifty implementation of clamp0. to [InstCombine] Fold a shifty implementation of clamp negative to zero..
huihuiz edited the summary of this revision. (Show Details)

make folding more general

llvm-mca results for more general folding pattern

  • Scalar Tests ---

X86: skylake cmovgl latency 1

test input; run : clang clampNegToZero.ll -O2 -target x86_64 -march=skylake -S -o - | llvm-mca -mtriple=x86_64-unknown-unknown -mcpu=skylake

define i32 @clamp0(i32 %v, i32 %x) {
  %sub = sub nsw i32 %x, %v
  %shr = ashr i32 %sub, 31
  %and = and i32 %shr, %v
  ret i32 %and
}

Before:

Iterations:        100
Instructions:      500
Total Cycles:      159
Total uOps:        700

Dispatch Width:    6
uOps Per Cycle:    4.40
IPC:               3.14
Block RThroughput: 1.2


Instruction Info:
[1]: #uOps
[2]: Latency
[3]: RThroughput
[4]: MayLoad
[5]: MayStore
[6]: HasSideEffects (U)

[1]    [2]    [3]    [4]    [5]    [6]    Instructions:
 1      1     0.25                        movl  %esi, %eax
 1      1     0.25                        subl  %edi, %eax
 1      1     0.50                        sarl  $31, %eax
 1      1     0.25                        andl  %edi, %eax
 3      7     1.00                  U     retq

After this transformation:

Iterations:        100
Instructions:      400
Total Cycles:      110
Total uOps:        600

Dispatch Width:    6
uOps Per Cycle:    5.45
IPC:               3.64
Block RThroughput: 1.0


Instruction Info:
[1]: #uOps
[2]: Latency
[3]: RThroughput
[4]: MayLoad
[5]: MayStore
[6]: HasSideEffects (U)

[1]    [2]    [3]    [4]    [5]    [6]    Instructions:
 1      0     0.17                        xorl  %eax, %eax
 1      1     0.25                        cmpl  %esi, %edi
 1      1     0.50                        cmovgl        %edi, %eax
 3      7     1.00                  U     retq

X86: cooper lake cmovgl latency also 1

same input; run: clang clampNegToZero.ll -O2 -target x86_64 -march=cooperlake -S -o - | llvm-mca -mtriple=x86_64-unknown-unknown -mcpu=cooperlake

before

Iterations:        100
Instructions:      500
Total Cycles:      159
Total uOps:        700

Dispatch Width:    6
uOps Per Cycle:    4.40
IPC:               3.14
Block RThroughput: 1.2


Instruction Info:
[1]: #uOps
[2]: Latency
[3]: RThroughput
[4]: MayLoad
[5]: MayStore
[6]: HasSideEffects (U)

[1]    [2]    [3]    [4]    [5]    [6]    Instructions:
 1      1     0.25                        movl  %esi, %eax
 1      1     0.25                        subl  %edi, %eax
 1      1     0.50                        sarl  $31, %eax
 1      1     0.25                        andl  %edi, %eax
 3      7     1.00                  U     retq

After this transformation:

Iterations:        100
Instructions:      400
Total Cycles:      110
Total uOps:        600

Dispatch Width:    6
uOps Per Cycle:    5.45
IPC:               3.64
Block RThroughput: 1.0


Instruction Info:
[1]: #uOps
[2]: Latency
[3]: RThroughput
[4]: MayLoad
[5]: MayStore
[6]: HasSideEffects (U)

[1]    [2]    [3]    [4]    [5]    [6]    Instructions:
 1      0     0.17                        xorl  %eax, %eax
 1      1     0.25                        cmpl  %esi, %edi
 1      1     0.50                        cmovgl        %edi, %eax
 3      7     1.00                  U     retq

AMD :
same input; run: clang clampNegToZero.ll -O2 -target x86_64 -march=znver2 -S -o - | llvm-mca -mtriple=x86_64-unknown-unknown -mcpu=znver2

Before

Iterations:        100
Instructions:      500
Total Cycles:      155
Total uOps:        600

Dispatch Width:    4
uOps Per Cycle:    3.87
IPC:               3.23
Block RThroughput: 1.5


Instruction Info:
[1]: #uOps
[2]: Latency
[3]: RThroughput
[4]: MayLoad
[5]: MayStore
[6]: HasSideEffects (U)

[1]    [2]    [3]    [4]    [5]    [6]    Instructions:
 1      1     0.25                        movl  %esi, %eax
 1      1     0.25                        subl  %edi, %eax
 1      1     0.25                        sarl  $31, %eax
 1      1     0.25                        andl  %edi, %eax
 2      1     0.50                  U     retq

After this transformation:

Iterations:        100
Instructions:      400
Total Cycles:      203
Total uOps:        500

Dispatch Width:    4
uOps Per Cycle:    2.46
IPC:               1.97
Block RThroughput: 1.3


Instruction Info:
[1]: #uOps
[2]: Latency
[3]: RThroughput
[4]: MayLoad
[5]: MayStore
[6]: HasSideEffects (U)

[1]    [2]    [3]    [4]    [5]    [6]    Instructions:
 1      1     0.25                        xorl  %eax, %eax
 1      1     0.25                        cmpl  %esi, %edi
 1      1     0.25                        cmovgl        %edi, %eax
 2      1     0.50                  U     retq

AArch64: cortex-a57 csel latency 1
run: clang clampNegToZero.ll -O2 -target aarch64 -mcpu=cortex-a57 -S -o - | llvm-mca -mtriple=aarch64 -mcpu=cortex-a57
before:

Iterations:        100
Instructions:      300
Total Cycles:      303
Total uOps:        300

Dispatch Width:    3
uOps Per Cycle:    0.99
IPC:               0.99
Block RThroughput: 1.0


Instruction Info:
[1]: #uOps
[2]: Latency
[3]: RThroughput
[4]: MayLoad
[5]: MayStore
[6]: HasSideEffects (U)

[1]    [2]    [3]    [4]    [5]    [6]    Instructions:
 1      1     0.50                        sub   w8, w1, w0
 1      2     1.00                        and   w0, w0, w8, asr #31
 1      1     1.00                  U     ret

After this transformation:

Iterations:        100
Instructions:      300
Total Cycles:      203
Total uOps:        300

Dispatch Width:    3
uOps Per Cycle:    1.48
IPC:               1.48
Block RThroughput: 1.0


Instruction Info:
[1]: #uOps
[2]: Latency
[3]: RThroughput
[4]: MayLoad
[5]: MayStore
[6]: HasSideEffects (U)

[1]    [2]    [3]    [4]    [5]    [6]    Instructions:
 1      1     0.50                        cmp   w0, w1
 1      1     0.50                        csel  w0, w0, wzr, gt
 1      1     1.00                  U     ret
  • Vector Tests ---

test input

define <4 x i32> @clamp0-vec(<4 x i32> %v, <4 x i32> %x) {
  %sub = sub nsw <4 x i32> %x, %v
  %shr = ashr <4 x i32> %sub, <i32 31, i32 31, i32 31, i32 31>
  %and = and <4 x i32> %shr, %v
  ret <4 x i32> %and
}

X86 : skylake
clang clampNegToZero-vec.ll -O2 -target x86_64 -march=skylake -S -o - | llvm-mca -mtriple=x86_64-unknown-unknown -mcpu=skylake

before

Iterations:        100
Instructions:      400
Total Cycles:      303
Total uOps:        600

Dispatch Width:    6
uOps Per Cycle:    1.98
IPC:               1.32
Block RThroughput: 1.0


Instruction Info:
[1]: #uOps
[2]: Latency
[3]: RThroughput
[4]: MayLoad
[5]: MayStore
[6]: HasSideEffects (U)

[1]    [2]    [3]    [4]    [5]    [6]    Instructions:
 1      1     0.33                        vpsubd        %xmm0, %xmm1, %xmm1
 1      1     0.50                        vpsrad        $31, %xmm1, %xmm1
 1      1     0.33                        vpand %xmm0, %xmm1, %xmm0
 3      7     1.00                  U     retq

After this transformation

Iterations:        100
Instructions:      300
Total Cycles:      203
Total uOps:        500

Dispatch Width:    6
uOps Per Cycle:    2.46
IPC:               1.48
Block RThroughput: 1.0


Instruction Info:
[1]: #uOps
[2]: Latency
[3]: RThroughput
[4]: MayLoad
[5]: MayStore
[6]: HasSideEffects (U)

[1]    [2]    [3]    [4]    [5]    [6]    Instructions:
 1      1     0.50                        vpcmpgtd      %xmm1, %xmm0, %xmm1
 1      1     0.33                        vpand %xmm0, %xmm1, %xmm0
 3      7     1.00                  U     retq

AMD znver2
clang clampNegToZero-vec.ll -O2 -target x86_64 -march=znver2 -S -o - | llvm-mca -mtriple=x86_64-unknown-unknown -mcpu=znver2

before

Iterations:        100
Instructions:      400
Total Cycles:      303
Total uOps:        500

Dispatch Width:    4
uOps Per Cycle:    1.65
IPC:               1.32
Block RThroughput: 1.3


Instruction Info:
[1]: #uOps
[2]: Latency
[3]: RThroughput
[4]: MayLoad
[5]: MayStore
[6]: HasSideEffects (U)

[1]    [2]    [3]    [4]    [5]    [6]    Instructions:
 1      1     0.25                        vpsubd        %xmm0, %xmm1, %xmm1
 1      1     0.25                        vpsrad        $31, %xmm1, %xmm1
 1      1     0.25                        vpand %xmm0, %xmm1, %xmm0
 2      1     0.50                  U     retq

After this transformation

Iterations:        100
Instructions:      300
Total Cycles:      203
Total uOps:        400

Dispatch Width:    4
uOps Per Cycle:    1.97
IPC:               1.48
Block RThroughput: 1.0


Instruction Info:
[1]: #uOps
[2]: Latency
[3]: RThroughput
[4]: MayLoad
[5]: MayStore
[6]: HasSideEffects (U)

[1]    [2]    [3]    [4]    [5]    [6]    Instructions:
 1      1     0.25                        vpcmpgtd      %xmm1, %xmm0, %xmm1
 1      1     0.25                        vpand %xmm0, %xmm1, %xmm0
 2      1     0.50                  U     retq

AArch64 cortex-a57
clang clampNegToZero-vec.ll -O2 -target aarch64 -mcpu=cortex-a57 -S -o - | llvm-mca -mtriple=aarch64 -mcpu=cortex-a57

before

Iterations:        100
Instructions:      400
Total Cycles:      903
Total uOps:        400

Dispatch Width:    3
uOps Per Cycle:    0.44
IPC:               0.44
Block RThroughput: 1.5


Instruction Info:
[1]: #uOps
[2]: Latency
[3]: RThroughput
[4]: MayLoad
[5]: MayStore
[6]: HasSideEffects (U)

[1]    [2]    [3]    [4]    [5]    [6]    Instructions:
 1      3     0.50                        sub   v1.4s, v1.4s, v0.4s
 1      3     0.50                        sshr  v1.4s, v1.4s, #31
 1      3     0.50                        and   v0.16b, v1.16b, v0.16b
 1      1     1.00                  U     ret

After this transformation

Iterations:        100
Instructions:      300
Total Cycles:      603
Total uOps:        300

Dispatch Width:    3
uOps Per Cycle:    0.50
IPC:               0.50
Block RThroughput: 1.0


Instruction Info:
[1]: #uOps
[2]: Latency
[3]: RThroughput
[4]: MayLoad
[5]: MayStore
[6]: HasSideEffects (U)

[1]    [2]    [3]    [4]    [5]    [6]    Instructions:
 1      3     0.50                        cmgt  v1.4s, v0.4s, v1.4s
 1      3     0.50                        and   v0.16b, v0.16b, v1.16b
 1      1     1.00                  U     ret

Another note, for older generation X86 target, e.g., haswell, cmove indeed has latency 2. But able to achieve comparable uOps Per Cycle
same test input
clang clampNegToZero.ll -O2 -target x86_64 -march=haswell -S -o - | llvm-mca -mtriple=x86_64-unknown-unknown -mcpu=haswell
before

Iterations:        100
Instructions:      500
Total Cycles:      210
Total uOps:        700

Dispatch Width:    4
uOps Per Cycle:    3.33
IPC:               2.38
Block RThroughput: 1.8


Instruction Info:
[1]: #uOps
[2]: Latency
[3]: RThroughput
[4]: MayLoad
[5]: MayStore
[6]: HasSideEffects (U)

[1]    [2]    [3]    [4]    [5]    [6]    Instructions:
 1      1     0.25                        movl  %esi, %eax
 1      1     0.25                        subl  %edi, %eax
 1      1     0.50                        sarl  $31, %eax
 1      1     0.25                        andl  %edi, %eax
 3      7     1.00                  U     retq

After

Iterations:        100
Instructions:      400
Total Cycles:      209
Total uOps:        700

Dispatch Width:    4
uOps Per Cycle:    3.35
IPC:               1.91
Block RThroughput: 1.8


Instruction Info:
[1]: #uOps
[2]: Latency
[3]: RThroughput
[4]: MayLoad
[5]: MayStore
[6]: HasSideEffects (U)

[1]    [2]    [3]    [4]    [5]    [6]    Instructions:
 1      0     0.25                        xorl  %eax, %eax
 1      1     0.25                        cmpl  %esi, %edi
 2      2     0.50                        cmovgl        %edi, %eax
 3      7     1.00                  U     retq
lebedev.ri retitled this revision from [InstCombine] Fold a shifty implementation of clamp negative to zero. to [InstCombine] Fold a shifty implementation of clamp-to-zero..Sep 23 2019, 6:23 AM
lebedev.ri edited the summary of this revision. (Show Details)
lebedev.ri accepted this revision.Sep 23 2019, 6:29 AM

Thanks, looks good to me.
-march=znver2 numbers somewhat both regress and improve, but there is no znver2 sched model in llvm trunk, so that is some default sched model.

llvm/lib/Transforms/InstCombine/InstCombineAndOrXor.cpp
1927–1929

Super pedantic: can we streamline the variables here?
How about

// and(ashr(subNSW(Y, X), (ScalarSizeInBits(Y)-1)), X) --> X s> Y ? X : 0.
This revision is now accepted and ready to land.Sep 23 2019, 6:29 AM

cc'ing @craig.topper @RKSimon @andreadb in case the use of x86 'cmov' has any pitfalls that we're not seeing so far.

Thanks, looks good to me.
-march=znver2 numbers somewhat both regress and improve, but there is no znver2 sched model in llvm trunk, so that is some default sched model.

There is WIP patch https://reviews.llvm.org/D66088

This is just FYI.

llvm-mca result for AMD btver2 and bdver2

AMD btver2
clang clampNegToZero.ll -O2 -target x86_64 -march=btver2 -S -o - | llvm-mca -mtriple=x86_64-unknown-unknown -mcpu=btver2
before

Iterations:        100
Instructions:      500
Total Cycles:      256
Total uOps:        500

Dispatch Width:    2
uOps Per Cycle:    1.95
IPC:               1.95
Block RThroughput: 2.5


Instruction Info:
[1]: #uOps
[2]: Latency
[3]: RThroughput
[4]: MayLoad
[5]: MayStore
[6]: HasSideEffects (U)

[1]    [2]    [3]    [4]    [5]    [6]    Instructions:
 1      1     0.50                        movl  %esi, %eax
 1      1     0.50                        subl  %edi, %eax
 1      1     0.50                        sarl  $31, %eax
 1      1     0.50                        andl  %edi, %eax
 1      4     1.00                  U     retq

After

Iterations:        100
Instructions:      400
Total Cycles:      206
Total uOps:        400

Dispatch Width:    2
uOps Per Cycle:    1.94
IPC:               1.94
Block RThroughput: 2.0


Instruction Info:
[1]: #uOps
[2]: Latency
[3]: RThroughput
[4]: MayLoad
[5]: MayStore
[6]: HasSideEffects (U)

[1]    [2]    [3]    [4]    [5]    [6]    Instructions:
 1      0     0.50                        xorl  %eax, %eax
 1      1     0.50                        cmpl  %esi, %edi
 1      1     0.50                        cmovgl        %edi, %eax
 1      4     1.00                  U     retq

AMD bdver2
clang clampNegToZero.ll -O2 -target x86_64 -march=bdver2 -S -o - | llvm-mca -mtriple=x86_64-unknown-unknown -mcpu=bdver2
before

Iterations:        100
Instructions:      500
Total Cycles:      455
Total uOps:        500

Dispatch Width:    4
uOps Per Cycle:    1.10
IPC:               1.10
Block RThroughput: 4.0


Instruction Info:
[1]: #uOps
[2]: Latency
[3]: RThroughput
[4]: MayLoad
[5]: MayStore
[6]: HasSideEffects (U)

[1]    [2]    [3]    [4]    [5]    [6]    Instructions:
 1      1     1.00                        movl  %esi, %eax
 1      1     1.00                        subl  %edi, %eax
 1      1     1.00                        sarl  $31, %eax
 1      1     1.00                        andl  %edi, %eax
 1      5     1.50                  U     retq

After

Iterations:        100
Instructions:      400
Total Cycles:      208
Total uOps:        400

Dispatch Width:    4
uOps Per Cycle:    1.92
IPC:               1.92
Block RThroughput: 1.5


Instruction Info:
[1]: #uOps
[2]: Latency
[3]: RThroughput
[4]: MayLoad
[5]: MayStore
[6]: HasSideEffects (U)

[1]    [2]    [3]    [4]    [5]    [6]    Instructions:
 1      0     0.25                        xorl  %eax, %eax
 1      1     1.00                        cmpl  %esi, %edi
 1      1     0.50                        cmovgl        %edi, %eax
 1      5     1.50                  U     retq
This revision was automatically updated to reflect the committed changes.