This is an archive of the discontinued LLVM Phabricator instance.

[DagCombiner] Not all 'andn''s work with immediates.
ClosedPublic

Authored by lebedev.ri on May 5 2018, 4:15 AM.

Details

Summary

Split off from D46031.

In masked merge case, this degrades IPC by decreasing instruction count.


The next patch should be able to recover and improve this.

This also affects the transform @spatel have added in D27489 / rL289738,
and the test coverage for X86 was missing.
But after i have added it, and looked at the changes in MCA, i'm somewhat confused.


I'd say this regression is an improvement, since IPC increased in that case?

Diff Detail

Repository
rL LLVM

Event Timeline

I'd say this regression is an improvement, since IPC increased in that case?

As a rule of thumb when using llvm-mca, it's best to always remove return statements from the assembly code sequence.
llvm-mca should have warned you about the presence of a return statement in the input sequence:

warning: found a return instruction in the input assembly sequence.
note: program counter updates are ignored.

To get the correct resource pressure distribution in example icmp-opt.txt, you should remove the retq.
As a result, you should see this:

Resource pressure per iteration:
[0]    [1]    [2]    [3]    [4]    [5]    [6]    [7]    [8]    [9]    [10]   [11]
 -      -     0.75   0.75   0.75   0.75    -      -      -      -      -      -

Resource pressure by instruction:
[0]    [1]    [2]    [3]    [4]    [5]    [6]    [7]    [8]    [9]    [10]   [11]       Instructions:
 -      -     0.25   0.25   0.25   0.25    -      -      -      -      -      -         shrq    $63, %rdi
 -      -     0.25   0.25   0.25   0.25    -      -      -      -      -      -         xorl    $1, %edi
 -      -     0.25   0.25   0.25   0.25    -      -      -      -      -      -         movl    %edi, %eax

And...

Resource pressure by instruction:
[0]    [1]    [2]    [3]    [4]    [5]    [6]    [7]    [8]    [9]    [10]   [11]       Instructions:
 -      -     0.25   0.25   0.25   0.25    -      -      -      -      -      -         xorl    %eax, %eax
 -      -     0.25   0.25   0.25   0.25    -      -      -      -      -      -         testq   %rdi, %rdi
 -      -     0.25   0.25   0.25   0.25    -      -      -      -      -      -         setns   %al

In terms of register pressure, the two code sequences are equally good.

The shlq+xor+mov is worse in terms of IPC because of the data dependency on %edi that limits the ILP when executing multiple iterations of the loop.
If you run multiple iterations and print the timeline view, you can see how the "average wait time" in the scheduler's queue is quite high for the shlq instruction.

You can see a similar behavior in test pos_sel_constants. The only problem I see is the slow LEA instruction (which is not treated specially by the scheduling model; at the moment it uses the same resources as a normal LEA).
Assuming that these instructions are executed in a loop, the new variant suffer less for long data dependencies between iterations.

I'd say this regression is an improvement, since IPC increased in that case?

As a rule of thumb when using llvm-mca, it's best to always remove return statements from the assembly code sequence.
llvm-mca should have warned you about the presence of a return statement in the input sequence:

warning: found a return instruction in the input assembly sequence.
note: program counter updates are ignored.

To get the correct resource pressure distribution in example icmp-opt.txt, you should remove the retq.

Thank you for your comments!
I will filter it out, so hopefully in future my mca expirience will be better :)

If you run multiple iterations and print the timeline view, you can see how the "average wait time" in the scheduler's queue is quite high for the shlq instruction.

Aha, so far i kinda ignored -timeline switch.
Those flags need work i think. I have just tried enabling them all, and it seems like they invert the current state?
I'd like to 1. have a switch to turn them all on, 2. maybe print which ones are currently enabled in -help

I'd say this regression is an improvement, since IPC increased in that case?

As a rule of thumb when using llvm-mca, it's best to always remove return statements from the assembly code sequence.
llvm-mca should have warned you about the presence of a return statement in the input sequence:

warning: found a return instruction in the input assembly sequence.
note: program counter updates are ignored.

To get the correct resource pressure distribution in example icmp-opt.txt, you should remove the retq.

Thank you for your comments!
I will filter it out, so hopefully in future my mca expirience will be better :)

If you run multiple iterations and print the timeline view, you can see how the "average wait time" in the scheduler's queue is quite high for the shlq instruction.

Aha, so far i kinda ignored -timeline switch.
Those flags need work i think. I have just tried enabling them all, and it seems like they invert the current state?
I'd like to 1. have a switch to turn them all on, 2. maybe print which ones are currently enabled in -help

That should not happen.
I guess you passed flag -instruction-tables too? To avoid confusions, I will remove that flag from the "View Options", since it is used to print a completely different report.

I'd like to 1. have a switch to turn them all on,

Sure, that can be added.

  1. maybe print which ones are currently enabled in -help

That can be done. At the moment, the llvm-mca commandline documentation specifies which views are enabled by default. I am going to add that information to the "help" too.

Cheers,
-Andrea

I'd say this regression is an improvement, since IPC increased in that case?

As a rule of thumb when using llvm-mca, it's best to always remove return statements from the assembly code sequence.
llvm-mca should have warned you about the presence of a return statement in the input sequence:

warning: found a return instruction in the input assembly sequence.
note: program counter updates are ignored.

To get the correct resource pressure distribution in example icmp-opt.txt, you should remove the retq.

Thank you for your comments!
I will filter it out, so hopefully in future my mca expirience will be better :)

If you run multiple iterations and print the timeline view, you can see how the "average wait time" in the scheduler's queue is quite high for the shlq instruction.

Aha, so far i kinda ignored -timeline switch.
Those flags need work i think. I have just tried enabling them all, and it seems like they invert the current state?
I'd like to 1. have a switch to turn them all on, 2. maybe print which ones are currently enabled in -help

That should not happen.
I guess you passed flag -instruction-tables too? To avoid confusions, I will remove that flag from the "View Options", since it is used to print a completely different report.

Sounds about right.

I'd like to 1. have a switch to turn them all on,

Sure, that can be added.

  1. maybe print which ones are currently enabled in -help

That can be done. At the moment, the llvm-mca commandline documentation specifies which views are enabled by default. I am going to add that information to the "help" too.

Yay! Thank you!

Cheers,
-Andrea

spatel accepted this revision.May 7 2018, 10:22 AM

LGTM - but I want to point out that the mca analysis may not be accurate for these sequences yet. We don't have the zero idiom (xor %eax, %eax) recognition yet ( https://bugs.llvm.org/show_bug.cgi?id=36671 ), so the stats for those cases are probably pessimistic.

This revision is now accepted and ready to land.May 7 2018, 10:22 AM

LGTM

Thank you for the review!

but I want to point out that the mca analysis may not be accurate for these sequences yet. We don't have the zero idiom (xor %eax, %eax) recognition yet ( https://bugs.llvm.org/show_bug.cgi?id=36671 ), so the stats for those cases are probably pessimistic.

I defer to you on the question whether icmp-opt.ll, selectcc-to-shiftand.ll changes are a regression (in which case they should be updated to use hasAndNotCompare() instead of hasAndNot())

LGTM

Thank you for the review!

but I want to point out that the mca analysis may not be accurate for these sequences yet. We don't have the zero idiom (xor %eax, %eax) recognition yet ( https://bugs.llvm.org/show_bug.cgi?id=36671 ), so the stats for those cases are probably pessimistic.

I defer to you on the question whether icmp-opt.ll, selectcc-to-shiftand.ll changes are a regression (in which case they should be updated to use hasAndNotCompare() instead of hasAndNot())

No - I think the test+set sequences are likely equal or better for most uarch, so those are improvements. They should also be less instruction bytes, so that's good too.

This revision was automatically updated to reflect the committed changes.