This is an archive of the discontinued LLVM Phabricator instance.

[ExecutionDepsFix] Optimize instruction insertion
Needs ReviewPublic

Authored by loladiro on Jan 19 2017, 1:23 PM.

Details

Summary

In preparation for an upcoming commit (D28786) that will make
ExecutionDepsFix more agressive about inserting dependency breaking instructions,
teach ExecutionDepsFix to be smarter about inserting these instructions.
There are two aspects to this:

  1. Recognize dependency breaking instructions and if there already is such an instruction, simply re-use it, rather than inserting a new one.
  1. For undef reads, which register is used does not matter. Thus, if there is many such reads in close succession, we only need to insert one dependency breaking instruction for many such reads.

Note: This revision depends on D28759

Event Timeline

loladiro created this revision.Jan 19 2017, 1:23 PM
myatsina edited edge metadata.Feb 9 2017, 7:12 AM

First, I'm sorry for the late review :)

Second, I have a general comment:
I know that all the things you've put in this patch are aimed at improving the number of xors and the registers we pick as undefs, but it is much easier to review, check necessity of each change and make sure the feature is properly tested when you break down your commits to small changes/patches.
Removing redundant xors when you have several instructions that use same undef reg is one issue
isDependencyBreak support is another issue
Trying to add only true dependency to the undef regs set is another issue
Trying to choose a better register is another issue (and I would expect it to somehow make "pickBestReg" smarter and not rewrite everything at the end).

Most of these changes where not part of the original patch you've uploaded in your previous patch version that combined the function calls, so I'm wondering how they got themselves into this patch now? Do we really need them? Have you seen performance improvements by these changes too measuring each one separately (even if they look good "on paper" they might not bring the expected result, they might even make it worse)?

I'm sorry I'm being very strict about this again, but I want to make sure we do each such change the best possible way and keep them separate (with dedicated tests of course), so please break your changes as much as possible, upload each change you want in a different review with a different test case.
I would also think about the general approach of trying to rewrite the register again - how does it work together with "pickBestRegister"? Does it undo the work of that function? Or did we miss some cases there in the first place? With this issue i would start from the beginning - what problem are we trying to solve and what is the best way to solve it?

lib/CodeGen/ExecutionDepsFix.cpp
635

Maybe it's better to use iterator_range for this purpose?
You can see regIndices example in this file.

649

Please document this function and what's its purpose

656

formatting issue - please surround the if with {} and thus be consistent with the style of the other conditions in this if- if - else if-else

661

why do you need this case for?

666

why do we choose lowest register? why not choose a "far away" register (this way we may even be able to save the xor)?

690

Shouldn't "pickBestRegister" catch these cases? And if not, maybe we've missed an optimization case there.
I'm not sure we should have this in this form.
What is the scenario you're trying to optimizer and that "pickBestRegister" doesn't catch?

711

Please try to avoid defining variable's initial values this way. It not very readable.
How about:
size_t LastInit = ...;
size_t i = LastInit;

722

Aren't you checking the liverange set in updateChooseableRegs? Why do it here too?

729

Please document the meaning of this code, and the next section as well. What are each of these sections doing?
It's really hard for to follow this.
I read the documentation that you've added earlier on of what is the purpose of your transformation, but I don't really understand how you iterate over the undefs and achieve this, why you've split it to several sections and the cases each section handles.

lib/Target/X86/X86InstrInfo.cpp
7501

There are additional instructions like KXORBrr, KXORwrr, KXORDrr, KXORQrr(AVX512).
Others might be added in the future, so perhaps we should set the right infrastructure for that.

swithc (op):
default:
return false;
case XORPS:
case XORPS:
...
<check xor instruction>

7504

you can use X86::NoRegister instead

7505

You can iterate directly over operands

7508

I think this whole section of could would be a bit clearer if you write it in this spirit:

for ()

7511

I see that you write a lot of functions so that they could return 2 arguments and that you've used the same "trick" in other functions as well.
I don't thing it's a good approach in general. This need of a function to return 2 values generally indicates that you need to extract the common logic somehow or define a new type or do some other form of refactoring refactoring.
In this case I think "isDependencyBreak" should only return true/false (as the name indicates). If the module asking wants to know which register it is then it can find the dest reg on its own, or we can provide the module a "getDependencyBreakReg". In general it will behave similar to the already existing "isReg()" and "getReg()".

test/CodeGen/X86/break-false-dep.ll
236

Each feature should have it's own test. You are not supposed to eliminate the original test (unless the original test's logic is no longer true), you're suppose to a small unit test for each feature.
This test was originally designed to check "pickBestRegister".

237

This test is solely for testing the xor combining you've added? Or is it suppose to test something else?
I already see you've added a dedicated test for the "xor", so I'm not sure why you've changed this one.

If it's only for xor combine then you can do a very simple test like this:

define double @clearence(i64 %arg1, i64 %arg2) {
// the inline asm calls that make xmm6 as the best choice
%tmp1 = sitofp i64 %arg1 to double
%tmp2 = sitofp i64 %arg2 to double
%tmp3 = add i64 %tmp1, %tmp2

ret double %tmp3

}

In this case xmm6 should be chosen as the undef for both convert instructions.
and there should only be one xor

242

why did you add this part?
What is the logic that is tests?

367

where are the tests for all the other optimizations you've added?
The "isBreakingDependency", "register re-picking", etc?

test/CodeGen/X86/vec_int_to_fp.ll
1668

Why is using xmm1 ok?
If we fall through from the revius BB where xmm1 is xored, then it's ok but we might jump to this block from another place and if we use xmm1 and won't have a xor, then we will have a stall, no?

There's only really two major changes here:

  1. Updating the registers at the end
  2. isDependencyBreak support

Updating the registers at the end is not redundant with picking the best register,
because it only operates on instructions where we've determined earlier that a
dependency break is required (i.e. there are no suitable undef registers with
sufficient clearance). What updating at the end does is allow us to avoid the pathological
case in the comment I added above:

// All registers clobbered here, the code determines it needs a dependency break,
// so arbitrarily picks xmm0 as the undef read and then remembers for later to
// insert a dependency break here
vrandom %xmm0<undef>, %xmm0<def>
vrandom %xmm1<undef>, %xmm1<def>
vrandom %xmm2<undef>, %xmm2<def>
vrandom %xmm3<undef>, %xmm3<def>

The problem is that without this patch this turns into:

vxorps %xmm0, %xmm0, %xmm0
vrandom %xmm0<undef>, %xmm0<def>
vxorps %xmm1, %xmm1, %xmm1
vrandom %xmm1<undef>, %xmm1<def>
vxorps %xmm2, %xmm2, %xmm2
vrandom %xmm2<undef>, %xmm2<def>
vxorps %xmm3, %xmm3, %xmm3
vrandom %xmm3<undef>, %xmm3<def>

Clearly, we can do better, which is what this patch does. However, doing so involves
a backwards walk over all the undef reads we know need to be broken, so it's fundamentally
incompatible with the forwards algorithm in pickBestRegister.

Most of these changes where not part of the original patch you've uploaded in your previous patch version that combined the function calls, so I'm wondering how they got themselves into this patch now? Do we really need them? Have you seen performance improvements by these changes too measuring each one separately (even if they look good "on paper" they might not bring the expected result, they might even make it worse)?

I think I had an earlier version for avoiding redundant xorps in a forward manner, which you may have looked at.
That turned out to be insufficient, and is subsumed by this algorithm. In general this
revision is aimed at limiting the impact of D28786, which without this enhancement,
caused a very large number of vxorps to be inserted.

The isDependencyBreak is mostly included because it makes it possible to easily create registers that the algorithm
recognizes as having lots of clearance (and thus not requiring an extra dependency break), even when clearance is
killed at function entry as D28786 does. After that change it is very rare for there to be registers of sufficient clearance otherwise.
I'd be happy to split that out if you want.

lib/CodeGen/ExecutionDepsFix.cpp
661

Only to make sure we actually return the correct LowestValid.

666

We want to pick an arbitrary register from the choosable set. Picking the lowest one ensures a consistent choice. Picking a far away one doesn't make sense, because to get to this point, we've already determined that no registers are sufficiently far away.

690

See the non-inline reply. This is fundamentally a reverse operation.

722

updateChoosableRegs only looks at the instructions before which we require dependency breaks.
This kills choosability for any registers that have liveness entirely between two such instructions.

lib/Target/X86/X86InstrInfo.cpp
7511

Fair enough (maybe I've been writing too much C code). In general I dislike splitting such logic across two functions since it needs to be kept in sync (e.g. when adding additional dependency breaking instructions to be recognized). How about adding an getDependencyBreakReg that just returns 0 or an llvm::Optional, if it's not a dependency breaking instruction.

test/CodeGen/X86/break-false-dep.ll
236

It still checks pickBestRegister, but in a way that remains valid after D28786, by making use of the fact that the fcmp ult double %x, 0.0 materializes a constant 0.0 as a dependency breaking instruction, so pickBestRegister can pick up on that.

237

See above

242

See above

367

This is the test for register re-picking.

test/CodeGen/X86/vec_int_to_fp.ll
1668

As far as I can tell, nothing else jumps in between the vxorps and this undefined use, so it's a good register to use.

I'm sorry, I'm getting too lost in this review with the interaction between all the changes and especially trying to check if all the corner cases were tested sufficiently.
I want to make sure we implement each of your changes the best possible way.

Let's do the separation to the 3 incremental reviews. I think the order should be:
Trying to add only true dependency to the undef regs
isDependencyBreak support (which is the "xor fusion" if I understand it correctly)
Register re-picking.
(and the call affect on clearance calculations can follow)

I'm emphasizing the fact that the "xor fusion" and "register re-picking" should have several unit tests dedicated only to these features, testing both simple and comlex cases.

Also, I think a cleaner implementation can be done for the register re-picking, but I need to stew on it a bit more until I have a good concrete suggestion.

lib/CodeGen/ExecutionDepsFix.cpp
499

Why not make it return true/false (true dependency) instead of void?

584

And then here you will have :
bool TrueDependency = pickBest...()

628

This feature should be heavily tested.
This is the "xor fusion" logic, right?

We're adding xors for dependency break in a late stage of the program, after clearance calculation of everyone was already done, including loops (i.e. you're previous patch that changes the calculation algorithm that you've comitted).
Now you are changing the clearance for the xor's we've added, if I understand this correctly.
So somehow you need to update your old clearance calculation, no?
Or is this code aimed to only catch "xor"s that were in the original code.

I would like to see a lot of tests for these feature (without the affect of the other features):
How does it affect simple one BB code?
What is the affect on simple loops with one xor to this register and no dependency?
What is the affect on complex loops that no xors and assigns to the same register:

loop:

cvt
xor xmm0
cvt
xmm0 =
cvt
cvt
xor xmm0
cvt
xmm0 =

656

There's something that doesn't make sense for me.
Sometimes we're adding registers to ChoosableRegs and sometimes we're erasing registers from it.
Why did we add something "bad" in the first place?
It seem like this function is trying to mix 2 different (but close) functionalities, and I don't think it's right.

681

Do you mean vroundss? In this case it's a bad example. It uses 3 xmms and you should hide the false dependency under the true dependency of the source xmm.

I think you should put here the full instruction with all the operands (vcvtsi2sd will probably be good enough for your point).

lib/Target/X86/X86InstrInfo.cpp
7508

Ignore the comment above :) I forgot to delete it.

7511

The llvm:optional seems very cool :) I wasn't familiar with it :) I think it's the best for us. I like it better than reg == 0 (I don't think there's a 0 is "non reg" for all targets).

Anyway, here are some additional options I thought of as well:

  1. implement getDepRegBreak and isDepRegBreak when one uses the other

isDepsRegBreak() {
return getRegBreak() != X86::NON_REG
}
This way you don't have the duplication of the logic that you've talked about.

  1. getDependencyBreak will return operand number (of dest reg) or -1 if it's not breaking
test/CodeGen/X86/break-false-dep.ll
367

I think you need added additional more complex tests fore the re-picking.
What happens if we're in a loop? or in a loop with a carry on dependency from the previous iteration?
What happens if you have instructions between the converts that make some of you original potential registers "alive" and then you can no longer use them?

I'll split up this review further and let's go from there.

lib/CodeGen/ExecutionDepsFix.cpp
499

That seems fine. I think I originally had it return a value, which is why I went this way, but since that's no longer the case, I'll update accordingly.

628

This is just for xors that were in the original code. xor fusion is no longer there, because it got subsumed by the register re-picking.

Ok, I've split everything but the register re-picking into D30173 and D30177. Let's get those reviewed and once that's done, I'll rebase this with just the register re-picking code. I hope I've also addressed the review comments for the relevant parts in those revisions.

MatzeB resigned from this revision.Aug 15 2017, 11:10 AM