This is an archive of the discontinued LLVM Phabricator instance.

[ExecutionDepsFix] Kill clearance at function entry/calls
Needs ReviewPublic

Authored by loladiro on Jan 16 2017, 3:49 PM.

Details

Summary

This is a follow up to D28759 and together with that commit fixes
all (maybe all, pending another look at the benchmarks) of
the benchmarks that regressed due to rL278321 (while keeping the
performance enhancements in cases where rL278321 was beneficial).

Prior to this commit, the analysis would simply ignore any function
calls for the clearance calulation, causing incorrect results after
any function call (for the benchmarks that regressed rL278321 just
happened to pick a register that was worse than the xmm0 default).
With this patch, we kill clearance for all registers when a function
call occurs.

Similarly, we kill clearance at function entry. This is by far the more
disruprive of the two cases, but is necessary to avoid 2x penalty in some
common cases. The most obvious case where this happens is calling a
small-ish non-inlined function in a loop. If the function uses xmm
registers that are not live-ins, it is likely to fall into this performance
trap, if we don't consider clearance small on live-ins.

This is obviously more pessimistic than reality in a lot cases. However,
the combination of the immense penalty for not having the dependency
breaking instruction (3-5x), together with the fact that these instructions
are extremely cheap (they are special cased in the decoder, AFAIK, so
don't even take up an execution unit).

Event Timeline

loladiro created this revision.Jan 16 2017, 3:49 PM
myatsina added inline comments.Jan 17 2017, 5:48 AM
lib/CodeGen/ExecutionDepsFix.cpp
503

Isn't it enough to do this only in the primary pass?

696

Better not mix 2 different issues in one patch, so please separate this optimization (and the related pickBestRegisterForUndef changes) to another review with a dedicated test.
Please also upload here a patch (+test) that relate just to the "call" clearance calculation.

test/CodeGen/X86/break-false-dep.ll
336

Isn't this suppose to be part of the https://reviews.llvm.org/D28759 patch?

loladiro updated this revision to Diff 85000.Jan 19 2017, 11:43 AM

Updated to also kill clearance at function entry

And a revised optimization procedure to avoid excessive vxorps caused by
that.

loladiro edited the summary of this revision. (Show Details)Jan 19 2017, 11:45 AM

Sorry to make you re-review this, but the solution I had put up previously turned out to be insufficient to address all the regressions we were seeing on benchmarks. Combining your previous changes with the two pending revisions, we're seeing 2-3x improvements on a number of our benchmarks and no regressions!

I'll update this to pull out the optimization into a prior patch, stacked before this one.

loladiro updated this revision to Diff 85019.Jan 19 2017, 1:34 PM

Split out the optimization part into D28915

loladiro retitled this revision from [ExecutionDepsFix] Kill clearance at function calls to [ExecutionDepsFix] Kill clearance at function entry/calls.
loladiro edited the summary of this revision. (Show Details)
loladiro set the repository for this revision to rL LLVM.

Sorry to make you re-review this, but the solution I had put up previously turned out to be insufficient to address all the regressions we were seeing on benchmarks. Combining your previous changes with the two pending revisions, we're seeing 2-3x improvements on a number of our benchmarks and no regressions! As requested, I split up the revision to pull out the vxorps insertion optimization into a separate revision.

myatsina edited edge metadata.Feb 9 2017, 8:31 AM

The existent tests you've changes seem to be affect by other changes as well.
Can you please upload only the changed tests' logic that is affected by this patch alone?
I would not like to see the affects of "xor fusion" or the other things here, just the support for call instructions in clearance calculation

lib/CodeGen/ExecutionDepsFix.cpp
422

I was about to upload a review for this issue :)
It solves at least 2 bugzillas:
https://llvm.org/bugs/show_bug.cgi?id=25277
https://llvm.org/bugs/show_bug.cgi?id=27573

One of them is marked as duplicate of other issues, so it will solve those too probably.

You should mention these bugzillas in your commit message.

In my version I had some small refactoring to this code section, feel free to adopt it if you like it:

// If this is an entery block, we don't know what the caller function did
// with the register, therfore we treat them as they were defined just before
// the first instruction.
// Otherwise, default values are 'nothing happened a long time ago'.
int defaultDefVal = MBB->pred_empty() ? -1 : -(1 << 20);

 for (unsigned rx = 0; rx != NumRegs; ++rx) {
   LiveRegs[rx].Value = nullptr;
   LiveRegs[rx].Def = defaultDefVal;
 }

 // This is the entry block.
 if (MBB->pred_empty()) {
   DEBUG(dbgs() << "BB#" << MBB->getNumber() << ": entry\n");
   return;
 }
640

Here, you've made this version dependent on the D28915 review.
I suggest you commit this patch before D28915, but then you need to change this condition a bit (else if --> if).

You can change it back in the "isDependencyBreak" review.

test/CodeGen/X86/avx512-cvt.ll
19–20

I think the checks in this test mix some of of your separate changes (the "XOR" fusion?).
Are you sure this tst is correct? Shouldn't it have 2 xors? also, I'm not sure why xmm2, xmm3 changed to xmm4 here.

60

Same here and in the rest of this file, I don;t understand why the register changed from xmm2 to xmm3. This should not have been caused by the change in this patch.

test/CodeGen/X86/break-false-dep.ll
269

Why the change? What are you testing here?

362

let's separate between xor after call instruction and "xor fusion". these are 2 different features and each should have it's own tests, and this is the exact purpose of unit tests.
If you want a test that combines several feature it shouldn't come at the expense of each feature's stand alone tests. It can be added on top of them.

364

This change should contain 2 tests,

  1. Check xor in the callee after a call (which callclearance does)
  2. Check xor in a function (I know it is indirectly tests in a lot of other places, but I think we should have an explicit test for this).
371

A suggestion for a simpler test that tests the call feature:

define <4 x double> @callclearance( i64 %val ) {
// inline asm to force choosing one register
%val1 = sitofp i64 %val to double
%val2 = sitofp i64 %val to double

%2 = call <4 x double> @myfunc(i64 %val, i64 %val1)

%val3 = sitofp i64 %val to double
...
}

Show that the undef register of the first and the second convert are the same, but the undef register of the third convert is different from them (--> the call instruction changes the clearance and thus another register was chosen).

test/CodeGen/X86/half.ll
104

the register is written explicitly, so better check the register too, no?
There are other tests like this further on.

test/CodeGen/X86/i64-to-float.ll
281–282

why the change to xmm2?
There is a dependency on it "really close", 3 instructions above it (vpcmpgtq %xmm1, %xmm0, %xmm2)

loladiro added inline comments.Feb 9 2017, 4:31 PM
lib/CodeGen/ExecutionDepsFix.cpp
640

Without D28915 there several hundred more xorps to be inserted in the tests ;)

test/CodeGen/X86/avx512-cvt.ll
19–20

Yes, this relies on the optimization in D28915. With this patch, but without the other one it would have inserted three xorps here (one for each of xmm2, xmm3, xmm4). The optimization in that patch allows us to get away with one.

The reason this changed here and not in the other patch is that before this, xmm2 was considered to have sufficient clearance (same for xmm3 and xmm4), but now it doesn't any more.

test/CodeGen/X86/break-false-dep.ll
269

As remarked in the other revision, this is a trick for making sure that pickBestRegister is still tested.

371

What kind of inline asm do you have in mind for this purpose? I tried for a bit to come up with something, but didn't manage to.

test/CodeGen/X86/half.ll
104

Fair enough

test/CodeGen/X86/i64-to-float.ll
281–282

Well, there needs to be an xor with some register here. xmm2 seems as good as any other, unless the fact that is was used shortly before actually makes a difference (I'm not aware of such an impact, since this should be handle in the register renaming unit, but happy to be corrected).

myatsina added inline comments.Feb 15 2017, 8:02 AM
lib/CodeGen/ExecutionDepsFix.cpp
640

Good point :)

test/CodeGen/X86/avx512-cvt.ll
19–20

Why doesn't xmm2 have enough clearance and xmm4 does?
There is no loop here, xmm4 and xmm2 should have the same clearance at this point.
Or is it getting together with the other optimization you did that merges a bunch of instructions and re-writes their undef register?

This is why I want to keep all these changes you are making separate from each other as much as possible.
I understand they are ALL needed for avoiding the regressions, but they have complex interaction with each other therefore I think it's crucial to do them separately and incrementally. It will also help "documentation" of the different parts of this mechanism.

test/CodeGen/X86/break-false-dep.ll
371

The inline asm itself can be nop or empty string, it is treated as a black box for most of the compiler and nobody actually analyzes the registers that are used there. The important part is to mark the relevant registers as clobbered by the asm call (equivelent to the clobber list in extended inline asm):
tail call void asm sideeffect "", "~{xmm8},~{xmm9},~{xmm10},~{xmm11},~{dirflag},~{fpsr},~{flags}"()

Here the inline asm is "" and it marks xmm8-xmm11 as clobbered, so this instruction will mark these registers as defs for clearance calculation purpose.

test/CodeGen/X86/i64-to-float.ll
281–282

What I meant in my comment is the original choice of xmm3 seems to be better than the new choice of xmm2.
In theory if you prefer far away register, you may find a register that is far enough so that you wouldn't have to insert a xor and by this save an instruction.

myatsina added inline comments.Feb 15 2017, 8:10 AM
test/CodeGen/X86/break-false-dep.ll
269

Well, you're no longer just testing pickBestReg, you're also testing the "xor fusion" which beats the purpose of unit tests.

Why do you think this test didn't test "pickBestReg" after your change?
But this test does need to be changes to expect a "xor" before the convert (which should be the only effect of your change on this test). Or is there some other way you've affected this test that I'm missing?

loladiro added inline comments.Feb 15 2017, 1:12 PM
test/CodeGen/X86/avx512-cvt.ll
19–20

Without the change in D28915, this diff would have been:

 ; KNL-NEXT:    vextracti32x4 $3, %zmm0, %xmm1
 ; KNL-NEXT:    vpextrq $1, %xmm1, %rax
+; KNL-NEXT:    vxorps %xmm2, %xmm2, %xmm2
 ; KNL-NEXT:    vcvtsi2sdq %rax, %xmm2, %xmm2
 ; KNL-NEXT:    vmovq %xmm1, %rax
+; KNL-NEXT:    vxorps %xmm3, %xmm3, %xmm3
 ; KNL-NEXT:    vcvtsi2sdq %rax, %xmm3, %xmm1
 ; KNL-NEXT:    vunpcklpd {{.*#+}} xmm1 = xmm1[0],xmm2[0]
 ; KNL-NEXT:    vextracti32x4 $2, %zmm0, %xmm2
 ; KNL-NEXT:    vpextrq $1, %xmm2, %rax
+; KNL-NEXT:    vxorps %xmm3, %xmm3, %xmm3
 ; KNL-NEXT:    vcvtsi2sdq %rax, %xmm3, %xmm3
 ; KNL-NEXT:    vmovq %xmm2, %rax
+; KNL-NEXT:    vxorps %xmm4, %xmm4, %xmm4
 ; KNL-NEXT:    vcvtsi2sdq %rax, %xmm4, %xmm2
 ; KNL-NEXT:    vunpcklpd {{.*#+}} xmm2 = xmm2[0],xmm3[0]
 ; KNL-NEXT:    vinsertf128 $1, %xmm1, %ymm2, %ymm1
 ; KNL-NEXT:    vextracti32x4 $1, %zmm0, %xmm2
 ; KNL-NEXT:    vpextrq $1, %xmm2, %rax
+; KNL-NEXT:    vxorps %xmm4, %xmm4, %xmm4
 ; KNL-NEXT:    vcvtsi2sdq %rax, %xmm4, %xmm3
 ; KNL-NEXT:    vmovq %xmm2, %rax
+; KNL-NEXT:    vxorps %xmm4, %xmm4, %xmm4
 ; KNL-NEXT:    vcvtsi2sdq %rax, %xmm4, %xmm2
 ; KNL-NEXT:    vunpcklpd {{.*#+}} xmm2 = xmm2[0],xmm3[0]
 ; KNL-NEXT:    vpextrq $1, %xmm0, %rax
+; KNL-NEXT:    vxorps %xmm4, %xmm4, %xmm4
 ; KNL-NEXT:    vcvtsi2sdq %rax, %xmm4, %xmm3
 ; KNL-NEXT:    vmovq %xmm0, %rax
+; KNL-NEXT:    vxorps %xmm4, %xmm4, %xmm4
 ; KNL-NEXT:    vcvtsi2sdq %rax, %xmm4, %xmm0
 ; KNL-NEXT:    vunpcklpd {{.*#+}} xmm0 = xmm0[0],xmm3[0]
 ; KNL-NEXT:    vinsertf128 $1, %xmm2, %ymm0, %ymm0
 ; KNL-NEXT:    vinsertf64x4 $1, %ymm1, %zmm0, %zmm0
 ; KNL-NEXT:    retq

The optimization in D28915 allows it to recognize that it only needs on vxorps.

test/CodeGen/X86/break-false-dep.ll
269

Because after this change, there's no register with sufficient clearance (since all clearances are killed at function entry, so there's no "best register" for it to pick - they're all bad). This construction (comparison against a constant 0) forces llvm to materialize a vxorps, which breaks the dependence, so pickBestReg actually has some work to do.

371

Yes, but that's the opposite of what we need. We'd need some inline assembly to revive clearances (because they're now dead-by-default), not kill them.

test/CodeGen/X86/i64-to-float.ll
281–282

Why does xmm3 seem better? At this point the compiler has already determined that no register has sufficient clearance, which is why it inserts the dependency break. As far as I know, which register it uses then doesn't matter.

MatzeB resigned from this revision.Aug 15 2017, 11:06 AM