This is an archive of the discontinued LLVM Phabricator instance.

Relax the clearance calculating for breaking partial register dependency.
ClosedPublic

Authored by danielcdh on Jun 21 2016, 10:04 AM.

Details

Summary

LLVM assumes that large clearance will hide the partial register spill penalty. But in our experiment, 16 clearance is too small. As the inserted XOR is normally fairly cheap, we should have a higher clearance threshold to aggressively insert XORs that is necessary to break partial register dependency.

Diff Detail

Event Timeline

danielcdh updated this revision to Diff 61397.Jun 21 2016, 10:04 AM
danielcdh retitled this revision from to Relax the clearance calculating for breaking partial register dependency..
danielcdh updated this object.
danielcdh added reviewers: mkuper, davidxl, wmi.
danielcdh added a subscriber: llvm-commits.
davidxl edited edge metadata.Jun 21 2016, 10:18 AM

Can you also add the original author who set the current threshold for the review?

mkuper edited edge metadata.Jun 21 2016, 11:29 AM

This mostly makes sense to me. 16 is as much of a magic number as 64, so if this is supported by performance numbers, it's fine.

The problem is that we're using the number of instructions between the write and the read as a proxy for the "the latency of the critical chain between the write and the read". And it's not a very good proxy, since it means how correct we are depends on how much ILP the loop (assuming this is a loop-carried dependency) has. So any number we have here will be just hand-waving.

Adding some Intel people in case they have more input.

I think this is a good change. The cost of inserting an xor when it is not needed is very small. The cost of failing to insert an xor when it IS needed can be huge.

But fixing the threshold isn't good enough for several reasons.
(1) It isn't always possible, because the register may hold a live value. Marina (added) is working on a patch that will fix this problem.
(2) If the instruction has a "real" XMM operand, it is better to choose that register for the undef operand rather than inserting an xor. (Doing so hides the false dependence behind a true dependence that is unavoidable.) This applies to instructions like vcvtsd2ss et al. Marina is planning to work on this also.

-Dave

lib/Target/X86/X86InstrInfo.cpp
5983

minor nit, instruction's --> instructions'

Here is a testcase extracted from internal benchmark, which runs 2% faster on sandybridge if clearance threshold is changed from 16 to 64. For this extracted testcase itself, when threshold is set at 16, the cvt2si for "g" will not have xor inserted to break dependency. As a result, the inner loop consumes 13 cycles (comparing with 11 cycles if breaking the dependency for all "r", "g" and "b").

int datas[10000];
int datad[10000];
void foo(float r, float g, float b, int s, int d, int h, int w, int *datas, int *datad) attribute ((noinline));
void foo(float r, float g, float b, int s, int d, int h, int w, int *datas, int *datad) {

int i, j;
for (i = 0; i < h; i++) {
    int *lines = datas + i * s;
    int *lined = datad + i * d;
    for (j = 0; j < w; j++) {
        int word = *(lines + j);
        int val = (int)(r * ((word >> 8) & 0xff) +
                        g * ((word >> 16) & 0xff) +
                        b * ((word >> 24) & 0xff) + 0.5);
        *((char *)(lined) + j) = val;
    }
}

}
int main() {

for (int i = 0; i < 100000; i++) {
  foo(2.0, 3.0, 4.0, 100, 100, 100, 100, datas,  datad);
}
return 0;

}

I don't have public benchmark result for this change yet. For internal benchmarks, no noticeable code size change has been observed. It has 2% speedup on the benchmark that motivated this patch, and has no performance impact on any other internal benchmarks.

I think this is a good change. The cost of inserting an xor when it is not needed is very small. The cost of failing to insert an xor when it IS needed can be huge.

But fixing the threshold isn't good enough for several reasons.
(1) It isn't always possible, because the register may hold a live value. Marina (added) is working on a patch that will fix this problem.
(2) If the instruction has a "real" XMM operand, it is better to choose that register for the undef operand rather than inserting an xor. (Doing so hides the false dependence behind a true dependence that is unavoidable.) This applies to instructions like vcvtsd2ss et al. Marina is planning to work on this also.

-Dave

It's great that Marina is working on a better fix for this problem. When is the ETA for the fix? If it's coming pretty soon, we should abandon this change. Otherwise if the change does no harm, let's just update the magic number so that we can reclaim performance for the motivating benchmark. Thoughts?

Thanks,
Dehao

I wasn't suggesting that Marina's fixes should replace this one. Sorry for that confusion. My point was simply that this is just one piece of a more comprehensive solution to the false dependence problem.

I think we should go ahead with this change. But I agree with davidxl that you should get an 'ok' from the original author, if possible.

Thanks,
-Dave

spatel edited edge metadata.Jun 22 2016, 12:47 PM

Since even the code comments acknowledge that this is a repeated magic number, please give it a name. Even better would be to make it a cl::opt, so we can run experiments more easily.

FWIW, I don't see any measurable perf difference for the example test case (compiled with -O2) running on Haswell.

Is there already a regression test to show that we're not adding instructions if we're in MinSize mode?

Is there already a regression test to show that we're not adding instructions if we're in MinSize mode?

I believe we currently add the xors in MinSize mode, and, as I wrote on PR22024, I think it's justified.
I guess it's a question of how -Oz is defined - whether it's "make code smaller at all costs", or (as the manual currently says):

  • -Os: Like -O2 with extra optimizations to reduce code size.
  • -Oz: Like -Os (and thus -O2), but reduces code size further.

My gut feeling is that omitting the xor is not a good performance/size trade-off even for -Oz.

Is there already a regression test to show that we're not adding instructions if we're in MinSize mode?

I believe we currently add the xors in MinSize mode, and, as I wrote on PR22024, I think it's justified.

Yes, I pasted the links before I scrolled through the comments.
In that case, let me negate my question. :)

Is there already a regression test to show that we ARE adding instructions if we're in MinSize mode?

The performance of attached testcase running on haswell: (to reproduce, need to build with -O2 -fno-tree-vectorize)

$ time ./good.out
real 0m2.442s
user 0m2.437s
sys 0m0.001s

$ time ./bad.out
real 0m3.480s
user 0m3.475s
sys 0m0.002s

The performance of attached testcase running on haswell: (to reproduce, need to build with -O2 -fno-tree-vectorize)

I missed -fno-tree-vectorize in my earlier experiment. With that setting, I am able to reproduce the Haswell win. This is at nominal 4GHz:

lessxor: user   0m2.977s
morexor: user	0m2.068s

I would expect all OoO SSE machines to have the same problem, and testing on AMD Jaguar generally shows that. But in this particular case, performance gets worse. This is at nominal 1.5GHz:

lessxor: user	0m11.916s
morexor: user	0m12.795s

I don't have an explanation for that loss yet. The loop is optimally aligned on a 64-byte boundary. The extra xorps adds 3 bytes causing the inner loop to grow from an even 80 bytes to 83 bytes. The extra bytes require an additional ifetch operation that is somehow slowing down the whole chain?

Note that the partial update problem is limited to SSE codegen. If we use -mavx, we generate the 'v' versions of the conversion instructions. Those do not have partial reg update problems, and so we don't need any (v)xorps instructions in the loop. Performance for AVX versions of the program are slightly better than the best SSE case for both CPUs.

Here's the asm code for the inner loop that I'm testing with:

LBB0_4:                                 ## %for.body6
                                        ##   Parent Loop BB0_2 Depth=1
                                        ## =>  This Inner Loop Header: Depth=2
  movl  (%rbx), %eax
  movzbl  %ah, %edi  # NOREX
  xorps %xmm4, %xmm4  <--- this gets generated with the current '16' clearance setting
  cvtsi2ssl %edi, %xmm4
  mulss %xmm0, %xmm4
  movl  %eax, %edi
  shrl  $16, %edi
  movzbl  %dil, %edi
  xorps %xmm5, %xmm5  <--- this is the added instruction generated by this patch
  cvtsi2ssl %edi, %xmm5
  mulss %xmm1, %xmm5
  addss %xmm4, %xmm5
  shrl  $24, %eax
  xorps %xmm4, %xmm4  <--- this gets generated with the current '16' clearance setting
  cvtsi2ssl %eax, %xmm4
  mulss %xmm2, %xmm4
  addss %xmm5, %xmm4
  cvtss2sd  %xmm4, %xmm4
  addsd %xmm3, %xmm4
  cvttsd2si %xmm4, %eax
  movb  %al, (%rsi)
  addq  $4, %rbx
  incq  %rsi
  decl  %r15d
  jne LBB0_4
danielcdh updated this revision to Diff 61852.Jun 24 2016, 3:47 PM
danielcdh edited edge metadata.

Put the magic number in an option.

spatel added inline comments.Jun 26 2016, 4:59 PM
lib/Target/X86/X86InstrInfo.cpp
61–66

The option is used for partial reg clearance and undef reg clearance. It should have a more generic name and description, or we should have 2 independent variables. I'm fine with either way, but what the patch has in this version is misleading.

5981

Don't hard-code the comment to any specific value since it may change above.

danielcdh updated this revision to Diff 61982.Jun 27 2016, 10:27 AM

update comments and options

Thanks.

Earlier in the thread, Joerg requested more general perf data. Has that request been fulfilled or cancelled?

It would be interesting to know what the optimal setting is for at least one benchmark, so you can leave a comment by the cl::opt definition to justify the default setting.

Also, we don't have to answer the MinSize question that I raised directly in this patch, but my perf data shows that we really shouldn't be adding xors in that situation - we hope these are free, but they are not in all cases. IMO, our documentation for -Oz is wrong. There shouldn't be any room for interpretation about the definition of "minimum"; this is the flag used by devs (embedded, boot code, etc) who require the smallest possible code, and perf shouldn't matter.

danielcdh added a comment.EditedJun 27 2016, 6:28 PM

I just tested the perf impact of this patch for SPECCPU INT2006 benchmarks (perf test were done on sandybridge):

400.perlbench 2.21%
401.bzip2 -0.46%
403.gcc 1.59%
429.mcf 1.79%
445.gobmk 1.52%
456.hmmer 1.63%
458.sjeng 5.62%
462.libquantum 1.54%
464.h264ref -1.28%
471.omnetpp 0.00%
473.astar 2.35%
483.xalancbmk -0.27%
overall 1.29%

About the -Oz issue, I agree that there should be no xor inserted at -Oz. But that's a different issue and should be fixed in a separate patch.

What platform, Ivybridge ?

David

spatel accepted this revision.Jun 28 2016, 9:04 AM
spatel edited edge metadata.

I just tested the perf impact of this patch for SPECCPU INT2006 benchmarks (perf test were done on sandybridge):

400.perlbench 2.21%
401.bzip2 -0.46%
403.gcc 1.59%
429.mcf 1.79%
445.gobmk 1.52%
456.hmmer 1.63%
458.sjeng 5.62%
462.libquantum 1.54%
464.h264ref -1.28%
471.omnetpp 0.00%
473.astar 2.35%
483.xalancbmk -0.27%
overall 1.29%

That is a surprisingly large gain. This raises the question: if 64 is good, is 128 even better?

About the -Oz issue, I agree that there should be no xor inserted at -Oz. But that's a different issue and should be fixed in a separate patch.

Agreed. Please let me know if you plan to follow-up on that part.

I have no further comments on this patch, so LGTM.

This revision is now accepted and ready to land.Jun 28 2016, 9:04 AM
danielcdh closed this revision.Jun 28 2016, 2:26 PM