LLVM assumes that large clearance will hide the partial register spill penalty. But in our experiment, 16 clearance is too small. As the inserted XOR is normally fairly cheap, we should have a higher clearance threshold to aggressively insert XORs that is necessary to break partial register dependency.
Details
Diff Detail
Event Timeline
This mostly makes sense to me. 16 is as much of a magic number as 64, so if this is supported by performance numbers, it's fine.
The problem is that we're using the number of instructions between the write and the read as a proxy for the "the latency of the critical chain between the write and the read". And it's not a very good proxy, since it means how correct we are depends on how much ILP the loop (assuming this is a loop-carried dependency) has. So any number we have here will be just hand-waving.
Adding some Intel people in case they have more input.
I think this is a good change. The cost of inserting an xor when it is not needed is very small. The cost of failing to insert an xor when it IS needed can be huge.
But fixing the threshold isn't good enough for several reasons.
(1) It isn't always possible, because the register may hold a live value. Marina (added) is working on a patch that will fix this problem.
(2) If the instruction has a "real" XMM operand, it is better to choose that register for the undef operand rather than inserting an xor. (Doing so hides the false dependence behind a true dependence that is unavoidable.) This applies to instructions like vcvtsd2ss et al. Marina is planning to work on this also.
-Dave
lib/Target/X86/X86InstrInfo.cpp | ||
---|---|---|
5988 | minor nit, instruction's --> instructions' |
Here is a testcase extracted from internal benchmark, which runs 2% faster on sandybridge if clearance threshold is changed from 16 to 64. For this extracted testcase itself, when threshold is set at 16, the cvt2si for "g" will not have xor inserted to break dependency. As a result, the inner loop consumes 13 cycles (comparing with 11 cycles if breaking the dependency for all "r", "g" and "b").
int datas[10000];
int datad[10000];
void foo(float r, float g, float b, int s, int d, int h, int w, int *datas, int *datad) attribute ((noinline));
void foo(float r, float g, float b, int s, int d, int h, int w, int *datas, int *datad) {
int i, j; for (i = 0; i < h; i++) { int *lines = datas + i * s; int *lined = datad + i * d; for (j = 0; j < w; j++) { int word = *(lines + j); int val = (int)(r * ((word >> 8) & 0xff) + g * ((word >> 16) & 0xff) + b * ((word >> 24) & 0xff) + 0.5); *((char *)(lined) + j) = val; } }
}
int main() {
for (int i = 0; i < 100000; i++) { foo(2.0, 3.0, 4.0, 100, 100, 100, 100, datas, datad); } return 0;
}
I don't have public benchmark result for this change yet. For internal benchmarks, no noticeable code size change has been observed. It has 2% speedup on the benchmark that motivated this patch, and has no performance impact on any other internal benchmarks.
It's great that Marina is working on a better fix for this problem. When is the ETA for the fix? If it's coming pretty soon, we should abandon this change. Otherwise if the change does no harm, let's just update the magic number so that we can reclaim performance for the motivating benchmark. Thoughts?
Thanks,
Dehao
I wasn't suggesting that Marina's fixes should replace this one. Sorry for that confusion. My point was simply that this is just one piece of a more comprehensive solution to the false dependence problem.
I think we should go ahead with this change. But I agree with davidxl that you should get an 'ok' from the original author, if possible.
Thanks,
-Dave
Since even the code comments acknowledge that this is a repeated magic number, please give it a name. Even better would be to make it a cl::opt, so we can run experiments more easily.
FWIW, I don't see any measurable perf difference for the example test case (compiled with -O2) running on Haswell.
Is there already a regression test to show that we're not adding instructions if we're in MinSize mode?
For reference, some bug reports about generating xorps here:
https://llvm.org/bugs/show_bug.cgi?id=22024
https://llvm.org/bugs/show_bug.cgi?id=25277
https://llvm.org/bugs/show_bug.cgi?id=26491
https://llvm.org/bugs/show_bug.cgi?id=27573
Is there already a regression test to show that we're not adding instructions if we're in MinSize mode?
I believe we currently add the xors in MinSize mode, and, as I wrote on PR22024, I think it's justified.
I guess it's a question of how -Oz is defined - whether it's "make code smaller at all costs", or (as the manual currently says):
- -Os: Like -O2 with extra optimizations to reduce code size.
- -Oz: Like -Os (and thus -O2), but reduces code size further.
My gut feeling is that omitting the xor is not a good performance/size trade-off even for -Oz.
Yes, I pasted the links before I scrolled through the comments.
In that case, let me negate my question. :)
Is there already a regression test to show that we ARE adding instructions if we're in MinSize mode?
The performance of attached testcase running on haswell: (to reproduce, need to build with -O2 -fno-tree-vectorize)
$ time ./good.out
real 0m2.442s
user 0m2.437s
sys 0m0.001s
$ time ./bad.out
real 0m3.480s
user 0m3.475s
sys 0m0.002s
I missed -fno-tree-vectorize in my earlier experiment. With that setting, I am able to reproduce the Haswell win. This is at nominal 4GHz:
lessxor: user 0m2.977s morexor: user 0m2.068s
I would expect all OoO SSE machines to have the same problem, and testing on AMD Jaguar generally shows that. But in this particular case, performance gets worse. This is at nominal 1.5GHz:
lessxor: user 0m11.916s morexor: user 0m12.795s
I don't have an explanation for that loss yet. The loop is optimally aligned on a 64-byte boundary. The extra xorps adds 3 bytes causing the inner loop to grow from an even 80 bytes to 83 bytes. The extra bytes require an additional ifetch operation that is somehow slowing down the whole chain?
Note that the partial update problem is limited to SSE codegen. If we use -mavx, we generate the 'v' versions of the conversion instructions. Those do not have partial reg update problems, and so we don't need any (v)xorps instructions in the loop. Performance for AVX versions of the program are slightly better than the best SSE case for both CPUs.
Here's the asm code for the inner loop that I'm testing with:
LBB0_4: ## %for.body6 ## Parent Loop BB0_2 Depth=1 ## => This Inner Loop Header: Depth=2 movl (%rbx), %eax movzbl %ah, %edi # NOREX xorps %xmm4, %xmm4 <--- this gets generated with the current '16' clearance setting cvtsi2ssl %edi, %xmm4 mulss %xmm0, %xmm4 movl %eax, %edi shrl $16, %edi movzbl %dil, %edi xorps %xmm5, %xmm5 <--- this is the added instruction generated by this patch cvtsi2ssl %edi, %xmm5 mulss %xmm1, %xmm5 addss %xmm4, %xmm5 shrl $24, %eax xorps %xmm4, %xmm4 <--- this gets generated with the current '16' clearance setting cvtsi2ssl %eax, %xmm4 mulss %xmm2, %xmm4 addss %xmm5, %xmm4 cvtss2sd %xmm4, %xmm4 addsd %xmm3, %xmm4 cvttsd2si %xmm4, %eax movb %al, (%rsi) addq $4, %rbx incq %rsi decl %r15d jne LBB0_4
lib/Target/X86/X86InstrInfo.cpp | ||
---|---|---|
61–66 | The option is used for partial reg clearance and undef reg clearance. It should have a more generic name and description, or we should have 2 independent variables. I'm fine with either way, but what the patch has in this version is misleading. | |
5986–5989 | Don't hard-code the comment to any specific value since it may change above. |
Thanks.
Earlier in the thread, Joerg requested more general perf data. Has that request been fulfilled or cancelled?
It would be interesting to know what the optimal setting is for at least one benchmark, so you can leave a comment by the cl::opt definition to justify the default setting.
Also, we don't have to answer the MinSize question that I raised directly in this patch, but my perf data shows that we really shouldn't be adding xors in that situation - we hope these are free, but they are not in all cases. IMO, our documentation for -Oz is wrong. There shouldn't be any room for interpretation about the definition of "minimum"; this is the flag used by devs (embedded, boot code, etc) who require the smallest possible code, and perf shouldn't matter.
I just tested the perf impact of this patch for SPECCPU INT2006 benchmarks (perf test were done on sandybridge):
400.perlbench 2.21%
401.bzip2 -0.46%
403.gcc 1.59%
429.mcf 1.79%
445.gobmk 1.52%
456.hmmer 1.63%
458.sjeng 5.62%
462.libquantum 1.54%
464.h264ref -1.28%
471.omnetpp 0.00%
473.astar 2.35%
483.xalancbmk -0.27%
overall 1.29%
About the -Oz issue, I agree that there should be no xor inserted at -Oz. But that's a different issue and should be fixed in a separate patch.
That is a surprisingly large gain. This raises the question: if 64 is good, is 128 even better?
About the -Oz issue, I agree that there should be no xor inserted at -Oz. But that's a different issue and should be fixed in a separate patch.
Agreed. Please let me know if you plan to follow-up on that part.
I have no further comments on this patch, so LGTM.
The option is used for partial reg clearance and undef reg clearance. It should have a more generic name and description, or we should have 2 independent variables. I'm fine with either way, but what the patch has in this version is misleading.