[X86] Enable call frame optimization ("mov to push") not only for optsizeā€¦

Description

[X86] Enable call frame optimization ("mov to push") not only for optsize (PR26325)

The size savings are significant, and from what I can tell, both ICC and GCC do this.

Differential Revision: http://reviews.llvm.org/D18573

What hardware was this change benchmarked on? The last time I looked at this was a few years ago, but back then the increased implicit hazards on %esp introduced noticeable performance regressions. Have recent CPUs optimised the microcode for this? The Intel optimisation guide used to explicitly say not to do this, but I've not checked recently.

hans added a comment.Apr 4 2016, 9:08 AM

What hardware was this change benchmarked on? The last time I looked at this was a few years ago, but back then the increased implicit hazards on %esp introduced noticeable performance regressions. Have recent CPUs optimised the microcode for this? The Intel optimisation guide used to explicitly say not to do this, but I've not checked recently.

I didn't benchmark it. My justification for the change is that GCC, ICC, and MSVC all lower calls like this, so it seems like the way to do it, and the binary size savings are significant.

According to the latest version of the Intel 64 and IA-32 Architectures Optimization Reference Manual, my concerns are not valid for the latest microarchitectures as the Stack Pointer Tracker eliminates the implicit dependencies. I am not sure that this applies to AMD / Centaur microarchitectures and how recent the addition of this to Intel chips is. I'd like to see some benchmarks, as this had a 5-10% performance penalty last time I looked at it.

hans added a comment.Apr 4 2016, 3:15 PM

Was the 5-10% regression you mention on a micro-benchmark or a full benchmark?

I don't have access to the architectures you mention, and I'm not sure what you'd consider the most relevant benchmark, but here are some numbers from V8, which is relevant for me.

Binary size before my patch: 23,248,074 b, after: 22,953,162 b (-294,912 b, or -1.3%).

Octane before (higher numbers are better):

Richards-octane2.1(Score): 27391
DeltaBlue-octane2.1(Score): 51519
Crypto-octane2.1(Score): 24533
RayTrace-octane2.1(Score): 68005
EarleyBoyer-octane2.1(Score): 40427
RegExp-octane2.1(Score): 4632
Splay-octane2.1(Score): 18678
SplayLatency-octane2.1(Score): 29875
NavierStokes-octane2.1(Score): 26214
PdfJS-octane2.1(Score): 17575
Mandreel-octane2.1(Score): 19200
MandreelLatency-octane2.1(Score): 45320
Gameboy-octane2.1(Score): 42320
CodeLoad-octane2.1(Score): 10595
Box2D-octane2.1(Score): 31416
zlib-octane2.1(Score): 51697
Typescript-octane2.1(Score): 30467

Octane after:

Richards-octane2.1(Score): 26621
DeltaBlue-octane2.1(Score): 47645
Crypto-octane2.1(Score): 25604
RayTrace-octane2.1(Score): 66525
EarleyBoyer-octane2.1(Score): 42885
RegExp-octane2.1(Score): 4814
Splay-octane2.1(Score): 21905
SplayLatency-octane2.1(Score): 37410
NavierStokes-octane2.1(Score): 29147
PdfJS-octane2.1(Score): 19737
Mandreel-octane2.1(Score): 21306
MandreelLatency-octane2.1(Score): 52476
Gameboy-octane2.1(Score): 48493
CodeLoad-octane2.1(Score): 12187
Box2D-octane2.1(Score): 49634
zlib-octane2.1(Score): 56661
Typescript-octane2.1(Score): 31267

Some tests scored lower with my patch (Richards, DeltaBlue, RayTrace), but the rest improved. I'd say it's in the noise, but at least the noise seems to point in a favourable direction for my change :-)

This was a 32-bit V8 release build on Linux, running on Intel(R) Xeon(R) CPU E5-2690 0 @ 2.90GHz

5% overall on macrobenchmarks - adding a pipeline stall on every function entry and exit adds up a lot. Good to see that it doesn't happen on modern hardware, but I'd still be interested in seeing what impact it has on AMD, Centaur, or older Intel hardware.