[X86] Enable call frame optimization ("mov to push") not only for optsize (PR26325)
The size savings are significant, and from what I can tell, both ICC and GCC do this.
Differential Revision: http://reviews.llvm.org/D18573
[X86] Enable call frame optimization ("mov to push") not only for optsizeā¦
Description
Details
Event TimelineComment Actions What hardware was this change benchmarked on? The last time I looked at this was a few years ago, but back then the increased implicit hazards on %esp introduced noticeable performance regressions. Have recent CPUs optimised the microcode for this? The Intel optimisation guide used to explicitly say not to do this, but I've not checked recently. Comment Actions
I didn't benchmark it. My justification for the change is that GCC, ICC, and MSVC all lower calls like this, so it seems like the way to do it, and the binary size savings are significant. Comment Actions According to the latest version of the Intel 64 and IA-32 Architectures Optimization Reference Manual, my concerns are not valid for the latest microarchitectures as the Stack Pointer Tracker eliminates the implicit dependencies. I am not sure that this applies to AMD / Centaur microarchitectures and how recent the addition of this to Intel chips is. I'd like to see some benchmarks, as this had a 5-10% performance penalty last time I looked at it. Comment Actions Was the 5-10% regression you mention on a micro-benchmark or a full benchmark? I don't have access to the architectures you mention, and I'm not sure what you'd consider the most relevant benchmark, but here are some numbers from V8, which is relevant for me. Binary size before my patch: 23,248,074 b, after: 22,953,162 b (-294,912 b, or -1.3%). Octane before (higher numbers are better): Richards-octane2.1(Score): 27391 Octane after: Richards-octane2.1(Score): 26621 Some tests scored lower with my patch (Richards, DeltaBlue, RayTrace), but the rest improved. I'd say it's in the noise, but at least the noise seems to point in a favourable direction for my change :-) This was a 32-bit V8 release build on Linux, running on Intel(R) Xeon(R) CPU E5-2690 0 @ 2.90GHz Comment Actions 5% overall on macrobenchmarks - adding a pipeline stall on every function entry and exit adds up a lot. Good to see that it doesn't happen on modern hardware, but I'd still be interested in seeing what impact it has on AMD, Centaur, or older Intel hardware. |