Two of the biggest problems with the current PGO instrumentation are 1) instrumented multi-threaded program performance 2) multi-threaded profile counter precision. Both are due to contentions created due to shared profile counter updates in hot regions of the program. In a multi-threaded program with work-sharing loop, with instrumentation increasing the number of threads actually slows down the program significantly -- from 10s elapse time with one thread down to > 5min using 16 threads. What is worse, the hottest block count is 4000000000 with 1 thread, but dropped to only 177119367 with 32 threads -- 95% of the counts are lost due to data races. Using atomic RMW is one way to fix the problem, but it will greatly slows down the instrumented program (see data below).
This patch implements the loop based register promotion for counter updates. It makes use of existing SSA updater utility (load store promotion) and isolates the change inside the lowerer without the need to expose the aliasing properties of the counter variables to any of the existing optimizer components. The lowerer has the full knowledge of counters and requires very little analysis. It supports speculative code motion and works at lower optimization level.
With this patch, the performance of the multi-threaded program mentioned above is improved greatly. For one thread, it speeds up the program by 22%. For 16 threads, the elapse time is only 0.9s, > 300x speedup compared without the patch. The profile counter precision is also greatly improved. With 32 threads, the hottest block count is 3996000000 only 0.09% counts are lost.
The patch speeds up single threaded instrumentation binary performance as well.
Here are the spec2000 int numbers
164.gzip -0.81% 175.vpr 2.05% 176.gcc 11.08% 181.mcf -0.61% 186.crafty -3.46% 197.parser 4.90% 252.eon 18.00% 253.perlbmk 11.21% 255.vortex -0.04% 256.bzip2 8.67% 300.twolf 3.89%
Here are the spec06 numbers
400.perlbench -1.87% 401.bzip2 16.98% 403.gcc 4.82% 429.mcf 12.88% 445.gobmk 1.83% 456.hmmer 12.48% 458.sjeng -0.19% 462.libquantum 28.09% 464.h264ref 6.49% 471.omnetpp 1.21% 473.astar 8.31% 483.xalancbmk 0.95% 450.soplex 12.35% 447.dealII 6.33% 453.povray -3.29% 444.namd 1.88%
I did some analysis on the povray regression: the program has a few hot loops with some blocks guarded by input flags which are never executed. Hoisting counter update outside the loop thus increases dynamic instruction count.
Lastly, here is the SPEC2k number of using atomic fetch-add compared the base line without this patch:
164.gzip -14.02% 175.vpr -16.02% 176.gcc -14.15% 181.mcf -4.48% 186.crafty -44.98% 197.parser -11.69% 252.eon -13.79% 253.perlbmk 6.26% 255.vortex -4.70% 256.bzip2 -4.06% 300.twolf -17.24%
It may be more convenient to introduce 'using LoadStorePair = ...'