Page MenuHomePhabricator

[MIR] Add simple PRE pass to MachineCSE

Authored by anton-afanasyev on Jan 16 2019, 12:15 AM.


This is the second part of the commit fixing PR38917 (hoisting
partitially redundant machine instruction). Most of PRE (partitial
redundancy elimination) and CSE work is done on LLVM IR, but some of
redundancy arises during DAG legalization. Machine CSE is not enough
to deal with it. This simple PRE implementation works a little bit
intricately: it passes before CSE, looking for partitial redundancy
and transforming it to fully redundancy, anticipating that the next
CSE step will eliminate this created redundancy. If CSE doesn't
eliminate this, than created instruction will remain dead and eliminated
later by Remove Dead Machine Instructions pass.

The third part of the commit is supposed to refactor MachineCSE,
to make it more clear and to merge MachinePRE with MachineCSE,
so one need no rely on further Remove Dead pass to clear instrs
not eliminated by CSE.

First step:


Diff Detail


Event Timeline

There are a very large number of changes, so older changes are hidden. Show Older Changes

Should I clarify anything?


FYI: Generally accepted ping frequency is once per week.

54 ↗(On Diff #181982)

Our general style is not to use line continuations like this, but instead rely on adjacent string concatenation, so this would be:

STATISTIC(NumPREs,      "Number of partial redundant expression"
                            " transformed to fully redundant");

(where the starting quotes are vertically aligned)

754 ↗(On Diff #181982)

Comments should be full sentences and end with a period (or other punctuation). Also, please explain *why* the conditions are stronger in this case.

790 ↗(On Diff #181982)

It looks like this could be:

if (!PREMap.count(MI)) {
  PREMap[MI] = MBB;

and then you don't need the else and extra associated indentation.

812 ↗(On Diff #181982)

It would be better to not create, and then delete, instructions speculatively. isProfitableToCSE does look at the parent blocks of the instructions, but we could abstract that by:

  1. Making isProfitableToCSE also take the blocks of the instructions as arguments; explain that isProfitableToCSE treats the instructions as though they are in the specified parent blocks (even if they're not currently).
  2. Make an overload isProfitableToCSE which retains the current behavior (where the parent blocks come from the instructions themselves) which calls the underlying implementation from (1).
anton-afanasyev marked 8 inline comments as done.Jan 24 2019, 2:22 AM


FYI: Generally accepted ping frequency is once per week.

Ok, thanks!

54 ↗(On Diff #181982)

Ok, fixed.

754 ↗(On Diff #181982)

Ok, changed.

790 ↗(On Diff #181982)

Thanks, that is better! Fixed.

812 ↗(On Diff #181982)

Thanks, changed. Actually isProfitableToCSE() doesn't need CSMI (new instr) as argument, only its basic block, so just changed function arglist.

anton-afanasyev marked 4 inline comments as done.

Have you measured the compile-time impact?

755 ↗(On Diff #183272)

I think it would be useful to elaborate on under what conditions we create these PRE instructions that are not later CSEd. Do you know how often that happens (I assume that you can figure this out by looking at the statistics and comparing runs where this happens vs. where it doesn't)?

anton-afanasyev marked an inline comment as done.Jan 29 2019, 2:54 PM

Have you measured the compile-time impact?

What do you mean by compile-time impact? I've run test suite of course, compared with baseline.
The results for exec_time and size..text metrics are like here:

> ~/llvm/test-suite/utils/  --filter-short -m exec_time ~/llvm/test-suite-results/results_rel_base.json ~/llvm/test-suite-results/results_rel_exp3.json 
Program                                         results_rel_base  results_rel_exp3 diff  
MicroBench...RayFDRMultiThreaded/threads:32   1055.94           931.21            -11.8%
Bitcode/Be...rid/halide_bilateral_grid.test    30.26             26.70            -11.8%
MultiSourc...mbolics-flt/Symbolics-flt.test     1.10              0.98            -10.9%
MicroBench...XRayFDRMultiThreaded/threads:2   497.85            547.92            10.1% 
MicroBench...lcalsCRaw.test:BM_EOS_RAW/5001    10.33              9.49            -8.2% 
MicroBench...bda.test:BM_PIC_1D_LAMBDA/5001    75.75             81.43             7.5% 
MicroBench...XRayFDRMultiThreaded/threads:8   1029.95           1100.16            6.8% 
MicroBench...XRayFDRMultiThreaded/threads:4   626.46            667.17             6.5% 
MicroBench...ambda.test:BM_EOS_LAMBDA/44217    87.21             92.66             6.2% 
MultiSourc...flt/LoopRestructuring-flt.test     3.76              3.53            -6.2% 
MicroBench...calsCRaw.test:BM_EOS_RAW/44217    85.94             90.96             5.8% 
MicroBench...Lambda.test:BM_ADI_LAMBDA/5001   128.27            135.35             5.5% 
MicroBench...est:BM_ReturnNeverInstrumented     1.60              1.68             5.3% 
MicroBench....test:BM_MULADDSUB_LAMBDA/5001    14.14             14.84             4.9% 
SingleSour...ncils/fdtd-apml/fdtd-apml.test     0.63              0.66             4.4%
> ~/llvm/test-suite/utils/  --filter-short -m size..text ~/llvm/test-suite-results/results_rel_base.json ~/llvm/test-suite-results/results_rel_exp3.json 
Program                                         results_rel_base  results_rel_exp3 diff 
SingleSour...marks/Shootout/Shootout-random   658.00            642.00            -2.4%
SingleSour...ce/UnitTests/SignlessTypes/rem   4834.00           4930.00            2.0%
SingleSour...ootout-C++/Shootout-C++-random   818.00            802.00            -2.0%
MultiSource/Benchmarks/Olden/mst/mst          2658.00           2706.00            1.8%
SingleSource/Benchmarks/Misc/ffbench          2322.00           2290.00           -1.4%
MultiSource/Benchmarks/NPB-serial/is/is       4946.00           4898.00           -1.0%
MultiSource/Benchmarks/McCat/18-imp/imp       13266.00          13138.00          -1.0%
MultiSourc...rks/McCat/03-testtrie/testtrie   1810.00           1794.00           -0.9%
MultiSourc...chmarks/McCat/01-qbsort/qbsort   2066.00           2050.00           -0.8%
MultiSourc...iabench/g721/g721encode/encode   7394.00           7346.00           -0.6%
SingleSource/Benchmarks/Misc/fbench           2546.00           2562.00            0.6%
SingleSour...ce/Benchmarks/Misc/ReedSolomon   11170.00          11234.00           0.6%
SingleSource/Benchmarks/Stanford/Towers       2802.00           2786.00           -0.6%
MultiSourc...e/Applications/SIBsim4/SIBsim4   47666.00          47938.00           0.6%
MultiSourc...OE-ProxyApps-C/miniGMG/miniGMG   76258.00          76674.00           0.5%

Although I'm not sure exec_time is good metrics. I've also tried compile_time, never used it before:

~/llvm/test-suite/utils/  --filter-short -m compile_time ~/llvm/test-suite-results/results_rel_base.json ~/llvm/test-suite-results/results_rel_exp3.json 
Program                                         results_rel_base  results_rel_exp3 diff 
SingleSour.../Benchmarks/Misc-C++/Large/ray     0.68              0.74             7.6%
MultiSourc...Prolangs-C/simulator/simulator     1.56              1.64             5.7%
MultiSource/Benchmarks/McCat/18-imp/imp         0.82              0.78            -5.3%
MultiSourc...OE-ProxyApps-C/RSBench/rsbench     0.88              0.84            -4.6%
MultiSource/Benchmarks/Ptrdist/bc/bc            2.42              2.52             4.1%
SingleSour...enchmarks/CoyoteBench/fftbench     1.08              1.03            -4.1%
SingleSour...UnitTests/Vectorizer/gcc-loops     1.42              1.48             3.9%
MultiSourc...chmarks/Prolangs-C++/city/city     3.20              3.08            -3.9%
MultiSource/Benchmarks/sim/sim                  0.72              0.75             3.9%
MultiSourc...nchmarks/FreeBench/pifft/pifft     2.56              2.65             3.6%
MultiSource/Applications/spiff/spiff            1.58              1.64             3.5%     3.80              3.94             3.5%
MultiSourc...s/Prolangs-C/compiler/compiler     0.84              0.81            -3.3%
MultiSourc...rks/Prolangs-C++/employ/employ     0.74              0.71            -3.3%
SingleSour...ks/Misc-C++/stepanov_container     3.01              3.11             3.2%

Also one can see many tests increased their machine-cse.NumCSEs statistics (meaning that PRE step created common subexpressions eliminated by CSE):

> ~/llvm/test-suite/utils/  --filter-short -m machine-cse.NumCSEs ~/llvm/test-suite-results_deb_base.json ~/llvm/test-suite-results/results_deb_exp.json
Program                                         results_deb_base  results_deb_exp diff   
SingleSour...ce/UnitTests/SignlessTypes/rem     1.00             19.00           1800.0%
MultiSource/Benchmarks/llubenchmark/llu         1.00             10.00           900.0% 
MultiSource/Benchmarks/NPB-serial/is/is         1.00              9.00           800.0% 
SingleSource/Benchmarks/Misc/fbench             1.00              7.00           600.0% 
MultiSourc...OE-ProxyApps-C/XSBench/XSBench     2.00              8.00           300.0% 
SingleSource/Benchmarks/Misc/ffbench            2.00              8.00           300.0% 
SingleSour...chmarks/Misc-C++/stepanov_v1p2     1.00              3.00           200.0% 
MultiSourc...arks/FreeBench/distray/distray     7.00             20.00           185.7% 
MicroBench.../ImageProcessing/Dither/Dither     3.00              8.00           166.7% 
MultiSourc...ench/telecomm-FFT/telecomm-fft     5.00             13.00           160.0% 
MultiSourc...enchmarks/mafft/pairlocalalign   482.00            1214.00          151.9% 
MultiSourc...aran/netbench-url/netbench-url     3.00              7.00           133.3% 
MultiSourc...Benchmarks/SciMark2-C/scimark2     6.00             13.00           116.7% 
MultiSourc...nch/mpeg2/mpeg2dec/mpeg2decode    57.00            115.00           101.8% 
MultiSourc...chmarks/McCat/01-qbsort/qbsort     2.00              4.00           100.0%
755 ↗(On Diff #183272)

Thanks for pointing this out, I've cleaned this place. I've removed needless checks like MI->isReturn() after comparing statistics (no change).

Mostly, these checks are implicitly done while CSE work (not inside isCSECandidate()). I've left aux checks of virtual registers though -- they are also implicitly done.

I've also updated comment.

This looks artificially, but I'm going to refactor this place in the third (and last) step of this work -- merging PRE and CSE, so we need no different checks.

Fixed revision, forgot to add changes.

Ping. Does it have a chance to get lgtm?

Ping. Please, review this.

hfinkel added inline comments.Feb 26 2019, 12:03 PM
800 ↗(On Diff #185240)

Are these calls expensive?

SingleSour.../Benchmarks/Misc-C++/Large/ray 0.68 0.74 7.6%

That 7% increase in compile time, and a few other others, look significant.

(Maybe there's something you can cache?)

anton-afanasyev marked an inline comment as done.Feb 26 2019, 2:15 PM
anton-afanasyev added inline comments.
800 ↗(On Diff #185240)

I believe caching this gives nothing, cause two common subexpressions unlikely belong to the same basic blocks pair. 7% are only for one test among hundreds, there are also -5% and so on. I'm to run benchmarks again to compare results.

Hi @hfinkel,
Unfortunately, the results of compile_time are unstable and not reproducible. Below are two different runs.

The first run:

> ~/llvm/test-suite/utils/  --filter-short -m compile_time ~/llvm/test-suite-results/results_rel_base.json ~/llvm/test-suite-results/results_rel_exp.json 
Tests: 1163
Short Running: 1010 (filtered out)
Remaining: 153
Metric: compile_time

Program                                         results_rel_base  results_rel_exp diff 
MultiSourc...OE-ProxyApps-C/XSBench/XSBench     0.74              0.79            6.5%
MicroBench...y/ReturnReference/retref-bench     1.24              1.17           -6.1%
SingleSour...rks/Misc-C++/Large/sphereflake     0.71              0.68           -4.5%     0.72              0.76            4.4%
MultiSourc...chmarks/Prolangs-C/gnugo/gnugo     1.40              1.34           -4.3%
MultiSourc...rks/Prolangs-C++/employ/employ     0.76              0.73           -4.2%
MicroBench...arks/ImageProcessing/Blur/blur     1.61              1.55           -4.0%
MultiSourc...rks/Trimaran/enc-3des/enc-3des     0.62              0.65            3.8%
MultiSourc...rks/mediabench/gsm/toast/toast     2.29              2.38            3.7%
MultiSourc...Benchmarks/SciMark2-C/scimark2     0.79              0.82            3.6%
MicroBench.../ImageProcessing/Dilate/Dilate     1.32              1.36            2.7%
MicroBench...oopInterchange/LoopInterchange     0.74              0.72           -2.7%
MultiSourc...s/Prolangs-C/unix-tbl/unix-tbl     2.22              2.16           -2.7%
MultiSourc...rolangs-C/archie-client/archie     1.20              1.17           -2.7%
MultiSourc...marks/Prolangs-C/loader/loader     0.62              0.60           -2.6%
       results_rel_base  results_rel_exp        diff
count  153.000000        152.000000       152.000000
mean   7.235216          7.270763        -0.000904  
std    12.322023         12.328571        0.016021  
min    0.600000          0.604000        -0.061093  
25%    1.516000          1.535000        -0.007504  
50%    3.784000          3.800000         0.000000  
75%    5.568000          5.581000         0.006387  
max    88.724000         88.972000        0.064516

and the second run:

> ~/llvm/test-suite/utils/  --filter-short -m compile_time ~/llvm/test-suite-results/results_rel_base2.json ~/llvm/test-suite-results/results_rel_exp2.json 
Tests: 1163
Short Running: 1009 (filtered out)
Remaining: 154
Metric: compile_time

Program                                         results_rel_base2  results_rel_exp2 diff 
ultiSource/Benchmarks/McCat/09-vor/vor          0.66               0.62            -6.6%
ultiSource...Benchmarks/SciMark2-C/scimark2     0.80               0.75            -5.5%
ultiSource...rolangs-C/archie-client/archie     1.16               1.22             5.2%
ultiSource...OE-ProxyApps-C/XSBench/XSBench     0.79               0.76            -4.1%
ultiSource/Benchmarks/McCat/18-imp/imp          0.80               0.83             4.0%
ultiSource...OE-ProxyApps-C/RSBench/rsbench     0.87               0.90             3.7%
ultiSource...rks/Trimaran/enc-3des/enc-3des     0.66               0.63            -3.7%
ultiSource...hmarks/MallocBench/cfrac/cfrac     1.64               1.70             3.6%
icroBenchm.../ImageProcessing/Dilate/Dilate     1.37               1.32            -3.5%
ultiSource...ALAC/decode/alacconvert-decode     2.41               2.34            -3.0%
ultiSource...chmarks/Prolangs-C++/city/city     3.22               3.13            -3.0%
ultiSource...xyApps-C/Pathfinder/PathFinder     1.77               1.72            -2.9%
ultiSource...s/Prolangs-C/compiler/compiler     0.86               0.84            -2.8%
ultiSource...rks/Prolangs-C++/employ/employ     0.73               0.71            -2.7%
ultiSource/Applications/sgefa/sgefa             0.99               1.02             2.4%
       results_rel_base2  results_rel_exp2        diff
count  152.000000         153.000000        151.000000
mean   7.284842           7.236418         -0.002551  
std    12.346005          12.320806         0.015407  
min    0.600000           0.604000         -0.066265  
25%    1.567000           1.504000         -0.009324  
50%    3.794000           3.776000         -0.001112  
75%    5.609000           5.584000          0.005669  
max    89.004000          89.036000         0.051724

Also, I don't think the heavy functions here (like findNearestCommonDominator() and isPotentiallyReachable()) can significant impact compile time -- they are called rarely only for two BBs with common subexpressions, which is unusual. Taking the machine-cse.NumCSE statistics above as example, one can make an estimation of no more than 10-100 calls of these functions for the benchs from test suite. Also, these heavy calls are already cached in a sense, they are using prebuilt Machine Dominator Tree.

Hi all, please review this.

@hfinkel, should I clarify anything?

Rebase. Please, review this.

Herald added a project: Restricted Project. · View Herald TranscriptApr 24 2019, 3:06 AM
Herald added a subscriber: hiraditya. · View Herald Transcript

@hfinkel Any more comments?

RKSimon accepted this revision.Apr 30 2019, 6:01 AM

LGTM - based on the improved codegen and the negligible effect on compile time. Thanks.

784 ↗(On Diff #196406)

Very minor - but why can't the ++I be moved inside the for-loop increment statement?

This revision is now accepted and ready to land.Apr 30 2019, 6:01 AM
anton-afanasyev marked an inline comment as done.Apr 30 2019, 6:26 AM
anton-afanasyev added inline comments.
784 ↗(On Diff #196406)

It can, I'm to fix it, thanks.

Small fix and tests update

RKSimon added inline comments.Apr 30 2019, 8:11 AM
784 ↗(On Diff #196406)

This one too?

anton-afanasyev marked an inline comment as done.Apr 30 2019, 8:16 AM
anton-afanasyev added inline comments.
784 ↗(On Diff #196406)

Hmm, looks like I've eventually fixed L506 -- the original loop inside ProcessBlockCSE() instead of ProcessBlockPRE().

Update, clang-format

This revision was automatically updated to reflect the committed changes.
spatel added a comment.May 3 2019, 7:20 AM

We should reopen this review since the patch was reverted at rL359875?

anton-afanasyev reopened this revision.May 3 2019, 7:51 AM
This revision is now accepted and ready to land.May 3 2019, 7:51 AM
john.brawn reopened this revision.Jun 11 2019, 4:33 AM
john.brawn added a subscriber: john.brawn.

This is causing incorrect code generation for this piece of IR:

target datalayout = "e-m:e-i8:8:32-i16:16:32-i64:64-i128:128-n32:64-S128"
target triple = "aarch64-arm-none-eabi"

@var = hidden local_unnamed_addr global i32 0, align 4
@_ZTIi = external dso_local constant i8*
declare dso_local void @_Z2fnv() local_unnamed_addr #1
declare dso_local i32 @__gxx_personality_v0(...)
declare i32*) #2
declare dso_local i8* @__cxa_begin_catch(i8*) local_unnamed_addr
declare dso_local void @__cxa_end_catch() local_unnamed_addr

define hidden i32 @_Z7examplev() personality i8* bitcast (i32 (...)* @__gxx_personality_v0 to i8*) {
  invoke void @_Z2fnv()
          to label %try.cont unwind label %lpad

lpad:                                             ; preds = %entry
  %0 = landingpad { i8*, i32 }
          catch i8* bitcast (i8** @_ZTIi to i8*)
          catch i8* null
  %1 = extractvalue { i8*, i32 } %0, 0
  %2 = extractvalue { i8*, i32 } %0, 1
  %3 = tail call i32* bitcast (i8** @_ZTIi to i8*))
  %matches = icmp eq i32 %2, %3
  %4 = tail call i8* @__cxa_begin_catch(i8* %1)
  %5 = load i32, i32* @var, align 4
  br i1 %matches, label %catch1, label %catch

catch1:                                           ; preds = %lpad
  %or3 = or i32 %5, 4
  store i32 %or3, i32* @var, align 4
  tail call void @__cxa_end_catch()
  br label %try.cont

try.cont:                                         ; preds = %entry, %catch1, %catch
  %6 = load i32, i32* @var, align 4
  ret i32 %6

catch:                                            ; preds = %lpad
  %or = or i32 %5, 8
  store i32 %or, i32* @var, align 4
  tail call void @__cxa_end_catch()
  br label %try.cont

As part of the accesses to var an ADRP instruction is being generated and MachineCSE is hoisting it to the entry block, but it gets hoisted to after the call to fn so when we catch an exception the ADRP hasn't been executed so the loads and stores use an undefined base register.

This revision is now accepted and ready to land.Jun 11 2019, 4:33 AM
anton-afanasyev added a comment.EditedJun 11 2019, 4:42 AM

This is causing incorrect code generation for this piece of IR:

As part of the accesses to var an ADRP instruction is being generated and MachineCSE is hoisting it to the entry block, but it gets hoisted to after the call to fn so when we catch an exception the ADRP hasn't been executed so the loads and stores use an undefined base register.

Hi @john.brawn , thanks, I'm looking into it. At first glance, this instruction should have special flag to being non-hoistable.

anton-afanasyev added a comment.EditedJun 11 2019, 8:23 AM

The issue actually is not with this instruction, but it's related to exception handling. Hoisted instruction is inserted before getFirstTerminator(), but there could be EH_LABEL's which are not terminators, but could change control flow.
I'm to fix it.

Hi @john.brawn , here is the fix, could you please look to it:

Last issue fixed by this revision:

LuoYuanke added inline comments.

Do we need to enhance the algorithm to consider more about register pressure on the profit calculation? I'm afraid there is performance drop when the register pressure is heavy.

anton-afanasyev marked 2 inline comments as done.Jun 21 2019, 8:19 AM
anton-afanasyev added inline comments.

Hi @LuoYuanke, yes, this could be the case. Actually this commit doesn't change profit calculation, ProcessBlockPRE() uses the same isProfitableToCommit() function as ProcessBlockCSE() uses. Do you have concrete test cases where register pressure increases?

LuoYuanke added inline comments.Jun 22 2019, 6:15 PM

CSE only eliminate MI, but current PRE insert MI to the common dominated block, so the instruction is hoisted to dominated block. It increase register pressure more than CSE.
I have some cases that got performance regression with this patch due to the register pressure, but I am not able to extract a small case from it.
I notice in LICM, MachineLICMBase::IsProfitableToHoist() is more considerate for the register pressure. I wonder if PRE can apply the same algorithm. Do you have any idea for a better register pressure solution?

anton-afanasyev marked 3 inline comments as done.Jun 23 2019, 2:58 PM
anton-afanasyev added inline comments.

Hi @LuoYuanke, actually PRE doesn't hoist instruction, but _duplicates_ it, real hoisting is made by CSE when it _eliminates_ instruction. I do believe that CSE should consider register pressure while eliminating instruction.

Have you tried MachineLICMBase::isProfitableToHoist() with your test case? We can adopt and insert it to CSE.

LuoYuanke added inline comments.Jul 1 2019, 6:14 PM

Hi @anton-afanasyev, sorry for reply late. I'm trying to figure out this issue, but it seems complex for me. Can you run SPEC cpu2017/500.perlbench_r? There is some significant performance drop on X86 with the patch. If you have resource to run the benchmark, could you help to look into the issue?

Did you look into the SPEC cpu2017/500.perlbench_r issue? There is some significant performance drop on X86 with the patch. I ask you to revert the patch first, and when the SPEC2017 regression is fixed, we can submit the patch again. How do you think?

anton-afanasyev marked 3 inline comments as done.Jul 19 2019, 12:41 AM

Did you look into the SPEC cpu2017/500.perlbench_r issue? There is some significant performance drop on X86 with the patch. I ask you to revert the patch first, and when the SPEC2017 regression is fixed, we can submit the patch again. How do you think?

Hi @LuoYuanke, I'm sorry, it looks like my previous answer to you remained unsubmitted.
Here it is:

Hi @LuoYuanke , I have only access to SPEC cpu2006, I see 400.perlbench there. Could it be the case you know regressed sample from cpu2006?
Also, I have prepared new patch for review: -- I believe it can potentially decrease reg pressure.

At the moment, I believe that another patch (from Kai Luo) fixes this regression.

Btw, this fix ( was commited recently.

lkail added a subscriber: lkail.Jul 20 2019, 8:07 AM
lkail added inline comments.

Hi @anton-afanasyev , I have a concern here that CFG of LLVM IR might not be equivalent to CFG of Machine IR.

Do you have any performance data for the patch? I'd like to know what benchmark has performance gain with your patch. fixed perlbench regression, but I wonder what the performance gain do we achieve with the 2 patch?

anton-afanasyev marked an inline comment as done.Jul 23 2019, 8:26 AM

Do you have any performance data for the patch? I'd like to know what benchmark has performance gain with your patch. fixed perlbench regression, but I wonder what the performance gain do we achieve with the 2 patch?

Hi @LuoYuanke, I've benchmarked the first patch and posted results in two posts started from here:
But you are right: with the second patch the performance gain may be eliminated! I'm to measure second patch effect.


Hi @lkail, yes this is possible case. But that is not an issue actually, for this case in place it would be just non-optimization. PRE will create instruction, but if CSE doesn't eliminate original instructions, than "hoisted" instruction will be deleted by Remove Dead Machine Instructions pass later.

Hi @LuoYuanke, I've benchmarked the effect of the revertion my and @lkail patches.
The benchmark showed some increase of the exec_time:

~/llvm/test-suite/utils/  --filter-short -m exec_time ~/llvm/test-suite-results/results_rel_base.json ~/llvm/test-suite-results/results_rel_base2.json vs ~/llvm/test-suite-results/results_rel_exp.json  ~/llvm/test-suite-results/results_rel_exp2.json 

Program                                        lhs    rhs     diff 
 test-suite...Raw.test:BM_MULADDSUB_RAW/5001    13.96  17.35  24.3%
 test-suite...XRayFDRMultiThreaded/threads:4   568.16 634.38  11.7%
 test-suite...RayFDRMultiThreaded/threads:32   858.02 944.45  10.1%
 test-suite...XRayFDRMultiThreaded/threads:2   464.93 501.92   8.0%
 test-suite...algebra/kernels/symm/symm.test    14.00  15.12   7.9%    39.41  37.34  -5.2%
 test-suite...test:BM_MULADDSUB_LAMBDA/44217   141.97 148.38   4.5%
 test-suite...Lambda.test:BM_EOS_LAMBDA/5001     9.88   9.48  -4.0%
 test-suite...ambda.test:BM_EOS_LAMBDA/44217    89.08  85.78  -3.7%
 test-suite...s/Halide/blur/halide_blur.test     1.77   1.83   3.6%
 test-suite...BenchmarkGame/Large/fasta.test     0.72   0.70  -3.3%
 test-suite...lcalsCRaw.test:BM_EOS_RAW/5001     9.86   9.54  -3.3%
 test-suite...calsCRaw.test:BM_EOS_RAW/44217    88.90  86.00  -3.3%
 test-suite...XRayFDRMultiThreaded/threads:8   983.92 1014.32  3.1%
 test-suite...RayFDRMultiThreaded/threads:16   979.55 1008.68  3.0%

So I'd like to leave these changes unreverted and to proceed with particular patches (like this one Please tell if you have any objections.

@anton-afanasyev, I have no objection. Thank you for the effort to improve the performance.