This is an archive of the discontinued LLVM Phabricator instance.

Improve the cost evaluation of LSR
Needs ReviewPublic

Authored by wmi on May 1 2015, 12:10 AM.

Details

Summary
  1. Motivation driven by testcases:

Existing LSR evaluates the cost of LSRUse Formula and LSR Solution by assigning weights to some indexes from high to low: NumRegs > AddRecCost > NumIVMuls > ... This doesn't work well for the testcase mentioned in https://llvm.org/bugs/show_bug.cgi?id=23384. For the testcase, on x86, AddRecCost should be valued more important than NumRegs because it will be translated to increased insn numbers directly, especially when the register pressure in the loop is low.

This problem may not be apparent on other architecture. X86 supports complex addressing mode so several load/store insns can share the same AddRec by using complex addressing mode. For those architectures not supporting complex addressing mode, they need extra insns for loads/stores to share the same AddRec, so saving AddRecCosts may not be translated to saving of insns.

However for architectures with pre/postinc addressing mode, we even want to increase AddRec. Here is another testcase:
int a[1000], b[1000], N, j;

void foo() {

long i;
for (i = 0; i < N; i+=1) {
  a[i+12] = 15;
  b[i-2] = 15;
}

}

For AArch64, llvm trunk generates code:.LBB0_2: // %for.body

lsl     x13, x9, #2
add     x9, x9, #1              // =1
add      x14, x10, x13
add      x13, x12, x13
str     w11, [x14, #48]
stur    w11, [x13, #-8]
cmp      x9, x8
b.lt    .LBB0_2

With the patch attached, it generates more AddRec but utilizes post-increment addressing mode to reduce the insns needed.
.LBB0_2: // %for.body

str     w12, [x10], #4
str     w12, [x11], #4
add     x9, x9, #1              // =1
cmp      x9, x8
b.lt    .LBB0_2

From the two testcases above, I think the existing cost evaluation which uses NumRegs as the major index may not be optimal, because instruction number is the most important factor showing the cost. Understanding all the cases in which extra instruction will be introduced and including those cases in the LSR cost model is the key to get the best performance.

  1. Major changes in the patch:
    1. The weights of new indexes from high to low: InstNumCost + SpillRegsCost > InstCmplxCost > SetupCost. InstNumCost is the number of extra instructions needed. SpillRegsCost is computed from regs used by Cost::UpdateSpillRegsCost (A simple version is used and it will be improved in the future). InstCmplxCost shows the complexity of instructions in the loop, which affects the laency of instruction decoding and available insn execution ports. SetupCost is the cost of instructions inserted in loop preheader. 2. InstNumCost is updated for some cases according to benchmark analysis on X86 architecture. It is also updated to handle the second testcase above to utilize pre/post-increment addressing mode on AArch64 or ARM.
      • For AddRec, InstNumCost will be increased by 1 unless the stride can be folded into load/store insns by using pre/post-increment addressing mode.
      • For ICmpZero, the loop can be a countdown loop only when the formula of ICmpZero only contains one register -- in this case, the conditional compare or test instruction can be omitted. In other cases, InstNumCost will be increased by 1.
      • Original NumBaseAdds/NumIVMuls/ImmCost are included into InstNumCost.
  1. Request for suggestions. I did performance testing and tuning for the patch using google benchmarks on x86, and saw some perf improvement (0.2% geomean on sandybridge and 0.4% on westmere). But I didn't do any testing and tuning for other architectures so it can be worse than the original LSR on other architectures. Any suggestions to improve it on other architectures if the brief idea here is ok.

Thanks,
Wei.

Diff Detail

Repository
rL LLVM

Event Timeline

wmi updated this revision to Diff 24794.May 1 2015, 12:10 AM
wmi retitled this revision from to Improve the cost evaluation of LSR .
wmi updated this object.
wmi edited the test plan for this revision. (Show Details)
wmi added reviewers: atrick, sanjoy, hfinkel.
wmi set the repository for this revision to rL LLVM.
wmi added subscribers: Unknown Object (MLST), davidxl.
qcolombet edited edge metadata.May 1 2015, 10:30 AM

Hi Wei,

The short story is: I do not think this is the way to go.

Now, the long story.

I have mixed feelings on the direction of the approach. On one hand, I also think we should optimize for performance as long as register pressure is not a problem. On the other hand, the register pressure estimate at this level too rough to make any useful decisions.

Your current approach illustrate this point. Indeed, IIRC NumRegs only gives you the number of registers you need to materialize the formulae, we do not consider how many register we already need in the loop or through the loop. Therefore, I believe by tweaking the body of the loop in your motivating example (i.e., adding just enough live ranges), we can bloat the register pressure with the new rating and have spill within the loop, whereas we wouldn’t with the previous rating.

I also mentioned that in the related PR, but I believe the way to go is: not to care on register pressure and just rate the cost of the loop body. However, this implies the backends are able to recover from the register pressure bloat and I believe we are not quite here.

*How do we move?

I would suggest we add an internal option to make LSR more aggressive w.r.t. to register pressure, and fix all the problems that rise in the backends. Then, we can turn that option on by default.

  • What if other people still believe this is the right way to move?

Like I said, I do not think this is the right way to go. Now, if other people believe it is, I would at least expect that you supply more details numbers. In particular, what are the actual numbers (not just the geometric means) and what are the regressions, why we do not care or how do we plan to fix them.

Cheers,
-Quentin

wmi added a comment.May 1 2015, 11:32 AM

Hi Quentin,

Thanks for explaining your idea to the LSR problem in detail.

Now, the long story.

I have mixed feelings on the direction of the approach. On one hand, I also think we should optimize for performance as long as register pressure is not a problem. On the other hand, the register pressure estimate at this level too rough to make any useful decisions.

I agree with you, except that the register pressure estimate cannot
make any useful decisions.
I understand that the register pressure estimation cannot be very
precise in an early stage like LSR, but it could still be useful,
especially when the real register pressure in a loop is very low or
very high. When the register pressure in a loop is close to the number
of available registers, I admit my patch can intend to use more
register, but the intention is still to reduce instruction number in
other place, like reducing recurrance adds at the end of loop or
reducing NumAddParts for a LSRUse. Perf impact in such case will be
hard to tell, because we may have more spills, but have less add insns
in the loop at the same time.

Your current approach illustrate this point. Indeed, IIRC NumRegs only gives you the number of registers you need to materialize the formulae, we do not consider how many register we already need in the loop or through the loop. Therefore, I believe by tweaking the body of the loop in your motivating example (i.e., adding just enough live ranges), we can bloat the register pressure with the new rating and have spill within the loop, whereas we wouldn’t with the previous rating.

I hope for the test tweaked, although the new rating may have more
spills, it will have less add insns. If it is not the case, it is a
bug in the new rating I should look at. The new rating have no reason
to increase NumRegs when it cannot reduce InstNumCost.

I also mentioned that in the related PR, but I believe the way to go is: not to care on register pressure and just rate the cost of the loop body. However, this implies the backends are able to recover from the register pressure bloat and I believe we are not quite here.

I agree it is a difficut way to go.

*How do we move?

I would suggest we add an internal option to make LSR more aggressive w.r.t. to register pressure, and fix all the problems that rise in the backends. Then, we can turn that option on by default.

That is a good suggestion! We can have more testcases then by
comparing several approaches.

  • What if other people still believe this is the right way to move?

Like I said, I do not think this is the right way to go. Now, if other people believe it is, I would at least expect that you supply more details numbers. In particular, what are the actual numbers (not just the geometric means) and what are the regressions, why we do not care or how do we plan to fix them.

The improvement is: 1% for Ad Delivery, 1.5% for licence place
detection, 1.5% for object recognition, 3% for a matrix computation
library. I also tested spec2000 and found almost no perf difference.
I didn't see significant regressions caused by LSR now. Actually I saw
at first then I fixed those regressions and collected some testcases.
There were still some regressions but caused by other side effect
after analysis. However all the tests are only for x86. I believe it
will have many regressions on other platforms.

Thanks,
Wei.

spatel added a subscriber: spatel.May 1 2015, 1:08 PM
spatel added a comment.May 4 2015, 1:10 PM

Something to consider when evaluating changes in this area: I see no perf difference between the existing and proposed codegen for the loops seen in https://llvm.org/bugs/show_bug.cgi?id=23384#c0 when running on an Intel Haswell system, but there is a >10% win running on an AMD Jaguar system when using the complex addressing. The Jaguar core is narrower (2-wide issue) and has much less micro-architectural shock absorption than Haswell. Wins and losses may be more apparent on smaller cores such as this or Atom.

FWIW, I applied the patch, ran the benchmarking subset of test-suite on the Jaguar system, and saw a 0.7% geomean improvement after filtering out lots of noisy tests. Benchmarks/McGill/chomp is the largest improvement: +31%; Benchmarks/Stanford/FloatMM is the worst regression: -10%. I haven't analyzed the results any more than that yet.

wmi added a comment.May 4 2015, 1:49 PM

Thanks Sanjay for the testing. I havn't considered the difference
between different microarchitectures, which is needed to be considered
in the future. For the patch, I regarded all x86 architecture as the
same and evaluated the cost mainly using inst numbers.

It is good to know for Jaguar the patch got some performance
improvement. And I am especially interested at the cause of the
regression. I will try the benchmark and see if there is any
instruction increase there.

Wei.

wmi updated this revision to Diff 25507.May 11 2015, 2:06 PM
wmi edited the test plan for this revision. (Show Details)
wmi edited edge metadata.

Hi,

I separated the existing and new cost model code, and used the following options to choose the cost model.

-lsrmodel                                               - Choose the LSR cost model:
  =basic                                                -   enable basic cost model
  =aggr-unlimited                                       -   enable aggressive cost model and assume unlimited hard regs
  =aggr-estimate                                        -   enable aggressive cost model and estimate reg pressure
  =aggr-few                                             -   enable aggressive cost model and assume very few hard regs

-lsrmodel=basic is the existing implementation. -lsrmodel=aggr-estimate is the new cost model. -lsrmodel=aggr-unlimited and -lsrmodel=aggr-few represent the cases when we have unlimited hardregs or very few hardregs.

It is very appreciated to try the new cost model on various platforms, so it can be improved based on more testcases.

Other changes includes adding two testcases and fixing an inst increase regression on x86.

Thanks,
Wei.

sanjoy resigned from this revision.Jun 22 2015, 2:36 PM
sanjoy removed a reviewer: sanjoy.

I don't think I'm qualified to review this.

Hi Wei,

What is the long-term plan here?

My understanding is that you want to gather benchmark numbers and maybe change the default to the best one.
Is what you are aiming for?

Thanks,
-Quentin

wmi added a comment.Jul 28 2015, 3:00 PM

Hi Wei,

What is the long-term plan here?

My understanding is that you want to gather benchmark numbers and maybe change the default to the best one.
Is what you are aiming for?

Thanks,
-Quentin

Hi Quentin,

Sorry for not updating this one for a while. For google benchmarks and for x86 platform, aggr-estimate is better than basic mode for several benchmarks. There is no much difference between aggr-unlimited, aggr-few and aggr-estimate. My plan is to compare the performance between various modes and do another round of perf analysis based on llvm testsuite (This part has not been done), refine the logic of aggr-estimate mode if there is regression. Then if the status is satisfying, checkin the patch (if review is ok) and make the aggr-estimate mode available for wider use and further tuning (But still keep the basic mode as default until it becomes good enough on platforms other than x86).

Thanks,
Wei.

Hi Wei,

Thanks for the update.

There is no much difference between aggr-unlimited, aggr-few and aggr-estimate.

This seems to concur my feeling that the register pressure estimation of the proposed model is off.
Given that evidence, I would drop aggr-estimate and aggr-few and focus aggr-unlimited. Indeed, this is the only configuration that this new models represent correctly AFAICT.

Thanks,
-Quentin