This is an archive of the discontinued LLVM Phabricator instance.

[LSR] Narrow search space by filtering non-optimal formulae with the same ScaledReg and Scale.
ClosedPublic

Authored by wmi on Jun 23 2017, 5:02 PM.

Details

Summary

When the formulae search space is huge, LSR uses a series of heuristic to keep pruning the search space until the number of possible solutions are within certain limit.

The big hammer of the series of heuristics is NarrowSearchSpaceByPickingWinnerRegs, which picks the register which is used by the most LSRUses and deletes the other formulae which don't use the register. This is a effective way to prune the search space, but quite often not a good way to keep the best solution. We saw cases before that the heuristic pruned the best formula candidate out of search space.

To relieve the problem, we introduce a new heuristic called NarrowSearchSpaceByFilterFormulaWithSameScaledReg. The basic idea is in order to reduce the search space while keeping the best formula, we want to keep as many formula with different Scale and ScaledReg as possible. That is because the central idea of LSR is to choose a group of loop induction variables and use those induction variables to represent LSRUses. An induction variable candidate is often represented by the Scale and ScaledReg in a formula. If we have more formulae with different ScaledReg and Scale to choose, we have better opportunity to find the best solution. That is why we believe pruning search space by only keeping the best formula with the same Scale and ScaledReg should be more effective than PickingWinnerReg. And we use two criteria to choose the best formula with the same Scale and ScaledReg. The first criteria is to select the formula using less non shared registers, and the second criteria is to select the formula with less cost got from RateFormula. The patch implements the heuristic before NarrowSearchSpaceByPickingWinnerRegs, which is the last resort.

Testing shows we get 1.8% and 2% on two internal benchmarks on x86. llvm nightly testsuite performance is neutral. We also tried lsr-exp-narrow and it didn't help on the two improved internal cases we saw.

Diff Detail

Event Timeline

wmi created this revision.Jun 23 2017, 5:02 PM
wmi added a reviewer: sanjoy.Jun 26 2017, 3:09 PM
sanjoy requested changes to this revision.Jun 29 2017, 3:38 PM
sanjoy added inline comments.
lib/Transforms/Scalar/LoopStrengthReduce.cpp
146 ↗(On Diff #103799)

I'd s/scaledreg/scaled-reg/

4318 ↗(On Diff #103799)

Just to be clear, this benefit here is that you can just look at the base regs and (cheaply) decide which one is better? If so, add a line in the description about that.

4338 ↗(On Diff #103799)

I think the convention would be IsBetterThan.

I'd also add a one-liner on the direction of the relation you're computing ("return true if A is better than B").

4339 ↗(On Diff #103799)

I'm being hyper pedantic here, but I suspect this will be "fewer" new registers.

4345 ↗(On Diff #103799)

It's not clear to me why you need to use a floating point reciprocal here. Why not just count the total number of uses (as a integer) and compare that? Or is there some more subtle metric you're tracking that requires this reciprocal logic (if so please comment on that).

4355 ↗(On Diff #103799)

Maybe the DenseSet can be declared at function scope and re-used in different invocations to isBetterThan? This is a worthwhile micro-optimization since DenseSet s are somewhat heavyweight.

4360 ↗(On Diff #103799)

As far as I can tell, in this situation we're making an arbitrary choice when the register use count is the same and both the formulas have the same cost (we'll just pick the second one). Can we instead keep both the formulas when Cost does not give us a unambiguous signal?

4364 ↗(On Diff #103799)

Let's call Any something more descriptive, like FormulaPruned.

[edit: I just saw that other parts of LSR also call this variable Any. I think those too should change, but given the "prior art" I'd be okay if you want to keep this named Any.]

This revision now requires changes to proceed.Jun 29 2017, 3:38 PM
wmi marked 4 inline comments as done.Jun 29 2017, 5:04 PM
wmi added inline comments.
lib/Transforms/Scalar/LoopStrengthReduce.cpp
4318 ↗(On Diff #103799)

The narrowing heuristic is to keep as many formulae with different Scale and ScaledReg pair as possible while narrowing the search space to be within limit. The benefit is that it is more likely to find out a better solution from a formulae set with more Scale and ScaledReg variations than picking the winner reg heuristic. Comments updated.

4345 ↗(On Diff #103799)

Yeah I don't really need to use floating point. I change it to use integer.

4360 ↗(On Diff #103799)

We uses the similar strategy in other places, like in FilterOutUndesirableDedicatedRegisters at the place: CostF.isLess(CostBest, TTI)). I think the key point here is to keep the formula set small while providing more induction variable choices for LSR solver, so making an arbitrary choice to reduce the formula set may not be too bad.

wmi updated this revision to Diff 104786.Jun 29 2017, 5:22 PM
wmi edited edge metadata.

Address Sanjoy's comments.

sanjoy added inline comments.Jun 29 2017, 5:53 PM
lib/Transforms/Scalar/LoopStrengthReduce.cpp
4360 ↗(On Diff #103799)

Sounds good.

4352 ↗(On Diff #104786)

Can we instead take the total number of uses of registers and FA.BaseRegs and FB.BaseRegs and compare them? That is:

int TotalUsesOfA = 0, TotalUsesOfB = 0;
for (const SCEV *Reg : FA.BaseRegs)
  TotalUsesOfA += RegUses.getUsedByIndices(Reg).count();
for (const SCEV *Reg : FB.BaseRegs)
  TotalUsesOfB += RegUses.getUsedByIndices(Reg).count();

if (TotalUsesOfA != TotalUsesOfB)
  return TotalUsesOfA > TotalUsesOfB;

Or does it have to exactly be the expression you're using?

test/Transforms/LoopStrengthReduce/X86/lsr-filtering-scaledreg.ll
7 ↗(On Diff #104786)

Can you please clean up the names a bit here? Perhaps using metarenamer?

Also, are both the loops necessary to show the difference in behavior?

wmi added inline comments.Jun 30 2017, 10:10 AM
lib/Transforms/Scalar/LoopStrengthReduce.cpp
4352 ↗(On Diff #104786)

We don't want to choose the formula with the most sharing. What we expect is to have fewer registers. If a register is shared, we count it proportionally . That is why a shared register is better than a non-shared register.

test/Transforms/LoopStrengthReduce/X86/lsr-filtering-scaledreg.ll
7 ↗(On Diff #104786)

metarenamer is a great tool. Thanks for the suggestion.

I reduce the test to some extend, but the second loop is still necessary to create a complex enough searching space.

wmi updated this revision to Diff 104888.Jun 30 2017, 10:13 AM

Cleanup and reduce the testcase.

sanjoy accepted this revision.Jun 30 2017, 4:55 PM

lgtm

test/Transforms/LoopStrengthReduce/X86/lsr-filtering-scaledreg.ll
7 ↗(On Diff #104786)

I reduce the test to some extend, but the second loop is still necessary to create a complex enough searching space.

Sounds good!

This revision is now accepted and ready to land.Jun 30 2017, 4:55 PM
This revision was automatically updated to reflect the committed changes.
eastig reopened this revision.Aug 3 2017, 4:36 AM
eastig added a subscriber: eastig.

This patch caused regressions from 5% to 23% in two our internal benchmarks on Cortex-M23 and Cortex-M0+. I attached test.ll which is reduced from the benchmarks. I used LLVM revision 309830. 'test.good.ll' is a result when filtering is disabled. 'test.bad.ll' is a result when filtering is enabled.
Comparing them I can see that this optimization changes how an induction variable is changed. Originally it is incremented from 0 to 256. The optimization changes this into decrementing from 0 to -256. This induction variable is also used as an offset to memory. So to preserve this semantic conversion of the induction variable from a negative value to a positive value is inserted. This is lowered to additional instructions which causes performance regressions.

Could you please have a look at this issue?

Thanks,
Evgeny Astigeevich
The ARM Compiler Optimization team leader

This revision is now accepted and ready to land.Aug 3 2017, 4:36 AM
wmi added a comment.Aug 3 2017, 11:12 AM

This patch caused regressions from 5% to 23% in two our internal benchmarks on Cortex-M23 and Cortex-M0+. I attached test.ll which is reduced from the benchmarks. I used LLVM revision 309830. 'test.good.ll' is a result when filtering is disabled. 'test.bad.ll' is a result when filtering is enabled.
Comparing them I can see that this optimization changes how an induction variable is changed. Originally it is incremented from 0 to 256. The optimization changes this into decrementing from 0 to -256. This induction variable is also used as an offset to memory. So to preserve this semantic conversion of the induction variable from a negative value to a positive value is inserted. This is lowered to additional instructions which causes performance regressions.

Could you please have a look at this issue?

Thanks,
Evgeny Astigeevich
The ARM Compiler Optimization team leader

Hi Evgeny,

Thanks for providing the testcase.

It looks like an existing issue in LSR cost evaluation exposed by the patch. Actually, comparing the trace by adding -debug-only=loop-reduce, all the candidates choosen by LSR without filtering are kept in the candidate set after adding the filter patch. However filtering patch provides some more candidates interesting for LSR cost model to choose, and LSR chooses a different set of candidates in the final result which it thinks better (1 less base add) but actually not. We can see that in the trace:

LSR without the filtering patch:
The chosen solution requires 5 regs, with addrec cost 2, plus 2 base adds, plus 4 imm cost, plus 1 setup cost:

LSR Use: Kind=Address of i8 in addrspace(0), Offsets={0}, widest fixup type: i8*
  reg({%ptr1,+,256}<%for.cond6.preheader.us.i.i>) + 1*reg({0,+,1}<nuw><nsw><%for.body8.us.i.i>)
LSR Use: Kind=ICmpZero, Offsets={0}, widest fixup type: i32
  reg(256) + -1*reg({0,+,1}<nuw><nsw><%for.body8.us.i.i>)
LSR Use: Kind=Address of i32 in addrspace(0), Offsets={0}, widest fixup type: i32*
  reg(%c0.0103.us.i.i) + 4*reg({0,+,1}<nuw><nsw><%for.body8.us.i.i>)
LSR Use: Kind=Address of i32 in addrspace(0), Offsets={0,4}, widest fixup type: i32*
  reg({(-4 + %c1.0104.us.i.i)<nsw>,+,4}<nsw><%for.body8.us.i.i>)
LSR Use: Kind=Special, Offsets={0}, all-fixups-outside-loop, widest fixup type: i32*
  reg(%c0.0103.us.i.i)
LSR Use: Kind=Address of i32 in addrspace(0), Offsets={0}, widest fixup type: i32*
  reg(%c1.0104.us.i.i) + 4*reg({0,+,1}<nuw><nsw><%for.body8.us.i.i>) + imm(4)
LSR Use: Kind=Special, Offsets={0}, all-fixups-outside-loop, widest fixup type: i32*
  reg(%c1.0104.us.i.i)

LSR with the filtering patch:
The chosen solution requires 5 regs, with addrec cost 2, plus 1 base add, plus 4 imm cost, plus 1 setup cost:

LSR Use: Kind=Address of i8 in addrspace(0), Offsets={0}, widest fixup type: i8*
  reg({%ptr1,+,256}<%for.cond6.preheader.us.i.i>) + -1*reg({0,+,-1}<nw><%for.body8.us.i.i>)
LSR Use: Kind=ICmpZero, Offsets={0}, widest fixup type: i32
  reg(-256) + -1*reg({0,+,-1}<nw><%for.body8.us.i.i>)
LSR Use: Kind=Address of i32 in addrspace(0), Offsets={0}, widest fixup type: i32*
  reg(%c0.0103.us.i.i) + -4*reg({0,+,-1}<nw><%for.body8.us.i.i>)
LSR Use: Kind=Address of i32 in addrspace(0), Offsets={0,4}, widest fixup type: i32*
  reg({(4 + %c1.0104.us.i.i)<nsw>,+,4}<nsw><%for.body8.us.i.i>) + imm(-8)
LSR Use: Kind=Special, Offsets={0}, all-fixups-outside-loop, widest fixup type: i32*
  reg(%c0.0103.us.i.i)
LSR Use: Kind=Address of i32 in addrspace(0), Offsets={0}, widest fixup type: i32*
  reg({(4 + %c1.0104.us.i.i)<nsw>,+,4}<nsw><%for.body8.us.i.i>)
LSR Use: Kind=Special, Offsets={0}, all-fixups-outside-loop, widest fixup type: i32*
  reg(%c1.0104.us.i.i)

The real problem is that LSR has no idea about the cost of getting negative value. It thinks 4*reg({0,+,-1} and -4*reg({0,+,-1} have the same cost.

LSR Use: Kind=Address of i8 in addrspace(0), Offsets={0}, widest fixup type: i8*
  reg({%ptr1,+,256}<%for.cond6.preheader.us.i.i>) + -1*reg({0,+,-1}<nw><%for.body8.us.i.i>)
LSR Use: Kind=Address of i32 in addrspace(0), Offsets={0}, widest fixup type: i32*
  reg(%c0.0103.us.i.i) + -4*reg({0,+,-1}<nw><%for.body8.us.i.i>)

I will think about how to fix it.

Wei.

eastig added a comment.Aug 4 2017, 2:52 AM

Thank you, Wei. Just let me know when you need help to test a fix.

This revision was automatically updated to reflect the committed changes.
Herald added a project: Restricted Project. · View Herald TranscriptOct 7 2019, 5:50 AM
Herald added a subscriber: hiraditya. · View Herald Transcript