This is an archive of the discontinued LLVM Phabricator instance.

Add Instruction number to LSR cost model (PR23384) part 1 of 3
ClosedPublic

Authored by evstupac on Jan 4 2017, 11:46 AM.

Details

Summary

Fix PR23384.
The patch adds instructions number generated by a solution to LSR cost under "-lsr-insns-cost" option.
Performance improvement on x86:
spec2000

177.mesa on -O2 +3%
256.bzip2 on -Ofast -flto +1.5%

Diff Detail

Repository
rL LLVM

Event Timeline

evstupac updated this revision to Diff 83095.Jan 4 2017, 11:46 AM
evstupac retitled this revision from to Add Instruction number to LSR cost model (PR23384) part 1 of 3.
evstupac updated this object.
evstupac added a reviewer: qcolombet.
evstupac set the repository for this revision to rL LLVM.
evstupac added subscribers: llvm-commits, wmi.
evstupac updated this revision to Diff 83111.Jan 4 2017, 1:00 PM

Fix "copy/paste" misprint.

hfinkel added inline comments.
lib/Transforms/Scalar/LoopStrengthReduce.cpp
1193

This represents the spill of the old value, right? We should say so.

1233

Please add a comment this explain this (a subtraction for the constant term in the formula?).

1235

This is the instruction that represents the increment?

1267

Do you really want to make this the most-important factor?

evstupac added inline comments.Jan 24 2017, 4:19 PM
lib/Transforms/Scalar/LoopStrengthReduce.cpp
1193

To be exact at least 1 fill.
Ideally RA should start with fills.

1235

Yes.

1267

Yes. What is more important than instruction count on this stage?

evstupac updated this revision to Diff 85652.Jan 24 2017, 4:33 PM

Comments fixed according to review.

hfinkel added inline comments.Jan 24 2017, 5:29 PM
lib/Transforms/Scalar/LoopStrengthReduce.cpp
1267

Fair enough.

test/Transforms/LoopStrengthReduce/X86/lsr-insns-1.ll
2

Please use FileCheck, not grep (and match the name of the statistic, not just the number and the pass name).

evstupac updated this revision to Diff 85683.Jan 24 2017, 7:24 PM

Tests fixed according to the comments.

hfinkel added inline comments.Jan 25 2017, 6:32 AM
test/Transforms/LoopStrengthReduce/X86/lsr-insns-1.ll
15

Ah, okay. You're just counting the total number of instructions. Please actually match the desired output instruction pattern. Otherwise these tests will be fragile to unrelated changes.

evstupac added inline comments.Jan 25 2017, 10:22 AM
test/Transforms/LoopStrengthReduce/X86/lsr-insns-1.ll
15

Well, the tests are really small and simple.
I expect more fragility from instruction pattern.
Anyway, if any change will increase instruction count here it would be a bad sign and the change author should pay attention to this.

hfinkel added inline comments.Jan 25 2017, 11:55 AM
test/Transforms/LoopStrengthReduce/X86/lsr-insns-1.ll
15

Well, the tests are really small and simple.

This is exactly why matching the output you expect is reasonable.

I expect more fragility from instruction pattern.

Perhaps, but make proper use of CHECK-DAG and named regular expressions for allocated registers and the test should not be fragile.

evstupac updated this revision to Diff 85832.Jan 25 2017, 4:18 PM
evstupac added a reviewer: hfinkel.

Updated test according to the latest comments.
For lsr-insns-1.ll the most important is to check that there are no "cmp" in the loop.
For lsr-insns-2.ll it is important that calculations are moved to address mode. However several variants are ok:

movl    (%rsi,%rcx,4), %eax
...
incq %rax
cmpq %rcx, %r8

or

movl    (%rsi,%rcx), %eax
...
addq $4, %rax
cmpq %rcx, %r8

"cmpq %rcx, %r8" could also be replaced with "decq %r8".
So I simplified checks to avoid fails in case of the replacements above.

hfinkel accepted this revision.Jan 25 2017, 4:23 PM

LGTM. I assume that you're going to send a message to llvm-dev asking people to test this?

This revision is now accepted and ready to land.Jan 25 2017, 4:23 PM

LGTM. I assume that you're going to send a message to llvm-dev asking people to test this?

Thanks!
Yes. The change is very sensitive. For x86 results are good for benchmarks set I have, but there could be corner cases that I've missed.
Ultimate goal is to make LSR cost target depended (there will be a separate patch for this) and move InsnsCost to 1st priority for x86.

qcolombet edited edge metadata.Jan 25 2017, 5:35 PM

Hi,

Sorry for the delay.
I'll have a look tomorrow.

Cheers,
-Quentin

qcolombet requested changes to this revision.Jan 26 2017, 5:58 PM

Hi,

I'd like to see opt tests as well and one piece of the code seems dead to me unless I am missing something. See my inline comment.

Also a related question, what is your plan to move with that heuristic?

Generally speaking, I think the number of instructions is a good heuristic only for code size and it is not even a given. Indeed, for performance we care about critical path, IPL, this kind of thing. What am I saying is having more instructions is not necessarily a bad thing. The other thing is I believe we can end up with cases with less instructions, but more registers and potentially more spill. This is possible to happen because the heuristic that account for "spill" only trigger when the NumRegs gets high enough. However, NumRegs may be low enough and spill may already happen. See my comment below as well.

Cheers,
-Quentin

lib/Transforms/Scalar/LoopStrengthReduce.cpp
1195

With this kind of check we are (sort of) sure we are going to spill to materialize this formulae, but that does not mean the other don't.
Bare in mind that NumRegs is only what's use in the formulae, not what is live.

Your comment is right, just wanted to make sure we are on the same page.

1198

I'm confused, this was tested in the condition just before, i.e., it's always going to be true and the else statement is always going to be false.

Am I missing something?

test/Transforms/LoopStrengthReduce/X86/lsr-insns-1.ll
2

I would like to see opt -loop-reduce checks as well.

test/Transforms/LoopStrengthReduce/X86/lsr-insns-2.ll
2

Ditto.

This revision now requires changes to proceed.Jan 26 2017, 5:58 PM

Hi,

Thanks for taking a look.

Also a related question, what is your plan to move with that heuristic?

As we deal in D27695, there will be next part moving Solution cost comparison to a target part.
For x86 I want to move "Insns" to number 1 priority. I've got good performance and code size results on my benchmarks (only linpack got 3% regression, which transforms into gain for some CPUs).
Anyway the results differs for x86 CPUs. So maybe we'll end up with different cost models.

I understand that the change is very sensitive - so an option gives testing opportunity.

Generally speaking, I think the number of instructions is a good heuristic only for code size and it is not even a given. Indeed, for performance we care about critical path, IPL, this kind of thing. What am I saying is having more instructions is not necessarily a bad thing. The other thing is I believe we can end up with cases with less instructions, but more registers and potentially more spill. This is possible to happen because the heuristic that account for "spill" only trigger when the NumRegs gets high enough. However, NumRegs may be low enough and spill may already happen.

Currently LSR take in account only Number of Registers. It is very important for 32 bit mode, but less important for 64 bit mode and float/vector loops.
While testing I have not seen cases where Insns count is less, but NumRegs are much bigger. Insns correlate with NumRegs (because Insns depend AddRec and NumBaseAdds).
Usually the case is that we have 1 more register (or same number) but less instructions. Putting Insns as first priority does not add much spill/fills. However it would be good to get such statistics. I'll gather this for x86.

Thanks,
Evgeny

lib/Transforms/Scalar/LoopStrengthReduce.cpp
1195

Yes. I understand this.
Instruction number and Register number have correlations.
When we move cost calculation to target part we can give Insns and RegNum some weights instead of priority.

1198

This is a misprint.
It should be PrevNumRegs here like it was in inital patch D27695.
Thanks for catching this.

test/Transforms/LoopStrengthReduce/X86/lsr-insns-1.ll
2

It is harder to see profit in IR as we are getting even more instructions (assuming target will combine them to complicated address or remove test on 0).
The will be harder to understand the test goal, but it is not a big deal to add such.

evstupac updated this revision to Diff 86014.Jan 26 2017, 9:28 PM
evstupac edited edge metadata.

Fixed according to the comments.
Added 2 opt tests.

Hi,

...

Generally speaking, I think the number of instructions is a good heuristic only for code size and it is not even a given. Indeed, for performance we care about critical path, IPL, this kind of thing. What am I saying is having more instructions is not necessarily a bad thing.

I agree, for performance we should look at critical-path length, ILP, etc. However, here, we're only ranking one formula at a time, and these generally correspond to dependent chains of computations. Making each formula shorter (i.e. using fewer instructions), as a result, will tend to make each path-length shorter. This is obviously heuristic, because we don't have an interface here to query expected latencies, but most of these things are simple integer operations, so I suspect it is not a bad approximation. Nevertheless, if any of these formula are on the critical path, then hopefully it too will get shorter.

For ILP, we may want a longer sequence, but I think it would be hard to reason about that here. It seems better to do that in the MachineCombiner, or something at that level, and for that purpose, using fewer instructions is also probably better: they'll be easier to pattern-match at the MI level.

qcolombet added inline comments.Jan 27 2017, 9:00 AM
test/Transforms/LoopStrengthReduce/X86/lsr-insns-1.ll
2

I'm not really interested by checking the profit, more by checking that this does what we want with the fewer possible interaction with something else.
Like you said, CodeGen could be fragile so it is best if we have better isolated testing, which is what opt is going to achieve.

evstupac added inline comments.Jan 27 2017, 9:31 AM
test/Transforms/LoopStrengthReduce/X86/lsr-insns-1.ll
2

I'm not really interested by checking the profit, more by checking that this does what we want with the fewer possible interaction with something else.

Ok. Are current opt tests meet the needs?

Please merge the opt and llc test cases into one file.
Note: It would be nice to have two opt run lines per test case:

  • one with lsr insns cost
  • one without

That may help catch the difference between those two modes. E.g., have a common prefix BOTH and one WITH and one WITHOUT. If the difference between WITH and WITHOUT is too big (i.e., not that much BOTH prefix), don't bother.

test/Transforms/LoopStrengthReduce/X86/lsr-insns-2a.ll
5

You can set several run line with different check prefix in the same file.
I.e., do not duplicate the test input.

Could you check that we have the desired addressing pattern instead of not having an lsr generated variable?

evstupac updated this revision to Diff 86371.Jan 30 2017, 5:07 PM

Tests updated according to the latest comments.

qcolombet accepted this revision.Jan 31 2017, 2:27 PM

LGTM, nitpicks on the check pattern for the llc test case.

test/Transforms/LoopStrengthReduce/X86/lsr-insns-1.ll
30

I usually like to check that the operands are correct as well.

This revision is now accepted and ready to land.Jan 31 2017, 2:27 PM
This revision was automatically updated to reflect the committed changes.