This is an archive of the discontinued LLVM Phabricator instance.

Add Instruction number to LSR cost model (PR23384)
AbandonedPublic

Authored by evstupac on Dec 12 2016, 6:07 PM.

Details

Reviewers
qcolombet
hfinkel
Summary

Fix PR23384.
The patch do the following:

  1. Add instructions number generated by a solution to LSR cost
  2. Move LSR cost comparison to target part
  3. Add new cross use generation for ICmpZero that ends with zero

One LIT test fails. However it should be fixed when D26367 is committed.

Performance improvement on x86:
spec2000

177.mesa on -O2 +3%
256.bzip2 on -Ofast -flto +1.5%

Diff Detail

Repository
rL LLVM

Event Timeline

evstupac updated this revision to Diff 81165.Dec 12 2016, 6:07 PM
evstupac retitled this revision from to Add Instruction number to LSR cost model (PR23384).
evstupac updated this object.
evstupac added reviewers: qcolombet, hfinkel.
evstupac set the repository for this revision to rL LLVM.
wmi added a comment.Dec 15 2016, 5:24 PM

Thanks for working on it. We also noticed the same problem and wanted it to be fixed.

  1. Add instructions number generated by a solution to LSR cost
  2. Move LSR cost comparison to target part
  1. seems a good idea for me, because for arch with post load/store increments addinc cost are not as significant as the cost on arches like x86 without the support.

It may be separated as a NFC patch?

  1. Add new cross use generation for ICmpZero that ends with zero

Can it be split into a separate patch?

lib/Transforms/Scalar/LoopStrengthReduce.cpp
917

Better to use member initializer list? Cost() : C({0, 0, ... ,0}) {}

1208

C.NumRegs only calculate the regs used in induction var expr so it is not a good estimation for register number used in the loop. It can be much less than the real register number used.

We can have a utility like that in vectorization to get a better register number estimation used in the loop, but that can be in a further enhancement. Before we have such a utility in place, I would rather conservatively think every loop has high register pressure, and always add C.NumRegs into C.Insns. I think it avoids the case that LSR significantly increase register pressure just to reduce one addinc.

Hi Wei,

Thanks for taking a look and you comments.
This is a change affecting almost every test. With benchmarks I have it looks good. However more data is better.
Would you mind testing it on your benchmarks/machines, please?

It may be separated as a NFC patch?

Yes. But that will be a patch with postponed effect. Before introducing Insns in solution cost I don't see a reason for x86 to get own cost function.
If we introduce Insns first that will affect others architectures as well (which I'm not specialist in). So I'd like to leave target cost functions tune for the target professionals.

Add new cross use generation for ICmpZero that ends with zero
Can it be split into a separate patch?

I'd like to leave this here as well. When LSR starts to count Insns, it is able to remove some unnecessary compares, but not all (because for some solution we don't generate appropriate cross use). That way for some test I need to remove cmp from one function and leave it in another. I'm trying to avoid a case when say for one arch/mode we check for cmp and for another just add.

lib/Transforms/Scalar/LoopStrengthReduce.cpp
917

Yes. Will fix.

1208

C.NumRegs only calculate the regs used in induction var expr so it is not a good estimation for register number used in the loop. It can be much less than the real register number used.

That is right. But we can say for sure that when we exceed TTIRegNum for a solution, we'll get at least fill. Why not to do this if already have NumRegs used by solution.

I like the idea to get better estimation. However let's leave this for a separate patch.

qcolombet requested changes to this revision.Dec 19 2016, 6:03 PM
qcolombet edited edge metadata.

Hi Evgeny,

Like Wei, I'd like #3 to be a separate patch.
Indeed, #1 and #2 should be NFC as long as the targets do not change the cost model.

Cheers,
-Quentin

lib/Target/X86/X86TargetTransformInfo.cpp
1874

Do you have data to support that heuristic?
Like Wei said, I suspect this may lead to pretty bad side effect where we will increase the register pressure by a lot to save a few instruction.
So before we switch the default, I want supportive evidence that this is general goodness.

For the record, we discuss with Wei this cost model issue and I still have on my todolist a better register pressure estimation for the loop. E.g., we can ignore NumRegs as long as it is below the regpressure.

This revision now requires changes to proceed.Dec 19 2016, 6:03 PM

Hi Quentin,

Thanks for taking a look.

Like Wei, I'd like #3 to be a separate patch.
Indeed, #1 and #2 should be NFC as long as the targets do not change the cost model.

Ok. I'll split the patch.
However #3 without #2 has almost no effect. So the following order seems the best #1, #3, #2.

Thanks,
Evgeny

lib/Target/X86/X86TargetTransformInfo.cpp
1874

Do you have data to support that heuristic?

Yes on spec2000:

177.mesa on -O2 +3%
256.bzip2 on -Ofast -flto +1.5%

There are gains on EEMBC tests. I don't see any significant (>2% regressions). And I'm waiting for data from Wei.
However I agree that for better testing we can introduce an option like -lsr-count-insns.

E.g., we can ignore NumRegs as long as it is below the regpressure.

That is roughly estimated in my patch. When TTI.getNumberOfRegisters(false) - 1 is exceeded we start to increment instruction counter (1 insn per 1 new register). In most cases that is enough to take a solution with low register pressure.

Quentin,

I've put first part in a separate review: https://reviews.llvm.org/D28307

Wei,

Did you have a chance to test the patch performance on your benchmarks?
evstupac abandoned this revision.May 2 2018, 12:01 PM