This is an archive of the discontinued LLVM Phabricator instance.

Improve LoopStrengthReduce RateFormula, by calling a new target hook increasedLSRFormulaCost()
ClosedPublic

Authored by jonpa on Apr 15 2016, 2:15 AM.

Details

Reviewers
qcolombet
hfinkel
Summary

While experimenting with the LoopUnroller pass for SystemZ, it was obvious that some loops regressed, due to the fact that LSR introduced negative offsets for load / store operations. These require an extra load-address instruction for float and vector types on SystemZ..

To be able to avoid this, I tried using isLegalAddressingMode() by returning false for float / double. This however caused other regressions.
Therefore I had to add a new method to TargetTransformInfo which adds an extra cost for the where appropriate. It seems to make sense generally to let targets adjust LSR formula rating.

During rating of formulas, the new TTI method needs to look at the actual user instructions. This was a bit awkward to get in place, and it may very well be that the code owner of this file could do a better job of it, or do some refactoring. I resorted to passing the Fixups to RateFormula, while letting the LSRUse access the right LSRFixup by means of an index. I had to rearrange code a bit to be able to use the LSRFixup class as an argument to RateFormula.

This seems to be the only viable way of using partial / runtime loop unrolling on SystemZ.

Diff Detail

Event Timeline

jonpa updated this revision to Diff 53858.Apr 15 2016, 2:15 AM
jonpa retitled this revision from to Improve LoopStrengthReduce RateFormula, by calling a new target hook increasedLSRFormulaCost().
jonpa updated this object.
jonpa added a reviewer: hfinkel.
jonpa added subscribers: uweigand, chandlerc.

ping

(Once again, this patch isn't really that big except I had to move up a class definition, so it just looks that way.)

The question that this patch answers is "Does this memory access (load/store, type) with the new total offset require an extra reg add?".

Perhaps this should be part of isAMCompletelyFolded()?

hfinkel edited edge metadata.Apr 26 2016, 9:26 AM

Could this problem be solved in the backend by implementing needsFrameBaseReg, materializeFrameBaseRegister, resolveFrameIndex and isFrameOffsetLegal in SystemZRegisterInfo.{h,cpp}?

I have no fundamental problem with the target being able to adjust the LSR formula costs, but the specific problem described seems to be what the virtual base register functionality should address.

lib/Target/SystemZ/SystemZISelLowering.cpp
31

rm extra blank line.

jonpa added a comment.Apr 27 2016, 1:55 AM

I don't think the RegisterInfo methods you mention will do the trick, as they all handle just frame index operands, meaning variables on the stack. Wouldn't that still leave e.g. global variables and argument addresses unhandled? Besides, since this relates to loops, even a simple extra register load could perhaps be worth avoiding.

The motivation for an LSR target cost adjustment is that it seems to be a good way to improve performance while avoiding regressions. As I said earlier, I tried using isLegalAddressingMode(), but that also caused regressions. I think this difference is because SystemZ actually has support for loading with big offset of float values. However, as those loads get folded into their uses an extra load-address instruction is emitted if the offset is negative. The instructions with a folded memory-operand only support small offsets.

I am not really sure about the way LoopStrengthReduce.cpp is organized. It seems intuitive to me that the LSRUse class should know about the actual user instructions (lsr-fixups). This was not the case, so I added just the indexes of the LSRFixups to LSRUse, since the fixups are stored in LSRInstance. This has the drawback of having to pass the Fixups vector to RateFormula as well, so that the LSRUse can index into it. Perhaps the LSRUse class should own the fixups instead?

hfinkel edited edge metadata.May 12 2016, 10:25 PM
hfinkel added subscribers: atrick, qcolombet, anemet.

I don't think the RegisterInfo methods you mention will do the trick, as they all handle just frame index operands, meaning variables on the stack. Wouldn't that still leave e.g. global variables and argument addresses unhandled? Besides, since this relates to loops, even a simple extra register load could perhaps be worth avoiding.

This makes sense.

The motivation for an LSR target cost adjustment is that it seems to be a good way to improve performance while avoiding regressions. As I said earlier, I tried using isLegalAddressingMode(), but that also caused regressions. I think this difference is because SystemZ actually has support for loading with big offset of float values. However, as those loads get folded into their uses an extra load-address instruction is emitted if the offset is negative. The instructions with a folded memory-operand only support small offsets.

I am not really sure about the way LoopStrengthReduce.cpp is organized. It seems intuitive to me that the LSRUse class should know about the actual user instructions (lsr-fixups). This was not the case, so I added just the indexes of the LSRFixups to LSRUse, since the fixups are stored in LSRInstance. This has the drawback of having to pass the Fixups vector to RateFormula as well, so that the LSRUse can index into it. Perhaps the LSRUse class should own the fixups instead?

Andy, Quentin, Adam, Sanjoy, et al., opinions?

Adding TTI hooks in LSR is fine with me. increasedLSRFormulaCost seems a little specific though for a TTI hook.

Iterating over Fixups.Offset does seem redundant with LSRUse.Offsets. My general concern is really the growing complexity of LSR utilities, with multiple ways of handling the same instruction set problems. I feels to me like this is something that should be handled by isAlwaysFoldable.

Can someone else take a look at this? Adam, MichaelZ, Quentin?

Hi Jonas,

I share Andy’s points:

  • This seems redundant with LU.Offsets.
  • That may be accounted within isAlwaysFoldable.

Could you add a test case to see how things are playing together?

Cheers,
-Quentin

jonpa updated this revision to Diff 57779.May 19 2016, 6:21 AM

The SystemZTargetLowering::increasedLSRFormulaCost() has been extended to consider vector store instructions also.

jonpa added a comment.May 19 2016, 6:29 AM

Hi Andy and Quentin,

thanks for review so far!

My feeling is that looking at LU.Offsets is not enough, since not all offsets of all user instructions are guaranteed / supposed to be present in the Offsets vector.

Using isAlwaysFoldable() also seems difficult, since there is no info about if the user instruction is a store or a load.

These two parameters are what is needed to make a proper cost function, at least for SystemZ, since it makes a difference if the user instruction is a load or a store (see the implementation in this patch).

One alternative to iterating over the LSRFixups (use instructions) may be to make sure *all* LU.Offsets are present in the vector, and also have keep track of load / store type for each offset somehow.

Another thought is to extend MemAccessTy (or KindType) with load / store attribute, although, that seems problematic since then stores and loads would then not belong to the same LSRUse even if they are otherwise compatible.

It may also be possible to compromise and ignore the load / store differences, but that is of course not as good.

What are your thoughts on this?

/Jonas

PS will send test case in separate mail

jonpa updated this revision to Diff 57929.May 20 2016, 6:32 AM

Patch updated so that

  • LSRUse will own its fixups instead of indexes into the LSRFixups vector.
  • Renamed the new method to isFoldableMemAccessOffset(), since it really isn't an LSR specific function.

I think it's generally useful to let the LSRUse own its fixups, and as Andy pointed out, the Offsets vector is not necessary if already iterating over the fixups, so that vector has been removed. (And again, class definitions have been moved around, so it looks like more changes than there really are).

LSR is with this patch *NOT* making entirely identical decisions. In RateFormula all fixup offets are now considered, whereas before Offsets were a bit sloppily kept unique, without any guarantee of them being so. From what I could quickly see, there seems to be less or smaller offsets used with the patch applied, which means it seems to be some welcome changes here and there. Perhaps someone could look over the logic under "Tally up the non-zero immediates." in RateFormula(), which is where this effect comes into play?

The new TargetTransformInfo hook could be considered a complement to isLegalAddressingMode(), which is more of a high level, while this new method is a more precise cost function with the Instruction included as argument. Perhaps they could be merged somehow, but I am not sure that is the right thing to do.

There is a myriad of isAlwaysFoldable() and isAMCompletelyFolded() functions used in LSR. There is also the Formula::UnfoldedOffset and the MaxOffset and MinOffset of the LSRUse which all play a role in the handling of immediate offsets. I am not sure about all the reasoning behind all this, so I have now only done the simple thing by adding this new cost function in RateFormula(). It may be that it could / should be used at earlier stages also, for efficiency's sake.

Even if this is not a complete reworking of LSR, I think this patch is a refactoring in the right direction, along with a needed improvement for SystemZ.

Just to clarify: LSR results are changed by both 1) Considering all LSRFixups instead of the old Offsets vector. This gives a little different ImmCost sums. 2) The new target hook, which may add 1 to NumBaseAdds.

Thanks for the clarifications and updates.

Will have a closer look.

qcolombet edited edge metadata.Jun 6 2016, 11:19 AM
qcolombet added a subscriber: Andy.

Hi Jonas,

The updated patch makes sense to me.
The patch is missing a test case though.

@Andy, do you have any additional comment?

Cheers,
-Quentin

include/llvm/Analysis/TargetTransformInfo.h
357

We should mention that the addressing mode is of the form reg + Offset.

jonpa updated this revision to Diff 59859.Jun 7 2016, 3:41 AM
jonpa edited edge metadata.

Comment fixed as suggested.
New tests for the SystemZ backend.

qcolombet added inline comments.Jun 20 2016, 11:51 AM
test/CodeGen/SystemZ/loop-01.ll
5

How different is the codegen for the z13 cpu?
The reason why I am asking is because right now, the test cases are partitioned between one target and the other with no overlap whereas if we have several RUN command in the file, I would have expected at least some overlap.

I.e., I would expect a common prefix between both CPUs that is used for most of the tests.

152

Use "opt -instnamer" to get %[0-9]+ variables.

jonpa added inline comments.Jun 21 2016, 1:31 AM
test/CodeGen/SystemZ/loop-01.ll
5

The one test for z13 is for vector instructions, which only z13 supports.

The other tests I added should be common for all subtargets, so I just reused the already present RUN command. Perhaps it should use the generic subtarget instead of z10 (which I think would be equivalent)?

jonpa added inline comments.Jun 21 2016, 2:01 AM
test/CodeGen/SystemZ/loop-01.ll
152

It seems to replace %[0-9]+ variable names with %tmp[1-9]* names. Is that what you want?

The other tests I added should be common for all subtargets, so I just reused the already present RUN command. Perhaps it should use the generic subtarget instead of z10 (which I think would be equivalent)?

In the past, LLVM would default to detecting the host CPU when compiling natively, which would cause random test failures depending on the particular host that was running the test suite. To fix this, most tests hard-coded a z10 CPU.

These days, this is no longer necessary since LLVM no longer defaults to automatically detecting the CPU, but always defaults to z10 anyway.

qcolombet added inline comments.Jun 21 2016, 8:21 AM
test/CodeGen/SystemZ/loop-01.ll
5

In that case, for the RUN line with z13 also add —check-prefix=CHECK.

152

Yes, that what I want.
[0-9]+ variables cannot be reordered or removed, tmp[0-9]+ can :).

jonpa updated this revision to Diff 61652.Jun 23 2016, 2:22 AM

Updated per review:
-check-prefix=CHECK added to -z13 RUN line so that all tests run for -z13 as well.
New tests filtered through opt -instnamer, to eliminate numeral based instruction names.

qcolombet accepted this revision.Jul 6 2016, 11:27 AM
qcolombet edited edge metadata.

LGTM.

This revision is now accepted and ready to land.Jul 6 2016, 11:27 AM
jonpa updated this revision to Diff 66931.Aug 5 2016, 5:15 AM
jonpa edited edge metadata.

I know this patch was approved already, but I have made some minor changes, which might as well get review since this hasn't been commited yet.

  • Added a check so that the new function is only called for loads and stores, after realizing it also gets called on PHIs etc.

+ if ((isa<LoadInst>(Fixup.UserInst) || isa<StoreInst>(Fixup.UserInst)) &&
+ !TTI.isFoldableMemAccessOffset(Fixup.UserInst, Offset))
+ NumBaseAdds++;

  • SystemZ implementation of isFoldableMemAccessOffset() rewritten to also handle the case of fp-stores.
jonpa requested a review of this revision.Aug 5 2016, 5:17 AM
jonpa edited edge metadata.

Quentin, you approved this patch before, could you please take a quick look again so that my last changes look good to you? Sorry for confusion.

qcolombet accepted this revision.Aug 8 2016, 5:32 PM
qcolombet edited edge metadata.

New rev LGTM as well!

Thanks,
-Quentin

This revision is now accepted and ready to land.Aug 8 2016, 5:32 PM
jonpa closed this revision.Aug 17 2016, 7:04 AM

Commited as r278927.